01Technical comparisons of direct alignment algorithms including DPO, IPO, and KTO
02Comprehensive breakdown of the 3-stage RLHF pipeline (SFT, RM, PPO)
0310 GitHub stars
04Detailed explanations of mathematical frameworks like the Bradley-Terry model
05Best practices for preference data collection and quality management
06Strategies for identifying and mitigating over-optimization and reward hacking