01CLI-integrated workflow for defining, checking, and reporting evals
02Standardized evaluation reporting and history logging
0334,604 GitHub stars
04Multi-modal grading including Code-based, Model-based, and Human review
05Capability and regression evaluation frameworks
06Pass@k and Pass^k reliability metric tracking