01Deterministic code-based grading using Grep, Bash, and test runners
02Standardized reporting workflow and version-controlled eval storage
03Model-based grading for qualitative assessment of AI outputs
0443,117 GitHub stars
05Capability and Regression eval templates for structured testing
06Reliability tracking using pass@k and pass^k metrics