01Standardized evaluation reporting and status checking
02Capability and regression evaluation templates
03Multi-mode grading via code, AI models, or human review
040 GitHub stars
05Integrated file-based storage for evaluation history
06Reliability tracking using pass@k and pass^k metrics