01Integrated evaluation storage and history within the project structure
02Reliability tracking using pass@k and pass^k metrics
03Standardized multi-stage workflow from definition to reporting
0461 GitHub stars
05Automated capability and regression testing frameworks
06Support for code-based, model-based, and human graders