01Automated regression testing for prompt and model changes
020 GitHub stars
03Eval-Driven Development (EDD) workflow integration
04Automated pass@k and pass^k reliability metrics
05Standardized evaluation reporting and session history
06Support for code-based, model-based, and human graders