01Regression testing to prevent functional decay in AI sessions
02Standardized evaluation reporting and project-level storage
03Advanced reliability metrics including pass@k and pass^k
04Support for deterministic code-based and model-based graders
05Eval-Driven Development (EDD) workflow integration
061 GitHub stars