What is Eval-Driven Development (EDD)?

EDD is a methodology where evaluations are defined before code implementation, acting as unit tests for AI behavior to ensure consistent and high-quality results.

Does Eval Harness support manual code reviews?

Yes, it includes a 'Human Review Required' grader type specifically for safety-critical checks or subjective UI/UX evaluations that require human oversight.

How does the pass@k metric work in Eval Harness?

Pass@k measures the probability that at least one success occurs within 'k' attempts. It helps developers understand how reliable a specific AI prompt or logic is.

Can I automate regression testing with this skill?

Yes, Eval Harness includes specific templates for Regression Evals that compare current results against a baseline SHA or checkpoint to prevent breaking changes.

Eval Harness for Claude Code

Name: Eval Harness for Claude Code
Author: affaan-m

byaffaan-m

•

43,117

•

Security & Testing

Implements a formal evaluation framework for Claude Code sessions based on Eval-Driven Development (EDD) principles to ensure reliability.

Eval Harness brings the rigor of software testing to AI development by treating evaluations as the 'unit tests' for Claude Code. It allows developers to implement Eval-Driven Development (EDD) by defining expected behaviors before coding, tracking regressions with baseline SHA comparisons, and measuring success through pass@k metrics. By combining deterministic code-based graders with model-based qualitative assessments, this skill ensures that AI-generated code meets specific capability standards and remains stable throughout the development lifecycle.

Key Features

01Deterministic code-based grading using Grep, Bash, and test runners

02Standardized reporting workflow and version-controlled eval storage

03Model-based grading for qualitative assessment of AI outputs

0443,117 GitHub stars

05Capability and Regression eval templates for structured testing

06Reliability tracking using pass@k and pass^k metrics

Use Cases

01Measuring the reliability of complex logic by tracking success rates across multiple attempts

02Protecting legacy functionality with automated regression checks during AI refactoring

03Defining feature requirements as testable evals before starting implementation

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add affaan-m/everything-claude-code eval-harness

For use in Claude.ai and ChatGPT

Download Skill