What is the difference between Capability and Regression evals?

Capability evals test if Claude can perform a new task, while Regression evals ensure that new changes haven't broken previously functioning features.

Where are the evaluation files stored?

Evals are stored locally within your project in the .claude/evals/ directory, ensuring they are versioned alongside your source code as first-class artifacts.

Can I use manual checks with the Eval Harness?

Yes, the framework supports a 'Human Grader' type that flags specific changes for manual review when automated or model-based grading is insufficient.

What is Eval-Driven Development (EDD)?

EDD is a methodology where you define expected AI behavior and success criteria before writing code, treating evaluations as the fundamental unit of AI development progress.

How does this skill measure AI reliability?

It uses pass@k metrics (measuring if at least one success occurs in k attempts) and pass^k (measuring if all k trials succeed) to quantify the reliability of AI solutions.

Claude Eval Harness

Name: Claude Eval Harness
Author: XD3an

byXD3an

•

Security & Testing

Implements a formal evaluation framework for Claude Code sessions to enable reliable, test-driven AI development.

The Eval Harness skill introduces Eval-Driven Development (EDD) to the Claude Code environment, treating AI evaluations as the modern equivalent of unit tests. It enables developers to define success criteria before implementation, run continuous capability and regression tests, and measure reliability using pass@k metrics. By bridging deterministic code checks with model-based grading, it ensures that AI-generated code remains stable, high-quality, and feature-complete across complex development cycles, providing a structured path from ideation to production-ready code.

Key Features

01Generates comprehensive evaluation reports to validate readiness for production

02Supports capability and regression evals with automated success criteria

03Calculates pass@k and pass^k metrics to track implementation reliability

04Utilizes multi-modal graders including code-based, model-based, and human-in-the-loop

058 GitHub stars

06Implements Eval-Driven Development (EDD) principles for AI coding

Use Cases

01Ensuring new AI-generated features don't break existing legacy functionality

02Creating a standardized testing workflow for AI-assisted software development teams

03Benchmarking Claude's performance on complex coding tasks across multiple attempts

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add xd3an/awesome-ai-coding-all-in-one eval-harness

For use in Claude.ai and ChatGPT

Download Skill