Where are the evaluations stored?

All eval definitions, run logs, and baselines are stored locally in your project's .claude/evals/ directory, allowing them to be versioned alongside your source code.

How does this skill help with regressions?

It allows you to create Regression Evals that compare current AI output against a known baseline, ensuring that new code changes do not break existing functionality.

Can I use my own test scripts as graders?

Yes, the skill supports code-based graders that can execute shell commands, grep patterns, or npm tests to deterministically verify code changes.

What are pass@k and pass^k metrics?

pass@k measures if at least one attempt out of k succeeds (for general reliability), while pass^k measures if all k attempts succeed (for critical path reliability).

What is Eval-Driven Development (EDD)?

EDD is a methodology where you define the expected behavior and success criteria for an AI task before implementation, treating evals as the 'unit tests' of AI development.

Eval Harness for Claude Code

Name: Eval Harness for Claude Code
Author: WorldFlowAI

byWorldFlowAI

0•

Security & Testing

Implements Eval-Driven Development (EDD) to rigorously test, validate, and track the reliability of AI-assisted code changes.

The Eval Harness skill brings formal evaluation frameworks to Claude Code, enabling developers to treat AI behavior like unit tests through Eval-Driven Development (EDD). It allows users to define success criteria before coding, run deterministic or model-based graders to verify outputs, and track reliability metrics like pass@k to ensure consistency. This skill is essential for teams looking to move beyond 'vibes-based' AI development into a structured, regression-proof workflow that maintains high code quality across complex refactors and feature additions.

Key Features

01Automated regression testing to ensure new AI changes don't break existing project functionality.

02Multi-modal grading including deterministic code-based, Claude-powered model, and human-in-the-loop reviewers.

03Structured eval storage and versioning within the .claude/ directory for seamless team collaboration.

04Advanced reliability metrics tracking success rates via pass@k and pass^k methodologies.

05Eval-Driven Development (EDD) workflow for defining, implementing, and reporting AI tasks.

060 GitHub stars

Use Cases

01Measuring the reliability and consistency of AI-generated code over multiple iterations to quantify production readiness.

02Performing large-scale refactors while ensuring no regressions in core business logic or API contracts.

03Defining clear success criteria before asking Claude to implement a complex feature to ensure accurate results.

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add worldflowai/everything-claude-code eval-harness

For use in Claude.ai and ChatGPT

Download Skill