What is Eval-Driven Development (EDD)?

EDD is a methodology that treats AI evaluations as unit tests. It involves defining expected behaviors and success criteria before code implementation to ensure accuracy and prevent regressions.

How does the pass@k metric work in this skill?

The pass@k metric measures the probability of at least one successful output within 'k' attempts. For example, pass@3 means the task was successfully completed within three trials.

Where are the evaluation results stored?

Evaluation definitions, run histories, and baseline data are stored directly in your project under the .claude/evals/ directory, allowing them to be versioned with your code.

Can I use deterministic tests with Eval Harness?

Yes, Eval Harness supports code-based graders that use bash commands, grep patterns, and existing test runners like npm test to provide objective verification.

Does it support manual code review?

Yes, it includes a Human Grader type that flags specific changes for manual review, which is recommended for high-risk tasks or security-sensitive code.

Eval Harness for Claude Code

Name: Eval Harness for Claude Code
Author: webcrafters-belgium

bywebcrafters-belgium

0•

Security & Testing

Implements an evaluation-driven development framework to test, verify, and track the reliability of Claude's code generation.

Eval Harness introduces Eval-Driven Development (EDD) to your AI-assisted workflow by treating evaluations as the unit tests of AI development. It enables developers to define expected behaviors before implementation, run continuous capability and regression tests, and measure success through robust metrics like pass@k. By providing deterministic code-based graders alongside model-based qualitative assessments, this skill ensures that Claude's contributions are reliable, functional, and free from regressions throughout the development lifecycle.

Key Features

01Tracks success reliability using standardized pass@k and pass^k metrics

02Implements Eval-Driven Development (EDD) workflow within Claude sessions

03Supports deterministic code-based, model-based, and human-in-the-loop graders

040 GitHub stars

05Automates evaluation report generation and status tracking

06Standardizes eval storage in project-specific .claude/evals directories

Use Cases

01Verifying complex feature implementations with multi-step success criteria

02Preventing regressions in legacy codebases during AI-driven refactoring

03Measuring and improving the reliability of Claude's code output over multiple attempts

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add webcrafters-belgium/webcrafters-studio eval-harness

For use in Claude.ai and ChatGPT

Download Skill