Where are the evaluation results stored?

Results are stored locally in your project root under a dedicated .eval-results/ directory, making them easy to track, share, and version-control via Git.

How does it improve AI code reviews?

By enforcing a specific schema and storing results, it allows you to run reviews multiple times and identify which findings are consistent and which might be outliers or hallucinations.

What metrics does the framework provide?

It calculates several key metrics including Jaccard overlap, precision, recall, severity agreement, and category agreement between two or more evaluation sets.

What is the Eval Framework skill?

It is a specialized tool for Claude Code that structures AI evaluation outputs into a comparable format to measure consistency and completeness across different sessions.

Can I compare different models with this framework?

Yes, you can run evaluations using different Claude models (like Sonnet and Opus) and use this skill to generate a comparison report showing the strengths and unique findings of each.

Eval Framework

Name: Eval Framework
Author: maxxentropy

bymaxxentropy

0•

Security & Testing

Standardizes and compares AI-generated evaluations to ensure consistency, accuracy, and reproducibility across multiple runs.

The Eval Framework skill provides a structured meta-framework for managing AI-driven evaluations such as architecture reviews, code audits, and security checks. It addresses the challenge of AI output variance by enforcing a strict YAML schema for findings, storing results in version-controlled files, and providing analytical tools to calculate overlap, precision, and recall between different runs. This allows developers to audit Claude's outputs, validate findings across different models (like Opus vs. Sonnet), and track the evolution of code quality over time with data-backed consistency scores and automated comparison reports.

Key Features

01Automated comparison engine to calculate Jaccard overlap, precision, and recall

02Cross-model benchmarking for comparing results from different AI versions

030 GitHub stars

04Normalization system for categorizing issues across different evaluation types

05Version-control friendly storage convention in .eval-results/ directories

06Standardized YAML output schema for structured findings and severity ratings

Use Cases

01Benchmarking different AI models on the same security audit for higher confidence

02Comparing multiple code review runs to ensure no critical bugs were missed

03Regression testing to verify if previously identified issues have been successfully resolved

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add maxxentropy/claude-tools eval-framework

For use in Claude.ai and ChatGPT

Download Skill