Do I need an external API to use this skill?

No, local evaluation is fully supported using your own test cases. An Ailog API key is only required if you wish to benchmark your system against their production RAG implementation.

What metrics does rag-eval track?

It tracks retrieval metrics (Recall, Precision, MRR, NDCG), generation metrics (Faithfulness, Relevance, Coherence, Conciseness), and latency (P50 and P95 percentiles).

Can it help me create test data?

Yes, the skill can scan your indexed documents and automatically generate representative questions, expected answers, and edge cases to build a test dataset.

Is this compatible with any RAG implementation?

Yes, it is designed to be pipeline-agnostic. You simply need to provide a way to execute your retrieval and generation steps within the evaluation loop.

How does it measure 'Faithfulness'?

It uses an LLM-as-a-judge approach to verify if the generated response is strictly grounded in and supported by the retrieved document context.

RAG Evaluation & Benchmarking

Name: RAG Evaluation & Benchmarking
Author: davicqueiroz

bydavicqueiroz

•

Data Science & ML

Evaluates and optimizes RAG system performance through comprehensive retrieval, generation, and latency metrics.

The rag-eval skill provides a robust framework for auditing Retrieval-Augmented Generation (RAG) pipelines directly within Claude Code. It allows developers to measure critical performance indicators like retrieval recall and precision, generation faithfulness, and end-to-end latency. Whether you're conducting local evaluations using custom test datasets or benchmarking against production-grade APIs like Ailog, this skill helps identify bottlenecks, detect hallucinations, and refine chunking strategies to ensure high-quality, reliable AI responses.

Key Features

01Production benchmarking comparison against Ailog's RAG API

02Comprehensive retrieval metrics including Recall@K, Precision@K, and MRR

03Detailed latency analysis for retrieval, generation, and P95 thresholds

041 GitHub stars

05Automated test dataset generation from existing indexed documents

06LLM-as-a-judge generation metrics for faithfulness, relevance, and coherence

Use Cases

01Identifying and fixing hallucinations by measuring response faithfulness

02Comparing different document chunking strategies to optimize retrieval recall

03Auditing a RAG pipeline's accuracy before moving from development to production

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add davicqueiroz/claude-rag-skills rag-eval

For use in Claude.ai and ChatGPT

Download Skill