Can it help if I don't have a test dataset yet?

Yes, the skill includes a feature to scan your indexed documents and automatically generate representative questions and expected answers to kickstart your testing.

Do I need an external API key to use this skill?

Local evaluation is fully functional without an API key; however, an Ailog API key is required if you wish to benchmark your system against their production RAG implementation.

How does it evaluate the accuracy of AI answers?

It uses an LLM-as-judge approach, where the model evaluates responses against the retrieved context to ensure the answer is grounded in facts and free from hallucinations.

What metrics does rag-eval track?

It tracks retrieval metrics (Recall, Precision, MRR, NDCG), generation metrics (Faithfulness, Relevance, Coherence, Conciseness), and performance latency (P50 and P95).

RAG Evaluation & Benchmarking

Name: RAG Evaluation & Benchmarking
Author: floflo777

byfloflo777

•

Data Science & ML

Evaluates RAG system performance through automated metrics, LLM-as-judge scoring, and competitive benchmarking.

About

The RAG Evaluation Skill provides a comprehensive framework for auditing and optimizing Retrieval-Augmented Generation pipelines directly within Claude Code. It enables developers to measure retrieval accuracy using standard metrics like Recall@K and MRR, while assessing generation quality through LLM-as-judge scoring for faithfulness and relevance. Whether you are testing local configurations or benchmarking against production-grade APIs like Ailog, this skill helps identify retrieval bottlenecks, hallucination risks, and latency issues to ensure high-quality AI responses.

Key Features

LLM-as-judge scoring for generation faithfulness, relevance, and coherence
Detailed performance reports with latency analysis and failure diagnostics
17 GitHub stars
Production-grade benchmarking against the Ailog RAG API
Automated retrieval metrics calculation including Recall, Precision, MRR, and NDCG
Synthetic test dataset generation from existing indexed documents

Use Cases

Detecting hallucinations and verifying response grounding in enterprise knowledge bases
Optimizing retrieval strategies by comparing different chunking and indexing methods
Generating synthetic Q&A pairs to build robust test suites for new RAG pipelines

About

Key Features

LLM-as-judge scoring for generation faithfulness, relevance, and coherence
Detailed performance reports with latency analysis and failure diagnostics
17 GitHub stars
Production-grade benchmarking against the Ailog RAG API
Automated retrieval metrics calculation including Recall, Precision, MRR, and NDCG
Synthetic test dataset generation from existing indexed documents

Use Cases

Detecting hallucinations and verifying response grounding in enterprise knowledge bases
Optimizing retrieval strategies by comparing different chunking and indexing methods
Generating synthetic Q&A pairs to build robust test suites for new RAG pipelines