What metrics are supported by this skill?

It supports standard NLP metrics like BLEU and ROUGE, semantic metrics like BERTScore, and specialized custom metrics for toxicity, groundedness, and factuality.

Can I use this for RAG application testing?

Yes, it includes frameworks for measuring retrieval metrics like Precision@K and NDCG, as well as NLI-based groundedness checks to ensure responses are based on the provided context.

What is regression detection in this context?

Regression detection compares current evaluation results against a saved baseline. If scores drop beyond a specific threshold (e.g., 5%), the system flags it as a performance regression.

How does the LLM-as-Judge pattern work?

It uses a stronger model (like GPT-4 or Claude 3.5 Sonnet) to evaluate the outputs of other models based on defined criteria like accuracy, helpfulness, and clarity.

LLM Evaluation

Name: LLM Evaluation
Author: ccf

byccf

0•

Data Science & ML

Implements comprehensive evaluation frameworks to measure LLM application quality using automated metrics, human feedback, and comparative benchmarks.

The LLM Evaluation skill provides a robust toolkit for systematically assessing the performance of Large Language Model applications. It bridges the gap between development and production by offering standardized methods for automated metric calculation (like BERTScore, ROUGE, and BLEU), LLM-as-Judge patterns, and human annotation workflows. Whether you are validating prompt iterations, comparing model providers, or protecting against regressions, this skill provides the statistical and procedural rigor necessary to build reliable AI systems.

Key Features

01Automated regression detection against performance baselines

020 GitHub stars

03Statistical A/B testing framework with significance and effect size calculation

04LLM-as-Judge patterns for pointwise and pairwise qualitative assessments

05Automated NLP metrics including BLEU, ROUGE, and BERTScore

06Customizable metrics for RAG groundedness, toxicity, and factuality

Use Cases

01Validating prompt engineering changes and system instructions before deployment

02Comparing performance across different foundation models or fine-tuned versions

03Establishing quality benchmarks for RAG-based knowledge retrieval systems

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add ccf/claude-code-ccf-marketplace llm-evaluation

For use in Claude.ai and ChatGPT

Download Skill