Can this help with RAG applications?

Yes, it includes specific patterns for evaluating Retrieval-Augmented Generation, including context grounding checks and retrieval-specific metrics like MRR and Precision@K.

How does it handle regression testing?

It provides a RegressionDetector class that compares new evaluation results against a baseline, flagging significant decreases in performance based on a configurable threshold.

What is LLM-as-Judge?

LLM-as-Judge is a technique that uses a more capable model to evaluate the outputs of a smaller or task-specific model based on custom rubrics, providing qualitative scores that automated metrics often miss.

Which automated metrics are supported?

The skill includes implementations for standard metrics like BLEU, ROUGE, and BERTScore, as well as custom metrics for groundedness, toxicity, and factuality.

LLM Evaluation & Metrics

Name: LLM Evaluation & Metrics
Author: amurata

byamurata

•

Data Science & ML

Implements rigorous evaluation frameworks for Large Language Model applications using automated metrics, LLM-as-judge patterns, and human feedback loops.

This skill provides a comprehensive toolkit for measuring and improving the quality of AI-driven applications. It covers everything from standard NLP metrics like BLEU and ROUGE to advanced 'LLM-as-Judge' methodologies and statistical A/B testing frameworks. By establishing systematic evaluation baselines, developers can confidently detect performance regressions, compare different model versions, and validate prompt engineering improvements throughout the software development lifecycle.

Key Features

01Statistical A/B testing framework with Cohen's d effect size analysis

02Regression detection to prevent performance drops before deployment

03Automated NLP metrics including BLEU, ROUGE, and BERTScore

04LLM-as-Judge patterns for pointwise and pairwise qualitative assessment

05Human evaluation structures with inter-rater agreement calculation

063 GitHub stars

Use Cases

01Establishing groundedness and factuality baselines for RAG systems

02Detecting quality regressions in CI/CD pipelines for AI applications

03Comparing the performance of different foundation models or prompt iterations

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add amurata/cc-tools llm-evaluation

For use in Claude.ai and ChatGPT

Download Skill