Which automated metrics are supported for text generation?

The skill includes implementations for BLEU (n-gram overlap), ROUGE (recall-oriented), METEOR (semantic similarity), BERTScore (embedding-based), and Perplexity.

What is the 'LLM-as-judge' approach included in this skill?

It is a pattern that uses a highly capable model to evaluate the output of other models or prompts based on specific criteria like accuracy, helpfulness, and tone, supporting both single-output scoring and pairwise comparisons.

How does the regression detection work?

It compares current evaluation results against a stored baseline and flags any metric that drops beyond a configurable threshold, ensuring that prompt changes don't accidentally break existing functionality.

Can this skill help evaluate RAG (Retrieval-Augmented Generation) systems?

Yes, it provides specific metrics for retrieval quality, such as Mean Reciprocal Rank (MRR) and NDCG, as well as groundedness checks to ensure responses are based on provided context.

LLM Evaluation & Metrics

Name: LLM Evaluation & Metrics
Author: HermeticOrmus

byHermeticOrmus

0•

Data Science & ML

Implements comprehensive evaluation frameworks for LLM applications using automated metrics, human feedback, and LLM-as-judge patterns.

This skill provides a robust toolkit for systematically assessing the quality and performance of Large Language Model applications throughout the development lifecycle. It covers the entire evaluation spectrum, from standard NLP metrics like BLEU and BERTScore to advanced 'LLM-as-judge' patterns for semantic assessment and human annotation workflows. Designed for developers building production-grade AI, it includes statistical frameworks for A/B testing, regression detection to prevent performance drops, and specific metrics for Retrieval-Augmented Generation (RAG) systems, enabling data-driven decisions instead of relying on anecdotal model responses.

Key Features

01Automated NLP metrics including BLEU, ROUGE, and BERTScore

02Automated regression detection to identify performance drops before deployment

03LLM-as-Judge patterns for pointwise and pairwise qualitative evaluation

040 GitHub stars

05RAG-specific metrics like MRR, NDCG, and groundedness checks

06Statistical A/B testing framework with p-value and effect size analysis

Use Cases

01Comparing model performance across different prompt versions or hyperparameters

02Validating the accuracy and groundedness of RAG-based search results

03Establishing automated CI/CD testing pipelines for AI application quality

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add hermeticormus/after-the-third-cup llm-evaluation

For use in Claude.ai and ChatGPT

Download Skill