What automated metrics are supported by this skill?

It supports standard linguistic metrics like BLEU, ROUGE, METEOR, and BERTScore, as well as retrieval-specific metrics like MRR and NDCG for RAG systems.

Does it support regression testing for AI models?

Yes, it includes a RegressionDetector that compares new evaluation results against a baseline and flags statistically significant drops in performance.

How does it handle human evaluation?

The skill provides templates for human annotation tasks and includes tools to calculate inter-rater agreement using metrics like Cohen's Kappa to ensure data quality.

Can I use this for RAG application testing?

Yes, it includes specialized metrics for Retrieval-Augmented Generation, including groundedness checks, context relevance, and retrieval precision/recall at K.

How does the 'LLM-as-Judge' pattern work?

This pattern uses a highly capable model to evaluate the outputs of other models based on specific rubrics, allowing for scalable semantic assessment without constant human intervention.

LLM Performance Evaluation

Name: LLM Performance Evaluation
Author: HermeticOrmus

byHermeticOrmus

•

Data Science & ML

Implements comprehensive evaluation frameworks for LLM applications using automated metrics, human feedback, and benchmarking strategies.

This skill provides a robust framework for measuring and improving the quality of Large Language Model applications. It equips developers with the tools to implement automated linguistic metrics like BLEU and ROUGE, set up 'LLM-as-Judge' patterns for semantic assessment, and establish human-in-the-loop evaluation workflows. By integrating regression testing and A/B analysis, it ensures that prompt changes and model updates lead to measurable improvements in accuracy, safety, and helpfulness, moving AI development from anecdotal testing to rigorous scientific validation.

Key Features

01LLM-as-Judge patterns for pointwise and pairwise semantic quality assessment

02Automated text generation metrics including BLEU, ROUGE, and BERTScore

03Regression detection to identify performance drops before production deployment

04Human-in-the-loop evaluation structures and inter-rater agreement tools

052 GitHub stars

06Statistical A/B testing framework with Cohen's d effect size analysis

Use Cases

01Comparing model performance during prompt engineering iterations

02Establishing production baselines for AI application reliability and safety

03Validating RAG system accuracy, groundedness, and retrieval quality

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add hermeticormus/libreuiux-claude-code llm-evaluation

For use in Claude.ai and ChatGPT

Download Skill