What is the 'LLM-as-Judge' approach?

It is a pattern where a highly capable model (like GPT-4 or Claude 3.5 Sonnet) is used to evaluate the outputs of other models based on specific criteria like helpfulness, accuracy, and tone.

Can I use this for regression testing?

Yes, the skill includes a RegressionDetector class that compares new model outputs against a baseline to flag significant drops in performance across your chosen metrics.

Which automated metrics are supported by this skill?

The skill supports a wide range of metrics including BLEU and ROUGE for overlap, BERTScore for semantic similarity, and RAG-specific metrics like MRR, NDCG, and groundedness checks.

Is this skill suitable for RAG (Retrieval-Augmented Generation) apps?

Absolutely. It includes specific patterns for checking if a response is grounded in the provided context and measuring the relevance of retrieved documents.

LLM Evaluation Suite

Name: LLM Evaluation Suite
Author: bdeornelas

bybdeornelas

0•

Data Science & ML

Implements robust evaluation frameworks for Large Language Model applications using automated metrics, human feedback, and statistical testing.

This skill provides a comprehensive toolkit for measuring and optimizing the performance of LLM-based applications. It enables developers to implement industry-standard metrics like BERTScore and ROUGE, establish LLM-as-judge patterns for qualitative assessment, and manage structured human evaluation workflows. By integrating regression detection and statistical A/B testing, it ensures that prompt engineering and model updates lead to measurable improvements while maintaining high production quality and reliability across diverse use cases.

Key Features

01Statistical A/B testing framework with significance and effect size calculations

02LLM-as-Judge implementation for automated qualitative and pairwise comparisons

03Regression detection system to identify performance drops before deployment

04RAG-specific evaluation including groundedness, MRR, and NDCG metrics

050 GitHub stars

06Automated metrics for text generation including BLEU, ROUGE, and BERTScore

Use Cases

01Establishing automated quality gates in CI/CD pipelines for AI features

02Comparing the performance of different LLM models or prompt versions

03Measuring the accuracy and retrieval quality of RAG-based systems

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add bdeornelas/bdeornelas.github.io llm-evaluation

For use in Claude.ai and ChatGPT

Download Skill