Can I use this for RAG (Retrieval-Augmented Generation) applications?

Yes, it includes specific patterns for checking groundedness, context relevance, and retrieval quality to ensure your RAG pipeline is accurate.

How does the 'LLM-as-Judge' pattern work?

This pattern uses a highly capable model (like GPT-4 or Claude 3.5 Sonnet) to evaluate the outputs of other models based on specific criteria like accuracy, helpfulness, and tone.

What metrics does this skill support for automated evaluation?

It supports a wide range of metrics including BLEU and ROUGE for text overlap, BERTScore for semantic similarity, and RAG-specific metrics like MRR, NDCG, and Precision@K.

How do I detect if a new prompt causes a performance regression?

The skill includes a RegressionDetector class that compares new evaluation results against a baseline and flags any metric drops that exceed a specified threshold.

LLM Evaluation Framework

Name: LLM Evaluation Framework
Author: apassuello

byapassuello

0•

Data Science & ML

Implements comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and comparative benchmarking.

This skill provides a robust framework for measuring and improving the quality of AI applications by bridging the gap between raw model outputs and production-ready performance. It equips developers with implementation patterns for automated metrics like BLEU and BERTScore, advanced LLM-as-judge patterns for qualitative assessment, and statistical A/B testing logic to validate improvements. Whether you are debugging unexpected behaviors, detecting performance regressions before deployment, or comparing different model architectures, this skill provides the structured methodology needed to build confidence in LLM-powered systems.

Key Features

01LLM-as-Judge patterns for pointwise and pairwise model comparisons

02Automated metrics integration including BLEU, ROUGE, and BERTScore

03Retrieval evaluation for RAG systems using MRR and NDCG

04Automated regression detection to flag performance drops before production

05Statistical A/B testing framework with Cohen's d effect size analysis

060 GitHub stars

Use Cases

01Building automated testing pipelines to detect hallucinations in RAG applications

02Comparing the performance of different LLMs or prompt versions systematically

03Establishing quality baselines and tracking model improvements over time

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add apassuello/multimodal_insight_engine llm-evaluation

For use in Claude.ai and ChatGPT

Download Skill