Can I use this skill to evaluate RAG applications?

Yes, it includes specific metrics for Retrieval-Augmented Generation, including MRR, NDCG, Precision@K, and custom 'groundedness' checks using NLI models.

What automated metrics are supported for text generation?

The skill supports standard metrics such as BLEU (translation overlap), ROUGE (summarization recall), METEOR, BERTScore (semantic similarity), and Perplexity.

What is the 'LLM-as-Judge' approach?

It is an evaluation pattern where a more advanced model is used to score or compare the outputs of other models based on criteria like accuracy, helpfulness, and clarity.

How does this skill help with AI regression testing?

It provides a RegressionDetector class that compares new evaluation results against a saved baseline and flags any significant decreases in metric scores based on a defined threshold.

LLM Performance Evaluation

Name: LLM Performance Evaluation
Author: simplysmartai

bysimplysmartai

0•

Data Science & ML

Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking.

This skill provides a robust framework for measuring and improving the quality of Large Language Model (LLM) applications. It enables developers to implement systematic testing across multiple dimensions, including automated linguistic metrics like BLEU and ROUGE, semantic similarity via BERTScore, and advanced 'LLM-as-Judge' patterns. By facilitating A/B testing, regression detection, and human annotation workflows, it helps teams move beyond anecdotal evidence to data-driven model optimization and production-ready reliability.

Key Features

01LLM-as-Judge patterns for automated pointwise and pairwise quality grading

02Regression detection to prevent performance drops during model or prompt updates

030 GitHub stars

04Retrieval-Augmented Generation (RAG) metrics like MRR, NDCG, and groundedness

05Automated text generation metrics including BLEU, ROUGE, and BERTScore

06Statistical A/B testing framework with Cohen’s d effect size calculation

Use Cases

01Detecting performance regressions before deploying updates to production AI systems

02Comparing performance between different models or prompt versions during development

03Validating the groundedness and factual accuracy of RAG-based chatbots

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add simplysmartai/5cypressautomation llm-evaluation

For use in Claude.ai and ChatGPT

Download Skill