How does it handle performance regressions?

It includes a Regression Detector module that compares current evaluation results against a baseline and flags any metric drops that exceed a defined threshold.

Which automated metrics are supported by this skill?

The skill supports a wide range of metrics including BLEU, ROUGE, METEOR, BERTScore for text generation, as well as RAG-specific metrics like MRR, NDCG, and Precision@K.

What is the 'LLM-as-Judge' approach?

It is a pattern where a more capable model (like GPT-4 or Claude 3.5 Sonnet) is used to evaluate the outputs of other models based on specific criteria like helpfulness, accuracy, and tone.

Can this skill help with RAG (Retrieval-Augmented Generation)?

Yes, it includes specialized metrics for retrieval performance and groundedness to ensure your RAG system is fetching the right context and citing it accurately.

LLM Application Evaluation

Name: LLM Application Evaluation
Author: yusoofsh

byyusoofsh

0•

Data Science & ML

Implements comprehensive evaluation frameworks for LLM applications using automated metrics, human feedback, and systematic benchmarking.

This skill provides a robust toolkit for measuring and improving the quality of AI applications throughout the development lifecycle. It covers a wide spectrum of evaluation strategies, from traditional NLP metrics like BLEU and ROUGE to modern embedding-based assessments like BERTScore and LLM-as-Judge patterns. Developers can use this skill to establish rigorous baselines, validate prompt engineering changes through A/B testing, and detect performance regressions in RAG systems or classification models before they reach production.

Key Features

01LLM-as-Judge patterns for automated qualitative assessment

02Automated NLP metrics including BLEU, ROUGE, and BERTScore

030 GitHub stars

04RAG-specific evaluation for retrieval precision and groundedness

05Statistical A/B testing framework with Cohen's d effect size analysis

06Automated regression detection for CI/CD quality gates

Use Cases

01Comparing performance between different model providers or prompt versions

02Validating RAG system accuracy and document retrieval relevance

03Establishing systematic quality benchmarks for production AI features

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add yusoofsh/dotfiles llm-evaluation

For use in Claude.ai and ChatGPT

Download Skill