What metrics does this skill support for LLM testing?

It supports a wide range of metrics including linguistic overlap (BLEU, ROUGE), semantic similarity (BERTScore), and retrieval metrics (MRR, NDCG) for RAG systems.

Does it support human-in-the-loop evaluation?

Absolutely. It provides structures for human annotation tasks and includes tools to calculate inter-rater agreement using Cohen's Kappa score to ensure evaluation consistency.

How does the 'LLM-as-judge' pattern work?

This pattern uses a highly capable model (like Claude 3.5 Sonnet or GPT-4o) to evaluate the outputs of other models based on specific rubrics, providing both quantitative scores and qualitative reasoning.

Can I use this skill to prevent regressions in my AI app?

Yes, it includes a RegressionDetector that compares new results against a baseline and flags significant performance drops based on a configurable threshold.

LLM Evaluation & Testing

Name: LLM Evaluation & Testing
Author: Kingly-Agency

byKingly-Agency

0•

Data Science & ML

Implements comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and LLM-as-judge patterns.

This skill provides a robust framework for measuring the performance and quality of Large Language Model (LLM) applications. It enables developers to implement automated metrics like BLEU, ROUGE, and BERTScore, while also supporting advanced techniques such as LLM-as-judge, human annotation workflows, and statistical A/B testing. By establishing clear baselines and regression detection, it ensures that model updates or prompt changes improve system performance without introducing unexpected behaviors or quality degradations in production environments.

Key Features

01LLM-as-judge patterns for pointwise and pairwise comparisons

02Statistical A/B testing with Cohen's d effect size analysis

03Automated text metrics including BLEU, ROUGE, and BERTScore

04Human evaluation frameworks with inter-rater agreement calculation

05RAG-specific metrics like MRR, NDCG, and groundedness checks

060 GitHub stars

Use Cases

01Validating prompt engineering improvements through systematic benchmarking

02Comparing performance and cost-efficiency between different LLM providers

03Detecting performance regressions in CI/CD pipelines before deployment

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add kingly-agency/kingly-claude-adapter llm-evaluation

For use in Claude.ai and ChatGPT

Download Skill