What are the benefits of using an LLM-as-Judge approach?

LLM-as-Judge allows you to automate qualitative assessments—such as tone, helpfulness, and nuance—by using a more capable model to grade outputs, providing a scalable alternative to manual human review.

How can I detect performance regressions in my LLM?

This skill includes a RegressionDetector that compares new evaluation scores against a baseline. It flags significant drops in performance based on a configurable threshold, helping you catch errors before they hit production.

Does this skill support statistical significance testing?

Yes, it includes an A/B testing framework that calculates p-values and Cohen's d effect sizes, allowing you to determine if a model improvement is statistically significant or just random noise.

Which metrics are best for evaluating RAG applications?

For RAG, you should focus on retrieval metrics like MRR and NDCG for the search component, and generation metrics like groundedness and faithfulness to ensure the AI stays within the provided context.

LLM Application Evaluation

Name: LLM Application Evaluation
Author: HermeticOrmus

byHermeticOrmus

0•

Data Science & ML

Implement comprehensive evaluation frameworks for LLM applications using automated metrics, human feedback, and benchmarking.

This skill provides a structured methodology for measuring the performance, quality, and reliability of Large Language Model applications. It offers implementation patterns for traditional NLP metrics like BLEU and ROUGE, modern embedding-based assessments like BERTScore, and advanced 'LLM-as-Judge' techniques. Whether you are validating RAG pipelines, comparing model versions, or establishing regression testing in a CI/CD environment, this skill equips you with the statistical tools and code patterns needed to build production-grade AI systems with confidence.

Key Features

01Statistical A/B testing framework with Cohen’s d and p-value analysis

02RAG-specific evaluation patterns for retrieval and groundedness

03LLM-as-Judge scoring for automated qualitative assessment

04Regression detection to identify performance drops between model versions

050 GitHub stars

06Implementation of automated metrics including BLEU, ROUGE, and BERTScore

Use Cases

01Validating the accuracy and factuality of RAG-based search systems

02Benchmarking a new prompt or model version against a production baseline

03Establishing automated quality gates for AI features in deployment pipelines

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add hermeticormus/alqvimia-contador llm-evaluation

For use in Claude.ai and ChatGPT

Download Skill