Does it support statistical significance testing?

Absolutely. It includes an ABTest framework that calculates p-values and Cohen's d to determine if improvements are statistically significant.

How does the 'LLM-as-Judge' approach work?

It uses a stronger model (like GPT-4 or Claude 3 Opus) to evaluate the output of other models using structured prompts for accuracy, helpfulness, and clarity.

Can this skill help with regression testing?

Yes, it includes a RegressionDetector class that compares new results against a baseline and flags significant performance drops based on a defined threshold.

What metrics are supported for RAG applications?

The skill supports specific retrieval metrics like Mean Reciprocal Rank (MRR), NDCG, and Precision@K, as well as generation metrics like groundedness.

LLM Evaluation Master

Name: LLM Evaluation Master
Author: cuoreinpace

bycuoreinpace

0•

Data Science & ML

Implements comprehensive evaluation frameworks for LLM applications using automated metrics, human feedback, and statistical benchmarking.

This skill provides a complete toolkit for systematically measuring and improving the performance of LLM-based applications. It covers essential evaluation methodologies including automated n-gram and embedding-based metrics (BLEU, ROUGE, BERTScore), LLM-as-judge patterns for semantic assessment, and structured human annotation frameworks. Whether you are comparing model versions, detecting performance regressions in CI/CD, or validating prompt engineering changes, this skill provides the implementation patterns and statistical analysis tools needed to build production-grade confidence in AI systems.

Key Features

01LLM-as-judge patterns for pointwise and pairwise semantic comparisons

02Custom metric support for groundedness, toxicity, and factuality checking

030 GitHub stars

04Automated metrics implementation for text generation, classification, and RAG

05Statistical A/B testing framework with Cohen’s d and p-value analysis

06Automated regression detection to prevent performance drops during updates

Use Cases

01Comparing performance between foundation models to optimize cost and quality

02Establishing quality baselines before deploying prompt updates to production

03Implementing automated guardrails to detect hallucinations and ungrounded claims

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add cuoreinpace/bdeornelas.github.io llm-evaluation

For use in Claude.ai and ChatGPT

Download Skill