Which automated metrics are supported for text generation?

The skill includes implementations for BLEU (n-gram overlap), ROUGE (recall-oriented), METEOR (semantic), and BERTScore (embedding-based similarity).

Can this skill evaluate RAG (Retrieval-Augmented Generation) applications?

Yes, it provides specific metrics for retrieval quality, including Mean Reciprocal Rank (MRR), NDCG, and Precision/Recall @ K, as well as groundedness checks.

Does it support statistical significance testing?

Yes, the skill includes an A/B testing framework that calculates T-tests, p-values, and Cohen's d to determine if model improvements are statistically significant.

What is the benefit of the 'LLM-as-judge' approach?

It allows a more capable model to evaluate the nuance, helpfulness, and safety of a target model's output, providing high-quality qualitative feedback at scale.

LLM Evaluation & Benchmarking

Name: LLM Evaluation & Benchmarking
Author: HermeticOrmus

byHermeticOrmus

0•

Data Science & ML

Implements comprehensive evaluation frameworks for LLM applications using automated metrics, human feedback, and LLM-as-judge patterns.

The llm-evaluation skill provides a robust toolkit for measuring and optimizing the performance of AI-driven applications. It enables developers to implement a wide array of assessment strategies, ranging from traditional NLP metrics like BLEU and ROUGE to advanced embedding-based assessments like BERTScore. Beyond simple metrics, it facilitates complex evaluation patterns such as LLM-as-judge for qualitative analysis, statistical A/B testing for model comparison, and specialized RAG evaluation to ensure retrieval accuracy. This skill is essential for developers looking to move beyond vibes-based development to a data-driven approach that ensures production readiness and prevents regressions.

Key Features

01Comprehensive automated metrics including BLEU, ROUGE, BERTScore, and Perplexity

02LLM-as-judge implementation for automated pointwise and pairwise response grading

03Specialized RAG evaluation metrics for retrieval (MRR, NDCG) and groundedness

04Statistical A/B testing framework with p-value and Cohen’s d effect size analysis

05Automated regression detection to identify performance drops before deployment

060 GitHub stars

Use Cases

01Systematically comparing the performance of different prompts or model versions

02Validating the accuracy and retrieval quality of RAG-based systems

03Setting up automated evaluation pipelines within CI/CD workflows

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add hermeticormus/hermetic-academy llm-evaluation

For use in Claude.ai and ChatGPT

Download Skill