What metrics are supported for text generation?

The skill supports standard n-gram metrics like BLEU and ROUGE, as well as semantic metrics like BERTScore and METEOR to measure similarity and fluency.

How does LLM-as-Judge improve evaluation?

It uses a more capable model to grade outputs on subjective criteria like helpfulness and coherence, providing a scalable alternative to manual human review.

How does the regression detection feature work?

It compares new model outputs against a set of baseline results and flags any significant statistical decreases in performance based on a configurable threshold.

Does it support statistical significance testing?

Yes, it includes a statistical testing framework using T-tests and Cohen’s d to determine if improvements between version A and B are truly significant.

Can I use this for RAG applications?

Yes, it includes specialized retrieval metrics such as MRR (Mean Reciprocal Rank), NDCG, and Precision@K to evaluate the quality of your context retrieval.

LLM Application Evaluation

Name: LLM Application Evaluation
Author: HermeticOrmus

byHermeticOrmus

0•

Data Science & ML

Implement comprehensive evaluation frameworks for LLM applications using automated metrics, human feedback, and benchmarking.

The llm-evaluation skill provides a systematic approach to measuring and improving the quality of AI-driven applications. It enables developers to implement a multi-layered evaluation strategy encompassing automated text metrics (BLEU, ROUGE, BERTScore), LLM-as-Judge patterns for semantic assessment, and structured human evaluation frameworks. By integrating statistical A/B testing and regression detection, this skill helps teams confidently validate prompt changes, compare model performance, and ensure production-grade reliability across text generation, classification, and RAG tasks.

Key Features

01LLM-as-Judge patterns for automated pointwise and pairwise evaluation

02Statistical A/B testing framework with Cohen's d effect size calculation

03Automated metrics for text generation and retrieval (RAG) performance

04Regression detection to prevent performance drops during deployment

05Human annotation structures with inter-rater agreement (Cohen's Kappa) analysis

060 GitHub stars

Use Cases

01Systematically comparing different foundation models or prompt iterations

02Validating RAG system accuracy, groundedness, and retrieval relevance

03Building automated CI/CD pipelines to catch LLM performance regressions

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add hermeticormus/hermetic-line llm-evaluation

For use in Claude.ai and ChatGPT

Download Skill