How does the LLM-as-judge feature work?

It provides patterns to use highly capable models to evaluate the outputs of other models based on defined rubrics, pointwise scoring, or pairwise comparisons.

Can I use this for RAG (Retrieval-Augmented Generation) applications?

Yes, the skill includes specific metrics and strategies for evaluating retrieval performance, groundedness, and context relevance in RAG pipelines.

Does it support statistical significance testing?

Yes, it includes tools for A/B testing and calculating p-values and Cohen's d to ensure that improvements in model performance are statistically significant.

What types of metrics are supported by this skill?

It supports automated text metrics (BLEU, ROUGE, BERTScore), retrieval metrics (MRR, NDCG), classification metrics (F1, Precision/Recall), and custom qualitative metrics using LLM-as-judge.

LLM Application Evaluation

Name: LLM Application Evaluation
Author: HermeticOrmus

byHermeticOrmus

0•

Data Science & ML

Evaluates Large Language Model application performance using automated metrics, human feedback loops, and LLM-as-judge frameworks.

The llm-evaluation skill provides a comprehensive framework for measuring and improving the quality of AI applications. It bridges the gap between raw model output and production-ready performance by implementing systematic evaluation strategies, including automated n-gram and embedding metrics, LLM-as-judge patterns for semantic validation, and human-in-the-loop annotation structures. Whether you are benchmarking different models, detecting performance regressions, or validating RAG pipelines, this skill equips developers with the statistical and programmatic tools needed to establish reliable baselines and ensure model outputs remain accurate, safe, and helpful over time.

Key Features

01Automated metrics implementation including BLEU, ROUGE, and BERTScore

02LLM-as-judge patterns for pointwise and pairwise model comparisons

03Statistical A/B testing and Cohen's d effect size analysis

04Regression detection frameworks to prevent performance drops during deployment

05Retrieval (RAG) specific evaluation metrics like MRR and NDCG

060 GitHub stars

Use Cases

01Validating prompt engineering changes to prevent quality regressions

02Benchmarking new model versions against established performance baselines

03Measuring the groundedness and factual accuracy of RAG application outputs

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add hermeticormus/floraheritage llm-evaluation

For use in Claude.ai and ChatGPT

Download Skill