What metrics are supported for text generation evaluation?

The skill supports standard metrics like BLEU, ROUGE, and METEOR for overlap, as well as embedding-based BERTScore for measuring semantic similarity.

Does it support statistical significance in A/B testing?

Yes, the skill includes a statistical testing framework using T-tests and Cohen’s d to ensure that performance improvements between model versions are mathematically significant.

Can I use this for RAG (Retrieval-Augmented Generation) applications?

Yes, it includes specialized retrieval metrics such as Mean Reciprocal Rank (MRR), NDCG, and Precision@K, along with custom metrics for groundedness.

How does the 'LLM-as-judge' feature work?

It provides implementation patterns to use a high-reasoning model to evaluate the outputs of other models through pointwise scoring, pairwise comparisons, or reference-based judging.

LLM Performance Evaluation

Name: LLM Performance Evaluation
Author: amurata

byamurata

•

Data Science & ML

Implements comprehensive evaluation frameworks for Large Language Model applications using automated metrics, human feedback, and LLM-as-judge patterns.

About

This skill provides developers with a robust toolkit for measuring and improving LLM application quality throughout the development lifecycle. It covers a wide spectrum of evaluation techniques, including linguistic metrics like BLEU and ROUGE, semantic similarity via BERTScore, RAG-specific retrieval metrics, and sophisticated LLM-as-judge patterns for qualitative assessment. By integrating systematic testing, A/B comparison, and regression detection, it helps teams build confidence in production AI systems, validate prompt engineering improvements, and maintain rigorous performance standards over time.

Key Features

Regression detection to prevent performance drops during model updates
3 GitHub stars
LLM-as-judge patterns for automated qualitative scoring and comparisons
Statistical A/B testing framework with significance and effect size analysis
RAG performance tracking for retrieval accuracy and groundedness
Automated NLP metrics including BLEU, ROUGE, and BERTScore

Use Cases

Comparing performance and costs across different LLM providers and versions
Validating prompt engineering changes and system instructions before deployment
Measuring the accuracy and retrieval quality of RAG-based knowledge systems

About

Key Features

Regression detection to prevent performance drops during model updates
3 GitHub stars
LLM-as-judge patterns for automated qualitative scoring and comparisons
Statistical A/B testing framework with significance and effect size analysis
RAG performance tracking for retrieval accuracy and groundedness
Automated NLP metrics including BLEU, ROUGE, and BERTScore

Use Cases

Comparing performance and costs across different LLM providers and versions
Validating prompt engineering changes and system instructions before deployment
Measuring the accuracy and retrieval quality of RAG-based knowledge systems