What metrics are included for text generation?

The skill includes standard metrics like BLEU and ROUGE for overlap, and BERTScore for semantic similarity using embedding-based comparisons.

Can I use this for RAG (Retrieval-Augmented Generation)?

Yes, it provides specific metrics for retrieval quality such as Mean Reciprocal Rank (MRR), NDCG, and custom groundedness checks.

What is the LLM-as-Judge approach?

It is a pattern where a more powerful model, such as GPT-4 or Claude 3 Opus, is used to evaluate the outputs of a target model based on specific qualitative criteria.

How does it handle performance regressions?

It includes a Regression Detector that compares new test results against a baseline and flags significant drops in performance across defined metrics.

LLM Evaluation Suite

Name: LLM Evaluation Suite
Author: synqing

bysynqing

0•

Data Science & ML

Implements comprehensive evaluation frameworks for LLM applications using automated metrics, human feedback, and benchmarking.

This skill provides a robust framework for assessing the quality and performance of LLM applications through a multi-layered approach. It covers a wide spectrum of evaluation techniques, including traditional linguistic metrics like BLEU and ROUGE, semantic evaluation using BERTScore, and modern LLM-as-judge patterns. Whether you are detecting performance regressions in CI/CD, comparing model variants through statistical A/B testing, or establishing human-in-the-loop annotation workflows, this skill helps ensure AI outputs remain accurate, safe, and helpful throughout the development lifecycle.

Key Features

01Retrieval-augmented generation (RAG) tracking with MRR and NDCG

02Automated text generation metrics including BLEU, ROUGE, and BERTScore

03LLM-as-Judge patterns for pointwise and pairwise model comparisons

04Automated regression detection to prevent quality drops before deployment

05Statistical A/B testing framework with p-value and effect size analysis

060 GitHub stars

Use Cases

01Benchmarking multiple model providers or prompts to determine the optimal configuration

02Validating RAG pipeline accuracy by measuring groundedness and context relevance

03Establishing standardized human evaluation workflows to calibrate AI judging

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add synqing/k1.node1 llm-evaluation

For use in Claude.ai and ChatGPT

Download Skill