What is the 'LLM-as-judge' pattern?

LLM-as-judge is a technique that uses a high-capability model (like Claude 3.5) to evaluate the outputs of other models, providing qualitative scores for accuracy, helpfulness, and clarity.

Can I implement custom evaluation metrics?

Yes, the framework includes a extensible Metric class that allows you to define custom logic for specific domain requirements, such as toxicity or legal compliance.

How do I detect model regressions?

By using the provided EvaluationSuite, you can run standardized test cases against every model update to track changes in mean scores and identify performance drops.

Which metrics are included for RAG evaluation?

The skill covers critical retrieval metrics like MRR, NDCG, and Precision@K, along with custom NLI-based checks for groundedness and factuality.

LLM Evaluation Framework

Name: LLM Evaluation Framework
Author: duanbiao2000

byduanbiao2000

0•

Data Science & ML

Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking.

The LLM Evaluation skill provides a robust framework for measuring and improving the performance of AI applications. It equips developers with standardized methodologies for automated metrics like BLEU and BERTScore, sophisticated LLM-as-judge patterns for qualitative assessment, and human evaluation workflows. Whether you are comparing model versions, validating prompt changes, or detecting performance regressions, this skill establishes a systematic approach to ensuring production-grade quality and reliability in LLM-driven systems.

Key Features

01Custom metric implementations for toxicity and factuality

02Specialized RAG evaluation for retrieval and groundedness

03Human evaluation frameworks with inter-rater agreement tracking

04LLM-as-judge patterns for pointwise and pairwise comparisons

050 GitHub stars

06Comprehensive automated metrics including BLEU, ROUGE, and BERTScore

Use Cases

01Validating RAG system accuracy and retrieval relevance

02Detecting performance regressions in production AI pipelines

03Comparing performance between different models or prompt iterations

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add duanbiao2000/obsidiandoc26 llm-evaluation

For use in Claude.ai and ChatGPT

Download Skill