How does the LLM-as-judge pattern work?

It utilizes a high-capability model to evaluate the outputs of a target model against a specific rubric, providing objective scoring and qualitative feedback without manual human review.

What is the '95% Variance Finding' mentioned in the skill?

Research indicates that 80% of LLM output variance stems from prompt construction and 15% from random seeds/sampling. This skill focuses on these areas as they represent 95% of the impact on output quality.

Does it include templates for quick setup?

Yes, the skill includes pre-defined templates for rubrics, judge prompts, and structured test cases, along with checklists to ensure your evaluation framework is robust.

Can this skill help with non-deterministic outputs?

Yes, it provides strategies such as running multiple iterations to report mean/variance and using seed control to improve reproducibility during the testing phase.

Grey Haven LLM Evaluation

Name: Grey Haven LLM Evaluation
Author: greyhaven-ai

bygreyhaven-ai

•

Security & Testing

Systematizes LLM output evaluation using multi-dimensional rubrics, LLM-as-judge patterns, and statistical variance handling.

This skill provides a comprehensive framework for testing and validating Large Language Model (LLM) outputs within the Claude Code environment. It implements the '95% Variance Finding'—the research-backed insight that prompt quality and sampling account for nearly all output variation—to focus developer efforts where they matter most. By providing templates for multi-dimensional rubrics, automated LLM-as-judge prompts, and strategies for handling non-determinism, it enables developers to build rigorous quality gates, perform A/B testing on prompts, and detect regressions in AI-powered production systems.

Key Features

01LLM-as-judge implementation for automated, scalable validation

02Structured test case design templates with ground-truth support

03Multi-dimensional scoring rubrics for granular output analysis

04Statistical methods for managing non-deterministic model behavior

0516 GitHub stars

06Ready-to-use checklists for evaluation setup and rubric validation

Use Cases

01Performing A/B testing on prompt versions to optimize output quality

02Detecting performance regressions after updating model versions or system prompts

03Building automated quality gates for production-grade LLM pipelines

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add greyhaven-ai/claude-code-config evaluation

For use in Claude.ai and ChatGPT

Download Skill