How does this skill handle position bias?

It implements a dual-pass protocol where responses are evaluated twice with their positions swapped, using consistency checks to ensure the verdict isn't influenced by presentation order.

Can this skill help with rubric creation?

Yes, it includes frameworks for generating detailed, multi-level rubrics that define clear boundaries and observable characteristics for consistent grading.

What is LLM-as-a-Judge?

It is an evaluation technique where a highly capable Large Language Model is used to grade or compare the outputs of other models based on specific criteria or rubrics.

Should I use direct scoring or pairwise comparison?

Use direct scoring for objective criteria like factual accuracy; use pairwise comparison for subjective qualities like tone, style, and persuasiveness.

Why is justification required before scoring?

Research shows that requiring a model to provide Chain-of-Thought reasoning or evidence before assigning a score increases evaluation reliability by 15-25%.

Advanced LLM Evaluation

Name: Advanced LLM Evaluation
Author: goodnight000

bygoodnight000

0•

Data Science & ML

Implements production-grade LLM-as-a-Judge techniques for evaluating AI outputs through rigorous scoring, pairwise comparisons, and bias mitigation.

This skill provides specialized knowledge to build and execute sophisticated evaluation systems for Large Language Models. It synthesizes industry best practices and academic research to implement direct scoring and pairwise comparison methodologies while actively mitigating systematic biases like position and length bias. Whether you are building automated evaluation pipelines, designing complex rubrics, or conducting A/B tests for prompt engineering, this skill ensures your AI assessments are consistent, reliable, and closely aligned with human judgment.

Key Features

010 GitHub stars

02Metric selection guidance including F1, Spearman's ρ, and Cohen's κ

03Custom rubric generation for domain-specific quality standards

04Automated LLM-as-a-Judge scoring and comparison frameworks

05Systematic bias mitigation protocols for position and length bias

06Chain-of-Thought integration for transparent evaluation reasoning

Use Cases

01Developing standardized rubrics for consistent human and AI content moderation

02Comparing model performance variations after prompt or hyperparameter changes

03Building automated CI/CD evaluation pipelines for LLM applications

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add goodnight000/kittycourt advanced-evaluation

For use in Claude.ai and ChatGPT

Download Skill