Implements production-grade LLM-as-a-judge techniques to evaluate AI outputs using direct scoring, pairwise comparison, and bias mitigation.
This skill provides a comprehensive framework for building reliable automated evaluation systems for LLM outputs. It synthesizes academic research and industry best practices to help developers implement 'LLM-as-a-judge' patterns, manage complex evaluation rubrics, and mitigate common biases like position and length bias. Whether you are conducting A/B tests for prompts or establishing consistent quality standards across production pipelines, this skill offers actionable guidance on selecting the right metrics, structuring evaluation prompts, and ensuring high correlation between automated and human judgments.