About
The Agent Performance Evaluation skill provides a comprehensive framework for measuring the quality and reliability of autonomous agent systems. It addresses the unique challenges of AI testing—such as non-determinism and context-dependent failures—by providing standardized rubrics for factual accuracy, tool efficiency, and citation quality. Whether you are building quality gates for production pipelines or validating context engineering strategies, this skill helps you implement scalable LLM-as-judge patterns and complexity-stratified test sets to ensure your agents perform consistently across diverse scenarios.