Why is agent evaluation different from traditional software testing?

Agents are non-deterministic and can take multiple valid paths to a goal, requiring outcome-focused rubrics rather than fixed step-by-step assertions.

How does LLM-as-judge work in this framework?

It uses a high-capability model to score an agent's output against a predefined rubric, providing scalable and consistent quality assessment across large test sets.

How do I test context engineering choices?

Run agents with different context strategies on the same test set and compare quality scores against token costs to find the most efficient configuration.

What is the '95% Finding' in agent performance?

Research shows that token usage, number of tool calls, and model choice explain 95% of performance variance in browsing agents, meaning evaluation must account for these resources.

What dimensions should an agent rubric include?

Effective rubrics cover factual accuracy, completeness, citation accuracy, source quality, and tool efficiency.

Agent Performance Evaluation

Name: Agent Performance Evaluation
Author: goodnight000

bygoodnight000

0•

Security & Testing

Builds comprehensive evaluation frameworks to measure, validate, and optimize AI agent performance and context engineering strategies.

This skill equips developers with standardized methodologies for assessing autonomous agent systems, solving for challenges like non-determinism and variable execution paths. It facilitates the creation of multi-dimensional rubrics—evaluating factors like factual accuracy, tool efficiency, and citation quality—while supporting LLM-as-judge and human-in-the-loop workflows. By applying complexity stratification and token budget analysis, it ensures that context engineering decisions are data-driven and that agent behaviors remain reliable across model upgrades and system changes.

Key Features

01Token budget and model performance variance analysis

020 GitHub stars

03LLM-as-judge automated scoring patterns

04Complexity stratification for robust test set design

05Continuous evaluation pipeline integration for CI/CD

06Multi-dimensional rubric design for accuracy and efficiency

Use Cases

01Benchmarking autonomous agents against ground-truth datasets

02Detecting performance regressions during model migrations

03Validating context engineering choices to optimize token usage

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add goodnight000/kittycourt evaluation

For use in Claude.ai and ChatGPT

Download Skill