What is a judge agent in the context of tdx agent-test?

A judge agent is an LLM configured to evaluate your agent's responses against the criteria you've defined in your test.yml file, providing a binary pass/fail result based on your requirements.

Can I test conversations that span multiple turns?

Yes, the skill supports multi-round tests where you can define consecutive user inputs and specific criteria for each individual response in the sequence to test context and memory.

Where should I store my test files?

You should create a test.yml file within your agent's directory, typically alongside your agent.yml and prompt.md files, following the project structure.

How does the re-evaluation workflow save time?

Using the --reeval flag, you can update your testing criteria and re-run the evaluation against cached conversations without needing to generate new, time-consuming LLM responses.

Can I filter which tests to run?

Yes, you can use the --name flag to target specific tests or the --tags flag to run groups of tests categorized by labels like 'smoke' or 'regression'.

Agent Testing Utility

Name: Agent Testing Utility
Author: treasure-data

bytreasure-data

•

Security & Testing

Automates the testing and evaluation of LLM agents using YAML-defined scenarios and AI-powered judge criteria.

The agent-test skill provides a robust framework for validating LLM agent behavior within the Treasure Data ecosystem. It allows developers to define complex, multi-round interaction scenarios in YAML and utilize a specialized judge agent to evaluate responses against specific, measurable criteria. This tool is essential for regression testing, refining agent prompts, and ensuring consistent performance across diverse user inputs without the need for manual oversight, significantly speeding up the agent development lifecycle.

Key Features

01Granular test filtering using tags and specific name-based execution.

02Dry-run and no-eval modes for syntax validation and conversation logging.

038 GitHub stars

04Efficient re-evaluation workflow to iterate on criteria without re-running LLM calls.

05Automated YAML-based test definitions for single and multi-round conversations.

06AI-powered judge agent for objective binary pass/fail evaluation of responses.

Use Cases

01Validating multi-step conversational flows and memory retention in complex agents.

02Performing regression testing on agents after updating system prompts or data sources.

03Standardizing the quality assurance process for LLM-based applications and workflows.

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add treasure-data/td-skills agent-test

For use in Claude.ai and ChatGPT

Download Skill