01Granular test filtering using tags and specific name-based execution.
02Dry-run and no-eval modes for syntax validation and conversation logging.
038 GitHub stars
04Efficient re-evaluation workflow to iterate on criteria without re-running LLM calls.
05Automated YAML-based test definitions for single and multi-round conversations.
06AI-powered judge agent for objective binary pass/fail evaluation of responses.