Can I export the evaluation results to my task manager?

Yes, if the MCP Linear plugin is available, the skill can automatically create Linear projects and issues based on its findings.

What are the prerequisites for using this skill?

You must have a configured Langfuse setup and an agent configuration file located at .codex/agent-eval/ .yaml within your repository.

Where are the local reports stored?

Reports are generated as Markdown files in the .codex/agent-eval/ /reports/ directory, organized by cycle number.

How does the trace analysis work?

The skill uses langfuse-trace-analysis to examine representative failures and compares them against successful traces with similar inputs to pinpoint the root cause of errors.

Does this skill automatically apply code fixes?

No, the skill is designed to analyze and recommend. It documents findings and suggests specific fixes at identified file paths, but it does not auto-apply them to maintain safety.

Langfuse Agent Evaluator

Name: Langfuse Agent Evaluator
Author: mberto10

bymberto10

0•

Analytics & Monitoring

Orchestrates end-to-end evaluation cycles for AI agents using Langfuse to identify performance regressions and generate actionable optimization reports.

The Langfuse Agent Evaluator is a specialized skill designed to bring rigorous observability and testing to AI agent development. It automates a multi-phase workflow that includes running dataset experiments with configured judges, performing deep-dive root cause analysis on failed traces, and comparing performance across different development cycles. By identifying specific failure patterns and symptoms, it provides structured recommendations for fixes without the risk of auto-applying unverified changes, ensuring developers have high-quality documentation and clear paths to improvement via Linear or local reports.

Key Features

01Structured fix recommendations with impact and complexity assessments

02Comparative trace analysis between successful and failed runs

030 GitHub stars

04Multi-format reporting including Linear project issues or local Markdown summaries

05Automated experiment execution using Langfuse datasets and judges

06Systematic failure analysis grouping by dimension and symptom

Use Cases

01Benchmarking AI agent performance against a golden dataset before production deployment

02Performing root-cause analysis on edge-case failures through detailed trace comparisons

03Tracking and documenting agent improvement cycles over time for stakeholder reporting

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add mberto10/mberto-compound langfuse-agent-eval

For use in Claude.ai and ChatGPT

Download Skill