About
This skill provides a comprehensive toolkit for measuring and improving the performance of Large Language Model applications. It guides developers through the implementation of standard NLP metrics like BLEU and ROUGE, sophisticated embedding-based scores like BERTScore, and modern LLM-as-judge evaluation patterns. Whether you are building RAG pipelines, chatbots, or classification agents, this skill helps establish rigorous baselines, perform A/B testing with statistical significance, and detect performance regressions before they reach production. It bridges the gap between raw model output and production-grade reliability by quantifying quality across dimensions like accuracy, groundedness, and coherence.