01Statistical A/B testing frameworks featuring T-tests and Cohen's d analysis
02Regression detection to identify performance drops before deployment
030 GitHub stars
04Automated NLP metrics including BLEU, ROUGE, METEOR, and BERTScore
05Retrieval-specific metrics for RAG systems like MRR, NDCG, and Precision@K
06LLM-as-Judge implementation for automated qualitative and pairwise assessments