01Automated metrics integration including BLEU, ROUGE, and BERTScore
02Automated regression detection to track performance over time
0323,194 GitHub stars
04LLM-as-judge patterns for pointwise and pairwise model comparisons
05RAG-specific retrieval metrics like MRR, NDCG, and Precision@K
06Statistical A/B testing framework with Cohen's d effect size analysis