01LLM-as-Judge patterns for pointwise and pairwise model comparisons
02Automated metrics integration including BLEU, ROUGE, and BERTScore
03Retrieval evaluation for RAG systems using MRR and NDCG
04Automated regression detection to flag performance drops before production
05Statistical A/B testing framework with Cohen's d effect size analysis
060 GitHub stars