01Statistical A/B testing framework with Cohen's d effect size calculations
02Human evaluation frameworks with inter-rater agreement (Cohen's Kappa) tools
032 GitHub stars
04Comprehensive automated metrics suite including BLEU, ROUGE, and BERTScore
05LLM-as-Judge patterns for automated semantic scoring and pairwise comparisons
06Automated regression detection to prevent quality drops during model updates