01Statistical A/B testing framework with Cohen's d effect size analysis
02Regression detection to prevent performance drops before deployment
03Automated NLP metrics including BLEU, ROUGE, and BERTScore
04LLM-as-Judge patterns for pointwise and pairwise qualitative assessment
05Human evaluation structures with inter-rater agreement calculation
063 GitHub stars