01LLM-as-Judge patterns for pointwise and pairwise semantic quality assessment
02Automated text generation metrics including BLEU, ROUGE, and BERTScore
03Regression detection to identify performance drops before production deployment
04Human-in-the-loop evaluation structures and inter-rater agreement tools
052 GitHub stars
06Statistical A/B testing framework with Cohen's d effect size analysis