01Human evaluation structures and inter-rater agreement (Cohen's Kappa) tracking
02Automated metrics integration including BLEU, ROUGE, and BERTScore
030 GitHub stars
04Automated regression detection to identify performance drops before deployment
05LLM-as-judge patterns for pointwise and pairwise semantic evaluation
06Statistical A/B testing framework with T-test and Cohen's d analysis