01Statistical A/B testing and Cohen’s d effect size analysis
0281 GitHub stars
03LLM-as-Judge implementation for qualitative scoring
04Human evaluation frameworks and inter-rater agreement tools
05Automated regression detection against performance baselines
06Automated metrics for text generation and RAG performance