01LLM-as-judge patterns for pointwise and pairwise comparisons
02Statistical A/B testing with Cohen's d effect size analysis
03Automated text metrics including BLEU, ROUGE, and BERTScore
04Human evaluation frameworks with inter-rater agreement calculation
05RAG-specific metrics like MRR, NDCG, and groundedness checks
060 GitHub stars