01Automated metrics implementation including BLEU, ROUGE, and BERTScore
02LLM-as-judge patterns for pointwise and pairwise model comparisons
03Statistical A/B testing and Cohen's d effect size analysis
04Regression detection frameworks to prevent performance drops during deployment
05Retrieval (RAG) specific evaluation metrics like MRR and NDCG
060 GitHub stars