01Statistical A/B testing framework with significance and effect size calculations
02LLM-as-Judge implementation for automated qualitative and pairwise comparisons
03Regression detection system to identify performance drops before deployment
04RAG-specific evaluation including groundedness, MRR, and NDCG metrics
050 GitHub stars
06Automated metrics for text generation including BLEU, ROUGE, and BERTScore