01Comprehensive automated metrics including BLEU, ROUGE, BERTScore, and Perplexity
02LLM-as-judge implementation for automated pointwise and pairwise response grading
03Specialized RAG evaluation metrics for retrieval (MRR, NDCG) and groundedness
04Statistical A/B testing framework with p-value and Cohen’s d effect size analysis
05Automated regression detection to identify performance drops before deployment
060 GitHub stars