01Automated NLP metrics including BLEU, ROUGE, and BERTScore
02Automated regression detection to identify performance drops before deployment
03LLM-as-Judge patterns for pointwise and pairwise qualitative evaluation
040 GitHub stars
05RAG-specific metrics like MRR, NDCG, and groundedness checks
06Statistical A/B testing framework with p-value and effect size analysis