01LLM-as-Judge patterns for automated pointwise and pairwise quality grading
02Regression detection to prevent performance drops during model or prompt updates
030 GitHub stars
04Retrieval-Augmented Generation (RAG) metrics like MRR, NDCG, and groundedness
05Automated text generation metrics including BLEU, ROUGE, and BERTScore
06Statistical A/B testing framework with Cohen’s d effect size calculation