01Statistical A/B testing framework with Cohen’s d and p-value analysis
02RAG-specific evaluation patterns for retrieval and groundedness
03LLM-as-Judge scoring for automated qualitative assessment
04Regression detection to identify performance drops between model versions
050 GitHub stars
06Implementation of automated metrics including BLEU, ROUGE, and BERTScore