01Retrieval-augmented generation (RAG) tracking with MRR and NDCG
02Automated text generation metrics including BLEU, ROUGE, and BERTScore
03LLM-as-Judge patterns for pointwise and pairwise model comparisons
04Automated regression detection to prevent quality drops before deployment
05Statistical A/B testing framework with p-value and effect size analysis
060 GitHub stars