01LLM-as-Judge patterns for automated qualitative assessment
02Automated NLP metrics including BLEU, ROUGE, and BERTScore
030 GitHub stars
04RAG-specific evaluation for retrieval precision and groundedness
05Statistical A/B testing framework with Cohen's d effect size analysis
06Automated regression detection for CI/CD quality gates