01Automated regression detection against performance baselines
020 GitHub stars
03Statistical A/B testing framework with significance and effect size calculation
04LLM-as-Judge patterns for pointwise and pairwise qualitative assessments
05Automated NLP metrics including BLEU, ROUGE, and BERTScore
06Customizable metrics for RAG groundedness, toxicity, and factuality