01LLM-as-judge patterns for pointwise and pairwise semantic comparisons
02Custom metric support for groundedness, toxicity, and factuality checking
030 GitHub stars
04Automated metrics implementation for text generation, classification, and RAG
05Statistical A/B testing framework with Cohen’s d and p-value analysis
06Automated regression detection to prevent performance drops during updates