01Automated metric implementation for text generation, classification, and RAG retrieval.
02Statistical A/B testing to measure performance gains and significance.
03Human evaluation frameworks with annotation guidelines and agreement calculations.
04LLM-as-judge patterns for pointwise scoring and pairwise model comparisons.
0523,139 GitHub stars
06Automated regression detection to prevent performance drops during deployment.