01Custom metric implementations for toxicity and factuality
02Specialized RAG evaluation for retrieval and groundedness
03Human evaluation frameworks with inter-rater agreement tracking
04LLM-as-judge patterns for pointwise and pairwise comparisons
050 GitHub stars
06Comprehensive automated metrics including BLEU, ROUGE, and BERTScore