01Comprehensive RAG evaluation metrics for retrieval accuracy and faithfulness
02Code-first and LLM-as-a-judge evaluator templates for Python and TypeScript
03Systematic error analysis and axial coding workflows to identify failure modes
04Validation workflows to ensure automated evaluators align with human judgment
05Experiment management tools for running batch evaluations and comparing datasets
068,664 GitHub stars