Can I use this skill with self-hosted models?

Yes, it supports any OpenAI-compatible API endpoint, making it compatible with models hosted via vLLM, TensorRT-LLM, or custom backend solutions.

What are the infrastructure requirements?

You need Python 3.10+, Docker for local execution, or access to a Slurm/Lepton environment for distributed benchmarking.

What benchmarks are supported by NeMo Evaluator?

It supports over 100 benchmarks including MMLU, GSM8K, HumanEval, IFEval, GPQA, and more from 18+ different specialized evaluation harnesses.

How does this compare to standard evaluation harnesses?

NeMo Evaluator provides a unified configuration layer and containerized environment that simplifies running multiple different harnesses simultaneously while ensuring results are reproducible.

NeMo Evaluator SDK

Name: NeMo Evaluator SDK
Author: eyadsibai

byeyadsibai

Data Science & ML

Evaluates Large Language Models across 100+ benchmarks using a reproducible, containerized framework.

About

The NeMo Evaluator SDK skill provides a comprehensive framework for benchmarking LLMs at scale, supporting over 100 standardized tests from 18+ evaluation harnesses like lm-evaluation-harness and HumanEval. It streamlines the evaluation process by offering containerized execution to ensure reproducibility across local Docker environments, Slurm HPC clusters, and cloud backends. This tool is ideal for developers and researchers who need to validate model performance, compare different architectures, and automate regression testing for mathematics, coding, and general instruction-following capabilities.

Key Features

Unified interface for 18+ evaluation harnesses (simple-evals, bigcode, etc.)
Multi-target support for NVIDIA NIM, vLLM, and OpenAI-compatible APIs
Access to 100+ industry-standard benchmarks including MMLU, GSM8K, and HumanEval
Containerized, reproducible execution across local, Slurm, and Lepton backends
0 GitHub stars
Direct result exporting to MLflow, Weights & Biases, and local YAML formats

Use Cases

Scaling massive evaluation pipelines across enterprise Slurm GPU clusters
Comparing performance delta between different quantization levels and backends
Benchmarking custom fine-tuned models against industry leaderboards

About

Key Features

Unified interface for 18+ evaluation harnesses (simple-evals, bigcode, etc.)
Multi-target support for NVIDIA NIM, vLLM, and OpenAI-compatible APIs
Access to 100+ industry-standard benchmarks including MMLU, GSM8K, and HumanEval
Containerized, reproducible execution across local, Slurm, and Lepton backends
0 GitHub stars
Direct result exporting to MLflow, Weights & Biases, and local YAML formats

Use Cases

Scaling massive evaluation pipelines across enterprise Slurm GPU clusters
Comparing performance delta between different quantization levels and backends
Benchmarking custom fine-tuned models against industry leaderboards