Can I run evaluations on my local machine?

Yes, the skill supports local execution using Docker containers to ensure environment consistency and reproducibility.

How do I export my evaluation results?

The skill includes built-in commands to export results to popular experiment tracking platforms like MLflow and Weights & Biases, or save them as local JSON/YAML files.

What benchmarks are supported by NeMo Evaluator?

It supports over 100 benchmarks across 18+ harnesses, including MMLU, GSM8K, HumanEval, IFEval, and vision-specific tasks like ChartQA and MMMU.

Does it work with private or self-hosted models?

Absolutely. You can configure the evaluator to target any OpenAI-compatible API endpoint, including those hosted via vLLM, NIM, or TensorRT-LLM.

NeMo LLM Evaluator

Name: NeMo LLM Evaluator
Author: Orchestra-Research

byOrchestra-Research

•

3,983

•

Data Science & ML

Evaluates Large Language Models across 100+ industry-standard benchmarks using NVIDIA's enterprise-grade containerized architecture.

The NeMo Evaluator skill provides a comprehensive suite for benchmarking LLMs and VLMs across various metrics including reasoning, coding, and safety. By leveraging NVIDIA's NeMo Evaluator SDK, it enables reproducible testing across diverse environments like local Docker containers, Slurm HPC clusters, and cloud endpoints. This skill is essential for researchers and engineers who need to validate model performance, compare different architectures, or ensure enterprise-level compliance and safety standards before deployment.

Key Features

01Access to 100+ benchmarks from 18+ harnesses including MMLU, GPQA, and IFEval

023,983 GitHub stars

03Specialized evaluation modules for AI Safety and Vision-Language Models (VLM)

04Automated result exporting to MLflow, Weights & Biases, and local JSON formats

05Multi-backend support for local Docker, Slurm HPC clusters, and cloud platforms

06Seamless integration with OpenAI-compatible endpoints like vLLM and TRT-LLM

Use Cases

01Benchmarking custom-trained models against industry leaders like Llama 3.1 and Mistral

02Performing safety and security probing to identify vulnerabilities in model responses

03Running large-scale, reproducible evaluations on high-performance computing infrastructure

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add orchestra-research/ai-research-skills nemo-evaluator

For use in Claude.ai and ChatGPT

Download Skill