How do I see all available benchmarks?

You can list all 60+ supported tasks by running the 'lm_eval --tasks list' command within your terminal environment.

Does this skill support code generation benchmarks?

Yes, it includes full support for code execution benchmarks like HumanEval and MBPP, provided the --allow_code_execution flag is used.

What is the LM Evaluation Harness?

The LM Evaluation Harness is an industry-standard open-source framework used to evaluate Large Language Models across dozens of academic benchmarks using standardized prompts and metrics.

How does vLLM support improve evaluation?

By using the vLLM backend, this skill can perform evaluations 5-10x faster than standard Transformers by optimizing inference throughput on supported GPUs.

Can I use this skill to evaluate local model checkpoints?

Yes, the skill supports evaluating local HuggingFace checkpoints, including support for various data types like bfloat16 and quantized 4-bit/8-bit formats.

LM Evaluation Harness

Name: LM Evaluation Harness
Author: Orchestra-Research

byOrchestra-Research

•

3,983

•

Data Science & ML

Evaluates Large Language Models across 60+ academic benchmarks using standardized prompts and metrics for reproducible research.

This skill integrates the industry-standard EleutherAI LM Evaluation Harness into your workflow, enabling precise benchmarking of LLM quality. It provides standardized implementations for over 60 academic tasks including MMLU, GSM8K, and HumanEval, making it essential for researchers and engineers who need to compare model performance, track training progress, or report academic results. With support for HuggingFace, vLLM, and API-based models, it allows for high-performance evaluation of both local checkpoints and remote services directly within your development environment.

Key Features

01Integrated workflows for tracking training progress with automated checkpoint evaluation.

02Support for multiple backends including HuggingFace Transformers, vLLM, and external APIs.

03Access to 60+ standardized academic benchmarks including MMLU, GSM8K, and TruthfulQA.

04Extensive support for quantization, few-shot prompting, and custom task configuration.

053,983 GitHub stars

06High-performance inference options using vLLM for up to 10x faster benchmarking.

Use Cases

01Comparing the accuracy impact of different quantization methods (e.g., 4-bit vs 8-bit) on specific reasoning tasks.

02Automating model quality checks during the training loop to visualize performance learning curves.

03Benchmarking a custom-trained model against industry standards like Llama-2 or Mistral.

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add orchestra-research/ai-research-skills lm-evaluation-harness

For use in Claude.ai and ChatGPT

Download Skill