Does HQQ support fine-tuning?

Yes, HQQ is fully compatible with PEFT and LoRA, allowing you to fine-tune quantized models just like you would with bitsandbytes or QLoRA.

Can I use HQQ with HuggingFace models?

Yes, HQQ has native integration with HuggingFace Transformers via the HqqConfig, allowing you to load and quantize models directly through the standard AutoModel API.

Which inference backends are supported?

HQQ supports several backends including pure PyTorch, ATEN, TorchAO, BitBlas, and Marlin. Marlin is specifically recommended for maximum speed on NVIDIA Ampere and newer GPUs.

What bit-rate is best for quality?

For most use cases, 4-bit quantization with a group size of 64 offers the best balance between memory savings and model accuracy.

What makes HQQ different from GPTQ or AWQ?

HQQ is calibration-free, meaning it doesn't require a sample dataset to quantize a model. This makes it significantly faster (minutes vs. hours) and easier to use for models where data access is limited.

HQQ Model Quantization

Name: HQQ Model Quantization
Author: Orchestra-Research

byOrchestra-Research

•

3,983

•

Data Science & ML

Quantizes Large Language Models to ultra-low bit precision without requiring calibration datasets for efficient inference and fine-tuning.

This skill implements Half-Quadratic Quantization (HQQ) to enable rapid, high-quality model compression from 8-bit down to 1-bit precision. Unlike GPTQ or AWQ, HQQ is calibration-free, meaning it can quantize any model in minutes without needing a representative dataset. It is specifically designed for AI researchers and engineers who need to deploy large models on memory-constrained hardware, supporting multiple optimized backends like Marlin and TorchAO. With native integration for HuggingFace and vLLM, plus full compatibility with PEFT/LoRA fine-tuning, it provides a powerful toolkit for optimizing LLM performance and accessibility.

Key Features

01Multiple optimized backends including Marlin, BitBlas, and TorchAO

02PEFT and LoRA compatibility for efficient fine-tuning

03Seamless integration with HuggingFace Transformers and vLLM

04Support for 1, 2, 3, 4, and 8-bit precision

053,983 GitHub stars

06Calibration-free quantization (no dataset required)

Use Cases

01Rapidly prototyping quantized models without time-consuming calibration

02Fine-tuning quantized models using memory-efficient LoRA/QLoRA workflows

03Deploying large-scale models on consumer GPUs with limited VRAM

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add orchestra-research/ai-research-skills hqq

For use in Claude.ai and ChatGPT

Download Skill