Does Miles support low-precision training?

Yes, Miles provides a unified FP8 pipeline and INT4 quantization-aware training, allowing for 1TB+ models to be trained on single-machine configurations.

How does Miles improve MoE training stability?

It utilizes Rollout Routing Replay (R3) to ensure bit-wise consistency between expert routing decisions during inference and training, preventing common policy collapse issues.

Miles is an enterprise-grade reinforcement learning framework optimized for training large-scale Mixture-of-Experts (MoE) models with high performance and efficiency.

What models can I train with Miles?

Miles supports leading model families including DeepSeek (V3/R1), Qwen (including MoE variants), Llama, Gemma, and GLM.

How does speculative RL work in Miles?

It integrates the EAGLE algorithm via SGLang, using a small draft model to generate tokens that the target model verifies, increasing throughput by up to 40%.

Miles Enterprise RL Training

Name: Miles Enterprise RL Training
Author: zhuangbiaowei

byzhuangbiaowei

•

Data Science & ML

Optimizes large-scale Mixture-of-Experts (MoE) model training with enterprise-grade reinforcement learning features and low-precision quantization.

Miles is a high-performance reinforcement learning framework designed for post-training enterprise-scale models like DeepSeek V3 and Qwen3-MoE. As a production-ready fork of Slime, it specializes in stabilizing MoE training through bit-wise train-inference alignment (R3) and maximizing throughput with speculative RL. It enables training 1TB+ models on hardware with limited VRAM using advanced FP8 and INT4 quantization-aware training, making it an essential tool for teams moving from research to production-scale AI alignment.

Key Features

01Rollout Routing Replay (R3) for exact bit-wise expert alignment between train and inference

021 GitHub stars

03Unified FP8 and INT4 quantization-aware training for massive MoE models

04Speculative RL via EAGLE algorithm for up to 40% faster rollout throughput

05Comprehensive support for DeepSeek V3, Qwen3-MoE, and Llama model families

06Zero-copy weight synchronization using CUDA IPC mapping

Use Cases

01Scaling reinforcement learning for 1TB+ Mixture-of-Experts models

02Reducing VRAM footprint for large model training using 4-bit quantization

03Accelerating RLHF pipelines with speculative decoding and partial rollout recycling

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add zhuangbiaowei/smart_bot miles

For use in Claude.ai and ChatGPT

Download Skill