Can I use this for training on A100 or H100 GPUs?

Yes, the skill includes configurations for BF16 mixed precision and NCCL backends, which are optimized for modern NVIDIA data center GPUs.

How does the skill handle model checkpointing?

It implements Distributed Checkpoint (DCP), which allows multiple ranks to save and load data in parallel, supporting load-time resharding across different GPU configurations.

What is the primary advantage of FSDP2 over FSDP1?

FSDP2 uses a newer DTensor-based approach which is more inspectable, offers simpler sharded state dicts, and composes more easily with Tensor Parallelism via DeviceMesh.

Why does this skill recommend bottom-up sharding?

Applying fully_shard bottom-up allows FSDP2 to form efficient parameter groups for better communication/computation overlap and significantly lower peak memory usage.

PyTorch FSDP2 Distributed Training

Name: PyTorch FSDP2 Distributed Training
Author: Orchestra-Research

byOrchestra-Research

•

3,983

•

Data Science & ML

Implements advanced PyTorch FSDP2 sharding and distributed checkpointing for efficient large-scale model training.

This skill provides specialized guidance for implementing PyTorch FSDP2 (Fully Sharded Data Parallel 2) to train models that exceed single-GPU memory limits. It ensures the correct application of DTensor-based sharding through bottom-up module wrapping, manages complex mixed-precision and CPU offload policies, and configures robust Distributed Checkpointing (DCP). By following modern PyTorch best practices, this skill helps developers avoid common distributed training pitfalls such as incorrect optimizer initialization or desynchronized gradient accumulation.

Key Features

013,983 GitHub stars

02Automated bottom-up fully_shard module wrapping for optimal memory efficiency

03Advanced mixed precision (BF16/FP16) and CPU offload policy configuration

04Parallel Distributed Checkpointing (DCP) for high-performance state management

05Standardized torchrun environment setup and rank-aware device initialization

06DTensor-based parameter sharding with DeviceMesh and multi-dimensional parallelism support

Use Cases

01Modernizing legacy DDP or FSDP1 scripts to use the more inspectable DTensor-based FSDP2 stack

02Scaling deep learning training across multi-node GPU clusters using NCCL backends

03Training Large Language Models (LLMs) that exceed the VRAM capacity of a single GPU

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add orchestra-research/ai-research-skills pytorch-fsdp2

For use in Claude.ai and ChatGPT

Download Skill