What distributed strategies does this skill support?

It provides debugging support for FSDP2, Tensor Parallelism (TP), Context Parallelism (CP), and Expert Parallelism (EP).

How do I identify why my training is hanging?

The skill guides you through using environment variables like TORCH_DISTRIBUTED_DEBUG and tools like py-spy to dump and analyze the call stacks of hung processes.

Does this require specific hardware?

It is designed for multi-GPU and multi-node environments typically using NVIDIA GPUs and NCCL, though the principles apply to general distributed PyTorch.

Can this help with Out-of-Memory (OOM) errors?

Yes, it includes specific procedures for checking memory usage per rank and verifying FSDP coverage to ensure parameters are correctly sharded.

Distributed Training Debugger

Name: Distributed Training Debugger
Author: inclusionAI

byinclusionAI

•

3,574

•

Data Science & ML

Diagnoses and resolves distributed training issues like hangs, OOM errors, and NCCL communication failures in LLM reasoning frameworks.

The Distributed Training Debugger is a specialized skill designed to troubleshoot complex multi-GPU training environments, specifically focusing on FSDP2, Tensor Parallelism, Context Parallelism, and Expert Parallelism. It streamlines the debugging process by providing automated environment configurations for NCCL and Torch Distributed, offering systematic guides for identifying deadlocks, and implementing minimal reproduction strategies. Whether you are dealing with rank synchronization hangs, numerical inconsistencies across devices, or memory fragmentation in sharded models, this skill provides the diagnostic scripts and py-spy integration needed to restore training stability quickly.

Key Features

01Memory profiling for identifying OOM issues in sharded training

023,574 GitHub stars

03Automated configuration of NCCL and Torch Distributed debug environment variables

04Deadlock and hang diagnosis using py-spy call stack analysis

05Verification of DTensor placements and Device Mesh configurations

06Rank-conditional logging and tensor consistency validation tools

Use Cases

01Fixing numerical divergence and gradient reduction errors across different GPU ranks

02Resolving 'RuntimeError: Timed out' and synchronization hangs during multi-node training

03Debugging Out-of-Memory (OOM) errors caused by improper FSDP sharding or activation checkpointing

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add inclusionai/areal debug-distributed

For use in Claude.ai and ChatGPT

Download Skill