The Distributed Training Debugger is a specialized skill designed to troubleshoot complex multi-GPU training environments, specifically focusing on FSDP2, Tensor Parallelism, Context Parallelism, and Expert Parallelism. It streamlines the debugging process by providing automated environment configurations for NCCL and Torch Distributed, offering systematic guides for identifying deadlocks, and implementing minimal reproduction strategies. Whether you are dealing with rank synchronization hangs, numerical inconsistencies across devices, or memory fragmentation in sharded models, this skill provides the diagnostic scripts and py-spy integration needed to restore training stability quickly.
Key Features
01Memory profiling for identifying OOM issues in sharded training
023,574 GitHub stars
03Automated configuration of NCCL and Torch Distributed debug environment variables
04Deadlock and hang diagnosis using py-spy call stack analysis
05Verification of DTensor placements and Device Mesh configurations
06Rank-conditional logging and tensor consistency validation tools