Can it help with specific NVIDIA architectures?

Yes, it provides optimization strategies tailored to specific architectures like Ampere, Hopper, and Ada Lovelace to leverage generation-specific hardware features.

Does it require external profiling tools to work?

While it leverages knowledge from tools like Nsight Copilot, it performs static analysis directly on your code to identify performance patterns before you even run a hardware profiler.

How does this skill help with GPU memory issues?

It analyzes your code for uncoalesced memory access patterns and shared memory bank conflicts, providing specific refactoring suggestions to improve bandwidth utilization.

What file types does it support?

It primarily focuses on CUDA source files (.cu) and headers (.cuh), but can also provide guidance on general GPU compute patterns in C++ or Python-based frameworks.

CUDA Performance Optimizer

Name: CUDA Performance Optimizer
Author: keith-mvs

bykeith-mvs

•

Data Science & ML

Optimizes CUDA kernels and GPU code to maximize hardware utilization and execution speed.

The CUDA Optimization skill provides specialized expertise for auditing and enhancing GPU-accelerated code within the Claude Code environment. It automatically analyzes CUDA kernels to detect uncoalesced memory accesses, warp divergence, and suboptimal thread configurations while suggesting architecture-specific improvements for modern NVIDIA generations like Ampere, Hopper, and Ada Lovelace. By integrating deep profiling knowledge, it helps developers transition from functional GPU code to highly efficient, production-grade parallel algorithms through actionable code transformations and bottleneck identification.

Key Features

011 GitHub stars

02Memory access pattern analysis for coalescing and bank conflicts

03Strategic shared, constant, and texture memory implementation

04Architecture-specific tuning for NVIDIA GPU generations

05Warp efficiency and branch divergence detection

06Thread block and occupancy optimization for maximum throughput

Use Cases

01Identifying memory bottlenecks in large-scale parallel algorithms

02Refactoring slow CUDA kernels to improve execution speed

03Optimizing GPU resource utilization for high-performance computing tasks

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add keith-mvs/nsight-copilot cuda-optimization

For use in Claude.ai and ChatGPT

Download Skill