What services does this VRAM manager support?

It includes implementation patterns for PyTorch, Transformers, ComfyUI, Flux, Ollama, and general shell-based GPU commands.

Can it proactively free up memory?

Yes, it includes a signaling protocol where one service can send a POST request to another service's endpoint to request it unloads its model if it is currently idle.

How do I configure Ollama to work with this pattern?

The skill provides specific systemd override configurations to set the OLLAMA_KEEP_ALIVE variable, ensuring Ollama releases VRAM quickly after a request.

How does the OOM retry logic work?

It catches CUDA Out-of-Memory errors, clears the CUDA cache, waits for a specified delay to allow other services to idle out, and retries the model load up to three times.

VRAM & GPU Memory Manager

Name: VRAM & GPU Memory Manager
Author: lawless-m

bylawless-m

•

Data Science & ML

Manages GPU VRAM allocation through OOM retry logic, idle auto-unloading, and cross-service signaling protocols.

About

This skill provides standardized implementation patterns for managing shared GPU resources across multiple AI services such as Ollama, Whisper, and ComfyUI. It addresses the common Out-of-Memory (OOM) bottleneck by implementing sophisticated retry loops, configurable idle timeouts for model unloading, and a signaling protocol that allows services to request VRAM clearance from one another. It is particularly useful for developers running multiple local AI models on a single GPU who need to ensure stable, automated handovers without manual intervention.

Key Features

Cross-service signaling protocol for polite model unload requests
5 GitHub stars
Automatic model unloading for idle services to proactively free up VRAM
Robust OOM exception handling with configurable retry logic and backoff delays
Implementation templates for PyTorch, Transformers, and shell scripts
Pre-configured optimization settings for Ollama, ComfyUI, and Flux

Use Cases

Automating sequential AI pipelines where different tasks require full GPU access
Running multiple local LLMs and image generators (e.g., Ollama + Flux) on a single workstation
Optimizing shared GPU environments to prevent manual cache clearing and service restarts

About

Key Features

Cross-service signaling protocol for polite model unload requests
5 GitHub stars
Automatic model unloading for idle services to proactively free up VRAM
Robust OOM exception handling with configurable retry logic and backoff delays
Implementation templates for PyTorch, Transformers, and shell scripts
Pre-configured optimization settings for Ollama, ComfyUI, and Flux

Use Cases

Automating sequential AI pipelines where different tasks require full GPU access
Running multiple local LLMs and image generators (e.g., Ollama + Flux) on a single workstation
Optimizing shared GPU environments to prevent manual cache clearing and service restarts