What frameworks does this skill support?

The skill is designed to detect and provide recommendations for major serving stacks including vLLM, Text Generation Inference (TGI), TensorRT-LLM, and custom inference scripts.

Does it automatically apply changes to my configuration?

No, it functions as an expert advisor that analyzes your current setup and provides a detailed report with actionable recommendations for you to implement.

What kind of hardware insights does it provide?

It suggests optimizations based on your detected or specified hardware, ensuring that strategies like PagedAttention or specific quantization formats are compatible with your GPUs.

Can I prioritize cost over speed?

Yes, by using the 'cost' focus argument, the skill will prioritize recommendations like INT4 quantization and higher density batching to minimize hardware requirements.

LLM Optimization Advisor

Name: LLM Optimization Advisor
Author: melodic-software

bymelodic-software

•

Data Science & ML

Analyzes and optimizes LLM serving configurations to improve latency, reduce inference costs, and maximize throughput.

The LLM Optimization Advisor is a specialized skill designed to help developers and machine learning engineers fine-tune their model deployment stacks. By automatically scanning your project for configuration files and inference scripts, it identifies the serving framework in use—such as vLLM, TGI, or TensorRT-LLM—and provides a comprehensive optimization report. Whether you need to decrease response times for real-time applications or slash cloud GPU costs, this skill offers tiered recommendations covering quantization strategies, PagedAttention settings, and batching configurations tailored to your specific hardware and model requirements.

Key Features

01Automatic detection of LLM serving frameworks and inference configurations

02Focus-driven analysis for specific goals like latency, cost, or throughput

03Tiered implementation roadmap from low-effort quick wins to advanced changes

04Tailored quantization strategies including INT8, INT4, and FP16 recommendations

05Optimization advice for KV cache management and continuous batching

0638 GitHub stars

Use Cases

01Reducing monthly cloud GPU expenditure for production AI services

02Optimizing real-time chat application responsiveness by minimizing time-to-first-token

03Scaling AI infrastructure to handle higher concurrent request volumes

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add melodic-software/claude-code-plugins optimize-llm

For use in Claude.ai and ChatGPT

Download Skill