012x speedup via vLLM-accelerated fast_generate()
02Support for 4-bit pre-quantized model loading
03Automated GPU memory cleanup and monitoring
04Optimized SamplingParams for diverse use cases
05Token-based parsing for Thinking/Reasoning models
060 GitHub stars