2026-04-27
vLLM v0.20.0 ships DeepSeek V4 + PyTorch 2.11 + FlashAttention 4
vLLM v0.20.0: 752 commits, 320 contributors. CUDA 13, PyTorch 2.11, Transformers v5, Python 3.14, FlashAttention 4 default, 2-bit KV cache.
vLLM v0.20.0 is a major release with 752 commits from 320 contributors. Key changes:
- Default CUDA bumped to 13.0, PyTorch 2.11, Transformers v5 compatibility, Python 3.14 support
- FlashAttention 4 re-enabled as the default MLA prefill backend
- TurboQuant 2-bit KV cache for 4x KV-cache capacity at lower memory cost
- Initial DeepSeek V4 / Hunyuan v3 / Granite 4.1 Vision support
Practitioner note
vLLM is the de-facto open inference runtime. CUDA 13 / PyTorch 2.11 raise the floor for self-hosters — older base images will need rebuilding before adopting v0.20.0.
The 2-bit KV cache (TurboQuant) is the most economically meaningful change: 4x KV capacity at the same GPU memory means longer contexts or higher batch sizes for the same hardware budget. If you’re running production self-hosted inference, this changes the deployment math. Run your own quality eval — 2-bit KV typically has tiny but measurable impact at long contexts.