2026-04-27

vLLM v0.20.0 ships DeepSeek V4 + PyTorch 2.11 + FlashAttention 4

vLLM v0.20.0: 752 commits, 320 contributors. CUDA 13, PyTorch 2.11, Transformers v5, Python 3.14, FlashAttention 4 default, 2-bit KV cache.

vLLM v0.20.0 is a major release with 752 commits from 320 contributors. Key changes:

Default CUDA bumped to 13.0, PyTorch 2.11, Transformers v5 compatibility, Python 3.14 support
FlashAttention 4 re-enabled as the default MLA prefill backend
TurboQuant 2-bit KV cache for 4x KV-cache capacity at lower memory cost
Initial DeepSeek V4 / Hunyuan v3 / Granite 4.1 Vision support

Practitioner note

vLLM is the de-facto open inference runtime. CUDA 13 / PyTorch 2.11 raise the floor for self-hosters — older base images will need rebuilding before adopting v0.20.0.

The 2-bit KV cache (TurboQuant) is the most economically meaningful change: 4x KV capacity at the same GPU memory means longer contexts or higher batch sizes for the same hardware budget. If you’re running production self-hosted inference, this changes the deployment math. Run your own quality eval — 2-bit KV typically has tiny but measurable impact at long contexts.

Sources

vLLM v0.20.0 release notes ↗

vLLM v0.20.0 ships DeepSeek V4 + PyTorch 2.11 + FlashAttention 4

Practitioner note

Sources

Tags