Builder Daily

2026-04-27

vLLM v0.20.0 ships DeepSeek V4 + PyTorch 2.11 + FlashAttention 4

vLLM v0.20.0: 752 commits, 320 contributors. CUDA 13, PyTorch 2.11, Transformers v5, Python 3.14, FlashAttention 4 default, 2-bit KV cache.

vLLM v0.20.0 is a major release with 752 commits from 320 contributors. Key changes:

Practitioner note

vLLM is the de-facto open inference runtime. CUDA 13 / PyTorch 2.11 raise the floor for self-hosters — older base images will need rebuilding before adopting v0.20.0.

The 2-bit KV cache (TurboQuant) is the most economically meaningful change: 4x KV capacity at the same GPU memory means longer contexts or higher batch sizes for the same hardware budget. If you’re running production self-hosted inference, this changes the deployment math. Run your own quality eval — 2-bit KV typically has tiny but measurable impact at long contexts.


Sources

Tags

Tip