Tags · #open-source

Cloudflare Infire — disaggregated LLM inference beats vLLM by 20%, Unweight cuts model size 22%

Cloudflare Infire (Rust) uses disaggregated prefill/decode to beat vLLM 0.10 by 20% on H100s. Unweight achieves 15–22% lossless model weight compression.

04 MAY 3 MIN READ Practitioner note inference open-source nvidia release

vLLM v0.20.0 ships DeepSeek V4 + PyTorch 2.11 + FlashAttention 4

vLLM v0.20.0: 752 commits, 320 contributors. CUDA 13, PyTorch 2.11, Transformers v5, Python 3.14, FlashAttention 4 default, 2-bit KV cache.

27 APR 3 MIN READ Practitioner note vllm inference open-source release