Builder Daily

2026-05-04

Cloudflare Infire — disaggregated LLM inference beats vLLM by 20%, Unweight cuts model size 22%

Cloudflare Infire (Rust) uses disaggregated prefill/decode to beat vLLM 0.10 by 20% on H100s. Unweight achieves 15–22% lossless model weight compression.

Cloudflare published details of two new internal systems powering Workers AI. Infire is a Rust-written inference engine built around disaggregated prefill/decode: prefill (prompt processing) and decode (token generation) run on separate GPU pools, allowing each to scale independently. In benchmarks against vLLM 0.10.0 on loaded H100 hardware, Infire delivered up to 20% higher tokens/sec. It supports tensor-parallel and pipeline-parallel deployment for MoE models.

Unweight is a lossless MLP weight compression system that reduces model size 15–22% with bit-exact output preservation — no accuracy loss, no quantization artifacts. The compression operates on the weight tensors before serving, not at inference time.

Practitioner note

Disaggregated prefill/decode is the architectural direction for high-throughput inference — vLLM and SGLang are moving the same way. If you run open-source models on H100s, benchmark Infire against your current setup. The 20% claim is on loaded hardware (the realistic production case), which makes it more credible than idle-GPU benchmarks. Unweight’s lossless claim is strong — read the research PDF if you manage GPU memory budgets, since a 22% size reduction with zero accuracy cost changes capacity math meaningfully.


Sources

Tags

Tip