Skip to content
AI-Daily-Builder

2026-06-09 views

Google Ships QAT Checkpoints for the Entire Gemma 4 Family: Q4_0 Weights at Near-BF16 Quality

On June 5, 2026, Google released QAT checkpoints for every Gemma 4 size. Q4_0 cuts E4B from 15GB to 5GB and text-only E2B to under 1GB, with day-one support across llama.cpp, Ollama, MLX, vLLM and SGLang.

What shipped

On June 5, 2026, Google released quantization-aware-training (QAT) checkpoints on Hugging Face for the Gemma 4 family — from the phone-class E2B and E4B, through the encoder-free multimodal 12B that launched just two days earlier on June 3, up to the 26B-A4B mixture-of-experts variant. Two formats are on offer: a standard Q4_0 collection aimed at desktop runtimes, and a novel mobile-specialized schema that pushes token-generation layers down to 2-bit, uses channel-wise quantization, and fixes activations statically for edge accelerators.

The runtime support list is unusually broad for a day-one quantization drop: llama.cpp, Ollama, LM Studio, vLLM, SGLang, MLX, LiteRT-LM, Transformers.js, Unsloth, and Hugging Face Transformers are all named in the announcement. For anyone serving models on unified-memory hardware, this is the most consequential part — the vendor is now shipping the canonical 4-bit artifact rather than leaving quantization to the community.

The numbers

ModelBF16 footprintQAT footprintFormat
E2B (text-only)9.6 GBunder 1 GBmobile-specialized
E2B (multimodal, iOS)607 MB active RAMLiteRT-LM
E4B15 GB5 GBQ4_0
12B~24 GB~7 GBQ4_0

The quality claim is what separates this from yet another GGUF re-quant. Because QAT simulates quantization noise during training, the weights settle into values that survive 4-bit rounding. Google reports QAT results “yield even higher overall quality compared to standard PTQ baselines,” and the prior-generation data backs the method: on Gemma 3, QAT reduced the perplexity drop from quantization by 54% versus post-training quantization. On mobile silicon, coverage reports the E2B QAT build decoding at 56 tokens per second on iOS Metal and 52 tokens per second on Android via OpenCL.

Why QAT beats post-hoc quantization

Most local-inference users run community post-training quantizations: take the BF16 release, run a calibration pass, round to K-quants or Q4_0, and accept whatever quality falls out. That process is at the mercy of the calibration set and tends to hit outlier channels hardest. QAT moves the problem upstream — the fine-tuning loop itself sees fake-quantized weights, so the optimizer routes around the precision cliff before the model ever ships. The result is a 4-bit file that behaves like the BF16 model rather than a degraded copy of it.

The mobile schema goes further than anything the community PTQ toolchain typically produces: 2-bit token-generation layers with static activations is a mixed-precision recipe that requires training-time cooperation. You cannot reproduce that with a post-hoc llama-quantize pass.

What it means on bandwidth-bound local hardware

For unified-memory machines in the DGX Spark class, decode throughput is set by how many bytes per weight you stream per token, not by compute. A Q4_0 QAT checkpoint gives you 4-bit byte counts without the usual PTQ quality tax — which is exactly the trade local-inference users have been making reluctantly for years. The 12B at roughly 7 GB leaves the bulk of a 128 GB unified-memory budget free for KV cache, which matters because the model carries a 256,000-token context window: long-context work is where freed memory converts directly into capability rather than just headroom.

The E4B at 5 GB is small enough to keep resident alongside a primary model as a utility worker — summarization, routing, structured extraction — without meaningfully denting the memory budget of the main serving job.

Practitioner note

The Q4_0 and mobile-format collections are on Hugging Face, and Ollama exposes the official builds behind a qat tag. Two cautions from early reports. First, Ollama currently has an active tool-calling bug with Gemma 4 models, so for agent workloads that depend on structured tool calls, llama.cpp is the recommended path until it is patched. Second, watch for name collisions: community PTQ quants of the BF16 weights were already circulating before June 5, and a generic “gemma-4 Q4_0” file is not necessarily the QAT artifact. Verify the checkpoint lineage before benchmarking, or you will measure the wrong thing.

Under-considered angle

The strategic shift here is who owns quantization. Until now, the 4-bit artifact a local user actually ran was a community product — a quilt of K-quants built with varying calibration sets, of varying provenance. With vendor-blessed QAT checkpoints covering an entire model family on day five of its life, the canonical low-precision artifact now comes from the lab that trained the model. That standardizes quality, but it also means recipes like 2-bit token-generation layers — which need training-time cooperation — will increasingly separate official quants from anything the community can replicate post hoc. Expect other labs to follow, and expect the community quant scene to refocus on sizes and formats vendors decline to ship.


Sources

Tip