2026-06-09 — views
Google Ships QAT Checkpoints for the Entire Gemma 4 Family: Q4_0 Weights at Near-BF16 Quality
On June 5, 2026, Google released QAT checkpoints for every Gemma 4 size. Q4_0 cuts E4B from 15GB to 5GB and text-only E2B to under 1GB, with day-one support across llama.cpp, Ollama, MLX, vLLM and SGLang.
What shipped
On June 5, 2026, Google released quantization-aware-training (QAT) checkpoints on Hugging Face for the Gemma 4 family — from the phone-class E2B and E4B, through the encoder-free multimodal 12B that launched just two days earlier on June 3, up to the 26B-A4B mixture-of-experts variant. Two formats are on offer: a standard Q4_0 collection aimed at desktop runtimes, and a novel mobile-specialized schema that pushes token-generation layers down to 2-bit, uses channel-wise quantization, and fixes activations statically for edge accelerators.
The runtime support list is unusually broad for a day-one quantization drop: llama.cpp, Ollama, LM Studio, vLLM, SGLang, MLX, LiteRT-LM, Transformers.js, Unsloth, and Hugging Face Transformers are all named in the announcement. For anyone serving models on unified-memory hardware, this is the most consequential part — the vendor is now shipping the canonical 4-bit artifact rather than leaving quantization to the community.
The numbers
| Model | BF16 footprint | QAT footprint | Format |
|---|---|---|---|
| E2B (text-only) | 9.6 GB | under 1 GB | mobile-specialized |
| E2B (multimodal, iOS) | — | 607 MB active RAM | LiteRT-LM |
| E4B | 15 GB | 5 GB | Q4_0 |
| 12B | ~24 GB | ~7 GB | Q4_0 |
The quality claim is what separates this from yet another GGUF re-quant. Because QAT simulates quantization noise during training, the weights settle into values that survive 4-bit rounding. Google reports QAT results “yield even higher overall quality compared to standard PTQ baselines,” and the prior-generation data backs the method: on Gemma 3, QAT reduced the perplexity drop from quantization by 54% versus post-training quantization. On mobile silicon, coverage reports the E2B QAT build decoding at 56 tokens per second on iOS Metal and 52 tokens per second on Android via OpenCL.
Why QAT beats post-hoc quantization
Most local-inference users run community post-training quantizations: take the BF16 release, run a calibration pass, round to K-quants or Q4_0, and accept whatever quality falls out. That process is at the mercy of the calibration set and tends to hit outlier channels hardest. QAT moves the problem upstream — the fine-tuning loop itself sees fake-quantized weights, so the optimizer routes around the precision cliff before the model ever ships. The result is a 4-bit file that behaves like the BF16 model rather than a degraded copy of it.
The mobile schema goes further than anything the community PTQ toolchain typically produces: 2-bit token-generation layers with static activations is a mixed-precision recipe that requires training-time cooperation. You cannot reproduce that with a post-hoc llama-quantize pass.
What it means on bandwidth-bound local hardware
For unified-memory machines in the DGX Spark class, decode throughput is set by how many bytes per weight you stream per token, not by compute. A Q4_0 QAT checkpoint gives you 4-bit byte counts without the usual PTQ quality tax — which is exactly the trade local-inference users have been making reluctantly for years. The 12B at roughly 7 GB leaves the bulk of a 128 GB unified-memory budget free for KV cache, which matters because the model carries a 256,000-token context window: long-context work is where freed memory converts directly into capability rather than just headroom.
The E4B at 5 GB is small enough to keep resident alongside a primary model as a utility worker — summarization, routing, structured extraction — without meaningfully denting the memory budget of the main serving job.
Practitioner note
The Q4_0 and mobile-format collections are on Hugging Face, and Ollama exposes the official builds behind a qat tag. Two cautions from early reports. First, Ollama currently has an active tool-calling bug with Gemma 4 models, so for agent workloads that depend on structured tool calls, llama.cpp is the recommended path until it is patched. Second, watch for name collisions: community PTQ quants of the BF16 weights were already circulating before June 5, and a generic “gemma-4 Q4_0” file is not necessarily the QAT artifact. Verify the checkpoint lineage before benchmarking, or you will measure the wrong thing.
Under-considered angle
The strategic shift here is who owns quantization. Until now, the 4-bit artifact a local user actually ran was a community product — a quilt of K-quants built with varying calibration sets, of varying provenance. With vendor-blessed QAT checkpoints covering an entire model family on day five of its life, the canonical low-precision artifact now comes from the lab that trained the model. That standardizes quality, but it also means recipes like 2-bit token-generation layers — which need training-time cooperation — will increasingly separate official quants from anything the community can replicate post hoc. Expect other labs to follow, and expect the community quant scene to refocus on sizes and formats vendors decline to ship.
Sources
- Gemma 4 with quantization-aware training — Google (official blog) ↗
- Gemma 4 QAT Cuts E2B to Under 1GB — Deploy It Now — byteiota ↗
- Google DeepMind launches Gemma 4 12B, bringing frontier AI model to everyday laptops — Tech Startups ↗
- Gemma 4 Goes Mobile: What Google's New QAT Checkpoints Mean for On-Device AI — DEV Community ↗
- Gemma 4 QAT Self-Hosting Guide: Ollama, llama.cpp, vLLM — Lushbinary ↗