arXiv 2604.24763 · 2026-04-27

Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal

Zhiheng Liu, Weiming Ren, Xiaoke Huang

Native unified multimodal model that encodes images with simple patch embeddings — no VAE, no separate vision encoder. Wins at scale on fine-grained perception.

arxiv.org/abs/2604.24763 ↗

Tuna-2 is a native unified multimodal model that encodes images with simple patch embeddings — no VAE, no separate vision encoder — and handles understanding plus generation in one pixel-space stack.

Encoder-based variants converge faster early, but Tuna-2’s encoder-free design wins at scale, particularly on fine-grained perception, while matching SOTA on multimodal benchmarks.

Practitioner note (mine)

This is the first credible challenge to the “must use a pretrained vision encoder” assumption that’s defined VLM architectures since CLIP. If the result holds up under wider eval, the next generation of vision-language models could be substantially simpler — one stack instead of two, no encoder/decoder mismatch.

For builders this is mostly forward-looking. The practical takeaway today: stop assuming “the vision encoder choice matters most” — at the frontier, the bottleneck may be moving elsewhere.