arXiv 2604.24763 · 2026-04-27
Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal
Zhiheng Liu, Weiming Ren, Xiaoke Huang
Native unified multimodal model that encodes images with simple patch embeddings — no VAE, no separate vision encoder. Wins at scale on fine-grained perception.
Tuna-2 is a native unified multimodal model that encodes images with simple patch embeddings — no VAE, no separate vision encoder — and handles understanding plus generation in one pixel-space stack.
Encoder-based variants converge faster early, but Tuna-2’s encoder-free design wins at scale, particularly on fine-grained perception, while matching SOTA on multimodal benchmarks.
Practitioner note (mine)
This is the first credible challenge to the “must use a pretrained vision encoder” assumption that’s defined VLM architectures since CLIP. If the result holds up under wider eval, the next generation of vision-language models could be substantially simpler — one stack instead of two, no encoder/decoder mismatch.
For builders this is mostly forward-looking. The practical takeaway today: stop assuming “the vision encoder choice matters most” — at the frontier, the bottleneck may be moving elsewhere.