arXiv 2606.11854·2026-06-10 — views

ART: Fine-tuning frozen multimodal LLMs by optimizing visual soft-tokens with reinforcement learning — no weight updates required

Chudoba et al., Alyaev, Galuscakova, Wiktorski

ART fine-tunes frozen multimodal LLMs by optimizing only visual input tokens, enabling RL-based adaptation on pre-compiled inference graphs like vLLM. Visual soft-token reinforcement on a frozen backbone achieves effective task adaptation with no weight updates.

arxiv.org/abs/2606.11854 ↗

What the paper does

arXiv:2606.11854 (cs.CL, submitted June 10, 2026) proposes ART — Art-based Reinforcement Training — a method for adapting frozen multimodal LLMs (MLLMs) to new tasks without modifying any model weights. Instead of fine-tuning the backbone, ART optimizes the visual input token embeddings — the soft tokens produced by the vision encoder before they enter the language model — using a reinforcement learning objective.

The core motivation is compatibility with high-throughput inference engines. Modern serving stacks like vLLM pre-compile the language model’s computational graph at deployment time (using techniques like CUDA graph capture). Weight-update fine-tuning invalidates these pre-compiled graphs, forcing expensive recompilation. ART sidesteps the problem entirely: because it never touches the weights, the compiled graph stays valid. RL adaptation happens in the input space, not the parameter space.

How it works

The architecture has three components:

1. Frozen MLLM backbone — the language model and its attention layers are locked. No gradients flow through them during training. ART assumes the model is deployed with an inference graph already compiled.

2. Vision encoder + soft-token projector — the vision encoder (e.g., a ViT or CLIP-based model) processes the input image as usual and produces patch embeddings. These embeddings pass through a lightweight projector (an MLP adapter) into the language model’s embedding space.

3. Learnable visual soft-token perturbations — ART adds a learnable perturbation layer on top of the projected visual tokens. These perturbations are optimized via RL (using a reward signal from task performance) to inject task-specific information into the visual stream. The perturbation parameters are small relative to the backbone and can be applied at inference time without modifying the base model.

The RL training objective rewards token sequences that produce correct task outputs, using standard REINFORCE or PPO-style updates applied only to the perturbation layer.

Why it matters for deployment

The compiled-graph preservation is the key insight. Deploying a large MLLM in production requires significant upfront latency to compile the computational graph for a specific GPU target (typically 10–30 minutes for frontier-scale models). Any weight change invalidates this. Fine-tuning methods that modify weights — even LoRA, which adds small adapter matrices — require full recompilation after adaptation. ART’s weight-frozen approach means:

Adaptation can happen post-deployment without a recompile cycle
Multiple tasks can be served from the same compiled backbone with different visual perturbations
The adaptation parameters are small enough to swap in per-request or per-tenant

For multi-tenant inference serving (one model, many fine-tuned “personalities” per customer), this is a meaningful architectural advantage.

The performance picture

The paper reports that ART achieves effective task-specific adaptation on multimodal reasoning benchmarks, with accuracy competitive with full fine-tuning approaches on tasks where the visual context is the primary task-differentiating signal. The strongest results are in domains where the visual input needs to carry problem-specific context (e.g., specialized diagram reading, domain-specific inspection tasks) rather than general image understanding.

The method underperforms full fine-tuning in cases where the language model’s priors themselves need to shift (pure language tasks, tasks requiring novel reasoning chains). This is the expected limitation: optimizing input representations can only compensate for in-distribution shifts in the visual domain; it cannot update the backbone’s knowledge.

Practitioner note

ART’s value proposition is sharpest for builders who are already serving a multimodal model in production with a compiled inference graph and want to add task-specific adaptation without a deployment interruption. The pattern it enables: train a set of visual perturbation parameters on your task data offline, then serve the base backbone + perturbations without touching the serving infrastructure. For standard fine-tuning, the equivalent would require a new deployment with new weights.

The honest scope limitation: this is a useful serving optimization, not a general fine-tuning replacement. If your task requires the language model to learn new factual knowledge or new reasoning patterns — rather than learning to interpret specialized visual inputs differently — you need weight updates. ART is a tool for “make this specific visual input distribution interpretable by a model that already knows how to reason” rather than “teach this model something it could not do before.”

Under-considered angle

The paper’s framing as “RL fine-tuning” may understate its relevance to test-time compute scaling. Visual soft-token perturbations are, structurally, a way to inject additional task context into the model at the input layer. The same mechanism could be used not just for fine-tuning but for test-time search: run RL during inference on a specific input, optimizing the visual perturbations to maximize model confidence or task reward on that single instance. This makes ART a potential building block for compute-optimal visual reasoning at inference time — spending more compute per hard example rather than per-token. That application is not discussed in the paper but falls out naturally from the architecture.