2026-06-08 — views
llama.cpp Gains Native Video Input: FFmpeg-Subprocess Decoding Lands in the mtmd Stack
On June 8, 2026, llama.cpp merged PR #24269, adding native video input to its multimodal (mtmd) subsystem. Rather than linking FFmpeg, it shells out to an FFmpeg subprocess to decode frames, and a new lazy-bitmap API
What shipped
On June 8, 2026, llama.cpp merged PR #24269 (author ngxson, reviewed by ggerganov), adding native video input to its multimodal subsystem, mtmd. The change closes issue #18389, which had outlined the plan back in December 2025, and it fulfills one of the three roadmap items the maintainers laid out in their FOSDEM 2026 talk on multimodal support (the other two being text-to-speech and image generation). The feature appears in build b9562 and later.
Before this, running a vision-language model over a video on llama.cpp meant decoding the clip yourself and feeding frames in as a sequence of still images. Video is now a first-class input type for both the CLI and the server.
How it works
The interesting engineering choices are about avoiding pain, not chasing peak throughput.
For decoding, llama.cpp does not link against libavcodec or bundle a codec. Instead it invokes FFmpeg as an external subprocess (the issue thread weighed this against a dlopen of libavcodec and rejected static linking outright, noting it “will lead to a bad UX in practice”). Two motivations drive this: it sidesteps codec licensing complications from proprietary formats, and it keeps the build dependency-free. The trade-off is that the user must install FFmpeg separately; it is not shipped with llama.cpp.
On the tokenization side, a new lazy-bitmap API (mtmd_bitmap_init_lazy) accepts a single <__media__> marker per video and expands it into multiple decoded image frames at tokenization time. Because the frame expansion happens inside the library, the server and CLI needed only minimal changes to gain full video support. For models that fuse frames (Qwen’s 3D convolution path, for example), the internal clip_image_f32 structure was given an extra dimension so multiple frames can be batched together.
The result is model-agnostic: any existing vision model in the GGUF ecosystem can take video without per-model modification. The maintainers tested Qwen3-VL-2B on the CLI and Gemma-4-E4B in the web UI, using a 10-second clip from Blender’s open-movie short Agent 327. Near-term follow-ups scoped in the thread include --video-ffmpeg-path and --video-fps flags, plus audio input as a separate track.
Why it matters for local inference
| Aspect | Before | After PR #24269 |
|---|---|---|
| Video handling | Manual frame extraction, feed as image sequence | Single <__media__> marker, library decodes frames |
| Codec dependency | N/A (your problem) | FFmpeg subprocess, not bundled, not linked |
| Model changes | Per-pipeline glue | Model-agnostic; works with existing GGUF vision models |
| Surface | Ad hoc scripts | CLI and llama-server, first-class |
The original feature request framed the stakes plainly: video understanding is moving “from specialized proprietary APIs to local inference,” with the payoff being “privacy-preserving video analysis on consumer hardware without heavy Python dependencies.” That captures why a C/C++ runtime taking on video matters. The same workstation-class and unified-memory boxes that people already use for local text and image inference can now run temporal-reasoning workloads (event localization, motion and causality questions, embodied-agent perception) without a cloud API and without a PyTorch stack.
A subprocess-based decoder also fits the local-hardware reality well. Frame decode is cheap relative to the vision encoder and the LLM decode pass, so handing it to a battle-tested external binary costs little while dodging the build-system and licensing headaches that would otherwise stall adoption.
Practitioner note
If you want to try it, pull a build at or above b9562, install FFmpeg on PATH, and point a video at llama-mtmd-cli or llama-server with a vision model such as Qwen3-VL or Gemma-4. Watch your frame count: the lazy expander turns one marker into many image tokens, so a long clip at a high sampling rate can blow up context length and KV memory fast. Until --video-fps lands broadly, treat the effective frame rate as the lever that decides whether a clip fits in your context budget on a given box. And because FFmpeg is a separate subprocess, version-pin it in your environment so decode behavior stays reproducible across machines.
Under-considered angle
The quietly important detail is the licensing posture. By refusing to link or bundle a codec and instead shelling out to whatever FFmpeg the user installed, llama.cpp keeps its own distribution clean of patent-encumbered decoders while still supporting the messy real-world formats people actually have. That is the kind of decision that rarely makes benchmark charts but determines whether a feature can ship at all in a permissively licensed, widely redistributed binary. It also subtly shifts the compliance burden to the operator’s environment, which is the correct place for it in a local-first tool, and it is a pattern other runtimes that have so far avoided video may end up copying.
Sources
- llama.cpp Releases (b9562, June 8 2026, video input feature) — ggml-org/llama.cpp ↗
- mtmd: plan to add video input support · Issue #18389 (closed by PR #24269) — ggml-org/llama.cpp ↗
- llama.cpp Adds Video Input via FFmpeg Subprocess — AI Weekly ↗
- Feature Request: Add Video Modality Support (Qwen2.5-VL) via llama-mtmd-cli · Issue #17660 — ggml-org/llama.cpp ↗
- Multimodal support in llama.cpp — Achievements and Future Directions (FOSDEM 2026) ↗