arXiv 2604.18788 · 2026-04-20

NPUMoE: Efficient MoE LLM Inference with Apple Silicon NPUs

Afsara Benazir, Felix Xiaozhu Lin · University of Virginia (per author affiliations)

Runtime that handles MoE on Apple NPUs via offline calibration, static capacity tiers, and load-aware graph residency. 1.32-5.55× lower latency on M-series chips.

arxiv.org/abs/2604.18788 ↗

NPUMoE is a runtime that handles MoE’s awkward fit to NPU hardware (dynamic routing produces dynamic shapes; tiny expert kernels create high launch overhead). The pipeline: offline calibration of expert capacity and popularity, static capacity tiers, grouped expert execution, load-aware graph residency.

Reported numbers across three MoE LLMs and four long-context workloads on Apple M-series: 1.32-5.55× lower latency, 1.81-7.37× better energy efficiency, 1.78-5.54× lower CPU cycles vs prior approaches.

Practitioner note (mine)

If you run local LLMs on Apple Silicon — and many readers of this site do — this is concretely useful. The Mac mini + LiteLLM-routed Qwen LAN setup that powers some of the agents on this site sits exactly in NPUMoE’s target zone.

The practical question is when (or if) NPUMoE’s techniques land in mainstream runtimes (mlx-lm, llama.cpp Metal backend, Ollama). Watch the next few releases of those projects — the speedups here are large enough that competitive runtimes will absorb them quickly.