2026-05-09

Litespark ternary-CPU inference (arXiv 2605.06485) — 9.2× TTFT, 52× throughput, ships pip package

Litespark replaces FP matmul with integer add/sub SIMD on ternary {-1,0,+1} weight networks. 9.2× TTFT, 52× throughput, 14× memory reduction. Pip-installable, HF-integrated.

Submitted to arXiv on May 7, 2026, Litespark-Inference (Dade, Morri, Rahat, Pal) replaces FP matrix multiplication with integer add/subtract SIMD kernels for ternary 1 weight networks. The headline numbers vs PyTorch baseline on Apple Silicon:

Metric	Improvement
Time-to-first-token	9.2× faster
Throughput	52× higher
Memory footprint	14× smaller

Comparable gains on Intel and AMD x86. The crucial detail: it ships as a pip-installable package that integrates with HuggingFace Transformers — so it’s not just a paper, it’s a working tool you can pip install today.

Why ternary matters for inference

Ternary weights 1 can be encoded in 1.58 bits and computed using only integer add/subtract operations — no multiplication required. This unlocks:

No FPU dependency — runs on the cheapest ARM cores
SIMD-friendly — integer add/sub vectorizes trivially on NEON / AVX2 / AVX-512
Lower energy per token — the dominant arithmetic op (multiply) is replaced by the cheapest one (add)

The catch: full-precision LLMs aren’t natively ternary. Litespark targets ternary-trained networks specifically (BitNet-family models and successors). Using it for a dense Qwen / Llama is a separate pipeline — you need a ternary distillation step.

Why this matters for DGX Spark operators

DGX Spark has a 20-core Grace CPU alongside the GB10 Blackwell GPU. Most operators leave the Grace cores idle during inference. Litespark gives you a credible reason to use them:

Draft model for speculative decoding. If you’re running Qwen3.6-35B-A3B on the GPU, a ternary draft model on the Grace cores can produce candidate tokens in parallel, leaving the GPU as the verifier. This is the same pattern as MTP-1, but the draft runs on different silicon — no GPU contention.
Routing / classification offload. Small ternary classifiers (intent detection, content moderation, code-vs-prose routing) can run on the Grace side without stealing GPU cycles from your main serving loop.
Embedding generation. Ternary embedding models scale near-linearly with CPU cores. 20 Grace cores × Litespark kernels gives respectable throughput for RAG indexing alongside GPU serving.

What to do

pip install litespark-inference

# Try a ternary draft model alongside your Qwen3.6 verifier
python -c "
from litespark import LitesparkLM
draft = LitesparkLM.from_pretrained('bitnet-b1.58-3b')
# Use as draft model for spec-dec against your main Spark Qwen3.6 verifier
"

A 30-minute experiment is appropriate scope: measure draft acceptance rate on your actual workload mix. If acceptance hits 70%+, the spec-dec lever is worth wiring into your serving stack. If it’s below 50%, the ternary draft and the FP target are too dissimilar — you’re better off with a homologous draft model.

This pattern also bridges to disaggregated serving: ternary draft on CPU → FP NVFP4 verifier on GPU. The two phases never compete for the same memory bandwidth.

Litespark ternary-CPU inference (arXiv 2605.06485) — 9.2× TTFT, 52× throughput, ships pip package

Why ternary matters for inference

Why this matters for DGX Spark operators

What to do

Sources