2026-05-09
Litespark ternary-CPU inference (arXiv 2605.06485) — 9.2× TTFT, 52× throughput, ships pip package
Litespark replaces FP matmul with integer add/sub SIMD on ternary {-1,0,+1} weight networks. 9.2× TTFT, 52× throughput, 14× memory reduction. Pip-installable, HF-integrated.
Submitted to arXiv on May 7, 2026, Litespark-Inference (Dade, Morri, Rahat, Pal) replaces FP matrix multiplication with integer add/subtract SIMD kernels for ternary 1 weight networks. The headline numbers vs PyTorch baseline on Apple Silicon:
| Metric | Improvement |
|---|---|
| Time-to-first-token | 9.2× faster |
| Throughput | 52× higher |
| Memory footprint | 14× smaller |
Comparable gains on Intel and AMD x86. The crucial detail: it ships as a pip-installable package that integrates with HuggingFace Transformers — so it’s not just a paper, it’s a working tool you can pip install today.
Why ternary matters for inference
Ternary weights 1 can be encoded in 1.58 bits and computed using only integer add/subtract operations — no multiplication required. This unlocks:
- No FPU dependency — runs on the cheapest ARM cores
- SIMD-friendly — integer add/sub vectorizes trivially on NEON / AVX2 / AVX-512
- Lower energy per token — the dominant arithmetic op (multiply) is replaced by the cheapest one (add)
The catch: full-precision LLMs aren’t natively ternary. Litespark targets ternary-trained networks specifically (BitNet-family models and successors). Using it for a dense Qwen / Llama is a separate pipeline — you need a ternary distillation step.
Why this matters for DGX Spark operators
DGX Spark has a 20-core Grace CPU alongside the GB10 Blackwell GPU. Most operators leave the Grace cores idle during inference. Litespark gives you a credible reason to use them:
-
Draft model for speculative decoding. If you’re running Qwen3.6-35B-A3B on the GPU, a ternary draft model on the Grace cores can produce candidate tokens in parallel, leaving the GPU as the verifier. This is the same pattern as MTP-1, but the draft runs on different silicon — no GPU contention.
-
Routing / classification offload. Small ternary classifiers (intent detection, content moderation, code-vs-prose routing) can run on the Grace side without stealing GPU cycles from your main serving loop.
-
Embedding generation. Ternary embedding models scale near-linearly with CPU cores. 20 Grace cores × Litespark kernels gives respectable throughput for RAG indexing alongside GPU serving.
What to do
pip install litespark-inference
# Try a ternary draft model alongside your Qwen3.6 verifier
python -c "
from litespark import LitesparkLM
draft = LitesparkLM.from_pretrained('bitnet-b1.58-3b')
# Use as draft model for spec-dec against your main Spark Qwen3.6 verifier
"
A 30-minute experiment is appropriate scope: measure draft acceptance rate on your actual workload mix. If acceptance hits 70%+, the spec-dec lever is worth wiring into your serving stack. If it’s below 50%, the ternary draft and the FP target are too dissimilar — you’re better off with a homologous draft model.
This pattern also bridges to disaggregated serving: ternary draft on CPU → FP NVFP4 verifier on GPU. The two phases never compete for the same memory bandwidth.