2026-04-30
LiteLLM + Claude Code on DGX Spark — LAN serving setup and protocol translation
Route Claude Code API calls to a self-hosted Qwen3 model on DGX Spark via LiteLLM proxy. Covers config, model alias mapping, multi-GPU offload, and the latency tradeoffs vs cloud API.
Claude Code sends requests to the Anthropic API format. DGX Spark runs vLLM with an OpenAI-compatible endpoint. LiteLLM bridges the gap — it translates Anthropic API calls into OpenAI API calls on the fly, letting Claude Code treat your local Qwen3 model as if it were a Claude model.
Architecture
Claude Code (Anthropic API)
↓
LiteLLM proxy (localhost:4000)
• maps claude-* aliases → Qwen3 model IDs
• translates message format + tool-use schema
↓
vLLM (http://192.168.68.155:8888/v1)
• serving Qwen3.6-35B-A3B-NVFP4 on GB10
LiteLLM config
# litellm_config.yaml
model_list:
- model_name: claude-sonnet-4-20250514
litellm_params:
model: openai/Intel/Qwen3.6-35B-A3B-int4-AutoRound
api_base: http://192.168.68.155:8888/v1
api_key: none
- model_name: claude-3-5-sonnet-20241022
litellm_params:
model: openai/Intel/Qwen3.6-35B-A3B-int4-AutoRound
api_base: http://192.168.68.155:8888/v1
api_key: none
- model_name: claude-3-opus-20240229
litellm_params:
model: openai/Intel/Qwen3.5-27B-int4-AutoRound
api_base: http://192.168.68.155:8888/v1
api_key: none
litellm_settings:
drop_params: true
set_verbose: false
Start the proxy:
litellm --config litellm_config.yaml --port 4000
Claude Code configuration
In your Claude Code settings (or via claude config), set:
{
"apiBaseUrl": "http://localhost:4000",
"apiKey": "sk-local-any-string-works"
}
Or use the shell launcher pattern to avoid editing global config:
#!/usr/bin/env bash
# claude_local.sh
export ANTHROPIC_API_KEY="sk-local-any-string"
export ANTHROPIC_BASE_URL="http://localhost:4000"
claude "$@"
NVFP4 model via vLLM
For best throughput on GB10, serve Qwen3.6-35B-A3B-NVFP4 instead of the int4 AutoRound variant:
docker run -d --gpus all --ipc host --shm-size 64gb \
-p 8888:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:cu130-nightly \
RedHatAI/Qwen3.6-35B-A3B-NVFP4 \
--served-model-name Intel/Qwen3.6-35B-A3B-int4-AutoRound \
--host 0.0.0.0 --port 8000 \
--dtype bfloat16 --gpu-memory-utilization 0.9 \
--max-model-len 131072 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--moe-backend=flashinfer_cutlass
The --served-model-name can be any string — set it to match the model alias in your LiteLLM config.
Latency reality check
| Scenario | Latency |
|---|---|
| Claude Sonnet 4 (cloud) | 0.8–2s TTFT |
| Qwen3.6-35B NVFP4+MTP (local) | 0.15–0.4s TTFT |
| Qwen3.6-35B NVFP4 (no MTP) | 0.25–0.6s TTFT |
| Qwen3.5-27B int4 (local) | 0.1–0.25s TTFT |
Local TTFT is lower than cloud when you’re on the same LAN — no TLS handshake, no geographic routing. The tradeoff is generation quality: Qwen3.6-35B in NVFP4 is competitive with Sonnet 3.5 on coding tasks but falls short on complex reasoning that requires Opus-class capability.
Known issues
- Tool-use schema differences: Qwen3’s tool schema is subtly different from Anthropic’s. LiteLLM’s translation handles most cases, but complex nested tool calls (multiple tool results in one turn) occasionally fail. If you hit
Invalid tool_use blockerrors, simplify the tool call or add"drop_params": trueto your LiteLLM config. - Streaming with large contexts: At 128K+ context lengths, LiteLLM’s streaming buffer can back-pressure under high concurrency. Single-user usage at any context length is stable.
- Model warmup: The first request after server start is slow (CUDA graph compilation in progress). Build a keep-alive ping into your launcher script.