2026-04-30

LiteLLM + Claude Code on DGX Spark — LAN serving setup and protocol translation

Route Claude Code API calls to a self-hosted Qwen3 model on DGX Spark via LiteLLM proxy. Covers config, model alias mapping, multi-GPU offload, and the latency tradeoffs vs cloud API.

Claude Code sends requests to the Anthropic API format. DGX Spark runs vLLM with an OpenAI-compatible endpoint. LiteLLM bridges the gap — it translates Anthropic API calls into OpenAI API calls on the fly, letting Claude Code treat your local Qwen3 model as if it were a Claude model.

Architecture

Claude Code (Anthropic API)
        ↓
LiteLLM proxy (localhost:4000)
  • maps claude-* aliases → Qwen3 model IDs
  • translates message format + tool-use schema
        ↓
vLLM (http://192.168.68.155:8888/v1)
  • serving Qwen3.6-35B-A3B-NVFP4 on GB10

LiteLLM config

# litellm_config.yaml
model_list:
  - model_name: claude-sonnet-4-20250514
    litellm_params:
      model: openai/Intel/Qwen3.6-35B-A3B-int4-AutoRound
      api_base: http://192.168.68.155:8888/v1
      api_key: none

  - model_name: claude-3-5-sonnet-20241022
    litellm_params:
      model: openai/Intel/Qwen3.6-35B-A3B-int4-AutoRound
      api_base: http://192.168.68.155:8888/v1
      api_key: none

  - model_name: claude-3-opus-20240229
    litellm_params:
      model: openai/Intel/Qwen3.5-27B-int4-AutoRound
      api_base: http://192.168.68.155:8888/v1
      api_key: none

litellm_settings:
  drop_params: true
  set_verbose: false

Start the proxy:

litellm --config litellm_config.yaml --port 4000

Claude Code configuration

In your Claude Code settings (or via claude config), set:

{
  "apiBaseUrl": "http://localhost:4000",
  "apiKey": "sk-local-any-string-works"
}

Or use the shell launcher pattern to avoid editing global config:

#!/usr/bin/env bash
# claude_local.sh
export ANTHROPIC_API_KEY="sk-local-any-string"
export ANTHROPIC_BASE_URL="http://localhost:4000"
claude "$@"

NVFP4 model via vLLM

For best throughput on GB10, serve Qwen3.6-35B-A3B-NVFP4 instead of the int4 AutoRound variant:

docker run -d --gpus all --ipc host --shm-size 64gb \
  -p 8888:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:cu130-nightly \
  RedHatAI/Qwen3.6-35B-A3B-NVFP4 \
  --served-model-name Intel/Qwen3.6-35B-A3B-int4-AutoRound \
  --host 0.0.0.0 --port 8000 \
  --dtype bfloat16 --gpu-memory-utilization 0.9 \
  --max-model-len 131072 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --moe-backend=flashinfer_cutlass

The --served-model-name can be any string — set it to match the model alias in your LiteLLM config.

Latency reality check

Scenario	Latency
Claude Sonnet 4 (cloud)	0.8–2s TTFT
Qwen3.6-35B NVFP4+MTP (local)	0.15–0.4s TTFT
Qwen3.6-35B NVFP4 (no MTP)	0.25–0.6s TTFT
Qwen3.5-27B int4 (local)	0.1–0.25s TTFT

Local TTFT is lower than cloud when you’re on the same LAN — no TLS handshake, no geographic routing. The tradeoff is generation quality: Qwen3.6-35B in NVFP4 is competitive with Sonnet 3.5 on coding tasks but falls short on complex reasoning that requires Opus-class capability.

Known issues

Tool-use schema differences: Qwen3’s tool schema is subtly different from Anthropic’s. LiteLLM’s translation handles most cases, but complex nested tool calls (multiple tool results in one turn) occasionally fail. If you hit Invalid tool_use block errors, simplify the tool call or add "drop_params": true to your LiteLLM config.
Streaming with large contexts: At 128K+ context lengths, LiteLLM’s streaming buffer can back-pressure under high concurrency. Single-user usage at any context length is stable.
Model warmup: The first request after server start is slow (CUDA graph compilation in progress). Build a keep-alive ping into your launcher script.