Running Qwen 3.5 35B on an NVIDIA Jetson AGX Orin with OpenClaw
A real-world tutorial — every command included
What I Built
A NVIDIA Jetson AGX Orin 64GB running Qwen 3.5 35B-A3B (MoE, custom quantized) as a local AI model provider, fully integrated into an OpenClaw agent stack. My Mac calls the Jetson over LAN using a simple alias (agx) and gets 35B-level reasoning back at ~30 tok/sec — $0/month, 60 watts.
Hardware & Specs
Device: NVIDIA Jetson AGX Orin 64GB
OS: Ubuntu, JetPack R36.4.7 (aarch64)
CUDA: 12.6
RAM: 64GB unified memory (CPU + GPU share it)
Storage: 3.7TB NVMe
Power: ~60W under load
Model:
Kbenkhaled/Qwen3.5-35B-A3B-quantized.w4a16(w4a16 quantized, vLLM)
Why MoE matters
Qwen 3.5 35B-A3B is a Mixture of Experts model. 35B total parameters, but only ~3B active per token at inference. That's ~10x less memory bandwidth per inference compared to a dense model of the same size. The dense 27B Qwen variant is slower on the same hardware. MoE wins on edge hardware every time.
What NOT to Do: The Ollama Trap
The standard Ollama build of Qwen does not optimize for Orin's CUDA architecture the same way. If you want real performance out of the Jetson, skip Ollama and use a custom quantized build served via vLLM with CUDA acceleration.
NVIDIA's Jetson AI Lab documents this model officially — that's where I found it: 👉 jetson-ai-lab.com/models/qwen3-5-35b-a3b
The specific quantized build I used is the w4a16 variant optimized for Orin's architecture.
vLLM serves it with an OpenAI-compatible API on port 8000.
Step 1: Verify the Model is Running
SSH into your Jetson and confirm vLLM is serving:
# Check what's listening on port 8000
ss -tlnp | grep 8000
# Should show: LISTEN 0 2048 0.0.0.0:8000 0.0.0.0:*
# Verify the model API is responding
curl -s http://localhost:8000/v1/models | python3 -m json.toolTest a completion:
curl -s http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Kbenkhaled/Qwen3.5-35B-A3B-quantized.w4a16",
"messages": [{"role": "user", "content": "Say hello in one sentence."}],
"max_tokens": 100
}' | python3 -c "import json,sys; r=json.load(sys.stdin); print(r['choices'][0]['message']['content'])"Step 2: Configure OpenClaw on the Jetson
Check what OpenClaw sees:
openclaw models listSet the local model as the default:
openclaw config set agents.defaults.model qwen
openclaw gateway restartGotcha:
openclaw config set model qwendoesn't work —modelis not a root-level key. The correct path isagents.defaults.model.
Step 3: Clean Up Stale Model Entries
If you have leftover model entries (e.g., from an old Ollama provider or a stale alias), remove them:
# View current models config
openclaw config get models
# Remove stale provider (e.g. ollama pointing at port 11434 that isn't running)
python3 -c "
import json
with open('/home/agx/.openclaw/openclaw.json') as f:
cfg = json.load(f)
cfg['models']['providers'].pop('ollama', None)
with open('/home/agx/.openclaw/openclaw.json', 'w') as f:
json.dump(cfg, f, indent=2)
print('Done')
"
# Remove stale model alias from agents config
python3 -c "
import json
with open('/home/agx/.openclaw/openclaw.json') as f:
cfg = json.load(f)
cfg['agents']['defaults']['models'].pop('kbenkhaled/Qwen3.5-35B-A3B-quantized.w4a16', None)
with open('/home/agx/.openclaw/openclaw.json', 'w') as f:
json.dump(cfg, f, indent=2)
print('Done')
"
openclaw gateway restartGotcha: There's no
openclaw models removecommand. You have to edit the JSON directly.
Note on Ollama errors: OpenClaw has built-in Ollama auto-discovery that tries port 11434 at startup. If Ollama isn't running, you'll see
Failed to discover Ollama models: TypeError: fetch failedin logs. This is cosmetic — it doesn't affect functionality. There's no config key to disable it in 2026.3.2.
Step 4: Allow Remote Exec on the Jetson Node
By default, agx requires approval for every system.run command from a remote session. To allow your main machine to run commands freely:
# Set security to full (no restrictions — fine for a trusted local node)
openclaw config set tools.exec.ask off
openclaw config set tools.exec.security full
openclaw gateway restartGotcha:
security=allowlistwithout defined safeBins will give youallowlist misserrors. Usefullfor a local trusted node, or define your safeBins list explicitly.
Step 5: Add the Jetson as a Provider on Your Main Machine
First, confirm your Mac can reach the Jetson's model server:
# Run this on your Mac
curl -s http://YOUR_JETSON_IP:8000/v1/models | python3 -m json.tool | head -10Then add it as a custom provider in OpenClaw (run on your Mac, or use the gateway tool):
openclaw config set models.providers.agx-qwen '{
"baseUrl": "http://YOUR_JETSON_IP:8000/v1",
"apiKey": "none",
"api": "openai-completions",
"models": [{
"id": "Kbenkhaled/Qwen3.5-35B-A3B-quantized.w4a16",
"name": "AGX Qwen3.5-35B (local)",
"input": ["text"],
"contextWindow": 16000,
"maxTokens": 4096
}]
}'Replace YOUR_JETSON_IP with your Jetson's actual LAN IP:
# Check Jetson IP
hostname -I | awk '{print $1}'Step 6: Add a Model Alias
# Run on your Mac
openclaw models aliases add agx agx-qwen/Kbenkhaled/Qwen3.5-35B-A3B-quantized.w4a16Now you can reference it anywhere as agx.
Step 7: Test It from Your Mac
curl -s http://YOUR_JETSON_IP:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Kbenkhaled/Qwen3.5-35B-A3B-quantized.w4a16",
"messages": [{"role": "user", "content": "Write a one-paragraph story about a robot."}],
"max_tokens": 300
}' | python3 -c "import json,sys; r=json.load(sys.stdin); print(r['choices'][0]['message']['content'])"GPU Health Check
# Real-time stats (run on Jetson)
tegrastats
# Or one-shot summary
nvidia-smi --query-gpu=name,temperature.gpu,utilization.gpu,memory.used,memory.total,power.draw --format=csv,noheader,nounits
# System overview
free -h # RAM
df -h / # Disk
uptime # Load averageHealthy idle numbers on the AGX 64GB running Qwen 3.5 35B-A3B:
RAM: ~56GB used (model loaded in unified memory)
GPU temp: ~46°C
GPU utilization: ~10% idle, spikes during inference
Power draw: ~3.5W idle
Gotchas Summary
Issue:
Unrecognized key: "model"Cause: Wrong config path
Fix: Use
agents.defaults.modelnotmodel
Issue:
SYSTEM_RUN_DENIED: approval requiredCause: Default node security
Fix: Run
openclaw config set tools.exec.security fullon the node
Issue:
SYSTEM_RUN_DENIED: allowlist missCause:
security=allowlistwith no bins definedFix: Switch to
fullor definesafeBins
Issue:
openclaw models removenot foundCause: Command doesn't exist
Fix: Edit openclaw.json directly with python3
Issue: Ollama errors at startup
Cause: Built-in discovery, can't disable
Fix: Ignore — cosmetic only
Issue: Model output includes thinking chain
Cause: Reasoning mode baked into model
Fix: Add system prompt telling it to skip thinking, or disable reasoning in vLLM config
Hardware Note: Orin Nano Super (8GB)
Same approach works on an Orin Nano Super (8GB) — just use a smaller model. The Qwen 3.5 Small series (just released March 2026, 0.8B–9B range) is built for on-device/edge and fits the 8GB form factor. Methods are identical.
Why Bother?
$0/month operating cost
No API latency, no rate limits, no data leaving your network
60W power draw — runs all night on overnight tasks
35B-level reasoning for background jobs: research, batch processing, coding runs
Full tool-use and thinking capabilities
Still might become the brain for a robot someday
Originally set up March 6, 2026. All commands verified in production.

