Running Qwen 3.5 35B on an NVIDIA Jetson AGX Orin with OpenClaw

A real-world tutorial — every command included

Mar 07, 2026

What I Built

A NVIDIA Jetson AGX Orin 64GB running Qwen 3.5 35B-A3B (MoE, custom quantized) as a local AI model provider, fully integrated into an OpenClaw agent stack. My Mac calls the Jetson over LAN using a simple alias (agx) and gets 35B-level reasoning back at ~30 tok/sec — $0/month, 60 watts.

Hardware & Specs

Device: NVIDIA Jetson AGX Orin 64GB
OS: Ubuntu, JetPack R36.4.7 (aarch64)
CUDA: 12.6
RAM: 64GB unified memory (CPU + GPU share it)
Storage: 3.7TB NVMe
Power: ~60W under load
Model: Kbenkhaled/Qwen3.5-35B-A3B-quantized.w4a16 (w4a16 quantized, vLLM)

Why MoE matters

Qwen 3.5 35B-A3B is a Mixture of Experts model. 35B total parameters, but only ~3B active per token at inference. That's ~10x less memory bandwidth per inference compared to a dense model of the same size. The dense 27B Qwen variant is slower on the same hardware. MoE wins on edge hardware every time.

What NOT to Do: The Ollama Trap

The standard Ollama build of Qwen does not optimize for Orin's CUDA architecture the same way. If you want real performance out of the Jetson, skip Ollama and use a custom quantized build served via vLLM with CUDA acceleration.

NVIDIA's Jetson AI Lab documents this model officially — that's where I found it: 👉 jetson-ai-lab.com/models/qwen3-5-35b-a3b

The specific quantized build I used is the w4a16 variant optimized for Orin's architecture.

vLLM serves it with an OpenAI-compatible API on port 8000.

Step 1: Verify the Model is Running

SSH into your Jetson and confirm vLLM is serving:

# Check what's listening on port 8000
ss -tlnp | grep 8000
# Should show: LISTEN 0 2048 0.0.0.0:8000 0.0.0.0:*

# Verify the model API is responding
curl -s http://localhost:8000/v1/models | python3 -m json.tool

Test a completion:

curl -s http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Kbenkhaled/Qwen3.5-35B-A3B-quantized.w4a16",
    "messages": [{"role": "user", "content": "Say hello in one sentence."}],
    "max_tokens": 100
  }' | python3 -c "import json,sys; r=json.load(sys.stdin); print(r['choices'][0]['message']['content'])"

Step 2: Configure OpenClaw on the Jetson

Check what OpenClaw sees:

openclaw models list

Set the local model as the default:

openclaw config set agents.defaults.model qwen
openclaw gateway restart

Gotcha: openclaw config set model qwen doesn't work — model is not a root-level key. The correct path is agents.defaults.model.

Step 3: Clean Up Stale Model Entries

If you have leftover model entries (e.g., from an old Ollama provider or a stale alias), remove them:

# View current models config
openclaw config get models

# Remove stale provider (e.g. ollama pointing at port 11434 that isn't running)
python3 -c "
import json
with open('/home/agx/.openclaw/openclaw.json') as f:
    cfg = json.load(f)
cfg['models']['providers'].pop('ollama', None)
with open('/home/agx/.openclaw/openclaw.json', 'w') as f:
    json.dump(cfg, f, indent=2)
print('Done')
"

# Remove stale model alias from agents config
python3 -c "
import json
with open('/home/agx/.openclaw/openclaw.json') as f:
    cfg = json.load(f)
cfg['agents']['defaults']['models'].pop('kbenkhaled/Qwen3.5-35B-A3B-quantized.w4a16', None)
with open('/home/agx/.openclaw/openclaw.json', 'w') as f:
    json.dump(cfg, f, indent=2)
print('Done')
"

openclaw gateway restart

Gotcha: There's no openclaw models remove command. You have to edit the JSON directly.

Note on Ollama errors: OpenClaw has built-in Ollama auto-discovery that tries port 11434 at startup. If Ollama isn't running, you'll see Failed to discover Ollama models: TypeError: fetch failed in logs. This is cosmetic — it doesn't affect functionality. There's no config key to disable it in 2026.3.2.

Step 4: Allow Remote Exec on the Jetson Node

By default, agx requires approval for every system.run command from a remote session. To allow your main machine to run commands freely:

# Set security to full (no restrictions — fine for a trusted local node)
openclaw config set tools.exec.ask off
openclaw config set tools.exec.security full
openclaw gateway restart

Gotcha: security=allowlist without defined safeBins will give you allowlist miss errors. Use full for a local trusted node, or define your safeBins list explicitly.

Step 5: Add the Jetson as a Provider on Your Main Machine

First, confirm your Mac can reach the Jetson's model server:

# Run this on your Mac
curl -s http://YOUR_JETSON_IP:8000/v1/models | python3 -m json.tool | head -10

Then add it as a custom provider in OpenClaw (run on your Mac, or use the gateway tool):

openclaw config set models.providers.agx-qwen '{
  "baseUrl": "http://YOUR_JETSON_IP:8000/v1",
  "apiKey": "none",
  "api": "openai-completions",
  "models": [{
    "id": "Kbenkhaled/Qwen3.5-35B-A3B-quantized.w4a16",
    "name": "AGX Qwen3.5-35B (local)",
    "input": ["text"],
    "contextWindow": 16000,
    "maxTokens": 4096
  }]
}'

Replace YOUR_JETSON_IP with your Jetson's actual LAN IP:

# Check Jetson IP
hostname -I | awk '{print $1}'

Step 6: Add a Model Alias

# Run on your Mac
openclaw models aliases add agx agx-qwen/Kbenkhaled/Qwen3.5-35B-A3B-quantized.w4a16

Now you can reference it anywhere as agx.

Step 7: Test It from Your Mac

curl -s http://YOUR_JETSON_IP:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Kbenkhaled/Qwen3.5-35B-A3B-quantized.w4a16",
    "messages": [{"role": "user", "content": "Write a one-paragraph story about a robot."}],
    "max_tokens": 300
  }' | python3 -c "import json,sys; r=json.load(sys.stdin); print(r['choices'][0]['message']['content'])"

GPU Health Check

# Real-time stats (run on Jetson)
tegrastats

# Or one-shot summary
nvidia-smi --query-gpu=name,temperature.gpu,utilization.gpu,memory.used,memory.total,power.draw --format=csv,noheader,nounits

# System overview
free -h          # RAM
df -h /          # Disk
uptime           # Load average

Healthy idle numbers on the AGX 64GB running Qwen 3.5 35B-A3B:

RAM: ~56GB used (model loaded in unified memory)
GPU temp: ~46°C
GPU utilization: ~10% idle, spikes during inference
Power draw: ~3.5W idle

Gotchas Summary

Issue: Unrecognized key: "model"
Cause: Wrong config path
Fix: Use agents.defaults.model not model

Issue: SYSTEM_RUN_DENIED: approval required
Cause: Default node security
Fix: Run openclaw config set tools.exec.security full on the node

Issue: SYSTEM_RUN_DENIED: allowlist miss
Cause: security=allowlist with no bins defined
Fix: Switch to full or define safeBins

Issue: openclaw models remove not found
Cause: Command doesn't exist
Fix: Edit openclaw.json directly with python3

Issue: Ollama errors at startup
Cause: Built-in discovery, can't disable
Fix: Ignore — cosmetic only

Issue: Model output includes thinking chain
Cause: Reasoning mode baked into model
Fix: Add system prompt telling it to skip thinking, or disable reasoning in vLLM config

Hardware Note: Orin Nano Super (8GB)

Same approach works on an Orin Nano Super (8GB) — just use a smaller model. The Qwen 3.5 Small series (just released March 2026, 0.8B–9B range) is built for on-device/edge and fits the 8GB form factor. Methods are identical.

Why Bother?

$0/month operating cost
No API latency, no rate limits, no data leaving your network
60W power draw — runs all night on overnight tasks
35B-level reasoning for background jobs: research, batch processing, coding runs
Full tool-use and thinking capabilities
Still might become the brain for a robot someday

Originally set up March 6, 2026. All commands verified in production.

Protomota Lab

Discussion about this post

Ready for more?