bringing it back home: gemma 4 on a 60-watt box
I put Gemma 4 on a Jetson AGX Orin. It's better than it should be.
A few weeks ago I wrote about getting Qwen 3.5 35B-A3B running on my Jetson AGX Orin through vLLM and OpenClaw. It worked. It was fast enough. I was happy with it.
But every time I looked at my setup there was a small mental asterisk: I was running my local agent stack on weights trained by Alibaba. Not because I had any concern about the model, but because I'm an American indie builder and my favorite local model was coming out of Hangzhou. It felt slightly off-brand for what I'm trying to build.
Gemma 4 26B-A4B is the first genuinely capable small US open weights model I've been able to run on my own hardware. Not "capable for a small model." Capable, full stop.
Why 26B on an Orin should not work
The AGX Orin is a 275mm x 87mm module that draws about 60 watts under load. 64GB of unified memory shared between CPU and GPU. Ampere GPU, 2048 CUDA cores. It sits on my workbench next to my dev machine.
Gemma 4 26B-A4B is not a normal 26B model. It has 25.2 billion total parameters, but only 3.8 billion are active during any given inference pass. 128 experts per MoE layer, router picks 2 per token, the rest sit idle. The file is 16.8GB at Q4_K_M, but the compute per token is closer to a 4B model.
That's the whole trick. Training quality of a 26B, inference cost of a 4B. On hardware where every watt matters, that's the difference between "runs" and "doesn't."
NVIDIA's Jetson AI Lab lists the AGX Orin as a supported platform and ships a Docker container with llama.cpp pre-configured:
sudo docker run -it --rm --pull always --runtime=nvidia --network host \
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
ghcr.io/nvidia-ai-iot/llama_cpp:gemma4-jetson-orin \
llama-server -hf ggml-org/gemma-4-26B-A4B-it-GGUF:Q4_K_MOne line. Model takes ~24GB of RAM. On my 64GB Orin, that leaves plenty of room for the rest of the system.
What "runs" actually means
Set expectations. At Q4 on the Orin, Gemma 4 26B-A4B is not fast. I'm getting around 11 tokens per second, which I'd call "comfortable for non-interactive use." Background tasks, batch processing, offline analysis. Not real-time chat.
For my use cases this is fine. I'm not building a chatbot on the Orin. I'm running background agent tasks: log analysis, code review, config generation, structured extraction. None of those need sub-second latency. They need good output on short-to-medium prompts.
And the output is where 26B earns its place. Sub-10B models on this hardware work for simple stuff (summarize this, extract these fields, classify this log line), but they fall apart on anything that needs real reasoning. A 3B model asked to analyze a stack trace gives you something plausible that's wrong half the time. A 7B does better but hallucinates function names. Gemma 4 26B is a different animal. It understands code, handles tool calling, follows multi-step instructions without losing the thread.
The 256K context window doesn't hurt either. I can feed it a whole config file, a stack trace, and the relevant source all at once.
For the first time, I have a model on the Orin that I trust to do work I'd previously have to ship to an API.
Rough edges
Skip Ollama on Jetson. There's an open bug where gemma4:26b throws HTTP 500s on moderately long context on the Orin specifically. Looks like a memory management issue with Ollama's CUDA integration on Jetson. Stick with llama.cpp.
Long context is slow. The 256K window exists but filling it is painful. Prefill scales with context length, and the Orin's GPU takes a while on 50K+ tokens. I keep my prompts under 10K for anything that needs to respond in under a minute.
Multimodal is experimental. Vision through llama.cpp on Jetson isn't mature yet. The text capabilities are solid.
The punchline
I've been waiting for a model smart enough to be useful and small enough to run on hardware I actually own. Not a rented H100. Not a cloud API I'm paying per token for. Not a Mac Studio I'd have to buy specifically for this. A 60-watt box I already have on my desk.
Gemma 4 26B-A4B is the first model that meets both criteria on the Orin.
It's not going to replace Claude or GPT for complex, multi-turn reasoning. It's not going to write this newsletter. But for the work that needs to happen locally, without a cloud round-trip, it is the best option available today. And it's US open weights, which means I can finally stop the mental asterisk.
The interesting question isn't whether it's as good as a cloud model. It obviously isn't. The interesting question is whether it's good enough that you stop needing the cloud model for a meaningful chunk of your workload.
On my bench, the answer is yes. For the first time.

