← ~/content

OpenClaw Part 4: Local Model Shootout on Strix Halo

Tutorial~16 min read
OpenClaw Part 4: Local Model Shootout on Strix Halo
intermediate~1 hrs

Prerequisites

  • Completed Parts 1-3 (OpenClaw, Uptime Kuma, Matrix all working)

Tools

  • SSH terminal
  • Web browser (Element Web)

Software

  • ollama0.18.3
  • openclaw2026.3.24
Watch on YouTube

GLM-4.7-flash has been the default since Part 1 — it's OpenClaw's recommended model and it's been solid. But with 100GB of GPU-accessible memory on Strix Halo, you can run models that most people can't. The question is: should you?

This part is a shootout. Six models, four vendors, ranging from 9B to 122B parameters. Same infrastructure from Parts 2-3 — same diagnostic scripts, same Uptime Kuma webhook, same Matrix reporting. Every model runs through the same tests to see who comes out on top.

The results were surprising. Bigger wasn't always better. The fastest model wasn't the smallest. And the model that won the benchmarks wasn't the one I kept running in production.

NOTE

This is Part 4 of a 6-part series. Parts 1-3 built the monitoring pipeline. This part finds the best model for it. Parts 5-6 cover backup/restore and production hardening.

The Lineup

Six models, all with confirmed Ollama tool calling support:

#ModelArchitectureTotal / Active ParamsVRAMVendor
1Qwen3.5:9bDense9B / 9B~7GBAlibaba
2GPT-OSS:20bMoE20B / ~3B~16GBOpenAI
3GLM-4.7-flashMoE30B / 3B~20GBZhipu AI
4Nemotron-Cascade-2MoE30B / 3B~25GBNVIDIA
5Qwen3.5:27b (q8_0)Dense27B / 27B~32GBAlibaba
6Qwen3.5:122b-a10bMoE122B / 10B~85GBAlibaba

The narrative angles: tiny vs big (Qwen3.5 at three sizes), MoE vs dense (3B active vs 27B active), OpenAI's first open-weight model as the outsider, and the production baseline (GLM) as the control.

The Tests

Not just monitoring — five different capabilities:

TestWhat it measures
Service Crash DiagnosticsTool calling, SSH execution, structured data analysis
Morning Brief from RSSWeb fetch (curl), XML parsing, summarization, Matrix posting
Script GenerationCode quality, Proxmox CLI knowledge
Capacity PlanningData analysis, math, resource planning
Incident Report WritingWriting quality, structure, professionalism

Each model runs every test on the same hardware. Token speed, tool calling success, and output quality are all measured.

Setup: Pull All Models

~173GB total. Pull over ethernet — this takes a while.

On OpenClaw machine
ollama pull qwen3.5:9b
On OpenClaw machine
ollama pull gpt-oss:20b

GLM-4.7-flash is already pulled from Part 1.

On OpenClaw machine
ollama pull nemotron-cascade-2
On OpenClaw machine
ollama pull qwen3.5:27b-q8_0
On OpenClaw machine
ollama pull qwen3.5:122b-a10b

The 122B model is 81GB — expect a long download. Verify all six are ready:

On OpenClaw machine
ollama list

You should see all six models with their sizes.

Switching Models

To switch OpenClaw between models, use the config CLI. Two values need updating — the model ID and the display name:

On OpenClaw machine
openclaw config set models.providers.ollama.models.0.id "nemotron-cascade-2"
On OpenClaw machine
openclaw config set models.providers.ollama.models.0.name "nemotron-cascade-2"

Then restart the gateway to load the new model:

On OpenClaw machine
systemctl --user restart openclaw-gateway

TIP

The model ID must match the Ollama model name exactly (what you see in ollama list). The name field is just the display label in the UI.

Baseline Speed Benchmark

Before testing with OpenClaw, measure raw inference speed via the Ollama API — no tool calling overhead, no chat templates, pure token generation. This gives a clean hardware baseline to compare against.

Methodology

Every model gets the same prompt, run through Ollama's /api/generate endpoint with stream: false. The API response includes nanosecond-precision timing for two distinct phases:

  • Prompt eval (prefill) — how fast the model processes input tokens to build its internal context. This is compute-bound and scales with model size.
  • Generation (decode) — how fast the model produces output tokens. This is memory-bandwidth-bound — and the number most people mean when they say "tokens per second."

Each model is pre-loaded (warm) before the measured run. In production, your agent's model stays in memory, so cold start times aren't relevant to ongoing performance.

NOTE

These models don't all use the same quantization. Most use Q4_K_M (4-bit), but the 27B defaults to Q8_0 (8-bit) — double the precision, double the memory, roughly half the speed. GPT-OSS uses MXFP4, a Microsoft-developed 4-bit format. These are the defaults you get when you ollama pull — no manual quant selection needed.

The Benchmark Command

Hit the Ollama API and parse the timing fields. The response JSON includes prompt_eval_count, prompt_eval_duration, eval_count, and eval_duration — everything needed to calculate tok/s for both phases:

On OpenClaw machine
curl -s http://localhost:11434/api/generate -d '{
  "model": "nemotron-cascade-2",
  "prompt": "Explain what a homelab is, why someone would build one, and what the most common services are. Keep it under 200 words.",
  "stream": false
}' | python3 -c "
import json, sys
d = json.load(sys.stdin)
pp = d['prompt_eval_count'] / (d['prompt_eval_duration'] / 1e9)
tg = d['eval_count'] / (d['eval_duration'] / 1e9)
print(f'Prompt eval: {pp:.1f} tok/s ({d[\"prompt_eval_count\"]} tokens)')
print(f'Generation:  {tg:.1f} tok/s ({d[\"eval_count\"]} tokens)')
"

Here's what Nemotron returns:

Nemotron-Cascade-2 benchmark output
Prompt eval: 141.6 tok/s (48 tokens)
Generation:  57.8 tok/s (449 tokens)

Prompt eval tells you how fast the model digests your input. Generation tells you how fast it produces the answer. For an always-on agent, generation speed is what you feel — it's the pace of every response in Matrix, every diagnostic report, every morning brief.

Results

Run the same command for each model, swapping the model name:

ModelPrompt EvalGenerationQuantizationVRAM
Nemotron-Cascade-2141.6 tok/s57.8 tok/sQ4_K_M~25GB
GPT-OSS:20b646.3 tok/s48.3 tok/sMXFP4~16GB
GLM-4.7-flash150.3 tok/s47.8 tok/sQ4_K_M~20GB
Qwen3.5:9b440.5 tok/s31.8 tok/sQ4_K_M~7GB
Qwen3.5:122b-a10b90.4 tok/s19.3 tok/sQ4_K_M~85GB
Qwen3.5:27b (q8_0)62.5 tok/s6.7 tok/sQ8_0~32GB

The first surprise: Nemotron-Cascade-2 is the fastest generator at 57.8 tok/s — not the smallest model. GPT-OSS leads prompt eval at 646 tok/s (its MXFP4 format is highly optimized for prefill) but falls behind in generation. The 27B dense model crawls at 6.7 tok/s — all 27 billion parameters fire on every single token.

NOTE

MoE (Mixture of Experts) models only activate a fraction of their parameters per token. Nemotron-Cascade-2 has 30B total but only ~3B active — that's why it's fast. Qwen3.5:27b is fully dense — all 27B active on every token. Architecture matters more than parameter count for speed.

Test 1: Service Crash Diagnostics

The core test from Parts 2-3. Stop nginx on the test container, Uptime Kuma fires the webhook, the agent investigates via diagnostic scripts, and the report lands in Matrix.

ModelTool CallingDiagnosis
Qwen3.5:9bTried wrong toolFailed
GPT-OSS:20bDidn't attempt toolsWrote manual runbook
GLM-4.7-flashRan diagnosticsPartial — blamed old OOM entries
Nemotron-Cascade-2Ran diagnosticsCorrect — separated old from current
Qwen3.5:27bRan diagnosticsCorrect
Qwen3.5:122b-a10bRan diagnostics (with retry)Correct

The 9B model tried to call a tool but used the wrong one — exec denied. GPT-OSS understood the task perfectly and wrote out the exact SSH commands you'd need to run, but never actually called any tools. It wrote a manual runbook instead of executing.

GLM-4.7-flash — the production model — ran the diagnostics correctly but made a mistake in analysis. It found old OOM entries from a previous day in the host dmesg and blamed them for the current outage. The service was cleanly stopped, not OOM-killed.

Nemotron-Cascade-2 ran the same diagnostics and correctly said the old OOM entries "have no direct OOM activity now." Same data, better reasoning.

Test 2: Morning Brief from RSS

A completely different capability — and a genuinely useful one you can set up yourself. The agent fetches an RSS feed, summarizes the top stories, and posts a morning brief to a dedicated Matrix room. Here's how to set it up.

Create the Morning Brief Room

In Element Web, create a new room called Morning Brief — same process as Part 3. Private, encryption OFF, invite @openclaw-bot:matrix.

Note the room ID from Room Settings > Advanced (starts with !).

Set Up the Cron Job

OpenClaw has built-in cron scheduling — no system crontab or curl hacks needed. Create a cron job that triggers the agent every morning at 7am:

On OpenClaw machine
openclaw cron add \
  --name 'morning-brief' \
  --cron '0 7 * * *' \
  --tz 'America/New_York' \
  --message 'Fetch the top 5 stories from the Hacker News RSS feed at https://hnrss.org/frontpage. For each story, include the title, a 2-3 sentence summary, the points and comment count if available, and the link. Post a formatted morning brief to Matrix.' \
  --channel matrix \
  --to '!YOUR_MORNING_BRIEF_ROOM_ID:matrix' \
  --announce

Breaking this down:

  • --cron '0 7 * * *' — standard cron expression: minute 0, hour 7, every day. Runs at 7:00 AM.
  • --tz 'America/New_York' — timezone for the schedule. Change to your timezone (IANA format).
  • --message — the prompt the agent receives. It needs to know what feed to fetch, how to summarize, and where to post.
  • --channel matrix — deliver the output to Matrix.
  • --to — the room ID where the brief lands.
  • --announce — posts the agent's response to the chat.

WARNING

Use single quotes for all the string arguments. Double quotes can cause shell escaping issues that silently break the cron job — I learned this the hard way during recording.

TIP

Replace the timezone and room ID with your own. You can use any RSS feed — not just Hacker News. Tech news, homelab subreddits, whatever you want to wake up to.

Test It Manually

Don't wait until 7am — trigger it now to verify everything works:

On OpenClaw machine
openclaw cron run CRON_JOB_ID

Check the Morning Brief room in Element. Within a minute or two, the agent should post a formatted summary of the top Hacker News stories.

You can also check the cron job status and run history:

On OpenClaw machine
openclaw cron list
On OpenClaw machine
openclaw cron runs --id CRON_JOB_ID

NOTE

In my testing, the morning brief sometimes included meta-commentary alongside the actual content — things like "the cron task is running in an isolated session, let me check if there's a matrix connection." The brief itself was correct, but the agent narrated its own process into the Matrix post. It also reported that Matrix posting "failed" when it had actually posted successfully. The content was good; the delivery was messy. Prompt refinement would likely help here, but I haven't nailed it down yet.

Model Comparison: Morning Brief Quality

Every model except the 9B successfully fetched the RSS feed and posted a brief. Quality varied significantly:

ModelResultQuality
Qwen3.5:9bFailed — never posted-
GPT-OSS:20bPostedClean but basic
GLM-4.7-flashPostedGood — added "Trending" section
Nemotron-Cascade-2PostedFunctional but plain
Qwen3.5:27bPostedCreative — ASCII art, lobster quote
Qwen3.5:122b-a10bPostedBest — points, comments, clean headers

The 9B failed again — it couldn't handle the tool calling needed to curl the RSS feed. GLM added an unprompted "Trending today" section with the highest-point stories — a nice touch nobody else thought of. The 27B added ASCII art dividers and a lobster quote from the agent's personality file. The 122B had the cleanest formatting with story metadata (points, comments, links).

Test 3: Script Generation

"Write a bash script that checks container memory usage on Proxmox." No tools needed — pure code generation.

Every model produced a script. Every script had bugs. None got the pct CLI output format exactly right — they all assumed pct status or pct list returns data in formats it doesn't.

NOTE

This is a fundamental limitation. Niche CLI tools like Proxmox's pct aren't well-represented in training data. Every model — from 9B to 122B — got it wrong. Don't trust LLMs for niche tool knowledge regardless of model size.

The 27B was the most interesting: it wrote a first attempt, realized it was broken, and self-corrected mid-response. The second version used a JSON + jq pipeline — the correct approach, even if the exact flags were wrong.

Test 4: Capacity Planning

This test provides real host data and asks the model to reason about infrastructure decisions — no tools needed, pure analysis.

First, get your host's current state by running the diagnostics-host script:

On OpenClaw machine
ssh proxmox diagnostics-host

This outputs memory, ZFS pool status, running containers with VMIDs, and VM summary. Copy the full output, then paste it into Matrix along with the following prompt:

Capacity planning prompt
Here is the current state of my Proxmox host:
 
[paste diagnostics-host output here]
 
I want to deploy three new services: Grafana, Prometheus, and Home Assistant. For each one:
1. Which ZFS pool should the container storage go on and why?
2. What VMID should it get?
3. How much RAM and disk should I allocate?
4. Any dependencies or ordering considerations?

The model needs to read the data correctly (available memory, pool sizes, existing VMID range) and reason about placement. No SSH, no tools — just analysis.

NOTE

I originally wanted the models to run ssh proxmox diagnostics-host themselves, but most couldn't find or execute the diagnostic scripts reliably. Some hallucinated output instead of running the command. I ended up running the script myself and pasting the data into the prompt — which made it a cleaner test of reasoning ability anyway, without the tool-calling variable muddying the results.

ModelMathPool AssignmentCT IDs
Qwen3.5:9bCorrectCorrectCorrect
GPT-OSS:20bCorrectAll SSDCorrect
GLM-4.7-flashErrorWrong (due to error)Correct
Nemotron-Cascade-2CorrectCorrectCorrect
Qwen3.5:27bCorrectBest reasoningCorrect
Qwen3.5:122b-a10bCorrectCorrectCorrect

GLM made the most concerning error. It correctly parsed the data in its summary — "ssd-pool: 11.2T capacity, 7.81GB used" — then contradicted itself in the recommendation: "ssd-pool only 7.81GB free is tight." It confused used with free. The pool has 11.19TB free, not 7.81GB.

The 27B gave the best answer overall: correct math, sensible pool placement (SSD for Prometheus, HDD for Home Assistant), noted you should deploy Prometheus before Grafana (dependency order), and offered to draft the container creation commands.

Test 5: Incident Report Writing

Given bullet points about a Jellyfin OOM incident, write a formal report. Paste the following into Matrix:

Incident report prompt
Write a formal incident report based on these facts:
- Date: March 27, 2026, 02:15 UTC
- Service: Jellyfin media server (CT 109)
- Impact: All users unable to stream media for 47 minutes
- Root cause: OOM kill — a library scan job consumed 6GB RAM in a 4GB container
- Detection: Uptime Kuma alert at 02:15, OpenClaw investigation at 02:16
- Resolution: Container restarted at 02:58, library scan rescheduled to off-peak
- Prevention: Increased container memory to 8GB, added memory limit to scan job
 
Format it as a proper incident report with: Summary, Timeline, Root Cause Analysis, Impact, Resolution, Prevention, and Lessons Learned sections.

Every model produced a usable report. Quality ranged from "gets the job done" to "I'd actually send this."

The 27B was the standout — incident ID, severity level, contributing factors (not just root cause), verification steps, completed vs recommended mitigations with checkmarks, MTTR analysis, and a distribution list. It was also the slowest at 7 tok/s.

The 9B — which failed both tool-calling tests — wrote a surprisingly good incident report with inferred intermediate timestamps that weren't in the source data. Good reasoning, bad tool calling.

The Results

Speed:

ModelPrompt EvalGenerationVRAM
Nemotron-Cascade-2141.6 tok/s57.8 tok/s~25GB
GPT-OSS:20b646.3 tok/s48.3 tok/s~16GB
GLM-4.7-flash150.3 tok/s47.8 tok/s~20GB
Qwen3.5:9b440.5 tok/s31.8 tok/s~7GB
Qwen3.5:122b-a10b90.4 tok/s19.3 tok/s~85GB
Qwen3.5:27b (q8_0)62.5 tok/s6.7 tok/s~32GB

Task Performance:

ModelDiagnosticsMorning BriefScript GenCapacityIncident Report
Qwen3.5:9bFailedFailedBrokenCorrectGreat
GPT-OSS:20bFailed (no tools)GoodDecentCorrectGood
GLM-4.7-flashPartialGood+DecentErrorGood
Nemotron-Cascade-2BestGoodDecentCorrectGood
Qwen3.5:27bPassedCreativeBest approachBestBest
Qwen3.5:122b-a10bPassedBestGoodCorrectGreat

WARNING

These results weren't perfectly reproducible. When I re-ran the same tests during video recording, results were spottier — models that passed initially failed on subsequent attempts. Local models at this size are inconsistent, and a single pass doesn't tell the full story. Take the table above as a general sense of capability, not a guarantee of behavior.

Pipeline Reality Check

Nemotron-Cascade-2 scored the best in controlled testing, but running it through the full OpenClaw pipeline — Matrix delivery, Uptime Kuma webhooks, heartbeat system, queued messages — revealed issues. The model leaked persona instructions into responses, suggested deploying services into existing tutorial containers instead of creating new ones, and struggled with message boundaries when the gateway queued multiple messages. These aren't necessarily model quality issues in isolation — they're interaction effects between the model and OpenClaw's message pipeline that only surface in real production use.

GLM had its own quirks — posting meta-commentary about its process alongside actual content — but it was more predictable overall. At least when GLM failed, it failed in understandable ways.

Controlled benchmarks measure capability. Production use measures reliability. They're not the same thing.

What I Learned

Nemotron-Cascade-2 scored the best in testing — fastest at 57.8 tok/s, reliable tool calling, correct on everything. But controlled testing doesn't tell the full story. Pipeline integration matters, and GLM-4.7-flash handled the full OpenClaw pipeline more cleanly.

Qwen3.5:27b writes the best content — reports, analysis, personality — but at 6.7 tok/s it's too slow for real-time agent work. Great for scheduled tasks where you can wait.

The 122B is diminishing returns. 85GB VRAM for marginal gains over Nemotron. The 122B won morning brief formatting and wrote excellent reports, but Nemotron was better at the actual agent work.

GLM-4.7-flash made a data consistency error that none of the others made. It confused used with free space — not a minor issue for a model you're trusting with infrastructure decisions. But it handled the pipeline cleanly, which matters more for an always-on agent.

The 9B can't handle agent tasks but writes decent reports. Too small for tool calling.

GPT-OSS knows what to do but won't do it. It wrote perfect SSH commands but never actually executed them. Inconsistent tool calling — uses curl but not SSH.

MoE architecture dominates for agent work. 3B active parameters at 57.8 tok/s beats 27B dense at 6.7 tok/s for everything except writing quality. Architecture matters more than parameter count.

Every model failed the Proxmox CLI test. None knew the exact pct output format. Niche tool knowledge is a blind spot regardless of model size.

These local models aren't Claude or ChatGPT. That's not a knock — it's a reality check. Cloud models have orders of magnitude more parameters, better RLHF, and dedicated tool-calling training. Local models are getting better fast, but for now, expect to iterate on prompts and accept some inconsistency. The upside is everything runs on your hardware with zero cloud dependencies.

Prompt engineering is an unexplored variable. I used straightforward prompts and didn't spend time optimizing them per model. Better prompts — especially for the morning brief and diagnostics — would likely improve results. That's work I haven't done yet, and it's worth calling out.

Recommendation

For homelab monitoring and agent tasks: GLM-4.7-flash. Nemotron won the benchmarks, but GLM handles the full pipeline — webhooks, Matrix delivery, queued messages — without leaking persona instructions or misinterpreting message boundaries. The video walkthrough in this tutorial uses GLM for that reason.

For writing-heavy tasks (reports, documentation, analysis): Qwen3.5:27b if you can tolerate the speed, or Qwen3.5:122b-a10b if you have the VRAM and want the quality ceiling.

For production: I'm staying with GLM-4.7-flash for now. It's not the fastest or the smartest, but it's the most predictable — and for an always-on agent managing your infrastructure, predictable beats impressive. I'm still new to OpenClaw, and I suspect better prompts and configuration could change the picture. If you've had better results with a different model or setup, I'd genuinely like to hear about it.

In Part 5, the focus shifts to backup and restore — proving everything built in Parts 1-4 can survive a disaster. In Part 6, Caddy and Pi-hole add HTTPS and production hardening.

Related Products

Some links are affiliate links. I may earn a small commission at no extra cost to you.