What This Is
A five-round bench harness designed to test where local 30B-class models actually compete with frontier cloud models, and where they don’t. Same brief per round, identical prompt, fresh per-model working directory, no iteration unless noted. Companion artifact to the Five Rounds Deep Dive.
Each round tests a different shape of coding task:
| Round | Task | What it measures |
|---|---|---|
| R1 | Particle simulation, 60 FPS spec | Single-shot artifact generation under a hard quantitative constraint |
| R2 | One round of feedback to fix R1 | How models respond to explicit failure feedback |
| R3 | Lisp interpreter, 58-test pytest suite as spec | Semantic precision when the spec is the test suite |
| R4 | notespeak Go full-stack w/ SQLite FTS5 | On-distribution boilerplate generation with multi-file scope |
| R5 | Agentic debug: 10 planted bugs to find and fix | Multi-turn agent loop, local vs frontier head-to-head |
Contestants
| Model | Role |
|---|---|
| Qwen3-Coder-30B-A3B (Q5_K_M) | Local, coding-specialized MoE |
| Qwen3.6-27B (Q5_K_M) | Local, thinking, general flagship |
| Qwen3.5-35B-A3B (Q4_K_M) | Local, thinking MoE |
| Gemma 4 31B IT (Q5_K_M) | Local, dense (Google) |
| OpenAI Codex (gpt-5.3-codex) | Frontier control, R5 only, via Pi + ChatGPT Plus OAuth |
| Claude Sonnet 4.6 | Frontier control, R5 only, via claude -p + Claude Pro |
All six runs at $0 marginal cost (locals are free; frontier via existing subscriptions).
What’s in the Repo
round{1,2,3,4,5}/- per-round brief, eval artifact (tests/integration script), and per-model output directoriesscripts/- runner scripts that drove each round, reusable for new modelsassets/- screenshots, composited grids, the bench-wide scoreboard imageRESULTS.md- longer-form internal writeup with details that didn’t fit in the Deep Dive
Reproducing the Bench
The harness is built so adding a new model to any round is a matter of:
- Add it to
llama-swap’sconfig.yaml(or expose any OpenAI-compatible endpoint) - Edit the
MODELSarray in the round’s runner script - Create the per-model directory
- Run the script. Pass/fail counts and wall clock append to
_logs/SCORES.txt.
Hardware target: RTX 5090 (32 GB VRAM) was the workstation. A 3090 (24 GB) works with smaller quants. The bench runs in under an hour end-to-end.
Full setup, prerequisites, and per-round walkthroughs in the repo README.