TNG LLM Bench

What This Is

A five-round bench harness designed to test where local 30B-class models actually compete with frontier cloud models, and where they don’t. Same brief per round, identical prompt, fresh per-model working directory, no iteration unless noted. Companion artifact to the Five Rounds Deep Dive.

Each round tests a different shape of coding task:

Round	Task	What it measures
R1	Particle simulation, 60 FPS spec	Single-shot artifact generation under a hard quantitative constraint
R2	One round of feedback to fix R1	How models respond to explicit failure feedback
R3	Lisp interpreter, 58-test pytest suite as spec	Semantic precision when the spec is the test suite
R4	notespeak Go full-stack w/ SQLite FTS5	On-distribution boilerplate generation with multi-file scope
R5	Agentic debug: 10 planted bugs to find and fix	Multi-turn agent loop, local vs frontier head-to-head

Contestants

Model	Role
Qwen3-Coder-30B-A3B (Q5_K_M)	Local, coding-specialized MoE
Qwen3.6-27B (Q5_K_M)	Local, thinking, general flagship
Qwen3.5-35B-A3B (Q4_K_M)	Local, thinking MoE
Gemma 4 31B IT (Q5_K_M)	Local, dense (Google)
OpenAI Codex (gpt-5.3-codex)	Frontier control, R5 only, via Pi + ChatGPT Plus OAuth
Claude Sonnet 4.6	Frontier control, R5 only, via `claude -p` + Claude Pro

All six runs at $0 marginal cost (locals are free; frontier via existing subscriptions).

What’s in the Repo

round{1,2,3,4,5}/ - per-round brief, eval artifact (tests/integration script), and per-model output directories
scripts/ - runner scripts that drove each round, reusable for new models
assets/ - screenshots, composited grids, the bench-wide scoreboard image
RESULTS.md - longer-form internal writeup with details that didn’t fit in the Deep Dive

Reproducing the Bench

The harness is built so adding a new model to any round is a matter of:

Add it to llama-swap’s config.yaml (or expose any OpenAI-compatible endpoint)
Edit the MODELS array in the round’s runner script
Create the per-model directory
Run the script. Pass/fail counts and wall clock append to _logs/SCORES.txt.

Hardware target: RTX 5090 (32 GB VRAM) was the workstation. A 3090 (24 GB) works with smaller quants. The bench runs in under an hour end-to-end.

Full setup, prerequisites, and per-round walkthroughs in the repo README.

What This Is

Contestants

What’s in the Repo

Reproducing the Bench

Don't miss the next one.