# Results — run 20260610-164327, prompt 001-rts-wfc

Validated 2026-06-10 by claude-fable-5[1m] (bench-validate). All four cells ran to
completion (`ok`). Automated checks executed via `nix develop <bench-root> -c` from each
workspace; code review based on source reading (file:line citations below). Manual
section left for the user.

Models: gpt-5.5 = openai-codex/gpt-5.5@xhigh · grok = grok-build/grok-build@xhigh ·
minimax = ollama-cloud/minimax-m3@xhigh · glm = ollama-cloud/glm-5.1@xhigh

## Comparison table

| Item | gpt-5.5 | grok | minimax | glm |
|---|---|---|---|---|
| A1 install (`npm install`) | 1 | 1 | 1 | 1 |
| A2 no runtime deps / engines | 1 | 1 | 1 | 1 |
| A3 build (`npm run build`) | 1 | 1 | 1 | 1 |
| A4 strict typecheck (`tsc --noEmit`) | 1 | **0** | 1 | 1 |
| A5 lint (`npm run lint`) | 1 | 1 | 1 | 1 |
| A6 tests ≥15 assertions, req. coverage | 1 | 1 | 1 | 1 |
| A7 serves (preview + HTTP 200 + bundle) | 1 | 1 | 1 | 1 |
| A8 seeded determinism | 1 | **0** | **0** | 1 |
| **Automated subtotal /8** | **8** | **6** | **7** | **8** |
| R1 WFC authenticity | 2 | 2 | 2 | 2 |
| R2 playability pass | 2 | 2 | 2 | 2 |
| R3 pathfinding | 2 | 2 | 2 | 2 |
| R4 architecture | 2 | 2 | 2 | 2 |
| R5 AI structure | 2 | 2 | 2 | 2 |
| R6 type discipline | 1 | 1 | 2 | 1 |
| R7 smells | 1 | 1 | 1 | 1 |
| **Review subtotal /14** | **12** | **12** | **13** | **12** |
| **Total /22** | **20** | **18** | **20** | **20** |

Run durations: grok 911s · gpt-5.5 1421s · glm 2156s · minimax 4910s.
Test suites (vitest, all green): minimax 49 tests/~100 assertions · glm 21/33 ·
grok 11/29 · gpt-5.5 9/24.

## Per-cell findings

### openai-codex/gpt-5.5 — 20/22

- All automated items pass; zero `Math.random` in src; seed via `random.ts:78-89`
  (`seedFromQuery`, default 13371337), displayed in menu and HUD (`renderer.ts:90,384`).
- R1=2: bitmask domains + entropy heap + constraint propagation (`wfc.ts:162-183`,
  adjacency table `wfc.ts:10-17`). The most algorithmically sophisticated WFC of the four.
- R2=2: `validatePlayability` (`wfc.ts:387-402`) checks two starts, buildable patches,
  land reachability, gold+forest near both; deterministic repair (corridor carving,
  fair resource placement) and fail-fast throw if repair fails. Start positions are
  formula-placed (margin + jitter) rather than searched.
- R3=2: typed-array A* with corner-cut guard (`pathfinding.ts:132-142`), separation
  steering on movement (`simulation.ts:607-668`).
- R5=2: timer-driven AI state machine (`simulation.ts:943-1044`); count-based build
  plans implicitly rebuild destroyed buildings; difficulty scales resources, harvest
  rate, wave cadence.
- R6=1: zero `any`, but 219 non-null assertions (91 in simulation.ts) — not "rare".
- R7=1: simulation.ts is 1326 lines (god file); otherwise clean, stats data-driven.
- Thinnest test suite (9 cases / 24 assertions) — meets the ≥15-assertion bar with
  required coverage, but barely.
- **Comparability caveat:** this cell had extra tools the others lacked. The codex
  runtime loads `~/.codex/config.toml`, which registers the user's `codegraph` and
  `ledger` MCP servers — pi's `--mcp-config` cannot override codex-internal config.
  The model used the ledger (workspace `docs/` contains a cq ledger with an archived
  milestone, created mid-run at 17:43). `docs/` is harness contamination, not model
  output; it was ignored for grading.

### ollama-cloud/minimax-m3 — 20/22

- A8=0: 6 `Math.random` calls — `ai.ts:85,224` (AI target/composition choices) and
  `sim.ts:785,792,814` (damage rolls). Map generation is fully seeded, but the spec
  required ALL randomness through the seeded PRNG; combat replays are not reproducible.
- R1=2: bitmask wave + popcount entropy + directional adjacency rules with stack
  propagation and seeded restarts (`wfc.ts:30-53,140-206`).
- R2=2: start pairing enforces mutual reachability, gold+forest radii, and a
  resource-fairness ratio ≥0.5 (`mapgen.ts:85-178`); deterministic derived-seed retries.
- R3=2: the only binary-heap A* of the four (`pathfind.ts:31-133`); corner-cut guard;
  soft separation (`sim.ts:490-517`).
- R5=2: 1 Hz throttled AI with priority-sorted build plan (supply > barracks > mill >
  towers), per-difficulty wave tables, escalating composition (`ai.ts:131-323`).
- R6=2: zero `any`, zero non-null assertions — the cleanest type discipline; casts are
  narrow discriminated-union narrowing.
- R7=1: sim.ts 1166 lines; otherwise pristine, all stats in data.ts.
- Strongest test suite by far: 49 tests/~100 assertions including a 10-second
  simulation crash-test and "AI builds barracks within 60s".
- LOS-blocking fog (Bresenham through forest/rock) — beyond spec.

### ollama-cloud/glm-5.1 — 20/22

- All automated items pass. The single `Math.random` (`main.ts:15`) only generates a
  fallback seed when `?seed=` is absent; the seed is then displayed — reproducibility
  preserved, so A8=1.
- R1=2: set-based superposition, weighted minimal-entropy collapse, BFS propagation
  over explicit adjacency table (`wfc.ts:17-172`).
- R2=2: `validateAndFixMap` (`wfc.ts:227-266`) — reachability check, gold/forest
  insertion repair, start clearing; 10 deterministic retries then fallback map.
- R3=2: octile-heuristic A* with corner-cut guard (`astar.ts:56-147`); unit collision
  handled via entity-occupancy-aware walkability (`excludeEntity`) plus repath, no
  separation steering.
- R5=2 with a gap: AIController is well-structured (plan, saturation maps, waves,
  defense — `ai.ts:16-104`) but the build order is a fixed sequential array; destroyed
  buildings are never rebuilt.
- R6=1: zero `any`, but 127 non-null assertions across 4131 LOC.
- R7=1: game.ts 1220 lines; mild path-following duplication (game.ts:387-443 vs
  962-990); stats data-driven in constants.ts.

### grok-build/grok-build — 18/22

- A4=0: `npx tsc --noEmit` fails with 6 × TS2304 — `main.ts` references `hashSeed`,
  `NUM_LEVELS`, `SIM_DT` without importing them. `npm run build` passes only because
  the build script is bare `vite build` (esbuild transpiles without typechecking).
  **Expected runtime impact:** `hashSeed` is called inside `startLevel()`
  (`main.ts:34-35`) and remains a free identifier in the shipped bundle — starting a
  level should throw `ReferenceError`. Manual item M1 will likely fail; verify first.
- A8=0: 8 `Math.random` calls in `sim.ts` (worker/unit spawn positions, training
  decisions, harvest assignment — sim.ts:776-851,1319). Also `hashSeed` itself is
  defective independent of the import: `Math.imul(0x85ebca6b)` is called with one
  argument (rng.ts:14-23), so it returns 0.
- A5=1 by exit code, but 87 eslint warnings (the other three are warning-free).
- R1=2: the only 8-directional adjacency model; weighted entropy with deterministic
  tie-break; stack propagation (`wfc.ts:140-262`).
- R2=2: searched start locations with resource scoring, gold-mine repair placement,
  flood-fill connectivity check with corridor carving, fallback map (`mapgen.ts:74-248`).
- R3=2: octile A* + corner-cut guard (`pathfind.ts:23-131`); 30-tick replanning;
  push-based separation (`sim.ts:331-405`). Open set re-sorted per iteration (O(n log n)).
- R5=2: monolithic but complete `updateAI` (`sim.ts:731-932`): build flags, saturation
  targets, escalating waves, reactive defense. Flag-based build tracking never rebuilds
  destroyed structures.
- R6=1: 16 `as any` + 5 `: any` + 156 non-null assertions — the loosest of the four.
- R7=1: sim.ts 1345 lines; dead/wrong code in `hashSeed`; stats data-driven in data.ts.

## Manual checklist (user)

Launch any cell with `--host 0.0.0.0` to reach it from outside the VM; append
`?seed=<n>` to reproduce maps.

```
cd runs/20260610-164327/001-rts-wfc/<cell>/workspace && npm run dev -- --host 0.0.0.0
```

Per cell (`openai-codex_gpt-5_5_xhigh` · `grok-build_grok-build_xhigh` ·
`ollama-cloud_minimax-m3_xhigh` · `ollama-cloud_glm-5_1_xhigh`):

- [ ] M1 loads & starts — level select appears, level 1 starts, seed visible, no
      console errors *(grok: expect ReferenceError on start — see A4)*
- [ ] M2 economy loop — harvest gold/wood, build Farm + Barracks, train units,
      supply cap enforced
- [ ] M3 control feel — box select, shift-add, control groups, right-click orders,
      minimap + edge scrolling
- [ ] M4 combat — attack-move, ranged projectiles, auto-acquire, damage/armor
      plausible, corpses clear
- [ ] M5 pathfinding feel — 12+ unit group crosses the map and settles, no permanent
      jams
- [ ] M6 fog of war — unexplored black, explored memory, enemies hidden; minimap
      consistent
- [ ] M7 AI opponent — economy develops, first wave arrives, defends and rebuilds,
      difficulty 1 beatable *(glm, grok: rebuild not implemented — see R5)*
- [ ] M8 WFC maps — same seed reproduces, different seeds differ, terrain coherent,
      both starts viable
- [ ] M9 campaign — winning advances; later levels larger/harder
- [ ] M10 performance — ~100-unit battle stays smooth; pause + speed toggle work

## Summary (ranked)

gpt-5.5, minimax-m3, and glm-5.1 finish at 20/22 — a three-way tie on this rubric, and
the differences are qualitative rather than rankable: gpt-5.5 has the strongest
algorithmic core (bitmask WFC, fail-fast playability validation, zero Math.random) but
the thinnest tests and a tooling-uniformity caveat (it alone had codegraph/ledger MCP
tools); minimax-m3 has the cleanest types, the only binary-heap A*, and a 5× larger
test suite, but broke the determinism requirement with random damage rolls; glm-5.1 is
the most uniformly compliant with no standout strengths or defects beyond a
non-rebuilding AI. grok-build trails at 18/22 with the only artifact that fails
strict typechecking — six undefined identifiers that likely crash level start — plus
the most determinism violations and the loosest type discipline, despite a genuinely
complete feature set on paper. Manual play (especially M1 for grok and M7 for glm/grok)
is the remaining discriminator.
