v4_deep_calibration_log

generated_by: scripts/build_v4_deep_pilot_batch.py + sub-agent run against prompts/extraction_v4_deep.md generated_at: 2026-04-29T02:05:02Z sqlite_snapshot: 2026-04-25T16:31:05Z extraction_run_id: 2026-04-29T02:05:02Z filters: pilot_set: hardcoded 10 BGG IDs (CALIBRATION_BGG_IDS_V4 in scripts/enrich_llm.py) prompt_runtime: prompts/extraction_v4_deep.md prompt_calibration: prompts/extraction_v4_deep_calibration.md model: Sonnet 4.6 (sub-agent invocation, not direct SDK) inputs: data/llm_input_v4_deep/batch_pilot_heat.json + batch_pilot_v4_deep.json outputs: data/llm_output_v4_deep/batch_pilot_heat.jsonl + batch_pilot_v4_deep.jsonl top_k: 10 plan: docs/plans/v4-archetype-fit-pipeline.md milestone: W2-M3 (deep calibration pilot)

This log captures the W2-M3 calibration pilot for the v4 archetype-fit deep pass. The pilot validates that a sub-agent following prompts/extraction_v4_deep.md can produce schema-compliant JSONL with the four-archetype rubric, that anchor games hit their expected archetype top-decile, and that the rubric discriminates across archetypes (heterogeneity check).

Calibration set

Same 10 games as W1-M2 (per plan). Heat scored separately as the rubric scaffold; the remaining 9 ran in batch_pilot_v4_deep.json.

#	bgg_id	name	prior_density	rationale
1	366013	Heat: Pedal to the Metal	full	Anchor — Balatro top-decile canary
2	36218	Dominion	partial	Anchor — engine-growth source for Balatro
3	122522	Smash Up	thin	Anchor — Snap source-game proxy (no v2/v3 priors)
4	126163	Tzolk'in: The Mayan Calendar	full	Worker-recall + tableau diversity
5	413246	Bomb Busters	full	Cooperative deduction stress test
6	217372	The Quest for El Dorado	full	Deck-building + route-committal
7	193738	Great Western Trail	full	Route + tableau hybrid
8	164928	Orléans	full	Bag-building + presence tracks
9	42	Tigris & Euphrates	full	Region-majority + min-score
10	189932	Tyrants of the Underdark	full	Deck-building + area-control hybrid

Pilot run

SQLite snapshot date: 2026-04-25T16:31:05Z
Generated at: 2026-04-29T02:05:02Z
Prompt iterations performed: 0 (first-pass compliance hit 100%)
Cache control / structured output: NOT verified (sub-agent invocation, not direct SDK). Same caveat as W1-M2.

JSON-schema compliance

Total output lines: 10 (1 Heat + 9 batch)
Lines parseable as JSON: 10
Lines matching schema (theme block + 4 archetype_fits × 10 dims + 8-entry mda_survival + confidence): 10
Compliance rate: 100% (≥90% threshold cleared)

Note: the existing Heat row uses bare verdict strings ("Sensation":"conditional") for the mda_survival map, while the prompt schema specifies {verdict, condition} objects with condition required when verdict='conditional'. The 9 new lines follow the prompt schema. Reconcile in W2-M4 — pick one and update either the Heat row or the prompt.

Cross-archetype heterogeneity (composites = unweighted mean of 10 dims)

Game	balatro	snap	wordle	cozy	spread (max-min)
Heat	7.60	4.70	4.30	4.70	3.30
Dominion	7.80	4.80	4.20	5.00	3.60
Smash Up	4.90	7.20	3.60	3.80	3.60
Tzol'kin	4.40	3.00	2.30	3.20	2.10
Bomb Busters	6.00	5.70	6.00	5.20	0.80
Quest for El Dorado	6.90	4.20	3.90	4.90	3.00
Great Western Trail	4.80	2.50	2.40	3.10	2.40
Orléans	5.80	3.20	2.90	4.00	2.90
Tigris & Euphrates	4.20	3.60	2.60	3.20	1.60
Tyrants of the Underdark	6.30	4.40	3.30	3.70	3.00

Mean spread: 2.66 (>1 pt threshold for "discriminating" — rubric is not homogenizing).
Anchor-fit checks pass:
- Heat → Balatro 7.60 (top-decile canary ✓ per plan acceptance criterion ≥7.5)
- Dominion → Balatro 7.80 (engine-growth anchor ranks #1 on Balatro ✓)
- Smash Up → Snap 7.20 (Snap source-game ranks #1 on Snap ✓)
Smallest spread: Bomb Busters (0.80). Cooperative deduction game scores within 1 pt across all 4 archetypes. See ambiguity #1 below.

Rubric ambiguities surfaced (W2-M3 follow-up)

Cooperative games compress the rubric. Bomb Busters spread = 0.80; rubric anchors are all competitive-or-solo. Co-op deduction has no engine growth (drags Balatro), no faction PvP (drags Snap), no daily-puzzle compression (drags Wordle), but loss_tolerance_fit for Cozy is no longer pinned by the "competitive end-state" cap. Net: the four archetypes converge mediocrely. Decision needed: introduce a 5th archetype (coop) or accept that co-op games legitimately score in the 4-6 band across the board (no archetype is the right shape for them).
loss_tolerance_fit cap clarification for cooperative fail-states. The Cozy rubric caps at 3 "when the game has any competitive end-state where one player loses." Bomb Busters has a fail-state (detonator) but the loss is shared, not interpersonal. Pilot scored 5 (uncapped). Decision needed: the cap should explicitly say competitive (interpersonal) loss, not shared fail-state. Edit prompts/extraction_v4_deep.md §dimension definitions.
mda_survival schema inconsistency between Heat anchor and prompt. Heat row uses "Sensation":"conditional" (bare string); prompt requires {"verdict": "...", "condition": "..."}. The 9 new lines emit objects. Decision needed: pick the object form (richer, plan-aligned), update Heat row in batch_pilot_heat.jsonl, ensure aggregate_llm_v4_deep.py (W2-M4 scope) handles object form only.

Vocabulary / primitive signal

Not measured this pilot — v4-deep doesn't emit a vocabulary tier (the primitives axis lives in v4-wide). The archetype_fit_rationale text is the qualitative artifact; spot-check confirms primitives are cited by name (e.g., Heat's engine_pollution, Dominion's deck-thinning, Smash Up's faction-base scoring). No _other: escape valve in this schema.

Throughput measurement

Sub-agent wall time: ~5.4 min for 9 games (323 s)
Per-game wall time: ~36 s (vs ~22 s/game for v4-wide pilot)
Total tokens: ~99.7k (sub-agent run-level)
Reset-window throughput: NOT measured — same SDK-instrumentation gap as W1-M2.

Per-game wall time is ~1.6× the v4-wide rate, expected given the larger output (4 archetypes × 10 dims + 8 mda survival + theme vs v4-wide's compact feature vector).

Decisions

Ship as-is — rubric passes the M3 acceptance bar (≥90% compliance, anchors hit expected archetype top-decile, mean spread > 1 pt). Move to W2-M4 prep.
Ambiguity #2 resolved (this session): edited prompts/extraction_v4_deep.md loss_tolerance_fit definition to clarify that the Cozy 3-cap applies to interpersonal competitive loss only; cooperative shared fail-states do not trigger the cap.
Ambiguity #3 resolved (this session): re-ran Heat against the corrected prompt; data/llm_output_v4_deep/batch_pilot_heat.jsonl now uses object-form mda_survival ({verdict, condition}). All numeric dims preserved from the original Heat row — schema-shape fix only. W2-M4 aggregator can assume object form.
Defer the cooperative-archetype question (ambiguity #1) to a separate decision — does W2 want a 5th archetype or accept co-op games scoring mid-band? This is product-scope, not rubric-scope.
W2-M4 unblocker: build scripts/aggregate_llm_v4_deep.py and the game_llm_archetype_fit_v4 table schema (4 rows per game × 10 dim columns + theme block + mda survival blob + composite). Same aggregate_llm_v4.py shape, expanded.
Token economics moves still NOT verified — same as W1-M2. Sub-agent invocation hides cache-hit rate and per-call instrumentation. Defer SDK-wrap to a separate task before W2-M4 if throughput becomes load-bearing.

W2-M0 gate decision (2026-04-29)

Skip paid Sensor Tower. Falling back to manual App Store + public-data curation for data/seed/mobile_adaptations.csv. Plan-budgeted impact: W2-M2 expands from 3 days to 2 weeks calendar.

Rationale: paid API spend not justified at this stage; public sources (App Store charts archive, Sensor Tower free tier, postmortems, GDC talks, dev interviews) cover the 50-entry curation depth needed for the case-file's purpose (archetype-fit calibration anchor, not market sizing).

Schema unchanged — peak_grossing_rank and peak_dau_estimate remain nullable; entries without verifiable numbers ship with those columns blank and learnings carrying the qualitative read.

Files produced

data/llm_input_v4_deep/batch_pilot_v4_deep.json (9 games, compact-prior payloads)
data/llm_output_v4_deep/batch_pilot_v4_deep.jsonl (9 lines, schema-valid)
scripts/build_v4_deep_pilot_batch.py (one-shot batch builder; pilot-scope, not wired into enrich_llm.py)
This log: reports/v4_deep_calibration_log.md

W2-M4 throughput + cost (live, 2026-04-29)

Production run is mid-flight. Snapshot below.

Progress

114 / 300 games enriched (38%) — pilot (10) + batches 0-25 (104) of the top-20% slice.
456 archetype-fit rows in game_llm_archetype_fit_v4 (114 × 4 archetypes).
114 theme rows in game_llm_theme_v4.
Heat-anchor calibration: balatro composite 7.60, rank 3 / 114, top-decile ✓ (W2 acceptance criterion holding).
20 games above the 7.0 candidate threshold (9 balatro, 10 snap, 1 wordle (Orchard 7.10), 0 cozy). First wordle ≥ 7.0 candidate landed in batch 24.

Throughput

Wave	Batches	Games	Pattern	Wall time
1 (warm-up)	0, 1	8	sequential, 1-at-a-time	~9 min
2	2-10	36	9-way parallel (sub-agents)	~3 min
3	11-20	40	10-way parallel	~3 min
4	21-25	20	5-way parallel	~2 min
Total (104 games, 26 batches)				~17 min wall

Per-batch sub-agent wall time: ~135s avg (range 131-167s, 4 games per batch).
Per-game wall time effective: ~33s/game (pilot W1-M2 estimated 22s/game; production ~1.5× higher, consistent with deep > wide expectation).
Parallelization speedup is load-bearing. 84 games serialized would have been ~50 min wall; 9-10-way parallel collapsed that to ~6 min. The remaining 206 games will take 5-6 more parallel waves (~15-20 min wall) at this cadence.

Cost shape (Claude Code Max-plan quota)

Each sub-agent invocation: ~70K total tokens (per usage logs across 19 production batches).
24 batches × 70K ≈ 1.7M tokens consumed for 104 games (~16K tokens/game).
No API spend. All on Max-plan ($200/mo) sub-agent quota per the plan's "no API billing" out-of-scope rule.
Reset-window concerns from M1-M2 did not bind during the parallel waves — quota held across all 19 invocations in this session.

Calibration drift checks (parallel run, no rubric edits)

Same prompt + same reference output across all batches; cross-batch consistency verified by LotR: The Confrontation appearing in both batch_11 and batch_13 (different bgg_ids: 3201 vs 18833 are separate BGG editions) — both scored independently to snap composite 7.50. Reproducibility within margin.
All sub-agents independently flagged the same rubric edge: cozy loss_tolerance_fit ≤3 cap on competitive end-states is over-restrictive. Tonal-cozy candidates (Tea Dragon Society, Kingdom Builder, Trekking, Welcome To, Bosk, Chai, Ohanami) all hit the cap. Only 2 / 94 games top-cozy; zero clear 6.0. W2-M6 recalibration target firmly identified.
Snap dominates the top tier more than the original W2 hypothesis assumed: 9 of 17 above-7.0 candidates are snap-archetype, vs the plan's implicit balatro-as-dominant framing.

Files

data/llm_input_v4_deep/batch_{0..25}.json (26 batches, 104 games queued)
data/llm_output_v4_deep/batch_{0..25}.jsonl (26 batches, 104 schema-valid lines)
SQLite: game_llm_archetype_fit_v4 (456 rows), game_llm_theme_v4 (114 rows)

Quota learnings (load-bearing for the remaining 186 games)

What this session actually cost — measured, not estimated.

Per-game token cost is now empirical, not speculative.

~16K tokens per game via sub-agent invocation. The plan's W1-M2 / W2-M3 budgets deferred this to "reset-window wall-time" because sub-agent invocation hides per-call token counts. After 26 batches the per-game number is stable across waves.
104 games × 16K ≈ 1.7M sub-agent tokens consumed for the production run alone (excluding pilot, excluding main-session orchestration cost).
Extrapolation: completing W2-M4 (186 games remaining) ≈ 3M more sub-agent tokens at this rate.

Parallelization helps wall-time, not quota.

9-10-way parallel sub-agents fire ~700K tokens in 3 min of wall time. Serial firing the same 9 batches would cost the same 700K but take ~20 min wall.
Reset-window throttling that the plan called out as a binding risk did not bind for 26 batches in one session. The 5-hour Max-plan windows held across all four waves. This is good news for resumed runs: a single session can plausibly finish the remaining 186 games (3M tokens) without window-staging gymnastics, assuming nothing else in the session is burning quota.
Caveat: this measurement was on Sonnet 4.6 sub-agents on a Max-plan account. Heavier orchestrator usage (Opus, large file reads in the parent) would compete for the same window.

Main-session context is the actual bottleneck, not quota.

Each sub-agent's structured report (~3-5KB markdown) gets read into the parent context. 26 sub-agent reports cost the parent ~70 percentage points of context window in this session (started at ~17%, ended at ~89%).
Per-batch parent-context burn: ~2.5-3 pp. The remaining 186 games at 47 batches × 2.5pp = 117pp would need ~2 fresh contexts to ship.
Mitigation for next session: instruct sub-agents to write a one-line summary (per-game top + composite, no narrative) and dump full per-game ambiguity notes into a per-batch markdown file at reports/v4_deep_batch_NN_notes.md. Parent reads the one-liner; ambiguity log stays on disk for later. Cuts parent burn ~5x.

Code path divergence.

Pilot batches used scripts/build_v4_deep_pilot_batch.py (one-shot). Production used scripts/enrich_llm.py prepare --depth deep-v4. Both produced equivalent batch JSON; the production path's _next_batch_idx had a bug on non-numeric pilot suffixes (fixed in 465fdbd). Future resumes only use the production path.

Cross-batch reproducibility verified for free.

LotR Confrontation appeared in both batch_11 and batch_13 at different bgg_ids (3201 vs 18833 — separate BGG editions of the same Knizia design). Both scored snap composite 7.50 independently. Within the rubric's expected ±0.5 spread, this is exact match. The duplicate happened by accident (the prepare query selects by bgg_id; same name across editions slips through). Worth keeping the duplicate in the corpus as a permanent reproducibility canary.

M7 completion run (2026-04-30, Opus 4.7 1M-context orchestrator)

Closing run that took deep-v4 from 114 → 301 games, hitting the 20% top-discovery acceptance criterion.

Progress

47 batches (batch_26..batch_72), 187 new games scored. End state: game_llm_archetype_fit_v4 1,204 rows / 301 games; game_llm_theme_v4 301 rows.

Throughput

6 parallel general-purpose sub-agents, each handling 7-8 batches:

agent	batches	games	sub-agent tokens	wall (s)
A	26-33 (8)	32	98,146	569
B	34-41 (8)	32	184,547	796
C	42-49 (8)	32	140,443	779
D	50-57 (8)	32	127,688	708
E	58-64 (7)	28	126,355	766
F	65-72 (8)	31	92,056	438
total	47	187	769,235	~13.3 min wall (max-of-parallel)

Per-game sub-agent cost: ~4.1K tokens/game — ~4× cheaper than the 16K/game W2-M4 measurement. Likely drivers: (1) sub-agents read the v4-deep prompt once instead of re-reading per batch, (2) parent did not re-bundle the prompt into each agent's input (just pointed at the path), (3) batch instructions were terser this round.
Wall-clock: 6-way parallelism cut what would have been ~78 min serial into ~13 min.

Cost shape (Claude Code Max-plan quota)

Robert's quota meter: 0% → ~72% over the full session (6 agents in parallel + aggregate + analyze + Neon sync + commit). Sub-agent fanout was the bulk of spend; aggregate/analyze/sync added negligible main-context tokens.
47 batches × 16K (W2-M4 rate) projected 750K tokens; actual was 769K — projection from W2-M4 was right within 3%, even though the per-game number dropped because the parent overhead shrank.
No API billing. All on Max-plan ($200/mo) sub-agent quota per the plan's "no API billing" out-of-scope rule.

Plan-vs-actual

W2-M4 quota learnings predicted: "completing remaining 186 games ≈ 3M more sub-agent tokens" and "context-burn ≈ 117pp → needs 2 fresh contexts to ship". Both estimates were too pessimistic.
- Actual sub-agent tokens: 769K (~4× under projection).
- Actual context burn: single Opus 1M-context session shipped end-to-end with quota at 72% — never needed a context reset.
Driver of the gap: the W2-M4 mitigation ("instruct sub-agents to terse one-line summaries; ambiguity log on disk") was applied this round. Sub-agent reports came back as one-line "wrote batches X..Y, M games total" instead of multi-KB markdown. Parent context stayed lean.

Files

data/llm_input_v4_deep/batch_{26..72}.json (47 batches added this run; 73 total exists in directory)
data/llm_output_v4_deep/batch_{26..72}.jsonl (47 batches added this run; 75 total)
SQLite: game_llm_archetype_fit_v4 (1,204 rows / 301 games), game_llm_theme_v4 (301 rows)
Reports regenerated via scripts/analyze.py; Neon dashboard synced via web/npm run sync.

Acceptance criteria verification

✅ Heat top-decile Balatro-fit: Heat: Pedal to the Metal scores balatro composite 7.60, percentile 0.990 (3rd of 301). Comfortably top-1%, well above the ~7.5 W2-M3 calibration target.
✅ Unanticipated mechanic-archetype pair in top decile: Flip-and-write + 9-card solitaires dominate the Wordle ∩ Cozy intersection. The original plan proxied Wordle as "no clean board source"; the data answers it: Orchard (Wordle 7.10 / Cozy 6.30), Lucky Numbers, Welcome To..., Silver & Gold, Kokoro: Avenue of the Kodama, Cahoots all show up top-decile for both archetypes. The shared primitive cluster is Paper-and-Pencil + Solo Solitaire + Pattern Building — a different shape than Carcassonne/Dorfromantik, which the Cozy proxy implicitly anchored on tile-laying. This is the most actionable archetype-shape signal of the run: the next "Wordle-on-mobile" candidate is likely a paper-and-pencil or micro-solitaire derivative, not a word puzzle clone.
✅ Composite spread holds across archetypes: Top-decile thresholds (P90 composite) — Balatro 6.50, Snap 6.10, Wordle 6.10, Cozy 5.60. Balatro has the richest right tail (Dominion 7.80, Heat 7.60); Cozy is the tightest because the eligibility filter pre-selects against its no-fail hallmark and the cozy loss_tolerance_fit ≤3 cap on competitive end-states still bites — a known M6-revision target.

M6 cozy `loss_tolerance_fit` recalibration (2026-04-30, heuristic v0.1)

The blanket ≤3 cap was over-restrictive: 77% of cozy rows hit the cap, the right tail flattened, and tonal-cozy candidates with incidental scoring competition (Welcome To, Kingdom Builder, Azul, Ticket to Ride: London) were under-scored. Rule replaced with a tiered cap:

Tier	Trigger BGG mechanism tags	New cap
HARD	Player Elimination, Take That, Card Play Conflict Resolution	3
OPEN	Worker Placement, Area Majority / Influence, Auction / Bidding (any), Race, Sudden Death Ending	5
EXEMPT	Cooperative Game OR Solo / Solitaire (and no HARD/OPEN)	no cap
TONAL	Tile Placement, Pattern Building, Pattern Recognition, Paper-and-Pencil, Roll/Spin/Flip and Write, Layering, Grid Coverage, Network and Route Building (and no HARD/OPEN/EXEMPT)	no cap; lift to 6 if LLM was at the old binding cap (≤3)
UNKNOWN	None of the above	trust LLM verbatim

HARD/OPEN take precedence over EXEMPT — Heat: Pedal to the Metal has a Solo/Solitaire mode tag but is a Race; the Race tag wins and Heat gets cap 5, not the no-cap exemption. This was a bug found during the v0 spot-check.

Implementation

scripts/recalibrate_cozy_v4.py applies the heuristic in-place against game_llm_archetype_fit_v4 and recomputes archetype_fit_composite as the mean of all 10 dimensions. Only cozy rows are touched; the other three archetypes are unaffected.

Results

301 cozy rows: 138 raised, 5 capped, 158 unchanged.
Tier distribution: OPEN 127, UNKNOWN 79, EXEMPT 54, TONAL 22, HARD 19.
loss_tolerance_fit ≤ 3 share: 77% → 33%.
Avg composite: 4.40 → 4.50 (+0.10), top of distribution unchanged at 6.50 (Harmonies).
New top-decile entrants: Kingdom Builder, Harvest Dice, Azul, Ticket to Ride: London, Coffee Roaster, Copenhagen.
Other archetypes untouched: Heat balatro composite still 7.60 (verified post-recal).

Heuristic limitations (acknowledged; future LLM re-extraction may revisit)

The "lift to 6 / 5" inference assumes the original LLM cap was binding when old_ltf ≤ 3. For some games the LLM may have genuinely intended a low score on tonal grounds (punishing loss loop) and the cap-relax incorrectly raises them. This is not separable without re-running the prompt.
BGG mechanism tags don't perfectly capture tone — some games tagged "Race" are race-themed but not race-decided (Flamme Rouge is genuinely racing; some pickup-and-deliver games are not).
TONAL tier signals are sparse on this corpus (only 22 games matched). Could be expanded.
mda_survival JSON blocks were NOT touched. They were emitted per-archetype by the LLM and reflect the original cap context. A future re-extraction would also revise these.

Token cost

Zero LLM tokens. Pure programmatic recalibration against tagged BGG mechanisms.

2026-04-30 — Cozy heuristic v0.2 (UNKNOWN-tier primitive fallback)

v0.1 left 158 cozy rows unchanged, with 79 falling into UNKNOWN tier (BGG mechanism tags silent). Diagnosis showed many UNKNOWN rows are tonal-cozy by primitive shape — tableau_personal_board, set_collection_diversifying, card_drafting, multi_use_card — yet had no TONAL tag to lift them.

Approach

Add a new TONAL_PRIM tier driven by mech_interaction_primitives from game_llm_features_v4, slotted between TONAL and UNKNOWN in the precedence chain.

HARD > OPEN > EXEMPT > TONAL (BGG tags) > TONAL_PRIM (primitives, v0.2 only) > UNKNOWN

Cozy-signal primitives (16): tableau_personal_board, polyomino_packing, attribute_alignment_scoring, tile_orientation_choice, border_scoring, spatial_adjacency_scoring, personal_sheet_optimization, set_collection_diversifying, set_collection_concentrating, card_drafting, tableau_shared_market, shared_objective_card, multi_use_card, incremental_economy, tableau_market_refresh, arc_three_acts.

Anti-cozy primitives (15): escalating_threat, cascading_failure, feeding_pressure, time_pressure_realtime, negotiation_over_resources, forced_table_talk, bluff_layer, hidden_role_voting, summoner_kill_win_condition, card_combo_chaining, engine_acceleration, combo_setup_cost, exponential_payoff, action_blocking, region_majority.

A row is TONAL_PRIM if (a) ≥2 cozy-signal hits AND 0 anti-cozy, OR (b) ≥3 cozy-signal hits AND ≤1 anti-cozy. TONAL_PRIM lifts low-ltf rows to 5 (not 6 — the primitive-shape signal is weaker than a direct mechanism-tag signal).

v0.1 logic preserved verbatim under --version v0.1 (default). v0.2 adds the primitive-fallback path under --version v0.2.

Results

301 cozy rows (v0.2): 21 raised, 0 capped, 280 unchanged (additive over v0.1).
New tier distribution: OPEN 127, UNKNOWN 57, EXEMPT 54, TONAL 22, TONAL_PRIM 22, HARD 19.
22 rows reclassified UNKNOWN → TONAL_PRIM; of those, 21 had old_ltf ≤ 3 and were lifted to 5.
Avg composite delta: +0.014 vs. post-v0.1 (small, additive).
Anchor games (re-verified post-write):
- Heat balatro composite 7.60 ✓
- Heat cozy ltf 5 ✓ (Race / OPEN tier, capped)
- Harmonies cozy composite 6.50 ✓ (top-decile preserved)

Lifted under v0.2 (21 games)

Hadara, Draftosaurus, Tussie Mussie, Trekking the National Parks: Second Edition, Chai, Ohanami, Jump Drive, Paper Tales, Firenze, San Juan, Trambahn, 7 Wonders, San Juan (Second Edition), New York Slice, CuBirds, Sushi Go!, Sushi Roll, Sushi Go Party!, Space Explorers, Pixel Tactics 2, Tybor the Builder.

20-row manual ground-truth check on this set: 19/21 are unambiguously tonal-cozy (set collection drafting, light pattern-building, no PvP attack). Pixel Tactics 2 is the one borderline — it's a competitive card combat game; the tableau_personal_board + card_drafting primitives passed the filter without summoner_kill_win_condition being emitted. Manual classification accuracy: ≥90% (≥75% target met).

Plan target vs. actual

Plan called for ≥30 additional rows lifted (loose target). Actual: 21. Gap is in the EXEMPT tier (54 rows, trust-LLM) and the residual 57 UNKNOWN rows that lacked enough cozy-shape primitives to clear the threshold. Tightening the primitive fallback further (≥1 cozy hit) would raise more rows but with lower precision; the v0.2 thresholds were chosen to keep the manual-classification accuracy bar.

Token cost

Zero LLM tokens. Same programmatic-recalibration pattern as v0.1.

Heuristic limitations (carry-forward)

Pixel Tactics 2 false-positive shows the limit of primitive-only filtering — the LLM did not emit a summoner_kill_win_condition primitive for it, so the anti-cozy filter didn't catch it. Future LLM re-extraction with a clarified cozy rubric would be more faithful.
The 57 residual UNKNOWN rows are a mix of (a) genuinely-not-cozy games the LLM correctly flagged low, and (b) cozy games whose primitive list happened to be sparse. Sub-1% of the corpus; not worth a separate code path.
22 TONAL_PRIM = 22 lifts is a coincidence — the eligibility (≥2 cozy hits, no anti) is independent of the lift trigger (old_ltf ≤ 3). Most TONAL_PRIM rows happened to also be at the binding cap.

Calibration set

Pilot run

JSON-schema compliance

Cross-archetype heterogeneity (composites = unweighted mean of 10 dims)

Rubric ambiguities surfaced (W2-M3 follow-up)

Vocabulary / primitive signal

Throughput measurement

Decisions

W2-M0 gate decision (2026-04-29)

Files produced

W2-M4 throughput + cost (live, 2026-04-29)

Progress

Throughput

Cost shape (Claude Code Max-plan quota)

Calibration drift checks (parallel run, no rubric edits)

Files

Quota learnings (load-bearing for the remaining 186 games)

M7 completion run (2026-04-30, Opus 4.7 1M-context orchestrator)

Progress

Throughput

Cost shape (Claude Code Max-plan quota)

Plan-vs-actual

Files

Acceptance criteria verification

M6 cozy loss_tolerance_fit recalibration (2026-04-30, heuristic v0.1)

Implementation

Results

Heuristic limitations (acknowledged; future LLM re-extraction may revisit)

Token cost

2026-04-30 — Cozy heuristic v0.2 (UNKNOWN-tier primitive fallback)

Approach

Results

Lifted under v0.2 (21 games)

Plan target vs. actual

Token cost

Heuristic limitations (carry-forward)

M6 cozy `loss_tolerance_fit` recalibration (2026-04-30, heuristic v0.1)