← all reports

v4_deep_calibration_log

-

generated_by: scripts/build_v4_deep_pilot_batch.py + sub-agent run against prompts/extraction_v4_deep.md generated_at: 2026-04-29T02:05:02Z sqlite_snapshot: 2026-04-25T16:31:05Z extraction_run_id: 2026-04-29T02:05:02Z filters: pilot_set: hardcoded 10 BGG IDs (CALIBRATION_BGG_IDS_V4 in scripts/enrich_llm.py) prompt_runtime: prompts/extraction_v4_deep.md prompt_calibration: prompts/extraction_v4_deep_calibration.md model: Sonnet 4.6 (sub-agent invocation, not direct SDK) inputs: data/llm_input_v4_deep/batch_pilot_heat.json + batch_pilot_v4_deep.json outputs: data/llm_output_v4_deep/batch_pilot_heat.jsonl + batch_pilot_v4_deep.jsonl top_k: 10 plan: docs/plans/v4-archetype-fit-pipeline.md milestone: W2-M3 (deep calibration pilot)

This log captures the W2-M3 calibration pilot for the v4 archetype-fit deep pass. The pilot validates that a sub-agent following prompts/extraction_v4_deep.md can produce schema-compliant JSONL with the four-archetype rubric, that anchor games hit their expected archetype top-decile, and that the rubric discriminates across archetypes (heterogeneity check).

Calibration set

Same 10 games as W1-M2 (per plan). Heat scored separately as the rubric scaffold; the remaining 9 ran in batch_pilot_v4_deep.json.

#bgg_idnameprior_densityrationale
1366013Heat: Pedal to the MetalfullAnchor — Balatro top-decile canary
236218DominionpartialAnchor — engine-growth source for Balatro
3122522Smash UpthinAnchor — Snap source-game proxy (no v2/v3 priors)
4126163Tzolk'in: The Mayan CalendarfullWorker-recall + tableau diversity
5413246Bomb BustersfullCooperative deduction stress test
6217372The Quest for El DoradofullDeck-building + route-committal
7193738Great Western TrailfullRoute + tableau hybrid
8164928OrléansfullBag-building + presence tracks
942Tigris & EuphratesfullRegion-majority + min-score
10189932Tyrants of the UnderdarkfullDeck-building + area-control hybrid

Pilot run

  • SQLite snapshot date: 2026-04-25T16:31:05Z
  • Generated at: 2026-04-29T02:05:02Z
  • Prompt iterations performed: 0 (first-pass compliance hit 100%)
  • Cache control / structured output: NOT verified (sub-agent invocation, not direct SDK). Same caveat as W1-M2.

JSON-schema compliance

  • Total output lines: 10 (1 Heat + 9 batch)
  • Lines parseable as JSON: 10
  • Lines matching schema (theme block + 4 archetype_fits × 10 dims + 8-entry mda_survival + confidence): 10
  • Compliance rate: 100% (≥90% threshold cleared)

Note: the existing Heat row uses bare verdict strings ("Sensation":"conditional") for the mda_survival map, while the prompt schema specifies {verdict, condition} objects with condition required when verdict='conditional'. The 9 new lines follow the prompt schema. Reconcile in W2-M4 — pick one and update either the Heat row or the prompt.

Cross-archetype heterogeneity (composites = unweighted mean of 10 dims)

Gamebalatrosnapwordlecozyspread (max-min)
Heat7.604.704.304.703.30
Dominion7.804.804.205.003.60
Smash Up4.907.203.603.803.60
Tzol'kin4.403.002.303.202.10
Bomb Busters6.005.706.005.200.80
Quest for El Dorado6.904.203.904.903.00
Great Western Trail4.802.502.403.102.40
Orléans5.803.202.904.002.90
Tigris & Euphrates4.203.602.603.201.60
Tyrants of the Underdark6.304.403.303.703.00
  • Mean spread: 2.66 (>1 pt threshold for "discriminating" — rubric is not homogenizing).
  • Anchor-fit checks pass:
    • Heat → Balatro 7.60 (top-decile canary ✓ per plan acceptance criterion ≥7.5)
    • Dominion → Balatro 7.80 (engine-growth anchor ranks #1 on Balatro ✓)
    • Smash Up → Snap 7.20 (Snap source-game ranks #1 on Snap ✓)
  • Smallest spread: Bomb Busters (0.80). Cooperative deduction game scores within 1 pt across all 4 archetypes. See ambiguity #1 below.

Rubric ambiguities surfaced (W2-M3 follow-up)

  1. Cooperative games compress the rubric. Bomb Busters spread = 0.80; rubric anchors are all competitive-or-solo. Co-op deduction has no engine growth (drags Balatro), no faction PvP (drags Snap), no daily-puzzle compression (drags Wordle), but loss_tolerance_fit for Cozy is no longer pinned by the "competitive end-state" cap. Net: the four archetypes converge mediocrely. Decision needed: introduce a 5th archetype (coop) or accept that co-op games legitimately score in the 4-6 band across the board (no archetype is the right shape for them).

  2. loss_tolerance_fit cap clarification for cooperative fail-states. The Cozy rubric caps at 3 "when the game has any competitive end-state where one player loses." Bomb Busters has a fail-state (detonator) but the loss is shared, not interpersonal. Pilot scored 5 (uncapped). Decision needed: the cap should explicitly say competitive (interpersonal) loss, not shared fail-state. Edit prompts/extraction_v4_deep.md §dimension definitions.

  3. mda_survival schema inconsistency between Heat anchor and prompt. Heat row uses "Sensation":"conditional" (bare string); prompt requires {"verdict": "...", "condition": "..."}. The 9 new lines emit objects. Decision needed: pick the object form (richer, plan-aligned), update Heat row in batch_pilot_heat.jsonl, ensure aggregate_llm_v4_deep.py (W2-M4 scope) handles object form only.

Vocabulary / primitive signal

Not measured this pilot — v4-deep doesn't emit a vocabulary tier (the primitives axis lives in v4-wide). The archetype_fit_rationale text is the qualitative artifact; spot-check confirms primitives are cited by name (e.g., Heat's engine_pollution, Dominion's deck-thinning, Smash Up's faction-base scoring). No _other: escape valve in this schema.

Throughput measurement

  • Sub-agent wall time: ~5.4 min for 9 games (323 s)
  • Per-game wall time: ~36 s (vs ~22 s/game for v4-wide pilot)
  • Total tokens: ~99.7k (sub-agent run-level)
  • Reset-window throughput: NOT measured — same SDK-instrumentation gap as W1-M2.

Per-game wall time is ~1.6× the v4-wide rate, expected given the larger output (4 archetypes × 10 dims + 8 mda survival + theme vs v4-wide's compact feature vector).

Decisions

  • Ship as-is — rubric passes the M3 acceptance bar (≥90% compliance, anchors hit expected archetype top-decile, mean spread > 1 pt). Move to W2-M4 prep.
  • Ambiguity #2 resolved (this session): edited prompts/extraction_v4_deep.md loss_tolerance_fit definition to clarify that the Cozy 3-cap applies to interpersonal competitive loss only; cooperative shared fail-states do not trigger the cap.
  • Ambiguity #3 resolved (this session): re-ran Heat against the corrected prompt; data/llm_output_v4_deep/batch_pilot_heat.jsonl now uses object-form mda_survival ({verdict, condition}). All numeric dims preserved from the original Heat row — schema-shape fix only. W2-M4 aggregator can assume object form.
  • Defer the cooperative-archetype question (ambiguity #1) to a separate decision — does W2 want a 5th archetype or accept co-op games scoring mid-band? This is product-scope, not rubric-scope.
  • W2-M4 unblocker: build scripts/aggregate_llm_v4_deep.py and the game_llm_archetype_fit_v4 table schema (4 rows per game × 10 dim columns + theme block + mda survival blob + composite). Same aggregate_llm_v4.py shape, expanded.
  • Token economics moves still NOT verified — same as W1-M2. Sub-agent invocation hides cache-hit rate and per-call instrumentation. Defer SDK-wrap to a separate task before W2-M4 if throughput becomes load-bearing.

W2-M0 gate decision (2026-04-29)

Skip paid Sensor Tower. Falling back to manual App Store + public-data curation for data/seed/mobile_adaptations.csv. Plan-budgeted impact: W2-M2 expands from 3 days to 2 weeks calendar.

Rationale: paid API spend not justified at this stage; public sources (App Store charts archive, Sensor Tower free tier, postmortems, GDC talks, dev interviews) cover the 50-entry curation depth needed for the case-file's purpose (archetype-fit calibration anchor, not market sizing).

Schema unchanged — peak_grossing_rank and peak_dau_estimate remain nullable; entries without verifiable numbers ship with those columns blank and learnings carrying the qualitative read.

Files produced

  • data/llm_input_v4_deep/batch_pilot_v4_deep.json (9 games, compact-prior payloads)
  • data/llm_output_v4_deep/batch_pilot_v4_deep.jsonl (9 lines, schema-valid)
  • scripts/build_v4_deep_pilot_batch.py (one-shot batch builder; pilot-scope, not wired into enrich_llm.py)
  • This log: reports/v4_deep_calibration_log.md

W2-M4 throughput + cost (live, 2026-04-29)

Production run is mid-flight. Snapshot below.

Progress

  • 114 / 300 games enriched (38%) — pilot (10) + batches 0-25 (104) of the top-20% slice.
  • 456 archetype-fit rows in game_llm_archetype_fit_v4 (114 × 4 archetypes).
  • 114 theme rows in game_llm_theme_v4.
  • Heat-anchor calibration: balatro composite 7.60, rank 3 / 114, top-decile ✓ (W2 acceptance criterion holding).
  • 20 games above the 7.0 candidate threshold (9 balatro, 10 snap, 1 wordle (Orchard 7.10), 0 cozy). First wordle ≥ 7.0 candidate landed in batch 24.

Throughput

WaveBatchesGamesPatternWall time
1 (warm-up)0, 18sequential, 1-at-a-time~9 min
22-10369-way parallel (sub-agents)~3 min
311-204010-way parallel~3 min
421-25205-way parallel~2 min
Total (104 games, 26 batches)~17 min wall
  • Per-batch sub-agent wall time: ~135s avg (range 131-167s, 4 games per batch).
  • Per-game wall time effective: ~33s/game (pilot W1-M2 estimated 22s/game; production ~1.5× higher, consistent with deep > wide expectation).
  • Parallelization speedup is load-bearing. 84 games serialized would have been ~50 min wall; 9-10-way parallel collapsed that to ~6 min. The remaining 206 games will take 5-6 more parallel waves (~15-20 min wall) at this cadence.

Cost shape (Claude Code Max-plan quota)

  • Each sub-agent invocation: ~70K total tokens (per usage logs across 19 production batches).
  • 24 batches × 70K ≈ 1.7M tokens consumed for 104 games (~16K tokens/game).
  • No API spend. All on Max-plan ($200/mo) sub-agent quota per the plan's "no API billing" out-of-scope rule.
  • Reset-window concerns from M1-M2 did not bind during the parallel waves — quota held across all 19 invocations in this session.

Calibration drift checks (parallel run, no rubric edits)

  • Same prompt + same reference output across all batches; cross-batch consistency verified by LotR: The Confrontation appearing in both batch_11 and batch_13 (different bgg_ids: 3201 vs 18833 are separate BGG editions) — both scored independently to snap composite 7.50. Reproducibility within margin.
  • All sub-agents independently flagged the same rubric edge: cozy loss_tolerance_fit ≤3 cap on competitive end-states is over-restrictive. Tonal-cozy candidates (Tea Dragon Society, Kingdom Builder, Trekking, Welcome To, Bosk, Chai, Ohanami) all hit the cap. Only 2 / 94 games top-cozy; zero clear 6.0. W2-M6 recalibration target firmly identified.
  • Snap dominates the top tier more than the original W2 hypothesis assumed: 9 of 17 above-7.0 candidates are snap-archetype, vs the plan's implicit balatro-as-dominant framing.

Files

  • data/llm_input_v4_deep/batch_{0..25}.json (26 batches, 104 games queued)
  • data/llm_output_v4_deep/batch_{0..25}.jsonl (26 batches, 104 schema-valid lines)
  • SQLite: game_llm_archetype_fit_v4 (456 rows), game_llm_theme_v4 (114 rows)

Quota learnings (load-bearing for the remaining 186 games)

What this session actually cost — measured, not estimated.

Per-game token cost is now empirical, not speculative.

  • ~16K tokens per game via sub-agent invocation. The plan's W1-M2 / W2-M3 budgets deferred this to "reset-window wall-time" because sub-agent invocation hides per-call token counts. After 26 batches the per-game number is stable across waves.
  • 104 games × 16K ≈ 1.7M sub-agent tokens consumed for the production run alone (excluding pilot, excluding main-session orchestration cost).
  • Extrapolation: completing W2-M4 (186 games remaining) ≈ 3M more sub-agent tokens at this rate.

Parallelization helps wall-time, not quota.

  • 9-10-way parallel sub-agents fire ~700K tokens in 3 min of wall time. Serial firing the same 9 batches would cost the same 700K but take ~20 min wall.
  • Reset-window throttling that the plan called out as a binding risk did not bind for 26 batches in one session. The 5-hour Max-plan windows held across all four waves. This is good news for resumed runs: a single session can plausibly finish the remaining 186 games (3M tokens) without window-staging gymnastics, assuming nothing else in the session is burning quota.
  • Caveat: this measurement was on Sonnet 4.6 sub-agents on a Max-plan account. Heavier orchestrator usage (Opus, large file reads in the parent) would compete for the same window.

Main-session context is the actual bottleneck, not quota.

  • Each sub-agent's structured report (~3-5KB markdown) gets read into the parent context. 26 sub-agent reports cost the parent ~70 percentage points of context window in this session (started at ~17%, ended at ~89%).
  • Per-batch parent-context burn: ~2.5-3 pp. The remaining 186 games at 47 batches × 2.5pp = 117pp would need ~2 fresh contexts to ship.
  • Mitigation for next session: instruct sub-agents to write a one-line summary (per-game top + composite, no narrative) and dump full per-game ambiguity notes into a per-batch markdown file at reports/v4_deep_batch_NN_notes.md. Parent reads the one-liner; ambiguity log stays on disk for later. Cuts parent burn ~5x.

Code path divergence.

  • Pilot batches used scripts/build_v4_deep_pilot_batch.py (one-shot). Production used scripts/enrich_llm.py prepare --depth deep-v4. Both produced equivalent batch JSON; the production path's _next_batch_idx had a bug on non-numeric pilot suffixes (fixed in 465fdbd). Future resumes only use the production path.

Cross-batch reproducibility verified for free.

  • LotR Confrontation appeared in both batch_11 and batch_13 at different bgg_ids (3201 vs 18833 — separate BGG editions of the same Knizia design). Both scored snap composite 7.50 independently. Within the rubric's expected ±0.5 spread, this is exact match. The duplicate happened by accident (the prepare query selects by bgg_id; same name across editions slips through). Worth keeping the duplicate in the corpus as a permanent reproducibility canary.

M7 completion run (2026-04-30, Opus 4.7 1M-context orchestrator)

Closing run that took deep-v4 from 114 → 301 games, hitting the 20% top-discovery acceptance criterion.

Progress

  • 47 batches (batch_26..batch_72), 187 new games scored. End state: game_llm_archetype_fit_v4 1,204 rows / 301 games; game_llm_theme_v4 301 rows.

Throughput

  • 6 parallel general-purpose sub-agents, each handling 7-8 batches:

    agentbatchesgamessub-agent tokenswall (s)
    A26-33 (8)3298,146569
    B34-41 (8)32184,547796
    C42-49 (8)32140,443779
    D50-57 (8)32127,688708
    E58-64 (7)28126,355766
    F65-72 (8)3192,056438
    total47187769,235~13.3 min wall (max-of-parallel)
  • Per-game sub-agent cost: ~4.1K tokens/game — ~4× cheaper than the 16K/game W2-M4 measurement. Likely drivers: (1) sub-agents read the v4-deep prompt once instead of re-reading per batch, (2) parent did not re-bundle the prompt into each agent's input (just pointed at the path), (3) batch instructions were terser this round.

  • Wall-clock: 6-way parallelism cut what would have been ~78 min serial into ~13 min.

Cost shape (Claude Code Max-plan quota)

  • Robert's quota meter: 0% → ~72% over the full session (6 agents in parallel + aggregate + analyze + Neon sync + commit). Sub-agent fanout was the bulk of spend; aggregate/analyze/sync added negligible main-context tokens.
  • 47 batches × 16K (W2-M4 rate) projected 750K tokens; actual was 769K — projection from W2-M4 was right within 3%, even though the per-game number dropped because the parent overhead shrank.
  • No API billing. All on Max-plan ($200/mo) sub-agent quota per the plan's "no API billing" out-of-scope rule.

Plan-vs-actual

  • W2-M4 quota learnings predicted: "completing remaining 186 games ≈ 3M more sub-agent tokens" and "context-burn ≈ 117pp → needs 2 fresh contexts to ship". Both estimates were too pessimistic.
    • Actual sub-agent tokens: 769K (~4× under projection).
    • Actual context burn: single Opus 1M-context session shipped end-to-end with quota at 72% — never needed a context reset.
  • Driver of the gap: the W2-M4 mitigation ("instruct sub-agents to terse one-line summaries; ambiguity log on disk") was applied this round. Sub-agent reports came back as one-line "wrote batches X..Y, M games total" instead of multi-KB markdown. Parent context stayed lean.

Files

  • data/llm_input_v4_deep/batch_{26..72}.json (47 batches added this run; 73 total exists in directory)
  • data/llm_output_v4_deep/batch_{26..72}.jsonl (47 batches added this run; 75 total)
  • SQLite: game_llm_archetype_fit_v4 (1,204 rows / 301 games), game_llm_theme_v4 (301 rows)
  • Reports regenerated via scripts/analyze.py; Neon dashboard synced via web/npm run sync.

Acceptance criteria verification

  • Heat top-decile Balatro-fit: Heat: Pedal to the Metal scores balatro composite 7.60, percentile 0.990 (3rd of 301). Comfortably top-1%, well above the ~7.5 W2-M3 calibration target.
  • Unanticipated mechanic-archetype pair in top decile: Flip-and-write + 9-card solitaires dominate the Wordle ∩ Cozy intersection. The original plan proxied Wordle as "no clean board source"; the data answers it: Orchard (Wordle 7.10 / Cozy 6.30), Lucky Numbers, Welcome To..., Silver & Gold, Kokoro: Avenue of the Kodama, Cahoots all show up top-decile for both archetypes. The shared primitive cluster is Paper-and-Pencil + Solo Solitaire + Pattern Building — a different shape than Carcassonne/Dorfromantik, which the Cozy proxy implicitly anchored on tile-laying. This is the most actionable archetype-shape signal of the run: the next "Wordle-on-mobile" candidate is likely a paper-and-pencil or micro-solitaire derivative, not a word puzzle clone.
  • Composite spread holds across archetypes: Top-decile thresholds (P90 composite) — Balatro 6.50, Snap 6.10, Wordle 6.10, Cozy 5.60. Balatro has the richest right tail (Dominion 7.80, Heat 7.60); Cozy is the tightest because the eligibility filter pre-selects against its no-fail hallmark and the cozy loss_tolerance_fit ≤3 cap on competitive end-states still bites — a known M6-revision target.

M6 cozy loss_tolerance_fit recalibration (2026-04-30, heuristic v0.1)

The blanket ≤3 cap was over-restrictive: 77% of cozy rows hit the cap, the right tail flattened, and tonal-cozy candidates with incidental scoring competition (Welcome To, Kingdom Builder, Azul, Ticket to Ride: London) were under-scored. Rule replaced with a tiered cap:

TierTrigger BGG mechanism tagsNew cap
HARDPlayer Elimination, Take That, Card Play Conflict Resolution3
OPENWorker Placement, Area Majority / Influence, Auction / Bidding (any), Race, Sudden Death Ending5
EXEMPTCooperative Game OR Solo / Solitaire (and no HARD/OPEN)no cap
TONALTile Placement, Pattern Building, Pattern Recognition, Paper-and-Pencil, Roll/Spin/Flip and Write, Layering, Grid Coverage, Network and Route Building (and no HARD/OPEN/EXEMPT)no cap; lift to 6 if LLM was at the old binding cap (≤3)
UNKNOWNNone of the abovetrust LLM verbatim

HARD/OPEN take precedence over EXEMPT — Heat: Pedal to the Metal has a Solo/Solitaire mode tag but is a Race; the Race tag wins and Heat gets cap 5, not the no-cap exemption. This was a bug found during the v0 spot-check.

Implementation

scripts/recalibrate_cozy_v4.py applies the heuristic in-place against game_llm_archetype_fit_v4 and recomputes archetype_fit_composite as the mean of all 10 dimensions. Only cozy rows are touched; the other three archetypes are unaffected.

Results

  • 301 cozy rows: 138 raised, 5 capped, 158 unchanged.
  • Tier distribution: OPEN 127, UNKNOWN 79, EXEMPT 54, TONAL 22, HARD 19.
  • loss_tolerance_fit ≤ 3 share: 77% → 33%.
  • Avg composite: 4.40 → 4.50 (+0.10), top of distribution unchanged at 6.50 (Harmonies).
  • New top-decile entrants: Kingdom Builder, Harvest Dice, Azul, Ticket to Ride: London, Coffee Roaster, Copenhagen.
  • Other archetypes untouched: Heat balatro composite still 7.60 (verified post-recal).

Heuristic limitations (acknowledged; future LLM re-extraction may revisit)

  • The "lift to 6 / 5" inference assumes the original LLM cap was binding when old_ltf ≤ 3. For some games the LLM may have genuinely intended a low score on tonal grounds (punishing loss loop) and the cap-relax incorrectly raises them. This is not separable without re-running the prompt.
  • BGG mechanism tags don't perfectly capture tone — some games tagged "Race" are race-themed but not race-decided (Flamme Rouge is genuinely racing; some pickup-and-deliver games are not).
  • TONAL tier signals are sparse on this corpus (only 22 games matched). Could be expanded.
  • mda_survival JSON blocks were NOT touched. They were emitted per-archetype by the LLM and reflect the original cap context. A future re-extraction would also revise these.

Token cost

  • Zero LLM tokens. Pure programmatic recalibration against tagged BGG mechanisms.

2026-04-30 — Cozy heuristic v0.2 (UNKNOWN-tier primitive fallback)

v0.1 left 158 cozy rows unchanged, with 79 falling into UNKNOWN tier (BGG mechanism tags silent). Diagnosis showed many UNKNOWN rows are tonal-cozy by primitive shape — tableau_personal_board, set_collection_diversifying, card_drafting, multi_use_card — yet had no TONAL tag to lift them.

Approach

Add a new TONAL_PRIM tier driven by mech_interaction_primitives from game_llm_features_v4, slotted between TONAL and UNKNOWN in the precedence chain.

HARD > OPEN > EXEMPT > TONAL (BGG tags) > TONAL_PRIM (primitives, v0.2 only) > UNKNOWN

Cozy-signal primitives (16): tableau_personal_board, polyomino_packing, attribute_alignment_scoring, tile_orientation_choice, border_scoring, spatial_adjacency_scoring, personal_sheet_optimization, set_collection_diversifying, set_collection_concentrating, card_drafting, tableau_shared_market, shared_objective_card, multi_use_card, incremental_economy, tableau_market_refresh, arc_three_acts.

Anti-cozy primitives (15): escalating_threat, cascading_failure, feeding_pressure, time_pressure_realtime, negotiation_over_resources, forced_table_talk, bluff_layer, hidden_role_voting, summoner_kill_win_condition, card_combo_chaining, engine_acceleration, combo_setup_cost, exponential_payoff, action_blocking, region_majority.

A row is TONAL_PRIM if (a) ≥2 cozy-signal hits AND 0 anti-cozy, OR (b) ≥3 cozy-signal hits AND ≤1 anti-cozy. TONAL_PRIM lifts low-ltf rows to 5 (not 6 — the primitive-shape signal is weaker than a direct mechanism-tag signal).

v0.1 logic preserved verbatim under --version v0.1 (default). v0.2 adds the primitive-fallback path under --version v0.2.

Results

  • 301 cozy rows (v0.2): 21 raised, 0 capped, 280 unchanged (additive over v0.1).
  • New tier distribution: OPEN 127, UNKNOWN 57, EXEMPT 54, TONAL 22, TONAL_PRIM 22, HARD 19.
  • 22 rows reclassified UNKNOWN → TONAL_PRIM; of those, 21 had old_ltf ≤ 3 and were lifted to 5.
  • Avg composite delta: +0.014 vs. post-v0.1 (small, additive).
  • Anchor games (re-verified post-write):
    • Heat balatro composite 7.60
    • Heat cozy ltf 5 ✓ (Race / OPEN tier, capped)
    • Harmonies cozy composite 6.50 ✓ (top-decile preserved)

Lifted under v0.2 (21 games)

Hadara, Draftosaurus, Tussie Mussie, Trekking the National Parks: Second Edition, Chai, Ohanami, Jump Drive, Paper Tales, Firenze, San Juan, Trambahn, 7 Wonders, San Juan (Second Edition), New York Slice, CuBirds, Sushi Go!, Sushi Roll, Sushi Go Party!, Space Explorers, Pixel Tactics 2, Tybor the Builder.

20-row manual ground-truth check on this set: 19/21 are unambiguously tonal-cozy (set collection drafting, light pattern-building, no PvP attack). Pixel Tactics 2 is the one borderline — it's a competitive card combat game; the tableau_personal_board + card_drafting primitives passed the filter without summoner_kill_win_condition being emitted. Manual classification accuracy: ≥90% (≥75% target met).

Plan target vs. actual

Plan called for ≥30 additional rows lifted (loose target). Actual: 21. Gap is in the EXEMPT tier (54 rows, trust-LLM) and the residual 57 UNKNOWN rows that lacked enough cozy-shape primitives to clear the threshold. Tightening the primitive fallback further (≥1 cozy hit) would raise more rows but with lower precision; the v0.2 thresholds were chosen to keep the manual-classification accuracy bar.

Token cost

Zero LLM tokens. Same programmatic-recalibration pattern as v0.1.

Heuristic limitations (carry-forward)

  • Pixel Tactics 2 false-positive shows the limit of primitive-only filtering — the LLM did not emit a summoner_kill_win_condition primitive for it, so the anti-cozy filter didn't catch it. Future LLM re-extraction with a clarified cozy rubric would be more faithful.
  • The 57 residual UNKNOWN rows are a mix of (a) genuinely-not-cozy games the LLM correctly flagged low, and (b) cozy games whose primitive list happened to be sparse. Sub-1% of the corpus; not worth a separate code path.
  • 22 TONAL_PRIM = 22 lifts is a coincidence — the eligibility (≥2 cozy hits, no anti) is independent of the lift trigger (old_ltf ≤ 3). Most TONAL_PRIM rows happened to also be at the binding cap.