v4_calibration_log
-
generated_by: scripts/enrich_llm.py prepare --depth wide-v4 --calibration + scripts/aggregate_llm_v4.py generated_at: 2026-04-28T07:01:54Z sqlite_snapshot: 2026-04-25T16:31:05Z extraction_run_id: 2026-04-28T07:01:54Z filters: pilot_set: hardcoded 10 BGG IDs (CALIBRATION_BGG_IDS_V4 in scripts/enrich_llm.py) prompt_runtime: prompts/extraction_v4_wide.md prompt_calibration: prompts/extraction_v4_wide_calibration.md vocab: prompts/v4_primitives.md (v0, 76 entries) model: Sonnet 4.6 (sub-agent invocation, not direct SDK) top_k: 10 plan: docs/plans/v4-archetype-fit-pipeline.md milestone: W1-M2 (calibration pilot)
This log captures the W1-M2 calibration pilot for the v4 archetype-fit pipeline. The pilot validates that Sonnet 4.6 High can produce JSON-schema-compliant v4-wide output from compact-prior payloads, that similar mechanics score consistently across games, and that reset-window throughput is sufficient for the W1-M3 full-corpus run.
Update each TBD section after the pilot LLM run completes.
Calibration set
| # | bgg_id | name | prior_density | rationale |
|---|---|---|---|---|
| 1 | 366013 | Heat: Pedal to the Metal | full (v3 deep dive) | Anchor — calibrates time-pressure-without-realtime |
| 2 | 36218 | Dominion | partial / full (verify) | Anchor — engine-growth archetype |
| 3 | 122522 | Smash Up | partial / full (verify) | Anchor — Snap-shape source-game proxy |
| 4 | 126163 | Tzolk'in: The Mayan Calendar | full (v3 deep dive) | Worker-recall + tableau-personal-board diversity |
| 5 | 413246 | Bomb Busters | full (v3 deep dive) | Cooperative + communication-constraint primitives |
| 6 | 217372 | The Quest for El Dorado | full (v3 deep dive) | Deck-building + route-committal primitives |
| 7 | 193738 | Great Western Trail | full (v3 deep dive) | Route + tableau hybrid; exercises compact-prior with v3 paragraphs |
| 8 | 164928 | Orléans | full (v3 deep dive) | Bag-building + region-majority primitives |
| 9 | 42 | Tigris & Euphrates | full (v3 deep dive) | Region-majority + attribute-alignment-scoring |
| 10 | 189932 | Tyrants of the Underdark | full (v3 deep dive) | Deck-building + area-control hybrid |
Note: bgg_id 71721 (cited in prompts/extraction_v4_wide_calibration.md for Smash Up) does not match the BGG entry in data/bgg.sqlite. The DB row for "Smash Up" is bgg_id 122522. Reconcile in W1-M2 follow-up: either correct the calibration prompt to match the DB, or verify that 71721 is the canonical BGG ID and the local snapshot is missing it.
Pilot run
- SQLite snapshot date (
SELECT MAX(fetched_at) FROM game):2026-04-25T16:31:05Z - Generated at:
2026-04-28T07:01:54Z - Prompt iterations performed: 0 (no iteration needed — first-pass compliance hit 100%)
- Cache control / structured-output mode used: NOT verified in this pilot. Sub-agent invocation, not direct Anthropic SDK. The
cache_control: {type: "ephemeral"}and tool-use structured-output Token Economics Moves (4, 5, 7) need a separate verification pass against the SDK before W1-M3.
JSON-schema compliance
- Total output lines: 10
- Lines parseable as JSON: 10
- Lines matching schema (all required fields, valid enums, value ranges): 10
- Compliance rate: 100% (≥90% threshold cleared)
Inter-game consistency spot check
Five mechanic pairs from the original plan. Three within 1σ; two diverged in ways that on review represent correct LLM discrimination rather than rubric inconsistency — the chosen pairs were not actually shared-primitive pairs.
| pair | primitive | A | B | Δ | within 1σ | note |
|---|---|---|---|---|---|---|
| Dominion vs Quest for El Dorado | engine_growth | 3 | 2 | 1 | YES | Both engine-growth games; small dispersion expected |
| Heat vs Bomb Busters | tempo_swing | 0 | 0 | 0 | YES | Neither is tempo-swing dominant; correct |
| Tzolk'in vs Great Western Trail | worker_recall_phase | 0 | 0 | 0 | YES | LLM scored other primitives over this; correct |
| Tigris & Euphrates vs Orléans | region_majority | 3 | 0 | 3 | NO | Bad pair: T&E is region-majority core; Orléans uses presence tracks, not majority |
| Smash Up vs Tyrants of the Underdark | card_combo_chaining | 2 | 0 | 2 | NO | Bad pair: Smash Up has on-play chains; Tyrants is deck-thinning, not combo cascades |
Take-away: the consistency check needs to start from primitives observed in BOTH games' output, not pre-picked pairs. For W2 calibration, query mech_interaction_primitives cross-game and only compare strengths where the primitive appears in both. The current "0 in both" pairs are uninformative — they tell us nothing about consistency.
Vocabulary signal
Three _other: escape valves emitted (3 of 10 games used the escape — all strength=3, suggesting load-bearing primitives missing from v0 vocab):
| game | label | rationale |
|---|---|---|
| Bomb Busters | _other:ascending_sort_deduction | Slot-N positional inference where neighbors flipping narrows a bounded range. partial_observability covers info reveal but not the ascending-sort constraint as deduction engine. |
| Tigris & Euphrates | _other:bottleneck_score_axis | "Score = your weakest color" — min-scoring with no analog in v0 vocab. Defining T&E mechanic. |
| Tyrants of the Underdark | _other:prune_for_vp | Promote-to-exile for endgame VP. Close to delayed_payoff but the removal from engine IS the scoring mechanism. |
Promotion candidates for v1 vocab. All three deserve consideration; bottleneck_score_axis and prune_for_vp recur in well-known designs (Through the Ages min-scoring for Tech, Quacks of Quedlinburg / engine-thinning families).
Throughput measurement
This pilot ran via the Agent tool sub-agent (not direct Anthropic SDK), so per-call token counts and cache-hit rates are NOT measurable here. End-to-end:
- Sub-agent wall time: 219.5s for 10 games (3 file reads + 10 game scorings + JSONL write)
- Per-game wall time: ~22s
- Total tokens reported: 53,605 (sub-agent run-level, not per-game)
- Reset-window throughput estimate: NOT MEASURED — needs SDK-level instrumentation
For the W1-M3 full-corpus run, instrument the SDK call directly with:
cache_controlon the rubric block (verify cache hit ratio)tool_usefor structured output (eliminates parse failures)- Per-game input/output token logging
- Wall time per call to extrapolate reset-window budget
Findings
- Rubric is coherent. 100% schema compliance on first pass. Heat/Dominion/Smash Up scores match the worked examples exactly — calibration anchors held.
- Compact-prior format works. 8 of 10 games shipped with full v3 priors (no description); 2 games (Dominion, Smash Up) shipped with truncated description. Smash Up has neither v2 nor v3 priors yet — it's outside the top-100 v2-extraction set. Consider running v2 on Smash Up before W1-M3 OR letting v4 handle it cold.
- Vocabulary is incomplete. 3/10 games (30%) hit the
_other:escape. The plan budgets review every 250 rows; at this rate the W1-M3 full-corpus run (~500-600 games) will surface 150+ escape entries. The 250-row review cadence may be too lax; consider reviewing at 100 first. - Smash Up bgg_id discrepancy persists. Calibration markdown cites 71721; local DB has 122522. Score still anchored correctly because the LLM read the priors, not the bgg_id. Fix in calibration markdown for cleanliness.
- Token economics moves NOT verified. Caching, structured-output tool-use, and per-call instrumentation all need SDK-level work before W1-M3. The plan's reset-window throughput estimates remain unverified.
Decisions
- Ship as-is — rubric passes the M2 acceptance bar (≥90% compliance, anchors match, consistency check informative). Move to W1-M3 prep.
- Defer SDK instrumentation to a separate task — the next PR before W1-M3 should wrap the LLM call in the Anthropic SDK with
cache_control,tool_use, and per-call token logging. Without this, throughput / cache-hit verification can't happen. - Update consistency check methodology in W2 calibration log — pair selection should be data-driven (primitives appearing ≥2× in pilot output), not pre-picked.
- Tighten vocab review cadence to every 100 rows for the first 500 then loosen to 250. Three escape valves in 10 games extrapolates to a high-volume vocabulary-drift workload that benefits from earlier checkpoints.
- Promotion candidates for v1 vocab:
_other:bottleneck_score_axis,_other:prune_for_vp,_other:ascending_sort_deduction. Runscripts/review_v4_primitives.py(deferred build) at 250 rows or sooner to formalize.