v4_calibration_log

generated_by: scripts/enrich_llm.py prepare --depth wide-v4 --calibration + scripts/aggregate_llm_v4.py generated_at: 2026-04-28T07:01:54Z sqlite_snapshot: 2026-04-25T16:31:05Z extraction_run_id: 2026-04-28T07:01:54Z filters: pilot_set: hardcoded 10 BGG IDs (CALIBRATION_BGG_IDS_V4 in scripts/enrich_llm.py) prompt_runtime: prompts/extraction_v4_wide.md prompt_calibration: prompts/extraction_v4_wide_calibration.md vocab: prompts/v4_primitives.md (v0, 76 entries) model: Sonnet 4.6 (sub-agent invocation, not direct SDK) top_k: 10 plan: docs/plans/v4-archetype-fit-pipeline.md milestone: W1-M2 (calibration pilot)

This log captures the W1-M2 calibration pilot for the v4 archetype-fit pipeline. The pilot validates that Sonnet 4.6 High can produce JSON-schema-compliant v4-wide output from compact-prior payloads, that similar mechanics score consistently across games, and that reset-window throughput is sufficient for the W1-M3 full-corpus run.

Update each TBD section after the pilot LLM run completes.

Calibration set

#	bgg_id	name	prior_density	rationale
1	366013	Heat: Pedal to the Metal	full (v3 deep dive)	Anchor — calibrates time-pressure-without-realtime
2	36218	Dominion	partial / full (verify)	Anchor — engine-growth archetype
3	122522	Smash Up	partial / full (verify)	Anchor — Snap-shape source-game proxy
4	126163	Tzolk'in: The Mayan Calendar	full (v3 deep dive)	Worker-recall + tableau-personal-board diversity
5	413246	Bomb Busters	full (v3 deep dive)	Cooperative + communication-constraint primitives
6	217372	The Quest for El Dorado	full (v3 deep dive)	Deck-building + route-committal primitives
7	193738	Great Western Trail	full (v3 deep dive)	Route + tableau hybrid; exercises compact-prior with v3 paragraphs
8	164928	Orléans	full (v3 deep dive)	Bag-building + region-majority primitives
9	42	Tigris & Euphrates	full (v3 deep dive)	Region-majority + attribute-alignment-scoring
10	189932	Tyrants of the Underdark	full (v3 deep dive)	Deck-building + area-control hybrid

Note: bgg_id 71721 (cited in prompts/extraction_v4_wide_calibration.md for Smash Up) does not match the BGG entry in data/bgg.sqlite. The DB row for "Smash Up" is bgg_id 122522. Reconcile in W1-M2 follow-up: either correct the calibration prompt to match the DB, or verify that 71721 is the canonical BGG ID and the local snapshot is missing it.

Pilot run

SQLite snapshot date (SELECT MAX(fetched_at) FROM game): 2026-04-25T16:31:05Z
Generated at: 2026-04-28T07:01:54Z
Prompt iterations performed: 0 (no iteration needed — first-pass compliance hit 100%)
Cache control / structured-output mode used: NOT verified in this pilot. Sub-agent invocation, not direct Anthropic SDK. The cache_control: {type: "ephemeral"} and tool-use structured-output Token Economics Moves (4, 5, 7) need a separate verification pass against the SDK before W1-M3.

JSON-schema compliance

Total output lines: 10
Lines parseable as JSON: 10
Lines matching schema (all required fields, valid enums, value ranges): 10
Compliance rate: 100% (≥90% threshold cleared)

Inter-game consistency spot check

Five mechanic pairs from the original plan. Three within 1σ; two diverged in ways that on review represent correct LLM discrimination rather than rubric inconsistency — the chosen pairs were not actually shared-primitive pairs.

pair	primitive	A	B	Δ	within 1σ	note
Dominion vs Quest for El Dorado	engine_growth	3	2	1	YES	Both engine-growth games; small dispersion expected
Heat vs Bomb Busters	tempo_swing	0	0	0	YES	Neither is tempo-swing dominant; correct
Tzolk'in vs Great Western Trail	worker_recall_phase	0	0	0	YES	LLM scored other primitives over this; correct
Tigris & Euphrates vs Orléans	region_majority	3	0	3	NO	Bad pair: T&E is region-majority core; Orléans uses presence tracks, not majority
Smash Up vs Tyrants of the Underdark	card_combo_chaining	2	0	2	NO	Bad pair: Smash Up has on-play chains; Tyrants is deck-thinning, not combo cascades

Take-away: the consistency check needs to start from primitives observed in BOTH games' output, not pre-picked pairs. For W2 calibration, query mech_interaction_primitives cross-game and only compare strengths where the primitive appears in both. The current "0 in both" pairs are uninformative — they tell us nothing about consistency.

Vocabulary signal

Three _other: escape valves emitted (3 of 10 games used the escape — all strength=3, suggesting load-bearing primitives missing from v0 vocab):

game	label	rationale
Bomb Busters	`_other:ascending_sort_deduction`	Slot-N positional inference where neighbors flipping narrows a bounded range. `partial_observability` covers info reveal but not the ascending-sort constraint as deduction engine.
Tigris & Euphrates	`_other:bottleneck_score_axis`	"Score = your weakest color" — min-scoring with no analog in v0 vocab. Defining T&E mechanic.
Tyrants of the Underdark	`_other:prune_for_vp`	Promote-to-exile for endgame VP. Close to `delayed_payoff` but the removal from engine IS the scoring mechanism.

Promotion candidates for v1 vocab. All three deserve consideration; bottleneck_score_axis and prune_for_vp recur in well-known designs (Through the Ages min-scoring for Tech, Quacks of Quedlinburg / engine-thinning families).

Throughput measurement

This pilot ran via the Agent tool sub-agent (not direct Anthropic SDK), so per-call token counts and cache-hit rates are NOT measurable here. End-to-end:

Sub-agent wall time: 219.5s for 10 games (3 file reads + 10 game scorings + JSONL write)
Per-game wall time: ~22s
Total tokens reported: 53,605 (sub-agent run-level, not per-game)
Reset-window throughput estimate: NOT MEASURED — needs SDK-level instrumentation

For the W1-M3 full-corpus run, instrument the SDK call directly with:

cache_control on the rubric block (verify cache hit ratio)
tool_use for structured output (eliminates parse failures)
Per-game input/output token logging
Wall time per call to extrapolate reset-window budget

Findings

Rubric is coherent. 100% schema compliance on first pass. Heat/Dominion/Smash Up scores match the worked examples exactly — calibration anchors held.
Compact-prior format works. 8 of 10 games shipped with full v3 priors (no description); 2 games (Dominion, Smash Up) shipped with truncated description. Smash Up has neither v2 nor v3 priors yet — it's outside the top-100 v2-extraction set. Consider running v2 on Smash Up before W1-M3 OR letting v4 handle it cold.
Vocabulary is incomplete. 3/10 games (30%) hit the _other: escape. The plan budgets review every 250 rows; at this rate the W1-M3 full-corpus run (~500-600 games) will surface 150+ escape entries. The 250-row review cadence may be too lax; consider reviewing at 100 first.
Smash Up bgg_id discrepancy persists. Calibration markdown cites 71721; local DB has 122522. Score still anchored correctly because the LLM read the priors, not the bgg_id. Fix in calibration markdown for cleanliness.
Token economics moves NOT verified. Caching, structured-output tool-use, and per-call instrumentation all need SDK-level work before W1-M3. The plan's reset-window throughput estimates remain unverified.

Decisions

Ship as-is — rubric passes the M2 acceptance bar (≥90% compliance, anchors match, consistency check informative). Move to W1-M3 prep.
Defer SDK instrumentation to a separate task — the next PR before W1-M3 should wrap the LLM call in the Anthropic SDK with cache_control, tool_use, and per-call token logging. Without this, throughput / cache-hit verification can't happen.
Update consistency check methodology in W2 calibration log — pair selection should be data-driven (primitives appearing ≥2× in pilot output), not pre-picked.
Tighten vocab review cadence to every 100 rows for the first 500 then loosen to 250. Three escape valves in 10 games extrapolates to a high-volume vocabulary-drift workload that benefits from earlier checkpoints.
Promotion candidates for v1 vocab: _other:bottleneck_score_axis, _other:prune_for_vp, _other:ascending_sort_deduction. Run scripts/review_v4_primitives.py (deferred build) at 250 rows or sooner to formalize.