← all reports

v4_calibration_log

-

generated_by: scripts/enrich_llm.py prepare --depth wide-v4 --calibration + scripts/aggregate_llm_v4.py generated_at: 2026-04-28T07:01:54Z sqlite_snapshot: 2026-04-25T16:31:05Z extraction_run_id: 2026-04-28T07:01:54Z filters: pilot_set: hardcoded 10 BGG IDs (CALIBRATION_BGG_IDS_V4 in scripts/enrich_llm.py) prompt_runtime: prompts/extraction_v4_wide.md prompt_calibration: prompts/extraction_v4_wide_calibration.md vocab: prompts/v4_primitives.md (v0, 76 entries) model: Sonnet 4.6 (sub-agent invocation, not direct SDK) top_k: 10 plan: docs/plans/v4-archetype-fit-pipeline.md milestone: W1-M2 (calibration pilot)

This log captures the W1-M2 calibration pilot for the v4 archetype-fit pipeline. The pilot validates that Sonnet 4.6 High can produce JSON-schema-compliant v4-wide output from compact-prior payloads, that similar mechanics score consistently across games, and that reset-window throughput is sufficient for the W1-M3 full-corpus run.

Update each TBD section after the pilot LLM run completes.

Calibration set

#bgg_idnameprior_densityrationale
1366013Heat: Pedal to the Metalfull (v3 deep dive)Anchor — calibrates time-pressure-without-realtime
236218Dominionpartial / full (verify)Anchor — engine-growth archetype
3122522Smash Uppartial / full (verify)Anchor — Snap-shape source-game proxy
4126163Tzolk'in: The Mayan Calendarfull (v3 deep dive)Worker-recall + tableau-personal-board diversity
5413246Bomb Bustersfull (v3 deep dive)Cooperative + communication-constraint primitives
6217372The Quest for El Doradofull (v3 deep dive)Deck-building + route-committal primitives
7193738Great Western Trailfull (v3 deep dive)Route + tableau hybrid; exercises compact-prior with v3 paragraphs
8164928Orléansfull (v3 deep dive)Bag-building + region-majority primitives
942Tigris & Euphratesfull (v3 deep dive)Region-majority + attribute-alignment-scoring
10189932Tyrants of the Underdarkfull (v3 deep dive)Deck-building + area-control hybrid

Note: bgg_id 71721 (cited in prompts/extraction_v4_wide_calibration.md for Smash Up) does not match the BGG entry in data/bgg.sqlite. The DB row for "Smash Up" is bgg_id 122522. Reconcile in W1-M2 follow-up: either correct the calibration prompt to match the DB, or verify that 71721 is the canonical BGG ID and the local snapshot is missing it.

Pilot run

  • SQLite snapshot date (SELECT MAX(fetched_at) FROM game): 2026-04-25T16:31:05Z
  • Generated at: 2026-04-28T07:01:54Z
  • Prompt iterations performed: 0 (no iteration needed — first-pass compliance hit 100%)
  • Cache control / structured-output mode used: NOT verified in this pilot. Sub-agent invocation, not direct Anthropic SDK. The cache_control: {type: "ephemeral"} and tool-use structured-output Token Economics Moves (4, 5, 7) need a separate verification pass against the SDK before W1-M3.

JSON-schema compliance

  • Total output lines: 10
  • Lines parseable as JSON: 10
  • Lines matching schema (all required fields, valid enums, value ranges): 10
  • Compliance rate: 100% (≥90% threshold cleared)

Inter-game consistency spot check

Five mechanic pairs from the original plan. Three within 1σ; two diverged in ways that on review represent correct LLM discrimination rather than rubric inconsistency — the chosen pairs were not actually shared-primitive pairs.

pairprimitiveABΔwithin 1σnote
Dominion vs Quest for El Doradoengine_growth321YESBoth engine-growth games; small dispersion expected
Heat vs Bomb Busterstempo_swing000YESNeither is tempo-swing dominant; correct
Tzolk'in vs Great Western Trailworker_recall_phase000YESLLM scored other primitives over this; correct
Tigris & Euphrates vs Orléansregion_majority303NOBad pair: T&E is region-majority core; Orléans uses presence tracks, not majority
Smash Up vs Tyrants of the Underdarkcard_combo_chaining202NOBad pair: Smash Up has on-play chains; Tyrants is deck-thinning, not combo cascades

Take-away: the consistency check needs to start from primitives observed in BOTH games' output, not pre-picked pairs. For W2 calibration, query mech_interaction_primitives cross-game and only compare strengths where the primitive appears in both. The current "0 in both" pairs are uninformative — they tell us nothing about consistency.

Vocabulary signal

Three _other: escape valves emitted (3 of 10 games used the escape — all strength=3, suggesting load-bearing primitives missing from v0 vocab):

gamelabelrationale
Bomb Busters_other:ascending_sort_deductionSlot-N positional inference where neighbors flipping narrows a bounded range. partial_observability covers info reveal but not the ascending-sort constraint as deduction engine.
Tigris & Euphrates_other:bottleneck_score_axis"Score = your weakest color" — min-scoring with no analog in v0 vocab. Defining T&E mechanic.
Tyrants of the Underdark_other:prune_for_vpPromote-to-exile for endgame VP. Close to delayed_payoff but the removal from engine IS the scoring mechanism.

Promotion candidates for v1 vocab. All three deserve consideration; bottleneck_score_axis and prune_for_vp recur in well-known designs (Through the Ages min-scoring for Tech, Quacks of Quedlinburg / engine-thinning families).

Throughput measurement

This pilot ran via the Agent tool sub-agent (not direct Anthropic SDK), so per-call token counts and cache-hit rates are NOT measurable here. End-to-end:

  • Sub-agent wall time: 219.5s for 10 games (3 file reads + 10 game scorings + JSONL write)
  • Per-game wall time: ~22s
  • Total tokens reported: 53,605 (sub-agent run-level, not per-game)
  • Reset-window throughput estimate: NOT MEASURED — needs SDK-level instrumentation

For the W1-M3 full-corpus run, instrument the SDK call directly with:

  • cache_control on the rubric block (verify cache hit ratio)
  • tool_use for structured output (eliminates parse failures)
  • Per-game input/output token logging
  • Wall time per call to extrapolate reset-window budget

Findings

  1. Rubric is coherent. 100% schema compliance on first pass. Heat/Dominion/Smash Up scores match the worked examples exactly — calibration anchors held.
  2. Compact-prior format works. 8 of 10 games shipped with full v3 priors (no description); 2 games (Dominion, Smash Up) shipped with truncated description. Smash Up has neither v2 nor v3 priors yet — it's outside the top-100 v2-extraction set. Consider running v2 on Smash Up before W1-M3 OR letting v4 handle it cold.
  3. Vocabulary is incomplete. 3/10 games (30%) hit the _other: escape. The plan budgets review every 250 rows; at this rate the W1-M3 full-corpus run (~500-600 games) will surface 150+ escape entries. The 250-row review cadence may be too lax; consider reviewing at 100 first.
  4. Smash Up bgg_id discrepancy persists. Calibration markdown cites 71721; local DB has 122522. Score still anchored correctly because the LLM read the priors, not the bgg_id. Fix in calibration markdown for cleanliness.
  5. Token economics moves NOT verified. Caching, structured-output tool-use, and per-call instrumentation all need SDK-level work before W1-M3. The plan's reset-window throughput estimates remain unverified.

Decisions

  • Ship as-is — rubric passes the M2 acceptance bar (≥90% compliance, anchors match, consistency check informative). Move to W1-M3 prep.
  • Defer SDK instrumentation to a separate task — the next PR before W1-M3 should wrap the LLM call in the Anthropic SDK with cache_control, tool_use, and per-call token logging. Without this, throughput / cache-hit verification can't happen.
  • Update consistency check methodology in W2 calibration log — pair selection should be data-driven (primitives appearing ≥2× in pilot output), not pre-picked.
  • Tighten vocab review cadence to every 100 rows for the first 500 then loosen to 250. Three escape valves in 10 games extrapolates to a high-volume vocabulary-drift workload that benefits from earlier checkpoints.
  • Promotion candidates for v1 vocab: _other:bottleneck_score_axis, _other:prune_for_vp, _other:ascending_sort_deduction. Run scripts/review_v4_primitives.py (deferred build) at 250 rows or sooner to formalize.