Inputs
What feeds the research. Five sources — one primary scrape, one historical seed, one optional dump, hand-written editorial backbone, and the LLM-extracted derived layer.
BoardGameGeek XML API
Manual, on demandPrimary source
Game metadata (name, year, weight, ratings, mechanism/category tags). Pulled via /thing endpoint with a Bearer token; rate-limited at 2 req/sec. Only scripts/refresh_bgg.py talks to it.
2019 historical rank seed
One-time historicalAuxiliary input
data/seed/bgg_ranks_2019.csv from the beefsack/bgg-ranking-historicals dump. Used as a stable list of ~17k IDs for the scraper to seed against, AND joined in pipeline 3 to compute rank_delta_since_2019 (rank trajectory).
Kaggle dump
When dropped inOptional input
data/kaggle/games.csv from threnjen/board-games-database-from-boardgamegeek. Watch-folder: when present, pipeline 3 joins it automatically. Currently empty.
Hand-written notes
Continuous, by handResearch backbone
notes/mechanics.md (mechanism taxonomy + translation-difficulty heuristic), notes/lineages.md (game-to-game inheritance graph), and 8 per-game deep dives. These set the prior for what good translation candidates look like.
LLM extractions
Per pipeline 2 runDerived input
v2 wide-pass terse phrases over 522 games (game_llm_primitives) and v3 deep-pass paragraphs + theme axis over 40 top candidates (game_llm_deep). Driven by web-search-equipped agents using prompts/extraction_v{2,3}.md.