← back to pipelines

Pipeline 1 · BGG scraping

BGG XML API → SQLite. Rate-limited at 2 req/sec. Ban-sensitive.

Games scraped
10,035
10,029 ranked
Last scrape
1d ago
2026-04-25
Rate limit
2 req/sec
hard cap — never bypass

The scraper hits BoardGameGeek's XML API 2 — specifically the /thing endpoint — and lands raw game data in data/bgg.sqlite. It's the only path in the project that talks to BGG.

How it runs

python scripts/refresh_bgg.py            # default: top 2000 by 2019-rank seed
python -m src.scraper -n 5000            # broader pull

The seed list lives in data/seed/bgg_ranks_2019.csv. We refresh against that historical anchor so rank-trajectory deltas remain comparable across runs.

Rate limit (load-bearing)

2 requests per second, hard cap. Going faster gets the API token banned. The scraper enforces this in src/scraper.py; never bypass it. If a scrape needs to be faster, reduce the seed list, not the delay.

When BGG returns 202 Accepted, the response is queued — the scraper retries with backoff. 429 means we're already over the limit; back off harder.

What's stored

Each /thing response is parsed into the game table plus join tables game_mechanism and game_category. The mechanism and category lookup tables are populated incidentally as new IDs appear.

Key columns on game:

  • bayesaverage, averageweight, usersrated — the quantitative signal
  • playingtime, minplayers, maxplayers — eligibility filter inputs
  • rank — current BGG rank
  • fetched_at — provenance for reproducibility checks

What's not scraped (yet)

  • Mechanism descriptions. The /thing endpoint returns mechanism IDs but no body text. Descriptions are an open scrape — they live on BGG's web pages, not the XML API. Pending a safe endpoint, the mechanism table holds id + name only.
  • Forum / comment data. Out of scope.
  • Image assets. We don't fetch box art.

Most recently scraped

Last 10 games to land in the game table.

NameYearRankBayesWeightFetched
Star Wars: The Deckbuilding Game20232197.332.041d ago
20 Strong: Nemesis20260.003.001d ago
Orloj: The Prague Astronomical Clock202517226.343.551d ago
Dune: Imperium202068.223.081d ago
Slay the Spire: The Board Game2024178.022.901d ago
Endeavor: Deep Sea2024837.652.921d ago
Star Trek: Captain's Chair20258526.734.041d ago
Voidfall2023867.644.611d ago
Lost Ruins of Arnak2020307.922.931d ago
The Elder Scrolls: Betrayal of the Second Era20252237.334.071d ago