Fetcher deep dive — truth_poc.py
The script that captures NBA schedule, box scores, and play-by-play into a canonical SQLite “truth” database.
Current state
- PBP source:
nba_apiPlayByPlayV3 (tested import + sample frame). - Box score: BoxScoreTraditionalV3 (V2 may still exist for fallback, but V3 is the forward path).
- Backfill support:
--start-date,--end-date, and--resumefor long runs. - Idempotency: primary keys ensure safe re-run; inserts use “ignore on conflict” semantics.
Key CLI args
--date YYYY-MM-DD(single day) or--start-date/--end-date(range)--max-games Nlimits games per day (useful while testing)--pbp-mode nba_api(current stable) and--sleepseconds between calls--db path.sqliteoutput database file--resumeskips games already captured in earlier runs for the same DB
What “idempotent” means here
If you run the same date range twice, the second run should not add duplicates. We enforce this with table primary keys:
canonical_schedule_rest: (captured_time_utc, game_id, source)canonical_box_score: (captured_time_utc, game_id, team_id, player_id, source)canonical_pbp: (captured_time_utc, game_id, event_num, source)
Practical impact: you can interrupt, resume, or rerun after patching without corrupting the DB.
Past releases
Release 2025-12-16 (bundle v2) — expand
Fetcher deep dive — truth_poc.py
Exactly what the current script does, what it writes, and what’s coming next to connect data → props → EV.
Fetcher overview (truth_poc.py)
This script is our proof-of-collection. It makes small, polite requests, normalizes fields, and writes a durable SQLite snapshot.
Design goals:
- Auditable: small surface area, clear outputs.
- Joinable: stable IDs so we can attach odds/props later.
- Versionable: core functions are contract-hashed so changes are explicit.
Data points: what we can compute immediately
From schedule + box score alone we can produce:
- Minutes trend (season vs last 5)
- Per-minute production rates (PTS/min, AST/min, REB/min)
- Basic volatility estimates (game-to-game variance)
With PBP (optional), we can later derive:
- Event-based pace proxies and possession timing
- Foul trouble and substitution timing features
- Shot profile context (if action subtypes are present)
Known issues and hardening
- PBP uniqueness / reruns: if you rerun the same game into the same DB with the same capture timestamp, you can hit UNIQUE constraint conflicts. Our hardening patches focus on idempotency and safer event IDs.
- Endpoint churn: upstream NBA endpoints deprecate versions over time; we’re moving to v3 endpoints when possible and keeping
raw_jsonfor schema resilience. - Scaling: for multi-season backfills we’ll add chunking, retries with jitter, and a resume ledger so interrupted runs continue safely.
Next: odds & props ingestion
To advise bets, we must ingest the sportsbook side too:
- Prop markets (points, assists, rebounds, 3PM, combos, etc.)
- Line (e.g., 24.5 points)
- Odds price (e.g., -110)
- Timestamped snapshots so we can measure movement and closing line value (CLV)
We prefer licensed/normalized market data feeds when possible; scraping is a fallback and requires extra stability work.