Echo Angel Studio Sports Betting NBA data intake · Build surface
Sports Betting — Cloudflare Report

What we’re building now — detailed

A developer-friendly (but beginner-safe) explanation of the data pipeline we have today, and how it becomes a betting “edge” engine once market data is added.

Latest build state (2025-12-18 CT)

Release 2025-12-18 (CT) Record sportsbetting_Full_Record_v38 → v39 Phase Phase 6 (Market + availability ingestion)
  • Truth capture (schedule + box + PBP) is stable and has been demonstrated on season-to-date ranges.
  • Outputs are stored in a single local SQLite database with canonical tables and run logs.
  • Next build is a market ingestion layer (odds + player props) and join keys so we can evaluate bets.

What data we are collecting right now

Our “truth layer” captures what happened on the court. It is free (NBA Stats endpoints via nba_api), but it requires careful rate limiting and run-resume safety.

  • Schedule + metadata: date, game_id, home/away team ids, start times.
  • Box score: per-player and per-team totals (minutes, points, rebounds, assists, etc.).
  • Play-by-play: event stream with period/clock and event descriptors (shots, fouls, subs, etc.).

Where it lives (SQLite canonical tables)

These tables are designed so that re-running the same pull is safe. Primary keys prevent duplicates.

  • canonical_schedule_rest — one row per game capture (PK: captured_time_utc + game_id + source).
  • canonical_box_score — per-player/per-team box rows (PK: captured_time_utc + game_id + team_id + player_id + source).
  • canonical_pbp — one row per PBP event (PK: captured_time_utc + game_id + event_num + source).
  • canonical_run_log — run metadata, args, counts, status (PK: run_id).

How this turns into prop betting edges

  1. Model the player outcome (e.g., “points ≥ 10”): build a distribution for points/assists/rebounds using minutes, usage, pace, matchup, and recent role changes.
  2. Pull the market line + price (e.g., “Over 9.5 points at -115”).
  3. Compute EV: convert your modeled probability into expected value at the offered odds.
  4. Track CLV: compare your bet’s line/price to the closing line; repeatable edges show up as positive CLV over time.

Right now, steps (1) and the ground-truth pieces needed to evaluate (4) are feasible because we have reliable historical outcomes. What we’re missing is consistent, timestamped market lines for steps (2) and (3).

Past releases

Release 2025-12-16 (bundle v2) — expand
Sports Betting — Cloudflare Report

What we’re building now — detailed

A deeper look at the current fetcher, the exact fields we collect, and how those fields turn into prop evaluation.

Current state of the fetcher

truth_poc.py is our “truth-layer” fetcher. It is intentionally small and auditable. Its job is to:

  • Take a date (YYYY-MM-DD) and pull that day’s NBA games.
  • For each game, pull player box score rows (minutes + key stats).
  • Optionally pull play-by-play (event log) for deeper derived metrics later.
  • Write everything into a local SQLite database in canonical tables.

Right now, the script is productionized enough for small test pulls (1–2 games) and is being hardened for safe re-runs and larger backfills.

What data we’re gathering (today)

The fetcher writes three canonical tables. These are the “truth spine” we’ll join against odds/props later.

1) canonical_schedule_rest

One row per game per capture timestamp. Key columns:

  • captured_time_utc
  • game_id
  • raw_game_date
  • start_time_utc
  • home_team_id
  • away_team_id
  • source
  • requested_date

2) canonical_box_score

One row per player stat line per game. Key columns:

  • captured_time_utc
  • game_id
  • team_id
  • player_id
  • minutes
  • points
  • rebounds
  • assists
  • threes_made
  • steals
  • blocks
  • turnovers
  • source

Why this matters for props: minutes + production are the starting point for every prop model. Minutes are the volume lever; per-minute rates are the efficiency lever.

3) canonical_pbp

One row per play-by-play event (optional). Key columns:

  • captured_time_utc
  • game_id
  • event_num
  • period
  • clock
  • event_type
  • score
  • description
  • source
  • raw_json

Why PBP matters: it allows deeper features later (usage proxies, possession timing, lineup segments, foul trouble effects, etc.). We keep raw_json to stay forward-compatible as upstream schemas evolve.

Command to run (developer-friendly)

Client can paste this in any Python environment after installing deps:

python truth_poc.py --date 2025-12-15 --max-games 1 --db pilot.sqlite --pbp-mode nba_api --sleep 1.5

Tip: If you re-run often, use a new DB filename per run or apply the idempotency patch so reruns don’t collide on primary keys.

How we use these tables to build a prop model

The canonical tables are not the model themselves — they are the data substrate. The modeling pipeline will:

  1. Compute baseline rates (e.g., points per minute, assists per minute) from historical box scores.
  2. Project minutes for the upcoming game (recent minutes + role + rotation changes).
  3. Adjust for context: pace, rest, home/away, injuries/usage shifts, matchup factors (lightly).
  4. Generate a distribution of outcomes (not just a point estimate) so we can price “over/under” lines.
  5. Compare to market once we ingest sportsbook odds/lines snapshots; compute EV and only recommend when edge is meaningful.