Warehouse fresh · 2026-07-02 08:24 UTC
Model R̂ max 1.000
ESS min 1501
Divergences 0
Trained 2026-07-02 08:51 UTC

Methodology

A short, honest description of the model behind these forecasts — the data it sees, the statistical assumptions it makes, and the things it deliberately ignores. Everything you see on the site is generated by the pipeline described here.

1 · Overview

The forecast is produced by a hierarchical Bayesian generalised linear model fitted with Hamiltonian Monte Carlo (PyMC's NUTS sampler). It estimates each team's latent offensive and defensive strength, each starting pitcher's quality as a random effect on opponent run-rate, and a per-venue park factor capturing each ballpark's run-environment, using every regular-season and postseason game from 2022 to date, then projects the remainder of the 2026 regular season via Monte Carlo simulation of the posterior.

We chose a Bayesian framework for three reasons: posterior credible intervals quantify what we don't know; partial pooling keeps early-season ratings honest when a team has only played a dozen games; and posterior draws plug straight into a forward simulation, so the league standings we show are calibrated, uncertainty-aware projections rather than point estimates.

2 · Data pipeline

All training data flows from a separate warehouse repository. The ingestion layer pulls every MLB game from the MLB Stats API into a bronze table, which dbt transforms into a gold feature table (mlb_gold.feat_matchup) and a game-level fact table (mlb_gold.fct_games). This model reads both.

Critically, the rolling features in the warehouse are computed using only games that occurred before the current row — no target leakage. When we use, for example, a team's last-30-game win percentage on a given date, it genuinely means "as of the morning of that day".

3 · The likelihood

Each game contributes two observations to the model: the runs the home team scored, and the runs the away team scored. Runs are modelled as Poisson-distributed with a rate that depends on the two teams involved and whether the scoring team was playing at home:

runsteam, game ~ Poisson(λ)

log(λ) = intercept + offence[team] − defence[opponent]
              − pitcher_quality[opponent_starter]
              + park[venue] + home_adv · is_home

The Poisson is the natural distribution for count outcomes; in practice it also fits MLB run totals well because runs are rare events across a fixed inning structure. The log link means the four ratings combine multiplicatively on the rate scale, which matches how scoring actually works: a strong offence raises expected runs by a proportion, not a constant.

The opponent's starting pitcher is what suppresses the team's runs in any given game, so the random effect is keyed on the pitcher taking the mound for the team being scored against. Two games of the same team-vs-team matchup at the same venue will therefore have different expected run totals if different starters are pitching — which is exactly the source of game-to-game variation across a series that's otherwise identical at team level.

4 · Partial pooling within league

Team-level ratings are drawn from league-level priors, so early in a season every team is pulled towards the American or National League average until its own results earn it some distance. Without this "shrinkage" a team that started 3–0 would briefly look world-class; with it, the model concedes uncertainty and waits for more data.

offence[team] ~ Normal(league_offence[league(team)], σoff)
defence[team] ~ Normal(league_defence[league(team)], σdef)

league_offence, league_defence ~ Normal(0, 0.2)
σoff, σdef ~ HalfNormal(0.3)

The hyper-priors (σoff, σdef) are themselves learnt from the data. If the two leagues diverge sharply from the global mean, or if one league is markedly more spread-out than the other, the posterior will reflect that.

Pitcher quality is partially pooled the same way, but around a single league-wide zero (no league-of-pitcher hyperparameter — pitchers move between leagues too freely for that to add much). Each pitcher's individual estimate is shrunk toward zero by the same data-learnt σpitcher hyperparameter — so a pitcher with three good starts isn't treated as a Cy Young candidate, but a pitcher with 80 starts of consistent above-average performance can earn a meaningful positive estimate.

pitcher_quality[p] ~ Normal(0, σpitcher)

σpitcher ~ HalfNormal(0.25)

Pitchers not seen in training (recent debuts, traded-in arms, fresh call-ups) fall back to league average — zero effect — at predict time. Once they have a few starts on the books, they're picked up automatically on the next nightly fit.

Each ballpark gets its own random effect on the run-environment intercept. Coors Field, Petco Park and the Yankee Stadium short porch all materially shift the per-game expected runs; previously the model absorbed those venue effects into the home team's offence and defence terms, which is mis-attribution. Park factors are partially pooled toward zero with a tighter prior than team or pitcher effects — realistic park factors are smaller in magnitude (Coors at +15% maps to log(1.15) ≈ 0.14).

park[venue] ~ Normal(0, σpark)

σpark ~ HalfNormal(0.10)

Both teams' run rates at a given venue share the same park multiplier — i.e. the term is added to log(λ) regardless of which side is scoring. That keeps the offence and defence ratings clean of venue noise.

5 · Recency weighting

Baseball teams change — rosters turn over, managers get fired, front offices tear down and rebuild. A team's 2022 performance is not particularly informative about its 2026 performance, yet an unweighted likelihood would let four years of ancient results drown out a few dozen current ones. That's how a team in the middle of a genuine rebound (hello, Athletics) ends up stuck near the bottom of the posterior long after it should have moved.

Rather than model each season as its own parameter (tried; the random walk posterior geometry was unkind to NUTS and the fit took too long on the CI runner), each game's log-likelihood contribution is scaled by its age:

wi = exp( − agei / τ ) τ = 0.5 years

weighted log-likelihood = Σi wi · log p(runsi | λi)

Concrete weights for a refresh run today:

So the current 2026 season carries essentially all the weight, with the back half of 2025 contributing a meaningful but reduced slice and anything older effectively zero. τ was tightened from 1.5 → 0.5 years on 2026-05-14 after a calibration audit found the posterior home-field advantage collapsing to ~0.027 (despite a Normal(0.05, 0.015) prior) — the older seasons' lower observed HFA was dragging the estimate down. Shorter τ lets the 2026 sample, where home teams are winning ~57% of games, dominate. In PyMC this is applied via pm.Potential on a vectorised log-likelihood; the posterior sees each observation as a fractional contribution proportional to its recency.

6 · Current form (short-window recency)

Recency weighting (§5) is a long, slow signal — it discounts old games against new ones across years. It does not capture current form: the fact that a strong-rated team can be in a two-week cold streak without the season-level team-strength posterior meaningfully moving against them. A real failure mode surfaced in early-May 2026 evaluation: the model picked currently-slumping favourites at season-strength expectation (Dodgers 1-for-5 when picked; Yankees 3-for-8) because the strength posterior was correct on the year and the recency-weight decay was too long to register a fortnight of bad results.

To address this, the model includes an explicit short-window form feature alongside the long-run team strengths. For each game and each batting team we compute the team's run differential over their previous 10 completed games:

team_form_l10 = Σ (runs_scored − runs_allowed) over last 10 games

This enters the likelihood as a single shared coefficient (β_form) applied to the scaled form value:

log(λ) += β_form · (team_form_l10 / 25)

The prior is tight — β_form ~ Normal(0, 0.02) — so form is a modulation of the team-strength estimate, not a dominant signal. A typical +15 form value with a 1σ posterior coefficient produces about a +3% scoring boost on a given game; a −15 slump, about −3%. The current form column on the Team Strength tab shows each team's raw L10 run differential alongside their long-run rating.

Why this isn't double-counting recency: the likelihood weighting (§5) determines how much each historical game contributes to estimating each team's true skill; the form term modulates the predicted run rate around that skill estimate by current circumstances. They operate at different time scales — years for the strength posterior, days for form — and are jointly identified because of that separation. A team that has always been a 90-win club but is currently 1-9 over two weeks should keep its strength estimate (the long-run signal is years of data) while having its per-game prediction pulled toward weaker outputs (the short signal is two weeks of data).

No leakage: the form value for any given training game is computed from the team's previous 10 completed games strictly before that game's date, via a SQL window function with ROWS BETWEEN 10 PRECEDING AND 1 PRECEDING. The current game never contributes to its own form.

7 · Global parameters

Two parameters are shared across the whole league.

8 · Inference

Posteriors are sampled with PyMC's No-U-Turn Sampler (NUTS), the current default for Hamiltonian Monte Carlo on continuous parameter spaces. Settings for each nightly run:

chains: 4 (sampled in parallel on 4 vCPUs)
warm-up: 1,000 iterations per chain
retained draws: 500 iterations per chain
target_accept: 0.95
total samples: 2,000 posterior draws

A single fit takes roughly twelve minutes on a GitHub Actions runner — most of the cost is in the per-pitcher random effect (~1,500 RVs) added to the otherwise-small team-level model. The warm-up phase tunes the mass matrix and step size; only post-warm-up draws are used for inference and simulation. The slightly tighter target_accept (0.95 rather than PyMC's 0.8 default) keeps divergences at zero on this model.

9 · Convergence diagnostics

Every run writes three headline diagnostics into predictions.model_diagnostics, and the current values appear in the status strip at the top of every page.

10 · Season simulation

The posterior alone tells you how strong each team is; to translate that into standings and playoff odds we simulate the remaining regular season. For each of the 10,000 simulations, and each remaining scheduled game:

  1. Draw a single posterior sample of all parameters.
  2. Compute home-team and away-team expected run rates from the model equation.
  3. Draw a Poisson realisation for each side to get simulated run totals.
  4. Assign the win to whichever team scored more. Tied simulated scores (~13% of Poisson draws at typical run rates) are resolved by a fair coin flip per game — a close approximation to extra innings, and a correctness fix landed 2026-05-14 (previously the per-game predictor dropped ties entirely and the standings predictor awarded all ties to the home team, both of which introduced systematic bias).

Across 10,000 simulations this produces a full distribution of final-season win counts for every team, which is where the "80% CI" projection ranges and the division-winner probabilities come from.

10b · Post-fit calibration & shrinkage

A 2026-05-25 held-out audit graded the model against 293 completed games from the prior 22 days and surfaced two distinct failure modes that no parameter tweak inside the Bayesian fit could cleanly address. Three post-fit adjustments now sit between the Monte Carlo and the probability we publish.

Both the calibrated/shrunk home_win_prob and the raw home_win_prob_raw are written to BigQuery so future audits can attribute which adjustment is paying off vs hurting.

11 · Series Focus pages

The Series Focus tab hosts longer-form editorial pieces that decompose a single matchup or modeling question. Live data on those pages comes from four sources, joined at render time so each refresh of the training pipeline updates the prose embedded in the article:

For series with three or four scheduled games, an outcome- distribution Monte Carlo sits on top of the per-game WPs. The current implementation treats games as independent draws, which is fine for editorial framing but ignores the correlation that bullpen state induces from one night to the next. A bullpen-leverage feature on the roadmap will replace that with a properly correlated multi-game simulator.

12 · Live in-game win probability

Once a game starts, the per-game win-probability bar on the Today tab switches from the morning pre-game number to a live estimate that updates roughly every eleven minutes through the rest of the game. The pre-game number stays visible as a small reference under the live bar so the delta between "what the model said" and "where the game is now" is readable: a live 63% means very different things for a 55% favourite vs a 35% underdog.

The live estimate is a Monte Carlo simulation seeded by the morning fit. For each in-progress game and each of 1,000 iterations:

  1. Take the morning model's expected-runs-for-this-game numbers (pred_home_runs, pred_away_runs) — the posterior means under the specific team / pitcher / park matchup. Divide each by 9 to get a per-inning Poisson rate.
  2. For the current half-inning, draw the remaining runs from Poisson(λ), where λ is the Tom Tango RE24 expected-remaining-runs value indexed by (outs, base_state), rescaled by the batting team's per-inning rate vs the league average and by the run-suppression rating of the pitcher actually on the mound relative to the assumed starter — so a reliever weaker than the starter raises the expected runs in that inning.
  3. For each subsequent half-inning, draw fresh runs from Poisson(per_inning_rate) for the batting team.
  4. Apply MLB end-of-game rules explicitly: home doesn't bat in the bottom of any inning ≥ 9 if it's already leading; any score-differing result after a bottom-of-9th-or-later half ends the game; extras continue otherwise.
  5. Compare final simulated totals; count it as a home win or not.

The live home-win probability is the fraction of iterations where home > away.

What this captures: score differential, current base-out state (a 2-out empty-bases situation reads very differently from bases-loaded-0-out for the same score), team offensive-rate asymmetry (the trailing team's per-inning rate is used for their innings, not league average), the specific pitcher currently on the mound (a reliever who replaced the assumed starter re-rates the current half-inning — starters by the Bayesian model rating, relievers by a sample-size-regressed FIP rating so the bullpen is covered too), home-team walk-off mechanics.

What this does NOT capture (honest caveats):

Cadence and infrastructure. A scheduled job fires every 11 minutes during the 20:00-06:00 UTC game window, polls the MLB Stats API for live linescore state, runs the simulation for any in-progress games, and writes the result to a public JSON file the browser fetches. The 11-minute interval sits just over the average ~9-minute half-inning so each refresh is likely to land on a genuinely new game state.

The series outcome distribution updates live too. Each card's projected series-outcome table at the bottom is recomputed in the browser on every successful poll using the same Poisson-binomial convolution the morning model uses — independent Bernoulli per remaining game, where each game's win probability is either the morning pre-game number (for future games and pre-game tonight) or the live in-game probability (once the active game is in progress). Past games with decided results contribute deterministically. The take-series percentage and the per-row outcome bars all move as the live game evolves.

Starter pitch charts live on the Pitching tab. For each starter in every currently-live game, every pitch they've thrown is rendered as a coloured marker on an SVG strike-zone plot (catcher's perspective). Marker position uses the pX / pZ coordinates returned by the MLB Stats API live feed; colour encodes pitch type (four-seam red, sinker/two-seam orange, cutter yellow, slider blue, curveball purple, changeup green, splitter teal). The data refresh shares the live-poller's cadence — pitches accumulate across the start, you see roughly the latest poll's worth of new dots on each refresh. No model inference is applied to this view; it is the raw pitch data the model could in principle ingest in a future pitcher-stuff feature.

13 · Bullpen-leverage projection

MLB games end up in the bullpens. The starter rarely reaches the 9th, the leverage of every plate appearance climbs through the middle innings, and the late-inning matchup choices made by each manager become the difference in any one-run game. Public models typically don't price this: they price the starter, the team strength, the park, and call the bullpen a wash. We have the ingredients to do better.

Two existing data sources combine to produce a leverage projection:

For each team we join the two tables, filter to relievers in fresh or taxed state, sort by posterior quality_mean, and take the top three as that team's high-leverage core — the arms the manager is most likely to lean on if the game is tight in the 7th-9th. The headline number is the average quality_mean across that core. Positive values represent late-inning run suppression above league average; +0.05 is elite, 0.0 average, -0.05 weak. The Today tab surfaces the comparison as a "late-inning edge" strip per game; the Series Focus pages incorporate the same projection into the editorial framing of a multi-game series.

Limitations. The current iteration is independence-naïve: it does not model that a reliever used in Friday's 9th becomes taxed for Saturday's 8th. A correlated multi-game simulator that walks bullpen state forward across a series is the next iteration. Separately, warehouse ingestion of stg_pitcher_game_log currently only captures pitchers who have appeared in a 2026 game, which leaves bullpens two-to-four-arms deep instead of the full eight; the metric is correct given inputs, but the inputs understate stack depth on both sides until that ingestion widens. Pitcher handedness against the opposing team's projected late-inning lineup is also not factored in yet — a LHP-vs-LHB or RHP-vs-RHB matchup is real signal that the current top-N average ignores.

14 · Pitch-level data + heatmaps

As of 2026-05-15 every tracked pitch from the MLB Stats API v1.1 /game/{gamePk}/feed/live endpoint is persisted to mlb_bronze.pitches by scripts/refresh_pitches.py in the warehouse repo. Each row carries plate coordinates (pX, pZ), official zone (1-9 inside the strike zone, 11-14 outside), pitch type, start velocity, call code, and the at-bat result — the same nested play structure the live pitching tab consumes on a 15-second polling cadence, now aggregated historically.

stg_pitches normalises the raw stream and adds two derived columns the gold layer leans on: pitch_family (fastball / breaking / offspeed / other, collapsed from the noisy pitch_type_code), and zone_grid (a 5x5 row/col string that maps the 9 inner zones plus 4 outer corner zones onto the SVG heatmap grid).

feat_pitcher_zone rolls up to one row per (pitcher, pitch_family, zone_grid) with pitch count, share of the pitcher's total mix, called-strike rate, whiff-per-swing, in-play count, and the modal pitch_type in that cell. The model repo's pitch_zone_for(pitcher_id) data loader reads from this table and the pitcher_zone_svg() renderer turns it into inline SVG for the per-starter cards on Series Focus. Frequency-coloured cells; corner cells weighted heavier than inner zone cells to read across viewport sizes.

feat_batter_splits joins stg_player_splits — populated from the MLB Stats API statSplits endpoint per player — to stg_player_metadata for the bat_side_code + current_team_id needed by lineup rendering. Each batter has two rows (vs LHP / vs RHP) per season with a full slash line and counting stats. Lineup blocks on the Today tab swap the season slash line for the actual vs-opposing-hand line, with the (PA) sample size on hover.

15 · Posterior-update preview

The Series Focus page projects, for each team, what the overall rating posterior would look like after each of the four possible outcomes of the upcoming series. The mechanism is a standard conjugate-normal update:

posterior precision = prior precision + n · σg-2
posterior mean = ( prior precision · prior mean + n · σg-2 · obs ) / posterior precision

The prior precision is read off the current 80% credible interval on overall_mean (half-width divided by 1.2816 = posterior sd; precision = 1 / sd²). The per-game observation variance σg² is set to 0.025 on the log-rate scale — calibrated empirically so a noninformative prior would move by roughly the standard deviation of season- end ratings (~0.05) after 162 games. The observation itself is the scenario's average per-game run differential, mapped to log-rate space via the rule that a one-run-per-game change in true talent corresponds to roughly 1/40 in log-rate.

The Pythagorean win-percentage shown alongside each scenario uses the same Bill James 1.83-exponent formula from the Series Focus piece's first section, applied to the team's projected new runs-scored and runs-allowed (with average MLB win-game and loss-game splits — 5.5 RS / 3.5 RA per win, 3.0 RS / 5.0 RA per loss).

What this is not. The update is a posterior approximation, not the actual posterior from a re-fit of the PyMC model on data + tonight. The real refit would account for the opponent's strength (beating a strong team is more informative than beating a weak team), park effects, pitcher matchup quality, and the latent random walk in true talent. The closed-form conjugate version is much faster — readable in a single page render — and gives roughly the right magnitude and sign. For a proper "what is our posterior after this game?" analysis, you'd run an overnight fit on data including the new game and compare; this is the cheap preview.

16 · Playoff qualification

The playoff probability shown on the League Predictions tab is simply the fraction of simulations in which a team finished in the 2026 MLB postseason field. Under current rules, that's six teams per league: three division winners plus three wild cards taken from the remaining teams by regular-season win percentage. Ties are broken by assigning fractional wins to every tied team, so the reported odds always sum exactly to three division winners and three wild cards per league across simulations.

17 · Data sources

18 · Deliberate omissions

Several signals that might plausibly improve predictive accuracy are not currently in the model. Each is a conscious choice; each is a candidate for a future release.