Methodology · MLB 2026 Forecast

1 · Overview

The forecast is produced by a hierarchical Bayesian generalised linear model fitted with Hamiltonian Monte Carlo (PyMC's NUTS sampler). It estimates each team's latent offensive and defensive strength using every regular-season and postseason game from 2022 to date, then projects the remainder of the 2026 regular season via Monte Carlo simulation of the posterior.

We chose a Bayesian framework for three reasons: posterior credible intervals quantify what we don't know; partial pooling keeps early-season ratings honest when a team has only played a dozen games; and posterior draws plug straight into a forward simulation, so the league standings we show are calibrated, uncertainty-aware projections rather than point estimates.

2 · Data pipeline

All training data flows from a separate warehouse repository. The ingestion layer pulls every MLB game from the MLB Stats API into a bronze table, which dbt transforms into a gold feature table (mlb_gold.feat_matchup) and a game-level fact table (mlb_gold.fct_games). This model reads both.

Critically, the rolling features in the warehouse are computed using only games that occurred before the current row — no target leakage. When we use, for example, a team's last-30-game win percentage on a given date, it genuinely means "as of the morning of that day".

3 · The likelihood

Each game contributes two observations to the model: the runs the home team scored, and the runs the away team scored. Runs are modelled as Poisson-distributed with a rate that depends on the two teams involved and whether the scoring team was playing at home:

runs_{team, game} ~ Poisson(λ)

log(λ) = intercept + offence[team] − defence[opponent] + home_adv · is_home

The Poisson is the natural distribution for count outcomes; in practice it also fits MLB run totals well because runs are rare events across a fixed inning structure. The log link means the three ratings combine multiplicatively on the rate scale, which matches how scoring actually works: a strong offence raises expected runs by a proportion, not a constant.

4 · Partial pooling within league

Team-level ratings are drawn from league-level priors, so early in a season every team is pulled towards the American or National League average until its own results earn it some distance. Without this "shrinkage" a team that started 3–0 would briefly look world-class; with it, the model concedes uncertainty and waits for more data.

offence[team] ~ Normal(league_offence[league(team)], σ_off)
defence[team] ~ Normal(league_defence[league(team)], σ_def)

league_offence, league_defence ~ Normal(0, 0.2)
σ_off, σ_def ~ HalfNormal(0.3)

The hyper-priors (σ_off, σ_def) are themselves learnt from the data. If the two leagues diverge sharply from the global mean, or if one league is markedly more spread-out than the other, the posterior will reflect that.

5 · Recency weighting

Baseball teams change — rosters turn over, managers get fired, front offices tear down and rebuild. A team's 2022 performance is not particularly informative about its 2026 performance, yet an unweighted likelihood would let four years of ancient results drown out a few dozen current ones. That's how a team in the middle of a genuine rebound (hello, Athletics) ends up stuck near the bottom of the posterior long after it should have moved.

Rather than model each season as its own parameter (tried; the random walk posterior geometry was unkind to NUTS and the fit took too long on the CI runner), each game's log-likelihood contribution is scaled by its age:

w_i = exp( − age_i / τ ) τ = 1.5 years

weighted log-likelihood = Σ_i w_i · log p(runs_i | λ_i)

Concrete weights for a refresh run today:

Game played yesterday → w ≈ 1.00
Game played one year ago → w ≈ 0.51
Game played two years ago → w ≈ 0.26
Game played four years ago → w ≈ 0.07

So four full rebuild seasons exert roughly a tenth of the influence that a full current season would, without the model having to discretise time into seasons explicitly. In PyMC this is applied via pm.Potential on a vectorised log-likelihood; the posterior sees each observation as a fractional contribution proportional to its recency.

6 · Global parameters

Two parameters are shared across the whole league.

Intercept (~ Normal(log 4.5, 0.2)) — the league-average log run rate, centred on the long-run MLB average of roughly 4.5 runs per team per game.
Home-field advantage (~ Normal(0.03, 0.03)) — a small additive bump to the log run rate for the team at home. The prior is centred on the historical MLB HFA of around 3%.

7 · Inference

Posteriors are sampled with PyMC's No-U-Turn Sampler (NUTS), the current default for Hamiltonian Monte Carlo on continuous parameter spaces. Settings for each nightly run:

chains: 4 (sampled in parallel on 4 vCPUs)
warm-up: 1,000 iterations per chain
retained draws: 500 iterations per chain
target_accept: 0.95
total samples: 2,000 posterior draws

A single fit takes roughly seven minutes on a GitHub Actions runner. The warm-up phase tunes the mass matrix and step size; only post-warm-up draws are used for inference and simulation. The slightly tighter target_accept (0.95 rather than PyMC's 0.8 default) keeps divergences at zero on this model.

8 · Convergence diagnostics

Every run writes three headline diagnostics into predictions.model_diagnostics, and the current values appear in the status strip at the top of every page.

R̂ (potential scale reduction factor) — compares within-chain to between-chain variance. Values below 1.01 indicate that independent chains have converged to the same posterior. We warn on anything over 1.05.
ESS (effective sample size) — the number of effectively-independent draws after accounting for autocorrelation. Rough rule of thumb: we want at least 400 per parameter. Low ESS means the chains are mixing slowly; a posterior mean is still roughly right but its uncertainty is under-estimated.
Divergences — NUTS trajectories that failed to conserve energy, typically caused by a pathological posterior geometry. Any non-zero count is worth investigating; if the number is large the reported posterior should not be trusted until reparameterised or given higher target_accept.

9 · Season simulation

The posterior alone tells you how strong each team is; to translate that into standings and playoff odds we simulate the remaining regular season. For each of the 10,000 simulations, and each remaining scheduled game:

Draw a single posterior sample of all parameters.
Compute home-team and away-team expected run rates from the model equation.
Draw a Poisson realisation for each side to get simulated run totals.
Assign the win to whichever team scored more; aggregate wins onto each team's running season total.

Across 10,000 simulations this produces a full distribution of final-season win counts for every team, which is where the "80% CI" projection ranges and the division-winner probabilities come from.

10 · Playoff qualification

The playoff probability shown on the League Predictions tab is simply the fraction of simulations in which a team finished in the 2026 MLB postseason field. Under current rules, that's six teams per league: three division winners plus three wild cards taken from the remaining teams by regular-season win percentage. Ties are broken by assigning fractional wins to every tied team, so the reported odds always sum exactly to three division winners and three wild cards per league across simulations.

11 · Data sources

MLB Stats API (statsapi.mlb.com/api/v1) — schedule, game results, probable pitchers, venues, team metadata. The sole source of raw training data.
Seasons used: 2022, 2023, 2024, 2025 (all games complete) plus 2026 to the most recent refresh. Regular season and postseason; spring training and All-Star games are excluded.

12 · Deliberate omissions

Several signals that might plausibly improve predictive accuracy are not in v1. Each is a conscious choice; each is a candidate for a future release.

Starting pitcher effects. The schedule hydrate endpoint gives us probable pitchers, but the current model treats all of a team's pitching as a single "defence" rating. A pitcher- level extension (random effect on the stated starter) would likely improve single-game prediction at some cost in model complexity.
Park factors. Run environments at Coors Field and Oracle Park differ materially, but every game here is modelled with the same league-wide intercept. The venues table is loaded and ready, so a park-factor intercept is a natural next step.
Injuries, roster moves, lineups. The model can't see these and has no way to differentiate a team's performance with its star shortstop on the injured list from its performance when healthy. Recency weighting partially mitigates this; explicit injury data would do more.
Bayesian updating between runs. Each nightly fit is from scratch, with the same priors. A sequential / stateful update using yesterday's posterior as today's prior would be faster and in some senses more philosophically correct, but for a once-daily batch job the cold-start cost is acceptable.