Definitions, formulas, validation gates, calibration results, caveats, and null findings are public-facing by design.
Every number has a paper trail.
Risk archetypes, validation examples, formulas, and version notes. Methodology cards summarize what is live, what validated, and where coverage is intentionally limited.
Raw data licenses, credentials, scheduler internals, model artifacts, and exact feature-pipeline code paths stay out of public copy.
If a surface is experimental, missing coverage, descriptive-only, or parked after validation failure, the page should say that directly.
What each market can honestly show today.
Non-HR experimental lanes use distribution-thresholding while posted-line training data accumulates. Full production status still requires direct posted-line calibration.
| Market | Palantir | Fangorn | Valinor | Triple Confirmed | Consensus | Receipt path |
|---|---|---|---|---|---|---|
| Home Runs | Production | Production | Production | Full | Full | HR production validation receipts |
| Total Bases | Production | CatBoost experimental | LightGBM + beta experimental | Full | Full | outputs/validation/three_lanes_beta/total_bases/ |
| Hits | Production | Random Forest experimental | In active development | Partial, 2 of 3 lanes | Available, 2 lanes | outputs/validation/three_lanes_beta/hits/ |
| Pitcher Outs | Production | Random Forest experimental | In active development | Partial, 2 of 3 lanes | Available, 2 lanes | outputs/validation/three_lanes_beta/pitcher_outs/ |
| Strikeouts | Production | In active development | In active development | Waiting on multi-lane output | Insufficient lanes | outputs/validation/three_lanes_beta/strikeouts/ |
| Futures | Production | Intentionally limited | Intentionally limited | Waiting on multi-lane output | Insufficient lanes | docs/codex_context/three_lanes_beta_preregistration_amendment_001_2026_05_24.md |
Run-value target, full backfill, original feature set.
2020-2023 Statcast pitch-event backfill via pybaseball. Validation 2024; held-out test 2025.
Statcast delta_run_exp. Lower run value is better for pitchers; 100 is league average on the published plus scale.
Velocity, release point, extension, movement, location, zone height, count, pitch type, pitcher hand, and fastball-relative deltas.
Spin axis, batter handedness, and platoon context were tested. They added noise for v1.1 and were dropped from production despite a proxy-target variant producing shinier raw Y-Y numbers.
Phase 2 initially attempted a Kirby Index. Audit showed public Kirby is a command metric built from release-angle consistency, while Mithrandir had built pitch-type Stuff+ composition. We renamed it Arsenal Decomposition and surface it only as descriptive support under Stuff+, not as a predictive metric.
Two-Strike Adjustment Index v1.0 followed the Chamberlain chase-delta framework but missed the locked gate: contact-rate r² .147 vs .300 target, and raw two-strike K% remained the better next-year K% predictor (.445 r² vs .080). This stays parked as an honest null, with bat-tracking integration deferred for any v1.1 attempt.
Per-type validation cleared FB, CH, and SI but exposed sample gaps for CB/FC and a slider miss. Family aggregation by hitter-recognition class validated Fastball, Sinker, and Changeup; Breaking horizontal remains caution-only at r .239; Breaking vertical and Cutters/Splitters stay insufficient-sample with explicit tags.
Conditional OAA was tested as a standalone predictive metric and missed the locked Y-Y gate: r² .174 vs .450 target. The small-sample test was excellent, though: RMSE 2.073 vs raw OAA 6.377. We therefore apply shrinkage internally below 200 defensive outs and label those rows "regression applied" instead of promoting a separate metric page.
Stuff+ v1.1 = 100 + z(predicted pitch run prevention) * 10
production target = delta_run_exp / production features = original pitch-shape + location set
Arsenal Decomposition = pitch-type-relative Stuff+ by pitcher and pitch family
Pitch-Type Vulnerability = regressed same-family wOBA-against, inverted to resistance percentile
Plate discipline index, reframed after validation.
SEAGER+ was initially pre-registered against next-year ISO at r ≥ .300. The first formula produced r .023 and failed.
A methodology audit found a denominator error: public SEAGER is rate-based ST - HPT, not a per-pitch run-value average. Correcting it moved ISO r to .274.
Alternative-target diagnostics showed the corrected metric is a plate discipline monster: next-year BB% r .648 and chase-rate r -.692.
SEAGER+ = 100 + z(Selection Tendency - Hittable Pitches Taken) * 10
Selection Tendency = good takes / non-hittable opportunities; HPT = hittable takes / total takes
The ISO threshold miss is documented, not hidden. v1.1 roadmap: called-strike probability weighting plus bat-tracking quality on correct swings if we want to chase the power-prediction framing again.
Models are allowed to be wrong; they are not allowed to hide it.
Last cycle 2026-06-13T02:30:12.596374+00:00 / 1 failed / 8 stale
Failed: train_stuff_plus
| Market | Lane | Logged | Brier | ECE | 14d trend | Calibration | Flag |
|---|---|---|---|---|---|---|---|
| Home Runs | Palantir | 5214 | 0.124 | 0.004 | flat | recalibrated | healthy |
| Home Runs | Fangorn | 5214 | 0.125 | 0.010 | flat | recalibrated | healthy |
| Home Runs | Valinor | 1176 | 0.125 | 0.013 | insufficient sample | recalibrated | insufficient sample |
| Strikeouts | Palantir | 49 | 0.228 | 0.190 | insufficient sample | skipped insufficient sample | insufficient sample |
| Strikeouts | Fangorn | 13 | 0.239 | 0.139 | insufficient sample | skipped insufficient sample | insufficient sample |
| Strikeouts | Valinor | 32 | 0.236 | 0.220 | insufficient sample | skipped insufficient sample | insufficient sample |
| Total Bases | Palantir | 194 | 0.262 | 0.207 | insufficient sample | recalibrated | insufficient sample |
| Total Bases | Fangorn | 0 | -- | -- | tracking | skipped insufficient sample | tracking in progress |
| Total Bases | Valinor | 0 | -- | -- | tracking | training time | tracking in progress |
| Hits | Palantir | 2 | 0.560 | 0.748 | insufficient sample | skipped insufficient sample | insufficient sample |
| Hits | Fangorn | 0 | -- | -- | tracking | training time | tracking in progress |
| Hits | Valinor | 0 | -- | -- | tracking | training time | tracking in progress |
| Pitcher Outs | Palantir | 40 | 0.249 | 0.040 | insufficient sample | skipped insufficient sample | insufficient sample |
| Pitcher Outs | Fangorn | 0 | -- | -- | tracking | training time | tracking in progress |
| Pitcher Outs | Valinor | 0 | -- | -- | tracking | training time | tracking in progress |
| Futures | Palantir | 0 | -- | -- | tracking | skipped insufficient sample | tracking in progress |
| Futures | Fangorn | 0 | -- | -- | tracking | training time | tracking in progress |
| Futures | Valinor | 0 | -- | -- | tracking | training time | tracking in progress |
Healthy
Healthy
Insufficient Sample
The prior stays visible; the season earns weight one game at a time.
The frozen projection from the original season simulator. It remains on-page as the baseline.
The raw extrapolation of observed 2026 performance. Useful, but noisy.
The Bayesian blend: preseason prior plus observed performance, weighted by sample size.
observed_weight = opportunities / (opportunities + regression_constant)
rolling_projection = prior * (1 - observed_weight) + current_pace * observed_weight
team game lines = rolling team talent + opponent context + starter rolling ERA + bullpen FIP + lineup status
Small samples mean high uncertainty; early-season rolling projections intentionally mean-revert toward preseason until the observed sample becomes persuasive.
Regularized, projection-anchored lane built to stay stable rather than chase the sharpest take.
Stable / Readable mainline probabilities and broad projection context. Nonlinear / Interaction Discovery Fangorn ModelNonlinear interaction-seeking lane that can abstain when it does not see a distinct pocket.
Advanced / Comparing where tree-style interaction logic diverges from the stable anchor. Calibration / Probability Refinement Valinor ModelProbability-refinement lane built to ask whether center, tails, and reliability are shaped correctly.
Experimental / Calibration review and challenger comparison, not declaring the most aggressive pick.Production calibration lane is visible with a caveat banner while forward-logged graded sample accumulates. Rows flagged with the fallback path stay disclosed.
production / caveatedRebadged research lane remains disclosed as non-production independent methodology until a true Total Bases Fangorn engine ships.
rebadged researchHR Valinor is the worked example.
Non-stacked LightGBM + beta calibration promoted on ECE strength; Brier difference was inside bootstrap noise.
ECE = sum_b (n_b / N) * abs(avg_pred_b - avg_outcome_b)
Brier = mean((p_i - y_i)^2)
RA_FV = FutureValue * P(reaching ceiling)
