Mithrandir Metrics
3 identitiesaudit-backed
Section VI / Methods / Assumptions, validation, change history

Every number has a paper trail.

Risk archetypes, validation examples, formulas, and version notes. Methodology cards summarize what is live, what validated, and where coverage is intentionally limited.

Lanes
3
Markets
6
HR ECE
.0149
Method audit 2026-04 Calibration harness available Lane health 18 rows Scheduler 1 failed / 8 stale
Published methodology

Definitions, formulas, validation gates, calibration results, caveats, and null findings are public-facing by design.

Protected implementation

Raw data licenses, credentials, scheduler internals, model artifacts, and exact feature-pipeline code paths stay out of public copy.

Reader promise

If a surface is experimental, missing coverage, descriptive-only, or parked after validation failure, the page should say that directly.

Three Lanes Beta / current market coverage
Post-beta coverage

What each market can honestly show today.

Validation receipts linked

Non-HR experimental lanes use distribution-thresholding while posted-line training data accumulates. Full production status still requires direct posted-line calibration.

Market Palantir Fangorn Valinor Triple Confirmed Consensus Receipt path
Home Runs Production Production Production Full Full HR production validation receipts
Total Bases Production CatBoost experimental LightGBM + beta experimental Full Full outputs/validation/three_lanes_beta/total_bases/
Hits Production Random Forest experimental In active development Partial, 2 of 3 lanes Available, 2 lanes outputs/validation/three_lanes_beta/hits/
Pitcher Outs Production Random Forest experimental In active development Partial, 2 of 3 lanes Available, 2 lanes outputs/validation/three_lanes_beta/pitcher_outs/
Strikeouts Production In active development In active development Waiting on multi-lane output Insufficient lanes outputs/validation/three_lanes_beta/strikeouts/
Futures Production Intentionally limited Intentionally limited Waiting on multi-lane output Insufficient lanes docs/codex_context/three_lanes_beta_preregistration_amendment_001_2026_05_24.md
Mithrandir+ / flagship derived metrics
Stuff+ v1.1 / promoted

Run-value target, full backfill, original feature set.

Y-Y r² .504
Training

2020-2023 Statcast pitch-event backfill via pybaseball. Validation 2024; held-out test 2025.

Target

Statcast delta_run_exp. Lower run value is better for pitchers; 100 is league average on the published plus scale.

Feature set

Velocity, release point, extension, movement, location, zone height, count, pitch type, pitcher hand, and fastball-relative deltas.

STABILIZATION 100 pitches RMSE VS FIP -0.168 ABLATION features tested
Ablation finding

Spin axis, batter handedness, and platoon context were tested. They added noise for v1.1 and were dropped from production despite a proxy-target variant producing shinier raw Y-Y numbers.

Arsenal Decomposition

Phase 2 initially attempted a Kirby Index. Audit showed public Kirby is a command metric built from release-angle consistency, while Mithrandir had built pitch-type Stuff+ composition. We renamed it Arsenal Decomposition and surface it only as descriptive support under Stuff+, not as a predictive metric.

Two-Strike Adjustment null

Two-Strike Adjustment Index v1.0 followed the Chamberlain chase-delta framework but missed the locked gate: contact-rate r² .147 vs .300 target, and raw two-strike K% remained the better next-year K% predictor (.445 r² vs .080). This stays parked as an honest null, with bat-tracking integration deferred for any v1.1 attempt.

Pitch-Type Vulnerability partial promote

Per-type validation cleared FB, CH, and SI but exposed sample gaps for CB/FC and a slider miss. Family aggregation by hitter-recognition class validated Fastball, Sinker, and Changeup; Breaking horizontal remains caution-only at r .239; Breaking vertical and Cutters/Splitters stay insufficient-sample with explicit tags.

Conditional OAA internal reframe

Conditional OAA was tested as a standalone predictive metric and missed the locked Y-Y gate: r² .174 vs .450 target. The small-sample test was excellent, though: RMSE 2.073 vs raw OAA 6.377. We therefore apply shrinkage internally below 200 defensive outs and label those rows "regression applied" instead of promoting a separate metric page.

Stuff+ v1.1 = 100 + z(predicted pitch run prevention) * 10 production target = delta_run_exp / production features = original pitch-shape + location set Arsenal Decomposition = pitch-type-relative Stuff+ by pitcher and pitch family Pitch-Type Vulnerability = regressed same-family wOBA-against, inverted to resistance percentile

Open Stuff+ methodology + leaderboard

Open Pitch-Type Vulnerability methodology + leaderboard

SEAGER+ v1.0 / promoted

Plate discipline index, reframed after validation.

BB% r .648
Original gate

SEAGER+ was initially pre-registered against next-year ISO at r ≥ .300. The first formula produced r .023 and failed.

Audit correction

A methodology audit found a denominator error: public SEAGER is rate-based ST - HPT, not a per-pitch run-value average. Correcting it moved ISO r to .274.

Reframed signal

Alternative-target diagnostics showed the corrected metric is a plate discipline monster: next-year BB% r .648 and chase-rate r -.692.

PRIMARY TARGET BB% / chase ISO OBSERVATION r .274 ROADMAP called-strike weighting
SEAGER+ = 100 + z(Selection Tendency - Hittable Pitches Taken) * 10 Selection Tendency = good takes / non-hittable opportunities; HPT = hittable takes / total takes
Transparency note

The ISO threshold miss is documented, not hidden. v1.1 roadmap: called-strike probability weighting plus bat-tracking quality on correct swings if we want to chase the power-prediction framing again.

Open SEAGER+ methodology + leaderboard

Lane health / calibration drift moat
Epistemic chrome

Models are allowed to be wrong; they are not allowed to hide it.

2026-06-12
Scheduler freshness

Last cycle 2026-06-13T02:30:12.596374+00:00 / 1 failed / 8 stale

Failed: train_stuff_plus

Market Lane Logged Brier ECE 14d trend Calibration Flag
Home Runs Palantir 5214 0.124 0.004 flat recalibrated healthy
Home Runs Fangorn 5214 0.125 0.010 flat recalibrated healthy
Home Runs Valinor 1176 0.125 0.013 insufficient sample recalibrated insufficient sample
Strikeouts Palantir 49 0.228 0.190 insufficient sample skipped insufficient sample insufficient sample
Strikeouts Fangorn 13 0.239 0.139 insufficient sample skipped insufficient sample insufficient sample
Strikeouts Valinor 32 0.236 0.220 insufficient sample skipped insufficient sample insufficient sample
Total Bases Palantir 194 0.262 0.207 insufficient sample recalibrated insufficient sample
Total Bases Fangorn 0 -- -- tracking skipped insufficient sample tracking in progress
Total Bases Valinor 0 -- -- tracking training time tracking in progress
Hits Palantir 2 0.560 0.748 insufficient sample skipped insufficient sample insufficient sample
Hits Fangorn 0 -- -- tracking training time tracking in progress
Hits Valinor 0 -- -- tracking training time tracking in progress
Pitcher Outs Palantir 40 0.249 0.040 insufficient sample skipped insufficient sample insufficient sample
Pitcher Outs Fangorn 0 -- -- tracking training time tracking in progress
Pitcher Outs Valinor 0 -- -- tracking training time tracking in progress
Futures Palantir 0 -- -- tracking skipped insufficient sample tracking in progress
Futures Fangorn 0 -- -- tracking training time tracking in progress
Futures Valinor 0 -- -- tracking training time tracking in progress

Home Runs / Palantir

Healthy

Tracked
0-10% bucket: predicted 5.9%, observed 5.6%, n=744 10-20% bucket: predicted 15.3%, observed 15.5%, n=3699 20-30% bucket: predicted 21.5%, observed 20.2%, n=771 predicted observed
5214props 0.124Brier 0.004ECE

Home Runs / Fangorn

Healthy

Tracked
0-10% bucket: predicted 5.3%, observed 6.8%, n=264 10-20% bucket: predicted 15.7%, observed 14.9%, n=4746 20-30% bucket: predicted 25.9%, observed 22.1%, n=204 predicted observed
5214props 0.125Brier 0.010ECE

Home Runs / Valinor

Insufficient Sample

Tracked
0-10% bucket: predicted 8.6%, observed 11.8%, n=153 10-20% bucket: predicted 13.9%, observed 14.9%, n=1008 20-30% bucket: predicted 20.8%, observed 20.0%, n=15 predicted observed
1176props 0.125Brier 0.013ECE
Rolling projections / Bayesian shrinkage
Daily update layer

The prior stays visible; the season earns weight one game at a time.

live M1 + rolling
Preseason

The frozen projection from the original season simulator. It remains on-page as the baseline.

Current pace

The raw extrapolation of observed 2026 performance. Useful, but noisy.

Rolling

The Bayesian blend: preseason prior plus observed performance, weighted by sample size.

TEAM WINS 60-game regression constant HITTER RATES 220 PA PITCHER RATES 240 BF
observed_weight = opportunities / (opportunities + regression_constant) rolling_projection = prior * (1 - observed_weight) + current_pace * observed_weight team game lines = rolling team talent + opponent context + starter rolling ERA + bullpen FIP + lineup status
Limitation

Small samples mean high uncertainty; early-season rolling projections intentionally mean-revert toward preseason until the observed sample becomes persuasive.

Per-model write-ups / status
Validation

HR Valinor is the worked example.

Non-stacked LightGBM + beta calibration promoted on ECE strength; Brier difference was inside bootstrap noise.

Glossary / formulas
ECE = sum_b (n_b / N) * abs(avg_pred_b - avg_outcome_b) Brier = mean((p_i - y_i)^2) RA_FV = FutureValue * P(reaching ceiling)
Versioned changelog