Mithrandir Metrics

Method audit 2026-04 Calibration harness available Lane health 18 rows Scheduler 1 failed / 8 stale

Published methodology

Definitions, formulas, validation gates, calibration results, caveats, and null findings are public-facing by design.

Protected implementation

Raw data licenses, credentials, scheduler internals, model artifacts, and exact feature-pipeline code paths stay out of public copy.

Reader promise

If a surface is experimental, missing coverage, descriptive-only, or parked after validation failure, the page should say that directly.

Three Lanes Beta / current market coverage

Post-beta coverage

What each market can honestly show today.

Validation receipts linked

Non-HR experimental lanes use distribution-thresholding while posted-line training data accumulates. Full production status still requires direct posted-line calibration.

Market	Palantir	Fangorn	Valinor	Triple Confirmed	Consensus	Receipt path
Home Runs	Production	Production	Production	Full	Full	`HR production validation receipts`
Total Bases	Production	CatBoost experimental	LightGBM + beta experimental	Full	Full	`outputs/validation/three_lanes_beta/total_bases/`
Hits	Production	Random Forest experimental	In active development	Partial, 2 of 3 lanes	Available, 2 lanes	`outputs/validation/three_lanes_beta/hits/`
Pitcher Outs	Production	Random Forest experimental	In active development	Partial, 2 of 3 lanes	Available, 2 lanes	`outputs/validation/three_lanes_beta/pitcher_outs/`
Strikeouts	Production	In active development	In active development	Waiting on multi-lane output	Insufficient lanes	`outputs/validation/three_lanes_beta/strikeouts/`
Futures	Production	Intentionally limited	Intentionally limited	Waiting on multi-lane output	Insufficient lanes	`docs/codex_context/three_lanes_beta_preregistration_amendment_001_2026_05_24.md`

Mithrandir+ / flagship derived metrics

Stuff+ v1.1 / promoted

Run-value target, full backfill, original feature set.

Y-Y r² .504

Training

2020-2023 Statcast pitch-event backfill via pybaseball. Validation 2024; held-out test 2025.

Target

Statcast delta_run_exp. Lower run value is better for pitchers; 100 is league average on the published plus scale.

Feature set

Velocity, release point, extension, movement, location, zone height, count, pitch type, pitcher hand, and fastball-relative deltas.

STABILIZATION 100 pitches RMSE VS FIP -0.168 ABLATION features tested

Ablation finding

Spin axis, batter handedness, and platoon context were tested. They added noise for v1.1 and were dropped from production despite a proxy-target variant producing shinier raw Y-Y numbers.

Arsenal Decomposition

Phase 2 initially attempted a Kirby Index. Audit showed public Kirby is a command metric built from release-angle consistency, while Mithrandir had built pitch-type Stuff+ composition. We renamed it Arsenal Decomposition and surface it only as descriptive support under Stuff+, not as a predictive metric.

Two-Strike Adjustment null

Two-Strike Adjustment Index v1.0 followed the Chamberlain chase-delta framework but missed the locked gate: contact-rate r² .147 vs .300 target, and raw two-strike K% remained the better next-year K% predictor (.445 r² vs .080). This stays parked as an honest null, with bat-tracking integration deferred for any v1.1 attempt.

Pitch-Type Vulnerability partial promote

Per-type validation cleared FB, CH, and SI but exposed sample gaps for CB/FC and a slider miss. Family aggregation by hitter-recognition class validated Fastball, Sinker, and Changeup; Breaking horizontal remains caution-only at r .239; Breaking vertical and Cutters/Splitters stay insufficient-sample with explicit tags.

Conditional OAA internal reframe

Conditional OAA was tested as a standalone predictive metric and missed the locked Y-Y gate: r² .174 vs .450 target. The small-sample test was excellent, though: RMSE 2.073 vs raw OAA 6.377. We therefore apply shrinkage internally below 200 defensive outs and label those rows "regression applied" instead of promoting a separate metric page.

Stuff+ v1.1 = 100 + z(predicted pitch run prevention) * 10 production target = delta_run_exp / production features = original pitch-shape + location set Arsenal Decomposition = pitch-type-relative Stuff+ by pitcher and pitch family Pitch-Type Vulnerability = regressed same-family wOBA-against, inverted to resistance percentile

Open Stuff+ methodology + leaderboard

Open Pitch-Type Vulnerability methodology + leaderboard

SEAGER+ v1.0 / promoted

Plate discipline index, reframed after validation.

BB% r .648

Original gate

SEAGER+ was initially pre-registered against next-year ISO at r ≥ .300. The first formula produced r .023 and failed.

Audit correction

A methodology audit found a denominator error: public SEAGER is rate-based ST - HPT, not a per-pitch run-value average. Correcting it moved ISO r to .274.

Reframed signal

Alternative-target diagnostics showed the corrected metric is a plate discipline monster: next-year BB% r .648 and chase-rate r -.692.

PRIMARY TARGET BB% / chase ISO OBSERVATION r .274 ROADMAP called-strike weighting

SEAGER+ = 100 + z(Selection Tendency - Hittable Pitches Taken) * 10 Selection Tendency = good takes / non-hittable opportunities; HPT = hittable takes / total takes

Transparency note

The ISO threshold miss is documented, not hidden. v1.1 roadmap: called-strike probability weighting plus bat-tracking quality on correct swings if we want to chase the power-prediction framing again.

Open SEAGER+ methodology + leaderboard

Lane health / calibration drift moat

Epistemic chrome

Models are allowed to be wrong; they are not allowed to hide it.

2026-06-12

Scheduler freshness

Last cycle 2026-06-13T02:30:12.596374+00:00 / 1 failed / 8 stale

Failed: train_stuff_plus

Market	Lane	Logged	Brier	ECE	14d trend	Calibration	Flag
Home Runs	Palantir	5214	0.124	0.004	flat	recalibrated	healthy
Home Runs	Fangorn	5214	0.125	0.010	flat	recalibrated	healthy
Home Runs	Valinor	1176	0.125	0.013	insufficient sample	recalibrated	insufficient sample
Strikeouts	Palantir	49	0.228	0.190	insufficient sample	skipped insufficient sample	insufficient sample
Strikeouts	Fangorn	13	0.239	0.139	insufficient sample	skipped insufficient sample	insufficient sample
Strikeouts	Valinor	32	0.236	0.220	insufficient sample	skipped insufficient sample	insufficient sample
Total Bases	Palantir	194	0.262	0.207	insufficient sample	recalibrated	insufficient sample
Total Bases	Fangorn	0	--	--	tracking	skipped insufficient sample	tracking in progress
Total Bases	Valinor	0	--	--	tracking	training time	tracking in progress
Hits	Palantir	2	0.560	0.748	insufficient sample	skipped insufficient sample	insufficient sample
Hits	Fangorn	0	--	--	tracking	training time	tracking in progress
Hits	Valinor	0	--	--	tracking	training time	tracking in progress
Pitcher Outs	Palantir	40	0.249	0.040	insufficient sample	skipped insufficient sample	insufficient sample
Pitcher Outs	Fangorn	0	--	--	tracking	training time	tracking in progress
Pitcher Outs	Valinor	0	--	--	tracking	training time	tracking in progress
Futures	Palantir	0	--	--	tracking	skipped insufficient sample	tracking in progress
Futures	Fangorn	0	--	--	tracking	training time	tracking in progress
Futures	Valinor	0	--	--	tracking	training time	tracking in progress

Home Runs / Palantir

Healthy

Tracked

5214props 0.124Brier 0.004ECE

Home Runs / Fangorn

Healthy

Tracked

5214props 0.125Brier 0.010ECE

Home Runs / Valinor

Insufficient Sample

Tracked

1176props 0.125Brier 0.013ECE

Rolling projections / Bayesian shrinkage

Daily update layer

The prior stays visible; the season earns weight one game at a time.

live M1 + rolling

Preseason

The frozen projection from the original season simulator. It remains on-page as the baseline.

Current pace

The raw extrapolation of observed 2026 performance. Useful, but noisy.

Rolling

The Bayesian blend: preseason prior plus observed performance, weighted by sample size.

TEAM WINS 60-game regression constant HITTER RATES 220 PA PITCHER RATES 240 BF

observed_weight = opportunities / (opportunities + regression_constant) rolling_projection = prior * (1 - observed_weight) + current_pace * observed_weight team game lines = rolling team talent + opponent context + starter rolling ERA + bullpen FIP + lineup status

Limitation

Small samples mean high uncertainty; early-season rolling projections intentionally mean-revert toward preseason until the observed sample becomes persuasive.

Per-model write-ups / status

Stable / Projection-Anchored Palantir Model

Regularized, projection-anchored lane built to stay stable rather than chase the sharpest take.

Stable / Readable mainline probabilities and broad projection context. Nonlinear / Interaction Discovery Fangorn Model

Nonlinear interaction-seeking lane that can abstain when it does not see a distinct pocket.

Advanced / Comparing where tree-style interaction logic diverges from the stable anchor. Calibration / Probability Refinement Valinor Model

Probability-refinement lane built to ask whether center, tails, and reliability are shaped correctly.

Experimental / Calibration review and challenger comparison, not declaring the most aggressive pick.

Strikeouts / K Valinor Boosted residual + calibration lane

Production calibration lane is visible with a caveat banner while forward-logged graded sample accumulates. Rows flagged with the fallback path stay disclosed.

production / caveated

Total Bases / Fangorn Selective nonlinear signal lane

Rebadged research lane remains disclosed as non-production independent methodology until a true Total Bases Fangorn engine ships.

rebadged research

Validation

HR Valinor is the worked example.

Non-stacked LightGBM + beta calibration promoted on ECE strength; Brier difference was inside bootstrap noise.

Glossary / formulas

ECE = sum_b (n_b / N) * abs(avg_pred_b - avg_outcome_b) Brier = mean((p_i - y_i)^2) RA_FV = FutureValue * P(reaching ceiling)

Versioned changelog

P3 M1CalibrationHR Valinor promoted to production.LightGBM non-stacked / beta calibration-> Live M1Live stateCurrent standings, deltas, and regression feeds shipped.cached daily artifacts->