Theory & mathematics
This page summarizes the DEUP framework as defined by Lahlou et al. (2023, TMLR) and extensions for cross-sectional ranking, aggregation reliability, and two-level deployment from Sanderink (2026). The implementation follows Algorithm 2 (K-fold pre-fill of the error dataset) for honest out-of-sample error targets.
Risk decomposition
For input \(x \in \mathcal{X}\), target \(y \in \mathcal{Y}\), loss \(\ell\), and predictor \(f\), the pointwise risk is
The Bayes predictor \(f^*(x) = \arg\min_a \mathbb{E}[\ell(Y, a) \mid X=x]\) achieves the irreducible aleatoric floor
The excess risk (epistemic uncertainty under this framework) is
Under squared error with Gaussian \(P(Y \mid X=x) = \mathcal{N}(\mu(x), \sigma^2(x))\):
Under log loss for \(K\)-class classification with \(\mu(x) \in \Delta^K\):
Unlike posterior-variance estimators, excess risk captures model misspecification (bias): when \(f^* \notin \mathcal{H}\), disagreement among approximate Bayesian predictors can shrink even as the model remains systematically wrong.
DEUP estimator
DEUP trains a secondary error predictor \(e\) (the error_model, written \(g\) in
this library's code) to estimate the generalization error \(R(f, x)\), then subtracts
an aleatoric estimate \(a(x)\). The paper's estimator (Eq. 9) is
an estimator of the excess risk \(\mathrm{ER}(f, x)\). Since epistemic uncertainty is
non-negative, deup reports the clipped form
which coincides with \(u(f,x)\) wherever the latter is non-negative.
| Symbol | Role in deup |
v0.1 default |
|---|---|---|
| \(f\) | base_model |
any sklearn regressor |
| \(g \equiv e\) | error_model |
HistGradientBoostingRegressor |
| \(a(x)\) | aleatoric floor | 0 (conservative proxy: \(g(x)\) alone) |
| \(\ell\) | loss |
"squared" (per-row error target) |
Estimating the aleatoric floor \(a(x)\)
The paper (Sec. 3) gives three scenarios, each implemented or planned in deup:
- Noiseless (\(A(x) = 0\)): set \(a(x) = 0\), so \(\hat{e}(x) = g(x)\). This is the v0.1 default and the paper's choice for its noiseless experiments.
- Replicate oracle (regression, squared loss): with \(K\) i.i.d. outcomes
\(y_1,\dots,y_K \sim P(Y\mid x)\), the unbiased aleatoric target is
\(\tfrac{K}{K-1}\widehat{\mathrm{Var}}(y_1,\dots,y_K)\); fit \(a\) on these. Implemented
as
Heteroscedastic/Quantilealeatoric estimators (P6). - No estimate available: use \(e(f,x)\) itself as a conservative (pessimistic) proxy for epistemic uncertainty, i.e. \(a(x)=0\). Valid when uncertainty is only used to rank points and aleatoric noise is roughly constant across \(\mathcal{X}\).
Setting \(a(x)\equiv 0\) (scenarios 1 and 3) is what v0.1 ships. The aleatoric estimators of scenario 2 and the \(\hat{e}=\max(0,g-a)\) decomposition land in v0.2 (Prompt P6).
Algorithm 1 — fixed training set
Given trained \(f\), validation set \(\{(x_i', y_i')\}\), and aleatoric estimator \(a\):
- Build \(\mathcal{D}_e = \{(x_i', \ell(y_i', f(x_i'))\}_{i=1}^K\).
- Fit \(g\) on \(\mathcal{D}_e\) (regress errors; often with \(\log(\text{error} + \varepsilon)\) target stabilization).
- Return \(\hat{u}(x) = g(x) - a(x)\).
Critical: errors must be out-of-sample for \(f\). Using in-sample residuals underestimates uncertainty (Sec. 3.2).
Algorithm 2 — K-fold pre-fill (what OOFErrorCollector implements)
When no held-out set exists, pre-fill \(\mathcal{D}_e\) via cross-validation:
- For each fold \(k\): clone \(f\), fit on train indices, predict held-out indices.
- Store out-of-fold prediction \(\hat{y}_i\) and error \(\ell(y_i, \hat{y}_i)\).
- Optionally refit \(f\) on all data for deployment (
refit_on_all=True).
Each row receives exactly one out-of-sample error — the target for \(g\). Rows never held out (e.g. earliest walk-forward window) are excluded.
Refit assumption
With refit_on_all=True, \(g\) learns errors of fold models \(f_{-k}\) (strict subsets)
but is paired at inference with full-data \(f\). This standard stacking assumption means
\(g\) describes a slightly smaller model; the gap is small for reasonable fold counts.
Stationarizing features \(\phi_{z^N}(x)\)
In interactive settings the error target is non-stationary as \(f\) is retrained. The paper embeds \((x, z^N)\) into stationarizing features (Sec. 3.2, Eq. 12):
| Feature | Builder in deup |
Meaning |
|---|---|---|
| \(x\) | RawFeatures |
raw inputs |
| \(s\) | SeenBit |
1 if \(x\) was in training set \(z^N\), else 0 |
| \(\log \hat{q}\) | DensityFeature |
log-density under training distribution |
| \(\log \hat{V}\) | VarianceFeature |
log predictive variance (ensemble / GP) |
| distance | DistanceToTrain |
\(k\)-th NN distance to training manifold |
| |residual| proxy | ResidualMagnitude |
kNN-smoothed training residual magnitude |
Mahalanobis / diagonal Gaussian density (Appendix C; Lee et al. 2018 OOD baseline). With per-dimension MLE \(\mu_d\), \(\sigma^2_d\) (clamped \(\geq 10^{-6}\)):
DensityFeature(method="mahalanobis") implements this closed-form estimator.
Finding 3 — density can be null (Sanderink, 2026)
In homogeneous tabular/finance universes, density features may add no signal beyond rank geometry. Treat density as optional and ablatable; rank-geometry residualization (P6) is required for cross-sectional rankers (Sanderink, 2026, Finding 3).
Mapping to library objects
flowchart LR
X["X, y"] --> OOF["OOFErrorCollector"]
OOF -->|"oof errors"| G["error_model g"]
X --> FP["FeaturePipeline φ"]
FP --> G
OOF -->|"f refit"| F["base_model f"]
F --> Pred["predict(X)"]
G --> Unc["predict_epistemic(X)"]
Ranking adaptation & two-level deployment
deup's ranking support (DEUPRanker, rank loss, rank-geometry residualization)
extends DEUP from regression/classification to cross-sectional ranking by
predicting rank displacement and defining an epistemic signal \(\hat{e}\) relative to
a point-in-time (PIT-safe) baseline (Sanderink, 2026). Two empirical findings from
that work motivate the library design:
- Rank-geometry coupling (Finding 3). \(\hat{e}\) is structurally coupled with signal strength
(median per-date correlation between \(\hat{e}\) and \(|\text{score}|\) ≈ 0.6), so naive
inverse-uncertainty sizing de-levers the strongest signals. This motivates the
optional rank-geometry residualization in
RankResidualizer(Sanderink, 2026). - Two-level deployment. Uncertainty is best used as (i) a strategy-level regime-trust gate deciding whether to trade and (ii) a position-level tail-risk cap — i.e. DEUP adds value mainly as a tail-risk guard rather than a continuous sizing denominator (Sanderink, 2026). This informs the aggregation-reliability diagnostics (Findings 1–2; see Reliability).
References
- Lahlou, Jain, Nekoei, Butoi, Bertin, Rector-Brooks, Korablyov, Bengio (2023). DEUP: Direct Epistemic Uncertainty Prediction. TMLR. arXiv:2102.08501
- Sanderink, U. (2026) 'When Alpha Breaks: Two-Level Uncertainty for Safe Deployment of Cross-Sectional Stock Rankers', arXiv preprint arXiv:2603.13252. Available at: https://arxiv.org/pdf/2603.13252 (Accessed: 4 June 2026).
- Kotelevskii et al. (2025a). Bregman-divergence excess risk (formal cover for DEUP).
- Lee et al. (2018). Mahalanobis OOD score (diagonal Gaussian special case).
- Hüllermeier & Waegeman (2019). Aleatoric vs epistemic uncertainty survey.
- Romano, Patterson, Candès (2019). Conformalized Quantile Regression. NeurIPS.
- Lei et al. (2018). Distribution-Free Predictive Inference for Regression. JASA.