Decomposition & rank residualization

This page covers the v0.2 components that turn the raw error estimate \(g(x)\) into a reported epistemic signal: the error estimator, aleatoric estimators, the \(\hat{e} = \max(0, g - a)\) decomposition, and cross-sectional rank-geometry residualization. See Theory for the underlying math.

ErrorEstimator

ErrorEstimator is the reusable DEUP error model \(g\) — feature pipeline + target transform + non-negativity, fit on out-of-fold errors.

from deup.core import ErrorEstimator
from deup.core.features import DensityFeature, FeaturePipeline, RawFeatures
from deup.core.oof import OOFErrorCollector
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import KFold

oof = OOFErrorCollector(
    RandomForestRegressor(), cv=KFold(5), loss="squared"
).fit_collect(X, y)

g = ErrorEstimator(
    features=FeaturePipeline([("raw", RawFeatures()), ("density", DensityFeature())]),
    target_transform="log",
).fit(X[oof.indices], oof.errors)

error_estimate = g.predict(X_new)   # >= 0

Aleatoric estimators \(a(x)\)

Model-agnostic estimates of the irreducible noise floor \(A(x) = \mathrm{Var}(Y\mid X=x)\) (variance scale, matching a squared-error target).

Estimator	\(a(x)\)	When
`Homoscedastic`	constant \(\sigma^2\)	noise ~ constant across \(\mathcal{X}\)
`Heteroscedastic`	local k-NN label variance	input-dependent noise
`Quantile`	\(((q_{hi}-q_{lo})/z)^2\) from quantile regression	skewed / tail noise

from deup.core import Heteroscedastic

a = Heteroscedastic(k=20).fit(X, y).predict(X_new)

Decomposition

from deup.core import decompose_epistemic

e_hat = decompose_epistemic(error_estimate, a)   # max(0, g - a)
# a=None -> conservative proxy e_hat = g (the v0.1 default)

\(\hat{e}\) is always non-negative.

Rank-geometry residualization (Finding 3; Sanderink, 2026)

For cross-sectional rankers, \(g\) and the loss target can be partly mechanical rank geometry rather than genuine error (Sanderink, 2026, Finding 3). RankResidualizer fits an isotonic map from the within-group rank of \(|score|\) to the signal and subtracts it, leaving the part not explained by rank geometry.

from deup.core import RankResidualizer, coupling_retention_report

# decouple g from rank geometry, per date
res = RankResidualizer().fit(g_values, abs_score, groups=dates)
g_decoupled = res.transform(g_values, abs_score, groups=dates)

# diagnostics: coupling before/after + loss-association retention
report = coupling_retention_report(g_values, score, loss, groups=dates)
print(report.coupling_before, report.coupling_after, report.retention)

Empirical motivation (Sanderink, 2026)

Residualization decoupled the signal (per-date \(\rho(\hat{e}, |score|)\): \(0.616 \to 0.317\)) while retaining ~92.5% of the loss association. This is off by default and on in DEUPRanker (P7).

Density kill criterion (Finding 3 corollary; Sanderink, 2026)

Density features can be an informative null in homogeneous universes (Sanderink, 2026). The kill criterion drops them when their gain importance is negligible and they barely move the loss partial-correlation.

from deup.core import density_kill_criterion

decision = density_kill_criterion(gain_importance=1e-5, delta_partial_corr=0.001)
print(decision.keep, decision.reason)   # False, "killed: ..."

Use partial_correlation(a, b, control) to compute the \(\Delta\) partial-correlation with vs without the density feature.