Skip to content

Decomposition

Bases: BaseEstimator

Fit and predict the DEUP error model g.

Parameters:

Name Type Description Default
model Any

The secondary regressor. Defaults to :class:~sklearn.ensemble.HistGradientBoostingRegressor (no extra deps). Pass any sklearn-style regressor, or use "lightgbm" to request a LightGBM model (requires the [gbm] extra).

None
features Any

Optional feature builder (e.g. a :class:~deup.core.features.FeaturePipeline) applied to X before fitting model. If None, raw X is used.

None
target_transform TargetTransform

Stabilization for the error target: "log" (default), "asinh" or "none".

'log'
error_eps float

Stabilizer for log / asinh transforms.

1e-06
clip_negative bool

If True (default), clip predicted errors at 0.

True

Attributes:

Name Type Description
model_

The fitted secondary regressor.

features_

The fitted feature builder (or None).

fit(X, errors, y=None)

Fit g on (features(X), errors).

Parameters:

Name Type Description Default
X Any

Inputs aligned with errors (the rows that received OOF errors).

required
errors ArrayLike

Non-negative pointwise error targets from the OOF collector.

required
y ArrayLike | None

Optional original targets, forwarded to feature builders that need them (e.g. VarianceFeature / ResidualMagnitude).

None

predict(X)

Predict the (non-negative) error estimate g(x).

Bases: BaseEstimator

Constant aleatoric variance a(x) = sigma^2 for all x.

The global noise level is estimated as the mean local label variance among k nearest neighbors (a bias-corrected estimate of Var(Y | X) averaged over the training inputs). Use when label noise is believed roughly constant across the input space (the paper's scenario 3 with a non-zero floor).

Parameters:

Name Type Description Default
k int

Neighbors used to estimate local label variance.

10

Bases: BaseEstimator

Input-dependent aleatoric variance via local k-NN label variance.

For each x the estimate is the bias-corrected variance of training y among its k nearest neighbors — a model-free estimate of Var(Y | X = x).

Parameters:

Name Type Description Default
k int

Number of neighbors for the local variance estimate.

10

Bases: BaseEstimator

Aleatoric variance from a quantile-regression spread.

Fits two quantile regressors at q_lo and q_hi and converts the predicted interval width to a variance via the Gaussian relation sigma = (q_hi - q_lo) / (z_hi - z_lo), then a(x) = sigma^2.

Parameters:

Name Type Description Default
estimator Any

A quantile regressor factory taking a quantile= kwarg. Defaults to :class:~sklearn.ensemble.HistGradientBoostingRegressor with loss="quantile".

None
q_lo float

Lower / upper quantiles (default 0.159 / 0.841 ~ +/-1 sigma).

0.159
q_hi float

Lower / upper quantiles (default 0.159 / 0.841 ~ +/-1 sigma).

0.159

Return the epistemic estimate e_hat = max(0, g - a).

Parameters:

Name Type Description Default
error ArrayLike

The error estimate g(x) (e.g. from :class:ErrorEstimator).

required
aleatoric ArrayLike | None

The aleatoric estimate a(x). If None, a(x) = 0 and e_hat = g (the paper's conservative proxy, scenario 3).

None
clip bool

If True (default), clip the result at 0 (epistemic uncertainty is non-negative).

True

Residualize a signal on the within-group rank of a model score.

For cross-sectional rankers the raw epistemic signal can be partly mechanical: the within-date rank percentile of |score| mechanically tracks the loss target (Finding 3, per-date rho(e_hat, |score|) ~ 0.616). This transform fits an isotonic map from the within-group rank to the signal and subtracts it, leaving the part of the signal not explained by rank geometry.

Apply the same fitted residualizer to both g and the loss target to obtain a decoupled signal whose association with realized loss can then be measured honestly.

The axis to rank on is supplied as the score argument of fit/transform (pass |score| to decouple from rank-of-conviction).

Parameters:

Name Type Description Default
out_of_bounds str

Passed to :class:~sklearn.isotonic.IsotonicRegression (default "clip").

'clip'

fit(values, score, groups=None)

Fit the isotonic rank -> value map (pooled across groups).

transform(values, score, groups=None)

Return values minus the rank-explained component.

Quantify rank-geometry coupling reduction and loss-association retention.

Returns Spearman rho(g, |score|) before/after residualization (coupling) and rho(signal, loss) before/after (loss association). retention is the ratio of after/before loss association (Finding 3 reports R ~ 0.955).

Decide whether to keep density features in g.

Finding 3 corollary: in homogeneous tabular/finance universes density features can add no signal beyond rank geometry. Drop density when BOTH its gain importance is negligible AND adding it changes the loss partial-correlation by less than corr_tol.

Parameters:

Name Type Description Default
gain_importance float

The density feature's relative gain importance in g (in [0, 1]).

required
delta_partial_corr float

|rho_partial(with density) - rho_partial(without density)|.

required
importance_tol float

Thresholds below which each signal is considered negligible.

0.001
corr_tol float

Thresholds below which each signal is considered negligible.

0.001

Returns:

Type Description
DensityKillDecision

keep=False (kill) iff both signals are negligible.

Pearson partial correlation of a and b controlling for control.

Helper for computing delta_partial_corr in the density kill criterion: the residual correlation of two signals after linearly regressing out a control.