Decomposition

Bases: BaseEstimator

Fit and predict the DEUP error model g.

Parameters:

Name	Type	Description	Default
`model`	`Any`	The secondary regressor. Defaults to :class:`~sklearn.ensemble.HistGradientBoostingRegressor` (no extra deps). Pass any sklearn-style regressor, or use `"lightgbm"` to request a LightGBM model (requires the `[gbm]` extra).	`None`
`features`	`Any`	Optional feature builder (e.g. a :class:`~deup.core.features.FeaturePipeline`) applied to `X` before fitting `model`. If `None`, raw `X` is used.	`None`
`target_transform`	`TargetTransform`	Stabilization for the error target: `"log"` (default), `"asinh"` or `"none"`.	`'log'`
`error_eps`	`float`	Stabilizer for `log` / `asinh` transforms.	`1e-06`
`clip_negative`	`bool`	If `True` (default), clip predicted errors at 0.	`True`
`extrapolate`	`bool`	If `True` (default), apply a monotone floor when `DistanceToTrain` or `DensityFeature` columns exceed the OOF training hull. Tree-based `g` models otherwise flatten far from the in-support feature range.	`True`
`extrapolation_percentile`	`float`	OOF error percentile used as the reference scale for the floor (default 90).	`90.0`

Attributes:

Name	Type	Description
`model_`		The fitted secondary regressor.
`features_`		The fitted feature builder (or `None`).
`extrapolation_guide_`		Fitted bounds for extrapolation flooring, or `None`.

`fit(X, errors, y=None)`

Fit g on (features(X), errors).

Parameters:

Name	Type	Description	Default
`X`	`Any`	Inputs aligned with `errors` (the rows that received OOF errors).	required
`errors`	`ArrayLike`	Non-negative pointwise error targets from the OOF collector.	required
`y`	`ArrayLike \| None`	Optional original targets, forwarded to feature builders that need them (e.g. `VarianceFeature` / `ResidualMagnitude`).	`None`

`predict(X)`

Predict the (non-negative) error estimate g(x).

Bases: BaseEstimator

Constant aleatoric variance a(x) = sigma^2 for all x.

The global noise level is estimated as the mean local label variance among k nearest neighbors (a bias-corrected estimate of Var(Y | X) averaged over the training inputs). Use when label noise is believed roughly constant across the input space (the paper's scenario 3 with a non-zero floor).

Parameters:

Name	Type	Description	Default
`k`	`int`	Neighbors used to estimate local label variance.	`10`

Bases: BaseEstimator

Input-dependent aleatoric variance via local k-NN label variance.

For each x the estimate is the bias-corrected variance of training y among its k nearest neighbors — a model-free estimate of Var(Y | X = x).

Parameters:

Name	Type	Description	Default
`k`	`int`	Number of neighbors for the local variance estimate.	`10`

Bases: BaseEstimator

Aleatoric variance from a quantile-regression spread.

Fits two quantile regressors at q_lo and q_hi and converts the predicted interval width to a variance via the Gaussian relation sigma = (q_hi - q_lo) / (z_hi - z_lo), then a(x) = sigma^2.

Parameters:

Name	Type	Description	Default
`estimator`	`Any`	A quantile regressor factory taking a `quantile=` kwarg. Defaults to :class:`~sklearn.ensemble.HistGradientBoostingRegressor` with `loss="quantile"`.	`None`
`q_lo`	`float`	Lower / upper quantiles (default 0.159 / 0.841 ~ +/-1 sigma).	`0.159`
`q_hi`	`float`	Lower / upper quantiles (default 0.159 / 0.841 ~ +/-1 sigma).	`0.159`

Return the epistemic estimate e_hat = max(0, g - a).

Parameters:

Name	Type	Description	Default
`error`	`ArrayLike`	The error estimate `g(x)` (e.g. from :class:`ErrorEstimator`).	required
`aleatoric`	`ArrayLike \| None`	The aleatoric estimate `a(x)`. If `None`, `a(x) = 0` and `e_hat = g` (the paper's conservative proxy, scenario 3).	`None`
`clip`	`bool`	If `True` (default), clip the result at 0 (epistemic uncertainty is non-negative).	`True`

Residualize a signal on the within-group rank of a model score.

For cross-sectional rankers the raw epistemic signal can be partly mechanical: the within-date rank percentile of |score| mechanically tracks the loss target (Finding 3, per-date rho(e_hat, |score|) ~ 0.616). This transform fits an isotonic map from the within-group rank to the signal and subtracts it, leaving the part of the signal not explained by rank geometry.

Apply the same fitted residualizer to both g and the loss target to obtain a decoupled signal whose association with realized loss can then be measured honestly.

The axis to rank on is supplied as the score argument of fit/transform (pass |score| to decouple from rank-of-conviction).

Parameters:

Name	Type	Description	Default
`out_of_bounds`	`str`	Passed to :class:`~sklearn.isotonic.IsotonicRegression` (default `"clip"`).	`'clip'`

`fit(values, score, groups=None)`

Fit the isotonic rank -> value map (pooled across groups).

`transform(values, score, groups=None)`

Return values minus the rank-explained component.

Quantify rank-geometry coupling reduction and loss-association retention.

Returns Spearman rho(g, |score|) before/after residualization (coupling) and rho(signal, loss) before/after (loss association). retention is the ratio of after/before loss association (Finding 3 reports R ~ 0.955).

Decide whether to keep density features in g.

Finding 3 corollary: in homogeneous tabular/finance universes density features can add no signal beyond rank geometry. Drop density when BOTH its gain importance is negligible AND adding it changes the loss partial-correlation by less than corr_tol.

Parameters:

Name	Type	Description	Default
`gain_importance`	`float`	The density feature's relative gain importance in `g` (in `[0, 1]`).	required
`delta_partial_corr`	`float`	`\|rho_partial(with density) - rho_partial(without density)\|`.	required
`importance_tol`	`float`	Thresholds below which each signal is considered negligible.	`0.001`
`corr_tol`	`float`	Thresholds below which each signal is considered negligible.	`0.001`

Returns:

Type	Description
`DensityKillDecision`	`keep=False` (kill) iff both signals are negligible.

Pearson partial correlation of a and b controlling for control.

Helper for computing delta_partial_corr in the density kill criterion: the residual correlation of two signals after linearly regressing out a control.