Feature builders for \(g(x)\)
The error predictor \(g\) in DEUP can use stationarizing features
\(\phi_{z^N}(x)\) beyond raw inputs (Lahlou et al., 2023, Sec. 3.2). Each builder
is a scikit-learn TransformerMixin that fits on training data only — the same
leakage discipline as OOFErrorCollector (Finding 4).
See Theory for the mathematical definitions.
Quick example
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from deup.core.features import (
DensityFeature,
DistanceToTrain,
FeaturePipeline,
RawFeatures,
SeenBit,
)
pipe = FeaturePipeline([
("raw", RawFeatures()),
("density", DensityFeature(method="mahalanobis")),
("dist", DistanceToTrain(k=5)),
("seen", SeenBit(atol=1e-8)),
])
X_train = np.random.default_rng(0).normal(size=(500, 8))
X_test = np.random.default_rng(1).normal(size=(50, 8))
phi_train = pipe.fit_transform(X_train)
phi_test = pipe.transform(X_test)
print(phi_train.shape, phi_test.shape) # (500, 8+1+1+1), (50, ...)
Builders
| Class | Output | Methods / notes |
|---|---|---|
RawFeatures |
\(x\) | passthrough |
DensityFeature |
\(\log \hat{q}(x)\) column | mahalanobis, knn, kde; flow requires [torch] |
VarianceFeature |
\(\log \hat{V}(x)\) column | ensemble (bootstrap); gp requires [torch] |
DistanceToTrain |
\(k\)-th NN distance | default k=5 |
SeenBit |
\(s \in \{0,1\}\) | exact / atol duplicate detection |
ResidualMagnitude |
kNN-smoothed \(\|y-f(x)\|\) | needs estimator + y at fit |
DensityFeature
# Diagonal Gaussian — same formulation as Lee et al. (2018) Mahalanobis OOD score
DensityFeature(method="mahalanobis")
# k-NN distance proxy: log q ≈ -log(d_k + ε)
DensityFeature(method="knn", k=5)
# sklearn KernelDensity
DensityFeature(method="kde", bandwidth=1.0)
Finding 3 (Sanderink, 2026)
Density can be informative null in homogeneous tabular panels. Ablate with
FeaturePipeline column importances or drop if \(\Delta\rho < 0.005\).
VarianceFeature (ensemble)
Fits n_estimators bootstrap replicas of a base model and returns
\(\log(\mathrm{Var}_j f_j(x) + \varepsilon)\).
VarianceFeature(
method="ensemble",
estimator=RandomForestRegressor(n_estimators=50, random_state=0),
n_estimators=10,
)
ResidualMagnitude
At fit(X, y) stores training residuals \(|y - f(x)|\). At transform(X) returns
the mean residual magnitude among \(k\) nearest training neighbors — a local error prior
when \(y\) is unavailable at inference.
ResidualMagnitude(
estimator=RandomForestRegressor(),
k=5,
).fit(X_train, y_train)
FeaturePipeline
FeaturePipeline horizontally stacks named builders (FeatureUnion-style). Names appear
in get_feature_names_out().
from deup.core.features import FeaturePipeline, VarianceFeature, SeenBit
pipe = FeaturePipeline([
("var", VarianceFeature(method="ensemble")),
("seen", SeenBit()),
])
Torch-dependent methods
DensityFeature(method="flow") and VarianceFeature(method="gp") require
pip install "deup[torch]". Without torch, construction raises ImportError with an
install hint; the module still imports cleanly on a torch-free install.
v0.1 vs v0.2
v0.1 (this release): feature builders + pipeline are available as primitives.
DEUPRegressor still trains \(g\) on raw \(X\) by default.
v0.2 (P6): ErrorEstimator wires FeaturePipeline into the DEUP training loop
with target transforms and non-negativity clipping.