Getting started
Install
pip install deup
Optional extras:
pip install "deup[gbm]" # LightGBM tabular backend
pip install "deup[xgb]" # XGBoost tabular backend
pip install "deup[catboost]" # CatBoost tabular backend
pip install "deup[gbm-all]" # all gradient-boosting backends
pip install "deup[finance]" # CrossSectionalDEUP (pandas)
pip install "deup[docs]" # MkDocs site locally
Quickstart — tabular regression
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from deup import DEUPRegressor
X, y = fetch_california_housing(return_X_y=True)
X = StandardScaler().fit_transform(X)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=0)
model = DEUPRegressor(
base_model=RandomForestRegressor(n_estimators=100, random_state=0),
cv=5,
random_state=0,
)
model.fit(X_tr, y_tr)
pred, unc = model.predict(X_te, return_uncertainty=True)
pred— point prediction from the base model (refit on all training data)unc— estimated epistemic uncertaintyg(x)(predicted out-of-sample error)
Higher unc means the model is likely to be wrong at that input. On California
housing (v0.1 benchmark), Spearman(unc, realized squared error) ≈ 0.51, beating
ensemble disagreement (0.46) and a conformal residual baseline (0.45).
Time-series / walk-forward
For ordered or panel data, pass a leakage-safe splitter:
from deup import DEUPRegressor
from deup.splitters import PurgedWalkForward
model = DEUPRegressor(
base_model=my_model,
cv=PurgedWalkForward(n_splits=5, embargo=5, min_train_size=20),
)
model.fit(X, y, groups=dates) # dates = cross-section label per row
pred, unc = model.predict(X_test, return_uncertainty=True)
PurgedWalkForward keeps each date's full cross-section together and drops an
embargo between train and test to prevent look-ahead leakage.
Cross-sectional ranking
Use :class:~deup.estimators.DEUPRanker — rank loss, walk-forward CV, and rank-geometry
residualization ON by default (Sanderink, 2026, Finding 3):
from deup import DEUPRanker
model = DEUPRanker(base_model=my_ranker, cv=5)
model.fit(X, y, groups=dates)
pred, unc = model.predict(X_test, return_uncertainty=True, groups=test_dates)
Set residualize_rank=False to disable decoupling (not recommended for rankers).
Classification
from deup import DEUPClassifier
from sklearn.ensemble import RandomForestClassifier
model = DEUPClassifier(base_model=RandomForestClassifier(), cv=5)
model.fit(X, y)
pred, unc = model.predict(X_test, return_uncertainty=True)
proba = model.predict_proba(X_test)
Default loss is logloss; use loss="brier" for Brier score targets.
Active learning — acquire
Select the k pool points with highest epistemic uncertainty (DEUP paper, Sec. 3.2):
idx = model.acquire(X_pool, k=10)
# or with uncertainty values:
idx, unc = model.acquire(X_pool, k=10, return_uncertainty=True)
For rankers pass groups= so residualization is correct on panel data.
Aleatoric decomposition
Subtract an aleatoric floor for a cleaner epistemic signal \(\hat{e}(x) = \max(0, g(x) - a(x))\):
from deup.core import Heteroscedastic
model = DEUPRegressor(
base_model=my_model,
aleatoric=Heteroscedastic(k=20),
)
model.fit(X, y)
unc = model.predict_epistemic(X) # max(0, g - a)
See Decomposition for details.
Prediction intervals (conformal)
Wrap the epistemic score in calibrated intervals with marginal coverage \(1-\alpha\):
model = DEUPRegressor(base_model=my_model).fit(X_train, y_train)
model.calibrate(X_cal, y_cal, alpha=0.1) # held-out split!
interval = model.predict_interval(X_test)
interval.lower, interval.upper
See Conformal calibration for methods (normalized, mondrian,
cqr) and MAPIE interop.
Aggregating g over contexts (read this first)
Averaging g into a context-level signal (mean(g) per day/batch/group) is only
trustworthy at high N with i.i.d. errors. Before relying on it, check:
from deup.diagnostics import should_trust_aggregate
verdict = should_trust_aggregate(g, groups)
print(verdict.trustworthy, verdict.reason)
For low-N / non-i.i.d. (time-series, finance), use the composite HealthIndex
instead. See When is aggregated DEUP reliable?.
Cross-sectional finance preset
For panel data (one row per date × asset), use the flagship preset:
from deup.domains.finance import CrossSectionalDEUP
model = CrossSectionalDEUP(horizon=20, cv=5, embargo=1).fit(panel_df)
model.calibrate(cal_df, alpha=0.1)
pred, unc = model.predict(test_df, return_uncertainty=True)
health = model.health_report(test_df)
See Domain presets for tabular and vision presets too.
Target stabilization
Heavy-tailed error targets are stabilized before training g:
# default: log(error + eps)
model = DEUPRegressor(target_transform="log")
# robust alternative for very heavy tails
model = DEUPRegressor(target_transform="asinh")
# raw errors (no transform)
model = DEUPRegressor(target_transform="none")
How it works (and the refit assumption)
DEUP fits two models:
- Base model
f— predictsyfromx(your existing model). - Error predictor
g— predicts how wrongfis atx.
The subtle part is generating honest training targets for g. If you trained g on
the residuals of an f that had already seen those rows, the residuals would be
optimistically small and g would systematically under-estimate uncertainty —
the canonical DEUP failure mode (Lahlou et al., 2023, Sec. 3.2). deup avoids this
with leakage-correct out-of-fold collection (OOFErrorCollector, the paper's
Algorithm 2): for each CV fold a fresh clone of f is fit on the train rows and used
to predict the held-out rows, so every error target is genuinely out-of-sample.
By default the collector then refits f on all data for deployment. So g is
trained on the errors of fold models f₋ₖ (each fit on a strict subset) but paired at
inference with the full-data f. This is the standard DEUP / stacking assumption:
g describes the error of a slightly smaller model. For reasonable fold counts the
gap is small; under walk-forward the fold models are legitimately smaller (expanding
window), which is the realistic operating regime. Pass refit_on_all=False (or a
pre-fit estimator) if you want to disable the final refit.
What v0.3 includes
Included: estimators, conformal calibration, aggregation-reliability diagnostics
(AggregationReliability, HealthIndex), domain presets (CrossSectionalDEUP,
TabularDEUP, VisionDEUP), benchmark suite, and
step-by-step tutorials.
Configure any use case via the five-axis guide.
Run the benchmark locally
git clone https://github.com/ursinasanderink/deup.git
cd deup
pip install -e ".[dev]"
python benchmarks/run_regression_benchmark.py
Results land in benchmarks/results/regression_benchmark.json.