Skip to content

Benchmarks

Reproducible uncertainty-quality comparisons for deup. All scripts use seed=42 and write JSON tables under benchmarks/results/.

Quick run

pip install -e ".[dev,benchmark,gbm,finance]" pyarrow
python benchmarks/run_all.py

See the full write-up in BENCHMARKS.md in the repository root (tables are committed from the last benchmark run).


Tabular regression (California housing)

Method Spearman ρ Notes
DEUP 0.509 DEUPRegressor + RF
DEUP + LightGBM 0.444 TabularDEUP(backend="lgbm")
DEUP + XGBoost 0.400 TabularDEUP(backend="xgb")
DEUP + CatBoost 0.407 TabularDEUP(backend="catboost")
Ensemble disagreement 0.460 Bootstrap variance
Conformal residual 0.447 Cal-set |residual| model
Laplace (BayesianRidge) 0.015 Posterior variance

N-sweep — aggregation reliability (headline)

N-sweep AUROC vs context size

i.i.d. contexts: AUROC(agg_g) rises to ≈0.96 at N≈1,000–10,000 (literature reference on CIFAR-10-C batches: ≈0.955).

Low-N autocorrelated: AUROC(agg_g)≈0.43 (cross-sectional finance reference ≈0.55); HealthIndex recovers to AUROC≈1.0 on the synthetic proxy (≈0.75 on a real finance holdout in published evaluation).

Details: Aggregation reliability.


CIFAR & finance

  • CIFAR proxy: oracle agg-g AUROC 1.0 on high-N i.i.d. batch simulation (literature reference 0.955 on CIFAR-10-C)
  • Finance walk-forward: ρ(g, rank_loss)=0.25 DEV / 0.17 FINAL on a cross-sectional ranker panel; see benchmarks/run_finance_walkforward.py

Future work

torchvision ResNet-18 → VisionDEUP; HuggingFace encoders; PyTorch Lightning hooks — see the project roadmap.