Robust OpenML dataset selection: retries, resume preference, and binary validation by charlesmartin14 · Pull Request #39 · CalculatedContent/xgbwwdata

charlesmartin14 · 2026-03-22T00:29:03Z

Improve robustness of the random OpenML binary dataset selection so the notebook can resume prior choices, avoid repeatedly failing dataset loads, and ensure selected datasets are actually binary.

Change default REUSE_LAST_MODEL behavior to True and use resume_dataset_uid to optionally try the prior dataset id first when not forcing a fresh start.
Add MAX_DATASET_SELECTION_ATTEMPTS and build a shuffled candidate list limited to that many dataset ids to avoid long or infinite selection loops.
Introduce a selection loop that attempts to load_dataset for each candidate, catches loader exceptions, validates that y is binary, records failures, and selects the first valid candidate or raises a clear RuntimeError if none succeed.
Move configure_checkpoint_paths so paths are set after a valid selected_dataset_uid is found and initialize X, y, meta defensively before the selection loop.

Retry OpenML selection until a valid binary dataset loads

1d3c803

charlesmartin14 added the codex label Mar 22, 2026 — with ChatGPT Codex Connector

Provide feedback