Skip to content

Robust OpenML dataset selection: retries, resume preference, and binary validation#39

Open
charlesmartin14 wants to merge 1 commit intomainfrom
codex/add-model-id-persistence-on-restart
Open

Robust OpenML dataset selection: retries, resume preference, and binary validation#39
charlesmartin14 wants to merge 1 commit intomainfrom
codex/add-model-id-persistence-on-restart

Conversation

@charlesmartin14
Copy link
Copy Markdown
Member

Motivation

  • Improve robustness of the random OpenML binary dataset selection so the notebook can resume prior choices, avoid repeatedly failing dataset loads, and ensure selected datasets are actually binary.

Description

  • Change default REUSE_LAST_MODEL behavior to True and use resume_dataset_uid to optionally try the prior dataset id first when not forcing a fresh start.
  • Add MAX_DATASET_SELECTION_ATTEMPTS and build a shuffled candidate list limited to that many dataset ids to avoid long or infinite selection loops.
  • Introduce a selection loop that attempts to load_dataset for each candidate, catches loader exceptions, validates that y is binary, records failures, and selects the first valid candidate or raises a clear RuntimeError if none succeed.
  • Move configure_checkpoint_paths so paths are set after a valid selected_dataset_uid is found and initialize X, y, meta defensively before the selection loop.

Testing

  • No automated tests were added or executed for this change.

Codex Task

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant