Skip to content

CLI redesign: three-mode positional parser (closed/open/whatif)#412

Open
FileSystemGuy wants to merge 54 commits into
mainfrom
FileSystemGuy-cli-refactor
Open

CLI redesign: three-mode positional parser (closed/open/whatif)#412
FileSystemGuy wants to merge 54 commits into
mainfrom
FileSystemGuy-cli-refactor

Conversation

@FileSystemGuy

@FileSystemGuy FileSystemGuy commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Summary

Refactors the mlpstorage CLI from a flag-gated --open/--closed design to a fully positional, three-mode hierarchy (closed|open|whatif). Resolves merge conflicts, eliminates is_closed throughout, and adds version and context-sensitive help support.

Phases:

  • Phase 1 — Core CLI Refactor: Three-branch positional parser (closed|open|whatif x benchmark x model x command x file|object); is_closed removed from all arg builders; 8 merge conflicts resolved
  • Phase 2 — Version Command: Fixed VERSION bug (wrong distribution name for importlib.metadata); added mlpstorage version subcommand with early-exit dispatch
  • Phase 3 — Help Behavior: --help_all full tree output; context-sensitive --help at each positional level; MLPStorageHelpFormatter (required flags first, positionals suppressed from usage line)
  • Phase 4 — Test Coverage: 47 new tests in test_parser_modes.py covering all three modes, open-gated arg exclusion, model/accelerator allow-lists, and version dispatch
  • Phase 5 — Run Config Summary: Pre-execution summary table of Tier 1 CLI args and env vars; --quiet flag to suppress; centralized S3 resolver with credential redaction; storage_config.py; run_summary.py
  • Fix — Duplicate log output: setup_logging() was not idempotent; multiple calls for the same logger name stacked handlers, causing each message to print twice. Guarded addHandler with if not _logger.handlers.
  • Fix — Closed-mode checkpointing crash: dlio_bin_path is open-gated for checkpointing so it was absent from the Namespace in closed mode; DlioBenchmark.__init__ accessed it without getattr. Added dlio_bin_path=None to set_defaults in _add_checkpointing_core_args.

Key changes

  • mlpstorage_py/cli_parser.py — rewritten with three top-level subparsers (closed, open, whatif)
  • mlpstorage_py/cli/ — all arg builders converted to mode-based three-tier pattern; is_closed parameter removed
  • mlpstorage_py/cli/checkpointing_args.pydlio_bin_path=None added to core set_defaults
  • mlpstorage_py/config.py — added MODELS_CLOSED, MODELS_OPEN, ACCELERATORS_CLOSED
  • mlpstorage_py/cli/help_formatter.py — new: HELP_ALL_TEXT + get_context_help_tokens() + MLPStorageHelpFormatter
  • mlpstorage_py/__init__.py — fixed _resolve_version() with correct distribution name + tomllib fallback
  • mlpstorage_py/main.py — version early-exit + help pre-scan guard; print_run_summary() call before benchmark loop
  • mlpstorage_py/storage_config.py — new: centralized S3 env var resolver with credential redaction
  • mlpstorage_py/run_summary.py — new: structured pre-execution config table
  • mlpstorage_py/mlps_logging.pysetup_logging() made idempotent (no duplicate handlers)
  • .gitignore.planning/ added; planning artifacts untracked

Test plan

  • pytest tests/unit/test_cli_parser.py tests/unit/test_cli.py -v — baseline CLI tests pass
  • pytest tests/unit/test_help_behavior.py -v — 45 help behavior tests pass
  • pytest tests/unit/test_version.py -v — version resolution tests pass
  • pytest tests/unit/test_parser_modes.py -v — 47 mode/arg/model/accelerator tests pass
  • pytest tests/unit/test_run_summary.py -v — 10 run summary tests pass
  • mlpstorage closed training unet3d run --help — shows closed-mode flags only (no --loops, --params)
  • mlpstorage open training unet3d run --help — shows open-gated flags
  • mlpstorage whatif training cosmoflow datasize --help — accepts whatif-only models
  • mlpstorage version — prints version string
  • mlpstorage --help_all — prints full command tree
  • mlpstorage closed checkpointing datasize -cm 512 -m llama3-70b -np 128 — prints run summary without crash
  • Pre-execution summary prints each config line exactly once (no duplicates)

@FileSystemGuy FileSystemGuy requested a review from a team June 8, 2026 16:34
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@dslik

dslik commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

What problem does adding whatif solve? How is this different from open?

@FileSystemGuy

Copy link
Copy Markdown
Contributor Author

In practice I think it only applies to training by allowing use of V100 and H100 GPUs. It's meant to be the "I'm only doing research and I want access to all the knobs".

There will be another PR after this which will add a printout of the effective parameters for this run. We have a 3-layer inheritance scheme right now: 1) the yaml files are the base default, 2) environment variables will override the corresponding yaml file field, and 3) a CLI argument will override the other two. As a result, the submitter needs to know what was actually in effect for this run. A specific case is if the user doesn't specify the S3 bucket but depends upon sourcing it from the environment variable; since it's not on the CLI we would have no other way of showing it to the user if we didn't do this printout.

@dslik

dslik commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

+1 Adding the printout of the effective parameters will be very helpful.

@FileSystemGuy FileSystemGuy marked this pull request as draft June 9, 2026 19:38
Plan 01-01 (Wave 1): resolve common_args.py SyntaxError blocker + config.py constants
Plan 01-02 (Wave 2): training_args.py + checkpointing_args.py three-tier builder
Plan 01-03 (Wave 2): vectordb_args.py + kvcache_args.py three-tier builder (parallel)
Plan 01-04 (Wave 2): utility_args.py + lockfile_args.py signature simplification (parallel)
Plan 01-05 (Wave 3): cli_parser.py three-branch rewrite + main.py/base.py dispatch + test updates
- MODELS_CLOSED changed from [DLRM, RETINANET, FLUX] to [UNET, RETINANET]
- MODELS_OPEN = [UNET, RETINANET] added as new constant
- MODELS (whatif list) left unchanged
…natures, positional storage type

- Remove both conflict blocks (lines 211-264 and 357-400)
- add_universal_arguments signature: (parser, req_results) — drop req_fileobj, offer_object, is_closed
- add_host_arguments signature: (parser, required=False) — drop is_closed
- add_mpi_arguments signature: (parser) — drop is_closed
- add_dlio_arguments signature: (parser) — drop is_closed; --params moves to open builders
- add_timeseries_arguments signature: (parser) — drop is_closed; always adds all three args
- add_storage_type_arguments: new function registering data_access_protocol as positional
- --dry-run replaces --what-if
- --loops and --allow-invalid-params removed (move to per-benchmark open builders)
- No conflict markers, no is_closed anywhere in file
…closed

- add_reports_arguments(parser, is_closed) → add_reports_arguments(parser)
- add_history_arguments(parser, is_closed) → add_history_arguments(parser)
- add_lockfile_arguments(parser, is_closed) → add_lockfile_arguments(parser)
- All add_universal_arguments calls updated from 5-arg to 2-arg signature:
  add_universal_arguments(p, True, True, True, is_closed) → add_universal_arguments(p, req_results=True)
…ode param

- Replace add_training_arguments(parser, is_closed) with add_training_arguments(parser, mode)
- Add model as positional BEFORE subparsers (consumed before command token)
- Three-tier helpers: _add_training_core_args, _add_training_open_args, _add_training_whatif_args
- Storage type positional (file|object) on datagen/run/configview; absent on datasize
- Closed mode: MODELS_CLOSED, ACCELERATORS_CLOSED; open: MODELS_OPEN; whatif: MODELS/ACCELERATORS
- Resolve conflict block (lines 143-154) — dissolves in rewrite
- No is_closed references; no conflict markers
- add_vectordb_arguments(parser, mode) replaces is_closed boolean
- --mode renamed to --benchmark-mode (dest=benchmark_mode) on run subparser
- Three-tier builder: _add_vectordb_core_args + _add_vectordb_open_args
- add_storage_type_arguments called with required=True for datagen/run only
- Conflict block (lines 203-213) deleted; replaced with per-parser calls
- vectordbbench.py: what_if -> dry_run; args.mode -> args.benchmark_mode
…lic API

- Add add_timeseries_arguments to import from common_args and __all__
- add_storage_type_arguments already imported; add to __all__
- Add validate_training/checkpointing/vectordb/kvcache_arguments imports
- Expand __all__ to include all validate_* and add_storage/timeseries functions
checkpointing_args.py:
- Replace add_checkpointing_arguments(parser, is_closed) with add_checkpointing_arguments(parser, mode)
- Add configview subcommand alongside datasize and run
- Three-tier helpers: _add_checkpointing_core_args, _add_checkpointing_open_args
- Storage type positional on run/configview; absent on datasize
- --checkpoint-folder required on run only; --model remains a flag
- Resolve conflict block (lines 108-121) — dissolves in rewrite
- No is_closed references; no conflict markers

dlio.py:
- Rename args.what_if to args.dry_run (getattr pattern preserved)
- Add _run_configview() method to CheckpointingBenchmark — prints DLIO command to stdout
- Add configview dispatch in CheckpointingBenchmark._run()
- add_kvcache_arguments(parser, mode) replaces is_closed boolean
- All 3 conflict blocks resolved: block 1 (three-tier pattern), block 2
  (--seed moves to mlperf_args, duplicate kvcache-bin-path dropped),
  block 3 (_add_kvcache_mlperf_arguments extracted verbatim from origin/main)
- New _add_kvcache_open_args: duration, generation-mode, performance-profile,
  loops, allow-invalid-params (open/whatif only; argparse rejects in closed)
- New _add_kvcache_mlperf_arguments: seed, trials, inter-option-delay, config
- add_storage_type_arguments never called (kvcache architectural constraint)
- Closed branch: model positional absent; open/whatif: model positional present
- kvcache.py: args.closed -> getattr(args, 'mode', None) == 'closed'
- kvcache.py: what_if -> dry_run throughout (3 locations)
…ture

- Replace sys.argv.pop() hack with proper argparse subparser tree (dest='mode')
- Add _build_mode_branch() to create benchmark subparsers under each mode
- Three mode branches: closed, open, whatif — each with training/checkpointing/vectordb/kvcache
- Utility commands (reports, history, lockfile) as top-level siblings of mode subparsers
- validate_args() dispatches on args.benchmark instead of args.program
- Remove post-parse --file/--object consolidation block (data_access_protocol is now positional)
- No sys.argv.pop(), no is_open/is_closed booleans, no args.program references remain
… arg structure

- main.py: replace all args.program references with args.mode (utilities) or args.benchmark (benchmarks)
- base.py: replace args.closed/args.open with mode string comparison; args.what_if -> args.dry_run
- test_cli_parser.py: rewrite to use new positional structure (mode/benchmark/model positionals)
- test_cli.py: update add_universal_arguments to 2-arg; add_training_arguments to (parser, mode);
  --what-if -> --dry-run; add --results-dir/file positionals where required; validate_args uses args.benchmark
- Rule 1 bug fix: --params action='append' with set_defaults(params='') incompatibility fixed by
  adding default=None to --params add_argument in training_args, checkpointing_args, vectordb_args
- test_cli_vectordb.py: fix add_vectordb_arguments missing mode arg (pre-existing break from 01-03)
…resolve_version()

- Replace bare try/except VERSION assignment with _resolve_version() function
- Use correct distribution name "mlpstorage" (not "mlpstorage_py")
- Add tomllib fallback to parse pyproject.toml for source-checkout usage
- Export _resolve_version() for direct testing and downstream use
- test_version_matches_pyproject: asserts VERSION equals pyproject.toml declared version
- test_version_lookup_uses_correct_distribution_name: proves "mlpstorage" dist name resolves
- test_version_fallback_reads_pyproject: monkeypatches metadata to verify tomllib fallback path
…rom cli/__init__

- Append add_version_arguments(parser) with pass body to utility_args.py after add_history_arguments()
- Extend utility_args import in cli/__init__.py to include add_version_arguments
- Add 'add_version_arguments' to __all__ in cli/__init__.py
…ory dispatch in main.py

- Add add_version_arguments to import block in cli_parser.py
- Register version_parser as top-level sibling of closed/open/whatif/reports/history/lockfile
- Add validate_args() early return guard for mode == 'version'
- Add early-exit dispatch in _main_impl() after parse_arguments() and before args.debug check,
  ensuring version invocations are never recorded by hist.add_entry() (HistoryTracker bypass)
…_help_tokens()

- HELP_ALL_TEXT constant contains verbatim full command reference from help_all_spec.md
- get_context_help_tokens() performs static positional-token lookup returning "next: X | Y"
- History subcommands correctly use 'show' and 'rerun' (not list/replay)
- kvcache returns None at run/datasize level (no file|object positional)
- version returns None (leaf — falls through to argparse)
- Both symbols exported in mlpstorage_py/cli/__init__.py __all__
…1/02/03

- HELP-01: --help_all intercept prints HELP_ALL_TEXT and exits 0 before parser construction
- HELP-02/03: unconditional get_context_help_tokens() call on positional-only argv
- R-03-01 fix: bare mid-tree invocations (e.g., 'mlpstorage closed training') now
  show "next: unet3d | retinanet" without requiring --help flag
- Strips -h/--help flags before positional lookup; strips all leading-dash tokens
  so interspersed flags like '-cm 64' don't interfere with path identification
- Leaf paths return None from get_context_help_tokens() -> fall through to argparse
- TestHelpAll (6 tests): --help_all exits 0, banner/placeholder/SYNOPSIS/KV content,
  direct HELP_ALL_TEXT placeholder check guards against drift (R-03-02)
- TestContextHelp (29 parametrized cases): bare, --help, -h at root; mode level;
  bare mid-tree (R-03-01); model level; command level; checkpointing; vectordb;
  kvcache; utility commands
- TestBareInvocation (2 tests): exits 0, prints "next: closed | open | whatif"
- TestLeafHelp (5 tests): datasize, datagen+file, ckpt datasize, kvcache run,
  version --help — all fall through to argparse (no "next:" prefix, has "--")
- TestRegression (3 tests): training datasize, checkpointing run+file, version
  return parsed args without SystemExit (R-03-04)
… classification

SYNOPSIS now uses <> for required positionals (closed|open|whatif and file|object).
Move --data-dir from Optional to Required in all four training placeholder
definitions (TR_DATASIZE, TR_DATAGEN, TR_RUN, TR_CONFIGVIEW) — found during UAT.
… positional in usage

Custom HelpFormatter applied to all leaf parsers via recursive tree walk:
- Required options listed before optional within each group, both sorted alphabetically
- file|object positional excluded from the usage line (it's already in the command path)
  and correctly shown only in the positional arguments section
Usage line now sorts required options before optional (both α).
Positionals (file|object) suppressed from both usage line and positional-
arguments section — consistent with closed/training/run which appear in
_prog and nowhere else.
…-01 through TEST-04)

- TestParserModeStructure: whatif mode attr, all three modes set benchmark, all benchmarks reachable in closed
- TestOpenGatedArgExclusion: 7 rejection tests (closed rejects --loops/--params/--timeseries-interval/--allow-invalid-params), 4 acceptance tests (open/whatif), help-text hiding assertion
- TestModelAcceleratorRestrictions: closed model allow-list (unet3d/retinanet), whatif accepts all 6 MODELS, accelerator allow-list (ACCELERATORS_CLOSED), kvcache positional grammar
- TestVersionDispatch: version subcommand sets mode='version', VERSION non-empty, dispatch does not raise
- 47 tests, 0 failures
FileSystemGuy and others added 22 commits June 9, 2026 13:09
Add Phase 5 to ROADMAP.md with 7 requirements (RUNSUM-01 through RUNSUM-07).
Create phase directory with RESEARCH.md, VALIDATION.md, PATTERNS.md, and 2
PLAN.md files. Wave 1 (05-01) centralizes S3 env var resolution into a new
storage_config.py and refactors 6 scattered readers/writers. Wave 2 (05-02)
implements print_run_summary() and wires it into main.py with --quiet support.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two targeted changes from Gemini review consensus:
- 05-01: add second NOTE to storage_config.py docstring requirement
  clarifying resolver is display-only; raw SDK auth reads stay in
  minio_reader.py and minio_writer.py as direct os.environ.get() calls
- 05-02: widen _WIDTH constant from 28 to 32 to accommodate long
  S3 endpoint URIs and results-dir paths without messy wrapping

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Create mlpstorage_py/storage_config.py with resolve_object_storage_config()
- Implement _redact() helper — credentials always returned as "[SET — N chars]" or "[not set]"
- Implement _resolve_endpoint() with 5-link priority chain (S3_ENDPOINT_URIS → S3_ENDPOINT_TEMPLATE → S3_ENDPOINT_FILE → AWS_ENDPOINT_URL → S3_ENDPOINT)
- Add tests/unit/test_storage_config.py covering TestCredentialRedaction, TestEndpointResolution, TestCentralizedResolver (8 tests, all pass)
… in 6 storage files

- mlpstorage_py/benchmarks/dlio.py: replace 4 os.environ.get calls (BUCKET, STORAGE_LIBRARY, AWS_ENDPOINT_URL, STORAGE_URI_SCHEME) with resolve_object_storage_config()
- mlpstorage_py/checkpointing/storage_readers/minio_reader.py: simplify _detect_endpoint() to delegate to resolver; use config for endpoint, aws_ca_bundle, aws_region
- mlpstorage_py/checkpointing/storage_readers/s3torch_reader.py: same pattern as minio_reader
- mlpstorage_py/checkpointing/storage_writers/minio_writer.py: simplify _detect_and_select_endpoint() to delegate to resolver; raw AWS_ACCESS_KEY_ID/SECRET remain for SDK auth
- mlpstorage_py/checkpointing/storage_writers/s3dlio_writer.py: replace _detect_multi_endpoint_config() body; replace 2x S3_LOAD_BALANCE_STRATEGY reads; PRESERVE os.environ write at line 160
- mlpstorage_py/checkpointing/storage_writers/s3torch_writer.py: simplify _detect_and_select_endpoint() and __init__ env reads via resolver
- New module mlpstorage_py/run_summary.py exports print_run_summary(args)
- Formats Tier-1 CLI args + env section; S3 section only for object protocol
- Credentials pre-redacted via resolve_object_storage_config(); never plain text
- --quiet guard: getattr(args, 'quiet', False) suppresses all output
- _WIDTH = 32 label column matches ban_boto3.py convention
- 10 unit tests covering TDD RED/GREEN cycle (all pass)
…on_args

- main.py: lazy import + print_run_summary(args) call after update_args(),
  before the benchmark for-loop; quiet guard uses getattr(args,'quiet',False)
- common_args.py: --quiet (store_true) added to Output Control group after
  --stream-log-level; no set_defaults entry needed
- 05-02-SUMMARY.md: documents print_run_summary() implementation, TDD cycle,
  test results, acceptance criteria, and self-check
- ROADMAP.md: updated with plan 02 summary count
…e path

When S3_ENDPOINT_FILE wins the priority chain, open the file and return
the first non-empty non-comment line as the endpoint URI.  If the file
is unreadable, fall through to the next priority link.  This prevents
minio_reader/minio_writer from attempting to connect to the file path
string as a hostname.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Line 195 was an unreachable copy of line 194 — paste/merge artifact.
Remove the dead second return statement.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace 3 separate resolve_object_storage_config() calls with a single
_s3cfg = resolve_object_storage_config() at the top of __init__, then
read endpoint, ca_bundle and region from the cached result.  This
eliminates redundant env-var scans and ensures all three values are
read from a consistent snapshot of the environment.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…point in s3dlio_writer

The old name implied a read-only query; the method mutates os.environ
when selecting an MPI rank endpoint.  Rename to _configure_multi_endpoint
at definition and all call sites to signal the side-effectful nature.
Add an explicit comment above the os.environ write explaining the intent.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…i_endpoint

When S3_ENDPOINT_FILE is set but the file does not exist, emit a
print() warning before returning None so the operator knows the
expected multi-endpoint file is missing and the benchmark is falling
back to single-endpoint mode.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…t-set

WR-05: Add test_s3_endpoint_template_wins_over_aws_endpoint_url,
test_s3_endpoint_file_wins_over_aws_endpoint_url (verifies the URI
*inside* the file is returned, not the file path — covering CR-01's
fix), and test_s3_endpoint_fallback_last_resort.

WR-06: Add test_secret_key_not_set to TestCredentialRedaction,
mirroring the existing test_access_key_not_set.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…mary.py

The comment 'matches ban_boto3.py convention' referenced a file that is
not imported and has no enforcement mechanism.  Replace with the
self-contained 'Label column width'.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… section

Mark status: fixes_applied in frontmatter and append a summary of all
fixed findings (CR-01, CR-02, WR-01 through WR-06, IN-04) with commit
hashes, plus the list of intentionally skipped items (WR-03, IN-01,
IN-02, IN-03).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ride chain

- test_cli.py: closed training namespace has loops/params/allow_invalid_params
  defaults; checkpointing open mode exposes --loops/--params/--num-checkpoints-read;
  new TestAddCheckpointingArgumentsClosed — rejects --loops/--ncr/--ncw, verifies
  all five set_defaults values
- test_cli_kvcache.py: new TestKVCacheClosedMode — verifies gpu_mem_gb/cpu_mem_gb/
  loops/params/allow_invalid_params/seed/trials/inter_option_delay defaults; rejects
  --gpu-mem-gb and --cpu-mem-gb in closed mode
- test_cli_vectordb.py: new TestVectorDBClosedMode — verifies namespace defaults and
  index-type restriction in closed mode
- test_dlio_object_storage.py: new TestProcessDlioParams — YAML-only, CLI-beats-YAML,
  surgical override, multiple params, falsy-params cases; new
  TestProcessDlioParamsFullChain — env injection, CLI-beats-env, YAML survival,
  file-protocol no-op
- kvcache.py: _interruptible_sleep and _execute_run checked dry_run
  instead of what_if; what-if mode never skipped sleep or aggregation
- test_benchmarks_kvcache.py: TestClosedEnforcement set args.closed=True
  but _execute_run checks args.mode=='closed'; change to args.mode
- test_dlio_object_storage.py: stub pyarrow and dotenv at module level
  so the file can be collected and run without the full ML stack installed
- test_validation_helpers.py: SSH tests with remote hosts also trigger
  MPI check which fails in CI (no mpirun); mock check_mpi_with_hints
- dlrm_mi355.yaml: num_samples_per_file was 4718592, mismatching
  dlrm_datagen.yaml (1536000); datagen and run config must agree

1083 passed, 0 failed
@FileSystemGuy FileSystemGuy force-pushed the FileSystemGuy-cli-refactor branch from 0a5be4e to 2253179 Compare June 9, 2026 20:11
setup_logging() was not idempotent — every call for the same logger name
added another StreamHandler to the shared logger instance, causing each
message to print once per handler. Guard the addHandler call so it only
runs when the logger has no handlers yet.
…space is complete

In closed mode, _add_checkpointing_open_args is not called, so --dlio-bin-path
is never registered and the attribute is absent from the Namespace. DlioBenchmark
.__init__ does a bare args.dlio_bin_path access (no getattr), crashing on any
closed-mode checkpointing command. Adding it to set_defaults follows the same
pattern as loops, params, and num_checkpoints_read/write.
@FileSystemGuy FileSystemGuy marked this pull request as ready for review June 9, 2026 20:56
@FileSystemGuy

Copy link
Copy Markdown
Contributor Author

Please review: @dslik @idevasena

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants