Skip to content

fix(gpu): select single CDI GPU defaults#1675

Open
elezar wants to merge 1 commit into
mainfrom
fix/1477-single-cdi-gpu-defaults-elezar
Open

fix(gpu): select single CDI GPU defaults#1675
elezar wants to merge 1 commit into
mainfrom
fix/1477-single-cdi-gpu-defaults-elezar

Conversation

@elezar

@elezar elezar commented Jun 2, 2026

Copy link
Copy Markdown
Member

Summary

Updates Docker and Podman GPU handling so a bare GPU request selects one concrete NVIDIA CDI device when possible.

Default nvidia.com/gpu=all fallback is allowed only for WSL2 all-only compatibility. Explicit GPU device requests pass through unchanged, including nvidia.com/gpu=all.

Related Issue

Closes #1477

Changes

  • Adds shared CDI inventory and round-robin default selection helpers.
  • Docker uses daemon CDI inventory from DiscoveredDevices and Docker /info to allow WSL2 all-only fallback.
  • Podman builds local CDI inventory from /dev/nvidiaN device nodes and uses /dev/dxg for WSL2 all-only fallback.
  • Keeps explicit CDI GPU device requests unchanged.
  • Updates GPU docs and optional GPU e2e expectations.

Testing

  • mise exec -- cargo fmt --check
  • mise exec -- cargo check -p openshell-core -p openshell-driver-docker -p openshell-driver-podman --tests
  • mise exec -- cargo test -p openshell-core -p openshell-driver-docker -p openshell-driver-podman gpu --tests
  • mise exec -- cargo test -p openshell-driver-docker -p openshell-driver-podman wsl2 --tests
  • mise exec -- cargo test -p openshell-core -p openshell-driver-podman all_only --tests
  • mise run pre-commit

Checklist

  • Follows Conventional Commits
  • Documentation updated

@github-actions

github-actions Bot commented Jun 2, 2026

Copy link
Copy Markdown

@elezar elezar marked this pull request as draft June 3, 2026 13:55
@copy-pr-bot

copy-pr-bot Bot commented Jun 3, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@elezar elezar changed the base branch from main to feat/gpu-request-spec June 3, 2026 13:58
@elezar elezar force-pushed the fix/1477-single-cdi-gpu-defaults-elezar branch from de229a4 to 7fa3cc6 Compare June 3, 2026 14:03
@elezar elezar force-pushed the feat/gpu-request-spec branch 2 times, most recently from 4c18100 to cec0e21 Compare June 4, 2026 07:54
@elezar elezar force-pushed the fix/1477-single-cdi-gpu-defaults-elezar branch from 7fa3cc6 to 16fe7a2 Compare June 4, 2026 14:47
@copy-pr-bot

copy-pr-bot Bot commented Jun 4, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@elezar elezar changed the base branch from feat/gpu-request-spec to main June 4, 2026 14:47
@elezar elezar marked this pull request as ready for review June 4, 2026 14:51
@elezar elezar force-pushed the fix/1477-single-cdi-gpu-defaults-elezar branch from 16fe7a2 to 4cedfec Compare June 4, 2026 15:49
Kubernetes mirrors each limit into the matching request. VM accepts the fields
but currently ignores them.

GPU requests enter the driver layer through `SandboxSpec.gpu` and

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this will change in #1156

@elezar elezar force-pushed the fix/1477-single-cdi-gpu-defaults-elezar branch from 4cedfec to bbdc514 Compare June 5, 2026 07:11
@elezar elezar added gator:in-review Gator is reviewing or awaiting PR review feedback test:e2e-gpu Requires GPU end-to-end coverage labels Jun 10, 2026
@elezar

elezar commented Jun 10, 2026

Copy link
Copy Markdown
Member Author

gator-agent

PR Review Status

Validation: this is maintainer-authored, project-valid GPU runtime work that addresses #1477 by making default Docker/Podman GPU requests select a concrete CDI device when possible.
Head SHA: bbdc51489c70995147bcae4298d82fe522c70d0e

Review findings:

  • Blocking: crates/openshell-driver-podman/src/container.rs appears to leave build_container_spec() calling build_container_spec_with_token() while the callee is gated behind #[cfg(test)], which would break non-test Podman driver compilation. Please either ungate build_container_spec_with_token() or gate the caller if it is test-only.
  • Warning: Docker WSL2 detection for permitting default nvidia.com/gpu=all is broad because it considers daemon name and arbitrary labels containing wsl. Please restrict the fallback to stronger OS/kernel indicators so a bare GPU request does not unexpectedly grant all GPUs.
  • Suggested coverage: add a driver-level test that two default GPU creates consume different devices from a multi-GPU inventory.

Docs: Fern docs were updated on the existing sandbox management page; no navigation change appears needed.
E2E: test:e2e-gpu is being applied because this changes GPU runtime/CDI behavior.

Next state: gator:in-review

@github-actions

Copy link
Copy Markdown

Label test:e2e-gpu applied for bbdc514. Open the existing run and click Re-run all jobs to execute with the label set. The run will execute GPU E2E after building the required supervisor image once. The matching required CI gate status on this PR will flip green automatically once the run finishes.

@elezar elezar force-pushed the fix/1477-single-cdi-gpu-defaults-elezar branch from bbdc514 to 2d73188 Compare June 10, 2026 18:06
@elezar

elezar commented Jun 10, 2026

Copy link
Copy Markdown
Member Author

gator-agent

Re-check After Author Update

I re-evaluated latest head 2d731882a278cb3f399dccde5b0dd54bcbd55f76 after the new commit fix(gpu): prefer single CDI devices for local runtimes.

Disposition: partially resolved.

Remaining items:

  • Resolved: the prior Podman #[cfg(test)] compile concern is addressed. Non-test creation now calls build_container_spec_with_token_and_gpu_default, while the old wrappers are test-only.
  • Resolved: production Docker WSL2 fallback detection is narrowed and no longer trusts daemon name or arbitrary labels.
  • Remaining warning: e2e/rust/tests/gpu_device_selection.rs still mirrors the older broad WSL2 heuristic in test expectations, including OSType, daemon Name, or Labels. Please mirror the production predicate so E2E expectations cannot disagree with the runtime behavior.
  • Remaining suggested coverage: add equivalent Podman driver-level two-create round-robin coverage using for_tests_with_gpu_inventory, matching the Docker coverage added in this head.

Checks: OpenShell / Branch Checks, OpenShell / Helm Lint, DCO, and OpenShell / GPU E2E are green for the current head.

Next state: gator:in-review

@elezar

elezar commented Jun 10, 2026

Copy link
Copy Markdown
Member Author

gator-agent

Blocked

Gator is blocked by merge conflicts: GitHub reports mergeStateStatus=DIRTY for head 2d731882a278cb3f399dccde5b0dd54bcbd55f76, so the PR cannot be merged even though the required CI gates are green.

Next action: @elezar update the branch against current main and resolve the conflicts. The existing review-loop items from the latest gator re-check also remain open after the conflict is resolved:

  • Mirror the production Docker WSL2 fallback predicate in e2e/rust/tests/gpu_device_selection.rs.
  • Add equivalent Podman driver-level two-create round-robin coverage using for_tests_with_gpu_inventory.

Next state: gator:blocked

@elezar elezar added gator:blocked Gator is blocked by process or repository gates and removed gator:in-review Gator is reviewing or awaiting PR review feedback labels Jun 10, 2026
@elezar elezar added gator:watch-pipeline Gator is monitoring PR CI/CD status gator:approval-needed Gator completed review; maintainer approval needed and removed gator:watch-pipeline Gator is monitoring PR CI/CD status labels Jun 10, 2026
@elezar

elezar commented Jun 10, 2026

Copy link
Copy Markdown
Member Author

gator-agent

Maintainer Approval Needed

Gator validation and PR monitoring are complete for latest head d1f54107564e2b4ada77929f0332bfc8c9a722d3.

Validation: maintainer-authored GPU runtime work that addresses #1477 by making default Docker/Podman GPU requests select a concrete CDI device when possible.
Review: I re-checked after the branch update; the prior merge-conflict blocker and review-loop items are resolved. No blocking review items remain. Nonblocking residual note: the implementation fails bare --gpu when Docker CDI inventory is empty or missing unless WSL2 all-only fallback is detected; if maintainers still want a nvidia.com/gpu=0 fallback for unavailable inventory, that should be handled explicitly before or after merge.
Docs: Fern docs were updated on the existing sandbox management page; no navigation change appears needed.
Checks: OpenShell / Branch Checks, OpenShell / Helm Lint, DCO, and OpenShell / GPU E2E are green for the current head.
E2E: test:e2e-gpu is applied and the GPU E2E required gate is green.

Human maintainer approval or merge decision is now required.

@elezar

elezar commented Jun 11, 2026

Copy link
Copy Markdown
Member Author

gator-agent

Blocked

Gator is blocked by merge conflicts: GitHub reports mergeStateStatus=DIRTY for head d1f54107564e2b4ada77929f0332bfc8c9a722d3, so this PR is no longer merge-ready despite the prior approval-needed handoff. The required CI gates are green, but the branch must be made mergeable before maintainer approval/merge can proceed.

Next action: @elezar update the branch against current main and resolve the conflicts. After the branch is mergeable again, gator should re-check the required gates and return this to maintainer approval if nothing else changed.

Next state: gator:blocked

@elezar elezar added gator:blocked Gator is blocked by process or repository gates and removed gator:approval-needed Gator completed review; maintainer approval needed labels Jun 11, 2026
@elezar elezar force-pushed the fix/1477-single-cdi-gpu-defaults-elezar branch from d1f5410 to c019321 Compare June 11, 2026 12:06
@elezar

elezar commented Jun 11, 2026

Copy link
Copy Markdown
Member Author

gator-agent

Re-check After Author Update

I re-evaluated latest head c01932108eaecee93574010e9b31c89c49d37f40 after the branch update that followed the prior merge-conflict blocker.

Disposition: resolved for the merge-conflict blocker; pipeline monitoring is active again.

Remaining items:

  • GitHub REST now reports mergeable=true and mergeable_state=blocked, so the prior mergeStateStatus=DIRTY conflict blocker is resolved. The remaining block is normal required review/check state.
  • The independent bounded re-review found no new blocking issues in this head. It confirmed the Docker WSL2 all-GPU fallback predicate remains mirrored between production and E2E, and the Podman two-create round-robin coverage remains present.
  • Nonblocking residual note remains unchanged: bare Docker --gpu still fails when Docker CDI inventory is empty or missing unless WSL2 all-only fallback is detected. If maintainers still want a nvidia.com/gpu=0 fallback for unavailable inventory, that should be handled explicitly before or after merge.
  • OpenShell / Branch Checks and OpenShell / GPU E2E are pending for the current head; test:e2e-gpu remains applied.

Next state: gator:watch-pipeline

@elezar elezar added gator:watch-pipeline Gator is monitoring PR CI/CD status and removed gator:blocked Gator is blocked by process or repository gates labels Jun 11, 2026
@elezar

elezar commented Jun 11, 2026

Copy link
Copy Markdown
Member Author

gator-agent

Maintainer Approval Needed

Gator validation and PR monitoring are complete for latest head c01932108eaecee93574010e9b31c89c49d37f40.

Validation: maintainer-authored GPU runtime work that addresses #1477 by making default Docker/Podman GPU requests select a concrete CDI device when possible.
Review: I re-checked after the latest branch update; the prior merge-conflict blocker and review-loop items remain resolved. No blocking review items remain. Nonblocking residual note: bare Docker --gpu still fails when Docker CDI inventory is empty or missing unless WSL2 all-only fallback is detected; if maintainers still want a nvidia.com/gpu=0 fallback for unavailable inventory, that should be handled explicitly before or after merge.
Docs: Fern docs were updated on the existing sandbox management page; no navigation change appears needed.
Checks: OpenShell / Branch Checks, OpenShell / Helm Lint, DCO, and OpenShell / GPU E2E are green for the current head.
E2E: test:e2e-gpu is applied and the GPU E2E required gate is green.

Human maintainer approval or merge decision is now required.

@elezar elezar added gator:approval-needed Gator completed review; maintainer approval needed and removed gator:watch-pipeline Gator is monitoring PR CI/CD status labels Jun 11, 2026
@elezar

elezar commented Jun 11, 2026

Copy link
Copy Markdown
Member Author

gator-agent

Blocked

Gator is blocked by merge conflicts: GitHub reports mergeStateStatus=DIRTY for head c01932108eaecee93574010e9b31c89c49d37f40, so this PR is no longer merge-ready despite the prior approval handoff.

Next action: @elezar, please update the branch against current main and resolve the conflicts. Gator will re-check the latest head after the branch is updated.

@elezar elezar added gator:blocked Gator is blocked by process or repository gates and removed gator:approval-needed Gator completed review; maintainer approval needed labels Jun 11, 2026
@elezar elezar force-pushed the fix/1477-single-cdi-gpu-defaults-elezar branch from c019321 to 29c3702 Compare June 12, 2026 07:14
@elezar

elezar commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

gator-agent

Re-check After Author Update

I re-evaluated latest head 29c3702310a16cb0c959d9418d69c6a4b7d5944b after the 2026-06-12 branch update.

Disposition: partially resolved.

Remaining items:

  • Resolved: the prior merge-conflict blocker is cleared. GitHub REST reports mergeable=true and mergeable_state=blocked, so this is no longer blocked by mergeStateStatus=DIRTY.
  • Blocking: OpenShell / Branch Checks is failing for this head. The failed Rust lint jobs on both linux-amd64 and linux-arm64 report crates/openshell-server/src/multiplex.rs errors against tonic 0.14 (tonic::body::BoxBody is private and tonic::body::empty_body is not found). That file is outside this PR's changed GPU/runtime diff, but the required Branch Checks gate must be green before gator can return this to approval.
  • Pending: OpenShell / GPU E2E is still running for this head.
  • The previous nonblocking residual note remains unchanged: bare Docker --gpu still fails when Docker CDI inventory is empty or missing unless WSL2 all-only fallback is detected. If maintainers still want a nvidia.com/gpu=0 fallback for unavailable inventory, that should be handled explicitly before or after merge.

Next action: @elezar or a maintainer should resolve the Branch Checks failure, or update this branch after the base/main compatibility issue is fixed, then gator can re-check the required gates and review state.

Next state: gator:blocked

Prefer a single CDI-qualified device when Docker or Podman resolves the default GPU request to one GPU.

Allow nvidia.com/gpu=all only as a WSL2 all-only compatibility fallback, using Docker daemon info and Podman's /dev/dxg probe to identify that case.

Update driver docs, architecture notes, and GPU e2e coverage for the default selection behavior.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
@elezar

elezar commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

gator-agent

Blocked

Gator is still blocked for latest head 29c3702310a16cb0c959d9418d69c6a4b7d5944b.

Current blockers:

  • OpenShell / Branch Checks is failing in Rust lint on both linux-amd64 and linux-arm64 with tonic 0.14 API errors in openshell-server/src/multiplex.rs (tonic::body::BoxBody is private and tonic::body::empty_body is not found). This matches the Branch Checks blocker reported earlier this cycle.
  • OpenShell / GPU E2E has now also failed for this head. The failing jobs are the Docker GPU linux-amd64, linux-arm64, and wsl-amd64 lanes plus the aggregate GPU E2E result. I attempted to inspect the failed logs, but GitHub returned Forbidden while fetching the action log blob, so gator cannot classify that GPU failure root cause in this cycle.

Next action: @elezar or a maintainer should resolve the required gate failures, or update this branch after the upstream/base compatibility issue is fixed, then gator can re-check the latest head.

Next state: gator:blocked

@elezar elezar force-pushed the fix/1477-single-cdi-gpu-defaults-elezar branch from 29c3702 to 7872e32 Compare June 12, 2026 07:56
@elezar elezar added gator:watch-pipeline Gator is monitoring PR CI/CD status and removed gator:blocked Gator is blocked by process or repository gates labels Jun 12, 2026
@elezar

elezar commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

gator-agent

Re-check After Author Update

I re-evaluated latest head 7872e32a1546ef3275eb22faeedc5c1288180b56 after the 2026-06-12 branch update.

Disposition: resolved for the prior process blocker; pipeline monitoring is active again.

Remaining items:

  • The prior required-gate blocker is cleared. OpenShell / Branch Checks, OpenShell / Helm Lint, and DCO are passing for this head, and GitHub REST reports mergeable=true with mergeable_state=blocked, which is normal while required review/check state is still outstanding.
  • OpenShell / GPU E2E is still pending/queued for this head; test:e2e-gpu remains applied.
  • The prior nonblocking residual note remains unchanged: bare Docker --gpu still fails when Docker CDI inventory is empty or missing unless WSL2 all-only fallback is detected. If maintainers still want a nvidia.com/gpu=0 fallback for unavailable inventory, that should be handled explicitly before or after merge.

Next state: gator:watch-pipeline

@elezar

elezar commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

gator-agent

Maintainer Approval Needed

Gator validation and PR monitoring are complete for latest head 7872e32a1546ef3275eb22faeedc5c1288180b56.

Validation: maintainer-authored GPU runtime work that addresses #1477 by making default Docker/Podman GPU requests select a concrete CDI device when possible.
Review: the prior merge-conflict and CI blockers are resolved for this head. No blocking review items remain. Nonblocking residual note: bare Docker --gpu still fails when Docker CDI inventory is empty or missing unless WSL2 all-only fallback is detected; if maintainers still want a nvidia.com/gpu=0 fallback for unavailable inventory, that can be handled explicitly before or after merge.
Docs: relevant runtime docs were updated.
Checks: OpenShell / Branch Checks, OpenShell / Helm Lint, and DCO are passing.
E2E: OpenShell / E2E and OpenShell / GPU E2E are passing with test:e2e-gpu applied.

Human maintainer approval or merge decision is now required.

@elezar elezar added gator:approval-needed Gator completed review; maintainer approval needed and removed gator:watch-pipeline Gator is monitoring PR CI/CD status labels Jun 12, 2026
@elezar

elezar commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

Review: the prior merge-conflict and CI blockers are resolved for this head. No blocking review items remain. Nonblocking residual note: bare Docker --gpu still fails when Docker CDI inventory is empty or missing unless WSL2 all-only fallback is detected; if maintainers still want a nvidia.com/gpu=0 fallback for unavailable inventory, that can be handled explicitly before or after merge.

We don't want to fall back to nvidia.com/gpu=0 for empty / unavailable inventory. This should be an error state -- even for WSL2.

@elezar

elezar commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

gator-agent

Re-check After Maintainer Update

I re-evaluated latest head 7872e32a1546ef3275eb22faeedc5c1288180b56 after @elezar's 2026-06-12 comment clarifying that an unavailable or empty Docker CDI inventory should remain an error state, including on WSL2, rather than falling back to nvidia.com/gpu=0.

Disposition: resolved for the residual note.

Remaining items:

  • No blocking gator review items remain.
  • OpenShell / Branch Checks, OpenShell / Helm Lint, DCO, OpenShell / E2E, and OpenShell / GPU E2E are passing for this head with test:e2e-gpu applied.

Next state: gator:approval-needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gator:approval-needed Gator completed review; maintainer approval needed test:e2e-gpu Requires GPU end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(gpu): select one CDI GPU by default for Docker and Podman

1 participant