Skip to content

CI: require N consecutive nvidia-smi successes after Windows device cycle#2195

Merged
leofang merged 4 commits into
NVIDIA:mainfrom
leofang:leof/fix-configure-driver-mode-multi-gpu
Jun 11, 2026
Merged

CI: require N consecutive nvidia-smi successes after Windows device cycle#2195
leofang merged 4 commits into
NVIDIA:mainfrom
leofang:leof/fix-configure-driver-mode-multi-gpu

Conversation

@leofang

@leofang leofang commented Jun 10, 2026

Copy link
Copy Markdown
Member

Description

Fix flakiness on the 2× H100 MCDM Windows runner, observed since #2176 landed (e.g. https://github.com/NVIDIA/cuda-python/actions/runs/27293276910/job/80661578895).

What broke: pre-#2176, every Windows row ran install_gpu_driver.ps1 unconditionally, and that script ended with a fixed Start-Sleep -Seconds 5 after the pnputil cycle. #2176 replaced that fixed 5-sec settle with a poll-until-success loop in configure_driver_mode.ps1 — and that loop exits on the first nvidia-smi exit-0, ~2 sec into the cycle. On the H100 pair, NVML briefly reports success mid-init and then flaps back to "Not Found" a few seconds later, so the workflow's next "Ensure GPU is working" step lands in the flap window:

Waiting for nvidia-smi/NVML to come back up after device cycle...
+ nvidia-smi                    # next workflow step, ~4 sec later
Failed to initialize NVML: Not Found
Write-Error: Switching to driver mode MCDM failed!

Fix: two layers.

  1. Restore the pre-CI: allow specifying custom driver versions in test matrix #2176 5-sec floor — unconditional Start-Sleep -Seconds 5 before the poll. That's the known-good baseline that worked across all Windows rows for a long time.
  2. Then require N consecutive successes in the poll, where N == Get-PnpDevice -Class Display -FriendlyName "NVIDIA*". On single-GPU rows that's still effectively one short success after the 5-sec floor. On the H100 pair it's two consecutive 3-sec-apart successes, so a mid-flap "ok" can't fool the loop. The 60-sec deadline cap is unchanged.

Net effect:

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Multi-GPU Windows rows (observed on 2x H100 MCDM after NVIDIA#2176 landed)
keep failing the "Ensure GPU is working" step with `Failed to
initialize NVML: Not Found`. Root cause: after `pnputil` cycles both
display devices, NVML briefly reports success mid-init then flaps back
to "Not Found" a couple seconds later. The existing poll exits on the
*first* `nvidia-smi` exit code 0, so the loop bails ~2 seconds in and
the next workflow step hits the flap window.

Scale the consecutive-success requirement to the number of cycled
NVIDIA devices (1 for single-GPU rows, 2 for the H100 pair) and bump
the inter-iteration sleep from 2 to 3 seconds. Single-GPU rows pay an
extra 1-sec floor; multi-GPU rows now require ~6 sec of stable NVML
before moving on.

The 60-sec deadline is unchanged; the loop still bails (and the script
fails loudly) if NVML doesn't settle in time.
@copy-pr-bot

copy-pr-bot Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added the CI/CD CI/CD infrastructure label Jun 10, 2026
@leofang leofang added bug Something isn't working P0 High priority - Must do! labels Jun 10, 2026
@leofang leofang self-assigned this Jun 10, 2026
@leofang

leofang commented Jun 10, 2026

Copy link
Copy Markdown
Member Author

/ok to test 6067a78

@leofang leofang added this to the cuda.core v1.1.0 milestone Jun 10, 2026
@github-actions

This comment has been minimized.

@leofang leofang marked this pull request as ready for review June 10, 2026 21:42
leofang added 3 commits June 10, 2026 21:51
… poll

Pre-NVIDIA#2176, every Windows row ran install_gpu_driver.ps1 unconditionally
and that script ended with a fixed `Start-Sleep -Seconds 5` after the
pnputil cycle. NVIDIA#2176 dropped that floor (the poll exits on the first
nvidia-smi success at ~2 sec on single-GPU, ~2 sec mid-flap on the
H100 pair). Put the 5-sec floor back, before the consecutive-success
poll, so we never settle for less than the known-good baseline.
@leofang

leofang commented Jun 10, 2026

Copy link
Copy Markdown
Member Author

/ok to test 4909e11

@leofang leofang enabled auto-merge (squash) June 11, 2026 01:02
@leofang leofang merged commit ece27e4 into NVIDIA:main Jun 11, 2026
274 of 277 checks passed
@leofang leofang deleted the leof/fix-configure-driver-mode-multi-gpu branch June 11, 2026 02:03
@leofang

leofang commented Jun 11, 2026

Copy link
Copy Markdown
Member Author

Thanks, Ralf!

@github-actions

Copy link
Copy Markdown
Doc Preview CI
Preview removed because the pull request was closed or merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working CI/CD CI/CD infrastructure P0 High priority - Must do!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants