CI: require N consecutive nvidia-smi successes after Windows device cycle by leofang · Pull Request #2195 · NVIDIA/cuda-python

leofang · 2026-06-10T21:09:46Z

Description

Fix flakiness on the 2× H100 MCDM Windows runner, observed since #2176 landed (e.g. https://github.com/NVIDIA/cuda-python/actions/runs/27293276910/job/80661578895).

What broke: pre-#2176, every Windows row ran install_gpu_driver.ps1 unconditionally, and that script ended with a fixed Start-Sleep -Seconds 5 after the pnputil cycle. #2176 replaced that fixed 5-sec settle with a poll-until-success loop in configure_driver_mode.ps1 — and that loop exits on the first nvidia-smi exit-0, ~2 sec into the cycle. On the H100 pair, NVML briefly reports success mid-init and then flaps back to "Not Found" a few seconds later, so the workflow's next "Ensure GPU is working" step lands in the flap window:

Waiting for nvidia-smi/NVML to come back up after device cycle...
+ nvidia-smi                    # next workflow step, ~4 sec later
Failed to initialize NVML: Not Found
Write-Error: Switching to driver mode MCDM failed!

Fix: two layers.

Restore the pre-CI: allow specifying custom driver versions in test matrix #2176 5-sec floor — unconditional Start-Sleep -Seconds 5 before the poll. That's the known-good baseline that worked across all Windows rows for a long time.
Then require N consecutive successes in the poll, where N == Get-PnpDevice -Class Display -FriendlyName "NVIDIA*". On single-GPU rows that's still effectively one short success after the 5-sec floor. On the H100 pair it's two consecutive 3-sec-apart successes, so a mid-flap "ok" can't fool the loop. The 60-sec deadline cap is unchanged.

Net effect:

Single-GPU rows: 5 sec sleep + ~3 sec (1 success) = ~8 sec floor (pre-CI: allow specifying custom driver versions in test matrix #2176 was 5 sec, plus the install overhead).
2× H100 MCDM: 5 sec sleep + ~6 sec (2 successes) = ~11 sec floor (pre-CI: allow specifying custom driver versions in test matrix #2176 was 5 sec).
Pathological case: 60-sec deadline still caps total wait, with a loud failure message instead of a silent ride into the next step.

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

Multi-GPU Windows rows (observed on 2x H100 MCDM after NVIDIA#2176 landed) keep failing the "Ensure GPU is working" step with `Failed to initialize NVML: Not Found`. Root cause: after `pnputil` cycles both display devices, NVML briefly reports success mid-init then flaps back to "Not Found" a couple seconds later. The existing poll exits on the *first* `nvidia-smi` exit code 0, so the loop bails ~2 seconds in and the next workflow step hits the flap window. Scale the consecutive-success requirement to the number of cycled NVIDIA devices (1 for single-GPU rows, 2 for the H100 pair) and bump the inter-iteration sleep from 2 to 3 seconds. Single-GPU rows pay an extra 1-sec floor; multi-GPU rows now require ~6 sec of stable NVML before moving on. The 60-sec deadline is unchanged; the loop still bails (and the script fails loudly) if NVML doesn't settle in time.

copy-pr-bot · 2026-06-10T21:09:53Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

leofang · 2026-06-10T21:13:44Z

/ok to test 6067a78

… poll Pre-NVIDIA#2176, every Windows row ran install_gpu_driver.ps1 unconditionally and that script ended with a fixed `Start-Sleep -Seconds 5` after the pnputil cycle. NVIDIA#2176 dropped that floor (the poll exits on the first nvidia-smi success at ~2 sec on single-GPU, ~2 sec mid-flap on the H100 pair). Put the 5-sec floor back, before the consecutive-success poll, so we never settle for less than the known-good baseline.

leofang · 2026-06-10T21:56:50Z

/ok to test 4909e11

leofang · 2026-06-11T02:03:25Z

Thanks, Ralf!

github-actions · 2026-06-11T02:11:22Z

Doc Preview CI
Preview removed because the pull request was closed or merged.

github-actions Bot added the CI/CD CI/CD infrastructure label Jun 10, 2026

leofang added bug Something isn't working P0 High priority - Must do! labels Jun 10, 2026

leofang self-assigned this Jun 10, 2026

leofang added this to the cuda.core v1.1.0 milestone Jun 10, 2026

This comment has been minimized.

Sign in to view

leofang marked this pull request as ready for review June 10, 2026 21:42

leofang added 3 commits June 10, 2026 21:51

CI: trim configure_driver_mode.ps1 comments for portability

cf5bfd1

CI: drop redundant @(...) comment

4909e11

leofang mentioned this pull request Jun 10, 2026

Move driver and nvrtc cython and internal layers to new generator #1972

Open

rwgk approved these changes Jun 10, 2026

View reviewed changes

leofang enabled auto-merge (squash) June 11, 2026 01:02

leofang merged commit ece27e4 into NVIDIA:main Jun 11, 2026
274 of 277 checks passed

leofang deleted the leof/fix-configure-driver-mode-multi-gpu branch June 11, 2026 02:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: require N consecutive nvidia-smi successes after Windows device cycle#2195

CI: require N consecutive nvidia-smi successes after Windows device cycle#2195
leofang merged 4 commits into
NVIDIA:mainfrom
leofang:leof/fix-configure-driver-mode-multi-gpu

leofang commented Jun 10, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 10, 2026

Uh oh!

leofang commented Jun 10, 2026

Uh oh!

This comment has been minimized.

leofang commented Jun 10, 2026

Uh oh!

Uh oh!

leofang commented Jun 11, 2026

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

leofang commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

copy-pr-bot Bot commented Jun 10, 2026

Uh oh!

leofang commented Jun 10, 2026

Uh oh!

This comment has been minimized.

leofang commented Jun 10, 2026

Uh oh!

Uh oh!

leofang commented Jun 11, 2026

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

leofang commented Jun 10, 2026 •

edited

Loading