CI: require N consecutive nvidia-smi successes after Windows device cycle#2195
Merged
leofang merged 4 commits intoJun 11, 2026
Merged
Conversation
Multi-GPU Windows rows (observed on 2x H100 MCDM after NVIDIA#2176 landed) keep failing the "Ensure GPU is working" step with `Failed to initialize NVML: Not Found`. Root cause: after `pnputil` cycles both display devices, NVML briefly reports success mid-init then flaps back to "Not Found" a couple seconds later. The existing poll exits on the *first* `nvidia-smi` exit code 0, so the loop bails ~2 seconds in and the next workflow step hits the flap window. Scale the consecutive-success requirement to the number of cycled NVIDIA devices (1 for single-GPU rows, 2 for the H100 pair) and bump the inter-iteration sleep from 2 to 3 seconds. Single-GPU rows pay an extra 1-sec floor; multi-GPU rows now require ~6 sec of stable NVML before moving on. The 60-sec deadline is unchanged; the loop still bails (and the script fails loudly) if NVML doesn't settle in time.
Contributor
Member
Author
|
/ok to test 6067a78 |
This comment has been minimized.
This comment has been minimized.
… poll Pre-NVIDIA#2176, every Windows row ran install_gpu_driver.ps1 unconditionally and that script ended with a fixed `Start-Sleep -Seconds 5` after the pnputil cycle. NVIDIA#2176 dropped that floor (the poll exits on the first nvidia-smi success at ~2 sec on single-GPU, ~2 sec mid-flap on the H100 pair). Put the 5-sec floor back, before the consecutive-success poll, so we never settle for less than the known-good baseline.
Member
Author
|
/ok to test 4909e11 |
rwgk
approved these changes
Jun 10, 2026
Member
Author
|
Thanks, Ralf! |
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Fix flakiness on the 2× H100 MCDM Windows runner, observed since #2176 landed (e.g. https://github.com/NVIDIA/cuda-python/actions/runs/27293276910/job/80661578895).
What broke: pre-#2176, every Windows row ran
install_gpu_driver.ps1unconditionally, and that script ended with a fixedStart-Sleep -Seconds 5after thepnputilcycle. #2176 replaced that fixed 5-sec settle with a poll-until-success loop inconfigure_driver_mode.ps1— and that loop exits on the firstnvidia-smiexit-0, ~2 sec into the cycle. On the H100 pair, NVML briefly reports success mid-init and then flaps back to "Not Found" a few seconds later, so the workflow's next "Ensure GPU is working" step lands in the flap window:Fix: two layers.
Start-Sleep -Seconds 5before the poll. That's the known-good baseline that worked across all Windows rows for a long time.N == Get-PnpDevice -Class Display -FriendlyName "NVIDIA*". On single-GPU rows that's still effectively one short success after the 5-sec floor. On the H100 pair it's two consecutive 3-sec-apart successes, so a mid-flap "ok" can't fool the loop. The 60-sec deadline cap is unchanged.Net effect:
Checklist