Fix ci torch version issue by YuanTingHsieh · Pull Request #4391 · NVIDIA/NVFlare

YuanTingHsieh · 2026-04-01T22:42:05Z

Description

Fix CI accidentally upgrading PyTorch in test container

The Blossom CI machine uses a Tesla V100 (Compute Capability 7.0). A recent PyTorch release (post-2025-03-23) dropped support for CC 7.0, causing swarm_cse_pt and other GPU tests to fail with a CUDA driver compatibility error.

Solution

Update CI container to cuda 12.6
Install the torch version follows release note: https://github.com/pytorch/pytorch/releases/tag/v2.11.0

Types of changes

Non-breaking change (fix or new feature that would not break existing functionality).
Breaking change (fix or new feature that would cause existing functionality to change).
New tests added to cover the changes.
Quick tests passed locally by running ./runtest.sh.
In-line docstrings updated.
Documentation updated.

greptile-apps · 2026-04-01T22:44:34Z

Greptile Summary

This PR fixes CI GPU test failures on Tesla V100 (Compute Capability 7.0) by introducing a dedicated integration_test_pt() function that pins torch==2.6.0 and torchvision==0.21.0 on the cu124 index, and routes client_api, client_api_qa, pytorch, and cifar build types to it. It also removes per-test pytorch_lightning install steps from the test YAML configs.

The pytorch_lightning package is no longer installed anywhere in the PT test flow — it's absent from integration_test_pt() and from setup.py's dev extras — so any lightning integration tests (run lightning-client-api, run lightning-client-api-in-process) will fail with an ImportError.

Confidence Score: 4/5

Not safe to merge as-is: lightning integration tests will break due to missing pytorch_lightning installation.

A P1 defect exists: removing the per-test pytorch_lightning setup without installing it anywhere else in the PT flow will cause lightning-related integration tests to fail at import time. The CUDA/version mismatch between the PR description and the code is a P2 documentation concern.

ci/run_integration.sh (missing pytorch_lightning install), tests/integration_test/data/test_configs/standalone_job/client_api.yml and client_api_qa.yml (pytorch_lightning setup removed)

Important Files Changed

Filename	Overview
ci/run_integration.sh	Adds integration_test_pt() function that pins torch==2.6.0/torchvision==0.21.0 on cu124, but the PR description references CUDA 12.6 and PyTorch v2.11.0 — a mismatch that needs clarification. pytorch_lightning is also not installed here, leaving lightning tests without a required dependency.
tests/integration_test/data/test_configs/standalone_job/client_api.yml	Removes per-test pytorch_lightning install from two lightning test setups; those tests will now fail at import time since pytorch_lightning is not installed anywhere in the PT test flow.
tests/integration_test/data/test_configs/standalone_job/client_api_qa.yml	Removes pytorch_lightning setup step from one test; same missing-dependency risk as client_api.yml, though the impact depends on whether that test exercises lightning.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[run_integration.sh invoked with BUILD_TYPE] --> B{BUILD_TYPE?}
    B -->|tensorflow| C[integration_test_tf]
    B -->|client_api / client_api_qa / pytorch / cifar| D[integration_test_pt]
    B -->|everything else| E[integration_test via pipenv]
    D --> D1[pip install dev extras]
    D1 --> D2[pip install torch==2.6.0 torchvision==0.21.0 cu124 index]
    D2 --> D3[run_integration_tests.sh]
    D3 --> D4{Test requires pytorch_lightning?}
    D4 -->|Yes| D5[ImportError - not installed]
    D4 -->|No| D6[Test passes]

_{Reviews (9): Last reviewed commit: "Merge branch 'main' into fix_ci_torch_ve..." | Re-trigger Greptile}

Copilot

Pull request overview

This PR adjusts the integration-test setup to avoid CI accidentally upgrading PyTorch (which can break GPU tests on older compute capabilities) by centralizing GPU Python package installation in the CI runner and removing per-test installs that can perturb the environment.

Changes:

Removed pip install pytorch_lightning steps from standalone-job integration test YAML configs.
Added a new GPU integration-test path in ci/run_integration.sh that installs torch/torchvision (CUDA 12.6 index) and runs selected backends without pipenv.
Routed several PyTorch-related build types (client_api, pytorch, cifar, etc.) through the new GPU integration-test path.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
tests/integration_test/data/test_configs/standalone_job/client_api.yml	Removes per-test `pytorch_lightning` installation to avoid unintended dependency upgrades during runs.
tests/integration_test/data/test_configs/standalone_job/client_api_qa.yml	Same removal of per-test `pytorch_lightning` installation for QA config.
ci/run_integration.sh	Introduces a GPU-specific integration test runner that installs CUDA PyTorch wheels and uses it for several GPU-related build types.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

YuanTingHsieh · 2026-04-02T21:54:34Z

/build

YuanTingHsieh · 2026-04-02T23:11:43Z

/build

YuanTingHsieh · 2026-04-02T23:12:26Z

/build

YuanTingHsieh · 2026-04-03T16:19:41Z

/build

YuanTingHsieh · 2026-04-03T19:38:59Z

/build

IsaacYangSLA · 2026-04-03T20:08:09Z

I do believe there is a reason CompuCap 7.0 been dropped. In the long term, we should also cut off support of aging hardware.

IsaacYangSLA

LGTM.

### Description Fix CI accidentally upgrading PyTorch in test container The Blossom CI machine uses a Tesla V100 (Compute Capability 7.0). A recent PyTorch release (post-2025-03-23) dropped support for CC 7.0, causing swarm_cse_pt and other GPU tests to fail with a CUDA driver compatibility error. #### Solution 1. Update CI container to cuda 12.6 2. Install the torch version follows release note: https://github.com/pytorch/pytorch/releases/tag/v2.11.0 ### Types of changes  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.

Fix ci torch version issue

8f4e7d0

Copilot AI review requested due to automatic review settings April 1, 2026 22:42

Copilot started reviewing on behalf of YuanTingHsieh April 1, 2026 22:42 View session

greptile-apps bot reviewed Apr 1, 2026

View reviewed changes

Comment thread ci/run_integration.sh Outdated

Comment thread ci/run_integration.sh Outdated

Copilot AI reviewed Apr 1, 2026

View reviewed changes

Comment thread ci/run_integration.sh Outdated

Comment thread ci/run_integration.sh Outdated

Comment thread ci/run_integration.sh Outdated

YuanTingHsieh added 5 commits April 1, 2026 16:32

fix blinker issue in ci container build

e85f589

fix issue

59042b9

remove blinker

e707b9f

fix issue

93a552b

ping torch version in ci

0cb8c8d

Merge branch 'main' into fix_ci_torch_version

aeb074a

YuanTingHsieh added the cicd continuous integration/continuous development label Apr 2, 2026

Merge branch 'main' into fix_ci_torch_version

97fe012

YuanTingHsieh requested review from chesterxgchen, nvidianz and pcnudde April 3, 2026 16:19

Merge branch 'main' into fix_ci_torch_version

b053b1b

IsaacYangSLA approved these changes Apr 3, 2026

View reviewed changes

IsaacYangSLA merged commit f7cacbb into NVIDIA:main Apr 3, 2026
29 checks passed

YuanTingHsieh deleted the fix_ci_torch_version branch April 3, 2026 20:22

YuanTingHsieh mentioned this pull request Apr 7, 2026

Add install of pytorch_lightning to ci #4414

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ci torch version issue#4391

Fix ci torch version issue#4391
IsaacYangSLA merged 9 commits intoNVIDIA:mainfrom
YuanTingHsieh:fix_ci_torch_version

YuanTingHsieh commented Apr 1, 2026

Uh oh!

greptile-apps bot commented Apr 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

YuanTingHsieh commented Apr 2, 2026

Uh oh!

YuanTingHsieh commented Apr 2, 2026

Uh oh!

YuanTingHsieh commented Apr 2, 2026

Uh oh!

YuanTingHsieh commented Apr 3, 2026

Uh oh!

YuanTingHsieh commented Apr 3, 2026

Uh oh!

IsaacYangSLA commented Apr 3, 2026

Uh oh!

IsaacYangSLA left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

YuanTingHsieh commented Apr 1, 2026

Description

Solution

Types of changes

Uh oh!

greptile-apps bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

YuanTingHsieh commented Apr 2, 2026

Uh oh!

YuanTingHsieh commented Apr 2, 2026

Uh oh!

YuanTingHsieh commented Apr 2, 2026

Uh oh!

YuanTingHsieh commented Apr 3, 2026

Uh oh!

YuanTingHsieh commented Apr 3, 2026

Uh oh!

IsaacYangSLA commented Apr 3, 2026

Uh oh!

IsaacYangSLA left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps bot commented Apr 1, 2026 •

edited

Loading