Skip to content

Fix ci torch version issue#4391

Merged
IsaacYangSLA merged 9 commits intoNVIDIA:mainfrom
YuanTingHsieh:fix_ci_torch_version
Apr 3, 2026
Merged

Fix ci torch version issue#4391
IsaacYangSLA merged 9 commits intoNVIDIA:mainfrom
YuanTingHsieh:fix_ci_torch_version

Conversation

@YuanTingHsieh
Copy link
Copy Markdown
Collaborator

Description

Fix CI accidentally upgrading PyTorch in test container

The Blossom CI machine uses a Tesla V100 (Compute Capability 7.0). A recent PyTorch release (post-2025-03-23) dropped support for CC 7.0, causing swarm_cse_pt and other GPU tests to fail with a CUDA driver compatibility error.

Solution

  1. Update CI container to cuda 12.6
  2. Install the torch version follows release note: https://github.com/pytorch/pytorch/releases/tag/v2.11.0

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Quick tests passed locally by running ./runtest.sh.
  • In-line docstrings updated.
  • Documentation updated.

Copilot AI review requested due to automatic review settings April 1, 2026 22:42
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 1, 2026

Greptile Summary

This PR fixes CI GPU test failures on Tesla V100 (Compute Capability 7.0) by introducing a dedicated integration_test_pt() function that pins torch==2.6.0 and torchvision==0.21.0 on the cu124 index, and routes client_api, client_api_qa, pytorch, and cifar build types to it. It also removes per-test pytorch_lightning install steps from the test YAML configs.

  • The pytorch_lightning package is no longer installed anywhere in the PT test flow — it's absent from integration_test_pt() and from setup.py's dev extras — so any lightning integration tests (run lightning-client-api, run lightning-client-api-in-process) will fail with an ImportError.

Confidence Score: 4/5

Not safe to merge as-is: lightning integration tests will break due to missing pytorch_lightning installation.

A P1 defect exists: removing the per-test pytorch_lightning setup without installing it anywhere else in the PT flow will cause lightning-related integration tests to fail at import time. The CUDA/version mismatch between the PR description and the code is a P2 documentation concern.

ci/run_integration.sh (missing pytorch_lightning install), tests/integration_test/data/test_configs/standalone_job/client_api.yml and client_api_qa.yml (pytorch_lightning setup removed)

Important Files Changed

Filename Overview
ci/run_integration.sh Adds integration_test_pt() function that pins torch==2.6.0/torchvision==0.21.0 on cu124, but the PR description references CUDA 12.6 and PyTorch v2.11.0 — a mismatch that needs clarification. pytorch_lightning is also not installed here, leaving lightning tests without a required dependency.
tests/integration_test/data/test_configs/standalone_job/client_api.yml Removes per-test pytorch_lightning install from two lightning test setups; those tests will now fail at import time since pytorch_lightning is not installed anywhere in the PT test flow.
tests/integration_test/data/test_configs/standalone_job/client_api_qa.yml Removes pytorch_lightning setup step from one test; same missing-dependency risk as client_api.yml, though the impact depends on whether that test exercises lightning.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[run_integration.sh invoked with BUILD_TYPE] --> B{BUILD_TYPE?}
    B -->|tensorflow| C[integration_test_tf]
    B -->|client_api / client_api_qa / pytorch / cifar| D[integration_test_pt]
    B -->|everything else| E[integration_test via pipenv]
    D --> D1[pip install dev extras]
    D1 --> D2[pip install torch==2.6.0 torchvision==0.21.0 cu124 index]
    D2 --> D3[run_integration_tests.sh]
    D3 --> D4{Test requires pytorch_lightning?}
    D4 -->|Yes| D5[ImportError - not installed]
    D4 -->|No| D6[Test passes]
Loading

Reviews (9): Last reviewed commit: "Merge branch 'main' into fix_ci_torch_ve..." | Re-trigger Greptile

Comment thread ci/run_integration.sh Outdated
Comment thread ci/run_integration.sh Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts the integration-test setup to avoid CI accidentally upgrading PyTorch (which can break GPU tests on older compute capabilities) by centralizing GPU Python package installation in the CI runner and removing per-test installs that can perturb the environment.

Changes:

  • Removed pip install pytorch_lightning steps from standalone-job integration test YAML configs.
  • Added a new GPU integration-test path in ci/run_integration.sh that installs torch/torchvision (CUDA 12.6 index) and runs selected backends without pipenv.
  • Routed several PyTorch-related build types (client_api, pytorch, cifar, etc.) through the new GPU integration-test path.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
tests/integration_test/data/test_configs/standalone_job/client_api.yml Removes per-test pytorch_lightning installation to avoid unintended dependency upgrades during runs.
tests/integration_test/data/test_configs/standalone_job/client_api_qa.yml Same removal of per-test pytorch_lightning installation for QA config.
ci/run_integration.sh Introduces a GPU-specific integration test runner that installs CUDA PyTorch wheels and uses it for several GPU-related build types.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ci/run_integration.sh Outdated
Comment thread ci/run_integration.sh Outdated
Comment thread ci/run_integration.sh Outdated
@YuanTingHsieh
Copy link
Copy Markdown
Collaborator Author

/build

@YuanTingHsieh
Copy link
Copy Markdown
Collaborator Author

/build

1 similar comment
@YuanTingHsieh
Copy link
Copy Markdown
Collaborator Author

/build

@YuanTingHsieh YuanTingHsieh added the cicd continuous integration/continuous development label Apr 2, 2026
@YuanTingHsieh
Copy link
Copy Markdown
Collaborator Author

/build

@YuanTingHsieh
Copy link
Copy Markdown
Collaborator Author

/build

@IsaacYangSLA
Copy link
Copy Markdown
Collaborator

I do believe there is a reason CompuCap 7.0 been dropped. In the long term, we should also cut off support of aging hardware.

Copy link
Copy Markdown
Collaborator

@IsaacYangSLA IsaacYangSLA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@IsaacYangSLA IsaacYangSLA merged commit f7cacbb into NVIDIA:main Apr 3, 2026
29 checks passed
@YuanTingHsieh YuanTingHsieh deleted the fix_ci_torch_version branch April 3, 2026 20:22
YuanTingHsieh added a commit to YuanTingHsieh/NVFlare that referenced this pull request Apr 3, 2026
### Description

Fix CI accidentally upgrading PyTorch in test container

The Blossom CI machine uses a Tesla V100 (Compute Capability 7.0). A
recent PyTorch release (post-2025-03-23) dropped support for CC 7.0,
causing swarm_cse_pt and other GPU tests to fail with a CUDA driver
compatibility error.

#### Solution
1. Update CI container to cuda 12.6
2. Install the torch version follows release note:
https://github.com/pytorch/pytorch/releases/tag/v2.11.0

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.
YuanTingHsieh added a commit that referenced this pull request Apr 6, 2026
### Description

Fix CI accidentally upgrading PyTorch in test container

The Blossom CI machine uses a Tesla V100 (Compute Capability 7.0). A
recent PyTorch release (post-2025-03-23) dropped support for CC 7.0,
causing swarm_cse_pt and other GPU tests to fail with a CUDA driver
compatibility error.

#### Solution
1. Update CI container to cuda 12.6
2. Install the torch version follows release note:
https://github.com/pytorch/pytorch/releases/tag/v2.11.0

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cicd continuous integration/continuous development

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants