Conversation
Greptile SummaryThis PR fixes CI GPU test failures on Tesla V100 (Compute Capability 7.0) by introducing a dedicated
Confidence Score: 4/5Not safe to merge as-is: lightning integration tests will break due to missing pytorch_lightning installation. A P1 defect exists: removing the per-test pytorch_lightning setup without installing it anywhere else in the PT flow will cause lightning-related integration tests to fail at import time. The CUDA/version mismatch between the PR description and the code is a P2 documentation concern. ci/run_integration.sh (missing pytorch_lightning install), tests/integration_test/data/test_configs/standalone_job/client_api.yml and client_api_qa.yml (pytorch_lightning setup removed) Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[run_integration.sh invoked with BUILD_TYPE] --> B{BUILD_TYPE?}
B -->|tensorflow| C[integration_test_tf]
B -->|client_api / client_api_qa / pytorch / cifar| D[integration_test_pt]
B -->|everything else| E[integration_test via pipenv]
D --> D1[pip install dev extras]
D1 --> D2[pip install torch==2.6.0 torchvision==0.21.0 cu124 index]
D2 --> D3[run_integration_tests.sh]
D3 --> D4{Test requires pytorch_lightning?}
D4 -->|Yes| D5[ImportError - not installed]
D4 -->|No| D6[Test passes]
Reviews (9): Last reviewed commit: "Merge branch 'main' into fix_ci_torch_ve..." | Re-trigger Greptile |
There was a problem hiding this comment.
Pull request overview
This PR adjusts the integration-test setup to avoid CI accidentally upgrading PyTorch (which can break GPU tests on older compute capabilities) by centralizing GPU Python package installation in the CI runner and removing per-test installs that can perturb the environment.
Changes:
- Removed
pip install pytorch_lightningsteps from standalone-job integration test YAML configs. - Added a new GPU integration-test path in
ci/run_integration.shthat installs torch/torchvision (CUDA 12.6 index) and runs selected backends without pipenv. - Routed several PyTorch-related build types (
client_api,pytorch,cifar, etc.) through the new GPU integration-test path.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| tests/integration_test/data/test_configs/standalone_job/client_api.yml | Removes per-test pytorch_lightning installation to avoid unintended dependency upgrades during runs. |
| tests/integration_test/data/test_configs/standalone_job/client_api_qa.yml | Same removal of per-test pytorch_lightning installation for QA config. |
| ci/run_integration.sh | Introduces a GPU-specific integration test runner that installs CUDA PyTorch wheels and uses it for several GPU-related build types. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/build |
|
/build |
1 similar comment
|
/build |
|
/build |
|
/build |
|
I do believe there is a reason CompuCap 7.0 been dropped. In the long term, we should also cut off support of aging hardware. |
### Description Fix CI accidentally upgrading PyTorch in test container The Blossom CI machine uses a Tesla V100 (Compute Capability 7.0). A recent PyTorch release (post-2025-03-23) dropped support for CC 7.0, causing swarm_cse_pt and other GPU tests to fail with a CUDA driver compatibility error. #### Solution 1. Update CI container to cuda 12.6 2. Install the torch version follows release note: https://github.com/pytorch/pytorch/releases/tag/v2.11.0 ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.
### Description Fix CI accidentally upgrading PyTorch in test container The Blossom CI machine uses a Tesla V100 (Compute Capability 7.0). A recent PyTorch release (post-2025-03-23) dropped support for CC 7.0, causing swarm_cse_pt and other GPU tests to fail with a CUDA driver compatibility error. #### Solution 1. Update CI container to cuda 12.6 2. Install the torch version follows release note: https://github.com/pytorch/pytorch/releases/tag/v2.11.0 ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.
Description
Fix CI accidentally upgrading PyTorch in test container
The Blossom CI machine uses a Tesla V100 (Compute Capability 7.0). A recent PyTorch release (post-2025-03-23) dropped support for CC 7.0, causing swarm_cse_pt and other GPU tests to fail with a CUDA driver compatibility error.
Solution
Types of changes
./runtest.sh.