-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Fix progress bar deadlock on DDP metrics computation. #21322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Fix progress bar deadlock on DDP metrics computation. #21322
Conversation
c1c55c9 to
77bfef4
Compare
77bfef4 to
a1a3700
Compare
| We used to have a bug where metrics were synced only on the rank 0 process. See | ||
| https://github.com/Lightning-AI/pytorch-lightning/issues/21264 | ||
| for more details. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| We used to have a bug where metrics were synced only on the rank 0 process. See | |
| https://github.com/Lightning-AI/pytorch-lightning/issues/21264 | |
| for more details. |
| for more details. | ||
| """ | ||
| RichProgressBar() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| RichProgressBar() |
seem to not really be used?
| We used to have a bug where metrics were synced only on the rank 0 process. See | ||
| https://github.com/Lightning-AI/pytorch-lightning/issues/21264 | ||
| for more details. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| We used to have a bug where metrics were synced only on the rank 0 process. See | |
| https://github.com/Lightning-AI/pytorch-lightning/issues/21264 | |
| for more details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes a deadlock issue in DDP (Distributed Data Parallel) training when progress bars compute metrics at the end of a training epoch. The bug occurred because metrics synchronization only happened on rank 0 processes, causing other ranks to wait indefinitely.
Key Changes:
- Modified TQDMProgressBar and RichProgressBar to call
get_metrics()unconditionally on all ranks before checking if the progress bar is enabled/disabled - Added comprehensive tests for both progress bar types to verify DDP deadlock scenarios are resolved
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
src/lightning/pytorch/callbacks/progress/tqdm_progress.py |
Moved get_metrics() call outside the conditional check to ensure all DDP ranks participate in metric synchronization |
src/lightning/pytorch/callbacks/progress/rich_progress.py |
Moved get_metrics() call before the early return check to ensure synchronization happens on all ranks |
tests/tests_pytorch/callbacks/progress/test_tqdm_progress_bar.py |
Added DDP deadlock test with proper mocking to isolate the progress bar fix |
tests/tests_pytorch/callbacks/progress/test_rich_progress_bar.py |
Added DDP deadlock test for RichProgressBar (contains issues that need fixing) |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if self._metric_component: | ||
| self._metric_component.update(metrics) |
Copilot
AI
Dec 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition if self._metric_component: is redundant. The previous check on line 655 already ensures that self._metric_component is not None when this line is reached. This check can be simplified or removed.
| if self._metric_component: | |
| self._metric_component.update(metrics) | |
| self._metric_component.update(metrics) |
| model = MyModel() | ||
|
|
||
| # We need to mock these logger connector hooks, since these also attempt to sync metrics | ||
| # and can "save" otherwise incorrect implementations of TQDMProgressBar.on_train_epoch_end. |
Copilot
AI
Dec 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment mentions 'TQDMProgressBar' but this test is for RichProgressBar. The comment should reference RichProgressBar instead.
| # and can "save" otherwise incorrect implementations of TQDMProgressBar.on_train_epoch_end. | |
| # and can "save" otherwise incorrect implementations of RichProgressBar.on_train_epoch_end. |
| process_group_backend="gloo", # run on CPU | ||
| timeout=datetime.timedelta(seconds=5), # timeout quickly for the test to fail | ||
| ), | ||
| enable_progress_bar=True, |
Copilot
AI
Dec 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RichProgressBar instance created on line 620 is not passed to the trainer's callbacks. This test should include callbacks=[pbar] in the Trainer initialization to actually test the RichProgressBar implementation. Without this, the test only uses the default progress bar behavior and doesn't exercise the fix in RichProgressBar._update_metrics().
| # This should not raise an AssertionError | ||
| trainer.fit(model) | ||
|
|
||
|
|
Copilot
AI
Dec 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test should have the @RunIf(rich=True) decorator to ensure it only runs when the rich library is available, since it instantiates RichProgressBar which raises ModuleNotFoundError when rich is not installed.
| @RunIf(rich=True) |
What does this PR do?
Fixes #21264
Before submitting
PR review
Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:
Reviewer checklist
📚 Documentation preview 📚: https://pytorch-lightning--21322.org.readthedocs.build/en/21322/