Update load mcore checkpoint#9595
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors checkpoint loading in Megatron utilities by renaming variables for clarity and ensuring strict=False is consistently applied when loading state dicts for multiple DDP models. In wrap_model, it fixes a return bug when DDP wrapping is disabled, dynamically calculates bucket_size based on num_buckets, disables bucketing for non-first pipeline-parallel ranks, and adds proper CUDA stream synchronization for DDP initialization. The review feedback highlights a potential ZeroDivisionError if num_buckets is set to zero, suggesting a safety check before performing the division.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if ddp_config.bucket_size is None and getattr(ddp_config, 'num_buckets', None) is not None: | ||
| ddp_config.bucket_size = num_parameters // ddp_config.num_buckets |
There was a problem hiding this comment.
If ddp_config.num_buckets is configured to 0, the division num_parameters // ddp_config.num_buckets will raise a ZeroDivisionError. To prevent potential runtime crashes, we should ensure num_buckets is greater than 0 before performing the division. If it is 0 or invalid, it will safely fall back to the default bucket size calculation below.
| if ddp_config.bucket_size is None and getattr(ddp_config, 'num_buckets', None) is not None: | |
| ddp_config.bucket_size = num_parameters // ddp_config.num_buckets | |
| if ddp_config.bucket_size is None and getattr(ddp_config, 'num_buckets', None) is not None and ddp_config.num_buckets > 0: | |
| ddp_config.bucket_size = num_parameters // ddp_config.num_buckets |
No description provided.