Skip to content

Update load mcore checkpoint#9595

Open
Jintao-Huang wants to merge 2 commits into
modelscope:mainfrom
Jintao-Huang:update_load_mcore_checkpoint
Open

Update load mcore checkpoint#9595
Jintao-Huang wants to merge 2 commits into
modelscope:mainfrom
Jintao-Huang:update_load_mcore_checkpoint

Conversation

@Jintao-Huang

Copy link
Copy Markdown
Collaborator

No description provided.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors checkpoint loading in Megatron utilities by renaming variables for clarity and ensuring strict=False is consistently applied when loading state dicts for multiple DDP models. In wrap_model, it fixes a return bug when DDP wrapping is disabled, dynamically calculates bucket_size based on num_buckets, disables bucketing for non-first pipeline-parallel ranks, and adds proper CUDA stream synchronization for DDP initialization. The review feedback highlights a potential ZeroDivisionError if num_buckets is set to zero, suggesting a safety check before performing the division.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +531 to +532
if ddp_config.bucket_size is None and getattr(ddp_config, 'num_buckets', None) is not None:
ddp_config.bucket_size = num_parameters // ddp_config.num_buckets

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If ddp_config.num_buckets is configured to 0, the division num_parameters // ddp_config.num_buckets will raise a ZeroDivisionError. To prevent potential runtime crashes, we should ensure num_buckets is greater than 0 before performing the division. If it is 0 or invalid, it will safely fall back to the default bucket size calculation below.

Suggested change
if ddp_config.bucket_size is None and getattr(ddp_config, 'num_buckets', None) is not None:
ddp_config.bucket_size = num_parameters // ddp_config.num_buckets
if ddp_config.bucket_size is None and getattr(ddp_config, 'num_buckets', None) is not None and ddp_config.num_buckets > 0:
ddp_config.bucket_size = num_parameters // ddp_config.num_buckets

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant