Skip to content

Gemma4 full SFT + ZeRO-2 hangs on NCCL ALLREDUCE when data mixes text-only and image-text samples #9104

@zealota

Description

@zealota

Checklist / 检查清单

  • I have searched existing issues, and this is a new bug report. / 我已经搜索过现有的 issues,确认这是一个新的 bug report。

Bug Description / Bug 描述

Full SFT of google/gemma-4-E2B-it on a mixed dataset (~91% image+text,
~9% text-only) hangs after ~55 steps with an NCCL ALLREDUCE watchdog
timeout. The two ranks time out at the same SeqNum but with different
tensor sizes:

[rank0] ALLREDUCE SeqNum=4176 NumelIn=1
[rank1] ALLREDUCE SeqNum=4176 NumelIn=402653184

Both processes SIGABRT after the 600s watchdog timeout.

How to Reproduce / 如何复现

Environment:

  • ms-swift: 4.1.1
  • transformers: 5.5.4
  • deepspeed: 0.18.9
  • torch: 2.10.0+cu128
  • GPU: A100-SXM4-80GB × 2 (compute capability 8.0)

Command:

swift sft --model google/gemma-4-E2B-it
--tuner_type full --torch_dtype bfloat16 --attn_impl sdpa
--freeze_vit false --freeze_aligner false
--deepspeed zero2 --ddp_find_unused_parameters true
--per_device_train_batch_size 2 --gradient_accumulation_steps 4
--max_length 4096
--dataset

Dataset composition: ~91% image+text samples, ~9% text-only samples.
Training proceeds normally until around step 56, then hangs.

Additional Information / 补充信息

Is there a recommended way to train Gemma4 with a mix of image-text and text-only samples, or is the text-only portion expected to be filtered out for multimodal full SFT?

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions