Gemma4 full SFT + ZeRO-2 hangs on NCCL ALLREDUCE when data mixes text-only and image-text samples

### Checklist / 检查清单

- [x] I have searched existing issues, and this is a new bug report. / 我已经搜索过现有的 issues，确认这是一个新的 bug report。

### Bug Description / Bug 描述

Full SFT of `google/gemma-4-E2B-it` on a mixed dataset (~91% image+text,
~9% text-only) hangs after ~55 steps with an NCCL ALLREDUCE watchdog
timeout. The two ranks time out at the same SeqNum but with different
tensor sizes:

[rank0] ALLREDUCE SeqNum=4176 NumelIn=1
[rank1] ALLREDUCE SeqNum=4176 NumelIn=402653184

Both processes SIGABRT after the 600s watchdog timeout.

### How to Reproduce / 如何复现

Environment:
  - ms-swift: 4.1.1
  - transformers: 5.5.4
  - deepspeed: 0.18.9
  - torch: 2.10.0+cu128
  - GPU: A100-SXM4-80GB × 2 (compute capability 8.0)

Command:

swift sft --model google/gemma-4-E2B-it \
    --tuner_type full --torch_dtype bfloat16 --attn_impl sdpa \
    --freeze_vit false --freeze_aligner false \
    --deepspeed zero2 --ddp_find_unused_parameters true \
    --per_device_train_batch_size 2 --gradient_accumulation_steps 4 \
    --max_length 4096 \
    --dataset <mix of image-text and text-only jsonl>

Dataset composition: ~91% image+text samples, ~9% text-only samples.
Training proceeds normally until around step 56, then hangs.

### Additional Information / 补充信息

Is there a recommended way to train Gemma4 with a mix of image-text and text-only samples, or is the text-only portion expected to be filtered out for multimodal full SFT?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gemma4 full SFT + ZeRO-2 hangs on NCCL ALLREDUCE when data mixes text-only and image-text samples #9104

Checklist / 检查清单

Bug Description / Bug 描述

How to Reproduce / 如何复现

Additional Information / 补充信息

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Gemma4 full SFT + ZeRO-2 hangs on NCCL ALLREDUCE when data mixes text-only and image-text samples #9104

Description

Checklist / 检查清单

Bug Description / Bug 描述

How to Reproduce / 如何复现

Additional Information / 补充信息

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions