Checklist / 检查清单
Bug Description / Bug 描述
Full SFT of google/gemma-4-E2B-it on a mixed dataset (~91% image+text,
~9% text-only) hangs after ~55 steps with an NCCL ALLREDUCE watchdog
timeout. The two ranks time out at the same SeqNum but with different
tensor sizes:
[rank0] ALLREDUCE SeqNum=4176 NumelIn=1
[rank1] ALLREDUCE SeqNum=4176 NumelIn=402653184
Both processes SIGABRT after the 600s watchdog timeout.
How to Reproduce / 如何复现
Environment:
- ms-swift: 4.1.1
- transformers: 5.5.4
- deepspeed: 0.18.9
- torch: 2.10.0+cu128
- GPU: A100-SXM4-80GB × 2 (compute capability 8.0)
Command:
swift sft --model google/gemma-4-E2B-it
--tuner_type full --torch_dtype bfloat16 --attn_impl sdpa
--freeze_vit false --freeze_aligner false
--deepspeed zero2 --ddp_find_unused_parameters true
--per_device_train_batch_size 2 --gradient_accumulation_steps 4
--max_length 4096
--dataset
Dataset composition: ~91% image+text samples, ~9% text-only samples.
Training proceeds normally until around step 56, then hangs.
Additional Information / 补充信息
Is there a recommended way to train Gemma4 with a mix of image-text and text-only samples, or is the text-only portion expected to be filtered out for multimodal full SFT?
Checklist / 检查清单
Bug Description / Bug 描述
Full SFT of
google/gemma-4-E2B-iton a mixed dataset (~91% image+text,~9% text-only) hangs after ~55 steps with an NCCL ALLREDUCE watchdog
timeout. The two ranks time out at the same SeqNum but with different
tensor sizes:
[rank0] ALLREDUCE SeqNum=4176 NumelIn=1
[rank1] ALLREDUCE SeqNum=4176 NumelIn=402653184
Both processes SIGABRT after the 600s watchdog timeout.
How to Reproduce / 如何复现
Environment:
Command:
swift sft --model google/gemma-4-E2B-it
--tuner_type full --torch_dtype bfloat16 --attn_impl sdpa
--freeze_vit false --freeze_aligner false
--deepspeed zero2 --ddp_find_unused_parameters true
--per_device_train_batch_size 2 --gradient_accumulation_steps 4
--max_length 4096
--dataset
Dataset composition: ~91% image+text samples, ~9% text-only samples.
Training proceeds normally until around step 56, then hangs.
Additional Information / 补充信息
Is there a recommended way to train Gemma4 with a mix of image-text and text-only samples, or is the text-only portion expected to be filtered out for multimodal full SFT?