Hi @warner-benjamin @ohmeow,
Nice work on the ModernBERT project — I’ve been learning a lot from it.
I have one question regarding the masking logic here:
|
(masked_batch, labels) = SequencePacker.mlm_masking( |
|
batch, self.mask_prob, self.mask_token_id, self.pad_token_id, self.ignore_token_id, self.np_rng |
|
) |
|
yieldval = { |
|
"input_ids": torch.from_numpy(masked_batch), |
|
"labels": torch.from_numpy(labels), |
|
"cu_seqlens": cu_seq_lens, |
|
"max_seqlen": max_seq_lens, |
|
"attention_mask": torch.from_numpy(np.where(batch == self.pad_token_id, 0, 1)), |
|
} |
|
self._token_count += yieldval["attention_mask"].sum().item() |
|
# # assert isinstance(yieldval[0], torch.Tensor), f"Unexpected {type(yieldval[0])=}" |
|
# if not self.suppress_masking: |
|
# assert isinstance(yieldval[1], torch.Tensor), f"Unexpected {type(yieldval[1])=}" |
|
# assert isinstance(yieldval[2], list), f"Unexpected {type(yieldval[2])=}" |
|
# if yieldval[2]: |
|
# assert isinstance(yieldval[2][0], torch.Tensor), f"Unexpected {type(yieldval[2][0])=}" |
|
yield yieldval |
From what I understand, masking is applied after the sequence packing step.
This means that the masking probability is applied across the entire packed sequence (pseq), without regard to the original sample boundaries. As a result, some original samples inside a packed sequence might end up with no masked tokens at all.
I was curious about the intent behind applying masking at the packed-sequence level rather than per original sample.
Could you share the reasoning or trade-offs for this design choice?
Thanks,
Hi @warner-benjamin @ohmeow,
Nice work on the ModernBERT project — I’ve been learning a lot from it.
I have one question regarding the masking logic here:
ModernBERT/src/sequence_packer.py
Lines 264 to 281 in 8c57a0f
From what I understand, masking is applied after the sequence packing step.
This means that the masking probability is applied across the entire packed sequence (pseq), without regard to the original sample boundaries. As a result, some original samples inside a packed sequence might end up with no masked tokens at all.
I was curious about the intent behind applying masking at the packed-sequence level rather than per original sample.
Could you share the reasoning or trade-offs for this design choice?
Thanks,