Bug Fix for parallel expert encoder for SALM automodel #15814
Conversation
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
|
/ok to test 4029bf9 |
|
/ok to test f4c04a2 |
|
[🤖]: Hi @tango4j 👋, We wanted to let you know that a CICD pipeline for this PR just finished successfully. So it might be time to merge this PR or get some approvals. |
|
@pzelasko I still need to add fix for slow data loader so please do not merge this yet. I will notify you when I finish fixing it. Thanks. |
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
|
/ok to test e27fc16 |
|
@pzelasko There was a critical bug in "speaker_targets" tensor collate function. It was crashing whenever there were 5,6 speakers where max speaker is set 4 speaker. There were 0.001% cases (few dozen utts) in the training datasets. Faulty 5+ speaker datapoints are removed from the cluster, and the code itself was changed to not crash even if it happens. Also added unit tests for this. Now this PR is ready for review and merge. |
|
/ok to test e27fc16 |
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
|
/ok to test f7ae8c8 |
Important
The
Update branchbutton must only be pressed in very rare occasions.An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.
What does this PR do?
Lets the ParallelExpertEncoder (PE) bundle load from a HuggingFace/NGC model card
(not just a local
.nemo), relaxes multispeaker-ASR RTTM parsing so 9-columnRTTM files are accepted, and hardens the new RTTM parser with explicit validation
and focused unit coverage.
Collection: ASR (with a small touch in SpeechLM2)
Changelog
nemo/collections/asr/modules/parallel_expert_encoder.pyParallelExpertEncoderPT.load_from_nemonow follows the standard NeMoModelcheckpoint-resolution convention: a local.nemopath is restoredvia
ModelPT.restore_from, otherwise the argument is treated as a pretrainedmodel id (HuggingFace Hub
{repo}/{name}or NGC alias) and resolved viaModel.from_pretrained(which downloads/caches the.nemoand honours theHuggingFace cache +
HF_HUB_OFFLINE, so a prefetched cache works on offlinecluster nodes).
nemo_pathtomodel_path_or_nameto reflect that itaccepts either a local path or a model id; updated the docstring.
nemo/collections/speechlm2/parts/pretrained.pysetup_parallel_expert_encoderno longer requiresmodel.pe_encoder_pathtoend in
.nemo; it accepts any non-empty string (local.nemopath orpretrained model id), with a clearer error message.
.nemobundle?" pre-check (is_pe_nemo) now runsonly for an actual local
.nemofile. For HuggingFace/NGC ids the bundle isresolved and validated downstream by
ParallelExpertEncoderPT.load_from_nemo(
from_pretrained->restore_from, which checks the bundle target class).nemo/collections/asr/parts/utils/asr_multispeaker_utils.pyread_rttm_supervisions_lenient(): a faithful copy oflhotse.SupervisionSet.from_rttm(same columns: recording_id=1, channel=2,start=3, duration=4, speaker=7; skips zero-duration segments) that relaxes
the strict
len(parts) == 10check tolen(parts) >= 8.assertvalidation with an explicitValueErrorsomalformed RTTM lines are rejected even when Python runs with
-O/-OO.speaker_to_targetnow callsread_rttm_supervisions_lenientinstead ofSupervisionSet.from_rttmwhen a cut carries anrttm_filepath.tests/collections/asr/utils/test_asr_multispeaker_utils.pymalformed short RTTM lines, blank/zero-duration line handling, multiple RTTM
file input, and selected helper utilities.
nemo/collections/asr/parts/utils/sot_speaker_alignment.pyor more speaker columns than the configured target speaker count.
tests/collections/asr/utils/test_sot_speaker_alignment.pyWhy
(e.g.
nvidia/.../taejinp/...). The previoussetup_parallel_expert_encoderhard-required a local.nemopath andrejected model ids with
model.pe_encoder_path must point to a ParallelExpertEncoderPT .nemo bundle.Aligning with the standard
restore_from/from_pretrainedconvention lets arecipe set
model.pe_encoder_pathto either a local file or a model card, andworks offline once the card is prefetched into the HuggingFace cache.
(the trailing
Signal Lookahead Time, specified as always<NA>and neverread, is dropped). lhotse's
from_rttmasserts exactly 10 fields, so a singledataloader worker raised
AssertionError: Invalid RTTM line ...and crashedmultispeaker-ASR training. The parser only uses columns 1,2,3,4,7, so accepting
>= 8fields is sufficient and produces identical output for valid 10-fieldRTTMs.
validation must not rely on
assert, which can be stripped in optimized Pythonmode.
Usage
GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added
to the PR. To re-run CI remove and add the label again. To run CI on an untrusted
fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines
contains specific people who can review PRs to various areas.
Additional Information
.nemope_encoder_pathbehavesexactly as before, and 10-field RTTMs parse identically to
from_rttm.malformed-input path requested during review.