Skip to content

Bug Fix for parallel expert encoder for SALM automodel #15814

Open
tango4j wants to merge 25 commits into
NVIDIA-NeMo:mainfrom
tango4j:pe_encs_pr1
Open

Bug Fix for parallel expert encoder for SALM automodel #15814
tango4j wants to merge 25 commits into
NVIDIA-NeMo:mainfrom
tango4j:pe_encs_pr1

Conversation

@tango4j

@tango4j tango4j commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Important

The Update branch button must only be pressed in very rare occasions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do?

Lets the ParallelExpertEncoder (PE) bundle load from a HuggingFace/NGC model card
(not just a local .nemo), relaxes multispeaker-ASR RTTM parsing so 9-column
RTTM files are accepted, and hardens the new RTTM parser with explicit validation
and focused unit coverage.

Collection: ASR (with a small touch in SpeechLM2)

Changelog

  • nemo/collections/asr/modules/parallel_expert_encoder.py
    • ParallelExpertEncoderPT.load_from_nemo now follows the standard NeMo
      Model checkpoint-resolution convention: a local .nemo path is restored
      via ModelPT.restore_from, otherwise the argument is treated as a pretrained
      model id (HuggingFace Hub {repo}/{name} or NGC alias) and resolved via
      Model.from_pretrained (which downloads/caches the .nemo and honours the
      HuggingFace cache + HF_HUB_OFFLINE, so a prefetched cache works on offline
      cluster nodes).
    • Renamed the parameter nemo_path to model_path_or_name to reflect that it
      accepts either a local path or a model id; updated the docstring.
  • nemo/collections/speechlm2/parts/pretrained.py
    • setup_parallel_expert_encoder no longer requires model.pe_encoder_path to
      end in .nemo; it accepts any non-empty string (local .nemo path or
      pretrained model id), with a clearer error message.
    • The strict "is this a PE .nemo bundle?" pre-check (is_pe_nemo) now runs
      only for an actual local .nemo file. For HuggingFace/NGC ids the bundle is
      resolved and validated downstream by ParallelExpertEncoderPT.load_from_nemo
      (from_pretrained -> restore_from, which checks the bundle target class).
  • nemo/collections/asr/parts/utils/asr_multispeaker_utils.py
    • Added read_rttm_supervisions_lenient(): a faithful copy of
      lhotse.SupervisionSet.from_rttm (same columns: recording_id=1, channel=2,
      start=3, duration=4, speaker=7; skips zero-duration segments) that relaxes
      the strict len(parts) == 10 check to len(parts) >= 8.
    • Replaced the parser's assert validation with an explicit ValueError so
      malformed RTTM lines are rejected even when Python runs with -O / -OO.
    • speaker_to_target now calls read_rttm_supervisions_lenient instead of
      SupervisionSet.from_rttm when a cut carries an rttm_filepath.
  • tests/collections/asr/utils/test_asr_multispeaker_utils.py
    • Added unit coverage for 9-column RTTM parsing, 10-column RTTM parsing,
      malformed short RTTM lines, blank/zero-duration line handling, multiple RTTM
      file input, and selected helper utilities.
  • nemo/collections/asr/parts/utils/sot_speaker_alignment.py
    • Keeps speaker-activity collation robust when per-example targets have fewer
      or more speaker columns than the configured target speaker count.
  • tests/collections/asr/utils/test_sot_speaker_alignment.py
    • Covers SOT speaker-token parsing and speaker-activity alignment behavior.

Why

  • PE encoder loading: PE bundles are distributed as HuggingFace model cards
    (e.g. nvidia/... / taejinp/...). The previous
    setup_parallel_expert_encoder hard-required a local .nemo path and
    rejected model ids with
    model.pe_encoder_path must point to a ParallelExpertEncoderPT .nemo bundle.
    Aligning with the standard restore_from / from_pretrained convention lets a
    recipe set model.pe_encoder_path to either a local file or a model card, and
    works offline once the card is prefetched into the HuggingFace cache.
  • RTTM reading: The nemoSOT multispeaker data ships RTTMs with 9 columns
    (the trailing Signal Lookahead Time, specified as always <NA> and never
    read, is dropped). lhotse's from_rttm asserts exactly 10 fields, so a single
    dataloader worker raised AssertionError: Invalid RTTM line ... and crashed
    multispeaker-ASR training. The parser only uses columns 1,2,3,4,7, so accepting
    >= 8 fields is sufficient and produces identical output for valid 10-field
    RTTMs.
  • Parser validation: RTTM files are external user/data-pipeline input, so
    validation must not rely on assert, which can be stripped in optimized Python
    mode.

Usage

# 1) Load a ParallelExpertEncoder from a HuggingFace model card or a local .nemo
from nemo.collections.asr.modules.parallel_expert_encoder import ParallelExpertEncoderPT

# HuggingFace Hub id (downloaded/cached; works offline once prefetched):
enc = ParallelExpertEncoderPT.load_from_nemo("nvidia/phPEE-canary-enc-1b-v2-sortformer-v2.1")

# ...or a local bundle:
enc = ParallelExpertEncoderPT.load_from_nemo("/path/to/pe_encoder.nemo")

# In a SpeechLM2 recipe this is driven by:
#   model.pe_encoder_path: nvidia/phPEE-canary-enc-1b-v2-sortformer-v2.1
# or:
#   model.pe_encoder_path: /path/to/pe_encoder.nemo

# 2) Read RTTMs tolerantly (accepts 9- or 10-column lines)
from nemo.collections.asr.parts.utils.asr_multispeaker_utils import read_rttm_supervisions_lenient

sup = read_rttm_supervisions_lenient("/path/to/utt.rttm")

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added
to the PR. To re-run CI remove and add the label again. To run CI on an untrusted
fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines
contains specific people who can review PRs to various areas.

Additional Information

  • Both changes are backward compatible: a local .nemo pe_encoder_path behaves
    exactly as before, and 10-field RTTMs parse identically to from_rttm.
  • The RTTM parser now has explicit tests for the 9-column data case and for the
    malformed-input path requested during review.
  • Related to # (issue)

tango4j added 18 commits June 9, 2026 16:47
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 18, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added the ASR label Jun 18, 2026
@tango4j tango4j requested a review from pzelasko June 18, 2026 21:49
@tango4j tango4j marked this pull request as ready for review June 18, 2026 21:49
Signed-off-by: Taejin Park <tango4j@gmail.com>
@tango4j

tango4j commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator Author

/ok to test 4029bf9

@tango4j

tango4j commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator Author

/ok to test f4c04a2

@github-actions

Copy link
Copy Markdown
Contributor

[🤖]: Hi @tango4j 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

@tango4j

tango4j commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator Author

@pzelasko I still need to add fix for slow data loader so please do not merge this yet. I will notify you when I finish fixing it. Thanks.

tango4j added 3 commits June 19, 2026 13:52
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
@tango4j

tango4j commented Jun 20, 2026

Copy link
Copy Markdown
Collaborator Author

/ok to test e27fc16

@tango4j

tango4j commented Jun 20, 2026

Copy link
Copy Markdown
Collaborator Author

@pzelasko There was a critical bug in "speaker_targets" tensor collate function. It was crashing whenever there were 5,6 speakers where max speaker is set 4 speaker. There were 0.001% cases (few dozen utts) in the training datasets. Faulty 5+ speaker datapoints are removed from the cluster, and the code itself was changed to not crash even if it happens. Also added unit tests for this.

Now this PR is ready for review and merge.

@tango4j

tango4j commented Jun 20, 2026

Copy link
Copy Markdown
Collaborator Author

/ok to test e27fc16

tango4j added 2 commits June 21, 2026 08:50
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
@tango4j

tango4j commented Jun 21, 2026

Copy link
Copy Markdown
Collaborator Author

/ok to test f7ae8c8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants