fix(geneformer): add nvidia-resiliency-ext>=0.6.0 dependency#1600
fix(geneformer): add nvidia-resiliency-ext>=0.6.0 dependency#1600svc-bionemo wants to merge 1 commit into
Conversation
megatron-core now requires nvidia-resiliency-ext>=0.6.0 at import time (via dist_checkpointing.strategies.nvrx.has_nvrx_async_support). The base container image (nvcr.io/nvidia/pytorch:26.04-py3) ships an older version, causing an AssertionError during test collection for all tests that import geneformer.convert (which transitively imports nemo.lightning → megatron.core). Add the explicit dependency pin so pip installs the required version during the Docker build. Signed-off-by: svc-bionemo <267129667+svc-bionemo@users.noreply.github.com>
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
Closing in favor of #1598, which resolves the same issue. Why this approach failed: Adding Fix in #1598: Uses a |
Problem
The nightly CI run for
models/geneformerfails with:All 4 failing tests (
test_geneformer_checkpoint_loss,test_geneformer_checkpoint_weight_compatibility,test_te_bert_layer_and_hf_bert_layer_similar_output_values_random_inputs,test_geneformer_model_loss_validity) hit this at import time through the chain:geneformer.convert→nemo.lightning→megatron.core→dist_checkpointing.strategies.nvrx.has_nvrx_async_support()Root Cause
megatron-corenow requiresnvidia-resiliency-ext >= 0.6.0at import time, but the base container image (nvcr.io/nvidia/pytorch:26.04-py3) ships an older version. Since the geneformerpyproject.tomldid not pin this transitive dependency,pip installduring Docker build did not upgrade it.Fix
Add
nvidia-resiliency-ext>=0.6.0as an explicit dependency inbionemo-recipes/models/geneformer/pyproject.toml.CI Run
https://github.com/NVIDIA-BioNeMo/bionemo-framework/actions/runs/26945822391