Skip to content

nvidia-modeset: suspend crash (jump_label BUG) caused by missing objtool NOP conversion in DKMS build #1095

@alexanderlerch-arch

Description

@alexanderlerch-arch

NVIDIA Open GPU Kernel Modules Version

595.58.03

Operating System and Version

Ubuntu 26.04 LTS (Resolute Raccoon)

Kernel Release

Linux 7.0.0-10-generic x86_64

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Build Command

No build done, but the template forces me to provide a command.
dkms build nvidia/595.58.03 -k 7.0.0-10-generic

Terminal output/Build Log

Actually also no build done, but the template forces me to provide some log.
objtool is skipped during DKMS build due to pre-compiled blobs lacking -mfunction-return=thunk-extern (see #1077, #1021). This results in static_branch JMPs not being converted to NOPs, causing a runtime crash on suspend.

More Info

Hey,

i had a hard time debugging this, but i finally had some success and found the potential root cause for the bug.
In short, i cannot suspend to s3 deep and resume again, always resulting in a crash, after some digging to isolate the problematic function, i finally can suspend and resume again and it may only comes down to skipping objtools during build.

Here is some AI slop, with insights:

nvidia-modeset: jump_label BUG on suspend — missing objtool NOP conversion in DKMS build

Description

nvidia-modeset.ko built via DKMS crashes on suspend with a jump_label: Fatal kernel bug on kernels ≥ 6.0 (x86_64, CONFIG_HAVE_JUMP_LABEL_HACK=y).

The DKMS build does not run objtool on the compiled shim objects (nvidia-modeset-linux.c, nv-kthread-q.c). With CONFIG_HAVE_JUMP_LABEL_HACK, GCC emits JMP instructions at static_branch_unlikely() sites that objtool is supposed to convert to NOPs (see comment in arch/x86/include/asm/jump_label.h: "jmp %l[l_yes] # objtool NOPs this").

Before kernel 6.0, jump_label_apply_nops() in arch/x86/kernel/module.c corrected this at module load time. Commit fdfd42892f31 (June 2022, merged in 6.0) removed that function for x86, expecting modules to ship correct instructions.

The result: nvidia-modeset.ko has JMPs where the kernel expects NOPs. On suspend, the kernel enables freezer_active and tries to patch NOP→JMP at these sites — finds JMP instead of NOP — triggers BUG().

Error signature

jump_label: Fatal kernel bug, unexpected op at nvkms_kthread_q_callback+0x8e/0x1a0 [nvidia_modeset]
  (e9 9b 00 00 00 != 0f 1f 44 00 00)) size:5 type:1
kernel BUG at arch/x86/kernel/jump_label.c:73!

Followed by GSP Heartbeat Timeout, GPU lost, hard reset required.

Affected functions in nvidia-modeset

All functions calling nvkms_read_lock_pm_lock() which inlines try_to_freeze()freezing()static_branch_unlikely(&freezer_active):

  • nvkms_kthread_q_callback (5-byte JMP e9 at +0x8e)
  • nvkms_close_pm_unlocked (2-byte JMP eb at +0x26)
  • nvkms_open_from_kapi (2-byte JMP eb at +0x34)
  • nvkms_close_from_kapi (2-byte JMP eb at +0x26)
  • nvkms_ioctl_from_kapi (2-byte JMP eb at +0x36)

How I verified this

1. Disassembly of nvidia-modeset.ko vs a test module

A minimal GPL module (test_freezer.ko) compiled against the same kernel headers with identical try_to_freeze() usage has correct NOPs at static_branch sites. nvidia-modeset.ko has JMPs.

test_freezer.ko .text+0x21:  66 90              ← 2-byte NOP (correct)
nvidia-modeset.ko .text+0x9ae: e9 9b 00 00 00   ← 5-byte JMP (wrong)

2. Runtime memory verification

Reading the loaded module's memory via /proc/kcore:

# nvidia-modeset WITHOUT fix:
0xffffffffc04d09ae: 0xe9 0x9b 0x00 0x00 0x00   ← JMP still in memory (not patched to NOP)

# nvidia-modeset WITH fix (text_poke to NOP):
0xffffffffc04d09ae: 0x0f 0x1f 0x44 0x00 0x00   ← NOP (patched by fix module)

3. Suspend tests

Condition Result
Suspend with nvidia-modeset (unpatched) jump_label: Fatal kernel bug → crash
Suspend without nvidia-modeset (rmmod) Clean S3 suspend/resume
Suspend with nvidia-modeset + runtime NOP fix Clean S3 suspend/resume

4. The __jump_table entries confirm freezer_active

$ readelf -r nvidia-modeset.ko | grep freezer_active
000000000008  ...  R_X86_64_PC64  freezer_active + 2
000000000018  ...  R_X86_64_PC64  freezer_active + 2
000000000048  ...  R_X86_64_PC64  freezer_active + 2
000000000058  ...  R_X86_64_PC64  freezer_active + 2
000000000068  ...  R_X86_64_PC64  freezer_active + 2

Tested driver versions (all crash the same way)

  • nvidia-driver-595-open (595.58.03)
  • nvidia-driver-595 proprietary (595.58.03)
  • nvidia-driver-590-open
  • nvidia-driver-590 proprietary
  • nvidia-driver-580-open (580.126.09)
  • nvidia-driver-580 proprietary

System info

  • Kernel: 7.0.0-10-generic (Ubuntu 26.04 LTS)
  • GPU: NVIDIA GeForce RTX 3060 Ti (GA104)
  • CPU: AMD Ryzen Threadripper 3960X
  • Board: MSI TRX40 PRO WIFI
  • CONFIG_HAVE_JUMP_LABEL_HACK=y
  • CONFIG_JUMP_LABEL=y

Affected kernel versions

Any kernel ≥ 6.0 on x86_64 with CONFIG_HAVE_JUMP_LABEL_HACK=y.

Before 6.0, jump_label_apply_nops() masked the bug by patching all module jump_label sites at load time.

Related issues

This is a direct consequence of the objtool incompatibility tracked in:

In #1077, an NVIDIA collaborator confirmed that the blobs are not yet built with -mfunction-return=thunk-extern and suggested disabling CONFIG_OBJTOOL_WERROR or setting OBJECT_FILES_NON_STANDARD := y as a workaround. Both workarounds disable objtool validation entirely, which also skips the JMP→NOP conversion for static_branch sites — directly causing the suspend crash described here.

Suggested fix

The pre-compiled blobs need to be compiled with the correct compiler flags so that objtool validation passes. This would allow objtool to run during the DKMS build, which would convert the static_branch JMPs to NOPs as the kernel expects.

If fixing the blobs is not feasible short-term, nvidia-modeset's module init could patch its own __jump_table entries to convert JMPs→NOPs for disabled keys (equivalent to the removed jump_label_apply_nops()).

Workaround

A runtime fix module that patches JMPs→NOPs via text_poke() after nvidia-modeset loads:

/* Iterate nvidia_modeset's jump_entries, patch JMPs to NOPs for disabled keys */
for (; iter < stop; iter++) {
    struct static_key *key = jump_entry_key(iter);
    if (!static_key_enabled(key)) {
        unsigned char *addr = (unsigned char *)jump_entry_code(iter);
        if (addr[0] == 0xe9)
            text_poke(addr, nop5, 5);  /* 0f 1f 44 00 00 */
        else if (addr[0] == 0xeb)
            text_poke(addr, nop2, 2);  /* 66 90 */
    }
}

Full source available on request.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions