Skip to content

ce: fix CeUtils scheduling left paused on error paths in kceTopLevelPceLceMappingsUpdate#1210

Open
rawrmonster17 wants to merge 1 commit into
NVIDIA:mainfrom
rawrmonster17:fix/ce-utils-pause-leak
Open

ce: fix CeUtils scheduling left paused on error paths in kceTopLevelPceLceMappingsUpdate#1210
rawrmonster17 wants to merge 1 commit into
NVIDIA:mainfrom
rawrmonster17:fix/ce-utils-pause-leak

Conversation

@rawrmonster17

Copy link
Copy Markdown

Problem

kceTopLevelPceLceMappingsUpdate_IMPL() calls cePauseCeUtilsScheduling()
to block RM-internal Copy Engine submissions while PCE-LCE mappings are
being reconfigured. The matching ceResumeCeUtilsScheduling() is only
reached on the success path, leaving two error paths that return without
resuming:

  1. NV_ASSERT_OK_OR_RETURN() on rmapiControlCacheFreeForControl()
    the macro returns immediately on failure, bypassing the resume call.
  2. The explicit return status when
    NV2080_CTRL_CMD_CE_UPDATE_PCE_LCE_MAPPINGS_V2 fails.

When either path fires, CeUtils submission stays permanently paused for
the lifetime of the GPU instance. Subsequent RM-internal CE operations
(memory scrubbing, allocation init) will stall or fail silently.

Fix

Convert both early returns to goto cleanup so ceResumeCeUtilsScheduling()
is unconditionally called after the pause, regardless of which path is
taken. The NV_ASSERT_OK_OR_RETURN() is replaced with an explicit status
check so the error is captured in status before jumping to cleanup.

cleanup:
    ceResumeCeUtilsScheduling(pGpu);
    return status;

This matches the standard cleanup-label pattern used throughout the RM
codebase for balanced resource acquire/release.

…ceLceMappingsUpdate

cePauseCeUtilsScheduling() is called at the start of
kceTopLevelPceLceMappingsUpdate_IMPL() to block RM-internal CE
submissions while PCE-LCE mappings are being updated. However, two
error paths return without calling the matching
ceResumeCeUtilsScheduling():

  1. NV_ASSERT_OK_OR_RETURN() on rmapiControlCacheFreeForControl()
     returns immediately on failure, skipping the resume.
  2. The early return on NV2080_CTRL_CMD_CE_UPDATE_PCE_LCE_MAPPINGS_V2
     failure likewise skips the resume.

When either path fires, CeUtils submission stays permanently paused
for the lifetime of the GPU instance. Subsequent RM-internal CE
operations (memory scrubbing, allocation init) stall or fail.

Fix by converting both early returns to goto cleanup so that
ceResumeCeUtilsScheduling() is always called after the pause,
regardless of which error path is taken. Also convert the
NV_ASSERT_OK_OR_RETURN() to an explicit status check so the error
is captured in status before branching to cleanup.

Signed-off-by: rawrmonster17 <rawrmonster17@users.noreply.github.com>
@CLAassistant

CLAassistant commented Jun 19, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants