Skip to content

fix(pathfinder): make find_nvidia_binary_utility deterministic, never search CWD#2196

Open
aryanputta wants to merge 3 commits into
NVIDIA:mainfrom
aryanputta:fix/pathfinder-binary-deterministic-no-cwd
Open

fix(pathfinder): make find_nvidia_binary_utility deterministic, never search CWD#2196
aryanputta wants to merge 3 commits into
NVIDIA:mainfrom
aryanputta:fix/pathfinder-binary-deterministic-no-cwd

Conversation

@aryanputta

Copy link
Copy Markdown
Contributor

Summary

Fixes #2119.

find_nvidia_binary_utility assembles a bounded list of trusted directories (NVIDIA wheel bin/, CONDA_PREFIX, CUDA_HOME/CUDA_PATH) and then delegated to shutil.which(name, path=trusted_dirs). On Windows, CPython's shutil.which prepends the process current working directory to the search even when an explicit path= is supplied. As a result a binary located in an arbitrary (possibly attacker-writable) CWD could be returned in preference to the trusted CUDA / Conda / wheel binary, violating the pathfinder contract of a deterministic lookup over a documented, bounded set of trusted roots.

Fix

Replace the shutil.which delegation with an explicit resolver (_resolve_in_trusted_dirs) that searches only the trusted directories, in order, and returns the first executable match. The current working directory and ambient PATH are never consulted.

  • POSIX execute-bit (X_OK) and Windows extension semantics are preserved via _is_executable_file, so behavior is unchanged apart from removing the CWD/PATH leakage.
  • Empty and duplicate trusted dirs are skipped (an empty entry would otherwise resolve relative to CWD).
  • Binaries already resolvable in the existing trusted dirs return exactly as before.

Tests

  • Rewrote the search-path tests to assert the deterministic probe order (site-packages -> Conda -> CUDA_HOME, plus the Windows-specific dir layout) by recording the probed candidates.
  • Added TestResolveInTrustedDirs: CWD isolation (the core regression), first-matching-dir-wins, empty/duplicate dir skipping, and POSIX non-executable rejection.

Local checks: ruff check and ruff format --check pass on both files; the resolver logic and CWD-isolation behavior were verified on a real filesystem (this package does not import on macOS due to the Linux-only dlinfo binding, so the mocker-based suite runs in CI on Linux/Windows).

… search CWD

find_nvidia_binary_utility assembled a bounded list of trusted directories
(NVIDIA wheel bin/, CONDA_PREFIX, CUDA_HOME/CUDA_PATH) and then delegated to
shutil.which(name, path=trusted_dirs). On Windows shutil.which prepends the
process current working directory to the search even when an explicit path=
is supplied, so a binary located in an arbitrary (possibly attacker-writable)
CWD could be returned in preference to the trusted CUDA / Conda / wheel
binary. That violates the pathfinder contract of a deterministic lookup over
a documented, bounded set of trusted roots.

Replace the shutil.which delegation with an explicit resolver that searches
only the trusted directories, in order, returning the first executable match.
The current working directory and ambient PATH are never consulted. POSIX
execute-bit (X_OK) and Windows extension semantics are preserved, so behavior
is unchanged except for removing the CWD/PATH leakage. Names resolved in the
existing trusted dirs return exactly as before.

Rewrites the search-path tests to assert the deterministic probe order and
adds TestResolveInTrustedDirs covering CWD isolation, first-match-wins,
empty/duplicate dir skipping, and POSIX non-executable rejection.

Fixes NVIDIA#2119
@copy-pr-bot

copy-pr-bot Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added the cuda.pathfinder Everything related to the cuda.pathfinder module label Jun 10, 2026
@rwgk rwgk added this to the cuda.pathfinder next milestone Jun 10, 2026
@rwgk rwgk added the P1 Medium priority - Should do label Jun 10, 2026
@rwgk

rwgk commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Hi @aryanputta, this is great, but in isolation it'll probably break some users. This was pointed out by @kkraus14 here before:

If someone follows this installation guide and sets LD_LIBRARY_PATH for libraries and PATH for executables ...

I ran the context and my thoughts by GPT around the same time, which lead to the conclusion at the end of this comment:

Use a custom bounded finder, add CTK-root canary fallback, and do not implicitly search PATH.

I still think that's the right compromise:

  • We will find the binary utility via LD_LIBRARY_PATH if set. It's still an attack vector, but significantly more difficult to exploit than hacking PATH.
  • Users following the installation guide will not see a difference in most cases, and get a more self-consistent outcome in the rare case that the searches via PATH and LD_LIBRARY_PATH produce diverging results.

The behavior difference is significant enough though that I'd want to bump the pathfinder minor version, which we can do; I don't have concerns about that.

How do you feel about expanding this PR to fold in the CTK-root canary fallback? — I believe the code changes will still be quite modest.

…utility

After the deterministic search over the explicit trusted directories (NVIDIA
wheel bin/, CONDA_PREFIX, CUDA_HOME/CUDA_PATH) misses, fall back to a CTK-root
canary probe: resolve cudart through the OS dynamic loader, which honors
LD_LIBRARY_PATH on Linux and the native DLL search on Windows, derive the CUDA
Toolkit root from its absolute path, and search that root's bin layout.

This addresses the concern raised on NVIDIA#2196: users who follow the CUDA Linux
installation guide set LD_LIBRARY_PATH for libraries and PATH for executables.
The bounded finder alone would stop finding the utility for them because PATH
is intentionally never consulted. The canary fallback recovers that case
through LD_LIBRARY_PATH instead of PATH. LD_LIBRARY_PATH is still an attack
vector, but a significantly harder one to exploit than PATH, and the ambient
PATH and process CWD remain unused.

The canary runs only after the explicit trusted dirs miss, so the common
wheel/conda/CUDA_HOME cases never spawn the resolver subprocess. The
canary -> CTK-root resolution is factored into a shared
resolve_ctk_root_via_canary helper reused by the dynamic-library CTK-root
canary flow.

Adds tests for the fallback (found, ordering, Windows bin layout, not
consulted when found earlier, cached) and for resolve_ctk_root_via_canary.
Adds 1.6.0 release notes for the minor version bump.
@aryanputta

Copy link
Copy Markdown
Contributor Author

Thanks @rwgk, that compromise makes sense and I've folded it in (pushed c3d39f0, no force-push). Agreed on the reasoning: the bounded finder shouldn't regress users who follow the Linux install guide and only set LD_LIBRARY_PATH + PATH, but reaching them via LD_LIBRARY_PATH instead of PATH is the safer trade.

What the new commit adds

Search order is unchanged for the explicit trusted dirs, with the canary as a strict fallback:

  1. NVIDIA wheel bin/ (site-packages)
  2. CONDA_PREFIX
  3. CUDA_HOME / CUDA_PATH
  4. CTK-root canary (new): resolve cudart through the OS dynamic loader, derive the CTK root from its absolute path, search that root's bin layout

Key points:

  • The canary runs only when steps 1-3 miss, so the common wheel/conda/CUDA_HOME cases never spawn the resolver subprocess.
  • It reuses the existing canary toolbox. I factored the cudart -> CTK-root resolution into resolve_ctk_root_via_canary(...) in load_nvidia_dynamic_lib.py and had the existing dynamic-library _try_ctk_root_canary call it too, so lib and binary discovery share one path. The resolver runs in the isolated subprocess that honors LD_LIBRARY_PATH on Linux and the native DLL search on Windows.
  • PATH and the process CWD are still never consulted, so the LD_LIBRARY_PATH surface is the only added vector, exactly as you described.

Version

Added docs/source/release/1.6.0-notes.rst for the minor bump (version is tag-driven, so this is the doc side of it). Happy to renumber if you'd rather it land under a different version.

Tests

Added coverage for the fallback in tests/test_find_nvidia_binaries.py (found via canary, ordering vs CUDA_HOME, Windows bin/x64/bin/x86_64/bin layout, not consulted when an earlier dir hits, cached) and a focused resolve_ctk_root_via_canary unit test in tests/test_ctk_root_discovery.py. ruff check and ruff format --check are clean.

Local note: cuda.pathfinder can't be imported on macOS (load_dl_linux binds dlinfo, absent on Darwin), so I verified the binary-finder logic directly with stubbed parent packages and unittest.mock (10/10 of the fallback scenarios pass). The full pytest suite will exercise everything on the CI runners.

pre-commit.ci mypy flagged returning Any from resolve_ctk_root_via_canary and
_resolve_ctk_root_via_canary (both declared -> str | None), because
derive_ctk_root resolves to Any under the pathfinder mypy config. Bind the
result to an annotated local before returning, matching the pattern used
elsewhere in the package.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.pathfinder Everything related to the cuda.pathfinder module P1 Medium priority - Should do

Projects

None yet

Development

Successfully merging this pull request may close these issues.

find_nvidia_binary_utility() violates deterministic-search contract on Windows (CWD leakage via shutil.which)

2 participants