Skip to content

Make NVTE tensor handle pool size configurable#3090

Draft
lhb8125 wants to merge 2 commits into
NVIDIA:mainfrom
lhb8125:codex/nvte-tensor-pool-env-var
Draft

Make NVTE tensor handle pool size configurable#3090
lhb8125 wants to merge 2 commits into
NVIDIA:mainfrom
lhb8125:codex/nvte-tensor-pool-env-var

Conversation

@lhb8125
Copy link
Copy Markdown
Contributor

@lhb8125 lhb8125 commented Jun 5, 2026

Summary

  • Add runtime environment variables to configure the internal NVTETensor and NVTEGroupedTensor handle pool sizes.
  • Keep the default pool size at 20 MiB to preserve existing behavior.
  • Improve exhaustion errors to mention the configured pool size and the env var to increase.

Motivation

Large model initialization paths can legitimately create more TE tensor handles than the current fixed-size pool allows, even when GPU and CPU memory are otherwise sufficient. Exposing the pool size as an environment variable avoids downstream source patches for these scale-dependent cases.

Testing

  • git diff --check

Signed-off-by: hongbinl <hongbinl@nvidia.com>
@github-actions github-actions Bot added the community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. label Jun 5, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jun 5, 2026

Greptile Summary

This PR exposes two new runtime environment variables (NVTE_TENSOR_HANDLE_POOL_SIZE_MB and NVTE_GROUPED_TENSOR_HANDLE_POOL_SIZE_MB) to make the internal handle pool sizes configurable, preserving the existing 20 MiB default and improving exhaustion error messages to include the configured size and the relevant env var name.

  • Adds GetTensorHandlePoolSizeMB and GetTensorHandlePoolCapacity helpers in an anonymous namespace that read, validate, and convert env var values to pool capacities; the computed constants replace the previously hardcoded 20 * 1024 * 1024 / sizeof(T) expressions as member initializers in both allocator singletons.
  • Updates docs/envvars.rst with documentation entries for the two new variables.

Confidence Score: 4/5

Safe to merge; the change is backward-compatible (default pool size unchanged) and the new code paths are straightforward.

The core logic is correct — validation guards against zero and overflow, the member-initializer ordering is right, and the singleton construction is safe. Two minor defects exist: partial string inputs like "50bad" are silently accepted as 50 due to no eof check in the shared getenv helper, and negative env var values wrap to SIZE_MAX and trigger the "too large" error rather than the intended "must be a positive integer" message. Both are input-validation edge cases with no correctness impact under documented usage.

transformer_engine/common/transformer_engine.cpp — specifically the env-var parsing and validation in GetTensorHandlePoolSizeMB

Important Files Changed

Filename Overview
transformer_engine/common/transformer_engine.cpp Adds anonymous-namespace helpers to read and validate pool-size env vars; replaces hardcoded constants in TensorAllocator and GroupedTensorAllocator with env-var-driven member initializers. Logic is correct but partial string parsing (no eof check) can silently accept values like "50bad" as 50, and negative inputs produce a confusing "too large" error due to size_t unsigned wrapping.
docs/envvars.rst Adds documentation for both new env vars with correct defaults and description; consistent with the surrounding RST style.

Sequence Diagram

sequenceDiagram
    participant Env as Environment Variable
    participant Helper as GetTensorHandlePoolSizeMB
    participant Cap as GetTensorHandlePoolCapacity
    participant Alloc as TensorAllocator (singleton)
    participant App as Application Code

    App->>Alloc: first call to instance()
    activate Alloc
    Alloc->>Env: std::getenv("NVTE_TENSOR_HANDLE_POOL_SIZE_MB")
    Env-->>Helper: raw string value (or null default 20)
    Helper->>Helper: parse via getenv size_t
    Helper->>Helper: "NVTE_CHECK pool_size_mb > 0"
    Helper->>Helper: NVTE_CHECK pool_size_mb within bounds
    Helper-->>Cap: pool_size_mb
    Cap->>Cap: compute pool_size_bytes
    Cap->>Cap: "NVTE_CHECK pool_size_bytes >= sizeof(Tensor)"
    Cap-->>Alloc: MAX_TENSOR_NUM
    Alloc->>Alloc: memory.reserve(MAX_TENSOR_NUM)
    Alloc-->>App: singleton ready
    deactivate Alloc

    App->>Alloc: Allocate(mode, out, N)
    Alloc->>Alloc: "check available >= N"
    alt pool exhausted
        Alloc-->>App: NVTE_CHECK error with pool_size_mb and env var hint
    else space available
        Alloc-->>App: NVTETensor handles
    end
Loading

Reviews (1): Last reviewed commit: "[pre-commit.ci] auto fixes from pre-comm..." | Re-trigger Greptile

Comment on lines +404 to +407
const size_t pool_size_mb = getenv<size_t>(env_var, kDefaultTensorHandlePoolSizeMB);
NVTE_CHECK(pool_size_mb > 0, env_var, " must be a positive integer.");
NVTE_CHECK(pool_size_mb <= std::numeric_limits<size_t>::max() / kBytesPerMB, env_var,
" is too large.");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Partial string values accepted silently

getenv_helper for numeric types calls iss >> value without checking iss.eof() afterwards, so an env var like NVTE_TENSOR_HANDLE_POOL_SIZE_MB=50bad will parse successfully as 50 — the trailing non-numeric characters are silently ignored. The subsequent range checks in GetTensorHandlePoolSizeMB only reject zero and overflow; they won't catch this case. Consider calling std::getenv directly here and using std::stoull with an idx argument (or verifying iss.eof()) to ensure the entire string is consumed before calling the generic getenv<size_t> wrapper.

Comment on lines +405 to +407
NVTE_CHECK(pool_size_mb > 0, env_var, " must be a positive integer.");
NVTE_CHECK(pool_size_mb <= std::numeric_limits<size_t>::max() / kBytesPerMB, env_var,
" is too large.");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Negative inputs produce a misleading "too large" error

getenv<size_t> parses the env-var string into an unsigned size_t. On virtually all implementations, a negative string like -1 wraps around to SIZE_MAX without setting failbit, so the pool_size_mb > 0 check (designed to catch zero/negative intent) is bypassed, and the overflow guard fires instead with the message "… is too large." — not the intended "… must be a positive integer." The docs describe the type as int (positive integer), so users who supply a negative value will see a confusing diagnostic.

@lhb8125 lhb8125 marked this pull request as draft June 5, 2026 08:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution PRs from external contributor outside the core maintainers, representing community-driven work.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant