Make NVTE tensor handle pool size configurable by lhb8125 · Pull Request #3090 · NVIDIA/TransformerEngine

lhb8125 · 2026-06-05T07:17:06Z

Summary

Add runtime environment variables to configure the internal NVTETensor and NVTEGroupedTensor handle pool sizes.
Keep the default pool size at 20 MiB to preserve existing behavior.
Improve exhaustion errors to mention the configured pool size and the env var to increase.

Motivation

Large model initialization paths can legitimately create more TE tensor handles than the current fixed-size pool allows, even when GPU and CPU memory are otherwise sufficient. Exposing the pool size as an environment variable avoids downstream source patches for these scale-dependent cases.

Testing

git diff --check

Signed-off-by: hongbinl <hongbinl@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-06-05T07:24:28Z

Greptile Summary

This PR exposes two new runtime environment variables (NVTE_TENSOR_HANDLE_POOL_SIZE_MB and NVTE_GROUPED_TENSOR_HANDLE_POOL_SIZE_MB) to make the internal handle pool sizes configurable, preserving the existing 20 MiB default and improving exhaustion error messages to include the configured size and the relevant env var name.

Adds GetTensorHandlePoolSizeMB and GetTensorHandlePoolCapacity helpers in an anonymous namespace that read, validate, and convert env var values to pool capacities; the computed constants replace the previously hardcoded 20 * 1024 * 1024 / sizeof(T) expressions as member initializers in both allocator singletons.
Updates docs/envvars.rst with documentation entries for the two new variables.

Confidence Score: 4/5

Safe to merge; the change is backward-compatible (default pool size unchanged) and the new code paths are straightforward.

The core logic is correct — validation guards against zero and overflow, the member-initializer ordering is right, and the singleton construction is safe. Two minor defects exist: partial string inputs like "50bad" are silently accepted as 50 due to no eof check in the shared getenv helper, and negative env var values wrap to SIZE_MAX and trigger the "too large" error rather than the intended "must be a positive integer" message. Both are input-validation edge cases with no correctness impact under documented usage.

transformer_engine/common/transformer_engine.cpp — specifically the env-var parsing and validation in GetTensorHandlePoolSizeMB

Important Files Changed

Filename	Overview
transformer_engine/common/transformer_engine.cpp	Adds anonymous-namespace helpers to read and validate pool-size env vars; replaces hardcoded constants in TensorAllocator and GroupedTensorAllocator with env-var-driven member initializers. Logic is correct but partial string parsing (no eof check) can silently accept values like "50bad" as 50, and negative inputs produce a confusing "too large" error due to size_t unsigned wrapping.
docs/envvars.rst	Adds documentation for both new env vars with correct defaults and description; consistent with the surrounding RST style.

Sequence Diagram

sequenceDiagram
    participant Env as Environment Variable
    participant Helper as GetTensorHandlePoolSizeMB
    participant Cap as GetTensorHandlePoolCapacity
    participant Alloc as TensorAllocator (singleton)
    participant App as Application Code

    App->>Alloc: first call to instance()
    activate Alloc
    Alloc->>Env: std::getenv("NVTE_TENSOR_HANDLE_POOL_SIZE_MB")
    Env-->>Helper: raw string value (or null default 20)
    Helper->>Helper: parse via getenv size_t
    Helper->>Helper: "NVTE_CHECK pool_size_mb > 0"
    Helper->>Helper: NVTE_CHECK pool_size_mb within bounds
    Helper-->>Cap: pool_size_mb
    Cap->>Cap: compute pool_size_bytes
    Cap->>Cap: "NVTE_CHECK pool_size_bytes >= sizeof(Tensor)"
    Cap-->>Alloc: MAX_TENSOR_NUM
    Alloc->>Alloc: memory.reserve(MAX_TENSOR_NUM)
    Alloc-->>App: singleton ready
    deactivate Alloc

    App->>Alloc: Allocate(mode, out, N)
    Alloc->>Alloc: "check available >= N"
    alt pool exhausted
        Alloc-->>App: NVTE_CHECK error with pool_size_mb and env var hint
    else space available
        Alloc-->>App: NVTETensor handles
    end

_{Reviews (1): Last reviewed commit: "[pre-commit.ci] auto fixes from pre-comm..." | Re-trigger Greptile}

greptile-apps · 2026-06-05T07:24:32Z

+  const size_t pool_size_mb = getenv<size_t>(env_var, kDefaultTensorHandlePoolSizeMB);
+  NVTE_CHECK(pool_size_mb > 0, env_var, " must be a positive integer.");
+  NVTE_CHECK(pool_size_mb <= std::numeric_limits<size_t>::max() / kBytesPerMB, env_var,
+             " is too large.");


Partial string values accepted silently

getenv_helper for numeric types calls iss >> value without checking iss.eof() afterwards, so an env var like NVTE_TENSOR_HANDLE_POOL_SIZE_MB=50bad will parse successfully as 50 — the trailing non-numeric characters are silently ignored. The subsequent range checks in GetTensorHandlePoolSizeMB only reject zero and overflow; they won't catch this case. Consider calling std::getenv directly here and using std::stoull with an idx argument (or verifying iss.eof()) to ensure the entire string is consumed before calling the generic getenv<size_t> wrapper.

greptile-apps · 2026-06-05T07:24:33Z

+  NVTE_CHECK(pool_size_mb > 0, env_var, " must be a positive integer.");
+  NVTE_CHECK(pool_size_mb <= std::numeric_limits<size_t>::max() / kBytesPerMB, env_var,
+             " is too large.");


Negative inputs produce a misleading "too large" error

getenv<size_t> parses the env-var string into an unsigned size_t. On virtually all implementations, a negative string like -1 wraps around to SIZE_MAX without setting failbit, so the pool_size_mb > 0 check (designed to catch zero/negative intent) is bypassed, and the overflow guard fires instead with the message "… is too large." — not the intended "… must be a positive integer." The docs describe the type as int (positive integer), so users who supply a negative value will see a confusing diagnostic.

Make NVTE tensor handle pool size configurable

ec48388

Signed-off-by: hongbinl <hongbinl@nvidia.com>

github-actions Bot added the community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. label Jun 5, 2026

[pre-commit.ci] auto fixes from pre-commit.com hooks

c3a3285

for more information, see https://pre-commit.ci

greptile-apps Bot reviewed Jun 5, 2026

View reviewed changes

lhb8125 marked this pull request as draft June 5, 2026 08:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make NVTE tensor handle pool size configurable#3090

Make NVTE tensor handle pool size configurable#3090
lhb8125 wants to merge 2 commits into
NVIDIA:mainfrom
lhb8125:codex/nvte-tensor-pool-env-var

lhb8125 commented Jun 5, 2026

Uh oh!

greptile-apps Bot commented Jun 5, 2026

Uh oh!

greptile-apps Bot Jun 5, 2026

Uh oh!

greptile-apps Bot Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lhb8125 commented Jun 5, 2026

Summary

Motivation

Testing

Uh oh!

greptile-apps Bot commented Jun 5, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant