Make NVTE tensor handle pool size configurable#3090
Conversation
Signed-off-by: hongbinl <hongbinl@nvidia.com>
for more information, see https://pre-commit.ci
Greptile SummaryThis PR exposes two new runtime environment variables (
Confidence Score: 4/5Safe to merge; the change is backward-compatible (default pool size unchanged) and the new code paths are straightforward. The core logic is correct — validation guards against zero and overflow, the member-initializer ordering is right, and the singleton construction is safe. Two minor defects exist: partial string inputs like "50bad" are silently accepted as 50 due to no eof check in the shared getenv helper, and negative env var values wrap to SIZE_MAX and trigger the "too large" error rather than the intended "must be a positive integer" message. Both are input-validation edge cases with no correctness impact under documented usage. transformer_engine/common/transformer_engine.cpp — specifically the env-var parsing and validation in GetTensorHandlePoolSizeMB Important Files Changed
Sequence DiagramsequenceDiagram
participant Env as Environment Variable
participant Helper as GetTensorHandlePoolSizeMB
participant Cap as GetTensorHandlePoolCapacity
participant Alloc as TensorAllocator (singleton)
participant App as Application Code
App->>Alloc: first call to instance()
activate Alloc
Alloc->>Env: std::getenv("NVTE_TENSOR_HANDLE_POOL_SIZE_MB")
Env-->>Helper: raw string value (or null default 20)
Helper->>Helper: parse via getenv size_t
Helper->>Helper: "NVTE_CHECK pool_size_mb > 0"
Helper->>Helper: NVTE_CHECK pool_size_mb within bounds
Helper-->>Cap: pool_size_mb
Cap->>Cap: compute pool_size_bytes
Cap->>Cap: "NVTE_CHECK pool_size_bytes >= sizeof(Tensor)"
Cap-->>Alloc: MAX_TENSOR_NUM
Alloc->>Alloc: memory.reserve(MAX_TENSOR_NUM)
Alloc-->>App: singleton ready
deactivate Alloc
App->>Alloc: Allocate(mode, out, N)
Alloc->>Alloc: "check available >= N"
alt pool exhausted
Alloc-->>App: NVTE_CHECK error with pool_size_mb and env var hint
else space available
Alloc-->>App: NVTETensor handles
end
Reviews (1): Last reviewed commit: "[pre-commit.ci] auto fixes from pre-comm..." | Re-trigger Greptile |
| const size_t pool_size_mb = getenv<size_t>(env_var, kDefaultTensorHandlePoolSizeMB); | ||
| NVTE_CHECK(pool_size_mb > 0, env_var, " must be a positive integer."); | ||
| NVTE_CHECK(pool_size_mb <= std::numeric_limits<size_t>::max() / kBytesPerMB, env_var, | ||
| " is too large."); |
There was a problem hiding this comment.
Partial string values accepted silently
getenv_helper for numeric types calls iss >> value without checking iss.eof() afterwards, so an env var like NVTE_TENSOR_HANDLE_POOL_SIZE_MB=50bad will parse successfully as 50 — the trailing non-numeric characters are silently ignored. The subsequent range checks in GetTensorHandlePoolSizeMB only reject zero and overflow; they won't catch this case. Consider calling std::getenv directly here and using std::stoull with an idx argument (or verifying iss.eof()) to ensure the entire string is consumed before calling the generic getenv<size_t> wrapper.
| NVTE_CHECK(pool_size_mb > 0, env_var, " must be a positive integer."); | ||
| NVTE_CHECK(pool_size_mb <= std::numeric_limits<size_t>::max() / kBytesPerMB, env_var, | ||
| " is too large."); |
There was a problem hiding this comment.
Negative inputs produce a misleading "too large" error
getenv<size_t> parses the env-var string into an unsigned size_t. On virtually all implementations, a negative string like -1 wraps around to SIZE_MAX without setting failbit, so the pool_size_mb > 0 check (designed to catch zero/negative intent) is bypassed, and the overflow guard fires instead with the message "… is too large." — not the intended "… must be a positive integer." The docs describe the type as int (positive integer), so users who supply a negative value will see a confusing diagnostic.
Summary
Motivation
Large model initialization paths can legitimately create more TE tensor handles than the current fixed-size pool allows, even when GPU and CPU memory are otherwise sufficient. Exposing the pool size as an environment variable avoids downstream source patches for these scale-dependent cases.
Testing