Skip to content

[Bug] maximum/minimum/relu/clip do not propagate NaN (IEEE 754 violation) #19579

@wuyii8941

Description

@wuyii8941

Expected behavior

maximum(NaN, x) should return NaN per IEEE 754-2019 §9.6, consistent with NumPy, PyTorch, JAX, and ONNX Runtime.

relu(NaN) should return NaN (since relu = max(x, 0)).

Actual behavior

When NaN is the first operand of T.max / T.min, the result is the second operand instead of NaN. This affects R.maximum, R.minimum, R.nn.relu, and R.clip.

The root cause is that T.max(a, b) compiles to x86 maxss/maxps instructions, which have the hardware behavior: "if src1 is NaN, return src2". IEEE 754 requires returning NaN when either operand is NaN.

Reproducer

import numpy as np
import tvm
from tvm import relax
import tvm.relax.op as R
from tvm.relax.transform import LegalizeOps

bb = relax.BlockBuilder()
a = relax.Var("a", relax.TensorStructInfo((4,), "float32"))
b = relax.Var("b", relax.TensorStructInfo((4,), "float32"))
with bb.function("main", [a, b]):
    with bb.dataflow():
        gv = bb.emit_output(bb.emit(R.maximum(a, b)))
    bb.emit_func_output(gv)
mod = bb.finalize()

pipeline = tvm.ir.transform.Sequential([LegalizeOps()])
exe = tvm.relax.build(pipeline(mod), target="llvm")
vm = tvm.relax.VirtualMachine(exe, device=tvm.cpu())

A = np.array([np.nan, 1.0, np.nan, 0.0], np.float32)
B = np.array([1.0, np.nan, np.nan, np.nan], np.float32)
out = vm["main"](
    tvm.runtime.tensor(A, device=tvm.cpu()),
    tvm.runtime.tensor(B, device=tvm.cpu()),
).numpy()

print(out)       # [1.  nan  nan  nan]  — element 0 is WRONG
print(np.maximum(A, B))  # [nan nan nan nan]  — all NaN per IEEE 754

The pattern is operand-order-dependent:

Expression TVM Expected (IEEE 754)
max(NaN, 1.0) 1.0 NaN
max(1.0, NaN) NaN NaN
relu(NaN) = max(NaN, 0) 0.0 NaN
clip(NaN, -1, 1) 1.0 NaN

Affected operations

R.maximum(a, b)    # when a is NaN
R.minimum(a, b)    # when a is NaN
R.nn.relu(x)       # when x is NaN → returns 0
R.clip(x, lo, hi)  # when x is NaN → returns hi

Not affected (correct NaN propagation):

  • R.add, R.multiply, R.subtract, R.divide — arithmetic propagates NaN correctly
  • R.nn.leakyrelu — uses comparison path, NaN propagates through multiply
  • R.nn.silu, R.nn.gelu — sigmoid/erf path propagates NaN

Why this matters

relu is the most common activation function. When an upstream computation produces NaN (e.g., from overflow or division by zero), the NaN should propagate to signal the error. Instead, TVM's relu silently converts NaN to 0, making the error invisible:

# Suppose upstream overflow produces NaN in one element:
x = [[1.0, 2.0, NaN, 4.0]]
relu(x).sum()
# TVM:   7.0   ← NaN silently disappeared
# NumPy: NaN   ← correctly signals the problem

This can cause silent wrong results in production models, where NaN detection is a standard debugging/monitoring signal.

Root cause

In the lowered TIR, maximum becomes T.max(a, b), which LLVM lowers to x86 maxss/maxps. These instructions follow "if src1 is NaN, return src2" semantics rather than IEEE 754 "return NaN if either is NaN".

The fix would be to emit NaN-aware max/min, e.g.:

select(isnan(a) | isnan(b), NaN, max(a, b))

Environment

  • TVM commit: 0b0afd8 (main, 2026-04-24)
  • OS: Ubuntu 20.04
  • Target: llvm (CPU, x86-64)

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-triagePRs or issues that need to be investigated by maintainers to find the right assignees to address ittype: bug

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions