Skip to content

[C++][Parquet] Min/max statistics for all-infinity floating-point columns are truncated to the largest finite value #50182

@clee704

Description

@clee704

Describe the bug, including details regarding any error messages, version, and platform.

When every value in a FLOAT, DOUBLE, or FLOAT16 (HALF_FLOAT) column is the same infinity, the Parquet writer stores a finite min/max statistic (the largest representable finite value) instead of the infinity that is actually in the column.

  • An all-+Inf column gets min = FLT_MAX/DBL_MAX (should be +Inf). max is correctly +Inf.
  • An all--Inf column gets max = -FLT_MAX/-DBL_MAX (should be -Inf). min is correctly -Inf.

Reproduction (PyArrow)

import io
import pyarrow as pa
import pyarrow.parquet as pq

inf = float("inf")
for name, vals in [("all +Inf", [inf, inf, inf]), ("all -Inf", [-inf, -inf, -inf])]:
    buf = io.BytesIO()
    pq.write_table(pa.table({"c": pa.array(vals, type=pa.float64())}), buf)
    st = pq.ParquetFile(buf).metadata.row_group(0).column(0).statistics
    print(f"{name}: min={st.min!r}  max={st.max!r}")

Output (PyArrow 24.0.0):

all +Inf: min=1.7976931348623157e+308  max=inf
all -Inf: min=-inf  max=-1.7976931348623157e+308

Expected: all +Inf -> min=inf, max=inf and all -Inf -> min=-inf, max=-inf.

FLOAT and FLOAT16 behave the same way (an all-+Inf half-float column reports min = 65504.0, the largest finite half).

I also confirmed this at the C++ level (no PyArrow) against libparquet 24.0.0, both through the writer's comparator directly and end-to-end via a real Parquet file:

TypedComparator::GetMinMax:
  double  all +Inf  -> min=1.7976931348623157e+308   max=inf
  double  all -Inf  -> min=-inf                      max=-1.7976931348623157e+308
on-disk (write file, read statistics back):
  double  all +Inf  -> min=DBL_MAX  (WRONG)          max=inf
  double  all -Inf  -> min=-inf                      max=-DBL_MAX  (WRONG)
  double  finite    -> min=1  max=3                  (control: correct)
  double  +/-Inf    -> min=-inf  max=inf             (control: correct)

Root cause

In cpp/src/parquet/statistics.cc, the floating-point comparator seeds the min/max accumulation with the largest/smallest finite values:

constexpr static T DefaultMin() { return std::numeric_limits<T>::max(); }    // FLT_MAX / DBL_MAX
constexpr static T DefaultMax() { return std::numeric_limits<T>::lowest(); } // -FLT_MAX / -DBL_MAX

and folds each value into them in TypedComparatorImpl::GetMinMax / GetMinMaxSpaced:

T min = Helper::DefaultMin();
T max = Helper::DefaultMax();
for (...) {
  min = Helper::Min(type_length_, min, Helper::Coalesce(val, Helper::DefaultMin()));
  max = Helper::Max(type_length_, max, Helper::Coalesce(val, Helper::DefaultMax()));
}

These finite sentinels are not identity elements for min/max once ±Inf is present in the data. Min(FLT_MAX, +Inf) returns FLT_MAX (because FLT_MAX < +Inf), so the seed is never displaced when every value is +Inf; symmetrically, Max(-FLT_MAX, -Inf) returns -FLT_MAX. CleanStatistic() only special-cases NaN and signed zero, so it does not correct this.

Float16CompareHelper has the same flaw (DefaultMin() = Float16::max(), DefaultMax() = Float16::lowest()).

The compute min_max aggregate kernel is not affected (pa.compute.min_max(pa.array([inf, inf, inf])) correctly returns {min: inf, max: inf}); this is specific to the Parquet writer's comparator.

Crucially, the resulting (wrong) value is then written as exact: TypedStatisticsImpl::SetMinMaxPair unconditionally sets

statistics_.is_min_value_exact = true;
statistics_.is_max_value_exact = true;

so the file records min = DBL_MAX and asserts it is the exact minimum.

Impact

The truncation always moves the bound in the conservative direction (min lowered toward finite, max raised toward finite), so range-based row-group filtering / predicate pushdown remains safe — no matching rows are skipped. The damage is to consumers that use the statistic as a value rather than a range bound: because the stored min/max is wrong yet flagged is_min_value_exact = true, a spec-compliant reader is told the exact minimum of an all-+Inf column is DBL_MAX (a value not even present in the column). Concretely this gives wrong query results for engines that answer MIN(col) / MAX(col) directly from Parquet statistics (e.g. Spark's Parquet aggregate push-down, spark.sql.parquet.aggregatePushdown), and a wrong value for any metadata/introspection tooling.

Suggested fix

Seed the floating-point accumulation with the true identities — +infinity for the running min and -infinity for the running max (and the half-float equivalents) — rather than the largest finite values; or track whether any non-NaN value has been observed. Note this also requires updating the "no valid value" detection in CleanStatistic() (currently min == numeric_limits<T>::max() && max == numeric_limits<T>::lowest() → return nullopt) to match the new sentinels. A genuine all-+Inf column has min == +Inf, max == +Inf, which stays distinguishable from the empty/all-NaN case (min == +Inf, max == -Inf).

Confirmed present on main (the relevant code is unchanged from 24.0.0).

Component(s)

C++, Parquet

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions