[C++][Parquet] Min/max statistics for all-infinity floating-point columns are truncated to the largest finite value

### Describe the bug, including details regarding any error messages, version, and platform.

When every value in a `FLOAT`, `DOUBLE`, or `FLOAT16` (HALF_FLOAT) column is the **same infinity**, the Parquet writer stores a **finite** min/max statistic (the largest representable finite value) instead of the infinity that is actually in the column.

- An all-`+Inf` column gets **`min` = `FLT_MAX`/`DBL_MAX`** (should be `+Inf`). `max` is correctly `+Inf`.
- An all-`-Inf` column gets **`max` = `-FLT_MAX`/`-DBL_MAX`** (should be `-Inf`). `min` is correctly `-Inf`.

#### Reproduction (PyArrow)

```python
import io
import pyarrow as pa
import pyarrow.parquet as pq

inf = float("inf")
for name, vals in [("all +Inf", [inf, inf, inf]), ("all -Inf", [-inf, -inf, -inf])]:
    buf = io.BytesIO()
    pq.write_table(pa.table({"c": pa.array(vals, type=pa.float64())}), buf)
    st = pq.ParquetFile(buf).metadata.row_group(0).column(0).statistics
    print(f"{name}: min={st.min!r}  max={st.max!r}")
```

Output (PyArrow 24.0.0):

```
all +Inf: min=1.7976931348623157e+308  max=inf
all -Inf: min=-inf  max=-1.7976931348623157e+308
```

Expected: `all +Inf -> min=inf, max=inf` and `all -Inf -> min=-inf, max=-inf`.

`FLOAT` and `FLOAT16` behave the same way (an all-`+Inf` half-float column reports `min = 65504.0`, the largest finite half).

I also confirmed this at the C++ level (no PyArrow) against `libparquet` 24.0.0, both through the writer's comparator directly and end-to-end via a real Parquet file:

```
TypedComparator::GetMinMax:
  double  all +Inf  -> min=1.7976931348623157e+308   max=inf
  double  all -Inf  -> min=-inf                      max=-1.7976931348623157e+308
on-disk (write file, read statistics back):
  double  all +Inf  -> min=DBL_MAX  (WRONG)          max=inf
  double  all -Inf  -> min=-inf                      max=-DBL_MAX  (WRONG)
  double  finite    -> min=1  max=3                  (control: correct)
  double  +/-Inf    -> min=-inf  max=inf             (control: correct)
```

#### Root cause

In `cpp/src/parquet/statistics.cc`, the floating-point comparator seeds the min/max accumulation with the largest/smallest **finite** values:

```cpp
constexpr static T DefaultMin() { return std::numeric_limits<T>::max(); }    // FLT_MAX / DBL_MAX
constexpr static T DefaultMax() { return std::numeric_limits<T>::lowest(); } // -FLT_MAX / -DBL_MAX
```

and folds each value into them in `TypedComparatorImpl::GetMinMax` / `GetMinMaxSpaced`:

```cpp
T min = Helper::DefaultMin();
T max = Helper::DefaultMax();
for (...) {
  min = Helper::Min(type_length_, min, Helper::Coalesce(val, Helper::DefaultMin()));
  max = Helper::Max(type_length_, max, Helper::Coalesce(val, Helper::DefaultMax()));
}
```

These finite sentinels are **not identity elements** for min/max once `±Inf` is present in the data. `Min(FLT_MAX, +Inf)` returns `FLT_MAX` (because `FLT_MAX < +Inf`), so the seed is never displaced when every value is `+Inf`; symmetrically, `Max(-FLT_MAX, -Inf)` returns `-FLT_MAX`. `CleanStatistic()` only special-cases NaN and signed zero, so it does not correct this.

`Float16CompareHelper` has the same flaw (`DefaultMin()` = `Float16::max()`, `DefaultMax()` = `Float16::lowest()`).

The compute `min_max` aggregate kernel is **not** affected (`pa.compute.min_max(pa.array([inf, inf, inf]))` correctly returns `{min: inf, max: inf}`); this is specific to the Parquet writer's comparator.

Crucially, the resulting (wrong) value is then written as **exact**: `TypedStatisticsImpl::SetMinMaxPair` unconditionally sets

```cpp
statistics_.is_min_value_exact = true;
statistics_.is_max_value_exact = true;
```

so the file records `min = DBL_MAX` *and* asserts it is the exact minimum.

#### Impact

The truncation always moves the bound in the conservative direction (min lowered toward finite, max raised toward finite), so range-based row-group filtering / predicate pushdown remains **safe** — no matching rows are skipped. The damage is to consumers that use the statistic as a *value* rather than a range bound: because the stored min/max is wrong yet flagged `is_min_value_exact = true`, a spec-compliant reader is told the exact minimum of an all-`+Inf` column is `DBL_MAX` (a value not even present in the column). Concretely this gives **wrong query results** for engines that answer `MIN(col)` / `MAX(col)` directly from Parquet statistics (e.g. Spark's Parquet aggregate push-down, `spark.sql.parquet.aggregatePushdown`), and a wrong value for any metadata/introspection tooling.

#### Suggested fix

Seed the floating-point accumulation with the true identities — `+infinity` for the running min and `-infinity` for the running max (and the half-float equivalents) — rather than the largest finite values; or track whether any non-NaN value has been observed. Note this also requires updating the "no valid value" detection in `CleanStatistic()` (currently `min == numeric_limits<T>::max() && max == numeric_limits<T>::lowest()` → return `nullopt`) to match the new sentinels. A genuine all-`+Inf` column has `min == +Inf, max == +Inf`, which stays distinguishable from the empty/all-NaN case (`min == +Inf, max == -Inf`).

Confirmed present on `main` (the relevant code is unchanged from 24.0.0).

### Component(s)

C++, Parquet


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Parquet] Min/max statistics for all-infinity floating-point columns are truncated to the largest finite value #50182

Describe the bug, including details regarding any error messages, version, and platform.

Reproduction (PyArrow)

Root cause

Impact

Suggested fix

Component(s)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[C++][Parquet] Min/max statistics for all-infinity floating-point columns are truncated to the largest finite value #50182

Description

Describe the bug, including details regarding any error messages, version, and platform.

Reproduction (PyArrow)

Root cause

Impact

Suggested fix

Component(s)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions