Describe the bug, including details regarding any error messages, version, and platform.
When every value in a FLOAT, DOUBLE, or FLOAT16 (HALF_FLOAT) column is the same infinity, the Parquet writer stores a finite min/max statistic (the largest representable finite value) instead of the infinity that is actually in the column.
- An all-
+Inf column gets min = FLT_MAX/DBL_MAX (should be +Inf). max is correctly +Inf.
- An all-
-Inf column gets max = -FLT_MAX/-DBL_MAX (should be -Inf). min is correctly -Inf.
Reproduction (PyArrow)
import io
import pyarrow as pa
import pyarrow.parquet as pq
inf = float("inf")
for name, vals in [("all +Inf", [inf, inf, inf]), ("all -Inf", [-inf, -inf, -inf])]:
buf = io.BytesIO()
pq.write_table(pa.table({"c": pa.array(vals, type=pa.float64())}), buf)
st = pq.ParquetFile(buf).metadata.row_group(0).column(0).statistics
print(f"{name}: min={st.min!r} max={st.max!r}")
Output (PyArrow 24.0.0):
all +Inf: min=1.7976931348623157e+308 max=inf
all -Inf: min=-inf max=-1.7976931348623157e+308
Expected: all +Inf -> min=inf, max=inf and all -Inf -> min=-inf, max=-inf.
FLOAT and FLOAT16 behave the same way (an all-+Inf half-float column reports min = 65504.0, the largest finite half).
I also confirmed this at the C++ level (no PyArrow) against libparquet 24.0.0, both through the writer's comparator directly and end-to-end via a real Parquet file:
TypedComparator::GetMinMax:
double all +Inf -> min=1.7976931348623157e+308 max=inf
double all -Inf -> min=-inf max=-1.7976931348623157e+308
on-disk (write file, read statistics back):
double all +Inf -> min=DBL_MAX (WRONG) max=inf
double all -Inf -> min=-inf max=-DBL_MAX (WRONG)
double finite -> min=1 max=3 (control: correct)
double +/-Inf -> min=-inf max=inf (control: correct)
Root cause
In cpp/src/parquet/statistics.cc, the floating-point comparator seeds the min/max accumulation with the largest/smallest finite values:
constexpr static T DefaultMin() { return std::numeric_limits<T>::max(); } // FLT_MAX / DBL_MAX
constexpr static T DefaultMax() { return std::numeric_limits<T>::lowest(); } // -FLT_MAX / -DBL_MAX
and folds each value into them in TypedComparatorImpl::GetMinMax / GetMinMaxSpaced:
T min = Helper::DefaultMin();
T max = Helper::DefaultMax();
for (...) {
min = Helper::Min(type_length_, min, Helper::Coalesce(val, Helper::DefaultMin()));
max = Helper::Max(type_length_, max, Helper::Coalesce(val, Helper::DefaultMax()));
}
These finite sentinels are not identity elements for min/max once ±Inf is present in the data. Min(FLT_MAX, +Inf) returns FLT_MAX (because FLT_MAX < +Inf), so the seed is never displaced when every value is +Inf; symmetrically, Max(-FLT_MAX, -Inf) returns -FLT_MAX. CleanStatistic() only special-cases NaN and signed zero, so it does not correct this.
Float16CompareHelper has the same flaw (DefaultMin() = Float16::max(), DefaultMax() = Float16::lowest()).
The compute min_max aggregate kernel is not affected (pa.compute.min_max(pa.array([inf, inf, inf])) correctly returns {min: inf, max: inf}); this is specific to the Parquet writer's comparator.
Crucially, the resulting (wrong) value is then written as exact: TypedStatisticsImpl::SetMinMaxPair unconditionally sets
statistics_.is_min_value_exact = true;
statistics_.is_max_value_exact = true;
so the file records min = DBL_MAX and asserts it is the exact minimum.
Impact
The truncation always moves the bound in the conservative direction (min lowered toward finite, max raised toward finite), so range-based row-group filtering / predicate pushdown remains safe — no matching rows are skipped. The damage is to consumers that use the statistic as a value rather than a range bound: because the stored min/max is wrong yet flagged is_min_value_exact = true, a spec-compliant reader is told the exact minimum of an all-+Inf column is DBL_MAX (a value not even present in the column). Concretely this gives wrong query results for engines that answer MIN(col) / MAX(col) directly from Parquet statistics (e.g. Spark's Parquet aggregate push-down, spark.sql.parquet.aggregatePushdown), and a wrong value for any metadata/introspection tooling.
Suggested fix
Seed the floating-point accumulation with the true identities — +infinity for the running min and -infinity for the running max (and the half-float equivalents) — rather than the largest finite values; or track whether any non-NaN value has been observed. Note this also requires updating the "no valid value" detection in CleanStatistic() (currently min == numeric_limits<T>::max() && max == numeric_limits<T>::lowest() → return nullopt) to match the new sentinels. A genuine all-+Inf column has min == +Inf, max == +Inf, which stays distinguishable from the empty/all-NaN case (min == +Inf, max == -Inf).
Confirmed present on main (the relevant code is unchanged from 24.0.0).
Component(s)
C++, Parquet
Describe the bug, including details regarding any error messages, version, and platform.
When every value in a
FLOAT,DOUBLE, orFLOAT16(HALF_FLOAT) column is the same infinity, the Parquet writer stores a finite min/max statistic (the largest representable finite value) instead of the infinity that is actually in the column.+Infcolumn getsmin=FLT_MAX/DBL_MAX(should be+Inf).maxis correctly+Inf.-Infcolumn getsmax=-FLT_MAX/-DBL_MAX(should be-Inf).minis correctly-Inf.Reproduction (PyArrow)
Output (PyArrow 24.0.0):
Expected:
all +Inf -> min=inf, max=infandall -Inf -> min=-inf, max=-inf.FLOATandFLOAT16behave the same way (an all-+Infhalf-float column reportsmin = 65504.0, the largest finite half).I also confirmed this at the C++ level (no PyArrow) against
libparquet24.0.0, both through the writer's comparator directly and end-to-end via a real Parquet file:Root cause
In
cpp/src/parquet/statistics.cc, the floating-point comparator seeds the min/max accumulation with the largest/smallest finite values:and folds each value into them in
TypedComparatorImpl::GetMinMax/GetMinMaxSpaced:These finite sentinels are not identity elements for min/max once
±Infis present in the data.Min(FLT_MAX, +Inf)returnsFLT_MAX(becauseFLT_MAX < +Inf), so the seed is never displaced when every value is+Inf; symmetrically,Max(-FLT_MAX, -Inf)returns-FLT_MAX.CleanStatistic()only special-cases NaN and signed zero, so it does not correct this.Float16CompareHelperhas the same flaw (DefaultMin()=Float16::max(),DefaultMax()=Float16::lowest()).The compute
min_maxaggregate kernel is not affected (pa.compute.min_max(pa.array([inf, inf, inf]))correctly returns{min: inf, max: inf}); this is specific to the Parquet writer's comparator.Crucially, the resulting (wrong) value is then written as exact:
TypedStatisticsImpl::SetMinMaxPairunconditionally setsso the file records
min = DBL_MAXand asserts it is the exact minimum.Impact
The truncation always moves the bound in the conservative direction (min lowered toward finite, max raised toward finite), so range-based row-group filtering / predicate pushdown remains safe — no matching rows are skipped. The damage is to consumers that use the statistic as a value rather than a range bound: because the stored min/max is wrong yet flagged
is_min_value_exact = true, a spec-compliant reader is told the exact minimum of an all-+Infcolumn isDBL_MAX(a value not even present in the column). Concretely this gives wrong query results for engines that answerMIN(col)/MAX(col)directly from Parquet statistics (e.g. Spark's Parquet aggregate push-down,spark.sql.parquet.aggregatePushdown), and a wrong value for any metadata/introspection tooling.Suggested fix
Seed the floating-point accumulation with the true identities —
+infinityfor the running min and-infinityfor the running max (and the half-float equivalents) — rather than the largest finite values; or track whether any non-NaN value has been observed. Note this also requires updating the "no valid value" detection inCleanStatistic()(currentlymin == numeric_limits<T>::max() && max == numeric_limits<T>::lowest()→ returnnullopt) to match the new sentinels. A genuine all-+Infcolumn hasmin == +Inf, max == +Inf, which stays distinguishable from the empty/all-NaN case (min == +Inf, max == -Inf).Confirmed present on
main(the relevant code is unchanged from 24.0.0).Component(s)
C++, Parquet