Describe the bug
parquet::FormatStatValue in cpp/src/parquet/types.cc does fixed-width loads on the statistics value:
BOOLEAN: memcpy of sizeof(bool)
INT32/FLOAT: 4-byte numeric load
INT64/DOUBLE: 8-byte numeric load
INT96: memcpy of 3 * sizeof(int32_t)
- Float16 (
FIXED_LEN_BYTE_ARRAY with the float16 logical type): 2-byte load
The val argument is the min_value/max_value taken verbatim from the file's Thrift-encoded statistics, so its length is attacker controlled. A crafted file with a stat shorter than the column's physical type (for example a zero-byte stat on an INT96 column) makes those loads read past the end of the source buffer.
It is reachable from the printer/debug path that formats a file's column statistics.
Component(s)
Parquet, C++
Suggested fix
Reject any statistics value shorter than the fixed width its type requires before the load runs. Note that the declared width of a non-float16 FIXED_LEN_BYTE_ARRAY (decimal/string) cannot be validated from the Type::type enum alone without the column's type_length, and FormatStatValue is a public API whose signature can't change without breaking compatibility, so a full per-length check is a separate discussion.
Tracking PR: #50025
Describe the bug
parquet::FormatStatValueincpp/src/parquet/types.ccdoes fixed-width loads on the statistics value:BOOLEAN:memcpyofsizeof(bool)INT32/FLOAT: 4-byte numeric loadINT64/DOUBLE: 8-byte numeric loadINT96:memcpyof3 * sizeof(int32_t)FIXED_LEN_BYTE_ARRAYwith the float16 logical type): 2-byte loadThe
valargument is themin_value/max_valuetaken verbatim from the file's Thrift-encoded statistics, so its length is attacker controlled. A crafted file with a stat shorter than the column's physical type (for example a zero-byte stat on anINT96column) makes those loads read past the end of the source buffer.It is reachable from the printer/debug path that formats a file's column statistics.
Component(s)
Parquet, C++
Suggested fix
Reject any statistics value shorter than the fixed width its type requires before the load runs. Note that the declared width of a non-float16
FIXED_LEN_BYTE_ARRAY(decimal/string) cannot be validated from theType::typeenum alone without the column'stype_length, andFormatStatValueis a public API whose signature can't change without breaking compatibility, so a full per-length check is a separate discussion.Tracking PR: #50025