Skip to content

[C++][Parquet] FormatStatValue reads past the buffer for too-short statistics values #50184

@jmestwa-coder

Description

@jmestwa-coder

Describe the bug

parquet::FormatStatValue in cpp/src/parquet/types.cc does fixed-width loads on the statistics value:

  • BOOLEAN: memcpy of sizeof(bool)
  • INT32/FLOAT: 4-byte numeric load
  • INT64/DOUBLE: 8-byte numeric load
  • INT96: memcpy of 3 * sizeof(int32_t)
  • Float16 (FIXED_LEN_BYTE_ARRAY with the float16 logical type): 2-byte load

The val argument is the min_value/max_value taken verbatim from the file's Thrift-encoded statistics, so its length is attacker controlled. A crafted file with a stat shorter than the column's physical type (for example a zero-byte stat on an INT96 column) makes those loads read past the end of the source buffer.

It is reachable from the printer/debug path that formats a file's column statistics.

Component(s)

Parquet, C++

Suggested fix

Reject any statistics value shorter than the fixed width its type requires before the load runs. Note that the declared width of a non-float16 FIXED_LEN_BYTE_ARRAY (decimal/string) cannot be validated from the Type::type enum alone without the column's type_length, and FormatStatValue is a public API whose signature can't change without breaking compatibility, so a full per-length check is a separate discussion.

Tracking PR: #50025

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions