Skip to content

refactor: make scalar distance u64 and overflow aware#22892

Open
sweb wants to merge 9 commits into
apache:mainfrom
sweb:scalar-distance-overflow
Open

refactor: make scalar distance u64 and overflow aware#22892
sweb wants to merge 9 commits into
apache:mainfrom
sweb:scalar-distance-overflow

Conversation

@sweb

@sweb sweb commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Closes #22687

Rationale for this change

The distance API in datafusion/common/src/scalar/mod.rs previously returned Option<usize>. usize is machine-width dependent and does not represent value-domain cardinality. This could lead to target-dependent behavior on large integer/temporal ranges. Additionally, downstream callers like interval_arithmetic.rs had to convert the distance back to u64 to compute cardinality.

Exposing an overflow-aware u64-oriented contract (distance_u64) resolves these architecture differences and aligns the API with value-domain semantics.

What changes are included in this PR?

  • Added distance_u64: Added a new public method distance_u64(&self, other: &ScalarValue) -> Option<u64> to ScalarValue.
  • Deprecated distance: Marked the original distance(&self, other: &ScalarValue) -> Option<usize> method as deprecated and redirected it to call distance_u64.
  • Interval Cardinality: Migrated the cardinality calculation in datafusion/expr-common/src/interval_arithmetic.rs to use distance_u64 directly.
  • Selectivity / Stats Overlap: Migrated the overlap calculations in datafusion/common/src/stats.rs to use distance_u64.
  • Boundary/Overflow Tests: Added test_scalar_distance_u64_boundaries in scalar/mod.rs to verify edge cases:
    • Full signed range edge (i64::MIN to i64::MAX)
    • Full unsigned range edge (u64::MIN to u64::MAX)
    • Large temporal range edge (TimestampSecond and Date32 boundaries)
    • Overflow-to-None behavior (exceeding u64::MAX for Float, Decimal128, and Decimal256 values)

Are these changes tested?

Yes, they are covered by the new unit tests in datafusion-common and existing test suites in both datafusion-common and datafusion-expr-common.

Are there any user-facing changes?

No, deprecation of ScalarValue::distance has been removed from this PR

@github-actions github-actions Bot added logical-expr Logical plan and expressions common Related to common crate auto detected api change Auto detected API change labels Jun 10, 2026

@kosiew kosiew left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sweb
Thanks for tightening up the overflow handling here.
I think there is one boundary case we need to fix before this lands: float values at exactly 2^64 can still slip through and saturate to u64::MAX instead of returning None.

Comment thread datafusion/common/src/scalar/mod.rs Outdated
(Self::Float64(Some(l)), Self::Float64(Some(r))) => {
Some((l - r).abs().round() as _)
let diff = (l - r).abs().round();
if diff <= u64::MAX as f64 {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

u64::MAX as f64 rounds up to 2^64, so a rounded float distance of exactly 2^64 can pass this guard. After that, the as u64 conversion saturates to u64::MAX, which means we return Some(u64::MAX) even though the new overflow-aware contract says distances greater than u64::MAX should return None.

For example, this shape currently returns Some(u64::MAX):

ScalarValue::Float64(Some(0.0))
    .distance_u64(&ScalarValue::Float64(Some(18446744073709551616.0)))

Could we make the upper bound strict, for example diff.is_finite() && diff < u64::MAX as f64, or move this into an explicit helper? It would also be great to add boundary coverage for exact 2^64.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for your review! Addressed in c356c99

Comment thread datafusion/common/src/scalar/mod.rs Outdated
(Self::Float32(Some(l)), Self::Float32(Some(r))) => {
Some((l - r).abs().round() as _)
let diff = (l - r).abs().round();
if diff <= u64::MAX as f32 {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue here for Float32: u64::MAX as f32 is 2^64, so a rounded Float32 distance equal to 2^64 is accepted and then saturates to u64::MAX instead of returning None.

Could we reject the rounded value at the first float representation above the u64 domain and add a boundary test for this case too?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in c356c99

(Self::UInt32(Some(l)), Self::UInt32(Some(r))) => Some(l.abs_diff(*r) as u64),
(Self::UInt64(Some(l)), Self::UInt64(Some(r))) => Some(l.abs_diff(*r)),
// TODO: we might want to look into supporting ceil/floor here for floats.
(Self::Float16(Some(l)), Self::Float16(Some(r))) => {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking thought: the float arms all repeat the same pattern of taking a rounded absolute diff and then doing the checked conversion. Once the boundary check is fixed, it might be worth using a small helper like fn rounded_float_distance_u64(diff: f64) -> Option<u64> and calling it from Float16, Float32, and Float64, with f16 and f32 widened to f64. That would keep the overflow invariant in one place and make future drift less likely.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in c356c99

Since this is now a helper, I am also checking for >= 0

@sweb

sweb commented Jun 13, 2026

Copy link
Copy Markdown
Contributor Author

@sweb Thanks for tightening up the overflow handling here. I think there is one boundary case we need to fix before this lands: float values at exactly 2^64 can still slip through and saturate to u64::MAX instead of returning None.

Thank you for your review. In addition to your requested changes, I noticed that PartialOrd (https://github.com/apache/datafusion/blob/main/datafusion/common/src/scalar/mod.rs#L621) only compares decimals on scales, so I added a commit (d5f9c86) that adjusts the handling of decimal to align with that. Happy to drop that again if this is not wanted.

@sweb sweb requested a review from kosiew June 13, 2026 20:22

@kosiew kosiew left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sweb
Thanks for the follow-up updates. The float overflow concerns look addressed now, and the decimal behavior looks consistent with the existing PartialOrd semantics. I only have one remaining concern around the new deprecation and versioning policy before this is ready to merge.

Comment thread datafusion/common/src/scalar/mod.rs Outdated
///
///
/// Note: the datatype itself must support subtraction.
#[deprecated(since = "54.0.0", note = "Use distance_u64 instead")]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates here. One remaining concern: ScalarValue::distance is a public method and it is now marked as #[deprecated], while the workspace version is still 54.0.0.

The semver bot is reporting this as a minor-version API change, so could we either remove or defer the deprecation in this PR, or make sure the semver and versioning policy is explicitly satisfied before merging?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kosiew I removed the deprecation note as I am not sure what the right next version would be - I also updated the PR description mentioning that deprecation of distance is not part of this PR.

@github-actions github-actions Bot removed the auto detected api change Auto detected API change label Jun 15, 2026
@sweb sweb requested a review from kosiew June 15, 2026 20:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate logical-expr Logical plan and expressions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make Scalar Distance Overflow-Aware and Domain-Sized

2 participants