refactor: make scalar distance u64 and overflow aware#22892
Conversation
There was a problem hiding this comment.
@sweb
Thanks for tightening up the overflow handling here.
I think there is one boundary case we need to fix before this lands: float values at exactly 2^64 can still slip through and saturate to u64::MAX instead of returning None.
| (Self::Float64(Some(l)), Self::Float64(Some(r))) => { | ||
| Some((l - r).abs().round() as _) | ||
| let diff = (l - r).abs().round(); | ||
| if diff <= u64::MAX as f64 { |
There was a problem hiding this comment.
u64::MAX as f64 rounds up to 2^64, so a rounded float distance of exactly 2^64 can pass this guard. After that, the as u64 conversion saturates to u64::MAX, which means we return Some(u64::MAX) even though the new overflow-aware contract says distances greater than u64::MAX should return None.
For example, this shape currently returns Some(u64::MAX):
ScalarValue::Float64(Some(0.0))
.distance_u64(&ScalarValue::Float64(Some(18446744073709551616.0)))Could we make the upper bound strict, for example diff.is_finite() && diff < u64::MAX as f64, or move this into an explicit helper? It would also be great to add boundary coverage for exact 2^64.
There was a problem hiding this comment.
Thank you very much for your review! Addressed in c356c99
| (Self::Float32(Some(l)), Self::Float32(Some(r))) => { | ||
| Some((l - r).abs().round() as _) | ||
| let diff = (l - r).abs().round(); | ||
| if diff <= u64::MAX as f32 { |
There was a problem hiding this comment.
Same issue here for Float32: u64::MAX as f32 is 2^64, so a rounded Float32 distance equal to 2^64 is accepted and then saturates to u64::MAX instead of returning None.
Could we reject the rounded value at the first float representation above the u64 domain and add a boundary test for this case too?
| (Self::UInt32(Some(l)), Self::UInt32(Some(r))) => Some(l.abs_diff(*r) as u64), | ||
| (Self::UInt64(Some(l)), Self::UInt64(Some(r))) => Some(l.abs_diff(*r)), | ||
| // TODO: we might want to look into supporting ceil/floor here for floats. | ||
| (Self::Float16(Some(l)), Self::Float16(Some(r))) => { |
There was a problem hiding this comment.
Non-blocking thought: the float arms all repeat the same pattern of taking a rounded absolute diff and then doing the checked conversion. Once the boundary check is fixed, it might be worth using a small helper like fn rounded_float_distance_u64(diff: f64) -> Option<u64> and calling it from Float16, Float32, and Float64, with f16 and f32 widened to f64. That would keep the overflow invariant in one place and make future drift less likely.
There was a problem hiding this comment.
Addressed in c356c99
Since this is now a helper, I am also checking for >= 0
Thank you for your review. In addition to your requested changes, I noticed that PartialOrd (https://github.com/apache/datafusion/blob/main/datafusion/common/src/scalar/mod.rs#L621) only compares decimals on scales, so I added a commit (d5f9c86) that adjusts the handling of decimal to align with that. Happy to drop that again if this is not wanted. |
There was a problem hiding this comment.
@sweb
Thanks for the follow-up updates. The float overflow concerns look addressed now, and the decimal behavior looks consistent with the existing PartialOrd semantics. I only have one remaining concern around the new deprecation and versioning policy before this is ready to merge.
| /// | ||
| /// | ||
| /// Note: the datatype itself must support subtraction. | ||
| #[deprecated(since = "54.0.0", note = "Use distance_u64 instead")] |
There was a problem hiding this comment.
Thanks for the updates here. One remaining concern: ScalarValue::distance is a public method and it is now marked as #[deprecated], while the workspace version is still 54.0.0.
The semver bot is reporting this as a minor-version API change, so could we either remove or defer the deprecation in this PR, or make sure the semver and versioning policy is explicitly satisfied before merging?
There was a problem hiding this comment.
@kosiew I removed the deprecation note as I am not sure what the right next version would be - I also updated the PR description mentioning that deprecation of distance is not part of this PR.
Closes #22687
Rationale for this change
The distance API in
datafusion/common/src/scalar/mod.rspreviously returnedOption<usize>.usizeis machine-width dependent and does not represent value-domain cardinality. This could lead to target-dependent behavior on large integer/temporal ranges. Additionally, downstream callers likeinterval_arithmetic.rshad to convert the distance back tou64to compute cardinality.Exposing an overflow-aware
u64-oriented contract (distance_u64) resolves these architecture differences and aligns the API with value-domain semantics.What changes are included in this PR?
distance_u64: Added a new public methoddistance_u64(&self, other: &ScalarValue) -> Option<u64>toScalarValue.distance: Marked the originaldistance(&self, other: &ScalarValue) -> Option<usize>method as deprecated and redirected it to calldistance_u64.datafusion/expr-common/src/interval_arithmetic.rsto usedistance_u64directly.datafusion/common/src/stats.rsto usedistance_u64.test_scalar_distance_u64_boundariesinscalar/mod.rsto verify edge cases:i64::MINtoi64::MAX)u64::MINtou64::MAX)TimestampSecondandDate32boundaries)u64::MAXfor Float,Decimal128, andDecimal256values)Are these changes tested?
Yes, they are covered by the new unit tests in
datafusion-commonand existing test suites in bothdatafusion-commonanddatafusion-expr-common.Are there any user-facing changes?
No, deprecation of
ScalarValue::distancehas been removed from this PR