perf: improve performance of array_union/array_intersect with batched row conversion#20243
perf: improve performance of array_union/array_intersect with batched row conversion#20243comphead merged 8 commits intoapache:mainfrom
array_union/array_intersect with batched row conversion#20243Conversation
| let r_start = r_offsets[i].as_usize(); | ||
| let r_end = r_offsets[i + 1].as_usize(); | ||
|
|
||
| let mut count = 0usize; |
There was a problem hiding this comment.
count can be declared out of cycle and reused?
perhaps we can find a better name and clarify count of what the variable is storing?
There was a problem hiding this comment.
count can be determined solely from the seen size too I believe? instead of incrementing it
There was a problem hiding this comment.
yes, count is redundant here since we can just use seen size.
| seen.clear(); | ||
| r_set.clear(); | ||
|
|
||
| match set_op { |
There was a problem hiding this comment.
Could we get more performance gains by moving this match outside the hot loop? Or making it a const generic for example?
There was a problem hiding this comment.
Good suggestion, using const generics could definitely give us a nice performance boost.
group optimized optimized_const_generic
----- --------- ----------
array_intersect/high_overlap/10 1.79 800.7±51.79µs ? ?/sec 1.00 446.9±15.17µs ? ?/sec
array_intersect/high_overlap/100 1.77 8.2±0.13ms ? ?/sec 1.00 4.6±0.08ms ? ?/sec
array_intersect/high_overlap/50 1.77 4.0±0.06ms ? ?/sec 1.00 2.3±0.07ms ? ?/sec
array_intersect/low_overlap/10 1.70 570.4±53.84µs ? ?/sec 1.00 335.3±4.74µs ? ?/sec
array_intersect/low_overlap/100 1.62 6.7±0.27ms ? ?/sec 1.00 4.2±0.07ms ? ?/sec
array_intersect/low_overlap/50 1.71 3.4±0.44ms ? ?/sec 1.00 1993.1±23.05µs ? ?/sec
array_union/high_overlap/10 1.62 548.4±30.79µs ? ?/sec 1.00 337.6±8.12µs ? ?/sec
array_union/high_overlap/100 2.06 7.5±2.17ms ? ?/sec 1.00 3.6±0.10ms ? ?/sec
array_union/high_overlap/50 1.53 2.8±0.06ms ? ?/sec 1.00 1805.2±72.23µs ? ?/sec
array_union/low_overlap/10 1.88 718.8±148.49µs ? ?/sec 1.00 382.7±15.45µs ? ?/sec
array_union/low_overlap/100 1.67 6.9±0.21ms ? ?/sec 1.00 4.1±0.13ms ? ?/sec
array_union/low_overlap/50 1.70 3.5±0.10ms ? ?/sec 1.00 2.0±0.06ms ? ?/sec
| let r_start = r_offsets[i].as_usize(); | ||
| let r_end = r_offsets[i + 1].as_usize(); | ||
|
|
||
| let mut count = 0usize; |
There was a problem hiding this comment.
count can be determined solely from the seen size too I believe? instead of incrementing it
| ); | ||
| } | ||
|
|
||
| fn invoke_array_intersect( |
There was a problem hiding this comment.
This function look exactly the same as invoke_array_union(). Maybe drop one of them ?!
There was a problem hiding this comment.
done. thanks for the suggestion.
| ); | ||
| } | ||
|
|
||
| for &array_size in ARRAY_SIZES { |
There was a problem hiding this comment.
the two for &array_size in ARRAY_SIZES loops could be simplified into one by using another outer/inner loop: for (overlap_label, overlap_ratio) in &[("high_overlap", 0.8), ("low_overlap", 0.2)] { ... }
|
|
||
| let mut result_offsets = Vec::with_capacity(l.len() + 1); | ||
| result_offsets.push(OffsetSize::usize_as(0)); | ||
| let mut final_rows = Vec::with_capacity(rows_l.num_rows()); |
There was a problem hiding this comment.
for SetOp::Intercept the capacity could be optimised to min(rows_l.num_rows(), rows_r.num_rows())
There was a problem hiding this comment.
updated the capacity for SetOp::Intersect to min(rows_l.num_rows(), rows_r.num_rows()).
| return internal_err!("{set_op}: failed to get array from rows"); | ||
| SetOp::Intersect => { | ||
| // Build hash set from right array for lookup table | ||
| // then iterator left array to find common elements. |
There was a problem hiding this comment.
| // then iterator left array to find common elements. | |
| // then iterate left array to find common elements. |
| let overlap_positions = &positions[..overlap_count]; | ||
|
|
||
| for i in 0..array_size { | ||
| if overlap_positions.contains(&i) { |
There was a problem hiding this comment.
Slice::contains() is O(n) (linear search). Using a HashSet would be O(1), but create_arrays_with_overlap() is called before group.bench_with_input(...), so maybe it is OK.
let overlap_positions: std::collections::HashSet<_> =
positions[..overlap_count].iter().copied().collect();| return internal_err!("{set_op}: failed to get array from rows"); | ||
| SetOp::Intersect => { | ||
| // Build hash set from right array for lookup table | ||
| // then iterator left array to find common elements. |
There was a problem hiding this comment.
It would be faster to create the HashSet from the shorter array and iterate over the longer one. This would minimise the memory usage for the hash set and can reduce the number of hash operations, especially when there's a significant size difference between the two arrays.
There was a problem hiding this comment.
now building the HashSet from the shorter array and iterating over the longer one, good catch.
| )] | ||
| #[derive(Debug, PartialEq, Eq, Hash)] | ||
| pub(super) struct ArrayIntersect { | ||
| pub struct ArrayIntersect { |
There was a problem hiding this comment.
This is now public to be able to use it in the benchmark test (a separate crate).
Maybe annotate it with #[doc(hidden)] to hide it from the end users, since it is not really supposed to be part of the public APIs ?!
There was a problem hiding this comment.
ArrayIntersect already has #[user_doc], and keeping it visible aligns with how other user-facing SQL functions are exposed?
|
Thanks everyone for the reviews. I've addressed all the feedback. Please let me know if anything else needs adjustment. |
There was a problem hiding this comment.
Thanks @lyne7-sc looks like it is fine now
test changed because array_ doesnt preserve order like in https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.array_intersect.html
However in future it could be a reason for flaky tests.
Thanks @martin-g and @Jefffrey for the review
|
Do you think it is worth adding |
|
I’m not sure we want to make the tests order-insensitive here. My understanding is that set operations should still produce deterministic results for the same input. If ordering ever becomes non-deterministic, that would likely be a regression rather than something we want to mask in tests. So I’d lean toward keeping the tests as-is so they continue to validate deterministic behavior. |
Which issue does this PR close?
Rationale for this change
The current implementation of
array_unionandarray_intersectperformsRowConverter::convert_columns()on a per-row basis, which introduces avoidable overhead due to repeated conversions and intermediate allocations.This PR improves performance by:
sorted().dedup()pattern in favor of hash-based set operationsWhat changes are included in this PR?
Refactored the internal set operation implementation to use batch row conversion and a single-pass construction of result arrays.
Benchmarks
Are these changes tested?
Yes. Existing SQL logic tests updated to reflect new output order.
Are there any user-facing changes?
Yes. The output order may differ from the previous implementation.
Previously, results were implicitly sorted due to the use of
sorted().dedup(). The new implementation preserves the order of first appearance within each list.This is a user-visible behavioral change, but it is consistent with typical SQL set operation semantics, which do not guarantee a specific output order.