perf: improve performance of `array_union`/`array_intersect` with batched row conversion by lyne7-sc · Pull Request #20243 · apache/datafusion

lyne7-sc · 2026-02-09T14:39:39Z

Which issue does this PR close?

Closes #.

Rationale for this change

The current implementation of array_union and array_intersect performs RowConverter::convert_columns() on a per-row basis, which introduces avoidable overhead due to repeated conversions and intermediate allocations.

This PR improves performance by:

converting all list values to rows in a batch
reusing hash sets across iterations
removing the sorted().dedup() pattern in favor of hash-based set operations

What changes are included in this PR?

Refactored the internal set operation implementation to use batch row conversion and a single-pass construction of result arrays.

Benchmarks

group                               before                                  optimized
-----                               ------                                  ---------
array_intersect/high_overlap/10     2.99  1442.0±99.94µs        ? ?/sec     1.00   481.6±21.45µs        ? ?/sec
array_intersect/high_overlap/100    1.90      9.5±0.63ms        ? ?/sec     1.00      5.0±0.09ms        ? ?/sec
array_intersect/high_overlap/50     2.01      5.3±0.41ms        ? ?/sec     1.00      2.6±0.05ms        ? ?/sec
array_intersect/low_overlap/10      3.47  1288.1±72.39µs        ? ?/sec     1.00   371.4±14.08µs        ? ?/sec
array_intersect/low_overlap/100     2.35      9.2±0.43ms        ? ?/sec     1.00      3.9±0.08ms        ? ?/sec
array_intersect/low_overlap/50      2.45      5.1±0.41ms        ? ?/sec     1.00      2.1±0.07ms        ? ?/sec
array_union/high_overlap/10         4.01  1593.1±292.17µs        ? ?/sec    1.00   396.9±13.43µs        ? ?/sec
array_union/high_overlap/100        2.54      9.8±0.18ms        ? ?/sec     1.00      3.9±0.11ms        ? ?/sec
array_union/high_overlap/50         2.65      5.4±0.10ms        ? ?/sec     1.00      2.0±0.07ms        ? ?/sec
array_union/low_overlap/10          3.74  1622.7±96.50µs        ? ?/sec     1.00   434.3±17.87µs        ? ?/sec
array_union/low_overlap/100         2.39     10.3±0.92ms        ? ?/sec     1.00      4.3±0.11ms        ? ?/sec
array_union/low_overlap/50          2.63      5.8±0.27ms        ? ?/sec     1.00      2.2±0.11ms        ? ?/sec

Are these changes tested?

Yes. Existing SQL logic tests updated to reflect new output order.

Are there any user-facing changes?

Yes. The output order may differ from the previous implementation.

Previously, results were implicitly sorted due to the use of sorted().dedup(). The new implementation preserves the order of first appearance within each list.

This is a user-visible behavioral change, but it is consistent with typical SQL set operation semantics, which do not guarantee a specific output order.

comphead · 2026-02-09T19:11:49Z

datafusion/functions-nested/src/set_ops.rs

+        let r_start = r_offsets[i].as_usize();
+        let r_end = r_offsets[i + 1].as_usize();
+
+        let mut count = 0usize;


count can be declared out of cycle and reused?
perhaps we can find a better name and clarify count of what the variable is storing?

count can be determined solely from the seen size too I believe? instead of incrementing it

yes, count is redundant here since we can just use seen size.

Jefffrey · 2026-02-10T11:15:42Z

datafusion/functions-nested/src/set_ops.rs

+        seen.clear();
+        r_set.clear();
+
+        match set_op {


Could we get more performance gains by moving this match outside the hot loop? Or making it a const generic for example?

Good suggestion, using const generics could definitely give us a nice performance boost.

group optimized optimized_const_generic ----- --------- ---------- array_intersect/high_overlap/10 1.79 800.7±51.79µs ? ?/sec 1.00 446.9±15.17µs ? ?/sec array_intersect/high_overlap/100 1.77 8.2±0.13ms ? ?/sec 1.00 4.6±0.08ms ? ?/sec array_intersect/high_overlap/50 1.77 4.0±0.06ms ? ?/sec 1.00 2.3±0.07ms ? ?/sec array_intersect/low_overlap/10 1.70 570.4±53.84µs ? ?/sec 1.00 335.3±4.74µs ? ?/sec array_intersect/low_overlap/100 1.62 6.7±0.27ms ? ?/sec 1.00 4.2±0.07ms ? ?/sec array_intersect/low_overlap/50 1.71 3.4±0.44ms ? ?/sec 1.00 1993.1±23.05µs ? ?/sec array_union/high_overlap/10 1.62 548.4±30.79µs ? ?/sec 1.00 337.6±8.12µs ? ?/sec array_union/high_overlap/100 2.06 7.5±2.17ms ? ?/sec 1.00 3.6±0.10ms ? ?/sec array_union/high_overlap/50 1.53 2.8±0.06ms ? ?/sec 1.00 1805.2±72.23µs ? ?/sec array_union/low_overlap/10 1.88 718.8±148.49µs ? ?/sec 1.00 382.7±15.45µs ? ?/sec array_union/low_overlap/100 1.67 6.9±0.21ms ? ?/sec 1.00 4.1±0.13ms ? ?/sec array_union/low_overlap/50 1.70 3.5±0.10ms ? ?/sec 1.00 2.0±0.06ms ? ?/sec

Jefffrey · 2026-02-10T11:19:17Z

datafusion/functions-nested/src/set_ops.rs

+        let r_start = r_offsets[i].as_usize();
+        let r_end = r_offsets[i + 1].as_usize();
+
+        let mut count = 0usize;


count can be determined solely from the seen size too I believe? instead of incrementing it

martin-g · 2026-02-10T12:54:04Z

datafusion/functions-nested/benches/array_set_ops.rs

+    );
+}
+
+fn invoke_array_intersect(


This function look exactly the same as invoke_array_union(). Maybe drop one of them ?!

done. thanks for the suggestion.

martin-g · 2026-02-10T12:56:59Z

datafusion/functions-nested/benches/array_set_ops.rs

+        );
+    }
+
+    for &array_size in ARRAY_SIZES {


the two for &array_size in ARRAY_SIZES loops could be simplified into one by using another outer/inner loop: for (overlap_label, overlap_ratio) in &[("high_overlap", 0.8), ("low_overlap", 0.2)] { ... }

martin-g · 2026-02-10T13:02:20Z

datafusion/functions-nested/src/set_ops.rs

+
+    let mut result_offsets = Vec::with_capacity(l.len() + 1);
+    result_offsets.push(OffsetSize::usize_as(0));
+    let mut final_rows = Vec::with_capacity(rows_l.num_rows());


for SetOp::Intercept the capacity could be optimised to min(rows_l.num_rows(), rows_r.num_rows())

updated the capacity for SetOp::Intersect to min(rows_l.num_rows(), rows_r.num_rows()).

martin-g · 2026-02-10T13:03:41Z

datafusion/functions-nested/src/set_ops.rs

-                return internal_err!("{set_op}: failed to get array from rows");
+            SetOp::Intersect => {
+                // Build hash set from right array for lookup table
+                // then iterator left array to find common elements.


Suggested change

// then iterator left array to find common elements.

// then iterate left array to find common elements.

martin-g · 2026-02-10T13:07:47Z

datafusion/functions-nested/benches/array_set_ops.rs

+        let overlap_positions = &positions[..overlap_count];
+
+        for i in 0..array_size {
+            if overlap_positions.contains(&i) {


Slice::contains() is O(n) (linear search). Using a HashSet would be O(1), but create_arrays_with_overlap() is called before group.bench_with_input(...), so maybe it is OK.

let overlap_positions: std::collections::HashSet<_> = positions[..overlap_count].iter().copied().collect();

martin-g · 2026-02-10T13:11:53Z

datafusion/functions-nested/src/set_ops.rs

-                return internal_err!("{set_op}: failed to get array from rows");
+            SetOp::Intersect => {
+                // Build hash set from right array for lookup table
+                // then iterator left array to find common elements.


It would be faster to create the HashSet from the shorter array and iterate over the longer one. This would minimise the memory usage for the hash set and can reduce the number of hash operations, especially when there's a significant size difference between the two arrays.

now building the HashSet from the shorter array and iterating over the longer one, good catch.

martin-g · 2026-02-10T13:40:04Z

datafusion/functions-nested/src/set_ops.rs

 )]
 #[derive(Debug, PartialEq, Eq, Hash)]
-pub(super) struct ArrayIntersect {
+pub struct ArrayIntersect {


This is now public to be able to use it in the benchmark test (a separate crate).
Maybe annotate it with #[doc(hidden)] to hide it from the end users, since it is not really supposed to be part of the public APIs ?!

ArrayIntersect already has #[user_doc], and keeping it visible aligns with how other user-facing SQL functions are exposed?

lyne7-sc · 2026-02-10T15:11:36Z

Thanks everyone for the reviews. I've addressed all the feedback. Please let me know if anything else needs adjustment.

comphead

Thanks @lyne7-sc looks like it is fine now
test changed because array_ doesnt preserve order like in https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.array_intersect.html

However in future it could be a reason for flaky tests.
Thanks @martin-g and @Jefffrey for the review

neilconway · 2026-02-11T00:33:44Z

Do you think it is worth adding array_sort to the test cases to defend against the risk of future test flakes?

lyne7-sc · 2026-02-11T14:18:57Z

I’m not sure we want to make the tests order-insensitive here. My understanding is that set operations should still produce deterministic results for the same input. If ordering ever becomes non-deterministic, that would likely be a regression rather than something we want to mask in tests. So I’d lean toward keeping the tests as-is so they continue to validate deterministic behavior.

lyne7-sc added 3 commits February 9, 2026 22:14

perf array_union/array_intersect

c396950

perf array_union/array_intersect

01affff

clippy

029bfb3

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Feb 9, 2026

license

0eb66b9

comphead reviewed Feb 9, 2026

View reviewed changes

Jefffrey reviewed Feb 10, 2026

View reviewed changes

martin-g reviewed Feb 10, 2026

View reviewed changes

lyne7-sc and others added 4 commits February 10, 2026 22:41

Merge branch 'apache:main' into perf/array_set

605cf42

simplify benches

8284bbb

simplify benches

0e2dc47

optimize by suggestions.

e718bc7

comphead approved these changes Feb 10, 2026

View reviewed changes

martin-g approved these changes Feb 11, 2026

View reviewed changes

Jefffrey approved these changes Feb 11, 2026

View reviewed changes

comphead added this pull request to the merge queue Feb 11, 2026

Merged via the queue into apache:main with commit 4e2c0f1 Feb 11, 2026
31 checks passed

	// then iterator left array to find common elements.
	// then iterate left array to find common elements.

Conversation

lyne7-sc commented Feb 9, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Benchmarks

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lyne7-sc commented Feb 10, 2026

Uh oh!

comphead left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

neilconway commented Feb 11, 2026

Uh oh!

lyne7-sc commented Feb 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

comphead left a comment •

edited

Loading