feat: two-generation early emission for partial aggregation

Dandandan · claude · Dandandan · commit cbc046e0b0b1 · 2026-04-13T19:43:32.000+02:00
When the partial aggregate's hash table exceeds a configurable size
threshold (default: 4MB), use a two-generation scheme to emit
intermediate state while keeping the hash table cache-friendly.

When the hot hash table fills up:
1. Emit the cold batch (previous generation's state) downstream
2. Promote the current hot table state to the cold batch
3. Reset the hot hash table and continue reading

This gives recurring groups a second chance to be merged locally
before being sent downstream, reducing the number of partial
emissions through the hash repartition while keeping the working
set in CPU cache.

At end-of-input, the remaining hot state and cold batch are
concatenated and emitted together.

New config: datafusion.execution.partial_aggregation_max_table_size

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/datafusion/common/src/config.rs b/datafusion/common/src/config.rs
@@ -639,6 +639,14 @@ config_namespace! {
         /// aggregation ratio check and trying to switch to skipping aggregation mode
         pub skip_partial_aggregation_probe_rows_threshold: usize, default = 100_000
 
+        /// Maximum memory (in bytes) that the partial aggregation hash table
+        /// may use before emitting intermediate state and resetting. This
+        /// keeps the hash table small enough to fit in CPU cache, improving
+        /// performance for high-cardinality GROUP BY queries. A value of 0
+        /// disables early emission. Only applies to Partial aggregation mode
+        /// with unordered input.
+        pub partial_aggregation_max_table_size: usize, default = 4_194_304
+
         /// Should DataFusion use row number estimates at the input to decide
         /// whether increasing parallelism is beneficial or not. By default,
         /// only exact row numbers (not estimates) are used for this decision.
diff --git a/datafusion/physical-plan/src/aggregates/row_hash.rs b/datafusion/physical-plan/src/aggregates/row_hash.rs
@@ -437,6 +437,18 @@ pub(crate) struct GroupedHashAggregateStream {
     /// current stream.
     skip_aggregation_probe: Option<SkipAggregationProbe>,
 
+    /// Maximum size (in bytes) of the hash table before emitting
+    /// intermediate state and resetting during partial aggregation.
+    /// 0 means disabled.
+    early_emit_max_table_size: usize,
+
+    /// Two-generation early emission: the previous generation's partial
+    /// state batch. When the hot hash table fills up, we emit this cold
+    /// batch (if any), then store the hot table's state as the new cold
+    /// batch. Groups appearing across multiple generations get merged
+    /// locally before being sent downstream.
+    early_emit_cold_batch: Option<RecordBatch>,
+
     // ========================================================================
     // EXECUTION RESOURCES:
     // Fields related to managing execution resources and monitoring performance.
@@ -649,6 +661,18 @@ impl GroupedHashAggregateStream {
             None
         };
 
+        let early_emit_max_table_size = if agg.mode == AggregateMode::Partial
+            && matches!(group_ordering, GroupOrdering::None)
+        {
+            context
+                .session_config()
+                .options()
+                .execution
+                .partial_aggregation_max_table_size
+        } else {
+            0
+        };
+
         let reduction_factor = if agg.mode == AggregateMode::Partial {
             Some(
                 MetricBuilder::new(&agg.metrics)
@@ -680,6 +704,8 @@ impl GroupedHashAggregateStream {
             spill_state,
             group_values_soft_limit: agg.limit_options().map(|config| config.limit()),
             skip_aggregation_probe,
+            early_emit_max_table_size,
+            early_emit_cold_batch: None,
             reduction_factor,
         })
     }
@@ -780,6 +806,39 @@ impl Stream for GroupedHashAggregateStream {
                                 }
                             }
 
+                            // Two-generation early emission: keeps the hash table
+                            // small enough to fit in CPU cache while giving
+                            // recurring groups a second chance to be merged locally.
+                            //
+                            // When the hot table fills:
+                            // 1. Emit the cold batch (previous generation) if any
+                            // 2. Promote current hot table state → cold batch
+                            // 3. Reset hot table and continue reading
+                            if self.early_emit_max_table_size > 0 {
+                                let table_size = self.group_values.size()
+                                    + self
+                                        .accumulators
+                                        .iter()
+                                        .map(|x| x.size())
+                                        .sum::<usize>();
+                                if table_size >= self.early_emit_max_table_size {
+                                    // Take the cold batch to emit
+                                    let to_emit = self.early_emit_cold_batch.take();
+                                    // Promote hot → cold
+                                    let batch_size = self.batch_size;
+                                    self.early_emit_cold_batch =
+                                        self.emit(EmitTo::All, false)?;
+                                    self.clear_shrink(batch_size);
+                                    // Emit the previous cold batch if we had one
+                                    if let Some(batch) = to_emit {
+                                        timer.done();
+                                        self.exec_state =
+                                            ExecutionState::ProducingOutput(batch);
+                                        break 'reading_input;
+                                    }
+                                }
+                            }
+
                             // If we reach this point, try to update the memory reservation
                             // handling out-of-memory conditions as determined by the OOM mode.
                             if let Some(new_state) =
@@ -1221,11 +1280,21 @@ impl GroupedHashAggregateStream {
         self.group_ordering.input_done();
         let elapsed_compute = self.baseline_metrics.elapsed_compute().clone();
         let timer = elapsed_compute.timer();
+
         self.exec_state = if self.spill_state.spills.is_empty() {
             // Input has been entirely processed without spilling to disk.
 
-            // Flush any remaining group values.
-            let batch = self.emit(EmitTo::All, false)?;
+            // Flush any remaining group values from the hot table,
+            // concatenated with the cold batch from early emission.
+            let hot_batch = self.emit(EmitTo::All, false)?;
+            let cold_batch = self.early_emit_cold_batch.take();
+            let batch = match (hot_batch, cold_batch) {
+                (Some(hot), Some(cold)) => {
+                    Some(arrow::compute::concat_batches(&hot.schema(), &[cold, hot])?)
+                }
+                (Some(b), None) | (None, Some(b)) => Some(b),
+                (None, None) => None,
+            };
 
             // If there are none, we're done; otherwise switch to emitting them
             batch.map_or(ExecutionState::Done, ExecutionState::ProducingOutput)
diff --git a/datafusion/sqllogictest/test_files/information_schema.slt b/datafusion/sqllogictest/test_files/information_schema.slt
@@ -262,6 +262,7 @@ datafusion.execution.parquet.statistics_truncate_length 64
 datafusion.execution.parquet.use_content_defined_chunking NULL
 datafusion.execution.parquet.write_batch_size 1024
 datafusion.execution.parquet.writer_version 1.0
+datafusion.execution.partial_aggregation_max_table_size 4194304
 datafusion.execution.perfect_hash_join_min_key_density 0.15
 datafusion.execution.perfect_hash_join_small_build_threshold 1024
 datafusion.execution.planning_concurrency 13
@@ -407,6 +408,7 @@ datafusion.execution.parquet.statistics_truncate_length 64 (writing) Sets statis
 datafusion.execution.parquet.use_content_defined_chunking NULL (writing) EXPERIMENTAL: Enable content-defined chunking (CDC) when writing parquet files. When `Some`, CDC is enabled with the given options; when `None` (the default), CDC is disabled. When CDC is enabled, parallel writing is automatically disabled since the chunker state must persist across row groups.
 datafusion.execution.parquet.write_batch_size 1024 (writing) Sets write_batch_size in rows
 datafusion.execution.parquet.writer_version 1.0 (writing) Sets parquet writer version valid values are "1.0" and "2.0"
+datafusion.execution.partial_aggregation_max_table_size 4194304 Maximum memory (in bytes) that the partial aggregation hash table may use before emitting intermediate state and resetting. This keeps the hash table small enough to fit in CPU cache, improving performance for high-cardinality GROUP BY queries. A value of 0 disables early emission. Only applies to Partial aggregation mode with unordered input.
 datafusion.execution.perfect_hash_join_min_key_density 0.15 The minimum required density of join keys on the build side to consider a perfect hash join (see `HashJoinExec` for more details). Density is calculated as: `(number of rows) / (max_key - min_key + 1)`. A perfect hash join may be used if the actual key density > this value. Currently only supports cases where build_side.num_rows() < u32::MAX. Support for build_side.num_rows() >= u32::MAX will be added in the future.
 datafusion.execution.perfect_hash_join_small_build_threshold 1024 A perfect hash join (see `HashJoinExec` for more details) will be considered if the range of keys (max - min) on the build side is < this threshold. This provides a fast path for joins with very small key ranges, bypassing the density check. Currently only supports cases where build_side.num_rows() < u32::MAX. Support for build_side.num_rows() >= u32::MAX will be added in the future.
 datafusion.execution.planning_concurrency 13 Fan-out during initial physical planning. This is mostly use to plan `UNION` children in parallel. Defaults to the number of CPU cores on the system