You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem or challenge?
WindowAggExec can require buffering an entire window partition before it can evaluate some window functions. Today that buffering is memory-only, so queries over large or skewed partitions can exhaust the configured memory pool even when the runtime has a spill-capable disk manager.
This is surprising because other memory-intensive physical operators in DataFusion already participate in spill paths, but window aggregation does not. A query with a large PARTITION BY group, or a query without PARTITION BY, can therefore fail with Resources exhausted rather than using the configured spill storage.
Motivation sequence
The desired behavior is for WindowAggExec to keep the existing partition-at-a-time evaluation model, but use disk spill as a fallback when a complete window partition no longer fits in the memory reservation.
sequenceDiagram
autonumber
participant Input as Sorted input stream
participant Window as WindowAggExec
participant Memory as MemoryReservation
participant Spill as SpillManager / DiskManager
participant Expr as Window expression evaluator
participant Output as Downstream consumer
Input->>Window: RecordBatch with ordered window partitions
loop For each logical window partition
Window->>Memory: Reserve memory for buffered partition rows
alt Reservation succeeds
Window->>Window: Buffer partition rows in memory
else Reservation fails and disk spill is enabled
Window->>Spill: Write buffered rows to spill file
Window->>Memory: Release memory for spilled rows
Input->>Window: Continue same partition
Window->>Spill: Append remaining partition rows
else Reservation fails and spill is unavailable
Window-->>Output: Return ResourcesExhausted
end
Window->>Spill: Read spilled rows, if partition spilled
Window->>Expr: Evaluate window functions for complete partition
Expr-->>Window: Window result columns
Window-->>Output: Emit original columns plus window columns
end
Loading
Describe the solution you'd like
Add spill support to WindowAggExec:
Track buffered partition batches with a MemoryReservation.
Preserve the current sorted-input partition semantics and finish one partition at a time.
When the active partition cannot grow its reservation and a disk manager is available, write the buffered partition batches to spill files through the existing SpillManager.
Read spilled partitions back when the partition is ready for window expression evaluation.
Report spill metrics such as spill count, spilled rows, and spilled bytes.
Keep the non-spill path unchanged when memory is sufficient.
Describe alternatives you've considered
A more advanced alternative would be implementing streaming evaluation for more window frame/function combinations. That can reduce memory further for specific functions, but it is a larger semantic change and does not cover all window functions. Operator-level spill support is still useful as a general fallback for large partitions.
Additional context
This would make WindowAggExec behave more consistently with DataFusion's other spill-aware physical operators and make memory-limited window queries fail less often when spill storage is configured.
Is your feature request related to a problem or challenge?
WindowAggExeccan require buffering an entire window partition before it can evaluate some window functions. Today that buffering is memory-only, so queries over large or skewed partitions can exhaust the configured memory pool even when the runtime has a spill-capable disk manager.This is surprising because other memory-intensive physical operators in DataFusion already participate in spill paths, but window aggregation does not. A query with a large
PARTITION BYgroup, or a query withoutPARTITION BY, can therefore fail withResources exhaustedrather than using the configured spill storage.Motivation sequence
The desired behavior is for
WindowAggExecto keep the existing partition-at-a-time evaluation model, but use disk spill as a fallback when a complete window partition no longer fits in the memory reservation.sequenceDiagram autonumber participant Input as Sorted input stream participant Window as WindowAggExec participant Memory as MemoryReservation participant Spill as SpillManager / DiskManager participant Expr as Window expression evaluator participant Output as Downstream consumer Input->>Window: RecordBatch with ordered window partitions loop For each logical window partition Window->>Memory: Reserve memory for buffered partition rows alt Reservation succeeds Window->>Window: Buffer partition rows in memory else Reservation fails and disk spill is enabled Window->>Spill: Write buffered rows to spill file Window->>Memory: Release memory for spilled rows Input->>Window: Continue same partition Window->>Spill: Append remaining partition rows else Reservation fails and spill is unavailable Window-->>Output: Return ResourcesExhausted end Window->>Spill: Read spilled rows, if partition spilled Window->>Expr: Evaluate window functions for complete partition Expr-->>Window: Window result columns Window-->>Output: Emit original columns plus window columns endDescribe the solution you'd like
Add spill support to
WindowAggExec:MemoryReservation.SpillManager.Describe alternatives you've considered
A more advanced alternative would be implementing streaming evaluation for more window frame/function combinations. That can reduce memory further for specific functions, but it is a larger semantic change and does not cover all window functions. Operator-level spill support is still useful as a general fallback for large partitions.
Additional context
This would make
WindowAggExecbehave more consistently with DataFusion's other spill-aware physical operators and make memory-limited window queries fail less often when spill storage is configured.