Skip to content

Support spilling for WindowAggExec #22946

@wirybeaver

Description

@wirybeaver

Is your feature request related to a problem or challenge?

WindowAggExec can require buffering an entire window partition before it can evaluate some window functions. Today that buffering is memory-only, so queries over large or skewed partitions can exhaust the configured memory pool even when the runtime has a spill-capable disk manager.

This is surprising because other memory-intensive physical operators in DataFusion already participate in spill paths, but window aggregation does not. A query with a large PARTITION BY group, or a query without PARTITION BY, can therefore fail with Resources exhausted rather than using the configured spill storage.

Motivation sequence

The desired behavior is for WindowAggExec to keep the existing partition-at-a-time evaluation model, but use disk spill as a fallback when a complete window partition no longer fits in the memory reservation.

sequenceDiagram
    autonumber
    participant Input as Sorted input stream
    participant Window as WindowAggExec
    participant Memory as MemoryReservation
    participant Spill as SpillManager / DiskManager
    participant Expr as Window expression evaluator
    participant Output as Downstream consumer

    Input->>Window: RecordBatch with ordered window partitions
    loop For each logical window partition
        Window->>Memory: Reserve memory for buffered partition rows
        alt Reservation succeeds
            Window->>Window: Buffer partition rows in memory
        else Reservation fails and disk spill is enabled
            Window->>Spill: Write buffered rows to spill file
            Window->>Memory: Release memory for spilled rows
            Input->>Window: Continue same partition
            Window->>Spill: Append remaining partition rows
        else Reservation fails and spill is unavailable
            Window-->>Output: Return ResourcesExhausted
        end

        Window->>Spill: Read spilled rows, if partition spilled
        Window->>Expr: Evaluate window functions for complete partition
        Expr-->>Window: Window result columns
        Window-->>Output: Emit original columns plus window columns
    end
Loading

Describe the solution you'd like

Add spill support to WindowAggExec:

  • Track buffered partition batches with a MemoryReservation.
  • Preserve the current sorted-input partition semantics and finish one partition at a time.
  • When the active partition cannot grow its reservation and a disk manager is available, write the buffered partition batches to spill files through the existing SpillManager.
  • Read spilled partitions back when the partition is ready for window expression evaluation.
  • Report spill metrics such as spill count, spilled rows, and spilled bytes.
  • Keep the non-spill path unchanged when memory is sufficient.

Describe alternatives you've considered

A more advanced alternative would be implementing streaming evaluation for more window frame/function combinations. That can reduce memory further for specific functions, but it is a larger semantic change and does not cover all window functions. Operator-level spill support is still useful as a general fallback for large partitions.

Additional context

This would make WindowAggExec behave more consistently with DataFusion's other spill-aware physical operators and make memory-limited window queries fail less often when spill storage is configured.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions