Integrate DataFusion as execution engine for compute-heavy operations

### Feature Request / Improvement

# Problem
PyIceberg cannot perform several operations at production scale due to unbounded memory requirements in the PyArrow execution path:
- Tables with equality deletes are unreadable (hard `ValueError`)
- CoW deletes OOM on large Parquet files (~1GB)
- CoW overwrite OOMs (same pattern as delete)
- Upsert uses O(n) row-by-row comparison
- Compaction not implemented, requires external sort (infeasible in-memory for large tables)
- Orphan file deletion OOMs (LEFT ANTI JOIN of millions of paths)

The list goes on (documented below) with operations that don't scale in the typical single-node environment of PyIceberg. These block PyIceberg from achieving feature parity with Java Iceberg for V2/V3.

# Proposed Solution
Integrate Apache DataFusion (already exists via `pip install 'pyiceberg[pyiceberg-core]'`) as an optional execution engine behind an automatic engine-resolution layer. When installed, compute-heavy operations use DataFusion's spill-to-disk execution (bounded memory). When not installed, the existing PyArrow path remains unchanged (works for small data, OOMs gracefully on large).

No existing behavior changes. No forced dependency. DuckDB-style UX where only a developer only needs to configure a memory budget if they so choose.

# Design Doc
[Support for PyIceberg DataFusion Integration](https://docs.google.com/document/d/1p3Imyhlw_KZq9asP6Wz9VFj9sny1hcqelY9LC0c6J0Y/edit?usp=sharing)

# Implementation Approach

There are two ways to execute DataFusion operations from PyIceberg. Both use the same underlying Rust DataFusion engine; the difference is where orchestration happens.

## Track 1: Python-side DataFusion (`datafusion-python`)

Python orchestrates the plan, configures the session, registers Parquet files, runs SQL, gets Arrow data back. Writing new files happens in Python.

**Pros:** Works today, no upstream iceberg-rust changes needed.
**Cons:** Object store config must be duplicated (DataFusion side + PyIceberg FileIO). File writing in Python is less capable than Rust's `IcebergWriteExec` (no automatic target-size splitting, no partition routing).

### Checklist
- [ ] Engine resolution module (`pyiceberg/execution/engine.py`)
- [ ] Object store bridge (translate FileIO properties → DataFusion object store config)
- [ ] Equality delete resolution (register files, anti-join SQL, return Arrow table)
- [ ] Orphan file deletion (register path arrays, anti-join SQL, return paths)
- [ ] Streaming CoW rewrite (register file, filter SQL, iterate batches, write from Python)
- [ ] Compaction (register files, sort SQL, iterate batches, write from Python)

## Track 2: Rust-side execution (`pyiceberg_core.execution`)

One Python function call crosses the FFI boundary. Rust handles everything, file I/O, plan construction, execution, file writing, and returns only metadata. Consistent with how `pyiceberg_core` already works (transforms, TableProvider).

**Pros:** No object store duplication (uses Iceberg's FileIO). Rust writes files via `IcebergWriteExec` (target-size splitting, partition routing built in). No per-batch FFI overhead. Consistent with existing `pyiceberg_core` patterns.
**Cons:** Requires contributions to iceberg-rust (bounded session helper, execution module, overwrite commit node). Write-path operations blocked on upstream `OverwriteAction`/`RewriteFiles` landing.

### Checklist (iceberg-rust contributions)
- [ ] Bounded-memory session helper (`iceberg-datafusion` — no blockers)
- [ ] `pyiceberg_core.execution` module stubs (`bindings/python/` — no blockers)
- [ ] `execute_antijoin_paths` implementation (needs session helper)
- [ ] `execute_equality_resolution` implementation (needs session helper)
- [ ] `execute_cow_rewrite` implementation (needs session helper + OverwriteAction)
- [ ] `execute_compaction` implementation (needs session helper + OverwriteAction)

### Checklist (this repo)
- [ ] Engine resolution module (`pyiceberg/execution/engine.py`)
- [ ] Dispatch wiring: call `pyiceberg_core.execution` when available, else PyArrow fallback

## How they relate

- Both tracks share the same user-facing API (`table.compact()`, `table.delete()`, etc.)
- Both share the same commit path (Python `Transaction` API)
- Both share the same engine resolution layer
- Per operation, you pick one — they are interchangeable behind the dispatch layer
- Swapping Track 1 → Track 2 for any operation is a single function body change (same inputs, same outputs)
- Track 2 is the preferred long-term path (matches existing `pyiceberg_core` patterns)
- Track 1 can serve as an interim implementation for operations where Track 2 isn't ready yet

# Operations Unblocked
- [ ] Equality delete read resolution
- [ ] Streaming CoW delete/overwrite
- [ ] Table compaction (sort + rewrite)
- [ ] Orphan file deletion
- [ ] Upsert via hash join
- [ ] Equality-to-positional conversion
- [ ] Position delete compaction
- [ ] Full MoR compaction
- [ ] Z-Order / Hilbert sorting
- [ ] DV compaction
- [ ] Incremental compaction
- [ ] Sort-order enforcement on write
- [ ] Dynamic partition overwrite (bounded memory)

# Related Issues

## PyIceberg
- #1078 (MoR support epic)
- #1210 / #3270 (equality delete reads)
- #3356 (execution path isolation)
- #1092 (data compaction)
- #1200 (orphan file deletion)
- #3285 (`DeleteFileIndex` for equality deletes)
- #3319 / #3320 (commit retry, prerequisite for safe compaction commits)
- #3130 / #3131 (`REPLACE` API, prerequisite for compaction)
- #1818 (V3 tracking, DV compaction)
- #1808 (positional delete write support)
- #2918 (`DeleteFileIndex` for positional deletes, merged foundation)

## iceberg-rust
- [iceberg-rust#2186](https://github.com/apache/iceberg-rust/issues/2186) (MoR scan-side delete reconciliation)
- [iceberg-rust#2205](https://github.com/apache/iceberg-rust/issues/2205) (equality delete reader)
- [iceberg-rust#1530](https://github.com/apache/iceberg-rust/issues/1530) (delete file support in scan)
- [iceberg-rust#2269](https://github.com/apache/iceberg-rust/issues/2269) (DataFusion write actions  EPIC)
- [iceberg-rust#1607](https://github.com/apache/iceberg-rust/issues/1607) (`RewriteFiles` support  umbrella with sub-issues, actively developed)
- [iceberg-rust#2244](https://github.com/apache/iceberg-rust/issues/2244) (Implement `RewriteFilesAction`  needed for compaction commit)
- [iceberg-rust#2185](https://github.com/apache/iceberg-rust/pull/2185) (`OverwriteAction` PR  CoW primitive, under review)
- [iceberg-rust#2711](https://github.com/apache/iceberg-rust/issues/2711) (DataFusion `InsertOp::Overwrite` silently committed as append)
- [iceberg-rust#2620](https://github.com/apache/iceberg-rust/pull/2620) (`MergingSnapshotProducer`  draft, foundation for RewriteFiles)

## datafusion-python
- [datafusion-python#1217](https://github.com/apache/datafusion-python/issues/1217) (FFI boundary stability)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Integrate DataFusion as execution engine for compute-heavy operations #3554

Feature Request / Improvement

Problem

Proposed Solution

Design Doc

Implementation Approach

Track 1: Python-side DataFusion (`datafusion-python`)

Checklist

Track 2: Rust-side execution (`pyiceberg_core.execution`)

Checklist (iceberg-rust contributions)

Checklist (this repo)

How they relate

Operations Unblocked

Related Issues

PyIceberg

iceberg-rust

datafusion-python

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Integrate DataFusion as execution engine for compute-heavy operations #3554

Description

Feature Request / Improvement

Problem

Proposed Solution

Design Doc

Implementation Approach

Track 1: Python-side DataFusion (datafusion-python)

Checklist

Track 2: Rust-side execution (pyiceberg_core.execution)

Checklist (iceberg-rust contributions)

Checklist (this repo)

How they relate

Operations Unblocked

Related Issues

PyIceberg

iceberg-rust

datafusion-python

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Track 1: Python-side DataFusion (`datafusion-python`)

Track 2: Rust-side execution (`pyiceberg_core.execution`)