You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PyIceberg cannot perform several operations at production scale due to unbounded memory requirements in the PyArrow execution path:
Tables with equality deletes are unreadable (hard ValueError)
CoW deletes OOM on large Parquet files (~1GB)
CoW overwrite OOMs (same pattern as delete)
Upsert uses O(n) row-by-row comparison
Compaction not implemented, requires external sort (infeasible in-memory for large tables)
Orphan file deletion OOMs (LEFT ANTI JOIN of millions of paths)
The list goes on (documented below) with operations that don't scale in the typical single-node environment of PyIceberg. These block PyIceberg from achieving feature parity with Java Iceberg for V2/V3.
Proposed Solution
Integrate Apache DataFusion (already exists via pip install 'pyiceberg[pyiceberg-core]') as an optional execution engine behind an automatic engine-resolution layer. When installed, compute-heavy operations use DataFusion's spill-to-disk execution (bounded memory). When not installed, the existing PyArrow path remains unchanged (works for small data, OOMs gracefully on large).
No existing behavior changes. No forced dependency. DuckDB-style UX where only a developer only needs to configure a memory budget if they so choose.
There are two ways to execute DataFusion operations from PyIceberg. Both use the same underlying Rust DataFusion engine; the difference is where orchestration happens.
Python orchestrates the plan, configures the session, registers Parquet files, runs SQL, gets Arrow data back. Writing new files happens in Python.
Pros: Works today, no upstream iceberg-rust changes needed. Cons: Object store config must be duplicated (DataFusion side + PyIceberg FileIO). File writing in Python is less capable than Rust's IcebergWriteExec (no automatic target-size splitting, no partition routing).
One Python function call crosses the FFI boundary. Rust handles everything, file I/O, plan construction, execution, file writing, and returns only metadata. Consistent with how pyiceberg_core already works (transforms, TableProvider).
Pros: No object store duplication (uses Iceberg's FileIO). Rust writes files via IcebergWriteExec (target-size splitting, partition routing built in). No per-batch FFI overhead. Consistent with existing pyiceberg_core patterns. Cons: Requires contributions to iceberg-rust (bounded session helper, execution module, overwrite commit node). Write-path operations blocked on upstream OverwriteAction/RewriteFiles landing.
Checklist (iceberg-rust contributions)
Bounded-memory session helper (iceberg-datafusion — no blockers)
pyiceberg_core.execution module stubs (bindings/python/ — no blockers)
Feature Request / Improvement
Problem
PyIceberg cannot perform several operations at production scale due to unbounded memory requirements in the PyArrow execution path:
ValueError)The list goes on (documented below) with operations that don't scale in the typical single-node environment of PyIceberg. These block PyIceberg from achieving feature parity with Java Iceberg for V2/V3.
Proposed Solution
Integrate Apache DataFusion (already exists via
pip install 'pyiceberg[pyiceberg-core]') as an optional execution engine behind an automatic engine-resolution layer. When installed, compute-heavy operations use DataFusion's spill-to-disk execution (bounded memory). When not installed, the existing PyArrow path remains unchanged (works for small data, OOMs gracefully on large).No existing behavior changes. No forced dependency. DuckDB-style UX where only a developer only needs to configure a memory budget if they so choose.
Design Doc
Support for PyIceberg DataFusion Integration
Implementation Approach
There are two ways to execute DataFusion operations from PyIceberg. Both use the same underlying Rust DataFusion engine; the difference is where orchestration happens.
Track 1: Python-side DataFusion (
datafusion-python)Python orchestrates the plan, configures the session, registers Parquet files, runs SQL, gets Arrow data back. Writing new files happens in Python.
Pros: Works today, no upstream iceberg-rust changes needed.
Cons: Object store config must be duplicated (DataFusion side + PyIceberg FileIO). File writing in Python is less capable than Rust's
IcebergWriteExec(no automatic target-size splitting, no partition routing).Checklist
pyiceberg/execution/engine.py)Track 2: Rust-side execution (
pyiceberg_core.execution)One Python function call crosses the FFI boundary. Rust handles everything, file I/O, plan construction, execution, file writing, and returns only metadata. Consistent with how
pyiceberg_corealready works (transforms, TableProvider).Pros: No object store duplication (uses Iceberg's FileIO). Rust writes files via
IcebergWriteExec(target-size splitting, partition routing built in). No per-batch FFI overhead. Consistent with existingpyiceberg_corepatterns.Cons: Requires contributions to iceberg-rust (bounded session helper, execution module, overwrite commit node). Write-path operations blocked on upstream
OverwriteAction/RewriteFileslanding.Checklist (iceberg-rust contributions)
iceberg-datafusion— no blockers)pyiceberg_core.executionmodule stubs (bindings/python/— no blockers)execute_antijoin_pathsimplementation (needs session helper)execute_equality_resolutionimplementation (needs session helper)execute_cow_rewriteimplementation (needs session helper + OverwriteAction)execute_compactionimplementation (needs session helper + OverwriteAction)Checklist (this repo)
pyiceberg/execution/engine.py)pyiceberg_core.executionwhen available, else PyArrow fallbackHow they relate
table.compact(),table.delete(), etc.)TransactionAPI)pyiceberg_corepatterns)Operations Unblocked
Related Issues
PyIceberg
pyiceberg-corefrom thepyarrowextra #3356 (execution path isolation)DeleteFileIndexfor equality deletes)REPLACEAPI, prerequisite for compaction)DeleteFileIndexfor positional deletes, merged foundation)iceberg-rust
RewriteFilessupport umbrella with sub-issues, actively developed)RewriteFilesActionneeded for compaction commit)OverwriteActionPR CoW primitive, under review)InsertOp::Overwritesilently committed as append)MergingSnapshotProducerdraft, foundation for RewriteFiles)datafusion-python