Skip to content

Conversation

@dojiong
Copy link
Contributor

@dojiong dojiong commented Dec 17, 2025

Which issue does this PR close?

  • Closes #.

What changes are included in this PR?

Currently, ArrowReader instantiates a new CachingDeleteFileLoader (and consequently a new DeleteFilter) for each FileScanTask when calling load_deletes. This
results in the DeleteFilter state being isolated per task. If multiple tasks reference the same delete file (common in positional deletes), that delete file is
re-read and re-parsed for every task, leading to significant performance overhead and redundant I/O.

Changes

  • Shared State: Moved the DeleteFilter instance into the CachingDeleteFileLoader struct. Since ArrowReader holds a single CachingDeleteFileLoader instance across
    its lifetime, the DeleteFilter state is now effectively shared across all file scan tasks processed by that reader.
  • Positional Delete Caching: Implemented a state machine for loading positional delete files (PosDelState) in DeleteFilter.
    • Added try_start_pos_del_load: Coordinates concurrent access to the same positional delete file.
    • Added finish_pos_del_load: Signals completion of loading.
    • Synchronization: Introduced a WaitFor state. Unlike equality deletes (which are accessed asynchronously), positional deletes are accessed synchronously by
      ArrowReader. Therefore, if a task encounters a file that is currently being loaded by another task, it must asynchronously wait (notify.notified().await)
      during the loading phase to ensure the data is fully populated before ArrowReader proceeds.
  • Refactoring: Updated load_file_for_task and related types in CachingDeleteFileLoader to support the new caching logic and carry file paths through the loading
    context.

Are these changes tested?

Added test_caching_delete_file_loader_caches_results to verify that repeated loads of the same delete file return shared memory objects

@mbutrovich
Copy link
Collaborator

Thank you @dojiong! I will take a look since I worked on this code recently. I'll also test the changes with Comet and the Iceberg Java tests.

@mbutrovich mbutrovich self-requested a review December 17, 2025 13:12
@mbutrovich
Copy link
Collaborator

mbutrovich commented Dec 17, 2025

I ran Iceberg Java 1.10.0 tests with Comet against this branch:

iceberg-spark:iceberg-spark-extensions-3.5_2.13:test:
Screenshot 2025-12-17 at 11 46 55 AM

iceberg-spark:iceberg-spark-3.5_2.13:test
Screenshot 2025-12-17 at 11 47 14 AM

I did a quick review of the code, and I will take another read through later today. I mostly just need another cup of coffee before I think too hard about race conditions with the Notify usage.

Copy link
Collaborator

@mbutrovich mbutrovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM, just a (possibly sound) performance suggestion. Thank you @dojiong!

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dojiong for this fix, and Matt for review. Let's wait for a while befor this got merged.

@liurenjie1024 liurenjie1024 merged commit a329a3b into apache:main Dec 22, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants