[CELEBORN-2314] Optimize the performance of DataBatches.requireBatches by afterincomparableyum · Pull Request #3671 · apache/celeborn

afterincomparableyum · 2026-04-23T04:39:25Z

What changes were proposed in this pull request?

requireBatches(int requestSize) currently calls batches.remove(0) per iteration, which shifts all remaining elements each time, overall O(kn). Replacing with a two pass approach (find split point, then subList(0, count).clear()) reduces this to O(n).

Why are the changes needed?

This is a minor performance optimization.

Does this PR resolve a correctness bug?

No.

Does this PR introduce any user-facing change?

No, this is just a performance improvement.

How was this patch tested?

CI Unit/Integration tests.

Copilot

Pull request overview

Optimizes DataBatches.requireBatches(int) to avoid repeated ArrayList.remove(0) calls (which cause repeated element shifting) by selecting a prefix via subList(0, count) and clearing it in one operation.

Changes:

Replace per-element remove(0) loop with a prefix-size scan (get(count)) to find the split point.
Return the selected prefix batches and remove them from the internal list via subList(0, count).clear().
Update totalSize in one subtraction using the accumulated byte size.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

requireBatches(int requestSize) currently calls batches.remove(0) per iteration, which shifts all remaining elements each time, overall O(kn). Replacing with a two pass approach (find split point, then subList(0, count).clear()) reduces this to O(n).

afterincomparableyum · 2026-04-29T04:59:25Z

Ping @SteNicholas @FMX for review

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

SteNicholas · 2026-04-30T06:31:55Z

@afterincomparableyum

PR Review: [CELEBORN-2314] Optimize the performance of DataBatches.requireBatches                                                                                                                                                                 
                                                                                                                                                                                                                                                    
  Author: afterincomparableyum | File changed: 1 | +11 / -8                                                                                                                                                                                         
                                                                                                                                                                                                                                                    
  Overview                                                           
                                                                                                                                                                                                                                                    
  This PR optimizes DataBatches.requireBatches() by replacing an O(n²) ArrayList.remove(0) loop with an O(n) subList(0, count).clear() approach. The old code repeatedly removed the head element, causing all remaining elements to shift left on
  each removal. The new code counts elements first, copies the sublist, then clears it in a single bulk operation.

  Analysis

  What's good:

  - The core optimization is correct and well-motivated. ArrayList.remove(0) in a loop is a classic O(n²) anti-pattern; subList().clear() performs a single System.arraycopy internally — a solid improvement.
  - The requestSize >= totalSize fast path now correctly resets batches to a new empty list instead of returning the mutable internal reference. This is a bugfix on top of the optimization — previously, the caller and DataBatches shared the
  same ArrayList instance after the fast path, so mutations from either side would corrupt the other.                                                                                                                                               
  - The TODO comment from the old code is resolved and removed.
                                                                                                                                                                                                                                                    
  Issues and suggestions:   
                                                                     
  1. totalSize accounting is moved outside the loop (medium risk)
  The old code decremented totalSize per element: totalSize -= elem.body.length. The new code does totalSize -= currentSize once after the loop. This is logically equivalent only if currentSize doesn't overshoot requestSize significantly. Since
   the while condition is currentSize < requestSize, the loop will consume one extra batch whose body pushes currentSize past requestSize. The old code had the same behavior, so this is not a regression — but worth noting that the method
  returns at least requestSize bytes, not exactly.
  2. Potential IndexOutOfBoundsException (bug)
  The while loop increments count and reads batches.get(count) without checking count < batches.size(). If requestSize > totalSize but doesn't hit the >= guard (this shouldn't happen if totalSize is maintained correctly), or if totalSize drifts
   out of sync with actual batch sizes, this will throw. The old code had the same latent issue (batches.remove(0) on an empty list), so this is pre-existing — but now would be a good time to add a bounds check:                                 
  while (currentSize < requestSize && count < batches.size()) {
  3. Return type is ArrayList<DataBatch> (minor, style)                                                                                                                                                                                             
  The method signature returns ArrayList<DataBatch> rather than List<DataBatch>. Since the new code already introduces a List<DataBatch> import for the subList result, the return type could be widened to List<DataBatch> — though this would be a
   separate API change and is fine to defer.                         
  4. The fast-path change is a behavioral fix, not just an optimization
  The old code returned the same batches reference and set totalSize = 0 but never cleared or replaced batches. Any subsequent call to addDataBatch() would append to the list the caller is still holding. The new code (batches = new
  ArrayList<>()) properly detaches. This should be called out in the PR description — it's a correctness fix, not just a perf change.

  Performance

  The optimization is sound:

  ┌──────────────────────────┬────────┬─────────────────────────────────┐
  │        Operation         │  Old   │               New               │
  ├──────────────────────────┼────────┼─────────────────────────────────┤
  │ Removing k of n elements │ O(k·n) │ O(n) (subList copy + arraycopy) │
  ├──────────────────────────┼────────┼─────────────────────────────────┤
  │ Fast path (return all)   │ O(1)   │ O(1)                            │
  └──────────────────────────┴────────┴─────────────────────────────────┘

  For large batch lists, this is a meaningful improvement on the write-side hot path.

  Test Coverage

  No new tests are added. I'd recommend:                                                                                                                                                                                                            
  - A test for partial requireBatches(requestSize) verifying correct elements are returned and batches retains the remainder.
  - A test for the full-drain fast path ensuring the returned list is independent from the internal state (verifying the bugfix).                                                                                                                   
  - An edge case test where requestSize equals exactly one batch's size.
                                                                     
  Verdict

  Approve with minor comments. The core optimization is correct and addresses a real O(n²) bottleneck. The fast-path fix is a bonus correctness improvement. The only actionable suggestion is adding a bounds guard (count < batches.size()) in the while loop for defensive safety, and noting the behavioral fix in the PR description.

RexXiong · 2026-05-07T01:45:49Z

Overall LGTM, Could you add unit tests for requireBatches(int requestSize)?

github-actions Bot added the module:common label Apr 23, 2026

SteNicholas requested a review from Copilot April 23, 2026 13:31

Copilot started reviewing on behalf of SteNicholas April 23, 2026 13:32 View session

Copilot AI reviewed Apr 23, 2026

View reviewed changes

Comment thread common/src/main/java/org/apache/celeborn/common/write/DataBatches.java

Comment thread common/src/main/java/org/apache/celeborn/common/write/DataBatches.java Outdated

afterincomparableyum force-pushed the celeborn-2314 branch from 819bf8e to b907983 Compare April 29, 2026 04:59

SteNicholas requested a review from Copilot April 30, 2026 06:09

Copilot started reviewing on behalf of SteNicholas April 30, 2026 06:10 View session

Copilot AI reviewed Apr 30, 2026

View reviewed changes

SteNicholas approved these changes May 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CELEBORN-2314] Optimize the performance of DataBatches.requireBatches#3671

[CELEBORN-2314] Optimize the performance of DataBatches.requireBatches#3671
afterincomparableyum wants to merge 1 commit intoapache:mainfrom
afterincomparableyum:celeborn-2314

afterincomparableyum commented Apr 23, 2026 •

edited by SteNicholas

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

afterincomparableyum commented Apr 29, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

SteNicholas commented Apr 30, 2026 •

edited

Loading

Uh oh!

RexXiong commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

afterincomparableyum commented Apr 23, 2026 • edited by SteNicholas Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR resolve a correctness bug?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

afterincomparableyum commented Apr 29, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

SteNicholas commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RexXiong commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

afterincomparableyum commented Apr 23, 2026 •

edited by SteNicholas

Loading

SteNicholas commented Apr 30, 2026 •

edited

Loading