Add StatisticsContext parameter to partition_statistics#21815
Add StatisticsContext parameter to partition_statistics#21815asolimando wants to merge 11 commits into
Conversation
|
Hi @xudong963, I have opened the PR as a prerequisite for #21122, as discussed. This is a breaking change and I therefore added a section under .../library-user-guide/upgrading/54.0.0.md, I have checked around what usually goes there, but I'd appreciate if you could take a deeper look and confirm if I captured what's expected for the update guide. Looking forward to your feedback! |
|
@asolimando thanks, I'll review it next Monday! /cc @jonathanc-n |
Gentle reminder @xudong963 :) |
xudong963
left a comment
There was a problem hiding this comment.
@asolimando thanks! I'm sorry that I'm busy with others this week.
This PR doesn't fully solve the problem it claims to. The stated goal in the PR description and #20184 is to eliminate exponential recomputation. But for any plan containing a CoalescePartitionsExec, SortPreservingMergeExec, RepartitionExec, HashJoinExec (CollectLeft/Auto), CrossJoinExec, or NestedLoopJoinExec — which is most non-trivial plans — the operator restarts a fresh bottom-up walk from inside its own partition_statistics IIUC. So the recomputation isn't gone;
Caching sounds good, how about making caching part of StatisticsContext from day one, then we can have some benchmarks to show off the gains which will be easier for the community to accept the PR, wdyt?
Thank you for your input @xudong963, no need to apologies, it's understandable! You raise a fair point, we fully avoid the recomputation only for linear plans, but operators that call Re. the cache, I identified the need for the One limitation I identified on the Cache lifecycle/scope:
The scope of #20184 is, in my understanding, 1. (single walk), if you agree with that, I plan to use Re. benchmarks, do you have a specific workload in mind (e.g., TPC-DS, Q99)? Also, could I be added to the allowlist to trigger benchmark runs so I can iterate without requiring manual re-runs, in case I need multiple iterations? WDYT? |
|
Thanks for the thoughtful response @asolimando — the framing is exactly right, and the prior discussion with @kosiew in #21483 is helpful context. On scope: agreed, let's land per-call caching in this PR (your Option 1) and treat cross-call caching with stable node IDs as a follow-up. Could you open an issue for Option 2 so we don't lose track? On the cache key: (Arc::as_ptr, partition) is safe within a single synchronous compute_statistics walk — the Arcs are held by the plan tree and can't be dropped during the call, so pointer reuse isn't a concern. Good call. On benchmarks: I'd avoid full TPC-DS Q99 — statistics computation is a small fraction of total query time and will get lost in noise. A targeted micro-bench is more informative:
That should cleanly demonstrate the gain. |
Thanks for the confirmation and the clarifications, I will hopefully get to it early next week and I will ping you back as soon as I will have some updates! |
e135e8a to
a8a3d6c
Compare
|
Thank you for opening this pull request! Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch). Details |
|
Hey @xudong963, I've pushed new commits implementing what we discussed (force-pushed to rebase on latest main, but the first two commits ( A walkthrough of the new commits:
Re. the benchmark: the numbers are from the average of 5 local runs, and they are conservative, as the baseline still benefits from an ephemeral per-walk cache within each re-walk, the true baseline would be no caching at all, and it would show a larger gap. Since this benchmark is new, I couldn't find a better way to show a before/after run. The improvement is clear anyway, but I just wanted to mention it for completeness. Will open a follow-up issue for cross-call caching with stable node IDs (Option 2) once this lands, as Looking forward to your review! |
a8a3d6c to
3d66565
Compare
|
(rebased on latest |
| let input_stats = match partition { | ||
| Some(_) => Arc::unwrap_or_clone( | ||
| ctx.compute_child_statistics(self.input.as_ref(), partition)?, | ||
| ), | ||
| None => Arc::unwrap_or_clone(Arc::clone(&ctx.child_stats()[0])), | ||
| }; |
There was a problem hiding this comment.
Per-operator boilerplate is repetitive and bug-prone. Almost every partition-preserving operator now contains:
let stats = match partition {
Some(_) => ctx.compute_child_statistics(self.input.as_ref(), partition)?,
None => Arc::clone(&ctx.child_stats()[0]),
};This should be a helper on the context: ctx.child_stats_for(0, self.input.as_ref(), partition) or similar. Five identical match blocks across FilterExec, CoalesceBatchesExec, BufferExec, CooperativeExec, OutputRequirementExec is five places to make the same mistake when the contract evolves.
There was a problem hiding this comment.
If we added a StatisticsArgs structure as I proposed above, we could perhaps have this as a method on StatisticsArgs
There was a problem hiding this comment.
Makes total sense and, as suggested, StatisticsArgs proved to be a good location for this. We now have:
args.child_stats_for(self.input.as_ref()), which replaces the match block across all partition-preserving operatorsargs.child_stats_of(child), for partition-merging operators
Addressed in bc32cf2
There was a problem hiding this comment.
also worth measuring on a deep FilterExec chain queried at Some(0). (The context is if you ask compute_statistics(plan, Some(0)) on a deep filter chain, the framework first walks the entire tree computing None stats, then each filter turns around and asks for Some(0) stats on demand (which triggers another cached walk). The shared cache makes the second walk cheap, but for partition-preserving plans we end up populating both None and Some(p) entries for every node
There was a problem hiding this comment.
Covered in bf43bc7: I have added a FilterExec chain at depths 10/20/50. It shows ~2x cost of per-partition vs overall, and ~25x speedup over non-shared-cache baseline at depth 50.
The 2x cost is expected due to the second walk, and as you were anticipating, the cache still makes it cheap enough.
|
Thanks for the ping -- I will try and review this shortly. I am totally swamped trying to review multiple 1000+ line PRs (and trying to give them thoughtful reviews and understand the implications) |
alamb
left a comment
There was a problem hiding this comment.
Thank you @asolimando and @xudong963 -- this is looking like good progress. I left some thoughts,.
| let input_stats = match partition { | ||
| Some(_) => Arc::unwrap_or_clone( | ||
| ctx.compute_child_statistics(self.input.as_ref(), partition)?, | ||
| ), | ||
| None => Arc::unwrap_or_clone(Arc::clone(&ctx.child_stats()[0])), | ||
| }; |
There was a problem hiding this comment.
If we added a StatisticsArgs structure as I proposed above, we could perhaps have this as a method on StatisticsArgs
|
Thank you @xudong963 and @alamb for your feedback and reviews! I am off until early next week with limited connectivity but I will get back to you soon, here and in related PRs/issues around statistics. |
Sounds good -- thank you. It will probably be good timing -- we'll get the 54 release out and then we can add these new APIs in 55 |
3d66565 to
53bbf5e
Compare
003d1ab to
d25e1ad
Compare
|
@alamb @xudong963, there was another conflict so I had to force push again, since there were a couple of artifact from the previous rebase, I did a new one and reworked the commit by "theme", as it was getting hard to manage. Current commits:
|
d25e1ad to
25757b9
Compare
f93b452 to
b4b8e76
Compare
…StatsCache StatisticsArgs carries partition index and a per-call cache. Operators look up child stats lazily via compute_child_statistics(child, partition).
…sArgs Callers now create StatisticsArgs directly and call plan.statistics_with_args(). The cache is created in StatisticsArgs::new() and shared through compute_child_statistics calls.
b4b8e76 to
e19b719
Compare
…n-statistics-context # Conflicts: # datafusion/physical-plan/src/execution_plan.rs # datafusion/physical-plan/src/filter.rs # docs/source/library-user-guide/upgrading/55.0.0.md
|
@alamb @xudong963: I have fixed the new conflicts, would you be able to take a final look if all looks good to you? Happy to address any remaining concerns. |
|
Now that we have released 54.0.0 I have some more time to work on major changes for 55. Checking this one out again |
alamb
left a comment
There was a problem hiding this comment.
Thank you @asolimando and @xudong963 -- I just went over this PR again and I think it looks like a nice step forward.
I have two small suggestions:
However, I am also happy to implement them as their own follow on PRs
| #[expect(deprecated)] | ||
| self.partition_statistics(args.partition()) |
There was a problem hiding this comment.
I guess I was thinking we could make it
fn statistics_with_args(&self, args: &StatisticsArgs) -> Result<Arc<Statistics>> {
if let Some(idx) = args.partition() {
// Validate partition index
let partition_count = self.properties().partitioning.partition_count();
assert_or_internal_err!(
idx < partition_count,
"Invalid partition index: {}, the partition count is {}",
idx,
partition_count
);
}
Ok(Arc::new(Statistics::new_unknown(&self.schema())))
}(aka literally copy/paste the implementation of partition_statistics inline)
|
Thank you, the implementation is neat! I believe it solves the recomputation issue. I have one idea to improve the API: Currently, the caching logic is explicitly implemented inside each operator’s statistics computation. We could decouple cache management from the operator-level statistics propagation, so that the implementation is easier to evolve. The idea would look like this: Stateless API inside
|
…change Replace `StatisticsArgs::new(partition)` with a builder-style API: * `StatisticsArgs::new()` takes no arguments (partition defaults to `None`) * `with_partition(Some(idx))` / `set_partition(Some(idx))` set the partition Changing the partition starts a new statistics walk, so the memoization cache (keyed by raw plan pointer + partition) is now reset when the partition changes. This prevents entries computed for one walk from leaking into another, where a since-dropped plan node could share an address with a new node and produce a stale cache hit. Update all call sites and add a unit test covering the cache reset. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Thank you @alamb for the approval and for preparing the two PRs, I have cherry-picked both and just added a small follow-up commit to adapt the FFI call sites to the new builder-style API. Apologies for the late reply but I was off last week. |
…n-statistics-context Conflicts: - aggregates/mod.rs: keep statistics_with_args, pass args.partition() to statistics_inner (upstream added partition arg) - hash_join/exec.rs: keep StatisticsArgs import + statistics_with_args, drop get_record_batch_memory_size (replaced by RecordBatchMemoryCounter), add missing null_equality arg to (None, _) branch
Thanks @2010YOUY01, I agree that decoupling the statistics handling from the computation is the proper long-term solution, I have filed #22958 to track this issue, I hope I have correctly summarized your proposal there |
Which issue does this PR close?
Closes #20184
Rationale for this change
ExecutionPlan::partition_statisticsforces each operator to re-fetch child statistics internally, causing redundant subtree walks in deep plans.What changes are included in this PR?
partition_statisticsin favor ofstatistics_with_args(&self, args: &StatisticsArgs), an extensible signature that won't require downstream churn when new parameters are addedStatisticsArgscarries the partition index and a shared per-callStatsCache, eliminating redundant subtree walks within a singlecompute_statisticscallpartition=Noneand cached; operators look them up viaargs.child_stats_of(child)(overall) orargs.child_stats_for(child)(partition-aware)Tests
Existing tests pass unchanged. New unit test verifies the caching contract.
Test plan
cargo fmt --allcargo clippy --all-targets --all-features -- -D warningscargo test --profile ci --all-featureson affected cratesDisclaimer: I used AI to assist in the code generation, I have manually reviewed the output and it matches my intention and understanding.