perf: batch fragment-reuse index remap into a single rebuild + commit#7303
Closed
xuanyu-z wants to merge 2 commits into
Closed
perf: batch fragment-reuse index remap into a single rebuild + commit#7303xuanyu-z wants to merge 2 commits into
xuanyu-z wants to merge 2 commits into
Conversation
…emapped `remap_column_index` decided whether to remap an index's data from the presence of the old fragments in its fragment bitmap. But `load_indices` coverage-remaps the bitmap onto the new fragments in memory, and a later commit can persist that cleaned bitmap to disk before the index data is remapped -- for example while remapping a sibling index on the same table. Once the bitmap is clean, the data remap was silently skipped: the index kept serving stale row addresses (corrected only by reuse-index auto-remap at load), and its dataset_version never advanced, so the fragment reuse index could never be trimmed and grew without bound. Detect the already-coverage-remapped case (the group's old fragments are gone from the bitmap but its new fragments are present, and the index data still predates the reuse version) and remap the data in that case too. Adds a regression test exercising two indexes on one table.
`remap_index` applied the fragment-reuse index one version at a time, rebuilding the index file and committing once per version. An index touched by K deferred compactions therefore paid K full index rebuilds and K commits. Compose instead: `remap_row_id` already chains every reuse version (passing through addresses a version does not touch), so the union of all versions' keys gives a single baseline->final row-address map; the fragment bitmap is composed in one pass with the same all-or-nothing/straddle semantics. The index is then rebuilt and committed exactly once. Adds a test asserting a single commit for a multi-version reuse chain.
f54ea8a to
d203514
Compare
Contributor
Author
|
Superseded by a stacked PR based on the #7286 fix branch so the diff shows only the batching commit: xuanyu-z#1 . Will reopen against upstream main once #7286 merges. |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
remap_index(the post-deferred-compaction catch-up that physically rewrites an index's stored row addresses) applied the fragment-reuse index one version at a time — rebuilding the index file and committing once per reuse version. An index touched by K deferred compactions paid K full index rebuilds + K commits, even though the final state is identical to applying all K at once. This is worst exactly when it matters most: when the reuse index has accumulated many versions before remap runs.Change
Compose the whole chain and rebuild once:
FragReuseIndex::remap_row_idalready chains every version and passes through addresses a version doesn't touch, so the union of all versions' map keys mapped through it yields a single baseline → final address map. Oneindex::remap_indexcall applies it.data_predates_versioncheck is evaluated against the fixed baseline, since there are no longer any intermediate commits.CreateIndexcommit instead of one per version.Semantically equivalent to the previous per-version loop (applying the composed map to a stored address gives the same result as applying each version's map in sequence), just without the intermediate rebuilds/commits.
Why this is safe
remap/frag_reuse/compactiontests pass unchanged, includingtest_remap_index_after_compaction(a 3-version chain) which now completes in one commit.Test
test_remap_index_batches_multiple_reuse_versions: accumulates ≥2 reuse versions (delete-prefix + deferred compaction in a loop), then asserts a singleremap_column_indexadvances the dataset version by exactly 1 (one commit, not one per version) and that the reuse index trims to zero afterward.Note
This optimizes the cost per catch-up; it's complementary to running remap regularly (which keeps the version count — and thus K — small in the first place).