Skip to content

Support bundle: automatic deletion to maintain free dataset buffer#9662

Open
smklein wants to merge 13 commits intomainfrom
support-bundle-auto-delete
Open

Support bundle: automatic deletion to maintain free dataset buffer#9662
smklein wants to merge 13 commits intomainfrom
support-bundle-auto-delete

Conversation

@smklein
Copy link
Copy Markdown
Collaborator

@smklein smklein commented Jan 15, 2026

Summary

Implements automatic deletion of support bundles to maintain a buffer of free debug datasets for new allocations.

  • Config: Added support_bundle_config table with target_free_percent and min_keep_percent
  • DB Query: Added support_bundle_auto_delete() CTE + unit tests
  • Background Task: Added auto_delete_bundles() phase before existing phases
  • Schema: Added index on (state, time_created) for efficient queries
  • omdb: Added nexus support-bundle-config show/set commands

How it works

The SupportBundleCollector background task now runs three phases:

  1. Auto-delete (new): Marks oldest Active bundles as Destroying when free datasets < target
  2. Cleanup (existing): Cleans up storage and DB for Destroying bundles
  3. Collect (existing): Collects pending bundles

Configuration uses percentages of total datasets (calculated with CEIL rounding)

  • target_free_percent: Target % of datasets to keep free for future bundles (default: 10%)
  • min_keep_percent: Minimum % of datasets to retain (default: 10%)

Tests

  • Deletion triggers only when free_datasets < target threshold
  • Oldest bundles (by time_created) are selected first
  • min_keep_percent constraint limits maximum deletions
  • Only Active bundles are candidates; Collecting/Destroying still occupy datasets
  • Failed bundles don't count toward used dataset count
  • SQL validity via EXPLAIN + expectorate for change detection
  • Integration test in test_support_bundle_auto_deletion()

Fixes #9660

@smklein smklein marked this pull request as draft January 15, 2026 21:19
@smklein smklein force-pushed the support-bundle-auto-delete branch from 5bc38cc to c78018e Compare January 15, 2026 21:26
@smklein smklein force-pushed the support-bundle-auto-delete branch from c78018e to 1c94bc0 Compare January 15, 2026 21:38
- Use COUNT queries instead of loading all dataset/bundle rows
- Use GROUP BY to get used_datasets and active_bundles in a single query
- Add LIMIT to the deletion candidates query
- Exclude Failed bundles from used_datasets (their dataset was expunged)

These changes allow the auto-deletion query to scale to systems with
thousands of datasets without loading all rows into memory.
Replace the two-step find-then-update approach with a single atomic CTE
query that calculates how many deletions are needed AND performs them
in one database operation.

The previous approach had a time-of-check to time-of-use (TOCTTOU) issue
where multiple Nexuses running concurrently could over-delete bundles:
1. Nexus A queries: free=2, needs 1 deletion -> gets B1
2. Nexus A transitions B1: Active -> Destroying
3. Nexus B queries: free=2 (unchanged, Destroying still occupies), needs 1 -> gets B2
4. Nexus B transitions B2: Active -> Destroying
5. Result: 2 bundles deleted when only 1 was needed

The new atomic query:
- Calculates free datasets and needed deletions
- Respects min_bundles_to_keep constraint
- Finds the N oldest Active bundles
- Transitions them to Destroying state atomically
- Returns the IDs of deleted bundles

Also adds tests verifying:
- Bundles are actually transitioned to Destroying state
- Failed bundles don't count as occupying datasets
- Explain test for SQL validity
- Expectorate test for SQL output
@smklein smklein marked this pull request as ready for review January 22, 2026 20:06
@smklein smklein requested review from hawkw and papertigers January 22, 2026 20:06
Copy link
Copy Markdown
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a bunch of very insignificant style nits, but overall, I like the approach used here.

Comment thread dev-tools/omdb/src/bin/omdb/nexus.rs Outdated
Comment thread dev-tools/omdb/src/bin/omdb/nexus/support_bundle_config.rs Outdated
.support_bundle_config_get(opctx)
.await
.context("failed to get current support bundle config")?;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wondered if, perhaps, we ought to have some code in here that validates whether the new values are in the range 0-100, but i see that there are CHECK constraints for that in the db schema, and those will fail when we try to set the new values. do we think that the error that bubbles up from the database query will be clear enough to the user in that case, or should we be pre-validating the values here?

if we decide to just rely on the CHECK constraints for this, maybe we should leave a comment here saying that we don't need to validate the desired percentages now since they get CHECKed when we try to go set them?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so: the CHECK constraints will catch that, we do have a catch in the datastore method below, and I have tests for it in test_config_set_rejects_invalid_target_{free,keep}_percent.

I'll tweak them to be more explicit about the error message, but I think they're pretty clear?

Can also add a comment here - basically, "it's the datastore's problem".

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment could be nice, but not a big deal.

Comment thread nexus/db-queries/src/db/datastore/support_bundle.rs Outdated
Comment thread nexus/db-queries/src/db/datastore/support_bundle.rs Outdated
Comment thread nexus/tests/integration_tests/support_bundles.rs Outdated
Comment thread nexus/tests/integration_tests/support_bundles.rs Outdated
Comment thread schema/crdb/dbinit.sql Outdated
Comment on lines +3174 to +3207
CREATE INDEX IF NOT EXISTS lookup_bundle_by_state_and_creation ON omicron.public.support_bundle (
state,
time_created
);

/*
* Support Bundle Config
*
* Configuration for automatic support bundle deletion. This table uses a
* singleton pattern (exactly one row) to store cluster-wide configuration.
*/
CREATE TABLE IF NOT EXISTS omicron.public.support_bundle_config (
-- Singleton pattern: only one row allowed
singleton BOOL PRIMARY KEY DEFAULT TRUE CHECK (singleton = TRUE),

-- Percentage (0-100) of total datasets to keep free for new allocations.
-- Calculated as CEIL(total_datasets * target_free_percent / 100).
-- Example: 10% of 100 datasets = 10 free, 10% of 5 datasets = 1 free.
target_free_percent INT8 NOT NULL
CHECK (target_free_percent >= 0 AND target_free_percent <= 100),

-- Percentage (0-100) of total datasets to retain as bundles (minimum).
-- Calculated as CEIL(total_datasets * min_keep_percent / 100).
-- Prevents aggressive cleanup on small systems.
min_keep_percent INT8 NOT NULL
CHECK (min_keep_percent >= 0 AND min_keep_percent <= 100),

time_modified TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- Default: 10% free datasets, keep at least 10% worth of bundles
INSERT INTO omicron.public.support_bundle_config (singleton, target_free_percent, min_keep_percent, time_modified)
VALUES (TRUE, 10, 10, NOW())
ON CONFLICT (singleton) DO NOTHING;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: can we wrap some of these lines? perhaps:

Suggested change
CREATE INDEX IF NOT EXISTS lookup_bundle_by_state_and_creation ON omicron.public.support_bundle (
state,
time_created
);
/*
* Support Bundle Config
*
* Configuration for automatic support bundle deletion. This table uses a
* singleton pattern (exactly one row) to store cluster-wide configuration.
*/
CREATE TABLE IF NOT EXISTS omicron.public.support_bundle_config (
-- Singleton pattern: only one row allowed
singleton BOOL PRIMARY KEY DEFAULT TRUE CHECK (singleton = TRUE),
-- Percentage (0-100) of total datasets to keep free for new allocations.
-- Calculated as CEIL(total_datasets * target_free_percent / 100).
-- Example: 10% of 100 datasets = 10 free, 10% of 5 datasets = 1 free.
target_free_percent INT8 NOT NULL
CHECK (target_free_percent >= 0 AND target_free_percent <= 100),
-- Percentage (0-100) of total datasets to retain as bundles (minimum).
-- Calculated as CEIL(total_datasets * min_keep_percent / 100).
-- Prevents aggressive cleanup on small systems.
min_keep_percent INT8 NOT NULL
CHECK (min_keep_percent >= 0 AND min_keep_percent <= 100),
time_modified TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Default: 10% free datasets, keep at least 10% worth of bundles
INSERT INTO omicron.public.support_bundle_config (singleton, target_free_percent, min_keep_percent, time_modified)
VALUES (TRUE, 10, 10, NOW())
ON CONFLICT (singleton) DO NOTHING;
CREATE INDEX IF NOT EXISTS
lookup_bundle_by_state_and_creation
ON omicron.public.support_bundle (
state,
time_created
);
/*
* Support Bundle Config
*
* Configuration for automatic support bundle deletion. This table uses a
* singleton pattern (exactly one row) to store cluster-wide configuration.
*/
CREATE TABLE IF NOT EXISTS omicron.public.support_bundle_config (
-- Singleton pattern: only one row allowed
singleton BOOL PRIMARY KEY DEFAULT TRUE CHECK (singleton = TRUE),
-- Percentage (0-100) of total datasets to keep free for new allocations.
-- Calculated as CEIL(total_datasets * target_free_percent / 100).
-- Example: 10% of 100 datasets = 10 free, 10% of 5 datasets = 1 free.
target_free_percent INT8 NOT NULL
CHECK (target_free_percent >= 0 AND target_free_percent <= 100),
-- Percentage (0-100) of total datasets to retain as bundles (minimum).
-- Calculated as CEIL(total_datasets * min_keep_percent / 100).
-- Prevents aggressive cleanup on small systems.
min_keep_percent INT8 NOT NULL
CHECK (min_keep_percent >= 0 AND min_keep_percent <= 100),
time_modified TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Default: 10% free datasets, keep at least 10% worth of bundles
INSERT INTO omicron.public.support_bundle_config (
singleton,
target_free_percent,
min_keep_percent,
time_modified
)
VALUES (TRUE, 10, 10, NOW())
ON CONFLICT (singleton) DO NOTHING;

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mind line-wrapping this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...and wrapping this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, overall this seems good to me --- thanks for addressing all my feedback. I had some minor nitpicks about stuff but I don't really care about them.

Comment on lines +345 to +349
// Update report with state (as of before any deletions)
report.total_datasets = auto_deleted.total_datasets;
report.active_bundles = auto_deleted.active_bundles;
report.free_datasets = auto_deleted.free_datasets;
report.bundles_marked_for_deletion = auto_deleted.deleted_ids.len();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parta me wants to say "why can't we just use the same type for the return value from support_bundle_auto_delete and the report type, but...I realize that would create a weird dependency between nexus-types and nexus-db-queries which we maybe don't wanna have?

i might say "hey what if we exhaustively destructured auto_deleted so that we get a compiler error if someone adds more fields we're not looking at, but...i don't know if i care that much..."

Comment on lines +996 to +1001
// Check auto-deletion report
assert_eq!(output.auto_deletion_err, None);
let report = output
.auto_deletion_report
.expect("Should have auto_deletion_report");
assert_eq!(report.total_datasets, 5);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in addition to asserting about what's in the report, should we maybe also be asserting all the bundles that we expect to exist still exist after each activation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support Bundle: Automatic Deletion

2 participants