Support bundle: automatic deletion to maintain free dataset buffer#9662
Support bundle: automatic deletion to maintain free dataset buffer#9662
Conversation
5bc38cc to
c78018e
Compare
c78018e to
1c94bc0
Compare
- Use COUNT queries instead of loading all dataset/bundle rows - Use GROUP BY to get used_datasets and active_bundles in a single query - Add LIMIT to the deletion candidates query - Exclude Failed bundles from used_datasets (their dataset was expunged) These changes allow the auto-deletion query to scale to systems with thousands of datasets without loading all rows into memory.
Replace the two-step find-then-update approach with a single atomic CTE query that calculates how many deletions are needed AND performs them in one database operation. The previous approach had a time-of-check to time-of-use (TOCTTOU) issue where multiple Nexuses running concurrently could over-delete bundles: 1. Nexus A queries: free=2, needs 1 deletion -> gets B1 2. Nexus A transitions B1: Active -> Destroying 3. Nexus B queries: free=2 (unchanged, Destroying still occupies), needs 1 -> gets B2 4. Nexus B transitions B2: Active -> Destroying 5. Result: 2 bundles deleted when only 1 was needed The new atomic query: - Calculates free datasets and needed deletions - Respects min_bundles_to_keep constraint - Finds the N oldest Active bundles - Transitions them to Destroying state atomically - Returns the IDs of deleted bundles Also adds tests verifying: - Bundles are actually transitioned to Destroying state - Failed bundles don't count as occupying datasets - Explain test for SQL validity - Expectorate test for SQL output
hawkw
left a comment
There was a problem hiding this comment.
I had a bunch of very insignificant style nits, but overall, I like the approach used here.
| .support_bundle_config_get(opctx) | ||
| .await | ||
| .context("failed to get current support bundle config")?; | ||
|
|
There was a problem hiding this comment.
i wondered if, perhaps, we ought to have some code in here that validates whether the new values are in the range 0-100, but i see that there are CHECK constraints for that in the db schema, and those will fail when we try to set the new values. do we think that the error that bubbles up from the database query will be clear enough to the user in that case, or should we be pre-validating the values here?
if we decide to just rely on the CHECK constraints for this, maybe we should leave a comment here saying that we don't need to validate the desired percentages now since they get CHECKed when we try to go set them?
There was a problem hiding this comment.
so: the CHECK constraints will catch that, we do have a catch in the datastore method below, and I have tests for it in test_config_set_rejects_invalid_target_{free,keep}_percent.
I'll tweak them to be more explicit about the error message, but I think they're pretty clear?
Can also add a comment here - basically, "it's the datastore's problem".
There was a problem hiding this comment.
Comment could be nice, but not a big deal.
| CREATE INDEX IF NOT EXISTS lookup_bundle_by_state_and_creation ON omicron.public.support_bundle ( | ||
| state, | ||
| time_created | ||
| ); | ||
|
|
||
| /* | ||
| * Support Bundle Config | ||
| * | ||
| * Configuration for automatic support bundle deletion. This table uses a | ||
| * singleton pattern (exactly one row) to store cluster-wide configuration. | ||
| */ | ||
| CREATE TABLE IF NOT EXISTS omicron.public.support_bundle_config ( | ||
| -- Singleton pattern: only one row allowed | ||
| singleton BOOL PRIMARY KEY DEFAULT TRUE CHECK (singleton = TRUE), | ||
|
|
||
| -- Percentage (0-100) of total datasets to keep free for new allocations. | ||
| -- Calculated as CEIL(total_datasets * target_free_percent / 100). | ||
| -- Example: 10% of 100 datasets = 10 free, 10% of 5 datasets = 1 free. | ||
| target_free_percent INT8 NOT NULL | ||
| CHECK (target_free_percent >= 0 AND target_free_percent <= 100), | ||
|
|
||
| -- Percentage (0-100) of total datasets to retain as bundles (minimum). | ||
| -- Calculated as CEIL(total_datasets * min_keep_percent / 100). | ||
| -- Prevents aggressive cleanup on small systems. | ||
| min_keep_percent INT8 NOT NULL | ||
| CHECK (min_keep_percent >= 0 AND min_keep_percent <= 100), | ||
|
|
||
| time_modified TIMESTAMPTZ NOT NULL DEFAULT NOW() | ||
| ); | ||
|
|
||
| -- Default: 10% free datasets, keep at least 10% worth of bundles | ||
| INSERT INTO omicron.public.support_bundle_config (singleton, target_free_percent, min_keep_percent, time_modified) | ||
| VALUES (TRUE, 10, 10, NOW()) | ||
| ON CONFLICT (singleton) DO NOTHING; |
There was a problem hiding this comment.
nitpick: can we wrap some of these lines? perhaps:
| CREATE INDEX IF NOT EXISTS lookup_bundle_by_state_and_creation ON omicron.public.support_bundle ( | |
| state, | |
| time_created | |
| ); | |
| /* | |
| * Support Bundle Config | |
| * | |
| * Configuration for automatic support bundle deletion. This table uses a | |
| * singleton pattern (exactly one row) to store cluster-wide configuration. | |
| */ | |
| CREATE TABLE IF NOT EXISTS omicron.public.support_bundle_config ( | |
| -- Singleton pattern: only one row allowed | |
| singleton BOOL PRIMARY KEY DEFAULT TRUE CHECK (singleton = TRUE), | |
| -- Percentage (0-100) of total datasets to keep free for new allocations. | |
| -- Calculated as CEIL(total_datasets * target_free_percent / 100). | |
| -- Example: 10% of 100 datasets = 10 free, 10% of 5 datasets = 1 free. | |
| target_free_percent INT8 NOT NULL | |
| CHECK (target_free_percent >= 0 AND target_free_percent <= 100), | |
| -- Percentage (0-100) of total datasets to retain as bundles (minimum). | |
| -- Calculated as CEIL(total_datasets * min_keep_percent / 100). | |
| -- Prevents aggressive cleanup on small systems. | |
| min_keep_percent INT8 NOT NULL | |
| CHECK (min_keep_percent >= 0 AND min_keep_percent <= 100), | |
| time_modified TIMESTAMPTZ NOT NULL DEFAULT NOW() | |
| ); | |
| -- Default: 10% free datasets, keep at least 10% worth of bundles | |
| INSERT INTO omicron.public.support_bundle_config (singleton, target_free_percent, min_keep_percent, time_modified) | |
| VALUES (TRUE, 10, 10, NOW()) | |
| ON CONFLICT (singleton) DO NOTHING; | |
| CREATE INDEX IF NOT EXISTS | |
| lookup_bundle_by_state_and_creation | |
| ON omicron.public.support_bundle ( | |
| state, | |
| time_created | |
| ); | |
| /* | |
| * Support Bundle Config | |
| * | |
| * Configuration for automatic support bundle deletion. This table uses a | |
| * singleton pattern (exactly one row) to store cluster-wide configuration. | |
| */ | |
| CREATE TABLE IF NOT EXISTS omicron.public.support_bundle_config ( | |
| -- Singleton pattern: only one row allowed | |
| singleton BOOL PRIMARY KEY DEFAULT TRUE CHECK (singleton = TRUE), | |
| -- Percentage (0-100) of total datasets to keep free for new allocations. | |
| -- Calculated as CEIL(total_datasets * target_free_percent / 100). | |
| -- Example: 10% of 100 datasets = 10 free, 10% of 5 datasets = 1 free. | |
| target_free_percent INT8 NOT NULL | |
| CHECK (target_free_percent >= 0 AND target_free_percent <= 100), | |
| -- Percentage (0-100) of total datasets to retain as bundles (minimum). | |
| -- Calculated as CEIL(total_datasets * min_keep_percent / 100). | |
| -- Prevents aggressive cleanup on small systems. | |
| min_keep_percent INT8 NOT NULL | |
| CHECK (min_keep_percent >= 0 AND min_keep_percent <= 100), | |
| time_modified TIMESTAMPTZ NOT NULL DEFAULT NOW() | |
| ); | |
| -- Default: 10% free datasets, keep at least 10% worth of bundles | |
| INSERT INTO omicron.public.support_bundle_config ( | |
| singleton, | |
| target_free_percent, | |
| min_keep_percent, | |
| time_modified | |
| ) | |
| VALUES (TRUE, 10, 10, NOW()) | |
| ON CONFLICT (singleton) DO NOTHING; |
hawkw
left a comment
There was a problem hiding this comment.
Okay, overall this seems good to me --- thanks for addressing all my feedback. I had some minor nitpicks about stuff but I don't really care about them.
| // Update report with state (as of before any deletions) | ||
| report.total_datasets = auto_deleted.total_datasets; | ||
| report.active_bundles = auto_deleted.active_bundles; | ||
| report.free_datasets = auto_deleted.free_datasets; | ||
| report.bundles_marked_for_deletion = auto_deleted.deleted_ids.len(); |
There was a problem hiding this comment.
parta me wants to say "why can't we just use the same type for the return value from support_bundle_auto_delete and the report type, but...I realize that would create a weird dependency between nexus-types and nexus-db-queries which we maybe don't wanna have?
i might say "hey what if we exhaustively destructured auto_deleted so that we get a compiler error if someone adds more fields we're not looking at, but...i don't know if i care that much..."
| // Check auto-deletion report | ||
| assert_eq!(output.auto_deletion_err, None); | ||
| let report = output | ||
| .auto_deletion_report | ||
| .expect("Should have auto_deletion_report"); | ||
| assert_eq!(report.total_datasets, 5); |
There was a problem hiding this comment.
in addition to asserting about what's in the report, should we maybe also be asserting all the bundles that we expect to exist still exist after each activation?
Summary
Implements automatic deletion of support bundles to maintain a buffer of free debug datasets for new allocations.
support_bundle_configtable withtarget_free_percentandmin_keep_percentsupport_bundle_auto_delete()CTE + unit testsauto_delete_bundles()phase before existing phasesHow it works
The
SupportBundleCollectorbackground task now runs three phases:Configuration uses percentages of total datasets (calculated with CEIL rounding)
target_free_percent: Target % of datasets to keep free for future bundles (default: 10%)min_keep_percent: Minimum % of datasets to retain (default: 10%)Tests
min_keep_percentconstraint limits maximum deletionstest_support_bundle_auto_deletion()Fixes #9660