feat: support mandatory compaction trigger for cascading reindexing supervisors who have unapplied deletion rules#19633
Draft
capistrant wants to merge 1 commit into
Conversation
…upervisors who have unapplied deletion rules
Contributor
Author
|
To limit PR scope web console change to the policy form will be in a follow up. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Cascading Reindexing introduced deletion rules that allow for deleting rows that match a filter after some period of time. This has many applications, but one in specific is the ability to do row granular deletions automatically for data compliance needs. For example, if your Druid deploy contains customer data with a column
customer_ididentifying what customer a row of data is associated with, you may have different compliance requirements for different customers. For instance, keep customerfoodata for 30 days and customerbardata for 365 days. With cascading reindexing deletion rules you have a rule for each of these requirements that will properly lifecycle your data and keep you in compliance.Currently there is pain when it comes to compaction candidacy. An operator can put minimum un-compacted thresholds on intervals to avoid over-compacting when the gain from such compactions is negligible. From a pure cost perspective, this is great. Don't spend compute dollars for negligible space savings or performance gains. However, this breaks down from a compliance perspective when considering my leading example for customer data retention. The only option an operator has today if they need to meet compliance is to drop those thresholds down to essentially 0 and run compaction overly-aggressive even if the compaction does not apply any missing deletion rules. So they are eating cost to cover a limited set of cases.
This PR introduces a new policy config for compaction that an operator can opt into to force compaction to run on an interval if any of the candidate segments are out of sync with applying deletion rules. This allows keeping the thresholds for compaction at the desired level knowing that that decision will not put you into compliance issues when it comes to data lifecycle.
Compliance is just one reason someone may want to enable this config. It is the most compelling that comes to mind because it can be a legal requirement for operators to meet, and the current option to lower thresholds to be compliant is sub-optimal.
forcePendingDeletionCompactionis the new policy config that defaults to false, but anyone wishing to force pending deletion rules to be applied regardless of interval compaction stats, can simply set it to true and leave their thresholds for non-deletion required candidates in place.Release note
Operators using Cascading Reindexing compaction supervisors with deletion rules can now use compaction policy configuration to opt into their deletion rules being applied to candidate intervals who are not fully compliant regardless of their interval level compaction stats. If you opt in, intervals that required deletions to be applied but did not meet compaction stat thresholds specified in the policy will now be compacted in order to ensure all required deletion rules are applied. Critical for operators who use deletion rules for data compliance needs. Note that you must be using
MostFragmentedIntervalFirstpolicy to have this config take effect.Key changed/added classes in this PR
CompactionConfigBasedJobTemplateReindexingDeletionRuleOptimizerCompactionCandidateSearchPolicyMostFragmentedIntervalFirstPolicyThis PR has: