POC: Add citus.distribution_columns GUC for auto-distributing tables#8482
Open
POC: Add citus.distribution_columns GUC for auto-distributing tables#8482
Conversation
Codecov Report❌ Patch coverage is ❌ Your patch check has failed because the patch coverage (6.77%) is below the target coverage (75.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #8482 +/- ##
==========================================
- Coverage 88.90% 88.59% -0.31%
==========================================
Files 286 286
Lines 63107 63341 +234
Branches 7910 7974 +64
==========================================
+ Hits 56108 56120 +12
- Misses 4734 4949 +215
- Partials 2265 2272 +7 🚀 New features to boost your workflow:
|
Introduce a new GUC citus.distribution_columns that accepts a comma-separated priority list of column names. When set, any CREATE TABLE or CREATE TABLE AS SELECT whose columns match an entry in the list is automatically hash-distributed by that column, removing the need for an explicit create_distributed_table() call. Implementation: - Register the GUC in shared_library_init.c - Add ShouldAutoDistributeNewTable() and AutoDistributeNewTable() in table.c, called from ConvertNewTableIfNecessary() for both CreateStmt and CreateTableAsStmt paths - Guard against unsupported relation kinds: foreign tables, matviews, partitioned table children, and inherited tables are skipped - The GUC takes lower priority than tenant schema (citus.enable_schema_based_sharding) but higher priority than citus.use_citus_managed_tables Test infrastructure: - Add comprehensive regression test (auto_distribution_columns.sql) covering GUC parsing, priority lists, CTAS from distributed/local/ reference tables, partitioned tables, colocation, EXPLAIN CREATE TABLE AS, foreign tables, matviews, inheritance, transactions, schema interactions, and edge cases - Create post_citus14_schedule for post-Citus-14 feature tests that are expected to fail in n/n-1 mixed-version mode - Move auto_distribution_columns out of multi_1_schedule into the new post_citus14_schedule with its own Makefile target (check-post-citus14) Current limitation: CTAS pulls data to the coordinator first, then redistributes it via CopyLocalDataIntoShards, causing a round trip even when source and target share the same distribution column.
….SELECT When citus.distribution_columns is set and a CREATE TABLE AS SELECT is executed, the old path materialized all data on the coordinator, then redistributed it to workers via CopyLocalDataIntoShards — a full round-trip. The new path intercepts CTAS before PostgreSQL executes it and decomposes it into two steps: 1. CREATE TABLE (empty) + auto-distribute via SPI 2. INSERT INTO ... SELECT (Citus can push this down to workers) When source and target are co-located, no data passes through the coordinator at all. Implementation details: - TryOptimizeCTASForAutoDistribution() in table.c builds the CREATE TABLE DDL from the Query's targetList (types, collations, WITH options, tablespace) and executes it via SPI. - After SPI creates the table, AutoDistributeNewTable() is called explicitly since SPI sub-commands use PROCESS_UTILITY_QUERY context and don't trigger the top-level ConvertNewTableIfNecessary hook. - The SELECT portion is extracted from the original query string by scanning for the AS keyword, then executed as INSERT INTO ... SELECT. - FindMatchingDistributionColumnFromTargetList() checks output columns against the GUC priority list before the table exists. Bail-out cases (fall back to old path): - SELECT INTO syntax (no AS keyword to parse) - Temp tables, materialized views, binary upgrades - Internal backends (metadata sync, rebalancer) - No matching distribution column in the output
Addresses 9 of 12 identified gaps: GAP-3 (CI blocker): Move citus.distribution_columns GUC registration to correct alphabetical position (after distributed_deadlock_detection_factor). GAP-1 (High): Add tenant schema precedence check in TryOptimizeCTASForAutoDistribution. When schema-based sharding is enabled and target schema is a tenant schema, fall back to the standard CTAS path so ConvertNewTableIfNecessary creates a single-shard tenant table instead of hash-distributing. GAP-2 (Medium): Add INTO->accessMethod handling to the optimized CTAS path so USING columnar (and other access methods) is preserved. GAP-4 (Low): Fix escaped single-quote handling in the AS keyword scanner. The inner while loop now properly handles '' escape sequences. GAP-10 (Low): Remove ~25 lines of leftover design-thinking comments and dead appendStringInfo/resetStringInfo code. GAP-12 (Low): Remove unused colIdx variable from FindMatchingDistributionColumnFromTargetList. GAP-7 (Test): Add CTAS-in-tenant-schema test verifying tenant schema takes precedence over distribution_columns GUC. GAP-8 (Test): Add CTAS with CTE and parenthesized subquery tests exercising the AS keyword scanner. GAP-9 (Test): Add CTAS with explicit column name override tests verifying IntoClause.colNames handling. Not addressed (low priority, no functional risk): - GAP-5: Dollar-quoting in AS scanner (extremely rare, safe fallback) - GAP-6: Word boundary in keyword match (no practical risk) - GAP-11: Duplicate GUC parsing refactor (cosmetic)
Remove redundant test cases and add missing coverage: - Remove ctas_paren: parenthesized CTAS form already exercised 6x in the nested queries section. - Remove basic CTAS (t_ctas): same scenario covered by ctas_same_dist with more thorough checks. Move the priority-list fallback sub-test into the priority list section where it logically belongs. - Trim 5 redundant counts_match checks (local, same_dist, ref, nested_agg, both EXPLAIN cases). Keep 3 representative ones: join, no_match, and explain_join. Add new test coverage for AS keyword scanner: - TABLE keyword: CREATE TABLE t AS TABLE source - VALUES keyword: CREATE TABLE t (cols) AS VALUES (...) Net: -24 lines SQL, +2 new test scenarios, 0 redundancy.
Replace the manual strtok_r-based parsing that ran on every table creation with a pre-parsed List maintained by GUC hooks: - Add CheckDistributionColumns (check hook): validates the comma- separated identifier list via SplitIdentifierString before the value is accepted. Invalid input (e.g. double commas) is rejected without touching the current parsed list. - Add AssignDistributionColumns (assign hook): parses the validated string into a List of char* pointers in TopMemoryContext so it survives transaction boundaries. A static 'previousRawList' keeps the backing string alive (SplitIdentifierString returns pointers into its input buffer). - Refactor FindMatchingDistributionColumn and FindMatchingDistributionColumnFromTargetList to iterate the pre-parsed list instead of re-tokenizing on every call. - Rename DistributionColumnsGUC → DistributionColumns and AssignDistributionColumnsGUC → AssignDistributionColumns. Behavioral change: the GUC now uses standard PostgreSQL identifier rules (unquoted names are downcased, quoted names preserve case). To match a case-sensitive column like "Tenant_Id", set the GUC to '"Tenant_Id"'. Tests updated accordingly. All check-post-citus14 tests pass.
0e32e88 to
6903c26
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a new
citus.distribution_columnsGUC that automatically distributes tables by a priority list of column names onCREATE TABLEandCREATE TABLE AS SELECT.Key changes
Commit 1: GUC and auto-distribution for CREATE TABLE
citus.distribution_columnsGUC (comma-separated column name priority list)distribution_columns>use_citus_managed_tablesCommit 2: Optimized CTAS path
CREATE TABLE AS SELECTwith auto-distribution uses a distribute-first strategyINSERT...SELECTto push data directly to workers (avoids coordinator round-trip)Commit 3: Gap analysis fixes
accessMethod(USING columnar) preserved in optimized CTASCommit 4: Test simplification
Testing
check-post-citus14tests passmulti_1_schedulerun: zero new failures vs baseline