POC: Add citus.distribution_columns GUC for auto-distributing tables by eaydingol · Pull Request #8482 · citusdata/citus

eaydingol · 2026-02-19T13:34:52Z

Summary

Adds a new citus.distribution_columns GUC that automatically distributes tables by a priority list of column names on CREATE TABLE and CREATE TABLE AS SELECT.

Key changes

Commit 1: GUC and auto-distribution for CREATE TABLE

New citus.distribution_columns GUC (comma-separated column name priority list)
Tables with a matching column are auto-distributed on creation
Precedence: tenant schema > distribution_columns > use_citus_managed_tables

Commit 2: Optimized CTAS path

CREATE TABLE AS SELECT with auto-distribution uses a distribute-first strategy
Creates the empty distributed table, then runs INSERT...SELECT to push data directly to workers (avoids coordinator round-trip)
Handles CTE, TABLE, VALUES, and parenthesized subquery forms

Commit 3: Gap analysis fixes

GUC alphabetical ordering (CI fix)
Tenant schema precedence in CTAS optimization path
accessMethod (USING columnar) preserved in optimized CTAS
Escaped quote handling in AS keyword scanner
Dead code and unused variable cleanup

Commit 4: Test simplification

Remove redundant test cases (parenthesized CTAS, duplicate counts_match)
Add TABLE and VALUES keyword CTAS tests

Testing

843-line regression test covering 47 scenarios
All check-post-citus14 tests pass
Full multi_1_schedule run: zero new failures vs baseline

codecov · 2026-02-20T19:55:19Z

Codecov Report

❌ Patch coverage is 6.77966% with 220 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.59%. Comparing base (546f206) to head (6903c26).

❌ Your patch check has failed because the patch coverage (6.77%) is below the target coverage (75.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #8482      +/-   ##
==========================================
- Coverage   88.90%   88.59%   -0.31%     
==========================================
  Files         286      286              
  Lines       63107    63341     +234     
  Branches     7910     7974      +64     
==========================================
+ Hits        56108    56120      +12     
- Misses       4734     4949     +215     
- Partials     2265     2272       +7

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Introduce a new GUC citus.distribution_columns that accepts a comma-separated priority list of column names. When set, any CREATE TABLE or CREATE TABLE AS SELECT whose columns match an entry in the list is automatically hash-distributed by that column, removing the need for an explicit create_distributed_table() call. Implementation: - Register the GUC in shared_library_init.c - Add ShouldAutoDistributeNewTable() and AutoDistributeNewTable() in table.c, called from ConvertNewTableIfNecessary() for both CreateStmt and CreateTableAsStmt paths - Guard against unsupported relation kinds: foreign tables, matviews, partitioned table children, and inherited tables are skipped - The GUC takes lower priority than tenant schema (citus.enable_schema_based_sharding) but higher priority than citus.use_citus_managed_tables Test infrastructure: - Add comprehensive regression test (auto_distribution_columns.sql) covering GUC parsing, priority lists, CTAS from distributed/local/ reference tables, partitioned tables, colocation, EXPLAIN CREATE TABLE AS, foreign tables, matviews, inheritance, transactions, schema interactions, and edge cases - Create post_citus14_schedule for post-Citus-14 feature tests that are expected to fail in n/n-1 mixed-version mode - Move auto_distribution_columns out of multi_1_schedule into the new post_citus14_schedule with its own Makefile target (check-post-citus14) Current limitation: CTAS pulls data to the coordinator first, then redistributes it via CopyLocalDataIntoShards, causing a round trip even when source and target share the same distribution column.

….SELECT When citus.distribution_columns is set and a CREATE TABLE AS SELECT is executed, the old path materialized all data on the coordinator, then redistributed it to workers via CopyLocalDataIntoShards — a full round-trip. The new path intercepts CTAS before PostgreSQL executes it and decomposes it into two steps: 1. CREATE TABLE (empty) + auto-distribute via SPI 2. INSERT INTO ... SELECT (Citus can push this down to workers) When source and target are co-located, no data passes through the coordinator at all. Implementation details: - TryOptimizeCTASForAutoDistribution() in table.c builds the CREATE TABLE DDL from the Query's targetList (types, collations, WITH options, tablespace) and executes it via SPI. - After SPI creates the table, AutoDistributeNewTable() is called explicitly since SPI sub-commands use PROCESS_UTILITY_QUERY context and don't trigger the top-level ConvertNewTableIfNecessary hook. - The SELECT portion is extracted from the original query string by scanning for the AS keyword, then executed as INSERT INTO ... SELECT. - FindMatchingDistributionColumnFromTargetList() checks output columns against the GUC priority list before the table exists. Bail-out cases (fall back to old path): - SELECT INTO syntax (no AS keyword to parse) - Temp tables, materialized views, binary upgrades - Internal backends (metadata sync, rebalancer) - No matching distribution column in the output

Addresses 9 of 12 identified gaps: GAP-3 (CI blocker): Move citus.distribution_columns GUC registration to correct alphabetical position (after distributed_deadlock_detection_factor). GAP-1 (High): Add tenant schema precedence check in TryOptimizeCTASForAutoDistribution. When schema-based sharding is enabled and target schema is a tenant schema, fall back to the standard CTAS path so ConvertNewTableIfNecessary creates a single-shard tenant table instead of hash-distributing. GAP-2 (Medium): Add INTO->accessMethod handling to the optimized CTAS path so USING columnar (and other access methods) is preserved. GAP-4 (Low): Fix escaped single-quote handling in the AS keyword scanner. The inner while loop now properly handles '' escape sequences. GAP-10 (Low): Remove ~25 lines of leftover design-thinking comments and dead appendStringInfo/resetStringInfo code. GAP-12 (Low): Remove unused colIdx variable from FindMatchingDistributionColumnFromTargetList. GAP-7 (Test): Add CTAS-in-tenant-schema test verifying tenant schema takes precedence over distribution_columns GUC. GAP-8 (Test): Add CTAS with CTE and parenthesized subquery tests exercising the AS keyword scanner. GAP-9 (Test): Add CTAS with explicit column name override tests verifying IntoClause.colNames handling. Not addressed (low priority, no functional risk): - GAP-5: Dollar-quoting in AS scanner (extremely rare, safe fallback) - GAP-6: Word boundary in keyword match (no practical risk) - GAP-11: Duplicate GUC parsing refactor (cosmetic)

Remove redundant test cases and add missing coverage: - Remove ctas_paren: parenthesized CTAS form already exercised 6x in the nested queries section. - Remove basic CTAS (t_ctas): same scenario covered by ctas_same_dist with more thorough checks. Move the priority-list fallback sub-test into the priority list section where it logically belongs. - Trim 5 redundant counts_match checks (local, same_dist, ref, nested_agg, both EXPLAIN cases). Keep 3 representative ones: join, no_match, and explain_join. Add new test coverage for AS keyword scanner: - TABLE keyword: CREATE TABLE t AS TABLE source - VALUES keyword: CREATE TABLE t (cols) AS VALUES (...) Net: -24 lines SQL, +2 new test scenarios, 0 redundancy.

Replace the manual strtok_r-based parsing that ran on every table creation with a pre-parsed List maintained by GUC hooks: - Add CheckDistributionColumns (check hook): validates the comma- separated identifier list via SplitIdentifierString before the value is accepted. Invalid input (e.g. double commas) is rejected without touching the current parsed list. - Add AssignDistributionColumns (assign hook): parses the validated string into a List of char* pointers in TopMemoryContext so it survives transaction boundaries. A static 'previousRawList' keeps the backing string alive (SplitIdentifierString returns pointers into its input buffer). - Refactor FindMatchingDistributionColumn and FindMatchingDistributionColumnFromTargetList to iterate the pre-parsed list instead of re-tokenizing on every call. - Rename DistributionColumnsGUC → DistributionColumns and AssignDistributionColumnsGUC → AssignDistributionColumns. Behavioral change: the GUC now uses standard PostgreSQL identifier rules (unquoted names are downcased, quoted names preserve case). To match a case-sensitive column like "Tenant_Id", set the GUC to '"Tenant_Id"'. Tests updated accordingly. All check-post-citus14 tests pass.

eaydingol added 5 commits February 23, 2026 11:12

eaydingol force-pushed the feature/auto-distribution-columns branch from 0e32e88 to 6903c26 Compare February 23, 2026 08:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POC: Add citus.distribution_columns GUC for auto-distributing tables#8482

POC: Add citus.distribution_columns GUC for auto-distributing tables#8482
eaydingol wants to merge 5 commits intomainfrom
feature/auto-distribution-columns

eaydingol commented Feb 19, 2026

Uh oh!

codecov bot commented Feb 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

eaydingol commented Feb 19, 2026

Summary

Key changes

Testing

Uh oh!

codecov bot commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov bot commented Feb 20, 2026 •

edited

Loading