Skip to content

POC: Add citus.distribution_columns GUC for auto-distributing tables#8482

Open
eaydingol wants to merge 5 commits intomainfrom
feature/auto-distribution-columns
Open

POC: Add citus.distribution_columns GUC for auto-distributing tables#8482
eaydingol wants to merge 5 commits intomainfrom
feature/auto-distribution-columns

Conversation

@eaydingol
Copy link
Collaborator

Summary

Adds a new citus.distribution_columns GUC that automatically distributes tables by a priority list of column names on CREATE TABLE and CREATE TABLE AS SELECT.

Key changes

Commit 1: GUC and auto-distribution for CREATE TABLE

  • New citus.distribution_columns GUC (comma-separated column name priority list)
  • Tables with a matching column are auto-distributed on creation
  • Precedence: tenant schema > distribution_columns > use_citus_managed_tables

Commit 2: Optimized CTAS path

  • CREATE TABLE AS SELECT with auto-distribution uses a distribute-first strategy
  • Creates the empty distributed table, then runs INSERT...SELECT to push data directly to workers (avoids coordinator round-trip)
  • Handles CTE, TABLE, VALUES, and parenthesized subquery forms

Commit 3: Gap analysis fixes

  • GUC alphabetical ordering (CI fix)
  • Tenant schema precedence in CTAS optimization path
  • accessMethod (USING columnar) preserved in optimized CTAS
  • Escaped quote handling in AS keyword scanner
  • Dead code and unused variable cleanup

Commit 4: Test simplification

  • Remove redundant test cases (parenthesized CTAS, duplicate counts_match)
  • Add TABLE and VALUES keyword CTAS tests

Testing

  • 843-line regression test covering 47 scenarios
  • All check-post-citus14 tests pass
  • Full multi_1_schedule run: zero new failures vs baseline

@codecov
Copy link

codecov bot commented Feb 20, 2026

Codecov Report

❌ Patch coverage is 6.77966% with 220 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.59%. Comparing base (546f206) to head (6903c26).

❌ Your patch check has failed because the patch coverage (6.77%) is below the target coverage (75.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8482      +/-   ##
==========================================
- Coverage   88.90%   88.59%   -0.31%     
==========================================
  Files         286      286              
  Lines       63107    63341     +234     
  Branches     7910     7974      +64     
==========================================
+ Hits        56108    56120      +12     
- Misses       4734     4949     +215     
- Partials     2265     2272       +7     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Introduce a new GUC citus.distribution_columns that accepts a
comma-separated priority list of column names. When set, any
CREATE TABLE or CREATE TABLE AS SELECT whose columns match an
entry in the list is automatically hash-distributed by that column,
removing the need for an explicit create_distributed_table() call.

Implementation:
- Register the GUC in shared_library_init.c
- Add ShouldAutoDistributeNewTable() and AutoDistributeNewTable() in
  table.c, called from ConvertNewTableIfNecessary() for both
  CreateStmt and CreateTableAsStmt paths
- Guard against unsupported relation kinds: foreign tables, matviews,
  partitioned table children, and inherited tables are skipped
- The GUC takes lower priority than tenant schema
  (citus.enable_schema_based_sharding) but higher priority than
  citus.use_citus_managed_tables

Test infrastructure:
- Add comprehensive regression test (auto_distribution_columns.sql)
  covering GUC parsing, priority lists, CTAS from distributed/local/
  reference tables, partitioned tables, colocation, EXPLAIN CREATE
  TABLE AS, foreign tables, matviews, inheritance, transactions,
  schema interactions, and edge cases
- Create post_citus14_schedule for post-Citus-14 feature tests that
  are expected to fail in n/n-1 mixed-version mode
- Move auto_distribution_columns out of multi_1_schedule into the
  new post_citus14_schedule with its own Makefile target
  (check-post-citus14)

Current limitation: CTAS pulls data to the coordinator first, then
redistributes it via CopyLocalDataIntoShards, causing a round trip
even when source and target share the same distribution column.
….SELECT

When citus.distribution_columns is set and a CREATE TABLE AS SELECT is
executed, the old path materialized all data on the coordinator, then
redistributed it to workers via CopyLocalDataIntoShards — a full
round-trip.

The new path intercepts CTAS before PostgreSQL executes it and
decomposes it into two steps:

  1. CREATE TABLE (empty) + auto-distribute via SPI
  2. INSERT INTO ... SELECT (Citus can push this down to workers)

When source and target are co-located, no data passes through the
coordinator at all.

Implementation details:
- TryOptimizeCTASForAutoDistribution() in table.c builds the CREATE
  TABLE DDL from the Query's targetList (types, collations, WITH
  options, tablespace) and executes it via SPI.
- After SPI creates the table, AutoDistributeNewTable() is called
  explicitly since SPI sub-commands use PROCESS_UTILITY_QUERY context
  and don't trigger the top-level ConvertNewTableIfNecessary hook.
- The SELECT portion is extracted from the original query string by
  scanning for the AS keyword, then executed as INSERT INTO ... SELECT.
- FindMatchingDistributionColumnFromTargetList() checks output columns
  against the GUC priority list before the table exists.

Bail-out cases (fall back to old path):
- SELECT INTO syntax (no AS keyword to parse)
- Temp tables, materialized views, binary upgrades
- Internal backends (metadata sync, rebalancer)
- No matching distribution column in the output
Addresses 9 of 12 identified gaps:

GAP-3 (CI blocker): Move citus.distribution_columns GUC registration
to correct alphabetical position (after distributed_deadlock_detection_factor).

GAP-1 (High): Add tenant schema precedence check in
TryOptimizeCTASForAutoDistribution. When schema-based sharding is
enabled and target schema is a tenant schema, fall back to the standard
CTAS path so ConvertNewTableIfNecessary creates a single-shard tenant
table instead of hash-distributing.

GAP-2 (Medium): Add INTO->accessMethod handling to the optimized CTAS
path so USING columnar (and other access methods) is preserved.

GAP-4 (Low): Fix escaped single-quote handling in the AS keyword
scanner. The inner while loop now properly handles '' escape sequences.

GAP-10 (Low): Remove ~25 lines of leftover design-thinking comments
and dead appendStringInfo/resetStringInfo code.

GAP-12 (Low): Remove unused colIdx variable from
FindMatchingDistributionColumnFromTargetList.

GAP-7 (Test): Add CTAS-in-tenant-schema test verifying tenant schema
takes precedence over distribution_columns GUC.

GAP-8 (Test): Add CTAS with CTE and parenthesized subquery tests
exercising the AS keyword scanner.

GAP-9 (Test): Add CTAS with explicit column name override tests
verifying IntoClause.colNames handling.

Not addressed (low priority, no functional risk):
- GAP-5: Dollar-quoting in AS scanner (extremely rare, safe fallback)
- GAP-6: Word boundary in keyword match (no practical risk)
- GAP-11: Duplicate GUC parsing refactor (cosmetic)
Remove redundant test cases and add missing coverage:

- Remove ctas_paren: parenthesized CTAS form already exercised 6x in
  the nested queries section.
- Remove basic CTAS (t_ctas): same scenario covered by ctas_same_dist
  with more thorough checks. Move the priority-list fallback sub-test
  into the priority list section where it logically belongs.
- Trim 5 redundant counts_match checks (local, same_dist, ref,
  nested_agg, both EXPLAIN cases). Keep 3 representative ones: join,
  no_match, and explain_join.

Add new test coverage for AS keyword scanner:
- TABLE keyword: CREATE TABLE t AS TABLE source
- VALUES keyword: CREATE TABLE t (cols) AS VALUES (...)

Net: -24 lines SQL, +2 new test scenarios, 0 redundancy.
Replace the manual strtok_r-based parsing that ran on every table
creation with a pre-parsed List maintained by GUC hooks:

- Add CheckDistributionColumns (check hook): validates the comma-
  separated identifier list via SplitIdentifierString before the
  value is accepted. Invalid input (e.g. double commas) is rejected
  without touching the current parsed list.

- Add AssignDistributionColumns (assign hook): parses the validated
  string into a List of char* pointers in TopMemoryContext so it
  survives transaction boundaries. A static 'previousRawList' keeps
  the backing string alive (SplitIdentifierString returns pointers
  into its input buffer).

- Refactor FindMatchingDistributionColumn and
  FindMatchingDistributionColumnFromTargetList to iterate the
  pre-parsed list instead of re-tokenizing on every call.

- Rename DistributionColumnsGUC → DistributionColumns and
  AssignDistributionColumnsGUC → AssignDistributionColumns.

Behavioral change: the GUC now uses standard PostgreSQL identifier
rules (unquoted names are downcased, quoted names preserve case).
To match a case-sensitive column like "Tenant_Id", set the GUC to
'"Tenant_Id"'. Tests updated accordingly.

All check-post-citus14 tests pass.
@eaydingol eaydingol force-pushed the feature/auto-distribution-columns branch from 0e32e88 to 6903c26 Compare February 23, 2026 08:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant