Shard move in block_writes mode fails with idle_in_transaction_session_timeout on metadata workers#8484
Conversation
…tions during shard moves In block_writes mode, LockShardListMetadataOnWorkers() opens coordinated transactions on all metadata workers to hold advisory shard metadata locks. These connections remain open for the entire duration of the shard move, but workers not involved in the data copy have no commands to execute and they sit idle-in-transaction until the coordinated transaction commits. For large shards, the data copy can take hours, easily exceeding common idle_in_transaction_session_timeout values. When the timeout fires on an uninvolved worker, PostgreSQL terminates the connection and the move fails. Fix by sending SET LOCAL idle_in_transaction_session_timeout = 0 on each metadata worker connection before acquiring locks. SET LOCAL scopes the change to the current transaction only, so normal sessions are unaffected.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files🚀 New features to boost your workflow:
|
|
LGTM. |
There was a problem hiding this comment.
Pull request overview
Fixes shard move / shard split failures in block_writes mode caused by idle_in_transaction_session_timeout terminating idle coordinated transactions on metadata workers during long-running data copy.
Changes:
- Disable
idle_in_transaction_session_timeout(viaSET LOCAL ... = 0) on metadata worker connections before acquiring shard metadata locks. - Add a regression test that configures a low
idle_in_transaction_session_timeoutacross nodes and performs ablock_writesshard move. - Update expected regression output accordingly.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
src/backend/distributed/utils/resource_lock.c |
Prepends a SET LOCAL idle_in_transaction_session_timeout = 0 command before shard metadata lock acquisition on metadata workers. |
src/test/regress/sql/shard_move_constraints_blocking.sql |
Adds a regression test scenario for shard move under low idle-in-transaction timeout. |
src/test/regress/expected/shard_move_constraints_blocking.out |
Captures the expected output for the new regression test section. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| CREATE TABLE test_move(id int PRIMARY KEY, val text); | ||
| SELECT create_distributed_table('test_move', 'id'); | ||
| INSERT INTO test_move SELECT i, 'val_' || i FROM generate_series(1, 100) i; | ||
|
|
||
| -- move a shard using block_writes; should succeed despite the 1s timeout | ||
| SELECT citus_move_shard_placement(8980000, 'localhost', :worker_1_port, 'localhost', :worker_2_port, shard_transfer_mode:='block_writes'); | ||
| SELECT public.wait_for_resource_cleanup(); |
There was a problem hiding this comment.
This test sets idle_in_transaction_session_timeout to 1s, but the shard being moved is tiny (100 rows spread across 4 shards), so the block_writes move will likely finish well under 1s even without the backend fix. That makes the regression non-deterministic / ineffective at catching the original failure mode. Consider making the data-copy phase reliably exceed the timeout (e.g., add a small per-row delay on the target during copy, or otherwise slow the copy in a deterministic way) so the test would fail on older code paths.
There was a problem hiding this comment.
@copilot open a new pull request to apply changes based on this feedback
|
@emelsimsek I've opened a new pull request, #8485, to work on those changes. Once the pull request is ready, I'll request review from you. |
…n_timeout on metadata workers (#8484) ### Description When performing a shard move using block_writes transfer mode (either directly via citus_move_shard_placement or through the background rebalancer), the operation can fail with: ``` ERROR: terminating connection due to idle-in-transaction timeout CONTEXT: while executing command on <worker_host>:<worker_port> ``` The failing worker is a metadata worker that is neither the source nor the target of the shard move. ### Root Cause LockShardListMetadataOnWorkers() opens coordinated transactions on all metadata workers to acquire advisory shard metadata locks via SELECT lock_shard_metadata(...). These transactions remain open until the entire shard move completes and the coordinated transaction commits. In block_writes mode, the data copy phase (CopyShardsToNode) runs synchronously between the source and target workers. Metadata workers not involved in the copy have no commands to execute and their connections sit completely idle-in-transaction for the entire duration of the data copy. For large shards, the copy can take significantly longer than common idle_in_transaction_session_timeout values, When the timeout fires on an uninvolved worker, PostgreSQL terminates the connection, causing the shard move to fail. This also affects shard splits, since they follow the same code path through LockShardListMetadataOnWorkers. ### Fix LockShardListMetadataOnWorkers() should send SET LOCAL idle_in_transaction_session_timeout = 0 on each metadata worker connection before acquiring the locks. SET LOCAL scopes the change to the current transaction only, so normal sessions on the workers are unaffected.
#8491) …n_timeout on metadata workers (#8484) ### Description When performing a shard move using block_writes transfer mode (either directly via citus_move_shard_placement or through the background rebalancer), the operation can fail with: ``` ERROR: terminating connection due to idle-in-transaction timeout CONTEXT: while executing command on <worker_host>:<worker_port> ``` The failing worker is a metadata worker that is neither the source nor the target of the shard move. ### Root Cause LockShardListMetadataOnWorkers() opens coordinated transactions on all metadata workers to acquire advisory shard metadata locks via SELECT lock_shard_metadata(...). These transactions remain open until the entire shard move completes and the coordinated transaction commits. In block_writes mode, the data copy phase (CopyShardsToNode) runs synchronously between the source and target workers. Metadata workers not involved in the copy have no commands to execute and their connections sit completely idle-in-transaction for the entire duration of the data copy. For large shards, the copy can take significantly longer than common idle_in_transaction_session_timeout values, When the timeout fires on an uninvolved worker, PostgreSQL terminates the connection, causing the shard move to fail. This also affects shard splits, since they follow the same code path through LockShardListMetadataOnWorkers. ### Fix LockShardListMetadataOnWorkers() should send SET LOCAL idle_in_transaction_session_timeout = 0 on each metadata worker connection before acquiring the locks. SET LOCAL scopes the change to the current transaction only, so normal sessions on the workers are unaffected. DESCRIPTION: PR description that will go into the change log, up to 78 characters
…n_timeout on metadata workers (#8484) ### Description When performing a shard move using block_writes transfer mode (either directly via citus_move_shard_placement or through the background rebalancer), the operation can fail with: ``` ERROR: terminating connection due to idle-in-transaction timeout CONTEXT: while executing command on <worker_host>:<worker_port> ``` The failing worker is a metadata worker that is neither the source nor the target of the shard move. ### Root Cause LockShardListMetadataOnWorkers() opens coordinated transactions on all metadata workers to acquire advisory shard metadata locks via SELECT lock_shard_metadata(...). These transactions remain open until the entire shard move completes and the coordinated transaction commits. In block_writes mode, the data copy phase (CopyShardsToNode) runs synchronously between the source and target workers. Metadata workers not involved in the copy have no commands to execute and their connections sit completely idle-in-transaction for the entire duration of the data copy. For large shards, the copy can take significantly longer than common idle_in_transaction_session_timeout values, When the timeout fires on an uninvolved worker, PostgreSQL terminates the connection, causing the shard move to fail. This also affects shard splits, since they follow the same code path through LockShardListMetadataOnWorkers. ### Fix LockShardListMetadataOnWorkers() should send SET LOCAL idle_in_transaction_session_timeout = 0 on each metadata worker connection before acquiring the locks. SET LOCAL scopes the change to the current transaction only, so normal sessions on the workers are unaffected.
Description
When performing a shard move using block_writes transfer mode (either directly via citus_move_shard_placement or through the background rebalancer), the operation can fail with:
The failing worker is a metadata worker that is neither the source nor the target of the shard move.
Root Cause
LockShardListMetadataOnWorkers() opens coordinated transactions on all metadata workers to acquire advisory shard metadata locks via SELECT lock_shard_metadata(...). These transactions remain open until the entire shard move completes and the coordinated transaction commits.
In block_writes mode, the data copy phase (CopyShardsToNode) runs synchronously between the source and target workers. Metadata workers not involved in the copy have no commands to execute and their connections sit completely idle-in-transaction for the entire duration of the data copy.
For large shards, the copy can take significantly longer than common idle_in_transaction_session_timeout values, When the timeout fires on an uninvolved worker, PostgreSQL terminates the connection, causing the shard move to fail.
This also affects shard splits, since they follow the same code path through LockShardListMetadataOnWorkers.
Fix
LockShardListMetadataOnWorkers() should send SET LOCAL idle_in_transaction_session_timeout = 0 on each metadata worker connection before acquiring the locks. SET LOCAL scopes the change to the current transaction only, so normal sessions on the workers are unaffected.