Move tables 1.6 crash safe resume#1708
Conversation
| if !migrationContext.UseGTIDs { | ||
| log.Infof("--move-tables requires GTID coordinates for cutover drain checks; enabling --gtid") | ||
| migrationContext.UseGTIDs = true | ||
| } |
There was a problem hiding this comment.
I chose to not fail the request without these, and just enable them. Could also change it to a Fatal and force the operator to rerun with the required flags. But that feels kinda silly if they are always required.
There was a problem hiding this comment.
Pull request overview
This PR makes --move-tables crash-resume safe across the cooperative cutover by persisting additional cutover state to the checkpoint table (cutover-started + drain GTID), adding logic to resume mid-cutover, and ensuring move-tables runs with checkpointing + GTID coordinates enabled.
Changes:
- Extend checkpoint schema + insert query builder to include move-tables cutover resume fields, and read/write them from the target-side checkpoint table in move-tables mode.
- Add migrator logic to persist cutover checkpoints, drain to a captured GTID with improved “non-DML tail” handling, and resume cutover directly from checkpoint state.
- Update local move-tables scripts/docs to exercise checkpointing and provide better local test flows.
Show a summary per file
| File | Description |
|---|---|
script/move-tables/teardown |
Adds additional Docker cleanup steps after removing containers. |
script/move-tables/README.md |
Updates local testing steps/command to include checkpointing and validation workflow. |
script/move-tables/insert-source-primary-loop |
Adds batched insert support for higher write rates during local testing. |
go/sql/builder.go |
Extends checkpoint insert builder to optionally include move-tables cutover columns. |
go/sql/builder_test.go |
Adds coverage for move-tables checkpoint insert query shape. |
go/logic/migrator.go |
Implements move-tables cutover checkpoint persistence, drain logic, and resume-from-cutover-checkpoint flow. |
go/logic/migrator_move_tables_cutover_test.go |
Adds unit test for resuming move-tables cutover when already drained. |
go/logic/checkpoint.go |
Extends Checkpoint struct with move-tables cutover resume fields. |
go/logic/applier.go |
Routes checkpoint storage to the target DB in move-tables mode and reads/writes new move-tables cutover fields. |
go/logic/applier_test.go |
Adds integration tests for move-tables checkpoint persistence and cutover-checkpoint selection. |
go/cmd/gh-ost/main.go |
Automatically enables --checkpoint and --gtid when running --move-tables. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 11/11 changed files
- Comments generated: 6
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
| return chk, nil | ||
| } | ||
|
|
||
| func (apl *Applier) ReadMoveTablesCutOverCheckpoint() (*Checkpoint, error) { |
There was a problem hiding this comment.
[question] why can't we re-use ReadLastCheckpoint for this? the only significant difference I see is that the original fn is L1053-4 which we should able to ignore.
I assume there's something subtle I'm missing 🤔
There was a problem hiding this comment.
I think we technically could have. There might have been a reason at some point during my commit history but I agree it looks like it would be safe to use ReadLastCheckpoint. The big difference is that ReadMoveTablesCutOverCheckpoint has a where clause for gh_ost_move_tables_cutover_started = 1 and to make sure the drain GTID is set. But the one callsite i see of ReadMoveTablesCutOverCheckpoint does that check in the code as well
Lines 1020 to 1024 in f5273fa
closes https://github.com/github/database-infrastructure/issues/8210
A Pull Request should be associated with an Issue.
Related issue (internal): https://github.com/github/database-infrastructure/issues/8210
Related issue (public): #1681
Description
This PR updates move-tables mode to be crash-resume safe. It manages the
checkpointtable life, and a backgrounded checkpoint process as well as introduces new checkpoint data specifically for the move-tables cutover. Per the docs on crash safe resume across cutover, we now store that the move-tables cutover has started (a bool) and the move-tables cutover drain GTID. We use these two attributes for--move-tables --resumeto pickup if the process failed mid-cutover. It's safe for failure between T1-T3 (the rename and the drain) and also safe for failure post drain, pre hook call (between T3-T5). I simple--resumewill pick back up in the cutover protocol and ensure the rename isn't re-run, the drain is complete, and the complete hook is fired.script/cibuildreturns with no formatting errors, build errors or unit test errors.Demo
ghost-crash-resume-cutover.mp4