diff --git a/pgcopydb-helpers/AGENTS.md b/pgcopydb-helpers/AGENTS.md index 036d747..3e850ba 100644 --- a/pgcopydb-helpers/AGENTS.md +++ b/pgcopydb-helpers/AGENTS.md @@ -201,9 +201,9 @@ Resumes a previously interrupted `pgcopydb clone --follow` migration. Backs up t ~/resume-migration.sh ~/migration_YYYYMMDD-HHMMSS # specify explicitly ``` -**Important:** This script intentionally does NOT use `--split-tables-larger-than` with `--resume`. pgcopydb truncates the entire table before checking split parts on resume, which causes data loss. +**Important:** If the original migration used `--split-tables-larger-than`, the resume script passes the same value. This is safe when COPY has already completed (the COPY supervisor doesn't run, so no truncation occurs). If COPY was still in progress when the failure happened, use `--restart` instead — pgcopydb truncates split tables before re-queuing parts on resume, which loses already-copied partitions. Run `~/check-migration-status.sh` to determine whether COPY completed before deciding. -**When to use:** After pgcopydb crashes, the instance reboots, or the migration is interrupted. Do NOT use after a successful migration — use `run-migration.sh` to start fresh. +**When to use:** After pgcopydb crashes, the instance reboots, or the migration is interrupted during indexes, post-data restore, or CDC. If COPY failed mid-flight, use `~/target-clean.sh` + `~/drop-replication-slots.sh` + `~/start-migration-screen.sh` to start fresh instead. **Requires:** `PGCOPYDB_SOURCE_PGURI`, `PGCOPYDB_TARGET_PGURI`, existing migration directory @@ -396,11 +396,11 @@ All scripts use variables at the top that can be adjusted per migration. See [Cl | `TABLE_JOBS` | 16 | run-migration.sh, resume-migration.sh | | `INDEX_JOBS` | 12 | run-migration.sh, resume-migration.sh | | `FILTER_FILE` | ~/filters.ini | run-migration.sh, resume-migration.sh | -| `--split-tables-larger-than` | 50GB | run-migration.sh only (not resume) | +| `--split-tables-larger-than` | 50GB | run-migration.sh, resume-migration.sh | ## Critical Warnings -- **Never use `--split-tables-larger-than` with `--resume`** — pgcopydb truncates the entire table before checking parts, causing data loss. +- **If COPY failed mid-flight, use `--restart` instead of `--resume`** — pgcopydb truncates split tables before re-queuing parts on resume, causing data loss for partially-copied tables. If COPY completed and the failure was in a later phase (indexes, CDC), `--resume` with the same `--split-tables-larger-than` value is safe. Run `~/check-migration-status.sh` to check. - **Never use `pgcopydb --restart`** without backing up first — it wipes the CDC directory AND SQLite catalogs. - **Always clean up replication slots** after a migration — unconsumed slots cause WAL accumulation on the source. - **Verify extension filtering after STEP 1** — check `SELECT COUNT(*) FROM s_depend;` in `filter.db`. If it's 0, extension-owned objects in `public` won't be filtered. diff --git a/pgcopydb-helpers/README.md b/pgcopydb-helpers/README.md index 3b85e55..cb2293c 100644 --- a/pgcopydb-helpers/README.md +++ b/pgcopydb-helpers/README.md @@ -215,7 +215,14 @@ If pgcopydb crashes, the instance reboots, or the migration is interrupted: ~/resume-migration.sh ~/migration_YYYYMMDD-HHMMSS # or specify explicitly ``` -This backs up the SQLite catalog before resuming. It uses `--not-consistent` to allow resuming from a mid-transaction state, and intentionally omits `--split-tables-larger-than` because pgcopydb truncates the entire table before checking split parts on resume, which causes data loss. +This backs up the SQLite catalog before resuming and uses `--not-consistent` to allow resuming from a mid-transaction state. + +**Choosing between `--resume` and `--restart`:** + +- **COPY already completed** (failure was during indexes, post-data restore, or CDC): Use `--resume`. If the original migration used `--split-tables-larger-than`, pass the same value — the COPY phase is skipped entirely so there is no truncation risk. +- **COPY was still in progress** when the failure occurred: Use `--restart` (full restart) instead. pgcopydb truncates split tables before re-queuing parts on resume, which loses data from already-copied partitions. + +To check whether COPY completed, run `~/check-migration-status.sh` and look at the copy task progress. If all COPY tasks show as completed with no outstanding jobs, it is safe to `--resume`. To start completely over, wipe the target and clean up replication: @@ -392,7 +399,7 @@ sqlite3 ~/migration_*/schema/filter.db "SELECT COUNT(*) FROM s_depend;" ## Critical Warnings -- **Never use `--split-tables-larger-than` with `--resume`** — pgcopydb truncates the entire table before checking parts, causing data loss. +- **If COPY failed mid-flight, use `--restart` instead of `--resume`** — pgcopydb truncates split tables before re-queuing parts on resume, causing data loss for partially-copied tables. If COPY completed and the failure was in a later phase (indexes, CDC), `--resume` with the same `--split-tables-larger-than` value is safe. - **Never use `pgcopydb --restart`** without backing up first — it wipes the CDC directory AND SQLite catalogs. - **Always clean up replication slots** when done — unconsumed slots cause unbounded WAL growth on the source. - **Verify extension filtering after STEP 1** — if `s_depend` count is 0, extension-owned objects won't be excluded. diff --git a/pgcopydb-helpers/resume-migration.sh b/pgcopydb-helpers/resume-migration.sh index 2675efa..0a2ad5f 100755 --- a/pgcopydb-helpers/resume-migration.sh +++ b/pgcopydb-helpers/resume-migration.sh @@ -5,8 +5,15 @@ # # Resumes a previously interrupted pgcopydb clone --follow migration. # If no directory is given, uses the most recent ~/migration_* directory. -# Backs up the SQLite catalog before resuming. Does NOT use -# --split-tables-larger-than (unsafe with --resume). +# Backs up the SQLite catalog before resuming. +# +# IMPORTANT: --split-tables-larger-than and --resume +# If the original migration used --split-tables-larger-than, you MUST pass +# the same value here -- pgcopydb validates catalog consistency and will +# refuse to resume without it. This is SAFE if the COPY phase already +# completed (indexes, CDC, etc.). If COPY was still in progress when the +# failure occurred, use --restart instead -- pgcopydb truncates split tables +# before re-queuing parts on resume, which loses already-copied partitions. # set -eo pipefail @@ -57,8 +64,10 @@ cp "$MIGRATION_DIR/schema/source.db" "$MIGRATION_DIR/schema/source.db.bak.$(date echo "Migration dir: $MIGRATION_DIR" echo "==========================================" - # NOTE: Do NOT use --split-tables-larger-than with --resume. - # pgcopydb truncates the entire table before checking parts, causing data loss. + # If the original migration used --split-tables-larger-than, pass the + # same value here. This is safe when COPY is already complete (the COPY + # supervisor won't run, so no truncation occurs). If COPY failed + # mid-flight, use --restart instead of --resume. /usr/lib/postgresql/17/bin/pgcopydb clone \ --follow \ --plugin wal2json \ @@ -73,6 +82,8 @@ cp "$MIGRATION_DIR/schema/source.db" "$MIGRATION_DIR/schema/source.db.bak.$(date --skip-db-properties \ --table-jobs "$TABLE_JOBS" \ --index-jobs "$INDEX_JOBS" \ + --split-tables-larger-than 50GB \ + --split-max-parts "$TABLE_JOBS" \ --dir "$MIGRATION_DIR" EXIT_CODE=$?