Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,17 +1,17 @@
---
{
"title": "MySQL Full Database Sync",
"sidebar_label": "Full Database Sync",
"title": "MySQL Database-level Sync",
"sidebar_label": "Database-level Sync",
"language": "en",
"description": "Doris can continuously synchronize full and incremental data from an entire database or selected tables in MySQL into Doris using Streaming Job."
"description": "Doris can continuously sync full and incremental data of a group of MySQL tables into Doris at the database level via Streaming Job, auto-creating downstream tables on first sync."
}
---

## Overview

Supports using Job to continuously synchronize full and incremental data from an entire database or selected tables in a MySQL database to Doris via Stream Load. Suitable for scenarios requiring real-time full database sync to Doris.
Database-level Sync is implemented via the native `FROM MYSQL (...) TO DATABASE (...)` DDL, **using a database as the sync unit with a Doris database as the target container**. You can sync one, several, or all tables via `include_tables`; on first sync Doris automatically creates downstream primary-key tables and keeps primary keys consistent with the upstream. Suitable for mirror replication scenarios where downstream schema should track upstream automatically and no SQL processing is needed.

By integrating [Flink CDC](https://github.com/apache/flink-cdc), Doris supports reading change logs from MySQL databases, enabling full and incremental full database sync. When synchronizing for the first time, Doris automatically creates downstream tables (primary key tables) and keeps the primary key consistent with the upstream.
By integrating [Flink CDC](https://github.com/apache/flink-cdc), Doris reads change logs from MySQL and continuously writes full + incremental data of a group of tables into Doris via Stream Load. If you need column mapping, filtering, or data transformation during sync, see [MySQL Table-level Sync](./continuous-load-mysql-table.md).

**Notes:**

Expand Down Expand Up @@ -99,7 +99,7 @@ For more common operations (pause, resume, delete, check Task, etc.), see [Conti

### Import Command

Syntax for creating a full database sync job:
Syntax for creating a database-level sync job:

```sql
CREATE JOB <job_name>
Expand Down
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
---
{
"title": "MySQL Single Table Sync",
"sidebar_label": "Single Table Sync",
"title": "MySQL Table-level Sync",
"sidebar_label": "Table-level Sync",
"language": "en",
"description": "Doris supports continuously synchronizing full and incremental data from a single MySQL table into Doris using Job + CDC Stream TVF."
"description": "Doris can continuously sync MySQL data into a specified Doris table using Job + CDC Stream TVF at the table level, with support for column mapping and data transformation."
}
---

Expand All @@ -13,9 +13,9 @@ This feature is supported since version 4.1.0.

## Overview

Doris supports continuously synchronizing full and incremental data from a single MySQL table into a specified Doris table using Job + [CDC Stream TVF](../../../sql-manual/sql-functions/table-valued-functions/cdc-stream.md). This is suitable for real-time synchronization scenarios that require flexible column mapping and data transformation on a single table.
Table-level Sync is implemented via Job + [CDC Stream TVF](../../../sql-manual/sql-functions/table-valued-functions/cdc-stream.md), targeting an existing Doris table (`INSERT INTO tbl SELECT * FROM cdc_stream(...)`). It leverages Doris SQL to support column mapping, filtering and data transformation, with exactly-once semantics. Suitable for real-time sync scenarios that require data processing.

By integrating [Flink CDC](https://github.com/apache/flink-cdc) reading capabilities, Doris supports reading change logs (Binlog) from MySQL databases, enabling full and incremental data synchronization for a single table.
By integrating [Flink CDC](https://github.com/apache/flink-cdc), Doris reads change logs (Binlog) from MySQL, enabling full + incremental sync from the source table to the target table. If you prefer Doris to auto-create downstream tables or sync a group of tables at the database granularity, see [MySQL Database-level Sync](./continuous-load-mysql-database.md).

**Notes:**

Expand Down
51 changes: 44 additions & 7 deletions docs/data-operate/import/streaming-job/continuous-load-overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,16 +13,53 @@ Doris supports continuously loading data from multiple data sources into Doris t

Continuous Load supports the following data sources and import modes:

| Data Source | Supported Versions | Single Table Sync | Full Database Sync | Setup Guide |
| Data Source | Supported Versions | Table-level Sync | Database-level Sync | Setup Guide |
|:------|:--------|:--------|:--------|:--------|
| MySQL | 5.6, 5.7, 8.0.x | [MySQL Single Table Sync](./continuous-load-mysql-single.md) | [MySQL Full Database Sync](./continuous-load-mysql-multi.md) | [Amazon RDS MySQL](./prerequisites/amazon-rds-mysql.md) · [Amazon Aurora MySQL](./prerequisites/amazon-aurora-mysql.md) |
| PostgreSQL | 14, 15, 16, 17 | [PostgreSQL Single Table Sync](./continuous-load-postgresql-single.md) | [PostgreSQL Full Database Sync](./continuous-load-postgresql-multi.md) | [Amazon RDS PostgreSQL](./prerequisites/amazon-rds-postgresql.md) · [Amazon Aurora PostgreSQL](./prerequisites/amazon-aurora-postgresql.md) |
| MySQL | 5.6, 5.7, 8.0.x | [MySQL Table-level Sync](./continuous-load-mysql-table.md) | [MySQL Database-level Sync](./continuous-load-mysql-database.md) | [Amazon RDS MySQL](./prerequisites/amazon-rds-mysql.md) · [Amazon Aurora MySQL](./prerequisites/amazon-aurora-mysql.md) |
| PostgreSQL | 14, 15, 16, 17 | [PostgreSQL Table-level Sync](./continuous-load-postgresql-table.md) | [PostgreSQL Database-level Sync](./continuous-load-postgresql-database.md) | [Amazon RDS PostgreSQL](./prerequisites/amazon-rds-postgresql.md) · [Amazon Aurora PostgreSQL](./prerequisites/amazon-aurora-postgresql.md) |
| S3 | - | [S3 Continuous Load](./continuous-load-s3.md) | - | - |

:::tip
- **Single Table Sync**: Uses CDC Stream TVF or S3 TVF to continuously load data into a specific Doris table, supporting flexible column mapping and data transformation.
- **Full Database Sync**: Uses native multi-table CDC capability to continuously sync an entire database or selected tables from the source to Doris, automatically creating downstream tables on first sync.
:::
## How to Choose

Table-level Sync and Database-level Sync are **two fundamentally different mechanisms**, not a distinction by "number of tables". **Database-level Sync can also sync just one table via `include_tables`**, so the choice should be driven by capability requirements:

| Capability | Table-level Sync | Database-level Sync |
|:--------|:--------|:--------|
| Underlying mechanism | Job + TVF (`INSERT INTO tbl SELECT * FROM tvf()`) | Job + native database DDL (`FROM src TO DATABASE db`) |
| Target granularity | One existing Doris table | A Doris database container |
| Sync scope | A single table | One to many to all tables (controlled by `include_tables`) |
| Auto-create tables | ❌ Requires pre-creation | ✅ Automatically creates primary-key tables on first sync |
| SQL expressiveness | ✅ Column mapping, filtering, transformation (via SELECT) | ❌ Direct replication, no ETL |
| Delivery semantics | exactly-once | at-least-once |
| Required privileges | Load | Load + Create (when auto-creating) |
| Typical scenarios | Real-time sync needing column pruning, renaming, type conversion, or conditional filtering | Mirror replication of a database or group of tables, where downstream schema should track upstream automatically |

- **Need SQL transformations or strict exactly-once semantics** → Choose **Table-level Sync**
- **Want Doris to auto-create tables and sync a group of tables with one config** → Choose **Database-level Sync**
- **Source is S3 object storage** → Only Table-level Sync is supported (via S3 TVF)

## Job Lifecycle

A Streaming Job transitions between the following states during its lifecycle. Both Table-level Sync and Database-level Sync follow the same state machine:

```mermaid
stateDiagram-v2
[*] --> PENDING: create job
PENDING --> RUNNING: createStreamingTask()
RUNNING --> FINISHED: source consumed
RUNNING --> PAUSED: task failed (with failReason)
PAUSED --> PENDING: autoResume after exponential backoff
FINISHED --> [*]
```

| State | Description |
|:----|:----|
| **PENDING** | The job has been created but no `StreamingTask` has been dispatched yet; awaiting the next scheduling round |
| **RUNNING** | A child task has been dispatched and is running, reading incremental data from the source and writing into Doris |
| **FINISHED** | The source has been fully consumed and the job has terminated. S3 TVF jobs enter this state once all files have been imported |
| **PAUSED** | A child task failed; the job is automatically paused and a `failReason` is recorded. Check the `ErrorMsg` column in `select * from jobs(...)` for details |

**Auto-resume:** After entering `PAUSED`, the scheduler periodically retries with an exponential backoff strategy and transitions the job back to `PENDING` to dispatch a new task. **Transient failures (network jitter, brief upstream unavailability, etc.) are absorbed automatically without manual intervention.** To resume immediately after diagnosing a failure, use [`RESUME JOB`](#resume-import-job); to stop scheduling entirely, use [`PAUSE JOB`](#pause-import-job) (manually paused jobs are NOT woken up by auto-resume) or [`DROP JOB`](#delete-import-job).

## Common Operations

Expand Down
Original file line number Diff line number Diff line change
@@ -1,17 +1,17 @@
---
{
"title": "PostgreSQL Full Database Sync",
"sidebar_label": "Full Database Sync",
"title": "PostgreSQL Database-level Sync",
"sidebar_label": "Database-level Sync",
"language": "en",
"description": "Doris can continuously synchronize full and incremental data from an entire database or selected tables in PostgreSQL into Doris using Streaming Job."
"description": "Doris can continuously sync full and incremental data of a group of PostgreSQL tables into Doris at the database level via Streaming Job, auto-creating downstream tables on first sync."
}
---

## Overview

Supports using Job to continuously synchronize full and incremental data from an entire database or selected tables in a PostgreSQL database to Doris via Stream Load. Suitable for scenarios requiring real-time full database sync to Doris.
Database-level Sync is implemented via the native `FROM POSTGRES (...) TO DATABASE (...)` DDL, **using a database as the sync unit with a Doris database as the target container**. You can sync one, several, or all tables via `include_tables`; on first sync Doris automatically creates downstream primary-key tables and keeps primary keys consistent with the upstream. Suitable for mirror replication scenarios where downstream schema should track upstream automatically and no SQL processing is needed.

By integrating [Flink CDC](https://github.com/apache/flink-cdc), Doris supports reading change logs from PostgreSQL databases, enabling full and incremental full database sync. When synchronizing for the first time, Doris automatically creates downstream tables (primary key tables) and keeps the primary key consistent with the upstream.
By integrating [Flink CDC](https://github.com/apache/flink-cdc), Doris reads change logs from PostgreSQL and continuously writes full + incremental data of a group of tables into Doris via Stream Load. If you need column mapping, filtering, or data transformation during sync, see [PostgreSQL Table-level Sync](./continuous-load-postgresql-table.md).

**Notes:**

Expand Down Expand Up @@ -73,7 +73,7 @@ For more common operations (pause, resume, delete, check Task, etc.), see [Conti

### Import Command

Syntax for creating a full database sync job:
Syntax for creating a database-level sync job:

```sql
CREATE JOB <job_name>
Expand Down
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
---
{
"title": "PostgreSQL Single Table Sync",
"sidebar_label": "Single Table Sync",
"title": "PostgreSQL Table-level Sync",
"sidebar_label": "Table-level Sync",
"language": "en",
"description": "Doris supports continuously synchronizing full and incremental data from a single PostgreSQL table into Doris using Job + CDC Stream TVF."
"description": "Doris can continuously sync PostgreSQL data into a specified Doris table using Job + CDC Stream TVF at the table level, with support for column mapping and data transformation."
}
---

Expand All @@ -13,9 +13,9 @@ This feature is supported since version 4.1.0.

## Overview

Doris supports continuously synchronizing full and incremental data from a single PostgreSQL table into a specified Doris table using Job + [CDC Stream TVF](../../../sql-manual/sql-functions/table-valued-functions/cdc-stream.md). This is suitable for real-time synchronization scenarios that require flexible column mapping and data transformation on a single table.
Table-level Sync is implemented via Job + [CDC Stream TVF](../../../sql-manual/sql-functions/table-valued-functions/cdc-stream.md), targeting an existing Doris table (`INSERT INTO tbl SELECT * FROM cdc_stream(...)`). It leverages Doris SQL to support column mapping, filtering and data transformation, with exactly-once semantics. Suitable for real-time sync scenarios that require data processing.

By integrating [Flink CDC](https://github.com/apache/flink-cdc) reading capabilities, Doris supports reading change logs (WAL) from PostgreSQL databases, enabling full and incremental data synchronization for a single table.
By integrating [Flink CDC](https://github.com/apache/flink-cdc), Doris reads change logs (WAL) from PostgreSQL, enabling full + incremental sync from the source table to the target table. If you prefer Doris to auto-create downstream tables or sync a group of tables at the database granularity, see [PostgreSQL Database-level Sync](./continuous-load-postgresql-database.md).

**Notes:**

Expand Down
11 changes: 10 additions & 1 deletion docs/ecosystem/flink-doris-connector/flink-doris-connector.md
Original file line number Diff line number Diff line change
Expand Up @@ -1148,4 +1148,13 @@ In the whole database synchronization tool provided by the Connector, no additio

7. **stream load error: HTTP/1.1 307 Temporary Redirect**

Flink will first request FE, and after receiving 307, it will request BE after redirection. When FE is in FullGC/high pressure/network delay, HttpClient will send data without waiting for a response within a certain period of time (3 seconds) by default. Since the request body is InputStream by default, when a 307 response is received, the data cannot be replayed and an error will be reported directly. There are three ways to solve this problem: 1. Upgrade to Connector25.1.0 or above to increase the default time; 2. Modify auto-redirect=false to directly initiate a request to BE (not applicable to some cloud scenarios); 3. The unique key model can enable batch mode.
Flink will first request FE, and after receiving 307, it will request BE after redirection. When FE is in FullGC/high pressure/network delay, HttpClient will send data without waiting for a response within a certain period of time (3 seconds) by default. Since the request body is InputStream by default, when a 307 response is received, the data cannot be replayed and an error will be reported directly. There are three ways to solve this problem: 1. Upgrade to Connector25.1.0 or above to increase the default time; 2. Modify auto-redirect=false to directly initiate a request to BE (not applicable to some cloud scenarios); 3. The unique key model can enable batch mode.

8. **When using Flink CDC to sync large tables from databases such as Oracle, an `I/O exception (java.net.SocketException) ... Broken pipe` error is reported. How to handle it?**

This error usually occurs when the data volume of a single Stream Load request exceeds the limit on the BE side. You can adjust it from the following aspects:
- Increase the `streaming_load_max_mb` parameter in `be.conf` on the BE side (default 10240, in MB), so that a single Stream Load can carry more data. The BE needs to be restarted to take effect.
- Enable batch mode (`sink.enable.batch-mode=true`), so that the Connector automatically splits the data into batches internally, avoiding too much data in a single Stream Load.
- Try to increase the parallelism of Oracle CDC by adding `--oracle-conf scan.incremental.snapshot.enabled=true` (experimental feature) to the startup command, which enables parallel reading of Oracle full data.

For more Flink CDC related issues, please refer to [Flink CDC FAQ](https://nightlies.apache.org/flink/flink-cdc-docs-release-3.6/docs/faq/faq/).
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@

The CDC Stream table-valued-function (TVF) enables users to read change data from relational databases (such as MySQL, PostgreSQL) via CDC. By integrating [Flink CDC](https://github.com/apache/flink-cdc) reading capabilities, it supports full and incremental data synchronization.

It is typically used with `CREATE JOB ON STREAMING` to achieve continuous single-table data synchronization. For detailed usage, see [MySQL Single-table Import](../../../data-operate/import/streaming-job/continuous-load-mysql-single.md) and [PostgreSQL Single-table Import](../../../data-operate/import/streaming-job/continuous-load-postgresql-single.md).
It is typically used with `CREATE JOB ON STREAMING` to achieve continuous table-level data synchronization. For detailed usage, see [MySQL Table-level Sync](../../../data-operate/import/streaming-job/continuous-load-mysql-table.md) and [PostgreSQL Table-level Sync](../../../data-operate/import/streaming-job/continuous-load-postgresql-table.md).

## Syntax

Expand Down
Original file line number Diff line number Diff line change
@@ -1,17 +1,17 @@
---
{
"title": "MySQL 整库同步",
"sidebar_label": "整库同步",
"title": "MySQL 库级同步",
"language": "zh-CN",
"description": "Doris 可以通过 Streaming Job 的方式,将 MySQL 整库的全量和增量数据持续同步到 Doris 中。"
"sidebar_label": "库级同步",
"description": "Doris 可以通过 Streaming Job 的方式,以库为单位将 MySQL 一组表的全量和增量数据持续同步到 Doris 中,首次同步自动创建下游表。"
}
---

## 概述

支持通过 Job 将 MySQL 整库或指定多张表的全量和增量数据,通过 Stream Load 的方式持续同步到 Doris 中。适用于需要实时同步整库数据到 Doris 的场景
库级同步通过原生 `FROM MYSQL (...) TO DATABASE (...)` DDL 实现,**以库为同步单位,目标是一个 Doris database 容器**;可以通过 `include_tables` 控制同步一张、多张或全部表,首次同步时 Doris 会自动创建下游主键表,并保持主键与上游一致。适用于不需要对数据做 SQL 加工、希望下游表结构自动跟随上游的镜像复制场景

通过集成 [Flink CDC](https://github.com/apache/flink-cdc) 能力,Doris 支持从 MySQL 数据库读取变更日志,实现整库的全量和增量数据同步。首次同步时会自动创建 Doris 下游表(主键表),并保持主键与上游一致
通过集成 [Flink CDC](https://github.com/apache/flink-cdc) 能力,Doris MySQL 读取变更日志,将一组表的全量 + 增量数据通过 Stream Load 持续写入 Doris。若需要在同步过程中做列映射、过滤或数据转换,请参考 [MySQL 表级同步](./continuous-load-mysql-table.md)

**注意事项:**

Expand Down Expand Up @@ -99,7 +99,7 @@ TO DATABASE target_test_db

### 导入命令

创建整库同步作业语法如下
创建库级同步作业语法如下

```sql
CREATE JOB <job_name>
Expand Down
Loading
Loading