-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Problem
When self-hosting trigger.dev with a multi-replica ClickHouse cluster (e.g., ClickHouse Operator on Kubernetes with 2+ replicas behind a load-balanced Service), the dashboard shows inconsistent data on each page load.
Root Cause
trigger.dev creates ClickHouse tables using ReplacingMergeTree engine (e.g., task_runs_v2). In a multi-replica setup, each replica maintains its own independent copy of data. Since the webapp writes to whichever replica the Kubernetes Service routes to, data ends up split across replicas. Subsequent reads hit random replicas, causing the runs list to show different subsets of data on each refresh.
Current Workaround
Pin the webapp to a single ClickHouse replica using the headless service DNS (e.g., clickhouse-shard0-0.clickhouse-headless.clickhouse.svc.cluster.local) instead of the load-balanced service. This works but defeats the purpose of having multiple replicas for HA.
Proposed Solution
Use ReplicatedReplacingMergeTree (or Replicated* variants) instead of ReplacingMergeTree when creating ClickHouse tables. This would allow ClickHouse's native replication via ZooKeeper/ClickHouse Keeper to keep all replicas in sync.
This could be:
- Default behavior — always use Replicated engines (they work fine with single-replica setups too)
- Configuration option — e.g.,
CLICKHOUSE_USE_REPLICATED_TABLES=trueor a Helm value
Environment
- trigger.dev v4 (self-hosted via Helm chart v4.0.5)
- ClickHouse cluster: 2 replicas, 1 shard (ClickHouse Operator on Kubernetes)
- ClickHouse Keeper enabled for coordination
Additional Context
The RunsReplicationService in the webapp syncs PostgreSQL → ClickHouse via logical replication. The replication itself works correctly — the issue is purely that the ClickHouse table engine doesn't propagate data between replicas.