Rebalance shards when ingester status changes by ncoiffier-celonis · Pull Request #6185 · quickwit-oss/quickwit

ncoiffier-celonis · 2026-03-02T11:39:44Z

Description

Attempt to fix #6158

Following @guilload's suggestion here, this PR:

gossip the ingester status over chit chat
update the ingester pool when ingester status changes
update the indexer pool too when ingester status changes (to fix no open shard found on ingester error)
have the control plane rebalance the shards when the ingester status changes

With this approach, even if we have some 10s propagation delay before decomissioning, it is still possible to fail to ingest some documents if the chitchat takes longer than expected to gossip the ingester status to the control-plane.

Any feedback is welcome!!

How was this PR tested?

In addition of the unit and integration tests, I've run it against a local cluster with 2 indexer and observed that the number of errors reported in #6158 decreases from a few 100 to no errors.

Other approches

This PR is fairly identical to the branch guilload/ingester-status, rebased on main and with some additional bugfixes:

fix bug in timeout_after being always 0, causing to not wait
update ingester pool when IngesterStatus change (not only indexer pool)
more unit and integration tests

guilload · 2026-03-04T20:26:18Z

it is still possible to fail to ingest some documents if the chitchat takes longer than expected to gossip the ingester status to the control-plane.

technically the ingest router should just retry when that happens and there should be a path for the router to open a new shard if the ingester being decommissioned was the only one to have shard(s) for this index. Is it not what you observed?

guilload · 2026-03-04T20:33:49Z

@ncoiffier-celonis I giving you write access to the repo so next time you can push directly on this repo rather than our fork. It makes it easier for me to checkout your changes locally. Though I learnt how to use gh pr checkout <PR_NUMBER> in the meantime.

guilload

I think we're close but we need to fix a few issues.

guilload · 2026-03-04T20:35:34Z

quickwit/quickwit-cluster/src/node.rs

+    }
+
+    #[cfg(any(test, feature = "testsuite"))]
+    pub async fn for_test_with_ingester_status(


nit: seems a bit overkill to me

Addressed with 0c1c82b: I unified ClusterNode::for_test_with_ingester_status into ClusterNode::for_test

guilload · 2026-03-04T20:36:19Z

quickwit/quickwit-control-plane/src/ingest/ingest_controller.rs

 pub struct IngestController {
-    ingester_pool: IngesterPool,
+    pub(crate) ingester_pool: IngesterPool,
+    pub(crate) stats: IngestControllerStats,


guilload · 2026-03-04T20:48:36Z

quickwit/quickwit-control-plane/src/control_plane.rs

        let Some(mailbox) = weak_mailbox.upgrade() else {
            return;
        };
+        let mut trigger_rebalance = false;


@ncoiffier-celonis please review this tricky logic thoroughly. I'm the initial author of this change and now I'm also reviewing it so I'm more likely to miss something. I could use a second pair of eyes.

Yeah this logic def needs a comment. Here, we're considering both indexers and ingesters. Indexers run indexing pipelines when they're ready, they are ready to index, so we want to rebuild an index plan. Same thing when they leave.

In addition, we're considering ingesters (technically all indexers are ingesters and vice-versa because we didn't want to expose users to a new service (service as metastore, janitor, control-plane, etc. not micro service as router, ingester, debug info, etc.)

Ingesters have two level of readiness. First one same as indexer, "I'm up and running, I can connect to the metastore". Second one, "I have loaded my WAL".

So we want to rebalance when the ingester is ready ready, which can happens from the perspective of the stream of events as:

Add(ready, ready)

OR

Add(ready, not ready)

Update(ready, ready)

The logic below tries to implement that.

guilload · 2026-03-04T21:04:27Z

quickwit/quickwit-ingest/src/ingest_v2/helpers.rs

+    }
+
+    #[tokio::test]
+    async fn test_wait_for_ingester_decommission_elapsed_timeout_not_zero() {


guilload · 2026-03-04T21:15:27Z

quickwit/quickwit-integration-tests/src/tests/ingest_v2_tests.rs

+    // Ingest docs with auto-commit. With a 5s commit timeout, these documents
+    // sit uncommitted in the ingesters' WAL - exactly the in-flight state we
+    // want to exercise during draining.
+    ingest(


How do we know the shard for this index is always go to be created on the indexer that we're about to shutdown?

guilload · 2026-03-04T21:16:30Z

quickwit/quickwit-integration-tests/src/tests/ingest_v2_tests.rs

+/// Tests that the graceful shutdown sequence works correctly in a multi-indexer
+/// cluster: shutting down one indexer does NOT cause 500 errors or data loss,
+/// and the cluster eventually rebalances. see #6158
+#[tokio::test]


Very very nice! Let's make sure this is not flaky, though. Run it 1,000 times! This is how I do it (fish):

while true c t --manifest-path quickwit/Cargo.toml -p quickwit-integration-tests --nocapture -- test_graceful_shutdown_no_data_loss end

guilload · 2026-03-04T21:22:05Z

quickwit/quickwit-serve/src/lib.rs

+    Ok((ingest_router, ingest_router_service, ingester_opt))
+}
+
+fn setup_ingester_pool(


Same here, we need to be extremely careful about this convoluted logic.

Now that I've thought more about this, I think we have an issue with this logic. This creates a pool of write-only ingesters, which is great for the logic in quickwit-ingest, but in quickwit-indexing, the source also holds an ingester pool and we still want to be able to read and truncate from ingesters when they are in the retiring and decommissioning status. I don't think we want to actually create and mange those distinct pools so we need to maybe restrict this pool to not initializing ingesters and push the additional filtering logic whereever needed (router, control plane).

guilload · 2026-03-04T21:30:24Z

quickwit/quickwit-serve/src/lib.rs

+                if let Some(ingester) = &ingester_opt {
+                    if let Ok(status) = try_get_ingester_status(ingester).await {
+                        status != IngesterStatus::Failed
+                    } else {
+                        // If we couldn't get the ingester status, it's not looking good, so we set
+                        // the node to not ready.
+                        false
+                    }
+                } else {
+                    true
+                }


Suggested change

if let Some(ingester) = &ingester_opt {

if let Ok(status) = try_get_ingester_status(ingester).await {

status != IngesterStatus::Failed

} else {

// If we couldn't get the ingester status, it's not looking good, so we set

// the node to not ready.

false

}

} else {

true

}

Feels like this logic does not need to be duplicated. Brainfart on my end? WDYT?

guilload · 2026-03-04T21:37:30Z

quickwit/quickwit-serve/src/lib.rs

-                        );
-                        Some(change)
-                    }
+                ClusterChange::Add(node) | ClusterChange::Update { updated: node, .. }


This is going to update the pool each time a chitchat key value changes. This is not buying us anything and will generate some noisy logs.

guilload · 2026-03-04T21:49:46Z

@nadav-govari, I need your eyes on this because:

this is a tricky PR so the more reviewers the ...
you will most likely have to troubleshoot and fix the bugs it will introduce
this will conflict with your current work
this will give you a cleaner way to propagate an ingester's az

…opagation

ncoiffier-celonis · 2026-03-05T10:09:23Z

(I took the liberty to force-push after signing all the individual commits, no code change)

…for_test

ncoiffier-celonis mentioned this pull request Mar 2, 2026

Exclude decomissioning nodes when opening new shards, using gRPC stream #6166

Closed

guilload reviewed Mar 4, 2026

View reviewed changes

guilload requested a review from nadav-govari March 4, 2026 21:46

guilload and others added 9 commits March 5, 2026 11:07

Gossip ingester status

c2662a2

Update ingester pool when status changes

aab1d8f

Rebalance shards when IngesterStatus changes

99584f4

Fix timeout_after being 0, causing to not wait for ingester status pr…

a1d2a6d

…opagation

Also refresh the ingester pool when an ingester status has changed

03bbb79

Add integration test

c7c3609

Make setup_ingester_pool and setup_indexer_pool a bit more uniform

eb38d13

make fix

5ecffd5

Instrument rebalance_shards calls

6bebf0d

ncoiffier-celonis force-pushed the ingester-status-rebased branch from 81e493d to 6bebf0d Compare March 5, 2026 10:08

ncoiffier-celonis added 2 commits March 5, 2026 11:38

Unified ClusterNode::for_test_with_ingester_status into ClusterNode::…

0c1c82b

…for_test

Remove duplicated readiness check on ingester status

7716f56

Conversation

ncoiffier-celonis commented Mar 2, 2026

Description

How was this PR tested?

Other approches

Uh oh!

guilload commented Mar 4, 2026

Uh oh!

guilload commented Mar 4, 2026

Uh oh!

guilload left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ncoiffier-celonis Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guilload commented Mar 4, 2026

Uh oh!

ncoiffier-celonis commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ncoiffier-celonis Mar 5, 2026 •

edited

Loading