Skip to content

Comments

DAOS-18552 pool: Fix a PS start-stop race#17564

Open
liw wants to merge 1 commit intomasterfrom
liw/pool-svc-stop-wa
Open

DAOS-18552 pool: Fix a PS start-stop race#17564
liw wants to merge 1 commit intomasterfrom
liw/pool-svc-stop-wa

Conversation

@liw
Copy link
Contributor

@liw liw commented Feb 17, 2026

The following race happened during a pool create operation, triggered by abnormally slow VMs:

ds_rsvc_start
  start
    pool_svc_alloc_cb
      ds_pool_lookup: OK
....VM slowness causes start timeout, which triggers stop....
                            ds_pool_stop
                              pool->sp_stopping = 1
                              ds_pool_svc_stop: none
  insert
                              wait for ds_pool references: hang

This patch is a quick fix that prevents ds_rsvc_start from inserting a PS to the hash table if the ds_pool is stopping, so that ds_pool_stop won't hang. Manual testing shows that such a pool create operation will now retry and succeed transparently.

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@github-actions
Copy link

Ticket title is 'rebuild/container_rf.py:RbldContRfTest.test_rebuild_with_container_rf - pool create failed: DER_BUSY(-1012): Device or resource busy'
Status is 'In Progress'
Labels: 'ci_master_weekly,weekly_test'
https://daosio.atlassian.net/browse/DAOS-18552

@liw liw force-pushed the liw/pool-svc-stop-wa branch from 8d0122c to 4ccf463 Compare February 17, 2026 06:43
The following race happened during a pool create operation, triggered by
abnormally slow VMs:

  ds_rsvc_start
    start
      pool_svc_alloc_cb
        ds_pool_lookup: OK
  ....VM slowness causes start timeout, which triggers stop....
                            ds_pool_stop
                              pool->sp_stopping = 1
                              ds_pool_svc_stop: none
    insert
                              wait for ds_pool references: hang

This patch is a quick fix that prevents ds_rsvc_start from inserting a
PS to the hash table if the ds_pool is stopping, so that ds_pool_stop
won't hang. Manual testing shows that such a pool create operation will
now retry and succeed transparently.

Signed-off-by: Li Wei <liwei@hpe.com>
@liw liw force-pushed the liw/pool-svc-stop-wa branch from 4ccf463 to 0ec1e95 Compare February 17, 2026 07:40
@liw liw marked this pull request as ready for review February 18, 2026 00:16
@liw liw requested review from a team as code owners February 18, 2026 00:16
@liw liw requested review from kccain and liuxuezhao February 18, 2026 00:17
rc = rsvc_class(class)->sc_insert(svc);
if (rc != 0) {
D_DEBUG(DB_MD, "%s: sc_insert: " DF_RC "\n", svc->s_name, DP_RC(rc));
goto err_svc_started;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unusual goto, though I see the reasoning (do not duplicate stop call line of code here, and do not generate inaccurate D_DEBUG log when it is sc_insert that failed rather than d_hash_rec_insert)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's a bit unusual, though I occasionally use this method. If I revise this PR next time, I'll change this to just duplicate the stop call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants