Skip to content

DAOS-18593 test: replace sleep with retry in rebuild/interactive.py#17559

Merged
daltonbohning merged 3 commits intomasterfrom
dbohning/daos-18593
Feb 24, 2026
Merged

DAOS-18593 test: replace sleep with retry in rebuild/interactive.py#17559
daltonbohning merged 3 commits intomasterfrom
dbohning/daos-18593

Conversation

@daltonbohning
Copy link
Contributor

@daltonbohning daltonbohning commented Feb 13, 2026

Replace arbitrary sleep with a retry on expected DER_NONEXIST.
This improves a race condition where even when dmg pool query shows rebuild is busy,
it hasn't "actually" started yet.
So when dmg pool rebuild stop fails with DER_NONEXIST, we simply wait and retry.

Test-repeat: 10
Test-tag: RbldInteractive
Skip-unit-tests: true
Skip-fault-injection-test: true

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@daltonbohning daltonbohning self-assigned this Feb 13, 2026
@github-actions
Copy link

Ticket title is 'rebuild/interactive.py: remove arbitrary sleep'
Status is 'In Progress'
https://daosio.atlassian.net/browse/DAOS-18593

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17559/1/execution/node/791/log

Replace arbitrary sleep with a retry on expected DER_NONEXIST.

Test-repeat: 10
Test-tag: RbldInteractive
Skip-unit-tests: true
Skip-fault-injection-test: true

Signed-off-by: Dalton Bohning <dalton.bohning@hpe.com>
Test-repeat: 10
Test-tag: RbldInteractive
Skip-unit-tests: true
Skip-fault-injection-test: true

Signed-off-by: Dalton Bohning <dalton.bohning@hpe.com>
@daltonbohning
Copy link
Contributor Author

daltonbohning commented Feb 18, 2026

This is improving a race condition where even when dmg pool query shows rebuild is busy, it hasn't "actually" started yet. So when dmg pool rebuild stop fails with DER_NONEXIST, we simply wait and retry.
It is hard to reproduce this race condition, so for testing purposes I removed the rebuild busy detection to verify the DER_NONEXIST handling is working.

This sample run shows the DER_NONEXIST handling is working
https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17559/3/artifact/Functional%20Hardware%20Large%20MD%20on%20SSD/rebuild/interactive.py/repeat001/job.log

2026-02-17 22:18:56,586 process          L0604 INFO | Running '/usr/bin/dmg -o /var/tmp/daos_testing/configs/daos_control.yml -d pool rebuild stop TestPool_1'
2026-02-17 22:18:56,722 process          L0416 DEBUG| [stderr] DEBUG 2026/02/17 22:18:56.722631 main.go:228: debug output enabled
2026-02-17 22:18:56,723 process          L0416 DEBUG| [stderr] DEBUG 2026/02/17 22:18:56.723297 main.go:260: control config loaded from /var/tmp/daos_testing/configs/daos_control.yml
2026-02-17 22:18:56,730 process          L0416 DEBUG| [stderr] DEBUG 2026/02/17 22:18:56.729999 rpc.go:278: request hosts: [hdr-112:10001 hdr-113:10001 hdr-114:10001 hdr-115:10001 hdr-116:10001]
2026-02-17 22:18:56,779 process          L0416 DEBUG| [stderr] DEBUG 2026/02/17 22:18:56.779013 response.go:179: hdr-112:10001: err: DER_NONEXIST(-1005): The specified entity does not exist
2026-02-17 22:18:56,779 process          L0416 DEBUG| [stderr] DEBUG 2026/02/17 22:18:56.779179 response.go:179: hdr-112:10001: err: DER_NONEXIST(-1005): The specified entity does not exist
2026-02-17 22:18:56,779 process          L0416 DEBUG| [stderr] DEBUG 2026/02/17 22:18:56.779235 pool.go:1084: pool-rebuild stop failed: DER_NONEXIST(-1005): The specified entity does not exist
2026-02-17 22:18:56,779 process          L0416 DEBUG| [stderr] ERROR: dmg: pool-rebuild stop failed: DER_NONEXIST(-1005): The specified entity does not exist
2026-02-17 22:18:56,782 process          L0686 INFO | Command '/usr/bin/dmg -o /var/tmp/daos_testing/configs/daos_control.yml -d pool rebuild stop TestPool_1' finished with 1 after 0.19339227676391602s
2026-02-17 22:18:56,802 general_utils    L0176 INFO | Error occurred running '/usr/bin/dmg -o /var/tmp/daos_testing/configs/daos_control.yml -d pool rebuild stop TestPool_1': Command '/usr/bin/dmg -o /var/tmp/daos_testing/configs/daos_control.yml -d pool rebuild stop TestPool_1' failed.
stdout: b''
stderr: b'DEBUG 2026/02/17 22:18:56.722631 main.go:228: debug output enabled\nDEBUG 2026/02/17 22:18:56.723297 main.go:260: control config loaded from /var/tmp/daos_testing/configs/daos_control.yml\nDEBUG 2026/02/17 22:18:56.729999 rpc.go:278: request hosts: [hdr-112:10001 hdr-113:10001 hdr-114:10001 hdr-115:10001 hdr-116:10001]\nDEBUG 2026/02/17 22:18:56.779013 response.go:179: hdr-112:10001: err: DER_NONEXIST(-1005): The specified entity does not exist\nDEBUG 2026/02/17 22:18:56.779179 response.go:179: hdr-112:10001: err: DER_NONEXIST(-1005): The specified entity does not exist\nDEBUG 2026/02/17 22:18:56.779235 pool.go:1084: pool-rebuild stop failed: DER_NONEXIST(-1005): The specified entity does not exist\nERROR: dmg: pool-rebuild stop failed: DER_NONEXIST(-1005): The specified entity does not exist\n'
additional_info: None
2026-02-17 22:18:56,802 interactive      L0107 INFO | Assuming rebuild is not started yet. Retrying in 3 seconds...
2026-02-17 22:18:59,805 command_utils_ba L0203 DEBUG| Updated param pool => TestPool_1
2026-02-17 22:18:59,805 command_utils_ba L0203 DEBUG| Updated param force => False
2026-02-17 22:18:59,806 general_utils    L0151 INFO | Command environment vars:
  {}
2026-02-17 22:18:59,806 process          L0604 INFO | Running '/usr/bin/dmg -o /var/tmp/daos_testing/configs/daos_control.yml -d pool rebuild stop TestPool_1'
2026-02-17 22:18:59,945 process          L0416 DEBUG| [stderr] DEBUG 2026/02/17 22:18:59.944884 main.go:228: debug output enabled
2026-02-17 22:18:59,946 process          L0416 DEBUG| [stderr] DEBUG 2026/02/17 22:18:59.946206 main.go:260: control config loaded from /var/tmp/daos_testing/configs/daos_control.yml
2026-02-17 22:18:59,950 process          L0416 DEBUG| [stderr] DEBUG 2026/02/17 22:18:59.950520 rpc.go:278: request hosts: [hdr-112:10001 hdr-114:10001 hdr-116:10001 hdr-117:10001 hdr-118:10001]
2026-02-17 22:18:59,998 process          L0416 DEBUG| [stderr] DEBUG 2026/02/17 22:18:59.998910 response.go:179: hdr-112:10001: *mgmt.DaosResp status:DER_SUCCESS(0): Success
2026-02-17 22:18:59,999 process          L0416 DEBUG| [stderr] DEBUG 2026/02/17 22:18:59.999101 pool.go:1091: Pool-rebuild stop request succeeded
2026-02-17 22:18:59,999 process          L0416 DEBUG| [stdout] Pool-rebuild stop request succeeded
2026-02-17 22:19:01,002 process          L0686 INFO | Command '/usr/bin/dmg -o /var/tmp/daos_testing/configs/daos_control.yml -d pool rebuild stop TestPool_1' finished with 0 after 1.1937801837921143s

This reverts commit 7318e84.

Test-repeat: 10
Test-tag: RbldInteractive
Skip-unit-tests: true
Skip-fault-injection-test: true
Signed-off-by: Dalton Bohning <dalton.bohning@hpe.com>
@daltonbohning daltonbohning marked this pull request as ready for review February 18, 2026 21:44
@daltonbohning daltonbohning requested review from a team as code owners February 18, 2026 21:44
Copy link
Collaborator

@jamesanunez jamesanunez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need 'break'. Please review.

@daltonbohning daltonbohning requested a review from a team February 24, 2026 16:16
@daltonbohning daltonbohning added the forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed. label Feb 24, 2026
@daltonbohning daltonbohning merged commit f980974 into master Feb 24, 2026
33 checks passed
@daltonbohning daltonbohning deleted the dbohning/daos-18593 branch February 24, 2026 16:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed.

Development

Successfully merging this pull request may close these issues.

4 participants