Skip to content

DAOS-18587 chk: handle report upcall failure#17546

Merged
gnailzenh merged 1 commit intomasterfrom
Nasf-Fan/DAOS-18587
Feb 26, 2026
Merged

DAOS-18587 chk: handle report upcall failure#17546
gnailzenh merged 1 commit intomasterfrom
Nasf-Fan/DAOS-18587

Conversation

@Nasf-Fan
Copy link
Copy Markdown
Contributor

@Nasf-Fan Nasf-Fan commented Feb 11, 2026

Anytime when DAOS engine logic needs interaction with admin, it will generate new interaction record in chk_instance::ci_pending_hdl tree, and then trigger dRPP upcall to control plane that may fail for some reason. If hit failure, the dRPC sponsor needs to remove such record from chk_instance::ci_pending_hdl tree before destroying it to avoid triggering fake assertion.

The patch also fixes a container label check issue:
If the label is transferred as d_iov_t instead of string, then the buffer maybe not '\0' terminated, need to check its buffer length.

Test-tag: recovery

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Feb 11, 2026

Ticket title is 'CR - dmg check start causes engine crash on Aurora'
Status is 'In Progress'
Labels: 'catastrophic_recovery,test_2.8'
https://daosio.atlassian.net/browse/DAOS-18587

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-18587 branch 2 times, most recently from b998e0d to c542bf2 Compare February 11, 2026 13:31
@Nasf-Fan Nasf-Fan marked this pull request as ready for review February 12, 2026 12:51
@Nasf-Fan
Copy link
Copy Markdown
Contributor Author

All required CI tests got pass.

@Nasf-Fan Nasf-Fan requested a review from wangshilong February 13, 2026 00:55
kjacque
kjacque previously approved these changes Feb 13, 2026
Copy link
Copy Markdown
Contributor

@kjacque kjacque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Only one minor suggestion, which I wouldn't block on.

Comment thread src/chk/chk_engine.c Outdated
Comment on lines +3225 to +3226
cpr->cpr_busy = 0;
chk_pending_del(ins, *seq, NULL);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor suggestion - Maybe would be good to have a helper function for this pair of actions, since it's done in multiple places.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I will update the patch.

Anytime when DAOS engine logic needs interaction with admin, it will
generate new interaction record in chk_instance::ci_pending_hdl tree,
and then trigger dRPP upcall to control plane that may fail for some
reason. If hit failure, the dRPC sponsor needs to remove such record
from chk_instance::ci_pending_hdl tree before destroying it to avoid
triggering fake assertion.

The patch also fixes a container label check issue:
If the label is transferred as d_iov_t instead of string, then the
buffer maybe not '\0' terminated, need to check its buffer length.

Test-tag: recovery

Signed-off-by: Fan Yong <fan.yong@hpe.com>
@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-18587 branch from bf8af89 to d13a0b1 Compare February 13, 2026 03:03
Copy link
Copy Markdown
Contributor

@kjacque kjacque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard to evaluate changes from the last version due to the force push, but looks OK to me. Comment is minor, address if you find it useful.

Comment thread src/chk/chk_engine.c
@Nasf-Fan
Copy link
Copy Markdown
Contributor Author

Ping reviewers, thanks!

Comment thread src/chk/chk_leader.c
pending->cpr_action = act;
ABT_cond_broadcast(pending->cpr_cond);
ABT_mutex_unlock(pending->cpr_mutex);
chk_pending_del(ins, seq, &pending);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the returned "pending" is not used by anyone?

Copy link
Copy Markdown
Contributor Author

@Nasf-Fan Nasf-Fan Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interaction sponsor is still using it (waiting on it). Here, we just set pending->cpr_action and remove it from the tree instead of freeing it. The interaction sponsor will release the pending record after using.

Comment thread src/chk/chk_common.c
rc = dbtree_lookup(ins->ci_pending_hdl, &kiov, &riov);
ABT_rwlock_unlock(ins->ci_abt_lock);
if (rc == 0)
*cpr = (struct chk_pending_rec *)riov.iov_buf;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if there could race like after releasing the rw_lock, another thread deletes and frees the instance before you copy it to "cpr"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the caller of chk_pending_lookup() and chk_pending_del() are the same ULT.

@gnailzenh gnailzenh merged commit 94b5b83 into master Feb 26, 2026
50 of 54 checks passed
@gnailzenh gnailzenh deleted the Nasf-Fan/DAOS-18587 branch February 26, 2026 08:02
@Nasf-Fan Nasf-Fan requested a review from a team February 26, 2026 08:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants