DAOS-18587 chk: handle report upcall failure#17546
Conversation
|
Ticket title is 'CR - dmg check start causes engine crash on Aurora' |
b998e0d to
c542bf2
Compare
|
All required CI tests got pass. |
kjacque
left a comment
There was a problem hiding this comment.
LGTM. Only one minor suggestion, which I wouldn't block on.
| cpr->cpr_busy = 0; | ||
| chk_pending_del(ins, *seq, NULL); |
There was a problem hiding this comment.
minor suggestion - Maybe would be good to have a helper function for this pair of actions, since it's done in multiple places.
There was a problem hiding this comment.
Good idea. I will update the patch.
ccd00ef to
bf8af89
Compare
Anytime when DAOS engine logic needs interaction with admin, it will generate new interaction record in chk_instance::ci_pending_hdl tree, and then trigger dRPP upcall to control plane that may fail for some reason. If hit failure, the dRPC sponsor needs to remove such record from chk_instance::ci_pending_hdl tree before destroying it to avoid triggering fake assertion. The patch also fixes a container label check issue: If the label is transferred as d_iov_t instead of string, then the buffer maybe not '\0' terminated, need to check its buffer length. Test-tag: recovery Signed-off-by: Fan Yong <fan.yong@hpe.com>
bf8af89 to
d13a0b1
Compare
kjacque
left a comment
There was a problem hiding this comment.
Hard to evaluate changes from the last version due to the force push, but looks OK to me. Comment is minor, address if you find it useful.
|
Ping reviewers, thanks! |
| pending->cpr_action = act; | ||
| ABT_cond_broadcast(pending->cpr_cond); | ||
| ABT_mutex_unlock(pending->cpr_mutex); | ||
| chk_pending_del(ins, seq, &pending); |
There was a problem hiding this comment.
the returned "pending" is not used by anyone?
There was a problem hiding this comment.
The interaction sponsor is still using it (waiting on it). Here, we just set pending->cpr_action and remove it from the tree instead of freeing it. The interaction sponsor will release the pending record after using.
| rc = dbtree_lookup(ins->ci_pending_hdl, &kiov, &riov); | ||
| ABT_rwlock_unlock(ins->ci_abt_lock); | ||
| if (rc == 0) | ||
| *cpr = (struct chk_pending_rec *)riov.iov_buf; |
There was a problem hiding this comment.
I'm not sure if there could race like after releasing the rw_lock, another thread deletes and frees the instance before you copy it to "cpr"
There was a problem hiding this comment.
Currently, the caller of chk_pending_lookup() and chk_pending_del() are the same ULT.
Anytime when DAOS engine logic needs interaction with admin, it will generate new interaction record in chk_instance::ci_pending_hdl tree, and then trigger dRPP upcall to control plane that may fail for some reason. If hit failure, the dRPC sponsor needs to remove such record from chk_instance::ci_pending_hdl tree before destroying it to avoid triggering fake assertion.
The patch also fixes a container label check issue:
If the label is transferred as d_iov_t instead of string, then the buffer maybe not '\0' terminated, need to check its buffer length.
Test-tag: recovery
Steps for the author:
After all prior steps are complete: