SDSTOR-21989: check index_table existence before using it#409
Conversation
0a92251 to
dca49f2
Compare
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## stable/v4.x #409 +/- ##
==============================================
Coverage ? 52.71%
==============================================
Files ? 36
Lines ? 5332
Branches ? 662
==============================================
Hits ? 2811
Misses ? 2225
Partials ? 296 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
In general LGTM
I think we should redo the
|
|
No, I haven't met the destroy_pg crash issue before in my memory. But also agree on redo pg_destroy.
|
|
I have added an accessor to index_table. redo pg destory will come in a later separate PR |
| auto pg_index_table = hs_pg->index_table_; | ||
| RELEASE_ASSERT(pg_index_table, "Index table not found for PG pg_id={}", pg_id); | ||
| auto pg_index_table = hs_pg->get_index_table(); | ||
| if (!pg_index_table) { return false; } |
There was a problem hiding this comment.
If the PG was destroyed, will GC/BR/put_blob/get_blob operations still occur after a restart? If not, can we limit the changes to refresh_pg_statistics?
The reason is when I reviewing the replace_blob_index caller, I noticed that returning false causes a progress assertion failure, as do most functions that access it after retrieval.
There was a problem hiding this comment.
sorry, I don`t fully get the point, can you pls put more details here?
actually, assertion failure is necessary. there are two case that false will be returned.
1 pg index table is not found. this should not happen since if there is an on-going gc task, pg can not be destroyed. pls refer to drain_pg_pending_gc_task, which will wait until all the gc task of this pg is completed
2 failed to update pg index table. this means partial updated(some are updated successfully and others not) probably happens and it is very dangerous since data loss will happen if we don`t assert.
There was a problem hiding this comment.
If the PG was destroyed, will GC/BR/put_blob/get_blob operations still occur after a restart
pg is destroyed in 3 cases: br (only at follower), destroy repl_dev(all member) and replace_member(only at out member)
if br case , pg will be recreated again, and thus GC/BR/put_blob/get_blob operations probably occurs after restart.
There was a problem hiding this comment.
What I mean is: keep the existing assert logic in those places unchanged, because:
- Correct me if I’m wrong, but GC / put_blob / get_blob are not expected to run on a destroyed PG. BR will destroy the PG again, so this case shouldn’t happen at that step either. This scenario would only show up during a restart when the PG state is inconsistent
- Even if you get nullptr in those places, you should still keep the original asserts. Because even if you don’t assert there, we will still assert later based on the return value, or we’ll crash when dereferencing a null pointer anyway
xiaoxichen
left a comment
There was a problem hiding this comment.
lgtm considering this is a temporary fix
if homeobject crashes after index_table is destroyed in destroy_pg and before the pg is created in baseline resync, we can not find pg index table when restarts. so we need add a sanity check to see if index_table exists. if not, skip refreshing pg metrics