Skip to content

SDSTOR-21989: check index_table existence before using it#409

Merged
JacksonYao287 merged 2 commits into
eBay:stable/v4.xfrom
JacksonYao287:21989
May 14, 2026
Merged

SDSTOR-21989: check index_table existence before using it#409
JacksonYao287 merged 2 commits into
eBay:stable/v4.xfrom
JacksonYao287:21989

Conversation

@JacksonYao287
Copy link
Copy Markdown
Collaborator

@JacksonYao287 JacksonYao287 commented May 6, 2026

if homeobject crashes after index_table is destroyed in destroy_pg and before the pg is created in baseline resync, we can not find pg index table when restarts. so we need add a sanity check to see if index_table exists. if not, skip refreshing pg metrics

@JacksonYao287 JacksonYao287 requested a review from Besroy May 6, 2026 01:50
@JacksonYao287 JacksonYao287 force-pushed the 21989 branch 2 times, most recently from 0a92251 to dca49f2 Compare May 6, 2026 03:26
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 6, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 31.57895% with 13 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (stable/v4.x@2de356a). Learn more about missing BASE report.

Files with missing lines Patch % Lines
src/lib/homestore_backend/hs_pg_manager.cpp 33.33% 6 Missing and 2 partials ⚠️
src/lib/homestore_backend/hs_blob_manager.cpp 28.57% 3 Missing and 2 partials ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@              Coverage Diff               @@
##             stable/v4.x     #409   +/-   ##
==============================================
  Coverage               ?   52.71%           
==============================================
  Files                  ?       36           
  Lines                  ?     5332           
  Branches               ?      662           
==============================================
  Hits                   ?     2811           
  Misses                 ?     2225           
  Partials               ?      296           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@xiaoxichen
Copy link
Copy Markdown
Collaborator

In general LGTM

  1. During recovery, should we better handling destroyed PG? Currently we only partially handled the case that Index is destroyed on destroyed PG. (
    } else {
    RELEASE_ASSERT(hs_pg->pg_sb_->state == PGState::DESTROYED, "IndexTable should be recovered before PG");
    hs_pg->index_table_ = nullptr;
    LOGI("Index table not found for destroyed pg={}, index_table_uuid={}", pg_id, uuid_str);
    }
    )

I think we should redo the HSHomeObject::pg_destroy during recovery.

  1. the check index_table logic spread across the code, is it better to add an accessor into HS_PG? which checks the existence of index and logging (if the PG is destroyed) or assert.

@Besroy
Copy link
Copy Markdown
Contributor

Besroy commented May 7, 2026

Agree, checking the PG state seems safer and cleaner. @yuwmao @koujl Do you recall the previous solution for handling crashes during destroy_pg? Perhaps we can integrate it with this fix as a unified solution

@JacksonYao287
Copy link
Copy Markdown
Collaborator Author

JacksonYao287 commented May 8, 2026

agree, we need to redo pg_destroy to make sure all the pg resource are cleared before pg is recreated by BR or permanently destroyed. cc @yuwmao @koujl if you want to do this, feel free to close this PR.

@JacksonYao287 JacksonYao287 requested review from koujl and yuwmao May 8, 2026 03:54
@yuwmao
Copy link
Copy Markdown
Contributor

yuwmao commented May 8, 2026

No, I haven't met the destroy_pg crash issue before in my memory. But also agree on redo pg_destroy.

Agree, checking the PG state seems safer and cleaner. @yuwmao @koujl Do you recall the previous solution for handling crashes during destroy_pg? Perhaps we can integrate it with this fix as a unified solution

@JacksonYao287
Copy link
Copy Markdown
Collaborator Author

I have added an accessor to index_table. redo pg destory will come in a later separate PR

@JacksonYao287 JacksonYao287 requested a review from xiaoxichen May 11, 2026 08:01
auto pg_index_table = hs_pg->index_table_;
RELEASE_ASSERT(pg_index_table, "Index table not found for PG pg_id={}", pg_id);
auto pg_index_table = hs_pg->get_index_table();
if (!pg_index_table) { return false; }
Copy link
Copy Markdown
Contributor

@Besroy Besroy May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the PG was destroyed, will GC/BR/put_blob/get_blob operations still occur after a restart? If not, can we limit the changes to refresh_pg_statistics?
The reason is when I reviewing the replace_blob_index caller, I noticed that returning false causes a progress assertion failure, as do most functions that access it after retrieval.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I don`t fully get the point, can you pls put more details here?

actually, assertion failure is necessary. there are two case that false will be returned.

1 pg index table is not found. this should not happen since if there is an on-going gc task, pg can not be destroyed. pls refer to drain_pg_pending_gc_task, which will wait until all the gc task of this pg is completed

2 failed to update pg index table. this means partial updated(some are updated successfully and others not) probably happens and it is very dangerous since data loss will happen if we don`t assert.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the PG was destroyed, will GC/BR/put_blob/get_blob operations still occur after a restart

pg is destroyed in 3 cases: br (only at follower), destroy repl_dev(all member) and replace_member(only at out member)

if br case , pg will be recreated again, and thus GC/BR/put_blob/get_blob operations probably occurs after restart.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is: keep the existing assert logic in those places unchanged, because:

  1. Correct me if I’m wrong, but GC / put_blob / get_blob are not expected to run on a destroyed PG. BR will destroy the PG again, so this case shouldn’t happen at that step either. This scenario would only show up during a restart when the PG state is inconsistent
  2. Even if you get nullptr in those places, you should still keep the original asserts. Because even if you don’t assert there, we will still assert later based on the return value, or we’ll crash when dereferencing a null pointer anyway

Copy link
Copy Markdown
Collaborator Author

@JacksonYao287 JacksonYao287 May 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Besroy , done, ptal

Copy link
Copy Markdown
Collaborator

@xiaoxichen xiaoxichen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm considering this is a temporary fix

@JacksonYao287 JacksonYao287 merged commit f604993 into eBay:stable/v4.x May 14, 2026
25 checks passed
@JacksonYao287 JacksonYao287 deleted the 21989 branch May 14, 2026 07:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants