Skip to content

Support streamState restore to avoid unnecessary read retries during worker restarts#3667

Draft
saurabhd336 wants to merge 1 commit intoapache:mainfrom
saurabhd336:optimizeOpenStreamResponse
Draft

Support streamState restore to avoid unnecessary read retries during worker restarts#3667
saurabhd336 wants to merge 1 commit intoapache:mainfrom
saurabhd336:optimizeOpenStreamResponse

Conversation

@saurabhd336
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Whenever a worker restarts (including graceful restarts), all ongoing client read requests (i.e. PbChunkFetchRequest requests) start failing with
"Stream ${streamChunkSlice.streamId} is not registered with worker. This can happen if the worker was restart recently."
This is because while the DiskFileInfos in StorageManager is restored from recovery file, the streamState is not.

This failure leads to an expensive retry logic on the client side, which includes registering a new stream state, unnecessarily excluding the worker from hosting next set of shuffles and a complete re-read of the entire partition file in some cases (eg: when replication is enabled)

We can avoid this by persisting and restoring the streamState whereever possible

Why are the changes needed?

Allow worker chunk read retries for already registered streamIds to go through.

Does this PR resolve a correctness bug?

No

Does this PR introduce any user-facing change?

No

How was this patch tested?

TODO

@saurabhd336
Copy link
Copy Markdown
Contributor Author

@SteNicholas / @s0nskar / @zaynt4606 / others, wanted to start an early review of the idea. This PR is not complete yet, but wanted to check if the community finds value in this proposal. We've noticed post a server restart, a lot of the chunk requests are bound to fail becuase the streamState isn't restored unlike the diskFileInfo and other metadata.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant