fix(rtc): auto-reconnect signaling WebSocket on unexpected loss#241
Draft
aliev wants to merge 2 commits into
Draft
fix(rtc): auto-reconnect signaling WebSocket on unexpected loss#241aliev wants to merge 2 commits into
aliev wants to merge 2 commits into
Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
`WebSocketClient` had no reconnect logic of its own — `_on_close`, `_on_error`, and the health-check timeout in `_ping_loop` only logged and stopped the connection. When the signaling WS dropped (transient TCP reset, missed health check, server-side close without an SFU error event), the session sat hanging until the frontend's own timeout fired a DELETE. Under concurrent load this is the dominant remaining cause of "the agent disconnected randomly" reports: even with the late-offer fix in place, a single brief WS blip is fatal because nothing kicks reconnection. - `signaling.py`: emit a `connection_lost` event with a reason string on (a) unexpected `_on_close`, (b) `_on_error` after the initial handshake completed, and (c) `_ping_loop` health-check timeout (notify *before* `self.close()` so the user-initiated guard does not suppress it). Idempotent via `_connection_lost_sent`. - `connection_manager.py`: subscribe to `connection_lost` on the WS client and route it into the existing `ReconnectionManager` (`ReconnectionStrategy.FAST`). The manager already handles strategy escalation, locking, and the disconnection-timeout deadline.
7be299a to
a81381f
Compare
`ConnectionManager._connect_internal` stores `join_response.data` as `self.join_response` (see `connection_manager.py:387`), so credentials live at the top level. `_reconnect_fast` was reading `self.join_response.data.credentials.token`, which raises `AttributeError: 'JoinCallResponse' object has no attribute 'data'` the first time a FAST reconnect actually runs. This codepath was never exercised before because nothing triggered a reconnect on signaling-WS loss; with the prior commit on this branch finally driving into `ReconnectionManager`, the latent bug surfaces. Drop the spurious `.data` hop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
WebSocketClientingetstream/video/rtc/signaling.pyhas no reconnect logic of its own. On unexpected close, error after handshake, or health-check timeout, it logs the failure and stops. The session then sits hanging until the frontend's own timeout fires aDELETE /sessions/...and the agent goes away.Under concurrent load this is now the dominant cause of "the agent disconnected randomly" reports on a production deploy: a single brief TCP reset or missed heartbeat between agent and SFU is fatal because nothing reconnects.
We can drive into the existing
ReconnectionManager— it already handles strategy escalation (FAST→REJOIN→MIGRATE), locking, and the disconnection-timeout deadline. The missing piece is just the notification.Changes
signaling.py: emit aconnection_lostevent (with a reason string) from the WS worker thread when the connection drops outside of a user-initiatedclose(). Three sites:_on_close— unexpected close._on_error— error afterfirst_message_eventis set (i.e. past initial handshake; pre-handshake errors continue to surface viafirst_message)._ping_loop— health-check timeout. Notify beforeself.close(), otherwise thenot self.closedguard suppresses the event._connection_lost_sentso a chained_on_error→_on_closeonly fires once.connection_manager.py: registeron_event("connection_lost", self._on_signaling_connection_lost)after the other handlers. The handler invokesself._reconnector.reconnect(ReconnectionStrategy.FAST, …).No new public API; consumers that don't care about the event are unaffected.
Test plan
tests/test_signaling.py— all 7 cases pass locally (error path during connect, event callbacks, threading, traces) and don't trip on the new emit.