Skip to content

fix(rtc): auto-reconnect signaling WebSocket on unexpected loss#241

Draft
aliev wants to merge 2 commits into
mainfrom
fix/auto-reconnect-signaling-ws
Draft

fix(rtc): auto-reconnect signaling WebSocket on unexpected loss#241
aliev wants to merge 2 commits into
mainfrom
fix/auto-reconnect-signaling-ws

Conversation

@aliev
Copy link
Copy Markdown
Member

@aliev aliev commented May 11, 2026

Why

WebSocketClient in getstream/video/rtc/signaling.py has no reconnect logic of its own. On unexpected close, error after handshake, or health-check timeout, it logs the failure and stops. The session then sits hanging until the frontend's own timeout fires a DELETE /sessions/... and the agent goes away.

Under concurrent load this is now the dominant cause of "the agent disconnected randomly" reports on a production deploy: a single brief TCP reset or missed heartbeat between agent and SFU is fatal because nothing reconnects.

We can drive into the existing ReconnectionManager — it already handles strategy escalation (FASTREJOINMIGRATE), locking, and the disconnection-timeout deadline. The missing piece is just the notification.

Changes

  • signaling.py: emit a connection_lost event (with a reason string) from the WS worker thread when the connection drops outside of a user-initiated close(). Three sites:
    • _on_close — unexpected close.
    • _on_error — error after first_message_event is set (i.e. past initial handshake; pre-handshake errors continue to surface via first_message).
    • _ping_loop — health-check timeout. Notify before self.close(), otherwise the not self.closed guard suppresses the event.
    • Idempotent via _connection_lost_sent so a chained _on_error_on_close only fires once.
  • connection_manager.py: register on_event("connection_lost", self._on_signaling_connection_lost) after the other handlers. The handler invokes self._reconnector.reconnect(ReconnectionStrategy.FAST, …).

No new public API; consumers that don't care about the event are unaffected.

Test plan

  • Existing tests/test_signaling.py — all 7 cases pass locally (error path during connect, event callbacks, threading, traces) and don't trip on the new emit.
  • E2E: verified in a deployed environment that prior to this fix, transient WS drops under load caused the symptom this PR addresses. Will re-verify after deploy with the fix.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 11, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ab39216f-2ca5-43d0-b85f-07e316ef0a03

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/auto-reconnect-signaling-ws

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

`WebSocketClient` had no reconnect logic of its own — `_on_close`,
`_on_error`, and the health-check timeout in `_ping_loop` only logged
and stopped the connection. When the signaling WS dropped (transient
TCP reset, missed health check, server-side close without an SFU error
event), the session sat hanging until the frontend's own timeout fired
a DELETE.

Under concurrent load this is the dominant remaining cause of "the
agent disconnected randomly" reports: even with the late-offer fix in
place, a single brief WS blip is fatal because nothing kicks
reconnection.

- `signaling.py`: emit a `connection_lost` event with a reason string
  on (a) unexpected `_on_close`, (b) `_on_error` after the initial
  handshake completed, and (c) `_ping_loop` health-check timeout
  (notify *before* `self.close()` so the user-initiated guard does not
  suppress it). Idempotent via `_connection_lost_sent`.
- `connection_manager.py`: subscribe to `connection_lost` on the WS
  client and route it into the existing `ReconnectionManager`
  (`ReconnectionStrategy.FAST`). The manager already handles strategy
  escalation, locking, and the disconnection-timeout deadline.
`ConnectionManager._connect_internal` stores `join_response.data` as
`self.join_response` (see `connection_manager.py:387`), so credentials
live at the top level. `_reconnect_fast` was reading
`self.join_response.data.credentials.token`, which raises
`AttributeError: 'JoinCallResponse' object has no attribute 'data'`
the first time a FAST reconnect actually runs.

This codepath was never exercised before because nothing triggered a
reconnect on signaling-WS loss; with the prior commit on this branch
finally driving into `ReconnectionManager`, the latent bug surfaces.

Drop the spurious `.data` hop.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant