fix(rtc): auto-reconnect signaling WebSocket on unexpected loss by aliev · Pull Request #241 · GetStream/stream-py

aliev · 2026-05-11T13:41:15Z

Why

WebSocketClient in getstream/video/rtc/signaling.py has no reconnect logic of its own. On unexpected close, error after handshake, or health-check timeout, it logs the failure and stops. The session then sits hanging until the frontend's own timeout fires a DELETE /sessions/... and the agent goes away.

Under concurrent load this is now the dominant cause of "the agent disconnected randomly" reports on a production deploy: a single brief TCP reset or missed heartbeat between agent and SFU is fatal because nothing reconnects.

We can drive into the existing ReconnectionManager — it already handles strategy escalation (FAST → REJOIN → MIGRATE), locking, and the disconnection-timeout deadline. The missing piece is just the notification.

Changes

signaling.py: emit a connection_lost event (with a reason string) from the WS worker thread when the connection drops outside of a user-initiated close(). Three sites:
- _on_close — unexpected close.
- _on_error — error after first_message_event is set (i.e. past initial handshake; pre-handshake errors continue to surface via first_message).
- _ping_loop — health-check timeout. Notify before self.close(), otherwise the not self.closed guard suppresses the event.
- Idempotent via _connection_lost_sent so a chained _on_error → _on_close only fires once.
connection_manager.py: register on_event("connection_lost", self._on_signaling_connection_lost) after the other handlers. The handler invokes self._reconnector.reconnect(ReconnectionStrategy.FAST, …).

No new public API; consumers that don't care about the event are unaffected.

Test plan

Existing tests/test_signaling.py — all 7 cases pass locally (error path during connect, event callbacks, threading, traces) and don't trip on the new emit.
E2E: verified in a deployed environment that prior to this fix, transient WS drops under load caused the symptom this PR addresses. Will re-verify after deploy with the fix.

coderabbitai · 2026-05-11T13:41:35Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ab39216f-2ca5-43d0-b85f-07e316ef0a03

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/auto-reconnect-signaling-ws

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

`WebSocketClient` had no reconnect logic of its own — `_on_close`, `_on_error`, and the health-check timeout in `_ping_loop` only logged and stopped the connection. When the signaling WS dropped (transient TCP reset, missed health check, server-side close without an SFU error event), the session sat hanging until the frontend's own timeout fired a DELETE. Under concurrent load this is the dominant remaining cause of "the agent disconnected randomly" reports: even with the late-offer fix in place, a single brief WS blip is fatal because nothing kicks reconnection. - `signaling.py`: emit a `connection_lost` event with a reason string on (a) unexpected `_on_close`, (b) `_on_error` after the initial handshake completed, and (c) `_ping_loop` health-check timeout (notify *before* `self.close()` so the user-initiated guard does not suppress it). Idempotent via `_connection_lost_sent`. - `connection_manager.py`: subscribe to `connection_lost` on the WS client and route it into the existing `ReconnectionManager` (`ReconnectionStrategy.FAST`). The manager already handles strategy escalation, locking, and the disconnection-timeout deadline.

`ConnectionManager._connect_internal` stores `join_response.data` as `self.join_response` (see `connection_manager.py:387`), so credentials live at the top level. `_reconnect_fast` was reading `self.join_response.data.credentials.token`, which raises `AttributeError: 'JoinCallResponse' object has no attribute 'data'` the first time a FAST reconnect actually runs. This codepath was never exercised before because nothing triggered a reconnect on signaling-WS loss; with the prior commit on this branch finally driving into `ReconnectionManager`, the latent bug surfaces. Drop the spurious `.data` hop.

aliev temporarily deployed to ci May 11, 2026 13:41 — with GitHub Actions Inactive

aliev force-pushed the fix/auto-reconnect-signaling-ws branch from 7be299a to a81381f Compare May 11, 2026 14:47

aliev temporarily deployed to ci May 11, 2026 14:47 — with GitHub Actions Inactive

aliev temporarily deployed to ci May 11, 2026 15:18 — with GitHub Actions Inactive

aliev deployed to ci May 11, 2026 15:18 — with GitHub Actions Active

aliev temporarily deployed to ci May 11, 2026 15:18 — with GitHub Actions Inactive

aliev had a problem deploying to ci May 11, 2026 15:18 — with GitHub Actions Failure

aliev temporarily deployed to ci May 11, 2026 15:18 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(rtc): auto-reconnect signaling WebSocket on unexpected loss#241

fix(rtc): auto-reconnect signaling WebSocket on unexpected loss#241
aliev wants to merge 2 commits into
mainfrom
fix/auto-reconnect-signaling-ws

aliev commented May 11, 2026

Uh oh!

coderabbitai Bot commented May 11, 2026 •

edited

Loading

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aliev commented May 11, 2026

Why

Changes

Test plan

Uh oh!

coderabbitai Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented May 11, 2026 •

edited

Loading