Skip to content

fix(socket): clear stuck "Reconnecting" indicator after invite accept#4919

Closed
waleedlatif1 wants to merge 1 commit into
stagingfrom
worktree-fix+invite-reconnecting
Closed

fix(socket): clear stuck "Reconnecting" indicator after invite accept#4919
waleedlatif1 wants to merge 1 commit into
stagingfrom
worktree-fix+invite-reconnecting

Conversation

@waleedlatif1

@waleedlatif1 waleedlatif1 commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Accepting an invite switches the active org and immediately redirects into the workspace, so the realtime socket bootstraps under a just-rotated session. A transient token-mint failure in that window left the socket stuck showing "Reconnecting…" until a manual page reload.
  • Fix follows the documented socket.io pattern: branch connect_error on socket.active.
    • active === true (transport/network blip) → let socket.io auto-reconnect (unchanged).
    • active === false (server-denied handshake — a null/expired token rejected by the auth middleware destroys the socket and disables auto-reconnect) → retry socket.connect() with capped exponential backoff, which re-runs the auth callback to mint a fresh token.
  • This recovers a transient failure on the next attempt and bounds a genuine logout to MAX_AUTH_RETRY_ATTEMPTS (10, ~3 min) before latching authFailed for a manual reload — no infinite re-minting.
  • The connect handler now clears isReconnecting, so a healthy socket can never sit showing "Reconnecting…".
  • Replaces the prior error-message sniffing and socket teardown/rebuild with the framework's native socket.active signal + socket.connect() manual reconnect.

Type of Change

  • Bug fix

Testing

Tested manually. Biome clean, existing app/workspace/providers tests pass (14/14), typecheck clean. Diagnostic logs (Retrying socket connection after denied handshake, Socket auth retries exhausted…) are included to confirm behavior on the next real occurrence.

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

@vercel

vercel Bot commented Jun 9, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped Jun 9, 2026 5:32am

Request Review

@cursor

cursor Bot commented Jun 9, 2026

Copy link
Copy Markdown

PR Summary

Medium Risk
Changes realtime connection and session-token handshake behavior for all workspace users; bounded retries limit runaway reconnect loops but incorrect retry logic could still affect collaboration presence/sync.

Overview
Fixes a stuck Reconnecting… state when the workspace socket boots right after an org switch (e.g. accepting an invite) and token mint or handshake auth fails transiently.

socket-provider no longer treats auth connect_error as a permanent failure that tears down the socket and blocks re-init. When Socket.IO marks the connection inactive after a denied handshake, it now scheduleAuthRetry: exponential backoff with jitter, up to 10 attempts, each calling connect() so the auth callback can mint a fresh token. After exhaustion it latches authFailed for manual recovery.

Token mint errors in the auth callback always complete the handshake with token: null (fast deny + retry path) instead of hanging or skipping the callback. A successful connect clears isReconnecting, authFailed, and the auth retry timer/counter. retryConnection resets backoff and reconnects the existing socket instead of only flipping state to re-run the init effect. Socket initialization no longer bails out while authFailed is set.

Reviewed by Cursor Bugbot for commit b27dca7. Configure here.

@greptile-apps

greptile-apps Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

Fixes a stuck "Reconnecting…" indicator that appeared after an invite-accept redirect, where a transient token-minting failure during org-switch was latched as a permanent auth failure. The fix fast-fails the handshake on any token error, adds bounded exponential-backoff retry for namespace rejections, and ensures the connect event always clears isReconnecting.

  • Auth callback now always calls cb({ token: null }) on any error instead of leaving the handshake hanging; the resulting server rejection triggers scheduleAuthRetry with a 10-attempt cap and exponential backoff (1 s base, 30 s max).
  • connect handler clears isReconnecting, authFailed, and resets authRetryAttemptRef, so a successful handshake always returns the UI to a clean state.
  • retryConnection is repaired to directly call socketRef.current?.connect() instead of relying on a state-driven effect re-run that was never triggered.

Confidence Score: 5/5

Safe to merge — the change is contained to the socket reconnection path, all event handlers remain registered on the same socket instance, and the bounded retry cap prevents endless re-minting.

The auth-failure recovery path is now deterministic: tokens are always fast-failed to the server, inactive-socket errors are handled by a capped backoff loop, and the connect handler always clears isReconnecting. The retryConnection fix is straightforward. No data loss, no new async races, and the cap correctly bounds the worst-case behaviour for a genuine logged-out session.

No files require special attention.

Important Files Changed

Filename Overview
apps/sim/app/workspace/providers/socket-provider.tsx Rewrites the socket auth-failure recovery path: auth callback now always calls cb (fast-fail), connect_error delegates to scheduleAuthRetry when the socket is inactive, connect clears isReconnecting, and retryConnection directly calls socket.connect() instead of depending on effect re-run. Capped at 10 auto-retries before latching authFailed.

Sequence Diagram

sequenceDiagram
    participant UI as SocketProvider
    participant SIO as Socket.IO Client
    participant API as /api/auth/socket-token
    participant SRV as Socket.IO Server

    UI->>SIO: "io(url, { auth: async cb })"
    SIO->>API: generateSocketToken()
    alt Token minted OK
        API-->>SIO: token
        SIO->>SRV: connect (token)
        SRV-->>SIO: connect_error (bad token)
        SIO-->>UI: "connect_error (socket.active = false)"
        UI->>UI: scheduleAuthRetry(socket, attempt++)
        Note over UI: backoff delay (1s to 30s)
        UI->>SIO: socket.connect()
    else Token mint failed (any error)
        API-->>SIO: throws
        SIO->>SRV: connect (token: null)
        SRV-->>SIO: namespace rejection
        SIO-->>UI: "connect_error (socket.active = false)"
        UI->>UI: scheduleAuthRetry(socket, attempt++)
    end

    alt "Retry succeeds (attempt < 10)"
        SIO->>SRV: connect (fresh token)
        SRV-->>SIO: connect OK
        SIO-->>UI: connect event
        UI->>UI: clearIsReconnecting, resetAttemptCounter
    else "Retries exhausted (attempt >= 10)"
        UI->>UI: setAuthFailed(true), stop retrying
        Note over UI: User must call retryConnection() or reload
    end

    opt Manual retryConnection()
        UI->>UI: reset counter, setAuthFailed(false)
        UI->>SIO: socket.connect()
    end
Loading

Reviews (2): Last reviewed commit: "fix(socket): recover from denied handsha..." | Re-trigger Greptile

Comment thread apps/sim/app/workspace/providers/socket-provider.tsx Outdated
Accepting an invite switches the active org and immediately redirects into the
workspace, so the socket bootstraps under a just-rotated session. A transient
token-mint failure during that window left the realtime socket stuck showing
"Reconnecting..." until a manual page reload.

Follow the documented socket.io pattern and branch connect_error on
socket.active. A server-denied handshake (active === false — e.g. a null or
expired token rejected by the auth middleware, which destroys the socket and
does not auto-reconnect) now retries socket.connect() with capped exponential
backoff, re-running the auth callback to mint a fresh token. This recovers a
transient failure on the next attempt and bounds a genuine logout to
MAX_AUTH_RETRY_ATTEMPTS before latching authFailed for a manual reload. The
connect handler clears isReconnecting so a healthy socket never shows it.

Replaces the prior error-message sniffing and socket teardown/rebuild.
@waleedlatif1 waleedlatif1 force-pushed the worktree-fix+invite-reconnecting branch from bc83720 to b27dca7 Compare June 9, 2026 05:32
@waleedlatif1

Copy link
Copy Markdown
Collaborator Author

@greptile

@waleedlatif1

Copy link
Copy Markdown
Collaborator Author

@cursor review

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit b27dca7. Configure here.

@waleedlatif1

Copy link
Copy Markdown
Collaborator Author

Closing — this is already fixed more completely on dev by @vikhyath in 9b1ec8ec1c ("fix sockets err classification") and f498731b88 ("sockets invite flow fix").

That fix and this PR independently converged on the same idiomatic client pattern (branch connect_error on socket.active; manual socket.connect() with bounded backoff for server-denied handshakes; drop the teardown/rebuild). The difference: the dev fix also addresses the server-side root cause — the realtime HTTP handler was racing Engine.IO's /socket.io/ polling handshake and 404-ing it (apps/realtime/src/index.ts) — which this client-only PR does not. It also adds focus/online auto-recovery.

Letting the complete fix reach staging via the normal dev → staging flow instead of landing a partial, duplicate client change here.

@waleedlatif1 waleedlatif1 deleted the worktree-fix+invite-reconnecting branch June 9, 2026 17:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant