Skip to content

[Bug]: Failed startup session probe bricks the app — fatal, sticky error in root beforeLoad with no recovery #3513

Description

@aanishs

Before submitting

  • I searched existing issues and did not find a duplicate.
  • I included enough detail to reproduce or investigate the problem.

Area

apps/web (fatal, sticky error in root beforeLoad) — with an underlying apps/server persistence defect (see below)

Image

Status: PR #3520 addresses the recoverability slice (Suggested fix #1 — auto-retry the stuck startup error boundary). This issue intentionally stays open to track the rest: the underlying session-store write failure (Suggested fix #3), degrade-to-gate (#2), and the renderer diagnosability gap (#4).

Summary

On desktop, a failed startup session probe bricks the whole app with an unrecoverable error screen: "Something went wrong. Primary environment request failed during fetch-session-state (HTTP 500)." The root TanStack Router beforeLoad awaits resolveInitialServerAuthGateState()bootstrapServerAuth()fetchSessionState() (unguarded, auth.ts:316); a throw there renders the root errorComponent (RootRouteErrorView) before any UI loads. The overlay is stickybeforeLoad runs once and the screen persists; neither button recovers it: Try again (reset()) only resets the error boundary and does not re-run beforeLoad (it re-shows the same error), and Reload app (window.location.reload()) re-runs everything but re-hits the failure while it persists.

Two clarifications after digging (to save triage time):

  • The "(HTTP 500)" is a client-side fallback label, not necessarily a server 500. PrimaryEnvironmentRequestError.fromCause sets status = readHttpApiStatus(cause) ?? 500 (connection_errors.ts:54). A non-HTTP cause — transport error, a response decode/schema mismatch, or a server defect-channel error — is labeled "HTTP 500" with operation: fetch-session-state even when no 500 response existed. (Consistent with the server logs showing zero internal_error lines — every typed session-500 would be logged via failEnvironmentInternal, and none are.)
  • The renderer error is logged nowhere (~/.t3/userdata/logs has no renderer sink), so the exact cause._tag of the probe failure can't be confirmed from artifacts. What is verified is the fatal+sticky handling and the absence of any typed server session-500.

Steps to reproduce

Trigger is intermittent / not reproducible on demand (the app self-heals after a clean re-auth). Observed:

  1. Use the desktop nightly; let stored session/credential state get into a bad state.
  2. Relaunch (or auto-update + relaunch).
  3. The app shows the full-screen error before any UI; Try again / Reload app re-hit it.
  4. Clearing the local auth token / re-authenticating recovers; afterward it's no longer reproducible.

Expected behavior

A failed fetchSessionState at startup should degrade to the sign-in / re-pair flow (a recoverable, non-fatal gate state), never brick the app in beforeLoad.

Actual behavior

Whole app unusable at launch; both recovery buttons re-trigger the same failure. Overlay stack (asset hashes only):

PrimaryEnvironmentRequestError
    at e.fromCause (t3code://app/assets/index-<hash>.js)
    at async Object.beforeLoad (t3code://app/assets/index-<hash>.js)
    ...

Root cause (verified handling) + the underlying defect

Handling (verified in source — this is the brick):

  • bootstrapServerAuth awaits fetchSessionState() at apps/web/src/environments/primary/auth.ts:316 outside any try/catch (the try only wraps the later token exchange at :328-338, which does degrade to requires-auth).
  • A 500/throw is not retriedretryTransientBootstrap only retries {502,503,504} / TypeError / AbortError (auth.ts:273,302-312).
  • The throw reaches beforeLoadRootRouteErrorView (apps/web/src/routes/__root.tsx:73,79,187-228), which is sticky and offers no sign-out/reset.
  • The recoverable { status: "requires-auth" } gate state already exists and routes to sign-in (__root.tsx:108) — the session-probe failure just doesn't use it.

Underlying server defect (observed in logs, distinct from the brick path):

  • The backend repeatedly fails to mint a session credential: failEnvironmentInternal("access_token_issuance_failed") (token endpoint, apps/server/src/auth/http.ts:305-307) with cause ServerAuthAuthenticatedAccessTokenIssueError → SessionCredentialIssueError — i.e. a sessions.issue WRITE failure in the session-credential persistence layer (14 occurrences over one session). The local state.sqlite was 2.48 GB (+ a multi-MB WAL), so a SQLite health/space/WAL-bloat problem on the Nightly profile is the prime suspect.

Suggested fix

  1. Primary (cause-independent, smallest correct fix): make the startup error recoverable, not sticky. Have RootRouteErrorView auto-retry the probe when the window regains focus/visibility or the network returns, so a recovered backend un-sticks the app instead of stranding the user. Use router.invalidate() to re-run beforeLoadreset() only resets the boundary and does not re-run the loader. (Guard with an in-flight lock; resolveInitialServerAuthGateState doesn't cache failures, so invalidate genuinely re-probes.)
  2. Deeper follow-up: degrade a failed startup probe to a recoverable gate state. The existing requires-auth state can't be reused directly — the sign-in screen needs auth.bootstrapMethods (PairingRouteSurface.tsx:115,159), which isn't available when the probe throws — so this needs a new no-auth "unavailable/retry" gate state plus an auth-state audit (the _chat.tsx/pair redirect). Also add a sign-out / reset action so users aren't forced to delete ~/.t3/userdata/ token files by hand.
  3. Underlying defect: investigate the sessions.issue / SessionCredentialIssueError WRITE failure and the state.sqlite health (size/WAL/corruption); surface a specific error (e.g. "local session store unavailable") rather than a generic internal error.
  4. Diagnosability: the renderer PrimaryEnvironmentRequestError (operation, status, cause._tag) is logged nowhere — add a renderer log/Sentry capture so future occurrences can be classified (transport vs decode vs typed server 500) without guessing.

Do NOT add blanket retry-on-500 to bootstrap — a 500 is correctly treated as non-transient.

Environment

  • OS: macOS 26.2 (arm64)
  • App: T3 Code (Nightly) desktop, versions 0.0.28-nightly.20260622.622 and …20260623.629
  • Reproducibility: intermittent; recovers after re-auth, not reproducible on demand

Possibly related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions