Before submitting
Area
apps/web (fatal, sticky error in root beforeLoad) — with an underlying apps/server persistence defect (see below)
Status: PR #3520 addresses the recoverability slice (Suggested fix #1 — auto-retry the stuck startup error boundary). This issue intentionally stays open to track the rest: the underlying session-store write failure (Suggested fix #3), degrade-to-gate (#2), and the renderer diagnosability gap (#4).
Summary
On desktop, a failed startup session probe bricks the whole app with an unrecoverable error screen: "Something went wrong. Primary environment request failed during fetch-session-state (HTTP 500)." The root TanStack Router beforeLoad awaits resolveInitialServerAuthGateState() → bootstrapServerAuth() → fetchSessionState() (unguarded, auth.ts:316); a throw there renders the root errorComponent (RootRouteErrorView) before any UI loads. The overlay is sticky — beforeLoad runs once and the screen persists; neither button recovers it: Try again (reset()) only resets the error boundary and does not re-run beforeLoad (it re-shows the same error), and Reload app (window.location.reload()) re-runs everything but re-hits the failure while it persists.
Two clarifications after digging (to save triage time):
- The "(HTTP 500)" is a client-side fallback label, not necessarily a server 500.
PrimaryEnvironmentRequestError.fromCause sets status = readHttpApiStatus(cause) ?? 500 (connection_errors.ts:54). A non-HTTP cause — transport error, a response decode/schema mismatch, or a server defect-channel error — is labeled "HTTP 500" with operation: fetch-session-state even when no 500 response existed. (Consistent with the server logs showing zero internal_error lines — every typed session-500 would be logged via failEnvironmentInternal, and none are.)
- The renderer error is logged nowhere (
~/.t3/userdata/logs has no renderer sink), so the exact cause._tag of the probe failure can't be confirmed from artifacts. What is verified is the fatal+sticky handling and the absence of any typed server session-500.
Steps to reproduce
Trigger is intermittent / not reproducible on demand (the app self-heals after a clean re-auth). Observed:
- Use the desktop nightly; let stored session/credential state get into a bad state.
- Relaunch (or auto-update + relaunch).
- The app shows the full-screen error before any UI;
Try again / Reload app re-hit it.
- Clearing the local auth token / re-authenticating recovers; afterward it's no longer reproducible.
Expected behavior
A failed fetchSessionState at startup should degrade to the sign-in / re-pair flow (a recoverable, non-fatal gate state), never brick the app in beforeLoad.
Actual behavior
Whole app unusable at launch; both recovery buttons re-trigger the same failure. Overlay stack (asset hashes only):
PrimaryEnvironmentRequestError
at e.fromCause (t3code://app/assets/index-<hash>.js)
at async Object.beforeLoad (t3code://app/assets/index-<hash>.js)
...
Root cause (verified handling) + the underlying defect
Handling (verified in source — this is the brick):
bootstrapServerAuth awaits fetchSessionState() at apps/web/src/environments/primary/auth.ts:316 outside any try/catch (the try only wraps the later token exchange at :328-338, which does degrade to requires-auth).
- A 500/throw is not retried —
retryTransientBootstrap only retries {502,503,504} / TypeError / AbortError (auth.ts:273,302-312).
- The throw reaches
beforeLoad → RootRouteErrorView (apps/web/src/routes/__root.tsx:73,79,187-228), which is sticky and offers no sign-out/reset.
- The recoverable
{ status: "requires-auth" } gate state already exists and routes to sign-in (__root.tsx:108) — the session-probe failure just doesn't use it.
Underlying server defect (observed in logs, distinct from the brick path):
- The backend repeatedly fails to mint a session credential:
failEnvironmentInternal("access_token_issuance_failed") (token endpoint, apps/server/src/auth/http.ts:305-307) with cause ServerAuthAuthenticatedAccessTokenIssueError → SessionCredentialIssueError — i.e. a sessions.issue WRITE failure in the session-credential persistence layer (14 occurrences over one session). The local state.sqlite was 2.48 GB (+ a multi-MB WAL), so a SQLite health/space/WAL-bloat problem on the Nightly profile is the prime suspect.
Suggested fix
- Primary (cause-independent, smallest correct fix): make the startup error recoverable, not sticky. Have
RootRouteErrorView auto-retry the probe when the window regains focus/visibility or the network returns, so a recovered backend un-sticks the app instead of stranding the user. Use router.invalidate() to re-run beforeLoad — reset() only resets the boundary and does not re-run the loader. (Guard with an in-flight lock; resolveInitialServerAuthGateState doesn't cache failures, so invalidate genuinely re-probes.)
- Deeper follow-up: degrade a failed startup probe to a recoverable gate state. The existing
requires-auth state can't be reused directly — the sign-in screen needs auth.bootstrapMethods (PairingRouteSurface.tsx:115,159), which isn't available when the probe throws — so this needs a new no-auth "unavailable/retry" gate state plus an auth-state audit (the _chat.tsx→/pair redirect). Also add a sign-out / reset action so users aren't forced to delete ~/.t3/userdata/ token files by hand.
- Underlying defect: investigate the
sessions.issue / SessionCredentialIssueError WRITE failure and the state.sqlite health (size/WAL/corruption); surface a specific error (e.g. "local session store unavailable") rather than a generic internal error.
- Diagnosability: the renderer
PrimaryEnvironmentRequestError (operation, status, cause._tag) is logged nowhere — add a renderer log/Sentry capture so future occurrences can be classified (transport vs decode vs typed server 500) without guessing.
Do NOT add blanket retry-on-500 to bootstrap — a 500 is correctly treated as non-transient.
Environment
- OS: macOS 26.2 (arm64)
- App: T3 Code (Nightly) desktop, versions
0.0.28-nightly.20260622.622 and …20260623.629
- Reproducibility: intermittent; recovers after re-auth, not reproducible on demand
Possibly related
Before submitting
Area
apps/web (fatal, sticky error in root
beforeLoad) — with an underlying apps/server persistence defect (see below)Summary
On desktop, a failed startup session probe bricks the whole app with an unrecoverable error screen: "Something went wrong. Primary environment request failed during fetch-session-state (HTTP 500)." The root TanStack Router
beforeLoadawaitsresolveInitialServerAuthGateState()→bootstrapServerAuth()→fetchSessionState()(unguarded,auth.ts:316); a throw there renders the rooterrorComponent(RootRouteErrorView) before any UI loads. The overlay is sticky —beforeLoadruns once and the screen persists; neither button recovers it:Try again(reset()) only resets the error boundary and does not re-runbeforeLoad(it re-shows the same error), andReload app(window.location.reload()) re-runs everything but re-hits the failure while it persists.Two clarifications after digging (to save triage time):
PrimaryEnvironmentRequestError.fromCausesetsstatus = readHttpApiStatus(cause) ?? 500(connection_errors.ts:54). A non-HTTP cause — transport error, a response decode/schema mismatch, or a server defect-channel error — is labeled "HTTP 500" withoperation: fetch-session-stateeven when no 500 response existed. (Consistent with the server logs showing zerointernal_errorlines — every typed session-500 would be logged viafailEnvironmentInternal, and none are.)~/.t3/userdata/logshas no renderer sink), so the exactcause._tagof the probe failure can't be confirmed from artifacts. What is verified is the fatal+sticky handling and the absence of any typed server session-500.Steps to reproduce
Trigger is intermittent / not reproducible on demand (the app self-heals after a clean re-auth). Observed:
Try again/Reload appre-hit it.Expected behavior
A failed
fetchSessionStateat startup should degrade to the sign-in / re-pair flow (a recoverable, non-fatal gate state), never brick the app inbeforeLoad.Actual behavior
Whole app unusable at launch; both recovery buttons re-trigger the same failure. Overlay stack (asset hashes only):
Root cause (verified handling) + the underlying defect
Handling (verified in source — this is the brick):
bootstrapServerAuthawaitsfetchSessionState()atapps/web/src/environments/primary/auth.ts:316outside any try/catch (the try only wraps the later token exchange at :328-338, which does degrade torequires-auth).retryTransientBootstraponly retries{502,503,504}/TypeError/AbortError(auth.ts:273,302-312).beforeLoad→RootRouteErrorView(apps/web/src/routes/__root.tsx:73,79,187-228), which is sticky and offers no sign-out/reset.{ status: "requires-auth" }gate state already exists and routes to sign-in (__root.tsx:108) — the session-probe failure just doesn't use it.Underlying server defect (observed in logs, distinct from the brick path):
failEnvironmentInternal("access_token_issuance_failed")(token endpoint,apps/server/src/auth/http.ts:305-307) with causeServerAuthAuthenticatedAccessTokenIssueError → SessionCredentialIssueError— i.e. asessions.issueWRITE failure in the session-credential persistence layer (14 occurrences over one session). The localstate.sqlitewas 2.48 GB (+ a multi-MB WAL), so a SQLite health/space/WAL-bloat problem on the Nightly profile is the prime suspect.Suggested fix
RootRouteErrorViewauto-retry the probe when the window regains focus/visibility or the network returns, so a recovered backend un-sticks the app instead of stranding the user. Userouter.invalidate()to re-runbeforeLoad—reset()only resets the boundary and does not re-run the loader. (Guard with an in-flight lock;resolveInitialServerAuthGateStatedoesn't cache failures, so invalidate genuinely re-probes.)requires-authstate can't be reused directly — the sign-in screen needsauth.bootstrapMethods(PairingRouteSurface.tsx:115,159), which isn't available when the probe throws — so this needs a new no-auth "unavailable/retry" gate state plus an auth-state audit (the_chat.tsx→/pairredirect). Also add a sign-out / reset action so users aren't forced to delete~/.t3/userdata/token files by hand.sessions.issue/SessionCredentialIssueErrorWRITE failure and thestate.sqlitehealth (size/WAL/corruption); surface a specific error (e.g. "local session store unavailable") rather than a generic internal error.PrimaryEnvironmentRequestError(operation, status,cause._tag) is logged nowhere — add a renderer log/Sentry capture so future occurrences can be classified (transport vs decode vs typed server 500) without guessing.Do NOT add blanket retry-on-500 to bootstrap — a 500 is correctly treated as non-transient.
Environment
0.0.28-nightly.20260622.622and…20260623.629Possibly related
PrimaryEnvironmentRequestError)