[Bug]: Failed startup session probe bricks the app — fatal, sticky error in root beforeLoad with no recovery

## Before submitting
- [x] I searched existing issues and did not find a duplicate.
- [x] I included enough detail to reproduce or investigate the problem.

## Area
apps/web (fatal, sticky error in root `beforeLoad`) — with an underlying apps/server persistence defect (see below)

<img width="2168" height="1538" alt="Image" src="https://github.com/user-attachments/assets/9b9535a2-ecdb-4e18-b3c6-36e0a02ce1a9" />

> **Status:** PR #3520 addresses the recoverability slice (Suggested fix #1 — auto-retry the stuck startup error boundary). This issue intentionally stays open to track the rest: the underlying session-store write failure (Suggested fix #3), degrade-to-gate (#2), and the renderer diagnosability gap (#4).

## Summary
On desktop, a failed **startup session probe** bricks the whole app with an unrecoverable error screen: *"Something went wrong. Primary environment request failed during fetch-session-state (HTTP 500)."* The root TanStack Router `beforeLoad` awaits `resolveInitialServerAuthGateState()` → `bootstrapServerAuth()` → `fetchSessionState()` (unguarded, `auth.ts:316`); a throw there renders the root `errorComponent` (`RootRouteErrorView`) **before any UI loads**. The overlay is **sticky** — `beforeLoad` runs once and the screen persists; neither button recovers it: `Try again` (`reset()`) only resets the error boundary and does **not** re-run `beforeLoad` (it re-shows the same error), and `Reload app` (`window.location.reload()`) re-runs everything but re-hits the failure while it persists.

**Two clarifications after digging (to save triage time):**
- **The "(HTTP 500)" is a client-side fallback label, not necessarily a server 500.** `PrimaryEnvironmentRequestError.fromCause` sets `status = readHttpApiStatus(cause) ?? 500` (`connection_errors.ts:54`). A non-HTTP cause — transport error, a response **decode/schema** mismatch, or a server **defect-channel** error — is labeled "HTTP 500" with `operation: fetch-session-state` even when no 500 response existed. (Consistent with the server logs showing **zero** `internal_error` lines — every *typed* session-500 would be logged via `failEnvironmentInternal`, and none are.)
- **The renderer error is logged nowhere** (`~/.t3/userdata/logs` has no renderer sink), so the exact `cause._tag` of the probe failure can't be confirmed from artifacts. What *is* verified is the fatal+sticky handling and the absence of any typed server session-500.

## Steps to reproduce
Trigger is intermittent / not reproducible on demand (the app self-heals after a clean re-auth). Observed:
1. Use the desktop nightly; let stored session/credential state get into a bad state.
2. Relaunch (or auto-update + relaunch).
3. The app shows the full-screen error before any UI; `Try again` / `Reload app` re-hit it.
4. Clearing the local auth token / re-authenticating recovers; afterward it's no longer reproducible.

## Expected behavior
A failed `fetchSessionState` at startup should **degrade to the sign-in / re-pair flow** (a recoverable, non-fatal gate state), never brick the app in `beforeLoad`.

## Actual behavior
Whole app unusable at launch; both recovery buttons re-trigger the same failure. Overlay stack (asset hashes only):
```
PrimaryEnvironmentRequestError
    at e.fromCause (t3code://app/assets/index-<hash>.js)
    at async Object.beforeLoad (t3code://app/assets/index-<hash>.js)
    ...
```

## Root cause (verified handling) + the underlying defect
**Handling (verified in source — this is the brick):**
- `bootstrapServerAuth` awaits `fetchSessionState()` at `apps/web/src/environments/primary/auth.ts:316` **outside any try/catch** (the try only wraps the later token exchange at :328-338, which *does* degrade to `requires-auth`).
- A 500/throw is **not retried** — `retryTransientBootstrap` only retries `{502,503,504}` / `TypeError` / `AbortError` (`auth.ts:273,302-312`).
- The throw reaches `beforeLoad` → `RootRouteErrorView` (`apps/web/src/routes/__root.tsx:73,79,187-228`), which is sticky and offers no sign-out/reset.
- The recoverable `{ status: "requires-auth" }` gate state already exists and routes to sign-in (`__root.tsx:108`) — the session-probe failure just doesn't use it.

**Underlying server defect (observed in logs, distinct from the brick path):**
- The backend repeatedly fails to mint a session credential: `failEnvironmentInternal("access_token_issuance_failed")` (token endpoint, `apps/server/src/auth/http.ts:305-307`) with cause `ServerAuthAuthenticatedAccessTokenIssueError → SessionCredentialIssueError` — i.e. a **`sessions.issue` WRITE failure** in the session-credential persistence layer (14 occurrences over one session). The local `state.sqlite` was **2.48 GB** (+ a multi-MB WAL), so a SQLite health/space/WAL-bloat problem on the Nightly profile is the prime suspect.

## Suggested fix
1. **Primary (cause-independent, smallest correct fix): make the startup error recoverable, not sticky.** Have `RootRouteErrorView` auto-retry the probe when the window regains focus/visibility or the network returns, so a recovered backend un-sticks the app instead of stranding the user. Use **`router.invalidate()`** to re-run `beforeLoad` — `reset()` only resets the boundary and does not re-run the loader. (Guard with an in-flight lock; `resolveInitialServerAuthGateState` doesn't cache failures, so invalidate genuinely re-probes.)
2. **Deeper follow-up:** degrade a failed startup probe to a recoverable gate state. The existing `requires-auth` state can't be reused directly — the sign-in screen needs `auth.bootstrapMethods` (`PairingRouteSurface.tsx:115,159`), which isn't available when the probe throws — so this needs a new no-auth "unavailable/retry" gate state plus an auth-state audit (the `_chat.tsx`→`/pair` redirect). Also add a **sign-out / reset** action so users aren't forced to delete `~/.t3/userdata/` token files by hand.
3. **Underlying defect:** investigate the `sessions.issue` / `SessionCredentialIssueError` WRITE failure and the `state.sqlite` health (size/WAL/corruption); surface a specific error (e.g. "local session store unavailable") rather than a generic internal error.
4. **Diagnosability:** the renderer `PrimaryEnvironmentRequestError` (operation, status, `cause._tag`) is logged nowhere — add a renderer log/Sentry capture so future occurrences can be classified (transport vs decode vs typed server 500) without guessing.

**Do NOT** add blanket retry-on-500 to bootstrap — a 500 is correctly treated as non-transient.

## Environment
- OS: macOS 26.2 (arm64)
- App: T3 Code (Nightly) desktop, versions `0.0.28-nightly.20260622.622` and `…20260623.629`
- Reproducibility: intermittent; recovers after re-auth, not reproducible on demand

## Possibly related
- #3252 (fix: stabilize local dev auth startup) · #3180 ([codex] align server auth Effect services) · #3409 / #3413 / #3471 (introduced `PrimaryEnvironmentRequestError`)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Failed startup session probe bricks the app — fatal, sticky error in root beforeLoad with no recovery #3513

Before submitting

Area

Summary

Steps to reproduce

Expected behavior

Actual behavior

Root cause (verified handling) + the underlying defect

Suggested fix

Environment

Possibly related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug]: Failed startup session probe bricks the app — fatal, sticky error in root beforeLoad with no recovery #3513

Description

Before submitting

Area

Summary

Steps to reproduce

Expected behavior

Actual behavior

Root cause (verified handling) + the underlying defect

Suggested fix

Environment

Possibly related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions