chore(api): log reason for failing to connect, resume sandboxes on state change#2145
chore(api): log reason for failing to connect, resume sandboxes on state change#2145matthewlouisbrockman wants to merge 8 commits intomainfrom
Conversation
|
|
||
| err = a.orchestrator.WaitForStateChange(ctx, teamID, sandboxID) | ||
| if err != nil { | ||
| logger.L().Error(ctx, "Error waiting for sandbox state change", |
There was a problem hiding this comment.
WaitForStateChange returns ctx.Err() when the polling loop context is cancelled (state_change.go:250). A normal client disconnect will trigger this path, logging a spurious Error and attempting to send a 500 to a client that is already gone.
Consider skipping the Error log when the error is a context cancellation or deadline exceeded.
There was a problem hiding this comment.
making 'em debug
| logger.L().Debug(ctx, "Waiting for sandbox to pause", logger.WithSandboxID(sandboxID)) | ||
| err = a.orchestrator.WaitForStateChange(ctx, teamID, sandboxID) | ||
| if err != nil { | ||
| logger.L().Error(ctx, "Error waiting for sandbox to pause", |
There was a problem hiding this comment.
Same issue: WaitForStateChange returns ctx.Err() on context cancellation, so client disconnects will be logged at Error level. Filter context.Canceled / context.DeadlineExceeded before logging.
There was a problem hiding this comment.
making 'em debug, i want to know they're happening
shoud we add him to |
|
eh, can worry about getting the reason for the failed db guys later, pretty sure those are all just timing out on the 30 second callback from the redis route |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

currently, we drop the
errfromerr = a.orchestrator.WaitForStateChange(ctx, teamID, sandboxID). This makes it hard for us to know why requests during pausing errored.This change logs the dropped
errso we can get better confidence in why we're gettign 500 errors during state transitions.also updates the error logging to use the telemetry.reportCriticalError to maintain
Errortext while helping with the spansNote
Low Risk
Low risk: changes are limited to additional telemetry/error reporting on existing failure paths, without altering core sandbox state logic.
Overview
Adds
telemetry.ReportCriticalErrorcalls in sandboxconnect/resumehandlers to capture the underlying errors (includingWaitForStateChangefailures, unexpected non-running/unknown states, snapshot fetch failures, and secure envd token errors) and to attach sandbox/team/build/template identifiers for easier debugging of 500s during state transitions.Written by Cursor Bugbot for commit ab21f5a. This will update automatically on new commits. Configure here.