Skip to content

chore(api): log reason for failing to connect, resume sandboxes on state change#2145

Open
matthewlouisbrockman wants to merge 8 commits intomainfrom
pause-resume-failure-logs
Open

chore(api): log reason for failing to connect, resume sandboxes on state change#2145
matthewlouisbrockman wants to merge 8 commits intomainfrom
pause-resume-failure-logs

Conversation

@matthewlouisbrockman
Copy link
Contributor

@matthewlouisbrockman matthewlouisbrockman commented Mar 16, 2026

currently, we drop the err from err = a.orchestrator.WaitForStateChange(ctx, teamID, sandboxID). This makes it hard for us to know why requests during pausing errored.

This change logs the dropped err so we can get better confidence in why we're gettign 500 errors during state transitions.

also updates the error logging to use the telemetry.reportCriticalError to maintain Error text while helping with the spans


Note

Low Risk
Low risk: changes are limited to additional telemetry/error reporting on existing failure paths, without altering core sandbox state logic.

Overview
Adds telemetry.ReportCriticalError calls in sandbox connect/resume handlers to capture the underlying errors (including WaitForStateChange failures, unexpected non-running/unknown states, snapshot fetch failures, and secure envd token errors) and to attach sandbox/team/build/template identifiers for easier debugging of 500s during state transitions.

Written by Cursor Bugbot for commit ab21f5a. This will update automatically on new commits. Configure here.


err = a.orchestrator.WaitForStateChange(ctx, teamID, sandboxID)
if err != nil {
logger.L().Error(ctx, "Error waiting for sandbox state change",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WaitForStateChange returns ctx.Err() when the polling loop context is cancelled (state_change.go:250). A normal client disconnect will trigger this path, logging a spurious Error and attempting to send a 500 to a client that is already gone.

Consider skipping the Error log when the error is a context cancellation or deadline exceeded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

making 'em debug

logger.L().Debug(ctx, "Waiting for sandbox to pause", logger.WithSandboxID(sandboxID))
err = a.orchestrator.WaitForStateChange(ctx, teamID, sandboxID)
if err != nil {
logger.L().Error(ctx, "Error waiting for sandbox to pause",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue: WaitForStateChange returns ctx.Err() on context cancellation, so client disconnects will be logged at Error level. Filter context.Canceled / context.DeadlineExceeded before logging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

making 'em debug, i want to know they're happening

@matthewlouisbrockman matthewlouisbrockman marked this pull request as ready for review March 16, 2026 22:09
Copy link
Contributor

@dobrac dobrac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use the telemetry.ReportError and telemetry.ReportCriticalError instead to also populate the traces correctly. It does logging automatically

@matthewlouisbrockman
Copy link
Contributor Author

you can use the telemetry.ReportError and telemetry.ReportCriticalError instead to also populate the traces correctly. It does logging automatically

shoud we add him to logger.L().Error(ctx, "Error getting last snapshot", logger.WithSandboxID(sandboxID), zap.Error(err)) too to debug those guys? (i think it's a 30s timeout somewhere but can't find it yet)

@matthewlouisbrockman
Copy link
Contributor Author

eh, can worry about getting the reason for the failed db guys later, pretty sure those are all just timing out on the 30 second callback from the redis route

@matthewlouisbrockman matthewlouisbrockman changed the title log reason for failing to connect, resume sandboxes on state change observability: log reason for failing to connect, resume sandboxes on state change Mar 17, 2026
@matthewlouisbrockman matthewlouisbrockman changed the title observability: log reason for failing to connect, resume sandboxes on state change chore[api]: log reason for failing to connect, resume sandboxes on state change Mar 17, 2026
@matthewlouisbrockman matthewlouisbrockman changed the title chore[api]: log reason for failing to connect, resume sandboxes on state change chore(api): log reason for failing to connect, resume sandboxes on state change Mar 17, 2026
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants