fix(orch): keep the sandbox in the map even after it starts shutting down to handle ip reverse lookups#2116
fix(orch): keep the sandbox in the map even after it starts shutting down to handle ip reverse lookups#2116
Conversation
| // MarkRunning transitions a sandbox from starting to running and notifies OnInsert subscribers | ||
| func (m *Map) MarkRunning(sbx *Sandbox) { | ||
| sbx.started.Store(true) | ||
| sbx.status.Store(int32(Status`Running)) |
There was a problem hiding this comment.
Syntax error / compilation failure: Status followed by a backtick is an unterminated raw string literal in Go, so the package will not compile. This should be StatusRunning:
sbx.status.Store(int32(StatusRunning))| // notified immediately (so the proxy / firewall limiter can clean up), but the | ||
| // entry stays in the map for deadEvictionGracePeriod so that IP-based lookups | ||
| // still resolve while the Firecracker process finishes shutting down. | ||
| func (m *Map) MarkDead(ctx context.Context, sandboxID string) { |
There was a problem hiding this comment.
MarkDead is not idempotent, so OnRemove can fire twice in a concurrent scenario. The old Remove used RemoveCb(return exists) which was naturally idempotent: a second concurrent call would see exists=false and skip subscriber notification.
With MarkDead, the entry intentionally stays in the map for 30 s. If two goroutines both pass the Get/IsRunning() check before either calls MarkDead (a TOCTOU race on concurrent Delete RPCs for the same sandbox), both find the entry still present, both call sbx.status.Store(StatusDead), and both fire go m.trigger(OnRemove), notifying every subscriber twice.
Using CAS would prevent this:
if !sbx.status.CompareAndSwap(int32(StatusRunning), int32(StatusDead)) {
return
}…down to handle ip reverse lookups
c14e526 to
7562220
Compare
…down to handle ip reverse lookups
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Note
Medium Risk
Changes sandbox lifecycle state and eviction timing, which can affect routing/firewall lookups and cleanup correctness under concurrent deletes or slot reuse. Risk is moderated by CAS idempotency and guarded eviction, but timing-based behavior could still cause edge-case leaks or mis-association.
Overview
Keeps sandboxes discoverable by host IP for a short grace period after shutdown by introducing an explicit
SandboxStatus(starting/running/dead) and replacing immediate map removal withMarkDeadplus delayed eviction; live queries (Get,Items,Count) now ignore non-running entries whileGetByHostPortcan still resolve dead sandboxes. Also adds safeguards to prevent stale IP reuse and accidental eviction of replacement sandboxes (CAS state transition, pointer-checked eviction, and insert-time removal of older entries sharing the same IP).Written by Cursor Bugbot for commit f923bee. This will update automatically on new commits. Configure here.