Onboarding: Timeout on active instance eventually#271
Conversation
📝 WalkthroughWalkthroughAdds a Clock field to OnboardingController, uses clock-based timing in the smoke-test flow to detect timeouts, patches onboarding status on timeout, and deletes timed-out test servers; tests updated to inject and advance clocks; Changes
Sequence DiagramsequenceDiagram
participant OC as OnboardingController
participant Clock as Clock
participant Server as TestServer
participant Status as StatusPatcher
OC->>Clock: r.Clock.Now()
Clock-->>OC: current time
OC->>Server: GET /servers/detail (read LaunchedAt)
Server-->>OC: LaunchedAt timestamp
OC->>OC: Compare LaunchedAt + timeout to current time
alt Timeout detected
OC->>Status: Set onboarding condition -> testing (timeout) & patch status
Status-->>OC: status patched
OC->>Server: DELETE /servers/{id}
Server-->>OC: deletion confirmed
else Not timed out
OC->>Server: Continue normal smoke-test actions (POST action, etc.)
Server-->>OC: responses
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
c79bb87 to
b27678f
Compare
b27678f to
28a3038
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
internal/controller/onboarding_controller_test.go (1)
168-170: Drive the timeout through the existing-server branch.Right now the spec gets
ACTIVEpluslaunched_atfromPOST /servers, so the timeout can fire on the first reconcile. That skips the more representative path wherecreateOrGetTestServer()returns an already-running server on a later reconcile. Seeding the stale server viaGET /servers/detailand reconciling twice would keep this test closer to the controller flow the PR is protecting.Also applies to: 607-625
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/onboarding_controller_test.go` around lines 168 - 170, The test currently seeds the server as already ACTIVE with a launched_at in the POST /servers response which lets the timeout fire on the first reconcile; change the test to instead seed the stale server via the GET /servers/detail response so that createOrGetTestServer() will take the path of "already-running server" on a subsequent reconcile—update the mocked responses in onboarding_controller_test.go to return a non-ACTIVE result from POST /servers (or delay launched_at) and supply the ACTIVE + launched_at payload on the GET /servers/detail mock, then run the controller reconcile twice in the test to exercise the intended branch.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/controller/onboarding_controller.go`:
- Around line 303-305: The delete in the timed-out cleanup (servers.Delete(ctx,
r.testComputeClient, server.ID).ExtractErr()) should tolerate concurrent "not
found" (404) races like the other delete paths; change the error handling so
that after calling servers.Delete(...).ExtractErr() you only return an error if
it's not a 404 (use the same not-found check used elsewhere in this file),
otherwise treat the delete as successful and continue. Ensure you reference the
same ctx, r.testComputeClient, and server.ID values and mirror the 404-tolerant
pattern used in the other delete sites.
- Line 292: SetupWithManager currently unconditionally sets r.Clock =
clock.RealClock{} which overwrites any test-injected clock; modify
SetupWithManager to only assign the fallback RealClock when r.Clock is nil
(i.e., wrap the existing assignment in if r.Clock == nil { r.Clock =
clock.RealClock{} }) so that caller-injected clocks on the OnboardingController
are preserved while still providing a default in production.
---
Nitpick comments:
In `@internal/controller/onboarding_controller_test.go`:
- Around line 168-170: The test currently seeds the server as already ACTIVE
with a launched_at in the POST /servers response which lets the timeout fire on
the first reconcile; change the test to instead seed the stale server via the
GET /servers/detail response so that createOrGetTestServer() will take the path
of "already-running server" on a subsequent reconcile—update the mocked
responses in onboarding_controller_test.go to return a non-ACTIVE result from
POST /servers (or delay launched_at) and supply the ACTIVE + launched_at payload
on the GET /servers/detail mock, then run the controller reconcile twice in the
test to exercise the intended branch.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: db638a1b-bf7f-4c78-9010-61a52cb37865
📒 Files selected for processing (3)
go.modinternal/controller/onboarding_controller.gointernal/controller/onboarding_controller_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- go.mod
The code currently assumes, that an instance would report the expected string eventually, which assumes that the metadata service lookup works. The active state only reflects that the hypervisor could boot the VM. As the instance usually boots within seconds, a five minute timeout seems save enough to catch this eventuality.
28a3038 to
69bcc0c
Compare
|
In production, the VMs are active, and the log message says: This change should at least make the operator retry. |
Merging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
internal/controller/onboarding_controller_test.go (1)
629-631: Add an exact-deadline case to harden timeout behavior.This test jumps to
+6m; adding a+5m exactlycase would catch regressions between exclusive/inclusive timeout checks.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/onboarding_controller_test.go` around lines 629 - 631, Add a test case that sets onboardingReconciler.Clock to exactly 5 minutes after the recorded launched_at timestamp (use clocktesting.NewFakeClock with time.Date(...) at +5m) and assert the same timeout behavior expectations (e.g., whether serverDeletedCalled is true/false) to verify inclusive vs exclusive checks; update the test around the existing +6m setup (the code that sets onboardingReconciler.Clock and serverDeletedCalled) to include this exact +5m scenario and ensure the assertions reference the same launched_at-based logic used by the reconciler.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/controller/onboarding_controller.go`:
- Line 292: The timeout check at the line using
r.Clock.Now().After(server.LaunchedAt.Add(smokeTestTimeout)) is exclusive and
should be inclusive; change the condition to test inclusive equality by
replacing the After(...) check with a non-Before check (e.g.,
!r.Clock.Now().Before(server.LaunchedAt.Add(smokeTestTimeout))) or explicitly
include r.Clock.Now().Equal(...). Update the condition that references
server.LaunchedAt, r.Clock.Now(), and smokeTestTimeout in the reconcile logic so
the timeout fires exactly at launched_at + smokeTestTimeout rather than only
after it.
---
Nitpick comments:
In `@internal/controller/onboarding_controller_test.go`:
- Around line 629-631: Add a test case that sets onboardingReconciler.Clock to
exactly 5 minutes after the recorded launched_at timestamp (use
clocktesting.NewFakeClock with time.Date(...) at +5m) and assert the same
timeout behavior expectations (e.g., whether serverDeletedCalled is true/false)
to verify inclusive vs exclusive checks; update the test around the existing +6m
setup (the code that sets onboardingReconciler.Clock and serverDeletedCalled) to
include this exact +5m scenario and ensure the assertions reference the same
launched_at-based logic used by the reconciler.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: aad422d1-fe77-49e6-aaba-1782ac1e0824
📒 Files selected for processing (3)
go.modinternal/controller/onboarding_controller.gointernal/controller/onboarding_controller_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- go.mod
| } | ||
|
|
||
| if !strings.Contains(consoleOutput, server.Name) { | ||
| if !server.LaunchedAt.IsZero() && r.Clock.Now().After(server.LaunchedAt.Add(smokeTestTimeout)) { |
There was a problem hiding this comment.
Timeout threshold is exclusive at Line 292.
Using After(...) means the timeout does not fire exactly at launched_at + 5m; it fires on the next reconcile. Make the check inclusive.
⏱️ Proposed fix
- if !server.LaunchedAt.IsZero() && r.Clock.Now().After(server.LaunchedAt.Add(smokeTestTimeout)) {
+ deadline := server.LaunchedAt.Add(smokeTestTimeout)
+ if !server.LaunchedAt.IsZero() && !r.Clock.Now().Before(deadline) {📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if !server.LaunchedAt.IsZero() && r.Clock.Now().After(server.LaunchedAt.Add(smokeTestTimeout)) { | |
| deadline := server.LaunchedAt.Add(smokeTestTimeout) | |
| if !server.LaunchedAt.IsZero() && !r.Clock.Now().Before(deadline) { |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/controller/onboarding_controller.go` at line 292, The timeout check
at the line using r.Clock.Now().After(server.LaunchedAt.Add(smokeTestTimeout))
is exclusive and should be inclusive; change the condition to test inclusive
equality by replacing the After(...) check with a non-Before check (e.g.,
!r.Clock.Now().Before(server.LaunchedAt.Add(smokeTestTimeout))) or explicitly
include r.Clock.Now().Equal(...). Update the condition that references
server.LaunchedAt, r.Clock.Now(), and smokeTestTimeout in the reconcile logic so
the timeout fires exactly at launched_at + smokeTestTimeout rather than only
after it.
The code currently assumes, that an instance would report the expected string eventually, which assumes that the metadata service lookup works.
The active state only reflects that the hypervisor could boot the VM.
As the instance usually boots within seconds,
a five minute timeout seems save enough to
catch this eventuality.
Summary by CodeRabbit
Bug Fixes
Tests