Skip to content

Onboarding: Timeout on active instance eventually#271

Merged
fwiesel merged 1 commit intomainfrom
onboarding-timeout
Mar 18, 2026
Merged

Onboarding: Timeout on active instance eventually#271
fwiesel merged 1 commit intomainfrom
onboarding-timeout

Conversation

@fwiesel
Copy link
Contributor

@fwiesel fwiesel commented Mar 17, 2026

The code currently assumes, that an instance would report the expected string eventually, which assumes that the metadata service lookup works.

The active state only reflects that the hypervisor could boot the VM.

As the instance usually boots within seconds,
a five minute timeout seems save enough to
catch this eventuality.

Summary by CodeRabbit

  • Bug Fixes

    • Improved onboarding smoke-test timeout handling so stalled or unresponsive test servers are reliably detected, removed, and onboarding status is updated with a clear timeout state and message.
  • Tests

    • Added and extended deterministic time-based tests for onboarding to verify timed-out servers are cleaned up and status transitions/reporting occur as expected.

@coderabbitai
Copy link

coderabbitai bot commented Mar 17, 2026

📝 Walkthrough

Walkthrough

Adds a Clock field to OnboardingController, uses clock-based timing in the smoke-test flow to detect timeouts, patches onboarding status on timeout, and deletes timed-out test servers; tests updated to inject and advance clocks; k8s.io/utils moved from indirect to direct in go.mod.

Changes

Cohort / File(s) Summary
Dependency Management
go.mod
Converted k8s.io/utils from indirect to direct requirement (removed // indirect) without changing the version.
Controller Implementation
internal/controller/onboarding_controller.go
Added public Clock clock.Clock field and smokeTestTimeout constant; smokeTest now uses r.Clock.Now() to detect timeouts, update onboarding condition/status, patch status, and delete timed-out test servers. Initialized Clock in SetupWithManager when nil. Added k8s.io/utils/clock import.
Controller Tests
internal/controller/onboarding_controller_test.go
Tests set controller Clock (RealClock/fake clock) and new clocktesting usage, expose server detail/action/delete handlers to simulate launched_at and deletion, advance fake clock to trigger timeout, and assert onboarding condition updates and server deletion.

Sequence Diagram

sequenceDiagram
    participant OC as OnboardingController
    participant Clock as Clock
    participant Server as TestServer
    participant Status as StatusPatcher

    OC->>Clock: r.Clock.Now()
    Clock-->>OC: current time

    OC->>Server: GET /servers/detail (read LaunchedAt)
    Server-->>OC: LaunchedAt timestamp

    OC->>OC: Compare LaunchedAt + timeout to current time

    alt Timeout detected
        OC->>Status: Set onboarding condition -> testing (timeout) & patch status
        Status-->>OC: status patched
        OC->>Server: DELETE /servers/{id}
        Server-->>OC: deletion confirmed
    else Not timed out
        OC->>Server: Continue normal smoke-test actions (POST action, etc.)
        Server-->>OC: responses
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐇 I tick the clock, I watch the test,

When time runs out I do my best,
Mark the status, clean the rest,
A little hop, the suite at rest.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Onboarding: Timeout on active instance eventually' directly summarizes the main change: adding a timeout mechanism for active instances during the onboarding flow when metadata service lookup or instance reporting does not occur.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch onboarding-timeout
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

@fwiesel fwiesel force-pushed the onboarding-timeout branch from c79bb87 to b27678f Compare March 18, 2026 09:45
@fwiesel fwiesel marked this pull request as ready for review March 18, 2026 09:45
@fwiesel fwiesel force-pushed the onboarding-timeout branch from b27678f to 28a3038 Compare March 18, 2026 10:55
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
internal/controller/onboarding_controller_test.go (1)

168-170: Drive the timeout through the existing-server branch.

Right now the spec gets ACTIVE plus launched_at from POST /servers, so the timeout can fire on the first reconcile. That skips the more representative path where createOrGetTestServer() returns an already-running server on a later reconcile. Seeding the stale server via GET /servers/detail and reconciling twice would keep this test closer to the controller flow the PR is protecting.

Also applies to: 607-625

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/onboarding_controller_test.go` around lines 168 - 170,
The test currently seeds the server as already ACTIVE with a launched_at in the
POST /servers response which lets the timeout fire on the first reconcile;
change the test to instead seed the stale server via the GET /servers/detail
response so that createOrGetTestServer() will take the path of "already-running
server" on a subsequent reconcile—update the mocked responses in
onboarding_controller_test.go to return a non-ACTIVE result from POST /servers
(or delay launched_at) and supply the ACTIVE + launched_at payload on the GET
/servers/detail mock, then run the controller reconcile twice in the test to
exercise the intended branch.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/controller/onboarding_controller.go`:
- Around line 303-305: The delete in the timed-out cleanup (servers.Delete(ctx,
r.testComputeClient, server.ID).ExtractErr()) should tolerate concurrent "not
found" (404) races like the other delete paths; change the error handling so
that after calling servers.Delete(...).ExtractErr() you only return an error if
it's not a 404 (use the same not-found check used elsewhere in this file),
otherwise treat the delete as successful and continue. Ensure you reference the
same ctx, r.testComputeClient, and server.ID values and mirror the 404-tolerant
pattern used in the other delete sites.
- Line 292: SetupWithManager currently unconditionally sets r.Clock =
clock.RealClock{} which overwrites any test-injected clock; modify
SetupWithManager to only assign the fallback RealClock when r.Clock is nil
(i.e., wrap the existing assignment in if r.Clock == nil { r.Clock =
clock.RealClock{} }) so that caller-injected clocks on the OnboardingController
are preserved while still providing a default in production.

---

Nitpick comments:
In `@internal/controller/onboarding_controller_test.go`:
- Around line 168-170: The test currently seeds the server as already ACTIVE
with a launched_at in the POST /servers response which lets the timeout fire on
the first reconcile; change the test to instead seed the stale server via the
GET /servers/detail response so that createOrGetTestServer() will take the path
of "already-running server" on a subsequent reconcile—update the mocked
responses in onboarding_controller_test.go to return a non-ACTIVE result from
POST /servers (or delay launched_at) and supply the ACTIVE + launched_at payload
on the GET /servers/detail mock, then run the controller reconcile twice in the
test to exercise the intended branch.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: db638a1b-bf7f-4c78-9010-61a52cb37865

📥 Commits

Reviewing files that changed from the base of the PR and between b27678f and 28a3038.

📒 Files selected for processing (3)
  • go.mod
  • internal/controller/onboarding_controller.go
  • internal/controller/onboarding_controller_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • go.mod

The code currently assumes, that an instance would
report the expected string eventually, which assumes
that the metadata service lookup works.

The active state only reflects that the hypervisor
could boot the VM.

As the instance usually boots within seconds,
a five minute timeout seems save enough to
catch this eventuality.
@fwiesel fwiesel force-pushed the onboarding-timeout branch from 28a3038 to 69bcc0c Compare March 18, 2026 13:25
@fwiesel
Copy link
Contributor Author

fwiesel commented Mar 18, 2026

In production, the VMs are active, and the log message says:

BdsDxe: failed to load Boot0002 "UEFI Misc Device" from PciRoot(0x0)/Pci(0x3,0x0): Not Found

>>Start PXE over IPv4.
  PXE-E16: No valid offer received.
BdsDxe: failed to load Boot0003 "UEFI PXEv4 (MAC:FA163EC9E29F)" from PciRoot(0x0)/Pci(0x2,0x0)/MAC(FA163EC9E29F,0x1)/IPv4(0.0.0.0,0x0,DHCP,0.0.0.0,0.0.0.0,0.0.0.0): Not Found
BdsDxe: No bootable option or device was found.
BdsDxe: Press any key to enter the Boot Manager Menu.

This change should at least make the operator retry.

@github-actions
Copy link

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/cobaltcore-dev/openstack-hypervisor-operator/internal/controller 63.27% (+0.40%) 👍

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/cobaltcore-dev/openstack-hypervisor-operator/internal/controller/onboarding_controller.go 55.28% (+1.89%) 246 (+10) 136 (+10) 110 👍

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/cobaltcore-dev/openstack-hypervisor-operator/internal/controller/onboarding_controller_test.go

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
internal/controller/onboarding_controller_test.go (1)

629-631: Add an exact-deadline case to harden timeout behavior.

This test jumps to +6m; adding a +5m exactly case would catch regressions between exclusive/inclusive timeout checks.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/onboarding_controller_test.go` around lines 629 - 631,
Add a test case that sets onboardingReconciler.Clock to exactly 5 minutes after
the recorded launched_at timestamp (use clocktesting.NewFakeClock with
time.Date(...) at +5m) and assert the same timeout behavior expectations (e.g.,
whether serverDeletedCalled is true/false) to verify inclusive vs exclusive
checks; update the test around the existing +6m setup (the code that sets
onboardingReconciler.Clock and serverDeletedCalled) to include this exact +5m
scenario and ensure the assertions reference the same launched_at-based logic
used by the reconciler.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/controller/onboarding_controller.go`:
- Line 292: The timeout check at the line using
r.Clock.Now().After(server.LaunchedAt.Add(smokeTestTimeout)) is exclusive and
should be inclusive; change the condition to test inclusive equality by
replacing the After(...) check with a non-Before check (e.g.,
!r.Clock.Now().Before(server.LaunchedAt.Add(smokeTestTimeout))) or explicitly
include r.Clock.Now().Equal(...). Update the condition that references
server.LaunchedAt, r.Clock.Now(), and smokeTestTimeout in the reconcile logic so
the timeout fires exactly at launched_at + smokeTestTimeout rather than only
after it.

---

Nitpick comments:
In `@internal/controller/onboarding_controller_test.go`:
- Around line 629-631: Add a test case that sets onboardingReconciler.Clock to
exactly 5 minutes after the recorded launched_at timestamp (use
clocktesting.NewFakeClock with time.Date(...) at +5m) and assert the same
timeout behavior expectations (e.g., whether serverDeletedCalled is true/false)
to verify inclusive vs exclusive checks; update the test around the existing +6m
setup (the code that sets onboardingReconciler.Clock and serverDeletedCalled) to
include this exact +5m scenario and ensure the assertions reference the same
launched_at-based logic used by the reconciler.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: aad422d1-fe77-49e6-aaba-1782ac1e0824

📥 Commits

Reviewing files that changed from the base of the PR and between 28a3038 and 69bcc0c.

📒 Files selected for processing (3)
  • go.mod
  • internal/controller/onboarding_controller.go
  • internal/controller/onboarding_controller_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • go.mod

}

if !strings.Contains(consoleOutput, server.Name) {
if !server.LaunchedAt.IsZero() && r.Clock.Now().After(server.LaunchedAt.Add(smokeTestTimeout)) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Timeout threshold is exclusive at Line 292.

Using After(...) means the timeout does not fire exactly at launched_at + 5m; it fires on the next reconcile. Make the check inclusive.

⏱️ Proposed fix
-			if !server.LaunchedAt.IsZero() && r.Clock.Now().After(server.LaunchedAt.Add(smokeTestTimeout)) {
+			deadline := server.LaunchedAt.Add(smokeTestTimeout)
+			if !server.LaunchedAt.IsZero() && !r.Clock.Now().Before(deadline) {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if !server.LaunchedAt.IsZero() && r.Clock.Now().After(server.LaunchedAt.Add(smokeTestTimeout)) {
deadline := server.LaunchedAt.Add(smokeTestTimeout)
if !server.LaunchedAt.IsZero() && !r.Clock.Now().Before(deadline) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/onboarding_controller.go` at line 292, The timeout check
at the line using r.Clock.Now().After(server.LaunchedAt.Add(smokeTestTimeout))
is exclusive and should be inclusive; change the condition to test inclusive
equality by replacing the After(...) check with a non-Before check (e.g.,
!r.Clock.Now().Before(server.LaunchedAt.Add(smokeTestTimeout))) or explicitly
include r.Clock.Now().Equal(...). Update the condition that references
server.LaunchedAt, r.Clock.Now(), and smokeTestTimeout in the reconcile logic so
the timeout fires exactly at launched_at + smokeTestTimeout rather than only
after it.

@fwiesel fwiesel merged commit 0c09786 into main Mar 18, 2026
7 checks passed
@fwiesel fwiesel deleted the onboarding-timeout branch March 18, 2026 13:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants