Skip to content

controller: implement cordon and drain logic#53

Merged
jlebon merged 9 commits into
mainfrom
milestone-3b-phase2
Jun 8, 2026
Merged

controller: implement cordon and drain logic#53
jlebon merged 9 commits into
mainfrom
milestone-3b-phase2

Conversation

@jlebon

@jlebon jlebon commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

More milestone 3b bits. See individual commit messages.

jlebon added 2 commits June 3, 2026 22:15
The drain.Helper from k8s.io/kubectl/pkg/drain requires a
kubernetes.Interface clientset rather than controller-runtime's
client.Client. Add a KubeClient field to the reconciler and wire it
up in cmd/controller/main.go and suite_test.go.

Assisted-by: Pi (Claude Opus 4.6)
The drain.Helper needs to list pods on a node, evict them, and look
up DaemonSets (for IgnoreAllDaemonSets). Add RBAC markers for pods
(get/list), pods/eviction (create), and apps/daemonsets (get).
Regenerate the ClusterRole manifest.

Assisted-by: Pi (Claude Opus 4.6)
@jlebon

jlebon commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator Author

Still want to test this manually as well locally, but I think it's ready for review.

Tested this more now and yeah I think it's ready to go!

@jlebon jlebon force-pushed the milestone-3b-phase2 branch from 62a1d04 to d787c9c Compare June 6, 2026 02:34
jlebon added 7 commits June 5, 2026 23:00
For each candidate selected by selectCandidates, fetch the K8s Node
and call assignRebootSlot. This function sets the in-reboot-slot and
was-cordoned annotations on the BootcNode and cordons the K8s Node.
All operations are idempotent so controller restarts re-enter the
flow cleanly.

Drains are not started yet; that is the next commit.

While we're here, also add the matching `freeRebootSlot` function, even
though it's not used yet.

Assisted-by: Pi (Claude Opus 4.6)
drain.Helper uses io.Writer (Out and ErrOut) for its output rather
than a structured logger. This adapter bridges the two so that drain
progress and errors flow through our logr pipeline with appropriate
levels instead of being lost or going to stdout/stderr.

Assisted-by: Pi (Claude Opus 4.6)
After assigning reboot slots to candidates, iterate all Staged nodes
with the in-reboot-slot annotation and start drain goroutines for
them via ensureDrain.

Drain completion is not handled yet; that is the next commit.

Assisted-by: Pi (Claude Opus 4.6)
The Eviction API has been stable for some time now. This is something
that wasn't available at the time the MCO was started for example.

I think we should strongly consider using that directly instead of
pulling in kubectl, which is a huge dependency. There's some gotchas
there we could lift from the kubectl code. But nothing insurmountable.

Anyway, just write this down as a future enhancement for now.
Add collectDrainResults, called at the start of driveRollout before
assigning new reboot slots. It does a non-blocking check on each
drain's result channel. On success, it sets desiredImageState to
Booted on the BootcNode so the daemon knows to apply the staged image
and reboot. Drain errors and cancellations are logged but otherwise
deferred.

Assisted-by: Pi (Claude Opus 4.6)
Simulates a 3-node rollout where all nodes have staged the target image.
With maxUnavailable: 1, verifies that only one node gets a reboot slot
(cordoned, annotated) at a time.

This is pretty simple for now because we only have all the rollout logic
partially implemented. As we add more of the missing pieces, we can keep
extending this test to be more complete.

Assisted-by: Pi (Claude Opus 4.6)
This is subtle but by having the tag be `:latest`, K8s will use a
default of `Always` for the image pull policy:

https://kubernetes.io/docs/concepts/containers/images/#imagepullpolicy-defaulting

(We don't specify an explicit `imagePullPolicy` in our configs. We may
do that in the future, though there's trade-offs there to think about. I
think it depends a lot on the packaging/productization layers once we
get there. But I think the K8s defaults mostly make sense.)
@jlebon jlebon force-pushed the milestone-3b-phase2 branch from d787c9c to 1dc31e8 Compare June 6, 2026 03:05
@jlebon jlebon merged commit dc56644 into main Jun 8, 2026
8 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants