Skip to content

feat(anc): gate check-hotfix on enable_provisioning_hotfix contract field#8717

Open
Devinwong wants to merge 1 commit into
devinwong/anc-wire-check-hotfix-wrapperfrom
devinwong/anc-hotfix-env-delivery
Open

feat(anc): gate check-hotfix on enable_provisioning_hotfix contract field#8717
Devinwong wants to merge 1 commit into
devinwong/anc-wire-check-hotfix-wrapperfrom
devinwong/anc-hotfix-env-delivery

Conversation

@Devinwong

@Devinwong Devinwong commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

2.1d - gate check-hotfix on the enable_provisioning_hotfix contract field

POC / M1 draft. AgentBaker / Node SIG side only.

This is the final layer of the provisioning-hotfix stack. It makes the AKSNodeConfig
contract field the single source of truth for whether aks-node-controller check-hotfix
does any work, and relaxes the env gate added in 2.1c.

What changed

  • Proto contract field: add bool enable_provisioning_hotfix = 45; to
    aksnodeconfig/v1/config.proto (next free tag after cse_timeout = 44) and regenerate
    the Go bindings.
  • Go gate: check-hotfix reads the field at the very top of checkHotfix() via
    App.provisioningHotfixEnabled() (reads the node-config JSON that is already on disk and
    calls GetEnableProvisioningHotfix()). When the field is not true (false, unset, or the
    config cannot be read/parsed) it returns the new telemetry outcome disabled and exits 0
    WITHOUT contacting the apiserver. Fail-open everywhere.
  • Wrapper relaxation: aks-node-controller-wrapper.sh now calls check-hotfix
    UNCONDITIONALLY (still wrapped defensively so it can never block provisioning). The
    Go binary self-gates on the contract field.

Supersedes the env-delivery approach

An earlier revision of this PR delivered the toggle as an env var via a cse_cmd.sh
template var plus a systemd drop-in (Environment="ENABLE_PROVISIONING_HOTFIX=...") on
aks-node-controller.service, mirroring the IMDS-restriction pattern. That approach was
dropped because:

  • check-hotfix already parses the AKSNodeConfig (it reads the apiserver FQDN and bootstrap
    token from it), so a real contract field is available to the binary with zero new plumbing -
    no template var, no drop-in, no env var.
  • In the self-provisioning path the wrapper and the drop-in writer are the same service, so an
    env/drop-in written during provisioning would only take effect on the NEXT boot. Reading the
    contract field directly avoids that activation-timing problem - it works on the same boot
    because the config JSON is on disk before the service starts.

This also means absvc sets ONE field (the contract bool), not an env var plus a field.

Relaxes the 2.1c env gate

This PR relaxes the ENABLE_PROVISIONING_HOTFIX env gate introduced in #8715 (2.1c); gating
now lives in the Go binary via the enable_provisioning_hotfix contract field - single source
of truth, so absvc sets ONE field, not an env var plus a field. The 2.1c env gate is
intentionally added-then-relaxed across the stack so each PR stays reviewable on its own.

Default-off and fail-open

When enable_provisioning_hotfix is false or unset, behavior is byte-identical to before this
stack: check-hotfix makes no apiserver call and provisioning proceeds unchanged. Any read or
parse error is treated as off. This preserves the 6-month VHD support window in both directions
(older VHD + newer config, and newer VHD + older binary are both safe).

Before / after

  • field false or unset -> check-hotfix returns outcome=disabled, no apiserver call, exit 0
  • field true -> check-hotfix reads the kube-system hotfix-version ConfigMap and stages the
    pointer (existing 2.1b behavior)

Stack

main
 \- #8694  2.1a  base->version hotfix map (Go)
     \- #8696  2.1b  check-hotfix ConfigMap reader (Go)
         \- #8715  2.1c  wire check-hotfix into wrapper (shell)
             \- #8717  2.1d  enable_provisioning_hotfix contract field + Go self-gate   <- this PR

The aks-rp region toggle that sets the field is in a different repo and is the only remaining
out-of-repo piece. With the field settable on a node, the on-node PoC e2e tests (fail-open and
multi-base) become runnable.

Tests

  • go test ./... in aks-node-controller: all check-hotfix tests pass, including new gate
    tests (disabled -> outcome=disabled and the injected fetcher is never called; enabled ->
    fetch path runs). Pre-existing Windows-only failures (CRLF goldens, file locks, os-release
    message text) are unrelated and also fail on the base branch.
  • Wrapper shellspec updated for unconditional check-hotfix: 7 examples, 0 failures.
  • shellcheck clean on the wrapper.

@github-actions

github-actions Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

The latest Buf updates on your PR. Results from workflow Buf CI / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed❌ failed (1)✅ passed✅ passedJun 16, 2026, 6:17 PM

@Devinwong Devinwong force-pushed the devinwong/anc-hotfix-env-delivery branch from 2d4b37d to 59b7cef Compare June 16, 2026 01:33
@Devinwong Devinwong changed the title feat(anc): deliver ENABLE_PROVISIONING_HOTFIX to node-controller via contract field + systemd drop-in feat(anc): gate check-hotfix on enable_provisioning_hotfix contract field Jun 16, 2026
@Devinwong Devinwong force-pushed the devinwong/anc-hotfix-env-delivery branch from 59b7cef to d80ae7d Compare June 16, 2026 02:21
@Devinwong

Copy link
Copy Markdown
Collaborator Author

Acknowledged - no action needed. This is the automated Buf CI status, and it reports Build, Format, Lint, and Breaking all passing for the additive optional field enable_provisioning_hotfix = 45. That matches the local buf verification (lint STANDARD clean, WIRE_JSON breaking clean against the 2.1c base, format clean), so the proto change is confirmed compatible and no change is warranted.

…ield

Replaces the env-delivery approach (systemd drop-in + cse_cmd.sh) with a single
contract field. check-hotfix self-gates on the new AKSNodeConfig field
enable_provisioning_hotfix (proto tag 45, optional bool); when it is not true the
command no-ops with telemetry outcome=disabled and makes no apiserver call.
Default-off, fail-open.

Relaxes the ENABLE_PROVISIONING_HOTFIX env gate introduced in 2.1c so the wrapper
calls check-hotfix unconditionally; gating now lives in the Go binary via the
contract field as the single source of truth.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@Devinwong Devinwong force-pushed the devinwong/anc-hotfix-env-delivery branch from 6854cfa to 297282e Compare June 16, 2026 18:16
@github-actions

Copy link
Copy Markdown
Contributor

Changes cached containers or packages on windows VHDs

Please get a Windows SIG member to approve.

The following dif file shows any additions or deletions from what will be cached on windows VHDs organised by VHD type.

  • Additions are new things cached.
  • Deletions are things no longer cached.
diff --git a/vhd_files/2022-containerd-gen2.txt b/vhd_files/2022-containerd-gen2.txt
index 7039bac..c51a47f 100644
--- a/vhd_files/2022-containerd-gen2.txt
+++ b/vhd_files/2022-containerd-gen2.txt
@@ -122,0 +123 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.34.6-windows-hp
+mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.2-windows-hp
@@ -124 +124,0 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.3-windows-hp
-mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.4-windows-hp
diff --git a/vhd_files/2022-containerd.txt b/vhd_files/2022-containerd.txt
index 5915cf1..7312c49 100644
--- a/vhd_files/2022-containerd.txt
+++ b/vhd_files/2022-containerd.txt
@@ -122,0 +123 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.34.6-windows-hp
+mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.2-windows-hp
@@ -124 +124,0 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.3-windows-hp
-mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.4-windows-hp
diff --git a/vhd_files/2025-gen2.txt b/vhd_files/2025-gen2.txt
index 37d9326..36e3641 100644
--- a/vhd_files/2025-gen2.txt
+++ b/vhd_files/2025-gen2.txt
@@ -52,0 +53 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.34.6-windows-hp
+mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.2-windows-hp
@@ -54 +54,0 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.3-windows-hp
-mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.4-windows-hp
diff --git a/vhd_files/2025.txt b/vhd_files/2025.txt
index 5b08280..b8873d5 100644
--- a/vhd_files/2025.txt
+++ b/vhd_files/2025.txt
@@ -52,0 +53 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.34.6-windows-hp
+mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.2-windows-hp
@@ -54 +54,0 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.3-windows-hp
-mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.4-windows-hp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant