Skip to content

fix: network isolated cluster incorrect credential provider config for soverign clouds#8709

Open
fseldow wants to merge 2 commits into
mainfrom
xinhl/mcrsov
Open

fix: network isolated cluster incorrect credential provider config for soverign clouds#8709
fseldow wants to merge 2 commits into
mainfrom
xinhl/mcrsov

Conversation

@fseldow

@fseldow fseldow commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

network isolated cluster incorrect credential provider config for soverign clouds
it hardcoded mcr.microsoft.com for all clouds

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Linux CSE credential provider config generation for network-isolated clusters so the MCR registry domain is not hardcoded to mcr.microsoft.com, enabling correct behavior in sovereign/custom cloud environments.

Changes:

  • Use MCR_REPOSITORY_BASE in the credential provider matchImages list for network-isolated clusters.
  • Use MCR_REPOSITORY_BASE in the --registry-mirror= argument mapping for network-isolated clusters.

Comment thread parts/linux/cloud-init/artifacts/cse_config.sh
@aks-node-assistant

Copy link
Copy Markdown
Contributor

🕵️ AgentBaker Linux Gate DetectiveBuild 168052408 (Run AgentBaker E2E) FAILED — known shared-cluster infra outage, not caused by this PR.

TL;DR

Matches existing wiki signature kubenet-v5-node-not-ready-scriptless on shared cluster abe2e-kubenet-v5-150ee (westus3). Repair item #38403603 is already tracking. The cluster is partially recovering — failure count is down to 39/469 (vs 200+ during peak burst). This PR touches NetworkIsolated sovereign-cloud credential provider config; the failed scenarios visible in the log (Test_Ubuntu2204_ContainerdURL_IMDSRestrictionFilterTable_Scriptless, Test_Ubuntu2204_ANCHotfix_BinarySelection, etc.) don't exercise that code path.

3-level RCA

1. Surface symptom — 39 of 469 tests failed (97 skipped); all visible failures are on shared cluster abe2e-kubenet-v5-150ee with VMSS scenarios that never reach node-ready (kube.go:195 pre-kubelet-registration timeout pattern, same as the ongoing infra signature). panic.go:694: ✗ preparing AKS node failed.

2. Corroboration — Adjacent build 168051200 on PR #8659 (renovate/patch-nvidia-dcgm) running ~12 minutes earlier on the same cluster has the identical signature (40/469 failures) — same infra fingerprint, independent unrelated PR. The kubenet-v5-node-not-ready-scriptless signature has now hit 26 distinct builds across 10+ PRs over the past 72h.

3. Root-cause challenge — Strongest alternative: PR-caused regression via the sovereign-cloud credential provider config fix. Why less likely: the failing test names (ContainerdURL_IMDSRestrictionFilterTable_Scriptless, ANCHotfix_BinarySelection) are not the NetworkIsolated_NonAnonymousACR / sovereign-cloud scenarios that would exercise the PR's MCRSov code path. Failure point is pre-kubelet-registration on a known-unhealthy shared cluster, matching the cross-PR infra pattern.

Classification

  • Test infrastructure / shared-cluster fleet stress (not PR-caused)
  • Wiki signature: kubenet-v5-node-not-ready-scriptless (Count → 27 distinct builds incl. this one)
  • Confidence: High infra outage; High PR is unrelated.

Recommended next action

  • For this PR: safe to rerun the gate; no code change required. Once cluster fully recovers, the NetworkIsolated_NonAnonymousACR scenarios should exercise your sovereign-cloud fix path.
  • Owner of the underlying issue: AgentBaker E2E test-infra (repair item #38403603 — Active, persists past 72h).

Evidence

Posted by Clawpilot AgentBaker Linux Gate Detective Watcher.

@fseldow

fseldow commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@fseldow

fseldow commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

/azp run AKS Linux VHD Build - PR check-in gate

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@aks-node-assistant

Copy link
Copy Markdown
Contributor

🕵️ AgentBaker Linux Gate DetectiveBuild 168091386 (Run AgentBaker E2E) FAILED — 2nd recurrence of kubenet-v5-node-not-ready-scriptless on this PR — known infra outage tracked under #38403603, not caused by this PR.

TL;DR

28/469 failures on shared cluster abe2e-kubenet-v5-150ee — same pattern as the prior build (168052408) on this PR. Cluster is still in partial-recovery mode (failure rate down from 200+ at peak, now hovering 24-40 across builds today). PR's NetworkIsolated sovereign-cloud credential-provider config change is unrelated to pre-kubelet-registration cluster timeouts.

Classification

  • Test infrastructure / shared-cluster fleet stress (not PR-caused), repair item #38403603 (Active)
  • Wiki signature: kubenet-v5-node-not-ready-scriptless (Count → 29 distinct builds)
  • Confidence: High

Recommended next action

Safe to rerun once cluster fully recovers. Prior comment: #issuecomment-4707948564.

Posted by Clawpilot AgentBaker Linux Gate Detective Watcher.

@fseldow

fseldow commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

/azp run AKS Linux VHD Build - PR check-in gate

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@fseldow

fseldow commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

/azp run AKS Linux VHD Build - PR check-in gate

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@fseldow

fseldow commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

/azp run AKS Linux VHD Build - PR check-in gate

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@fseldow

fseldow commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

/azp run AKS Linux VHD Build - PR check-in gate

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@fseldow

fseldow commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

/azp run AKS Linux VHD Build - PR check-in gate

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@fseldow

fseldow commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

/azp run AKS Linux VHD Build - PR check-in gate

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@fseldow

fseldow commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

/azp run AKS Linux VHD Build - PR check-in gate

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@fseldow

fseldow commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

/azp run AKS Linux VHD Build - PR check-in gate

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@fseldow

fseldow commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

/azp run AKS Linux VHD Build - PR check-in gate

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@fseldow

fseldow commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

/azp run AKS Linux VHD Build - PR check-in gate

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@fseldow

fseldow commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

/azp run AKS Linux VHD Build - PR check-in gate

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants