Skip to content

feat(anc): add check-hotfix subcommand to read hotfix pointer from ConfigMap#8696

Open
Devinwong wants to merge 1 commit into
devinwong/laughing-pancakefrom
devinwong/anc-check-hotfix-configmap
Open

feat(anc): add check-hotfix subcommand to read hotfix pointer from ConfigMap#8696
Devinwong wants to merge 1 commit into
devinwong/laughing-pancakefrom
devinwong/anc-check-hotfix-configmap

Conversation

@Devinwong

@Devinwong Devinwong commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Add a check-hotfix subcommand that reads the hotfix pointer from a ConfigMap

This adds a new fail-open check-hotfix subcommand to aks-node-controller. It reads a cluster ConfigMap that maps an ANC version base to a hotfix version, and writes that pointer to the file download-hotfix already consumes. download-hotfix then re-resolves the pointer and keeps its existing patch-only, strictly-higher gating. check-hotfix only fetches and stages the pointer - it never installs anything and never blocks provisioning.

Stacking

This branch is stacked on the base-to-version hotfix map change (PR #8694). The PR base is set to that branch so the diff shows only this change (app.go wiring + checkhotfix.go + checkhotfix_test.go). It must merge after #8694; if #8694 merges first, retarget this PR to main.

main
 \- #8694  2.1a  base->version hotfix map (Go)
     \- #8696  2.1b  check-hotfix ConfigMap reader (Go)        <- this PR
         \- #8715  2.1c  wire check-hotfix into wrapper (shell)
             \- #8717  2.1d  enable_provisioning_hotfix contract field + Go self-gate

This PR's check-hotfix command is always-on by itself - it has no feature gate. The gating arrives later in 2.1d (#8717), which adds an enable_provisioning_hotfix contract field so check-hotfix self-gates on it.

What it does

  1. Reads the kube-system/anc-hotfix-version ConfigMap from the apiserver with a raw net/http HTTPS GET (no client-go dependency).
    • Endpoint/creds all come from the node config that ANC already parses: the apiserver FQDN, the TLS bootstrap token, and the cluster CA (carried base64-encoded in the node config). check-hotfix runs before the provisioning scripts, so the on-node kubelet kubeconfigs and the decoded CA file (/etc/kubernetes/certs/ca.crt) do not exist yet - the node config is the only credential source guaranteed to be present at this point.
    • Short-timeout (~10s) HTTPS client trusting the cluster CA.
  2. Parses the ConfigMap: .data holds the full {"hotfixes":{...}} JSON object under a single key (prefers hotfixes.json, else the only entry). The value unmarshals directly into the same config type download-hotfix uses, so both commands share one parser and data contract.
  3. Writes the pointer to /opt/azure/containers/aks-node-controller-hotfix.json in the same {"hotfixes":{...}} shape (atomic temp-file + rename), so download-hotfix re-resolves it and applies its unchanged gating.
  4. Fail-open: the command always exits 0 so provisioning is never blocked. Any 404 / 403 / timeout / parse failure is logged, emitted as telemetry, and swallowed.
  5. Cold-start fallback: if the ConfigMap read fails, it reads a lenient top-level hotfixes object embedded in the node config and uses that. (Marked with a TODO to switch to a typed config field once that contract exists.)
  6. Telemetry: guest-agent events under task name CheckHotfix with outcomes configMapRead, noHotfixForBase, customDataFallback, failed.

Net effect (examples)

ConfigMap published to the cluster:

{
  "data": {
    "hotfixes.json": "{\"hotfixes\":{\"202604.01\":\"202604.01.1\",\"202605.01\":\"202605.01.2\"}}"
  }
}

check-hotfix stages /opt/azure/containers/aks-node-controller-hotfix.json:

{"hotfixes":{"202604.01":"202604.01.1","202605.01":"202605.01.2"}}
Node baked ANC version ConfigMap read check-hotfix outcome download-hotfix then does
202604.01.0 OK configMapRead base 202604.01 -> 202604.01.1, patch 1 > 0, upgrades
202605.01.2 OK configMapRead base 202605.01 -> 202605.01.2, patch not higher, no-op
202607.15.0 OK (no matching base) noHotfixForBase no pointer for this base, no-op
202604.01.0 fails, node config has embedded hotfixes customDataFallback reads staged fallback pointer, resolves as above
202604.01.0 fails, no fallback present failed (still exit 0) nothing staged, no-op

Tests

New network-free unit tests (creds/ConfigMap source injected, no real networking): success read+write, 404/403/timeout/connection fail-open, invalid ConfigMap JSON fail-open, noHotfixForBase, cold-start fallback (and no-pointer failure), telemetry outcomes and always-exit-0 wiring, shared-parser equivalence with download-hotfix, and kubeconfig parsing (token + client-cert, inline-data and file forms).

All new tests pass. The full go test ./... run shows no new failures versus the base branch. The only failures are pre-existing Windows-only environmental ones (they need /etc/os-release, bash, and unix file perms) that pass in Linux CI.

Note: wiring this command into the provisioning wrapper script is intentionally out of scope for this PR and will land separately behind a feature flag.

@Devinwong Devinwong changed the title feat(anc): provisioning-hotfix M1 - check-hotfix ConfigMap reader (2.1b) feat(anc): add check-hotfix subcommand to read hotfix pointer from ConfigMap Jun 12, 2026
@Devinwong Devinwong force-pushed the devinwong/laughing-pancake branch from 5fff98d to 061ba60 Compare June 15, 2026 21:53
@Devinwong Devinwong force-pushed the devinwong/anc-check-hotfix-configmap branch from 64e782d to ede050a Compare June 15, 2026 21:56
@Devinwong Devinwong marked this pull request as ready for review June 16, 2026 02:27
@Devinwong Devinwong force-pushed the devinwong/anc-check-hotfix-configmap branch from ede050a to 0c90761 Compare June 16, 2026 17:37
…2.1b)

Add a fail-open 'check-hotfix' CLI subcommand that reads the
kube-system/anc-hotfix-version ConfigMap published by the
live-patching-controller and stages the resolved {hotfixes:{...}} pointer
to the path download-hotfix already reads. download-hotfix keeps its
unchanged patch-only, strictly-higher gating; check-hotfix only fetches and
writes the pointer.

- Raw net/http HTTPS GET (no client-go); creds from AKSNodeConfig bootstrap
  token + apiserver FQDN (primary) or on-node kubeconfigs (secondary).
- Shares the 2.1a hotfixConfig parser/data contract with download-hotfix.
- Always exits 0; emits CheckHotfix telemetry (configMapRead,
  noHotfixForBase, customDataFallback, failed).
- PoC cold-start fallback reads a lenient top-level hotfixes object from the
  node config when the ConfigMap read fails (TODO: typed absvc contract).
- Injectable App fields (checkHotfixConfigMapFetcher, nodeConfigPath) for
  network-free unit tests.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@Devinwong Devinwong force-pushed the devinwong/anc-check-hotfix-configmap branch from 0c90761 to b33ec66 Compare June 16, 2026 18:08
@github-actions

Copy link
Copy Markdown
Contributor

Changes cached containers or packages on windows VHDs

Please get a Windows SIG member to approve.

The following dif file shows any additions or deletions from what will be cached on windows VHDs organised by VHD type.

  • Additions are new things cached.
  • Deletions are things no longer cached.
diff --git a/vhd_files/2022-containerd-gen2.txt b/vhd_files/2022-containerd-gen2.txt
index 7039bac..c51a47f 100644
--- a/vhd_files/2022-containerd-gen2.txt
+++ b/vhd_files/2022-containerd-gen2.txt
@@ -122,0 +123 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.34.6-windows-hp
+mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.2-windows-hp
@@ -124 +124,0 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.3-windows-hp
-mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.4-windows-hp
diff --git a/vhd_files/2022-containerd.txt b/vhd_files/2022-containerd.txt
index 5915cf1..7312c49 100644
--- a/vhd_files/2022-containerd.txt
+++ b/vhd_files/2022-containerd.txt
@@ -122,0 +123 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.34.6-windows-hp
+mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.2-windows-hp
@@ -124 +124,0 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.3-windows-hp
-mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.4-windows-hp
diff --git a/vhd_files/2025-gen2.txt b/vhd_files/2025-gen2.txt
index 37d9326..36e3641 100644
--- a/vhd_files/2025-gen2.txt
+++ b/vhd_files/2025-gen2.txt
@@ -52,0 +53 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.34.6-windows-hp
+mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.2-windows-hp
@@ -54 +54,0 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.3-windows-hp
-mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.4-windows-hp
diff --git a/vhd_files/2025.txt b/vhd_files/2025.txt
index 5b08280..b8873d5 100644
--- a/vhd_files/2025.txt
+++ b/vhd_files/2025.txt
@@ -52,0 +53 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.34.6-windows-hp
+mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.2-windows-hp
@@ -54 +54,0 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.3-windows-hp
-mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.4-windows-hp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant