fix: prevent CSE hang when curl verbose output blocks on unstable disks by pdamianov-dev · Pull Request #8711 · Azure/AgentBaker

pdamianov-dev · 2026-06-15T15:04:20Z

What this PR does / why we need it:

Redirect CURL_OUTPUT to /dev/shm (tmpfs) instead of /tmp to avoid blocking on unstable OS disks. /dev/shm is kernel-mounted tmpfs (CONFIG_TMPFS=y on all Azure images) across all VM SKUs including CVM and ARM64.
Add --max-time to curl in _retry_file_curl_internal so curl enforces its own deadline even if shell timeout cannot deliver signals.
Add -k 5 to timeout in _retrycmd_internal and _retry_file_curl_internal to escalate to SIGKILL if SIGTERM is ignored (e.g. process in D-state).

Which issue(s) this PR fixes:
Bug 36680094: [Repair Item] Improve CSE script to handle curl timeouts and prevent blocking by redirecting verbose output away from unstable disks and enforcing strict timeout with forced kill signals.
Fixes #

Copilot

Pull request overview

This PR hardens Linux CSE download/retry behavior to reduce the chance of CSE hangs when storage is unstable by moving curl verbose logging to tmpfs and making timeout enforcement more aggressive.

Changes:

Redirects CURL_OUTPUT to /dev/shm (tmpfs) when available, falling back to /tmp.
Updates retry helpers to use timeout -k 5 (SIGKILL escalation) and adds curl --max-time in _retry_file_curl_internal.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
parts/linux/cloud-init/artifacts/cse_install.sh	Sets `CURL_OUTPUT` to prefer `/dev/shm` (tmpfs) over `/tmp` for verbose curl logs.
parts/linux/cloud-init/artifacts/cse_helpers.sh	Applies the same `CURL_OUTPUT` change and strengthens retry timeout behavior (`timeout -k 5`, `curl --max-time`).

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

+                            echo "timeout_args: $*" >> $CURL_OUTPUT
+                            return 0
+                        }
+                        touch /tmp/nonexistent
+                        When call _retry_file_curl_internal 1 1 30 0 "/tmp/nonexistent" "https://dummy.url/file" "[ -f /tmp/nonexistent ]"


+if [ -d "/dev/shm" ] && [ -w "/dev/shm" ]; then
+    CURL_OUTPUT=/dev/shm/curl_verbose.out
+else
+    CURL_OUTPUT=/tmp/curl_verbose.out
+fi


Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

cameronmeissner · 2026-06-15T17:11:40Z

+# /dev/shm is kernel-mounted tmpfs (CONFIG_TMPFS=y on all Azure images) and available on
+# all VM SKUs including CVM and ARM64. Fallback to /tmp for non-Azure environments.
+if [ -d "/dev/shm" ] && [ -w "/dev/shm" ]; then
+    CURL_OUTPUT=/dev/shm/curl_verbose.out


can we log something out when /dev/shm is selected?

aks-node-assistant · 2026-06-15T18:01:53Z

🕵️ AgentBaker Linux Gate Detective — Build 168091952 FAILED at Setup Cue (cascading into Build VHD / Test, Scan, and Cleanup) — 2nd occurrence of go-toolchain-tls-handshake-timeout, infra-side, not caused by this PR.

TL;DR

The pipeline's Setup Cue step needs Go 1.25.11 to go install cuelang.org/go/cmd/cue@latest. Build agent has Go 1.25.10 pre-installed, so Go auto-downloads the toolchain from storage.googleapis.com/proxy-golang-org-prod/... → TLS handshake timeout after ~10s.
cue binary is therefore not on PATH; the next step cue export ./schemas/manifest.cue exits with cue: command not found (exit 127).
Downstream Build VHD fails with SKU_NAME must be set for linux VHD builds because init-packer reads SKU_NAME from the Cue-generated manifest that was never produced — that's a cascading symptom, not the root cause.
Matches existing wiki signature go-toolchain-tls-handshake-timeout (first seen on build 168010239 / PR chore(deps): bump github.com/onsi/gomega from 1.41.0 to 1.42.0 #8704).

3-level RCA

1. Surface symptom — Setup Cue log: go: download go1.25.11: golang.org/toolchain@v0.0.1-go1.25.11.linux-amd64: Get "https://storage.googleapis.com/proxy-golang-org-prod/...": net/http: TLS handshake timeout, then cue: command not found, exit 127. Build VHD log: SKU_NAME must be set for linux VHD builds from packer.mk:101: init-packer. Test, Scan, and Cleanup + Publish BCC Tools Installation Log fail downstream because no VHD was published.

2. Corroboration — Identical pattern to build 168010239 on PR #8704 (gomega dep bump) — same Go-toolchain auto-download to storage.googleapis.com/proxy-golang-org-prod/..., same net/http: TLS handshake timeout after ~10s. Hosted build agent egress to the Google module-proxy CDN is flaky; first occurrence was 2 days ago. This PR doesn't change cuelang.org/go version requirements (CUE module requires Go 1.25.0, and the toolchain selector picked 1.25.11 which is the newest minor available at module-resolution time).

3. Root-cause challenge — Strongest alternative: PR-caused regression via the curl/CSE change. Why less likely: the PR touches CSE bash scripts (cse_helpers.sh / cse_cmd.sh curl behavior to prevent hangs on unstable disks); it does not modify go.mod, Cue schemas, or anything Go-toolchain related. Failure occurs at Setup Cue — the very first task that requires Go — before any AgentBaker source code from the PR is even compiled. Cascading SKU_NAME must be set is a Make-time consequence, not a Packer/script bug.

Classification

Test infrastructure / build-agent egress flakiness (Google module-proxy TLS handshake)
Wiki signature: go-toolchain-tls-handshake-timeout (Count → 2)
Confidence: High that the failure mechanism is network-side; High that PR is unrelated.

Recommended next action

For this PR: rerun the gate — toolchain downloads usually succeed on retry.
Owner of the underlying issue: AgentBaker E2E build infra — recurrence in <72h on a distinct pipeline path (Setup Cue, not just Run AgentBaker E2E) confirms this is worth pre-baking Go 1.25.11 + 1.26.4 into the agent image, or constraining GOTOOLCHAIN=local and bumping the host Go install separately.

Evidence

Failed run: https://msazure.visualstudio.com/CloudNativeCompute/_build/results?buildId=168091952
Failed tasks: Setup Cue (root), Build VHD (cascading), Test, Scan, and Cleanup (cascading), Publish BCC Tools Installation Log (cascading)
Source commit: ac7ebd9ca01f1450652566cb4229ef261f417711
Prior occurrence on PR chore(deps): bump github.com/onsi/gomega from 1.41.0 to 1.42.0 #8704 (gomega bump): build 168010239 comment

Posted by Clawpilot AgentBaker Linux Gate Detective Watcher.

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

        fi

-        timeout $effectiveTimeout curl -fsSLv $url -o $filePath > $CURL_OUTPUT 2>&1
+        timeout -k 5 "$effectiveTimeout" curl --max-time "$effectiveTimeout" -fsSLv "$url" -o "$filePath" > "$CURL_OUTPUT" 2>&1


- Add --max-time to curl in _retry_file_curl_internal so curl enforces its own deadline even if shell timeout cannot deliver signals. - Add -k 5 to timeout in _retry_file_curl_internal to escalate to SIGKILL if SIGTERM is ignored (e.g. process in D-state on disk). - Add ShellSpec test asserting -k 5 and --max-time flags are passed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Validates that the timeout -k mechanism in cse_helpers.sh properly kills hung curl processes when a download URL is unreachable. Uses a non-routable IP (192.0.2.1, RFC 5737 TEST-NET-1) to cause curl to hang, and a short CSETimeout (90s) to verify the global timeout -k5s in cse_start.sh fires and terminates the provisioning script with exit code 124. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

+        timeout -k 5 $effectiveTimeout curl --max-time $effectiveTimeout -fsSLv $url -o $filePath > $CURL_OUTPUT 2>&1
        if [ "$?" -ne 0 ]; then
            cat $CURL_OUTPUT
        fi


+// Test_Ubuntu2204_CSETimeoutOnUnreachableDownload validates that the per-curl timeout (-k 5)
+// in cse_helpers.sh _retry_file_curl_internal properly kills hung download attempts.
+// It sets CustomKubeBinaryURL to a non-routable IP (RFC 5737 TEST-NET-1) which causes curl
+// to hang until the timeout fires, and sets a short CSETimeout so the test completes quickly.
+// The global timeout in cse_start.sh (timeout -k5s) kills the entire provisioning script.


aks-node-assistant · 2026-06-15T22:32:48Z

🕵️ AgentBaker Linux Gate Detective — Build 168131287 FAILED at Test, Scan, and Cleanup / build2004fipsgen2containerd — new signature trivy-db-mcr-unauthorized, infra-side, not caused by this PR.

TL;DR

The post-build VHD scan step runs trivy --scanners vuln rootfs -f json --db-repository mcr.microsoft.com/mirror/ghcr/aquasecurity/trivy-db:2,... on the built VHD. The Trivy DB download from mcr.microsoft.com consistently returns:

OCI repository error: GET https://mcrprod.azurecr.io/oauth2/token?scope=repository%3Amirror%2Fghcr%2Faquasecurity%2Ftrivy-db%3Apull&service=mcrprod.azurecr.io: UNAUTHORIZED: authentication required

Retried 10 times with 30s sleep, all 10 attempts fail identically. Downstream vhd-scanning.sh exited with code 1, blob ref not uploaded → "ERROR: The specified blob does not exist", Test, Scan, and Cleanup exits 2.

This PR fixes a CSE curl-hang bug — completely unrelated to MCR auth or Trivy.

3-level RCA

1. Surface symptom — Test, Scan, and Cleanup task for SKU build2004fipsgen2containerd fails. Failed step: trivy vulnerability DB download (10 retries). Each retry returns HTTP 401 UNAUTHORIZED from mcrprod.azurecr.io token endpoint. CorrelationIds visible in the log (e.g. 5d0d06a2-aa1f-4a1b-9a3b-b16c62071247) — surface this to ACR ops if needed. vhd-scanning.sh exits 1, build aborts.

2. Corroboration — Only one SKU job failed (build2004fipsgen2containerd); other parallel SKU builds appear to have completed without this issue, suggesting either an intermittent MCR auth glitch or per-job-agent identity propagation problem. No corresponding entry in the wiki source-of-truth — new signature.

3. Root-cause challenge — Strongest alternative: PR-caused CSE script regression. Why less likely: the PR (fix: prevent CSE hang when curl verbose output blocks on unstable disks) modifies cse_helpers.sh / cse_cmd.sh to redirect curl verbose output away from blocking IO. That code only runs on the booted VM during CSE, not during the VHD-builder agent's post-build Trivy scan. The agent's MCR call uses the agent's managed identity, which is unaffected by anything in the PR. Trivy's exit is at HTTP-auth time, not at runtime — clearly an infra-side auth/network failure.

Classification

Test infrastructure / VHD-builder agent MCR auth flakiness (new signature, first observed occurrence)
Wiki signature: trivy-db-mcr-unauthorized (Count → 1, new row)
Confidence: High that trivy DB download is the root cause; Medium about whether this is transient (per-agent MSI token) or persistent (MCR mirror access policy changed). Will be confirmed by next build on this PR or any other PR's build2004fipsgen2containerd job.

Recommended next action

For this PR: rerun the gate — 10 retries within one job aren't enough if the agent's managed-identity token to mcrprod.azurecr.io is genuinely cold. A fresh build should re-acquire the token.
Owner: AgentBaker E2E build infra — if it recurs, check the agent pool's managed-identity ACR-pull role assignment on mcrprod.azurecr.io/mirror/ghcr/aquasecurity/trivy-db and whether MCR's mirror policy recently changed to require auth where it previously was anonymous.

Evidence

Failed run: https://msazure.visualstudio.com/CloudNativeCompute/_build/results?buildId=168131287
Failed task: Test, Scan, and Cleanup for SKU build2004fipsgen2containerd
Source commit: cd266014804fcac9277bb2e87551187f2a4f4f51
Earlier build on this PR (168091952) hit a different infra signature (go-toolchain-tls-handshake-timeout) at Setup Cue: comment

Posted by Clawpilot AgentBaker Linux Gate Detective Watcher.

pdamianov-dev · 2026-06-16T13:24:11Z

It was determined that this PR and the associated work item were not needed to address the issue and did not fully match the result of the linked ticket

Copilot AI review requested due to automatic review settings June 15, 2026 15:04

pdamianov-dev requested review from AbelHu, Devinwong, SriHarsha001, awesomenix, calvin197, cameronmeissner, djsly, ganeshkumarashok, lilypan26, mxj220, phealy, r2k1, runzhen, sulixu, surajssd, timmy-wright and zachary-bailey as code owners June 15, 2026 15:04

Copilot started reviewing on behalf of pdamianov-dev June 15, 2026 15:04 View session

Copilot AI reviewed Jun 15, 2026

View reviewed changes

Comment thread parts/linux/cloud-init/artifacts/cse_helpers.sh Outdated

Copilot AI review requested due to automatic review settings June 15, 2026 15:45

pdamianov-dev force-pushed the pdamianov/fix-curl-hang-unstable-disk branch from 201e349 to ce9efb7 Compare June 15, 2026 15:45

Copilot started reviewing on behalf of pdamianov-dev June 15, 2026 15:46 View session

Copilot AI reviewed Jun 15, 2026

View reviewed changes

pdamianov-dev force-pushed the pdamianov/fix-curl-hang-unstable-disk branch from ce9efb7 to 9d46a0f Compare June 15, 2026 16:41

Copilot AI review requested due to automatic review settings June 15, 2026 16:58

pdamianov-dev force-pushed the pdamianov/fix-curl-hang-unstable-disk branch from 9d46a0f to ac7ebd9 Compare June 15, 2026 16:58

Copilot started reviewing on behalf of pdamianov-dev June 15, 2026 16:58 View session

Copilot AI reviewed Jun 15, 2026

View reviewed changes

Comment thread parts/linux/cloud-init/artifacts/cse_helpers.sh

Comment thread parts/linux/cloud-init/artifacts/cse_helpers.sh

cameronmeissner reviewed Jun 15, 2026

View reviewed changes

pdamianov-dev force-pushed the pdamianov/fix-curl-hang-unstable-disk branch from ac7ebd9 to 50188b4 Compare June 15, 2026 18:03

awesomenix reviewed Jun 15, 2026

View reviewed changes

Comment thread parts/linux/cloud-init/artifacts/cse_helpers.sh Outdated

Copilot AI review requested due to automatic review settings June 15, 2026 19:39

Copilot started reviewing on behalf of pdamianov-dev June 15, 2026 19:40 View session

Copilot AI reviewed Jun 15, 2026

View reviewed changes

Comment thread parts/linux/cloud-init/artifacts/cse_helpers.sh Outdated

fi

timeout $effectiveTimeout curl -fsSLv $url -o $filePath > $CURL_OUTPUT 2>&1

timeout -k 5 "$effectiveTimeout" curl --max-time "$effectiveTimeout" -fsSLv "$url" -o "$filePath" > "$CURL_OUTPUT" 2>&1

pdamianov-dev force-pushed the pdamianov/fix-curl-hang-unstable-disk branch from a03d4c9 to 554fd16 Compare June 15, 2026 20:42

Copilot AI review requested due to automatic review settings June 15, 2026 21:08

Copilot started reviewing on behalf of pdamianov-dev June 15, 2026 21:09 View session

Copilot AI reviewed Jun 15, 2026

View reviewed changes

pdamianov-dev closed this Jun 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent CSE hang when curl verbose output blocks on unstable disks#8711

fix: prevent CSE hang when curl verbose output blocks on unstable disks#8711
pdamianov-dev wants to merge 2 commits into
mainfrom
pdamianov/fix-curl-hang-unstable-disk

pdamianov-dev commented Jun 15, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

cameronmeissner Jun 15, 2026 •

edited

Loading

Uh oh!

aks-node-assistant Bot commented Jun 15, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

aks-node-assistant Bot commented Jun 15, 2026

Uh oh!

pdamianov-dev commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

pdamianov-dev commented Jun 15, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

cameronmeissner Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aks-node-assistant Bot commented Jun 15, 2026

TL;DR

3-level RCA

Classification

Recommended next action

Evidence

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

aks-node-assistant Bot commented Jun 15, 2026

TL;DR

3-level RCA

Classification

Recommended next action

Evidence

Uh oh!

pdamianov-dev commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cameronmeissner Jun 15, 2026 •

edited

Loading