Skip to content

fix: prevent CSE hang when curl verbose output blocks on unstable disks#8711

Closed
pdamianov-dev wants to merge 2 commits into
mainfrom
pdamianov/fix-curl-hang-unstable-disk
Closed

fix: prevent CSE hang when curl verbose output blocks on unstable disks#8711
pdamianov-dev wants to merge 2 commits into
mainfrom
pdamianov/fix-curl-hang-unstable-disk

Conversation

@pdamianov-dev

Copy link
Copy Markdown
Contributor

What this PR does / why we need it:

  • Redirect CURL_OUTPUT to /dev/shm (tmpfs) instead of /tmp to avoid blocking on unstable OS disks. /dev/shm is kernel-mounted tmpfs (CONFIG_TMPFS=y on all Azure images) across all VM SKUs including CVM and ARM64.
  • Add --max-time to curl in _retry_file_curl_internal so curl enforces its own deadline even if shell timeout cannot deliver signals.
  • Add -k 5 to timeout in _retrycmd_internal and _retry_file_curl_internal to escalate to SIGKILL if SIGTERM is ignored (e.g. process in D-state).

Which issue(s) this PR fixes:
Bug 36680094: [Repair Item] Improve CSE script to handle curl timeouts and prevent blocking by redirecting verbose output away from unstable disks and enforcing strict timeout with forced kill signals.
Fixes #

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens Linux CSE download/retry behavior to reduce the chance of CSE hangs when storage is unstable by moving curl verbose logging to tmpfs and making timeout enforcement more aggressive.

Changes:

  • Redirects CURL_OUTPUT to /dev/shm (tmpfs) when available, falling back to /tmp.
  • Updates retry helpers to use timeout -k 5 (SIGKILL escalation) and adds curl --max-time in _retry_file_curl_internal.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
parts/linux/cloud-init/artifacts/cse_install.sh Sets CURL_OUTPUT to prefer /dev/shm (tmpfs) over /tmp for verbose curl logs.
parts/linux/cloud-init/artifacts/cse_helpers.sh Applies the same CURL_OUTPUT change and strengthens retry timeout behavior (timeout -k 5, curl --max-time).

Comment thread parts/linux/cloud-init/artifacts/cse_helpers.sh Outdated
Copilot AI review requested due to automatic review settings June 15, 2026 15:45
@pdamianov-dev pdamianov-dev force-pushed the pdamianov/fix-curl-hang-unstable-disk branch from 201e349 to ce9efb7 Compare June 15, 2026 15:45

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

Comment on lines +269 to +273
echo "timeout_args: $*" >> $CURL_OUTPUT
return 0
}
touch /tmp/nonexistent
When call _retry_file_curl_internal 1 1 30 0 "/tmp/nonexistent" "https://dummy.url/file" "[ -f /tmp/nonexistent ]"
Comment thread parts/linux/cloud-init/artifacts/cse_helpers.sh
Comment on lines +31 to +35
if [ -d "/dev/shm" ] && [ -w "/dev/shm" ]; then
CURL_OUTPUT=/dev/shm/curl_verbose.out
else
CURL_OUTPUT=/tmp/curl_verbose.out
fi
@pdamianov-dev pdamianov-dev force-pushed the pdamianov/fix-curl-hang-unstable-disk branch from ce9efb7 to 9d46a0f Compare June 15, 2026 16:41
Copilot AI review requested due to automatic review settings June 15, 2026 16:58
@pdamianov-dev pdamianov-dev force-pushed the pdamianov/fix-curl-hang-unstable-disk branch from 9d46a0f to ac7ebd9 Compare June 15, 2026 16:58

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comment thread parts/linux/cloud-init/artifacts/cse_helpers.sh
Comment thread parts/linux/cloud-init/artifacts/cse_helpers.sh
# /dev/shm is kernel-mounted tmpfs (CONFIG_TMPFS=y on all Azure images) and available on
# all VM SKUs including CVM and ARM64. Fallback to /tmp for non-Azure environments.
if [ -d "/dev/shm" ] && [ -w "/dev/shm" ]; then
CURL_OUTPUT=/dev/shm/curl_verbose.out

@cameronmeissner cameronmeissner Jun 15, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we log something out when /dev/shm is selected?

@aks-node-assistant

Copy link
Copy Markdown
Contributor

🕵️ AgentBaker Linux Gate DetectiveBuild 168091952 FAILED at Setup Cue (cascading into Build VHD / Test, Scan, and Cleanup) — 2nd occurrence of go-toolchain-tls-handshake-timeout, infra-side, not caused by this PR.

TL;DR

  • The pipeline's Setup Cue step needs Go 1.25.11 to go install cuelang.org/go/cmd/cue@latest. Build agent has Go 1.25.10 pre-installed, so Go auto-downloads the toolchain from storage.googleapis.com/proxy-golang-org-prod/...TLS handshake timeout after ~10s.
  • cue binary is therefore not on PATH; the next step cue export ./schemas/manifest.cue exits with cue: command not found (exit 127).
  • Downstream Build VHD fails with SKU_NAME must be set for linux VHD builds because init-packer reads SKU_NAME from the Cue-generated manifest that was never produced — that's a cascading symptom, not the root cause.
  • Matches existing wiki signature go-toolchain-tls-handshake-timeout (first seen on build 168010239 / PR chore(deps): bump github.com/onsi/gomega from 1.41.0 to 1.42.0 #8704).

3-level RCA

1. Surface symptom — Setup Cue log: go: download go1.25.11: golang.org/toolchain@v0.0.1-go1.25.11.linux-amd64: Get "https://storage.googleapis.com/proxy-golang-org-prod/...": net/http: TLS handshake timeout, then cue: command not found, exit 127. Build VHD log: SKU_NAME must be set for linux VHD builds from packer.mk:101: init-packer. Test, Scan, and Cleanup + Publish BCC Tools Installation Log fail downstream because no VHD was published.

2. Corroboration — Identical pattern to build 168010239 on PR #8704 (gomega dep bump) — same Go-toolchain auto-download to storage.googleapis.com/proxy-golang-org-prod/..., same net/http: TLS handshake timeout after ~10s. Hosted build agent egress to the Google module-proxy CDN is flaky; first occurrence was 2 days ago. This PR doesn't change cuelang.org/go version requirements (CUE module requires Go 1.25.0, and the toolchain selector picked 1.25.11 which is the newest minor available at module-resolution time).

3. Root-cause challenge — Strongest alternative: PR-caused regression via the curl/CSE change. Why less likely: the PR touches CSE bash scripts (cse_helpers.sh / cse_cmd.sh curl behavior to prevent hangs on unstable disks); it does not modify go.mod, Cue schemas, or anything Go-toolchain related. Failure occurs at Setup Cue — the very first task that requires Go — before any AgentBaker source code from the PR is even compiled. Cascading SKU_NAME must be set is a Make-time consequence, not a Packer/script bug.

Classification

  • Test infrastructure / build-agent egress flakiness (Google module-proxy TLS handshake)
  • Wiki signature: go-toolchain-tls-handshake-timeout (Count → 2)
  • Confidence: High that the failure mechanism is network-side; High that PR is unrelated.

Recommended next action

  • For this PR: rerun the gate — toolchain downloads usually succeed on retry.
  • Owner of the underlying issue: AgentBaker E2E build infra — recurrence in <72h on a distinct pipeline path (Setup Cue, not just Run AgentBaker E2E) confirms this is worth pre-baking Go 1.25.11 + 1.26.4 into the agent image, or constraining GOTOOLCHAIN=local and bumping the host Go install separately.

Evidence

Posted by Clawpilot AgentBaker Linux Gate Detective Watcher.

@pdamianov-dev pdamianov-dev force-pushed the pdamianov/fix-curl-hang-unstable-disk branch from ac7ebd9 to 50188b4 Compare June 15, 2026 18:03
Comment thread parts/linux/cloud-init/artifacts/cse_helpers.sh Outdated
Copilot AI review requested due to automatic review settings June 15, 2026 19:39

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

fi

timeout $effectiveTimeout curl -fsSLv $url -o $filePath > $CURL_OUTPUT 2>&1
timeout -k 5 "$effectiveTimeout" curl --max-time "$effectiveTimeout" -fsSLv "$url" -o "$filePath" > "$CURL_OUTPUT" 2>&1
- Add --max-time to curl in _retry_file_curl_internal so curl enforces
  its own deadline even if shell timeout cannot deliver signals.
- Add -k 5 to timeout in _retry_file_curl_internal to escalate to
  SIGKILL if SIGTERM is ignored (e.g. process in D-state on disk).
- Add ShellSpec test asserting -k 5 and --max-time flags are passed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@pdamianov-dev pdamianov-dev force-pushed the pdamianov/fix-curl-hang-unstable-disk branch from a03d4c9 to 554fd16 Compare June 15, 2026 20:42
Validates that the timeout -k mechanism in cse_helpers.sh properly kills
hung curl processes when a download URL is unreachable. Uses a non-routable
IP (192.0.2.1, RFC 5737 TEST-NET-1) to cause curl to hang, and a short
CSETimeout (90s) to verify the global timeout -k5s in cse_start.sh fires
and terminates the provisioning script with exit code 124.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 15, 2026 21:08

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comment on lines +402 to 405
timeout -k 5 $effectiveTimeout curl --max-time $effectiveTimeout -fsSLv $url -o $filePath > $CURL_OUTPUT 2>&1
if [ "$?" -ne 0 ]; then
cat $CURL_OUTPUT
fi
Comment thread e2e/scenario_test.go
Comment on lines +682 to +686
// Test_Ubuntu2204_CSETimeoutOnUnreachableDownload validates that the per-curl timeout (-k 5)
// in cse_helpers.sh _retry_file_curl_internal properly kills hung download attempts.
// It sets CustomKubeBinaryURL to a non-routable IP (RFC 5737 TEST-NET-1) which causes curl
// to hang until the timeout fires, and sets a short CSETimeout so the test completes quickly.
// The global timeout in cse_start.sh (timeout -k5s) kills the entire provisioning script.
@aks-node-assistant

Copy link
Copy Markdown
Contributor

🕵️ AgentBaker Linux Gate DetectiveBuild 168131287 FAILED at Test, Scan, and Cleanup / build2004fipsgen2containerdnew signature trivy-db-mcr-unauthorized, infra-side, not caused by this PR.

TL;DR

The post-build VHD scan step runs trivy --scanners vuln rootfs -f json --db-repository mcr.microsoft.com/mirror/ghcr/aquasecurity/trivy-db:2,... on the built VHD. The Trivy DB download from mcr.microsoft.com consistently returns:

OCI repository error: GET https://mcrprod.azurecr.io/oauth2/token?scope=repository%3Amirror%2Fghcr%2Faquasecurity%2Ftrivy-db%3Apull&service=mcrprod.azurecr.io: UNAUTHORIZED: authentication required

Retried 10 times with 30s sleep, all 10 attempts fail identically. Downstream vhd-scanning.sh exited with code 1, blob ref not uploaded → "ERROR: The specified blob does not exist", Test, Scan, and Cleanup exits 2.

This PR fixes a CSE curl-hang bug — completely unrelated to MCR auth or Trivy.

3-level RCA

1. Surface symptomTest, Scan, and Cleanup task for SKU build2004fipsgen2containerd fails. Failed step: trivy vulnerability DB download (10 retries). Each retry returns HTTP 401 UNAUTHORIZED from mcrprod.azurecr.io token endpoint. CorrelationIds visible in the log (e.g. 5d0d06a2-aa1f-4a1b-9a3b-b16c62071247) — surface this to ACR ops if needed. vhd-scanning.sh exits 1, build aborts.

2. Corroboration — Only one SKU job failed (build2004fipsgen2containerd); other parallel SKU builds appear to have completed without this issue, suggesting either an intermittent MCR auth glitch or per-job-agent identity propagation problem. No corresponding entry in the wiki source-of-truth — new signature.

3. Root-cause challenge — Strongest alternative: PR-caused CSE script regression. Why less likely: the PR (fix: prevent CSE hang when curl verbose output blocks on unstable disks) modifies cse_helpers.sh / cse_cmd.sh to redirect curl verbose output away from blocking IO. That code only runs on the booted VM during CSE, not during the VHD-builder agent's post-build Trivy scan. The agent's MCR call uses the agent's managed identity, which is unaffected by anything in the PR. Trivy's exit is at HTTP-auth time, not at runtime — clearly an infra-side auth/network failure.

Classification

  • Test infrastructure / VHD-builder agent MCR auth flakiness (new signature, first observed occurrence)
  • Wiki signature: trivy-db-mcr-unauthorized (Count → 1, new row)
  • Confidence: High that trivy DB download is the root cause; Medium about whether this is transient (per-agent MSI token) or persistent (MCR mirror access policy changed). Will be confirmed by next build on this PR or any other PR's build2004fipsgen2containerd job.

Recommended next action

  • For this PR: rerun the gate — 10 retries within one job aren't enough if the agent's managed-identity token to mcrprod.azurecr.io is genuinely cold. A fresh build should re-acquire the token.
  • Owner: AgentBaker E2E build infra — if it recurs, check the agent pool's managed-identity ACR-pull role assignment on mcrprod.azurecr.io/mirror/ghcr/aquasecurity/trivy-db and whether MCR's mirror policy recently changed to require auth where it previously was anonymous.

Evidence

Posted by Clawpilot AgentBaker Linux Gate Detective Watcher.

@pdamianov-dev

Copy link
Copy Markdown
Contributor Author

It was determined that this PR and the associated work item were not needed to address the issue and did not fully match the result of the linked ticket

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants