fix: prevent CSE hang when curl verbose output blocks on unstable disks#8711
fix: prevent CSE hang when curl verbose output blocks on unstable disks#8711pdamianov-dev wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR hardens Linux CSE download/retry behavior to reduce the chance of CSE hangs when storage is unstable by moving curl verbose logging to tmpfs and making timeout enforcement more aggressive.
Changes:
- Redirects
CURL_OUTPUTto/dev/shm(tmpfs) when available, falling back to/tmp. - Updates retry helpers to use
timeout -k 5(SIGKILL escalation) and addscurl --max-timein_retry_file_curl_internal.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| parts/linux/cloud-init/artifacts/cse_install.sh | Sets CURL_OUTPUT to prefer /dev/shm (tmpfs) over /tmp for verbose curl logs. |
| parts/linux/cloud-init/artifacts/cse_helpers.sh | Applies the same CURL_OUTPUT change and strengthens retry timeout behavior (timeout -k 5, curl --max-time). |
201e349 to
ce9efb7
Compare
| echo "timeout_args: $*" >> $CURL_OUTPUT | ||
| return 0 | ||
| } | ||
| touch /tmp/nonexistent | ||
| When call _retry_file_curl_internal 1 1 30 0 "/tmp/nonexistent" "https://dummy.url/file" "[ -f /tmp/nonexistent ]" |
| if [ -d "/dev/shm" ] && [ -w "/dev/shm" ]; then | ||
| CURL_OUTPUT=/dev/shm/curl_verbose.out | ||
| else | ||
| CURL_OUTPUT=/tmp/curl_verbose.out | ||
| fi |
ce9efb7 to
9d46a0f
Compare
9d46a0f to
ac7ebd9
Compare
| # /dev/shm is kernel-mounted tmpfs (CONFIG_TMPFS=y on all Azure images) and available on | ||
| # all VM SKUs including CVM and ARM64. Fallback to /tmp for non-Azure environments. | ||
| if [ -d "/dev/shm" ] && [ -w "/dev/shm" ]; then | ||
| CURL_OUTPUT=/dev/shm/curl_verbose.out |
There was a problem hiding this comment.
can we log something out when /dev/shm is selected?
|
🕵️ AgentBaker Linux Gate Detective — Build 168091952 FAILED at TL;DR
3-level RCA1. Surface symptom — Setup Cue log: 2. Corroboration — Identical pattern to build 168010239 on PR #8704 (gomega dep bump) — same Go-toolchain auto-download to 3. Root-cause challenge — Strongest alternative: PR-caused regression via the curl/CSE change. Why less likely: the PR touches CSE bash scripts ( Classification
Recommended next action
Evidence
Posted by Clawpilot AgentBaker Linux Gate Detective Watcher. |
ac7ebd9 to
50188b4
Compare
| fi | ||
|
|
||
| timeout $effectiveTimeout curl -fsSLv $url -o $filePath > $CURL_OUTPUT 2>&1 | ||
| timeout -k 5 "$effectiveTimeout" curl --max-time "$effectiveTimeout" -fsSLv "$url" -o "$filePath" > "$CURL_OUTPUT" 2>&1 |
- Add --max-time to curl in _retry_file_curl_internal so curl enforces its own deadline even if shell timeout cannot deliver signals. - Add -k 5 to timeout in _retry_file_curl_internal to escalate to SIGKILL if SIGTERM is ignored (e.g. process in D-state on disk). - Add ShellSpec test asserting -k 5 and --max-time flags are passed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
a03d4c9 to
554fd16
Compare
Validates that the timeout -k mechanism in cse_helpers.sh properly kills hung curl processes when a download URL is unreachable. Uses a non-routable IP (192.0.2.1, RFC 5737 TEST-NET-1) to cause curl to hang, and a short CSETimeout (90s) to verify the global timeout -k5s in cse_start.sh fires and terminates the provisioning script with exit code 124. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| timeout -k 5 $effectiveTimeout curl --max-time $effectiveTimeout -fsSLv $url -o $filePath > $CURL_OUTPUT 2>&1 | ||
| if [ "$?" -ne 0 ]; then | ||
| cat $CURL_OUTPUT | ||
| fi |
| // Test_Ubuntu2204_CSETimeoutOnUnreachableDownload validates that the per-curl timeout (-k 5) | ||
| // in cse_helpers.sh _retry_file_curl_internal properly kills hung download attempts. | ||
| // It sets CustomKubeBinaryURL to a non-routable IP (RFC 5737 TEST-NET-1) which causes curl | ||
| // to hang until the timeout fires, and sets a short CSETimeout so the test completes quickly. | ||
| // The global timeout in cse_start.sh (timeout -k5s) kills the entire provisioning script. |
|
🕵️ AgentBaker Linux Gate Detective — Build 168131287 FAILED at TL;DRThe post-build VHD scan step runs Retried 10 times with 30s sleep, all 10 attempts fail identically. Downstream This PR fixes a CSE curl-hang bug — completely unrelated to MCR auth or Trivy. 3-level RCA1. Surface symptom — 2. Corroboration — Only one SKU job failed ( 3. Root-cause challenge — Strongest alternative: PR-caused CSE script regression. Why less likely: the PR ( Classification
Recommended next action
Evidence
Posted by Clawpilot AgentBaker Linux Gate Detective Watcher. |
|
It was determined that this PR and the associated work item were not needed to address the issue and did not fully match the result of the linked ticket |
What this PR does / why we need it:
Which issue(s) this PR fixes:
Bug 36680094: [Repair Item] Improve CSE script to handle curl timeouts and prevent blocking by redirecting verbose output away from unstable disks and enforcing strict timeout with forced kill signals.
Fixes #