Fix flaky CI: fast test scratch (/ci) + UFFD orphan reaping + transient vsock retry#286
Fix flaky CI: fast test scratch (/ci) + UFFD orphan reaping + transient vsock retry#286rgarcia wants to merge 3 commits into
Conversation
TestFCUFFDOneShotLifecycle (Firecracker + UFFD snapshot-pager) flakes under shared-runner I/O contention: the post-restore in-guest exec over vsock intermittently returns EOF / gRPC Unavailable. Worse, when the test aborts it leaks the Firecracker guest, and an orphaned UFFD guest whose pager has exited spins its vCPU at 100% forever (no pager to service page faults), permanently burning a core per flake and worsening contention. The leak happens because test cleanup only killed meta.HypervisorPID, which is deliberately set to nil during standby/fork/restore transitions -- so a failure in that window left the guest unreaped. Fixes: 1. Robust orphan reaping (lib/instances/manager_test.go): cleanupOrphanedProcesses now also scans /proc/*/cmdline for any firecracker / cloud-hypervisor / qemu-system* / hypeman-uffd-pager process whose command line references this test's t.TempDir() data dir, and SIGKILLs them. Matching is scoped to the exact per-test data-dir prefix so it can never touch a sibling parallel test's guests. 2. Teardown ordering (lib/instances/manager_test.go): The reaper kills guests before pagers, so an orphaned UFFD guest is never stranded without a pager (which is what causes the vCPU spin). 3. Tolerate transient vsock errors (lib/instances/firecracker_test.go, compression_integration_linux_test.go): assertGuestFile and the in-guest exec inside requireRunningSleepInstance now use execCommandWithRetry, which retries ONLY transient connection errors (EOF / Unavailable / "client connection is closing" / etc.), never real assertion failures. 4. Production-safety guard (deferred, lib/instances/hypervisor_linux.go): Added a detailed TODO for terminating a UFFD guest when its pager is lost. The pager is a single shared, version-keyed systemd service (not per-guest) and there is no existing pager-health watchdog or pager->guest map, so a correct fix needs a new monitoring loop; doing it carelessly risks killing healthy guests on a benign pager restart. Deferred per "correctness over completeness"; the test-side reaper mitigates the CI flake today. Tested via gofmt + go vet (linux and darwin); integration tests require root+linux+network+KVM and run on the CI runner. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Created a monitoring plan for this PR. What this PR does: Fixes a CI test flake where leaked Firecracker/UFFD guests from aborted tests were permanently burning CPU on the shared runner, compounding I/O contention and causing more flakes. This is a test-infrastructure fix with no production code changes. Intended effect:
Risks:
Status updates will be posted automatically on this PR as monitoring progresses. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: Data-dir substring kills sibling tests
- Updated cmdline directory matching to require a path-boundary after the matched dir and added coverage proving sibling prefix paths like /0010 no longer match /001.
Or push these changes by commenting:
@cursor push 7779d03804
Preview (7779d03804)
diff --git a/lib/instances/manager_test.go b/lib/instances/manager_test.go
--- a/lib/instances/manager_test.go
+++ b/lib/instances/manager_test.go
@@ -306,22 +306,75 @@
// cmdlineReferencesDir reports whether any argv token is, or lives under, dir.
func cmdlineReferencesDir(argv []string, dir string) bool {
+ dir = filepath.Clean(dir)
for _, tok := range argv {
tok = strings.TrimSpace(tok)
if tok == "" {
continue
}
// Tokens may embed the path inside larger strings (e.g. qemu's
- // "socket,...,path=/tmp/.../qmp.sock"), so a substring check on the
- // cleaned prefix is the robust match. The prefix is a unique per-test
- // t.TempDir() root, so substring matching cannot collide across tests.
- if strings.Contains(tok, dir) {
- return true
+ // "socket,...,path=/tmp/.../qmp.sock"), so scan for occurrences and
+ // require a path boundary at the end of the matched directory.
+ for start := 0; start < len(tok); {
+ idx := strings.Index(tok[start:], dir)
+ if idx < 0 {
+ break
+ }
+ idx += start
+ end := idx + len(dir)
+ if end == len(tok) || (end < len(tok) && os.IsPathSeparator(tok[end])) {
+ return true
+ }
+ start = idx + 1
}
}
return false
}
+func TestCmdlineReferencesDir(t *testing.T) {
+ t.Parallel()
+
+ dir := filepath.Clean(filepath.Join(string(os.PathSeparator), "tmp", "hypeman", "001"))
+
+ tests := []struct {
+ name string
+ argv []string
+ want bool
+ }{
+ {
+ name: "exact token",
+ argv: []string{"qemu-system-x86_64", dir},
+ want: true,
+ },
+ {
+ name: "embedded under dir",
+ argv: []string{
+ "cloud-hypervisor",
+ "--api-sock",
+ "socket,id=qmp,path=" + filepath.Join(dir, "instances", "vm-1", "qmp.sock"),
+ },
+ want: true,
+ },
+ {
+ name: "sibling dir with prefix is not a match",
+ argv: []string{
+ "cloud-hypervisor",
+ "--api-sock",
+ "socket,id=qmp,path=" + filepath.Join(string(os.PathSeparator), "tmp", "hypeman", "0010", "instances", "vm-1", "qmp.sock"),
+ },
+ want: false,
+ },
+ }
+
+ for _, tt := range tests {
+ tt := tt
+ t.Run(tt.name, func(t *testing.T) {
+ t.Parallel()
+ assert.Equal(t, tt.want, cmdlineReferencesDir(tt.argv, dir))
+ })
+ }
+}
+
func isGuestProcess(exe string) bool {
switch {
case exe == "firecracker":You can send follow-ups to the cloud agent here.
Reviewed by Cursor Bugbot for commit 2945399. Configure here.
| } | ||
| } | ||
| return false | ||
| } |
There was a problem hiding this comment.
Data-dir substring kills sibling tests
Medium Severity
cmdlineReferencesDir treats any argv token containing the cleaned data-dir string as a match. When one parallel test’s t.TempDir() path is a strict prefix of another’s (e.g. …/001 vs …/0010), cleanup for the shorter path can SIGKILL hypervisor or UFFD pager processes whose cmdlines reference the longer path, breaking still-running sibling tests.
Reviewed by Cursor Bugbot for commit 2945399. Configure here.
Point the Linux test job's TMPDIR at /ci, the dedicated unencrypted OS-SSD scratch provisioned by the dev_ci_scratch Ansible role (kernel/infra #616). VM snapshot/fork/restore I/O all flows through $TMPDIR; today that is /tmp, bind-mounted onto the encrypted RAID5 data volume which delivers ~158 MB/s idle and as low as ~23 MB/s under concurrent CI load — squarely in the range that starves fork/restore and produces the flaky "did not reach Running within 45s" / vsock-EOF timeouts. /ci is on a separate disk (~400-500 MB/s) and cannot be starved by data_crypt contention. A controlled cgroup io.max experiment confirmed I/O causation (dose-response; CPU-throttle control passes), and /ci measured 406 MB/s vs /tmp 23 MB/s under live load. /ci is also shorter than /tmp, so it stays well under the 108-char AF_UNIX sun_path limit for VMM control sockets (unlike a /tmp/hci$N prefix). Requires the passthrough of TMPDIR through the test-linux sudo env, otherwise sudo strips it and os.TempDir() falls back to /tmp. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The image build symlinks the prewarm oci-cache into each per-test cache and
builds rootfs there; those operations are not cross-filesystem safe. With
TMPDIR=/ci but the prewarm cache on $HOME (data_crypt), the build spans two
filesystems and mkfs.erofs fails ("failed to opendir ... rootfs"). /tmp worked
only because it is on the same filesystem (data_crypt) as the prewarm cache.
Point HYPEMAN_TEST_PREWARM_DIR at /ci/prewarm/linux-amd64 so prewarm and test
scratch share one (fast) filesystem again.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>



Addresses the flaky failures on the self-hosted Linux runner (
deft-kernel-dev) —instance ... did not reach Running within 45s,vm.restore ... Client.Timeout exceeded,vsock ... EOF— which are environmental (disk-I/O contention) plus a test-harness orphan leak, not product bugs.Root cause (proven)
The integration tests are heavily I/O-bound on VM snapshot/fork/restore, all of which flow through
$TMPDIR. On the runner,/tmpis bind-mounted onto the encrypted RAID5 data volume (data_crypt), which delivers ~158 MB/s idle and ~23 MB/s under concurrent CI load. A controlled cgroupio.maxexperiment confirmed causation: throttling I/O reproduces the failures with a clean dose-response (10 MB/s → 6/6 fail), while a CPU-throttle control passes 6/6. Separately, when a test is SIGKILLed/times out during a standby/fork/restore transition, its Firecracker guest is leaked and — once its UFFD pager exits — spins a vCPU at 100% forever, compounding host contention.Changes
test.ymlsetsTMPDIR=/ci, a dedicated unencrypted OS-SSD scratch (~400–500 MB/s, isolated fromdata_crypt/Docker/prod) provisioned by thedev_ci_scratchAnsible role (kernel/infra #616, merged + applied to deft). Measured under live load:/ci406 MB/s vs/tmp23 MB/s.MakefilepassesTMPDIRthrough thetest-linuxsudo env(otherwise sudo strips it)./ciis shorter than/tmp, so it stays under the 108-charAF_UNIXsun_pathlimit for VMM control sockets (unlike a/tmp/hci$Nprefix).cleanupOrphanedProcessesnow also scans/procforfirecracker/cloud-hypervisor/hypeman-uffd-pagerwhose cmdline is under the test'st.TempDir(), killing guests before pagers so an orphaned UFFD guest is never left to spin. Catches the PID-nil window wheremeta.HypervisorPIDis cleared during standby/fork/restore.assertGuestFile/requireRunningSleepInstanceuseexecCommandWithRetry, retrying only transient vsock/gRPC connection errors (EOF/Unavailable/closing), never real assertion failures.(Fix 4, a production "kill guest if its pager dies" guard, is left as a documented TODO — it needs a real pager-health watchdog and is risky to do narrowly; the test-side reaper mitigates CI today.)
Dependency / rollout
TMPDIR=/cirequires kernel/infra #616 applied on the runner (done —/ciexists, hourly reaper active). Safe ordering: this can merge now since/ciis already live.Testing
gofmt/vet/build pass. FCUFFD passed 3/3 across reruns pre-/ci. Fork-test on/civs/tmpat idle: both pass,/cifaster (22.6s vs 25.5s median); the under-load win is carried by the cgroup experiment + the 406-vs-23 MB/s throughput measurement (the heavy-load window didn't naturally recur during the fork-test run).🤖 Generated with Claude Code
Note
Low Risk
Changes are limited to CI workflow, Makefile test env, integration test cleanup/retry logic, and a comment-only production TODO—no runtime product behavior changes.
Overview
Reduces flaky Linux CI on the self-hosted KVM runner by moving VM scratch and prewarm cache onto
/ci(TMPDIR+HYPEMAN_TEST_PREWARM_DIR) so snapshot/rootfs work stays on the same fast filesystem, and by forwardingTMPDIRthroughsudoinmake test-linux.Test harness hardening:
cleanupOrphanedProcessesnow scans/procfor hypervisor and UFFD-pager processes tied to the test’s data dir (guests before pagers), covering leaks when metadata PIDs are cleared mid standby/fork/restore. Integration tests retry in-guest exec only on transient vsock/gRPC errors viaisTransientExecError/execCommandWithRetry(including Firecracker UFFD paths).A production UFFD orphan-kill guard is documented as a deferred TODO in
hypervisor_linux.go; CI relies on the expanded reaper instead.Reviewed by Cursor Bugbot for commit 1520cb9. Bugbot is set up for automated code reviews on this repo. Configure here.