Skip to content

Fix flaky CI: fast test scratch (/ci) + UFFD orphan reaping + transient vsock retry#286

Open
rgarcia wants to merge 3 commits into
mainfrom
fix/uffd-test-orphan-reap
Open

Fix flaky CI: fast test scratch (/ci) + UFFD orphan reaping + transient vsock retry#286
rgarcia wants to merge 3 commits into
mainfrom
fix/uffd-test-orphan-reap

Conversation

@rgarcia

@rgarcia rgarcia commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Addresses the flaky failures on the self-hosted Linux runner (deft-kernel-dev) — instance ... did not reach Running within 45s, vm.restore ... Client.Timeout exceeded, vsock ... EOF — which are environmental (disk-I/O contention) plus a test-harness orphan leak, not product bugs.

Root cause (proven)

The integration tests are heavily I/O-bound on VM snapshot/fork/restore, all of which flow through $TMPDIR. On the runner, /tmp is bind-mounted onto the encrypted RAID5 data volume (data_crypt), which delivers ~158 MB/s idle and ~23 MB/s under concurrent CI load. A controlled cgroup io.max experiment confirmed causation: throttling I/O reproduces the failures with a clean dose-response (10 MB/s → 6/6 fail), while a CPU-throttle control passes 6/6. Separately, when a test is SIGKILLed/times out during a standby/fork/restore transition, its Firecracker guest is leaked and — once its UFFD pager exits — spins a vCPU at 100% forever, compounding host contention.

Changes

  1. Run tests on a fast scratch disktest.yml sets TMPDIR=/ci, a dedicated unencrypted OS-SSD scratch (~400–500 MB/s, isolated from data_crypt/Docker/prod) provisioned by the dev_ci_scratch Ansible role (kernel/infra #616, merged + applied to deft). Measured under live load: /ci 406 MB/s vs /tmp 23 MB/s. Makefile passes TMPDIR through the test-linux sudo env (otherwise sudo strips it). /ci is shorter than /tmp, so it stays under the 108-char AF_UNIX sun_path limit for VMM control sockets (unlike a /tmp/hci$N prefix).
  2. Reap orphaned guests in test cleanupcleanupOrphanedProcesses now also scans /proc for firecracker/cloud-hypervisor/hypeman-uffd-pager whose cmdline is under the test's t.TempDir(), killing guests before pagers so an orphaned UFFD guest is never left to spin. Catches the PID-nil window where meta.HypervisorPID is cleared during standby/fork/restore.
  3. Tolerate transient post-restore exec blipsassertGuestFile/requireRunningSleepInstance use execCommandWithRetry, retrying only transient vsock/gRPC connection errors (EOF/Unavailable/closing), never real assertion failures.

(Fix 4, a production "kill guest if its pager dies" guard, is left as a documented TODO — it needs a real pager-health watchdog and is risky to do narrowly; the test-side reaper mitigates CI today.)

Dependency / rollout

TMPDIR=/ci requires kernel/infra #616 applied on the runner (done — /ci exists, hourly reaper active). Safe ordering: this can merge now since /ci is already live.

Testing

gofmt/vet/build pass. FCUFFD passed 3/3 across reruns pre-/ci. Fork-test on /ci vs /tmp at idle: both pass, /ci faster (22.6s vs 25.5s median); the under-load win is carried by the cgroup experiment + the 406-vs-23 MB/s throughput measurement (the heavy-load window didn't naturally recur during the fork-test run).

🤖 Generated with Claude Code


Note

Low Risk
Changes are limited to CI workflow, Makefile test env, integration test cleanup/retry logic, and a comment-only production TODO—no runtime product behavior changes.

Overview
Reduces flaky Linux CI on the self-hosted KVM runner by moving VM scratch and prewarm cache onto /ci (TMPDIR + HYPEMAN_TEST_PREWARM_DIR) so snapshot/rootfs work stays on the same fast filesystem, and by forwarding TMPDIR through sudo in make test-linux.

Test harness hardening: cleanupOrphanedProcesses now scans /proc for hypervisor and UFFD-pager processes tied to the test’s data dir (guests before pagers), covering leaks when metadata PIDs are cleared mid standby/fork/restore. Integration tests retry in-guest exec only on transient vsock/gRPC errors via isTransientExecError / execCommandWithRetry (including Firecracker UFFD paths).

A production UFFD orphan-kill guard is documented as a deferred TODO in hypervisor_linux.go; CI relies on the expanded reaper instead.

Reviewed by Cursor Bugbot for commit 1520cb9. Bugbot is set up for automated code reviews on this repo. Configure here.

TestFCUFFDOneShotLifecycle (Firecracker + UFFD snapshot-pager) flakes under
shared-runner I/O contention: the post-restore in-guest exec over vsock
intermittently returns EOF / gRPC Unavailable. Worse, when the test aborts it
leaks the Firecracker guest, and an orphaned UFFD guest whose pager has exited
spins its vCPU at 100% forever (no pager to service page faults), permanently
burning a core per flake and worsening contention.

The leak happens because test cleanup only killed meta.HypervisorPID, which is
deliberately set to nil during standby/fork/restore transitions -- so a failure
in that window left the guest unreaped.

Fixes:

1. Robust orphan reaping (lib/instances/manager_test.go):
   cleanupOrphanedProcesses now also scans /proc/*/cmdline for any
   firecracker / cloud-hypervisor / qemu-system* / hypeman-uffd-pager process
   whose command line references this test's t.TempDir() data dir, and SIGKILLs
   them. Matching is scoped to the exact per-test data-dir prefix so it can
   never touch a sibling parallel test's guests.

2. Teardown ordering (lib/instances/manager_test.go):
   The reaper kills guests before pagers, so an orphaned UFFD guest is never
   stranded without a pager (which is what causes the vCPU spin).

3. Tolerate transient vsock errors (lib/instances/firecracker_test.go,
   compression_integration_linux_test.go):
   assertGuestFile and the in-guest exec inside requireRunningSleepInstance now
   use execCommandWithRetry, which retries ONLY transient connection errors
   (EOF / Unavailable / "client connection is closing" / etc.), never real
   assertion failures.

4. Production-safety guard (deferred, lib/instances/hypervisor_linux.go):
   Added a detailed TODO for terminating a UFFD guest when its pager is lost.
   The pager is a single shared, version-keyed systemd service (not per-guest)
   and there is no existing pager-health watchdog or pager->guest map, so a
   correct fix needs a new monitoring loop; doing it carelessly risks killing
   healthy guests on a benign pager restart. Deferred per "correctness over
   completeness"; the test-side reaper mitigates the CI flake today.

Tested via gofmt + go vet (linux and darwin); integration tests require
root+linux+network+KVM and run on the CI runner.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@firetiger-agent

Copy link
Copy Markdown

Created a monitoring plan for this PR.

What this PR does: Fixes a CI test flake where leaked Firecracker/UFFD guests from aborted tests were permanently burning CPU on the shared runner, compounding I/O contention and causing more flakes. This is a test-infrastructure fix with no production code changes.

Intended effect:

  • Test workflow pass rate: baseline 50–69% failure (week of Jun 1–8); confirmed improving if failure rate drops and holds below 50% over the next 7 days after merge.
  • Orphan reap log lines: "Cleaning up orphaned guest/uffd pager process" messages should appear in test output on flaky runs instead of silent CPU leaks that degrade the runner.

Risks:

  • Reaper over-broad match/proc cmdline substring scan could in theory match a sibling test's guest; alert if Test failure rate stays above 60% for 3+ consecutive days post-merge and logs show unexpected instance-not-found or mid-test SIGKILL errors.
  • Retry masking real failuresisTransientExecError string matching could hide a genuine assertion failure that produces a matching error message; alert if new failures surface only as context deadline exceeded on tests expected to fail fast.

Status updates will be posted automatically on this PR as monitoring progresses.

View monitor

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Data-dir substring kills sibling tests
    • Updated cmdline directory matching to require a path-boundary after the matched dir and added coverage proving sibling prefix paths like /0010 no longer match /001.

Create PR

Or push these changes by commenting:

@cursor push 7779d03804
Preview (7779d03804)
diff --git a/lib/instances/manager_test.go b/lib/instances/manager_test.go
--- a/lib/instances/manager_test.go
+++ b/lib/instances/manager_test.go
@@ -306,22 +306,75 @@
 
 // cmdlineReferencesDir reports whether any argv token is, or lives under, dir.
 func cmdlineReferencesDir(argv []string, dir string) bool {
+	dir = filepath.Clean(dir)
 	for _, tok := range argv {
 		tok = strings.TrimSpace(tok)
 		if tok == "" {
 			continue
 		}
 		// Tokens may embed the path inside larger strings (e.g. qemu's
-		// "socket,...,path=/tmp/.../qmp.sock"), so a substring check on the
-		// cleaned prefix is the robust match. The prefix is a unique per-test
-		// t.TempDir() root, so substring matching cannot collide across tests.
-		if strings.Contains(tok, dir) {
-			return true
+		// "socket,...,path=/tmp/.../qmp.sock"), so scan for occurrences and
+		// require a path boundary at the end of the matched directory.
+		for start := 0; start < len(tok); {
+			idx := strings.Index(tok[start:], dir)
+			if idx < 0 {
+				break
+			}
+			idx += start
+			end := idx + len(dir)
+			if end == len(tok) || (end < len(tok) && os.IsPathSeparator(tok[end])) {
+				return true
+			}
+			start = idx + 1
 		}
 	}
 	return false
 }
 
+func TestCmdlineReferencesDir(t *testing.T) {
+	t.Parallel()
+
+	dir := filepath.Clean(filepath.Join(string(os.PathSeparator), "tmp", "hypeman", "001"))
+
+	tests := []struct {
+		name string
+		argv []string
+		want bool
+	}{
+		{
+			name: "exact token",
+			argv: []string{"qemu-system-x86_64", dir},
+			want: true,
+		},
+		{
+			name: "embedded under dir",
+			argv: []string{
+				"cloud-hypervisor",
+				"--api-sock",
+				"socket,id=qmp,path=" + filepath.Join(dir, "instances", "vm-1", "qmp.sock"),
+			},
+			want: true,
+		},
+		{
+			name: "sibling dir with prefix is not a match",
+			argv: []string{
+				"cloud-hypervisor",
+				"--api-sock",
+				"socket,id=qmp,path=" + filepath.Join(string(os.PathSeparator), "tmp", "hypeman", "0010", "instances", "vm-1", "qmp.sock"),
+			},
+			want: false,
+		},
+	}
+
+	for _, tt := range tests {
+		tt := tt
+		t.Run(tt.name, func(t *testing.T) {
+			t.Parallel()
+			assert.Equal(t, tt.want, cmdlineReferencesDir(tt.argv, dir))
+		})
+	}
+}
+
 func isGuestProcess(exe string) bool {
 	switch {
 	case exe == "firecracker":

You can send follow-ups to the cloud agent here.

Reviewed by Cursor Bugbot for commit 2945399. Configure here.

}
}
return false
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data-dir substring kills sibling tests

Medium Severity

cmdlineReferencesDir treats any argv token containing the cleaned data-dir string as a match. When one parallel test’s t.TempDir() path is a strict prefix of another’s (e.g. …/001 vs …/0010), cleanup for the shorter path can SIGKILL hypervisor or UFFD pager processes whose cmdlines reference the longer path, breaking still-running sibling tests.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 2945399. Configure here.

Point the Linux test job's TMPDIR at /ci, the dedicated unencrypted OS-SSD
scratch provisioned by the dev_ci_scratch Ansible role (kernel/infra #616).
VM snapshot/fork/restore I/O all flows through $TMPDIR; today that is /tmp,
bind-mounted onto the encrypted RAID5 data volume which delivers ~158 MB/s
idle and as low as ~23 MB/s under concurrent CI load — squarely in the range
that starves fork/restore and produces the flaky "did not reach Running
within 45s" / vsock-EOF timeouts. /ci is on a separate disk (~400-500 MB/s)
and cannot be starved by data_crypt contention.

A controlled cgroup io.max experiment confirmed I/O causation (dose-response;
CPU-throttle control passes), and /ci measured 406 MB/s vs /tmp 23 MB/s under
live load. /ci is also shorter than /tmp, so it stays well under the 108-char
AF_UNIX sun_path limit for VMM control sockets (unlike a /tmp/hci$N prefix).

Requires the passthrough of TMPDIR through the test-linux sudo env, otherwise
sudo strips it and os.TempDir() falls back to /tmp.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@rgarcia rgarcia changed the title Reap orphaned UFFD guests in tests; tolerate transient post-restore exec Fix flaky CI: fast test scratch (/ci) + UFFD orphan reaping + transient vsock retry Jun 10, 2026
The image build symlinks the prewarm oci-cache into each per-test cache and
builds rootfs there; those operations are not cross-filesystem safe. With
TMPDIR=/ci but the prewarm cache on $HOME (data_crypt), the build spans two
filesystems and mkfs.erofs fails ("failed to opendir ... rootfs"). /tmp worked
only because it is on the same filesystem (data_crypt) as the prewarm cache.

Point HYPEMAN_TEST_PREWARM_DIR at /ci/prewarm/linux-amd64 so prewarm and test
scratch share one (fast) filesystem again.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant