Skip to content

test: de-flake TestUpstreamManagerDetectsChromiumAndRestart#284

Closed
hiroTamada wants to merge 2 commits into
mainfrom
hypeship/deflake-devtoolsproxy-chromium-test
Closed

test: de-flake TestUpstreamManagerDetectsChromiumAndRestart#284
hiroTamada wants to merge 2 commits into
mainfrom
hypeship/deflake-devtoolsproxy-chromium-test

Conversation

@hiroTamada

@hiroTamada hiroTamada commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Summary

TestUpstreamManagerDetectsChromiumAndRestart (lib/devtoolsproxy) is flaky in two distinct modes — it reproduces locally at ~20% (3/15 runs) and has been failing CI runs on main and PRs (including several attempts on #280's run, and runs on other branches the same day):

  1. Teardown race — the test kills the chromium parent process (Process.Kill()), but chromium's renderer/crashpad children outlive it and keep writing the --user-data-dir while Go's t.TempDir() cleanup deletes it → TempDir RemoveAll cleanup: unlinkat .../Default: directory not empty. This is the dominant mode locally.
  2. Detection timeout — the 20s waitForCondition for the DevTools endpoint is too tight on loaded CI runners; chromium starts but hasn't printed DevTools listening within the window (log shows only dbus noise).

Fix

  • Start chromium in its own process group (Setpgid, same pattern as lib/recorder/ffmpeg.go and cmd/api/api/process.go) and kill the whole group on cleanup via a small killProcessGroup helper, replacing the three parent-only kill sites.
  • Raise the two detection waits from 20s to a shared upstreamDetectTimeout = 60s. A genuinely hung chromium still fails (with the existing log dump), slow runners no longer do.

Test-only change; no production code touched.

Verification

  • Before: 3/15 local runs fail (teardown race).
  • After: 60/60 local runs pass (-count=30 twice); full lib/devtoolsproxy package suite passes.

🤖 Generated with Claude Code


Note

Low Risk
Test-only changes in proxy_test.go; no production code paths affected.

Overview
Stabilizes TestUpstreamManagerDetectsChromiumAndRestart in proxy_test.go by fixing Chromium teardown and loosening DevTools detection timing on slow CI.

Chromium is started with Setpgid and teardown uses a new killProcessGroup helper (SIGKILL on the process group + Wait) instead of killing only the parent—avoiding leftover renderer/crashpad processes racing t.TempDir() cleanup. The two upstream-detection waits share upstreamDetectTimeout = 60s (was 20s).

Reviewed by Cursor Bugbot for commit 4566e1c. Bugbot is set up for automated code reviews on this repo. Configure here.

Two flake modes: chromium's renderer/crashpad children outlive
Process.Kill() on the parent and keep writing the user-data-dir while
t.TempDir() cleanup deletes it (TempDir RemoveAll: directory not
empty), and the 20s devtools-detection wait is too tight on loaded CI
runners. Start chromium in its own process group and kill the whole
group on cleanup, and raise the detection wait to 60s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@hiroTamada hiroTamada marked this pull request as ready for review June 10, 2026 17:24
@firetiger-agent

Copy link
Copy Markdown

Created a monitoring plan for this PR.

What this PR does: Fixes a flaky CI test in the devtools proxy package; no production code is changed.

Intended effect:

  • TestUpstreamManagerDetectsChromiumAndRestart pass rate: baseline ~80% (3/15 local runs failing); confirmed if it reaches 100% across repeated CI runs with no teardown-race or timeout failures.
  • CI suite wall time for lib/devtoolsproxy: no significant regression expected; the longer 60 s timeout only fires when Chromium hangs, which is the failure mode being fixed.

Risks:

  • Orphaned process group (misuse)Kill(-pid, SIGKILL) without Setpgid on the target could affect the caller's process group; alert if any CI job hangs or exits abnormally after calling killProcessGroup outside the startChromium helper path.
  • 60 s timeout inflates suite time on genuine hangs — if Chromium consistently fails to start on CI, the test now takes up to 120 s before failing instead of 40 s; alert if lib/devtoolsproxy suite runtime exceeds 5 min on CI.
  • Platform portabilitySetpgid / syscall.Kill(-pid) are POSIX-only; alert if CI is ever extended to Windows runners and this package fails to compile.

Status updates will be posted automatically on this PR as monitoring progresses.

View monitor

@hiroTamada hiroTamada requested a review from rgarcia June 10, 2026 17:35

@rgarcia rgarcia left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reviewed — solid de-flake, the Setpgid + process-group kill is the right fix for the teardown race, and the verification (60/60) is convincing. approving.

one thing worth a look (non-blocking):

worth-a-look

  • server/lib/devtoolsproxy/proxy_test.go:357 (+ defer at :330) — cmd1 is killed twice: explicitly at 357 to restart, then again via the defer killProcessGroup(cmd1). the original code had the same double-kill, but the deferred cmd1.Process.Kill() was a safe no-op — go's os.Process.Kill returns ErrProcessDone after Wait() and skips the syscall. the new helper uses raw syscall.Kill(-pid, SIGKILL), which has no such guard, so the deferred call issues a real kill(-pid) against an already-reaped PID/PGID. if that PGID got recycled in the window, you'd SIGKILL an unrelated group. probability is very low (needs PGID reuse within the test, higher in a container with a small pid_max), but it's a behavior change from the prior no-op. consider a killed-flag/sync.Once in the helper, or a closure defer — note defer killProcessGroup(cmd1) captures the arg at 330, so reassigning cmd1 = nil later won't help; it'd need defer func(){ killProcessGroup(cmd1) }().

The deferred kill after an explicit kill+reap re-signalled the freed
PGID, which could hit a recycled process group. Guard on ProcessState
(set by Wait) so a reaped cmd is never signalled again — restoring the
no-op-after-Wait semantics the previous os.Process.Kill provided.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@hiroTamada hiroTamada closed this Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants