fix(e2e): improve e2e test reliability (16 fixes) by timmy-wright · Pull Request #8707 · Azure/AgentBaker

timmy-wright · 2026-06-15T07:11:02Z

E2E Reliability Improvements

This PR implements 16 targeted reliability fixes to the AgentBaker e2e test suite identified through a comprehensive code review.

Critical fixes

Do not memoize transient errors in cachedFunc (prevents suite-wide poisoning)
Avoid mutating global DefaultVMSKU under parallel tests (fixes data race)
Wait for VMSS deletion and register cleanup immediately after create (prevents orphans)
Propagate context cancellation through SSH exec and bastion paths

Systematic flakiness fixes

Retry transient API errors in wait/poll loops instead of failing immediately
Replace fixed sleeps with polling loops in validators
Add jitter to retry delays to prevent synchronized storms
Fix WaitUntilPodRunning to not accept PodSucceeded; add WaitUntilPodCompleted
Treat ssh.ExitMissingError as failure, not success

Correctness & debuggability

Remove nested t.Parallel() to reduce isolation pressure
Implement real deep copy in copyScenario
Generate unique pod names per retry attempt
Fix sysctl arg slice init (leading empty strings)
Fail Windows version validator when settings unreadable
Require full type+reason+message match in validateNPDCondition
Add cycle detection to DAG

Closes AB#38423358

Copilot

Pull request overview

This PR targets reliability and debuggability improvements across the AgentBaker e2e test suite, focusing on reducing flakiness (transient Azure/K8s errors, fixed sleeps) and improving cleanup/cancellation behavior.

Changes:

Improve polling/retry behavior (transient error handling, jittered backoffs, replace some fixed sleeps with polling).
Harden VMSS lifecycle handling (earlier cleanup registration, wait for VMSS deletion completion, avoid mutating global VM SKU under parallelism via per-scenario VMSize).
Improve correctness primitives (pod readiness semantics, NPD condition matching/debug output, DAG cycle detection, SSH exec context cancellation propagation).

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
e2e/vmss.go	Registers cleanup earlier, waits for VMSS deletion to complete, and uses per-scenario VM size in the VMSS model.
e2e/validators.go	Fixes sysctl key slice init, replaces restart sleep with polling, strengthens NPD validation logging, hardens Windows settings read, and improves transient retry logging.
e2e/validation.go	Adds per-retry unique pod naming and improves resource polling cancellation diagnostics.
e2e/types.go	Adds `ScenarioRuntime.VMSize` to avoid global DefaultVMSKU mutation.
e2e/test_helpers.go	Removes top-level parallelization, implements a deeper scenario copy for parallel subtests, and computes/stores an effective VM SKU per scenario.
e2e/kube.go	Splits pod wait into running vs completed modes, adds transient error handling, and uses container readiness checks.
e2e/exec.go	Propagates context into SCP timeout, supports SSH exec cancellation, treats ExitMissingError as failure, and adds jitter to pod-exec retries.
e2e/dag/dag.go	Adds runtime cycle detection for explicitly declared dependencies.
e2e/cluster.go	Adds transient error handling in cluster readiness polling and jitter to cluster creation retries.
e2e/cache.go	Avoids memoizing transient errors by caching only successful results.
e2e/bastionssh.go	Propagates context through token/session creation and adds jitter to bastion SSH dial retries.

cachedFunc now caches only successful results so transient first-call failures do not poison later callers sharing the same key. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

prepareAKSNode now keeps the fallback VM size local to the scenario so parallel tests do not race on the global default SKU. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…er create VMSS cleanup is now registered as soon as creation starts and delete waits for completion so failed setups do not leave orphaned scale sets behind. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

… paths SSH command execution and bastion setup now honor caller deadlines so canceled tests stop remote work instead of hanging behind background contexts. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…iling immediately Polling helpers now log transient Kubernetes and ARM read failures and keep retrying until timeout instead of aborting on the first hiccup. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…and GPU validators Service restart and GPU scheduling validators now poll real readiness signals instead of relying on fixed sleeps that flake under load. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…d storms Application-level retry loops now add randomized backoff so parallel failures do not hammer the same endpoints in lockstep. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…aitUntilPodCompleted Pod wait helpers now keep running and completed success states separate so completed one-shot pods cannot satisfy running readiness checks. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

SSH channel closures without an exit status now surface as explicit failures instead of being misreported as successful command execution. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…on pressure RunScenario no longer marks the parent test parallel so only one subtest layer fans out and shared e2e state sees less contention. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…l subtests copyScenario now clones mutable scenario slices and pointer-backed state so parallel subtests stop sharing the same scenario internals by accident. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…Exists conflicts Pod validation retries now create uniquely named pods so a slow delete from a prior attempt cannot mask the real failure with AlreadyExists. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…uments ValidateSysctlConfig now builds its sysctl argument list without prefilled empty entries so the command checks only the requested keys. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…e or version empty The Windows version validator now stops on unreadable settings or empty expected versions instead of passing vacuously on an empty contains check. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Node problem detector validation now succeeds only on the intended condition tuple and reports the last observed condition when polling times out. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…iring DAG task registration now rejects explicit dependency cycles up front so bad wiring fails fast instead of deadlocking Wait forever. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

e2e/cluster.go:497

math/rand is never seeded in this repo, so the jitter here is deterministic across processes (default seed = 1). That reduces the benefit of jitter for de-synchronizing retries in CI runs. Consider using a per-call RNG seeded with time, or a package-level RNG seeded once at startup.

	// Note, it seems like the operation still can start a trigger a new operation even if nothing has changes
	pollerResp, err := config.Azure.AKS.BeginCreateOrUpdate(
		ctx,
		config.ResourceGroupName(*cluster.Location),
		*cluster.Name,
		*cluster,

- validators.go: add backoff sleep before continue on transient node GET error to prevent tight-loop hammering the API server - validation.go: guard empty baseName in podNameForAttempt to avoid DNS-1123 invalid names starting with '-' - validators.go: replace bash-specific [[ ]] with POSIX [ ] in ServiceCanRestartValidator SSH script

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

- cache.go: update doc comment to accurately describe success-only caching semantics (errors are retried, not memoized) - cache_test.go: replace Test_cachedFunc_caches_errors with Test_cachedFunc_retries_on_error and add Test_cachedFunc_caches_after_success_following_error to match new behavior - kube.go: add isPodReadyCondition helper; check pod-level PodReady condition in WaitUntilPodRunning to capture readiness gates and minReadySeconds, not just per-container readiness

Copilot AI review requested due to automatic review settings June 15, 2026 07:11

timmy-wright requested review from AbelHu, Devinwong, SriHarsha001, awesomenix, calvin197, cameronmeissner, djsly, ganeshkumarashok, lilypan26, mxj220, pdamianov-dev, phealy, r2k1, runzhen, sulixu, surajssd and zachary-bailey as code owners June 15, 2026 07:11

Copilot started reviewing on behalf of timmy-wright June 15, 2026 07:11 View session

Copilot AI reviewed Jun 15, 2026

View reviewed changes

Comment thread e2e/validation.go

Copilot AI review requested due to automatic review settings June 15, 2026 07:22

Copilot started reviewing on behalf of timmy-wright June 15, 2026 07:22 View session

timmy-wright and others added 8 commits June 15, 2026 07:26

fix(e2e): do not memoize transient errors in cachedFunc

2ff0de6

cachedFunc now caches only successful results so transient first-call failures do not poison later callers sharing the same key. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

fix(e2e): avoid mutating global DefaultVMSKU under parallel tests

9a0ffbf

prepareAKSNode now keeps the fallback VM size local to the scenario so parallel tests do not race on the global default SKU. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

fix(e2e): add jitter to app-level retry delays to prevent synchronize…

70974e5

…d storms Application-level retry loops now add randomized backoff so parallel failures do not hammer the same endpoints in lockstep. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

timmy-wright and others added 8 commits June 15, 2026 07:27

fix(e2e): treat ssh.ExitMissingError as failure, not success

ff955c7

SSH channel closures without an exit status now surface as explicit failures instead of being misreported as successful command execution. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

fix(e2e): add cycle detection to DAG to prevent silent hangs on bad w…

6e203f9

…iring DAG task registration now rejects explicit dependency cycles up front so bad wiring fails fast instead of deadlocking Wait forever. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

timmy-wright force-pushed the timmy/e2e-reliability branch from 8803f4e to 6e203f9 Compare June 15, 2026 07:28

Copilot AI reviewed Jun 15, 2026

View reviewed changes

Comment thread e2e/validators.go

Comment thread e2e/validation.go

Comment thread e2e/bastionssh.go

Comment thread e2e/exec.go

Comment thread e2e/validators.go

Copilot AI review requested due to automatic review settings June 15, 2026 07:32

Copilot started reviewing on behalf of timmy-wright June 15, 2026 07:32 View session

Copilot AI reviewed Jun 15, 2026

View reviewed changes

Comment thread e2e/cache.go

Comment thread e2e/kube.go

Comment thread e2e/bastionssh.go

Comment thread e2e/cluster.go

Comment thread e2e/exec.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(e2e): improve e2e test reliability (16 fixes)#8707

fix(e2e): improve e2e test reliability (16 fixes)#8707
timmy-wright wants to merge 18 commits into
mainfrom
timmy/e2e-reliability

timmy-wright commented Jun 15, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

timmy-wright commented Jun 15, 2026

E2E Reliability Improvements

Critical fixes

Systematic flakiness fixes

Correctness & debuggability

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants