Skip to content

fix(e2e): improve e2e test reliability (16 fixes)#8707

Open
timmy-wright wants to merge 18 commits into
mainfrom
timmy/e2e-reliability
Open

fix(e2e): improve e2e test reliability (16 fixes)#8707
timmy-wright wants to merge 18 commits into
mainfrom
timmy/e2e-reliability

Conversation

@timmy-wright

Copy link
Copy Markdown
Contributor

E2E Reliability Improvements

This PR implements 16 targeted reliability fixes to the AgentBaker e2e test suite identified through a comprehensive code review.

Critical fixes

  • Do not memoize transient errors in cachedFunc (prevents suite-wide poisoning)
  • Avoid mutating global DefaultVMSKU under parallel tests (fixes data race)
  • Wait for VMSS deletion and register cleanup immediately after create (prevents orphans)
  • Propagate context cancellation through SSH exec and bastion paths

Systematic flakiness fixes

  • Retry transient API errors in wait/poll loops instead of failing immediately
  • Replace fixed sleeps with polling loops in validators
  • Add jitter to retry delays to prevent synchronized storms
  • Fix WaitUntilPodRunning to not accept PodSucceeded; add WaitUntilPodCompleted
  • Treat ssh.ExitMissingError as failure, not success

Correctness & debuggability

  • Remove nested t.Parallel() to reduce isolation pressure
  • Implement real deep copy in copyScenario
  • Generate unique pod names per retry attempt
  • Fix sysctl arg slice init (leading empty strings)
  • Fail Windows version validator when settings unreadable
  • Require full type+reason+message match in validateNPDCondition
  • Add cycle detection to DAG

Closes AB#38423358

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets reliability and debuggability improvements across the AgentBaker e2e test suite, focusing on reducing flakiness (transient Azure/K8s errors, fixed sleeps) and improving cleanup/cancellation behavior.

Changes:

  • Improve polling/retry behavior (transient error handling, jittered backoffs, replace some fixed sleeps with polling).
  • Harden VMSS lifecycle handling (earlier cleanup registration, wait for VMSS deletion completion, avoid mutating global VM SKU under parallelism via per-scenario VMSize).
  • Improve correctness primitives (pod readiness semantics, NPD condition matching/debug output, DAG cycle detection, SSH exec context cancellation propagation).

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
e2e/vmss.go Registers cleanup earlier, waits for VMSS deletion to complete, and uses per-scenario VM size in the VMSS model.
e2e/validators.go Fixes sysctl key slice init, replaces restart sleep with polling, strengthens NPD validation logging, hardens Windows settings read, and improves transient retry logging.
e2e/validation.go Adds per-retry unique pod naming and improves resource polling cancellation diagnostics.
e2e/types.go Adds ScenarioRuntime.VMSize to avoid global DefaultVMSKU mutation.
e2e/test_helpers.go Removes top-level parallelization, implements a deeper scenario copy for parallel subtests, and computes/stores an effective VM SKU per scenario.
e2e/kube.go Splits pod wait into running vs completed modes, adds transient error handling, and uses container readiness checks.
e2e/exec.go Propagates context into SCP timeout, supports SSH exec cancellation, treats ExitMissingError as failure, and adds jitter to pod-exec retries.
e2e/dag/dag.go Adds runtime cycle detection for explicitly declared dependencies.
e2e/cluster.go Adds transient error handling in cluster readiness polling and jitter to cluster creation retries.
e2e/cache.go Avoids memoizing transient errors by caching only successful results.
e2e/bastionssh.go Propagates context through token/session creation and adds jitter to bastion SSH dial retries.

Comment thread e2e/validation.go
Copilot AI review requested due to automatic review settings June 15, 2026 07:22
timmy-wright and others added 8 commits June 15, 2026 07:26
cachedFunc now caches only successful results so transient first-call failures do not poison later callers sharing the same key.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
prepareAKSNode now keeps the fallback VM size local to the scenario so parallel tests do not race on the global default SKU.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…er create

VMSS cleanup is now registered as soon as creation starts and delete waits for completion so failed setups do not leave orphaned scale sets behind.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… paths

SSH command execution and bastion setup now honor caller deadlines so canceled tests stop remote work instead of hanging behind background contexts.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…iling immediately

Polling helpers now log transient Kubernetes and ARM read failures and keep retrying until timeout instead of aborting on the first hiccup.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…and GPU validators

Service restart and GPU scheduling validators now poll real readiness signals instead of relying on fixed sleeps that flake under load.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…d storms

Application-level retry loops now add randomized backoff so parallel failures do not hammer the same endpoints in lockstep.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…aitUntilPodCompleted

Pod wait helpers now keep running and completed success states separate so completed one-shot pods cannot satisfy running readiness checks.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
timmy-wright and others added 8 commits June 15, 2026 07:27
SSH channel closures without an exit status now surface as explicit failures instead of being misreported as successful command execution.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…on pressure

RunScenario no longer marks the parent test parallel so only one subtest layer fans out and shared e2e state sees less contention.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…l subtests

copyScenario now clones mutable scenario slices and pointer-backed state so parallel subtests stop sharing the same scenario internals by accident.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…Exists conflicts

Pod validation retries now create uniquely named pods so a slow delete from a prior attempt cannot mask the real failure with AlreadyExists.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…uments

ValidateSysctlConfig now builds its sysctl argument list without prefilled empty entries so the command checks only the requested keys.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…e or version empty

The Windows version validator now stops on unreadable settings or empty expected versions instead of passing vacuously on an empty contains check.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Node problem detector validation now succeeds only on the intended condition tuple and reports the last observed condition when polling times out.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…iring

DAG task registration now rejects explicit dependency cycles up front so bad wiring fails fast instead of deadlocking Wait forever.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@timmy-wright timmy-wright force-pushed the timmy/e2e-reliability branch from 8803f4e to 6e203f9 Compare June 15, 2026 07:28

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

e2e/cluster.go:497

  • math/rand is never seeded in this repo, so the jitter here is deterministic across processes (default seed = 1). That reduces the benefit of jitter for de-synchronizing retries in CI runs. Consider using a per-call RNG seeded with time, or a package-level RNG seeded once at startup.
	// Note, it seems like the operation still can start a trigger a new operation even if nothing has changes
	pollerResp, err := config.Azure.AKS.BeginCreateOrUpdate(
		ctx,
		config.ResourceGroupName(*cluster.Location),
		*cluster.Name,
		*cluster,

Comment thread e2e/validators.go
Comment thread e2e/validation.go
Comment thread e2e/bastionssh.go
Comment thread e2e/exec.go
Comment thread e2e/validators.go
- validators.go: add backoff sleep before continue on transient node GET
  error to prevent tight-loop hammering the API server
- validation.go: guard empty baseName in podNameForAttempt to avoid
  DNS-1123 invalid names starting with '-'
- validators.go: replace bash-specific [[ ]] with POSIX [ ] in
  ServiceCanRestartValidator SSH script
Copilot AI review requested due to automatic review settings June 15, 2026 07:32

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

Comment thread e2e/cache.go
Comment thread e2e/kube.go
Comment thread e2e/bastionssh.go
Comment thread e2e/cluster.go
Comment thread e2e/exec.go
- cache.go: update doc comment to accurately describe success-only
  caching semantics (errors are retried, not memoized)
- cache_test.go: replace Test_cachedFunc_caches_errors with
  Test_cachedFunc_retries_on_error and add
  Test_cachedFunc_caches_after_success_following_error to match new behavior
- kube.go: add isPodReadyCondition helper; check pod-level PodReady
  condition in WaitUntilPodRunning to capture readiness gates and
  minReadySeconds, not just per-container readiness
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants