fix(e2e): improve e2e test reliability (16 fixes)#8707
Open
timmy-wright wants to merge 18 commits into
Open
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR targets reliability and debuggability improvements across the AgentBaker e2e test suite, focusing on reducing flakiness (transient Azure/K8s errors, fixed sleeps) and improving cleanup/cancellation behavior.
Changes:
- Improve polling/retry behavior (transient error handling, jittered backoffs, replace some fixed sleeps with polling).
- Harden VMSS lifecycle handling (earlier cleanup registration, wait for VMSS deletion completion, avoid mutating global VM SKU under parallelism via per-scenario VMSize).
- Improve correctness primitives (pod readiness semantics, NPD condition matching/debug output, DAG cycle detection, SSH exec context cancellation propagation).
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| e2e/vmss.go | Registers cleanup earlier, waits for VMSS deletion to complete, and uses per-scenario VM size in the VMSS model. |
| e2e/validators.go | Fixes sysctl key slice init, replaces restart sleep with polling, strengthens NPD validation logging, hardens Windows settings read, and improves transient retry logging. |
| e2e/validation.go | Adds per-retry unique pod naming and improves resource polling cancellation diagnostics. |
| e2e/types.go | Adds ScenarioRuntime.VMSize to avoid global DefaultVMSKU mutation. |
| e2e/test_helpers.go | Removes top-level parallelization, implements a deeper scenario copy for parallel subtests, and computes/stores an effective VM SKU per scenario. |
| e2e/kube.go | Splits pod wait into running vs completed modes, adds transient error handling, and uses container readiness checks. |
| e2e/exec.go | Propagates context into SCP timeout, supports SSH exec cancellation, treats ExitMissingError as failure, and adds jitter to pod-exec retries. |
| e2e/dag/dag.go | Adds runtime cycle detection for explicitly declared dependencies. |
| e2e/cluster.go | Adds transient error handling in cluster readiness polling and jitter to cluster creation retries. |
| e2e/cache.go | Avoids memoizing transient errors by caching only successful results. |
| e2e/bastionssh.go | Propagates context through token/session creation and adds jitter to bastion SSH dial retries. |
cachedFunc now caches only successful results so transient first-call failures do not poison later callers sharing the same key. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
prepareAKSNode now keeps the fallback VM size local to the scenario so parallel tests do not race on the global default SKU. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…er create VMSS cleanup is now registered as soon as creation starts and delete waits for completion so failed setups do not leave orphaned scale sets behind. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… paths SSH command execution and bastion setup now honor caller deadlines so canceled tests stop remote work instead of hanging behind background contexts. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…iling immediately Polling helpers now log transient Kubernetes and ARM read failures and keep retrying until timeout instead of aborting on the first hiccup. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…and GPU validators Service restart and GPU scheduling validators now poll real readiness signals instead of relying on fixed sleeps that flake under load. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…d storms Application-level retry loops now add randomized backoff so parallel failures do not hammer the same endpoints in lockstep. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…aitUntilPodCompleted Pod wait helpers now keep running and completed success states separate so completed one-shot pods cannot satisfy running readiness checks. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
SSH channel closures without an exit status now surface as explicit failures instead of being misreported as successful command execution. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…on pressure RunScenario no longer marks the parent test parallel so only one subtest layer fans out and shared e2e state sees less contention. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…l subtests copyScenario now clones mutable scenario slices and pointer-backed state so parallel subtests stop sharing the same scenario internals by accident. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…Exists conflicts Pod validation retries now create uniquely named pods so a slow delete from a prior attempt cannot mask the real failure with AlreadyExists. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…uments ValidateSysctlConfig now builds its sysctl argument list without prefilled empty entries so the command checks only the requested keys. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…e or version empty The Windows version validator now stops on unreadable settings or empty expected versions instead of passing vacuously on an empty contains check. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Node problem detector validation now succeeds only on the intended condition tuple and reports the last observed condition when polling times out. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…iring DAG task registration now rejects explicit dependency cycles up front so bad wiring fails fast instead of deadlocking Wait forever. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
8803f4e to
6e203f9
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.
Comments suppressed due to low confidence (1)
e2e/cluster.go:497
math/randis never seeded in this repo, so the jitter here is deterministic across processes (default seed = 1). That reduces the benefit of jitter for de-synchronizing retries in CI runs. Consider using a per-call RNG seeded with time, or a package-level RNG seeded once at startup.
// Note, it seems like the operation still can start a trigger a new operation even if nothing has changes
pollerResp, err := config.Azure.AKS.BeginCreateOrUpdate(
ctx,
config.ResourceGroupName(*cluster.Location),
*cluster.Name,
*cluster,
- validators.go: add backoff sleep before continue on transient node GET error to prevent tight-loop hammering the API server - validation.go: guard empty baseName in podNameForAttempt to avoid DNS-1123 invalid names starting with '-' - validators.go: replace bash-specific [[ ]] with POSIX [ ] in ServiceCanRestartValidator SSH script
- cache.go: update doc comment to accurately describe success-only caching semantics (errors are retried, not memoized) - cache_test.go: replace Test_cachedFunc_caches_errors with Test_cachedFunc_retries_on_error and add Test_cachedFunc_caches_after_success_following_error to match new behavior - kube.go: add isPodReadyCondition helper; check pod-level PodReady condition in WaitUntilPodRunning to capture readiness gates and minReadySeconds, not just per-container readiness
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
E2E Reliability Improvements
This PR implements 16 targeted reliability fixes to the AgentBaker e2e test suite identified through a comprehensive code review.
Critical fixes
Systematic flakiness fixes
Correctness & debuggability
Closes AB#38423358