Split from #421 (Fix 1 resilience items). The 2.2.0-beta.6 incident's attempt #1 failed because cosign signing hit a transient Rekor 404 getLogEntryByUuidNotFound across all 3 retry attempts. Rekor intermittency can last several minutes, so the current retry policy is insufficient.
Current state (develop)
build.yml inputs: cosign_max_attempts default 3, cosign_initial_delay default 5.
src/security/cosign-sign/action.yml: exponential backoff (delay ×3 per failed attempt), no jitter.
Proposed
- Add jitter to the cosign retry delay (randomized component) to avoid thundering-herd when multiple jobs hit Rekor simultaneously.
- Review/raise the
cosign_max_attempts default (e.g. 3 → 5) and consider a higher backoff ceiling, to ride out multi-minute Rekor outages.
Scope notes
Related: #421
Split from #421 (Fix 1 resilience items). The 2.2.0-beta.6 incident's attempt #1 failed because cosign signing hit a transient Rekor
404 getLogEntryByUuidNotFoundacross all 3 retry attempts. Rekor intermittency can last several minutes, so the current retry policy is insufficient.Current state (
develop)build.ymlinputs:cosign_max_attemptsdefault3,cosign_initial_delaydefault5.src/security/cosign-sign/action.yml: exponential backoff (delay ×3 per failed attempt), no jitter.Proposed
cosign_max_attemptsdefault (e.g. 3 → 5) and consider a higher backoff ceiling, to ride out multi-minute Rekor outages.Scope notes
src/security/cosign-sign/action.yml(retry loop) and possibly thebuild.ymldefaults.continue_gitops_on_signing_failureduring this same incident is fixed separately (see the PR for build: pre-flight tag existence check + cleanup on downstream failure #421's attempt-ci(deps): bump actions/setup-node from 4 to 5 #1/ci(deps): bump actions/checkout from 4 to 5 #2 fixes).Related: #421