Summary
gitops-update.yml reports ArgoCD Sync failure for downstream apps even when ArgoCD itself completes the sync successfully a short time later. The culprit is the hard-coded 180s timeout on argocd app wait, which is shorter than the time some apps take to finish their PostSync hooks on slower / busier clusters.
Symptom
Workflow runs that legitimately update GitOps end with ❌ ArgoCD Sync (<server>/<env>) jobs while the corresponding ArgoCD Application transitions to Synced + Healthy within ~1–2 minutes after the GHA timeout. This produces a steady stream of false-failure notifications for owners of apps with multi-stage sync hooks (PreSync migrations + PostSync init jobs).
Evidence
Run: https://github.com/LerianStudio/plugin-access-manager/actions/runs/26538953954 (tag v3.0.0-beta.7, triggered from build-auth-init.yml)
| server/env |
GHA conclusion |
ArgoCD timeline (UTC, from argocd app history) |
Current state |
| firmino/dev |
✅ success |
(within timeout) |
Synced/Healthy |
| benedita/dev |
✅ success |
(within timeout) |
Synced/Healthy |
| clotilde/dev |
❌ failure (timeout at 21:22:31Z) |
sync to 5038e5c started 21:21:09Z, completed 21:22:38Z |
Synced/Healthy |
| anacleto/dev |
❌ failure (timeout at 21:19:51Z) |
sync to 5038e5c started 21:17:52Z, completed 21:20:08Z |
Synced/Healthy |
Failure messages in both failed jobs are identical and benign:
Sync Status: Synced to main (5038e5c)
Health Status: Degraded
Message: waiting for completion of hook batch/Job/plugin-access-manager-auth-init-user
##[error]Timeout waiting for sync completion of <app>
The PreSync auth-backend-migrations Job completed in both clusters; the PostSync auth-init-user Job (which seeds Casdoor via API) needed slightly more than 180s to finish on clotilde/anacleto. ArgoCD itself converged ~30–90s after the GHA gave up.
Root cause
.github/workflows/gitops-update.yml (v1.30.0):
# line 808
argocd app sync "$APP_NAME" ... --async --timeout 180 $PRUNE_FLAG
# line 822
argocd app wait "$APP_NAME" ... --timeout 180
The 180s app wait timeout is shorter than the worst-case PostSync hook duration for apps like plugin-access-manager on slower clusters. There is no input to override it, so callers cannot tune per app/server.
Proposed solution
- Expose two new inputs with sensible defaults:
argocd_sync_timeout (default 180, applied to argocd app sync --timeout)
argocd_wait_timeout (default 600, applied to argocd app wait --timeout)
- Raise the default
wait timeout to 600s so apps with PostSync seed jobs stop producing false negatives out of the box. The existing 5-attempt retry loop with 30s sleep is already in place for transient sync failures; the new value only affects how long each attempt blocks on health.
- (Optional) Add a
argocd_skip_wait_on_hooks input to make app wait return as soon as resources are Synced (ignore hook health). Useful for callers that explicitly do not want to gate the workflow on long-running seed jobs.
Acceptance criteria
- Re-running the linked plugin-access-manager release with the new defaults produces ✅ on all four servers without any change to the downstream apps.
- The new inputs are documented in
docs/gitops-update-workflow.md.
- Existing callers (e.g.
plugin-br-bank-transfer, midaz, ungoliant-controller) continue to work without changes.
References
Summary
gitops-update.ymlreportsArgoCD Syncfailure for downstream apps even when ArgoCD itself completes the sync successfully a short time later. The culprit is the hard-coded 180s timeout onargocd app wait, which is shorter than the time some apps take to finish their PostSync hooks on slower / busier clusters.Symptom
Workflow runs that legitimately update GitOps end with
❌ ArgoCD Sync (<server>/<env>)jobs while the corresponding ArgoCD Application transitions toSynced+Healthywithin ~1–2 minutes after the GHA timeout. This produces a steady stream of false-failure notifications for owners of apps with multi-stage sync hooks (PreSync migrations + PostSync init jobs).Evidence
Run: https://github.com/LerianStudio/plugin-access-manager/actions/runs/26538953954 (tag
v3.0.0-beta.7, triggered frombuild-auth-init.yml)argocd app history)21:22:31Z)5038e5cstarted21:21:09Z, completed21:22:38Z21:19:51Z)5038e5cstarted21:17:52Z, completed21:20:08ZFailure messages in both failed jobs are identical and benign:
The PreSync
auth-backend-migrationsJob completed in both clusters; the PostSyncauth-init-userJob (which seeds Casdoor via API) needed slightly more than 180s to finish on clotilde/anacleto. ArgoCD itself converged ~30–90s after the GHA gave up.Root cause
.github/workflows/gitops-update.yml(v1.30.0):The 180s
app waittimeout is shorter than the worst-case PostSync hook duration for apps likeplugin-access-manageron slower clusters. There is no input to override it, so callers cannot tune per app/server.Proposed solution
argocd_sync_timeout(default180, applied toargocd app sync --timeout)argocd_wait_timeout(default600, applied toargocd app wait --timeout)waittimeout to 600s so apps with PostSync seed jobs stop producing false negatives out of the box. The existing 5-attempt retry loop with 30s sleep is already in place for transient sync failures; the new value only affects how long each attempt blocks on health.argocd_skip_wait_on_hooksinput to makeapp waitreturn as soon as resources are Synced (ignore hook health). Useful for callers that explicitly do not want to gate the workflow on long-running seed jobs.Acceptance criteria
docs/gitops-update-workflow.md.plugin-br-bank-transfer,midaz,ungoliant-controller) continue to work without changes.References