Skip to content

gitops-update: argocd app wait 180s timeout causes false-failure on apps with long PostSync hooks #381

Description

@bedatty

Summary

gitops-update.yml reports ArgoCD Sync failure for downstream apps even when ArgoCD itself completes the sync successfully a short time later. The culprit is the hard-coded 180s timeout on argocd app wait, which is shorter than the time some apps take to finish their PostSync hooks on slower / busier clusters.

Symptom

Workflow runs that legitimately update GitOps end with ❌ ArgoCD Sync (<server>/<env>) jobs while the corresponding ArgoCD Application transitions to Synced + Healthy within ~1–2 minutes after the GHA timeout. This produces a steady stream of false-failure notifications for owners of apps with multi-stage sync hooks (PreSync migrations + PostSync init jobs).

Evidence

Run: https://github.com/LerianStudio/plugin-access-manager/actions/runs/26538953954 (tag v3.0.0-beta.7, triggered from build-auth-init.yml)

server/env GHA conclusion ArgoCD timeline (UTC, from argocd app history) Current state
firmino/dev ✅ success (within timeout) Synced/Healthy
benedita/dev ✅ success (within timeout) Synced/Healthy
clotilde/dev ❌ failure (timeout at 21:22:31Z) sync to 5038e5c started 21:21:09Z, completed 21:22:38Z Synced/Healthy
anacleto/dev ❌ failure (timeout at 21:19:51Z) sync to 5038e5c started 21:17:52Z, completed 21:20:08Z Synced/Healthy

Failure messages in both failed jobs are identical and benign:

Sync Status:  Synced to main (5038e5c)
Health Status: Degraded
Message:      waiting for completion of hook batch/Job/plugin-access-manager-auth-init-user
##[error]Timeout waiting for sync completion of <app>

The PreSync auth-backend-migrations Job completed in both clusters; the PostSync auth-init-user Job (which seeds Casdoor via API) needed slightly more than 180s to finish on clotilde/anacleto. ArgoCD itself converged ~30–90s after the GHA gave up.

Root cause

.github/workflows/gitops-update.yml (v1.30.0):

# line 808
argocd app sync "$APP_NAME" ... --async --timeout 180 $PRUNE_FLAG
# line 822
argocd app wait "$APP_NAME" ... --timeout 180

The 180s app wait timeout is shorter than the worst-case PostSync hook duration for apps like plugin-access-manager on slower clusters. There is no input to override it, so callers cannot tune per app/server.

Proposed solution

  1. Expose two new inputs with sensible defaults:
    • argocd_sync_timeout (default 180, applied to argocd app sync --timeout)
    • argocd_wait_timeout (default 600, applied to argocd app wait --timeout)
  2. Raise the default wait timeout to 600s so apps with PostSync seed jobs stop producing false negatives out of the box. The existing 5-attempt retry loop with 30s sleep is already in place for transient sync failures; the new value only affects how long each attempt blocks on health.
  3. (Optional) Add a argocd_skip_wait_on_hooks input to make app wait return as soon as resources are Synced (ignore hook health). Useful for callers that explicitly do not want to gate the workflow on long-running seed jobs.

Acceptance criteria

  • Re-running the linked plugin-access-manager release with the new defaults produces ✅ on all four servers without any change to the downstream apps.
  • The new inputs are documented in docs/gitops-update-workflow.md.
  • Existing callers (e.g. plugin-br-bank-transfer, midaz, ungoliant-controller) continue to work without changes.

References

Metadata

Metadata

Assignees

Labels

bugSomething is not working as expectedstaleNo recent activity — will be closed if not updated

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions