Summary
During a normal rollout, when a Worker Deployment Version drains, the controller deletes that version's Kubernetes Deployment (and the HPA rendered from the WorkerResourceTemplate), but it never deletes the Temporal-side Worker Deployment Version record. The only call to DeleteWorkerDeploymentVersion is in the CRD-deletion finalizer (added in #240), which runs only when the whole WorkerDeployment custom resource is deleted, not when an individual version drains during a rollout.
As a result, Temporal-side version records accumulate by one per deploy, and the controller relies entirely on server-side GC to keep the count under matching.maxVersionsInDeployment.
Why this is a problem
In practice the server's at-cap reclamation does not keep up. We see deployments accumulate 100+ drained versions that are never reclaimed (companion issue: temporalio/temporal#10737). Once a deployment reaches the cap, a new build cannot register as a poller and the rollout silently wedges: a merged change ships to a fleet still running the old version.
Note on existing issues
#270 ("version retention policy: keep last N after drain") was closed pointing at the Kubernetes sunset config (scaledownDelay / deleteDelay). But sunset only deletes the Kubernetes Deployment, not the Temporal-side version record, so it does not bound server-side version growth. The two cleanups are independent.
Request
An opt-in, controller-side prune of drained versions (for example a minVersionsToKeep retention policy that deletes drained, non-current versions beyond the newest N), or at minimum documentation that operators must externally prune drained versions to avoid hitting maxVersionsInDeployment. Happy to contribute a PR if there is interest.
Version
v1.7.0.
Summary
During a normal rollout, when a Worker Deployment Version drains, the controller deletes that version's Kubernetes Deployment (and the HPA rendered from the WorkerResourceTemplate), but it never deletes the Temporal-side Worker Deployment Version record. The only call to
DeleteWorkerDeploymentVersionis in the CRD-deletion finalizer (added in #240), which runs only when the whole WorkerDeployment custom resource is deleted, not when an individual version drains during a rollout.As a result, Temporal-side version records accumulate by one per deploy, and the controller relies entirely on server-side GC to keep the count under
matching.maxVersionsInDeployment.Why this is a problem
In practice the server's at-cap reclamation does not keep up. We see deployments accumulate 100+ drained versions that are never reclaimed (companion issue: temporalio/temporal#10737). Once a deployment reaches the cap, a new build cannot register as a poller and the rollout silently wedges: a merged change ships to a fleet still running the old version.
Note on existing issues
#270 ("version retention policy: keep last N after drain") was closed pointing at the Kubernetes
sunsetconfig (scaledownDelay/deleteDelay). Butsunsetonly deletes the Kubernetes Deployment, not the Temporal-side version record, so it does not bound server-side version growth. The two cleanups are independent.Request
An opt-in, controller-side prune of drained versions (for example a
minVersionsToKeepretention policy that deletes drained, non-current versions beyond the newest N), or at minimum documentation that operators must externally prune drained versions to avoid hittingmaxVersionsInDeployment. Happy to contribute a PR if there is interest.Version
v1.7.0.