Skip to content

[Bug] HorizontalPodAutoscaler continuously rescales due to controller interference #350

Description

@gustavo-marquez-byx

Bug: HorizontalPodAutoscaler continuously rescales due to controller interference

Summary

When using WorkerResourceTemplate to configure HPA for a TemporalWorkerDeployment, the controller continuously resets the underlying Deployment's replica count to 1, directly conflicting with HPA scaling decisions. This results in a cascading rescale loop where the HPA scales up to minReplicas every ~10-15 seconds, only to be reset by the controller milliseconds later.

Severity

HIGH - HPA is completely non-functional for TemporalWorkerDeployment resources.


Environment

  • temporal-worker-controller chart version: 0.24.1
  • temporal-worker-controller container image version: v1.5.2
  • Kubernetes version: 1.33
  • Container platform: EKS with Karpenter
  • Reproducer: Minimal example provided below

Problem Description

Observed Behavior

  1. Deployment is locked at 1 replica despite HPA configuration specifying minReplicas: 3
  2. HPA generates continuous rescale events — 11,087+ "SuccessfulRescale" events in 24 hours with message: "Current number of replicas below Spec.MinReplicas"
  3. Controller logs show active interference:
    • Repeated "msg":"scaling deployment" entries every 10-15 seconds
    • Each entry explicitly sets "replicas":1
    • Followed by deployment error on ObjectReference handling
  4. HPA status confirms the conflict:
    status:
      currentReplicas: 1       # What's running
      desiredReplicas: 3       # What HPA wants

Expected Behavior

When a WorkerResourceTemplate with HPA is created for a TemporalWorkerDeployment, the controller should:

  1. Recognize that HPA is managing the Deployment
  2. NOT modify the spec.replicas field (allow HPA full control)
  3. Allow HPA to scale between minReplicas and maxReplicas based on metrics
  4. Maintain steady state at the desired replica count

Configuration & Reproduction

Minimal TemporalWorkerDeployment

apiVersion: temporal.io/v1alpha1
kind: TemporalWorkerDeployment
metadata:
  name: temporal-worker-example
  namespace: default
spec:
  # NOTE: No replicas field — HPA should manage scaling
  rollout:
    strategy: AllAtOnce
  sunset:
    deleteDelay: 2h
    scaledownDelay: 15m
  template:
    metadata:
      labels:
        app: temporal-worker-example
    spec:
      containers:
        - name: temporal-worker-example
          image: myrepo/myapp:v1.0.0
          resources:
            limits:
              cpu: 1000m
              memory: 1Gi
            requests:
              cpu: 500m
              memory: 512Mi
      serviceAccountName: default
  workerOptions:
    connectionRef:
      name: temporal-connection
    temporalNamespace: default.local

WorkerResourceTemplate with HPA (Desired Scaling Config)

apiVersion: temporal.io/v1alpha1
kind: WorkerResourceTemplate
metadata:
  name: temporal-worker-example-hpa
  namespace: default
spec:
  temporalWorkerDeploymentRef:
    name: temporal-worker-example
  template:
    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    spec:
      scaleTargetRef: {}  # Controller should populate this
      minReplicas: 3
      maxReplicas: 10
      behavior:
        scaleUp:
          stabilizationWindowSeconds: 30
          policies:
            - type: Pods
              value: 2
              periodSeconds: 60
            - type: Percent
              value: 50
              periodSeconds: 60
          selectPolicy: Max
      metrics:
        - type: Resource
          resource:
            name: cpu
            target:
              type: Utilization
              averageUtilization: 70
        - type: Resource
          resource:
            name: memory
            target:
              type: Utilization
              averageUtilization: 75

Steps to Reproduce

  1. Apply the above TemporalWorkerDeployment (without spec.replicas)
  2. Apply the WorkerResourceTemplate with HPA configuration
  3. Observe the underlying Deployment created by the controller
  4. Expected: Deployment scales to 3+ replicas based on CPU/memory
  5. Actual: Deployment stays at 1 replica, HPA generates continuous SuccessfulRescale events

Validation

Check the generated HPA status:

kubectl get hpa -n default -l temporal.io/deployment-name=temporal-worker-example -o yaml

You'll see:

status:
  currentReplicas: 1
  desiredReplicas: 3  # or higher based on metrics
  conditions:
    - reason: SucceededRescale
      message: the HPA controller was able to update the target scale to 3

But the actual Deployment will show only 1 replica running.


Controller Log Evidence

From temporal-worker-controller logs, the interference is clear:

Pattern: Controller resetting replicas

{
  "level": "info",
  "msg": "scaling deployment",
  "controller": "temporalworkerdeployment",
  "TemporalWorkerDeployment": {
    "name": "temporal-worker-zili",
    "namespace": "byx-originacao"
  },
  "deployment": {
    "apiVersion": "apps/v1",
    "kind": "Deployment"
  },
  "deploymentError": "got runtime.Object without object metadata: &ObjectReference{...}",
  "replicas": 1
}

This repeats every 10-15 seconds, indicating the controller is:

  1. Regularly reconciling the TemporalWorkerDeployment
  2. Explicitly setting replicas: 1 on the underlying Deployment
  3. Failing due to an ObjectReference deserialization error (secondary issue)
  4. But the replicas: 1 field is still propagated despite the error

Secondary errors in logs

{
  "level": "info",
  "msg": "unknown field \"status.conditions\"",
  "controller": "temporalworkerdeployment"
}

This suggests the controller has issues with status field management, which may be related to the deployment error above.


Root Cause Analysis

The controller's reconciliation logic appears to:

  1. Always set spec.replicas on the underlying Deployment (defaulting to 1 or from TemporalWorkerDeployment spec)
  2. Not check if a WorkerResourceTemplate with HPA is managing this Deployment
  3. Override HPA decisions on every reconciliation cycle

This creates an irreconcilable conflict:

  • HPA: "Target 3+ replicas based on metrics" → scales Deployment
  • Controller: "TemporalWorkerDeployment says 1 replica" → resets Deployment
  • HPA: Detects replicas < minReplicas → scales again
  • Loop: Repeats indefinitely

Questions for the Maintainers

  1. Should the controller implement HPA-awareness (check for WorkerResourceTemplate presence and skip replica management)?
  2. What is the intended workflow for autoscaling TemporalWorkerDeployments?
  3. Can the ObjectReference deserialization error be addressed (secondary issue)?

Possible Solutions

Option 1: Controller Detects HPA and Defers Replica Management (Recommended)

if workerResourceTemplateExists(ctx, client, deployment) {
    // WorkerResourceTemplate is managing this deployment
    // Don't touch spec.replicas - let HPA manage it
    skip(deployment.Spec.Replicas)
} else {
    // Set replicas normally from TemporalWorkerDeployment spec
    deployment.Spec.Replicas = ptr.To(int32(twd.Spec.Replicas))
}

Option 2: Explicit Validation

Reject TemporalWorkerDeployments that have both:

  • No explicit spec.replicas AND
  • An associated WorkerResourceTemplate with HPA

With a clear error: "Cannot use HPA-managed WorkerResourceTemplate without explicit replicas configuration"

Option 3: User-facing Workaround Documentation

Document that HPA + TemporalWorkerDeployment currently requires:

  • Explicit spec.replicas set to minReplicas value
  • HPA configured to respect this as the baseline
  • Understanding that controller will manage replicas, not HPA (defeats the purpose)

Workarounds Attempted (All Failed)

  1. Omitting spec.replicas: Controller defaults to 1, HPA can't override
  2. Setting spec.replicas: 1: Controller locks it, HPA can't scale
  3. Setting spec.replicas: 3: Controller overrides HPA, no dynamic scaling
  4. Setting spec.replicas: null: Controller treats as 1

Conclusion: There is currently no working workaround. HPA + TemporalWorkerDeployment are fundamentally broken together.


Impact

  • Affected Users: Anyone attempting to use WorkerResourceTemplate with HPA for autoscaling
  • Scope: All TemporalWorkerDeployment resources configured with HPA
  • Duration: Deployment becomes unscalable; stuck at default replica count
  • Resource Waste: Continuous HPA reconciliation cycles create unnecessary API load and logging overhead

Additional Context

The HPA itself is functioning correctly:

  • It successfully creates the HorizontalPodAutoscaler resource
  • It correctly identifies when replicas are below minReplicas
  • It successfully patches the Deployment to scale up
  • But: Within milliseconds, the controller re-patches it back to 1

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions