Skip to content

Skip stale worker heartbeats from orphaned worker directories#8789

Open
mwkang wants to merge 1 commit into
apache:masterfrom
mwkang:8590-even-rebalance-on-idle-supervisor
Open

Skip stale worker heartbeats from orphaned worker directories#8789
mwkang wants to merge 1 commit into
apache:masterfrom
mwkang:8590-even-rebalance-on-idle-supervisor

Conversation

@mwkang

@mwkang mwkang commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

What is the purpose of the change

Fixes #8786.

The supervisor builds the heartbeat batch it sends to Nimbus from every worker directory present on disk (SupervisorUtils.readWorkerHeartbeatsReportWorkerHeartbeats.getSupervisorWorkerHeartbeatsFromLocal), with no check on whether the worker is still alive. When a worker directory is left behind — e.g. a worker that died before the supervisor finished Container.cleanUpForRestart() — its frozen heartbeat keeps being reported on every reporting round. Orphaned directories are only reclaimed when ReadClusterState is constructed at supervisor startup, not from the periodic sync loop, so at runtime the stale heartbeat is reported indefinitely until the next supervisor restart.

Because the reported topology is usually no longer assigned (and its conf may already be deleted), Nimbus repeatedly tries to read the topology conf while processing that heartbeat and floods its log with Exception when getting heartbeat timeout / NotAliveException. STORM-4022 only suppressed the NotAliveException logging on the Nimbus side; the underlying behavior — the supervisor reporting heartbeats for dead/orphaned workers — was never addressed, and the generic-exception branch still logs.

This change filters stale heartbeats out in ReportWorkerHeartbeats: any heartbeat whose age exceeds supervisor.worker.timeout.secs is skipped and not forwarded to Nimbus. That is the same timeout Slot uses to decide a worker is dead, so a worker past it is already considered dead and should not be reported. A live worker always refreshes its heartbeat well within the timeout, so no valid heartbeat is ever dropped; a heartbeat exactly at the boundary (age == timeout) is still reported.

How was the change tested

Added ReportWorkerHeartbeatsTest (driven by Time.SimulatedTime) covering getSupervisorWorkerHeartbeatsFromLocal:

  • a fresh heartbeat is reported;
  • a heartbeat exactly at the timeout boundary (age == timeout) is still reported;
  • a heartbeat just past the timeout (age == timeout + 1) is filtered out;
  • a long-stale orphaned heartbeat (a day old) is filtered out;
  • null local heartbeats continue to be skipped.

The supervisor builds the heartbeat batch it sends to Nimbus from every
worker directory present on disk, with no check on whether the worker is
still alive. If a worker directory is ever left behind (a worker that died
before the supervisor finished cleanup), its frozen heartbeat keeps being
reported until the next supervisor restart -- the only time orphaned
directories are reclaimed. Nimbus then repeatedly reads the (often deleted)
topology conf, flooding its log with "Exception when getting heartbeat
timeout" / NotAliveException.

Filter out heartbeats older than supervisor.worker.timeout.secs in
ReportWorkerHeartbeats: such a worker is already considered dead by the
same timeout the slot uses, so it should not be reported. A live worker
always refreshes its heartbeat well within the timeout, so this never
drops a valid heartbeat.
@rzo1 rzo1 requested review from reiabreu and rzo1 June 12, 2026 09:57
@rzo1

rzo1 commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Thx for the PR. Think we should get #8788 in first, so we can adjust the heartbeat getters.

@rzo1 rzo1 added this to the 3.0.0 milestone Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Supervisor keeps reporting stale heartbeats for orphaned worker directories

2 participants