Skip stale worker heartbeats from orphaned worker directories by mwkang · Pull Request #8789 · apache/storm

mwkang · 2026-06-12T02:31:25Z

What is the purpose of the change

The supervisor builds the heartbeat batch it sends to Nimbus from every worker directory present on disk (SupervisorUtils.readWorkerHeartbeats → ReportWorkerHeartbeats.getSupervisorWorkerHeartbeatsFromLocal), with no check on whether the worker is still alive. When a worker directory is left behind — e.g. a worker that died before the supervisor finished Container.cleanUpForRestart() — its frozen heartbeat keeps being reported on every reporting round. Orphaned directories are only reclaimed when ReadClusterState is constructed at supervisor startup, not from the periodic sync loop, so at runtime the stale heartbeat is reported indefinitely until the next supervisor restart.

Because the reported topology is usually no longer assigned (and its conf may already be deleted), Nimbus repeatedly tries to read the topology conf while processing that heartbeat and floods its log with Exception when getting heartbeat timeout / NotAliveException. STORM-4022 only suppressed the NotAliveException logging on the Nimbus side; the underlying behavior — the supervisor reporting heartbeats for dead/orphaned workers — was never addressed, and the generic-exception branch still logs.

This change filters stale heartbeats out in ReportWorkerHeartbeats: any heartbeat whose age exceeds supervisor.worker.timeout.secs is skipped and not forwarded to Nimbus. That is the same timeout Slot uses to decide a worker is dead, so a worker past it is already considered dead and should not be reported. A live worker always refreshes its heartbeat well within the timeout, so no valid heartbeat is ever dropped; a heartbeat exactly at the boundary (age == timeout) is still reported.

How was the change tested

Added ReportWorkerHeartbeatsTest (driven by Time.SimulatedTime) covering getSupervisorWorkerHeartbeatsFromLocal:

a fresh heartbeat is reported;
a heartbeat exactly at the timeout boundary (age == timeout) is still reported;
a heartbeat just past the timeout (age == timeout + 1) is filtered out;
a long-stale orphaned heartbeat (a day old) is filtered out;
null local heartbeats continue to be skipped.

The supervisor builds the heartbeat batch it sends to Nimbus from every worker directory present on disk, with no check on whether the worker is still alive. If a worker directory is ever left behind (a worker that died before the supervisor finished cleanup), its frozen heartbeat keeps being reported until the next supervisor restart -- the only time orphaned directories are reclaimed. Nimbus then repeatedly reads the (often deleted) topology conf, flooding its log with "Exception when getting heartbeat timeout" / NotAliveException. Filter out heartbeats older than supervisor.worker.timeout.secs in ReportWorkerHeartbeats: such a worker is already considered dead by the same timeout the slot uses, so it should not be reported. A live worker always refreshes its heartbeat well within the timeout, so this never drops a valid heartbeat.

rzo1 · 2026-06-12T09:59:05Z

Thx for the PR. Think we should get #8788 in first, so we can adjust the heartbeat getters.

rzo1 requested review from reiabreu and rzo1 June 12, 2026 09:57

rzo1 added this to the 3.0.0 milestone Jun 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip stale worker heartbeats from orphaned worker directories#8789

Skip stale worker heartbeats from orphaned worker directories#8789
mwkang wants to merge 1 commit into
apache:masterfrom
mwkang:8590-even-rebalance-on-idle-supervisor

mwkang commented Jun 12, 2026

Uh oh!

rzo1 commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mwkang commented Jun 12, 2026

What is the purpose of the change

How was the change tested

Uh oh!

rzo1 commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants