Skip stale worker heartbeats from orphaned worker directories#8789
Open
mwkang wants to merge 1 commit into
Open
Skip stale worker heartbeats from orphaned worker directories#8789mwkang wants to merge 1 commit into
mwkang wants to merge 1 commit into
Conversation
The supervisor builds the heartbeat batch it sends to Nimbus from every worker directory present on disk, with no check on whether the worker is still alive. If a worker directory is ever left behind (a worker that died before the supervisor finished cleanup), its frozen heartbeat keeps being reported until the next supervisor restart -- the only time orphaned directories are reclaimed. Nimbus then repeatedly reads the (often deleted) topology conf, flooding its log with "Exception when getting heartbeat timeout" / NotAliveException. Filter out heartbeats older than supervisor.worker.timeout.secs in ReportWorkerHeartbeats: such a worker is already considered dead by the same timeout the slot uses, so it should not be reported. A live worker always refreshes its heartbeat well within the timeout, so this never drops a valid heartbeat.
Contributor
|
Thx for the PR. Think we should get #8788 in first, so we can adjust the heartbeat getters. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What is the purpose of the change
Fixes #8786.
The supervisor builds the heartbeat batch it sends to Nimbus from every worker directory present on disk (
SupervisorUtils.readWorkerHeartbeats→ReportWorkerHeartbeats.getSupervisorWorkerHeartbeatsFromLocal), with no check on whether the worker is still alive. When a worker directory is left behind — e.g. a worker that died before the supervisor finishedContainer.cleanUpForRestart()— its frozen heartbeat keeps being reported on every reporting round. Orphaned directories are only reclaimed whenReadClusterStateis constructed at supervisor startup, not from the periodic sync loop, so at runtime the stale heartbeat is reported indefinitely until the next supervisor restart.Because the reported topology is usually no longer assigned (and its conf may already be deleted), Nimbus repeatedly tries to read the topology conf while processing that heartbeat and floods its log with
Exception when getting heartbeat timeout/NotAliveException. STORM-4022 only suppressed theNotAliveExceptionlogging on the Nimbus side; the underlying behavior — the supervisor reporting heartbeats for dead/orphaned workers — was never addressed, and the generic-exception branch still logs.This change filters stale heartbeats out in
ReportWorkerHeartbeats: any heartbeat whose age exceedssupervisor.worker.timeout.secsis skipped and not forwarded to Nimbus. That is the same timeoutSlotuses to decide a worker is dead, so a worker past it is already considered dead and should not be reported. A live worker always refreshes its heartbeat well within the timeout, so no valid heartbeat is ever dropped; a heartbeat exactly at the boundary (age == timeout) is still reported.How was the change tested
Added
ReportWorkerHeartbeatsTest(driven byTime.SimulatedTime) coveringgetSupervisorWorkerHeartbeatsFromLocal: