Fix Y2038 worker heartbeat overflow by migrating timestamps to i64#8788
Open
dpol1 wants to merge 6 commits into
Open
Fix Y2038 worker heartbeat overflow by migrating timestamps to i64#8788dpol1 wants to merge 6 commits into
dpol1 wants to merge 6 commits into
Conversation
Five regression tests for STORM apache#7897: - CWH time_secs i64 round-trip survives post-2038 epochs - HeartbeatCache accepts Long TIME_SECS beats without false timeout - documents int currentTimeSecs() overflow (negative guard) - legacy i32 LSWorkerHeartbeat blob fails required-field validation under the i64 schema (wire-compat semantics) - post-2038 executor launch window not misclassified as dead
Time.currentTimeSecs() narrows epoch seconds to int, which overflows on 2038-01-19T03:14:07Z and is the root cause of STORM apache#7897 (workers falsely timed out post-2038). Add currentTimeSecsLong() and deltaSecsLong(long) for absolute timestamps. The int variants stay, deprecated, for short relative durations (uptime/UI).
Promote time_secs from i32 to i64 in ClusterWorkerHeartbeat, SupervisorWorkerHeartbeat and LSWorkerHeartbeat; uptime_secs stays i32 (relative duration). Regenerate the Java and Python thrift bindings for the three structs. Widen the callers that the type change forces at compile time: ExecutorBeat.timeSecs, ClusterUtils.convertExecutorBeats and the latest-heartbeat comparison in PaceMakerStateStorage.get_worker_hb (an absolute comparison that broke across the 2038 rollover). Wire compat: thrift tags i32/i64 differently, so blobs written by the old schema fail required-field validation under the new one. Heartbeats self-heal on re-report; a full-cluster bounce upgrade is required.
Switch every heartbeat writer to Time.currentTimeSecsLong(): - Worker.doHeartBeat (LSWorkerHeartbeat, on-disk local state) - ClientStatsUtil.mkZkWorkerHb (ZK beat map, TIME_SECS now a Long) - ClientStatsUtil.thriftifyZkWorkerHb (keep full long, no intValue narrowing) - StatsUtil.thriftifyRpcWorkerHb (SupervisorWorkerHeartbeat) - SupervisorHeartbeat (SupervisorInfo.time_secs, already i64 on the wire but previously fed a wrapped int)
HeartbeatCache (the Nimbus worker-liveness site) tracks receipt time and reported time as Long and computes timeouts with deltaSecsLong; beat map values are read through Number so Integer beats from legacy producers still work. The executor launch-window check no longer truncates assignment start times through intValue(). Supervisor and logviewer consumers move off the int clock as well: Slot heartbeat-age checks use deltaSecsLong, and the logviewer alive-worker scan (WorkerLogs/LogCleaner) carries epoch seconds as long end to end.
Explain the STORM-7897 time_secs i32->i64 promotion, why uptime_secs stays i32, the deprecation of the int-based Time methods, and the full-cluster bounce upgrade requirement (legacy heartbeat blobs fail required-field validation and self-heal on re-report).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What is the purpose of the change
Fixes the Y2038 heartbeat overflow issue. Worker heartbeat timestamps (
time_secs) were carried asi32seconds, which would overflow on 2038-01-19. This change migrates the timestamps to 64-bit values (i64) to prevent false timeouts and cluster failures post-2038. It also establishes the expected bounce-upgrade semantics for legacy heartbeats.How was the change tested
Added
Y2038HeartbeatTestto ensure that 64-bit timestamps round-trip correctly through Thrift serialization.Verified that
HeartbeatCachesuccessfully parses and handles post-2038 heartbeats without flagging them as timed out.Tested schema backward-compatibility to ensure legacy
i32payloads fail validation gracefully during the bounce-upgrade window.Fixes [STORM-4116] Heartbeats mechanism is affected by Y2038 bug #7897