Skip to content

Fix Y2038 worker heartbeat overflow by migrating timestamps to i64#8788

Open
dpol1 wants to merge 6 commits into
apache:masterfrom
dpol1:fix/7897
Open

Fix Y2038 worker heartbeat overflow by migrating timestamps to i64#8788
dpol1 wants to merge 6 commits into
apache:masterfrom
dpol1:fix/7897

Conversation

@dpol1

@dpol1 dpol1 commented Jun 11, 2026

Copy link
Copy Markdown
Member

What is the purpose of the change

Fixes the Y2038 heartbeat overflow issue. Worker heartbeat timestamps (time_secs) were carried as i32 seconds, which would overflow on 2038-01-19. This change migrates the timestamps to 64-bit values (i64) to prevent false timeouts and cluster failures post-2038. It also establishes the expected bounce-upgrade semantics for legacy heartbeats.

How was the change tested

  • Added Y2038HeartbeatTest to ensure that 64-bit timestamps round-trip correctly through Thrift serialization.

  • Verified that HeartbeatCache successfully parses and handles post-2038 heartbeats without flagging them as timed out.

  • Tested schema backward-compatibility to ensure legacy i32 payloads fail validation gracefully during the bounce-upgrade window.

  • Fixes [STORM-4116] Heartbeats mechanism is affected by Y2038 bug #7897

dpol1 added 6 commits June 11, 2026 09:46
Five regression tests for STORM apache#7897:
- CWH time_secs i64 round-trip survives post-2038 epochs
- HeartbeatCache accepts Long TIME_SECS beats without false timeout
- documents int currentTimeSecs() overflow (negative guard)
- legacy i32 LSWorkerHeartbeat blob fails required-field validation
  under the i64 schema (wire-compat semantics)
- post-2038 executor launch window not misclassified as dead
Time.currentTimeSecs() narrows epoch seconds to int, which overflows on
2038-01-19T03:14:07Z and is the root cause of STORM apache#7897 (workers
falsely timed out post-2038). Add currentTimeSecsLong() and
deltaSecsLong(long) for absolute timestamps. The int variants stay,
deprecated, for short relative durations (uptime/UI).
Promote time_secs from i32 to i64 in ClusterWorkerHeartbeat,
SupervisorWorkerHeartbeat and LSWorkerHeartbeat; uptime_secs stays i32
(relative duration). Regenerate the Java and Python thrift bindings for
the three structs.

Widen the callers that the type change forces at compile time:
ExecutorBeat.timeSecs, ClusterUtils.convertExecutorBeats and the
latest-heartbeat comparison in PaceMakerStateStorage.get_worker_hb
(an absolute comparison that broke across the 2038 rollover).

Wire compat: thrift tags i32/i64 differently, so blobs written by the
old schema fail required-field validation under the new one. Heartbeats
self-heal on re-report; a full-cluster bounce upgrade is required.
Switch every heartbeat writer to Time.currentTimeSecsLong():
- Worker.doHeartBeat (LSWorkerHeartbeat, on-disk local state)
- ClientStatsUtil.mkZkWorkerHb (ZK beat map, TIME_SECS now a Long)
- ClientStatsUtil.thriftifyZkWorkerHb (keep full long, no intValue narrowing)
- StatsUtil.thriftifyRpcWorkerHb (SupervisorWorkerHeartbeat)
- SupervisorHeartbeat (SupervisorInfo.time_secs, already i64 on the wire
  but previously fed a wrapped int)
HeartbeatCache (the Nimbus worker-liveness site) tracks receipt time
and reported time as Long and computes timeouts with deltaSecsLong;
beat map values are read through Number so Integer beats from legacy
producers still work. The executor launch-window check no longer
truncates assignment start times through intValue().

Supervisor and logviewer consumers move off the int clock as well:
Slot heartbeat-age checks use deltaSecsLong, and the logviewer
alive-worker scan (WorkerLogs/LogCleaner) carries epoch seconds as
long end to end.
Explain the STORM-7897 time_secs i32->i64 promotion, why uptime_secs
stays i32, the deprecation of the int-based Time methods, and the
full-cluster bounce upgrade requirement (legacy heartbeat blobs fail
required-field validation and self-heal on re-report).
@dpol1 dpol1 changed the title Fix/7897 Fix Y2038 worker heartbeat overflow by migrating timestamps to i64 Jun 11, 2026
@rzo1 rzo1 requested review from reiabreu and rzo1 June 11, 2026 13:56
@rzo1 rzo1 added this to the 3.0.0 milestone Jun 11, 2026
@rzo1 rzo1 added bug java Pull requests that update Java code labels Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug java Pull requests that update Java code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[STORM-4116] Heartbeats mechanism is affected by Y2038 bug

2 participants