HDDS-15533. DNS refresh on heartbeat failure for DN to SCM#10488
Draft
kerneltime wants to merge 4 commits into
Draft
HDDS-15533. DNS refresh on heartbeat failure for DN to SCM#10488kerneltime wants to merge 4 commits into
kerneltime wants to merge 4 commits into
Conversation
Copilot stopped reviewing on behalf of
kerneltime due to an error
June 11, 2026 15:30
78e8646 to
45b4dd8
Compare
Contributor
Author
|
Rebased onto the updated PR-3 (#10487) tip and retitled to HDDS-15533 per @szetszwo's subtask request. Copilot's earlier review pass errored out and posted no inline comments. Will re-request a Copilot review once this PR's status is settled. The substantive Copilot findings on PR-1, PR-2, and PR-3 have been addressed in their owning PRs and propagate forward via rebase. |
Ratis builds gRPC channels via NettyChannelBuilder.forTarget(address), where the default DnsNameResolver re-resolves hostnames on connection failure. Two of the three pre-existing createRaftPeer paths in OM, and the AddSCMRequest path in SCMHAManagerImpl, were passing new InetSocketAddress(omNode.getInetAddress(), ratisPort) -- which bakes the resolved IP into RaftPeer.address. Once baked, Ratis (and gRPC under it) keeps dialing that IP for the channel's lifetime, so peer-pod restarts in Kubernetes never recover until the parent process is restarted. Switch every createRaftPeer / AddSCMRequest call to pass the hostname:port string. Collapse the two OzoneManagerRatisServer overloads into one. Replace the misleading "// TODO : Should we use IP instead of hostname??" comment in SCMRatisServerImpl.buildRaftGroup and SCMHAManagerImpl with explanatory comments citing HDDS-15514. Add testCreateRaftPeerUsesHostnameAddress to assert the contract: RaftPeer.address must NEVER be an IPv4 numeric form. This catches any future regression that re-introduces InetSocketAddress at this seam. This is the first of four PRs splitting HDDS-15514 along its natural code-path boundaries. No flag, no exception classifier, and no atomic swap machinery in this PR -- those land with the proxy-provider PRs that follow.
OMProxyInfo constructs an InetSocketAddress at process start and reuses
it for the proxy's lifetime. InetSocketAddress freezes the resolved IP
at construction; when an OM pod is rescheduled to a new IP under a
stable DNS name (Kubernetes), every subsequent client RPC dials the
gone-away IP forever and only a process restart recovers.
Fix it at the FailoverProxyProvider seam, gated by a new opt-in flag
(ozone.client.failover.resolve-needed, default false).
Shared infrastructure (used by subsequent PRs in this series):
- ConnectionFailureUtils: classifies a Throwable's cause chain
(depth-bounded to 16) as a connection-class failure. Connection
types: ConnectException, SocketTimeoutException,
NoRouteToHostException, UnknownHostException, EOFException,
SocketException. Application errors (OMException, OMNotLeaderEx,
AccessControlException, RetryAction-coded responses) are NOT
classified as connection failures, so DNS load is not amplified
by logical errors.
- ozone.client.failover.resolve-needed flag.
Client -> OM Hadoop RPC mechanism:
- OMProxyInfo preserves the original host:port string and
refreshAddressIfChanged() re-resolves it outside the entry
monitor; on IP change, atomically swaps the cached
InetSocketAddress / dtService / proxy=null under the monitor;
stops the stale proxy via RPC.stopProxy outside the monitor.
- OMFailoverProxyProviderBase.shouldRetry calls the refresh on
connection-class exceptions only when the flag is on. On a
successful refresh, returns FAILOVER_AND_RETRY but pins
nextProxyIndex to the current node so RetryInvocationHandler
does NOT skip past the just-refreshed peer.
- HadoopRpcOMFailoverProxyProvider and the follower-read variant
pass the preserved hostname string to OMProxyInfo at
construction.
OM <-> OM Hadoop-RPC control-plane (OMInterServiceProtocol) rides on
the same OMFailoverProxyProvider machinery, so OM-to-OM Hadoop-RPC
recovery is a free transitive benefit of this PR. The gRPC OM client
(GrpcOMFailoverProxyProvider) was already correct (placeholder
InetSocketAddress(0); gRPC's NameResolver re-resolves on its own) and
is unchanged.
Secure-cluster prerequisite documented inline in ozone-default.xml:
when this flag is true on a Kerberos cluster, operators must also set
hadoop.security.token.service.use_ip=false in core-site.xml. Same
prerequisite HADOOP-17068 carries: the Hadoop delegation-token service
ID defaults to IP:port and would silently fail token selection after
a refresh without that co-config.
This is PR 2 of 4 splitting HDDS-15514 along its natural code-path
boundaries. PR-1 (Ratis hostname-only fix) is the merge base.
Subsequent PRs:
- PR-3: OM -> SCM (SCMFailoverProxyProviderBase / SCMProxyInfo).
- PR-4: DN -> SCM heartbeat (EndpointStateMachine /
SCMConnectionManager / StateContext).
Tests:
- TestConnectionFailureUtils (new, 20 tests): bare types,
IOException-wrapped, deeply nested chains (3 levels), application
negative cases, length-2 cause cycles (terminates), 1024-deep
non-matching chains (cost bound).
- TestOMProxyInfoDnsRefresh (new, 4 tests): no-op preserves cached
proxy, swap on IP change, rebuilt proxy uses freshly-resolved
address, dtService updates. Uses a @VisibleForTesting setter.
- TestOMFailoverProxyProviderRefreshWired (new, 5 tests):
SocketTimeoutException triggers refresh (the AWS EC2 silent-drop
case end-to-end); ConnectException triggers refresh; OMException
does NOT; flag-off does NOT; nextProxyIndex stays pinned after
successful refresh.
Existing TestOMFailoverProxyProvider (8) and TestOMFailovers (1)
verified non-regressed.
SCMProxyInfo constructs an InetSocketAddress at OM startup and reuses
it for the SCM proxy's lifetime. InetSocketAddress freezes the
resolved IP at construction; when an SCM pod is rescheduled to a new
IP under a stable DNS name (Kubernetes), every subsequent OM to SCM
RPC dials the gone-away IP forever, and only an OM process restart
recovers.
Apply the same DNS-refresh-on-failure pattern PR-2 introduced for
Client to OM. Reuses the ConnectionFailureUtils classifier and the
ozone.client.failover.resolve-needed flag landed in PR-2.
SCMProxyInfo:
- New final hostAndPort String preserves the config-time host:port
string. The string is the source of truth for re-resolution; the
InetSocketAddress is now a derived cache.
- rpcAddr becomes mutable behind the entry monitor (was effectively
final).
- getHostAndPort() accessor for the provider's refresh path.
SCMFailoverProxyProviderBase.refreshProxyAddressIfChanged(nodeId):
- PHASE A (no lock): re-resolve hostAndPort. Compare with cached
rpcAddr.getAddress(). If unchanged or unresolved, return false.
- PHASE B (under provider monitor): capture stale proxy reference,
swap rpcAddr, clear cached proxy. Next getProxy(nodeId) rebuilds
via existing createSCMProxy(nodeId) path.
- PHASE C (no lock): RPC.stopProxy(staleProxy). Holding the monitor
across socket teardown would stall every concurrent getProxy()
caller.
shouldRetry wiring:
- When the flag is true AND
ConnectionFailureUtils.isConnectionFailure(exception) matches,
the provider calls refreshProxyAddressIfChanged for the current
SCM nodeId.
- On a successful refresh, returns FAILOVER_AND_RETRY but pins
updatedLeaderNodeID to the just-refreshed nodeId so
RetryInvocationHandler does NOT advance past the now-fixed peer.
Without the pin, an N-peer SCM HA cluster would skip the fixed
SCM for up to N-1 attempts.
Tests:
- TestSCMFailoverProxyProviderRefresh (new, 3 tests): per-instance
swap on IP change, no-op when unchanged, no-op without
preserved hostAndPort (legacy code path).
- TestSCMFailoverProxyProviderRefreshWired (new, 5 tests): end-to-end
retry path. ConnectException + SocketTimeoutException trigger
refresh; application errors and flag-off do NOT;
updatedLeaderNodeID stays pinned across successful refresh.
Existing TestSCMFailoverProxyProvider verified non-regressed.
This is PR 3 of 4 splitting HDDS-15514. Stacked on PR-2
(HDDS-15514-client-om-refresh). PR-4 (DN to SCM heartbeat) follows.
EndpointStateMachine.address is constructed at DN startup from the
configured host:port and reused for the lifetime of the DN heartbeat
loop. InetSocketAddress freezes the resolved IP at construction; when
an SCM pod is rescheduled to a new IP under a stable DNS name
(Kubernetes), every heartbeat to that peer dials the gone-away IP
forever. The DN's endpoints set still contains the broken peer's
EndpointStateMachine, but that machine never recovers without a DN
process restart.
Apply DNS-refresh-on-failure for the DN heartbeat path. Reuses
ConnectionFailureUtils and the ozone.client.failover.resolve-needed
flag landed in PR-2. Adds a separate threshold knob since the
heartbeat path runs at a much higher cadence than the failover-proxy
seams.
EndpointStateMachine.resolveLatestAddress():
- Re-resolve the preserved hostAndPort via NetUtils.createSocketAddr.
- Return the freshly-resolved InetSocketAddress only if its
getAddress() differs from the cached one.
- Return null on legacy endpoints (no preserved hostAndPort),
unresolved DNS, or unchanged IP -- so callers can opt into a
swap without committing to one.
SCMConnectionManager.refreshSCMServer() -- 4-phase atomic swap:
- PHASE A (read lock): snapshot endpoint reference + hostAndPort.
- PHASE B (no lock): resolveLatestAddress. DNS lookup must NEVER
hold any lock; a slow / dead resolver under
lock would freeze every concurrent heartbeat
and reconfiguration path.
- PHASE C (write lock): re-check snapshot (defends against
concurrent removeSCMServer / refresh races),
enforce collision invariant (refuse swap if
the resolved IP collides with another
registered peer key -- transient kube-DNS
can return peer-B's IP for peer-A's
hostname; overwriting peer-B would leak
its executor and orphan its task thread),
BUILD replacement endpoint BEFORE removing
stale (build failure must NOT leave the
peer absent from scmMachines), commit swap.
- PHASE D (no lock): close stale endpoint. RPC.stopProxy +
socket teardown blocks; holding the write
lock across that stalls every concurrent
heartbeat.
StateContext.migrateEndpoint -- preserve in-flight reports across
swap. Per-endpoint queues (incrementalReportsQueue, containerActions,
pipelineActions, isFullReportReadyToBeSent) are keyed by
InetSocketAddress; without migration a swap orphans every queued
report. Migration ordering preserves the invariant "every endpoint
in `endpoints` has a queue at every observable point":
1. PUBLISH: install new-key queues alongside old-key queues.
2. SWITCH: add newEndpoint to endpoints; remove oldEndpoint.
3. RETIRE: drop old-key queues (no producer can reach them now).
endpoints is now a CopyOnWriteArraySet (was HashSet).
incrementalReportsQueue / containerActions / pipelineActions /
isFullReportReadyToBeSent are now ConcurrentHashMap. Producers
null-skip queue lookups as defense-in-depth against producer-vs-
migration races. The full-report flags get a special case: a swapped
endpoint is effectively a fresh peer (the new SCM pod has no idea
which reports we already shipped), so its isFullReportReadyToBeSent
flags are seeded fresh -- not copied from the old key.
HeartbeatEndpointTask trigger:
In the heartbeat catch block, after logIfNeeded(ex):
if (resolveOnFailureEnabled
&& missedCount >= refreshThreshold
&& ConnectionFailureUtils.isConnectionFailure(ex)
&& hostAndPort != null) {
maybeRefreshScmAddress();
}
All four gates required. Application-level errors do NOT trigger.
Endpoints without a preserved hostname (legacy code path) do NOT
trigger. Threshold prevents over-reaction to a one-off blip.
New config knob:
- ozone.datanode.scm.heartbeat.address.refresh.threshold = 3.
Conservative default: at the typical 30-second heartbeat interval
and 6-second socketTimeout, this means at most ~108 seconds of
dialing the stale IP before the first DNS retry. In practice
failures are usually fast (TCP RST or routing failure), so
recovery is much faster.
Tests:
- TestSCMConnectionManager (extended, 7 = 1 prior + 6 new):
resolveLatestAddress edge cases; refreshSCMServer happy-path
swap; no-op when hostAndPort not preserved; rollback regression
-- when buildScmEndpoint throws, stale endpoint stays registered
(uses @VisibleForTesting overridable hook to inject the
failure).
- TestHeartbeatEndpointTaskDnsRefresh (new, 6): production trigger
chain. HeartbeatEndpointTask.call() catch block fires
refreshSCMServer only when (a) flag enabled, (b) threshold met,
(c) cause is connection-class, (d) hostAndPort preserved.
AccessControlException at threshold does NOT trigger. After
successful swap, StateContext's incremental-reports map has the
new key and not the old key.
- TestSCMConnectionManagerDnsRefreshE2E (new, 1, @timeout(30)):
real-RPC swap mechanism. Stands up a real ScmTestMock RPC server
on a loopback OS-assigned port, primes the connection manager
with a stale 127.0.0.99 cache + preserved localhost:port, calls
refreshSCMServer, asserts a real sendHeartbeat round-trips
through the swapped endpoint.
Existing TestEndPoint (17) and TestHeartbeatEndpointTask (8)
verified non-regressed.
Real-world failure shapes this fix targets:
- AWS EC2 / EKS silent packet drop: stale-IP packets are silently
dropped, surfacing as SocketTimeoutException after socketTimeout.
Without this fix, the DN retries the dead IP forever.
- OpenStack TCP RST / ICMP unreachable: stale-IP packets fast-
rejected, surfacing as ConnectException. Same recovery path.
Scope: the refresh fires from the HEARTBEAT phase. If a DN starts
up with the SCM peer already at a stale IP and never reaches
HEARTBEAT, the recovery path does NOT engage. Initial-bringup DNS
staleness is HDDS-5919's
ozone.network.jvm.address.cache.enabled=false's concern.
This is PR 4 of 4 splitting HDDS-15514. Stacked on PR-3
(HDDS-15514-om-scm-refresh).
45b4dd8 to
3d9ba8b
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This is PR 4 of 4 splitting HDDS-15514 (originally proposed as a single ~160KB patch in #10473, split per @szetszwo's review feedback).
This PR fixes the DN → SCM heartbeat path — the largest and most invasive of the four split PRs. Unlike the failover-proxy-provider seams, the DN does not failover; it heartbeats every SCM in parallel via the
EndpointStateMachine/SCMConnectionManagerabstraction. The fix introduces:EndpointStateMachineswap when DNS re-resolution detects an IP change.StateContextso in-flight reports survive the swap.ozone.datanode.scm.heartbeat.address.refresh.threshold, default 3) — the heartbeat path runs at a much higher cadence than the failover-proxy path, so a count-based gate prevents over-reaction to transient blips.Why this matters
EndpointStateMachine.addressis the cachedInetSocketAddressthat the DN heartbeat task uses to dial each SCM peer. It is constructed at DN startup from the configuredhost:portand never re-resolved. When an SCM pod is rescheduled in Kubernetes, every heartbeat to that peer dials the now-defunct IP forever. The DN'sendpointsset still contains the broken peer'sEndpointStateMachine, but that machine never recovers without a DN process restart.This is the path the AWC ozone-operator's existing 7-layer workaround was built to defeat: after watching SCM pod IP changes, the operator force-restarts every DN. PR-4 is the upstream fix that lets the operator drop those restarts.
What this PR does
1.
EndpointStateMachinepreserves a hostnamefinal String hostAndPortfield.resolveLatestAddress(): re-resolveshostAndPortviaNetUtils.createSocketAddrand returns the freshly-resolvedInetSocketAddressonly if itsgetAddress()differs from the cached one. Returns null on legacy endpoints (no preservedhostAndPort), unresolved DNS, or unchanged IP.2.
SCMConnectionManager.refreshSCMServer— 4-phase atomic swapCrucial properties (each had a corresponding bug in the original combined PR that Copilot's failure-injection lens caught):
buildScmEndpointthrows (transient DNS, peer not yet accepting on the new IP, NetUtils refusing the address), the stale endpoint stays registered. Otherwise the peer would disappear fromscmMachinesentirely and no heartbeat could recover it. Tested byTestSCMConnectionManager.testRefreshSCMServerLeavesStaleEndpointOnBuildFailureusing a@VisibleForTestingoverridablebuildScmEndpointhook.EndpointStateMachinewould be silently replaced, leaking its executor and orphaning its task thread.removeSCMServeror refresh may have raced ahead while we were resolving. The write-lock phase verifies the snapshot is still current before swapping.close()outside the lock. Stale-endpoint teardown blocks onRPC.stopProxy; holdingwriteLock()across that would stall every concurrent heartbeat / reconfiguration.3.
StateContext.migrateEndpoint— preserve in-flight reports across swapPer-endpoint queues (
incrementalReportsQueue,containerActions,pipelineActions,isFullReportReadyToBeSent) are keyed byInetSocketAddress. Without migration, a swap would orphan all queued reports for that peer. The migration is ordered to preserve the invariant "every endpoint inendpointshas a queue at every observable point":newEndpointto the endpoints set; removeoldEndpointfrom the endpoints set.endpointsis now aCopyOnWriteArraySet(wasHashSet).incrementalReportsQueue,containerActions,pipelineActions, andisFullReportReadyToBeSentare nowConcurrentHashMap(some already were). Producers null-skip queue lookups as defense-in-depth — a producer racing migration MUST NOT NPE on a concurrentremove.The full-report flags get a special case: a swapped endpoint is effectively a fresh peer (the new SCM pod has no idea which reports we have already shipped), so its
isFullReportReadyToBeSent[type]flags are seeded fresh rather than copied from the old key. Tested inTestHeartbeatEndpointTaskDnsRefresh.4.
HeartbeatEndpointTasktriggerIn the heartbeat catch block, after
logIfNeeded(ex):All four gates are required. Application-level errors don't trigger refresh. Endpoints without a preserved hostname (legacy code path) don't trigger. The threshold prevents over-reaction to a one-off blip.
5. New config knob
ozone.datanode.scm.heartbeat.address.refresh.threshold(default 3). Conservative default — at the typical 30-second heartbeat interval and 6-secondsocketTimeout, this means at most ~108 seconds of dialing the stale IP before the first DNS retry. In practice the failures are usually fast (TCP RST or routing failure), so the recovery is much faster.Real-world failure shapes this fix targets
Two distinct failure modes drove the requirement:
scm-0after the pod has moved, AWS silently drops the packet. The TCP retry loop expires aftersocketTimeout(default 6 seconds in Ozone). Without this PR, the DN retries the same dead IP forever. With this PR, afterthresholdconsecutiveSocketTimeoutExceptions, the DN re-resolves DNS and swaps to the new IP.ConnectException. Same recovery path: afterthresholdconsecutive failures, refresh.How was this patch tested?
TestSCMConnectionManager(extended)resolveLatestAddressedge cases.refreshSCMServerhappy-path swap. No-op whenhostAndPortnot preserved. Rollback regression: whenbuildScmEndpointthrows, the stale endpoint remains registered (uses@VisibleForTestingoverridable hook to inject the failure).TestHeartbeatEndpointTaskDnsRefresh(new)HeartbeatEndpointTask.call()catch block firesrefreshSCMServeronly when (a) flag enabled, (b) threshold met, (c) cause is connection-class, (d)hostAndPortpreserved.AccessControlExceptionat threshold does NOT trigger. After a successful swap,StateContext's incremental-reports map has the new key and not the old key.TestSCMConnectionManagerDnsRefreshE2E(new)@Timeout(30))ScmTestMockRPC server on a loopback OS-assigned port, primes the connection manager with a stale127.0.0.99cache + preservedlocalhost:port, callsrefreshSCMServer, asserts a realsendHeartbeatround-trips through the swapped endpoint. Lives inhadoop-hdds/server-scmbecause it depends onScmTestMock.Existing regression suite verified non-regressed:
TestEndPoint(17),TestHeartbeatEndpointTask(8).Scope and known limitations
HEARTBEATphase viaHeartbeatEndpointTask. If a DN starts up with the SCM peer already at a stale IP and never reachesHEARTBEAT, the recovery path does not engage. Initial-bringup DNS staleness is the existing concern of HDDS-5919'sozone.network.jvm.address.cache.enabled=false.InitDatanodeState.javaalready postpones initialization on initial-resolution failure.What is the link to the Apache JIRA?
https://issues.apache.org/jira/browse/HDDS-15514