Skip to content

HDDS-15533. DNS refresh on heartbeat failure for DN to SCM#10488

Draft
kerneltime wants to merge 4 commits into
apache:masterfrom
kerneltime:HDDS-15514-dn-scm-refresh
Draft

HDDS-15533. DNS refresh on heartbeat failure for DN to SCM#10488
kerneltime wants to merge 4 commits into
apache:masterfrom
kerneltime:HDDS-15514-dn-scm-refresh

Conversation

@kerneltime

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

This is PR 4 of 4 splitting HDDS-15514 (originally proposed as a single ~160KB patch in #10473, split per @szetszwo's review feedback).

This PR fixes the DN → SCM heartbeat path — the largest and most invasive of the four split PRs. Unlike the failover-proxy-provider seams, the DN does not failover; it heartbeats every SCM in parallel via the EndpointStateMachine / SCMConnectionManager abstraction. The fix introduces:

  1. An atomic EndpointStateMachine swap when DNS re-resolution detects an IP change.
  2. Per-endpoint queue migration in StateContext so in-flight reports survive the swap.
  3. A separate threshold knob (ozone.datanode.scm.heartbeat.address.refresh.threshold, default 3) — the heartbeat path runs at a much higher cadence than the failover-proxy path, so a count-based gate prevents over-reaction to transient blips.

Stacked on #10487 (PR-3 of 4 — OM → SCM DNS refresh). Reuses the ConnectionFailureUtils classifier and the ozone.client.failover.resolve-needed flag landed in PR-2 (#10486).

Why this matters

EndpointStateMachine.address is the cached InetSocketAddress that the DN heartbeat task uses to dial each SCM peer. It is constructed at DN startup from the configured host:port and never re-resolved. When an SCM pod is rescheduled in Kubernetes, every heartbeat to that peer dials the now-defunct IP forever. The DN's endpoints set still contains the broken peer's EndpointStateMachine, but that machine never recovers without a DN process restart.

This is the path the AWC ozone-operator's existing 7-layer workaround was built to defeat: after watching SCM pod IP changes, the operator force-restarts every DN. PR-4 is the upstream fix that lets the operator drop those restarts.

What this PR does

1. EndpointStateMachine preserves a hostname

Change Why
New final String hostAndPort field. Source of truth for re-resolution.
resolveLatestAddress(): re-resolves hostAndPort via NetUtils.createSocketAddr and returns the freshly-resolved InetSocketAddress only if its getAddress() differs from the cached one. Returns null on legacy endpoints (no preserved hostAndPort), unresolved DNS, or unchanged IP. Lets the heartbeat task ask "did the IP just change?" without committing to a swap.

2. SCMConnectionManager.refreshSCMServer — 4-phase atomic swap

PHASE A (read lock):       snapshot the endpoint reference and hostAndPort
PHASE B (no lock):         resolveLatestAddress  (DNS lookup must NEVER hold a lock)
PHASE C (write lock):      re-check snapshot, enforce collision invariant,
                           build replacement endpoint, commit swap
PHASE D (no lock):         close stale endpoint  (RPC.stopProxy + socket teardown)

Crucial properties (each had a corresponding bug in the original combined PR that Copilot's failure-injection lens caught):

  • Build-then-swap, never remove-then-build. If buildScmEndpoint throws (transient DNS, peer not yet accepting on the new IP, NetUtils refusing the address), the stale endpoint stays registered. Otherwise the peer would disappear from scmMachines entirely and no heartbeat could recover it. Tested by TestSCMConnectionManager.testRefreshSCMServerLeavesStaleEndpointOnBuildFailure using a @VisibleForTesting overridable buildScmEndpoint hook.
  • Refuse swaps that collide with another registered peer key. If transient kube-DNS returns peer-B's IP for peer-A's hostname, the swap is refused rather than overwriting peer-B's endpoint. Without this, peer-B's EndpointStateMachine would be silently replaced, leaking its executor and orphaning its task thread.
  • Re-check after DNS lookup. A concurrent removeSCMServer or refresh may have raced ahead while we were resolving. The write-lock phase verifies the snapshot is still current before swapping.
  • close() outside the lock. Stale-endpoint teardown blocks on RPC.stopProxy; holding writeLock() across that would stall every concurrent heartbeat / reconfiguration.

3. StateContext.migrateEndpoint — preserve in-flight reports across swap

Per-endpoint queues (incrementalReportsQueue, containerActions, pipelineActions, isFullReportReadyToBeSent) are keyed by InetSocketAddress. Without migration, a swap would orphan all queued reports for that peer. The migration is ordered to preserve the invariant "every endpoint in endpoints has a queue at every observable point":

  1. PUBLISH — install new-key queues alongside the old-key queues.
  2. SWITCH — add newEndpoint to the endpoints set; remove oldEndpoint from the endpoints set.
  3. RETIRE — drop the old-key queues (no producer can reach them after step 2).

endpoints is now a CopyOnWriteArraySet (was HashSet). incrementalReportsQueue, containerActions, pipelineActions, and isFullReportReadyToBeSent are now ConcurrentHashMap (some already were). Producers null-skip queue lookups as defense-in-depth — a producer racing migration MUST NOT NPE on a concurrent remove.

The full-report flags get a special case: a swapped endpoint is effectively a fresh peer (the new SCM pod has no idea which reports we have already shipped), so its isFullReportReadyToBeSent[type] flags are seeded fresh rather than copied from the old key. Tested in TestHeartbeatEndpointTaskDnsRefresh.

4. HeartbeatEndpointTask trigger

In the heartbeat catch block, after logIfNeeded(ex):

if (resolveOnFailureEnabled                    // ozone.client.failover.resolve-needed
    && missedCount >= refreshThreshold         // ozone.datanode.scm.heartbeat.address.refresh.threshold
    && ConnectionFailureUtils.isConnectionFailure(ex)
    && hostAndPort != null) {
  maybeRefreshScmAddress();                    // calls SCMConnectionManager.refreshSCMServer
}

All four gates are required. Application-level errors don't trigger refresh. Endpoints without a preserved hostname (legacy code path) don't trigger. The threshold prevents over-reaction to a one-off blip.

5. New config knob

ozone.datanode.scm.heartbeat.address.refresh.threshold (default 3). Conservative default — at the typical 30-second heartbeat interval and 6-second socketTimeout, this means at most ~108 seconds of dialing the stale IP before the first DNS retry. In practice the failures are usually fast (TCP RST or routing failure), so the recovery is much faster.

Real-world failure shapes this fix targets

Two distinct failure modes drove the requirement:

  • AWS EC2 / EKS — silent packet drop. When a DN attempts to connect to the cached IP of scm-0 after the pod has moved, AWS silently drops the packet. The TCP retry loop expires after socketTimeout (default 6 seconds in Ozone). Without this PR, the DN retries the same dead IP forever. With this PR, after threshold consecutive SocketTimeoutExceptions, the DN re-resolves DNS and swaps to the new IP.
  • OpenStack — TCP RST or ICMP unreachable. The network stack fast-rejects packets to the dead IP, surfacing as ConnectException. Same recovery path: after threshold consecutive failures, refresh.

How was this patch tested?

Test class Count Coverage
TestSCMConnectionManager (extended) 7 (1 prior + 6 new) resolveLatestAddress edge cases. refreshSCMServer happy-path swap. No-op when hostAndPort not preserved. Rollback regression: when buildScmEndpoint throws, the stale endpoint remains registered (uses @VisibleForTesting overridable hook to inject the failure).
TestHeartbeatEndpointTaskDnsRefresh (new) 6 Production trigger chain. HeartbeatEndpointTask.call() catch block fires refreshSCMServer only when (a) flag enabled, (b) threshold met, (c) cause is connection-class, (d) hostAndPort preserved. AccessControlException at threshold does NOT trigger. After a successful swap, StateContext's incremental-reports map has the new key and not the old key.
TestSCMConnectionManagerDnsRefreshE2E (new) 1 (@Timeout(30)) Real-RPC swap mechanism. Stands up a real ScmTestMock RPC server on a loopback OS-assigned port, primes the connection manager with a stale 127.0.0.99 cache + preserved localhost:port, calls refreshSCMServer, asserts a real sendHeartbeat round-trips through the swapped endpoint. Lives in hadoop-hdds/server-scm because it depends on ScmTestMock.

Existing regression suite verified non-regressed: TestEndPoint (17), TestHeartbeatEndpointTask (8).

Scope and known limitations

  • DN initial bringup with stale DNS: the refresh fires from the HEARTBEAT phase via HeartbeatEndpointTask. If a DN starts up with the SCM peer already at a stale IP and never reaches HEARTBEAT, the recovery path does not engage. Initial-bringup DNS staleness is the existing concern of HDDS-5919's ozone.network.jvm.address.cache.enabled=false. InitDatanodeState.java already postpones initialization on initial-resolution failure.
  • HDFS-14118-style construction-time DNS fan-out (one hostname → multiple persistent IPs, for round-robin DNS HA) is a different problem and out of scope. Worth a follow-on JIRA if needed.

What is the link to the Apache JIRA?

https://issues.apache.org/jira/browse/HDDS-15514

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@adoroszlai adoroszlai marked this pull request as draft June 11, 2026 16:31
@kerneltime kerneltime force-pushed the HDDS-15514-dn-scm-refresh branch from 78e8646 to 45b4dd8 Compare June 12, 2026 06:23
@kerneltime kerneltime changed the title HDDS-15514. DNS refresh on heartbeat failure for DN to SCM HDDS-15533. DNS refresh on heartbeat failure for DN to SCM Jun 12, 2026
@kerneltime

Copy link
Copy Markdown
Contributor Author

Rebased onto the updated PR-3 (#10487) tip and retitled to HDDS-15533 per @szetszwo's subtask request.

Copilot's earlier review pass errored out and posted no inline comments. Will re-request a Copilot review once this PR's status is settled. The substantive Copilot findings on PR-1, PR-2, and PR-3 have been addressed in their owning PRs and propagate forward via rebase.

Ratis builds gRPC channels via NettyChannelBuilder.forTarget(address),
where the default DnsNameResolver re-resolves hostnames on connection
failure. Two of the three pre-existing createRaftPeer paths in OM, and
the AddSCMRequest path in SCMHAManagerImpl, were passing
new InetSocketAddress(omNode.getInetAddress(), ratisPort) -- which bakes
the resolved IP into RaftPeer.address. Once baked, Ratis (and gRPC under
it) keeps dialing that IP for the channel's lifetime, so peer-pod
restarts in Kubernetes never recover until the parent process is
restarted.

Switch every createRaftPeer / AddSCMRequest call to pass the
hostname:port string. Collapse the two OzoneManagerRatisServer
overloads into one.

Replace the misleading "// TODO : Should we use IP instead of hostname??"
comment in SCMRatisServerImpl.buildRaftGroup and SCMHAManagerImpl with
explanatory comments citing HDDS-15514.

Add testCreateRaftPeerUsesHostnameAddress to assert the contract:
RaftPeer.address must NEVER be an IPv4 numeric form. This catches any
future regression that re-introduces InetSocketAddress at this seam.

This is the first of four PRs splitting HDDS-15514 along its natural
code-path boundaries. No flag, no exception classifier, and no atomic
swap machinery in this PR -- those land with the proxy-provider PRs
that follow.
OMProxyInfo constructs an InetSocketAddress at process start and reuses
it for the proxy's lifetime. InetSocketAddress freezes the resolved IP
at construction; when an OM pod is rescheduled to a new IP under a
stable DNS name (Kubernetes), every subsequent client RPC dials the
gone-away IP forever and only a process restart recovers.

Fix it at the FailoverProxyProvider seam, gated by a new opt-in flag
(ozone.client.failover.resolve-needed, default false).

Shared infrastructure (used by subsequent PRs in this series):
  - ConnectionFailureUtils: classifies a Throwable's cause chain
    (depth-bounded to 16) as a connection-class failure. Connection
    types: ConnectException, SocketTimeoutException,
    NoRouteToHostException, UnknownHostException, EOFException,
    SocketException. Application errors (OMException, OMNotLeaderEx,
    AccessControlException, RetryAction-coded responses) are NOT
    classified as connection failures, so DNS load is not amplified
    by logical errors.
  - ozone.client.failover.resolve-needed flag.

Client -> OM Hadoop RPC mechanism:
  - OMProxyInfo preserves the original host:port string and
    refreshAddressIfChanged() re-resolves it outside the entry
    monitor; on IP change, atomically swaps the cached
    InetSocketAddress / dtService / proxy=null under the monitor;
    stops the stale proxy via RPC.stopProxy outside the monitor.
  - OMFailoverProxyProviderBase.shouldRetry calls the refresh on
    connection-class exceptions only when the flag is on. On a
    successful refresh, returns FAILOVER_AND_RETRY but pins
    nextProxyIndex to the current node so RetryInvocationHandler
    does NOT skip past the just-refreshed peer.
  - HadoopRpcOMFailoverProxyProvider and the follower-read variant
    pass the preserved hostname string to OMProxyInfo at
    construction.

OM <-> OM Hadoop-RPC control-plane (OMInterServiceProtocol) rides on
the same OMFailoverProxyProvider machinery, so OM-to-OM Hadoop-RPC
recovery is a free transitive benefit of this PR. The gRPC OM client
(GrpcOMFailoverProxyProvider) was already correct (placeholder
InetSocketAddress(0); gRPC's NameResolver re-resolves on its own) and
is unchanged.

Secure-cluster prerequisite documented inline in ozone-default.xml:
when this flag is true on a Kerberos cluster, operators must also set
hadoop.security.token.service.use_ip=false in core-site.xml. Same
prerequisite HADOOP-17068 carries: the Hadoop delegation-token service
ID defaults to IP:port and would silently fail token selection after
a refresh without that co-config.

This is PR 2 of 4 splitting HDDS-15514 along its natural code-path
boundaries. PR-1 (Ratis hostname-only fix) is the merge base.
Subsequent PRs:
  - PR-3: OM -> SCM (SCMFailoverProxyProviderBase / SCMProxyInfo).
  - PR-4: DN -> SCM heartbeat (EndpointStateMachine /
          SCMConnectionManager / StateContext).

Tests:
  - TestConnectionFailureUtils (new, 20 tests): bare types,
    IOException-wrapped, deeply nested chains (3 levels), application
    negative cases, length-2 cause cycles (terminates), 1024-deep
    non-matching chains (cost bound).
  - TestOMProxyInfoDnsRefresh (new, 4 tests): no-op preserves cached
    proxy, swap on IP change, rebuilt proxy uses freshly-resolved
    address, dtService updates. Uses a @VisibleForTesting setter.
  - TestOMFailoverProxyProviderRefreshWired (new, 5 tests):
    SocketTimeoutException triggers refresh (the AWS EC2 silent-drop
    case end-to-end); ConnectException triggers refresh; OMException
    does NOT; flag-off does NOT; nextProxyIndex stays pinned after
    successful refresh.

Existing TestOMFailoverProxyProvider (8) and TestOMFailovers (1)
verified non-regressed.
SCMProxyInfo constructs an InetSocketAddress at OM startup and reuses
it for the SCM proxy's lifetime. InetSocketAddress freezes the
resolved IP at construction; when an SCM pod is rescheduled to a new
IP under a stable DNS name (Kubernetes), every subsequent OM to SCM
RPC dials the gone-away IP forever, and only an OM process restart
recovers.

Apply the same DNS-refresh-on-failure pattern PR-2 introduced for
Client to OM. Reuses the ConnectionFailureUtils classifier and the
ozone.client.failover.resolve-needed flag landed in PR-2.

SCMProxyInfo:
  - New final hostAndPort String preserves the config-time host:port
    string. The string is the source of truth for re-resolution; the
    InetSocketAddress is now a derived cache.
  - rpcAddr becomes mutable behind the entry monitor (was effectively
    final).
  - getHostAndPort() accessor for the provider's refresh path.

SCMFailoverProxyProviderBase.refreshProxyAddressIfChanged(nodeId):
  - PHASE A (no lock): re-resolve hostAndPort. Compare with cached
    rpcAddr.getAddress(). If unchanged or unresolved, return false.
  - PHASE B (under provider monitor): capture stale proxy reference,
    swap rpcAddr, clear cached proxy. Next getProxy(nodeId) rebuilds
    via existing createSCMProxy(nodeId) path.
  - PHASE C (no lock): RPC.stopProxy(staleProxy). Holding the monitor
    across socket teardown would stall every concurrent getProxy()
    caller.

shouldRetry wiring:
  - When the flag is true AND
    ConnectionFailureUtils.isConnectionFailure(exception) matches,
    the provider calls refreshProxyAddressIfChanged for the current
    SCM nodeId.
  - On a successful refresh, returns FAILOVER_AND_RETRY but pins
    updatedLeaderNodeID to the just-refreshed nodeId so
    RetryInvocationHandler does NOT advance past the now-fixed peer.
    Without the pin, an N-peer SCM HA cluster would skip the fixed
    SCM for up to N-1 attempts.

Tests:
  - TestSCMFailoverProxyProviderRefresh (new, 3 tests): per-instance
    swap on IP change, no-op when unchanged, no-op without
    preserved hostAndPort (legacy code path).
  - TestSCMFailoverProxyProviderRefreshWired (new, 5 tests): end-to-end
    retry path. ConnectException + SocketTimeoutException trigger
    refresh; application errors and flag-off do NOT;
    updatedLeaderNodeID stays pinned across successful refresh.

Existing TestSCMFailoverProxyProvider verified non-regressed.

This is PR 3 of 4 splitting HDDS-15514. Stacked on PR-2
(HDDS-15514-client-om-refresh). PR-4 (DN to SCM heartbeat) follows.
EndpointStateMachine.address is constructed at DN startup from the
configured host:port and reused for the lifetime of the DN heartbeat
loop. InetSocketAddress freezes the resolved IP at construction; when
an SCM pod is rescheduled to a new IP under a stable DNS name
(Kubernetes), every heartbeat to that peer dials the gone-away IP
forever. The DN's endpoints set still contains the broken peer's
EndpointStateMachine, but that machine never recovers without a DN
process restart.

Apply DNS-refresh-on-failure for the DN heartbeat path. Reuses
ConnectionFailureUtils and the ozone.client.failover.resolve-needed
flag landed in PR-2. Adds a separate threshold knob since the
heartbeat path runs at a much higher cadence than the failover-proxy
seams.

EndpointStateMachine.resolveLatestAddress():
  - Re-resolve the preserved hostAndPort via NetUtils.createSocketAddr.
  - Return the freshly-resolved InetSocketAddress only if its
    getAddress() differs from the cached one.
  - Return null on legacy endpoints (no preserved hostAndPort),
    unresolved DNS, or unchanged IP -- so callers can opt into a
    swap without committing to one.

SCMConnectionManager.refreshSCMServer() -- 4-phase atomic swap:
  - PHASE A (read lock): snapshot endpoint reference + hostAndPort.
  - PHASE B (no lock):   resolveLatestAddress. DNS lookup must NEVER
                         hold any lock; a slow / dead resolver under
                         lock would freeze every concurrent heartbeat
                         and reconfiguration path.
  - PHASE C (write lock): re-check snapshot (defends against
                         concurrent removeSCMServer / refresh races),
                         enforce collision invariant (refuse swap if
                         the resolved IP collides with another
                         registered peer key -- transient kube-DNS
                         can return peer-B's IP for peer-A's
                         hostname; overwriting peer-B would leak
                         its executor and orphan its task thread),
                         BUILD replacement endpoint BEFORE removing
                         stale (build failure must NOT leave the
                         peer absent from scmMachines), commit swap.
  - PHASE D (no lock):   close stale endpoint. RPC.stopProxy +
                         socket teardown blocks; holding the write
                         lock across that stalls every concurrent
                         heartbeat.

StateContext.migrateEndpoint -- preserve in-flight reports across
swap. Per-endpoint queues (incrementalReportsQueue, containerActions,
pipelineActions, isFullReportReadyToBeSent) are keyed by
InetSocketAddress; without migration a swap orphans every queued
report. Migration ordering preserves the invariant "every endpoint
in `endpoints` has a queue at every observable point":
  1. PUBLISH: install new-key queues alongside old-key queues.
  2. SWITCH:  add newEndpoint to endpoints; remove oldEndpoint.
  3. RETIRE:  drop old-key queues (no producer can reach them now).
endpoints is now a CopyOnWriteArraySet (was HashSet).
incrementalReportsQueue / containerActions / pipelineActions /
isFullReportReadyToBeSent are now ConcurrentHashMap. Producers
null-skip queue lookups as defense-in-depth against producer-vs-
migration races. The full-report flags get a special case: a swapped
endpoint is effectively a fresh peer (the new SCM pod has no idea
which reports we already shipped), so its isFullReportReadyToBeSent
flags are seeded fresh -- not copied from the old key.

HeartbeatEndpointTask trigger:
  In the heartbeat catch block, after logIfNeeded(ex):
    if (resolveOnFailureEnabled
        && missedCount >= refreshThreshold
        && ConnectionFailureUtils.isConnectionFailure(ex)
        && hostAndPort != null) {
      maybeRefreshScmAddress();
    }
  All four gates required. Application-level errors do NOT trigger.
  Endpoints without a preserved hostname (legacy code path) do NOT
  trigger. Threshold prevents over-reaction to a one-off blip.

New config knob:
  - ozone.datanode.scm.heartbeat.address.refresh.threshold = 3.
    Conservative default: at the typical 30-second heartbeat interval
    and 6-second socketTimeout, this means at most ~108 seconds of
    dialing the stale IP before the first DNS retry. In practice
    failures are usually fast (TCP RST or routing failure), so
    recovery is much faster.

Tests:
  - TestSCMConnectionManager (extended, 7 = 1 prior + 6 new):
    resolveLatestAddress edge cases; refreshSCMServer happy-path
    swap; no-op when hostAndPort not preserved; rollback regression
    -- when buildScmEndpoint throws, stale endpoint stays registered
    (uses @VisibleForTesting overridable hook to inject the
    failure).
  - TestHeartbeatEndpointTaskDnsRefresh (new, 6): production trigger
    chain. HeartbeatEndpointTask.call() catch block fires
    refreshSCMServer only when (a) flag enabled, (b) threshold met,
    (c) cause is connection-class, (d) hostAndPort preserved.
    AccessControlException at threshold does NOT trigger. After
    successful swap, StateContext's incremental-reports map has the
    new key and not the old key.
  - TestSCMConnectionManagerDnsRefreshE2E (new, 1, @timeout(30)):
    real-RPC swap mechanism. Stands up a real ScmTestMock RPC server
    on a loopback OS-assigned port, primes the connection manager
    with a stale 127.0.0.99 cache + preserved localhost:port, calls
    refreshSCMServer, asserts a real sendHeartbeat round-trips
    through the swapped endpoint.

Existing TestEndPoint (17) and TestHeartbeatEndpointTask (8)
verified non-regressed.

Real-world failure shapes this fix targets:
  - AWS EC2 / EKS silent packet drop: stale-IP packets are silently
    dropped, surfacing as SocketTimeoutException after socketTimeout.
    Without this fix, the DN retries the dead IP forever.
  - OpenStack TCP RST / ICMP unreachable: stale-IP packets fast-
    rejected, surfacing as ConnectException. Same recovery path.

Scope: the refresh fires from the HEARTBEAT phase. If a DN starts
up with the SCM peer already at a stale IP and never reaches
HEARTBEAT, the recovery path does NOT engage. Initial-bringup DNS
staleness is HDDS-5919's
ozone.network.jvm.address.cache.enabled=false's concern.

This is PR 4 of 4 splitting HDDS-15514. Stacked on PR-3
(HDDS-15514-om-scm-refresh).
@kerneltime kerneltime force-pushed the HDDS-15514-dn-scm-refresh branch from 45b4dd8 to 3d9ba8b Compare June 12, 2026 06:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants