HDDS-15533. DNS refresh on heartbeat failure for DN to SCM by kerneltime · Pull Request #10488 · apache/ozone

kerneltime · 2026-06-11T05:21:38Z

What changes were proposed in this pull request?

This is PR 4 of 4 splitting HDDS-15514 (originally proposed as a single ~160KB patch in #10473, split per @szetszwo's review feedback).

This PR fixes the DN → SCM heartbeat path — the largest and most invasive of the four split PRs. Unlike the failover-proxy-provider seams, the DN does not failover; it heartbeats every SCM in parallel via the EndpointStateMachine / SCMConnectionManager abstraction. The fix introduces:

An atomic EndpointStateMachine swap when DNS re-resolution detects an IP change.
Per-endpoint queue migration in StateContext so in-flight reports survive the swap.
A separate threshold knob (ozone.datanode.scm.heartbeat.address.refresh.threshold, default 3) — the heartbeat path runs at a much higher cadence than the failover-proxy path, so a count-based gate prevents over-reaction to transient blips.

Stacked on #10487 (PR-3 of 4 — OM → SCM DNS refresh). Reuses the ConnectionFailureUtils classifier and the ozone.client.failover.resolve-needed flag landed in PR-2 (#10486).

Why this matters

EndpointStateMachine.address is the cached InetSocketAddress that the DN heartbeat task uses to dial each SCM peer. It is constructed at DN startup from the configured host:port and never re-resolved. When an SCM pod is rescheduled in Kubernetes, every heartbeat to that peer dials the now-defunct IP forever. The DN's endpoints set still contains the broken peer's EndpointStateMachine, but that machine never recovers without a DN process restart.

This is the path the AWC ozone-operator's existing 7-layer workaround was built to defeat: after watching SCM pod IP changes, the operator force-restarts every DN. PR-4 is the upstream fix that lets the operator drop those restarts.

What this PR does

1. `EndpointStateMachine` preserves a hostname

Change	Why
New `final String hostAndPort` field.	Source of truth for re-resolution.
`resolveLatestAddress()`: re-resolves `hostAndPort` via `NetUtils.createSocketAddr` and returns the freshly-resolved `InetSocketAddress` only if its `getAddress()` differs from the cached one. Returns null on legacy endpoints (no preserved `hostAndPort`), unresolved DNS, or unchanged IP.	Lets the heartbeat task ask "did the IP just change?" without committing to a swap.

2. `SCMConnectionManager.refreshSCMServer` — 4-phase atomic swap

PHASE A (read lock):       snapshot the endpoint reference and hostAndPort
PHASE B (no lock):         resolveLatestAddress  (DNS lookup must NEVER hold a lock)
PHASE C (write lock):      re-check snapshot, enforce collision invariant,
                           build replacement endpoint, commit swap
PHASE D (no lock):         close stale endpoint  (RPC.stopProxy + socket teardown)

Crucial properties (each had a corresponding bug in the original combined PR that Copilot's failure-injection lens caught):

Build-then-swap, never remove-then-build. If buildScmEndpoint throws (transient DNS, peer not yet accepting on the new IP, NetUtils refusing the address), the stale endpoint stays registered. Otherwise the peer would disappear from scmMachines entirely and no heartbeat could recover it. Tested by TestSCMConnectionManager.testRefreshSCMServerLeavesStaleEndpointOnBuildFailure using a @VisibleForTesting overridable buildScmEndpoint hook.
Refuse swaps that collide with another registered peer key. If transient kube-DNS returns peer-B's IP for peer-A's hostname, the swap is refused rather than overwriting peer-B's endpoint. Without this, peer-B's EndpointStateMachine would be silently replaced, leaking its executor and orphaning its task thread.
Re-check after DNS lookup. A concurrent removeSCMServer or refresh may have raced ahead while we were resolving. The write-lock phase verifies the snapshot is still current before swapping.
close() outside the lock. Stale-endpoint teardown blocks on RPC.stopProxy; holding writeLock() across that would stall every concurrent heartbeat / reconfiguration.

3. `StateContext.migrateEndpoint` — preserve in-flight reports across swap

Per-endpoint queues (incrementalReportsQueue, containerActions, pipelineActions, isFullReportReadyToBeSent) are keyed by InetSocketAddress. Without migration, a swap would orphan all queued reports for that peer. The migration is ordered to preserve the invariant "every endpoint in endpoints has a queue at every observable point":

PUBLISH — install new-key queues alongside the old-key queues.
SWITCH — add newEndpoint to the endpoints set; remove oldEndpoint from the endpoints set.
RETIRE — drop the old-key queues (no producer can reach them after step 2).

endpoints is now a CopyOnWriteArraySet (was HashSet). incrementalReportsQueue, containerActions, pipelineActions, and isFullReportReadyToBeSent are now ConcurrentHashMap (some already were). Producers null-skip queue lookups as defense-in-depth — a producer racing migration MUST NOT NPE on a concurrent remove.

The full-report flags get a special case: a swapped endpoint is effectively a fresh peer (the new SCM pod has no idea which reports we have already shipped), so its isFullReportReadyToBeSent[type] flags are seeded fresh rather than copied from the old key. Tested in TestHeartbeatEndpointTaskDnsRefresh.

4. `HeartbeatEndpointTask` trigger

In the heartbeat catch block, after logIfNeeded(ex):

if (resolveOnFailureEnabled                    // ozone.client.failover.resolve-needed
    && missedCount >= refreshThreshold         // ozone.datanode.scm.heartbeat.address.refresh.threshold
    && ConnectionFailureUtils.isConnectionFailure(ex)
    && hostAndPort != null) {
  maybeRefreshScmAddress();                    // calls SCMConnectionManager.refreshSCMServer
}

All four gates are required. Application-level errors don't trigger refresh. Endpoints without a preserved hostname (legacy code path) don't trigger. The threshold prevents over-reaction to a one-off blip.

5. New config knob

ozone.datanode.scm.heartbeat.address.refresh.threshold (default 3). Conservative default — at the typical 30-second heartbeat interval and 6-second socketTimeout, this means at most ~108 seconds of dialing the stale IP before the first DNS retry. In practice the failures are usually fast (TCP RST or routing failure), so the recovery is much faster.

Real-world failure shapes this fix targets

Two distinct failure modes drove the requirement:

AWS EC2 / EKS — silent packet drop. When a DN attempts to connect to the cached IP of scm-0 after the pod has moved, AWS silently drops the packet. The TCP retry loop expires after socketTimeout (default 6 seconds in Ozone). Without this PR, the DN retries the same dead IP forever. With this PR, after threshold consecutive SocketTimeoutExceptions, the DN re-resolves DNS and swaps to the new IP.
OpenStack — TCP RST or ICMP unreachable. The network stack fast-rejects packets to the dead IP, surfacing as ConnectException. Same recovery path: after threshold consecutive failures, refresh.

How was this patch tested?

Test class	Count	Coverage
`TestSCMConnectionManager` (extended)	7 (1 prior + 6 new)	`resolveLatestAddress` edge cases. `refreshSCMServer` happy-path swap. No-op when `hostAndPort` not preserved. Rollback regression: when `buildScmEndpoint` throws, the stale endpoint remains registered (uses `@VisibleForTesting` overridable hook to inject the failure).
`TestHeartbeatEndpointTaskDnsRefresh` (new)	6	Production trigger chain. `HeartbeatEndpointTask.call()` catch block fires `refreshSCMServer` only when (a) flag enabled, (b) threshold met, (c) cause is connection-class, (d) `hostAndPort` preserved. `AccessControlException` at threshold does NOT trigger. After a successful swap, `StateContext`'s incremental-reports map has the new key and not the old key.
`TestSCMConnectionManagerDnsRefreshE2E` (new)	1 (`@Timeout(30)`)	Real-RPC swap mechanism. Stands up a real `ScmTestMock` RPC server on a loopback OS-assigned port, primes the connection manager with a stale `127.0.0.99` cache + preserved `localhost:port`, calls `refreshSCMServer`, asserts a real `sendHeartbeat` round-trips through the swapped endpoint. Lives in `hadoop-hdds/server-scm` because it depends on `ScmTestMock`.

Existing regression suite verified non-regressed: TestEndPoint (17), TestHeartbeatEndpointTask (8).

Scope and known limitations

DN initial bringup with stale DNS: the refresh fires from the HEARTBEAT phase via HeartbeatEndpointTask. If a DN starts up with the SCM peer already at a stale IP and never reaches HEARTBEAT, the recovery path does not engage. Initial-bringup DNS staleness is the existing concern of HDDS-5919's ozone.network.jvm.address.cache.enabled=false. InitDatanodeState.java already postpones initialization on initial-resolution failure.
HDFS-14118-style construction-time DNS fan-out (one hostname → multiple persistent IPs, for round-robin DNS HA) is a different problem and out of scope. Worth a follow-on JIRA if needed.

What is the link to the Apache JIRA?

https://issues.apache.org/jira/browse/HDDS-15514

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

kerneltime · 2026-06-12T06:23:26Z

Rebased onto the updated PR-3 (#10487) tip and retitled to HDDS-15533 per @szetszwo's subtask request.

Copilot's earlier review pass errored out and posted no inline comments. Will re-request a Copilot review once this PR's status is settled. The substantive Copilot findings on PR-1, PR-2, and PR-3 have been addressed in their owning PRs and propagate forward via rebase.

Ratis builds gRPC channels via NettyChannelBuilder.forTarget(address), where the default DnsNameResolver re-resolves hostnames on connection failure. Two of the three pre-existing createRaftPeer paths in OM, and the AddSCMRequest path in SCMHAManagerImpl, were passing new InetSocketAddress(omNode.getInetAddress(), ratisPort) -- which bakes the resolved IP into RaftPeer.address. Once baked, Ratis (and gRPC under it) keeps dialing that IP for the channel's lifetime, so peer-pod restarts in Kubernetes never recover until the parent process is restarted. Switch every createRaftPeer / AddSCMRequest call to pass the hostname:port string. Collapse the two OzoneManagerRatisServer overloads into one. Replace the misleading "// TODO : Should we use IP instead of hostname??" comment in SCMRatisServerImpl.buildRaftGroup and SCMHAManagerImpl with explanatory comments citing HDDS-15514. Add testCreateRaftPeerUsesHostnameAddress to assert the contract: RaftPeer.address must NEVER be an IPv4 numeric form. This catches any future regression that re-introduces InetSocketAddress at this seam. This is the first of four PRs splitting HDDS-15514 along its natural code-path boundaries. No flag, no exception classifier, and no atomic swap machinery in this PR -- those land with the proxy-provider PRs that follow.

@VisibleForTesting

OMProxyInfo constructs an InetSocketAddress at process start and reuses it for the proxy's lifetime. InetSocketAddress freezes the resolved IP at construction; when an OM pod is rescheduled to a new IP under a stable DNS name (Kubernetes), every subsequent client RPC dials the gone-away IP forever and only a process restart recovers. Fix it at the FailoverProxyProvider seam, gated by a new opt-in flag (ozone.client.failover.resolve-needed, default false). Shared infrastructure (used by subsequent PRs in this series): - ConnectionFailureUtils: classifies a Throwable's cause chain (depth-bounded to 16) as a connection-class failure. Connection types: ConnectException, SocketTimeoutException, NoRouteToHostException, UnknownHostException, EOFException, SocketException. Application errors (OMException, OMNotLeaderEx, AccessControlException, RetryAction-coded responses) are NOT classified as connection failures, so DNS load is not amplified by logical errors. - ozone.client.failover.resolve-needed flag. Client -> OM Hadoop RPC mechanism: - OMProxyInfo preserves the original host:port string and refreshAddressIfChanged() re-resolves it outside the entry monitor; on IP change, atomically swaps the cached InetSocketAddress / dtService / proxy=null under the monitor; stops the stale proxy via RPC.stopProxy outside the monitor. - OMFailoverProxyProviderBase.shouldRetry calls the refresh on connection-class exceptions only when the flag is on. On a successful refresh, returns FAILOVER_AND_RETRY but pins nextProxyIndex to the current node so RetryInvocationHandler does NOT skip past the just-refreshed peer. - HadoopRpcOMFailoverProxyProvider and the follower-read variant pass the preserved hostname string to OMProxyInfo at construction. OM <-> OM Hadoop-RPC control-plane (OMInterServiceProtocol) rides on the same OMFailoverProxyProvider machinery, so OM-to-OM Hadoop-RPC recovery is a free transitive benefit of this PR. The gRPC OM client (GrpcOMFailoverProxyProvider) was already correct (placeholder InetSocketAddress(0); gRPC's NameResolver re-resolves on its own) and is unchanged. Secure-cluster prerequisite documented inline in ozone-default.xml: when this flag is true on a Kerberos cluster, operators must also set hadoop.security.token.service.use_ip=false in core-site.xml. Same prerequisite HADOOP-17068 carries: the Hadoop delegation-token service ID defaults to IP:port and would silently fail token selection after a refresh without that co-config. This is PR 2 of 4 splitting HDDS-15514 along its natural code-path boundaries. PR-1 (Ratis hostname-only fix) is the merge base. Subsequent PRs: - PR-3: OM -> SCM (SCMFailoverProxyProviderBase / SCMProxyInfo). - PR-4: DN -> SCM heartbeat (EndpointStateMachine / SCMConnectionManager / StateContext). Tests: - TestConnectionFailureUtils (new, 20 tests): bare types, IOException-wrapped, deeply nested chains (3 levels), application negative cases, length-2 cause cycles (terminates), 1024-deep non-matching chains (cost bound). - TestOMProxyInfoDnsRefresh (new, 4 tests): no-op preserves cached proxy, swap on IP change, rebuilt proxy uses freshly-resolved address, dtService updates. Uses a @VisibleForTesting setter. - TestOMFailoverProxyProviderRefreshWired (new, 5 tests): SocketTimeoutException triggers refresh (the AWS EC2 silent-drop case end-to-end); ConnectException triggers refresh; OMException does NOT; flag-off does NOT; nextProxyIndex stays pinned after successful refresh. Existing TestOMFailoverProxyProvider (8) and TestOMFailovers (1) verified non-regressed.

SCMProxyInfo constructs an InetSocketAddress at OM startup and reuses it for the SCM proxy's lifetime. InetSocketAddress freezes the resolved IP at construction; when an SCM pod is rescheduled to a new IP under a stable DNS name (Kubernetes), every subsequent OM to SCM RPC dials the gone-away IP forever, and only an OM process restart recovers. Apply the same DNS-refresh-on-failure pattern PR-2 introduced for Client to OM. Reuses the ConnectionFailureUtils classifier and the ozone.client.failover.resolve-needed flag landed in PR-2. SCMProxyInfo: - New final hostAndPort String preserves the config-time host:port string. The string is the source of truth for re-resolution; the InetSocketAddress is now a derived cache. - rpcAddr becomes mutable behind the entry monitor (was effectively final). - getHostAndPort() accessor for the provider's refresh path. SCMFailoverProxyProviderBase.refreshProxyAddressIfChanged(nodeId): - PHASE A (no lock): re-resolve hostAndPort. Compare with cached rpcAddr.getAddress(). If unchanged or unresolved, return false. - PHASE B (under provider monitor): capture stale proxy reference, swap rpcAddr, clear cached proxy. Next getProxy(nodeId) rebuilds via existing createSCMProxy(nodeId) path. - PHASE C (no lock): RPC.stopProxy(staleProxy). Holding the monitor across socket teardown would stall every concurrent getProxy() caller. shouldRetry wiring: - When the flag is true AND ConnectionFailureUtils.isConnectionFailure(exception) matches, the provider calls refreshProxyAddressIfChanged for the current SCM nodeId. - On a successful refresh, returns FAILOVER_AND_RETRY but pins updatedLeaderNodeID to the just-refreshed nodeId so RetryInvocationHandler does NOT advance past the now-fixed peer. Without the pin, an N-peer SCM HA cluster would skip the fixed SCM for up to N-1 attempts. Tests: - TestSCMFailoverProxyProviderRefresh (new, 3 tests): per-instance swap on IP change, no-op when unchanged, no-op without preserved hostAndPort (legacy code path). - TestSCMFailoverProxyProviderRefreshWired (new, 5 tests): end-to-end retry path. ConnectException + SocketTimeoutException trigger refresh; application errors and flag-off do NOT; updatedLeaderNodeID stays pinned across successful refresh. Existing TestSCMFailoverProxyProvider verified non-regressed. This is PR 3 of 4 splitting HDDS-15514. Stacked on PR-2 (HDDS-15514-client-om-refresh). PR-4 (DN to SCM heartbeat) follows.

@VisibleForTesting

EndpointStateMachine.address is constructed at DN startup from the configured host:port and reused for the lifetime of the DN heartbeat loop. InetSocketAddress freezes the resolved IP at construction; when an SCM pod is rescheduled to a new IP under a stable DNS name (Kubernetes), every heartbeat to that peer dials the gone-away IP forever. The DN's endpoints set still contains the broken peer's EndpointStateMachine, but that machine never recovers without a DN process restart. Apply DNS-refresh-on-failure for the DN heartbeat path. Reuses ConnectionFailureUtils and the ozone.client.failover.resolve-needed flag landed in PR-2. Adds a separate threshold knob since the heartbeat path runs at a much higher cadence than the failover-proxy seams. EndpointStateMachine.resolveLatestAddress(): - Re-resolve the preserved hostAndPort via NetUtils.createSocketAddr. - Return the freshly-resolved InetSocketAddress only if its getAddress() differs from the cached one. - Return null on legacy endpoints (no preserved hostAndPort), unresolved DNS, or unchanged IP -- so callers can opt into a swap without committing to one. SCMConnectionManager.refreshSCMServer() -- 4-phase atomic swap: - PHASE A (read lock): snapshot endpoint reference + hostAndPort. - PHASE B (no lock): resolveLatestAddress. DNS lookup must NEVER hold any lock; a slow / dead resolver under lock would freeze every concurrent heartbeat and reconfiguration path. - PHASE C (write lock): re-check snapshot (defends against concurrent removeSCMServer / refresh races), enforce collision invariant (refuse swap if the resolved IP collides with another registered peer key -- transient kube-DNS can return peer-B's IP for peer-A's hostname; overwriting peer-B would leak its executor and orphan its task thread), BUILD replacement endpoint BEFORE removing stale (build failure must NOT leave the peer absent from scmMachines), commit swap. - PHASE D (no lock): close stale endpoint. RPC.stopProxy + socket teardown blocks; holding the write lock across that stalls every concurrent heartbeat. StateContext.migrateEndpoint -- preserve in-flight reports across swap. Per-endpoint queues (incrementalReportsQueue, containerActions, pipelineActions, isFullReportReadyToBeSent) are keyed by InetSocketAddress; without migration a swap orphans every queued report. Migration ordering preserves the invariant "every endpoint in `endpoints` has a queue at every observable point": 1. PUBLISH: install new-key queues alongside old-key queues. 2. SWITCH: add newEndpoint to endpoints; remove oldEndpoint. 3. RETIRE: drop old-key queues (no producer can reach them now). endpoints is now a CopyOnWriteArraySet (was HashSet). incrementalReportsQueue / containerActions / pipelineActions / isFullReportReadyToBeSent are now ConcurrentHashMap. Producers null-skip queue lookups as defense-in-depth against producer-vs- migration races. The full-report flags get a special case: a swapped endpoint is effectively a fresh peer (the new SCM pod has no idea which reports we already shipped), so its isFullReportReadyToBeSent flags are seeded fresh -- not copied from the old key. HeartbeatEndpointTask trigger: In the heartbeat catch block, after logIfNeeded(ex): if (resolveOnFailureEnabled && missedCount >= refreshThreshold && ConnectionFailureUtils.isConnectionFailure(ex) && hostAndPort != null) { maybeRefreshScmAddress(); } All four gates required. Application-level errors do NOT trigger. Endpoints without a preserved hostname (legacy code path) do NOT trigger. Threshold prevents over-reaction to a one-off blip. New config knob: - ozone.datanode.scm.heartbeat.address.refresh.threshold = 3. Conservative default: at the typical 30-second heartbeat interval and 6-second socketTimeout, this means at most ~108 seconds of dialing the stale IP before the first DNS retry. In practice failures are usually fast (TCP RST or routing failure), so recovery is much faster. Tests: - TestSCMConnectionManager (extended, 7 = 1 prior + 6 new): resolveLatestAddress edge cases; refreshSCMServer happy-path swap; no-op when hostAndPort not preserved; rollback regression -- when buildScmEndpoint throws, stale endpoint stays registered (uses @VisibleForTesting overridable hook to inject the failure). - TestHeartbeatEndpointTaskDnsRefresh (new, 6): production trigger chain. HeartbeatEndpointTask.call() catch block fires refreshSCMServer only when (a) flag enabled, (b) threshold met, (c) cause is connection-class, (d) hostAndPort preserved. AccessControlException at threshold does NOT trigger. After successful swap, StateContext's incremental-reports map has the new key and not the old key. - TestSCMConnectionManagerDnsRefreshE2E (new, 1, @timeout(30)): real-RPC swap mechanism. Stands up a real ScmTestMock RPC server on a loopback OS-assigned port, primes the connection manager with a stale 127.0.0.99 cache + preserved localhost:port, calls refreshSCMServer, asserts a real sendHeartbeat round-trips through the swapped endpoint. Existing TestEndPoint (17) and TestHeartbeatEndpointTask (8) verified non-regressed. Real-world failure shapes this fix targets: - AWS EC2 / EKS silent packet drop: stale-IP packets are silently dropped, surfacing as SocketTimeoutException after socketTimeout. Without this fix, the DN retries the dead IP forever. - OpenStack TCP RST / ICMP unreachable: stale-IP packets fast- rejected, surfacing as ConnectException. Same recovery path. Scope: the refresh fires from the HEARTBEAT phase. If a DN starts up with the SCM peer already at a stale IP and never reaches HEARTBEAT, the recovery path does NOT engage. Initial-bringup DNS staleness is HDDS-5919's ozone.network.jvm.address.cache.enabled=false's concern. This is PR 4 of 4 splitting HDDS-15514. Stacked on PR-3 (HDDS-15514-om-scm-refresh).

kerneltime mentioned this pull request Jun 11, 2026

HDDS-15514. DNS-refresh-on-failure for OM, SCM, DN RPC paths #10473

Closed

kerneltime requested review from Gargi-jais11, Copilot and szetszwo and removed request for Copilot and szetszwo June 11, 2026 15:28

kerneltime marked this pull request as ready for review June 11, 2026 15:28

Copilot started reviewing on behalf of kerneltime June 11, 2026 15:28 View session

Copilot AI reviewed Jun 11, 2026

Copilot stopped reviewing on behalf of kerneltime due to an error June 11, 2026 15:30
Error performing PR review: Git command failed: git -c ...: exited with code 128: error: RPC failed; HTTP 500 curl 22 The requested URL returned error: 500 fatal: expected 'packfile'

adoroszlai marked this pull request as draft June 11, 2026 16:31

kerneltime force-pushed the HDDS-15514-dn-scm-refresh branch from 78e8646 to 45b4dd8 Compare June 12, 2026 06:23

kerneltime changed the title ~~HDDS-15514. DNS refresh on heartbeat failure for DN to SCM~~ HDDS-15533. DNS refresh on heartbeat failure for DN to SCM Jun 12, 2026

kerneltime added 4 commits June 11, 2026 23:34

kerneltime force-pushed the HDDS-15514-dn-scm-refresh branch from 45b4dd8 to 3d9ba8b Compare June 12, 2026 06:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-15533. DNS refresh on heartbeat failure for DN to SCM#10488

HDDS-15533. DNS refresh on heartbeat failure for DN to SCM#10488
kerneltime wants to merge 4 commits into
apache:masterfrom
kerneltime:HDDS-15514-dn-scm-refresh

kerneltime commented Jun 11, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

kerneltime commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kerneltime commented Jun 11, 2026

What changes were proposed in this pull request?

Why this matters

What this PR does

1. EndpointStateMachine preserves a hostname

2. SCMConnectionManager.refreshSCMServer — 4-phase atomic swap

3. StateContext.migrateEndpoint — preserve in-flight reports across swap

4. HeartbeatEndpointTask trigger

5. New config knob

Real-world failure shapes this fix targets

How was this patch tested?

Scope and known limitations

What is the link to the Apache JIRA?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

kerneltime commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. `EndpointStateMachine` preserves a hostname

2. `SCMConnectionManager.refreshSCMServer` — 4-phase atomic swap

3. `StateContext.migrateEndpoint` — preserve in-flight reports across swap

4. `HeartbeatEndpointTask` trigger