Skip to content

feat(core): multi-link routing#123

Open
gaoyifan wants to merge 13 commits into
encodeous:mainfrom
gaoyifan:feat/multi-link-routing
Open

feat(core): multi-link routing#123
gaoyifan wants to merge 13 commits into
encodeous:mainfrom
gaoyifan:feat/multi-link-routing

Conversation

@gaoyifan
Copy link
Copy Markdown

Background

Currently, nylon employs a "one neighbor, many candidate endpoints, one best endpoint" model. In this implementation, a neighbor is keyed strictly by its NodeId. While multiple remote endpoints can be configured for a single peer, only the single best-performing endpoint is active for routing at any given time.
This approach has several limitations:

  • Lack of Interface Awareness: It cannot distinguish between different physical paths, such as reaching a peer via a local WAN interface versus a local LAN interface.

  • Restricted Control Plane: Control messages and probes are tied to the peer's primary endpoint, preventing independent liveness and metric tracking for alternative paths.
  • Limited Path Redundancy: It treats all connections to a peer as a single next-hop, rather than treating independent physical or logical links as distinct routing adjacencies.

To address these constraints, this PR implements a Multi-Link Routing design. The core change shifts the routing adjacency from a "node-level" view to a "link-level" view. By introducing LocalBind (local interface/source selection), each unique (Peer, LocalBind, RemoteEndpoint) tuple is now treated as an independent routing link. This allows the router to independently track metrics for multiple paths between the same two nodes and select the optimal link for traffic based on real-time performance and local policy.

Full design: docs/reference/multi-link-routing.mdx.

What changed

  • state: add LocalBindID, RemoteEndpointID, LinkID, and Link; store links in RouterState; key selected routes by next-hop link (SelRoute.NhLink).
  • config: parse local binds and structured endpoint IDs while keeping plain string-endpoint compatibility; reject explicit binds off Linux.
  • conn: pair a remote endpoint with a local bind selector (sticky source / IP_PKTINFO) so the same remote address on different binds is a distinct link.
  • discovery: probe the local bind × remote endpoint product, track probes by link, dedupe duplicate transport tuples, and skip bind/endpoint address-family mismatches.
  • router: resolve every incoming control packet to a link before it reaches router logic; select the lowest-metric active link per peer with stable tie-breaking; carry a per-retraction acknowledgment token; keep seqno-request suppression router-wide.
  • forwarding: set TCElement.ToEp from the selected link endpoint so data follows the selected link rather than the peer default endpoint.
  • status / IPC: expose per-link bind, endpoint, and neighbour-route info.

Supporting commits

  • fix(conn): StdNetBind.Send reused a pooled net.UDPAddr whose IP slice could have been shrunk to 4 bytes by a prior IPv4 send, truncating the next IPv6 destination (e.g. 2001:db8::12001:db8::). The link then never collected RTT samples and its metric stayed at INF. Resize the slice before copying.
  • perf(core): batch a received bundle's control packets into a single dispatch that recomputes routes at most once, and coalesce pong-driven recomputation behind a pending flag, to avoid saturating the dispatch queue on multi-link meshes.

Testing

The feature has currently been tested across a total of 12 nodes deployed in different geographic regions over the public Internet, with continuous operation exceeding 24 hours. The test coverage includes:

  1. Dual-stack nodes with both IPv4 and IPv6 connectivity.
  2. IPv4-only nodes.
  3. Multi-homed nodes with multiple network interfaces, each assigned its own independent IP address.
  4. Nodes without any publicly reachable endpoint configuration.

Copilot AI review requested due to automatic review settings May 29, 2026 00:47
@encodeous
Copy link
Copy Markdown
Owner

Hi Yifan. Thanks for the PR, I appreciate the enthusiasm.

Nylon already supports multi-endpoint probing. It will only send data through the active/best link, but will continually probe (send control packets) over all configured endpoints.

My suggestion is to look at code under polyamide, and compare the diff to wireguard-go upstream (use git subtree).

Can you double check?

Also, can you elaborate on "Lack of Interface Awareness"? Nylon currently does not support sending packets directly over a specified interface, but that should be a relatively small change without needing to do a large refactor.

Thanks

P.S: This is a very big change, if possible, split it into a set of smaller PRs so it is easier for me to review.

@gaoyifan
Copy link
Copy Markdown
Author

Thank you very much for the comment and suggestions.

I re-checked the current probing logic in polyamide and Nylon, and you are right: Nylon already continuously probes all configured remote endpoints for a peer and sends data through the active/best endpoint. My original PR description was inaccurate.

What I meant to describe is the lack of local egress/source/interface awareness. For example, if nodes A and B each have three interfaces, A1/A2/A3 and B1/B2/B3, and A1/B1 are the default egress paths, today A will probe B1/B2/B3 mostly from A1, while B will probe A1/A2/A3 mostly from B1. So Nylon observes only part of the possible interface-pair combinations. With explicit local binds, it can probe the full local bind × remote endpoint set, including paths such as A2-B2, A2-B3, A3-B2, and A3-B3, which may otherwise never be selected by the host routing table.

So this PR is not intended to replace Nylon’s existing multi-endpoint probing. It reuses polyamide’s multi-endpoint support, and tries to add local egress as part of the link identity and metric model. Most of the larger changes come from carrying that link identity as first class citizen through probes, control packets, routing state, forwarding, and status output.

I agree the current PR is too large. I’ll try to restructure it into smaller, easier-to-review pieces, and see whether the local bind/source selection part can be extracted first with a smaller router change.

If you have guidance on what the smallest acceptable version should look like, I would really appreciate it.

Thanks again for creating Nylon. It's been a pleasure to work with the codebase and learn from its design.

@gaoyifan gaoyifan marked this pull request as draft May 30, 2026 10:47
@gaoyifan gaoyifan force-pushed the feat/multi-link-routing branch from ebe46ad to ea5f42f Compare May 30, 2026 10:48
@gaoyifan
Copy link
Copy Markdown
Author

I have force-pushed a rewritten history with smaller, buildable commits that might make the dependency chain easier to review.

If you have a particular smallest acceptable version in mind for this PR, I would be very grateful for your guidance.

@gaoyifan gaoyifan marked this pull request as ready for review May 30, 2026 11:09
gaoyifan added 9 commits May 30, 2026 11:26
Add the design note before the implementation commits so reviewers can read the intended model first.

The document explains LinkID, local bind semantics, endpoint generations, deduplication rules, and the expected deployment/validation flow for multi-link routing.
Keep pooled UDPAddr reuse from truncating IPv6 endpoints.

This is a small correctness fix that keeps later multi-endpoint probing from inheriting a corrupted destination address.
Add Linux pktinfo support for carrying a source address/interface selector on conn endpoints.

The router uses this later to bind a probe or control packet to a specific local egress path while preserving the existing default behaviour elsewhere.
Extend config/state with explicit local binds and stable remote endpoint IDs.

This commit also adds endpoint clone/resolution helpers, bind validation, and IP-family checks. It deliberately stops before changing router adjacency semantics.
Expand configured probing from remote endpoints to the deduplicated local-bind x remote-endpoint product.

At this point the router still stores routes at the neighbour level; the change only makes the extra candidate transports discoverable and probeable. Cover the transport deduplication and IP-family filtering invariants here so they bisect with the probing change.
Add retraction tokens to route updates and acknowledgement packets.

The token lets held-route cleanup track acknowledgements per routed adjacency after neighbour routes become link-scoped.
Add LinkID, Link, and RouterState link helpers as the explicit routed adjacency model.

This is still a state-only step; later commits populate and consume Links from core routing code.
Expose local bind ID, interface, and source in status IPC and CLI output.

This gives operators a way to inspect which local egress selector backs each endpoint before routing is fully link-scoped.
Populate RouterState.Links during config reconcile and endpoint discovery while keeping the old neighbour route path active.

This separates link construction from the larger router algorithm switch, so reviewers can verify the new adjacency inventory independently.
@gaoyifan gaoyifan force-pushed the feat/multi-link-routing branch 2 times, most recently from fd363c4 to 19f079d Compare May 30, 2026 11:38
gaoyifan added 4 commits May 30, 2026 11:44
Switch the Babel router from peer-scoped adjacency state to LinkID-scoped state.

This moves pending IO, route updates, seqno requests, retraction ACKs, selected next hops, GC, forwarding endpoint selection, and status route aggregation onto routed links. The peer-level neighbour view remains for grouping status output and configured endpoints, but route ownership now lives on Link.Routes.

This keeps the existing selected-route retention and link switch deadband behavior; those policy changes are split into a later commit.
Schedule a single delayed route recomputation when probe pongs update link RTT samples instead of recomputing routes for every pong.

Make Dispatch report whether the function was queued so the pending recompute flag can be cleared if the dispatch queue is full.
Remove the selected-route retention shortcut so route recomputation can switch away from a current link when another feasible link is no worse.

Set the default LinkSwitchDeadband to 1.0 to match the immediate switching policy.
Add focused regressions for local-bind incoming resolution, remote-init learned links, endpoint generation rotation, bind-specific discovery, and merged peer route reporting.

These tests target the edge cases that are easiest to miss without a full live mesh.
@gaoyifan gaoyifan force-pushed the feat/multi-link-routing branch from 19f079d to 6bef547 Compare May 30, 2026 11:57
@encodeous
Copy link
Copy Markdown
Owner

Hi Yifan,

Thanks for the quick response and tidying up the commit history!

Regarding your changes, I looked over the diff, as well as your design doc.

Here are some comments:

  1. I think it might not be necessary to change the routing/adjacency model
    • Much of what this change involves can be just as succinctly implemented at the endpoint-level.
    • When I designed nylon, I intentionally separated the routing level (decision of which nodes to visit), and the link level (which connection to traverse from one node to the next).
      • At the routing level, there is no need to have multiple edges between nodes, since they are (and should be) functionally the same, outside of a single metric score.
      • Thus, we can simply surface the single "best" link without losing generality
      • Additionally, as you mentioned in the "Risks" section, WireGuard ultimately requires at most one link between two nodes, so we still have to pick one "best" link.
      • As you also stated, we do not support bandwidth aggregation, ECMP, etc, so I don't see any necessity for this at the routing level. Besides, we don't even have the correct routing algorithm for supporting those use cases (consider the difference between Max-Flow, and shortest path)
      • I think you should work around NylonEndpoint to include the local bind (interface, src addr), this might be a bit tricky, and I think we should discuss how the API/user experience would look like for central/local configuration.
  2. However, I do think it would be a good idea to somehow let the user decide which interface(s) a specific endpoint should be reached over. Thus, the changes in polyamide for supporting this, is indeed, necessary machinery, so I would love to see that in a separate PR. However, do note that we do need to support non-Linux platforms such as macOS.

Let's discuss about this before making more changes to the code. We also need a less clunky API for specifying the interface.

One way trivial way would to just produce I*E links from I interfaces, and E endpoints per peering (but this can also lead to a mess).

I'd love to hear your POV

@gaoyifan
Copy link
Copy Markdown
Author

I completely agree with your separation of the link and routing models. I think I had fallen into the mindset of FRRouting-style designs, where multiple interfaces are explicitly exposed to Babel. In retrospect, Nylon's design is much more elegant: it finds a very nice optimal substructure by keeping all multi-link complexity confined to the peer-to-peer layer, which significantly simplifies the overall architecture.


Your last comments actually inspired me to think about a different possible design. Using semantics similar to the Timestamp Sub-TLV from RFC 9616, it may be possible to improve asymmetric routing behavior relatively easily within the current endpoints model.

Consider an extreme example: nodes A and B have two paths, A1 <-> B1 and A2 <-> B2, with the following one-way latencies:

  • A1 -> B1: 1 ms
  • B1 -> A1: 100 ms
  • A2 -> B2: 100 ms
  • B2 -> A2: 1 ms

In theory, if asymmetric routing is allowed, the best RTT would be:

A1 -> B1 | B2 -> A1 = 1 ms + 1 ms = 2 ms

rather than 101 ms.

In practice, asymmetric routing is quite common on the Internet. Under the current endpoints model, exploiting this property at a single-hop level actually becomes relatively straightforward. However, this would likely require changes to the Ping/Pong packet format, replacing the current random-token + PingBuf RTT measurement mechanism with a Timestamp Sub-TLV-based measurement model. It would also be more efficient. My understanding is that this would be a fairly self-contained optimization.

Would you prefer implementing something like this together with the interface-awareness work, or opening a separate issue for discussion and potentially addressing it in a later PR?


Regarding the configuration schema, I initially considered a design that would automatically discover interfaces and addresses instead of requiring the current manual nylon_binds[] configuration. However, it seems somewhat tricky in practice:

  • It would require introducing platform-specific interface discovery and parsing logic, such as AF_NETLINK on Linux or getifaddrs/ioctl on macOS, which would add a substantial amount of code (although perhaps there are third-party libraries that provide cleaner abstractions).
  • Not every interface is necessarily expected—or appropriate—to be used as a Nylon underlay interface. Loopback interfaces, VPN virtual interfaces, internal-only interfaces, Thunderbolt interfaces, and others may be unsuitable in certain environments. To address this, we would likely need either heuristic filtering rules or some more sophisticated detection mechanism. As far as I know, Tailscale chose the former approach with a fixed filtering policy, but static block lists are inherently inflexible and cannot accommodate all deployment scenarios.
  • To preserve the semantics of automatic full-mesh connectivity, we would likely need to monitor interface and address changes from the kernel. That would further increase implementation complexity.

It is also worth noting that, for multi-homing to work correctly, it is usually necessary to either explicitly bind sockets using SO_BINDTODEVICE and ensure that a corresponding default route exists on that interface, or manually configure policy routing, such as:

ip rule add from <public IP A1> lookup <some routing table>
ip rule add from <public IP A2> lookup <another routing table>

For those reasons, I was thinking of starting with a purely manual configuration model for this feature. In practice, since the nylon cluster deployment is handled by Ansible + an AI agent, the configuration burden is not actually too high. On the contrary, purely manual specification makes the expected behavior clearer and more predictable.

That said, perhaps we can find a middle ground between simplicity and completeness. For example, we could provide a built-in heuristic interface-name filter (such as a regular-expression-based rule) and automatically use all addresses on matching interfaces as source addresses. At the same time, users could override the interface or address filtering rules when necessary. This would allow most nodes to work with the default configuration, while only a small number of special cases would require manual configuration.

We could also defer dynamic interface/address change handling for now, to avoid introducing too much uncertainty in the initial implementation.

Do you have any preference regarding which direction would make the most sense?

@encodeous
Copy link
Copy Markdown
Owner

Hmm, in regards to asymmetric routing... I have actually added an experimental implementation over a year ago, but have since removed it.

  • In theory, yes, this is definitely a case where nylon can actually improve latency
  • However... Outside of special datacenters (or GPS), the typical time drift is on the order of 5ms. (Basically accounting for the latency for NTP)
  • This means that: there is no way to compare timestamps between two servers.
  • This implies, there is no simple and reliable way to measure asymmetrical latency
  • If you know how, let me know :)

In regards to interfaces.

I also think its a good starting point to just specify interfaces when desired. Since nylon runs as root in most deployments anyways, I think it's fine to do SO_BINDTODEVICE. Do note, we now might need to bind to multiple interfaces, so the polyamide change needs to be thought out...

I think when you do need to specify an interface, that interface tends to typically not change a lot. Your "middle ground" approach makes sense to me.

  • Probably by default, we just want to use the system routing table, thus no bind to interface
  • In each node's config, we should be able to add rules for specific endpoints to override which interface(s) it can be reached over.

Let's not worry about dynamically changing interfaces yet!

@gaoyifan
Copy link
Copy Markdown
Author

gaoyifan commented May 30, 2026

As far as I know, there are roughly several approaches to routing with asymmetric paths without GPS or Atomic clock:

1. NTP-synchronized clocks and one-way delay estimation

This is the most straightforward approach. If clocks are synchronized, we can compare absolute timestamps and estimate one-way latency directly. Routing decisions can then be made based on the estimated one-way delays.

The downside is that the error can be quite large. For asymmetric paths, NTP synchronization error is often on the same order of magnitude as the network latency being measured, making the estimates rather noisy.

2. Only measure cycle latency, without solving for one-way delay

Instead of trying to estimate one-way delays, we can work entirely with cycle latency. Cycle latency can be measured accurately using a mechanism similar to Babel's Timestamp Sub-TLV.

The key observation is that clock offsets cancel out when measuring a cycle. As a result, time synchronization errors do not affect the final measurement.

Reference:

https://gemini.google.com/share/d470d773636d

3. Estimate one-way delays using measurements from many nodes

This approach first estimates one-way delays across the network and then applies the first method for routing decisions.

See the paper:

https://ieeexplore.ieee.org/document/1638554

or my old notes below:

based on "One-way delay estimation using network-wide measurements"
    https://ieeexplore.ieee.org/document/1638554


the theory

 n: number of nodes

 td: (clock) time differencies between nodes, this is a n-dimension vector
    td[i] denotes (clock) time at node i - (clock) time at node 0
        so natually td[0] is always 0
        they're independant so this forms a (n-1)-dimension space
    this is not time zone, but similar
    td[i,j] denotes (clock) time at node j - (clock) time at node i
        so td[i,j] = td[j] - td[i]
    we can't measure them directly

 d: delay/latency between nodes, this is a n*n matrix
    d[i,j] denotes delay from node i to node j
        so natually diagonal entries are zero

 dm: delay measured
    dm[i,j] = d[i,j] + td[i,j] = d[i,j] + (td[j] - td[i])
        again diagonal entries are zero
        we might not have all of them due to incomplete tests;
    this is the only measure we can/will get

 natually d[i,j] should always > 0
    thus (d[i,j] = ) dm[i,j] + td[i] - td[j] > 0
        this is a half space
    for every measurement, we get a half space constraint
    intersection of multiple half spaces = convex polytope
    and td[1] ~ td[n-1] is the 0~(n-2)th variable in that (n-1)-d space

Several years ago I ran some simulation experiments. For a 100-node cluster with roughly 50% symmetric routes, the average one-way-delay estimation error could be kept below 2 ms. Running the solver on a CPU, a single optimization round in PyTorch took on the order of a few seconds.

The larger the absolute number of symmetric links, the stronger the constraints become, and the more accurate the one-way-delay estimates are across the entire cluster.

For an arbitrary strongly connected directed graph, it can be shown mathematically that—ignoring the small amount of drift inherent to hardware clocks—the cycle-latency approach (method 2) is equivalent, from a routing perspective, to routing based on perfectly accurate one-way-delay measurements.

Intuitively, the best route from A to B must ultimately be part of some minimum cycle containing both A and B. Since method 2 can already compute the exact latency of arbitrary cycles, it provides equivalent routing information.

From that perspective, method 3 is probably not particularly useful for routing itself. However, since we were discussing one-way delay estimation, I thought it was an interesting idea worth sharing. :)

For the asymmetric-path scenario we discussed earlier between two peers, this is actually just the special case of a two-node graph, and can obviously be solved using method 2 as well.


Regarding the "middle-ground" configuration approach, I'd like to make the proposal a bit more concrete. If we don't have any major disagreements, I plan to implement something along these lines in the near future:

  1. The default behavior remains unchanged and fully compatible with current behavior prior to this PR.
  2. Support interface-name filtering using regular expressions, either as a whitelist or a blacklist. Whitelist and blacklist modes are mutually exclusive. Specifying either enables the multi-interface feature.
  3. Support IP-address filtering using either a whitelist or a blacklist. Specifying either enables the multi-interface feature.
  4. Interface filtering (2) and IP filtering (3) can coexist and are combined with logical AND semantics.

Please let me know if there's anything I've overlooked or misunderstood. And don't hesitate to share any additional thoughts or concerns.


update:

  • We can directly use negative assertions in regular expressions, without distinguishing between a blacklist and a whitelist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants