feat(core): multi-link routing#123
Conversation
|
Hi Yifan. Thanks for the PR, I appreciate the enthusiasm. Nylon already supports multi-endpoint probing. It will only send data through the active/best link, but will continually probe (send control packets) over all configured endpoints. My suggestion is to look at code under Can you double check? Also, can you elaborate on "Lack of Interface Awareness"? Nylon currently does not support sending packets directly over a specified interface, but that should be a relatively small change without needing to do a large refactor. Thanks P.S: This is a very big change, if possible, split it into a set of smaller PRs so it is easier for me to review. |
|
Thank you very much for the comment and suggestions. I re-checked the current probing logic in What I meant to describe is the lack of local egress/source/interface awareness. For example, if nodes A and B each have three interfaces, A1/A2/A3 and B1/B2/B3, and A1/B1 are the default egress paths, today A will probe B1/B2/B3 mostly from A1, while B will probe A1/A2/A3 mostly from B1. So Nylon observes only part of the possible interface-pair combinations. With explicit local binds, it can probe the full So this PR is not intended to replace Nylon’s existing multi-endpoint probing. It reuses polyamide’s multi-endpoint support, and tries to add local egress as part of the link identity and metric model. Most of the larger changes come from carrying that link identity as first class citizen through probes, control packets, routing state, forwarding, and status output. I agree the current PR is too large. I’ll try to restructure it into smaller, easier-to-review pieces, and see whether the local bind/source selection part can be extracted first with a smaller router change. If you have guidance on what the smallest acceptable version should look like, I would really appreciate it. Thanks again for creating Nylon. It's been a pleasure to work with the codebase and learn from its design. |
ebe46ad to
ea5f42f
Compare
|
I have force-pushed a rewritten history with smaller, buildable commits that might make the dependency chain easier to review. If you have a particular smallest acceptable version in mind for this PR, I would be very grateful for your guidance. |
Add the design note before the implementation commits so reviewers can read the intended model first. The document explains LinkID, local bind semantics, endpoint generations, deduplication rules, and the expected deployment/validation flow for multi-link routing.
Keep pooled UDPAddr reuse from truncating IPv6 endpoints. This is a small correctness fix that keeps later multi-endpoint probing from inheriting a corrupted destination address.
Add Linux pktinfo support for carrying a source address/interface selector on conn endpoints. The router uses this later to bind a probe or control packet to a specific local egress path while preserving the existing default behaviour elsewhere.
Extend config/state with explicit local binds and stable remote endpoint IDs. This commit also adds endpoint clone/resolution helpers, bind validation, and IP-family checks. It deliberately stops before changing router adjacency semantics.
Expand configured probing from remote endpoints to the deduplicated local-bind x remote-endpoint product. At this point the router still stores routes at the neighbour level; the change only makes the extra candidate transports discoverable and probeable. Cover the transport deduplication and IP-family filtering invariants here so they bisect with the probing change.
Add retraction tokens to route updates and acknowledgement packets. The token lets held-route cleanup track acknowledgements per routed adjacency after neighbour routes become link-scoped.
Add LinkID, Link, and RouterState link helpers as the explicit routed adjacency model. This is still a state-only step; later commits populate and consume Links from core routing code.
Expose local bind ID, interface, and source in status IPC and CLI output. This gives operators a way to inspect which local egress selector backs each endpoint before routing is fully link-scoped.
Populate RouterState.Links during config reconcile and endpoint discovery while keeping the old neighbour route path active. This separates link construction from the larger router algorithm switch, so reviewers can verify the new adjacency inventory independently.
fd363c4 to
19f079d
Compare
Switch the Babel router from peer-scoped adjacency state to LinkID-scoped state. This moves pending IO, route updates, seqno requests, retraction ACKs, selected next hops, GC, forwarding endpoint selection, and status route aggregation onto routed links. The peer-level neighbour view remains for grouping status output and configured endpoints, but route ownership now lives on Link.Routes. This keeps the existing selected-route retention and link switch deadband behavior; those policy changes are split into a later commit.
Schedule a single delayed route recomputation when probe pongs update link RTT samples instead of recomputing routes for every pong. Make Dispatch report whether the function was queued so the pending recompute flag can be cleared if the dispatch queue is full.
Remove the selected-route retention shortcut so route recomputation can switch away from a current link when another feasible link is no worse. Set the default LinkSwitchDeadband to 1.0 to match the immediate switching policy.
Add focused regressions for local-bind incoming resolution, remote-init learned links, endpoint generation rotation, bind-specific discovery, and merged peer route reporting. These tests target the edge cases that are easiest to miss without a full live mesh.
19f079d to
6bef547
Compare
|
Hi Yifan, Thanks for the quick response and tidying up the commit history! Regarding your changes, I looked over the diff, as well as your design doc. Here are some comments:
Let's discuss about this before making more changes to the code. We also need a less clunky API for specifying the interface. One way trivial way would to just produce I*E links from I interfaces, and E endpoints per peering (but this can also lead to a mess). I'd love to hear your POV |
|
I completely agree with your separation of the link and routing models. I think I had fallen into the mindset of FRRouting-style designs, where multiple interfaces are explicitly exposed to Babel. In retrospect, Nylon's design is much more elegant: it finds a very nice optimal substructure by keeping all multi-link complexity confined to the peer-to-peer layer, which significantly simplifies the overall architecture. Your last comments actually inspired me to think about a different possible design. Using semantics similar to the Timestamp Sub-TLV from RFC 9616, it may be possible to improve asymmetric routing behavior relatively easily within the current endpoints model. Consider an extreme example: nodes A and B have two paths, A1 <-> B1 and A2 <-> B2, with the following one-way latencies:
In theory, if asymmetric routing is allowed, the best RTT would be:
rather than 101 ms. In practice, asymmetric routing is quite common on the Internet. Under the current endpoints model, exploiting this property at a single-hop level actually becomes relatively straightforward. However, this would likely require changes to the Ping/Pong packet format, replacing the current random-token + PingBuf RTT measurement mechanism with a Timestamp Sub-TLV-based measurement model. It would also be more efficient. My understanding is that this would be a fairly self-contained optimization. Would you prefer implementing something like this together with the interface-awareness work, or opening a separate issue for discussion and potentially addressing it in a later PR? Regarding the configuration schema, I initially considered a design that would automatically discover interfaces and addresses instead of requiring the current manual
It is also worth noting that, for multi-homing to work correctly, it is usually necessary to either explicitly bind sockets using SO_BINDTODEVICE and ensure that a corresponding default route exists on that interface, or manually configure policy routing, such as: For those reasons, I was thinking of starting with a purely manual configuration model for this feature. In practice, since the nylon cluster deployment is handled by Ansible + an AI agent, the configuration burden is not actually too high. On the contrary, purely manual specification makes the expected behavior clearer and more predictable. That said, perhaps we can find a middle ground between simplicity and completeness. For example, we could provide a built-in heuristic interface-name filter (such as a regular-expression-based rule) and automatically use all addresses on matching interfaces as source addresses. At the same time, users could override the interface or address filtering rules when necessary. This would allow most nodes to work with the default configuration, while only a small number of special cases would require manual configuration. We could also defer dynamic interface/address change handling for now, to avoid introducing too much uncertainty in the initial implementation. Do you have any preference regarding which direction would make the most sense? |
|
Hmm, in regards to asymmetric routing... I have actually added an experimental implementation over a year ago, but have since removed it.
In regards to interfaces. I also think its a good starting point to just specify interfaces when desired. Since nylon runs as root in most deployments anyways, I think it's fine to do I think when you do need to specify an interface, that interface tends to typically not change a lot. Your "middle ground" approach makes sense to me.
Let's not worry about dynamically changing interfaces yet! |
|
As far as I know, there are roughly several approaches to routing with asymmetric paths without GPS or Atomic clock: 1. NTP-synchronized clocks and one-way delay estimationThis is the most straightforward approach. If clocks are synchronized, we can compare absolute timestamps and estimate one-way latency directly. Routing decisions can then be made based on the estimated one-way delays. The downside is that the error can be quite large. For asymmetric paths, NTP synchronization error is often on the same order of magnitude as the network latency being measured, making the estimates rather noisy. 2. Only measure cycle latency, without solving for one-way delayInstead of trying to estimate one-way delays, we can work entirely with cycle latency. Cycle latency can be measured accurately using a mechanism similar to Babel's Timestamp Sub-TLV. The key observation is that clock offsets cancel out when measuring a cycle. As a result, time synchronization errors do not affect the final measurement. Reference: https://gemini.google.com/share/d470d773636d 3. Estimate one-way delays using measurements from many nodesThis approach first estimates one-way delays across the network and then applies the first method for routing decisions. See the paper: https://ieeexplore.ieee.org/document/1638554 or my old notes below: Several years ago I ran some simulation experiments. For a 100-node cluster with roughly 50% symmetric routes, the average one-way-delay estimation error could be kept below 2 ms. Running the solver on a CPU, a single optimization round in PyTorch took on the order of a few seconds. The larger the absolute number of symmetric links, the stronger the constraints become, and the more accurate the one-way-delay estimates are across the entire cluster. For an arbitrary strongly connected directed graph, it can be shown mathematically that—ignoring the small amount of drift inherent to hardware clocks—the cycle-latency approach (method 2) is equivalent, from a routing perspective, to routing based on perfectly accurate one-way-delay measurements. Intuitively, the best route from A to B must ultimately be part of some minimum cycle containing both A and B. Since method 2 can already compute the exact latency of arbitrary cycles, it provides equivalent routing information. From that perspective, method 3 is probably not particularly useful for routing itself. However, since we were discussing one-way delay estimation, I thought it was an interesting idea worth sharing. :) For the asymmetric-path scenario we discussed earlier between two peers, this is actually just the special case of a two-node graph, and can obviously be solved using method 2 as well. Regarding the "middle-ground" configuration approach, I'd like to make the proposal a bit more concrete. If we don't have any major disagreements, I plan to implement something along these lines in the near future:
Please let me know if there's anything I've overlooked or misunderstood. And don't hesitate to share any additional thoughts or concerns. update:
|
Background
Currently, nylon employs a "one neighbor, many candidate endpoints, one best endpoint" model. In this implementation, a neighbor is keyed strictly by its NodeId. While multiple remote endpoints can be configured for a single peer, only the single best-performing endpoint is active for routing at any given time.
This approach has several limitations:
To address these constraints, this PR implements a Multi-Link Routing design. The core change shifts the routing adjacency from a "node-level" view to a "link-level" view. By introducing LocalBind (local interface/source selection), each unique (Peer, LocalBind, RemoteEndpoint) tuple is now treated as an independent routing link. This allows the router to independently track metrics for multiple paths between the same two nodes and select the optimal link for traffic based on real-time performance and local policy.
Full design:
docs/reference/multi-link-routing.mdx.What changed
LocalBindID,RemoteEndpointID,LinkID, andLink; store links inRouterState; key selected routes by next-hop link (SelRoute.NhLink).IP_PKTINFO) so the same remote address on different binds is a distinct link.local bind × remote endpointproduct, track probes by link, dedupe duplicate transport tuples, and skip bind/endpoint address-family mismatches.TCElement.ToEpfrom the selected link endpoint so data follows the selected link rather than the peer default endpoint.Supporting commits
fix(conn):StdNetBind.Sendreused a poolednet.UDPAddrwhose IP slice could have been shrunk to 4 bytes by a prior IPv4 send, truncating the next IPv6 destination (e.g.2001:db8::1→2001:db8::). The link then never collected RTT samples and its metric stayed atINF. Resize the slice before copying.perf(core): batch a received bundle's control packets into a single dispatch that recomputes routes at most once, and coalesce pong-driven recomputation behind a pending flag, to avoid saturating the dispatch queue on multi-link meshes.Testing
The feature has currently been tested across a total of 12 nodes deployed in different geographic regions over the public Internet, with continuous operation exceeding 24 hours. The test coverage includes: