Skip to content

Intoduce tokio/async-based BFD implementation#795

Open
jgallagher wants to merge 23 commits into
mainfrom
john/bfd-async
Open

Intoduce tokio/async-based BFD implementation#795
jgallagher wants to merge 23 commits into
mainfrom
john/bfd-async

Conversation

@jgallagher

Copy link
Copy Markdown
Contributor

This PR adds a tokio/async-based implementation of BFD. It's heavily based on the existing implementation both in API shape and names, but has some intentional architectural differences (primarily to support pretty extensive unit testing). The non-test code is handwritten, but I leaned pretty heavily on Claude to help with writing tests (although I went over all of them by hand - it has a tendency to write flaky-in-a-not-obvious-way tests when sockets are involved, so hopefully I caught all of those) and used it to compare this implementation against the existing one. The primary motivator for moving to async is fixing #787, which it does: running the reproducer against this implementation results in the DELETEs returning more or less immediately (under 1ms).

The biggest architectural changes are:

  1. The state machine is written in a pure "sans I/O" style, and must be driven by an external source that provides both input/output and a clock source. This allows straightforward unit testing without having to deal with timeouts, threads, or sockets.
  2. Each BFD session starts 3 tokio tasks (one to drive the state machine, one to send outgoing packets, and one to sync with the RIB), plus a tokio task for the local listening address that's shared with all BFD sessions using that same listening address. This is pretty close to swapping out threads for tasks; the sync implementation started 3 OS threads per session (two to drive the state machine, one to send outgoing packets) plus a local listening thread shared between sessions. It didn't need a separate thread for RIB syncing, but we do that here to bridge between the sync and async worlds.
  3. Communication between the UDP send/recv tasks and the state machine driver use bounded channels and drop packets if the channels are full. The old implementation used unbounded channels. I don't expect this to matter in practice, but it's a spot where something going off the rails (e.g., a send_to() getting stuck or something) used to mean we'd have unbounded memory growth now means we have a lot of dropped packets instead.

Some less-architectural-but-still-intentional changes:

  1. I made the detection threshold a NonZeroU8, both because I was concerned about multiplying by 0 and because the RFC says such packets MUST be rejected. We now do so, and also return a 400 if we're asked to configure a session with a threshold of 0. This might be the wrong call and I'm happy to discuss - there was a "panic on startup if we can't re-add all peers" path, so I made that path bump any persisted 0s to 1s, but maybe I should do the same in the API handler instead of returning a 400, in case there are Nexus configs persisted that have a threshold of 0?
  2. Some more careful math in the event of unreasonable large numbers (e.g., replacing an as u32 that wrapped around on large values with .saturating_*() methods).
  3. We now increment SessionCounters::control_packets_received on incoming packets - I think this was just an oversight before?
  4. If we fail to bind an egress socket, the old code would retry every 5 seconds; this code retries every time it has a packet it wants to send. I'm not sure if this is better or worse but it was simpler to implement - happy to discuss.

Some things that I left unchanged but I'm not sure are correct:

  1. Calculations of the "next send" and "recv deadline" timers in the state machine - I left TODO-correctness comments on these with my notes but didn't want to change that behavior as part of a big rework.
  2. Outgoing packets don't populate the local detection_multiplier value (also tagged with a TODO-correctness in the state machine).
  3. We close the egress socket and rebind on any error from UdpSocket::send_to(). Could that leave us in a constant loop of closing/opening sockets if sends are failing for some reason not related to the local socket?
  4. When removing the peer, there's no final sync with the RIB (nor is there in the sync implementation) - I think this means if we remove a peer in the Up state, we'll leave behind a nexthop_shudown of false for that address? I don't know whether this is okay or gets cleaned up by something else.
  5. Should required_min_rx also be a NonZero type? If that gets set to 0 we'll also have a "multiply by 0" in the calculation of the recv deadline.

This PR is only the async implementation with no modifications to the main mgd binary except the change of detection threshold to NonZeroU8 mentioned above. I'll open a second PR that swaps the implementations over.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant