Skip to content

feat(rocev2): RoCEv2 RDMA-SEND datapath, drivers, and PRBS validation harness#15

Open
ruck314 wants to merge 8 commits into
mainfrom
rocev2-dev-1
Open

feat(rocev2): RoCEv2 RDMA-SEND datapath, drivers, and PRBS validation harness#15
ruck314 wants to merge 8 commits into
mainfrom
rocev2-dev-1

Conversation

@ruck314

@ruck314 ruck314 commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds end-to-end RoCEv2 RDMA-SEND support to the Simple-10GbE-RUDP-KCU105 example:
a hardware RDMA datapath built on the surf RoCEv2AxiStreamRdma engine, three new
RoCEv2 build targets, a PyRogue device driver with host↔FPGA bring-up
sequencing, and the rocev2PrbsTest.py line-rate PRBS validation harness.

The branch history is organized into five reviewable commits, one per feature:

  1. chore(rocev2): bump surf — surf 634c8d9 → dcc0155: AXI-Stream RDMA
    datapath + engine refactor + RoCEv2 rename, DCQCN congestion control with
    runtime bypass, regenerated blue-rdma transport engine, runtime ECN/DSCP IPv4
    header register, AxiStreamMon frameUpdate.
  2. feat(rocev2): wire RoCEv2 RDMA engine into shared RTL — instantiate
    RoCEv2AxiStreamRdma in the shared datapath; thread a single 64-bit RDMA
    AXI-Stream pair through CorePkg/Core/App/Rudp; wrap the SsiPrbsTx
    payload with AxiStreamPacketizer2 (CRC_MODE_G="NONE",
    MAX_PACKET_BYTES_G=4096) ahead of the RDMA SEND; default to point-to-point
    posture (DSCP_G=0, ECN_G="00" Not-ECT) so the host NIC keeps DCQCN
    disengaged unless configured at runtime.
  3. feat(rocev2): add RoCEv2 build targets + plumbing — three RoCEv2 KCU105
    targets (1GbE / 10GbE / RJ45) with HDL, ruckus.tcl, Makefile, and
    promgen.tcl; a top-level firmware/Makefile batch build over all six
    targets; hoist the Vivado VersionCheck 2023.1 out of each Simple* target
    into shared ruckus.tcl; switch releases.yaml to FW_only (mcs/ltx); bump
    firmware version to v3.0.0.0.
  4. feat(rocev2): add RoCEv2 PyRogue device driver — new
    rocev2_10gbe_rudp_kcu105_example package (Root/App + host↔FPGA
    bring-up/tear-down sequencer, with a packetizer.CoreV2 depacketizer stage
    ahead of PrbsRx); teach the shared Core driver about rocev2/dcqcn
    knobs and instantiate RoCEv2AxiStreamRdma at 0x0015_0000.
  5. feat(rocev2): add rocev2PrbsTest.py line-rate PRBS harness — end-to-end
    PRBS validation that derives PacketLength from maxPayload minus the 16 B
    packetizer overhead; --p2p / throttle / checkPayload knobs; pass/fail
    gated on rxErrors and FW-egress bandwidth telemetry. Plus
    updateBootProm.py post-reload settle 5 s → 10 s and setup_env_slac.sh
    conda env → rogue_v6.15.0.

Test plan

  • FW synthesizes and meets timing (Vivado 2025.2, xcku040)
  • Docs CI (cd docs && make html) succeeds locally
  • End-to-end on KCU105 @ 192.168.2.10: programmed the bitstream and ran
    rocev2PrbsTest.py (throttled, checkPayload=True) → PASS (rxErrors=0),
    confirming the FW packetizer → RDMA SEND → SW CoreV2 depacketizer →
    PrbsRx chain reassembles and validates intact.

ruck314 added 5 commits June 25, 2026 07:18
Pulls in the surf-side RoCEv2 work that the rest of this branch builds on:
- AXI-Stream RDMA datapath, engine refactor, and RoCEv2 rename
- DCQCN congestion control with runtime bypass
- regenerated blue-rdma transport engine for line-rate RDMA-SEND
- runtime ECN/DSCP IPv4 header register and AxiStreamMon frameUpdate

surf 634c8d9 -> dcc0155.
Instantiate the surf RoCEv2AxiStreamRdma engine in the shared firmware
datapath and thread a single 64-bit RDMA AXI-Stream pair through the
hierarchy:

- CorePkg.vhd: define RDMA_AXIS_CONFIG_C (64-bit) as the shared RDMA
  stream config.
- App.vhd: drive SsiPrbsTx on RDMA_AXIS_CONFIG_C and wrap the payload
  with AxiStreamPacketizer2 (CRC_MODE_G="NONE", MAX_PACKET_BYTES_G=4096)
  ahead of the RDMA SEND.
- Core.vhd: pass the single RDMA master/slave pair through Core.
- Rudp.vhd: replace the bare engine with the RoCEv2AxiStreamRdma wrapper;
  default it to point-to-point posture (DSCP_G=0, ECN_G="00" Not-ECT) so
  the host NIC keeps DCQCN disengaged unless configured at runtime.
Add the three RoCEv2 KCU105 build targets (1GbE, 10GbE, RJ45), each with
its top-level HDL, ruckus.tcl, Makefile, and promgen.tcl, and the
supporting build/release plumbing:

- firmware/Makefile: top-level batch build/clean over all six targets.
- shared/ruckus.tcl + Simple* targets: hoist the Vivado VersionCheck
  2023.1 out of each target into the shared ruckus.tcl (single source).
- releases.yaml: drop the Rogue packaging block and switch the release to
  FW_only (mcs/ltx) now that firmware ships independently of the driver.
- shared_version.mk: bump firmware version to v3.0.0.0.
Add the rocev2_10gbe_rudp_kcu105_example PyRogue package (Root/App and
RoCEv2 bring-up sequencer) and teach the shared Core driver about RoCEv2:

- simple_10gbe_rudp_kcu105_example/_Core.py: add rocev2/dcqcn ctor knobs,
  add a UDP client (numClt=1) and instantiate surf RoCEv2AxiStreamRdma at
  0x0015_0000 when rocev2 is enabled.
- rocev2_10gbe_rudp_kcu105_example/_Root.py: host<->FPGA RoCEv2 bring-up
  and tear-down sequencing, plus a packetizer CoreV2 stage that strips the
  AxiStreamPacketizer2 framing ahead of PrbsRx.
- rocev2_10gbe_rudp_kcu105_example/_App.py + __init__.py: App device tree
  and package exports.
Add the rocev2PrbsTest.py end-to-end PRBS validation harness for the
RoCEv2 RDMA-SEND datapath: derives PacketLength from maxPayload minus the
16 B packetizer overhead, exposes --p2p / throttle / checkPayload knobs,
and gates pass/fail on rxErrors and FW-egress bandwidth telemetry.

Supporting environment updates:
- updateBootProm.py: extend the post-FpgaReload settle from 5 s to 10 s.
- setup_env_slac.sh: bump the conda env to rogue_v6.15.0.
@ruck314 ruck314 changed the title feat(rocev2): packetize RDMA PRBS path + software depacketizer feat(rocev2): RoCEv2 RDMA-SEND datapath, drivers, and PRBS validation harness Jun 25, 2026
@ruck314 ruck314 marked this pull request as ready for review June 25, 2026 14:36
@ruck314 ruck314 requested a review from FilMarini June 25, 2026 14:36
@FilMarini

Copy link
Copy Markdown
Collaborator

Hi @ruck314 ,
I've been testing the firmware (RoCEv2-10GbE) with Soft-RoCE, and I've run into an issue.
Sometimes running the prbs script results in a failure.
For example:

python3 rocev2PrbsTest.py --ip 192.168.2.10 --roceDevice rxe0 --target 1000                                                              [2/07/26 | 14:36:23]
Rogue/pyrogue version v6.15.0. https://github.com/slaclab/rogue
WARNING: --roceDevice='rxe0' is not an mlx5_* HW NIC; proceeding anyway (softRoCE / bench escape hatch).
Start: Started zmqServer on ports 9099-9101
    To start a gui: python -m pyrogue gui --server='localhost:9099'
    To use a virtual client: client = pyrogue.interfaces.VirtualClient(addr='localhost', port=9099)
setP2pMode: DcqcnBypass toggled LIVE; minRnrTimer=1 recorded — RNR backoff takes effect on the NEXT reconnect/restart.
--- RoCEv2 connection info ---
  Host QPN        : 0x20
  Host GID        : 0000:0000:0000:0000:0000:ffff:c0a8:0264
  MR addr         : 0x7cc21754e000
  MR rkey         : 0x116b
  FPGA lkey       : 0x1e239dbd
  MaxPayload      : 4096
  RxQueueDepth    : 256
  MrLen           : 1048576
------------------------------
Root.Core.AxiVersion count reset called
Streaming until rxCount >= 1000 (4080B raw / 4096B on-wire per frame, native RNR flow control, minRnrTimer=1, p2p=True)...
WARNING: timed out after 10.0s waiting for rxCount >= 1000
--- PRBS result ---
  PrbsRx.rxErrors        : 0
  PrbsRx.rxCount         : 499 (target 1000)
  Dma.SuccessCounter     : 0
  RESULT                 : FAIL
-------------------
--- FW telemetry ---
  run mode               : throttled
  trajectory samples     : 20 (cadence 0.5s)
  Dcqcn.Rc start         : 1250000000 B/s (10.000 Gb/s)
  Dcqcn.Rc min           : 1250000000 B/s (10.000 Gb/s)
  Dcqcn.Rc max           : 1250000000 B/s (10.000 Gb/s)
  Dcqcn.Rc last          : 1250000000 B/s (10.000 Gb/s)
  Dcqcn.CnpCounter peak  : 0
  Rdma.MonBandwidth      : 0.000 Gb/s (FW egress, steady)
  gate band              : throttled per-frame integrity: rxErrors==0 AND rxCount>=target
  VERDICT                : FAIL
-------------------
Root.Core.AxiVersion count reset called
Root.Core.AxiVersion count reset called
Connected to Root at 127.0.0.1:9099
Running GUI. Close window, hit cntrl-c or send SIGTERM to 38246 to exit.

This can happen running the script multiple times but even at first run after FPGA programming.

Sometimes after the error the RoCEv2Engine goes into timeout trying the DESTROY QP/MR/PD sequence.

Maybe Im running the script wrong?

ruck314 added 2 commits July 2, 2026 16:15
Root.start() now unwinds through stop() if the host<->FPGA hand-off throws, so a
failed bring-up cannot leak the transport/poll threads or a half-open QP.
Root.stop() runs teardownConnection() then an Engine SoftReset inside a
try/finally, forcing the FW RoceConfigurator back to IDLE (it has no response
timeout, so an out-of-order DESTROY could otherwise wedge it) and guaranteeing
the transport is always released. setP2pMode() is decoupled from the RNR timer:
min_rnr_timer is QP state owned solely by transportCfg, so a slow softRoCE
responder can keep a larger backoff while FW DCQCN is still bypassed.

rocev2PrbsTest.py defaults --trigRate per device (5 kHz for softRoCE vs 25 kHz
for an mlx5 NIC) and, under --p2p on softRoCE, keeps the configured --minRnrTimer
instead of forcing the minimal code-1 backoff that storms RNR NAKs on a kernel
responder. Adds FW diagnostic counters (Unsuccess/DmaRead/Oversize) to the
result summary for one-shot failure classification.

Validated on a softRoCE-only host: 25/25 PRBS runs pass with zero errors and no
teardown segfault (paired with the rogue rocev2 Server teardown fix).
surf dcc0155 -> 142903a: RoCEv2 UdpEngine ECN/DSCP fix (move ECN/DSCP out of the
localMac word bits) and AxiStream/UdpEngine PR-review fixes.
ruckus 4c60fea -> 684ecc6: build-system updates (versal 2026.1, release
compare-URL, docs Pages deploy).
@ruck314

ruck314 commented Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

Thanks @FilMarini — not you running it wrong; reproduced it and root-caused two real bugs:

1. Intermittent RESULT: FAIL (rxCount stalls). An RNR-NAK storm on soft-RoCE: the bench default forced minRnrTimer=1 (0.01 ms) under --p2p and drove the PRBS source too fast, so a kernel rxe0 responder can't keep its recv queue armed — the FPGA SQ (rnr_retry=infinite) stalls/retransmits and throughput collapses into the 10 s timeout. Fixed on rocev2-dev-1: on soft-RoCE keep minRnrTimer=12 even under --p2p and default --trigRate to 5 kHz (25 kHz stays the mlx5 default).

2. DESTROY QP/MR/PD timeout + Segmentation fault at exit. Two parts:

  • The FW RoceConfigurator FSM has no response timeout, so a stray/out-of-order DESTROY wedges it — now cleared by a SoftReset pulse in Root.stop() (best-effort try/finally).
  • The crash is a rogue use-after-free: RoCEv2Server::stop() freed the registered RX slab while zero-copy frames were still held downstream (the GUI leaves the stream armed, so there's always one in flight). Fixed by freeing the slab in ~Server instead (its lifetime is the Pool object's), plus the missing GilRelease on bring-up/teardown, with a hardware-gated rxe0 regression test → fix(rocev2.Server): free RX slab in destructor to fix zero-copy teardown segfault rogue#1276.

Side note on your other observation: on a host that also has a hard-RoCE NIC, make sure the IPv4-mapped GID index is used — rxe index 0 is the link-local fe80:: GID and bring-up misbehaves with it (the driver auto-detects the right one on the FPGA subnet).

Validated on a soft-RoCE-only host: 25/25 PRBS runs pass, zero errors, no segfault (with rogue#1276). The app-side fixes are pushed here; please re-test once #1276 merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants