From 1f9c397325f003da11a0314954d7e8341d4ec8e5 Mon Sep 17 00:00:00 2001 From: MC952-arch Date: Wed, 27 May 2026 17:29:54 +0800 Subject: [PATCH 1/8] Add FlagCX v0.13.0 new features --- .../flagcx-v0.13.0-new-features.md | 227 ++++++++++++++++++ 1 file changed, 227 insertions(+) create mode 100644 fep/sig-network/flagcx-v0.13.0-new-features.md diff --git a/fep/sig-network/flagcx-v0.13.0-new-features.md b/fep/sig-network/flagcx-v0.13.0-new-features.md new file mode 100644 index 0000000..1ed07d6 --- /dev/null +++ b/fep/sig-network/flagcx-v0.13.0-new-features.md @@ -0,0 +1,227 @@ +# FEP(sig-network): Add FlagCX v0.13.0 New Features + +**Status:** `Provisional` + +**Created:** 2026-05-27 + +**Owner:** @flagos-ai + +**SIG:** sig-network + +**Target Version:** FlagOS 2.1 + +--- + +## Summary + +Comparing v0.13.0 (current main branch) against v0.11.0 (commit `cceb96d`), three significant feature areas have been introduced in FlagCX: + +1. **P2P Engine** — a one-sided RDMA engine for point-to-point communication, enabling prefill-decode disaggregation in LLM inference scenarios and NIXL integration. +2. **Device API-based CustomAllReduce** — an intra-node AllReduce collective implemented entirely through FlagCX Device API primitives, allowing custom kernels to perform AllReduce without host-side scheduling overhead. +3. **Device API IR Bindings for Triton** — a set of C IR wrapper functions (compilable via LLVM bitcode) that expose FlagCX device-side communication primitives to Triton-generated kernels. + +Repository: https://github.com/flagos-ai/FlagCX + +--- + +## Motivation + +### Goals + +- **P2P Engine:** Provide a hardware-abstracted, one-sided RDMA engine that supports high-performance P2P communications widely used in LLM inference scenarios, such as prefill-decode disaggregation. Currently, FlagCX P2P engine have been to used as vLLM KV transfer connector and integrated as a NIXL backend. +- **Device API CustomAllReduce:** Achieve low-latency AllReduce using Device API to address intra-node small-to-medium message size communication. +- **Device API IR Bindings:** Enable Triton-compiled kernels to call FlagCX Device API (rank queries, intra-node pointer access, barriers, etc.) via LLVM bitcode. + +--- + +## Proposal + +### Feature 1: P2P Engine + +A standalone P2P engine (`FlagcxP2pEngine`) is introduced with a C++ API for one-sided RDMA and two-sided send/recv operations. The engine: + +- Creates and manages RDMA connections over IBRC (InfiniBand Reliable Connected) QPs. +- Exposes vectorized read/write (`flagcxP2pEngineReadVector`, `flagcxP2pEngineWriteVector`) suitable for scatter-gather KV cache transfers. +- Provides an out-of-band notification channel for completion signaling. +- Integrates with FlagCX's existing topology manager (`flagcxP2pTopoManager`) to select the optimal NIC per GPU. + +Users of vLLM, NIXL, Mooncake or custom disaggregation frameworks can use the P2P engine as a low-level RDMA substrate. A patch for NIXL v1.1.0 integration (`plugin/nixl/flagcx_p2p_on_nixl_v1.1.0.patch`) is provided. + +### Feature 2: Device API-based CustomAllReduce + +`flagcxIntraAllReduce` is a kernel-based AllReduce that operates on a registered shared memory window (`flagcxDevMem_t`) using LSA or Multicast. The host-side setup: + +1. Allocate a symmetric buffer (`flagcxMemAlloc` for VMM/window mode, or `cudaMalloc` for IPC mode). +2. Register it: `flagcxCommWindowRegister` (window mode) or `flagcxCommRegister` (IPC mode). +3. Create device handles: `flagcxDevCommCreate` + `flagcxDevMemCreate`. +4. Get device pointers: `flagcxDevCommGetDevicePtr` + `flagcxDevMemGetDevicePtr`. +5. Call `flagcxIntraAllReduce(devMem, count, datatype, devComm, stream)` from the host. + +### Feature 3: Device API IR Bindings for Triton + +A set of `extern "C"` wrapper functions (declared in `flagcx_device_wrapper.h`, implemented in `flagcx_device_wrapper_impl.h`) expose the following categories of device-side primitives for LLVM bitcode compilation: + +| Category | Functions | +|---|---| +| Comm Queries | `flagcxDevCommGetRank`, `flagcxDevCommGetSize`, `flagcxDevCommGetIntraRank`, `flagcxDevCommGetIntraSize` | +| Cooperative Group | `flagcxCoopAnyInitBlock`, `flagcxCoopThreadRankC`, `flagcxCoopSizeC`, `flagcxCoopSyncC` | +| Team Queries | `flagcxGetTeamIntra`, `flagcxTeamRankToWorldC`, `flagcxTeamRankToIntraC` | +| Local Pointer | `flagcxGetLocalPointerC` | +| Intra Pointer (LSA) | `flagcxGetIntraPointerC` | +| Data Type Size | `flagcxDataTypeSizeC` | +| Intra Barrier | `flagcxIntraBarrierSessionInit`, `flagcxIntraBarrierSyncC` | +| Intra Barrier Arrive/Wait | `flagcxIntraBarrierArriveC`, `flagcxIntraBarrierWaitC` | + +The `flagcx_kernel.h` umbrella header guards these with `#ifndef __clang_llvm_bitcode_lib__` so that Triton's bitcode path only includes the device-safe subset (`flagcx_kernel_core.h`). + +--- + +## Design Details + +### P2P Engine Architecture + +``` +FlagcxP2pEngine + ├── IBRC adaptor (flagcxP2pDevCtx, ibv_pd per device) + ├── Accept thread (TCP handshake → QP setup) + ├── Notification thread (out-of-band completion signals) + └── MR registry (base VA → lkey/rkey mapping) + +FlagcxP2pConn + ├── flagcxP2pSendComm / flagcxP2pRecvComm (IB QP + CQ) + ├── flagcxP2pRequest ring (128 slots) + └── IPC handle cache (intra-node transfers) + +FlagcxP2pRdmaDesc (64 bytes) + ├── addr : remote virtual address + ├── size : transfer size + ├── rkey : remote MR key + └── padding : reserved for bookkeeping +``` + +Connection setup follows a TCP-based handshake where both sides exchange QP numbers, GIDs, and MTU via `flagcxP2pConnMeta`. The topology manager (`flagcxP2pTopoInit`) enumerates local GPUs and NICs, builds a node-scoped topology graph, and selects the best NIC for each GPU via `flagcxP2pTopoGetNetDev`. + +### Device API CustomAllReduce Data Flow + +``` +Host Setup: + flagcxMemAlloc(regBuff) + flagcxCommWindowRegister(comm, regBuff, size, &win, FLAGCX_WIN_COLL_SYMMETRIC) + flagcxDevCommCreate(comm, &reqs, &devComm) // reqs.intraBarrierCount = CTA_COUNT + flagcxDevMemCreate(comm, regBuff, size, win, &devMem) + +Kernel Execution: + flagcxIntraAllReduce(devMem, count, flagcxFloat, devComm, stream) + └── Device kernel: + 1. Each CTA reads local data from regBuff + 2. Reads peer data via flagcxGetIntraPointerC(devMem, offset, peer) + 3. Performs reduction (sum) + 4. Writes result back to regBuff + 5. Synchronizes via flagcxIntraBarrier +``` + +Two registration modes are supported: +- **Window mode** (`-R 2`): Uses `flagcxCommWindowRegister` + VMM-allocated memory. Preferred for NCCL >= 2.28. +- **IPC mode** (`-R 1`): Uses `flagcxCommRegister` + `cudaMalloc` memory. Compatible with all NCCL versions. + +### Device API IR Bindings Architecture + +``` +Triton Kernel (.py) + → Triton IR → LLVM IR + → Links flagcx_device_wrapper bitcode (.bc) + → Final PTX/CUBIN + +flagcx_device_wrapper.h (extern "C" declarations, bitcode-safe) +flagcx_device_wrapper_impl.h (inline implementations using adaptor) +flagcx_kernel_core.h (device-side types: flagcxDevComm, flagcxDevMem, etc.) +``` + +The IR functions operate on opaque `devCommPtr` and `devMemPtr` pointers obtained from the host-side `flagcxDevCommGetDevicePtr` / `flagcxDevMemGetDevicePtr` APIs. This allows Triton kernels to: +- Query communicator topology (rank, size, intra-rank). +- Access peer memory directly via LSA pointers. +- Synchronize across intra-node ranks using barriers. +- Perform cooperative group operations within a CTA. + +--- + +## Packaging + +### Build + +```bash +# Build FlagCX with Device API and P2P support +cd flagcx && make -j + +# Build with NIXL integration (optional) +cd plugin/nixl && make -j +``` + +### Dependencies + +- MPI (for multi-process tests) +- libibverbs (for IBRC P2P adaptor) +- CUDA toolkit (for NVIDIA backend) +- NCCL >= 2.25 (for Device API vendor path; >= 2.28 for window mode) +- Triton >= 3.6 (for IR bindings integration) + +--- + +## Test Plan + +### P2P Engine Tests + +| Test | Command | Description | +|---|---|---| +| Unit test: P2P read engine | `mpirun -np 2 ./test_p2p_engine_read` | Verifies one-sided RDMA read between two ranks | +| Perf test: PUT | `mpirun -np 2 ./test_put -b 1024 -e 67108864 -f 2` | Bandwidth benchmark for one-sided PUT | +| Perf test: GET | `mpirun -np 2 ./test_get -b 1024 -e 67108864 -f 2` | Bandwidth benchmark for one-sided GET | +| Perf test: IPC sendrecv | `mpirun -np 2 ./test_ipc_sendrecv` | Intra-node IPC-based send/recv | +| KV transfer benchmark | `python test/perf/kv_transfer/kv_transfer_benchmark.py --connector=flagcx --role=server` | End-to-end KV cache transfer benchmark supporting NIXL, Mooncake, and FlagCX backends | + +### Device API CustomAllReduce Tests + +| Test | Command | Description | +|---|---|---| +| Unit test: AllReduce correctness | `mpirun -np N ./test_runner --gtest_filter=DeviceApiTest.IntraAllReduceViaDevicePtr` | Each rank fills buffer with (rank+1), verifies sum = N*(N+1)/2 | +| Perf test: AllReduce bandwidth | `mpirun -np N ./test_device_api_allreduce -b 1024 -e 67108864 -f 2 -R 2` | Sweeps message sizes, reports algBW/busBW, verifies correctness | +| Intra-node kernel test | `mpirun -np N ./test_intranode` | Full intra-node AllReduce kernel test | + +### Device API IR Bindings Tests + +| Test | Command | Description | +|---|---|---| +| IR function test (8 kernels) | `mpirun -np N ./test_device_ir` | Tests all 8 kernel categories (K1-K8) covering comm queries, cooperative groups, team queries, local/intra pointers, data type size, and intra barriers | + +**K1-K8 test categories:** +- K1: Comm Queries — verifies `GetRank`, `GetSize`, `GetIntraRank`, `GetIntraSize` +- K2: Cooperative Group — verifies `InitBlock`, `ThreadRank`, `Size`, `Sync` +- K3: Team Queries — verifies `GetTeamIntra`, `RankToWorld`, `RankToIntra` +- K4: Local Pointer — verifies `GetLocalPointerC` returns correct buffer address +- K5: Intra Pointer — verifies LSA read of peer's data via `GetIntraPointerC` +- K6: Data Type Size — verifies `DataTypeSizeC` for float(4), half(2), double(8), int32(4), uint64(8) +- K7: Intra Barrier Sync — write buffer, barrier, read peer's data +- K8: Intra Barrier Arrive/Wait — write buffer, arrive, wait, read peer's data + +--- + +## Related PRs + +- [ ] flagos-ai/FlagCX#450 — [PAL] IBRC P2P adaptor for FlagCX P2P engine +- [ ] flagos-ai/FlagCX#452 — [CRL] Refactor P2P zerocopy +- [ ] flagos-ai/FlagCX#453 — [CRL] P2P topo manager +- [ ] flagos-ai/FlagCX#454 — [CRL] Using Device API for customAllReduce implementation +- [ ] flagos-ai/FlagCX#466 — [CRL] Add & implement P2P interface for integration with NIXL +- [ ] flagos-ai/FlagCX#433 — [PAL] Introduce traits abstraction and DeviceAPI for unified vendor/fallback support +- [ ] flagos-ai/FlagCX#445 — [PAL] Support Device API Transport +- [ ] flagos-ai/FlagCX#442 — [PAL] Add Device API DU support +- [ ] flagos-ai/FlagCX#447 — [CRL] Add Device API multi-FIFO support +- [ ] flagos-ai/FlagCX#471 — [CRL] Add Device API symmem and multicast support +- [ ] flagos-ai/FlagCX#474 — [Others] KV transfer benchmark +- [ ] flagos-ai/FlagCX#475 — [UIL] Support Device API IR Bindings + +--- + +## Implementation History + +- 2026-05-27: FEP created for FlagCX v0.13.0 (features under development) under `sig-network`. \ No newline at end of file From 071d09c27cdea959093da42c06354ac3bc9e22de Mon Sep 17 00:00:00 2001 From: MC952-arch Date: Tue, 9 Jun 2026 20:22:29 +0800 Subject: [PATCH 2/8] docs(fep): update packaging and test commands for sig-network v0.13.0 --- .../flagcx-v0.13.0-new-features.md | 62 +++++++++++-------- 1 file changed, 35 insertions(+), 27 deletions(-) diff --git a/fep/sig-network/flagcx-v0.13.0-new-features.md b/fep/sig-network/flagcx-v0.13.0-new-features.md index 1ed07d6..0e4ab2e 100644 --- a/fep/sig-network/flagcx-v0.13.0-new-features.md +++ b/fep/sig-network/flagcx-v0.13.0-new-features.md @@ -147,23 +147,32 @@ The IR functions operate on opaque `devCommPtr` and `devMemPtr` pointers obtaine ## Packaging +### Obtain Source Code + +```bash +git clone https://github.com/flagos-ai/FlagCX.git +cd FlagCX +git submodule update --init --recursive +``` + ### Build ```bash -# Build FlagCX with Device API and P2P support -cd flagcx && make -j +# Build FlagCX core library (choose your backend) +make =1 -j$(nproc) -# Build with NIXL integration (optional) -cd plugin/nixl && make -j +# Build with Device API kernel support (required for CustomAllReduce) +make USE_NVIDIA=1 COMPILE_KERNEL=1 -j$(nproc) ``` +Where `` is one of: `USE_NVIDIA`, `USE_ASCEND`, `USE_ILUVATAR_COREX`, `USE_CAMBRICON`, `USE_METAX`, `USE_MUSA`, `USE_KUNLUNXIN`, `USE_DU`, `USE_AMD`, `USE_TSM`, `USE_ENFLAME`. + ### Dependencies - MPI (for multi-process tests) - libibverbs (for IBRC P2P adaptor) - CUDA toolkit (for NVIDIA backend) - NCCL >= 2.25 (for Device API vendor path; >= 2.28 for window mode) -- Triton >= 3.6 (for IR bindings integration) --- @@ -171,37 +180,36 @@ cd plugin/nixl && make -j ### P2P Engine Tests +```bash +cd test/perf/host_api +make USE_NVIDIA=1 +cd build/bin +``` + | Test | Command | Description | |---|---|---| -| Unit test: P2P read engine | `mpirun -np 2 ./test_p2p_engine_read` | Verifies one-sided RDMA read between two ranks | -| Perf test: PUT | `mpirun -np 2 ./test_put -b 1024 -e 67108864 -f 2` | Bandwidth benchmark for one-sided PUT | -| Perf test: GET | `mpirun -np 2 ./test_get -b 1024 -e 67108864 -f 2` | Bandwidth benchmark for one-sided GET | -| Perf test: IPC sendrecv | `mpirun -np 2 ./test_ipc_sendrecv` | Intra-node IPC-based send/recv | -| KV transfer benchmark | `python test/perf/kv_transfer/kv_transfer_benchmark.py --connector=flagcx --role=server` | End-to-end KV cache transfer benchmark supporting NIXL, Mooncake, and FlagCX backends | +| Unit test: P2P read engine | `mpirun --allow-run-as-root -np 2 ./test_p2p_engine_read` | Verifies one-sided RDMA read between two ranks | +| Perf test: PUT | `mpirun --allow-run-as-root -np 2 ./test_put -b 1024 -e 67108864 -f 2` | Bandwidth benchmark for one-sided PUT | +| Perf test: GET | `mpirun --allow-run-as-root -np 2 ./test_get -b 1024 -e 67108864 -f 2` | Bandwidth benchmark for one-sided GET | +| Perf test: IPC sendrecv | `mpirun --allow-run-as-root -np 2 ./test_ipc_sendrecv` | Intra-node IPC-based send/recv | +| KV transfer benchmark | `python test/perf/kv_transfer/kv_transfer_benchmark.py --connector=flagcx --role=server` | End-to-end KV cache transfer benchmark | ### Device API CustomAllReduce Tests -| Test | Command | Description | -|---|---|---| -| Unit test: AllReduce correctness | `mpirun -np N ./test_runner --gtest_filter=DeviceApiTest.IntraAllReduceViaDevicePtr` | Each rank fills buffer with (rank+1), verifies sum = N*(N+1)/2 | -| Perf test: AllReduce bandwidth | `mpirun -np N ./test_device_api_allreduce -b 1024 -e 67108864 -f 2 -R 2` | Sweeps message sizes, reports algBW/busBW, verifies correctness | -| Intra-node kernel test | `mpirun -np N ./test_intranode` | Full intra-node AllReduce kernel test | +```bash +# FlagCX must be built with COMPILE_KERNEL=1 (from project root) +make USE_NVIDIA=1 COMPILE_KERNEL=1 -j$(nproc) -### Device API IR Bindings Tests +cd test/perf/device_api +make USE_NVIDIA=1 +cd build/bin +``` | Test | Command | Description | |---|---|---| -| IR function test (8 kernels) | `mpirun -np N ./test_device_ir` | Tests all 8 kernel categories (K1-K8) covering comm queries, cooperative groups, team queries, local/intra pointers, data type size, and intra barriers | - -**K1-K8 test categories:** -- K1: Comm Queries — verifies `GetRank`, `GetSize`, `GetIntraRank`, `GetIntraSize` -- K2: Cooperative Group — verifies `InitBlock`, `ThreadRank`, `Size`, `Sync` -- K3: Team Queries — verifies `GetTeamIntra`, `RankToWorld`, `RankToIntra` -- K4: Local Pointer — verifies `GetLocalPointerC` returns correct buffer address -- K5: Intra Pointer — verifies LSA read of peer's data via `GetIntraPointerC` -- K6: Data Type Size — verifies `DataTypeSizeC` for float(4), half(2), double(8), int32(4), uint64(8) -- K7: Intra Barrier Sync — write buffer, barrier, read peer's data -- K8: Intra Barrier Arrive/Wait — write buffer, arrive, wait, read peer's data +| Unit test: AllReduce correctness | `mpirun --allow-run-as-root -np N -x FLAGCX_USE_HETERO_COMM=1 -x FLAGCX_MEM_ENABLE=1 -x FLAGCX_VMM_ENABLE=0 -x FLAGCX_P2P_DISABLE=1 ./test_runner --gtest_filter=DeviceApiTest.IntraAllReduceViaDevicePtr` | Each rank fills buffer with (rank+1), verifies sum = N*(N+1)/2 | +| Perf test: AllReduce bandwidth | `mpirun --allow-run-as-root -np N -x FLAGCX_USE_HETERO_COMM=1 -x FLAGCX_MEM_ENABLE=1 -x FLAGCX_VMM_ENABLE=0 -x FLAGCX_P2P_DISABLE=1 ./perf_allreduce_intranode -b 1M -e 64M -f 2 -R 1` | Sweeps message sizes, reports algBW/busBW, verifies correctness | +| Intra-node kernel test | `mpirun --allow-run-as-root -np N -x FLAGCX_USE_HETERO_COMM=1 -x FLAGCX_MEM_ENABLE=1 -x FLAGCX_VMM_ENABLE=0 -x FLAGCX_P2P_DISABLE=1 ./test_intranode -b 1M -e 4M -f 2 -R 2` | Full intra-node AllReduce kernel test | --- From f8cd113cdf508dc165faba9117e16c0ad84e188c Mon Sep 17 00:00:00 2001 From: MC952-arch Date: Wed, 10 Jun 2026 11:08:01 +0800 Subject: [PATCH 3/8] docs(fep): add Device API IR Bindings test section to v0.13.0 FEP --- fep/sig-network/flagcx-v0.13.0-new-features.md | 17 ++++++++++++++++- 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/fep/sig-network/flagcx-v0.13.0-new-features.md b/fep/sig-network/flagcx-v0.13.0-new-features.md index 0e4ab2e..74be159 100644 --- a/fep/sig-network/flagcx-v0.13.0-new-features.md +++ b/fep/sig-network/flagcx-v0.13.0-new-features.md @@ -198,7 +198,7 @@ cd build/bin ```bash # FlagCX must be built with COMPILE_KERNEL=1 (from project root) -make USE_NVIDIA=1 COMPILE_KERNEL=1 -j$(nproc) +make USE_NVIDIA=1 COMPILE_KERNEL=1 FORCE_DEFAULT_PATH=1 -j$(nproc) cd test/perf/device_api make USE_NVIDIA=1 @@ -211,6 +211,21 @@ cd build/bin | Perf test: AllReduce bandwidth | `mpirun --allow-run-as-root -np N -x FLAGCX_USE_HETERO_COMM=1 -x FLAGCX_MEM_ENABLE=1 -x FLAGCX_VMM_ENABLE=0 -x FLAGCX_P2P_DISABLE=1 ./perf_allreduce_intranode -b 1M -e 64M -f 2 -R 1` | Sweeps message sizes, reports algBW/busBW, verifies correctness | | Intra-node kernel test | `mpirun --allow-run-as-root -np N -x FLAGCX_USE_HETERO_COMM=1 -x FLAGCX_MEM_ENABLE=1 -x FLAGCX_VMM_ENABLE=0 -x FLAGCX_P2P_DISABLE=1 ./test_intranode -b 1M -e 4M -f 2 -R 2` | Full intra-node AllReduce kernel test | +### Device API IR Bindings Tests + +```bash +# FlagCX must be built with COMPILE_KERNEL=1 (from project root) +make USE_NVIDIA=1 COMPILE_KERNEL=1 FORCE_DEFAULT_PATH=1 -j$(nproc) + +cd test/unittest/device_api +make USE_NVIDIA=1 FORCE_DEFAULT_PATH=1 -j$(nproc) +cd build/bin +``` + +| Test | Command | Description | +|---|---|---| +| IR bindings correctness | `mpirun --allow-run-as-root -np 8 -x FLAGCX_USE_HETERO_COMM=1 -x FLAGCX_MEM_ENABLE=1 -x FLAGCX_VMM_ENABLE=0 -x FLAGCX_P2P_DISABLE=1 ./test_device_ir -b 1M -e 4M -f 2 -R 2` | Tests 8 kernel categories covering 69 IR wrapper functions (comm queries, cooperative group, team queries, local/intra pointers, barriers) | + --- ## Related PRs From 45d34ce60e767dd9bf8994bdcb4ed6a9d0291b1c Mon Sep 17 00:00:00 2001 From: MC952-arch Date: Wed, 10 Jun 2026 15:30:51 +0800 Subject: [PATCH 4/8] fix(fep): correct test binary names and remove non-existent tests --- fep/sig-network/flagcx-v0.13.0-new-features.md | 11 +++-------- 1 file changed, 3 insertions(+), 8 deletions(-) diff --git a/fep/sig-network/flagcx-v0.13.0-new-features.md b/fep/sig-network/flagcx-v0.13.0-new-features.md index 74be159..db96777 100644 --- a/fep/sig-network/flagcx-v0.13.0-new-features.md +++ b/fep/sig-network/flagcx-v0.13.0-new-features.md @@ -188,11 +188,8 @@ cd build/bin | Test | Command | Description | |---|---|---| -| Unit test: P2P read engine | `mpirun --allow-run-as-root -np 2 ./test_p2p_engine_read` | Verifies one-sided RDMA read between two ranks | -| Perf test: PUT | `mpirun --allow-run-as-root -np 2 ./test_put -b 1024 -e 67108864 -f 2` | Bandwidth benchmark for one-sided PUT | -| Perf test: GET | `mpirun --allow-run-as-root -np 2 ./test_get -b 1024 -e 67108864 -f 2` | Bandwidth benchmark for one-sided GET | -| Perf test: IPC sendrecv | `mpirun --allow-run-as-root -np 2 ./test_ipc_sendrecv` | Intra-node IPC-based send/recv | -| KV transfer benchmark | `python test/perf/kv_transfer/kv_transfer_benchmark.py --connector=flagcx --role=server` | End-to-end KV cache transfer benchmark | +| Perf test: PUT | `mpirun --allow-run-as-root -np 2 ./perf_put -b 1024 -e 67108864 -f 2` | Bandwidth benchmark for one-sided PUT | +| Perf test: GET | `mpirun --allow-run-as-root -np 2 ./perf_get -b 1024 -e 67108864 -f 2` | Bandwidth benchmark for one-sided GET | ### Device API CustomAllReduce Tests @@ -207,9 +204,7 @@ cd build/bin | Test | Command | Description | |---|---|---| -| Unit test: AllReduce correctness | `mpirun --allow-run-as-root -np N -x FLAGCX_USE_HETERO_COMM=1 -x FLAGCX_MEM_ENABLE=1 -x FLAGCX_VMM_ENABLE=0 -x FLAGCX_P2P_DISABLE=1 ./test_runner --gtest_filter=DeviceApiTest.IntraAllReduceViaDevicePtr` | Each rank fills buffer with (rank+1), verifies sum = N*(N+1)/2 | -| Perf test: AllReduce bandwidth | `mpirun --allow-run-as-root -np N -x FLAGCX_USE_HETERO_COMM=1 -x FLAGCX_MEM_ENABLE=1 -x FLAGCX_VMM_ENABLE=0 -x FLAGCX_P2P_DISABLE=1 ./perf_allreduce_intranode -b 1M -e 64M -f 2 -R 1` | Sweeps message sizes, reports algBW/busBW, verifies correctness | -| Intra-node kernel test | `mpirun --allow-run-as-root -np N -x FLAGCX_USE_HETERO_COMM=1 -x FLAGCX_MEM_ENABLE=1 -x FLAGCX_VMM_ENABLE=0 -x FLAGCX_P2P_DISABLE=1 ./test_intranode -b 1M -e 4M -f 2 -R 2` | Full intra-node AllReduce kernel test | +| Perf test: AllReduce intranode | `mpirun --allow-run-as-root -np 8 -x FLAGCX_USE_HETERO_COMM=1 -x FLAGCX_MEM_ENABLE=1 -x FLAGCX_VMM_ENABLE=0 -x FLAGCX_P2P_DISABLE=1 ./perf_allreduce_intranode -b 1M -e 64M -f 2 -R 1` | Sweeps message sizes, reports algBW/busBW, verifies AllReduce correctness | ### Device API IR Bindings Tests From b612110f1434ff5028098531a0d25352c834f899 Mon Sep 17 00:00:00 2001 From: MC952-arch Date: Wed, 10 Jun 2026 16:07:41 +0800 Subject: [PATCH 5/8] fix(fep): replace perf benchmarks with P2P unit tests for regression testing --- fep/sig-network/flagcx-v0.13.0-new-features.md | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/fep/sig-network/flagcx-v0.13.0-new-features.md b/fep/sig-network/flagcx-v0.13.0-new-features.md index db96777..fbd6ce3 100644 --- a/fep/sig-network/flagcx-v0.13.0-new-features.md +++ b/fep/sig-network/flagcx-v0.13.0-new-features.md @@ -181,15 +181,14 @@ Where `` is one of: `USE_NVIDIA`, `USE_ASCEND`, `USE_ILUVATAR_COREX`, ` ### P2P Engine Tests ```bash -cd test/perf/host_api -make USE_NVIDIA=1 +cd test/unittest/p2p +make cd build/bin ``` | Test | Command | Description | |---|---|---| -| Perf test: PUT | `mpirun --allow-run-as-root -np 2 ./perf_put -b 1024 -e 67108864 -f 2` | Bandwidth benchmark for one-sided PUT | -| Perf test: GET | `mpirun --allow-run-as-root -np 2 ./perf_get -b 1024 -e 67108864 -f 2` | Bandwidth benchmark for one-sided GET | +| Unit test: P2P engine | `mpirun --allow-run-as-root -np 2 ./p2p_unit_tests` | Verifies P2P engine correctness: one-sided read, RPC, adaptor, batch, and slice task | ### Device API CustomAllReduce Tests From a24cab803c94e0d706e0847bb7763c2fb17774c5 Mon Sep 17 00:00:00 2001 From: MC952-arch Date: Thu, 11 Jun 2026 00:18:11 +0800 Subject: [PATCH 6/8] Update P2P Engine test plan to use perf_p2p_engine benchmark --- fep/sig-network/flagcx-v0.13.0-new-features.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/fep/sig-network/flagcx-v0.13.0-new-features.md b/fep/sig-network/flagcx-v0.13.0-new-features.md index fbd6ce3..4e5b911 100644 --- a/fep/sig-network/flagcx-v0.13.0-new-features.md +++ b/fep/sig-network/flagcx-v0.13.0-new-features.md @@ -181,14 +181,14 @@ Where `` is one of: `USE_NVIDIA`, `USE_ASCEND`, `USE_ILUVATAR_COREX`, ` ### P2P Engine Tests ```bash -cd test/unittest/p2p -make +cd test/perf/host_api +make USE_NVIDIA=1 cd build/bin ``` | Test | Command | Description | |---|---|---| -| Unit test: P2P engine | `mpirun --allow-run-as-root -np 2 ./p2p_unit_tests` | Verifies P2P engine correctness: one-sided read, RPC, adaptor, batch, and slice task | +| Perf test: P2P Engine (read/write) | `mpirun --allow-run-as-root -np 2 ./perf_p2p_engine -b 4K -e 64M -f 2 -n 10` | One-sided RDMA GET/PUT bandwidth benchmark via P2P Engine RPC control-plane. Set `FLAGCX_P2P_PERF_OP=read\|write\|both` to select operation. | ### Device API CustomAllReduce Tests From 61ee2b912ec59a8072dbe48589377db2666f2b58 Mon Sep 17 00:00:00 2001 From: MC952-arch Date: Thu, 11 Jun 2026 14:35:54 +0800 Subject: [PATCH 7/8] Add multi-platform support status for v0.13.0 features --- fep/sig-network/flagcx-v0.13.0-new-features.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/fep/sig-network/flagcx-v0.13.0-new-features.md b/fep/sig-network/flagcx-v0.13.0-new-features.md index 4e5b911..dde389d 100644 --- a/fep/sig-network/flagcx-v0.13.0-new-features.md +++ b/fep/sig-network/flagcx-v0.13.0-new-features.md @@ -174,6 +174,14 @@ Where `` is one of: `USE_NVIDIA`, `USE_ASCEND`, `USE_ILUVATAR_COREX`, ` - CUDA toolkit (for NVIDIA backend) - NCCL >= 2.25 (for Device API vendor path; >= 2.28 for window mode) +### Multi-Platform Support + +| Feature | Platform Support | Current Test Status | +|---|---|---| +| P2P Engine | All supported backends (no vendor-specific adaptation required) | Tested on NVIDIA only during development | +| Device API CustomAllReduce | NVIDIA only | Other vendors need to add compilation pipeline and kernel implementation | +| Device API IR Bindings | NVIDIA only | Other vendors need to add compilation pipeline and kernel implementation | + --- ## Test Plan From acf2275676e7a38ecfdd9f0f77ded2e16f36aa3a Mon Sep 17 00:00:00 2001 From: MC952-arch Date: Thu, 18 Jun 2026 13:08:39 +0800 Subject: [PATCH 8/8] docs(fep): update P2P Engine multi-platform test status for v0.13.0 --- .../flagcx-v0.13.0-new-features.md | 25 ++++++++++++++----- 1 file changed, 19 insertions(+), 6 deletions(-) diff --git a/fep/sig-network/flagcx-v0.13.0-new-features.md b/fep/sig-network/flagcx-v0.13.0-new-features.md index dde389d..aabd1f7 100644 --- a/fep/sig-network/flagcx-v0.13.0-new-features.md +++ b/fep/sig-network/flagcx-v0.13.0-new-features.md @@ -169,18 +169,28 @@ Where `` is one of: `USE_NVIDIA`, `USE_ASCEND`, `USE_ILUVATAR_COREX`, ` ### Dependencies -- MPI (for multi-process tests) +- MPI (for multi-process tests): OpenMPI or mpich - libibverbs (for IBRC P2P adaptor) - CUDA toolkit (for NVIDIA backend) - NCCL >= 2.25 (for Device API vendor path; >= 2.28 for window mode) +#### Installing mpich (required for multi-platform P2P Engine testing) + +```bash +wget https://www.mpich.org/static/downloads/4.2.3/mpich-4.2.3.tar.gz +tar xzvf mpich-4.2.3.tar.gz +cd mpich-4.2.3 && ./configure --prefix="$PWD/build" --with-device=ch3 --disable-fortran && make -j64 && make install +export MPI_HOME=$PWD/build +export LD_LIBRARY_PATH=$MPI_HOME/lib:$MPI_HOME:$LD_LIBRARY_PATH +``` + ### Multi-Platform Support | Feature | Platform Support | Current Test Status | |---|---|---| -| P2P Engine | All supported backends (no vendor-specific adaptation required) | Tested on NVIDIA only during development | -| Device API CustomAllReduce | NVIDIA only | Other vendors need to add compilation pipeline and kernel implementation | -| Device API IR Bindings | NVIDIA only | Other vendors need to add compilation pipeline and kernel implementation | +| P2P Engine | All supported backends (no vendor-specific adaptation required) | Tested on NVIDIA, Mthreads (USE_MUSA), Metax (USE_METAX), Hygon (USE_DU) | +| Device API CustomAllReduce | NVIDIA only | Tested on NVIDIA only; other vendors need to add compilation pipeline and kernel implementation | +| Device API IR Bindings | NVIDIA only | Tested on NVIDIA only; other vendors need to add compilation pipeline and kernel implementation | --- @@ -189,14 +199,17 @@ Where `` is one of: `USE_NVIDIA`, `USE_ASCEND`, `USE_ILUVATAR_COREX`, ` ### P2P Engine Tests ```bash +# Build (choose your backend) +make USE_NVIDIA=1 -j$(nproc) # or USE_MUSA=1, USE_METAX=1, USE_DU=1 + cd test/perf/host_api -make USE_NVIDIA=1 +make USE_NVIDIA=1 # match your backend flag cd build/bin ``` | Test | Command | Description | |---|---|---| -| Perf test: P2P Engine (read/write) | `mpirun --allow-run-as-root -np 2 ./perf_p2p_engine -b 4K -e 64M -f 2 -n 10` | One-sided RDMA GET/PUT bandwidth benchmark via P2P Engine RPC control-plane. Set `FLAGCX_P2P_PERF_OP=read\|write\|both` to select operation. | +| Perf test: P2P Engine (read/write) | `$MPI_HOME/bin/mpirun --genv FLAGCX_MEM_ENABLE=1 -np 2 ./perf_p2p_engine -b 4K -e 64M -f 2 -n 10` | One-sided RDMA GET/PUT bandwidth benchmark via P2P Engine RPC control-plane. Set `FLAGCX_P2P_PERF_OP=read\|write\|both` to select operation. | ### Device API CustomAllReduce Tests