diff --git a/fep/sig-network/flagcx-v0.13.0-new-features.md b/fep/sig-network/flagcx-v0.13.0-new-features.md new file mode 100644 index 0000000..aabd1f7 --- /dev/null +++ b/fep/sig-network/flagcx-v0.13.0-new-features.md @@ -0,0 +1,265 @@ +# FEP(sig-network): Add FlagCX v0.13.0 New Features + +**Status:** `Provisional` + +**Created:** 2026-05-27 + +**Owner:** @flagos-ai + +**SIG:** sig-network + +**Target Version:** FlagOS 2.1 + +--- + +## Summary + +Comparing v0.13.0 (current main branch) against v0.11.0 (commit `cceb96d`), three significant feature areas have been introduced in FlagCX: + +1. **P2P Engine** — a one-sided RDMA engine for point-to-point communication, enabling prefill-decode disaggregation in LLM inference scenarios and NIXL integration. +2. **Device API-based CustomAllReduce** — an intra-node AllReduce collective implemented entirely through FlagCX Device API primitives, allowing custom kernels to perform AllReduce without host-side scheduling overhead. +3. **Device API IR Bindings for Triton** — a set of C IR wrapper functions (compilable via LLVM bitcode) that expose FlagCX device-side communication primitives to Triton-generated kernels. + +Repository: https://github.com/flagos-ai/FlagCX + +--- + +## Motivation + +### Goals + +- **P2P Engine:** Provide a hardware-abstracted, one-sided RDMA engine that supports high-performance P2P communications widely used in LLM inference scenarios, such as prefill-decode disaggregation. Currently, FlagCX P2P engine have been to used as vLLM KV transfer connector and integrated as a NIXL backend. +- **Device API CustomAllReduce:** Achieve low-latency AllReduce using Device API to address intra-node small-to-medium message size communication. +- **Device API IR Bindings:** Enable Triton-compiled kernels to call FlagCX Device API (rank queries, intra-node pointer access, barriers, etc.) via LLVM bitcode. + +--- + +## Proposal + +### Feature 1: P2P Engine + +A standalone P2P engine (`FlagcxP2pEngine`) is introduced with a C++ API for one-sided RDMA and two-sided send/recv operations. The engine: + +- Creates and manages RDMA connections over IBRC (InfiniBand Reliable Connected) QPs. +- Exposes vectorized read/write (`flagcxP2pEngineReadVector`, `flagcxP2pEngineWriteVector`) suitable for scatter-gather KV cache transfers. +- Provides an out-of-band notification channel for completion signaling. +- Integrates with FlagCX's existing topology manager (`flagcxP2pTopoManager`) to select the optimal NIC per GPU. + +Users of vLLM, NIXL, Mooncake or custom disaggregation frameworks can use the P2P engine as a low-level RDMA substrate. A patch for NIXL v1.1.0 integration (`plugin/nixl/flagcx_p2p_on_nixl_v1.1.0.patch`) is provided. + +### Feature 2: Device API-based CustomAllReduce + +`flagcxIntraAllReduce` is a kernel-based AllReduce that operates on a registered shared memory window (`flagcxDevMem_t`) using LSA or Multicast. The host-side setup: + +1. Allocate a symmetric buffer (`flagcxMemAlloc` for VMM/window mode, or `cudaMalloc` for IPC mode). +2. Register it: `flagcxCommWindowRegister` (window mode) or `flagcxCommRegister` (IPC mode). +3. Create device handles: `flagcxDevCommCreate` + `flagcxDevMemCreate`. +4. Get device pointers: `flagcxDevCommGetDevicePtr` + `flagcxDevMemGetDevicePtr`. +5. Call `flagcxIntraAllReduce(devMem, count, datatype, devComm, stream)` from the host. + +### Feature 3: Device API IR Bindings for Triton + +A set of `extern "C"` wrapper functions (declared in `flagcx_device_wrapper.h`, implemented in `flagcx_device_wrapper_impl.h`) expose the following categories of device-side primitives for LLVM bitcode compilation: + +| Category | Functions | +|---|---| +| Comm Queries | `flagcxDevCommGetRank`, `flagcxDevCommGetSize`, `flagcxDevCommGetIntraRank`, `flagcxDevCommGetIntraSize` | +| Cooperative Group | `flagcxCoopAnyInitBlock`, `flagcxCoopThreadRankC`, `flagcxCoopSizeC`, `flagcxCoopSyncC` | +| Team Queries | `flagcxGetTeamIntra`, `flagcxTeamRankToWorldC`, `flagcxTeamRankToIntraC` | +| Local Pointer | `flagcxGetLocalPointerC` | +| Intra Pointer (LSA) | `flagcxGetIntraPointerC` | +| Data Type Size | `flagcxDataTypeSizeC` | +| Intra Barrier | `flagcxIntraBarrierSessionInit`, `flagcxIntraBarrierSyncC` | +| Intra Barrier Arrive/Wait | `flagcxIntraBarrierArriveC`, `flagcxIntraBarrierWaitC` | + +The `flagcx_kernel.h` umbrella header guards these with `#ifndef __clang_llvm_bitcode_lib__` so that Triton's bitcode path only includes the device-safe subset (`flagcx_kernel_core.h`). + +--- + +## Design Details + +### P2P Engine Architecture + +``` +FlagcxP2pEngine + ├── IBRC adaptor (flagcxP2pDevCtx, ibv_pd per device) + ├── Accept thread (TCP handshake → QP setup) + ├── Notification thread (out-of-band completion signals) + └── MR registry (base VA → lkey/rkey mapping) + +FlagcxP2pConn + ├── flagcxP2pSendComm / flagcxP2pRecvComm (IB QP + CQ) + ├── flagcxP2pRequest ring (128 slots) + └── IPC handle cache (intra-node transfers) + +FlagcxP2pRdmaDesc (64 bytes) + ├── addr : remote virtual address + ├── size : transfer size + ├── rkey : remote MR key + └── padding : reserved for bookkeeping +``` + +Connection setup follows a TCP-based handshake where both sides exchange QP numbers, GIDs, and MTU via `flagcxP2pConnMeta`. The topology manager (`flagcxP2pTopoInit`) enumerates local GPUs and NICs, builds a node-scoped topology graph, and selects the best NIC for each GPU via `flagcxP2pTopoGetNetDev`. + +### Device API CustomAllReduce Data Flow + +``` +Host Setup: + flagcxMemAlloc(regBuff) + flagcxCommWindowRegister(comm, regBuff, size, &win, FLAGCX_WIN_COLL_SYMMETRIC) + flagcxDevCommCreate(comm, &reqs, &devComm) // reqs.intraBarrierCount = CTA_COUNT + flagcxDevMemCreate(comm, regBuff, size, win, &devMem) + +Kernel Execution: + flagcxIntraAllReduce(devMem, count, flagcxFloat, devComm, stream) + └── Device kernel: + 1. Each CTA reads local data from regBuff + 2. Reads peer data via flagcxGetIntraPointerC(devMem, offset, peer) + 3. Performs reduction (sum) + 4. Writes result back to regBuff + 5. Synchronizes via flagcxIntraBarrier +``` + +Two registration modes are supported: +- **Window mode** (`-R 2`): Uses `flagcxCommWindowRegister` + VMM-allocated memory. Preferred for NCCL >= 2.28. +- **IPC mode** (`-R 1`): Uses `flagcxCommRegister` + `cudaMalloc` memory. Compatible with all NCCL versions. + +### Device API IR Bindings Architecture + +``` +Triton Kernel (.py) + → Triton IR → LLVM IR + → Links flagcx_device_wrapper bitcode (.bc) + → Final PTX/CUBIN + +flagcx_device_wrapper.h (extern "C" declarations, bitcode-safe) +flagcx_device_wrapper_impl.h (inline implementations using adaptor) +flagcx_kernel_core.h (device-side types: flagcxDevComm, flagcxDevMem, etc.) +``` + +The IR functions operate on opaque `devCommPtr` and `devMemPtr` pointers obtained from the host-side `flagcxDevCommGetDevicePtr` / `flagcxDevMemGetDevicePtr` APIs. This allows Triton kernels to: +- Query communicator topology (rank, size, intra-rank). +- Access peer memory directly via LSA pointers. +- Synchronize across intra-node ranks using barriers. +- Perform cooperative group operations within a CTA. + +--- + +## Packaging + +### Obtain Source Code + +```bash +git clone https://github.com/flagos-ai/FlagCX.git +cd FlagCX +git submodule update --init --recursive +``` + +### Build + +```bash +# Build FlagCX core library (choose your backend) +make =1 -j$(nproc) + +# Build with Device API kernel support (required for CustomAllReduce) +make USE_NVIDIA=1 COMPILE_KERNEL=1 -j$(nproc) +``` + +Where `` is one of: `USE_NVIDIA`, `USE_ASCEND`, `USE_ILUVATAR_COREX`, `USE_CAMBRICON`, `USE_METAX`, `USE_MUSA`, `USE_KUNLUNXIN`, `USE_DU`, `USE_AMD`, `USE_TSM`, `USE_ENFLAME`. + +### Dependencies + +- MPI (for multi-process tests): OpenMPI or mpich +- libibverbs (for IBRC P2P adaptor) +- CUDA toolkit (for NVIDIA backend) +- NCCL >= 2.25 (for Device API vendor path; >= 2.28 for window mode) + +#### Installing mpich (required for multi-platform P2P Engine testing) + +```bash +wget https://www.mpich.org/static/downloads/4.2.3/mpich-4.2.3.tar.gz +tar xzvf mpich-4.2.3.tar.gz +cd mpich-4.2.3 && ./configure --prefix="$PWD/build" --with-device=ch3 --disable-fortran && make -j64 && make install +export MPI_HOME=$PWD/build +export LD_LIBRARY_PATH=$MPI_HOME/lib:$MPI_HOME:$LD_LIBRARY_PATH +``` + +### Multi-Platform Support + +| Feature | Platform Support | Current Test Status | +|---|---|---| +| P2P Engine | All supported backends (no vendor-specific adaptation required) | Tested on NVIDIA, Mthreads (USE_MUSA), Metax (USE_METAX), Hygon (USE_DU) | +| Device API CustomAllReduce | NVIDIA only | Tested on NVIDIA only; other vendors need to add compilation pipeline and kernel implementation | +| Device API IR Bindings | NVIDIA only | Tested on NVIDIA only; other vendors need to add compilation pipeline and kernel implementation | + +--- + +## Test Plan + +### P2P Engine Tests + +```bash +# Build (choose your backend) +make USE_NVIDIA=1 -j$(nproc) # or USE_MUSA=1, USE_METAX=1, USE_DU=1 + +cd test/perf/host_api +make USE_NVIDIA=1 # match your backend flag +cd build/bin +``` + +| Test | Command | Description | +|---|---|---| +| Perf test: P2P Engine (read/write) | `$MPI_HOME/bin/mpirun --genv FLAGCX_MEM_ENABLE=1 -np 2 ./perf_p2p_engine -b 4K -e 64M -f 2 -n 10` | One-sided RDMA GET/PUT bandwidth benchmark via P2P Engine RPC control-plane. Set `FLAGCX_P2P_PERF_OP=read\|write\|both` to select operation. | + +### Device API CustomAllReduce Tests + +```bash +# FlagCX must be built with COMPILE_KERNEL=1 (from project root) +make USE_NVIDIA=1 COMPILE_KERNEL=1 FORCE_DEFAULT_PATH=1 -j$(nproc) + +cd test/perf/device_api +make USE_NVIDIA=1 +cd build/bin +``` + +| Test | Command | Description | +|---|---|---| +| Perf test: AllReduce intranode | `mpirun --allow-run-as-root -np 8 -x FLAGCX_USE_HETERO_COMM=1 -x FLAGCX_MEM_ENABLE=1 -x FLAGCX_VMM_ENABLE=0 -x FLAGCX_P2P_DISABLE=1 ./perf_allreduce_intranode -b 1M -e 64M -f 2 -R 1` | Sweeps message sizes, reports algBW/busBW, verifies AllReduce correctness | + +### Device API IR Bindings Tests + +```bash +# FlagCX must be built with COMPILE_KERNEL=1 (from project root) +make USE_NVIDIA=1 COMPILE_KERNEL=1 FORCE_DEFAULT_PATH=1 -j$(nproc) + +cd test/unittest/device_api +make USE_NVIDIA=1 FORCE_DEFAULT_PATH=1 -j$(nproc) +cd build/bin +``` + +| Test | Command | Description | +|---|---|---| +| IR bindings correctness | `mpirun --allow-run-as-root -np 8 -x FLAGCX_USE_HETERO_COMM=1 -x FLAGCX_MEM_ENABLE=1 -x FLAGCX_VMM_ENABLE=0 -x FLAGCX_P2P_DISABLE=1 ./test_device_ir -b 1M -e 4M -f 2 -R 2` | Tests 8 kernel categories covering 69 IR wrapper functions (comm queries, cooperative group, team queries, local/intra pointers, barriers) | + +--- + +## Related PRs + +- [ ] flagos-ai/FlagCX#450 — [PAL] IBRC P2P adaptor for FlagCX P2P engine +- [ ] flagos-ai/FlagCX#452 — [CRL] Refactor P2P zerocopy +- [ ] flagos-ai/FlagCX#453 — [CRL] P2P topo manager +- [ ] flagos-ai/FlagCX#454 — [CRL] Using Device API for customAllReduce implementation +- [ ] flagos-ai/FlagCX#466 — [CRL] Add & implement P2P interface for integration with NIXL +- [ ] flagos-ai/FlagCX#433 — [PAL] Introduce traits abstraction and DeviceAPI for unified vendor/fallback support +- [ ] flagos-ai/FlagCX#445 — [PAL] Support Device API Transport +- [ ] flagos-ai/FlagCX#442 — [PAL] Add Device API DU support +- [ ] flagos-ai/FlagCX#447 — [CRL] Add Device API multi-FIFO support +- [ ] flagos-ai/FlagCX#471 — [CRL] Add Device API symmem and multicast support +- [ ] flagos-ai/FlagCX#474 — [Others] KV transfer benchmark +- [ ] flagos-ai/FlagCX#475 — [UIL] Support Device API IR Bindings + +--- + +## Implementation History + +- 2026-05-27: FEP created for FlagCX v0.13.0 (features under development) under `sig-network`. \ No newline at end of file