flagos-ai · MC952-arch · May 27, 2026 · Jun 9, 2026 · Jun 10, 2026 · Jun 10, 2026
diff --git a/fep/sig-network/flagcx-v0.13.0-new-features.md b/fep/sig-network/flagcx-v0.13.0-new-features.md
@@ -0,0 +1,265 @@
+# FEP(sig-network): Add FlagCX v0.13.0 New Features
+
+**Status:** `Provisional`
+
+**Created:** 2026-05-27
+
+**Owner:** @flagos-ai
+
+**SIG:** sig-network
+
+**Target Version:** FlagOS 2.1
+
+---
+
+## Summary
+
+Comparing v0.13.0 (current main branch) against v0.11.0 (commit `cceb96d`), three significant feature areas have been introduced in FlagCX:
+
+1. **P2P Engine** — a one-sided RDMA engine for point-to-point communication, enabling prefill-decode disaggregation in LLM inference scenarios and NIXL integration.
+2. **Device API-based CustomAllReduce** — an intra-node AllReduce collective implemented entirely through FlagCX Device API primitives, allowing custom kernels to perform AllReduce without host-side scheduling overhead.
+3. **Device API IR Bindings for Triton** — a set of C IR wrapper functions (compilable via LLVM bitcode) that expose FlagCX device-side communication primitives to Triton-generated kernels.
+
+Repository: https://github.com/flagos-ai/FlagCX
+
+---
+
+## Motivation
+
+### Goals
+
+- **P2P Engine:** Provide a hardware-abstracted, one-sided RDMA engine that supports high-performance P2P communications widely used in LLM inference scenarios, such as prefill-decode disaggregation. Currently, FlagCX P2P engine have been to used as vLLM KV transfer connector and integrated as a NIXL backend.
+- **Device API CustomAllReduce:** Achieve low-latency AllReduce using Device API to address intra-node small-to-medium message size communication.
+- **Device API IR Bindings:** Enable Triton-compiled kernels to call FlagCX Device API (rank queries, intra-node pointer access, barriers, etc.) via LLVM bitcode.
+
+---
+
+## Proposal
+
+### Feature 1: P2P Engine
+
+A standalone P2P engine (`FlagcxP2pEngine`) is introduced with a C++ API for one-sided RDMA and two-sided send/recv operations. The engine:
+
+- Creates and manages RDMA connections over IBRC (InfiniBand Reliable Connected) QPs.
+- Exposes vectorized read/write (`flagcxP2pEngineReadVector`, `flagcxP2pEngineWriteVector`) suitable for scatter-gather KV cache transfers.
+- Provides an out-of-band notification channel for completion signaling.
+- Integrates with FlagCX's existing topology manager (`flagcxP2pTopoManager`) to select the optimal NIC per GPU.
+
+Users of vLLM, NIXL, Mooncake or custom disaggregation frameworks can use the P2P engine as a low-level RDMA substrate. A patch for NIXL v1.1.0 integration (`plugin/nixl/flagcx_p2p_on_nixl_v1.1.0.patch`) is provided.
+
+### Feature 2: Device API-based CustomAllReduce
+
+`flagcxIntraAllReduce` is a kernel-based AllReduce that operates on a registered shared memory window (`flagcxDevMem_t`) using LSA or Multicast. The host-side setup:
+
+1. Allocate a symmetric buffer (`flagcxMemAlloc` for VMM/window mode, or `cudaMalloc` for IPC mode).
+2. Register it: `flagcxCommWindowRegister` (window mode) or `flagcxCommRegister` (IPC mode).
+3. Create device handles: `flagcxDevCommCreate` + `flagcxDevMemCreate`.
+4. Get device pointers: `flagcxDevCommGetDevicePtr` + `flagcxDevMemGetDevicePtr`.
+5. Call `flagcxIntraAllReduce(devMem, count, datatype, devComm, stream)` from the host.
+
+### Feature 3: Device API IR Bindings for Triton
+
+A set of `extern "C"` wrapper functions (declared in `flagcx_device_wrapper.h`, implemented in `flagcx_device_wrapper_impl.h`) expose the following categories of device-side primitives for LLVM bitcode compilation:
+
+| Category | Functions |
+|---|---|
+| Comm Queries | `flagcxDevCommGetRank`, `flagcxDevCommGetSize`, `flagcxDevCommGetIntraRank`, `flagcxDevCommGetIntraSize` |
+| Cooperative Group | `flagcxCoopAnyInitBlock`, `flagcxCoopThreadRankC`, `flagcxCoopSizeC`, `flagcxCoopSyncC` |
+| Team Queries | `flagcxGetTeamIntra`, `flagcxTeamRankToWorldC`, `flagcxTeamRankToIntraC` |
+| Local Pointer | `flagcxGetLocalPointerC` |
+| Intra Pointer (LSA) | `flagcxGetIntraPointerC` |
+| Data Type Size | `flagcxDataTypeSizeC` |
+| Intra Barrier | `flagcxIntraBarrierSessionInit`, `flagcxIntraBarrierSyncC` |
+| Intra Barrier Arrive/Wait | `flagcxIntraBarrierArriveC`, `flagcxIntraBarrierWaitC` |
+
+The `flagcx_kernel.h` umbrella header guards these with `#ifndef __clang_llvm_bitcode_lib__` so that Triton's bitcode path only includes the device-safe subset (`flagcx_kernel_core.h`).
+
+---
+
+## Design Details
+
+### P2P Engine Architecture
+
+```
+FlagcxP2pEngine
+  ├── IBRC adaptor (flagcxP2pDevCtx, ibv_pd per device)
+  ├── Accept thread (TCP handshake → QP setup)
+  ├── Notification thread (out-of-band completion signals)
+  └── MR registry (base VA → lkey/rkey mapping)
+
+FlagcxP2pConn
+  ├── flagcxP2pSendComm / flagcxP2pRecvComm (IB QP + CQ)
+  ├── flagcxP2pRequest ring (128 slots)
+  └── IPC handle cache (intra-node transfers)
+
+FlagcxP2pRdmaDesc (64 bytes)
+  ├── addr    : remote virtual address
+  ├── size    : transfer size
+  ├── rkey    : remote MR key
+  └── padding : reserved for bookkeeping
+```
+
+Connection setup follows a TCP-based handshake where both sides exchange QP numbers, GIDs, and MTU via `flagcxP2pConnMeta`. The topology manager (`flagcxP2pTopoInit`) enumerates local GPUs and NICs, builds a node-scoped topology graph, and selects the best NIC for each GPU via `flagcxP2pTopoGetNetDev`.
+
+### Device API CustomAllReduce Data Flow
+
+```
+Host Setup:
+  flagcxMemAlloc(regBuff)
+  flagcxCommWindowRegister(comm, regBuff, size, &win, FLAGCX_WIN_COLL_SYMMETRIC)
+  flagcxDevCommCreate(comm, &reqs, &devComm)   // reqs.intraBarrierCount = CTA_COUNT
+  flagcxDevMemCreate(comm, regBuff, size, win, &devMem)
+
+Kernel Execution:
+  flagcxIntraAllReduce(devMem, count, flagcxFloat, devComm, stream)
+    └── Device kernel:
+        1. Each CTA reads local data from regBuff
+        2. Reads peer data via flagcxGetIntraPointerC(devMem, offset, peer)
+        3. Performs reduction (sum)
+        4. Writes result back to regBuff
+        5. Synchronizes via flagcxIntraBarrier
+```
+
+Two registration modes are supported:
+- **Window mode** (`-R 2`): Uses `flagcxCommWindowRegister` + VMM-allocated memory. Preferred for NCCL >= 2.28.
+- **IPC mode** (`-R 1`): Uses `flagcxCommRegister` + `cudaMalloc` memory. Compatible with all NCCL versions.
+
+### Device API IR Bindings Architecture
+
+```
+Triton Kernel (.py)
+  → Triton IR → LLVM IR
+  → Links flagcx_device_wrapper bitcode (.bc)
+  → Final PTX/CUBIN
+
+flagcx_device_wrapper.h   (extern "C" declarations, bitcode-safe)
+flagcx_device_wrapper_impl.h  (inline implementations using adaptor)
+flagcx_kernel_core.h      (device-side types: flagcxDevComm, flagcxDevMem, etc.)
+```
+
+The IR functions operate on opaque `devCommPtr` and `devMemPtr` pointers obtained from the host-side `flagcxDevCommGetDevicePtr` / `flagcxDevMemGetDevicePtr` APIs. This allows Triton kernels to:
+- Query communicator topology (rank, size, intra-rank).
+- Access peer memory directly via LSA pointers.
+- Synchronize across intra-node ranks using barriers.
+- Perform cooperative group operations within a CTA.
+
+---
+
+## Packaging
+
+### Obtain Source Code
+
+```bash
+git clone https://github.com/flagos-ai/FlagCX.git
+cd FlagCX
+git submodule update --init --recursive
+```
+
+### Build
+
+```bash
+# Build FlagCX core library (choose your backend)
+make <backend>=1 -j$(nproc)
+
+# Build with Device API kernel support (required for CustomAllReduce)
+make USE_NVIDIA=1 COMPILE_KERNEL=1 -j$(nproc)
+```
+
+Where `<backend>` is one of: `USE_NVIDIA`, `USE_ASCEND`, `USE_ILUVATAR_COREX`, `USE_CAMBRICON`, `USE_METAX`, `USE_MUSA`, `USE_KUNLUNXIN`, `USE_DU`, `USE_AMD`, `USE_TSM`, `USE_ENFLAME`.
+
+### Dependencies
+
+- MPI (for multi-process tests): OpenMPI or mpich
+- libibverbs (for IBRC P2P adaptor)
+- CUDA toolkit (for NVIDIA backend)
+- NCCL >= 2.25 (for Device API vendor path; >= 2.28 for window mode)
+
+#### Installing mpich (required for multi-platform P2P Engine testing)
+
+```bash
+wget https://www.mpich.org/static/downloads/4.2.3/mpich-4.2.3.tar.gz
+tar xzvf mpich-4.2.3.tar.gz
+cd mpich-4.2.3 && ./configure --prefix="$PWD/build" --with-device=ch3 --disable-fortran && make -j64 && make install
+export MPI_HOME=$PWD/build
+export LD_LIBRARY_PATH=$MPI_HOME/lib:$MPI_HOME:$LD_LIBRARY_PATH
+```
+
+### Multi-Platform Support
+
+| Feature | Platform Support | Current Test Status |
+|---|---|---|
+| P2P Engine | All supported backends (no vendor-specific adaptation required) | Tested on NVIDIA, Mthreads (USE_MUSA), Metax (USE_METAX), Hygon (USE_DU) |
+| Device API CustomAllReduce | NVIDIA only | Tested on NVIDIA only; other vendors need to add compilation pipeline and kernel implementation |
+| Device API IR Bindings | NVIDIA only | Tested on NVIDIA only; other vendors need to add compilation pipeline and kernel implementation |
+
+---
+
+## Test Plan
+
+### P2P Engine Tests
+
+```bash
+# Build (choose your backend)
+make USE_NVIDIA=1 -j$(nproc)   # or USE_MUSA=1, USE_METAX=1, USE_DU=1
+
+cd test/perf/host_api
+make USE_NVIDIA=1              # match your backend flag
+cd build/bin
+```
+
+| Test | Command | Description |
+|---|---|---|
+| Perf test: P2P Engine (read/write) | `$MPI_HOME/bin/mpirun --genv FLAGCX_MEM_ENABLE=1 -np 2 ./perf_p2p_engine -b 4K -e 64M -f 2 -n 10` | One-sided RDMA GET/PUT bandwidth benchmark via P2P Engine RPC control-plane. Set `FLAGCX_P2P_PERF_OP=read\|write\|both` to select operation. |
+
+### Device API CustomAllReduce Tests
+
+```bash
+# FlagCX must be built with COMPILE_KERNEL=1 (from project root)
+make USE_NVIDIA=1 COMPILE_KERNEL=1 FORCE_DEFAULT_PATH=1 -j$(nproc)
+
+cd test/perf/device_api
+make USE_NVIDIA=1
+cd build/bin
+```
+
+| Test | Command | Description |
+|---|---|---|
+| Perf test: AllReduce intranode | `mpirun --allow-run-as-root -np 8 -x FLAGCX_USE_HETERO_COMM=1 -x FLAGCX_MEM_ENABLE=1 -x FLAGCX_VMM_ENABLE=0 -x FLAGCX_P2P_DISABLE=1 ./perf_allreduce_intranode -b 1M -e 64M -f 2 -R 1` | Sweeps message sizes, reports algBW/busBW, verifies AllReduce correctness |
+
+### Device API IR Bindings Tests
+
+```bash
+# FlagCX must be built with COMPILE_KERNEL=1 (from project root)
+make USE_NVIDIA=1 COMPILE_KERNEL=1 FORCE_DEFAULT_PATH=1 -j$(nproc)
+
+cd test/unittest/device_api
+make USE_NVIDIA=1 FORCE_DEFAULT_PATH=1 -j$(nproc)
+cd build/bin
+```
+
+| Test | Command | Description |
+|---|---|---|
+| IR bindings correctness | `mpirun --allow-run-as-root -np 8 -x FLAGCX_USE_HETERO_COMM=1 -x FLAGCX_MEM_ENABLE=1 -x FLAGCX_VMM_ENABLE=0 -x FLAGCX_P2P_DISABLE=1 ./test_device_ir -b 1M -e 4M -f 2 -R 2` | Tests 8 kernel categories covering 69 IR wrapper functions (comm queries, cooperative group, team queries, local/intra pointers, barriers) |
+
+---
+
+## Related PRs
+
+- [ ] flagos-ai/FlagCX#450 — [PAL] IBRC P2P adaptor for FlagCX P2P engine
+- [ ] flagos-ai/FlagCX#452 — [CRL] Refactor P2P zerocopy
+- [ ] flagos-ai/FlagCX#453 — [CRL] P2P topo manager
+- [ ] flagos-ai/FlagCX#454 — [CRL] Using Device API for customAllReduce implementation
+- [ ] flagos-ai/FlagCX#466 — [CRL] Add & implement P2P interface for integration with NIXL
+- [ ] flagos-ai/FlagCX#433 — [PAL] Introduce traits abstraction and DeviceAPI for unified vendor/fallback support
+- [ ] flagos-ai/FlagCX#445 — [PAL] Support Device API Transport
+- [ ] flagos-ai/FlagCX#442 — [PAL] Add Device API DU support
+- [ ] flagos-ai/FlagCX#447 — [CRL] Add Device API multi-FIFO support
+- [ ] flagos-ai/FlagCX#471 — [CRL] Add Device API symmem and multicast support
+- [ ] flagos-ai/FlagCX#474 — [Others] KV transfer benchmark
+- [ ] flagos-ai/FlagCX#475 — [UIL] Support Device API IR Bindings
+
+---
+
+## Implementation History
+
+- 2026-05-27: FEP created for FlagCX v0.13.0 (features under development) under `sig-network`.