Add SM90 FP8 MegaMoE split kernel#352
Conversation
Optimize SM90 MegaMoE split path
Optimize MegaMoE split path
|
Hi authors, thanks for your great work! I wonder whether you've compared your split version versus the original one in #323. And if so, could you please illustrate the where the performance gain comes from? Thx |
|
Does MegaMoE currently support SM120, or is support still under development? |
no plan to support sm120 |
|
Here are the benchmark results tested on an 8x H800 setup: 💻 Environment & Configurations
📊 End-to-End Latency Comparison (μs)
|
|
The baseline seems a bit different from the one used on SM100 — looks like it isn't using DeepEP-v2? Here are my test results on H200: For reference, my run on H200 (8 ranks,
|
co-author: @jychen21
Summary
What's new
Generic touches
Benchmark — H20-3e, 8 ranks, hidden=7168, ih=2048, num_experts=256, topk=8
M ≤ 128 are 10-run medians; M ≥ 256 are from the full sweep. Numbers for M ≤ 128 are 10-run medians because at that scale per-call time is only 700–950 μs, and the result is dominated by routing-draw noise, idle-SM under-utilisation, and roughly M-independent launch/barrier overhead that doesn't amortise away.