fix: use 64-bit mega moe sf offsets by netaddi · Pull Request #331 · deepseek-ai/DeepGEMM

netaddi · 2026-05-10T08:58:50Z

Summary

Fix SM100 Mega MoE scale-factor pool address arithmetic for large
num_max_tokens_per_rank.

This PR:

uses 64-bit arithmetic for SM100 Mega MoE SF-pool offsets;
adds a manual SM100 capacity regression test;
excludes the heavy multi-process regression from sanitizer auto-discovery.

Bug Discovery

While validating Mega MoE in a serving-like SM100 setup, the output changed when
only the symmetric-memory buffer capacity changed.

The reproducer uses identical live tokens, inputs, routing, and weights, then
runs:

small capacity: num_max_tokens_per_rank=32
large capacity: num_max_tokens_per_rank=200000

The capacity should only affect allocation size. It should not affect the
mathematical output for the same live tokens.

Root Cause

In sm100_fp8_fp4_mega_moe_impl, some SF-pool offsets were computed with
32-bit intermediate values.

For large num_max_tokens_per_rank, kNumPaddedSFPoolTokens becomes large
enough that expressions like:

j * kNumPaddedSFPoolTokens + sf_pool_token_idx

and:

k_uint_idx * mn_stride + sf_pool_token_idx * sizeof(uint32_t) + byte_idx

can overflow 32-bit arithmetic before being used as addresses.

That makes the kernel write or read FP8/FP4 scale factors at the wrong SF-pool
locations, so the subsequent GEMMs consume incorrect scales and produce wrong
outputs.

Fix

Promote the SF-pool offset calculations to uint64_t in both affected paths:

dispatch-side local SF pool write;
L2 SF buffer address calculation.

This keeps the indexing valid for large Mega MoE capacities while preserving the
existing layout and kernel behavior for normal capacities.

Validation

Syntax / diff checks:

python3 -m py_compile tests/test_mega_moe_capacity.py tests/test_sanitizer.py
git diff --check

Manual SM100 regression test:

python tests/test_mega_moe_capacity.py \
--num-processes 8 \
--small-capacity 32 \
--large-capacity 200000

Expected result after the fix:

all ranks produce matching small/large output hashes;
max_abs_diff == 0.0;
mean_abs_diff == 0.0.

当num_max_tokens_per_rank较大时，kNumPaddedSFPoolTokens会导致 SF-pool偏移量计算中的32位整数溢出，使kernel读写错误的scale factor 地址，产生静默的计算错误。将dispatch侧SF写入和L2 SF buffer地址计算中的偏移量提升为uint64_t。参考: deepseek-ai#331

fix: use 64-bit mega moe sf offsets

b0ba0cb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use 64-bit mega moe sf offsets#331

fix: use 64-bit mega moe sf offsets#331
netaddi wants to merge 1 commit into
deepseek-ai:mainfrom
netaddi:fix/mega-moe-sf-offset-u64-pr

netaddi commented May 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

netaddi commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Bug Discovery

Root Cause

Fix

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

netaddi commented May 10, 2026 •

edited

Loading