fix: use 64-bit mega moe sf offsets#331
Open
netaddi wants to merge 1 commit into
Open
Conversation
zhijiehou
pushed a commit
to zhijiehou/DeepGEMM
that referenced
this pull request
May 19, 2026
当num_max_tokens_per_rank较大时,kNumPaddedSFPoolTokens会导致 SF-pool偏移量计算中的32位整数溢出,使kernel读写错误的scale factor 地址,产生静默的计算错误。 将dispatch侧SF写入和L2 SF buffer地址计算中的偏移量提升为uint64_t。 参考: deepseek-ai#331
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix SM100 Mega MoE scale-factor pool address arithmetic for large
num_max_tokens_per_rank.This PR:
Bug Discovery
While validating Mega MoE in a serving-like SM100 setup, the output changed when
only the symmetric-memory buffer capacity changed.
The reproducer uses identical live tokens, inputs, routing, and weights, then
runs:
num_max_tokens_per_rank=32num_max_tokens_per_rank=200000The capacity should only affect allocation size. It should not affect the
mathematical output for the same live tokens.
Root Cause
In
sm100_fp8_fp4_mega_moe_impl, some SF-pool offsets were computed with32-bit intermediate values.
For large
num_max_tokens_per_rank,kNumPaddedSFPoolTokensbecomes largeenough that expressions like:
j * kNumPaddedSFPoolTokens + sf_pool_token_idxand:
can overflow 32-bit arithmetic before being used as addresses.
That makes the kernel write or read FP8/FP4 scale factors at the wrong SF-pool
locations, so the subsequent GEMMs consume incorrect scales and produce wrong
outputs.
Fix
Promote the SF-pool offset calculations to uint64_t in both affected paths:
This keeps the indexing valid for large Mega MoE capacities while preserving the
existing layout and kernel behavior for normal capacities.
Validation
Syntax / diff checks:
Manual SM100 regression test:
Expected result after the fix: