Skip to content

fix: use 64-bit mega moe sf offsets#331

Open
netaddi wants to merge 1 commit into
deepseek-ai:mainfrom
netaddi:fix/mega-moe-sf-offset-u64-pr
Open

fix: use 64-bit mega moe sf offsets#331
netaddi wants to merge 1 commit into
deepseek-ai:mainfrom
netaddi:fix/mega-moe-sf-offset-u64-pr

Conversation

@netaddi

@netaddi netaddi commented May 10, 2026

Copy link
Copy Markdown

Summary

Fix SM100 Mega MoE scale-factor pool address arithmetic for large
num_max_tokens_per_rank.

This PR:

  • uses 64-bit arithmetic for SM100 Mega MoE SF-pool offsets;
  • adds a manual SM100 capacity regression test;
  • excludes the heavy multi-process regression from sanitizer auto-discovery.

Bug Discovery

While validating Mega MoE in a serving-like SM100 setup, the output changed when
only the symmetric-memory buffer capacity changed.

The reproducer uses identical live tokens, inputs, routing, and weights, then
runs:

  • small capacity: num_max_tokens_per_rank=32
  • large capacity: num_max_tokens_per_rank=200000

The capacity should only affect allocation size. It should not affect the
mathematical output for the same live tokens.

Root Cause

In sm100_fp8_fp4_mega_moe_impl, some SF-pool offsets were computed with
32-bit intermediate values.

For large num_max_tokens_per_rank, kNumPaddedSFPoolTokens becomes large
enough that expressions like:

j * kNumPaddedSFPoolTokens + sf_pool_token_idx

and:

k_uint_idx * mn_stride + sf_pool_token_idx * sizeof(uint32_t) + byte_idx

can overflow 32-bit arithmetic before being used as addresses.

That makes the kernel write or read FP8/FP4 scale factors at the wrong SF-pool
locations, so the subsequent GEMMs consume incorrect scales and produce wrong
outputs.

Fix

Promote the SF-pool offset calculations to uint64_t in both affected paths:

  • dispatch-side local SF pool write;
  • L2 SF buffer address calculation.

This keeps the indexing valid for large Mega MoE capacities while preserving the
existing layout and kernel behavior for normal capacities.

Validation

Syntax / diff checks:

python3 -m py_compile tests/test_mega_moe_capacity.py tests/test_sanitizer.py
git diff --check

Manual SM100 regression test:

python tests/test_mega_moe_capacity.py \
--num-processes 8 \
--small-capacity 32 \
--large-capacity 200000

Expected result after the fix:

  • all ranks produce matching small/large output hashes;
  • max_abs_diff == 0.0;
  • mean_abs_diff == 0.0.

zhijiehou pushed a commit to zhijiehou/DeepGEMM that referenced this pull request May 19, 2026
当num_max_tokens_per_rank较大时,kNumPaddedSFPoolTokens会导致
SF-pool偏移量计算中的32位整数溢出,使kernel读写错误的scale factor
地址,产生静默的计算错误。

将dispatch侧SF写入和L2 SF buffer地址计算中的偏移量提升为uint64_t。

参考: deepseek-ai#331
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant