Skip to content

CUDA: make DeepSeek-V4-Pro correct on the indexed-attention path (top_k 512→1024) + enable decode LUT gate for in_dim>4096#478

Open
slackarea wants to merge 2 commits into
antirez:mainfrom
vcnngr:pro-cuda-fixes
Open

CUDA: make DeepSeek-V4-Pro correct on the indexed-attention path (top_k 512→1024) + enable decode LUT gate for in_dim>4096#478
slackarea wants to merge 2 commits into
antirez:mainfrom
vcnngr:pro-cuda-fixes

Conversation

@slackarea

Copy link
Copy Markdown

What

Two surgical CUDA fixes that make DeepSeek-V4-Pro run correctly and a bit
faster on the CUDA backend. Both are gated so Flash is byte-identical (its
top_k == 512 keeps its exact prior path).

1. Generalize the CUDA DSA indexer top_k 512 → 1024

PRO uses indexer top_k = 1024 (Flash = 512). The CUDA indexed-attention path was
hardcoded for 512, so PRO emitted garbage on --cuda:

  • ds4_gpu_attention_indexed_mixed_batch_heads_tensor() hard-rejected top_k > 512
    (if (top_k > 512u) return 0;), no-oping the whole indexed attention for PRO;
  • the single + online indexed kernels capped candidate rows at 512 via
    comp_rows[512] / scores[768] / comp_count clamps, truncating PRO's 1024
    selected compressed rows → wrong sparse attention;
  • ds4_gpu_indexer_topk_tensor() gated its fast parallel top-k kernels behind
    top_k == 512u, forcing PRO onto the O(n_comp·top_k) single-thread fallback.

Fix (Flash-safe): widen comp_rows[512]→[1024], scores[768]→[1280], the caps,
raise the dispatch wall to top_k <= 1024u, force the streaming online heads8 kernel
for top_k > 512, and relax the 5 indexer_topk selection gates to <= 1024u (those
parallel kernels are already generic in top_k <= SORT_N).

2. Enable the decode LUT iq2_xxs gate kernel for expert_in_dim > 4096

moe_gate_up_mid_decode_lut_qwarp32_kernel staged the quantized activation in a fixed
__shared__ sxq[16], gated xq_blocks <= 16. PRO's experts have expert_in_dim = 7168
xq_blocks = 28 > 16, so PRO fell off this fast path onto the slower global-memory
gate. Widen to sxq[32] / <= 32u. Inert for Flash/GLM (xq_blocks <= 16 run the
identical staged path).

Validation (2× H200, PRO IQ2XXS, --ssd-streaming)

  • before fix Fix spelling typos in README #1: control-char garbage on every prompt;
  • after: coherent output ("The capital of France is" → "Paris."; think mode reasons
    then answers);
  • fix Fix typo #2: MoE gateup 1.03 → 0.179 ms/layer (5.7×); decode gen 1.74 → 1.84 t/s (+5.7%);
  • Flash IQ2XXS regression byte-stable: prefill 97.3 / gen 40.9 t/s, output unchanged.

🤖 Generated with Claude Code

slackarea and others added 2 commits June 29, 2026 21:09
DeepSeek-V4-PRO uses indexer top_k=1024 (Flash=512). The CUDA indexed
attention path was hardcoded for 512, so PRO emitted garbage on --cuda:

  - ds4_gpu_attention_indexed_mixed_batch_heads_tensor() hard-rejected
    top_k>512 (`if (top_k > 512u) return 0;`), so the whole indexed
    attention no-oped for PRO.
  - attention_indexed_mixed_kernel / *_online_kernel capped candidate
    rows at 512 via comp_rows[512]/scores[768] and comp_count clamps,
    truncating PRO's 1024 selected compressed rows -> wrong sparse attn.
  - ds4_gpu_indexer_topk_tensor() gated its fast parallel top-k kernels
    behind `top_k == 512u`, forcing PRO onto the O(n_comp*top_k)
    single-thread fallback.

Fix (Flash-safe; 512 keeps its exact prior path):
  - widen comp_rows[512]->[1024], scores[768]->[1280], caps 512->1024
    in the single + online indexed kernels (the existing parallel topk
    kernels are already generic in top_k <= SORT_N).
  - raise the dispatch wall to top_k <= 1024u and force the streaming
    online heads8 kernel for top_k>512 (its candidates stream, so no
    shared-mem blowup), bypassing the 512-wide bitonic sort path.
  - relax the 5 indexer_topk selection gates from `== 512u` to
    `<= 1024u`.

Validated on 2x H200 (atlas01), PRO IQ2XXS:
  - before: control-char garbage on every prompt.
  - after: coherent output ("The capital of France is" -> "Paris.",
    think mode reasons then answers).
  - parallel top-k vs forced fallback: gen 1.69 vs 1.33 t/s (+27%).
  - Flash IQ2XXS regression byte-stable: prefill 97.3 / gen 40.9 t/s.

PRO single-GPU is SSD-streaming bound (432 GiB GGUF > 287 GiB VRAM);
this fixes correctness + the GPU-side selection cost. Throughput is
gated by NVMe expert paging, not this path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01N1GeZuRmy2sERszdL3oRXJ
The fast n_tokens==1 MoE gate path (moe_gate_up_mid_decode_lut_qwarp32_kernel,
DS4_CUDA_MOE_NO_DECODE_LUT_GATE to disable) staged the quantized activation into
a fixed __shared__ sxq[16] and was gated `xq_blocks <= 16u`. PRO's experts have
expert_in_dim=7168 -> xq_blocks=28 > 16, so PRO fell off this path onto the
slower global-memory gate, making iq2_xxs gateup the dominant decode-MoE cost.

Widen the staging buffer to sxq[32] and the guard/dispatch to `xq_blocks <= 32u`
so PRO (28 blocks) also stages x in shared memory. 32 q8_K blocks (~9 KiB) plus
the iq2 grid/signs stays well under the per-SM shared-memory limit.

Logically inert for Flash/GLM (xq_blocks<=16 run the identical staged path).
Validated on 2x H200, PRO IQ2XXS, --ssd-streaming:
  - MoE gateup 1.03 ms -> 0.179 ms/layer (5.7x).
  - decode gen 1.74 -> 1.84 t/s (+5.7%).
  - output unchanged ("The capital of France is" -> "Paris." with LUT on and off).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01N1GeZuRmy2sERszdL3oRXJ
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant