CUDA: make DeepSeek-V4-Pro correct on the indexed-attention path (top_k 512→1024) + enable decode LUT gate for in_dim>4096 by slackarea · Pull Request #478 · antirez/ds4

slackarea · 2026-06-29T19:10:26Z

What

Two surgical CUDA fixes that make DeepSeek-V4-Pro run correctly and a bit
faster on the CUDA backend. Both are gated so Flash is byte-identical (its
top_k == 512 keeps its exact prior path).

1. Generalize the CUDA DSA indexer `top_k` 512 → 1024

PRO uses indexer top_k = 1024 (Flash = 512). The CUDA indexed-attention path was
hardcoded for 512, so PRO emitted garbage on --cuda:

ds4_gpu_attention_indexed_mixed_batch_heads_tensor() hard-rejected top_k > 512
(if (top_k > 512u) return 0;), no-oping the whole indexed attention for PRO;
the single + online indexed kernels capped candidate rows at 512 via
comp_rows[512] / scores[768] / comp_count clamps, truncating PRO's 1024
selected compressed rows → wrong sparse attention;
ds4_gpu_indexer_topk_tensor() gated its fast parallel top-k kernels behind
top_k == 512u, forcing PRO onto the O(n_comp·top_k) single-thread fallback.

Fix (Flash-safe): widen comp_rows[512]→[1024], scores[768]→[1280], the caps,
raise the dispatch wall to top_k <= 1024u, force the streaming online heads8 kernel
for top_k > 512, and relax the 5 indexer_topk selection gates to <= 1024u (those
parallel kernels are already generic in top_k <= SORT_N).

2. Enable the decode LUT iq2_xxs gate kernel for `expert_in_dim > 4096`

moe_gate_up_mid_decode_lut_qwarp32_kernel staged the quantized activation in a fixed
__shared__ sxq[16], gated xq_blocks <= 16. PRO's experts have expert_in_dim = 7168
→ xq_blocks = 28 > 16, so PRO fell off this fast path onto the slower global-memory
gate. Widen to sxq[32] / <= 32u. Inert for Flash/GLM (xq_blocks <= 16 run the
identical staged path).

Validation (2× H200, PRO IQ2XXS, `--ssd-streaming`)

before fix Fix spelling typos in README #1: control-char garbage on every prompt;
after: coherent output ("The capital of France is" → "Paris."; think mode reasons
then answers);
fix Fix typo #2: MoE gateup 1.03 → 0.179 ms/layer (5.7×); decode gen 1.74 → 1.84 t/s (+5.7%);
Flash IQ2XXS regression byte-stable: prefill 97.3 / gen 40.9 t/s, output unchanged.

🤖 Generated with Claude Code

DeepSeek-V4-PRO uses indexer top_k=1024 (Flash=512). The CUDA indexed attention path was hardcoded for 512, so PRO emitted garbage on --cuda: - ds4_gpu_attention_indexed_mixed_batch_heads_tensor() hard-rejected top_k>512 (`if (top_k > 512u) return 0;`), so the whole indexed attention no-oped for PRO. - attention_indexed_mixed_kernel / *_online_kernel capped candidate rows at 512 via comp_rows[512]/scores[768] and comp_count clamps, truncating PRO's 1024 selected compressed rows -> wrong sparse attn. - ds4_gpu_indexer_topk_tensor() gated its fast parallel top-k kernels behind `top_k == 512u`, forcing PRO onto the O(n_comp*top_k) single-thread fallback. Fix (Flash-safe; 512 keeps its exact prior path): - widen comp_rows[512]->[1024], scores[768]->[1280], caps 512->1024 in the single + online indexed kernels (the existing parallel topk kernels are already generic in top_k <= SORT_N). - raise the dispatch wall to top_k <= 1024u and force the streaming online heads8 kernel for top_k>512 (its candidates stream, so no shared-mem blowup), bypassing the 512-wide bitonic sort path. - relax the 5 indexer_topk selection gates from `== 512u` to `<= 1024u`. Validated on 2x H200 (atlas01), PRO IQ2XXS: - before: control-char garbage on every prompt. - after: coherent output ("The capital of France is" -> "Paris.", think mode reasons then answers). - parallel top-k vs forced fallback: gen 1.69 vs 1.33 t/s (+27%). - Flash IQ2XXS regression byte-stable: prefill 97.3 / gen 40.9 t/s. PRO single-GPU is SSD-streaming bound (432 GiB GGUF > 287 GiB VRAM); this fixes correctness + the GPU-side selection cost. Throughput is gated by NVMe expert paging, not this path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01N1GeZuRmy2sERszdL3oRXJ

The fast n_tokens==1 MoE gate path (moe_gate_up_mid_decode_lut_qwarp32_kernel, DS4_CUDA_MOE_NO_DECODE_LUT_GATE to disable) staged the quantized activation into a fixed __shared__ sxq[16] and was gated `xq_blocks <= 16u`. PRO's experts have expert_in_dim=7168 -> xq_blocks=28 > 16, so PRO fell off this path onto the slower global-memory gate, making iq2_xxs gateup the dominant decode-MoE cost. Widen the staging buffer to sxq[32] and the guard/dispatch to `xq_blocks <= 32u` so PRO (28 blocks) also stages x in shared memory. 32 q8_K blocks (~9 KiB) plus the iq2 grid/signs stays well under the per-SM shared-memory limit. Logically inert for Flash/GLM (xq_blocks<=16 run the identical staged path). Validated on 2x H200, PRO IQ2XXS, --ssd-streaming: - MoE gateup 1.03 ms -> 0.179 ms/layer (5.7x). - decode gen 1.74 -> 1.84 t/s (+5.7%). - output unchanged ("The capital of France is" -> "Paris." with LUT on and off). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01N1GeZuRmy2sERszdL3oRXJ

slackarea and others added 2 commits June 29, 2026 21:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: make DeepSeek-V4-Pro correct on the indexed-attention path (top_k 512→1024) + enable decode LUT gate for in_dim>4096#478

CUDA: make DeepSeek-V4-Pro correct on the indexed-attention path (top_k 512→1024) + enable decode LUT gate for in_dim>4096#478
slackarea wants to merge 2 commits into
antirez:mainfrom
vcnngr:pro-cuda-fixes

slackarea commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

slackarea commented Jun 29, 2026

What

1. Generalize the CUDA DSA indexer top_k 512 → 1024

2. Enable the decode LUT iq2_xxs gate kernel for expert_in_dim > 4096

Validation (2× H200, PRO IQ2XXS, --ssd-streaming)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. Generalize the CUDA DSA indexer `top_k` 512 → 1024

2. Enable the decode LUT iq2_xxs gate kernel for `expert_in_dim > 4096`

Validation (2× H200, PRO IQ2XXS, `--ssd-streaming`)