Skip to content

Support DeepSeek V4 Flash 4Expert (top-4)#474

Open
yuhai-china wants to merge 2 commits into
antirez:mainfrom
yuhai-china:4expert
Open

Support DeepSeek V4 Flash 4Expert (top-4)#474
yuhai-china wants to merge 2 commits into
antirez:mainfrom
yuhai-china:4expert

Conversation

@yuhai-china

@yuhai-china yuhai-china commented Jun 28, 2026

Copy link
Copy Markdown

This PR enables ds4 to run the 4Expert variant of DeepSeek V4 Flash, which routes to top-4 experts instead of top-6.

Motivation

DeepSeek V4 Flash comes in variants that activate different numbers of routed experts per token. The original ds4 hardcodes 6 active experts. With this PR, the 4Expert variant (256 total experts, 4 active per token) is supported out of the box, while 6-expert models remain fully backward compatible.

4Expert safetensors: https://huggingface.co/cloudyu/DeepSeek-V4-Flash-4Expert
4Expert Q4_K GGUF: https://huggingface.co/cloudyu/DeepSeek-V4-Flash-4Expert-GGUF

Changes

ds4.c — 4Expert support with backward compatibility

  • DS4_SHAPE_FLASH.n_expert_used changed from 6 to 4
  • g_ds4_shape.n_expert_used changed from 6 to 4
  • ds4_select_shape_from_metadata(): when matching Flash variant, accepts both 4 and 6, preserving 6 at runtime for old GGUFs

gguf-tools/gen_gguf_template.py — GGUF template generator

Generates GGUF metadata templates from safetensors index, mapping HF tensor names to GGUF names via the same layer_map as deepseek4-quantize.c. Handles I64→I32 conversion for tid2eid routing table.

test-4expert.sh — One-click end-to-end test

Single script that clones, builds, downloads weights, generates template, quantizes, and runs inference. Anyone can verify the PR on a fresh machine with one command.

docs/

  • gguf-conversion.md — step-by-step GGUF conversion guide
  • test-pr-on-linux.md — Linux testing quickstart

Testing

  • Q4_K GGUF (~153 GiB) converted from 4Expert safetensors, generates at ~26.70 t/s
  • Existing 6-expert GGUF files continue to work (shape auto-detection)
  • All changes compile cleanly with make cpu (Linux) and make (macOS)

Quick Test

git clone https://github.com/yuhai-china/ds4 && cd ds4 && git checkout 4expert
bash test-4expert.sh

@yuhai-china yuhai-china force-pushed the 4expert branch 3 times, most recently from 129a9f8 to 4a0380c Compare June 28, 2026 23:25
- Change default n_expert_used from 6 to 4 in DS4_SHAPE_FLASH
- Add backward compatibility: auto-detect 6-expert Flash models
  and set n_expert_used accordingly
- Add gen_gguf_template.py: generate GGUF template from
  safetensors metadata for the quantizer pipeline
- Add docs/gguf-conversion.md: step-by-step GGUF conversion guide

Model: https://huggingface.co/cloudyu/DeepSeek-V4-Flash-4Expert
- Add n_expert_used parameter to router_select_kernel,
  router_select_parallel_kernel, and router_select_warp_topk_kernel
- Replace all hardcoded 6/6u expert count with n_expert_used
- Update guard checks to accept both n_expert_used=4 and n_expert_used=6
- Fix buffer size and hash byte calculations to use n_expert_used
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant