Support DeepSeek V4 Flash 4Expert (top-4)#474
Open
yuhai-china wants to merge 2 commits into
Open
Conversation
129a9f8 to
4a0380c
Compare
- Change default n_expert_used from 6 to 4 in DS4_SHAPE_FLASH - Add backward compatibility: auto-detect 6-expert Flash models and set n_expert_used accordingly - Add gen_gguf_template.py: generate GGUF template from safetensors metadata for the quantizer pipeline - Add docs/gguf-conversion.md: step-by-step GGUF conversion guide Model: https://huggingface.co/cloudyu/DeepSeek-V4-Flash-4Expert
- Add n_expert_used parameter to router_select_kernel, router_select_parallel_kernel, and router_select_warp_topk_kernel - Replace all hardcoded 6/6u expert count with n_expert_used - Update guard checks to accept both n_expert_used=4 and n_expert_used=6 - Fix buffer size and hash byte calculations to use n_expert_used
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR enables ds4 to run the 4Expert variant of DeepSeek V4 Flash, which routes to top-4 experts instead of top-6.
Motivation
DeepSeek V4 Flash comes in variants that activate different numbers of routed experts per token. The original ds4 hardcodes 6 active experts. With this PR, the 4Expert variant (256 total experts, 4 active per token) is supported out of the box, while 6-expert models remain fully backward compatible.
4Expert safetensors: https://huggingface.co/cloudyu/DeepSeek-V4-Flash-4Expert
4Expert Q4_K GGUF: https://huggingface.co/cloudyu/DeepSeek-V4-Flash-4Expert-GGUF
Changes
ds4.c — 4Expert support with backward compatibility
DS4_SHAPE_FLASH.n_expert_usedchanged from 6 to 4g_ds4_shape.n_expert_usedchanged from 6 to 4ds4_select_shape_from_metadata(): when matching Flash variant, accepts both 4 and 6, preserving 6 at runtime for old GGUFsgguf-tools/gen_gguf_template.py — GGUF template generator
Generates GGUF metadata templates from safetensors index, mapping HF tensor names to GGUF names via the same
layer_mapasdeepseek4-quantize.c. Handles I64→I32 conversion for tid2eid routing table.test-4expert.sh — One-click end-to-end test
Single script that clones, builds, downloads weights, generates template, quantizes, and runs inference. Anyone can verify the PR on a fresh machine with one command.
docs/
gguf-conversion.md— step-by-step GGUF conversion guidetest-pr-on-linux.md— Linux testing quickstartTesting
make cpu(Linux) andmake(macOS)Quick Test