feat(ptq): add SmoothQuant calibration with vLLM & offline weight transform#317
Closed
gavingavin99 wants to merge 1 commit into
Closed
feat(ptq): add SmoothQuant calibration with vLLM & offline weight transform#317gavingavin99 wants to merge 1 commit into
gavingavin99 wants to merge 1 commit into
Conversation
…eline for HY3 Introduces an end-to-end SmoothQuant flow on top of the existing vLLM calibration utilities, covering per-channel statistics collection, optional per-layer alpha grid search, and offline QK / VO / down_proj weight scaling. Core library (angelslim/compressor/quant/core/vllm_calibrate_utils): - hooks.py: SmoothAttnHook / SmoothDownProjInputHook for per-channel absmax + EMA on q, k, attn_out and dense down_proj inputs; setup / get / print entry points with TP-aware batched all_gather (groups keys by (size, kv_replicas) to cut O(N) NCCL round-trips down to O(S)); collect_fused_moe_smooth_stats for kernel-injected per-expert stats; optional percentile clipping with stride sub-sampling (set/get_percentile_subsample). - search.py: SmoothAlphaSearchConfig + SmoothAlphaSearcher grid search over alpha (default and per-tensor-act-first modes) with int8/fp8 x per_tensor/per_token/per_channel/per_group/per_block QDQ; raw activation capture via SmoothAlphaValueHook (dense) and collect_fused_moe_alpha_search_values (MoE kernel injection); rank-0 JSON write to avoid mp executor pickling. - EP is rejected up front (smooth stats are incompatible with expert parallelism). vLLM patch (tools/vllm_patch): - envs.py: register VLLM_MOE_COLLECT_SMOOTH_STATS / VLLM_MOE_COLLECT_ALPHA_SEARCH. - fused_moe.py: inject collect_fused_moe_smooth_stats and collect_fused_moe_alpha_search_values at the down_proj input point in TritonExperts. - install.sh: extend post-install checks for the new env vars and kernel injection points. Tooling (tools/smooth): - run_vllm_smooth.py: vLLM-driven calibration entrypoint that runs forward over the calibration set, collects smooth stats, and optionally runs the alpha grid search. - convert_smooth_weights.py: offline weight transform that loads the stats JSON and applies QK / VO / down_proj scaling, with attn / mlp output diff verification and parallel safetensors save. - README.md: full usage / concept / troubleshooting doc. Configs & scripts: - configs/hy3/ptq/hy3_smooth.yaml: shared config for both phases. - scripts/ptq/run_smooth_for_HY3.sh (one-shot pipeline), run_smooth_calibrate_for_HY3.sh (stats only), run_smooth_convert_for_HY3.sh (offline transform only). - scripts/ptq/README.md: document the new HY3 smooth scripts.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Introduces an end-to-end SmoothQuant flow on top of the existing vLLM calibration utilities, covering per-channel statistics collection, optional per-layer alpha grid search, and offline QK / VO / down_proj weight scaling.
Core library (angelslim/compressor/quant/core/vllm_calibrate_utils):
vLLM patch (tools/vllm_patch):
Tooling (tools/smooth):
Configs & scripts: