feat(ptq): add SmoothQuant calibration with vLLM & offline weight transform by gavingavin99 · Pull Request #317 · Tencent/AngelSlim

gavingavin99 · 2026-05-27T12:55:15Z

Introduces an end-to-end SmoothQuant flow on top of the existing vLLM calibration utilities, covering per-channel statistics collection, optional per-layer alpha grid search, and offline QK / VO / down_proj weight scaling.

Core library (angelslim/compressor/quant/core/vllm_calibrate_utils):

hooks.py: SmoothAttnHook / SmoothDownProjInputHook for per-channel absmax + EMA on q, k, attn_out and dense down_proj inputs; setup / get / print entry points with TP-aware batched all_gather (groups keys by (size, kv_replicas) to cut O(N) NCCL round-trips down to O(S)); collect_fused_moe_smooth_stats for kernel-injected per-expert stats; optional percentile clipping with stride sub-sampling (set/get_percentile_subsample).
search.py: SmoothAlphaSearchConfig + SmoothAlphaSearcher grid search over alpha (default and per-tensor-act-first modes) with int8/fp8 x per_tensor/per_token/per_channel/per_group/per_block QDQ; raw activation capture via SmoothAlphaValueHook (dense) and collect_fused_moe_alpha_search_values (MoE kernel injection); rank-0 JSON write to avoid mp executor pickling.
EP is rejected up front (smooth stats are incompatible with expert parallelism).

vLLM patch (tools/vllm_patch):

envs.py: register VLLM_MOE_COLLECT_SMOOTH_STATS / VLLM_MOE_COLLECT_ALPHA_SEARCH.
fused_moe.py: inject collect_fused_moe_smooth_stats and collect_fused_moe_alpha_search_values at the down_proj input point in TritonExperts.
install.sh: extend post-install checks for the new env vars and kernel injection points.

Tooling (tools/smooth):

run_vllm_smooth.py: vLLM-driven calibration entrypoint that runs forward over the calibration set, collects smooth stats, and optionally runs the alpha grid search.
convert_smooth_weights.py: offline weight transform that loads the stats JSON and applies QK / VO / down_proj scaling, with attn / mlp output diff verification and parallel safetensors save.
README.md: full usage / concept / troubleshooting doc.

Configs & scripts:

configs/hy3/ptq/hy3_smooth.yaml: shared config for both phases.
scripts/ptq/run_smooth_for_HY3.sh (one-shot pipeline), run_smooth_calibrate_for_HY3.sh (stats only), run_smooth_convert_for_HY3.sh (offline transform only).
scripts/ptq/README.md: document the new HY3 smooth scripts.

…eline for HY3 Introduces an end-to-end SmoothQuant flow on top of the existing vLLM calibration utilities, covering per-channel statistics collection, optional per-layer alpha grid search, and offline QK / VO / down_proj weight scaling. Core library (angelslim/compressor/quant/core/vllm_calibrate_utils): - hooks.py: SmoothAttnHook / SmoothDownProjInputHook for per-channel absmax + EMA on q, k, attn_out and dense down_proj inputs; setup / get / print entry points with TP-aware batched all_gather (groups keys by (size, kv_replicas) to cut O(N) NCCL round-trips down to O(S)); collect_fused_moe_smooth_stats for kernel-injected per-expert stats; optional percentile clipping with stride sub-sampling (set/get_percentile_subsample). - search.py: SmoothAlphaSearchConfig + SmoothAlphaSearcher grid search over alpha (default and per-tensor-act-first modes) with int8/fp8 x per_tensor/per_token/per_channel/per_group/per_block QDQ; raw activation capture via SmoothAlphaValueHook (dense) and collect_fused_moe_alpha_search_values (MoE kernel injection); rank-0 JSON write to avoid mp executor pickling. - EP is rejected up front (smooth stats are incompatible with expert parallelism). vLLM patch (tools/vllm_patch): - envs.py: register VLLM_MOE_COLLECT_SMOOTH_STATS / VLLM_MOE_COLLECT_ALPHA_SEARCH. - fused_moe.py: inject collect_fused_moe_smooth_stats and collect_fused_moe_alpha_search_values at the down_proj input point in TritonExperts. - install.sh: extend post-install checks for the new env vars and kernel injection points. Tooling (tools/smooth): - run_vllm_smooth.py: vLLM-driven calibration entrypoint that runs forward over the calibration set, collects smooth stats, and optionally runs the alpha grid search. - convert_smooth_weights.py: offline weight transform that loads the stats JSON and applies QK / VO / down_proj scaling, with attn / mlp output diff verification and parallel safetensors save. - README.md: full usage / concept / troubleshooting doc. Configs & scripts: - configs/hy3/ptq/hy3_smooth.yaml: shared config for both phases. - scripts/ptq/run_smooth_for_HY3.sh (one-shot pipeline), run_smooth_calibrate_for_HY3.sh (stats only), run_smooth_convert_for_HY3.sh (offline transform only). - scripts/ptq/README.md: document the new HY3 smooth scripts.

gavingavin99 closed this Jun 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ptq): add SmoothQuant calibration with vLLM & offline weight transform#317

feat(ptq): add SmoothQuant calibration with vLLM & offline weight transform#317
gavingavin99 wants to merge 1 commit into
Tencent:mainfrom
gavingavin99:dev_smoothvllm0527

gavingavin99 commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gavingavin99 commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant