Skip to content

feat(ptq): add SmoothQuant calibration with vLLM & offline weight transform#317

Closed
gavingavin99 wants to merge 1 commit into
Tencent:mainfrom
gavingavin99:dev_smoothvllm0527
Closed

feat(ptq): add SmoothQuant calibration with vLLM & offline weight transform#317
gavingavin99 wants to merge 1 commit into
Tencent:mainfrom
gavingavin99:dev_smoothvllm0527

Conversation

@gavingavin99
Copy link
Copy Markdown
Collaborator

Introduces an end-to-end SmoothQuant flow on top of the existing vLLM calibration utilities, covering per-channel statistics collection, optional per-layer alpha grid search, and offline QK / VO / down_proj weight scaling.

Core library (angelslim/compressor/quant/core/vllm_calibrate_utils):

  • hooks.py: SmoothAttnHook / SmoothDownProjInputHook for per-channel absmax + EMA on q, k, attn_out and dense down_proj inputs; setup / get / print entry points with TP-aware batched all_gather (groups keys by (size, kv_replicas) to cut O(N) NCCL round-trips down to O(S)); collect_fused_moe_smooth_stats for kernel-injected per-expert stats; optional percentile clipping with stride sub-sampling (set/get_percentile_subsample).
  • search.py: SmoothAlphaSearchConfig + SmoothAlphaSearcher grid search over alpha (default and per-tensor-act-first modes) with int8/fp8 x per_tensor/per_token/per_channel/per_group/per_block QDQ; raw activation capture via SmoothAlphaValueHook (dense) and collect_fused_moe_alpha_search_values (MoE kernel injection); rank-0 JSON write to avoid mp executor pickling.
  • EP is rejected up front (smooth stats are incompatible with expert parallelism).

vLLM patch (tools/vllm_patch):

  • envs.py: register VLLM_MOE_COLLECT_SMOOTH_STATS / VLLM_MOE_COLLECT_ALPHA_SEARCH.
  • fused_moe.py: inject collect_fused_moe_smooth_stats and collect_fused_moe_alpha_search_values at the down_proj input point in TritonExperts.
  • install.sh: extend post-install checks for the new env vars and kernel injection points.

Tooling (tools/smooth):

  • run_vllm_smooth.py: vLLM-driven calibration entrypoint that runs forward over the calibration set, collects smooth stats, and optionally runs the alpha grid search.
  • convert_smooth_weights.py: offline weight transform that loads the stats JSON and applies QK / VO / down_proj scaling, with attn / mlp output diff verification and parallel safetensors save.
  • README.md: full usage / concept / troubleshooting doc.

Configs & scripts:

  • configs/hy3/ptq/hy3_smooth.yaml: shared config for both phases.
  • scripts/ptq/run_smooth_for_HY3.sh (one-shot pipeline), run_smooth_calibrate_for_HY3.sh (stats only), run_smooth_convert_for_HY3.sh (offline transform only).
  • scripts/ptq/README.md: document the new HY3 smooth scripts.

…eline for HY3

Introduces an end-to-end SmoothQuant flow on top of the existing vLLM
calibration utilities, covering per-channel statistics collection, optional
per-layer alpha grid search, and offline QK / VO / down_proj weight scaling.

Core library (angelslim/compressor/quant/core/vllm_calibrate_utils):
- hooks.py: SmoothAttnHook / SmoothDownProjInputHook for per-channel
  absmax + EMA on q, k, attn_out and dense down_proj inputs; setup /
  get / print entry points with TP-aware batched all_gather (groups keys
  by (size, kv_replicas) to cut O(N) NCCL round-trips down to O(S));
  collect_fused_moe_smooth_stats for kernel-injected per-expert stats;
  optional percentile clipping with stride sub-sampling (set/get_percentile_subsample).
- search.py: SmoothAlphaSearchConfig + SmoothAlphaSearcher grid search
  over alpha (default and per-tensor-act-first modes) with int8/fp8 x
  per_tensor/per_token/per_channel/per_group/per_block QDQ; raw
  activation capture via SmoothAlphaValueHook (dense) and
  collect_fused_moe_alpha_search_values (MoE kernel injection); rank-0
  JSON write to avoid mp executor pickling.
- EP is rejected up front (smooth stats are incompatible with expert parallelism).

vLLM patch (tools/vllm_patch):
- envs.py: register VLLM_MOE_COLLECT_SMOOTH_STATS / VLLM_MOE_COLLECT_ALPHA_SEARCH.
- fused_moe.py: inject collect_fused_moe_smooth_stats and
  collect_fused_moe_alpha_search_values at the down_proj input point in
  TritonExperts.
- install.sh: extend post-install checks for the new env vars and kernel
  injection points.

Tooling (tools/smooth):
- run_vllm_smooth.py: vLLM-driven calibration entrypoint that runs
  forward over the calibration set, collects smooth stats, and
  optionally runs the alpha grid search.
- convert_smooth_weights.py: offline weight transform that loads the
  stats JSON and applies QK / VO / down_proj scaling, with attn / mlp
  output diff verification and parallel safetensors save.
- README.md: full usage / concept / troubleshooting doc.

Configs & scripts:
- configs/hy3/ptq/hy3_smooth.yaml: shared config for both phases.
- scripts/ptq/run_smooth_for_HY3.sh (one-shot pipeline),
  run_smooth_calibrate_for_HY3.sh (stats only),
  run_smooth_convert_for_HY3.sh (offline transform only).
- scripts/ptq/README.md: document the new HY3 smooth scripts.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant