Skip to content

feat: NCCL-Xfer refit merge PR#2808

Open
youngeunkwon0405 wants to merge 4 commits into
mainfrom
youngeunk/new-refit-merge
Open

feat: NCCL-Xfer refit merge PR#2808
youngeunkwon0405 wants to merge 4 commits into
mainfrom
youngeunk/new-refit-merge

Conversation

@youngeunkwon0405

@youngeunkwon0405 youngeunkwon0405 commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Current test results

Test Backend Mapping Refit time KL range
4b Megatron → vLLM DP8 → DP8 0.07s–0.08s 0.0004–0.0009
4b Megatron → vLLM DP8 → TP8 0.07s–0.08s 0.0005–0.0007
4b Megatron → vLLM PP2×DP4 → TP4×DP2 0.09s 0.0004–0.0007
4b Megatron → vLLM PP4×DP2 → TP4×DP2 0.13s–0.15s 0.0004–0.0007
4b Megatron → vLLM TP2×PP2×DP2 → TP4×DP2 0.13s 0.0005–0.0012
4b Megatron → vLLM TP2×DP4 → TP4×DP2 0.11s–0.15s 0.0005–0.0007
30b Megatron → vLLM EP4×TP2×DP2 → TP4×DP4 0.54s–0.67s 0.0016–0.0031
30b Megatron → vLLM EP4×TP2×PP2 → TP4×DP4 0.48s–0.52s 0.0017–0.0033
30b Megatron → vLLM EP8×PP2 → TP2×DP8 0.51s–0.55s 0.0016–0.0029
30b Megatron → vLLM EP8×PP2 → TP8×DP2 0.50s–0.51s 0.0016–0.0031
nano-v3 Megatron → vLLM EP8×TP2 → TP4×DP4 0.48s–0.55s 0.0011–0.0029
super-v3 Megatron → vLLM EP8×TP8xPP2xDP2 → TP8×DP2 1.66s–1.94s 0.0042–0.0081
235b Megatron → vLLM TP4×PP8×EP16 → TP8×DP16 4.56s–4.85s 0.0041–0.0076
dsv3 Megatron → vLLM PP16×EP16 → TP32×DP8 7.65s–9.94s 0.0014–0.0024
4b (FP8) Megatron → vLLM DP8 → DP8 0.06s–0.07s 0.0028–0.0045
4b (FP8) Megatron → vLLM DP8 → TP4×DP2 0.05s–0.10s 0.0029–0.0046
4b (FP8) Megatron → vLLM PP2×DP4 → TP4×DP2 0.06s–0.10s 0.0029–0.0042
4b (FP8) Megatron → vLLM PP4×DP2 → TP4×DP2 0.08s–0.10s 0.0031–0.0041
4b (FP8) Megatron → vLLM TP2×DP4 → TP4×DP2 0.09s–0.15s 0.0030–0.0050
30b (FP8) Megatron → vLLM EP4×TP2×PP2 → TP2×DP8 1.54s–1.69s 0.0056–0.0107
30b (FP8) Megatron → vLLM EP8×PP2 → TP2×DP8 1.34s–1.60s 0.0056–0.0101
30b (FP8) Megatron → vLLM EP8×DP2 → TP2×DP8 1.21s–1.62s 0.0054–0.0096

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

@youngeunkwon0405 youngeunkwon0405 self-assigned this Jun 15, 2026
@youngeunkwon0405 youngeunkwon0405 requested review from a team as code owners June 15, 2026 02:29
@copy-pr-bot

copy-pr-bot Bot commented Jun 15, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@youngeunkwon0405 youngeunkwon0405 added the Performance Related to improving performance label Jun 15, 2026
@youngeunkwon0405 youngeunkwon0405 changed the title [WIP] NCCL-Xfer refit merge PR feat: NCCL-Xfer refit merge PR Jun 15, 2026
@youngeunkwon0405 youngeunkwon0405 added the CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) label Jun 15, 2026
@youngeunkwon0405

Copy link
Copy Markdown
Contributor Author

/okay to test 374c26e

@youngeunkwon0405

Copy link
Copy Markdown
Contributor Author

/okay to test bda6edf

@youngeunkwon0405

Copy link
Copy Markdown
Contributor Author

/okay to test f800a11

youngeunkwon0405 and others added 2 commits June 15, 2026 10:31
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Add in-tree pytest coverage for the nccl_xfer refit path (Megatron
train -> vLLM gen disaggregated weight transfer), which previously had
only multi-node SLURM validation:

- tests/unit/distributed/test_nccl_xfer_utils.py: pure-CPU unit tests for
  build_mesh_info, get_placements, is_expert_param, MeshInfo,
  group_expert_params_in_metadata, and build_nccl_xfer_refit_info.
- tests/unit/distributed/test_xferdtensor_golden.py: shard-slice helpers
  plus end-to-end golden reshard over gloo/CPU.
- tests/unit/models/generation/test_nccl_xfer_backend.py (vllm marker):
  _fused_param_merge_slice and _build_hf_to_gen_backend_mapping.
- tests/unit/models/megatron/test_group_experts.py (mcore marker):
  _group_experts expert stacking.
- tests/functional/grpo_nccl_xfer_refit.sh: 2-node TP4DP2->TP2DP4 GRPO
  smoke (golden fallback) gated on token_mult_prob_error.

No implementation changes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
@youngeunkwon0405 youngeunkwon0405 force-pushed the youngeunk/new-refit-merge branch from 25987dc to 977bfe4 Compare June 15, 2026 17:31
@youngeunkwon0405

Copy link
Copy Markdown
Contributor Author

/okay to test 977bfe4

@youngeunkwon0405

Copy link
Copy Markdown
Contributor Author

/okay to test d90bf0c

When the model ties embeddings (share_embeddings_and_output_weights /
tie_word_embeddings, e.g. Qwen3-0.6B/1.7B), Bridge's export still
materializes lm_head.weight (reconstructed from embed_tokens), so it landed
in the nccl-xfer transfer list -- but no rank owns a standalone lm_head
tensor (task.param_weight is None), so nccl_xfer_refit asserted
"no local tensor for 'lm_head.weight'".

Skip lm_head.weight from the nccl-xfer metadata when the tie flag is set.
The gen backend has no standalone lm_head param either -- it reads logits
from the tied embed_tokens, which IS transferred -- so the shared refit_info
stays consistent across train+gen. Keyed on the tie flag (NOT on per-rank
local ownership) so it is a strict no-op for non-tied models, whose lm_head
is a real param transferred from the last PP stage (an ownership-based
filter would have wrongly dropped non-rank-0 EP/PP params).

Found by tests/functional/grpo_nccl_xfer_refit.sh (Qwen3-0.6B, tied).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
@youngeunkwon0405

Copy link
Copy Markdown
Contributor Author

/okay to test 989d3c7

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) Performance Related to improving performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant