feat: NCCL-Xfer refit merge PR by youngeunkwon0405 · Pull Request #2808 · NVIDIA-NeMo/RL

youngeunkwon0405 · 2026-06-15T02:29:54Z

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Current test results

Test	Backend	Mapping	Refit time	KL range
4b	Megatron → vLLM	DP8 → DP8	0.07s–0.08s	0.0004–0.0009
4b	Megatron → vLLM	DP8 → TP8	0.07s–0.08s	0.0005–0.0007
4b	Megatron → vLLM	PP2×DP4 → TP4×DP2	0.09s	0.0004–0.0007
4b	Megatron → vLLM	PP4×DP2 → TP4×DP2	0.13s–0.15s	0.0004–0.0007
4b	Megatron → vLLM	TP2×PP2×DP2 → TP4×DP2	0.13s	0.0005–0.0012
4b	Megatron → vLLM	TP2×DP4 → TP4×DP2	0.11s–0.15s	0.0005–0.0007
30b	Megatron → vLLM	EP4×TP2×DP2 → TP4×DP4	0.54s–0.67s	0.0016–0.0031
30b	Megatron → vLLM	EP4×TP2×PP2 → TP4×DP4	0.48s–0.52s	0.0017–0.0033
30b	Megatron → vLLM	EP8×PP2 → TP2×DP8	0.51s–0.55s	0.0016–0.0029
30b	Megatron → vLLM	EP8×PP2 → TP8×DP2	0.50s–0.51s	0.0016–0.0031
nano-v3	Megatron → vLLM	EP8×TP2 → TP4×DP4	0.48s–0.55s	0.0011–0.0029
super-v3	Megatron → vLLM	EP8×TP8xPP2xDP2 → TP8×DP2	1.66s–1.94s	0.0042–0.0081
235b	Megatron → vLLM	TP4×PP8×EP16 → TP8×DP16	4.56s–4.85s	0.0041–0.0076
dsv3	Megatron → vLLM	PP16×EP16 → TP32×DP8	7.65s–9.94s	0.0014–0.0024
4b (FP8)	Megatron → vLLM	DP8 → DP8	0.06s–0.07s	0.0028–0.0045
4b (FP8)	Megatron → vLLM	DP8 → TP4×DP2	0.05s–0.10s	0.0029–0.0046
4b (FP8)	Megatron → vLLM	PP2×DP4 → TP4×DP2	0.06s–0.10s	0.0029–0.0042
4b (FP8)	Megatron → vLLM	PP4×DP2 → TP4×DP2	0.08s–0.10s	0.0031–0.0041
4b (FP8)	Megatron → vLLM	TP2×DP4 → TP4×DP2	0.09s–0.15s	0.0030–0.0050
30b (FP8)	Megatron → vLLM	EP4×TP2×PP2 → TP2×DP8	1.54s–1.69s	0.0056–0.0107
30b (FP8)	Megatron → vLLM	EP8×PP2 → TP2×DP8	1.34s–1.60s	0.0056–0.0101
30b (FP8)	Megatron → vLLM	EP8×DP2 → TP2×DP8	1.21s–1.62s	0.0054–0.0096

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

copy-pr-bot · 2026-06-15T02:30:02Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

youngeunkwon0405 · 2026-06-15T02:47:08Z

/okay to test 374c26e

youngeunkwon0405 · 2026-06-15T05:18:17Z

/okay to test bda6edf

youngeunkwon0405 · 2026-06-15T05:21:43Z

/okay to test f800a11

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

Add in-tree pytest coverage for the nccl_xfer refit path (Megatron train -> vLLM gen disaggregated weight transfer), which previously had only multi-node SLURM validation: - tests/unit/distributed/test_nccl_xfer_utils.py: pure-CPU unit tests for build_mesh_info, get_placements, is_expert_param, MeshInfo, group_expert_params_in_metadata, and build_nccl_xfer_refit_info. - tests/unit/distributed/test_xferdtensor_golden.py: shard-slice helpers plus end-to-end golden reshard over gloo/CPU. - tests/unit/models/generation/test_nccl_xfer_backend.py (vllm marker): _fused_param_merge_slice and _build_hf_to_gen_backend_mapping. - tests/unit/models/megatron/test_group_experts.py (mcore marker): _group_experts expert stacking. - tests/functional/grpo_nccl_xfer_refit.sh: 2-node TP4DP2->TP2DP4 GRPO smoke (golden fallback) gated on token_mult_prob_error. No implementation changes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

youngeunkwon0405 · 2026-06-15T17:32:57Z

/okay to test 977bfe4

youngeunkwon0405 · 2026-06-15T20:14:53Z

/okay to test d90bf0c

When the model ties embeddings (share_embeddings_and_output_weights / tie_word_embeddings, e.g. Qwen3-0.6B/1.7B), Bridge's export still materializes lm_head.weight (reconstructed from embed_tokens), so it landed in the nccl-xfer transfer list -- but no rank owns a standalone lm_head tensor (task.param_weight is None), so nccl_xfer_refit asserted "no local tensor for 'lm_head.weight'". Skip lm_head.weight from the nccl-xfer metadata when the tie flag is set. The gen backend has no standalone lm_head param either -- it reads logits from the tied embed_tokens, which IS transferred -- so the shared refit_info stays consistent across train+gen. Keyed on the tie flag (NOT on per-rank local ownership) so it is a strict no-op for non-tied models, whose lm_head is a real param transferred from the last PP stage (an ownership-based filter would have wrongly dropped non-rank-0 EP/PP params). Found by tests/functional/grpo_nccl_xfer_refit.sh (Qwen3-0.6B, tied). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

youngeunkwon0405 · 2026-06-15T21:28:03Z

/okay to test 989d3c7

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

youngeunkwon0405 self-assigned this Jun 15, 2026

youngeunkwon0405 requested review from a team as code owners June 15, 2026 02:29

youngeunkwon0405 added the Performance Related to improving performance label Jun 15, 2026

youngeunkwon0405 changed the title ~~[WIP] NCCL-Xfer refit merge PR~~ feat: NCCL-Xfer refit merge PR Jun 15, 2026

youngeunkwon0405 added the CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) label Jun 15, 2026

copy-pr-bot Bot temporarily deployed to public June 15, 2026 02:47 Inactive

copy-pr-bot Bot temporarily deployed to public June 15, 2026 02:48 Inactive

copy-pr-bot Bot temporarily deployed to test June 15, 2026 02:50 Inactive

copy-pr-bot Bot temporarily deployed to public June 15, 2026 02:52 Inactive

youngeunkwon0405 force-pushed the youngeunk/new-refit-merge branch from 374c26e to bda6edf Compare June 15, 2026 05:17

youngeunkwon0405 requested a review from a team as a code owner June 15, 2026 05:17

copy-pr-bot Bot temporarily deployed to public June 15, 2026 05:18 Inactive

copy-pr-bot Bot temporarily deployed to public June 15, 2026 05:19 Inactive

copy-pr-bot Bot temporarily deployed to test June 15, 2026 05:21 Inactive

youngeunkwon0405 force-pushed the youngeunk/new-refit-merge branch from bda6edf to f800a11 Compare June 15, 2026 05:21

copy-pr-bot Bot temporarily deployed to public June 15, 2026 05:22 Inactive

copy-pr-bot Bot temporarily deployed to test June 15, 2026 05:26 Inactive

copy-pr-bot Bot temporarily deployed to public June 15, 2026 10:22 Inactive

copy-pr-bot Bot temporarily deployed to public June 15, 2026 10:25 Inactive

copy-pr-bot Bot temporarily deployed to test June 15, 2026 10:25 Inactive

youngeunkwon0405 force-pushed the youngeunk/new-refit-merge branch from 3cd1f4b to 25987dc Compare June 15, 2026 17:29

youngeunkwon0405 and others added 2 commits June 15, 2026 10:31

nccl-xfer-refit implementation rebased on the main

7bd6e2a

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

youngeunkwon0405 force-pushed the youngeunk/new-refit-merge branch from 25987dc to 977bfe4 Compare June 15, 2026 17:31

copy-pr-bot Bot temporarily deployed to public June 15, 2026 17:33 Inactive

copy-pr-bot Bot temporarily deployed to public June 15, 2026 17:34 Inactive

NVIDIA-NeMo deleted a comment from copy-pr-bot Bot Jun 15, 2026

copy-pr-bot Bot temporarily deployed to test June 15, 2026 17:36 Inactive

copy-pr-bot Bot temporarily deployed to public June 15, 2026 17:37 Inactive

youngeunkwon0405 force-pushed the youngeunk/new-refit-merge branch from 977bfe4 to d90bf0c Compare June 15, 2026 20:11

copy-pr-bot Bot temporarily deployed to public June 15, 2026 20:15 Inactive

copy-pr-bot Bot temporarily deployed to test June 15, 2026 20:17 Inactive

copy-pr-bot Bot temporarily deployed to public June 15, 2026 20:19 Inactive

youngeunkwon0405 force-pushed the youngeunk/new-refit-merge branch from d90bf0c to 977bfe4 Compare June 15, 2026 20:43

copy-pr-bot Bot temporarily deployed to public June 15, 2026 21:28 Inactive

copy-pr-bot Bot temporarily deployed to public June 15, 2026 21:29 Inactive

copy-pr-bot Bot temporarily deployed to test June 15, 2026 21:32 Inactive

copy-pr-bot Bot temporarily deployed to public June 15, 2026 21:33 Inactive

drop the qkv support for the megatron

7346778

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: NCCL-Xfer refit merge PR#2808

feat: NCCL-Xfer refit merge PR#2808
youngeunkwon0405 wants to merge 4 commits into
mainfrom
youngeunk/new-refit-merge

youngeunkwon0405 commented Jun 15, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 15, 2026

Uh oh!

youngeunkwon0405 commented Jun 15, 2026

Uh oh!

youngeunkwon0405 commented Jun 15, 2026

Uh oh!

youngeunkwon0405 commented Jun 15, 2026

Uh oh!

youngeunkwon0405 commented Jun 15, 2026

Uh oh!

youngeunkwon0405 commented Jun 15, 2026

Uh oh!

youngeunkwon0405 commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

youngeunkwon0405 commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Current test results

Usage

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot Bot commented Jun 15, 2026

Uh oh!

youngeunkwon0405 commented Jun 15, 2026

Uh oh!

youngeunkwon0405 commented Jun 15, 2026

Uh oh!

youngeunkwon0405 commented Jun 15, 2026

Uh oh!

youngeunkwon0405 commented Jun 15, 2026

Uh oh!

youngeunkwon0405 commented Jun 15, 2026

Uh oh!

youngeunkwon0405 commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

youngeunkwon0405 commented Jun 15, 2026 •

edited

Loading