feat: NCCL-Xfer refit merge PR#2808
Open
youngeunkwon0405 wants to merge 4 commits into
Open
Conversation
Contributor
Author
|
/okay to test 374c26e |
374c26e to
bda6edf
Compare
Contributor
Author
|
/okay to test bda6edf |
bda6edf to
f800a11
Compare
Contributor
Author
|
/okay to test f800a11 |
3cd1f4b to
25987dc
Compare
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Add in-tree pytest coverage for the nccl_xfer refit path (Megatron train -> vLLM gen disaggregated weight transfer), which previously had only multi-node SLURM validation: - tests/unit/distributed/test_nccl_xfer_utils.py: pure-CPU unit tests for build_mesh_info, get_placements, is_expert_param, MeshInfo, group_expert_params_in_metadata, and build_nccl_xfer_refit_info. - tests/unit/distributed/test_xferdtensor_golden.py: shard-slice helpers plus end-to-end golden reshard over gloo/CPU. - tests/unit/models/generation/test_nccl_xfer_backend.py (vllm marker): _fused_param_merge_slice and _build_hf_to_gen_backend_mapping. - tests/unit/models/megatron/test_group_experts.py (mcore marker): _group_experts expert stacking. - tests/functional/grpo_nccl_xfer_refit.sh: 2-node TP4DP2->TP2DP4 GRPO smoke (golden fallback) gated on token_mult_prob_error. No implementation changes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
25987dc to
977bfe4
Compare
Contributor
Author
|
/okay to test 977bfe4 |
977bfe4 to
d90bf0c
Compare
Contributor
Author
|
/okay to test d90bf0c |
d90bf0c to
977bfe4
Compare
When the model ties embeddings (share_embeddings_and_output_weights / tie_word_embeddings, e.g. Qwen3-0.6B/1.7B), Bridge's export still materializes lm_head.weight (reconstructed from embed_tokens), so it landed in the nccl-xfer transfer list -- but no rank owns a standalone lm_head tensor (task.param_weight is None), so nccl_xfer_refit asserted "no local tensor for 'lm_head.weight'". Skip lm_head.weight from the nccl-xfer metadata when the tie flag is set. The gen backend has no standalone lm_head param either -- it reads logits from the tied embed_tokens, which IS transferred -- so the shared refit_info stays consistent across train+gen. Keyed on the tie flag (NOT on per-rank local ownership) so it is a strict no-op for non-tied models, whose lm_head is a real param transferred from the last PP stage (an ownership-based filter would have wrongly dropped non-rank-0 EP/PP params). Found by tests/functional/grpo_nccl_xfer_refit.sh (Qwen3-0.6B, tied). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Contributor
Author
|
/okay to test 989d3c7 |
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Current test results
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information