feat(mimo_v25): support MiMo-V2.5-Pro#2514
Conversation
|
@Simar-malhotra09 Thank you! Could you please attach the wandb/training loss of the model on hellaswag dataset. |
|
@HuiyingLi I actually don't have access to GPUs to run the training. Is there something you can do on your end? I added the YAML file in the latest commit; mostly following the one for mimo flash so it should be good. |
|
/ok to test 6414e2f |
|
@akoumpa I added the coverage docs for the model since that was the main failing test I saw. Should be good to run again. Although the tests specific to this model still need to be written in |
…to match expected arch Signed-off-by: Simar Malhotra <malhotrasimar009@gmail.com>
Signed-off-by: Simar Malhotra <malhotrasimar009@gmail.com>
Signed-off-by: Simar Malhotra <malhotrasimar009@gmail.com>
…V2.5-Pro Implements MiMoV2StateDictAdapter (MoeSplitExpertsStateDictMixin + StateDictAdapter) for XiaomiMiMo/MiMo-V2.5-Pro: - from_hf: FP8 dequantisation (weight + _scale_inv pairs) followed by per-expert weight merging via _from_hf_w_merged_experts; fused QKV keys (self_attn.qkv_proj.weight) pass through unchanged since HF and NeMo use the same name - to_hf: splits merged expert tensors back to per-expert layout and re-quantises eligible weights to float8_e4m3fn with scale_inv companions; NON_QUANTIZED_KEY_PATTERNS matches the V2-Flash precedent (norms, embeddings, lm_head, router gate, o_proj, attention_sink_bias) - Registers MiMoV2ForCausalLM in MODEL_ARCH_MAPPING and mimo_v2 in _CUSTOM_CONFIG_REGISTRATIONS so NeMoAutoModelForCausalLM can resolve the model from an HF config Smoke-tested end-to-end on CPU with a tiny MiMo-V2.5-Pro config (4 layers, fused QKV, mixed full/SWA attention, MoE layers): imports, registry lookup, model instantiation, adapter attachment, and a forward pass all pass cleanly. Signed-off-by: Simar <malhotrasimar009@gmail.com> Signed-off-by: Simar Malhotra <malhotrasimar009@gmail.com>
Adds examples/llm_finetune/mimo_v25/mimo_v25_pro_hellaswag.yaml: - 16-node (128 H100) recipe using pp_size=4, ep_size=32 matching the declared ModelCapabilities (supports_pp=True, supports_ep=True) - dequantize_base_checkpoint=true to handle the FP8 base checkpoint via MiMoV2StateDictAdapter before training - Same hyperparameters (lr=1e-5, AdamW, max_steps=100) and dataset splits as the MiMo-V2-Flash hellaswag recipe Signed-off-by: Simar <malhotrasimar009@gmail.com> Signed-off-by: Simar Malhotra <malhotrasimar009@gmail.com>
Adds docs/model-coverage/llm/xiaomimimo/mimo-v2-5-pro.mdx so that test_every_registered_arch_has_model_coverage_doc passes for the newly registered MiMoV2ForCausalLM architecture. Signed-off-by: Simar <malhotrasimar009@gmail.com> Signed-off-by: Simar Malhotra <malhotrasimar009@gmail.com>
- Adds MiMo-V2.5-Pro page entry to docs/fern/versions/nightly.yml so the page appears in the sidebar alongside MiMo-V2-Flash - Adds missing Parameters row to the Info table to match the MiMo-V2-Flash page format Signed-off-by: Simar <malhotrasimar009@gmail.com> Signed-off-by: Simar Malhotra <malhotrasimar009@gmail.com>
jgerh
left a comment
There was a problem hiding this comment.
Completed tech pubs review of docs/model-coverage/llm/xiaomimimo/mimo-v2-5-pro.mdx. Looks great; added two minor copyedits.
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
|
/ok to test 6beb16e |
|
🌿 Preview your docs: https://nvidia-preview-main.docs.buildwithfern.com/nemo/automodel |
What does this PR do ?
Adds NeMo AutoModel support for XiaomiMiMo/MiMo-V2.5-Pro, closing #2462.
Changelog
nemo_automodel/components/models/mimo_v25/config.py—MiMoV2Configadapted from the HF source; handles fused QKV (attention_projection_layout: fused_qkv), hybrid attention pattern, partial RoPE, sigmoid/noaux_tc MoE routing, andtorch_dtypeassignment aftersuper().__init__()to avoid PretrainedConfig overwriting itnemo_automodel/components/models/mimo_v25/model.py— full model implementation (MiMoV2Attention,MiMoV2RotaryEmbedding,MiMoV2DecoderLayer,MiMoV2Model,MiMoV2ForCausalLM) using NeMo infrastructure (initialize_linear_module,initialize_rms_norm_module,MoE,HFCheckpointingMixin,MoEFSDPSyncMixin); supports both full and sliding-window attention layers and EP viaModelCapabilitiesnemo_automodel/components/models/mimo_v25/state_dict_adapter.py—MiMoV2StateDictAdapterhandling FP8 dequantisation (weight_scale_invpairs) and per-expert weight merging/splitting viaMoeSplitExpertsStateDictMixin; fused QKV keys pass through unchangednemo_automodel/_transformers/registry.py— registersMiMoV2ForCausalLMinMODEL_ARCH_MAPPINGandmimo_v2in_CUSTOM_CONFIG_REGISTRATIONSpyproject.toml— adds per-file D101/D103 ruff ignores matching the project baseline patternBefore your PR is "Ready for review"
Pre checks:
This is my first contribution to the repo so I may have missed some patterns or conventions. Happy to make changes as needed.
Additional Information