Skip to content

feat(mimo_v25): support MiMo-V2.5-Pro#2514

Open
Simar-malhotra09 wants to merge 9 commits into
NVIDIA-NeMo:mainfrom
Simar-malhotra09:main
Open

feat(mimo_v25): support MiMo-V2.5-Pro#2514
Simar-malhotra09 wants to merge 9 commits into
NVIDIA-NeMo:mainfrom
Simar-malhotra09:main

Conversation

@Simar-malhotra09

Copy link
Copy Markdown

What does this PR do ?

Adds NeMo AutoModel support for XiaomiMiMo/MiMo-V2.5-Pro, closing #2462.

Changelog

  • nemo_automodel/components/models/mimo_v25/config.pyMiMoV2Config adapted from the HF source; handles fused QKV (attention_projection_layout: fused_qkv), hybrid attention pattern, partial RoPE, sigmoid/noaux_tc MoE routing, and torch_dtype assignment after super().__init__() to avoid PretrainedConfig overwriting it
  • nemo_automodel/components/models/mimo_v25/model.py — full model implementation (MiMoV2Attention, MiMoV2RotaryEmbedding, MiMoV2DecoderLayer, MiMoV2Model, MiMoV2ForCausalLM) using NeMo infrastructure (initialize_linear_module, initialize_rms_norm_module, MoE, HFCheckpointingMixin, MoEFSDPSyncMixin); supports both full and sliding-window attention layers and EP via ModelCapabilities
  • nemo_automodel/components/models/mimo_v25/state_dict_adapter.pyMiMoV2StateDictAdapter handling FP8 dequantisation (weight_scale_inv pairs) and per-expert weight merging/splitting via MoeSplitExpertsStateDictMixin; fused QKV keys pass through unchanged
  • nemo_automodel/_transformers/registry.py — registers MiMoV2ForCausalLM in MODEL_ARCH_MAPPING and mimo_v2 in _CUSTOM_CONFIG_REGISTRATIONS
  • pyproject.toml — adds per-file D101/D103 ruff ignores matching the project baseline pattern

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?

This is my first contribution to the repo so I may have missed some patterns or conventions. Happy to make changes as needed.

Additional Information

  • Related to Support MiMo-V2.5-Pro #2462
  • Smoke-tested on CPU with a tiny config (4 layers, fused QKV, mixed full/SWA attention, MoE layers): imports, registry lookup, model instantiation, adapter attachment, and forward pass all pass
  • Example YAML recipe and documentation updates not yet included and can be added once the implementation is approved

@copy-pr-bot

copy-pr-bot Bot commented Jun 10, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@Simar-malhotra09 Simar-malhotra09 marked this pull request as ready for review June 10, 2026 20:46
@Simar-malhotra09 Simar-malhotra09 requested review from a team as code owners June 10, 2026 20:46
@HuiyingLi

Copy link
Copy Markdown
Contributor

@Simar-malhotra09 Thank you! Could you please attach the wandb/training loss of the model on hellaswag dataset.

@Simar-malhotra09

Copy link
Copy Markdown
Author

@HuiyingLi I actually don't have access to GPUs to run the training. Is there something you can do on your end? I added the YAML file in the latest commit; mostly following the one for mimo flash so it should be good.

@svcnvidia-nemo-ci svcnvidia-nemo-ci added the waiting-on-maintainers Waiting on maintainers to respond label Jun 13, 2026
@akoumpa

akoumpa commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

/ok to test 6414e2f

@Simar-malhotra09

Copy link
Copy Markdown
Author

@akoumpa I added the coverage docs for the model since that was the main failing test I saw. Should be good to run again. Although the tests specific to this model still need to be written in tests/unit_tests/models/ but idk if that is in scope atm

…to match expected arch

Signed-off-by: Simar Malhotra <malhotrasimar009@gmail.com>
Signed-off-by: Simar Malhotra <malhotrasimar009@gmail.com>
Signed-off-by: Simar Malhotra <malhotrasimar009@gmail.com>
…V2.5-Pro

Implements MiMoV2StateDictAdapter (MoeSplitExpertsStateDictMixin +
StateDictAdapter) for XiaomiMiMo/MiMo-V2.5-Pro:

- from_hf: FP8 dequantisation (weight + _scale_inv pairs) followed by
  per-expert weight merging via _from_hf_w_merged_experts; fused QKV
  keys (self_attn.qkv_proj.weight) pass through unchanged since HF and
  NeMo use the same name
- to_hf: splits merged expert tensors back to per-expert layout and
  re-quantises eligible weights to float8_e4m3fn with scale_inv
  companions; NON_QUANTIZED_KEY_PATTERNS matches the V2-Flash precedent
  (norms, embeddings, lm_head, router gate, o_proj, attention_sink_bias)
- Registers MiMoV2ForCausalLM in MODEL_ARCH_MAPPING and mimo_v2 in
  _CUSTOM_CONFIG_REGISTRATIONS so NeMoAutoModelForCausalLM can resolve
  the model from an HF config

Smoke-tested end-to-end on CPU with a tiny MiMo-V2.5-Pro config (4
layers, fused QKV, mixed full/SWA attention, MoE layers): imports,
registry lookup, model instantiation, adapter attachment, and a forward
pass all pass cleanly.

Signed-off-by: Simar <malhotrasimar009@gmail.com>
Signed-off-by: Simar Malhotra <malhotrasimar009@gmail.com>
Adds examples/llm_finetune/mimo_v25/mimo_v25_pro_hellaswag.yaml:

- 16-node (128 H100) recipe using pp_size=4, ep_size=32 matching
  the declared ModelCapabilities (supports_pp=True, supports_ep=True)
- dequantize_base_checkpoint=true to handle the FP8 base checkpoint
  via MiMoV2StateDictAdapter before training
- Same hyperparameters (lr=1e-5, AdamW, max_steps=100) and dataset
  splits as the MiMo-V2-Flash hellaswag recipe

Signed-off-by: Simar <malhotrasimar009@gmail.com>
Signed-off-by: Simar Malhotra <malhotrasimar009@gmail.com>
Adds docs/model-coverage/llm/xiaomimimo/mimo-v2-5-pro.mdx so that
test_every_registered_arch_has_model_coverage_doc passes for the newly
registered MiMoV2ForCausalLM architecture.

Signed-off-by: Simar <malhotrasimar009@gmail.com>
Signed-off-by: Simar Malhotra <malhotrasimar009@gmail.com>
- Adds MiMo-V2.5-Pro page entry to docs/fern/versions/nightly.yml so
  the page appears in the sidebar alongside MiMo-V2-Flash
- Adds missing Parameters row to the Info table to match the
  MiMo-V2-Flash page format

Signed-off-by: Simar <malhotrasimar009@gmail.com>
Signed-off-by: Simar Malhotra <malhotrasimar009@gmail.com>

@jgerh jgerh left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completed tech pubs review of docs/model-coverage/llm/xiaomimimo/mimo-v2-5-pro.mdx. Looks great; added two minor copyedits.

Comment thread docs/model-coverage/llm/xiaomimimo/mimo-v2-5-pro.mdx Outdated
Comment thread docs/model-coverage/llm/xiaomimimo/mimo-v2-5-pro.mdx Outdated
@svcnvidia-nemo-ci svcnvidia-nemo-ci added the waiting-on-customer Waiting on the original author to respond label Jun 16, 2026
akoumpa and others added 2 commits June 16, 2026 15:46
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
@akoumpa

akoumpa commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

/ok to test 6beb16e

@github-actions

Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request waiting-on-customer Waiting on the original author to respond

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants