Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/fern/versions/nightly.yml
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,8 @@ navigation:
path: ../../model-coverage/llm/tencent/hy-mt2.mdx
- page: "MiMo-V2-Flash"
path: ../../model-coverage/llm/xiaomimimo/mimo-v2-flash.mdx
- page: "MiMo-V2.5-Pro"
path: ../../model-coverage/llm/xiaomimimo/mimo-v2-5-pro.mdx
- page: "Ling 2.0"
path: ../../model-coverage/llm/inclusionai/ling-2.mdx
- section: "Vision Language Models"
Expand Down
93 changes: 93 additions & 0 deletions docs/model-coverage/llm/xiaomimimo/mimo-v2-5-pro.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
---
title: "MiMo-V2.5-Pro"
description: ""
---
[MiMo-V2.5-Pro](https://huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro) is Xiaomi's hybrid attention Mixture-of-Experts language model. It alternates full and sliding-window attention layers, uses a `sigmoid_with_bias` router with group-limited expert routing, and ships as an FP8 HF checkpoint.

<Info>

| | |
|---|---|
| **Task** | Text Generation (MoE, hybrid attention) |
| **Architecture** | `MiMoV2ForCausalLM` |
| **Parameters** | Approx. several hundred B total / much smaller active |
| **HF Org** | [XiaomiMiMo](https://huggingface.co/XiaomiMiMo) |

</Info>

## Available Models

- **MiMo-V2.5-Pro**: hybrid full/sliding-window attention with FP8 weights.

## Architecture

- `MiMoV2ForCausalLM`
- Sliding-window attention using the `MiMoV2Attention(is_swa=True)` path.
- MoE blocks use `nemo_automodel.components.moe.layers.MoE` with `score_func="sigmoid_with_bias"` and `gate_precision=fp32`.
- FP8 round-trip in `MiMoV2StateDictAdapter` covers the bulk of attention/expert weights; layer norms, the gate, `lm_head`, and `embed_tokens` stay in bf16 per `NON_QUANTIZED_KEY_PATTERNS`.

## Example HF Models

| Model | HF ID |
|---|---|
| MiMo-V2.5-Pro | [`XiaomiMiMo/MiMo-V2.5-Pro`](https://huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro) |

## Example Recipes

| Recipe | Description |
|---|---|
| [mimo_v25_pro_hellaswag.yaml](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/llm_finetune/mimo_v25/mimo_v25_pro_hellaswag.yaml) | SFT: MiMo-V2.5-Pro on HellaSwag |

## Try with NeMo AutoModel

**1. Install** ([full instructions](/get-started/installation)):

```bash
pip install nemo-automodel
```

**2. Clone the repo** to get the example recipes:

```bash
git clone https://github.com/NVIDIA-NeMo/Automodel.git
cd Automodel
```

**3. Run the recipe** from inside the repo:

```bash
automodel --nproc-per-node=8 examples/llm_finetune/mimo_v25/mimo_v25_pro_hellaswag.yaml
```

<Accordion title="Run with Docker">
**1. Pull the container** and mount a checkpoint directory:

```bash
docker run --gpus all -it --rm \
--shm-size=8g \
-v $(pwd)/checkpoints:/opt/Automodel/checkpoints \
nvcr.io/nvidia/nemo-automodel:26.02.00
```

**2. Navigate to the AutoModel directory**:

```bash
cd /opt/Automodel
```

**3. Run the recipe**:

```bash
automodel --nproc-per-node=8 examples/llm_finetune/mimo_v25/mimo_v25_pro_hellaswag.yaml
```
</Accordion>

See the [Installation Guide](/get-started/installation) and [LLM Fine-Tuning Guide](/recipes-e2e-examples/sft-peft).

## Fine-Tuning

See the [LLM Fine-Tuning Guide](/recipes-e2e-examples/sft-peft).

## Hugging Face Model Cards

- [XiaomiMiMo/MiMo-V2.5-Pro](https://huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro)
134 changes: 134 additions & 0 deletions examples/llm_finetune/mimo_v25/mimo_v25_pro_hellaswag.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# 16 H100 nodes (128 GPUs):
# torchrun --nproc-per-node 8 -m nemo_automodel.cli.app examples/llm_finetune/mimo_v25/mimo_v25_pro_hellaswag.yaml

recipe: TrainFinetuneRecipeForNextTokenPrediction

seed: 1234

step_scheduler:
global_batch_size: 256
local_batch_size: 8
ckpt_every_steps: 25
val_every_steps: 500
num_epochs: 1
max_steps: 100

distributed:
strategy: fsdp2
tp_size: 1
cp_size: 1
pp_size: 4
ep_size: 32

sequence_parallel: false
activation_checkpointing: true

pipeline:
pp_schedule: interleaved1f1b
pp_microbatch_size: 1
layers_per_stage: 2
round_virtual_stages_to_pp_multiple: down
scale_grads_in_schedule: false
patch_inner_model: false
patch_causal_lm_model: false

moe:
reshard_after_forward: false
wrap_outer_model: false

dist_env:
backend: nccl
timeout_minutes: 30

model:
_target_: nemo_automodel.NeMoAutoModelForCausalLM.from_config
config:
_target_: nemo_automodel.components.models.mimo_v25.config.MiMoV2Config.from_pretrained
pretrained_model_name_or_path: XiaomiMiMo/MiMo-V2.5-Pro
name_or_path: XiaomiMiMo/MiMo-V2.5-Pro
trust_remote_code: false
load_base_model: true
backend:
_target_: nemo_automodel.components.models.common.BackendConfig
attn: sdpa
linear: torch
rms_norm: torch_fp32
rope_fusion: false
dispatcher: deepep
experts: torch_mm
gate_precision: float32
enable_hf_state_dict_adapter: true
enable_fsdp_optimizations: true

checkpoint:
enabled: true
checkpoint_dir: checkpoints/mimo_v25_pro
model_save_format: safetensors
save_consolidated: false
dequantize_base_checkpoint: true

loss_fn:
_target_: nemo_automodel.components.loss.masked_ce.MaskedCrossEntropy

dataset:
_target_: nemo_automodel.components.datasets.llm.hellaswag.HellaSwag
path_or_dataset: rowan/hellaswag
split: train
tokenizer:
_target_: transformers.AutoTokenizer.from_pretrained
pretrained_model_name_or_path: XiaomiMiMo/MiMo-V2.5-Pro
trust_remote_code: true

packed_sequence:
packed_sequence_size: 0

dataloader:
_target_: torchdata.stateful_dataloader.StatefulDataLoader
collate_fn:
_target_: nemo_automodel.components.datasets.utils.default_collater
pad_seq_len_divisible: 64
shuffle: true

validation_dataset:
_target_: nemo_automodel.components.datasets.llm.hellaswag.HellaSwag
path_or_dataset: rowan/hellaswag
split: validation
num_samples_limit: 64
tokenizer:
_target_: transformers.AutoTokenizer.from_pretrained
pretrained_model_name_or_path: XiaomiMiMo/MiMo-V2.5-Pro
trust_remote_code: true

validation_dataloader:
_target_: torchdata.stateful_dataloader.StatefulDataLoader
collate_fn:
_target_: nemo_automodel.components.datasets.utils.default_collater
pad_seq_len_divisible: 64
shuffle: false
drop_last: true

optimizer:
_target_: torch.optim.AdamW
betas: [0.9, 0.95]
eps: 1e-8
lr: 1e-5
weight_decay: 0.1

wandb:
project: automodel-mimo-v25-pro
name: mimo_v25_pro_hellaswag_16n
mode: online
5 changes: 5 additions & 0 deletions nemo_automodel/_transformers/registry.py
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,10 @@
"MiMoV2FlashForCausalLM",
("nemo_automodel.components.models.mimo_v2_flash.model", "MiMoV2FlashForCausalLM"),
),
(
"MiMoV2ForCausalLM",
("nemo_automodel.components.models.mimo_v25.model", "MiMoV2ForCausalLM"),
),
(
"Ministral3ForCausalLM",
("nemo_automodel.components.models.mistral3.model", "Ministral3ForCausalLM"),
Expand Down Expand Up @@ -282,6 +286,7 @@
"kimi_vl": ("nemo_automodel.components.models.kimivl.model", "KimiVLConfig"),
"llavaonevision1_5": ("nemo_automodel.components.models.llava_onevision.model", "Llavaonevision1_5Config"),
"mimo_v2_flash": ("nemo_automodel.components.models.mimo_v2_flash.config", "MiMoV2FlashConfig"),
"mimo_v2": ("nemo_automodel.components.models.mimo_v25.config", "MiMoV2Config"),
"minimax_m3_vl": ("nemo_automodel.components.models.minimax_m3_vl.config", "MiniMaxM3VLConfig"),
"mistral4": ("nemo_automodel.components.models.mistral4.configuration", "Mistral4Config"),
"step3p5v": ("nemo_automodel.components.models.step3p7.configuration_step3p7", "Step3p5VConfig"),
Expand Down
18 changes: 18 additions & 0 deletions nemo_automodel/components/models/mimo_v25/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from nemo_automodel.components.models.mimo_v25.config import MiMoV2Config
from nemo_automodel.components.models.mimo_v25.model import MiMoV2ForCausalLM, MiMoV2Model

__all__ = ["MiMoV2Config", "MiMoV2ForCausalLM", "MiMoV2Model"]
Loading
Loading