NVIDIA-NeMo · yuki-97 · Jul 2, 2026 · Jun 10, 2026 · Jun 10, 2026 · Jun 10, 2026
@@ -0,0 +1,86 @@
+# Audio-Visual GRPO with Qwen2.5-Omni-7B
+
+This guide explains how to use NeMo RL to train [Qwen2.5-Omni-7B](https://huggingface.co/Qwen/Qwen2.5-Omni-7B) with GRPO on the [PhilipC/IntentTrain](https://huggingface.co/datasets/PhilipC/IntentTrain) audio-visual intent-recognition dataset and evaluate on [Daily-Omni](https://huggingface.co/datasets/liarliar/Daily-Omni), following the dataset structure used in [HumanOmniV2](https://arxiv.org/abs/2506.21277).
+
+Each training sample feeds the Qwen2.5-Omni processor both the video stream (8 frames) and the audio track decoded from the same file at 16 kHz mono. Audio and video flow as two **independent multimodal items** per prompt: the dataset emits `{type: video}` + `{type: audio}` content items, the Qwen2.5-Omni chat template renders both `<|VIDEO|>` and `<|AUDIO|>` placeholders, and vLLM rollouts populate `multi_modal_data["video"]` and `multi_modal_data["audio"]` from the same sample.
+
+## 1. Train the Model
+
+Run GRPO training with the provided config:
+
+```
+uv run examples/run_vlm_grpo.py --config examples/configs/recipes/vlm/vlm_grpo-qwen2.5-omni-7b-intent-1n8g-megatron.v1.yaml
+```
+
+Config: `examples/configs/recipes/vlm/vlm_grpo-qwen2.5-omni-7b-intent-1n8g-megatron.v1.yaml`
+
+Key hyperparameters:
+
+| Parameter | Value |
+| --- | --- |
+| Model | Qwen2.5-Omni-7B |
+| Train dataset | PhilipC/IntentTrain (problem_type = "multiple choice") |
+| Validation dataset | PhilipC/IntentBench (problem_type = "multiple choice") |
+| Modalities per prompt | video (8 frames, `<\|VIDEO\|>` placeholder) + audio (16 kHz mono, `<\|AUDIO\|>` placeholder) — independent multimodal items, no `use_audio_in_video` alignment |
+| GPUs | 8 x 1 node, Megatron backend, `tensor_model_parallel_size=2` (data parallel = 4) |
+| Learning rate | 1e-6 |
+| KL penalty | 0.01 |
+| Generations per prompt | 8 |
+| Prompts per step | 32 |
+| Train global / micro batch | 32 / 1 |
+| Max steps | 1000 |
+| Save period | 20 |
+| Reward | format (0.2) + exact_alnum (0.8) |
+
+The dataset class downloads `PhilipC/IntentTrain` and `PhilipC/IntentBench` via `huggingface_hub.snapshot_download` and extracts each `videos.zip` once into the corresponding HuggingFace cache directory. Re-instantiating the dataset on a machine that already has the archives extracted is a no-op.
+
+Only `problem_type == "multiple choice"` samples are used. The allow-list is configurable through `data.train.allowed_problem_types` and `data.validation.allowed_problem_types` if you want to extend scope (for example, to `emer_ov_mc`); doing so requires picking an answer-correctness reward that handles those answer formats.
+
+### 7B training notes
+
+- **8 video frames** keep the prompt around ~4.5k tokens (8×360 video + ~1.5k audio + text), under `max_total_sequence_length=8192`, and roughly halve the training-forward activation memory versus 16 frames. Do **not** switch to fps-based sampling — at fps=2 the clips expand to ~43k video tokens, blow past the token budget, and `vlm_hf_data_processor` then empties the multimodal items and sets `loss_multiplier=0`.
+- **`activation_checkpointing: true` + `gpu_memory_utilization: 0.4`** keep the Megatron forward inside the memory vLLM leaves resident after sleep mode. If `tensor_model_parallel_size=2` OOMs, fall back to `tensor_model_parallel_size=4` (proven to run at 8 frames).
+- If `loss_multiplier` is logged at 0 for many samples, the multimodal prompt is exceeding `max_total_sequence_length`; bump it until validation samples consistently produce non-zero loss.
+- Set `HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1` once `Qwen/Qwen2.5-Omni-7B`, `PhilipC/IntentTrain`, and `PhilipC/IntentBench` are pre-fetched, so Megatron's tokenizer worker doesn't hit the network.
+
+## 2. Convert Checkpoint (Megatron to HF)
+
+Checkpoints are saved under `results/vlm_grpo-qwen2.5-omni-7b-intent-1n8g-megatron.v1` (`checkpointing.checkpoint_dir`), one every `save_period=20` steps. Convert a checkpoint from Megatron to Hugging Face format before evaluating:
+
+```
+uv run --extra mcore python examples/converters/convert_megatron_to_hf.py \
+    --config results/vlm_grpo-qwen2.5-omni-7b-intent-1n8g-megatron.v1/step_43/config.yaml \
+    --megatron-ckpt-path results/vlm_grpo-qwen2.5-omni-7b-intent-1n8g-megatron.v1/step_43/policy/weights/iter_0000000 \
+    --hf-ckpt-path results/vlm_grpo-qwen2.5-omni-7b-intent-1n8g-megatron.v1/step_43/hf --no-strict
+```
+
+Replace the step number with the checkpoint you want to evaluate. `--no-strict` is expected here: only the Qwen2.5-Omni *thinker* is trained, so the talker tensors are reported as "not written". The `--extra mcore` flag is required for the Megatron converter.
+
+## 3. Evaluate
+
+In-training validation uses IntentBench as the validation set, so `val_period`, `val_batch_size`, and `max_val_samples` from the config drive evaluation cadence.
+
+For a standalone benchmark, decode the converted HF checkpoint on [Daily-Omni](https://huggingface.co/datasets/liarliar/Daily-Omni) (1197 audio-visual multiple-choice questions) with `examples/run_eval.py`:
+
+```
+uv run examples/run_eval.py --config examples/configs/evals/daily_omni.yaml \
+    generation.model_name=results/vlm_grpo-qwen2.5-omni-7b-intent-1n8g-megatron.v1/step_43/hf
+```
+
+The eval config (`examples/configs/evals/daily_omni.yaml`) feeds audio + video (32 frames — eval has no training-forward memory pressure, so it samples more densely than training), uses the same think+answer prompt as training, and scores with `exact_alnum` (case-insensitive exact match on the `<answer>` content).
+
+## 4. Results
+
+Daily-Omni accuracy (1197 questions, greedy decoding) for the base Qwen2.5-Omni-7B versus the GRPO-trained checkpoint:
+
+| Question type | Base | After GRPO |
+| --- | --- | --- |
+| **Overall** | **0.498** | **0.590** |
+| AV Event Alignment | 0.353 | 0.450 |
+| Comparative | 0.618 | 0.725 |
+| Context understanding | 0.446 | 0.534 |
+| Event Sequence | 0.395 | 0.490 |
+| Inference | 0.714 | 0.760 |
+| Reasoning | 0.651 | 0.766 |
+
+GRPO lifts overall Daily-Omni accuracy by ~9 points, with gains across every question category. The largest relative gains are on the reasoning-style questions.
@@ -121,6 +121,13 @@ Configure offline and online Eagle3 draft-model workflows to accelerate rollout
 Train Qwen2.5-Omni-3B with GRPO on AVQA and evaluate on MMAU, following the R1-AQA approach.
 :::
 
+:::{grid-item-card} {octicon}`device-camera-video` Audio-Visual Intent GRPO
+:link: guides/grpo-audio-visual
+:link-type: doc
+
+Train Qwen2.5-Omni-7B with GRPO on PhilipC/IntentTrain (audio-visual intent recognition) and evaluate on Daily-Omni, following HumanOmniV2's joint audio-visual setup.
+:::
+
 :::{grid-item-card} {octicon}`terminal` Two-Stage SWE RL (Qwen3 Thinking)
 :link: guides/swe-rl-qwen3
 :link-type: doc
@@ -271,6 +278,7 @@ guides/ppo.md
 guides/grpo-deepscaler.md
 guides/grpo-sliding-puzzle.md
 guides/grpo-audio.md
+guides/grpo-audio-visual.md
 guides/rm.md
 guides/environments.md
 guides/eval.md

@@ -0,0 +1,49 @@
+# Daily-Omni audio-visual eval. Inherits the shared eval defaults and only
+# overrides what differs for the Qwen2.5-Omni audio+video setup.
+defaults: "eval.yaml"
+
+generation:
+  model_name: "Qwen/Qwen2.5-Omni-7B"
+  vllm_cfg:
+    # 0.5 (vs the 0.9 default): with 32 video frames + audio, the Qwen2.5-Omni
+    # vision/audio encoder forward needs a large chunk of *transient
+    # activation* memory outside vLLM's KV-cache budget. At 0.9 the KV cache
+    # claims almost all VRAM and the first multimodal forward OOM-crashes the
+    # workers. 0.5 leaves ample headroom; KV cache is still far more than eval
+    # needs.
+    gpu_memory_utilization: 0.5
+    # Fit 32 video frames + the 16 kHz audio track without truncating the
+    # multimodal prompt (truncation silently masks samples out -> reward 0).
+    max_model_len: 32000
+    # Audio/multimodal models need the tokenizer initialized before generation.
+    skip_tokenizer_init: False
+    limit_mm_per_prompt:
+      video: 1
+      audio: 1
+  vllm_kwargs:
+    # Disable mm processor cache to avoid vLLM cache eviction during eval.
+    mm_processor_cache_gb: 0
+    # Cap concurrent sequences so the vision/audio encoder only processes a few
+    # clips per step. With audio + 32 frames, vLLM otherwise batches ~66 clips
+    # into one encoder forward and OOM-crashes the workers (encoder *activation*
+    # memory, not KV cache). Eval throughput is not a concern.
+    max_num_seqs: 8
+
+tokenizer:
+  video:
+    # 32 frames (vs 16): 60s clips at 16 frames is ~1 frame / 3.75s, too sparse
+    # for fine-grained temporal (Event Sequence) questions.
+    num_frames: 32
+
+data:
+  prompt_file: examples/prompts/daily_omni.txt
+  dataset_name: "daily-omni"
+  split: "train"
+  env_name: vlm
+
+env:
+  vlm:
+    num_workers: 8
+    reward_functions:
+    - name: exact_alnum
+      weight: 1.0
@@ -47,7 +47,7 @@ data:
   env_name: vlm
 
 env:
-  mmau:
+  vlm:
     num_workers: 8
     reward_functions:
     - name: exact_alnum

@@ -0,0 +1,79 @@
+defaults: ../../grpo_math_1B_megatron.yaml
+grpo:
+  num_generations_per_prompt: 8
+  max_num_steps: 1000
+  val_batch_size: 32
+checkpointing:
+  enabled: true
+  checkpoint_dir: results/vlm_grpo-qwen2.5-omni-7b-intent-1n8g-megatron.v1
+  keep_top_k: 10
+  save_period: 20
+  checkpoint_must_save_by: 00:03:45:00
+policy:
+  model_name: Qwen/Qwen2.5-Omni-7B
+  train_global_batch_size: 32
+  train_micro_batch_size: 1
+  generation_batch_size: 32
+  logprob_batch_size: 1
+  max_total_sequence_length: 8192
+  tokenizer:
+    video:
+      num_frames: 8
+  sequence_packing:
+    enabled: false
+  generation:
+    max_new_tokens: 1024
+    vllm_cfg:
+      skip_tokenizer_init: false
+      gpu_memory_utilization: 0.4
+      limit_mm_per_prompt:
+        video: 1
+        audio: 1
+    vllm_kwargs:
+      mm_processor_cache_gb: 0
+  megatron_cfg:
+    converter_type: Qwen2_5OmniForConditionalGeneration
+    apply_rope_fusion: false
+    activation_checkpointing: true
+    tensor_model_parallel_size: 2
+    optimizer:
+      lr: 1.0e-06
+      min_lr: 1.0e-07
+    scheduler:
+      lr_warmup_iters: 10
+      lr_warmup_init: 1.0e-07
+    distributed_data_parallel_config:
+      overlap_grad_reduce: false
+data:
+  num_workers: 0
+  train:
+    dataset_name: intent-train
+    split: train
+    allowed_problem_types:
+    - multiple choice
+  validation:
+    dataset_name: intent-bench
+    split: validation
+    allowed_problem_types:
+    - multiple choice
+  default:
+    prompt_file: null
+    processor: vlm_hf_data_processor
+    env_name: vlm
+env:
+  vlm:
+    num_workers: 8
+    reward_functions:
+    - name: format
+      weight: 0.2
+    - name: exact_alnum
+      weight: 0.8
+logger:
+  wandb_enabled: true
+  tensorboard_enabled: true
+  wandb:
+    name: vlm_grpo-qwen2.5-omni-7b-intent-1n8g-megatron.v1
+  swanlab:
+    name: vlm_grpo-qwen2.5-omni-7b-intent-1n8g-megatron.v1
+cluster:
+  gpus_per_node: 8
@@ -0,0 +1 @@
+{} First reason briefly between <think> </think> tags, then output only the single option letter (e.g., A, B, C, D, ...) between <answer> </answer> tags. Format example: <think>your reasoning</think><answer>A</answer>
@@ -47,18 +47,19 @@ def parse_args():
     return args, overrides
 
 
-def setup_data(tokenizer, data_config, env_configs):
+def setup_data(tokenizer, data_config, env_configs, is_multimodal=False):
     print("Setting up data...")
 
     # load dataset
     base_dataset = load_eval_dataset(data_config)
     rekeyed_ds = base_dataset.rekeyed_ds
 
-    # Determine env from config: use explicit env_name if provided,
-    # otherwise fall back to the single key in env_configs.
+    # Mirrors nemo_rl/data/utils.py: use data.env_name to look up the env
+    # config block and determine the registered environment class.
     env_key = next(iter(env_configs))
     env_name = data_config.get("env_name", env_key)
-    env = create_env(env_name=env_name, env_config=env_configs[env_key])
+    registered_env_name = "vlm" if is_multimodal else env_name
+    env = create_env(env_name=registered_env_name, env_config=env_configs[env_name])
 
     dataset = AllTaskProcessedDataset(
         dataset=rekeyed_ds,
@@ -113,7 +114,7 @@ def main():
         dataset,
         env,
         tokenizer,
-    ) = setup_data(tokenizer, config.data, config.env)
+    ) = setup_data(tokenizer, config.data, config.env, is_multimodal=is_multimodal)
 
     # Setup
     (

@@ -177,6 +177,32 @@ class MMAUEvalDataConfig(TypedDict):
     env_name: NotRequired[str]
 
 
+class DailyOmniEvalDataConfig(TypedDict):
+    """Config for the Daily-Omni audio-visual eval dataset.
+
+    Mirrors the MMAU multimodal schema but with its own ``dataset_name`` literal
+    so the eval-config union resolves daily-omni unambiguously. Kept as a
+    ``TypedDict`` for consistency with the other (still v1) eval-data configs in
+    this union, whose consumers access the resolved config by key
+    (``config.data["dataset_name"]``).
+
+    Fields:
+        max_input_seq_length: Max prompt length passed to the generation backend.
+        dataset_name: Must be ``"daily-omni"``.
+        split: HuggingFace split to load.
+        prompt_file: Optional prompt template path.
+        system_prompt_file: Optional system prompt path.
+        env_name: Reward/eval environment name (e.g. ``"vlm"``).
+    """
+
+    max_input_seq_length: int
+    dataset_name: Literal["daily-omni"]
+    split: NotRequired[str | None]
+    prompt_file: NotRequired[str | None]
+    system_prompt_file: NotRequired[str | None]
+    env_name: NotRequired[str]
+
+
 # Union type for all eval dataset configs
 EvalDataConfigType = Union[
     MMLUEvalDataConfig,
@@ -185,5 +211,6 @@ class MMAUEvalDataConfig(TypedDict):
     GPQAEvalDataConfig,
     MathEvalDataConfig,
     MMAUEvalDataConfig,
+    DailyOmniEvalDataConfig,
     LocalMathEvalDataConfig,
 ]
@@ -117,6 +117,7 @@ def eval_collate_fn(data_batch: list[DatumSpec]) -> BatchedDataDict[Any]:
     message_log = [datum_spec["message_log"] for datum_spec in data_batch]
     extra_env_info = [datum_spec["extra_env_info"] for datum_spec in data_batch]
     idx = [datum_spec["idx"] for datum_spec in data_batch]
+    task_names = [datum_spec.get("task_name", None) for datum_spec in data_batch]
 
     # Check if any of the data batch has vllm content (multimodal data)
     extra_args = {}
@@ -132,11 +133,15 @@ def eval_collate_fn(data_batch: list[DatumSpec]) -> BatchedDataDict[Any]:
         extra_args["vllm_audios"] = [
             datum_spec.get("vllm_audios", []) for datum_spec in data_batch
         ]
+        extra_args["vllm_videos"] = [
+            datum_spec.get("vllm_videos", []) for datum_spec in data_batch
+        ]
 
     output: BatchedDataDict[Any] = BatchedDataDict(
         message_log=message_log,
         extra_env_info=extra_env_info,
         idx=idx,
+        task_name=task_names,
         **extra_args,
     )
     return output

@@ -15,6 +15,7 @@
 from typing import cast
 
 from nemo_rl.data.datasets.eval_datasets.aime import AIMEDataset, AIMEVariant
+from nemo_rl.data.datasets.eval_datasets.daily_omni import DailyOmniEvalDataset
 from nemo_rl.data.datasets.eval_datasets.gpqa import GPQADataset
 from nemo_rl.data.datasets.eval_datasets.local_math_dataset import LocalMathDataset
 from nemo_rl.data.datasets.eval_datasets.math import MathDataset
@@ -23,7 +24,7 @@
 from nemo_rl.data.datasets.eval_datasets.mmlu_pro import MMLUProDataset
 
 # Dataset names that require multimodal (VLM) processing
-MULTIMODAL_DATASETS = {"mmau", "TwinkStart/MMAU"}
+MULTIMODAL_DATASETS = {"mmau", "TwinkStart/MMAU", "daily-omni"}
 
 
 def _is_multimodal_dataset(dataset_name):
@@ -94,6 +95,14 @@ def load_eval_dataset(data_config):
             dataset_name="TwinkStart/MMAU",
             split=split,
         )
+    # daily-omni
+    elif dataset_name == "daily-omni":
+        split = data_config.get("split", "train")
+        base_dataset = DailyOmniEvalDataset(
+            split=split,
+            prompt_file=data_config.get("prompt_file"),
+            system_prompt_file=data_config.get("system_prompt_file"),
+        )
     # fall back to local dataset
     else:
         print(f"Loading dataset from {dataset_name}...")
@@ -112,6 +121,7 @@ def load_eval_dataset(data_config):
 
 __all__ = [
     "AIMEDataset",
+    "DailyOmniEvalDataset",
     "GPQADataset",
     "LocalMathDataset",
     "MathDataset",
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		{} First reason briefly between <think> </think> tags, then output only the single option letter (e.g., A, B, C, D, ...) between <answer> </answer> tags. Format example: <think>your reasoning</think><answer>A</answer>