Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
ddc76a7
feat(eval): support Daily-Omni + Qwen2.5-Omni eval
yuekaizhang Jun 10, 2026
74b2a69
feat(grpo): add audio+video Intent GRPO recipe for Qwen2.5-Omni-3B
yuekaizhang Jun 10, 2026
6bfa8d5
fix(grpo-intent): use two-step processor call for audio+video samples
yuekaizhang Jun 10, 2026
25add99
fix(grpo-intent): pass prompt_token_ids to vLLM for audio+video samples
yuekaizhang Jun 10, 2026
57e0e53
fix(grpo-intent): pass audio + video as independent streams (no use_a…
yuekaizhang Jun 10, 2026
263d506
docs(grpo-intent): align comments + tests with verified independent-s…
yuekaizhang Jun 10, 2026
e2aeba9
docs(grpo-intent): match guide to verified independent-streams smoke run
yuekaizhang Jun 10, 2026
5e9a818
feat(grpo): expose per-component reward metrics in VLM validation
yuekaizhang Jun 10, 2026
0fd3f48
docs(grpo-intent): match Results section to per-component validation …
yuekaizhang Jun 10, 2026
e3dbc5e
fix(grpo-intent): explicit think+answer prompt + fallback reward to b…
yuekaizhang Jun 10, 2026
6fb6c00
feat(grpo-intent): audio+video Daily-Omni eval + intent prompt/config…
yuekaizhang Jun 15, 2026
2530250
revert: drop per-component VLM validation reward logging
yuekaizhang Jun 15, 2026
099ec14
refactor: drop exact_alnum_with_fallback reward
yuekaizhang Jun 15, 2026
37c2c11
refactor(grpo-audio-visual): standalone 7B recipe + 7B guide
yuekaizhang Jun 15, 2026
55bcf7f
test: add audio-visual GRPO megatron L1 functional test
yuekaizhang Jun 15, 2026
45ece43
docs(grpo-audio-visual): retitle, eval on Daily-Omni, link HumanOmniV…
yuekaizhang Jun 15, 2026
d9ca267
Merge branch 'main' into audio_video
yuekaizhang Jun 16, 2026
cc150ab
chore: apply ruff-format to intent dataset
yuekaizhang Jun 16, 2026
2871d57
chore: add eval_datasets/daily_omni.py to pyrefly project-includes
yuekaizhang Jun 16, 2026
e1aa32c
test: update test_dailyomni_dataset for audio+video content shape
yuekaizhang Jun 16, 2026
b5fd215
Merge branch 'main' into audio_video
yuekaizhang Jun 16, 2026
3d4c98a
Merge branch 'main' into audio_video
yuekaizhang Jun 23, 2026
5a5aca0
Merge branch 'main' into audio_video
yuekaizhang Jun 24, 2026
44cd7db
chore: address PR #2823 review comments (yuki-97)
yuekaizhang Jun 26, 2026
1a0d423
chore: inline intent 7B config into the nightly recipe
yuekaizhang Jun 26, 2026
a5f3b99
Merge branch 'main' into audio_video
yuekaizhang Jun 26, 2026
85b6499
fix(ci): add execute permission to intent test suite script
yuekaizhang Jun 29, 2026
e55f280
Merge branch 'main' into audio_video
yuekaizhang Jun 29, 2026
ba22b6e
fix(ci): raise nightly GPU hours limit from 2300 to 2310
yuekaizhang Jun 30, 2026
3c315ee
Merge branch 'main' into audio_video
yuekaizhang Jun 30, 2026
81d8869
fix: address PR #2823 review comments
yuekaizhang Jun 30, 2026
f41aca9
fix: tighten eval_daily_omni score threshold and add to fast tests
yuekaizhang Jun 30, 2026
b59840f
Merge branch 'main' into audio_video
yuekaizhang Jun 30, 2026
b7ee660
fix: reorder imports in datasets/utils.py per pre-commit
yuekaizhang Jun 30, 2026
93bc42f
fix(run_eval): handle env_name vs env config key mismatch
yuekaizhang Jun 30, 2026
9a41371
Merge branch 'main' into audio_video
yuekaizhang Jul 1, 2026
3bea1ac
fix: align run_eval env dispatch with data/utils.py and remove wrapper
yuekaizhang Jul 1, 2026
13db875
fix: use data_config.env_name to select env config block in run_eval
yuekaizhang Jul 1, 2026
95c92cd
fix: fall back to env_key when env_name not in env_configs
yuekaizhang Jul 1, 2026
bbff4cf
fix: unify env config key to vlm in mmau.yaml, simplify run_eval
yuekaizhang Jul 1, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
86 changes: 86 additions & 0 deletions docs/guides/grpo-audio-visual.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# Audio-Visual GRPO with Qwen2.5-Omni-7B

This guide explains how to use NeMo RL to train [Qwen2.5-Omni-7B](https://huggingface.co/Qwen/Qwen2.5-Omni-7B) with GRPO on the [PhilipC/IntentTrain](https://huggingface.co/datasets/PhilipC/IntentTrain) audio-visual intent-recognition dataset and evaluate on [Daily-Omni](https://huggingface.co/datasets/liarliar/Daily-Omni), following the dataset structure used in [HumanOmniV2](https://arxiv.org/abs/2506.21277).

Each training sample feeds the Qwen2.5-Omni processor both the video stream (8 frames) and the audio track decoded from the same file at 16 kHz mono. Audio and video flow as two **independent multimodal items** per prompt: the dataset emits `{type: video}` + `{type: audio}` content items, the Qwen2.5-Omni chat template renders both `<|VIDEO|>` and `<|AUDIO|>` placeholders, and vLLM rollouts populate `multi_modal_data["video"]` and `multi_modal_data["audio"]` from the same sample.

## 1. Train the Model

Run GRPO training with the provided config:

```
uv run examples/run_vlm_grpo.py --config examples/configs/recipes/vlm/vlm_grpo-qwen2.5-omni-7b-intent-1n8g-megatron.v1.yaml
```

Config: `examples/configs/recipes/vlm/vlm_grpo-qwen2.5-omni-7b-intent-1n8g-megatron.v1.yaml`

Key hyperparameters:

| Parameter | Value |
| --- | --- |
| Model | Qwen2.5-Omni-7B |
| Train dataset | PhilipC/IntentTrain (problem_type = "multiple choice") |
| Validation dataset | PhilipC/IntentBench (problem_type = "multiple choice") |
| Modalities per prompt | video (8 frames, `<\|VIDEO\|>` placeholder) + audio (16 kHz mono, `<\|AUDIO\|>` placeholder) — independent multimodal items, no `use_audio_in_video` alignment |
| GPUs | 8 x 1 node, Megatron backend, `tensor_model_parallel_size=2` (data parallel = 4) |
| Learning rate | 1e-6 |
| KL penalty | 0.01 |
| Generations per prompt | 8 |
| Prompts per step | 32 |
| Train global / micro batch | 32 / 1 |
| Max steps | 1000 |
| Save period | 20 |
| Reward | format (0.2) + exact_alnum (0.8) |

The dataset class downloads `PhilipC/IntentTrain` and `PhilipC/IntentBench` via `huggingface_hub.snapshot_download` and extracts each `videos.zip` once into the corresponding HuggingFace cache directory. Re-instantiating the dataset on a machine that already has the archives extracted is a no-op.

Only `problem_type == "multiple choice"` samples are used. The allow-list is configurable through `data.train.allowed_problem_types` and `data.validation.allowed_problem_types` if you want to extend scope (for example, to `emer_ov_mc`); doing so requires picking an answer-correctness reward that handles those answer formats.

### 7B training notes

- **8 video frames** keep the prompt around ~4.5k tokens (8×360 video + ~1.5k audio + text), under `max_total_sequence_length=8192`, and roughly halve the training-forward activation memory versus 16 frames. Do **not** switch to fps-based sampling — at fps=2 the clips expand to ~43k video tokens, blow past the token budget, and `vlm_hf_data_processor` then empties the multimodal items and sets `loss_multiplier=0`.
- **`activation_checkpointing: true` + `gpu_memory_utilization: 0.4`** keep the Megatron forward inside the memory vLLM leaves resident after sleep mode. If `tensor_model_parallel_size=2` OOMs, fall back to `tensor_model_parallel_size=4` (proven to run at 8 frames).
- If `loss_multiplier` is logged at 0 for many samples, the multimodal prompt is exceeding `max_total_sequence_length`; bump it until validation samples consistently produce non-zero loss.
- Set `HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1` once `Qwen/Qwen2.5-Omni-7B`, `PhilipC/IntentTrain`, and `PhilipC/IntentBench` are pre-fetched, so Megatron's tokenizer worker doesn't hit the network.

## 2. Convert Checkpoint (Megatron to HF)

Checkpoints are saved under `results/vlm_grpo-qwen2.5-omni-7b-intent-1n8g-megatron.v1` (`checkpointing.checkpoint_dir`), one every `save_period=20` steps. Convert a checkpoint from Megatron to Hugging Face format before evaluating:

```
uv run --extra mcore python examples/converters/convert_megatron_to_hf.py \
--config results/vlm_grpo-qwen2.5-omni-7b-intent-1n8g-megatron.v1/step_43/config.yaml \
--megatron-ckpt-path results/vlm_grpo-qwen2.5-omni-7b-intent-1n8g-megatron.v1/step_43/policy/weights/iter_0000000 \
--hf-ckpt-path results/vlm_grpo-qwen2.5-omni-7b-intent-1n8g-megatron.v1/step_43/hf --no-strict
```

Replace the step number with the checkpoint you want to evaluate. `--no-strict` is expected here: only the Qwen2.5-Omni *thinker* is trained, so the talker tensors are reported as "not written". The `--extra mcore` flag is required for the Megatron converter.

## 3. Evaluate

In-training validation uses IntentBench as the validation set, so `val_period`, `val_batch_size`, and `max_val_samples` from the config drive evaluation cadence.

For a standalone benchmark, decode the converted HF checkpoint on [Daily-Omni](https://huggingface.co/datasets/liarliar/Daily-Omni) (1197 audio-visual multiple-choice questions) with `examples/run_eval.py`:

```
uv run examples/run_eval.py --config examples/configs/evals/daily_omni.yaml \
generation.model_name=results/vlm_grpo-qwen2.5-omni-7b-intent-1n8g-megatron.v1/step_43/hf
```

The eval config (`examples/configs/evals/daily_omni.yaml`) feeds audio + video (32 frames — eval has no training-forward memory pressure, so it samples more densely than training), uses the same think+answer prompt as training, and scores with `exact_alnum` (case-insensitive exact match on the `<answer>` content).

## 4. Results

Daily-Omni accuracy (1197 questions, greedy decoding) for the base Qwen2.5-Omni-7B versus the GRPO-trained checkpoint:

| Question type | Base | After GRPO |
| --- | --- | --- |
| **Overall** | **0.498** | **0.590** |
| AV Event Alignment | 0.353 | 0.450 |
| Comparative | 0.618 | 0.725 |
| Context understanding | 0.446 | 0.534 |
| Event Sequence | 0.395 | 0.490 |
| Inference | 0.714 | 0.760 |
| Reasoning | 0.651 | 0.766 |

GRPO lifts overall Daily-Omni accuracy by ~9 points, with gains across every question category. The largest relative gains are on the reasoning-style questions.
8 changes: 8 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,13 @@ Configure offline and online Eagle3 draft-model workflows to accelerate rollout
Train Qwen2.5-Omni-3B with GRPO on AVQA and evaluate on MMAU, following the R1-AQA approach.
:::

:::{grid-item-card} {octicon}`device-camera-video` Audio-Visual Intent GRPO
:link: guides/grpo-audio-visual
:link-type: doc

Train Qwen2.5-Omni-7B with GRPO on PhilipC/IntentTrain (audio-visual intent recognition) and evaluate on Daily-Omni, following HumanOmniV2's joint audio-visual setup.
:::

:::{grid-item-card} {octicon}`terminal` Two-Stage SWE RL (Qwen3 Thinking)
:link: guides/swe-rl-qwen3
:link-type: doc
Expand Down Expand Up @@ -271,6 +278,7 @@ guides/ppo.md
guides/grpo-deepscaler.md
guides/grpo-sliding-puzzle.md
guides/grpo-audio.md
guides/grpo-audio-visual.md
guides/rm.md
guides/environments.md
guides/eval.md
Expand Down
49 changes: 49 additions & 0 deletions examples/configs/evals/daily_omni.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Daily-Omni audio-visual eval. Inherits the shared eval defaults and only
Comment thread
yuekaizhang marked this conversation as resolved.
# overrides what differs for the Qwen2.5-Omni audio+video setup.
defaults: "eval.yaml"

generation:
model_name: "Qwen/Qwen2.5-Omni-7B"
vllm_cfg:
# 0.5 (vs the 0.9 default): with 32 video frames + audio, the Qwen2.5-Omni
# vision/audio encoder forward needs a large chunk of *transient
# activation* memory outside vLLM's KV-cache budget. At 0.9 the KV cache
# claims almost all VRAM and the first multimodal forward OOM-crashes the
# workers. 0.5 leaves ample headroom; KV cache is still far more than eval
# needs.
gpu_memory_utilization: 0.5
# Fit 32 video frames + the 16 kHz audio track without truncating the
# multimodal prompt (truncation silently masks samples out -> reward 0).
max_model_len: 32000
# Audio/multimodal models need the tokenizer initialized before generation.
skip_tokenizer_init: False
limit_mm_per_prompt:
video: 1
audio: 1
vllm_kwargs:
# Disable mm processor cache to avoid vLLM cache eviction during eval.
mm_processor_cache_gb: 0
# Cap concurrent sequences so the vision/audio encoder only processes a few
# clips per step. With audio + 32 frames, vLLM otherwise batches ~66 clips
# into one encoder forward and OOM-crashes the workers (encoder *activation*
# memory, not KV cache). Eval throughput is not a concern.
max_num_seqs: 8

tokenizer:
video:
# 32 frames (vs 16): 60s clips at 16 frames is ~1 frame / 3.75s, too sparse
# for fine-grained temporal (Event Sequence) questions.
num_frames: 32

data:
prompt_file: examples/prompts/daily_omni.txt
dataset_name: "daily-omni"
split: "train"
env_name: vlm

env:
vlm:
num_workers: 8
reward_functions:
- name: exact_alnum
weight: 1.0
2 changes: 1 addition & 1 deletion examples/configs/evals/mmau.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ data:
env_name: vlm

env:
mmau:
vlm:
num_workers: 8
reward_functions:
- name: exact_alnum
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
defaults: ../../grpo_math_1B_megatron.yaml
grpo:
num_generations_per_prompt: 8
max_num_steps: 1000
val_batch_size: 32
checkpointing:
enabled: true
checkpoint_dir: results/vlm_grpo-qwen2.5-omni-7b-intent-1n8g-megatron.v1
keep_top_k: 10
save_period: 20
checkpoint_must_save_by: 00:03:45:00
policy:
model_name: Qwen/Qwen2.5-Omni-7B
train_global_batch_size: 32
train_micro_batch_size: 1
generation_batch_size: 32
logprob_batch_size: 1
max_total_sequence_length: 8192
tokenizer:
video:
num_frames: 8
sequence_packing:
enabled: false
generation:
max_new_tokens: 1024
vllm_cfg:
skip_tokenizer_init: false
gpu_memory_utilization: 0.4
limit_mm_per_prompt:
video: 1
audio: 1
vllm_kwargs:
mm_processor_cache_gb: 0
megatron_cfg:
converter_type: Qwen2_5OmniForConditionalGeneration
apply_rope_fusion: false
activation_checkpointing: true
tensor_model_parallel_size: 2
optimizer:
lr: 1.0e-06
min_lr: 1.0e-07
scheduler:
lr_warmup_iters: 10
lr_warmup_init: 1.0e-07
distributed_data_parallel_config:
overlap_grad_reduce: false
data:
num_workers: 0
train:
dataset_name: intent-train
split: train
allowed_problem_types:
- multiple choice
validation:
dataset_name: intent-bench
split: validation
allowed_problem_types:
- multiple choice
default:
prompt_file: null
processor: vlm_hf_data_processor
env_name: vlm
env:
vlm:
num_workers: 8
reward_functions:
- name: format
weight: 0.2
- name: exact_alnum
weight: 0.8
logger:
wandb_enabled: true
tensorboard_enabled: true
wandb:
name: vlm_grpo-qwen2.5-omni-7b-intent-1n8g-megatron.v1
swanlab:
name: vlm_grpo-qwen2.5-omni-7b-intent-1n8g-megatron.v1
cluster:
gpus_per_node: 8
1 change: 1 addition & 0 deletions examples/prompts/daily_omni.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{} First reason briefly between <think> </think> tags, then output only the single option letter (e.g., A, B, C, D, ...) between <answer> </answer> tags. Format example: <think>your reasoning</think><answer>A</answer>
11 changes: 6 additions & 5 deletions examples/run_eval.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,18 +47,19 @@ def parse_args():
return args, overrides


def setup_data(tokenizer, data_config, env_configs):
def setup_data(tokenizer, data_config, env_configs, is_multimodal=False):
print("Setting up data...")

# load dataset
base_dataset = load_eval_dataset(data_config)
rekeyed_ds = base_dataset.rekeyed_ds

# Determine env from config: use explicit env_name if provided,
# otherwise fall back to the single key in env_configs.
# Mirrors nemo_rl/data/utils.py: use data.env_name to look up the env
# config block and determine the registered environment class.
env_key = next(iter(env_configs))
env_name = data_config.get("env_name", env_key)
env = create_env(env_name=env_name, env_config=env_configs[env_key])
registered_env_name = "vlm" if is_multimodal else env_name
env = create_env(env_name=registered_env_name, env_config=env_configs[env_name])

dataset = AllTaskProcessedDataset(
dataset=rekeyed_ds,
Expand Down Expand Up @@ -113,7 +114,7 @@ def main():
dataset,
env,
tokenizer,
) = setup_data(tokenizer, config.data, config.env)
) = setup_data(tokenizer, config.data, config.env, is_multimodal=is_multimodal)

# Setup
(
Expand Down
27 changes: 27 additions & 0 deletions nemo_rl/data/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -177,6 +177,32 @@ class MMAUEvalDataConfig(TypedDict):
env_name: NotRequired[str]


class DailyOmniEvalDataConfig(TypedDict):
"""Config for the Daily-Omni audio-visual eval dataset.

Mirrors the MMAU multimodal schema but with its own ``dataset_name`` literal
so the eval-config union resolves daily-omni unambiguously. Kept as a
``TypedDict`` for consistency with the other (still v1) eval-data configs in
this union, whose consumers access the resolved config by key
(``config.data["dataset_name"]``).

Fields:
max_input_seq_length: Max prompt length passed to the generation backend.
dataset_name: Must be ``"daily-omni"``.
split: HuggingFace split to load.
prompt_file: Optional prompt template path.
system_prompt_file: Optional system prompt path.
env_name: Reward/eval environment name (e.g. ``"vlm"``).
"""

max_input_seq_length: int
dataset_name: Literal["daily-omni"]
split: NotRequired[str | None]
prompt_file: NotRequired[str | None]
system_prompt_file: NotRequired[str | None]
env_name: NotRequired[str]


# Union type for all eval dataset configs
EvalDataConfigType = Union[
MMLUEvalDataConfig,
Expand All @@ -185,5 +211,6 @@ class MMAUEvalDataConfig(TypedDict):
GPQAEvalDataConfig,
MathEvalDataConfig,
MMAUEvalDataConfig,
DailyOmniEvalDataConfig,
LocalMathEvalDataConfig,
]
5 changes: 5 additions & 0 deletions nemo_rl/data/collate_fn.py
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,7 @@ def eval_collate_fn(data_batch: list[DatumSpec]) -> BatchedDataDict[Any]:
message_log = [datum_spec["message_log"] for datum_spec in data_batch]
extra_env_info = [datum_spec["extra_env_info"] for datum_spec in data_batch]
idx = [datum_spec["idx"] for datum_spec in data_batch]
task_names = [datum_spec.get("task_name", None) for datum_spec in data_batch]

# Check if any of the data batch has vllm content (multimodal data)
extra_args = {}
Expand All @@ -132,11 +133,15 @@ def eval_collate_fn(data_batch: list[DatumSpec]) -> BatchedDataDict[Any]:
extra_args["vllm_audios"] = [
datum_spec.get("vllm_audios", []) for datum_spec in data_batch
]
extra_args["vllm_videos"] = [
datum_spec.get("vllm_videos", []) for datum_spec in data_batch
]

output: BatchedDataDict[Any] = BatchedDataDict(
message_log=message_log,
extra_env_info=extra_env_info,
idx=idx,
task_name=task_names,
**extra_args,
)
return output
Expand Down
12 changes: 11 additions & 1 deletion nemo_rl/data/datasets/eval_datasets/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
from typing import cast

from nemo_rl.data.datasets.eval_datasets.aime import AIMEDataset, AIMEVariant
from nemo_rl.data.datasets.eval_datasets.daily_omni import DailyOmniEvalDataset
from nemo_rl.data.datasets.eval_datasets.gpqa import GPQADataset
from nemo_rl.data.datasets.eval_datasets.local_math_dataset import LocalMathDataset
from nemo_rl.data.datasets.eval_datasets.math import MathDataset
Expand All @@ -23,7 +24,7 @@
from nemo_rl.data.datasets.eval_datasets.mmlu_pro import MMLUProDataset

# Dataset names that require multimodal (VLM) processing
MULTIMODAL_DATASETS = {"mmau", "TwinkStart/MMAU"}
MULTIMODAL_DATASETS = {"mmau", "TwinkStart/MMAU", "daily-omni"}


def _is_multimodal_dataset(dataset_name):
Expand Down Expand Up @@ -94,6 +95,14 @@ def load_eval_dataset(data_config):
dataset_name="TwinkStart/MMAU",
split=split,
)
# daily-omni
elif dataset_name == "daily-omni":
split = data_config.get("split", "train")
base_dataset = DailyOmniEvalDataset(
split=split,
prompt_file=data_config.get("prompt_file"),
system_prompt_file=data_config.get("system_prompt_file"),
)
# fall back to local dataset
else:
print(f"Loading dataset from {dataset_name}...")
Expand All @@ -112,6 +121,7 @@ def load_eval_dataset(data_config):

__all__ = [
"AIMEDataset",
"DailyOmniEvalDataset",
"GPQADataset",
"LocalMathDataset",
"MathDataset",
Expand Down
Loading
Loading