-
Notifications
You must be signed in to change notification settings - Fork 452
feat: video + audio understanding GRPO training recipe #2823
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
40 commits
Select commit
Hold shift + click to select a range
ddc76a7
feat(eval): support Daily-Omni + Qwen2.5-Omni eval
yuekaizhang 74b2a69
feat(grpo): add audio+video Intent GRPO recipe for Qwen2.5-Omni-3B
yuekaizhang 6bfa8d5
fix(grpo-intent): use two-step processor call for audio+video samples
yuekaizhang 25add99
fix(grpo-intent): pass prompt_token_ids to vLLM for audio+video samples
yuekaizhang 57e0e53
fix(grpo-intent): pass audio + video as independent streams (no use_a…
yuekaizhang 263d506
docs(grpo-intent): align comments + tests with verified independent-s…
yuekaizhang e2aeba9
docs(grpo-intent): match guide to verified independent-streams smoke run
yuekaizhang 5e9a818
feat(grpo): expose per-component reward metrics in VLM validation
yuekaizhang 0fd3f48
docs(grpo-intent): match Results section to per-component validation …
yuekaizhang e3dbc5e
fix(grpo-intent): explicit think+answer prompt + fallback reward to b…
yuekaizhang 6fb6c00
feat(grpo-intent): audio+video Daily-Omni eval + intent prompt/config…
yuekaizhang 2530250
revert: drop per-component VLM validation reward logging
yuekaizhang 099ec14
refactor: drop exact_alnum_with_fallback reward
yuekaizhang 37c2c11
refactor(grpo-audio-visual): standalone 7B recipe + 7B guide
yuekaizhang 55bcf7f
test: add audio-visual GRPO megatron L1 functional test
yuekaizhang 45ece43
docs(grpo-audio-visual): retitle, eval on Daily-Omni, link HumanOmniV…
yuekaizhang d9ca267
Merge branch 'main' into audio_video
yuekaizhang cc150ab
chore: apply ruff-format to intent dataset
yuekaizhang 2871d57
chore: add eval_datasets/daily_omni.py to pyrefly project-includes
yuekaizhang e1aa32c
test: update test_dailyomni_dataset for audio+video content shape
yuekaizhang b5fd215
Merge branch 'main' into audio_video
yuekaizhang 3d4c98a
Merge branch 'main' into audio_video
yuekaizhang 5a5aca0
Merge branch 'main' into audio_video
yuekaizhang 44cd7db
chore: address PR #2823 review comments (yuki-97)
yuekaizhang 1a0d423
chore: inline intent 7B config into the nightly recipe
yuekaizhang a5f3b99
Merge branch 'main' into audio_video
yuekaizhang 85b6499
fix(ci): add execute permission to intent test suite script
yuekaizhang e55f280
Merge branch 'main' into audio_video
yuekaizhang ba22b6e
fix(ci): raise nightly GPU hours limit from 2300 to 2310
yuekaizhang 3c315ee
Merge branch 'main' into audio_video
yuekaizhang 81d8869
fix: address PR #2823 review comments
yuekaizhang f41aca9
fix: tighten eval_daily_omni score threshold and add to fast tests
yuekaizhang b59840f
Merge branch 'main' into audio_video
yuekaizhang b7ee660
fix: reorder imports in datasets/utils.py per pre-commit
yuekaizhang 93bc42f
fix(run_eval): handle env_name vs env config key mismatch
yuekaizhang 9a41371
Merge branch 'main' into audio_video
yuekaizhang 3bea1ac
fix: align run_eval env dispatch with data/utils.py and remove wrapper
yuekaizhang 13db875
fix: use data_config.env_name to select env config block in run_eval
yuekaizhang 95c92cd
fix: fall back to env_key when env_name not in env_configs
yuekaizhang bbff4cf
fix: unify env config key to vlm in mmau.yaml, simplify run_eval
yuekaizhang File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,86 @@ | ||
| # Audio-Visual GRPO with Qwen2.5-Omni-7B | ||
|
|
||
| This guide explains how to use NeMo RL to train [Qwen2.5-Omni-7B](https://huggingface.co/Qwen/Qwen2.5-Omni-7B) with GRPO on the [PhilipC/IntentTrain](https://huggingface.co/datasets/PhilipC/IntentTrain) audio-visual intent-recognition dataset and evaluate on [Daily-Omni](https://huggingface.co/datasets/liarliar/Daily-Omni), following the dataset structure used in [HumanOmniV2](https://arxiv.org/abs/2506.21277). | ||
|
|
||
| Each training sample feeds the Qwen2.5-Omni processor both the video stream (8 frames) and the audio track decoded from the same file at 16 kHz mono. Audio and video flow as two **independent multimodal items** per prompt: the dataset emits `{type: video}` + `{type: audio}` content items, the Qwen2.5-Omni chat template renders both `<|VIDEO|>` and `<|AUDIO|>` placeholders, and vLLM rollouts populate `multi_modal_data["video"]` and `multi_modal_data["audio"]` from the same sample. | ||
|
|
||
| ## 1. Train the Model | ||
|
|
||
| Run GRPO training with the provided config: | ||
|
|
||
| ``` | ||
| uv run examples/run_vlm_grpo.py --config examples/configs/recipes/vlm/vlm_grpo-qwen2.5-omni-7b-intent-1n8g-megatron.v1.yaml | ||
| ``` | ||
|
|
||
| Config: `examples/configs/recipes/vlm/vlm_grpo-qwen2.5-omni-7b-intent-1n8g-megatron.v1.yaml` | ||
|
|
||
| Key hyperparameters: | ||
|
|
||
| | Parameter | Value | | ||
| | --- | --- | | ||
| | Model | Qwen2.5-Omni-7B | | ||
| | Train dataset | PhilipC/IntentTrain (problem_type = "multiple choice") | | ||
| | Validation dataset | PhilipC/IntentBench (problem_type = "multiple choice") | | ||
| | Modalities per prompt | video (8 frames, `<\|VIDEO\|>` placeholder) + audio (16 kHz mono, `<\|AUDIO\|>` placeholder) — independent multimodal items, no `use_audio_in_video` alignment | | ||
| | GPUs | 8 x 1 node, Megatron backend, `tensor_model_parallel_size=2` (data parallel = 4) | | ||
| | Learning rate | 1e-6 | | ||
| | KL penalty | 0.01 | | ||
| | Generations per prompt | 8 | | ||
| | Prompts per step | 32 | | ||
| | Train global / micro batch | 32 / 1 | | ||
| | Max steps | 1000 | | ||
| | Save period | 20 | | ||
| | Reward | format (0.2) + exact_alnum (0.8) | | ||
|
|
||
| The dataset class downloads `PhilipC/IntentTrain` and `PhilipC/IntentBench` via `huggingface_hub.snapshot_download` and extracts each `videos.zip` once into the corresponding HuggingFace cache directory. Re-instantiating the dataset on a machine that already has the archives extracted is a no-op. | ||
|
|
||
| Only `problem_type == "multiple choice"` samples are used. The allow-list is configurable through `data.train.allowed_problem_types` and `data.validation.allowed_problem_types` if you want to extend scope (for example, to `emer_ov_mc`); doing so requires picking an answer-correctness reward that handles those answer formats. | ||
|
|
||
| ### 7B training notes | ||
|
|
||
| - **8 video frames** keep the prompt around ~4.5k tokens (8×360 video + ~1.5k audio + text), under `max_total_sequence_length=8192`, and roughly halve the training-forward activation memory versus 16 frames. Do **not** switch to fps-based sampling — at fps=2 the clips expand to ~43k video tokens, blow past the token budget, and `vlm_hf_data_processor` then empties the multimodal items and sets `loss_multiplier=0`. | ||
| - **`activation_checkpointing: true` + `gpu_memory_utilization: 0.4`** keep the Megatron forward inside the memory vLLM leaves resident after sleep mode. If `tensor_model_parallel_size=2` OOMs, fall back to `tensor_model_parallel_size=4` (proven to run at 8 frames). | ||
| - If `loss_multiplier` is logged at 0 for many samples, the multimodal prompt is exceeding `max_total_sequence_length`; bump it until validation samples consistently produce non-zero loss. | ||
| - Set `HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1` once `Qwen/Qwen2.5-Omni-7B`, `PhilipC/IntentTrain`, and `PhilipC/IntentBench` are pre-fetched, so Megatron's tokenizer worker doesn't hit the network. | ||
|
|
||
| ## 2. Convert Checkpoint (Megatron to HF) | ||
|
|
||
| Checkpoints are saved under `results/vlm_grpo-qwen2.5-omni-7b-intent-1n8g-megatron.v1` (`checkpointing.checkpoint_dir`), one every `save_period=20` steps. Convert a checkpoint from Megatron to Hugging Face format before evaluating: | ||
|
|
||
| ``` | ||
| uv run --extra mcore python examples/converters/convert_megatron_to_hf.py \ | ||
| --config results/vlm_grpo-qwen2.5-omni-7b-intent-1n8g-megatron.v1/step_43/config.yaml \ | ||
| --megatron-ckpt-path results/vlm_grpo-qwen2.5-omni-7b-intent-1n8g-megatron.v1/step_43/policy/weights/iter_0000000 \ | ||
| --hf-ckpt-path results/vlm_grpo-qwen2.5-omni-7b-intent-1n8g-megatron.v1/step_43/hf --no-strict | ||
| ``` | ||
|
|
||
| Replace the step number with the checkpoint you want to evaluate. `--no-strict` is expected here: only the Qwen2.5-Omni *thinker* is trained, so the talker tensors are reported as "not written". The `--extra mcore` flag is required for the Megatron converter. | ||
|
|
||
| ## 3. Evaluate | ||
|
|
||
| In-training validation uses IntentBench as the validation set, so `val_period`, `val_batch_size`, and `max_val_samples` from the config drive evaluation cadence. | ||
|
|
||
| For a standalone benchmark, decode the converted HF checkpoint on [Daily-Omni](https://huggingface.co/datasets/liarliar/Daily-Omni) (1197 audio-visual multiple-choice questions) with `examples/run_eval.py`: | ||
|
|
||
| ``` | ||
| uv run examples/run_eval.py --config examples/configs/evals/daily_omni.yaml \ | ||
| generation.model_name=results/vlm_grpo-qwen2.5-omni-7b-intent-1n8g-megatron.v1/step_43/hf | ||
| ``` | ||
|
|
||
| The eval config (`examples/configs/evals/daily_omni.yaml`) feeds audio + video (32 frames — eval has no training-forward memory pressure, so it samples more densely than training), uses the same think+answer prompt as training, and scores with `exact_alnum` (case-insensitive exact match on the `<answer>` content). | ||
|
|
||
| ## 4. Results | ||
|
|
||
| Daily-Omni accuracy (1197 questions, greedy decoding) for the base Qwen2.5-Omni-7B versus the GRPO-trained checkpoint: | ||
|
|
||
| | Question type | Base | After GRPO | | ||
| | --- | --- | --- | | ||
| | **Overall** | **0.498** | **0.590** | | ||
| | AV Event Alignment | 0.353 | 0.450 | | ||
| | Comparative | 0.618 | 0.725 | | ||
| | Context understanding | 0.446 | 0.534 | | ||
| | Event Sequence | 0.395 | 0.490 | | ||
| | Inference | 0.714 | 0.760 | | ||
| | Reasoning | 0.651 | 0.766 | | ||
|
|
||
| GRPO lifts overall Daily-Omni accuracy by ~9 points, with gains across every question category. The largest relative gains are on the reasoning-style questions. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,49 @@ | ||
| # Daily-Omni audio-visual eval. Inherits the shared eval defaults and only | ||
| # overrides what differs for the Qwen2.5-Omni audio+video setup. | ||
| defaults: "eval.yaml" | ||
|
|
||
| generation: | ||
| model_name: "Qwen/Qwen2.5-Omni-7B" | ||
| vllm_cfg: | ||
| # 0.5 (vs the 0.9 default): with 32 video frames + audio, the Qwen2.5-Omni | ||
| # vision/audio encoder forward needs a large chunk of *transient | ||
| # activation* memory outside vLLM's KV-cache budget. At 0.9 the KV cache | ||
| # claims almost all VRAM and the first multimodal forward OOM-crashes the | ||
| # workers. 0.5 leaves ample headroom; KV cache is still far more than eval | ||
| # needs. | ||
| gpu_memory_utilization: 0.5 | ||
| # Fit 32 video frames + the 16 kHz audio track without truncating the | ||
| # multimodal prompt (truncation silently masks samples out -> reward 0). | ||
| max_model_len: 32000 | ||
| # Audio/multimodal models need the tokenizer initialized before generation. | ||
| skip_tokenizer_init: False | ||
| limit_mm_per_prompt: | ||
| video: 1 | ||
| audio: 1 | ||
| vllm_kwargs: | ||
| # Disable mm processor cache to avoid vLLM cache eviction during eval. | ||
| mm_processor_cache_gb: 0 | ||
| # Cap concurrent sequences so the vision/audio encoder only processes a few | ||
| # clips per step. With audio + 32 frames, vLLM otherwise batches ~66 clips | ||
| # into one encoder forward and OOM-crashes the workers (encoder *activation* | ||
| # memory, not KV cache). Eval throughput is not a concern. | ||
| max_num_seqs: 8 | ||
|
|
||
| tokenizer: | ||
| video: | ||
| # 32 frames (vs 16): 60s clips at 16 frames is ~1 frame / 3.75s, too sparse | ||
| # for fine-grained temporal (Event Sequence) questions. | ||
| num_frames: 32 | ||
|
|
||
| data: | ||
| prompt_file: examples/prompts/daily_omni.txt | ||
| dataset_name: "daily-omni" | ||
| split: "train" | ||
| env_name: vlm | ||
|
|
||
| env: | ||
| vlm: | ||
| num_workers: 8 | ||
| reward_functions: | ||
| - name: exact_alnum | ||
| weight: 1.0 | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -47,7 +47,7 @@ data: | |
| env_name: vlm | ||
|
|
||
| env: | ||
| mmau: | ||
| vlm: | ||
| num_workers: 8 | ||
| reward_functions: | ||
| - name: exact_alnum | ||
|
|
||
79 changes: 79 additions & 0 deletions
79
examples/configs/recipes/vlm/vlm_grpo-qwen2.5-omni-7b-intent-1n8g-megatron.v1.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,79 @@ | ||
| defaults: ../../grpo_math_1B_megatron.yaml | ||
| grpo: | ||
| num_generations_per_prompt: 8 | ||
| max_num_steps: 1000 | ||
| val_batch_size: 32 | ||
| checkpointing: | ||
| enabled: true | ||
| checkpoint_dir: results/vlm_grpo-qwen2.5-omni-7b-intent-1n8g-megatron.v1 | ||
| keep_top_k: 10 | ||
| save_period: 20 | ||
| checkpoint_must_save_by: 00:03:45:00 | ||
| policy: | ||
| model_name: Qwen/Qwen2.5-Omni-7B | ||
| train_global_batch_size: 32 | ||
| train_micro_batch_size: 1 | ||
| generation_batch_size: 32 | ||
| logprob_batch_size: 1 | ||
| max_total_sequence_length: 8192 | ||
| tokenizer: | ||
| video: | ||
| num_frames: 8 | ||
| sequence_packing: | ||
| enabled: false | ||
| generation: | ||
| max_new_tokens: 1024 | ||
| vllm_cfg: | ||
| skip_tokenizer_init: false | ||
| gpu_memory_utilization: 0.4 | ||
| limit_mm_per_prompt: | ||
| video: 1 | ||
| audio: 1 | ||
| vllm_kwargs: | ||
| mm_processor_cache_gb: 0 | ||
| megatron_cfg: | ||
| converter_type: Qwen2_5OmniForConditionalGeneration | ||
| apply_rope_fusion: false | ||
| activation_checkpointing: true | ||
| tensor_model_parallel_size: 2 | ||
| optimizer: | ||
| lr: 1.0e-06 | ||
| min_lr: 1.0e-07 | ||
| scheduler: | ||
| lr_warmup_iters: 10 | ||
| lr_warmup_init: 1.0e-07 | ||
| distributed_data_parallel_config: | ||
| overlap_grad_reduce: false | ||
| data: | ||
| num_workers: 0 | ||
| train: | ||
| dataset_name: intent-train | ||
| split: train | ||
| allowed_problem_types: | ||
| - multiple choice | ||
| validation: | ||
| dataset_name: intent-bench | ||
| split: validation | ||
| allowed_problem_types: | ||
| - multiple choice | ||
| default: | ||
| prompt_file: null | ||
| processor: vlm_hf_data_processor | ||
| env_name: vlm | ||
| env: | ||
| vlm: | ||
| num_workers: 8 | ||
| reward_functions: | ||
| - name: format | ||
| weight: 0.2 | ||
| - name: exact_alnum | ||
| weight: 0.8 | ||
| logger: | ||
| wandb_enabled: true | ||
| tensorboard_enabled: true | ||
| wandb: | ||
| name: vlm_grpo-qwen2.5-omni-7b-intent-1n8g-megatron.v1 | ||
| swanlab: | ||
| name: vlm_grpo-qwen2.5-omni-7b-intent-1n8g-megatron.v1 | ||
| cluster: | ||
| gpus_per_node: 8 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| {} First reason briefly between <think> </think> tags, then output only the single option letter (e.g., A, B, C, D, ...) between <answer> </answer> tags. Format example: <think>your reasoning</think><answer>A</answer> |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.