feat: single controller e2e by yuki-97 · Pull Request #2819 · NVIDIA-NeMo/RL

yuki-97 · 2026-06-15T06:53:06Z

Summary

Adds the SC (Single Controller) entrypoint for async GRPO on the TransferQueue data plane.

SingleControllerActor (nemo_rl/algorithms/single_controller.py): CPU-only Ray actor with three asyncio pumps.
- _rollout_pump: iterates the dataloader → RolloutManager.generate_and_push → TQReplayBuffer.add writes N training rows per group. Gated by _buffer_capacity / max_inflight_prompts / _rollout_permitted. Supports per-weight-version dispatch quota (over_sampling=false, reserve/commit slot accounting) and exact target-step matching (force_in_order=true).
- _train_pump: sampler.evict → sampler.select → _advantage_pump → TQPolicy split API (begin/train_microbatch/finish_train_step) → dp_client.clear_samples. One RL step = one optimizer step on train_global_batch_size; enforced by an init-time assertion num_prompts_per_step * num_generations_per_prompt == train_global_batch_size.
- _sync_weights: drain gate + WeightSynchronizer.sync_weights, wired through setup_single_controller.
single_controller_utils/ (config.py / setup.py / utils.py):
- SC MasterConfig + AsyncRLConfig;
- Driver-side setup_single_controller() builds the full SingleControllerBundle (clusters / generation / trainer / weight synchronizer).
async_utils/: TQReplayBuffer (group-granular, producer-side tensorization, reserve/commit slot accounting) + StalenessSampler (filter-only; strict_on_policy / staleness_window, optional force_in_order).
Launcher + exemplar: examples/run_grpo_single_controller.py, examples/configs/grpo_math_1B_single_controller.yaml.
Recipes + tests:
- L1 functional: tests/functional/grpo_dp_single_controller.sh (Qwen3-0.6B, 2 GPU, 2 steps).
- Nightly: grpo-llama3.1-8b-instruct-2n8g-async-1off-single-controller (matches legacy async-1off) and grpo-qwen2.5-math-1.5b-instruct-1n8g-megatron-single-controller.
- Unit: test_rollout_pump, test_single_controller_setup, test_staleness_sampler, test_tq_replay_buffer, test_megatron_split_state.
Megatron worker fix: begin_train_step also nulls config.no_sync_func so the outer model.no_sync() covers the last MB — otherwise register_grad_ready leaks counts past the explicit sync and asserts on step 2.

Known limits (TODOs in code)

compute_prev_logprobs / compute_reference_logprobs gating is provisional.
Multi-mini-step inside a single RL step is not supported (one optimizer.step per RL step).
Validation not wired (no eval dataset, no eval loop).

Current state

Runnable end-to-end via the launcher.
Functional: L1 grpo_dp_single_controller.sh (Qwen3-0.6B, 2 GPU, 2 steps).
Nightly: Llama 3.1 8B 2n8g async-1off SC and Qwen 2.5 Math 1.5B 1n8g Megatron SC wired and tracked.

Compared grpo-llama3.1-8b-instruct-2n8g-async-1off-single-controller with legacy code with same setting (grpo-llama3.1-8b-instruct-2n8g-async-1off), didn't use mini batch for now.

reward is a bit lower, maybe caused by grad_norm.
grad_norm is mismatch, investigating.

copy-pr-bot · 2026-06-15T06:53:10Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Two bugs around the split-API state machine fixed in one pass: 1. **no_sync_func leak (latent assertion on step 2 with PP=1).** `forward_backward_no_pipelining` (the PP=1 path, which is the common one) wraps inner microbatches in `model.config.no_sync_func` but runs the *last* microbatch OUTSIDE of it. Our outer `with self.model.no_sync():` in `train_microbatch` was therefore bypassed for the trailing MB. `register_grad_ready` leaked per-param counts past the explicit DP sync at `finish_train_step`, and the next `begin_train_step` then asserted on the stale counts (typically step 2). Fix: in `begin_train_step` also save and null `no_sync_func` (set to `contextlib.nullcontext`). Restore both hooks at finish/abort via the new `_restore_saved_grad_sync_func` helper. Spotted in yuki-97's PR NVIDIA-NeMo#2819 (`begin_train_step` also nulls `config.no_sync_func`); same fix landed here. 2. **No exception safety around the open step (terrykong, NVIDIA-NeMo#2683:640).** If `train_microbatch` or `finish_train_step` raised mid-body, both `grad_sync_func` and (now also) `no_sync_func` would stay nulled and future steps would run with the PP scheduler bypass disabled silently. Fix: extract `_train_microbatch_body` and `_finish_train_step_body`, wrap each entry method in try/except that calls `_restore_saved_grad_sync_func` before re-raising. Caller is still expected to invoke `abort_train_step` (idempotent on the saved values) to drop `_train_step_state`. 3. **Cleanup (terrykong, NVIDIA-NeMo#2683:582).** Drop dead `no_sync_active` field from `_split_step_state_init` — never read or written; `no_sync` is applied via the `with self.model.no_sync():` context manager. Also: add module-level `log = logging.getLogger(__name__)` so the try/except handlers can `log.exception(...)` if the restore itself fails. Signed-off-by: Akash Mehra <akamehra@nvidia.com>

Signed-off-by: Yuki Huang <yukih@nvidia.com> Co-authored-by: Akash Mehra <akamehra@nvidia.com>