Skip to content

feat: single controller e2e#2819

Draft
yuki-97 wants to merge 28 commits into
mainfrom
yukih/sc-entrypoint
Draft

feat: single controller e2e#2819
yuki-97 wants to merge 28 commits into
mainfrom
yukih/sc-entrypoint

Conversation

@yuki-97

@yuki-97 yuki-97 commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds the SC (Single Controller) entrypoint for async GRPO on the TransferQueue data plane.

  • SingleControllerActor (nemo_rl/algorithms/single_controller.py): CPU-only Ray actor with three asyncio pumps.
    • _rollout_pump: iterates the dataloader → RolloutManager.generate_and_pushTQReplayBuffer.add writes N training rows per group. Gated by _buffer_capacity / max_inflight_prompts / _rollout_permitted. Supports per-weight-version dispatch quota (over_sampling=false, reserve/commit slot accounting) and exact target-step matching (force_in_order=true).
    • _train_pump: sampler.evictsampler.select_advantage_pumpTQPolicy split API (begin/train_microbatch/finish_train_step) → dp_client.clear_samples. One RL step = one optimizer step on train_global_batch_size; enforced by an init-time assertion num_prompts_per_step * num_generations_per_prompt == train_global_batch_size.
    • _sync_weights: drain gate + WeightSynchronizer.sync_weights, wired through setup_single_controller.
  • single_controller_utils/ (config.py / setup.py / utils.py):
    • SC MasterConfig + AsyncRLConfig;
    • Driver-side setup_single_controller() builds the full SingleControllerBundle (clusters / generation / trainer / weight synchronizer).
  • async_utils/: TQReplayBuffer (group-granular, producer-side tensorization, reserve/commit slot accounting) + StalenessSampler (filter-only; strict_on_policy / staleness_window, optional force_in_order).
  • Launcher + exemplar: examples/run_grpo_single_controller.py, examples/configs/grpo_math_1B_single_controller.yaml.
  • Recipes + tests:
    • L1 functional: tests/functional/grpo_dp_single_controller.sh (Qwen3-0.6B, 2 GPU, 2 steps).
    • Nightly: grpo-llama3.1-8b-instruct-2n8g-async-1off-single-controller (matches legacy async-1off) and grpo-qwen2.5-math-1.5b-instruct-1n8g-megatron-single-controller.
    • Unit: test_rollout_pump, test_single_controller_setup, test_staleness_sampler, test_tq_replay_buffer, test_megatron_split_state.
  • Megatron worker fix: begin_train_step also nulls config.no_sync_func so the outer model.no_sync() covers the last MB — otherwise register_grad_ready leaks counts past the explicit sync and asserts on step 2.

Known limits (TODOs in code)

  • compute_prev_logprobs / compute_reference_logprobs gating is provisional.
  • Multi-mini-step inside a single RL step is not supported (one optimizer.step per RL step).
  • Validation not wired (no eval dataset, no eval loop).

Current state

  • Runnable end-to-end via the launcher.
  • Functional: L1 grpo_dp_single_controller.sh (Qwen3-0.6B, 2 GPU, 2 steps).
  • Nightly: Llama 3.1 8B 2n8g async-1off SC and Qwen 2.5 Math 1.5B 1n8g Megatron SC wired and tracked.

Compared grpo-llama3.1-8b-instruct-2n8g-async-1off-single-controller with legacy code with same setting (grpo-llama3.1-8b-instruct-2n8g-async-1off), didn't use mini batch for now.

  • reward is a bit lower, maybe caused by grad_norm.
  • grad_norm is mismatch, investigating.
image

@copy-pr-bot

copy-pr-bot Bot commented Jun 15, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

mehraakash added a commit to mehraakash/RL that referenced this pull request Jun 16, 2026
Two bugs around the split-API state machine fixed in one pass:

1. **no_sync_func leak (latent assertion on step 2 with PP=1).**
   `forward_backward_no_pipelining` (the PP=1 path, which is the common
   one) wraps inner microbatches in `model.config.no_sync_func` but runs
   the *last* microbatch OUTSIDE of it. Our outer `with self.model.no_sync():`
   in `train_microbatch` was therefore bypassed for the trailing MB.
   `register_grad_ready` leaked per-param counts past the explicit DP
   sync at `finish_train_step`, and the next `begin_train_step` then
   asserted on the stale counts (typically step 2).

   Fix: in `begin_train_step` also save and null `no_sync_func` (set to
   `contextlib.nullcontext`). Restore both hooks at finish/abort via the
   new `_restore_saved_grad_sync_func` helper.

   Spotted in yuki-97's PR NVIDIA-NeMo#2819 (`begin_train_step` also nulls
   `config.no_sync_func`); same fix landed here.

2. **No exception safety around the open step (terrykong, NVIDIA-NeMo#2683:640).**
   If `train_microbatch` or `finish_train_step` raised mid-body, both
   `grad_sync_func` and (now also) `no_sync_func` would stay nulled and
   future steps would run with the PP scheduler bypass disabled silently.

   Fix: extract `_train_microbatch_body` and `_finish_train_step_body`,
   wrap each entry method in try/except that calls
   `_restore_saved_grad_sync_func` before re-raising. Caller is still
   expected to invoke `abort_train_step` (idempotent on the saved
   values) to drop `_train_step_state`.

3. **Cleanup (terrykong, NVIDIA-NeMo#2683:582).** Drop dead `no_sync_active` field
   from `_split_step_state_init` — never read or written; `no_sync` is
   applied via the `with self.model.no_sync():` context manager.

Also: add module-level `log = logging.getLogger(__name__)` so the
try/except handlers can `log.exception(...)` if the restore itself fails.

Signed-off-by: Akash Mehra <akamehra@nvidia.com>
@yuki-97 yuki-97 force-pushed the yukih/sc-entrypoint branch 10 times, most recently from 36c7e50 to 44fc9a4 Compare June 21, 2026 15:25
@yuki-97 yuki-97 changed the title feat: single controller (w/o sync_weight) feat: single controller e2e Jun 24, 2026
yuki-97 and others added 16 commits June 28, 2026 02:14
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Co-authored-by: Akash Mehra <akamehra@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>

update functional test

Signed-off-by: Yuki Huang <yukih@nvidia.com>
… reserve/commit slots

Signed-off-by: Yuki Huang <yukih@nvidia.com>

update unit test

Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
yuki-97 added 6 commits June 28, 2026 02:14
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
@yuki-97 yuki-97 force-pushed the yukih/sc-entrypoint branch 2 times, most recently from deb3b4b to e0513e5 Compare June 28, 2026 11:27
yuki-97 added 3 commits June 28, 2026 05:16
…um_prompts_per_step

Signed-off-by: Yuki Huang <yukih@nvidia.com>
…unctional test buffer sizing

Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
@yuki-97 yuki-97 force-pushed the yukih/sc-entrypoint branch from e0513e5 to 1e0107e Compare June 28, 2026 12:17
yuki-97 added 2 commits June 28, 2026 08:06
Signed-off-by: Yuki Huang <yukih@nvidia.com>
…troller and grpo_async_gym_single_controller into it

Signed-off-by: Yuki Huang <yukih@nvidia.com>
@github-actions github-actions Bot added the CI Relating to CI label Jun 28, 2026
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI Relating to CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant