perf: reduce srun overhead in ray.sub and gate driver on sandbox readiness by ananthsub · Pull Request #2827 · NVIDIA-NeMo/RL

ananthsub · 2026-06-15T17:55:24Z

What does this PR do ?

ray.sub previously scales srun/slurmctld RPCs with the node count: it launches one srun per worker (each followed by a 3s sleep), slept a fixed 120s before workers, polls head readiness via srun --overlap test -f, and polls cluster status via repeated srun ... ray status. For larger jobs, this dominates cluster bringup time and risks throttling slurmctld.

This PR replaces the per-node fan-out with shared-filesystem file signaling so srun calls stay constant relative to the node count:

Launch all workers with a single batched srun (--nodes/--ntasks=N-1, --ntasks-per-node=1, --exclude=head); workers self-identify via SLURM_PROCID / SLURMD_NODENAME. This change drops the per-worker loop, the 3s stagger, and the 120s sleep.
Start the head without --block and touch STARTED_RAY_HEAD only after the GCS is listening, so workers connect on the first try; workers poll that file before ray start --address and retry the GCS connect.
A head-side ray-status sidecar publishes the live worker_units count to a file; the submit host reads it instead of issuing srun ... ray status RPCs.
The head container runs the driver inline (via a driver_command.sh file) and the submit host just waits on the head srun, removing the dedicated driver srun.
Sandbox readiness becomes a blocking gate: each per-node sandbox task polls its local port and touches SANDBOX_READY_<host>; the head waits for all instances before launching the driver. The SANDBOX_PORTS_DIR signal dir is always mounted and exported, independent of the SANDBOX_EXTRA_MOUNTS / SANDBOX_ENV_VARS knobs.

This change assumes that the LOG_DIR is on a shared filesystem visible across compute nodes.

This change also does the following:

cleans stale Ray session state inside the head and worker retry loops
Adds 0-indexed, zero-padded worker logs for natural sorting
Suffix LOG_DIR with SLURM_RESTART_COUNT so a requeue does not clobber the previous attempt's logs and signal files

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

copy-pr-bot · 2026-06-15T17:55:29Z

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

ananthsub · 2026-06-15T21:14:52Z

/ok to test 2bcbb17

…iness ray.sub previously scaled srun/slurmctld RPCs with the node count: it launched one srun per worker (each followed by a 3s sleep), slept a fixed 120s before workers, polled head readiness via `srun --overlap test -f`, and polled cluster status via repeated `srun ... ray status`. At scale this dominated cluster bringup time and risked throttling slurmctld. Replace the per-node fan-out with shared-filesystem file signaling so srun calls stay roughly constant in the node count: - Launch all workers with a single batched srun (--nodes/--ntasks=N-1, --ntasks-per-node=1, --exclude=head); workers self-identify via SLURM_PROCID / SLURMD_NODENAME. Drops the per-worker loop, the 3s stagger, and the 120s sleep. - Start the head without --block and touch STARTED_RAY_HEAD only after the GCS is listening, so workers connect on the first try; workers poll that file before `ray start --address` and retry the GCS connect. - A head-side ray-status sidecar publishes the live worker_units count to a file; the submit host reads it instead of issuing `srun ... ray status` RPCs. - The head container runs the driver inline (via a driver_command.sh file) and the submit host just waits on the head srun, removing the dedicated driver srun. Sandbox readiness becomes a blocking gate: each per-node sandbox task polls its local port and touches SANDBOX_READY_<host>; the head waits for all instances before launching the driver. The SANDBOX_PORTS_DIR signal dir is always mounted and exported, independent of the SANDBOX_EXTRA_MOUNTS / SANDBOX_ENV_VARS knobs. Also: clean stale Ray session state inside the head and worker retry loops (avoids the persisted-session AssertionError on retry); 0-indexed, zero-padded worker logs (ray-worker-%0Nt.log) for natural sorting; support single-node jobs (no workers needed); and suffix LOG_DIR with SLURM_RESTART_COUNT so a requeue does not clobber the previous attempt's logs and signal files. Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

terrykong

Reviewed by an agent team (bug-finder + RL-infra/guidelines + adversarial verification). Overall this is a solid, well-constructed bringup change — the shared-FS file-signaling design is sound, the bash -n pre-validation of the generated head/worker scripts is a nice safety net, heredoc/quoting expansion timing is correct throughout, and single-node mode is handled. No correctness bugs found. The 4 inline comments are all suggestions/nits (one perf-evidence ask, one fail-fast trade-off, two minor robustness nits) — none blocking.

Note: as a multi-node Slurm launcher this couldn't be executed end-to-end in review; findings are from static analysis with a few behaviors confirmed via standalone bash tests.

Generated by Claude Code

terrykong · 2026-06-24T20:57:15Z

+# NOTE: LOG_DIR must be on a shared filesystem visible to both the submit host and
+# all compute nodes. File-based signaling between the submit host and the containers
+# (STARTED_RAY_HEAD, ray_worker_units, ENDED, sandbox readiness) relies on this.
+# On systems without a shared FS, srun-based polling would be needed instead.


Suggested change

# On systems without a shared FS, srun-based polling would be needed instead.

i'm thinking we might not even mention that b/c there's so many assumptions about shared FS that someone attempting it will run into many other issues

terrykong · 2026-06-24T20:58:32Z

 fi

 # Create logs directory
+# NOTE: LOG_DIR must be on a shared filesystem visible to both the submit host and


OOC, could we simply check this by doing the mkdir and writing .slurm_job_id or something there and checking that all head/workers can view it before starting? otherwise the FS is not shared

terrykong · 2026-06-24T21:30:52Z

+    echo "\$units" > "$LOG_DIR/.ray_worker_units.tmp"
+    mv "$LOG_DIR/.ray_worker_units.tmp" "$LOG_DIR/ray_worker_units"


ray.sub:412

Minor: a one-line note on why the tmp+mv right here would help (the function-level comment at L406 mentions it, but it's easy to miss at the operation). Suggested:

Suggested change

echo "\$units" > "$LOG_DIR/.ray_worker_units.tmp"

mv "$LOG_DIR/.ray_worker_units.tmp" "$LOG_DIR/ray_worker_units"

# Publish atomically: a plain `>` truncates then writes, so the reader (the submit

# host, on another node, reading over the shared FS) could catch the file empty or

# partial. rename(2) is atomic, so the reader always sees the complete prior-or-new value.

echo "\$units" > "$LOG_DIR/.ray_worker_units.tmp"

mv "$LOG_DIR/.ray_worker_units.tmp" "$LOG_DIR/ray_worker_units"

terrykong · 2026-06-24T21:38:17Z

-  echo "Worker failed \$count/$num_retries times, restarting in 5 seconds..."
-  sleep 5
+  echo "[WARN] Worker \$SLURM_PROCID failed \$count/$WORKER_NUM_RETRIES times, restarting in 10 seconds..."
+  sleep 10


why change to 10 but have head stay at 5?

terrykong · 2026-06-24T21:57:07Z

+  # SANDBOX_READY_<hostname> once its port is listening (see the sandbox srun below).
+  if [[ -n "$SANDBOX_PORTS_DIR" ]]; then
+    echo "[INFO] Waiting for sandbox to be ready on all $SLURM_JOB_NUM_NODES nodes..."
+    SANDBOX_DEADLINE=\$((SECONDS + 600))


should we flip around this check to check the 600 second deadline first and then the 1800 so we can exit earlier?

terrykong · 2026-06-24T21:58:15Z

@@ -565,81 +681,134 @@ if [[ -n "${SANDBOX_CONTAINER:-}" ]] && [[ -n "${SANDBOX_COMMAND:-}" ]]; then
    --nodes="$SLURM_JOB_NUM_NODES" \


@ananthsub i think we actually only use one of these, and don't need to wait for N of them to start. doesn't necessarily block this PR, but can we have an issue to fix this?

terrykong · 2026-06-24T22:47:36Z

+  set +e
+  wait ${SRUN_PIDS["ray-head"]}
+  exit_code=$?
+  set -e
+  exit $exit_code


ray.sub:747-751

Non-interactive waits on the head PID only, so a hard-killed worker/sandbox srun during bringup isn't caught until the head's 30-min ray-status deadline (L476) — vs ~2s before via check_srun_processes. Costly at large node counts. (Driver-execution phase is unchanged: both old and new rely on Ray's node-failure detection + the driver erroring out.) One way to restore fast-fail, with wait -n (bash ≥5.1):

Suggested change

set +e

wait ${SRUN_PIDS["ray-head"]}

exit_code=$?

set -e

exit $exit_code

set +e

# Fail fast if a worker/sandbox srun dies, instead of waiting out the head's deadline.

_watch=("${SRUN_PIDS["ray-head"]}")

[[ -n "${SRUN_PIDS["ray-workers"]:-}" ]] && _watch+=("${SRUN_PIDS["ray-workers"]}")

[[ -n "${SRUN_PIDS["sandbox"]:-}" ]] && _watch+=("${SRUN_PIDS["sandbox"]}")

wait -n -p _done_pid "${_watch[@]}"

exit_code=$?

if [[ "$_done_pid" != "${SRUN_PIDS["ray-head"]}" ]]; then

echo "[ERROR] Background srun (pid=$_done_pid) exited before the head; failing fast." >&2

touch "$LOG_DIR/ENDED"

fi

set -e

exit $exit_code

terrykong · 2026-06-24T22:47:37Z

+  set +x
+  while true; do
+    sleep 10
+    units=\$(ray status 2>/dev/null | grep "worker_units" | awk -F'[/. ]' '{print \$4}' || echo 0)


ray.sub:411

|| echo 0 never fires: the pipeline's exit status is awk's (which exits 0 even when grep matches nothing), so units is empty, not 0. The set -o pipefail at L25 is in the submit-host shell and doesn't propagate into the bash -c "$head_cmd" subshell (SHELLOPTS isn't exported), so it's off here and at the inline poll on L478. Benign today (empty compares not-equal → loop retries), but worth fixing.

Cleanest fix covers L411, L478, and transitively the L758 read at once by enabling pipefail at the top of head_cmd — it spans more than this line, so as a diff to apply manually (pipefail only, not set -e — the retry loops and [[ … -eq … ]] poll conditions rely on non-zero being non-fatal):

fi +# pipefail so 'ray status | grep | awk || echo 0' reports grep's miss and the fallback fires. +# NOTE: pipefail only here — do NOT add 'set -e'. +set -o pipefail exit-dramatically() {

ananthsub requested review from macandro96 and yfw June 15, 2026 17:55

ananthsub requested a review from a team as a code owner June 15, 2026 17:55

ananthsub added the CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) label Jun 15, 2026

copy-pr-bot Bot temporarily deployed to public June 15, 2026 21:15 Inactive

copy-pr-bot Bot temporarily deployed to public June 15, 2026 21:16 Inactive

copy-pr-bot Bot temporarily deployed to test June 15, 2026 21:18 Inactive

copy-pr-bot Bot temporarily deployed to public June 15, 2026 21:19 Inactive

ananthsub force-pushed the ansubramania/raysub-srun-optimizations branch from 2bcbb17 to 90ccff8 Compare June 24, 2026 06:52

yfw requested a review from terrykong June 24, 2026 18:09

terrykong reviewed Jun 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: reduce srun overhead in ray.sub and gate driver on sandbox readiness#2827

perf: reduce srun overhead in ray.sub and gate driver on sandbox readiness#2827
ananthsub wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
ananthsub:ansubramania/raysub-srun-optimizations

ananthsub commented Jun 15, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 15, 2026

Uh oh!

ananthsub commented Jun 15, 2026

Uh oh!

terrykong left a comment

Uh oh!

terrykong Jun 24, 2026

Uh oh!

terrykong Jun 24, 2026

Uh oh!

terrykong Jun 24, 2026

Uh oh!

Uh oh!

terrykong Jun 24, 2026

Uh oh!

terrykong Jun 24, 2026

Uh oh!

terrykong Jun 24, 2026

Uh oh!

terrykong Jun 24, 2026

Uh oh!

terrykong Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		echo "\$units" > "$LOG_DIR/.ray_worker_units.tmp"
		mv "$LOG_DIR/.ray_worker_units.tmp" "$LOG_DIR/ray_worker_units"

		@@ -565,81 +681,134 @@ if [[ -n "${SANDBOX_CONTAINER:-}" ]] && [[ -n "${SANDBOX_COMMAND:-}" ]]; then
		--nodes="$SLURM_JOB_NUM_NODES" \

-  set +e
-  wait ${SRUN_PIDS["ray-head"]}
-  exit_code=$?
-  set -e
-  exit $exit_code
+  set +e
+  # Fail fast if a worker/sandbox srun dies, instead of waiting out the head's deadline.
+  _watch=("${SRUN_PIDS["ray-head"]}")
+  [[ -n "${SRUN_PIDS["ray-workers"]:-}" ]] && _watch+=("${SRUN_PIDS["ray-workers"]}")
+  [[ -n "${SRUN_PIDS["sandbox"]:-}" ]] && _watch+=("${SRUN_PIDS["sandbox"]}")
+  wait -n -p _done_pid "${_watch[@]}"
+  exit_code=$?
+  if [[ "$_done_pid" != "${SRUN_PIDS["ray-head"]}" ]]; then
+    echo "[ERROR] Background srun (pid=$_done_pid) exited before the head; failing fast." >&2
+    touch "$LOG_DIR/ENDED"
+  fi
+  set -e
+  exit $exit_code

Uh oh!

Conversation

ananthsub commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot Bot commented Jun 15, 2026

Uh oh!

ananthsub commented Jun 15, 2026

Uh oh!

terrykong left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ananthsub commented Jun 15, 2026 •

edited

Loading