Skip to content

perf: reduce srun overhead in ray.sub and gate driver on sandbox readiness#2827

Open
ananthsub wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
ananthsub:ansubramania/raysub-srun-optimizations
Open

perf: reduce srun overhead in ray.sub and gate driver on sandbox readiness#2827
ananthsub wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
ananthsub:ansubramania/raysub-srun-optimizations

Conversation

@ananthsub

@ananthsub ananthsub commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

What does this PR do ?

ray.sub previously scales srun/slurmctld RPCs with the node count: it launches one srun per worker (each followed by a 3s sleep), slept a fixed 120s before workers, polls head readiness via srun --overlap test -f, and polls cluster status via repeated srun ... ray status. For larger jobs, this dominates cluster bringup time and risks throttling slurmctld.

This PR replaces the per-node fan-out with shared-filesystem file signaling so srun calls stay constant relative to the node count:

  1. Launch all workers with a single batched srun (--nodes/--ntasks=N-1, --ntasks-per-node=1, --exclude=head); workers self-identify via SLURM_PROCID / SLURMD_NODENAME. This change drops the per-worker loop, the 3s stagger, and the 120s sleep.
  2. Start the head without --block and touch STARTED_RAY_HEAD only after the GCS is listening, so workers connect on the first try; workers poll that file before ray start --address and retry the GCS connect.
  3. A head-side ray-status sidecar publishes the live worker_units count to a file; the submit host reads it instead of issuing srun ... ray status RPCs.
  4. The head container runs the driver inline (via a driver_command.sh file) and the submit host just waits on the head srun, removing the dedicated driver srun.
  5. Sandbox readiness becomes a blocking gate: each per-node sandbox task polls its local port and touches SANDBOX_READY_<host>; the head waits for all instances before launching the driver. The SANDBOX_PORTS_DIR signal dir is always mounted and exported, independent of the SANDBOX_EXTRA_MOUNTS / SANDBOX_ENV_VARS knobs.

This change assumes that the LOG_DIR is on a shared filesystem visible across compute nodes.

This change also does the following:

  1. cleans stale Ray session state inside the head and worker retry loops
  2. Adds 0-indexed, zero-padded worker logs for natural sorting
  3. Suffix LOG_DIR with SLURM_RESTART_COUNT so a requeue does not clobber the previous attempt's logs and signal files

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

@ananthsub ananthsub requested review from macandro96 and yfw June 15, 2026 17:55
@ananthsub ananthsub requested a review from a team as a code owner June 15, 2026 17:55
@copy-pr-bot

copy-pr-bot Bot commented Jun 15, 2026

Copy link
Copy Markdown

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@ananthsub ananthsub added the CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) label Jun 15, 2026
@ananthsub

Copy link
Copy Markdown
Contributor Author

/ok to test 2bcbb17

…iness

ray.sub previously scaled srun/slurmctld RPCs with the node count: it launched
one srun per worker (each followed by a 3s sleep), slept a fixed 120s before
workers, polled head readiness via `srun --overlap test -f`, and polled cluster
status via repeated `srun ... ray status`. At scale this dominated cluster
bringup time and risked throttling slurmctld.

Replace the per-node fan-out with shared-filesystem file signaling so srun calls
stay roughly constant in the node count:

- Launch all workers with a single batched srun (--nodes/--ntasks=N-1,
  --ntasks-per-node=1, --exclude=head); workers self-identify via SLURM_PROCID /
  SLURMD_NODENAME. Drops the per-worker loop, the 3s stagger, and the 120s sleep.
- Start the head without --block and touch STARTED_RAY_HEAD only after the GCS is
  listening, so workers connect on the first try; workers poll that file before
  `ray start --address` and retry the GCS connect.
- A head-side ray-status sidecar publishes the live worker_units count to a file;
  the submit host reads it instead of issuing `srun ... ray status` RPCs.
- The head container runs the driver inline (via a driver_command.sh file) and the
  submit host just waits on the head srun, removing the dedicated driver srun.

Sandbox readiness becomes a blocking gate: each per-node sandbox task polls its
local port and touches SANDBOX_READY_<host>; the head waits for all instances
before launching the driver. The SANDBOX_PORTS_DIR signal dir is always mounted
and exported, independent of the SANDBOX_EXTRA_MOUNTS / SANDBOX_ENV_VARS knobs.

Also: clean stale Ray session state inside the head and worker retry loops
(avoids the persisted-session AssertionError on retry); 0-indexed, zero-padded
worker logs (ray-worker-%0Nt.log) for natural sorting; support single-node jobs
(no workers needed); and suffix LOG_DIR with SLURM_RESTART_COUNT so a requeue
does not clobber the previous attempt's logs and signal files.

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
@ananthsub ananthsub force-pushed the ansubramania/raysub-srun-optimizations branch from 2bcbb17 to 90ccff8 Compare June 24, 2026 06:52
@yfw yfw requested a review from terrykong June 24, 2026 18:09

@terrykong terrykong left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed by an agent team (bug-finder + RL-infra/guidelines + adversarial verification). Overall this is a solid, well-constructed bringup change — the shared-FS file-signaling design is sound, the bash -n pre-validation of the generated head/worker scripts is a nice safety net, heredoc/quoting expansion timing is correct throughout, and single-node mode is handled. No correctness bugs found. The 4 inline comments are all suggestions/nits (one perf-evidence ask, one fail-fast trade-off, two minor robustness nits) — none blocking.

Note: as a multi-node Slurm launcher this couldn't be executed end-to-end in review; findings are from static analysis with a few behaviors confirmed via standalone bash tests.

Generated by Claude Code

Comment thread ray.sub
# NOTE: LOG_DIR must be on a shared filesystem visible to both the submit host and
# all compute nodes. File-based signaling between the submit host and the containers
# (STARTED_RAY_HEAD, ray_worker_units, ENDED, sandbox readiness) relies on this.
# On systems without a shared FS, srun-based polling would be needed instead.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# On systems without a shared FS, srun-based polling would be needed instead.

i'm thinking we might not even mention that b/c there's so many assumptions about shared FS that someone attempting it will run into many other issues

Comment thread ray.sub
fi

# Create logs directory
# NOTE: LOG_DIR must be on a shared filesystem visible to both the submit host and

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OOC, could we simply check this by doing the mkdir and writing .slurm_job_id or something there and checking that all head/workers can view it before starting? otherwise the FS is not shared

Comment thread ray.sub
Comment on lines +412 to +413
echo "\$units" > "$LOG_DIR/.ray_worker_units.tmp"
mv "$LOG_DIR/.ray_worker_units.tmp" "$LOG_DIR/ray_worker_units"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ray.sub:412

Minor: a one-line note on why the tmp+mv right here would help (the function-level comment at L406 mentions it, but it's easy to miss at the operation). Suggested:

Suggested change
echo "\$units" > "$LOG_DIR/.ray_worker_units.tmp"
mv "$LOG_DIR/.ray_worker_units.tmp" "$LOG_DIR/ray_worker_units"
# Publish atomically: a plain `>` truncates then writes, so the reader (the submit
# host, on another node, reading over the shared FS) could catch the file empty or
# partial. rename(2) is atomic, so the reader always sees the complete prior-or-new value.
echo "\$units" > "$LOG_DIR/.ray_worker_units.tmp"
mv "$LOG_DIR/.ray_worker_units.tmp" "$LOG_DIR/ray_worker_units"

Comment thread ray.sub
Comment thread ray.sub
echo "Worker failed \$count/$num_retries times, restarting in 5 seconds..."
sleep 5
echo "[WARN] Worker \$SLURM_PROCID failed \$count/$WORKER_NUM_RETRIES times, restarting in 10 seconds..."
sleep 10

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change to 10 but have head stay at 5?

Comment thread ray.sub
# SANDBOX_READY_<hostname> once its port is listening (see the sandbox srun below).
if [[ -n "$SANDBOX_PORTS_DIR" ]]; then
echo "[INFO] Waiting for sandbox to be ready on all $SLURM_JOB_NUM_NODES nodes..."
SANDBOX_DEADLINE=\$((SECONDS + 600))

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we flip around this check to check the 600 second deadline first and then the 1800 so we can exit earlier?

Comment thread ray.sub
@@ -565,81 +681,134 @@ if [[ -n "${SANDBOX_CONTAINER:-}" ]] && [[ -n "${SANDBOX_COMMAND:-}" ]]; then
--nodes="$SLURM_JOB_NUM_NODES" \

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ananthsub i think we actually only use one of these, and don't need to wait for N of them to start. doesn't necessarily block this PR, but can we have an issue to fix this?

Comment thread ray.sub
Comment on lines +747 to +751
set +e
wait ${SRUN_PIDS["ray-head"]}
exit_code=$?
set -e
exit $exit_code

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ray.sub:747-751

Non-interactive waits on the head PID only, so a hard-killed worker/sandbox srun during bringup isn't caught until the head's 30-min ray-status deadline (L476) — vs ~2s before via check_srun_processes. Costly at large node counts. (Driver-execution phase is unchanged: both old and new rely on Ray's node-failure detection + the driver erroring out.) One way to restore fast-fail, with wait -n (bash ≥5.1):

Suggested change
set +e
wait ${SRUN_PIDS["ray-head"]}
exit_code=$?
set -e
exit $exit_code
set +e
# Fail fast if a worker/sandbox srun dies, instead of waiting out the head's deadline.
_watch=("${SRUN_PIDS["ray-head"]}")
[[ -n "${SRUN_PIDS["ray-workers"]:-}" ]] && _watch+=("${SRUN_PIDS["ray-workers"]}")
[[ -n "${SRUN_PIDS["sandbox"]:-}" ]] && _watch+=("${SRUN_PIDS["sandbox"]}")
wait -n -p _done_pid "${_watch[@]}"
exit_code=$?
if [[ "$_done_pid" != "${SRUN_PIDS["ray-head"]}" ]]; then
echo "[ERROR] Background srun (pid=$_done_pid) exited before the head; failing fast." >&2
touch "$LOG_DIR/ENDED"
fi
set -e
exit $exit_code

Comment thread ray.sub
set +x
while true; do
sleep 10
units=\$(ray status 2>/dev/null | grep "worker_units" | awk -F'[/. ]' '{print \$4}' || echo 0)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ray.sub:411

|| echo 0 never fires: the pipeline's exit status is awk's (which exits 0 even when grep matches nothing), so units is empty, not 0. The set -o pipefail at L25 is in the submit-host shell and doesn't propagate into the bash -c "$head_cmd" subshell (SHELLOPTS isn't exported), so it's off here and at the inline poll on L478. Benign today (empty compares not-equal → loop retries), but worth fixing.

Cleanest fix covers L411, L478, and transitively the L758 read at once by enabling pipefail at the top of head_cmd — it spans more than this line, so as a diff to apply manually (pipefail only, not set -e — the retry loops and [[ … -eq … ]] poll conditions rely on non-zero being non-fatal):

 fi
+# pipefail so 'ray status | grep | awk || echo 0' reports grep's miss and the fallback fires.
+# NOTE: pipefail only here — do NOT add 'set -e'.
+set -o pipefail
 
 exit-dramatically() {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants