perf: reduce srun overhead in ray.sub and gate driver on sandbox readiness#2827
perf: reduce srun overhead in ray.sub and gate driver on sandbox readiness#2827ananthsub wants to merge 1 commit into
Conversation
|
Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
/ok to test 2bcbb17 |
…iness ray.sub previously scaled srun/slurmctld RPCs with the node count: it launched one srun per worker (each followed by a 3s sleep), slept a fixed 120s before workers, polled head readiness via `srun --overlap test -f`, and polled cluster status via repeated `srun ... ray status`. At scale this dominated cluster bringup time and risked throttling slurmctld. Replace the per-node fan-out with shared-filesystem file signaling so srun calls stay roughly constant in the node count: - Launch all workers with a single batched srun (--nodes/--ntasks=N-1, --ntasks-per-node=1, --exclude=head); workers self-identify via SLURM_PROCID / SLURMD_NODENAME. Drops the per-worker loop, the 3s stagger, and the 120s sleep. - Start the head without --block and touch STARTED_RAY_HEAD only after the GCS is listening, so workers connect on the first try; workers poll that file before `ray start --address` and retry the GCS connect. - A head-side ray-status sidecar publishes the live worker_units count to a file; the submit host reads it instead of issuing `srun ... ray status` RPCs. - The head container runs the driver inline (via a driver_command.sh file) and the submit host just waits on the head srun, removing the dedicated driver srun. Sandbox readiness becomes a blocking gate: each per-node sandbox task polls its local port and touches SANDBOX_READY_<host>; the head waits for all instances before launching the driver. The SANDBOX_PORTS_DIR signal dir is always mounted and exported, independent of the SANDBOX_EXTRA_MOUNTS / SANDBOX_ENV_VARS knobs. Also: clean stale Ray session state inside the head and worker retry loops (avoids the persisted-session AssertionError on retry); 0-indexed, zero-padded worker logs (ray-worker-%0Nt.log) for natural sorting; support single-node jobs (no workers needed); and suffix LOG_DIR with SLURM_RESTART_COUNT so a requeue does not clobber the previous attempt's logs and signal files. Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
2bcbb17 to
90ccff8
Compare
terrykong
left a comment
There was a problem hiding this comment.
Reviewed by an agent team (bug-finder + RL-infra/guidelines + adversarial verification). Overall this is a solid, well-constructed bringup change — the shared-FS file-signaling design is sound, the bash -n pre-validation of the generated head/worker scripts is a nice safety net, heredoc/quoting expansion timing is correct throughout, and single-node mode is handled. No correctness bugs found. The 4 inline comments are all suggestions/nits (one perf-evidence ask, one fail-fast trade-off, two minor robustness nits) — none blocking.
Note: as a multi-node Slurm launcher this couldn't be executed end-to-end in review; findings are from static analysis with a few behaviors confirmed via standalone bash tests.
Generated by Claude Code
| # NOTE: LOG_DIR must be on a shared filesystem visible to both the submit host and | ||
| # all compute nodes. File-based signaling between the submit host and the containers | ||
| # (STARTED_RAY_HEAD, ray_worker_units, ENDED, sandbox readiness) relies on this. | ||
| # On systems without a shared FS, srun-based polling would be needed instead. |
There was a problem hiding this comment.
| # On systems without a shared FS, srun-based polling would be needed instead. |
i'm thinking we might not even mention that b/c there's so many assumptions about shared FS that someone attempting it will run into many other issues
| fi | ||
|
|
||
| # Create logs directory | ||
| # NOTE: LOG_DIR must be on a shared filesystem visible to both the submit host and |
There was a problem hiding this comment.
OOC, could we simply check this by doing the mkdir and writing .slurm_job_id or something there and checking that all head/workers can view it before starting? otherwise the FS is not shared
| echo "\$units" > "$LOG_DIR/.ray_worker_units.tmp" | ||
| mv "$LOG_DIR/.ray_worker_units.tmp" "$LOG_DIR/ray_worker_units" |
There was a problem hiding this comment.
Minor: a one-line note on why the tmp+mv right here would help (the function-level comment at L406 mentions it, but it's easy to miss at the operation). Suggested:
| echo "\$units" > "$LOG_DIR/.ray_worker_units.tmp" | |
| mv "$LOG_DIR/.ray_worker_units.tmp" "$LOG_DIR/ray_worker_units" | |
| # Publish atomically: a plain `>` truncates then writes, so the reader (the submit | |
| # host, on another node, reading over the shared FS) could catch the file empty or | |
| # partial. rename(2) is atomic, so the reader always sees the complete prior-or-new value. | |
| echo "\$units" > "$LOG_DIR/.ray_worker_units.tmp" | |
| mv "$LOG_DIR/.ray_worker_units.tmp" "$LOG_DIR/ray_worker_units" |
| echo "Worker failed \$count/$num_retries times, restarting in 5 seconds..." | ||
| sleep 5 | ||
| echo "[WARN] Worker \$SLURM_PROCID failed \$count/$WORKER_NUM_RETRIES times, restarting in 10 seconds..." | ||
| sleep 10 |
There was a problem hiding this comment.
why change to 10 but have head stay at 5?
| # SANDBOX_READY_<hostname> once its port is listening (see the sandbox srun below). | ||
| if [[ -n "$SANDBOX_PORTS_DIR" ]]; then | ||
| echo "[INFO] Waiting for sandbox to be ready on all $SLURM_JOB_NUM_NODES nodes..." | ||
| SANDBOX_DEADLINE=\$((SECONDS + 600)) |
There was a problem hiding this comment.
should we flip around this check to check the 600 second deadline first and then the 1800 so we can exit earlier?
| @@ -565,81 +681,134 @@ if [[ -n "${SANDBOX_CONTAINER:-}" ]] && [[ -n "${SANDBOX_COMMAND:-}" ]]; then | |||
| --nodes="$SLURM_JOB_NUM_NODES" \ | |||
There was a problem hiding this comment.
@ananthsub i think we actually only use one of these, and don't need to wait for N of them to start. doesn't necessarily block this PR, but can we have an issue to fix this?
| set +e | ||
| wait ${SRUN_PIDS["ray-head"]} | ||
| exit_code=$? | ||
| set -e | ||
| exit $exit_code |
There was a problem hiding this comment.
Non-interactive waits on the head PID only, so a hard-killed worker/sandbox srun during bringup isn't caught until the head's 30-min ray-status deadline (L476) — vs ~2s before via check_srun_processes. Costly at large node counts. (Driver-execution phase is unchanged: both old and new rely on Ray's node-failure detection + the driver erroring out.) One way to restore fast-fail, with wait -n (bash ≥5.1):
| set +e | |
| wait ${SRUN_PIDS["ray-head"]} | |
| exit_code=$? | |
| set -e | |
| exit $exit_code | |
| set +e | |
| # Fail fast if a worker/sandbox srun dies, instead of waiting out the head's deadline. | |
| _watch=("${SRUN_PIDS["ray-head"]}") | |
| [[ -n "${SRUN_PIDS["ray-workers"]:-}" ]] && _watch+=("${SRUN_PIDS["ray-workers"]}") | |
| [[ -n "${SRUN_PIDS["sandbox"]:-}" ]] && _watch+=("${SRUN_PIDS["sandbox"]}") | |
| wait -n -p _done_pid "${_watch[@]}" | |
| exit_code=$? | |
| if [[ "$_done_pid" != "${SRUN_PIDS["ray-head"]}" ]]; then | |
| echo "[ERROR] Background srun (pid=$_done_pid) exited before the head; failing fast." >&2 | |
| touch "$LOG_DIR/ENDED" | |
| fi | |
| set -e | |
| exit $exit_code |
| set +x | ||
| while true; do | ||
| sleep 10 | ||
| units=\$(ray status 2>/dev/null | grep "worker_units" | awk -F'[/. ]' '{print \$4}' || echo 0) |
There was a problem hiding this comment.
|| echo 0 never fires: the pipeline's exit status is awk's (which exits 0 even when grep matches nothing), so units is empty, not 0. The set -o pipefail at L25 is in the submit-host shell and doesn't propagate into the bash -c "$head_cmd" subshell (SHELLOPTS isn't exported), so it's off here and at the inline poll on L478. Benign today (empty compares not-equal → loop retries), but worth fixing.
Cleanest fix covers L411, L478, and transitively the L758 read at once by enabling pipefail at the top of head_cmd — it spans more than this line, so as a diff to apply manually (pipefail only, not set -e — the retry loops and [[ … -eq … ]] poll conditions rely on non-zero being non-fatal):
fi
+# pipefail so 'ray status | grep | awk || echo 0' reports grep's miss and the fallback fires.
+# NOTE: pipefail only here — do NOT add 'set -e'.
+set -o pipefail
exit-dramatically() {
What does this PR do ?
ray.subpreviously scales srun/slurmctld RPCs with the node count: it launches one srun per worker (each followed by a 3s sleep), slept a fixed 120s before workers, polls head readiness viasrun --overlap test -f, and polls cluster status via repeatedsrun ... ray status. For larger jobs, this dominates cluster bringup time and risks throttling slurmctld.This PR replaces the per-node fan-out with shared-filesystem file signaling so srun calls stay constant relative to the node count:
ray start --addressand retry the GCS connect.srun ... ray statusRPCs.SANDBOX_READY_<host>; the head waits for all instances before launching the driver. TheSANDBOX_PORTS_DIRsignal dir is always mounted and exported, independent of theSANDBOX_EXTRA_MOUNTS/SANDBOX_ENV_VARSknobs.This change assumes that the LOG_DIR is on a shared filesystem visible across compute nodes.
This change also does the following:
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information