Add support for Slurm arrays by sarahyurick · Pull Request #2059 · NVIDIA-NeMo/Curator

sarahyurick · 2026-06-09T21:31:11Z

TODO:

Add Slurm array parameters to FilePartitioningStage
Propagate Slurm array parameters through JsonlReader, ParquetReader, etc.
Add retry support
Add FailedTask support
Add a tutorial
Add nemo-curator-slurm-cli (not planned for this PR)
Address case when SLURM_ARRAY_TASK_COUNT > cluster limit
Add tests

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

copy-pr-bot · 2026-06-09T21:31:14Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

sarahyurick · 2026-06-12T16:37:10Z

        # Guarantee every emitted task has a task_id (derived id, or uuid fallback).
        results = self._post_process_task_ids(tasks, results)

+        self._record_failed_tasks([r for r in results if isinstance(r, FailedTask)])


Discussed with @abhinavg4 . For now the PR keeps track of FailedTask instances by looking for a user-set FAILED_TASKS_DIR_ENV_VAR = "NEMO_CURATOR_FAILED_TASKS_DIR" and writing a JSON file per failed task in the specified directory.

I did the environment variable and write approach because it seems more reliable than trying to handle a global Python variable, etc. And the reason it is an environment variable is so that BaseStageAdapter does not have to propagate an additional parameter for every single stage (which I think would involve having to update the executors as well?). Open to other suggestions.

praateekmahajan

Took a super quick look, here are some general thoughts

Instead of adding the same 3/4 fields to every "source" stage, can we have a base class and inherit that?
Alternatively (or maybe in addition), pipeline.build iirc now dynamically sets the first stage as is_source_stage=True, so can we just rely on those? If we do then inside backends/base.py we can say "if this is a source stage AND slurm is enabled then just use task_id as my key and decide which shard it belongs to"... this is something @abhinavg4 and I had discussed, this reduces the number of changes needed across curator code base, and also generalizes, since source_stage have task_id which is likely assigned using get_determenistic_task_id which is a hash(metadat['source_files'])

sarahyurick · 2026-06-15T21:18:41Z

Took a super quick look, here are some general thoughts

Instead of adding the same 3/4 fields to every "source" stage, can we have a base class and inherit that?

Alternatively (or maybe in addition), pipeline.build iirc now dynamically sets the first stage as is_source_stage=True, so can we just rely on those? If we do then inside backends/base.py we can say "if this is a source stage AND slurm is enabled then just use task_id as my key and decide which shard it belongs to"... this is something @abhinavg4 and I had discussed, this reduces the number of changes needed across curator code base, and also generalizes, since source_stage have task_id which is likely assigned using get_determenistic_task_id which is a hash(metadat['source_files'])

For 1, sure.

For 2, we could but it makes this PR dependent on the resumability PR, which is what we were trying to avoid I thought... also, I guess it is not immediately obvious to me how it can work for source stages that are not a FilePartitioningStage. I get the general idea I guess but I am not convinced that it could always work, it sounds to me like how it would probably have to work is convert all unselected tasks to NoneTask maybe?

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

basic slurm array file partitioning

bde2217

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

sarahyurick and others added 4 commits June 9, 2026 14:54

add slurm array params to composite stages using filepartitioningstage

a0595f6

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

add tutorial and tests

43ee179

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

Merge branch 'main' into slurm_array

cae17b3

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

ruff

acfeceb

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

sarahyurick commented Jun 11, 2026

View reviewed changes

Comment thread nemo_curator/stages/text/deduplication/semantic.py Outdated

sarahyurick marked this pull request as ready for review June 11, 2026 17:31

sarahyurick requested review from a team, abhinavg4 and suiyoubi as code owners June 11, 2026 17:31

copy-pr-bot Bot temporarily deployed to public June 11, 2026 17:31 Inactive

copy-pr-bot Bot temporarily deployed to test June 11, 2026 17:32 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci June 11, 2026 17:32 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci June 11, 2026 18:22 Inactive

sarahyurick and others added 6 commits June 11, 2026 12:53

more greptile comments

2ccbd3f

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

add nonetask and failedtask sentinels

1b659ea

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

add failedtask detection and repeat

3522809

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

ruff

717edac

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

Merge branch 'main' into slurm_array

8f2345b

greptile comments

ebba73e

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

sarahyurick commented Jun 12, 2026

View reviewed changes

Merge branch 'main' into slurm_array

5e58793

praateekmahajan reviewed Jun 15, 2026

View reviewed changes

sarahyurick and others added 4 commits June 16, 2026 13:33

TextSemanticDeduplicationWorkflow revert

ad8f68a

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

Merge branch 'main' into slurm_array

437270f

Merge branch 'main' into slurm_array

b55ec47

use SlurmArrayConfig dataclass

672f3d2

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Slurm arrays#2059

Add support for Slurm arrays#2059
sarahyurick wants to merge 18 commits into
NVIDIA-NeMo:mainfrom
sarahyurick:slurm_array

sarahyurick commented Jun 9, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 9, 2026

Uh oh!

Uh oh!

sarahyurick Jun 12, 2026 •

edited

Loading

Uh oh!

praateekmahajan left a comment

Uh oh!

sarahyurick commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sarahyurick commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot Bot commented Jun 9, 2026

Uh oh!

Uh oh!

sarahyurick Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

praateekmahajan left a comment

Choose a reason for hiding this comment

Uh oh!

sarahyurick commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sarahyurick commented Jun 9, 2026 •

edited

Loading

sarahyurick Jun 12, 2026 •

edited

Loading