Add support for Slurm arrays#2059
Conversation
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
| # Guarantee every emitted task has a task_id (derived id, or uuid fallback). | ||
| results = self._post_process_task_ids(tasks, results) | ||
|
|
||
| self._record_failed_tasks([r for r in results if isinstance(r, FailedTask)]) |
There was a problem hiding this comment.
Discussed with @abhinavg4 . For now the PR keeps track of FailedTask instances by looking for a user-set FAILED_TASKS_DIR_ENV_VAR = "NEMO_CURATOR_FAILED_TASKS_DIR" and writing a JSON file per failed task in the specified directory.
I did the environment variable and write approach because it seems more reliable than trying to handle a global Python variable, etc. And the reason it is an environment variable is so that BaseStageAdapter does not have to propagate an additional parameter for every single stage (which I think would involve having to update the executors as well?). Open to other suggestions.
praateekmahajan
left a comment
There was a problem hiding this comment.
Took a super quick look, here are some general thoughts
- Instead of adding the same 3/4 fields to every "source" stage, can we have a base class and inherit that?
- Alternatively (or maybe in addition),
pipeline.buildiirc now dynamically sets the first stage as is_source_stage=True, so can we just rely on those? If we do then inside backends/base.py we can say "if this is a source stage AND slurm is enabled then just usetask_idas my key and decide which shard it belongs to"... this is something @abhinavg4 and I had discussed, this reduces the number of changes needed across curator code base, and also generalizes, since source_stage have task_id which is likely assigned usingget_determenistic_task_idwhich is a hash(metadat['source_files'])
For 1, sure. For 2, we could but it makes this PR dependent on the resumability PR, which is what we were trying to avoid I thought... also, I guess it is not immediately obvious to me how it can work for source stages that are not a |
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
TODO:
FilePartitioningStageJsonlReader,ParquetReader, etc.FailedTasksupportnemo-curator-slurm-cli(not planned for this PR)SLURM_ARRAY_TASK_COUNT> cluster limit