fix: default workflow input extensions by filetype#2045
Conversation
|
Follow-up from #1982 review:
Validation:
|
Greptile SummaryThis PR fixes the default
Confidence Score: 5/5Safe to merge — all changes are additive, well-isolated, and covered by new tests. The change is narrow in scope: a new utility function, five one-line call-site updates, and matching tests. The new get_default_file_extensions helper raises eagerly for unknown filetypes and is exercised by all five affected workflows. No existing public interface changed, no data-path logic was altered, and every new branch has a corresponding test. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Workflow init
input_file_extensions / input_filetype] --> B{input_file_extensions
provided and non-empty?}
B -- Yes --> C[Use provided
extensions]
B -- No --> D[get_default_file_extensions
input_filetype]
D --> E{filetype in
FILETYPE_TO_DEFAULT_EXTENSIONS?}
E -- Yes --> F[Return canonical extensions
e.g. jsonl → .jsonl .json
parquet → .parquet]
E -- No --> G[Raise ValueError
Unsupported filetype]
C --> H[FilePartitioningStage
or Reader stage]
F --> H
H --> I[Pipeline execution]
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A[Workflow init
input_file_extensions / input_filetype] --> B{input_file_extensions
provided and non-empty?}
B -- Yes --> C[Use provided
extensions]
B -- No --> D[get_default_file_extensions
input_filetype]
D --> E{filetype in
FILETYPE_TO_DEFAULT_EXTENSIONS?}
E -- Yes --> F[Return canonical extensions
e.g. jsonl → .jsonl .json
parquet → .parquet]
E -- No --> G[Raise ValueError
Unsupported filetype]
C --> H[FilePartitioningStage
or Reader stage]
F --> H
H --> I[Pipeline execution]
Reviews (6): Last reviewed commit: "fix: address deduplication test failures" | Re-trigger Greptile |
| file_extensions = self.input_file_extensions or get_default_file_extensions(self.input_filetype) | ||
| if not file_extensions: | ||
| msg = f"Unsupported filetype: {self.input_filetype}" | ||
| raise ValueError(msg) |
There was a problem hiding this comment.
The
if not file_extensions: guard is now unreachable. get_default_file_extensions either returns a non-empty list or raises ValueError, so file_extensions will never be falsy after the assignment. The block can be removed.
| file_extensions = self.input_file_extensions or get_default_file_extensions(self.input_filetype) | |
| if not file_extensions: | |
| msg = f"Unsupported filetype: {self.input_filetype}" | |
| raise ValueError(msg) | |
| file_extensions = self.input_file_extensions or get_default_file_extensions(self.input_filetype) |
There was a problem hiding this comment.
Addressed in 19a7290: removed the unreachable guard and now rely on get_default_file_extensions to raise for unsupported input_filetype.
sarahyurick
left a comment
There was a problem hiding this comment.
Shouldn't https://github.com/NVIDIA-NeMo/Curator/blob/main/nemo_curator/stages/deduplication/semantic/workflow.py#L268 be updated too or no?
| file_extensions = self.input_file_extensions or get_default_file_extensions(self.input_filetype) | ||
| if not file_extensions: | ||
| msg = f"Unsupported filetype: {self.input_filetype}" | ||
| raise ValueError(msg) |
There was a problem hiding this comment.
Is there a test for this?
There was a problem hiding this comment.
Added focused KMeansStage decomposition tests in tests/stages/deduplication/semantic/test_kmeans.py covering default extensions for parquet/jsonl, custom extension overrides, and unsupported filetype errors.
|
Thanks for the follow-up. I pushed 19a7290 with the requested updates:
Validation:
Targeted pytest is still blocked on this macOS runner by NeMo-Curator’s Linux-only import guard. |
sarahyurick
left a comment
There was a problem hiding this comment.
Thanks @nightcityblade ! LGTM, although I will default to @praateekmahajan before merging, since he was the one who created the issue.
|
/ok to test 8ba4a15 |
|
Thanks — I pushed 51e8925 to address the failing deduplication/text tests:
Validation:
Targeted pytest is still blocked locally by NeMo-Curator's Linux-only import guard on this macOS runner. |
Signed-off-by: nightcityblade <nightcityblade@gmail.com>
Signed-off-by: nightcityblade <nightcityblade@gmail.com>
Signed-off-by: nightcityblade <nightcityblade@gmail.com>
Signed-off-by: nightcityblade <nightcityblade@gmail.com>
|
I fixed the DCO failure by rebasing the branch onto current Validation run:
|
Description
Fixes #1330. Replacement for #1982 after the source branch had to be recovered.
Workflows that expose
input_file_extensionsnow default to the extensions for their selectedinput_filetypewhen no explicit override is provided, instead of falling back to the genericFilePartitioningStagedefaults.Usage
Checklist
Tests:
uv run ruff check tests/stages/text/deduplication/test_semantic.pyuv run ruff format --check tests/stages/text/deduplication/test_semantic.pypython3 -m py_compile tests/stages/text/deduplication/test_semantic.pyuv run pytest tests/stages/text/deduplication/test_semantic.py -k 'embedding_reader_extensions'(blocked locally: NeMo-Curator raises at import time on non-Linux/darwin)