Skip to content

fix: default workflow input extensions by filetype#2045

Open
nightcityblade wants to merge 4 commits into
NVIDIA-NeMo:mainfrom
nightcityblade:fix/issue-1330
Open

fix: default workflow input extensions by filetype#2045
nightcityblade wants to merge 4 commits into
NVIDIA-NeMo:mainfrom
nightcityblade:fix/issue-1330

Conversation

@nightcityblade

Copy link
Copy Markdown
Contributor

Description

Fixes #1330. Replacement for #1982 after the source branch had to be recovered.

Workflows that expose input_file_extensions now default to the extensions for their selected input_filetype when no explicit override is provided, instead of falling back to the generic FilePartitioningStage defaults.

Usage

from nemo_curator.stages.deduplication.exact.workflow import ExactDeduplicationWorkflow

workflow = ExactDeduplicationWorkflow(
    input_path="/data/input",
    output_path="/data/output",
    input_filetype="jsonl",
)
# Defaults to [".jsonl", ".json"] unless input_file_extensions is set.

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Tests:

  • uv run ruff check tests/stages/text/deduplication/test_semantic.py
  • uv run ruff format --check tests/stages/text/deduplication/test_semantic.py
  • python3 -m py_compile tests/stages/text/deduplication/test_semantic.py
  • uv run pytest tests/stages/text/deduplication/test_semantic.py -k 'embedding_reader_extensions' (blocked locally: NeMo-Curator raises at import time on non-Linux/darwin)

@copy-pr-bot

copy-pr-bot Bot commented Jun 3, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@nightcityblade

Copy link
Copy Markdown
Contributor Author

Follow-up from #1982 review:

  • Moved the semantic file-extension tests into the existing tests/stages/text/deduplication/test_semantic.py file instead of creating a new test file.
  • Kept nemo_curator/stages/deduplication/semantic/workflow.py unchanged because it delegates input partitioning to KMeansStage, and this PR passes input_file_extensions through to that stage; the default resolution happens in KMeansStage.decompose() before constructing FilePartitioningStage.

Validation:

  • uv run ruff check tests/stages/text/deduplication/test_semantic.py
  • uv run ruff format --check tests/stages/text/deduplication/test_semantic.py
  • python3 -m py_compile tests/stages/text/deduplication/test_semantic.py
  • Targeted pytest remains blocked on this Darwin runner by NeMo-Curator’s Linux-only import guard.

@greptile-apps

greptile-apps Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes the default input_file_extensions behaviour across five deduplication workflows by introducing a new get_default_file_extensions utility that maps each supported input_filetype to its canonical extensions, replacing the previous fall-through to the union of all extensions ([".jsonl", ".json", ".parquet"]).

  • New utility (file_utils.get_default_file_extensions): raises ValueError for unknown filetypes; used as the or-fallback when input_file_extensions is not provided in ExactDeduplicationWorkflow, FuzzyDeduplicationWorkflow, KMeansStage, TextDuplicatesRemovalWorkflow, and TextSemanticDeduplicationWorkflow.
  • removal_workflow.py adds an explicit early guard for unsupported input_filetype values before the lazy FilePartitioningStage import.
  • Tests cover default extension selection, user override, and unsupported-filetype errors for all five affected workflows.

Confidence Score: 5/5

Safe to merge — all changes are additive, well-isolated, and covered by new tests.

The change is narrow in scope: a new utility function, five one-line call-site updates, and matching tests. The new get_default_file_extensions helper raises eagerly for unknown filetypes and is exercised by all five affected workflows. No existing public interface changed, no data-path logic was altered, and every new branch has a corresponding test.

No files require special attention.

Important Files Changed

Filename Overview
nemo_curator/utils/file_utils.py Adds get_default_file_extensions — a thin, well-scoped wrapper around FILETYPE_TO_DEFAULT_EXTENSIONS that raises ValueError for unknown types.
nemo_curator/stages/deduplication/exact/workflow.py Replaces bare self.input_file_extensions with self.input_file_extensions or get_default_file_extensions(self.input_filetype) in _create_input_filegroups; straightforward and correct.
nemo_curator/stages/deduplication/fuzzy/workflow.py Same one-line fix as exact workflow applied to _create_minhash_pipeline; consistent with the rest of the PR.
nemo_curator/stages/deduplication/semantic/kmeans.py Replaces the manual FILETYPE_TO_DEFAULT_EXTENSIONS.get + unreachable guard with get_default_file_extensions, simplifying decompose(); previously discussed dead-code guard removed.
nemo_curator/stages/text/deduplication/removal_workflow.py Adds an explicit early filetype guard and uses get_default_file_extensions for the FilePartitioningStage; the guard fires before the lazy import, providing a slightly more specific error message.
nemo_curator/stages/text/deduplication/semantic.py Applies get_default_file_extensions in both the jsonl and parquet branches of _run_embedding_generation; unsupported filetypes still hit the existing else: raise NotImplementedError path before the utility is ever called.
tests/stages/text/deduplication/test_semantic.py Adds three module-level unit tests using monkeypatching to verify extension defaults, overrides, and the unsupported-filetype error path; also removes autouse=True from the session fixture and wires it explicitly into test_config.
tests/stages/text/deduplication/test_removal_workflow.py Parametrises test_reader_stage on (filetype, expected_extensions) and adds a separate override test; correctly updates the previously hard-coded [".jsonl", ".json", ".parquet"] assertion.
tests/stages/deduplication/exact/test_workflow.py Adds default-extension and override tests for ExactDeduplicationWorkflow; coverage is complete for the one changed site.
tests/stages/deduplication/fuzzy/test_fuzzy_workflow.py Mirrors exact-workflow tests for FuzzyDeduplicationWorkflow; adequate coverage.
tests/stages/deduplication/semantic/test_kmeans.py Adds a new TestKMeansStage class with parametrised defaults, override, and unsupported-filetype error tests; correctly annotated with skipif guard.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Workflow init
input_file_extensions / input_filetype] --> B{input_file_extensions
provided and non-empty?}
    B -- Yes --> C[Use provided
extensions]
    B -- No --> D[get_default_file_extensions
input_filetype]
    D --> E{filetype in
FILETYPE_TO_DEFAULT_EXTENSIONS?}
    E -- Yes --> F[Return canonical extensions
e.g. jsonl → .jsonl .json
parquet → .parquet]
    E -- No --> G[Raise ValueError
Unsupported filetype]
    C --> H[FilePartitioningStage
or Reader stage]
    F --> H
    H --> I[Pipeline execution]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[Workflow init
input_file_extensions / input_filetype] --> B{input_file_extensions
provided and non-empty?}
    B -- Yes --> C[Use provided
extensions]
    B -- No --> D[get_default_file_extensions
input_filetype]
    D --> E{filetype in
FILETYPE_TO_DEFAULT_EXTENSIONS?}
    E -- Yes --> F[Return canonical extensions
e.g. jsonl → .jsonl .json
parquet → .parquet]
    E -- No --> G[Raise ValueError
Unsupported filetype]
    C --> H[FilePartitioningStage
or Reader stage]
    F --> H
    H --> I[Pipeline execution]
Loading

Reviews (6): Last reviewed commit: "fix: address deduplication test failures" | Re-trigger Greptile

Comment on lines 347 to 350
file_extensions = self.input_file_extensions or get_default_file_extensions(self.input_filetype)
if not file_extensions:
msg = f"Unsupported filetype: {self.input_filetype}"
raise ValueError(msg)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The if not file_extensions: guard is now unreachable. get_default_file_extensions either returns a non-empty list or raises ValueError, so file_extensions will never be falsy after the assignment. The block can be removed.

Suggested change
file_extensions = self.input_file_extensions or get_default_file_extensions(self.input_filetype)
if not file_extensions:
msg = f"Unsupported filetype: {self.input_filetype}"
raise ValueError(msg)
file_extensions = self.input_file_extensions or get_default_file_extensions(self.input_filetype)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 19a7290: removed the unreachable guard and now rely on get_default_file_extensions to raise for unsupported input_filetype.

@sarahyurick sarahyurick left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 347 to 350
file_extensions = self.input_file_extensions or get_default_file_extensions(self.input_filetype)
if not file_extensions:
msg = f"Unsupported filetype: {self.input_filetype}"
raise ValueError(msg)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a test for this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added focused KMeansStage decomposition tests in tests/stages/deduplication/semantic/test_kmeans.py covering default extensions for parquet/jsonl, custom extension overrides, and unsupported filetype errors.

@svcnvidia-nemo-ci svcnvidia-nemo-ci added waiting-on-customer Waiting on the original author to respond and removed waiting-on-customer Waiting on the original author to respond labels Jun 3, 2026
@nightcityblade

Copy link
Copy Markdown
Contributor Author

Thanks for the follow-up. I pushed 19a7290 with the requested updates:

  • Removed the unreachable if not file_extensions: guard in KMeansStage.decompose().
  • Added focused tests for KMeansStage default file-extension selection, override handling, and unsupported filetype errors.
  • I left semantic/workflow.py unchanged: it already passes input_file_extensions through to KMeansStage, so the default resolution/validation now happens in one place during KMeansStage.decompose().

Validation:

  • uv run ruff check nemo_curator/stages/deduplication/semantic/kmeans.py tests/stages/deduplication/semantic/test_kmeans.py
  • uv run ruff format --check nemo_curator/stages/deduplication/semantic/kmeans.py tests/stages/deduplication/semantic/test_kmeans.py
  • python3 -m py_compile nemo_curator/stages/deduplication/semantic/kmeans.py tests/stages/deduplication/semantic/test_kmeans.py

Targeted pytest is still blocked on this macOS runner by NeMo-Curator’s Linux-only import guard.

@svcnvidia-nemo-ci svcnvidia-nemo-ci removed the waiting-on-customer Waiting on the original author to respond label Jun 4, 2026

@sarahyurick sarahyurick left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nightcityblade ! LGTM, although I will default to @praateekmahajan before merging, since he was the one who created the issue.

@svcnvidia-nemo-ci svcnvidia-nemo-ci added the waiting-on-maintainers Waiting on maintainers to respond label Jun 6, 2026
@sarahyurick

Copy link
Copy Markdown
Contributor

/ok to test 8ba4a15

@nightcityblade

Copy link
Copy Markdown
Contributor Author

Thanks — I pushed 51e8925 to address the failing deduplication/text tests:

  • Restored TextDuplicatesRemovalWorkflow's existing invalid-input-filetype error before default extension lookup.
  • Marked the new semantic/KMeans extension tests to skip cleanly when their GPU-only dependencies are unavailable in CPU-only jobs, instead of failing with NameError after suppressed imports.

Validation:

  • uv run ruff check nemo_curator/stages/text/deduplication/removal_workflow.py tests/stages/deduplication/semantic/test_kmeans.py tests/stages/text/deduplication/test_semantic.py
  • uv run ruff format --check nemo_curator/stages/text/deduplication/removal_workflow.py tests/stages/deduplication/semantic/test_kmeans.py tests/stages/text/deduplication/test_semantic.py
  • python3 -m py_compile nemo_curator/stages/text/deduplication/removal_workflow.py tests/stages/deduplication/semantic/test_kmeans.py tests/stages/text/deduplication/test_semantic.py

Targeted pytest is still blocked locally by NeMo-Curator's Linux-only import guard on this macOS runner.

nightcityblade added 4 commits June 18, 2026 23:17
Signed-off-by: nightcityblade <nightcityblade@gmail.com>
Signed-off-by: nightcityblade <nightcityblade@gmail.com>
Signed-off-by: nightcityblade <nightcityblade@gmail.com>
Signed-off-by: nightcityblade <nightcityblade@gmail.com>
@nightcityblade

Copy link
Copy Markdown
Contributor Author

I fixed the DCO failure by rebasing the branch onto current main, removing the old merge commits, and replaying the PR commits with proper Signed-off-by trailers. The branch was updated with --force-with-lease to keep the PR history DCO-clean.

Validation run:

  • uv run python -m py_compile nemo_curator/stages/deduplication/exact/workflow.py nemo_curator/stages/deduplication/fuzzy/workflow.py nemo_curator/stages/deduplication/semantic/kmeans.py nemo_curator/stages/text/deduplication/removal_workflow.py nemo_curator/stages/text/deduplication/semantic.py tests/stages/deduplication/exact/test_workflow.py tests/stages/deduplication/fuzzy/test_fuzzy_workflow.py tests/stages/deduplication/semantic/test_kmeans.py tests/stages/text/deduplication/test_removal_workflow.py tests/stages/text/deduplication/test_semantic.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

See if all workflows expose file_extensions for reader i/o

4 participants