Add option to drop deduplication id field by nightcityblade · Pull Request #2078 · NVIDIA-NeMo/Curator

nightcityblade · 2026-06-16T03:16:25Z

Description

Closes #1580.

Adds an optional drop_id_field flag to the text duplicate removal stage/workflow so generated deduplication IDs can be removed from final outputs. The semantic deduplication workflow enables it by default when using generated IDs and no explicit output_fields are provided.

Usage

workflow = TextDuplicatesRemovalWorkflow(
    input_path="input",
    ids_to_remove_path="duplicates",
    output_path="output",
    drop_id_field=True,
)

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

Testing

uv run ruff check nemo_curator/stages/text/deduplication/removal.py nemo_curator/stages/text/deduplication/removal_workflow.py nemo_curator/stages/text/deduplication/semantic.py tests/stages/text/deduplication/test_removal_workflow.py
uv run pytest tests/stages/text/deduplication/test_removal_workflow.py -q (blocked on macOS: ValueError: NeMo-Curator currently only supports Linux systems, while the current machine has a darwin system.)

Signed-off-by: nightcityblade <nightcityblade@gmail.com>

copy-pr-bot · 2026-06-16T03:16:29Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-06-16T03:19:09Z

Greptile Summary

This PR adds an optional drop_id_field flag to TextDuplicatesRemovalStage and TextDuplicatesRemovalWorkflow so that synthetic deduplication IDs can be stripped from the final output. The semantic deduplication workflow automatically enables this when it generated the IDs and no explicit output_fields constraint was given.

TextDuplicatesRemovalStage.process() drops the ID column after the duplicate-filter step when drop_id_field=True.
TextDuplicatesRemovalWorkflow.__post_init__ validates that drop_id_field=True and id_field in output_fields cannot be used together, surfacing misconfiguration at construction time with a clear message.
TextSemanticDeduplicationWorkflow auto-selects drop_id_field with self.use_id_generator and self.output_fields is None, which is always safe under the existing invariant that use_id_generator=True requires id_field == CURATOR_DEDUP_ID_STR.

Confidence Score: 5/5

Safe to merge — the change is additive, defaults to the existing behaviour (drop_id_field=False), and the conflicting-option guard surfaces misconfiguration at construction time before any I/O occurs.

All four changed files are straightforward: a new boolean flag, a single-column drop after filtering, an early validation guard, and matching tests. The semantic workflow's auto-enable condition is tightly constrained by the existing use_id_generator invariants, and the default False value keeps all existing pipelines unchanged.

No files require special attention.

Important Files Changed

Filename	Overview
nemo_curator/stages/text/deduplication/removal.py	Adds drop_id_field: bool = False to the stage dataclass and drops the column after duplicate filtering; logic is straightforward and correct.
nemo_curator/stages/text/deduplication/removal_workflow.py	Adds drop_id_field flag, propagates it to the removal stage, and adds an early-exit validation in post_init that catches the conflicting drop_id_field=True + id_field in output_fields case.
nemo_curator/stages/text/deduplication/semantic.py	Auto-enables drop_id_field when use_id_generator=True and no explicit output_fields are set; condition is safe because it is always paired with output_fields=None, so the new validation in the workflow never fires.
tests/stages/text/deduplication/test_removal_workflow.py	Adds unit test for the stage drop behaviour, a validation test for the conflicting-options guard, and updates the existing stage-generation tests to assert drop_id_field is propagated correctly.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[TextDuplicatesRemovalWorkflow\nconstructor] --> B{drop_id_field=True\nAND output_fields\ncontains id_field?}
    B -- Yes --> C[raise ValueError\nearly exit]
    B -- No --> D[_generate_stages]
    D --> E[FilePartitioningStage]
    E --> F[ParquetReader / JsonlReader\n_assign_ids if id_generator_path]
    F --> G[TextDuplicatesRemovalStage\ndrop_id_field passed through]
    G --> H{drop_id_field?}
    H -- True --> I[df.drop id_field column]
    H -- False --> J[keep id_field in batch]
    I --> K[ParquetWriter / JsonlWriter\noutput_fields filter]
    J --> K

    subgraph SemanticDedup auto-enable
        SD[TextSemanticDeduplicationWorkflow] --> SC{use_id_generator\nAND output_fields is None?}
        SC -- Yes --> SD2[drop_id_field=True\nauto-set]
        SC -- No --> SD3[drop_id_field=False]
    end

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[TextDuplicatesRemovalWorkflow\nconstructor] --> B{drop_id_field=True\nAND output_fields\ncontains id_field?}
    B -- Yes --> C[raise ValueError\nearly exit]
    B -- No --> D[_generate_stages]
    D --> E[FilePartitioningStage]
    E --> F[ParquetReader / JsonlReader\n_assign_ids if id_generator_path]
    F --> G[TextDuplicatesRemovalStage\ndrop_id_field passed through]
    G --> H{drop_id_field?}
    H -- True --> I[df.drop id_field column]
    H -- False --> J[keep id_field in batch]
    I --> K[ParquetWriter / JsonlWriter\noutput_fields filter]
    J --> K

    subgraph SemanticDedup auto-enable
        SD[TextSemanticDeduplicationWorkflow] --> SC{use_id_generator\nAND output_fields is None?}
        SC -- Yes --> SD2[drop_id_field=True\nauto-set]
        SC -- No --> SD3[drop_id_field=False]
    end

_{Reviews (2): Last reviewed commit: "Validate dropped id output field conflic..." | Re-trigger Greptile}

greptile-apps · 2026-06-16T03:19:13Z

    output_kwargs: dict[str, Any] | None = None
    output_fields: list[str] | None = None
    output_mode: Literal["ignore", "overwrite", "append", "error"] | None = None
+    drop_id_field: bool = False


Conflicting drop_id_field + output_fields not validated

When a caller sets drop_id_field=True and also includes id_field in output_fields, the removal stage will have already dropped that column by the time the writer stage tries to select it, producing a KeyError at runtime. The semantic workflow avoids this with an explicit self.output_fields is None guard, but the base TextDuplicatesRemovalWorkflow has no equivalent check. Adding a __post_init__ guard (after the id_generator warning) like if self.drop_id_field and self.output_fields and self.id_field in self.output_fields: raise ValueError(...) would surface this misconfiguration early with a clear message.

+1 I think this is decent feedback. WDYT @nightcityblade ?

Good call — I added an early __post_init__ validation that raises a clear ValueError when drop_id_field=True conflicts with output_fields containing the id field, plus a focused unit test for that configuration.

Pushed: nightcityblade/Curator@2dbe71d
Validation: uv run ruff check nemo_curator/stages/text/deduplication/removal_workflow.py tests/stages/text/deduplication/test_removal_workflow.py and python3 -m py_compile ... passed locally. Targeted pytest collection is blocked on this macOS host by Curator’s import-time Linux-only guard.

sarahyurick · 2026-06-16T17:07:16Z

    output_kwargs: dict[str, Any] | None = None
    output_fields: list[str] | None = None
    output_mode: Literal["ignore", "overwrite", "append", "error"] | None = None
+    drop_id_field: bool = False


+1 I think this is decent feedback. WDYT @nightcityblade ?

sarahyurick · 2026-06-16T17:09:10Z

+
+    result = stage.process(task).to_pandas()
+
+    assert result.to_dict(orient="list") == {"text": ["keep"]}


I misread this test at first and was confused why we were checking that this text was kept. Can you add another assertion that explicitly checks that CURATOR_DEDUP_ID_STR is not in the result?

Added — the test now explicitly asserts that CURATOR_DEDUP_ID_STR is absent from the result columns, in addition to checking the remaining row contents.

Pushed in nightcityblade/Curator@2dbe71d.

Add option to drop deduplication id field

efc4cbe

Signed-off-by: nightcityblade <nightcityblade@gmail.com>

nightcityblade requested a review from a team as a code owner June 16, 2026 03:16

nightcityblade requested review from sarahyurick and removed request for a team June 16, 2026 03:16

github-actions Bot added the community-request label Jun 16, 2026

greptile-apps Bot reviewed Jun 16, 2026

View reviewed changes

sarahyurick reviewed Jun 16, 2026

View reviewed changes

svcnvidia-nemo-ci added the waiting-on-customer Waiting on the original author to respond label Jun 16, 2026

Validate dropped id output field conflict

2dbe71d

svcnvidia-nemo-ci added waiting-on-maintainers Waiting on maintainers to respond and removed waiting-on-customer Waiting on the original author to respond labels Jun 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to drop deduplication id field#2078

Add option to drop deduplication id field#2078
nightcityblade wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
nightcityblade:fix/issue-1580

nightcityblade commented Jun 16, 2026

Uh oh!

copy-pr-bot Bot commented Jun 16, 2026

Uh oh!

greptile-apps Bot commented Jun 16, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot Jun 16, 2026

Uh oh!

sarahyurick Jun 16, 2026

Uh oh!

nightcityblade Jun 17, 2026

Uh oh!

sarahyurick Jun 16, 2026

Uh oh!

sarahyurick Jun 16, 2026

Uh oh!

nightcityblade Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		result = stage.process(task).to_pandas()

		assert result.to_dict(orient="list") == {"text": ["keep"]}

Conversation

nightcityblade commented Jun 16, 2026

Description

Usage

Checklist

Testing

Uh oh!

copy-pr-bot Bot commented Jun 16, 2026

Uh oh!

greptile-apps Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

sarahyurick Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

nightcityblade Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

sarahyurick Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

sarahyurick Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

nightcityblade Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps Bot commented Jun 16, 2026 •

edited

Loading