Skip to content

Add option to drop deduplication id field#2078

Open
nightcityblade wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
nightcityblade:fix/issue-1580
Open

Add option to drop deduplication id field#2078
nightcityblade wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
nightcityblade:fix/issue-1580

Conversation

@nightcityblade

Copy link
Copy Markdown
Contributor

Description

Closes #1580.

Adds an optional drop_id_field flag to the text duplicate removal stage/workflow so generated deduplication IDs can be removed from final outputs. The semantic deduplication workflow enables it by default when using generated IDs and no explicit output_fields are provided.

Usage

workflow = TextDuplicatesRemovalWorkflow(
    input_path="input",
    ids_to_remove_path="duplicates",
    output_path="output",
    drop_id_field=True,
)

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Testing

  • uv run ruff check nemo_curator/stages/text/deduplication/removal.py nemo_curator/stages/text/deduplication/removal_workflow.py nemo_curator/stages/text/deduplication/semantic.py tests/stages/text/deduplication/test_removal_workflow.py
  • uv run pytest tests/stages/text/deduplication/test_removal_workflow.py -q (blocked on macOS: ValueError: NeMo-Curator currently only supports Linux systems, while the current machine has a darwin system.)

Signed-off-by: nightcityblade <nightcityblade@gmail.com>
@nightcityblade nightcityblade requested a review from a team as a code owner June 16, 2026 03:16
@nightcityblade nightcityblade requested review from sarahyurick and removed request for a team June 16, 2026 03:16
@copy-pr-bot

copy-pr-bot Bot commented Jun 16, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps

greptile-apps Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds an optional drop_id_field flag to TextDuplicatesRemovalStage and TextDuplicatesRemovalWorkflow so that synthetic deduplication IDs can be stripped from the final output. The semantic deduplication workflow automatically enables this when it generated the IDs and no explicit output_fields constraint was given.

  • TextDuplicatesRemovalStage.process() drops the ID column after the duplicate-filter step when drop_id_field=True.
  • TextDuplicatesRemovalWorkflow.__post_init__ validates that drop_id_field=True and id_field in output_fields cannot be used together, surfacing misconfiguration at construction time with a clear message.
  • TextSemanticDeduplicationWorkflow auto-selects drop_id_field with self.use_id_generator and self.output_fields is None, which is always safe under the existing invariant that use_id_generator=True requires id_field == CURATOR_DEDUP_ID_STR.

Confidence Score: 5/5

Safe to merge — the change is additive, defaults to the existing behaviour (drop_id_field=False), and the conflicting-option guard surfaces misconfiguration at construction time before any I/O occurs.

All four changed files are straightforward: a new boolean flag, a single-column drop after filtering, an early validation guard, and matching tests. The semantic workflow's auto-enable condition is tightly constrained by the existing use_id_generator invariants, and the default False value keeps all existing pipelines unchanged.

No files require special attention.

Important Files Changed

Filename Overview
nemo_curator/stages/text/deduplication/removal.py Adds drop_id_field: bool = False to the stage dataclass and drops the column after duplicate filtering; logic is straightforward and correct.
nemo_curator/stages/text/deduplication/removal_workflow.py Adds drop_id_field flag, propagates it to the removal stage, and adds an early-exit validation in post_init that catches the conflicting drop_id_field=True + id_field in output_fields case.
nemo_curator/stages/text/deduplication/semantic.py Auto-enables drop_id_field when use_id_generator=True and no explicit output_fields are set; condition is safe because it is always paired with output_fields=None, so the new validation in the workflow never fires.
tests/stages/text/deduplication/test_removal_workflow.py Adds unit test for the stage drop behaviour, a validation test for the conflicting-options guard, and updates the existing stage-generation tests to assert drop_id_field is propagated correctly.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[TextDuplicatesRemovalWorkflow\nconstructor] --> B{drop_id_field=True\nAND output_fields\ncontains id_field?}
    B -- Yes --> C[raise ValueError\nearly exit]
    B -- No --> D[_generate_stages]
    D --> E[FilePartitioningStage]
    E --> F[ParquetReader / JsonlReader\n_assign_ids if id_generator_path]
    F --> G[TextDuplicatesRemovalStage\ndrop_id_field passed through]
    G --> H{drop_id_field?}
    H -- True --> I[df.drop id_field column]
    H -- False --> J[keep id_field in batch]
    I --> K[ParquetWriter / JsonlWriter\noutput_fields filter]
    J --> K

    subgraph SemanticDedup auto-enable
        SD[TextSemanticDeduplicationWorkflow] --> SC{use_id_generator\nAND output_fields is None?}
        SC -- Yes --> SD2[drop_id_field=True\nauto-set]
        SC -- No --> SD3[drop_id_field=False]
    end
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[TextDuplicatesRemovalWorkflow\nconstructor] --> B{drop_id_field=True\nAND output_fields\ncontains id_field?}
    B -- Yes --> C[raise ValueError\nearly exit]
    B -- No --> D[_generate_stages]
    D --> E[FilePartitioningStage]
    E --> F[ParquetReader / JsonlReader\n_assign_ids if id_generator_path]
    F --> G[TextDuplicatesRemovalStage\ndrop_id_field passed through]
    G --> H{drop_id_field?}
    H -- True --> I[df.drop id_field column]
    H -- False --> J[keep id_field in batch]
    I --> K[ParquetWriter / JsonlWriter\noutput_fields filter]
    J --> K

    subgraph SemanticDedup auto-enable
        SD[TextSemanticDeduplicationWorkflow] --> SC{use_id_generator\nAND output_fields is None?}
        SC -- Yes --> SD2[drop_id_field=True\nauto-set]
        SC -- No --> SD3[drop_id_field=False]
    end
Loading

Reviews (2): Last reviewed commit: "Validate dropped id output field conflic..." | Re-trigger Greptile

output_kwargs: dict[str, Any] | None = None
output_fields: list[str] | None = None
output_mode: Literal["ignore", "overwrite", "append", "error"] | None = None
drop_id_field: bool = False

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Conflicting drop_id_field + output_fields not validated

When a caller sets drop_id_field=True and also includes id_field in output_fields, the removal stage will have already dropped that column by the time the writer stage tries to select it, producing a KeyError at runtime. The semantic workflow avoids this with an explicit self.output_fields is None guard, but the base TextDuplicatesRemovalWorkflow has no equivalent check. Adding a __post_init__ guard (after the id_generator warning) like if self.drop_id_field and self.output_fields and self.id_field in self.output_fields: raise ValueError(...) would surface this misconfiguration early with a clear message.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 I think this is decent feedback. WDYT @nightcityblade ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call — I added an early __post_init__ validation that raises a clear ValueError when drop_id_field=True conflicts with output_fields containing the id field, plus a focused unit test for that configuration.

Pushed: nightcityblade/Curator@2dbe71d
Validation: uv run ruff check nemo_curator/stages/text/deduplication/removal_workflow.py tests/stages/text/deduplication/test_removal_workflow.py and python3 -m py_compile ... passed locally. Targeted pytest collection is blocked on this macOS host by Curator’s import-time Linux-only guard.

output_kwargs: dict[str, Any] | None = None
output_fields: list[str] | None = None
output_mode: Literal["ignore", "overwrite", "append", "error"] | None = None
drop_id_field: bool = False

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 I think this is decent feedback. WDYT @nightcityblade ?


result = stage.process(task).to_pandas()

assert result.to_dict(orient="list") == {"text": ["keep"]}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I misread this test at first and was confused why we were checking that this text was kept. Can you add another assertion that explicitly checks that CURATOR_DEDUP_ID_STR is not in the result?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added — the test now explicitly asserts that CURATOR_DEDUP_ID_STR is absent from the result columns, in addition to checking the remaining row contents.

Pushed in nightcityblade/Curator@2dbe71d.

@svcnvidia-nemo-ci svcnvidia-nemo-ci added the waiting-on-customer Waiting on the original author to respond label Jun 16, 2026
@svcnvidia-nemo-ci svcnvidia-nemo-ci added waiting-on-maintainers Waiting on maintainers to respond and removed waiting-on-customer Waiting on the original author to respond labels Jun 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request waiting-on-maintainers Waiting on maintainers to respond

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow the textRemovalWorkflow to skip writing the id columns/_curator_dedup_id column on removal

3 participants