Add option to drop deduplication id field#2078
Conversation
Signed-off-by: nightcityblade <nightcityblade@gmail.com>
Greptile SummaryThis PR adds an optional
Confidence Score: 5/5Safe to merge — the change is additive, defaults to the existing behaviour (drop_id_field=False), and the conflicting-option guard surfaces misconfiguration at construction time before any I/O occurs. All four changed files are straightforward: a new boolean flag, a single-column drop after filtering, an early validation guard, and matching tests. The semantic workflow's auto-enable condition is tightly constrained by the existing use_id_generator invariants, and the default False value keeps all existing pipelines unchanged. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[TextDuplicatesRemovalWorkflow\nconstructor] --> B{drop_id_field=True\nAND output_fields\ncontains id_field?}
B -- Yes --> C[raise ValueError\nearly exit]
B -- No --> D[_generate_stages]
D --> E[FilePartitioningStage]
E --> F[ParquetReader / JsonlReader\n_assign_ids if id_generator_path]
F --> G[TextDuplicatesRemovalStage\ndrop_id_field passed through]
G --> H{drop_id_field?}
H -- True --> I[df.drop id_field column]
H -- False --> J[keep id_field in batch]
I --> K[ParquetWriter / JsonlWriter\noutput_fields filter]
J --> K
subgraph SemanticDedup auto-enable
SD[TextSemanticDeduplicationWorkflow] --> SC{use_id_generator\nAND output_fields is None?}
SC -- Yes --> SD2[drop_id_field=True\nauto-set]
SC -- No --> SD3[drop_id_field=False]
end
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A[TextDuplicatesRemovalWorkflow\nconstructor] --> B{drop_id_field=True\nAND output_fields\ncontains id_field?}
B -- Yes --> C[raise ValueError\nearly exit]
B -- No --> D[_generate_stages]
D --> E[FilePartitioningStage]
E --> F[ParquetReader / JsonlReader\n_assign_ids if id_generator_path]
F --> G[TextDuplicatesRemovalStage\ndrop_id_field passed through]
G --> H{drop_id_field?}
H -- True --> I[df.drop id_field column]
H -- False --> J[keep id_field in batch]
I --> K[ParquetWriter / JsonlWriter\noutput_fields filter]
J --> K
subgraph SemanticDedup auto-enable
SD[TextSemanticDeduplicationWorkflow] --> SC{use_id_generator\nAND output_fields is None?}
SC -- Yes --> SD2[drop_id_field=True\nauto-set]
SC -- No --> SD3[drop_id_field=False]
end
Reviews (2): Last reviewed commit: "Validate dropped id output field conflic..." | Re-trigger Greptile |
| output_kwargs: dict[str, Any] | None = None | ||
| output_fields: list[str] | None = None | ||
| output_mode: Literal["ignore", "overwrite", "append", "error"] | None = None | ||
| drop_id_field: bool = False |
There was a problem hiding this comment.
Conflicting
drop_id_field + output_fields not validated
When a caller sets drop_id_field=True and also includes id_field in output_fields, the removal stage will have already dropped that column by the time the writer stage tries to select it, producing a KeyError at runtime. The semantic workflow avoids this with an explicit self.output_fields is None guard, but the base TextDuplicatesRemovalWorkflow has no equivalent check. Adding a __post_init__ guard (after the id_generator warning) like if self.drop_id_field and self.output_fields and self.id_field in self.output_fields: raise ValueError(...) would surface this misconfiguration early with a clear message.
There was a problem hiding this comment.
+1 I think this is decent feedback. WDYT @nightcityblade ?
There was a problem hiding this comment.
Good call — I added an early __post_init__ validation that raises a clear ValueError when drop_id_field=True conflicts with output_fields containing the id field, plus a focused unit test for that configuration.
Pushed: nightcityblade/Curator@2dbe71d
Validation: uv run ruff check nemo_curator/stages/text/deduplication/removal_workflow.py tests/stages/text/deduplication/test_removal_workflow.py and python3 -m py_compile ... passed locally. Targeted pytest collection is blocked on this macOS host by Curator’s import-time Linux-only guard.
| output_kwargs: dict[str, Any] | None = None | ||
| output_fields: list[str] | None = None | ||
| output_mode: Literal["ignore", "overwrite", "append", "error"] | None = None | ||
| drop_id_field: bool = False |
There was a problem hiding this comment.
+1 I think this is decent feedback. WDYT @nightcityblade ?
|
|
||
| result = stage.process(task).to_pandas() | ||
|
|
||
| assert result.to_dict(orient="list") == {"text": ["keep"]} |
There was a problem hiding this comment.
I misread this test at first and was confused why we were checking that this text was kept. Can you add another assertion that explicitly checks that CURATOR_DEDUP_ID_STR is not in the result?
There was a problem hiding this comment.
Added — the test now explicitly asserts that CURATOR_DEDUP_ID_STR is absent from the result columns, in addition to checking the remaining row contents.
Pushed in nightcityblade/Curator@2dbe71d.
Description
Closes #1580.
Adds an optional
drop_id_fieldflag to the text duplicate removal stage/workflow so generated deduplication IDs can be removed from final outputs. The semantic deduplication workflow enables it by default when using generated IDs and no explicitoutput_fieldsare provided.Usage
Checklist
Testing
uv run ruff check nemo_curator/stages/text/deduplication/removal.py nemo_curator/stages/text/deduplication/removal_workflow.py nemo_curator/stages/text/deduplication/semantic.py tests/stages/text/deduplication/test_removal_workflow.pyuv run pytest tests/stages/text/deduplication/test_removal_workflow.py -q(blocked on macOS:ValueError: NeMo-Curator currently only supports Linux systems, while the current machine has a darwin system.)