Draft: ASR Open Source Datasets Processing Pipeline by sushmitha-deva-09 · Pull Request #2067 · NVIDIA-NeMo/Curator

sushmitha-deva-09 · 2026-06-11T11:17:36Z

Description

Usage

# Add snippet demonstrating usage

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

copy-pr-bot · 2026-06-11T11:17:39Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-06-11T11:29:07Z

Greptile Summary

This PR introduces a new ASR open-source dataset processing pipeline under nemo_curator/stages/audio/asr/, covering HuggingFace Arrow dataset extraction, transcript normalization with language-specific resources (10+ Indic languages), streaming transcript quality statistics, and tarred dataset writing. A tutorial for the IndicVoices dataset and a modest fix to ManifestReader.decompose() to handle OmegaConf sequences are also included.

HuggingFaceASRDatasetHandler loads Arrow datasets, decodes audio via datasets.Audio, coerces to mono, resamples, and emits AudioTask objects; joblib threading is used for per-split parallel extraction.
TranscriptNormalizationStage / TranscriptStatsStage apply language resource files (alphabet, pretok rules, pnc chars) to normalize and quality-gate transcripts, writing a rolling JSON summary per utterance when output_summary_path is set.
SplitAwareManifestWriter / TarredAudioDatasetWriterStage route tasks to per-language/per-split JSONL manifests and wrap NeMo's tarred dataset converter for final packaging.

Confidence Score: 4/5

The pipeline logic is sound for the primary HuggingFace handler path, but stats.py has open issues around per-utterance disk reads that could cause serious slowdowns on large multi-language datasets.

The rolling alphabet-file reads inside TranscriptStatsStage.process() (one summary() call per utterance, each triggering disk reads for every language/source bucket) remain unresolved and will significantly degrade throughput at scale. The new finding here is confined to the convert_audio contract gap, which is a lower-risk maintenance concern.

nemo_curator/stages/audio/asr/normalization/stats.py deserves attention before this PR is merged out of draft status.

Important Files Changed

Filename	Overview
nemo_curator/stages/audio/asr/normalization/stats.py	New streaming transcript stats stage; `_metrics_snapshot` calls `summary()` → `_bucket_summary()` → disk reads on every utterance (flagged previously), and `_write_summary()` has a re-open file handle leak (flagged previously).
nemo_curator/stages/audio/asr/datasets/base.py	New base class for ASR dataset handlers; `convert_audio` does not enforce mono despite `target_channels=1`, creating a contract gap for subclasses.
nemo_curator/stages/audio/asr/datasets/huggingface.py	New HuggingFace dataset handler; stats dict keyed from skip_reason strings is fragile but all current skip reasons are covered.
nemo_curator/stages/audio/asr/io/split_manifest_writer.py	New split-aware manifest writer; `teardown()` clears `_handles` but not `_counts` (flagged previously), which would skew logged entry counts on reuse.
nemo_curator/stages/audio/asr/normalization/transcript.py	New transcript normalization stage; normalizers are lazily cached per-language and resource files are loaded once per language. Logic is clean.
nemo_curator/stages/audio/asr/io/tarred_dataset_writer.py	New stage wrapping NeMo's tarred dataset converter; straightforward delegation with proper validation of manifest/target_dir length parity.
nemo_curator/stages/audio/asr/metadata.py	New typed ASRMetadata dataclass with clean to_dict/from_dict round-trip; `extra` fields are spread at serialization with core fields taking precedence.
nemo_curator/stages/audio/common.py	Minor change: adds `_coerce_manifest_path()` to convert OmegaConf sequences to plain Python lists before passing to FilePartitioningStage.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Driver
    participant HFHandler as HuggingFaceASRDatasetHandler
    participant Joblib as Joblib Threads
    participant NormStage as TranscriptNormalizationStage
    participant StatsStage as TranscriptStatsStage
    participant ManifestWriter as SplitAwareManifestWriter
    participant TarredWriter as TarredAudioDatasetWriterStage

    Driver->>HFHandler: process(_EmptyTask)
    loop For each lang x native_split
        HFHandler->>HFHandler: load_from_disk + cast_column(Audio)
        HFHandler->>Joblib: Parallel(load_and_process(i))
        Joblib->>HFHandler: coerce_audio mono 1D array
        Joblib->>HFHandler: convert_audio WAV/16kHz/PCM16
        Joblib-->>HFHandler: _RowResult(ASRMetadata)
    end
    HFHandler-->>Driver: list[AudioTask]
    Driver->>NormStage: process(AudioTask)
    NormStage->>NormStage: ResourceTranscriptNormalizer.normalize(text)
    NormStage-->>Driver: AudioTask + unknown_chars + transcript_error
    Driver->>StatsStage: process(AudioTask)
    StatsStage->>StatsStage: accumulate buckets (global + per-lang/source)
    StatsStage->>StatsStage: _write_summary() rolling JSON file
    StatsStage-->>Driver: AudioTask (or None if drop_invalid)
    Driver->>ManifestWriter: process(AudioTask)
    ManifestWriter->>ManifestWriter: route to lang/split.jsonl
    ManifestWriter-->>Driver: AudioTask
    Driver->>TarredWriter: process(_EmptyTask)
    TarredWriter->>TarredWriter: create_tar_datasets(manifest, target_dir)
    TarredWriter-->>Driver: []

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Driver
    participant HFHandler as HuggingFaceASRDatasetHandler
    participant Joblib as Joblib Threads
    participant NormStage as TranscriptNormalizationStage
    participant StatsStage as TranscriptStatsStage
    participant ManifestWriter as SplitAwareManifestWriter
    participant TarredWriter as TarredAudioDatasetWriterStage

    Driver->>HFHandler: process(_EmptyTask)
    loop For each lang x native_split
        HFHandler->>HFHandler: load_from_disk + cast_column(Audio)
        HFHandler->>Joblib: Parallel(load_and_process(i))
        Joblib->>HFHandler: coerce_audio mono 1D array
        Joblib->>HFHandler: convert_audio WAV/16kHz/PCM16
        Joblib-->>HFHandler: _RowResult(ASRMetadata)
    end
    HFHandler-->>Driver: list[AudioTask]
    Driver->>NormStage: process(AudioTask)
    NormStage->>NormStage: ResourceTranscriptNormalizer.normalize(text)
    NormStage-->>Driver: AudioTask + unknown_chars + transcript_error
    Driver->>StatsStage: process(AudioTask)
    StatsStage->>StatsStage: accumulate buckets (global + per-lang/source)
    StatsStage->>StatsStage: _write_summary() rolling JSON file
    StatsStage-->>Driver: AudioTask (or None if drop_invalid)
    Driver->>ManifestWriter: process(AudioTask)
    ManifestWriter->>ManifestWriter: route to lang/split.jsonl
    ManifestWriter-->>Driver: AudioTask
    Driver->>TarredWriter: process(_EmptyTask)
    TarredWriter->>TarredWriter: create_tar_datasets(manifest, target_dir)
    TarredWriter-->>Driver: []

_{Reviews (2): Last reviewed commit: "Use single handler for huggingface type ..." | Re-trigger Greptile}

greptile-apps · 2026-06-11T11:29:14Z

+    def teardown(self) -> None:
+        for (lang, split), handle in self._handles.items():
+            handle.close()
+            filename = self.output_filename_pattern.format(lang=lang, split=split, split_type=split)
+            logger.info(f"[{self.name}] {lang}/{filename}: {self._counts.get((lang, split), 0)} entries")
+        self._handles = {}


teardown() clears _handles but leaves _counts populated. If the stage is reused, _counts from the previous run accumulates into the next, making the logged entry counts wrong. Reset both dicts together.

Suggested change

def teardown(self) -> None:

for (lang, split), handle in self._handles.items():

handle.close()

filename = self.output_filename_pattern.format(lang=lang, split=split, split_type=split)

logger.info(f"[{self.name}] {lang}/{filename}: {self._counts.get((lang, split), 0)} entries")

self._handles = {}

def teardown(self) -> None:

for (lang, split), handle in self._handles.items():

handle.close()

filename = self.output_filename_pattern.format(lang=lang, split=split, split_type=split)

logger.info(f"[{self.name}] {lang}/{filename}: {self._counts.get((lang, split), 0)} entries")

self._handles = {}

self._counts = {}

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

sushmitha-deva-09 added 9 commits June 8, 2026 16:34

Add IndicVoices dataset handler

508ad9f

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Add normalization and stats stage

f0dc59a

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Write stats to summary json

97ee2f2

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Update scripts

2d049ce

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Add support to display stats per language and per source

388685f

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Add unknown character rate stats

80a22e2

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Update manifest reader

a9cb486

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Format summary

4931ea7

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Add alphabet to more indic languages

f87f364

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

sushmitha-deva-09 requested a review from a team as a code owner June 11, 2026 11:17

sushmitha-deva-09 requested review from meatybobby and removed request for a team June 11, 2026 11:17

greptile-apps Bot reviewed Jun 11, 2026

View reviewed changes

Use single handler for huggingface type datasets

87d3c22

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft: ASR Open Source Datasets Processing Pipeline#2067

Draft: ASR Open Source Datasets Processing Pipeline#2067
sushmitha-deva-09 wants to merge 10 commits into
NVIDIA-NeMo:mainfrom
sushmitha-deva-09:asr_dp

sushmitha-deva-09 commented Jun 11, 2026

Uh oh!

copy-pr-bot Bot commented Jun 11, 2026

Uh oh!

greptile-apps Bot commented Jun 11, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sushmitha-deva-09 commented Jun 11, 2026

Description

Usage

Checklist

Uh oh!

copy-pr-bot Bot commented Jun 11, 2026

Uh oh!

greptile-apps Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented Jun 11, 2026 •

edited

Loading