Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
328944e
nemo/collections/tts/models/easy_magpietts_inference.py: remove dupli…
vklimkov-nvidia Jun 2, 2026
78404cb
examples/tts/easymagpie_vllm_omni: initial commit for vllm_omni defin…
vklimkov-nvidia Jun 2, 2026
87d742c
examples/tts/easymagpie_vllm_omni: switch to actual configuration
vklimkov-nvidia Jun 2, 2026
bb8b427
examples/tts/easymagpie_vllm_omni: make sure model runs with cuda graphs
vklimkov-nvidia Jun 2, 2026
9992569
examples/tts/easymagpie_vllm_omni: extend preprocess to take speaker …
vklimkov-nvidia Jun 2, 2026
3a8d50b
examples/tts/easymagpie_vllm_omni: introduce script to convert the ch…
vklimkov-nvidia Jun 2, 2026
85be128
examples/tts/easymagpie_vllm_omni: clean up, add readme
vklimkov-nvidia Jun 2, 2026
f984ee1
examples/tts/easymagpie_vllm_omni: implement delay and proper phoneme…
vklimkov-nvidia Jun 2, 2026
9ab0038
examples/tts/easymagpie_vllm_omni: take text as input instead of tokens
vklimkov-nvidia Jun 3, 2026
36ce9a5
examples/tts/easymagpie_vllm_omni: add script to benchmark the acoust…
vklimkov-nvidia Jun 3, 2026
f5c06a5
examples/tts/easymagpie_vllm_omni/easy_magpietts_convert_to_vllm.py: …
vklimkov-nvidia Jun 3, 2026
8721e54
examples/tts/easymagpie_vllm_omni/tests: add tests to check equivalen…
vklimkov-nvidia Jun 3, 2026
4eda162
examples/tts/easymagpie_vllm_omni: hotfix for nemotron_h in fp16, nee…
vklimkov-nvidia Jun 3, 2026
4c7388f
examples/tts/easymagpie_vllm_omni: introduce EOS forwarding from LT s…
vklimkov-nvidia Jun 4, 2026
4210535
examples/tts/easymagpie_vllm_omni: initial version of TTS service
vklimkov-nvidia Jun 4, 2026
d11bece
examples/tts/easymagpie_vllm_omni/model_repository/easymp/1/model.py:…
vklimkov-nvidia Jun 5, 2026
2e10248
examples/tts/easymagpie_vllm_omni/Dockerfile: add installation of the…
vklimkov-nvidia Jun 5, 2026
d5d8dd6
examples/tts/easymagpie_vllm_omni/README.md: add info on service
vklimkov-nvidia Jun 5, 2026
5dd4ad2
examples/tts/easymagpie_vllm_omni: add benchmarks
vklimkov-nvidia Jun 5, 2026
ec50073
examples/tts/easymagpie_vllm_omni/benchmark_model.py: reduce default …
vklimkov-nvidia Jun 5, 2026
235c9ee
examples/tts/easymagpie_vllm_omni/Dockerfile: try to fix warning abou…
vklimkov-nvidia Jun 5, 2026
8d3c65a
examples/tts/easymagpie_vllm_omni/easymagpie_vllm_omni/easymagpie.py:…
vklimkov-nvidia Jun 5, 2026
dbe1374
examples/tts/easymagpie_vllm_omni/model_repository/easymp/config.pbtx…
vklimkov-nvidia Jun 5, 2026
0f90c2a
examples/tts/easymagpie_vllm_omni: small tweaks
vklimkov-nvidia Jun 5, 2026
a1bcbe7
examples/tts/easymagpie_vllm_omni: log fails, adjust numpy installation
vklimkov-nvidia Jun 8, 2026
4bc5218
examples/tts/easymagpie_vllm_omni/benchmark_model.py: allow dummy mod…
vklimkov-nvidia Jun 9, 2026
63976b3
examples/tts/easymagpie_vllm_omni: return DELTA from the model
vklimkov-nvidia Jun 9, 2026
7754d90
examples/tts/easymagpie_vllm_omni: allow streaming text tokens in the…
vklimkov-nvidia Jun 9, 2026
87232e6
xamples/tts/easymagpie_vllm_omni/easymagpie_inference_demo.ipynb: add…
vklimkov-nvidia Jun 9, 2026
bc51d8d
examples/tts/easymagpie_vllm_omni: custom scheduler to resume after t…
vklimkov-nvidia Jun 9, 2026
bafae30
examples/tts/easymagpie_vllm_omni/benchmark_model.py: add benchmarkin…
vklimkov-nvidia Jun 9, 2026
2b19ee5
examples/tts/easymagpie_vllm_omni: add input streaming support into t…
vklimkov-nvidia Jun 10, 2026
9cf98c9
examples/tts/easymagpie_vllm_omni: clean up service logs
vklimkov-nvidia Jun 10, 2026
bc041bc
examples/tts/easymagpie_vllm_omni: fixes to docker file so it runs pr…
vklimkov-nvidia Jun 10, 2026
6dfbbeb
examples/tts/easymagpie_vllm_omni: reduce initial audio chunk latency
vklimkov-nvidia Jun 10, 2026
058a240
examples/tts/easymagpie_vllm_omni: some optimizations
vklimkov-nvidia Jun 11, 2026
8ce7d86
examples/tts/easymagpie_vllm_omni: replace local-transformer FFN Conv…
vklimkov-nvidia Jun 12, 2026
8828a93
examples/tts/easymagpie_vllm_omni: capture local-transformer AR loop …
vklimkov-nvidia Jun 12, 2026
ea9689a
examples/tts/easymagpie_vllm_omni: optimize preprocess prefill path
vklimkov-nvidia Jun 12, 2026
a423740
examples/tts/easymagpie_vllm_omni: fix service to work with speaker_i…
vklimkov-nvidia Jun 15, 2026
2429b04
examples/tts/easymagpie_vllm_omni: extend benchmarking scripts
vklimkov-nvidia Jun 15, 2026
332579a
easymagpie_vllm_omni/easymagpie.py: simplify preprocessing, expect li…
vklimkov-nvidia Jun 15, 2026
39a4005
examples/tts/easymagpie_vllm_omni: allow to feed chunks of text tokens
vklimkov-nvidia Jun 15, 2026
1fba7d3
examples/tts/easymagpie_vllm_omni: update benchmarking with streaming…
vklimkov-nvidia Jun 15, 2026
39283e7
examples/tts/easymagpie_vllm_omni: add scripts subdir
vklimkov-nvidia Jun 16, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 38 additions & 0 deletions examples/tts/easymagpie_vllm_omni/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
FROM nvcr.io/nvidia/tritonserver:26.02-py3

# 1. System dependency for git-based installs
RUN apt-get update && \
apt-get install -y git sox libsox-fmt-all

# 2. upstream vllm
RUN pip install --no-cache-dir \
"vllm==0.21.0" \
"vllm_omni==0.21.0rc1"

# 3. Install easymagpie vllm omni model definition
# TODO: replace this clone with the upstream NeMo repo once the code is merged
RUN git clone --depth 1 --branch easymp_vllm_omni \
https://github.com/vklimkov-nvidia/NeMoDuplexRealtime.git \
/tmp/NeMoDuplexRealtime && \
pip install --no-cache-dir /tmp/NeMoDuplexRealtime/examples/tts/easymagpie_vllm_omni && \
rm -rf /tmp/NeMoDuplexRealtime

# 4. Extra python requirements needed to compile the model
RUN pip install --no-cache-dir \
onnxscript \
librosa \
sox \
onnx-graphsurgeon \
"tritonclient[grpc]"
# have to force numpy after previous installations otherwise vllm_omni doesnt work
RUN pip uninstall -y numpy || true
RUN rm -rf \
/usr/local/lib/python3.12/dist-packages/numpy \
/usr/local/lib/python3.12/dist-packages/numpy.libs \
/usr/local/lib/python3.12/dist-packages/numpy-*.dist-info
RUN pip install --no-cache-dir numpy==2.3.5

# 5. Restrict vLLM plugin auto-discovery to the EasyMagpie plugin only.
ENV VLLM_PLUGINS=easymagpie_omni

WORKDIR /workspace
79 changes: 79 additions & 0 deletions examples/tts/easymagpie_vllm_omni/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
## EasyMagpie TTS — vLLM-Omni + Triton service

Streaming TTS server for **EasyMagpieTTS** (NeMo model
`nemo.collections.tts.models.easy_magpietts.EasyMagpieTTSModel` /
`EasyMagpieTTSInferenceModel`, Nemotron-H backbone + per-codebook local
transformer over a 25 fps spectral codec).

The vLLM-Omni model definition (talker that runs the backbone + local
transformer as a single CUDA graph during uniform-batch decoding, piecewise
during prefill/mixed) lives in
[`vllm_plugin_easymagpie_omni/`](vllm_plugin_easymagpie_omni). A Triton
ensemble wraps it together with a TensorRT codec decoder to serve gRPC
streaming requests.

### Pipeline

1. **Convert the NeMo checkpoint to a vLLM-Omni model directory** — bakes the
text embedding + CAS lookup, dumps `config.json`, `model.safetensors`, the
text tokenizer, and optional reference speaker embeddings.

```bash
python examples/tts/easymagpie_vllm_omni/scripts/easy_magpietts_convert_to_vllm.py \
--nemo_file <ckpt>/2605_EMTTS_SmallMamba_Step150k_posttrained_epoch12.nemo \
--codec_model_path <ckpt>/25fps_spectral_codec_with_bandwidth_extension.nemo \
--outdir examples/tts/easymagpie_vllm_omni/easymp_vllm_model \
--context_audio english_sample.wav --speaker_name eng \
--phoneme_tokenizer_path <ckpt>/bpe_ipa_tokenizer_2048_en_de_es_fr_hi_it_vi_zh_ko-KR_pt-BR_ar.json
```

Checkpoints: <https://huggingface.co/nvidia/easymagpietts_NEXT/tree/main/2605_NemotronTTS_V0.2/v2>.

2. **Export the codec decoder to ONNX** — wraps `AudioCodecModel` so a single
`(B, T, C*S)` int tensor of stacked model codes decodes to a 22.05 kHz
waveform (clamp specials → unstack → FSQ index-convert → decode baked in).

```bash
python examples/tts/easymagpie_vllm_omni/scripts/export_codec_decoder_onnx.py \
--codec_model_path <ckpt>/25fps_spectral_codec_with_bandwidth_extension.nemo \
--nemo_file <ckpt>/2605_EMTTS_SmallMamba_Step150k_posttrained_epoch12.nemo \
--onnx-path examples/tts/easymagpie_vllm_omni/codec.onnx \
--frames 15 --device cuda
```

3. **Build the serving container** (Triton 26.02 + vLLM 0.21.0 +
vllm-omni 0.21.0rc1 + this plugin).

```bash
docker build --network=host -t easymp-vllm-omni examples/tts/easymagpie_vllm_omni/
```

4. **Launch the container** with the workspace and a GPU mounted.

```bash
docker run --rm -it --gpus all --network host --shm-size=8g \
-v "$PWD":/workspace -w /workspace \
easymp-vllm-omni bash
```

5. **Build the TensorRT engine from the ONNX** (inside the container) and drop
it into the Triton repo as `model.plan`. For now fp32 seems to be mandatory.

```bash
python examples/tts/easymagpie_vllm_omni/scripts/export_codec_decoder_trt.py \
--onnx-path examples/tts/easymagpie_vllm_omni/codec.onnx \
--trt-path examples/tts/easymagpie_vllm_omni/model_repository/codec/1/model.plan \
--batch-profile 1 8 32 --frames-profile 15 15 15 --fp32
```

6. **Start the Triton inference server** against
[`model_repository/`](model_repository) (two models: `easymp` python
backend + `codec` TRT plan).

```bash
tritonserver --model-repository=examples/tts/easymagpie_vllm_omni/model_repository
```

7. **Send a request.** End-to-end gRPC streaming example in
[`scripts/run_service_request.ipynb`](scripts/run_service_request.ipynb) —
sends `text`, receives streamed `audio` chunks at 22.05 kHz.
28 changes: 28 additions & 0 deletions examples/tts/easymagpie_vllm_omni/easymagpie_vllm_omni/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""EasyMagpieTTS model definition for vLLM-Omni.

This package provides an inference-only re-implementation of EasyMagpieTTS
(decoder-only, Nemotron-H hybrid-Mamba backbone + autoregressive local
transformer over the stacked audio codebooks) that plugs into the vLLM-Omni
serving stack via the standard ``preprocess`` / ``postprocess`` /
``make_omni_output`` hooks.

The companion ``vllm_plugin_easymagpie_omni`` package registers the model with
vLLM's ``ModelRegistry`` through the ``vllm.general_plugins`` entry point.
"""

from easymagpie_vllm_omni.config import EASYMAGPIE_SMALLMAMBA, EasyMagpieOmniArch

__all__ = ["EASYMAGPIE_SMALLMAMBA", "EasyMagpieOmniArch"]
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
# Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Backbone-side patches applied at model ``__init__``.

Runtime fixes for the constructed ``NemotronHModel`` backbone. They live with
the model because they're inherent to running EasyMagpie SmallMamba
(``mlp_hidden_act=silu``) on vLLM's NemotronH implementation. Mirrors the
EasyMagpie vLLM *sidecar* (``easymagpie_vllm/backbone_patches.py``).
"""
from __future__ import annotations

import torch
import torch.nn as nn
import torch.nn.functional as F
import vllm.v1.attention.backends.mamba_attn as _mamba_attn
from vllm.logger import init_logger

logger = init_logger(__name__)


def patch_mamba_streaming_decode() -> None:
"""Treat 1-token streaming extends as decodes so FULL decode cudagraphs work.

EasyMagpie's streaming-input path keeps extending each request's prompt with
every chunk, so ``num_computed_tokens < num_prompt_tokens`` (the engine's
``is_prefilling`` flag) stays True for the whole stream. vLLM's Mamba2
metadata builder calls
:func:`vllm.v1.attention.backends.utils.split_decodes_and_prefills` with
``treat_short_extends_as_decodes=False``, so every single-token decode step
is classified as a *prefill* (``num_prefills>0``).

That collides with the cudagraph dispatcher, which keys only on query length:
a uniform ``query_len==1`` batch dispatches the **FULL decode** graph
regardless of ``is_prefilling``. Two failures result:

* the replayed decode graph runs the decode Mamba kernels while the metadata
says prefill, and
* because ``num_prefills>0``, ``_update_metadata_for_cudagraph_capture``
never refreshes the persistent ``state_indices_tensor_d`` buffer, so the
captured kernel reads the capture-time dummy slot (0) instead of the
request's real Mamba-cache slot -> garbage hidden states.

Forcing ``treat_short_extends_as_decodes=True`` makes single-token extends
classify as decodes (``num_prefills==0``), which both matches the dispatched
FULL decode graph and re-enables the per-step ``state_indices_tensor_d``
refresh. Multi-token context prefills (``query_len>1``) still classify as
prefills, so this is safe for mixed batches. Advancing Mamba state by one
token via the decode kernels is semantically identical to a 1-token prefill
chunk (it reads the slot's state and writes the advanced state back in
place), so no state update is lost — the only requirement is exactly one new
token per streamed step (``SamplingParams(max_tokens=1)``).

Idempotent and process-global; the EasyMagpie plugin only ever serves this
model so the global patch is acceptable.
"""
orig = _mamba_attn.split_decodes_and_prefills
if getattr(orig, "_easymagpie_patched", False):
return

def patched(
common_attn_metadata,
decode_threshold: int = 1,
require_uniform: bool = False,
treat_short_extends_as_decodes: bool = True,
):
return orig(
common_attn_metadata,
decode_threshold=decode_threshold,
require_uniform=require_uniform,
treat_short_extends_as_decodes=True,
)

patched._easymagpie_patched = True
_mamba_attn.split_decodes_and_prefills = patched
logger.info("Mamba streaming-decode classification patch installed")


class _SiluActivation(nn.Module):
"""``nn.Module`` wrapper around ``F.silu`` (so vLLM's NemotronHMLP can hold it)."""

def forward(self, x):
return F.silu(x)


def patch_silu_shared_experts(backbone) -> int:
"""Replace ``shared_experts.act_fn`` with SiLU on every NemotronHMoE layer.

vLLM's ``NemotronHMLP`` hard-codes ReLU² for ``shared_experts`` (ignoring
``config.mlp_hidden_act``). SmallMamba trained with SiLU, so the mismatch
blows up shared-expert norms ~5× and the per-layer cosine drops to ≈-0.7 by
layer 30. Patching only ``act_fn`` (not the whole forward) keeps
``NemotronHMLP.forward`` in charge so torch.compile / CUDA-graph capture
continue to wrap it unchanged.

Args:
backbone: the ``NemotronHModel`` instance.

Returns:
Number of layers patched.
"""
patched = 0
for layer in backbone.layers:
mixer = getattr(layer, "mixer", None)
if mixer is None or mixer.__class__.__name__ != "NemotronHMoE":
continue
se = getattr(mixer, "shared_experts", None)
if se is None:
continue
se.act_fn = _SiluActivation()
patched += 1
logger.info("SiLU shared_experts fix installed on %d layers", patched)
return patched


def patch_moe_routed_scale(backbone) -> int:
"""Restore ``routed_scaling_factor`` on the NemotronHMoE output in FP16.

vLLM's ``FusedMoE`` uses an FP16 overflow trick: with
``apply_routed_scale_to_output=True`` it does **not** multiply the routed
output by ``s`` (=routed_scaling_factor); in FP16 it instead divides the
*shared* output by ``s`` and relies on the decoder layer to keep the whole
residual stream scaled by ``1/s`` (see ``DeepseekV2DecoderLayer.forward``).
NemotronH's decoder layer never applies that compensation, so in FP16 the
MoE block emits ``routed_raw + shared/s == (s*routed + shared)/s`` — the
correct value divided by ``s``. The MoE contribution to the residual ends up
``s``× too small and the error accumulates across the MoE layers.

We re-multiply each MoE mixer's output by ``s`` in FP16::

s * (routed_raw + shared/s) = s*routed_raw + shared

which matches the NeMo reference. FP32/BF16 already take the correct
``fused_output *= s`` branch, so the hook is a no-op there.

Args:
backbone: the ``NemotronHModel`` instance.

Returns:
Number of layers patched.
"""
patched = 0
for layer in backbone.layers:
mixer = getattr(layer, "mixer", None)
if mixer is None or mixer.__class__.__name__ != "NemotronHMoE":
continue
scale = float(getattr(mixer, "routed_scaling_factor", 1.0))
if scale == 1.0:
continue

def _scale_output(_mod, _inp, out, _scale=scale):
# FusedMoE only defers the scale in FP16; leave other dtypes alone.
if isinstance(out, torch.Tensor) and out.dtype == torch.float16:
return out * _scale
return out

mixer.register_forward_hook(_scale_output)
patched += 1
logger.info("FP16 MoE routed-scale fix installed on %d layers", patched)
return patched
Loading
Loading