[WIP] EasyMP vllm-omni model definition#15741

Open

vklimkov-nvidia wants to merge 45 commits into

NVIDIA-NeMo:easymp_voiceagentfrom

vklimkov-nvidia:easymp_vllm_omni

vklimkov-nvidia commented Jun 1, 2026

Member

EasyMP model defnition, where backbone and LT are compiled into a single cuda graph for uniform batches.
Loads real weights, doesn't produce valid acoustic tokens at this point.

vklimkov-nvidia requested a review from a team as a code owner

June 1, 2026 16:59

copy-pr-bot Bot commented Jun 1, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions Bot added the TTS label

vklimkov-nvidia added 8 commits

June 2, 2026 18:20


          nemo/collections/tts/models/easy_magpietts_inference.py: remove dupli…

328944e

…cate speaker encoder application

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>


          examples/tts/easymagpie_vllm_omni: initial commit for vllm_omni defin…

78404cb

…ition of Easy Magpie

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>


          examples/tts/easymagpie_vllm_omni: switch to actual configuration

87d742c

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>


          examples/tts/easymagpie_vllm_omni: make sure model runs with cuda graphs

bb8b427

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>


          examples/tts/easymagpie_vllm_omni: extend preprocess to take speaker …

…embeddings and prepare prefill embeddings

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>


          examples/tts/easymagpie_vllm_omni: introduce script to convert the ch…

3a8d50b

…eckpoint to vllm omni one

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>


          examples/tts/easymagpie_vllm_omni: clean up, add readme

85be128

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>


          examples/tts/easymagpie_vllm_omni: implement delay and proper phoneme…

f984ee1

… prediction processing

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>

vklimkov-nvidia force-pushed the easymp_vllm_omni branch from 5468baa to f984ee1 Compare

June 2, 2026 16:21

vklimkov-nvidia added 18 commits

June 3, 2026 12:27


          examples/tts/easymagpie_vllm_omni: take text as input instead of tokens

9ab0038

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>


          examples/tts/easymagpie_vllm_omni: add script to benchmark the acoust…

36ce9a5

…ic token prediction

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>


          examples/tts/easymagpie_vllm_omni/easy_magpietts_convert_to_vllm.py: …

f5c06a5

…do ckpt conversion without precision loss

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>


          examples/tts/easymagpie_vllm_omni/tests: add tests to check equivalen…

8721e54

…ce of cudagraph-friendly LT re-implemantation

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>


          examples/tts/easymagpie_vllm_omni: hotfix for nemotron_h in fp16, nee…

4eda162

…d scaling

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>


          examples/tts/easymagpie_vllm_omni: introduce EOS forwarding from LT s…

4c7388f

…ampled tokens

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>


          examples/tts/easymagpie_vllm_omni: initial version of TTS service

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>


          examples/tts/easymagpie_vllm_omni/model_repository/easymp/1/model.py:…

d11bece

… fix sending back chunks of audio

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>


          examples/tts/easymagpie_vllm_omni/Dockerfile: add installation of the…

2e10248

… model definition

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>


          examples/tts/easymagpie_vllm_omni/README.md: add info on service

d5d8dd6

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>


          examples/tts/easymagpie_vllm_omni: add benchmarks

5dd4ad2

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>


          examples/tts/easymagpie_vllm_omni/benchmark_model.py: reduce default …

ec50073

…memory utilization

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>


          examples/tts/easymagpie_vllm_omni/Dockerfile: try to fix warning abou…

235c9ee

…t other plugin

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>


          examples/tts/easymagpie_vllm_omni/easymagpie_vllm_omni/easymagpie.py:…

8d3c65a

… fix preprocessing start_idx usage

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>


          examples/tts/easymagpie_vllm_omni/model_repository/easymp/config.pbtx…

dbe1374

…t: increased batched tokens in case a lot of simaltenious prefill

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>


          examples/tts/easymagpie_vllm_omni: small tweaks

0f90c2a

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>


          examples/tts/easymagpie_vllm_omni: log fails, adjust numpy installation

a1bcbe7

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>


          examples/tts/easymagpie_vllm_omni/benchmark_model.py: allow dummy mod…

4bc5218

…el weights

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>

vklimkov-nvidia added 19 commits

June 9, 2026 14:15


          examples/tts/easymagpie_vllm_omni: return DELTA from the model

63976b3

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>


          examples/tts/easymagpie_vllm_omni: allow streaming text tokens in the…

7754d90

… model

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>


          xamples/tts/easymagpie_vllm_omni/easymagpie_inference_demo.ipynb: add…

87232e6

… demo for streaming text tokens

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>


          examples/tts/easymagpie_vllm_omni: custom scheduler to resume after t…

bc51d8d

…ext tokens are all streamed

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>


          examples/tts/easymagpie_vllm_omni/benchmark_model.py: add benchmarkin…

bafae30

…g of streaming mode, simplify

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>


          examples/tts/easymagpie_vllm_omni: add input streaming support into t…

2b19ee5

…ts service

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>


          examples/tts/easymagpie_vllm_omni: clean up service logs

9cf98c9

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>


          examples/tts/easymagpie_vllm_omni: fixes to docker file so it runs pr…

bc041bc

…operly

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>


          examples/tts/easymagpie_vllm_omni: reduce initial audio chunk latency

6dfbbeb

Lower the first service audio chunk to one frame based on local TTFA benchmarks, and record the measured codec/streaming investigation notes.


          examples/tts/easymagpie_vllm_omni: some optimizations

058a240

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>


          examples/tts/easymagpie_vllm_omni: replace local-transformer FFN Conv…

8ce7d86

…1d with Linear

The local transformer's feed-forward used kernel-1 Conv1d layers, forcing a
[b,t,c]<->[b,c,t] transpose on entry/exit that torch.compile could not fuse
away (showed up as transpose/convolution triton kernels in profiling). Switch
to bias-free nn.Linear operating directly on [b,t,c]; the conv submodule
attribute name is kept and load_weights squeezes the trailing singleton dim so
existing checkpoints still load 1:1. Also cache the positional arange index to
avoid re-running an embedding gather every autoregressive step.

Benchmark (Nemotron-H, -n 32 -c 1 32 --max-new-tokens 64): c=32 ITL 45.6->26.4ms,
req/s 10.54->11.82.


          examples/tts/easymagpie_vllm_omni: capture local-transformer AR loop …

8828a93

…in a single graph

The per-frame codebook loop replayed the compiled transformer N times with eager
projection/sampling in between. Move the whole loop (transformer stack +
per-codebook heads + sampling) into one @support_torch_compile module
(EasyMagpieCodeLoop) so vLLM captures a single CUDA graph replayed once per frame
instead of N times. Same FLOPs; removes per-step Python/launch overhead that
dominates throughput scaling.

Sampling is kept graph-safe: Gumbel noise is drawn eagerly outside the graph and
injected (so RNG isn't replayed), temperature is a runtime tensor (per-request
temperature still works), and top_k is a capture-time constant. The loop owns no
params \u2014 the heads/embeddings/mask stay on EasyMagpieCodePredictor so the
checkpoint still loads 1:1. Also squeeze the kernel-1 Conv1d->Linear weight in the
test's NeMo->vLLM copy (follow-up to the FFN dense change).

Benchmark (Nemotron-H, -n 32 -c 1 32 --max-new-tokens 64): c=32 req/s 11.82->17.55.


          examples/tts/easymagpie_vllm_omni: optimize preprocess prefill path

ea9689a

preprocess runs on the host, once per request, serially on the runner's
critical path; shipping a per-request (T_audio, embedding_dim) speaker
embedding (ZMQ serialize/deserialize + H2D) and reassembling the prefill
context there dominated TTFT under concurrency.

For the fixed speaker set we serve, bake the speaker embeddings into model
state: load each speaker_embeddings/<id>.pt once at construction into a
GPU-resident, model-dtype tensor, so a request carries only a short
speaker_id string. Custom / one-off voices may still pass a raw
speaker_embedding tensor. Loaded in __init__ (not load_weights) so it is
present under --load-format dummy too.

Also drop the silent zero/last-row padding of short prefill chunks in favor
of an assertion (the backbone was never trained on padded context).

Benchmark (dummy weights, RTX A6000, n=64, speaker_id path):
c=32 TTFT 188ms -> ~95ms, c=1 31ms -> 27ms.


          examples/tts/easymagpie_vllm_omni: fix service to work with speaker_i…

a423740

…d, no-op for codec as debug

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>


          examples/tts/easymagpie_vllm_omni: extend benchmarking scripts

2429b04

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>


          easymagpie_vllm_omni/easymagpie.py: simplify preprocessing, expect li…

332579a

…st of text tokens

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>


          examples/tts/easymagpie_vllm_omni: allow to feed chunks of text tokens

39a4005

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>


          examples/tts/easymagpie_vllm_omni: update benchmarking with streaming…

1fba7d3

… chunks of tokens

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>


          examples/tts/easymagpie_vllm_omni: add scripts subdir

39283e7

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

TTS