[WIP] EasyMP vllm-omni model definition#15741
Open
vklimkov-nvidia wants to merge 45 commits into
Open
Conversation
…cate speaker encoder application Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
…ition of Easy Magpie Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
…embeddings and prepare prefill embeddings Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
…eckpoint to vllm omni one Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
… prediction processing Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
5468baa to
f984ee1
Compare
Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
…ic token prediction Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
…do ckpt conversion without precision loss Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
…ce of cudagraph-friendly LT re-implemantation Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
…d scaling Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
…ampled tokens Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
… fix sending back chunks of audio Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
… model definition Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
…memory utilization Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
…t other plugin Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
… fix preprocessing start_idx usage Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
…t: increased batched tokens in case a lot of simaltenious prefill Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
…el weights Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
… model Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
… demo for streaming text tokens Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
…ext tokens are all streamed Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
…g of streaming mode, simplify Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
…ts service Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
…operly Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
Lower the first service audio chunk to one frame based on local TTFA benchmarks, and record the measured codec/streaming investigation notes.
Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
…1d with Linear The local transformer's feed-forward used kernel-1 Conv1d layers, forcing a [b,t,c]<->[b,c,t] transpose on entry/exit that torch.compile could not fuse away (showed up as transpose/convolution triton kernels in profiling). Switch to bias-free nn.Linear operating directly on [b,t,c]; the conv submodule attribute name is kept and load_weights squeezes the trailing singleton dim so existing checkpoints still load 1:1. Also cache the positional arange index to avoid re-running an embedding gather every autoregressive step. Benchmark (Nemotron-H, -n 32 -c 1 32 --max-new-tokens 64): c=32 ITL 45.6->26.4ms, req/s 10.54->11.82.
…in a single graph The per-frame codebook loop replayed the compiled transformer N times with eager projection/sampling in between. Move the whole loop (transformer stack + per-codebook heads + sampling) into one @support_torch_compile module (EasyMagpieCodeLoop) so vLLM captures a single CUDA graph replayed once per frame instead of N times. Same FLOPs; removes per-step Python/launch overhead that dominates throughput scaling. Sampling is kept graph-safe: Gumbel noise is drawn eagerly outside the graph and injected (so RNG isn't replayed), temperature is a runtime tensor (per-request temperature still works), and top_k is a capture-time constant. The loop owns no params \u2014 the heads/embeddings/mask stay on EasyMagpieCodePredictor so the checkpoint still loads 1:1. Also squeeze the kernel-1 Conv1d->Linear weight in the test's NeMo->vLLM copy (follow-up to the FFN dense change). Benchmark (Nemotron-H, -n 32 -c 1 32 --max-new-tokens 64): c=32 req/s 11.82->17.55.
preprocess runs on the host, once per request, serially on the runner's critical path; shipping a per-request (T_audio, embedding_dim) speaker embedding (ZMQ serialize/deserialize + H2D) and reassembling the prefill context there dominated TTFT under concurrency. For the fixed speaker set we serve, bake the speaker embeddings into model state: load each speaker_embeddings/<id>.pt once at construction into a GPU-resident, model-dtype tensor, so a request carries only a short speaker_id string. Custom / one-off voices may still pass a raw speaker_embedding tensor. Loaded in __init__ (not load_weights) so it is present under --load-format dummy too. Also drop the silent zero/last-row padding of short prefill chunks in favor of an assertion (the backbone was never trained on padded context). Benchmark (dummy weights, RTX A6000, n=64, speaker_id path): c=32 TTFT 188ms -> ~95ms, c=1 31ms -> 27ms.
…d, no-op for codec as debug Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
…st of text tokens Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
… chunks of tokens Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
EasyMP model defnition, where backbone and LT are compiled into a single cuda graph for uniform batches.
Loads real weights, doesn't produce valid acoustic tokens at this point.