Multitalker Parakeet Streaming 0.6B v1 -- ONNX

ONNX export of NVIDIA's multitalker-parakeet-streaming-0.6b-v1, a 600M-parameter streaming multi-speaker ASR model. Designed for use with parakeet-rs or similar.

The original NeMo model uses speaker kernel injection via forward hooks which are lost during standard ONNX export. These ONNX files were exported with a custom wrapper that exposes speaker targets as explicit graph inputs, preserving the full multi-speaker pipeline.

Files

File Size Description
encoder.onnx 40MB Encoder graph (references encoder.onnx.data)
encoder.onnx.data 2.3GB Encoder weights (fp32)
encoder.int8.onnx 627MB Encoder, dynamically quantised to uint8
decoder_joint.onnx 34MB Decoder + joint network (fp32)
decoder_joint.int8.onnx 8.6MB Decoder + joint, dynamically quantised to int8
tokenizer.model 245KB SentencePiece vocabulary (1024 tokens)
multitalker_config.json <1KB Model dimensions

For inference you need either the fp32 or int8 encoder, either the fp32 or int8 decoder, and the tokenizer. The int8 models are recommended for most use cases -- they are significantly smaller with minimal quality loss from dynamic quantisation.

How it works

The model runs one encoder instance per active speaker, each with independent cache and decoder state. A Sortformer diarisation model provides per-frame speaker activity probabilities, which are injected into the encoder as masks at layer 0 via learned feedforward networks (speaker kernels).

Audio Chunk
    |
    v
[Mel Spectrogram] -- computed once, shared
    |
    +---> [Sortformer] --> raw speaker activity [T, 4]
    |
    v
For each active speaker k:
    |   spk_targets_k   = activity[:, k]
    |   bg_spk_targets_k = max(activity[:, others])
    |
    +---> [Encoder(mel, cache_k, spk_targets_k, bg_targets_k)] --> encoded_k
    +---> [RNNT Decoder(encoded_k, state_k)] --> tokens_k
    |
    v
Per-Speaker Transcripts

The speaker kernel at layer 0 applies: x = x + FF(x * spk_mask) + FF_bg(x * bg_mask), where each FF is Linear(1024,1024) -> ReLU -> Dropout -> Linear(1024,1024).

Model architecture

Parameter Value
Architecture FastConformer + RNNT
Parameters 600M
Encoder layers 24
Hidden dimension 1024
Subsampling factor 8x
Streaming chunk size 112 mel frames (~1.12s)
Left context 70 frames
Conv context 8 (kernel_size - 1)
Decoder 2-layer LSTM, 640 hidden
Vocabulary 1024 SentencePiece tokens + 1 blank
Speaker kernel layers [0] (layer 0 only)
Max speakers 4 (from Sortformer)
Sample rate 16kHz mono
Mel bins 128
FFT size 512

ONNX inputs and outputs

Encoder (7 inputs, 5 outputs)

Inputs:

Name Shape Type
processed_signal [1, 128, time] float32
processed_signal_length [1] int64
cache_last_channel [1, 24, 70, 1024] float32
cache_last_time [1, 24, 1024, 8] float32
cache_last_channel_len [1] int64
spk_targets [1, spk_time] float32
bg_spk_targets [1, spk_time] float32

Outputs:

Name Shape Type
encoded [1, 1024, encoded_time] float32
encoded_len [1] int64
cache_last_channel_next [1, 24, 70, 1024] float32
cache_last_time_next [1, 24, 1024, 8] float32
cache_last_channel_len_next [1] int64

Cache tensors are batch-first [batch, n_layers, ...]. Initialise with zeros; pass outputs back as inputs for subsequent chunks.

spk_targets and bg_spk_targets are per-frame speaker activity probabilities in [0, 1]. For single-speaker mode, use spk_targets=1.0 and bg_spk_targets=0.0. The model internally handles time dimension mismatches between the mask and encoder hidden states.

Decoder + Joint (4 inputs, 4 outputs)

Inputs:

Name Shape Type
encoder_outputs [1, enc_time, 1024] float32
targets [1, 1] int64
input_states_1 [2, 1, 640] float32
input_states_2 [2, 1, 640] float32

Outputs:

Name Shape Type
outputs [1, enc_time, 1, 1025] float32
prednet_lengths scalar int64
states_1 [2, 1, 640] float32
states_2 [2, 1, 640] float32

Reproducing the export

The conversion_scripts/ directory contains everything needed to re-export from the original .nemo checkpoint. See conversion_scripts/README.md for setup and usage.

cd conversion_scripts
uv venv --python 3.12
uv pip install --python .venv/bin/python3.12 -r requirements.txt
source .venv/bin/activate
python export_multitalker.py --nemo-path /path/to/model.nemo --output-dir ../

Usage with parakeet-rs

Requires the multitalker feature and a Sortformer v2 ONNX model for speaker diarisation.

use parakeet_rs::MultitalkerASR;

let mut model = MultitalkerASR::from_pretrained(
    "path/to/this/repo",        // directory with encoder, decoder, tokenizer
    "path/to/sortformer.onnx",  // Sortformer v2 ONNX model
    None,                       // use default execution config
)?;

// Stream audio in chunks
for chunk in audio.chunks(17920) {  // ~1.12s at 16kHz
    let results = model.transcribe_chunk(chunk)?;
    for r in &results {
        println!("[Speaker {}] {}", r.speaker_id, r.text);
    }
}

// Get final per-speaker transcripts
for transcript in model.get_transcripts() {
    println!("Speaker {}: {}", transcript.speaker_id, transcript.text);
}

Or from the command line:

cargo run --release --example multitalker --features multitalker -- \
    audio.wav path/to/this/repo path/to/sortformer.onnx

The model also implements the Transcriber trait for single-speaker fallback (no diarisation needed).

Upstream model

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for smcleod/multitalker-parakeet-streaming-0.6b-v1-onnx-int8

Quantized
(3)
this model

Paper for smcleod/multitalker-parakeet-streaming-0.6b-v1-onnx-int8