Multitalker Parakeet Streaming 0.6B v1 -- ONNX

ONNX export of NVIDIA's multitalker-parakeet-streaming-0.6b-v1, a 600M-parameter streaming multi-speaker ASR model. Designed for use with parakeet-rs or similar.

The original NeMo model uses speaker kernel injection via forward hooks which are lost during standard ONNX export. These ONNX files were exported with a custom wrapper that exposes speaker targets as explicit graph inputs, preserving the full multi-speaker pipeline.

Multitalker Parakeet Streaming 0.6B v1 -- ONNX

Files

File	Size	Description
`encoder.onnx`	40MB	Encoder graph (references `encoder.onnx.data`)
`encoder.onnx.data`	2.3GB	Encoder weights (fp32)
`encoder.int8.onnx`	627MB	Encoder, dynamically quantised to uint8
`decoder_joint.onnx`	34MB	Decoder + joint network (fp32)
`decoder_joint.int8.onnx`	8.6MB	Decoder + joint, dynamically quantised to int8
`tokenizer.model`	245KB	SentencePiece vocabulary (1024 tokens)
`multitalker_config.json`	<1KB	Model dimensions

For inference you need either the fp32 or int8 encoder, either the fp32 or int8 decoder, and the tokenizer. The int8 models are recommended for most use cases -- they are significantly smaller with minimal quality loss from dynamic quantisation.

How it works

The model runs one encoder instance per active speaker, each with independent cache and decoder state. A Sortformer diarisation model provides per-frame speaker activity probabilities, which are injected into the encoder as masks at layer 0 via learned feedforward networks (speaker kernels).

Audio Chunk
    |
    v
[Mel Spectrogram] -- computed once, shared
    |
    +---> [Sortformer] --> raw speaker activity [T, 4]
    |
    v
For each active speaker k:
    |   spk_targets_k   = activity[:, k]
    |   bg_spk_targets_k = max(activity[:, others])
    |
    +---> [Encoder(mel, cache_k, spk_targets_k, bg_targets_k)] --> encoded_k
    +---> [RNNT Decoder(encoded_k, state_k)] --> tokens_k
    |
    v
Per-Speaker Transcripts

The speaker kernel at layer 0 applies: x = x + FF(x * spk_mask) + FF_bg(x * bg_mask), where each FF is Linear(1024,1024) -> ReLU -> Dropout -> Linear(1024,1024).

Model architecture

Parameter	Value
Architecture	FastConformer + RNNT
Parameters	600M
Encoder layers	24
Hidden dimension	1024
Subsampling factor	8x
Streaming chunk size	112 mel frames (~1.12s)
Left context	70 frames
Conv context	8 (kernel_size - 1)
Decoder	2-layer LSTM, 640 hidden
Vocabulary	1024 SentencePiece tokens + 1 blank
Speaker kernel layers	[0] (layer 0 only)
Max speakers	4 (from Sortformer)
Sample rate	16kHz mono
Mel bins	128
FFT size	512

ONNX inputs and outputs

Encoder (7 inputs, 5 outputs)

Inputs:

Name	Shape	Type
`processed_signal`	`[1, 128, time]`	float32
`processed_signal_length`	`[1]`	int64
`cache_last_channel`	`[1, 24, 70, 1024]`	float32
`cache_last_time`	`[1, 24, 1024, 8]`	float32
`cache_last_channel_len`	`[1]`	int64
`spk_targets`	`[1, spk_time]`	float32
`bg_spk_targets`	`[1, spk_time]`	float32

Outputs:

Name	Shape	Type
`encoded`	`[1, 1024, encoded_time]`	float32
`encoded_len`	`[1]`	int64
`cache_last_channel_next`	`[1, 24, 70, 1024]`	float32
`cache_last_time_next`	`[1, 24, 1024, 8]`	float32
`cache_last_channel_len_next`	`[1]`	int64

Cache tensors are batch-first [batch, n_layers, ...]. Initialise with zeros; pass outputs back as inputs for subsequent chunks.

spk_targets and bg_spk_targets are per-frame speaker activity probabilities in [0, 1]. For single-speaker mode, use spk_targets=1.0 and bg_spk_targets=0.0. The model internally handles time dimension mismatches between the mask and encoder hidden states.

Decoder + Joint (4 inputs, 4 outputs)

Inputs:

Name	Shape	Type
`encoder_outputs`	`[1, enc_time, 1024]`	float32
`targets`	`[1, 1]`	int64
`input_states_1`	`[2, 1, 640]`	float32
`input_states_2`	`[2, 1, 640]`	float32

Outputs:

Name	Shape	Type
`outputs`	`[1, enc_time, 1, 1025]`	float32
`prednet_lengths`	scalar	int64
`states_1`	`[2, 1, 640]`	float32
`states_2`	`[2, 1, 640]`	float32

Reproducing the export

The conversion_scripts/ directory contains everything needed to re-export from the original .nemo checkpoint. See conversion_scripts/README.md for setup and usage.

cd conversion_scripts
uv venv --python 3.12
uv pip install --python .venv/bin/python3.12 -r requirements.txt
source .venv/bin/activate
python export_multitalker.py --nemo-path /path/to/model.nemo --output-dir ../

Usage with parakeet-rs

Requires the multitalker feature and a Sortformer v2 ONNX model for speaker diarisation.

use parakeet_rs::MultitalkerASR;

let mut model = MultitalkerASR::from_pretrained(
    "path/to/this/repo",        // directory with encoder, decoder, tokenizer
    "path/to/sortformer.onnx",  // Sortformer v2 ONNX model
    None,                       // use default execution config
)?;

// Stream audio in chunks
for chunk in audio.chunks(17920) {  // ~1.12s at 16kHz
    let results = model.transcribe_chunk(chunk)?;
    for r in &results {
        println!("[Speaker {}] {}", r.speaker_id, r.text);
    }
}

// Get final per-speaker transcripts
for transcript in model.get_transcripts() {
    println!("Speaker {}: {}", transcript.speaker_id, transcript.text);
}

Or from the command line:

cargo run --release --example multitalker --features multitalker -- \
    audio.wav path/to/this/repo path/to/sortformer.onnx

The model also implements the Transcriber trait for single-speaker fallback (no diarisation needed).

Upstream model

Source: nvidia/multitalker-parakeet-streaming-0.6b-v1
Paper: Speaker Targeting via Self-Speaker Adaptation for Multi-talker ASR (Wang et al., 2025)
Base model: nvidia/nemotron-speech-streaming-en-0.6b
Licence: NVIDIA Open Model License

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for smcleod/multitalker-parakeet-streaming-0.6b-v1-onnx-int8

Base model

nvidia/multitalker-parakeet-streaming-0.6b-v1

Quantized

(3)

this model

Paper for smcleod/multitalker-parakeet-streaming-0.6b-v1-onnx-int8

Speaker Targeting via Self-Speaker Adaptation for Multi-talker ASR

Paper • 2506.22646 • Published Jun 27, 2025