Multitalker Parakeet Streaming 0.6B v1 -- ONNX
ONNX export of NVIDIA's multitalker-parakeet-streaming-0.6b-v1, a 600M-parameter streaming multi-speaker ASR model. Designed for use with parakeet-rs or similar.
The original NeMo model uses speaker kernel injection via forward hooks which are lost during standard ONNX export. These ONNX files were exported with a custom wrapper that exposes speaker targets as explicit graph inputs, preserving the full multi-speaker pipeline.
Files
| File | Size | Description |
|---|---|---|
encoder.onnx |
40MB | Encoder graph (references encoder.onnx.data) |
encoder.onnx.data |
2.3GB | Encoder weights (fp32) |
encoder.int8.onnx |
627MB | Encoder, dynamically quantised to uint8 |
decoder_joint.onnx |
34MB | Decoder + joint network (fp32) |
decoder_joint.int8.onnx |
8.6MB | Decoder + joint, dynamically quantised to int8 |
tokenizer.model |
245KB | SentencePiece vocabulary (1024 tokens) |
multitalker_config.json |
<1KB | Model dimensions |
For inference you need either the fp32 or int8 encoder, either the fp32 or int8 decoder, and the tokenizer. The int8 models are recommended for most use cases -- they are significantly smaller with minimal quality loss from dynamic quantisation.
How it works
The model runs one encoder instance per active speaker, each with independent cache and decoder state. A Sortformer diarisation model provides per-frame speaker activity probabilities, which are injected into the encoder as masks at layer 0 via learned feedforward networks (speaker kernels).
Audio Chunk
|
v
[Mel Spectrogram] -- computed once, shared
|
+---> [Sortformer] --> raw speaker activity [T, 4]
|
v
For each active speaker k:
| spk_targets_k = activity[:, k]
| bg_spk_targets_k = max(activity[:, others])
|
+---> [Encoder(mel, cache_k, spk_targets_k, bg_targets_k)] --> encoded_k
+---> [RNNT Decoder(encoded_k, state_k)] --> tokens_k
|
v
Per-Speaker Transcripts
The speaker kernel at layer 0 applies: x = x + FF(x * spk_mask) + FF_bg(x * bg_mask), where each FF is Linear(1024,1024) -> ReLU -> Dropout -> Linear(1024,1024).
Model architecture
| Parameter | Value |
|---|---|
| Architecture | FastConformer + RNNT |
| Parameters | 600M |
| Encoder layers | 24 |
| Hidden dimension | 1024 |
| Subsampling factor | 8x |
| Streaming chunk size | 112 mel frames (~1.12s) |
| Left context | 70 frames |
| Conv context | 8 (kernel_size - 1) |
| Decoder | 2-layer LSTM, 640 hidden |
| Vocabulary | 1024 SentencePiece tokens + 1 blank |
| Speaker kernel layers | [0] (layer 0 only) |
| Max speakers | 4 (from Sortformer) |
| Sample rate | 16kHz mono |
| Mel bins | 128 |
| FFT size | 512 |
ONNX inputs and outputs
Encoder (7 inputs, 5 outputs)
Inputs:
| Name | Shape | Type |
|---|---|---|
processed_signal |
[1, 128, time] |
float32 |
processed_signal_length |
[1] |
int64 |
cache_last_channel |
[1, 24, 70, 1024] |
float32 |
cache_last_time |
[1, 24, 1024, 8] |
float32 |
cache_last_channel_len |
[1] |
int64 |
spk_targets |
[1, spk_time] |
float32 |
bg_spk_targets |
[1, spk_time] |
float32 |
Outputs:
| Name | Shape | Type |
|---|---|---|
encoded |
[1, 1024, encoded_time] |
float32 |
encoded_len |
[1] |
int64 |
cache_last_channel_next |
[1, 24, 70, 1024] |
float32 |
cache_last_time_next |
[1, 24, 1024, 8] |
float32 |
cache_last_channel_len_next |
[1] |
int64 |
Cache tensors are batch-first [batch, n_layers, ...]. Initialise with zeros; pass outputs back as inputs for subsequent chunks.
spk_targets and bg_spk_targets are per-frame speaker activity probabilities in [0, 1]. For single-speaker mode, use spk_targets=1.0 and bg_spk_targets=0.0. The model internally handles time dimension mismatches between the mask and encoder hidden states.
Decoder + Joint (4 inputs, 4 outputs)
Inputs:
| Name | Shape | Type |
|---|---|---|
encoder_outputs |
[1, enc_time, 1024] |
float32 |
targets |
[1, 1] |
int64 |
input_states_1 |
[2, 1, 640] |
float32 |
input_states_2 |
[2, 1, 640] |
float32 |
Outputs:
| Name | Shape | Type |
|---|---|---|
outputs |
[1, enc_time, 1, 1025] |
float32 |
prednet_lengths |
scalar | int64 |
states_1 |
[2, 1, 640] |
float32 |
states_2 |
[2, 1, 640] |
float32 |
Reproducing the export
The conversion_scripts/ directory contains everything needed to re-export from the original .nemo checkpoint. See conversion_scripts/README.md for setup and usage.
cd conversion_scripts
uv venv --python 3.12
uv pip install --python .venv/bin/python3.12 -r requirements.txt
source .venv/bin/activate
python export_multitalker.py --nemo-path /path/to/model.nemo --output-dir ../
Usage with parakeet-rs
Requires the multitalker feature and a Sortformer v2 ONNX model for speaker diarisation.
use parakeet_rs::MultitalkerASR;
let mut model = MultitalkerASR::from_pretrained(
"path/to/this/repo", // directory with encoder, decoder, tokenizer
"path/to/sortformer.onnx", // Sortformer v2 ONNX model
None, // use default execution config
)?;
// Stream audio in chunks
for chunk in audio.chunks(17920) { // ~1.12s at 16kHz
let results = model.transcribe_chunk(chunk)?;
for r in &results {
println!("[Speaker {}] {}", r.speaker_id, r.text);
}
}
// Get final per-speaker transcripts
for transcript in model.get_transcripts() {
println!("Speaker {}: {}", transcript.speaker_id, transcript.text);
}
Or from the command line:
cargo run --release --example multitalker --features multitalker -- \
audio.wav path/to/this/repo path/to/sortformer.onnx
The model also implements the Transcriber trait for single-speaker fallback (no diarisation needed).
Upstream model
- Source: nvidia/multitalker-parakeet-streaming-0.6b-v1
- Paper: Speaker Targeting via Self-Speaker Adaptation for Multi-talker ASR (Wang et al., 2025)
- Base model: nvidia/nemotron-speech-streaming-en-0.6b
- Licence: NVIDIA Open Model License