alexwengg's picture
Upload 54 files
e888982 verified

Nemotron Streaming 0.6B - WER Benchmark Results

Model: nvidia/nemotron-speech-streaming-en-0.6b Dataset: LibriSpeech test-clean Chunk size: 1.12s

Results

10 Files

Mode WER Errors Words
pad_and_drop_preencoded=False 1.79% 3 168
pad_and_drop_preencoded=True 3.57% 6 168

100 Files

Mode WER Errors Words
pad_and_drop_preencoded=False 1.88% - -

NVIDIA Claimed

Dataset WER
LibriSpeech test-clean (1.12s chunks) 2.31%

Notes

  • pad_and_drop_preencoded=False: Better WER, but cannot be exported to ONNX/CoreML
  • pad_and_drop_preencoded=True: Worse WER (~3%), but required for ONNX/CoreML export
  • NVIDIA's 2.31% likely uses pad_and_drop_preencoded=True on full 2620 files
  • Our implementation uses conformer_stream_step API with CacheAwareStreamingAudioBuffer

Run Benchmark

cd nemotron-speech-streaming-0.6b/coreml
uv sync
uv run python benchmark_wer.py --num-files 100