# Nemotron Speech Streaming 0.6B - CoreML Conversion CoreML conversion of NVIDIA's `nvidia/nemotron-speech-streaming-en-0.6b` for real-time streaming ASR on Apple devices. ## Model Overview | Property | Value | |----------|-------| | Source Model | `nvidia/nemotron-speech-streaming-en-0.6b` | | Architecture | FastConformer RNNT (Streaming) | | Parameters | 0.6B | | Chunk Size | 1.12 seconds (112 mel frames) | | Sample Rate | 16kHz | | Mel Features | 128 bins | ## CoreML Models 4 mlpackage files for the streaming RNNT pipeline: | Model | Size | Function | |-------|------|----------| | `preprocessor.mlpackage` | 1.2M | audio → 128-dim mel spectrogram | | `encoder.mlpackage` | 2.2G | mel + cache → encoded + new_cache | | `decoder.mlpackage` | 28M | token + LSTM state → decoder_out + new_state | | `joint.mlpackage` | 6.6M | encoder + decoder → logits | Plus: - `metadata.json` - Model configuration - `tokenizer.json` - Vocabulary (1024 tokens) ## Streaming Configuration ```json { "sample_rate": 16000, "mel_features": 128, "chunk_mel_frames": 112, "pre_encode_cache": 9, "total_mel_frames": 121, "vocab_size": 1024, "blank_idx": 1024, "encoder_dim": 1024, "decoder_hidden": 640, "decoder_layers": 2 } ``` ### Chunk Timing | Parameter | Value | |-----------|-------| | window_stride | 10ms | | chunk_mel_frames | 112 | | **chunk duration** | 112 × 10ms = **1.120s** | | samples per chunk | 17,920 | ### Cache Shapes | Cache | Shape | Description | |-------|-------|-------------| | cache_channel | [1, 24, 70, 1024] | Attention context cache | | cache_time | [1, 24, 1024, 8] | Convolution time cache | | cache_len | [1] | Cache fill level | ## Benchmark Results ### WER on LibriSpeech test-clean | Mode | Files | WER | Notes | |------|-------|-----|-------| | PyTorch `pad_and_drop=False` | 100 | 1.88% | Non-streaming (full context) | | PyTorch `pad_and_drop=True` | 10 | 3.57% | True streaming | | CoreML Non-streaming | 100 | 1.83% | Full audio preprocessed | | CoreML Streaming | 100 | 1.79% | Audio chunked at 1.12s | | NVIDIA Claimed | 2620 | 2.31% | Full test-clean | ### Streaming Modes Explained ``` NON-STREAMING (test_coreml_inference.py): ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1. Full audio → preprocessor → FULL mel (one continuous spectrogram) 2. Slice mel into chunks for encoder 3. Each slice has natural continuity (no chunk boundaries) CHEAT: The mel was computed with full audio context WER: ~1.83% ``` ``` TRUE STREAMING (test_coreml_streaming.py): ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1. Audio chunk 1 → preprocessor → mel_1 2. Audio chunk 2 → preprocessor → mel_2 (computed separately!) 3. Prepend last 9 frames of mel_1 to mel_2 (mel_cache) mel_cache = bridge between separately-computed mels (NOT cheating) WER: ~1.79% ``` ### What is mel_cache? The encoder's subsampling layer needs 9 frames (~90ms) of look-back context: ``` ENCODER INPUT (needs 121 frames = 9 cache + 112 new) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ │9│ 112 frames │ ↑ mel_cache = last 9 frames from PREVIOUS chunk's mel Chunk 1: [000000000][mel_chunk_1] ← pad with zeros (no previous) Chunk 2: [mel_1_end][mel_chunk_2] ← 9 frames from chunk 1 Chunk 3: [mel_2_end][mel_chunk_3] ← 9 frames from chunk 2 ``` This is **NOT cheating** - in real-time streaming you DO have the previous 90ms of audio. ## Inference Pipeline ``` ┌─────────────────────────────────────────────────────────────────┐ │ STREAMING RNNT PIPELINE │ └─────────────────────────────────────────────────────────────────┘ 1. PREPROCESSOR (per 1.12s audio chunk) audio [1, 17920] → mel [1, 128, 112] 2. ENCODER (with cache) mel [1, 128, 121] + cache → encoded [1, 1024, 14] + new_cache (121 = 9 mel_cache + 112 new frames) (14 output frames after 8x subsampling) 3. DECODER + JOINT (greedy loop per encoder frame) For each of 14 encoder frames: ┌──────────────────────────────────────────┐ │ token → DECODER → decoder_out │ │ encoder_step + decoder_out → JOINT │ │ → logits → argmax → predicted token │ │ if token == BLANK: next encoder frame │ │ else: emit token, update decoder state │ └──────────────────────────────────────────┘ ``` ## Usage ### Convert to CoreML ```bash cd conversion_scripts uv sync uv run python convert_nemotron_streaming.py --output-dir ../nemotron_coreml ``` Options: - `--encoder-cu`: Encoder compute units (default: CPU_AND_NE) - `--precision`: FLOAT32 or FLOAT16 ### Run WER Benchmark (PyTorch) ```bash cd conversion_scripts uv run python ../benchmark_wer.py --num-files 100 ``` ### Test CoreML Inference Non-streaming (full audio preprocessing): ```bash uv run python ../test_coreml_inference.py --model-dir ../nemotron_coreml --num-files 10 ``` True streaming (audio chunked at 1.12s): ```bash uv run python ../test_coreml_streaming.py --model-dir ../nemotron_coreml --num-files 10 ``` ## Files ``` nemotron-speech-streaming-0.6b/coreml/ ├── README.md # This file ├── BENCHMARK_RESULTS.md # WER benchmark results ├── benchmark_wer.py # PyTorch streaming WER benchmark ├── nemo_streaming_reference.py # NeMo streaming reference implementation ├── test_coreml_inference.py # CoreML non-streaming test ├── test_coreml_streaming.py # CoreML true streaming test ├── conversion_scripts/ │ ├── pyproject.toml # Python dependencies (uv) │ ├── convert_nemotron_streaming.py # Main conversion script │ └── individual_components.py # Wrapper classes for export ├── nemotron_coreml/ # Exported CoreML models │ ├── preprocessor.mlpackage │ ├── encoder.mlpackage │ ├── decoder.mlpackage │ ├── joint.mlpackage │ ├── metadata.json │ └── tokenizer.json └── datasets/ └── LibriSpeech/test-clean/ # 2620 test files ``` ## Dependencies - Python 3.10 - PyTorch 2.x - NeMo Toolkit 2.x - CoreMLTools 7.x - soundfile, numpy, typer ## Notes - The encoder is the largest model (2.2GB) with 24 Conformer layers - Model uses 128 mel bins (not the typical 80) - RNNT blank token index is 1024 (vocab_size) - Decoder uses 2-layer LSTM with 640 hidden units