--- license: cc-by-4.0 language: - en tags: - speech - asr - coreml - parakeet - transducer base_model: nvidia/parakeet-tdt-0.6b-v2 --- # Parakeet TDT v3 — CoreML INT8 CoreML conversion of [NVIDIA Parakeet-TDT 0.6B v2](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2) with INT8-quantized encoder for Apple Neural Engine acceleration. ## Models | Model | Description | Compute | Quantization | |-------|-------------|---------|-------------| | `encoder.mlmodelc` | FastConformer encoder (24L, 1024 hidden) | CPU + Neural Engine | INT8 palettized | | `decoder.mlmodelc` | LSTM prediction network (2L, 640 hidden) | CPU + Neural Engine | FP16 | | `joint.mlmodelc` | TDT dual-head joint (token + duration logits) | CPU + Neural Engine | FP16 | ## Additional Files | File | Description | |------|-------------| | `vocab.json` | SentencePiece vocabulary (1024 tokens) | | `config.json` | Model configuration | ## Notes - **INT8 vs INT4**: INT8 uses 8-bit palettization for the encoder, offering higher accuracy than INT4 at the cost of ~2x encoder weight size. - **Mel preprocessing** is done in Swift using Accelerate/vDSP (not CoreML) because `torch.stft` tracing bakes audio length as a constant, breaking per-feature normalization for variable-length inputs. - **Encoder** uses `EnumeratedShapes` (100–3000 mel frames, covering 1–30s audio) to avoid BNNS crashes with dynamic shapes. ## Usage Used by [speech-swift](https://github.com/soniqo/speech-swift) `ParakeetASR` module: ```swift let model = try await ParakeetASRModel.fromPretrained(modelId: ParakeetASRModel.int8ModelId) let text = try model.transcribeAudio(samples, sampleRate: 16000) ``` --- --- - **Guide**: [soniqo.audio/guides/parakeet](https://soniqo.audio/guides/parakeet) - **Docs**: [soniqo.audio](https://soniqo.audio) - **GitHub**: [soniqo/speech-swift](https://github.com/soniqo/speech-swift)