---
license: cc-by-4.0
language:
- en
tags:
- speech
- asr
- coreml
- parakeet
- transducer
base_model: nvidia/parakeet-tdt-0.6b-v2
---

# Parakeet TDT v3 — CoreML INT8

CoreML conversion of [NVIDIA Parakeet-TDT 0.6B v2](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2) with INT8-quantized encoder for Apple Neural Engine acceleration.

## Models

| Model | Description | Compute | Quantization |
|-------|-------------|---------|-------------|
| `encoder.mlmodelc` | FastConformer encoder (24L, 1024 hidden) | CPU + Neural Engine | INT8 palettized |
| `decoder.mlmodelc` | LSTM prediction network (2L, 640 hidden) | CPU + Neural Engine | FP16 |
| `joint.mlmodelc` | TDT dual-head joint (token + duration logits) | CPU + Neural Engine | FP16 |

## Additional Files

| File | Description |
|------|-------------|
| `vocab.json` | SentencePiece vocabulary (1024 tokens) |
| `config.json` | Model configuration |

## Notes

- **INT8 vs INT4**: INT8 uses 8-bit palettization for the encoder, offering higher accuracy than INT4 at the cost of ~2x encoder weight size.
- **Mel preprocessing** is done in Swift using Accelerate/vDSP (not CoreML) because `torch.stft` tracing bakes audio length as a constant, breaking per-feature normalization for variable-length inputs.
- **Encoder** uses `EnumeratedShapes` (100–3000 mel frames, covering 1–30s audio) to avoid BNNS crashes with dynamic shapes.

## Usage

Used by [speech-swift](https://github.com/soniqo/speech-swift) `ParakeetASR` module:

```swift
let model = try await ParakeetASRModel.fromPretrained(modelId: ParakeetASRModel.int8ModelId)
let text = try model.transcribeAudio(samples, sampleRate: 16000)
```

---

---

- **Guide**: [soniqo.audio/guides/parakeet](https://soniqo.audio/guides/parakeet)
- **Docs**: [soniqo.audio](https://soniqo.audio)
- **GitHub**: [soniqo/speech-swift](https://github.com/soniqo/speech-swift)