---
license: mit
language:
- en
base_model:
- ResembleAI/chatterbox
pipeline_tag: text-to-speech
tags:
- tts
- text-to-speech
- chatterbox
- flow-matching
- hifi-gan
- gguf
- crispasr
library_name: ggml
---

# Chatterbox TTS — GGUF (ggml-quantised)

GGUF / ggml conversion of [`ResembleAI/chatterbox`](https://huggingface.co/ResembleAI/chatterbox) for use with **[CrispStrobe/CrispASR](https://github.com/CrispStrobe/CrispASR)**.

Chatterbox is a full TTS pipeline: character tokenizer → T3 (30-layer Llama AR, 520M) → speech tokens → S3Gen (Conformer encoder + UNet1D CFM denoiser, 10 Euler steps) → HiFTGenerator vocoder (conv chains + Snake activations + iSTFT) → 24 kHz WAV. Distributed under **MIT license**.

Two GGUF files are needed: the **T3 model** (text → speech tokens) and the **S3Gen model** (speech tokens → audio).

## Files

| File | Quant | Size | Notes |
|---|---|---:|---|
| `chatterbox-t3-f16.gguf`     | F16  | 1.1 GB | T3 AR model — reference quality |
| `chatterbox-t3-q8_0.gguf`    | Q8_0 | 542 MB | T3 AR model — recommended |
| `chatterbox-t3-q4_k.gguf`    | Q4_K | 287 MB | T3 AR model — smallest |
| `chatterbox-s3gen-f16.gguf`  | F16  | 548 MB | S3Gen + vocoder — reference quality |
| `chatterbox-s3gen-q8_0.gguf` | Q8_0 | 342 MB | S3Gen + vocoder — recommended |
| `chatterbox-s3gen-q4_k.gguf` | Q4_K | 237 MB | S3Gen + vocoder — smallest |

Note: vocoder weights (conv_pre, resblocks, conv_post, source fusion) are kept at F32 in all quant levels for audio quality. Quantization applies to the Conformer encoder, UNet decoder, and T3 Llama layers.

## Quick start

```bash
# 1. Build CrispASR
git clone https://github.com/CrispStrobe/CrispASR
cd CrispASR
cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=OFF
cmake --build build -j --target chatterbox

# 2. Pull both model files
huggingface-cli download cstr/chatterbox-GGUF chatterbox-t3-q8_0.gguf --local-dir .
huggingface-cli download cstr/chatterbox-GGUF chatterbox-s3gen-q8_0.gguf --local-dir .

# 3. Synthesise (C API / test binary — CLI adapter in progress)
# See tests/test_voc_wav.cpp for vocoder-only usage
```

## Architecture

```
Text → Character tokenizer (704 tokens)
     → T3 Llama AR (30 layers, 1024D, 16 heads, RoPE, SwiGLU, CFG)
     → 25 Hz speech tokens (6561 codebook)
     → Conformer encoder (6 pre + 4 post upsample, 512D, 8 heads)
     → 80-channel mel spectrogram
     → UNet1D CFM denoiser (1 down + 12 mid + 1 up, 256 ch, 10 Euler steps)
     → HiFTGenerator vocoder (3× ConvTranspose1d + 9 ResBlocks + Snake + iSTFT)
     → 24 kHz mono WAV
```

## Quality verification

ASR roundtrip on Python reference mel (no source fusion, deterministic):

| Metric | Value |
|---|---|
| ASR output (moonshine-base) | **"Hello world"** (correct) |
| Per-stage cosine vs Python ref | **1.000** (conv_pre through rb_2) |
| Waveform cosine vs torch.istft | **0.93** |
| STFT range | [-0.82, 2.0] (ref [-1.1, 1.7]) |

All quantization levels (F16/Q8_0/Q4_K) produce ASR-identical output on the reference mel.

## Conversion

```bash
python models/convert-chatterbox-to-gguf.py \
    --input ResembleAI/chatterbox \
    --output-dir .
```

Requires `pip install gguf safetensors torch huggingface_hub`.

## Related models

- [`cstr/lahgtna-chatterbox-v1-GGUF`](https://huggingface.co/cstr/lahgtna-chatterbox-v1-GGUF) — Arabic T3 variant (MIT, shares S3Gen)
- [`cstr/orpheus-3b-base-GGUF`](https://huggingface.co/cstr/orpheus-3b-base-GGUF) — Llama-3.2 + SNAC TTS
- [`cstr/qwen3-tts-0.6b-customvoice-GGUF`](https://huggingface.co/cstr/qwen3-tts-0.6b-customvoice-GGUF) — Qwen3-TTS with fixed speakers

## License

MIT — same as the upstream [ResembleAI/chatterbox](https://huggingface.co/ResembleAI/chatterbox).