---
language:
  - sw
pipeline_tag: text-to-speech
library_name: pytorch
tags:
  - text-to-speech
  - swahili
  - voice-cloning
  - audio
  - f5-tts
base_model:
  - SWivid/F5-TTS
datasets:
  - google/WaxalNLP
license: other
license_name: derived-checkpoint-upstream-obligations
license_link: "https://huggingface.co/msingiai/sauti-tts/blob/main/THIRD_PARTY_NOTICES.md"
---

# Sauti TTS

Sauti TTS is a **Swahili text-to-speech checkpoint** released by **MsingiAI**
and built on top of **F5-TTS v1 Base**. This checkpoint is intended for
research and development workflows involving Swahili speech synthesis and
reference-audio-conditioned voice transfer.

This upload is a **model checkpoint package**, not a standalone Python library.
It is designed to be used with the `sauti-tts` inference code and the upstream
F5-TTS stack.

## Model Summary

- **Model name:** Sauti TTS
- **Developer:** MsingiAI
- **Primary language:** Swahili
- **Base model family:** F5-TTS v1 Base
- **Vocoder path:** Vocos via the F5-TTS stack
- **Task:** text-to-speech
- **Conditioning mode:** text + short reference audio

## Release Snapshot

This Hub release contains:

- `model_last.pt`
- `vocab.txt`
- `training_config.json`
- `LICENSE`
- `THIRD_PARTY_NOTICES.md`

The uploaded checkpoint was taken from the Modal checkpoint volume path
`/sauti_tts_multi/model_last.pt` and corresponds to a full fine-tuning run of
the multi-GPU recipe. The checkpoint includes:

- model weights
- EMA weights
- optimizer state
- scheduler state

The checkpoint metadata reports `update = 15350`.

## Training Data

The model was trained from the **Google WaxalNLP** Swahili TTS subset
(`swa_tts`), after local preparation and export into an F5-TTS-compatible
format.

Prepared subset statistics for the run artifacts used in this release:

- **Total prepared utterances:** 1,245
- **Total prepared audio:** 4.20 hours
- **Speakers:** 7
- **Gender distribution:** 696 female / 549 male utterances
- **Average utterance duration:** 12.15 seconds
- **Min / max utterance duration:** 2.94 s / 29.95 s

Observed split sizes:

- **Train:** 976 utterances, 3.31 hours
- **Validation:** 133 utterances, 0.44 hours
- **Test:** 136 utterances, 0.45 hours

Data preparation in the `sauti-tts` project includes:

- resampling
- silence trimming
- loudness normalization
- Swahili text normalization
- F5-TTS-compatible metadata export

## Training Configuration

This checkpoint corresponds to the multi-GPU training recipe:

- `learning_rate = 2e-5`
- `batch_size_per_gpu = 2000` frames
- `num_warmup_updates = 300`
- `mixed_precision = bf16`
- `use_ema = true`
- `ema_decay = 0.9999`

The release also includes the exact `training_config.json` used for the run.

## Intended Use

This checkpoint is intended for:

- Swahili TTS research
- speech generation experiments for African language technology
- reference-audio-conditioned voice transfer experiments
- benchmarking and reproducibility work around F5-TTS fine-tuning

## Out-of-Scope Use

This checkpoint is **not** intended for:

- impersonation, fraud, or deception
- biometric identification or identity claims
- safety-critical systems
- any deployment that ignores upstream license or dataset obligations

## Limitations

- Output quality depends strongly on reference audio quality and the accuracy of
  the reference transcript.
- This release is focused on Swahili; quality outside Swahili has not been
  established.
- Very short pauses and waveform boundaries may still benefit from cleanup
  during inference.
- This is a full training checkpoint package, so the file is significantly
  larger than a stripped inference-only export.
- This release does not yet include benchmark tables or objective evaluation
  results in the Hub repo itself.

## How to Use

### 1. Clone the inference code

Use this checkpoint with the `sauti-tts` codebase and the upstream F5-TTS
dependency.

```bash
git clone https://github.com/Msingi-AI/sauti-tts
cd sauti-tts
pip install -r requirements.txt
git clone https://github.com/SWivid/F5-TTS.git
cd F5-TTS && pip install -e . && cd ..
```

### 2. Download the checkpoint and vocab

```bash
hf download msingiai/sauti-tts model_last.pt --local-dir ckpts/sauti_tts_multi
hf download msingiai/sauti-tts vocab.txt --local-dir data/waxal_swahili
```

### 3. Run inference

```bash
python scripts/inference.py \
  --checkpoint ckpts/sauti_tts_multi/model_last.pt \
  --vocab_path data/waxal_swahili/vocab.txt \
  --text "Habari, karibu kwenye Sauti TTS." \
  --ref_audio path/to/reference.wav \
  --ref_text "Habari, karibu kwenye Sauti TTS." \
  --output outputs/generated.wav
```

The inference script also supports:

- long-text chunking
- configurable sampling steps
- configurable guidance strength
- batch generation from a text file

## Repository Notes

This model card describes the Hub release artifact. The checkpoint was produced
from the `sauti-tts` training stack, which provides:

- dataset preparation
- model wrapping for F5-TTS
- training scripts
- Modal training and inference launchers
- evaluation utilities

## Licensing

This Hub repo is marked as `license: other` on purpose.

Reason:

- the uploaded artifact is a **derived model checkpoint**
- the checkpoint is based on **F5-TTS**
- the training data comes from **WaxalNLP**

You must comply with the obligations attached to the upstream model and dataset
in addition to any licensing that applies to the surrounding repository code.
See:

- `LICENSE`
- `THIRD_PARTY_NOTICES.md`

## Citation

If you use this release, cite the project and the upstream work:

```bibtex
@misc{sauti_tts_2026,
  title={Sauti TTS: Swahili Text-to-Speech via F5-TTS Fine-tuning on WaxalNLP},
  author={MsingiAI},
  year={2026}
}

@article{chen2024f5tts,
  title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching},
  author={Chen, Yushen and others},
  journal={arXiv preprint arXiv:2410.06885},
  year={2024}
}

@misc{waxal2026,
  title={WAXAL: A Large-Scale Multilingual African Speech Corpus},
  author={Diack, Abdoulaye and others},
  year={2026},
  url={https://huggingface.co/datasets/google/WaxalNLP}
}
```