--- language: - sw pipeline_tag: text-to-speech library_name: pytorch tags: - text-to-speech - swahili - voice-cloning - audio - f5-tts base_model: - SWivid/F5-TTS datasets: - google/WaxalNLP license: other license_name: derived-checkpoint-upstream-obligations license_link: "https://huggingface.co/msingiai/sauti-tts/blob/main/THIRD_PARTY_NOTICES.md" --- # Sauti TTS Sauti TTS is a **Swahili text-to-speech checkpoint** released by **MsingiAI** and built on top of **F5-TTS v1 Base**. This checkpoint is intended for research and development workflows involving Swahili speech synthesis and reference-audio-conditioned voice transfer. This upload is a **model checkpoint package**, not a standalone Python library. It is designed to be used with the `sauti-tts` inference code and the upstream F5-TTS stack. ## Model Summary - **Model name:** Sauti TTS - **Developer:** MsingiAI - **Primary language:** Swahili - **Base model family:** F5-TTS v1 Base - **Vocoder path:** Vocos via the F5-TTS stack - **Task:** text-to-speech - **Conditioning mode:** text + short reference audio ## Release Snapshot This Hub release contains: - `model_last.pt` - `vocab.txt` - `training_config.json` - `LICENSE` - `THIRD_PARTY_NOTICES.md` The uploaded checkpoint was taken from the Modal checkpoint volume path `/sauti_tts_multi/model_last.pt` and corresponds to a full fine-tuning run of the multi-GPU recipe. The checkpoint includes: - model weights - EMA weights - optimizer state - scheduler state The checkpoint metadata reports `update = 15350`. ## Training Data The model was trained from the **Google WaxalNLP** Swahili TTS subset (`swa_tts`), after local preparation and export into an F5-TTS-compatible format. Prepared subset statistics for the run artifacts used in this release: - **Total prepared utterances:** 1,245 - **Total prepared audio:** 4.20 hours - **Speakers:** 7 - **Gender distribution:** 696 female / 549 male utterances - **Average utterance duration:** 12.15 seconds - **Min / max utterance duration:** 2.94 s / 29.95 s Observed split sizes: - **Train:** 976 utterances, 3.31 hours - **Validation:** 133 utterances, 0.44 hours - **Test:** 136 utterances, 0.45 hours Data preparation in the `sauti-tts` project includes: - resampling - silence trimming - loudness normalization - Swahili text normalization - F5-TTS-compatible metadata export ## Training Configuration This checkpoint corresponds to the multi-GPU training recipe: - `learning_rate = 2e-5` - `batch_size_per_gpu = 2000` frames - `num_warmup_updates = 300` - `mixed_precision = bf16` - `use_ema = true` - `ema_decay = 0.9999` The release also includes the exact `training_config.json` used for the run. ## Intended Use This checkpoint is intended for: - Swahili TTS research - speech generation experiments for African language technology - reference-audio-conditioned voice transfer experiments - benchmarking and reproducibility work around F5-TTS fine-tuning ## Out-of-Scope Use This checkpoint is **not** intended for: - impersonation, fraud, or deception - biometric identification or identity claims - safety-critical systems - any deployment that ignores upstream license or dataset obligations ## Limitations - Output quality depends strongly on reference audio quality and the accuracy of the reference transcript. - This release is focused on Swahili; quality outside Swahili has not been established. - Very short pauses and waveform boundaries may still benefit from cleanup during inference. - This is a full training checkpoint package, so the file is significantly larger than a stripped inference-only export. - This release does not yet include benchmark tables or objective evaluation results in the Hub repo itself. ## How to Use ### 1. Clone the inference code Use this checkpoint with the `sauti-tts` codebase and the upstream F5-TTS dependency. ```bash git clone https://github.com/Msingi-AI/sauti-tts cd sauti-tts pip install -r requirements.txt git clone https://github.com/SWivid/F5-TTS.git cd F5-TTS && pip install -e . && cd .. ``` ### 2. Download the checkpoint and vocab ```bash hf download msingiai/sauti-tts model_last.pt --local-dir ckpts/sauti_tts_multi hf download msingiai/sauti-tts vocab.txt --local-dir data/waxal_swahili ``` ### 3. Run inference ```bash python scripts/inference.py \ --checkpoint ckpts/sauti_tts_multi/model_last.pt \ --vocab_path data/waxal_swahili/vocab.txt \ --text "Habari, karibu kwenye Sauti TTS." \ --ref_audio path/to/reference.wav \ --ref_text "Habari, karibu kwenye Sauti TTS." \ --output outputs/generated.wav ``` The inference script also supports: - long-text chunking - configurable sampling steps - configurable guidance strength - batch generation from a text file ## Repository Notes This model card describes the Hub release artifact. The checkpoint was produced from the `sauti-tts` training stack, which provides: - dataset preparation - model wrapping for F5-TTS - training scripts - Modal training and inference launchers - evaluation utilities ## Licensing This Hub repo is marked as `license: other` on purpose. Reason: - the uploaded artifact is a **derived model checkpoint** - the checkpoint is based on **F5-TTS** - the training data comes from **WaxalNLP** You must comply with the obligations attached to the upstream model and dataset in addition to any licensing that applies to the surrounding repository code. See: - `LICENSE` - `THIRD_PARTY_NOTICES.md` ## Citation If you use this release, cite the project and the upstream work: ```bibtex @misc{sauti_tts_2026, title={Sauti TTS: Swahili Text-to-Speech via F5-TTS Fine-tuning on WaxalNLP}, author={MsingiAI}, year={2026} } @article{chen2024f5tts, title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching}, author={Chen, Yushen and others}, journal={arXiv preprint arXiv:2410.06885}, year={2024} } @misc{waxal2026, title={WAXAL: A Large-Scale Multilingual African Speech Corpus}, author={Diack, Abdoulaye and others}, year={2026}, url={https://huggingface.co/datasets/google/WaxalNLP} } ```