Qwen3-TTS Voice Clone β€” ExecuTorch (Android-ready)

On-device text-to-speech with voice cloning, converted from Qwen3-TTS-1.7B-Base to ExecuTorch .pte format for mobile/edge deployment.

1.9B parameter end-to-end TTS β€” clone any voice from a short audio sample and synthesize speech entirely on-device. No cloud, no internet needed.

Models

INT8 Quantized (⭐ Recommended for on-device)

Module Size Description
speaker_encoder_int8.pte 46 MB Extract speaker identity from reference audio
talker_int8.pte 1.4 GB Main autoregressive LM (generates audio codec tokens)
code_predictor_int8.pte 78 MB Multi-codebook prediction (15 additional codebooks)
vocoder_int8.pte 301 MB Neural vocoder (codec tokens β†’ PCM waveform)
Total 1.8 GB Fits on 8GB+ phones

FP32 Unquantized

Module Size
speaker_encoder.pte 46 MB
talker_prefill.pte 5.3 GB
code_predictor.pte 309 MB
vocoder.pte 436 MB

Auxiliary Files

File Description
talker_embeddings.pt Text + codec embedding tables (loaded in Python orchestrator)
code_predictor_extras.pt Code predictor embedding + projection weights

Architecture

Qwen3-TTS Voice Clone Pipeline (1.9B params total)

Input: text + reference audio (3-5s voice sample)
                    β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β–Ό                       β–Ό
  Speaker Encoder (12M)    Speech Tokenizer
  TDNN β†’ AttPool β†’ FC     (encode ref audio
  ref_audio β†’ x_vector     to codec codes)
  [1, 2048]                [T, 16]
        β”‚                       β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                β–Ό
         Talker LM (1.7B)
         Qwen3, 28 layers, GQA 16/8
         dim=2048, audio vocab=3072
         Autoregressive codec generation
                β”‚
                β–Ό
         Code Predictor (175M)
         Predict 15 additional codebooks
         per token (residual VQ)
                β”‚
                β–Ό
         Vocoder (154M)
         Codec tokens β†’ 24kHz PCM audio
                β”‚
                β–Ό
         Output: speech waveform (.wav)

Component Details

Component Params Architecture Input β†’ Output
Speaker Encoder 12M TDNN + Attentive Stats Pooling mel spectrogram β†’ x_vector [1, 2048]
Talker (Main LM) 1,727M Qwen3, 28 layers, GQA 16/8 heads, dim 2048 text + speaker emb β†’ codec tokens (vocab 3072)
Text Projection 8M MLP text hidden β†’ audio hidden dim
Codec Head 6M Linear hidden states β†’ first codebook logits
Code Predictor 175M Small LM + 15 heads main LM output β†’ codebooks 2-16
Vocoder 154M Qwen3TTSTokenizerV2Model [16, T] codes β†’ 24kHz waveform

How It Works

Voice Clone Pipeline

1. ref_audio (24kHz, 3-5s) β†’ mel_spectrogram β†’ speaker_encoder.pte β†’ x_vector [1, 2048]
2. ref_audio β†’ speech_tokenizer.encode() β†’ ref_codes [T, 16]  (runs on CPU, not exported)
3. text β†’ Qwen2 tokenizer β†’ input_ids
4. Embed: text_embedding(input_ids) + codec_embedding(ref_codes) + x_vector β†’ inputs_embeds
5. talker.pte(inputs_embeds, kv_cache, ...) β†’ codec logits  (autoregressive loop)
6. code_predictor.pte(hidden_states) β†’ codebooks 2-16  (per step)
7. All codec codes β†’ vocoder.pte β†’ 24kHz PCM waveform

Token Format

The talker uses an interleaved text+audio token sequence:

[BOS] [text tokens...] [speaker x-vector] [ref audio codes...] [SEP] [generated audio codes...]

Key Parameters

Parameter Value
Audio sample rate 24,000 Hz
Codec frame rate 12.5 Hz (80ms per frame)
Codebooks 16 (1 from talker + 15 from code predictor)
Audio vocab size 3,072
Text vocab size 151,936 (Qwen2 tokenizer)
Max sequence length 2,048 tokens
Speaker embedding dim 2,048

Quick Start β€” Python

from huggingface_hub import hf_hub_download
from executorch.runtime import Runtime
import torch
import numpy as np

REPO = "acul3/Qwen3-TTS-1.7B-Base-ExecuTorch"

# Download INT8 models
spk_path = hf_hub_download(REPO, "speaker_encoder_int8.pte")
talker_path = hf_hub_download(REPO, "talker_int8.pte")
cp_path = hf_hub_download(REPO, "code_predictor_int8.pte")
voc_path = hf_hub_download(REPO, "vocoder_int8.pte")
emb_path = hf_hub_download(REPO, "talker_embeddings.pt")
cp_extras_path = hf_hub_download(REPO, "code_predictor_extras.pt")

# Load ExecuTorch runtime
runtime = Runtime.get()
speaker_enc = runtime.load_program(spk_path).load_method("forward")
vocoder = runtime.load_program(voc_path).load_method("forward")

# Load embeddings (used in Python orchestration)
embeddings = torch.load(emb_path, weights_only=True)
cp_extras = torch.load(cp_extras_path, weights_only=True)

# For full generation, see scripts/test_e2e.py

Quick Start β€” Android (Kotlin)

import org.pytorch.executorch.Module

// Load all 4 modules
val speakerEnc = Module.load("speaker_encoder_int8.pte")
val talker = Module.load("talker_int8.pte")
val codePred = Module.load("code_predictor_int8.pte")
val vocoder = Module.load("vocoder_int8.pte")

// Pipeline:
// 1. Process ref audio β†’ mel β†’ speaker_enc.forward() β†’ x_vector
// 2. Build input embeddings (text + speaker + ref codes)
// 3. Autoregressive loop: talker.forward() β†’ codec token β†’ code_pred.forward() β†’ all codebooks
// 4. vocoder.forward(all_codes) β†’ PCM audio
// See scripts/ for full implementation details

Validation Results

Component Method Cosine Similarity
Speaker Encoder .pte vs PyTorch 0.965*
Talker Wrapper vs Original 1.000 βœ…
Vocoder .pte vs PyTorch 1.000 βœ…
Code Predictor .pte validated βœ…

*Speaker encoder: 0.965 due to mel padding for fixed-size export. With matching sizes: 1.000.

INT8 quantization produces valid, intelligible speech β€” tested with full pipeline generation.

Export Details

Property Value
ExecuTorch 1.1.0
Backend XNNPACK (CPU, cross-platform)
Quantization torchao INT8 weight-only (per-channel, instant, no calibration)
Source model Qwen/Qwen3-TTS-1.7B-Base
Max sequence length 2,048
Speaker encoder input Fixed 469 mel frames (~3.8s at 24kHz)

Export Challenges Solved

  1. Conv1d padding="same" β€” Replaced with explicit F.pad() + Conv1d(padding=0) (ExecuTorch doesn't support padding="same")
  2. DynamicCache β€” Replaced with static KV cache tensors as model inputs/outputs
  3. MROPE (Multi-Resolution RoPE) β€” Simplified: all 3 dimensions share identical position_ids for TTS
  4. Stride-0 tensors β€” Used .repeat() instead of .expand() for ExecuTorch compatibility
  5. Vocoder dynamic chunking β€” Bypassed chunked_decode with fixed code length

Scripts

Script Description
scripts/analyze_model.py Deep architecture analysis + shape tracing
scripts/export_speaker_encoder.py Speaker encoder surgery + .pte export
scripts/export_talker.py Main talker LM surgery + .pte export
scripts/export_code_predictor.py Code predictor surgery + .pte export
scripts/export_vocoder.py Vocoder surgery + .pte export
scripts/quantize_all.py INT8 weight-only quantization of all modules
scripts/test_e2e.py End-to-end validation

Hardware Requirements

On-device inference (INT8)

  • RAM: 8 GB minimum (models use ~1.8 GB + KV cache + audio buffers)
  • Storage: 1.8 GB for all 4 model files + extras
  • CPU: ARM64 (Android) or x86_64

Export/development

  • 64 GB+ unified memory recommended (Jetson AGX Orin or similar)
  • Python 3.10+, PyTorch 2.10+, ExecuTorch 1.1.0, torchao

Reproduce

# Setup
git clone https://huggingface.co/acul3/Qwen3-TTS-1.7B-Base-ExecuTorch
cd Qwen3-TTS-1.7B-Base-ExecuTorch
pip install executorch torchao qwen-tts

# Export all modules
python scripts/export_speaker_encoder.py
python scripts/export_talker.py
python scripts/export_code_predictor.py
python scripts/export_vocoder.py

# Quantize
python scripts/quantize_all.py

# Validate
python scripts/test_e2e.py

License

Apache 2.0 (same as source model)

Citation

@misc{qwen3tts_executorch_2026,
  title = {Qwen3-TTS-1.7B-Base-ExecuTorch: On-Device Voice Clone TTS},
  author = {Samsul Rahmadani},
  year = {2026},
  url = {https://huggingface.co/acul3/Qwen3-TTS-1.7B-Base-ExecuTorch},
  note = {Converted from Qwen/Qwen3-TTS-1.7B-Base}
}

Acknowledgments

Downloads last month
54
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support