Qwen3-TTS Voice Clone β ExecuTorch (Android-ready)
On-device text-to-speech with voice cloning, converted from Qwen3-TTS-1.7B-Base to ExecuTorch .pte format for mobile/edge deployment.
1.9B parameter end-to-end TTS β clone any voice from a short audio sample and synthesize speech entirely on-device. No cloud, no internet needed.
Models
INT8 Quantized (β Recommended for on-device)
| Module |
Size |
Description |
speaker_encoder_int8.pte |
46 MB |
Extract speaker identity from reference audio |
talker_int8.pte |
1.4 GB |
Main autoregressive LM (generates audio codec tokens) |
code_predictor_int8.pte |
78 MB |
Multi-codebook prediction (15 additional codebooks) |
vocoder_int8.pte |
301 MB |
Neural vocoder (codec tokens β PCM waveform) |
| Total |
1.8 GB |
Fits on 8GB+ phones |
FP32 Unquantized
| Module |
Size |
speaker_encoder.pte |
46 MB |
talker_prefill.pte |
5.3 GB |
code_predictor.pte |
309 MB |
vocoder.pte |
436 MB |
Auxiliary Files
| File |
Description |
talker_embeddings.pt |
Text + codec embedding tables (loaded in Python orchestrator) |
code_predictor_extras.pt |
Code predictor embedding + projection weights |
Architecture
Qwen3-TTS Voice Clone Pipeline (1.9B params total)
Input: text + reference audio (3-5s voice sample)
β
βββββββββββββ΄ββββββββββββ
βΌ βΌ
Speaker Encoder (12M) Speech Tokenizer
TDNN β AttPool β FC (encode ref audio
ref_audio β x_vector to codec codes)
[1, 2048] [T, 16]
β β
βββββββββ¬ββββββββββββββββ
βΌ
Talker LM (1.7B)
Qwen3, 28 layers, GQA 16/8
dim=2048, audio vocab=3072
Autoregressive codec generation
β
βΌ
Code Predictor (175M)
Predict 15 additional codebooks
per token (residual VQ)
β
βΌ
Vocoder (154M)
Codec tokens β 24kHz PCM audio
β
βΌ
Output: speech waveform (.wav)
Component Details
| Component |
Params |
Architecture |
Input β Output |
| Speaker Encoder |
12M |
TDNN + Attentive Stats Pooling |
mel spectrogram β x_vector [1, 2048] |
| Talker (Main LM) |
1,727M |
Qwen3, 28 layers, GQA 16/8 heads, dim 2048 |
text + speaker emb β codec tokens (vocab 3072) |
| Text Projection |
8M |
MLP |
text hidden β audio hidden dim |
| Codec Head |
6M |
Linear |
hidden states β first codebook logits |
| Code Predictor |
175M |
Small LM + 15 heads |
main LM output β codebooks 2-16 |
| Vocoder |
154M |
Qwen3TTSTokenizerV2Model |
[16, T] codes β 24kHz waveform |
How It Works
Voice Clone Pipeline
1. ref_audio (24kHz, 3-5s) β mel_spectrogram β speaker_encoder.pte β x_vector [1, 2048]
2. ref_audio β speech_tokenizer.encode() β ref_codes [T, 16] (runs on CPU, not exported)
3. text β Qwen2 tokenizer β input_ids
4. Embed: text_embedding(input_ids) + codec_embedding(ref_codes) + x_vector β inputs_embeds
5. talker.pte(inputs_embeds, kv_cache, ...) β codec logits (autoregressive loop)
6. code_predictor.pte(hidden_states) β codebooks 2-16 (per step)
7. All codec codes β vocoder.pte β 24kHz PCM waveform
Token Format
The talker uses an interleaved text+audio token sequence:
[BOS] [text tokens...] [speaker x-vector] [ref audio codes...] [SEP] [generated audio codes...]
Key Parameters
| Parameter |
Value |
| Audio sample rate |
24,000 Hz |
| Codec frame rate |
12.5 Hz (80ms per frame) |
| Codebooks |
16 (1 from talker + 15 from code predictor) |
| Audio vocab size |
3,072 |
| Text vocab size |
151,936 (Qwen2 tokenizer) |
| Max sequence length |
2,048 tokens |
| Speaker embedding dim |
2,048 |
Quick Start β Python
from huggingface_hub import hf_hub_download
from executorch.runtime import Runtime
import torch
import numpy as np
REPO = "acul3/Qwen3-TTS-1.7B-Base-ExecuTorch"
spk_path = hf_hub_download(REPO, "speaker_encoder_int8.pte")
talker_path = hf_hub_download(REPO, "talker_int8.pte")
cp_path = hf_hub_download(REPO, "code_predictor_int8.pte")
voc_path = hf_hub_download(REPO, "vocoder_int8.pte")
emb_path = hf_hub_download(REPO, "talker_embeddings.pt")
cp_extras_path = hf_hub_download(REPO, "code_predictor_extras.pt")
runtime = Runtime.get()
speaker_enc = runtime.load_program(spk_path).load_method("forward")
vocoder = runtime.load_program(voc_path).load_method("forward")
embeddings = torch.load(emb_path, weights_only=True)
cp_extras = torch.load(cp_extras_path, weights_only=True)
Quick Start β Android (Kotlin)
import org.pytorch.executorch.Module
val speakerEnc = Module.load("speaker_encoder_int8.pte")
val talker = Module.load("talker_int8.pte")
val codePred = Module.load("code_predictor_int8.pte")
val vocoder = Module.load("vocoder_int8.pte")
Validation Results
| Component |
Method |
Cosine Similarity |
| Speaker Encoder |
.pte vs PyTorch |
0.965* |
| Talker |
Wrapper vs Original |
1.000 β
|
| Vocoder |
.pte vs PyTorch |
1.000 β
|
| Code Predictor |
.pte validated |
β
|
*Speaker encoder: 0.965 due to mel padding for fixed-size export. With matching sizes: 1.000.
INT8 quantization produces valid, intelligible speech β tested with full pipeline generation.
Export Details
| Property |
Value |
| ExecuTorch |
1.1.0 |
| Backend |
XNNPACK (CPU, cross-platform) |
| Quantization |
torchao INT8 weight-only (per-channel, instant, no calibration) |
| Source model |
Qwen/Qwen3-TTS-1.7B-Base |
| Max sequence length |
2,048 |
| Speaker encoder input |
Fixed 469 mel frames (~3.8s at 24kHz) |
Export Challenges Solved
Conv1d padding="same" β Replaced with explicit F.pad() + Conv1d(padding=0) (ExecuTorch doesn't support padding="same")
- DynamicCache β Replaced with static KV cache tensors as model inputs/outputs
- MROPE (Multi-Resolution RoPE) β Simplified: all 3 dimensions share identical position_ids for TTS
- Stride-0 tensors β Used
.repeat() instead of .expand() for ExecuTorch compatibility
- Vocoder dynamic chunking β Bypassed
chunked_decode with fixed code length
Scripts
| Script |
Description |
scripts/analyze_model.py |
Deep architecture analysis + shape tracing |
scripts/export_speaker_encoder.py |
Speaker encoder surgery + .pte export |
scripts/export_talker.py |
Main talker LM surgery + .pte export |
scripts/export_code_predictor.py |
Code predictor surgery + .pte export |
scripts/export_vocoder.py |
Vocoder surgery + .pte export |
scripts/quantize_all.py |
INT8 weight-only quantization of all modules |
scripts/test_e2e.py |
End-to-end validation |
Hardware Requirements
On-device inference (INT8)
- RAM: 8 GB minimum (models use ~1.8 GB + KV cache + audio buffers)
- Storage: 1.8 GB for all 4 model files + extras
- CPU: ARM64 (Android) or x86_64
Export/development
- 64 GB+ unified memory recommended (Jetson AGX Orin or similar)
- Python 3.10+, PyTorch 2.10+, ExecuTorch 1.1.0, torchao
Reproduce
git clone https://huggingface.co/acul3/Qwen3-TTS-1.7B-Base-ExecuTorch
cd Qwen3-TTS-1.7B-Base-ExecuTorch
pip install executorch torchao qwen-tts
python scripts/export_speaker_encoder.py
python scripts/export_talker.py
python scripts/export_code_predictor.py
python scripts/export_vocoder.py
python scripts/quantize_all.py
python scripts/test_e2e.py
License
Apache 2.0 (same as source model)
Citation
@misc{qwen3tts_executorch_2026,
title = {Qwen3-TTS-1.7B-Base-ExecuTorch: On-Device Voice Clone TTS},
author = {Samsul Rahmadani},
year = {2026},
url = {https://huggingface.co/acul3/Qwen3-TTS-1.7B-Base-ExecuTorch},
note = {Converted from Qwen/Qwen3-TTS-1.7B-Base}
}
Acknowledgments