bol-tts-marathi-onnx — ONNX export

ONNX-format export of the Marathi Kokoro-82M fine-tune at shreyask/bol-tts-marathi. Designed for WebGPU / transformers.js / onnxruntime deployments.

Live demo: shreyask/bol-tts-marathi (in-browser via WebGPU using this very ONNX file)
Write-up: kshreyas.dev/post/bol-tts-marathi
Code + export script: github.com/shreyaskarnik/bol-tts-marathi

Architecture: Kokoro-82M with disable_complex=True (uses CustomSTFT instead of TorchSTFT, which uses complex tensors that ONNX doesn't support).

Files

onnx/model.onnx          — fp32 model, 326 MB
config.json              — Kokoro inference config with ɭ at slot 144 (Marathi retroflex lateral)
voice_speeds.json        — per-voice optimal default speed
voices/*.pt              — 25 voicepack .pt files, [510, 1, 256] float32 each

Model I/O

Inputs:
  input_ids: int64   [1, n_phonemes]  — phoneme token IDs (per config.json vocab).
                                         MUST be wrapped with BOS=0 and EOS=0:
                                         [0, *content_ids, 0]
  style:     float32 [1, 256]         — voicepack slice at position [content_n_phonemes].
                                         (Naming follows kokoro-js + thewh1teagle/kokoro-onnx
                                          ecosystem convention.)
  speed:     float32 [1]              — pacing multiplier (1.0 = neutral; <1.0 slows, >1.0 fastens).
                                         Divides the predictor's per-phoneme duration BEFORE
                                         rounding, so it scales actual frame allocation —
                                         not just playback rate.

Outputs:
  audio:     float32 [1, n_samples]   — 24 kHz waveform. Includes BOS+EOS audio at start/end —
                                         strip `bos_frames * 600` samples from the front and
                                         `eos_frames * 600` from the back if you want
                                         content-only audio (Rasa-trained voicepacks generate
                                         a soft breathy pre-roll for BOS that surfaces as
                                         "umm" if not stripped).
  pred_dur:  int64   [1, n_phonemes]  — per-phoneme durations in predictor frames.
                                         1 frame = 600 audio samples at 24 kHz.
                                         pred_dur[0] = BOS duration; pred_dur[-1] = EOS.

pred_dur is exposed so downstream apps can build phoneme/word-level timestamps.

Usage — onnxruntime (Python)

import numpy as np
import onnxruntime as ort
import torch
import soundfile as sf
import json
from misaki import espeak

sess = ort.InferenceSession("onnx/model.onnx", providers=["CPUExecutionProvider"])
vocab = json.load(open("config.json"))["vocab"]
voice = torch.load("voices/mf_asha.pt", map_location="cpu", weights_only=True)

g2p = espeak.EspeakG2P(language="mr")
text = "नमस्कार, मी मराठी बोलतो."
phonemes, _ = g2p(text)
content_ids = [vocab[p] for p in phonemes if p in vocab]

# Wrap with BOS=0, EOS=0
input_ids = np.array([[0, *content_ids, 0]], dtype=np.int64)
# Voicepack indexed by CONTENT length (not wrapped length): [510, 1, 256] -> slot
style = voice[len(content_ids)].numpy().astype(np.float32)
speed = np.array([1.0], dtype=np.float32)

audio, pred_dur = sess.run(None, {
    "input_ids": input_ids,
    "style": style,
    "speed": speed,
})

# Strip BOS+EOS audio (optional but recommended; see I/O notes above)
HOP = 600
bos_frames = int(pred_dur.flatten()[0])
eos_frames = int(pred_dur.flatten()[-1])
audio = audio[bos_frames * HOP : len(audio) - eos_frames * HOP]

sf.write("out.wav", audio, 24000)

Usage — WebGPU / transformers.js

The live demo at shreyask/bol-tts-marathi uses this exact ONNX file via @huggingface/transformers. The TS client calls await model({ input_ids, style, speed }) and applies the BOS/EOS strip + per-utterance silence injection at punctuation boundaries client-side. Source: Space's src/model.ts.

For Marathi support in upstream Kokoro-JS pipelines, you'll need to monkey-patch 'm' as a Marathi lang_code (espeak 'mr').

Voicepacks (25)

This repo ships all 25 voicepacks deployed in the live demo as .pt files (use them as style input):

4 trained on Marathi corpora: mf_asha, mm_vivek (Rasa), mf_mukta, mm_dnyanesh (SPRINGLab)
19 stock-Kokoro crossovers: af_heart (Svara), af_nova (Tara), am_liam (Atharv), bf_emma-style (Ira), hm_omega (Vihaan), zf_xiaoxiao (Pari, kid), zf_xiaoyi (Vir, kid), … etc. See the demo's voicepacks.json for the full ID → display-name mapping.
2 synthetic: syn_sama (centroid mean of 5 voicepacks), syn_navya (centroid + Gaussian noise) — generated arithmetically with no reference audio.

Export details

Exported via scripts/export_onnx.py:

torch.onnx.export(
    KModelForONNX(kmodel),                   # upstream wrapper, runs forward_with_tokens
    (dummy_input_ids, dummy_style, dummy_speed),
    output_path,
    input_names=["input_ids", "style", "speed"],
    output_names=["audio", "pred_dur"],
    dynamic_axes={
        "input_ids": {1: "n_phonemes"},
        "audio":     {1: "n_samples"},
        "pred_dur":  {1: "n_phonemes"},
    },
    opset_version=17,
    dynamo=False,                  # legacy TorchScript tracer; pinned for torch ≤ 2.8
    do_constant_folding=True,
)

⚠️ torch ≤ 2.8 required for export. torch ≥ 2.9 silently emits a static-output ONNX with the legacy tracer (dynamo=False) on Kokoro's InstanceNorm-under-spectral-norm + LSTM + CustomSTFT combo. The exported file loads + runs in onnxruntime but produces silence. We pin torch==2.6 in our export venv. See bol-tts-marathi pyproject.toml for the constraint.

disable_complex=True is mandatory — Kokoro's default TorchSTFT uses complex tensors that ONNX doesn't support.

License

Apache 2.0. See the base PyTorch model for full citation/attribution.

Downloads last month: 338

Model tree for shreyask/bol-tts-marathi-onnx

Base model

yl4579/StyleTTS2-LJSpeech

Finetuned

hexgrad/Kokoro-82M

Finetuned

shreyask/bol-tts-marathi

Quantized

(1)

this model

shreyask
/

bol-tts-marathi-onnx