KhanhTTS ๐Ÿ—ฃ๏ธ๐Ÿ”ฅ

KhanhTTS lร  mรด hรฌnh Text-to-Speech (TTS) dแปฑa trรชn OmniVoice, ฤ‘ฦฐแปฃc fine-tune ฤ‘แปƒ tแป•ng hแปฃp giแปng nรณi tiแบฟng Viแป‡t vร  tiแบฟng Anh, hแป— trแปฃ voice cloning.

๐Ÿง  Thรดng tin huแบฅn luyแป‡n

  • Base model: k2-fsa/OmniVoice
  • Dataset: ~1500 giแป audio tiแบฟng Viแป‡t + Anh
  • Steps: ~500.000
  • Mแปฅc tiรชu:
    • Phรกt รขm tiแบฟng Viแป‡t + Anh tแปฑ nhiรชn
    • Voice cloning แป•n ฤ‘แป‹nh vแป›i reference ngแบฏn

โ˜• แปฆng hแป™ dแปฑ รกn nร y

Viแป‡c huแบฅn luyแป‡n cรกc mรด hรฌnh TTS chแบฅt lฦฐแปฃng cao ฤ‘รฒi hแปi tร i nguyรชn GPU ฤ‘รกng kแปƒ. Nแบฟu bแบกn thแบฅy mรด hรฌnh nร y hแปฏu รญch, vui lรฒng xem xรฉt hแป— trแปฃ quรก trรฌnh phรกt triแปƒn:

Buy Me a Coffee

Mแปi sแปฑ แปงng hแป™ cแปงa cรกc bแบกn lร  niแปm ฤ‘แป™ng lแปฑc giรบp mรฌnh phรกt triแปƒn cรกc mรด hรฌnh tแป‘t hฦกn trong tฦฐฦกng lai โค๏ธ


๐Ÿฆœ Sample

Reference Voice (Speaker Example):

Input Text:

ฤรชm ฤ‘รณ, anh xoรก sแป‘ cรด khแปi danh bแบก.
Nhฦฐng khi mร n hรฌnh tแป‘i ฤ‘i, anh vแบซn nhแป› rแบฅt rรตโ€ฆ sแป‘ แบฅy nแบฑm แปŸ ฤ‘รขu trong tim mรฌnh.

Ngoร i cแปญa sแป•, giรณ thแป•i khแบฝ.
Cรณ nhแปฏng thแปฉ ฤ‘รฃ rแปi ฤ‘i rแป“i,
nhฦฐng cแบฃm giรกc thรฌ แปŸ lแบกi lรขu hฦกn ta tฦฐแปŸng.

Generated Output (Cloned Voice):

๐Ÿš€ Cร i ฤ‘แบทt & chแบกy inference

1. Cร i ฤ‘แบทt mรดi trฦฐแปng

pip install omnivoice

2. Load model & Inference

from omnivoice import OmniVoice
import soundfile as sf
import torch

# Load the model
model = OmniVoice.from_pretrained(
    "kjanh/KhanhTTS-OmniVoice",
    device_map="cuda:0",
    dtype=torch.float16
)
audio = model.generate(
    text="Xin chร o cรกc bแบกn.",
    # ref_audio="refvoice.wav",
    # ref_text="cรณ ngฦฐแปi tแปซng nรณi vแป›i cรด, ฤ‘รณ lร  hฦกi thแปŸ cแปงa mรนa ฤ‘รดng, hฦกi thแปŸ cแปงa ฤ‘แบฅt trแปi, hฦกi thแปŸ cแปงa tรฌnh yรชu.",
) # audio is a list of `np.ndarray` with shape (T,) at 24 kHz.

sf.write("out.wav", audio[0], 24000)

โš ๏ธ Miแป…n trแปซ trรกch nhiแป‡m & Khuyแบฟn cรกo sแปญ dแปฅng (TTS)

Mรด hรฌnh Text-to-Speech (TTS) nร y ฤ‘ฦฐแปฃc cung cแบฅp chแป‰ nhแบฑm phแปฅc vแปฅ mแปฅc ฤ‘รญch nghiรชn cแปฉu, thแปญ nghiแป‡m vร  phรกt triแปƒn cรดng nghแป‡. Mแปi nแป™i dung รขm thanh do mรด hรฌnh tแบกo ra khรดng phแบฃn รกnh, ฤ‘แบกi diแป‡n hay ngแปฅ รฝ giแปng nรณi, danh tรญnh, quan ฤ‘iแปƒm hoแบทc sแปฑ chแบฅp thuแบญn cแปงa bแบฅt kแปณ cรก nhรขn hay tแป• chแปฉc cรณ thแบญt nร o. Tรกc giแบฃ vร  cรกc bรชn liรชn quan khรดng chแป‹u bแบฅt kแปณ trรกch nhiแป‡m phรกp lรฝ nร o ฤ‘แป‘i vแป›i cรกc hร nh vi sแปญ dแปฅng sai mแปฅc ฤ‘รญch, vi phแบกm phรกp luแบญt, xรขm phแบกm quyแปn riรชng tฦฐ, quyแปn nhรขn thรขn, quyแปn sแปŸ hแปฏu trรญ tuแป‡, hoแบทc cรกc thiแป‡t hแบกi trแปฑc tiแบฟp hay giรกn tiแบฟp phรกt sinh tแปซ viแป‡c sแปญ dแปฅng mรด hรฌnh nร y.

Ngฦฐแปi dรนng chแป‹u hoร n toร n trรกch nhiแป‡m phรกp lรฝ ฤ‘แป‘i vแป›i viแป‡c triแปƒn khai, phรขn phแป‘i vร  sแปญ dแปฅng mรด hรฌnh. Nghiรชm cแบฅm sแปญ dแปฅng mรด hรฌnh cho cรกc hร nh vi mแบกo danh, sao chรฉp hoแบทc mรด phแปng giแปng nรณi cรก nhรขn khi chฦฐa cรณ sแปฑ ฤ‘แป“ng รฝ hแปฃp phรกp, tแบกo nแป™i dung gรขy hiแปƒu lแบงm, lแปซa ฤ‘แบฃo, thao tรบng dฦฐ luแบญn hoแบทc bแบฅt kแปณ hร nh vi nร o trรกi vแป›i quy ฤ‘แป‹nh phรกp luแบญt hiแป‡n hร nh. Khi sแปญ dแปฅng hoแบทc chia sแบป รขm thanh ฤ‘ฦฐแปฃc tแบกo ra, khuyแบฟn nghแป‹ bแบฏt buแป™c phแบฃi cรดng bแป‘ rรต rร ng rแบฑng nแป™i dung lร  รขm thanh ฤ‘ฦฐแปฃc tแบกo bแปŸi trรญ tuแป‡ nhรขn tแบกo (AI), ฤ‘แป“ng thแปi tuรขn thแปง ฤ‘แบงy ฤ‘แปง cรกc quy ฤ‘แป‹nh phรกp luแบญt, chรญnh sรกch nแปn tแบฃng vร  chuแบฉn mแปฑc ฤ‘แบกo ฤ‘แปฉc cรณ liรชn quan.

Mแบซu mรด hรฌnh nร y ฤ‘ฦฐแปฃc phรกt hร nh chแป‰ cho mแปฅc ฤ‘รญch nghiรชn cแปฉu vร  phรกt triแปƒn. Chรบng tรดi khรดng khuyแบฟn khรญch viแป‡c sแปญ dแปฅng trong mรดi trฦฐแปng sแบฃn xuแบฅt hoแบทc cho mแปฅc ฤ‘รญch thฦฐฦกng mแบกi nแบฟu chฦฐa trแบฃi qua quy trรฌnh thแปญ nghiแป‡m, ฤ‘รกnh giรก rแปงi ro vร  kiแปƒm ฤ‘แป‹nh an toร n mแป™t cรกch nghiรชm ngแบทt. Vui lรฒng sแปญ dแปฅng mรด hรฌnh mแป™t cรกch cรณ trรกch nhiแป‡m.

Doanh nghiแป‡p hoแบทc tแป• chแปฉc cรณ nhu cแบงu sแปญ dแปฅng cho mแปฅc ฤ‘รญch thฦฐฦกng mแบกi cรณ thแปƒ liรชn hแป‡ ฤ‘แปƒ trao ฤ‘แป•i hแปฃp tรกc: https://www.facebook.com/khanh20204569/

๐Ÿ“š Trรญch dแบซn (Citation)

Nแบฟu bแบกn sแปญ dแปฅng mรด hรฌnh nร y hoแบทc dแปฑa trรชn OmniVoice cho nghiรชn cแปฉu/sแบฃn phแบฉm, vui lรฒng trรญch dแบซn bร i OmniVoice gแป‘c:

@article{zhu2026omnivoice,
      title={OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models},
      author={Zhu, Han and Ye, Lingxuan and Kang, Wei and Yao, Zengwei and Guo, Liyong and Kuang, Fangjun and Han, Zhifeng and Zhuang, Weiji and Lin, Long and Povey, Daniel},
      journal={arXiv preprint arXiv:2604.00688},
      year={2026}
}
Downloads last month
3,124
Safetensors
Model size
0.6B params
Tensor type
I64
ยท
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for kjanh/KhanhTTS-OmniVoice

Finetuned
Qwen/Qwen3-0.6B
Finetuned
k2-fsa/OmniVoice
Finetuned
(24)
this model

Spaces using kjanh/KhanhTTS-OmniVoice 2

Paper for kjanh/KhanhTTS-OmniVoice