KhanhTTS ๐ฃ๏ธ๐ฅ
KhanhTTS lร mรด hรฌnh Text-to-Speech (TTS) dแปฑa trรชn OmniVoice, ฤฦฐแปฃc fine-tune ฤแป tแปng hแปฃp giแปng nรณi tiแบฟng Viแปt vร tiแบฟng Anh, hแป trแปฃ voice cloning.
๐ง Thรดng tin huแบฅn luyแปn
- Base model: k2-fsa/OmniVoice
- Dataset: ~1500 giแป audio tiแบฟng Viแปt + Anh
- Steps: ~500.000
- Mแปฅc tiรชu:
- Phรกt รขm tiแบฟng Viแปt + Anh tแปฑ nhiรชn
- Voice cloning แปn ฤแปnh vแปi reference ngแบฏn
โ แปฆng hแป dแปฑ รกn nร y
Viแปc huแบฅn luyแปn cรกc mรด hรฌnh TTS chแบฅt lฦฐแปฃng cao ฤรฒi hแปi tร i nguyรชn GPU ฤรกng kแป. Nแบฟu bแบกn thแบฅy mรด hรฌnh nร y hแปฏu รญch, vui lรฒng xem xรฉt hแป trแปฃ quรก trรฌnh phรกt triแปn:
Mแปi sแปฑ แปงng hแป cแปงa cรกc bแบกn lร niแปm ฤแปng lแปฑc giรบp mรฌnh phรกt triแปn cรกc mรด hรฌnh tแปt hฦกn trong tฦฐฦกng lai โค๏ธ
๐ฆ Sample
Reference Voice (Speaker Example):
Input Text:
ฤรชm ฤรณ, anh xoรก sแป cรด khแปi danh bแบก.
Nhฦฐng khi mร n hรฌnh tแปi ฤi, anh vแบซn nhแป rแบฅt rรตโฆ sแป แบฅy nแบฑm แป ฤรขu trong tim mรฌnh.Ngoร i cแปญa sแป, giรณ thแปi khแบฝ.
Cรณ nhแปฏng thแปฉ ฤรฃ rแปi ฤi rแปi,
nhฦฐng cแบฃm giรกc thรฌ แป lแบกi lรขu hฦกn ta tฦฐแปng.
Generated Output (Cloned Voice):
๐ Cร i ฤแบทt & chแบกy inference
1. Cร i ฤแบทt mรดi trฦฐแปng
pip install omnivoice
2. Load model & Inference
from omnivoice import OmniVoice
import soundfile as sf
import torch
# Load the model
model = OmniVoice.from_pretrained(
"kjanh/KhanhTTS-OmniVoice",
device_map="cuda:0",
dtype=torch.float16
)
audio = model.generate(
text="Xin chร o cรกc bแบกn.",
# ref_audio="refvoice.wav",
# ref_text="cรณ ngฦฐแปi tแปซng nรณi vแปi cรด, ฤรณ lร hฦกi thแป cแปงa mรนa ฤรดng, hฦกi thแป cแปงa ฤแบฅt trแปi, hฦกi thแป cแปงa tรฌnh yรชu.",
) # audio is a list of `np.ndarray` with shape (T,) at 24 kHz.
sf.write("out.wav", audio[0], 24000)
โ ๏ธ Miแป n trแปซ trรกch nhiแปm & Khuyแบฟn cรกo sแปญ dแปฅng (TTS)
Mรด hรฌnh Text-to-Speech (TTS) nร y ฤฦฐแปฃc cung cแบฅp chแป nhแบฑm phแปฅc vแปฅ mแปฅc ฤรญch nghiรชn cแปฉu, thแปญ nghiแปm vร phรกt triแปn cรดng nghแป. Mแปi nแปi dung รขm thanh do mรด hรฌnh tแบกo ra khรดng phแบฃn รกnh, ฤแบกi diแปn hay ngแปฅ รฝ giแปng nรณi, danh tรญnh, quan ฤiแปm hoแบทc sแปฑ chแบฅp thuแบญn cแปงa bแบฅt kแปณ cรก nhรขn hay tแป chแปฉc cรณ thแบญt nร o. Tรกc giแบฃ vร cรกc bรชn liรชn quan khรดng chแปu bแบฅt kแปณ trรกch nhiแปm phรกp lรฝ nร o ฤแปi vแปi cรกc hร nh vi sแปญ dแปฅng sai mแปฅc ฤรญch, vi phแบกm phรกp luแบญt, xรขm phแบกm quyแปn riรชng tฦฐ, quyแปn nhรขn thรขn, quyแปn sแป hแปฏu trรญ tuแป, hoแบทc cรกc thiแปt hแบกi trแปฑc tiแบฟp hay giรกn tiแบฟp phรกt sinh tแปซ viแปc sแปญ dแปฅng mรด hรฌnh nร y.
Ngฦฐแปi dรนng chแปu hoร n toร n trรกch nhiแปm phรกp lรฝ ฤแปi vแปi viแปc triแปn khai, phรขn phแปi vร sแปญ dแปฅng mรด hรฌnh. Nghiรชm cแบฅm sแปญ dแปฅng mรด hรฌnh cho cรกc hร nh vi mแบกo danh, sao chรฉp hoแบทc mรด phแปng giแปng nรณi cรก nhรขn khi chฦฐa cรณ sแปฑ ฤแปng รฝ hแปฃp phรกp, tแบกo nแปi dung gรขy hiแปu lแบงm, lแปซa ฤแบฃo, thao tรบng dฦฐ luแบญn hoแบทc bแบฅt kแปณ hร nh vi nร o trรกi vแปi quy ฤแปnh phรกp luแบญt hiแปn hร nh. Khi sแปญ dแปฅng hoแบทc chia sแบป รขm thanh ฤฦฐแปฃc tแบกo ra, khuyแบฟn nghแป bแบฏt buแปc phแบฃi cรดng bแป rรต rร ng rแบฑng nแปi dung lร รขm thanh ฤฦฐแปฃc tแบกo bแปi trรญ tuแป nhรขn tแบกo (AI), ฤแปng thแปi tuรขn thแปง ฤแบงy ฤแปง cรกc quy ฤแปnh phรกp luแบญt, chรญnh sรกch nแปn tแบฃng vร chuแบฉn mแปฑc ฤแบกo ฤแปฉc cรณ liรชn quan.
Mแบซu mรด hรฌnh nร y ฤฦฐแปฃc phรกt hร nh chแป cho mแปฅc ฤรญch nghiรชn cแปฉu vร phรกt triแปn. Chรบng tรดi khรดng khuyแบฟn khรญch viแปc sแปญ dแปฅng trong mรดi trฦฐแปng sแบฃn xuแบฅt hoแบทc cho mแปฅc ฤรญch thฦฐฦกng mแบกi nแบฟu chฦฐa trแบฃi qua quy trรฌnh thแปญ nghiแปm, ฤรกnh giรก rแปงi ro vร kiแปm ฤแปnh an toร n mแปt cรกch nghiรชm ngแบทt. Vui lรฒng sแปญ dแปฅng mรด hรฌnh mแปt cรกch cรณ trรกch nhiแปm.
Doanh nghiแปp hoแบทc tแป chแปฉc cรณ nhu cแบงu sแปญ dแปฅng cho mแปฅc ฤรญch thฦฐฦกng mแบกi cรณ thแป liรชn hแป ฤแป trao ฤแปi hแปฃp tรกc: https://www.facebook.com/khanh20204569/
๐ Trรญch dแบซn (Citation)
Nแบฟu bแบกn sแปญ dแปฅng mรด hรฌnh nร y hoแบทc dแปฑa trรชn OmniVoice cho nghiรชn cแปฉu/sแบฃn phแบฉm, vui lรฒng trรญch dแบซn bร i OmniVoice gแปc:
@article{zhu2026omnivoice,
title={OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models},
author={Zhu, Han and Ye, Lingxuan and Kang, Wei and Yao, Zengwei and Guo, Liyong and Kuang, Fangjun and Han, Zhifeng and Zhuang, Weiji and Lin, Long and Povey, Daniel},
journal={arXiv preprint arXiv:2604.00688},
year={2026}
}
- Downloads last month
- 3,124