Text-to-Speech
Transformers
Safetensors
English
lfm2
text-generation

Voice cloning doesn't actually clone voice

#2
by wgwine - opened

https://kanitts.com/kani-tts-2

It basically just produces a random voice every time you extract embeddings.

NineNineSix org

This website does not belong to us.
But we'll figure out on our side, what's wrong with the voice cloning. Thanks for feedback!

The resulting voice doesn't sound like the supplied audio sample. And repeated generations yields different voices. like drastically different. I tried it locally too and got the same result. So I don't suspect the website is the problem. It was a 25 second .mp3 file. Not sure if that helps. Maybe some conversion problem?

NineNineSix org

yeah. Trying to figure it out. Meanwhile try kani-tts-2-pt and let me know if it's better in VC. If so, then it's just the current model has been undertrained and hasn't generalized well

multilingual(pt) is doing the same thing as far as i can tell. I did not find a publicly available kani-tts-2-pt demo page which included cloning, but when I did this locally with pt, 2 clones from the same mp3, these are the results.


NineNineSix org

https://huggingface.co/spaces/nineninesix/kani-tts-2-pt. Don' forget to press "Extract Embedding" button
meanwhile I'll check how embeddings are made on pip package.

I tried it with sample audio of a male vs female voice. And the cloning seems to get the gender correct, but the cloned voice sounds almost nothing like the sample audio beyond that.

Maybe it is functioning according to expectations. Not complaining, but I wouldn't really call it cloning.

NineNineSix org

Hello!
You are correct — the current behavior is expected given the model design.

This model is intentionally optimized for real-time inference, which imposes certain architectural and capacity constraints. As a result, its ability to generalize for high-fidelity voice cloning is limited compared to larger, offline-oriented models.

We have implemented stable speaker conditioning and can reliably control speaker identity, but achieving high-quality voice cloning under real-time constraints remains an active area of development.

We are continuously working on improving cloning quality while preserving low-latency performance. Updates will be released as improvements become available.

wgwine changed discussion status to closed

Sign up or log in to comment