Voice cloning doesn't actually clone voice

by wgwine - opened 23 days ago

Discussion

wgwine

23 days ago

https://kanitts.com/kani-tts-2

It basically just produces a random voice every time you extract embeddings.

ylankgz

NineNineSix org 22 days ago

This website does not belong to us.
But we'll figure out on our side, what's wrong with the voice cloning. Thanks for feedback!

wgwine

22 days ago

•

edited 22 days ago

The resulting voice doesn't sound like the supplied audio sample. And repeated generations yields different voices. like drastically different. I tried it locally too and got the same result. So I don't suspect the website is the problem. It was a 25 second .mp3 file. Not sure if that helps. Maybe some conversion problem?

ylankgz

NineNineSix org 22 days ago

yeah. Trying to figure it out. Meanwhile try kani-tts-2-pt and let me know if it's better in VC. If so, then it's just the current model has been undertrained and hasn't generalized well

wgwine

22 days ago

multilingual(pt) is doing the same thing as far as i can tell. I did not find a publicly available kani-tts-2-pt demo page which included cloning, but when I did this locally with pt, 2 clones from the same mp3, these are the results.

ylankgz

NineNineSix org 22 days ago

https://huggingface.co/spaces/nineninesix/kani-tts-2-pt. Don' forget to press "Extract Embedding" button
meanwhile I'll check how embeddings are made on pip package.

wgwine

22 days ago

I tried it with sample audio of a male vs female voice. And the cloning seems to get the gender correct, but the cloned voice sounds almost nothing like the sample audio beyond that.

Maybe it is functioning according to expectations. Not complaining, but I wouldn't really call it cloning.

Simonlob

NineNineSix org 22 days ago

Hello!
You are correct — the current behavior is expected given the model design.

This model is intentionally optimized for real-time inference, which imposes certain architectural and capacity constraints. As a result, its ability to generalize for high-fidelity voice cloning is limited compared to larger, offline-oriented models.

We have implemented stable speaker conditioning and can reliably control speaker identity, but achieving high-quality voice cloning under real-time constraints remains an active area of development.

We are continuously working on improving cloning quality while preserving low-latency performance. Updates will be released as improvements become available.

wgwine changed discussion status to closed 21 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment