--- license: mit language: - en tags: - text-to-speech - tts - dctts - pytorch - speech-synthesis - deep-convolutional-tts pipeline_tag: text-to-speech --- # DC-TTS Geralt Voice Model A Deep Convolutional Text-to-Speech (DC-TTS) model trained to synthesize speech in the voice of Geralt of Rivia from The Witcher series. ## Model Description This model is part of the [Deepstory](https://github.com/thetobysiu/deepstory) project, which combines Natural Language Generation, Text-to-Speech, and animation technologies to create interactive storytelling experiences. The DC-TTS architecture is based on the paper: > Hideyuki Tachibana, Katsuya Uenoyama, Shunsuke Aihara. "Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention" ([arXiv:1710.08969](https://arxiv.org/abs/1710.08969)) ## Model Architecture This model consists of two components: ### Text2Mel Network Converts text input to mel-spectrograms. | Parameter | Value | |-----------|-------| | Embedding Dimension (e) | 128 | | Hidden Unit Dimension (d) | 512 | | Vocabulary | `PE abcdefghijklmnopqrstuvwxyz'.,!?` | | Max Characters (N) | 259 | | Max Mel Frames (T) | 326 | | Basic Block Type | Gated Convolution | | Normalization | Layer Normalization | | Dropout Rate | 0.05 | ### SSRN (Spectrogram Super-Resolution Network) Upsamples mel-spectrograms to full spectrograms for audio synthesis. | Parameter | Value | |-----------|-------| | Hidden Unit Dimension (c) | 640 (512 + 128) | | Number of Mel Bins (f) | 80 | | FFT Points | 2048 | | Full Spectrogram Dimension | 1025 | | Reduction Rate | 4 | | Basic Block Type | Residual | | Normalization | Weight Normalization | | Weight Initialization | Kaiming | ### Audio Parameters | Parameter | Value | |-----------|-------| | Sample Rate | 22050 Hz | | Frame Shift | 0.0125s (12.5ms) | | Frame Length | 0.05s (50ms) | | Hop Length | 276 samples | | Win Length | 1102 samples | | Power | 1.5 | | Preemphasis | 0.97 | | Max dB | 100 | | Reference dB | 20 | | Griffin-Lim Iterations | 50 | ## Files - `t2m_step-102000_first.pth` - Text2Mel model checkpoint - `ssrn.pth` - SSRN model checkpoint ## Usage ```python import torch from modules.dctts import Text2Mel, SSRN, hp, spectrogram2wav device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # Load models text2mel = Text2Mel(hp.vocab).to(device).eval() text2mel.load_state_dict(torch.load('t2m_step-102000_first.pth', map_location=device)['state_dict']) ssrn = SSRN().to(device).eval() ssrn.load_state_dict(torch.load('ssrn.pth', map_location=device)['state_dict']) # Synthesize speech def synthesize(text, timeout=10000): normalized_text = normalize_text(text) + "E" # E: EOS L = torch.from_numpy(np.array([[hp.char2idx[char] for char in normalized_text]], np.long)).to(device) zeros = torch.from_numpy(np.zeros((1, hp.n_mels, 1), np.float32)).to(device) Y = zeros with torch.no_grad(): for i in range(timeout): _, Y_t, A = text2mel(L, Y, monotonic_attention=True) Y = torch.cat((zeros, Y_t), -1) _, attention = torch.max(A[0, :, -1], 0) if L[0, attention.item()] == hp.vocab.index('E'): break _, Z = ssrn(Y) Z = Z.cpu().numpy() wav = spectrogram2wav(Z[0, :, :].T) return wav ``` ## Training Data The model was trained on audio samples of Geralt's voice from The Witcher 3: Wild Hunt video game. ## Intended Use This model is intended for: - Research and experimentation in speech synthesis - Creative projects and fan content - Educational purposes ## Limitations - The model works best with English text - Vocabulary is limited to lowercase letters and basic punctuation - Audio quality may vary depending on input text complexity - The character voice is based on copyrighted material ## Citation If you use this model, please cite the original DC-TTS paper and the Deepstory project: ```bibtex @article{tachibana2018efficiently, title={Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention}, author={Tachibana, Hideyuki and Uenoyama, Katsuya and Aihara, Shunsuke}, journal={arXiv preprint arXiv:1710.08969}, year={2018} } @misc{deepstory, author = {Siu King Wai}, title = {Deepstory}, year = {2020}, publisher = {GitHub}, url = {https://github.com/thetobysiu/deepstory} } ``` ## License This model is released under the MIT License. Please note that the voice characteristics are based on copyrighted material from The Witcher 3: Wild Hunt. ## Acknowledgments - Original DC-TTS implementation: [tugstugi/pytorch-dc-tts](https://github.com/tugstugi/pytorch-dc-tts) - The Witcher 3: Wild Hunt by CD Projekt Red