---
license: mit
language:
  - en
tags:
  - text-to-speech
  - tts
  - dctts
  - pytorch
  - speech-synthesis
  - deep-convolutional-tts
pipeline_tag: text-to-speech
---

# DC-TTS Geralt Voice Model

A Deep Convolutional Text-to-Speech (DC-TTS) model trained to synthesize speech in the voice of Geralt of Rivia from The Witcher series.

## Model Description

This model is part of the [Deepstory](https://github.com/thetobysiu/deepstory) project, which combines Natural Language Generation, Text-to-Speech, and animation technologies to create interactive storytelling experiences.

The DC-TTS architecture is based on the paper:
> Hideyuki Tachibana, Katsuya Uenoyama, Shunsuke Aihara. "Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention" ([arXiv:1710.08969](https://arxiv.org/abs/1710.08969))

## Model Architecture

This model consists of two components:

### Text2Mel Network
Converts text input to mel-spectrograms.

| Parameter | Value |
|-----------|-------|
| Embedding Dimension (e) | 128 |
| Hidden Unit Dimension (d) | 512 |
| Vocabulary | `PE abcdefghijklmnopqrstuvwxyz'.,!?` |
| Max Characters (N) | 259 |
| Max Mel Frames (T) | 326 |
| Basic Block Type | Gated Convolution |
| Normalization | Layer Normalization |
| Dropout Rate | 0.05 |

### SSRN (Spectrogram Super-Resolution Network)
Upsamples mel-spectrograms to full spectrograms for audio synthesis.

| Parameter | Value |
|-----------|-------|
| Hidden Unit Dimension (c) | 640 (512 + 128) |
| Number of Mel Bins (f) | 80 |
| FFT Points | 2048 |
| Full Spectrogram Dimension | 1025 |
| Reduction Rate | 4 |
| Basic Block Type | Residual |
| Normalization | Weight Normalization |
| Weight Initialization | Kaiming |

### Audio Parameters

| Parameter | Value |
|-----------|-------|
| Sample Rate | 22050 Hz |
| Frame Shift | 0.0125s (12.5ms) |
| Frame Length | 0.05s (50ms) |
| Hop Length | 276 samples |
| Win Length | 1102 samples |
| Power | 1.5 |
| Preemphasis | 0.97 |
| Max dB | 100 |
| Reference dB | 20 |
| Griffin-Lim Iterations | 50 |

## Files

- `t2m_step-102000_first.pth` - Text2Mel model checkpoint
- `ssrn.pth` - SSRN model checkpoint

## Usage

```python
import torch
from modules.dctts import Text2Mel, SSRN, hp, spectrogram2wav

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load models
text2mel = Text2Mel(hp.vocab).to(device).eval()
text2mel.load_state_dict(torch.load('t2m_step-102000_first.pth', map_location=device)['state_dict'])

ssrn = SSRN().to(device).eval()
ssrn.load_state_dict(torch.load('ssrn.pth', map_location=device)['state_dict'])

# Synthesize speech
def synthesize(text, timeout=10000):
    normalized_text = normalize_text(text) + "E"  # E: EOS
    L = torch.from_numpy(np.array([[hp.char2idx[char] for char in normalized_text]], np.long)).to(device)
    zeros = torch.from_numpy(np.zeros((1, hp.n_mels, 1), np.float32)).to(device)
    Y = zeros
    
    with torch.no_grad():
        for i in range(timeout):
            _, Y_t, A = text2mel(L, Y, monotonic_attention=True)
            Y = torch.cat((zeros, Y_t), -1)
            _, attention = torch.max(A[0, :, -1], 0)
            if L[0, attention.item()] == hp.vocab.index('E'):
                break
        
        _, Z = ssrn(Y)
        Z = Z.cpu().numpy()
    
    wav = spectrogram2wav(Z[0, :, :].T)
    return wav
```

## Training Data

The model was trained on audio samples of Geralt's voice from The Witcher 3: Wild Hunt video game.

## Intended Use

This model is intended for:
- Research and experimentation in speech synthesis
- Creative projects and fan content
- Educational purposes

## Limitations

- The model works best with English text
- Vocabulary is limited to lowercase letters and basic punctuation
- Audio quality may vary depending on input text complexity
- The character voice is based on copyrighted material

## Citation

If you use this model, please cite the original DC-TTS paper and the Deepstory project:

```bibtex
@article{tachibana2018efficiently,
  title={Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention},
  author={Tachibana, Hideyuki and Uenoyama, Katsuya and Aihara, Shunsuke},
  journal={arXiv preprint arXiv:1710.08969},
  year={2018}
}

@misc{deepstory,
  author = {Siu King Wai},
  title = {Deepstory},
  year = {2020},
  publisher = {GitHub},
  url = {https://github.com/thetobysiu/deepstory}
}
```

## License

This model is released under the MIT License. Please note that the voice characteristics are based on copyrighted material from The Witcher 3: Wild Hunt.

## Acknowledgments

- Original DC-TTS implementation: [tugstugi/pytorch-dc-tts](https://github.com/tugstugi/pytorch-dc-tts)
- The Witcher 3: Wild Hunt by CD Projekt Red