The Poor Man's TTS (Experimental 🔧)

Tick for English synthesis, leave unchecked for Russian.

English?

Text

Enter the text for voice-guided synthesis.

Default Reference Voice (make sure it matches the language)

Select a pre-defined reference voice.

Upload Your Audio Reference (Overrides Default Voice & Speaker ID)

Synthesized Audio

Tick for English synthesis, leave unchecked for Russian.

English?

Text

Enter text; check the format from the examples.

Example Prompts

Select an example to load into the text box.

Synthesized Audio

Quick Notes:

This is run on a single RTX 3090.

These networks can only generate natural speech with correct intonations (i.e generating NSFW, non-speech sounds, stutters etc. doesn't work)

Make sure your inputs are not too short (more than a sentence long).

I will gradually update here and -> Github

Everything in this demo & the repo (coming soon) is experimental. The main idea is just playing around with different things to see what works when you're limited to training on a pair of RTX 3090s.

The data used for the english model is rough and pretty tough for any TTS model (think debates, real conversations, plus a little bit of cleaner professional performances). It mostly comes from public sources or third parties (no TOS signed). I'll probably write a blog post later with more details.

So far I focused on English and Russian, more can be covered.

Voice-Guided Tab (Using Audio Reference)

Options:

Default Voices: Pick one from the dropdown (these are stored locally).
Upload Audio: While the data isn't nearly enough for zero-shotting, you can still test your own samples. Make sure to decrease the beta if it didn't sound similar.
Speaker ID: Use a number (RU: 0-196, EN: 0-2006) to grab a random clip of that speaker from the server's dataset. Hit 'Randomize' to explore. (Invalid IDs use a default voice on the server).

Some notes:

Not all speakers are equal. Randomized samples might give you a poor reference sometimes.
IDs are not accurate. : since the base model didn't require one and it was automatically generated so the same ID can give you different speakers.
Play with Beta: Values from 0.2 to 0.9 can work well. Higher Beta = LESS like the reference. It works great for some voices, breaks others. Please play with different values. (0 = diffusion off).

Text-Guided Tab (Style is conditioned on the information and contents of the text)

Intuition: it will Figure out the voice style just from the text itself (using semantic encoders). No audio needed, which makes it suitable for real-time use cases.
Speaker Prefix: For Russian, you can use 'Speaker_ + number:'. As for the English, you can use any names. Names were randomly assigned during the training of the Encoder.

General Tips

Punctuation matters for intonation; don't use unsupported symbols.

Model Details (The Guts)

Darya (Russian Model) - More Stable

Generally more controlled than the English one. That's also why in terms of acoustic quality it should sound much better.

Setup: Non-End-to-End (separate steps).
Components:
- Style Encoder: Conformer-based.
- Duration Predictor: Conformer-based (with cross-attention).
- Semantic Encoder: RuModernBERT-base (for text-guidance).
- Diffusion Sampler: **Yes**.
Vocoder: RiFornet
Training: ~200K steps on ~320 hours of Russian data (mix of conversation & narration, hundreds of speakers).
Size: Lightweight (~< 200M params).
Specs: 44.1kHz output, 128 mel bins.

Kalliope (English Model) - Wild

More expressive potential, but also less predictable. Showed signs of overfitting on the noisy data.

Setup: Non-End-to-End.
Components:
- Style Encoder: Conformer-based.
- Text Encoder: ConvNextV2.
- Duration Predictor: Conformer-based (with cross-attention).
- Acoustic Decoder: Conformer-based.
- Semantic Encoder: DeBERTa V3 Base (for text-guided).
- Diffusion Sampler: Yes.
Vocoder: RiFornet.
Training: ~100K steps on ~300-400 hours of very complex & noisy English data (conversational, whisper, narration, wide emotion range).
Size: Bigger (~1.2B params total, but not all active at once - training was surprisingly doable). Hidden dim 1024, Style vector 512.
Specs: 44.1kHz output, 128 mel bins (but more than half the dataset were 22-24khz or even phone-call quality)

More details might show up in a blog post later.

The Poor Man's TTS (Experimental)