The Poor Man's TTS (Experimental)
Tick for English synthesis, leave unchecked for Russian.
Select a pre-defined reference voice.
Tick for English synthesis, leave unchecked for Russian.
Select an example to load into the text box.
Quick Notes:
This is run on a single RTX 3090.
These networks can only generate natural speech with correct intonations (i.e generating NSFW, non-speech sounds, stutters etc. doesn't work)
Make sure your inputs are not too short (more than a sentence long).
I will gradually update here and -> Github
Everything in this demo & the repo (coming soon) is experimental. The main idea is just playing around with different things to see what works when you're limited to training on a pair of RTX 3090s.
The data used for the english model is rough and pretty tough for any TTS model (think debates, real conversations, plus a little bit of cleaner professional performances). It mostly comes from public sources or third parties (no TOS signed). I'll probably write a blog post later with more details.
So far I focused on English and Russian, more can be covered.
Voice-Guided Tab (Using Audio Reference)
Options:
- Default Voices: Pick one from the dropdown (these are stored locally).
- Upload Audio: While the data isn't nearly enough for zero-shotting, you can still test your own samples. Make sure to decrease the beta if it didn't sound similar.
- Speaker ID: Use a number (RU: 0-196, EN: 0-2006) to grab a random clip of that speaker from the server's dataset. Hit 'Randomize' to explore. (Invalid IDs use a default voice on the server).
Some notes:
- Not all speakers are equal. Randomized samples might give you a poor reference sometimes.
- IDs are not accurate. : since the base model didn't require one and it was automatically generated so the same ID can give you different speakers.
- Play with Beta: Values from 0.2 to 0.9 can work well. Higher Beta = LESS like the reference. It works great for some voices, breaks others. Please play with different values. (0 = diffusion off).
Text-Guided Tab (Style is conditioned on the information and contents of the text)
- Intuition: it will Figure out the voice style just from the text itself (using semantic encoders). No audio needed, which makes it suitable for real-time use cases.
- Speaker Prefix: For Russian, you can use 'Speaker_ + number:'. As for the English, you can use any names. Names were randomly assigned during the training of the Encoder.
General Tips
- Punctuation matters for intonation; don't use unsupported symbols.
Model Details (The Guts)
Darya (Russian Model) - More Stable
Generally more controlled than the English one. That's also why in terms of acoustic quality it should sound much better.
- Setup: Non-End-to-End (separate steps).
- Components:
- Style Encoder: Conformer-based.
- Duration Predictor: Conformer-based (with cross-attention).
- Semantic Encoder:
RuModernBERT-base
(for text-guidance). - Diffusion Sampler: **Yes**.
- Vocoder: RiFornet
- Training: ~200K steps on ~320 hours of Russian data (mix of conversation & narration, hundreds of speakers).
- Size: Lightweight (~< 200M params).
- Specs: 44.1kHz output, 128 mel bins.
Kalliope (English Model) - Wild
More expressive potential, but also less predictable. Showed signs of overfitting on the noisy data.
- Setup: Non-End-to-End.
- Components:
- Style Encoder: Conformer-based.
- Text Encoder:
ConvNextV2
. - Duration Predictor: Conformer-based (with cross-attention).
- Acoustic Decoder: Conformer-based.
- Semantic Encoder:
DeBERTa V3 Base
(for text-guided). - Diffusion Sampler: Yes.
- Vocoder: RiFornet.
- Training: ~100K steps on ~300-400 hours of very complex & noisy English data (conversational, whisper, narration, wide emotion range).
- Size: Bigger (~1.2B params total, but not all active at once - training was surprisingly doable). Hidden dim 1024, Style vector 512.
- Specs: 44.1kHz output, 128 mel bins (but more than half the dataset were 22-24khz or even phone-call quality)
More details might show up in a blog post later.