Coqui TTS has blew my mind!

Linux for blind general discussion blinux-list at
Wed Feb 9 10:25:13 UTC 2022

Hello everyone,

may be I've discovered America, but yesterday I mostly randomly came across:

And the voice has completely blew my mind!

Like, I knew the TTS area has advanced significantly in the recent
years, but I thought the new neural voices are mostly closed features of
companies like Google or Microsoft.

I had no idea we had something so beautiful on linux and completely

Plus, it's not just the license that makes this so interesting, but also
the usability.

There were the Deepmind papers even before and some open projects trying
to implement them, but the level of completeness and usability varied
significantly, even if a project was usable, getting it to work required
some effort (at least the projects I saw).

With Coqui, the situation is completely differrent.

As the above mentioned blog says, all you need to do is:

$ pip3 install TTS

$ tts --text "Hello, this is an experimental sentence."

And you have a synthesized result!

Or you can launch the server:

$ tts-server

And play in the web browser. Note that the audio is sent only after it's
fully synthesized, so you'll need to wait a bit to listen it.

The only problematic part is the limit of decoder steps, which is set to
500 by default.

I'm not sure why did they put it so low, with this value, the TTS is
unable to speak longer sentences.

Fortunately, the fix is very easy. All I needed to do was to open

and modify the line:

     max_decoder_steps: int = 500


     max_decoder_steps: int = 0

which seems to disable the limit.

After this step, I can synthesize very long sentences, and the quality
is absolutely glamorous!

So I wanted to share. I may be actually the last person discoverying it
here, though I did not see it mentioned in TTS discussions on this list.

I've even thought about creating a speech dispatcher version of this. It
would certainly be doable, though I'm afraid what would the synthesis
sound like with the irregularities of navigation with a screenreader.
These voices are intended for reading longer texts and consistent
phrases, with punctuation, complete information etc.

The intonation would probably get a bit weird with for example just a
half sentence, as happens when navigating a document or webpage line by

Another limitation would be the one of speed. On my laptop, the realtime
factor (processing duration / audio length) is around 0.8, what means it
could handle real-time synthesis at the default speed without delays.

The situation would get more complicated with higher speeds, though.

It wouldn't be impossible, but one would need a GPU to handle
significantly higher speech rates.

So I wonder.

But anyway, this definitely made my day. :)

Best regards


More information about the Blinux-list mailing list