Best TTS for Video

Heya all

I've been testing so far best TTS tools out there that I can use in videos:
In Eleven Labs and Fish Audio instances I've used the paid versions of them and was able to create custom voice + convert text to speech.

My findings:

Eleven Labs
  • great voice, can change intonation and understand text
  • very good end result
  • paid version
  • limited number of chars, i.e. 30k chars for $5; you end up using the chars quite quickly
Fish Audio
  • good voice, it gets close to Eleven Labs
  • but it does not understand all text, if there are some characters in the text it goes wild
  • i got some strange shussh type of sounds that were very weird (no voice, just like a wind sound)
  • paid version so far shows on their pricing page that it is Unlimited usage
    • so far i have not been charged more and created heaps

XTTS-v2 & other open source
  • quite time consuming to set them up
  • sometimes you'll get errors with missing packages
  • a loooooong time to produce small amounts of text to voice, i.e. for 2 sentences took like 2 minutes on a powerful PC
  • I gave up bcz of wasted time
  • quality of audio was mediocre compared to the other two above
  • the only one that gets close to Fish Audio is XTTS-v2, but I am not impressed personally

I also used TTS Arena https://huggingface.co/spaces/TTS-AGI/TTS-Arena to check current live leader board, and Fish Audio gets pretty close.

My conclusion:
  • Eleven Labs is good for something high quality sound, quite impressive.
    • But for going at scale, I think costs might build up quite fast.
  • Fish Audio gets close in quality to Eleven Labs but there is more to tweaking it / and looks to have some bugs.
    • I'll report back as i get more insights into their bugs.

Anybody else tested TTS so far? What are your findings / feedback so far?
 
Great post! We've been chatting about this privately great to make it all public. My current experience:

Eleven Labs:
The best solution for mission critical audio and, when used well, delivers nearly undetectable results.
Audiosonic: By Writesonic has some really high quality voices, you can buy "minutes" relatively cheaply.
Natural Reader: Has some realistic voices, for $10/mo you can get 500K chars/day conversion with 1M/mo MP3 download.
OpenAI: OpenAI voices are cheaper and can do well on material where they are suited. They added 5 new voices in October. There is talk of a new generation being added to the API soon that will be much higher quality like 11labs.
GPT-4o: I see comments in chat about some amazing voice it can do but I've never been able to replicate it.

Probably better options on the horizon but as of today I still use Eleven Labs for everything just because it has that slight quality edge, but it's not cheap at scale yet.
 
P.S. @lucian.harhata I do voiceover. If you want to do an SEO test for "TTS vs human voice" just send me scripts and I'll send you human audio.

It would be interesting to test Soundcloud. Bradley Bennett just did a video on using it to get rankings by syndicating AI-generated podcast content. No one has ever proved/disproved definitively that TTS vs Human even matters, we should do that and release the results to the community. 💪
 
  • Like
Reactions: Ted
Back
Top