Palo Alto-based AI startup Zyphra unveiled a pair of open text-to-speech (TTS) models this week said to be capable of cloning your voice with as little as five seconds of sample audio. In our testing, we generated realistic results with less than half a minute of recorded speech.
Founded in 2021 by Danny Martinelli and Krithik Puthalath, the startup aims to build a multimodal agent system called MaiaOS. To date, these efforts have seen the release of its Zamba family of small language models, optimizations such as tree attention, and now the release of its Zonos TTS models.
Measuring at 1.6 billion parameters in size each, the models were trained on more than 200,000 hours of speech data, which includes both neutral-toned speech such as audiobook narration, and “highly expressive” speech. According to the upstart’s release notes for Zonos, the majority of its data was in English but there were “substantial” quantities of Chinese, Japanese, French, Spanish, and German. Zyphra tells El Reg this data was acquired from the web and was not obtained from data brokers.
[…]
Zyphra offers a demo environment where you can play with its Zonos models, along with paid API access and subscription plans on their website. But, if you’re hesitant to upload your voice to a random startup’s servers, getting the model running locally is relatively easy.
We’ll go into more detail on how to set that up in a bit, but first, let’s take a look at how well it actually works in the wild.
To test it out, we spun up Zyphra’s Zonos demo locally on an Nvidia RTX 6000 Ada Generation graphics card. We then uploaded 20- to 30-second clips of ourselves reading a random passage of text, and fed that into the Zonos-v0.1 transformer and hybrid models along with a 50 or so word text prompt, leaving all hyperparameters to their defaults. The goal is to have the trained model predict your voice, and output it as an audio file, from the provided sample recordings and prompt.
Using a 24-second sample clip, we were able to achieve a voice clone good enough to fool close friends and family — at least on first blush. After revealing that the clip was AI generated, they did note that the pacing and speed of the speech did feel a little off, and that they believed they would have caught on to the fact the audio wasn’t authentic given a longer clip.
[…]
If you’d like to use Zonos to clone your own voice, deploying the model is relatively easy, assuming you’ve got a compatible GPU and some familiarity with Linux and containerization.
[…]
Source: Zypher’s speech model can clone your voice with 5s of audio • The Register

Robin Edgar
Organisational Structures | Technology and Science | Military, IT and Lifestyle consultancy | Social, Broadcast & Cross Media | Flying aircraft