Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale

06/23/2023
by   Matthew Le, et al.
0

Large-scale generative models such as GPT and DALL-E have revolutionized natural language processing and computer vision research. These models not only generate high fidelity text or image outputs, but are also generalists which can solve tasks not explicitly taught. In contrast, speech generative models are still primitive in terms of scale and task generalization. In this paper, we present Voicebox, the most versatile text-guided generative model for speech at scale. Voicebox is a non-autoregressive flow-matching model trained to infill speech, given audio context and text, trained on over 50K hours of speech that are neither filtered nor enhanced. Similar to GPT, Voicebox can perform many different tasks through in-context learning, but is more flexible as it can also condition on future context. Voicebox can be used for mono or cross-lingual zero-shot text-to-speech synthesis, noise removal, content editing, style conversion, and diverse sample generation. In particular, Voicebox outperforms the state-of-the-art zero-shot TTS model VALL-E on both intelligibility (5.9 0.681) while being up to 20 times faster. See voicebox.metademolab.com for a demo of the model.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/27/2023

Why Does Zero-Shot Cross-Lingual Generation Fail? An Explanation and a Solution

Zero-shot cross-lingual transfer is when a multilingual model is trained...
research
04/12/2022

InCoder: A Generative Model for Code Infilling and Synthesis

Code is seldom written in a single left-to-right pass and is instead rep...
research
06/06/2023

Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias

Scaling text-to-speech to a large and wild dataset has been proven to be...
research
03/24/2022

Text to Mesh Without 3D Supervision Using Limit Subdivision

We present a technique for zero-shot generation of a 3D model using only...
research
05/25/2023

Towards Language-guided Interactive 3D Generation: LLMs as Layout Interpreter with Generative Feedback

Generating and editing a 3D scene guided by natural language poses a cha...
research
06/08/2020

WaveNODE: A Continuous Normalizing Flow for Speech Synthesis

In recent years, various flow-based generative models have been proposed...
research
11/25/2022

Expanding Small-Scale Datasets with Guided Imagination

The power of Deep Neural Networks (DNNs) depends heavily on the training...

Please sign up or login with your details

Forgot password? Click here to reset