Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis

06/01/2023
by   Hubert Siuzdak, et al.
0

Recent advancements in neural vocoding are predominantly driven by Generative Adversarial Networks (GANs) operating in the time-domain. While effective, this approach neglects the inductive bias offered by time-frequency representations, resulting in reduntant and computionally-intensive upsampling operations. Fourier-based time-frequency representation is an appealing alternative, aligning more accurately with human auditory perception, and benefitting from well-established fast algorithms for its computation. Nevertheless, direct reconstruction of complex-valued spectrograms has been historically problematic, primarily due to phase recovery issues. This study seeks to close this gap by presenting Vocos, a new model that addresses the key challenges of modeling spectral coefficients. Vocos demonstrates improved computational efficiency, achieving an order of magnitude increase in speed compared to prevailing time-domain neural vocoding approaches. As shown by objective evaluation, Vocos not only matches state-of-the-art audio quality, but thanks to frequency-aware generator, also effectively mitigates the periodicity issues frequently associated with time-domain GANs. The source code and model weights have been open-sourced at https://github.com/charactr-platform/vocos.

READ FULL TEXT
research
02/11/2019

Adversarial Generation of Time-Frequency Features with application in audio synthesis

Time-frequency (TF) representations provide powerful and intuitive featu...
research
06/16/2020

Comparing Representations for Audio Synthesis Using Generative Adversarial Networks

In this paper, we compare different audio signal representations, includ...
research
06/21/2019

The Shape of RemiXXXes to Come: Audio Texture Synthesis with Time-frequency Scattering

This article explains how to apply time-frequency scattering, a convolut...
research
03/12/2021

Signal Representations for Synthesizing Audio Textures with Generative Adversarial Networks

Generative Adversarial Networks (GANs) currently achieve the state-of-th...
research
03/08/2023

Vector Quantized Time Series Generation with a Bidirectional Prior Model

Time series generation (TSG) studies have mainly focused on the use of G...
research
02/23/2019

GANSynth: Adversarial Neural Audio Synthesis

Efficient audio synthesis is an inherently difficult machine learning ta...
research
12/08/2018

Estimates of the Reconstruction Error in Partially Redressed Warped Frames Expansions

In recent work, redressed warped frames have been introduced for the ana...

Please sign up or login with your details

Forgot password? Click here to reset