High Fidelity Neural Audio Compression

10/24/2022
by   Alexandre Défossez, et al.
0

We introduce a state-of-the-art real-time, high-fidelity, audio codec leveraging neural networks. It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion. We simplify and speed-up the training by using a single multiscale spectrogram adversary that efficiently reduces artifacts and produce high-quality samples. We introduce a novel loss balancer mechanism to stabilize training: the weight of a loss now defines the fraction of the overall gradient it should represent, thus decoupling the choice of this hyper-parameter from the typical scale of the loss. Finally, we study how lightweight Transformer models can be used to further compress the obtained representation by up to 40 than real time. We provide a detailed description of the key design choices of the proposed model including: training objective, architectural changes and a study of various perceptual loss functions. We present an extensive subjective evaluation (MUSHRA tests) together with an ablation study for a range of bandwidths and audio domains, including speech, noisy-reverberant speech, and music. Our approach is superior to the baselines methods across all evaluated settings, considering both 24 kHz monophonic and 48 kHz stereophonic audio. Code and models are available at github.com/facebookresearch/encodec.

READ FULL TEXT

page 2

page 5

research
07/07/2021

SoundStream: An End-to-End Neural Audio Codec

We present SoundStream, a novel neural audio codec that can efficiently ...
research
08/02/2023

From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion

Deep generative models can generate high-fidelity audio conditioned on v...
research
08/30/2020

Hierarchical Timbre-Painting and Articulation Generation

We present a fast and high-fidelity method for music generation, based o...
research
08/16/2020

Audio Dequantization for High Fidelity Audio Generation in Flow-based Neural Vocoder

In recent works, a flow-based neural vocoder has shown significant impro...
research
06/02/2023

Towards Robust FastSpeech 2 by Modelling Residual Multimodality

State-of-the-art non-autoregressive text-to-speech (TTS) models based on...
research
05/18/2023

FastFit: Towards Real-Time Iterative Neural Vocoder by Replacing U-Net Encoder With Multiple STFTs

This paper presents FastFit, a novel neural vocoder architecture that re...
research
10/31/2022

Audio Time-Scale Modification with Temporal Compressing Networks

We proposed a novel approach in the field of time-scale modification on ...

Please sign up or login with your details

Forgot password? Click here to reset