AudioLM: a Language Modeling Approach to Audio Generation

09/07/2022
by   Zalán Borsos, et al.
17

We introduce AudioLM, a framework for high-quality audio generation with long-term consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts audio generation as a language modeling task in this representation space. We show how existing audio tokenizers provide different trade-offs between reconstruction quality and long-term structure, and we propose a hybrid tokenization scheme to achieve both objectives. Namely, we leverage the discretized activations of a masked language model pre-trained on audio to capture long-term structure and the discrete codes produced by a neural audio codec to achieve high-quality synthesis. By training on large corpora of raw audio waveforms, AudioLM learns to generate natural and coherent continuations given short prompts. When trained on speech, and without any transcript or annotation, AudioLM generates syntactically and semantically plausible speech continuations while also maintaining speaker identity and prosody for unseen speakers. Furthermore, we demonstrate how our approach extends beyond speech by generating coherent piano music continuations, despite being trained without any symbolic representation of music.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/02/2022

Audio Language Modeling using Perceptually-Guided Discrete Representations

In this work, we study the task of Audio Language Modeling, in which we ...
research
05/16/2023

SoundStorm: Efficient Parallel Audio Generation

We present SoundStorm, a model for efficient, non-autoregressive audio g...
research
06/06/2021

Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

With rapid progress in neural text-to-speech (TTS) models, personalized ...
research
10/11/2021

MELONS: generating melody with long-term structure using transformers and structure graph

The creation of long melody sequences requires effective expression of c...
research
03/23/2023

LMCodec: A Low Bitrate Speech Codec With Causal Transformer Models

We introduce LMCodec, a causal neural speech codec that provides high qu...
research
08/31/2023

RepCodec: A Speech Representation Codec for Speech Tokenization

With recent rapid growth of large language models (LLMs), discrete speec...
research
11/16/2018

Generating Black Metal and Math Rock: Beyond Bach, Beethoven, and Beatles

We use a modified SampleRNN architecture to generate music in modern gen...

Please sign up or login with your details

Forgot password? Click here to reset