Stack-and-Delay: a new codebook pattern for music generation

09/15/2023
by   Gaël Le Lan, et al.
0

In language modeling based music generation, a generated waveform is represented by a sequence of hierarchical token stacks that can be decoded either in an auto-regressive manner or in parallel, depending on the codebook patterns. In particular, flattening the codebooks represents the highest quality decoding strategy, while being notoriously slow. To this end, we propose a novel stack-and-delay style of decoding strategy to improve upon the flat pattern decoding where generation speed is four times faster as opposed to vanilla flat decoding. This brings the inference time close to that of the delay decoding strategy, and allows for faster inference on GPU for small batch sizes. For the same inference efficiency budget as the delay pattern, we show that the proposed approach performs better in objective evaluations, almost closing the gap with the flat pattern in terms of quality. The results are corroborated by subjective evaluations which show that samples generated by the new model are slightly more often preferred to samples generated by the competing model given the same text prompts.

READ FULL TEXT
research
08/18/2022

Musika! Fast Infinite Waveform Music Generation

Fast and user-controllable music generation could enable novel ways of c...
research
07/11/2020

Fast Griffin Lim based Waveform Generation Strategy for Text-to-Speech Synthesis

The performance of text-to-speech (TTS) systems heavily depends on spect...
research
08/08/2023

Accelerating LLM Inference with Staged Speculative Decoding

Recent advances with large language models (LLM) illustrate their divers...
research
10/31/2018

Modeling Melodic Feature Dependency with Modularized Variational Auto-Encoder

Automatic melody generation has been a long-time aspiration for both AI ...
research
08/10/2022

Controlling Perceived Emotion in Symbolic Music Generation with Monte Carlo Tree Search

This paper presents a new approach for controlling emotion in symbolic m...
research
07/14/2021

High-Speed and High-Quality Text-to-Lip Generation

As a key component of talking face generation, lip movements generation ...

Please sign up or login with your details

Forgot password? Click here to reset