KaraSinger: Score-Free Singing Voice Synthesis with VQ-VAE using Mel-spectrograms

10/08/2021
by   Chien-Feng Liao, et al.
0

In this paper, we propose a novel neural network model called KaraSinger for a less-studied singing voice synthesis (SVS) task named score-free SVS, in which the prosody and melody are spontaneously decided by machine. KaraSinger comprises a vector-quantized variational autoencoder (VQ-VAE) that compresses the Mel-spectrograms of singing audio to sequences of discrete codes, and a language model (LM) that learns to predict the discrete codes given the corresponding lyrics. For the VQ-VAE part, we employ a Connectionist Temporal Classification (CTC) loss to encourage the discrete codes to carry phoneme-related information. For the LM part, we use location-sensitive attention for learning a robust alignment between the input phoneme sequence and the output discrete code. We keep the architecture of both the VQ-VAE and LM light-weight for fast training and inference speed. We validate the effectiveness of the proposed design choices using a proprietary collection of 550 English pop songs sung by multiple amateur singers. The result of a listening test shows that KaraSinger achieves high scores in intelligibility, musicality, and the overall quality.

READ FULL TEXT

page 2

page 3

page 4

research
10/12/2022

JukeDrummer: Conditional Beat-aware Audio-domain Drum Accompaniment Generation via Transformer VQ-VA

This paper proposes a model that generates a drum track in the audio dom...
research
05/16/2022

SQ-VAE: Variational Bayes on Discrete Representation with Self-annealed Stochastic Quantization

One noted issue of vector-quantized variational autoencoder (VQ-VAE) is ...
research
10/15/2020

The NeteaseGames System for Voice Conversion Challenge 2020 with Vector-quantization Variational Autoencoder and WaveNet

This paper presents the description of our submitted system for Voice Co...
research
12/06/2021

Conditional Deep Hierarchical Variational Autoencoder for Voice Conversion

Variational autoencoder-based voice conversion (VAE-VC) has the advantag...
research
12/03/2022

UniSyn: An End-to-End Unified Model for Text-to-Speech and Singing Voice Synthesis

Text-to-speech (TTS) and singing voice synthesis (SVS) aim at generating...
research
05/02/2019

Investigation of F0 conditioning and Fully Convolutional Networks in Variational Autoencoder based Voice Conversion

In this work, we investigate the effectiveness of two techniques for imp...
research
05/27/2019

VQVAE Unsupervised Unit Discovery and Multi-scale Code2Spec Inverter for Zerospeech Challenge 2019

We describe our submitted system for the ZeroSpeech Challenge 2019. The ...

Please sign up or login with your details

Forgot password? Click here to reset