DeepSinger: Singing Voice Synthesis with Data Mined From the Web

07/09/2020 ∙ by Yi Ren, et al. ∙ Zhejiang University Microsoft 0

In this paper, we develop DeepSinger, a multi-lingual multi-singer singing voice synthesis (SVS) system, which is built from scratch using singing training data mined from music websites. The pipeline of DeepSinger consists of several steps, including data crawling, singing and accompaniment separation, lyrics-to-singing alignment, data filtration, and singing modeling. Specifically, we design a lyrics-to-singing alignment model to automatically extract the duration of each phoneme in lyrics starting from coarse-grained sentence level to fine-grained phoneme level, and further design a multi-lingual multi-singer singing model based on a feed-forward Transformer to directly generate linear-spectrograms from lyrics, and synthesize voices using Griffin-Lim. DeepSinger has several advantages over previous SVS systems: 1) to the best of our knowledge, it is the first SVS system that directly mines training data from music websites, 2) the lyrics-to-singing alignment model further avoids any human efforts for alignment labeling and greatly reduces labeling cost, 3) the singing model based on a feed-forward Transformer is simple and efficient, by removing the complicated acoustic feature modeling in parametric synthesis and leveraging a reference encoder to capture the timbre of a singer from noisy singing data, and 4) it can synthesize singing voices in multiple languages and multiple singers. We evaluate DeepSinger on our mined singing dataset that consists of about 92 hours data from 89 singers on three languages (Chinese, Cantonese and English). The results demonstrate that with the singing data purely mined from the Web, DeepSinger can synthesize high-quality singing voices in terms of both pitch accuracy and voice naturalness (footnote: Our audio samples are shown in https://speechresearch.github.io/deepsinger/.)

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Singing voice synthesis (SVS) (Nishimura et al., 2016; Blaauw and Bonada, 2017; Lee et al., 2019; Lu et al., 2020), which generates singing voices from lyrics, has attracted a lot of attention in both research and industrial community in recent years. Similar to text to speech (TTS) (Shen et al., 2018; Ren et al., 2019, 2020)

that enables machines to speak, SVS enables machines to sing, both of which have been greatly improved with the rapid development of deep neural networks. Singing voices have more complicated prosody than normal speaking voices 

(Umbert et al., 2015), and therefore SVS needs additional information to control the duration and the pitch of singing voices, which makes SVS more challenging than TTS (Nguyen, 2018).

Previous works on SVS include lyrics-to-singing alignment (Fujihara et al., 2011; Chien et al., 2016; Gupta et al., 2018), parametric synthesis (Kim et al., 2018; Blaauw and Bonada, 2017), acoustic modeling (Nishimura et al., 2016; Nakamura et al., 2019; Lu et al., 2020), and adversarial synthesis (Chandna et al., 2019; Lee et al., 2019; Hono et al., 2019). Although they achieve reasonably good performance, these systems typically require 1) a large amount of high-quality singing recordings as training data, and 2) strict data alignments between lyrics and singing audio for accurate singing modeling, both of which incur considerable data labeling cost. Previous works collect these two kinds of data as follows:

  • [leftmargin=*]

  • For the first kind of data, previous works usually ask human to sing songs in a professional recording studio to record high-quality singing voices for at least a few hours, which needs the involvement of human experts and is costly.

  • For the second kind of data, previous works usually first manually split a whole song into aligned lyrics and audio in sentence level (Lee et al., 2019), and then extract the duration of each phoneme either by manual alignment or a phonetic timing model (Nishimura et al., 2016; Blaauw and Bonada, 2017), which incurs data labeling cost.

As can be seen, the training data for SVS mostly rely on human recording and annotations. What is more, there are few publicly available singing datasets, which increases the entry cost for researchers to work on SVS and slows down the research and product application in this area. Considering a lot of tasks such as language modeling and generation, search and ads ranking, and image classification heavily rely on data collected from the Web, a natural question is that: Can we build an SVS system with data collected from the Web? While there are plenty of songs in music websites and mining training data from the Web seems promising, we face several technical challenges:

  • [leftmargin=*]

  • The songs in music websites usually mix singing with accompaniment, which are very noisy to train SVS systems. In order to leverage such data, singing and accompaniment separation is needed to obtain clean singing voices.

  • The singing and the corresponding lyrics in crawled songs are not always well matched due to possible errors in music websites. How to filter the mismatched songs is important to ensure the quality of the mined dataset.

  • Although some songs have time alignments between the singing and lyrics in sentence level, most alignments are not accurate. For accurate singing modeling, we need build our own alignment model for both sentence-level alignment and phoneme-level alignment.

  • There are still noises in the singing audio after separation. How to design a singing model to learn from noisy data is challenging.

In this paper, we develop DeepSinger, a singing voice synthesis system that is built from scratch by using singing training data mined from music websites. To address the above challenges, we design a pipeline in DeepSinger that consists of several data mining and modeling steps, including:

  • [leftmargin=*]

  • Data crawling. We crawl popular songs of top singers in multiple languages from a music website.

  • Singing and accompaniment separation. We use a popular music separation tool Spleeter (Hennequin et al., 2019) to separate singing voices from song accompaniments.

  • Lyrics-to-singing alignment. We build an alignment model to segment the audio into sentences and extract the singing duration of each phoneme in lyrics.

  • Data filtration. We filter the aligned lyrics and singing voices according to their confidence scores in alignment.

  • Singing modeling. We build a FastSpeech (Ren et al., 2019, 2020) based singing model and leverage a reference encoder to handle noisy data.

Specifically, the detailed designs of the lyrics-to-singing alignment model and singing model are as follows:

  • [leftmargin=*]

  • We build the lyrics-to-singing alignment model based on automatic speech recognition to extract the duration of each phoneme in lyrics starting from coarse-grained sentence level to fine-grained phoneme level.

  • We design a multi-lingual multi-singer singing model based on FastSpeech (Ren et al., 2019, 2020) to directly generate linear-spectrogram from lyrics, instead of the traditional acoustic features in parametric synthesis. Additionally, we design a reference encoder in the singing model to capture the timbre of a singer from noisy singing data, instead of using singer ID.

We conduct experiments on our mined singing dataset (92 hours data with 89 singers and three languages) to evaluate the effectiveness of DeepSinger. Experiment results show that with the singing data purely mined from the Web, DeepSinger can synthesize high-quality singing voices in terms of both pitch accuracy and voice naturalness.

The contributions of this paper are summarized as follows:

  • [leftmargin=*]

  • To the best of our knowledge, DeepSinger is the first SVS system built from data directly mined from the Web, without any high-quality singing data recorded by human.

  • The lyrics-to-singing alignment model avoids any human efforts for alignment labeling and greatly reduces labeling cost.

  • The FastSpeech based singing model is simple and efficient, by removing the complicated acoustic feature modeling in parametric synthesis and leveraging a reference encoder to capture the timbre of a singer from noisy singing data.

  • DeepSinger can synthesize high-quality singing voices in multiple languages and multiple singers.

2. Background

Figure 1. Overview of the DeepSinger pipeline.

In this section, we introduce the background of DeepSinger, including text to speech (TTS), singing voice synthesis (SVS), text-to-audio alignment that is the key component to SVS system, as well as some other works that leverage training data mined from the Web.

Text to Speech

Text to Speech (TTS) (Shen et al., 2018; Ren et al., 2019, 2020) aims to synthesize natural and intelligible speech given text as input, which has witnessed great progress in recent years. TTS systems have changed from concatenative synthesis (Hunt and Black, 1996), to statistical parametric synthesis (Wu et al., 2016; Li et al., 2018), and to end-to-end neural based synthesis (Shen et al., 2018; Ping et al., 2019; Ren et al., 2019, 2020). Neural network based end-to-end TTS models usually first convert input text to acoustic features (e.g., mel-spectrograms) and then transform mel-spectrograms into audio samples with a vocoder. Griffin-Lim (Griffin and Lim, 1984) is a popular vocoder to reconstruct voices given linear-spectrograms. Other neural vocoders such as WaveNet (Oord et al., 2016) and WaveRNN (Kalchbrenner et al., 2018) directly generate waveform conditioned on acoustic features. SVS systems are mostly inspired by TTS and follow the basic components in TTS such as text-to-audio alignment, parametric acoustic modeling and vocoder.

Singing Voice Synthesis

Previous works have conducted studies on SVS from different aspects, including lyrics-to-singing alignment (Fujihara et al., 2011; Chien et al., 2016; Gupta et al., 2018), parametric synthesis (Kim et al., 2018; Blaauw and Bonada, 2017), acoustic modeling (Nishimura et al., 2016; Nakamura et al., 2019), and adversarial synthesis (Chandna et al., 2019; Lee et al., 2019; Hono et al., 2019). Blaauw and Bonada (2017) leverage the WaveNet architecture and separates the influence of pitch and timbre for parametric singing synthesis. Lee et al. (2019) introduce adversarial training in linear-spectrogram generation for better voice quality. Jukebox (Dhariwal et al., 2020) conditions on artist and lyrics to generate lots of musical styles and realistic singing voice with accompaniments together, while our work focuses on generating singing voice after vocal separation. Previous SVS systems usually obtain lyrics-to-singing alignment by human labeling or through the combination of human labeling and additional alignment tools, while DeepSinger builds SVS system from scratch (only original song-level lyrics and audio without any alignment information). Furthermore, most previous SVS systems use complicated acoustic parameters while DeepSinger generates linear-spectrograms directly based on a simple feed-forward Transformer model.

Text-to-Audio Alignment

Text-to-audio alignment is a key component for both TTS and SVS, which aims to obtain the duration of the pronunciation or the singing for each word/character/phoneme. TTS usually leverages a hidden Markov model (HMM) speech recognizer for phoneme-audio alignment 

(McAuliffe et al., 2017). For lyrics-to-singing alignment in SVS, traditional methods usually leverage the timing information from musical structure such as chords and chorus, or directly use musical score to align lyrics. However, such methods either need the presence of background accompaniments or require professional singing recordings where the notes are correctly sung, which are not practical for SVS. Recent works also leverage the alignment methods used in TTS for lyrics-to-singing alignment. Gupta et al. (2018) propose a semi-supervised method for lyrics and singing alignment using transcripts from an automatic speech recognition system. Sharma et al. (2019); Gupta et al. (2019) propose to align polyphonic music by adapting a solo-singing alignment model. Lee et al. (2019) leverage laborious human labeling combined with additional tools for alignment. The works on lyrics-to-singing alignment either leverage a large amount of speaking voices such as LibriSpeech for pre-training or need human labeling efforts. In this work, we propose an automatic speech recognition based alignment model that is based on an encoder-attention-decoder framework to align lyrics and audio first in song level and then in sentence level, which does not need additional data or human labeling.

Training Data Mined From the Web

A variety of tasks collect training data from the Web, such as the large-scale web-crawled text dataset ClueWeb (Callan et al., 2009) and Common Crawl333https://commoncrawl.org/ for language modeling (Yang et al., 2019), LETOR (Qin et al., 2010) for search ranking (Cao et al., 2007), and WebVision (Li et al., 2017) for image classification. Similar to these works, collecting singing data from music websites also needs a lot of specific data processing, data mining, and data modeling techniques, including voice separation, alignment modeling, data filtration, and singing modeling.

3. DeepSinger

In this section, we introduce DeepSinger, a multi-lingual multi-singer singing voice synthesis (SVS) system, which is built from scratch by leveraging singing training data mined from music websites. We first describe the pipeline of DeepSinger and briefly introduce each step in the pipeline, and then introduce the formulation of the lyrics-to-singing alignment model and singing model.

3.1. Pipeline Overview

In order to build an SVS system from scratch by leveraging singing training data mined from music websites, as shown in Figure 1, DeepSinger consists of several data processing, mining, and modeling steps, including 1) data crawling, 2) singing and accompaniment separation, 3) lyrics-to-singing alignment, 4) data filtration, and 5) singing modeling. Next, we introduce each step in details.

Data Crawling

In order to obtain a large amount of songs from the Internet, we crawl tens of thousands of songs and their lyrics from a well-known music website. Our crawled songs cover three languages (Chinese, Cantonese and English) and hundreds of singers. We perform some rough filtration and cleanings on the crawled dataset according to song duration and data quality:

  • [leftmargin=*]

  • We filter the songs which are too long (more than 5 minutes) or too short (less than 1 minute).

  • We filter the songs in concert version based on their names since they are usually noisy.

  • We remove some songs performed by music bands or multiple singers.

  • We also perform data cleaning on the crawled lyrics by removing some meta information (such as composer, singer, etc.) and unvoiced symbols.

  • We keep the separation marks between each sentence in the lyrics for further sentence-level alignment and segmentation.

We convert the lyrics into phoneme sequence using open source tools: 1) we use phonemizer

444https://github.com/bootphon/phonemizer to convert English and Cantonese lyrics into corresponding phonemes, and 2) use pypinyin555https://github.com/mozillazg/python-pinyin to convert Chinese lyrics into phonemes.

Singing and Accompaniment Separation

Since the singing voices are mixed with the accompaniments in almost all the crawled songs, we leverage Spleeter (Hennequin et al., 2019), an open-source music source separation tool that achieves state-of-the-art accuracy, to separate singing voices from accompaniments and extract singing voices from the crawled songs. We normalize the loudness of the singing voices to a fixed loudness level (-16 LUFS666LUFS is short for loudness units relative to full scale. with the measure of ITU-R BS.1770-3777ITU-R BS.1770-3 is a standard to measure audio loudness level. You can refer to https://www.itu.int/rec/R-REC-BS.1770/en.).

Lyrics-to-Singing Alignment

The duration of each phoneme determines how long each phoneme can be sung, and is critical for an SVS system to learn alignments between phonemes and acoustic features automatically. Therefore, we design a lyrics-to-singing alignment model to extract the duration of each phoneme first in sentence level and then in phoneme level. We describe the details of the alignment model in Section 3.2.

Data Filtration

After the lyrics-to-singing alignment, we found there are some misaligned lyrics and singing voices, which may be due to the following reasons: 1) the voice quality of some songs is poor, since they are not recorded in a professional recording studio or contain mixed chorus; 2) some lyrics are not belong to their songs, due to errors from music websites; 3) the quality of some separated singing voices is poor since the accompaniments cannot be totally removed. Therefore, we filter those misaligned singing voices according to their alignment quality, where we use the splitting reward introduced in Section 3.2, as the filtration criterion and filter singing data with the splitting reward lower than a threshold.

Singing Modeling

After the previous steps of data crawling, separation, alignment, and filtration, the mined singing data are ready for singing modeling. We design a FastSpeech based singing model that takes lyrics, duration, pitch information, as well as a reference audio as input to generate the singing voices. We introduce the details of the singing model in Section 3.3.

3.2. Lyrics-to-Singing Alignment

Lyrics-to-singing alignment is important for SVS to decide how long each phoneme is sung in synthesized voices. Previous works (Lee et al., 2019; Yi et al., 2019) usually leverage human labeling to split songs into sentences and then conduct phoneme alignment within each sentence by leveraging an HMM (hidden markov model) based speech recognition model. In this paper, we propose a new alignment model to extract the duration of each phoneme, by leveraging raw lyrics and song recordings, without relying on any human labeling efforts.

It is hard to directly extract the duration of each phoneme given lyrics and audio from a whole song. Instead, we use a two-stage alignment pipeline that first splits a whole song into sentences, and then extracts the phoneme duration in each sentence, as described as follows:

  • [leftmargin=*]

  • We train our alignment model (which is introduced in the next paragraphs) with the lyrics and audio in a whole song, and use it to split the whole song into aligned lyrics and audio in sentence level. We segment an audio by the frames that are aligned to the separation marks in raw lyrics.

  • We continue to train the previous alignment model with the aligned lyrics and audio in the sentence level that are obtained in the previous step. We use this trained alignment model to obtain lyrics-to-singing alignments in the phoneme level. We further design a duration extraction algorithm to obtain the duration of each phoneme from the lyrics-to-singing alignments.

Specifically, we train the alignment model based on automatic speech recognition with some strategies of greedy training and guided attention, and design a dynamic programming based duration extraction to obtain the phoneme duration. We introduce the two modules as follows.

Figure 2. The alignment model based on the architecture of automatic speech recognition.
Alignment Model Training

We leverage an automatic speech recognition model with location sensitive attention (Shen et al., 2018) to obtain lyrics-to-singing alignments, as shown in Figure 2. The alignment model consists of a bi-directional LSTM encoder and an LSTM decoder, which takes singing audio (mel-spectrograms) as input and lyrics (phoneme) as output. The attentions between the encoder and decoder are regarded as the alignments between audio and lyrics. To ensure alignment accuracy, we propose two techniques to help the training of the alignment model:

  • [leftmargin=*]

  • Greedy Training. It is hard to train the model by taking the whole song as input. Instead, we train the alignment model with a greedy strategy: 1) First, only a small part of the whole audio and lyrics from the beginning of each song are fed into the model for training. 2) Second, the length of the audio and lyrics are gradually increased for training. 3) Finally, the whole song are fed into the model for training. By increasing the difficulty of this training task, the model can learn the ability to align in the very beginning and gradually to lean to align the whole song.

  • Guided Attention. In order to further help the model learn reasonable alignments, we leverage guided attention (Tachibana et al., 2018) to constrain the attention alignments. Guided attention is based on prior knowledge that the attentions between lyrics and audio should be diagonal and monotonic. We first construct a diagonal mask matrix (shown in Figure 3) as follows:

    where is the bandwidth of representing how many elements equal to one, is the length of the mel-spectrogram frames and is the length of the phoneme sequence. The guided attention loss is defined as where is the weights of the encoder-decoder attention in the alignment model, and the guided attention loss will be added to the original loss of the alignment model. If the attention weight is far from diagonal, it will be pushed towards diagonal for better alignment quality.

Figure 3. The details of the guided attention.
Duration Extraction

After we get the lyrics-to-singing alignments from the alignment model, an intuitive way to extract the phoneme duration is to count how many consecutive frames of mel-spectrograms that a phoneme is attending. However, it is non-trivial considering the following reasons: 1) the attention map from the alignment model is not always diagonal and monotonic; 2) a mel-spectrogram frame can be attended by multiple phonemes and it is hard to decide which phoneme actually corresponds to this mel-spectrogram frame. In order to extract the duration from the attention map accurately, we design a novel duration extraction algorithm to compute the duration for each phoneme in a sequence, which satisfies:

(1)

where is the length of a phoneme sequence and is the length of mel-spectrogram frames. The extraction of consists of several steps:

  • [leftmargin=*]

  • First, we obtain the attention alignment from the alignment model. The -th row

    is a non-negative probability distribution, where each element

    represents the probability of the phoneme attending to the mel-spectrogram frame.

  • Second, we define the splitting boundary vector

    , which splits the mel-spectrogram frames into segments corresponding to the phonemes. The splitting reward for some certain split points is defined as follows:

    (2)
  • Third, the goal of duration extraction becomes to find the best split boundary vector by maximizing the splitting reward , and then to extract the duration according to the split boundary vector. Maximizing the splitting reward can be solved by a standard dynamic programming (DP) algorithm. The details of the DP algorithm for duration extraction is summarized in Section A.2.

  • Forth, after we get the best splitting boundary vector based on the DP algorithm, the duration can be simply calculated by

    (3)

3.3. Singing Modeling

After we get the duration of each phoneme in lyrics with the alignment model, in this section, we build our singing model to synthesize singing voices. Our singing model is based on FastSpeech (Ren et al., 2019, 2020)

, which is a feed-forward Transformer network to synthesize speech in parallel. FastSpeech leverages a length regulator to expand the phoneme sequence to the length of the target speech according to the duration of each phoneme, which is suitable for singing modeling since duration information is needed by default in SVS. Different from FastSpeech, our singing model 1) leverages separate lyrics and pitch encoders for pronunciation and pitch modeling, 2) designs a reference encoder to capture the timbre of a singer instead of a simple singer embedding, to improve the robustness of the learning from noisy data, and 3) directly generates linear-spectrograms instead of mel-spectrograms and using Griffin-Lim 

(Griffin and Lim, 1984) for voice synthesis. As shown in Figure 4, we describe each module in our singing model as follows.

Lyrics Encoder

The lyrics encoder consists of 1) a phoneme embedding lookup table to convert the phoneme ID into embedding vector, 2) several Transformer blocks (Vaswani et al., 2017) to convert the phoneme embedding sequence into a hidden sequence, and 3) a length expansion operation to expand the length of the hidden sequence to match the length of the target linear-spectrograms according to the phoneme duration.

Figure 4. The architecture of the singing model.
Pitch Encoder

The pitch encoder consists of 1) a pitch embedding lookup table to convert the pitch ID into embedding vector, and 2) several Transformer Blocks to generate pitch hidden sequence. Since the pitch sequence is directly extracted from the audio in the training set and has the same length with the linear-spectrogram sequence, we do not need any length expansion as used in the lyrics encoder.

Reference Encoder

The reference encoder consists of 1) a pre-net to preprocess the linear-spectrograms of the reference audio, 2) several Transformer blocks to generate a hidden sequence, and 3) an average pooling on the time dimension to compress the hidden sequence into a vector, which is called as the reference embedding and contains the timbre information of the speaker. The reference embedding is broadcasted along the time dimension when adding with the outputs of the lyrics and pitch encoder. The reference encoder shows advantages over singer embedding especially when the singing training data are noisy:

  • [leftmargin=*]

  • A singer embedding will average the timbre as well as voice characteristics of the singing data with different noise levels. Therefore, the singer embedding contains the characteristics of noise audio, which makes the synthesized voice noisy.

  • A reference encoder only learns the characteristics from a reference audio, which can ensure the model can synthesize clean voice given clean reference audio.

Decoder

The decoder consists of 1) several Transformer blocks to convert the frame-wise addition of the outputs of the three encoders into a hidden sequence, and 2) a linear layer to convert the hidden sequence into linear-spectrograms. Finally, we use Griffin-Lim (Griffin and Lim, 1984) to directly synthesize singing voices given the predicted linear-spectrograms.

Figure 5. The inference process of singing voice synthesis.

The training and inference processes of the singing model are described as follows.

  • [leftmargin=*]

  • Training. We extract the pitches from the training audio and transform the audio into linear-spectrograms. Then we feed the phoneme sequence together with duration, pitch sequence, and linear-spectrograms into the lyrics encoder, pitch encoder, and reference encoder correspondingly. We minimize the mean square error (MSE) loss between the output linear-spectrograms and the reference linear-spectrograms to optimize the model.

  • Inference. DeepSinger synthesizes voices from lyrics and demo singing audio, as shown in Figure 5. We extract the duration and pitch information similar to that used in the training. The reference audio can be any singing or speaking audios to provide the voice characteristics such as timbre for the synthesized voices.

4. The Mined Singing Dataset

In this section, we briefly introduce our mined singing dataset (named as Singing-Wild, which means singing voice dataset in the wild).

Singing-Wild contains more than 90 hours singing audio files in multiple languages (Chinese, Cantonese and English), and multiple singers, and the corresponding lyrics, phonemes, and singing-phoneme alignment files. In addition, we provide some meta data which store the song name, singer information and downloading links. The detailed data statistics of the Singing-Wild dataset are listed in Table 1. We further analyze some statistics in details, including the duration distributions of the singing voices in the sentence level and phoneme level, and the pitch distributions.

Language #Singers #Songs #Sentences Duration (hours)
Chinese 25 1910 52109 62.61
Cantonese 41 322 14853 18.83
English 23 741 13217 11.44
Total 89 2973 70179 92.88
Table 1. The statistics of the Singing-Wild dataset.
Sentence-Level Duration Distribution

We segment the songs into sentence-level singing voices using the lyrics-to-singing alignment model mentioned in Section 3.2. We plot the sentence-level duration distribution of the singing voices in Table 6 in Section A.5. We find that the distributions of Chinese and Cantonese are very similar, and most of sentences in these languages are around 4 seconds, while most sentences in English are around 2.5 seconds.

Phoneme-Level Duration Distribution

We align the phoneme sequence to the singing audio frames using the lyrics-to-singing alignment model mentioned in Section 3.2. We plot the distributions of the phoneme duration in Table 7 in Section A.5. We find that most of the phoneme duration on English is distributed around 10 ms to 100 ms, while the individual phoneme duration on Chinese and Cantonese is longer than English, which is distributed around 10 ms to 200 ms.

Pitch Distribution

We extract the F0 (fundamental frequency, also known as pitch) from the audio using Parselmouth888https://github.com/YannickJadoul/Parselmouth and convert F0 to a sequence of international pitch notation999http://www.flutopedia.com/octave_notation.htm. The international pitch notation expresses a note with a musical note name and a number to identify the octave of a pitch, e.g., C4, D4 and A5. We also apply the median filter to the note sequence to remove noises. We plot the pitch distributions in Table 8 in Section A.5. We can see that most of the pitches are distributed around C2 to C4 on all languages.

5. Experiments and Results

In this section, we first describe the experimental setup, report the accuracy of the alignment model in DeepSinger, and then evaluate the synthesized voices both quantitatively in terms of pitch accuracy and qualitatively in terms of mean opinion score (MOS). Finally, we conduct some analyses of DeepSinger.

5.1. Experimental Setup

5.1.1. Datasets

We use our mined Singing-Wild dataset to train the multi-lingual and multi-singer singing model. For each language, we randomly select 5 songs of different singers and pick 10 sentences in each song (totally 50 sentences) to construct the test set, and similarly construct another 50 sentences as the valid set, and use the remaining songs as the training set. Inspired by Zhang et al. (2019), we also leverage extra multi-speaker TTS datasets to help the training of the singing model and improve the quality of the generated voices. We use 1) THCHS-30 (Dong Wang, 2015) dataset for Chinese, which consists of 10893 sentences (27 hours) from 30 speakers, 2) a subset of Libritts (Zen et al., 2019) for English, which consists of 7205 sentences (12 hours) from 32 speakers, and 3) an internal dataset for Cantonese, which consists of 10000 sentences (25 hours) from 30 speakers.

For the audio data, we first convert the sampling rate of all audios to Hz, and then convert the audio waveform into mel-spectrograms for alignment model and linear-spectrograms for singing model. The frame size and hop size of both the mel-spectrograms and linear-spectrograms are set to 1024 and 256 respectively. For the text data, we convert text sequence into phoneme sequence with the open-source tools as mentioned in Section 3.1.

5.1.2. Alignment Model

Data Preprocessing

We first mark the non-vocal frames whose pitches cannot be successfully extracted or volumes are lower than a threshold. Then we detect the non-vocal segments which contain at least 10 consecutive non-vocal frames. We replace each non-vocal segment with 10 silence frames, which can significantly shorten the audio and make the alignment training easier.

Model Configuration

As shown in Figure 2, the alignment model consists of a mel-spectrogram encoder and a phoneme decoder. The pre-net in the encoder consists of a 3-layer 1D convolution with hidden size and kernel size of 512 and 9 respectively. The hidden size of the bi-directional LSTM in the encoder is 512. The decoder consists of a phoneme embedding lookup table with dimension of 512, and a 2-layer LSTM with hidden size of 1024.

Model Training

The alignment model is trained with a batch size of 64 sentences on one NVIDIA V100 GPU. We use Adam optimizer (Kingma and Ba, 2014) with , , and a learning rate of that is exponentially decaying to after 30,000 iterations. We also apply L2 regularization with weight of

. We use greedy training to better learn the alignment model. Specifically, we train the alignment model with the first 10% length of lyrics and audio in each song, and then we gradually increase the length by 2% every epoch. It takes 100 epochs for training till convergence. We fine-tune the alignment model with the aligned lyrics and audio in the sentence level to further obtain the phoneme-level alignment.

When training the alignment model with the lyrics and audio in the whole song level, it is difficult to store all the necessary activations in GPU memory for back-propagation due to the extremely long sequence. Therefore, we apply the truncated back-propagation through time (TBTT) (Jaeger, 2002), which is widely used in training RNN models. TBTT does not need for a complete backtrack through the whole sequence and thus saves GPU memory. We detail the whole training procedure in Section A.2.

5.1.3. Singing Model

Model Configuration

Our singing model consists of 4, 2, 4 and 4 Transformer blocks in lyrics encoder, pitch encoder, reference encoder and decoder respectively and the output linear layer converts the 384-dimensional hidden into 513-dimensional linear-spectrograms. Other configurations follow  Ren et al. (2019) and we list them in Section A.4.

Model Training and Inference

We train the singing model with a batch size of totally 64 sequences on 4 NVIDIA V100 GPUs. We use Adam optimizer with , , and follow the same learning rate schedule in Vaswani et al. (2017). It takes 160k steps for training till convergence. For each singer in the test set, we choose a high-quality audio of this singer from the training set as the reference audio to synthesize voices for evaluation.

5.2. Accuracy of Alignment Model

We use two metrics to measure the alignment accuracy: 1) percentage of correct segments (PCS) (Dzhambazov et al., 2017) for sentence-level alignment, which measures the ratio between the length of the correctly aligned segments and the total length of the song, 2) average absolute error (ASE) (Mesaros and Virtanen, 2008) for phoneme-level alignment, which measures the phoneme boundary deviation.

For the 50 test sentences in each language, we manually annotate the sentence-level and phoneme-level boundaries for the three languages as the ground truth to evaluate the accuracy of the alignment model. It is easy for human to annotate the sentence-level boundaries, while hard to annotate the phoneme-level boundaries. Therefore, we annotate the word/character-level (word level for English and character level for Chinese and Cantonese) boundaries instead of phoneme-level boundaries.

The PCS and ASE for the three languages are shown in Table 2. It can be seen that PCS on all the three languages are above 80%, while ASE are below 100ms, which demonstrates the high accuracy of our alignment model in both sentence level and word/character level, considering the distributions of sentence-level and phoneme-level duration shown in Figure 6 and 7.

We compare the accuracy of our alignment model on Chinese with Montreal Forced Aligner (MFA) (McAuliffe et al., 2017), an open-source system for speech-text alignment with good performance, which is also trained without any manually alignment annotations. Since MFA cannot support the alignments for long sentence (above 60 seconds) well, we only compare our model with MFA in term of character-level ASE. MFA achieves 78.5ms character-level ASE in the test set while our model achieves 76.3ms, with very close accuracy. However, our model supports the alignments for whole song while MFA cannot, which demonstrates the advantages of our alignment model.

Language PCS (sentence-level) ASE
Chinese 83.1% 76.3 ms
Cantonese 81.2% 85.2 ms
English 77.4% 92.5 ms
Table 2. The accuracy of the alignment model on the three languages, in terms of the sentence-level metric: percentage of correct segments (PCS) and word/character-level metric: average absolute error (ASE). For ASE, we use word level for English and character level for Chinese and Cantonese.

5.3. Voice Quality

We evaluate the quality of the synthesized voices by our singing model, both quantitatively with pitch accuracy and qualitatively with mean opinion score (MOS).

Quantitative Evaluation

We evaluate the accuracy of the pitches in the synthesized singing voices, following Lee et al. (2019). We first extract the fundamental frequency (F0) sequence from the synthesized audio using Praat (Boersma et al., 2002), and then convert it into pitch sequence. We calculate the frame-wise accuracy of the extracted pitch sequence with regard to the ground-truth pitch sequence that the singing model conditions on. We also provide the upper bound of the pitch accuracy by calculating the pitch accuracy of the voices reconstructed from the ground-truth linear-spectrograms, with regard to the ground-truth voices. As shown in Table 3, the upper bounds of the pitch accuracy on the three languages are higher than 95%, while DeepSinger can generate songs with pitch accuracy higher than 85% on three languages. Considering that human singing to a music score can usually achieve about 80% accuracy according to Lee et al. (2019), the accuracy achieved by DeepSinger is high enough.

Setting Chinese Cantonese English
DeepSinger 87.60% 85.32% 86.02%
Upper Bound 96.28% 95.62% 96.06%
Table 3. The pitch accuracy of DeepSinger and the corresponding upper bound.
Qualitative Evaluation

We conduct MOS (mean opinion score) evaluation on the test set to measure the quality of the synthesized voices, following Blaauw and Bonada (2017). Each audio is listened by at least 20 testers, who are all native speakers for each language. We compare the MOS of the synthesized audio samples among the following systems: 1) GT, the ground-truth audio; 2) GT (Linear+GL), where we synthesize voices based on the ground-truth linear-spectrograms using Griffin-Lim; 3) DeepSinger, where the audio is generated by DeepSinger. The results are shown in Table 4. It can be seen that DeepSinger synthesizes high-quality singing voices, with only 0.34, 0.76 and 0.43 MOS gap to the GT (Linear+GL) upper bound on the three languages. Our audio samples are shown in the demo website101010https://speechresearch.github.io/deepsinger/.

Setting Chinese Cantonese English
GT 4.36 0.08 4.38 0.09 4.15 0.10
GT (Linear+GL) 4.12 0.07 4.18 0.09 3.95 0.10
DeepSinger 3.78 0.10 3.42 0.10 3.52 0.11
Table 4.

The MOS of DeepSinger with 95% confidence intervals on the three languages.

Setting MOS
DeepSinger 3.78 0.10
DeepSinger w/o reference encoder 3.36 0.11
DeepSinger w/o TTS data 3.25 0.12
DeepSinger w/o multilingual 3.79 0.08
Table 5. The MOS with 95% confidence intervals for different methods.

5.4. Method Analyses

We conduct experimental studies on Chinese to analyze some specific designs in DeepSinger, including the effectiveness of the reference encoder, the benefits of leveraging TTS data for auxiliary training, and the influence of multilingual training on voice quality. We introduce the analyses as follows.

Reference Encoder

We analyze the effectiveness of the reference encoder in DeepSinger to handle noisy training data, from three perspectives:

  • [leftmargin=*]

  • We compare DeepSinger (denoted as DeepSinger in Table 5) to the system without reference encoder but instead with a singer embedding to differentiate multiple singers (denoted as DeepSinger w/o reference encoder in Table 5). It can be seen that DeepSinger generates singing voices with higher quality than DeepSinger w/o reference encoder, which demonstrate the advantages of reference encoder than singer embedding.

  • We further analyze how the reference encoder takes effect with reference audios in different noise levels. As shown in Table 6, we choose clean, normal, and noisy reference audios and evaluate the MOS of the synthesized speech. According to the MOS, it can be seen that clean voice can be synthesized given clean reference audio while noisy reference leads to noisy synthesized voice, which indicates that the reference encoder can learn the characteristics from the reference audio, verifying the analyses in Section 3.3.

  • We also compare DeepSinger w/o reference encoder with DeepSinger in term of preference scores111111We introduce the preference score in Section A.6.. We evaluate the preference scores on the singing training data with two noisy levels: clean singing data and noisy singing data. As shown in Table 7, DeepSinger largely outperforms DeepSinger w/o reference encoder on noisy singing training data while is slightly better than DeepSinger w/o reference encoder on clean singing training data. The results indicate that 1) singer embedding averages the timbre of the training data with different noise levels and thus performs differently when training with singing data in different noise levels; 2) our singing model can be tolerant of the noisy data with the help of reference encoder and can still generate good voice for noisy training data so long as the clean reference audio is given.

Setting Ref MOS Syn MOS
DeepSinger (Clean Ref) 4.38 0.09 3.78 0.10
DeepSinger (Normal Ref) 4.12 0.10 3.42 0.12
DeepSinger (Noisy Ref) 3.89 0.10 3.27 0.11
Table 6. The MOS with 95% confidence intervals. We generate singing with different (clean, normal and noisy) reference audios of the same singer. Ref and Syn MOS represent the MOS of the reference audio and synthesized audio respectively.
Data Setting DeepSinger Neutral DeepSinger w/o RE
Clean Singing 28.50% 45.30% 26.20%
Noisy Singing 66.20% 26.80% 7.00%
Table 7. The preference scores of singing model with reference encoder (our model, denoted as DeepSinger) and singing model with speaker embedding (denoted as DeepSinger w/o RE) when training with clean and noisy singing data.
TTS Training Data

We further conduct experiments to explore the effectiveness of extra multi-speaker TTS data for auxiliary training. We train the singing model without multi-speaker TTS data (denoted as DeepSinger w/o TTS data) and compare it to DeepSinger. As shown in Table 5, the voice quality of DeepSinger w/o TTS data drops compared to that of DeepSinger, indicating extra TTS data is helpful for the singing model, which is consistent with the previous work (Zhang et al., 2019). To demonstrate that our mined Singing-Wild dataset is critical for singing model training, we train another singing model with only TTS data (denoted as DeepSinger (only TTS)) and compare it to DeepSinger in term of preference scores. The results are shown in Table 8. It can be seen that DeepSinger largely outperforms the model trained with only TTS data, which demonstrates the importance of Singing-Wild dataset for SVS.

DeepSinger Neutral DeepSinger (only TTS)
Preference Score 65.00 % 15.50 % 19.50 %
Table 8. The preference scores of DeepSinger and DeepSinger (only TTS). We choose one TTS audio as the reference audio to generate singing voices on Chinese singing test set.
Multilingual Training

We then analyze whether multilingual training affects the quality of the synthesized voices for a certain language. We train a singing model with singing training data only in Chinese (denoted as DeepSinger w/o multilingual in Table 5) and compare it to DeepSinger. As can be seen, the voice quality of DeepSinger is nearly the same as that of DeepSinger w/o multilingual, which demonstrates that multilingual training does not affect the voice quality of the singing model. As a byproduct of multilingual training, DeepSinger can also perform cross-lingual singing voice synthesis, as described in Section A.7.

6. Discussions

While DeepSinger can synthesize reasonably good singing voices, expert studies show that the synthesized voices have unique styles and are very different from human voices:

  • DeepSinger does not contain breathing and breaks that are common in human singing voices, since AI has no physical constraints caused by human vocal organs.

  • The synthesized singing voices do not have as rich and diverse expressiveness and emotion as human voices, because DeepSinger simply learns average patterns from training data.

Those differences make it easy to distinguish synthesized voices from human singing voices. While one may think this is the disadvantage of DeepSinger, we would like to emphasize that our goal is not to clone human singing voices; in contrast, we target at generating beautiful AI singing voices with unique styles which can bring new artistic experiences to human.

Acknowledgments

This work was supported in part by the National Key R&D Program of China (Grant No.2018AAA0100603), Zhejiang Natural Science Foundation (LR19F020006), National Natural Science Foundation of China (Grant No.61836002), National Natural Science Foundation of China (Grant No.U1611461), National Natural Science Foundation of China (Grant No.61751209) and the Fundamental Research Funds for the Central Universities (2020QNA5024). This work was also partially funded by Microsoft Research Asia.

References

Appendix A Reproducibility

a.1. Details in Data Crawling

We use the python library Requests121212https://github.com/psf/requests to build the data crawler. We fetch the singer ids from the website first and then use the singer id S_ID to visit the song page of the singer. Finally, we can download all available songs of each singer.

a.2. Details in Lyrics-to-Singing Alignment

Training Details
1:Input: Training dataset , where is a pair of mel-spectrograms and text, total training epoch , TBTT chunk size .
2:Initialize: Set current maximum target length ratio = , phoneme-to-mel position mapping , alignment model .
3:for each  do
4:     for each  do
5:         
6:         set to the length of , set to the length of
7:         while   do
8:              
9:              
10:               =
11:              Optimize with loss
12:              Update with
13:              
14:         end while
15:     end for
16:     
17:end for
18:return
Algorithm 1 Alignment Model Training (with mini batch size 1)

The training of alignment model is shown as Algorithm 1131313We show the training process when mini batch size is 1 for simplicity.. In Line 3, we begin training the model for epochs. In Line 4, we traverse the dataset to fetch data id (id), mel-spectrogram () and phone sequence (). In Line 5, we use to represent the start position of phoneme subsequence in the chunk training, and initialize the decoder states for TBTT training. In Line 9, we get the phoneme and mel-spectrogram subsequences ( and ) for this chunk. In Line 10 to Line 12, we train the alignment model and update the phoneme-to-mel position mapping with , where can be transformed from duration which is extracted by the duration extraction algorithm. When the training ends, we can get lyrics-to-singing alignment from . In our work, we set to , and to .

Algorithm Details in Duration Extraction
1:Input: Alignment matrix
2:Output: Phoneme duration
3:Initialize: Initialize reward matrix

with zero matrix. Initialize the prefix sum matrix

to the prefix sum of each row of , that is, . Initialize all elements in the splitting boundary matrix to zero.
4:for each  do
5:     
6:end for
7:for each  do
8:     for each  do
9:         for each  do
10:              
11:              if  then
12:                  
13:                  
14:              end if
15:         end for
16:     end for
17:end for
18:
19:for each  do
20:     
21:     
22:end for
23:return
Algorithm 2 DP for Duration Extraction

The duration extraction is shown as Algorithm 2. In Line 3, the element in reward matrix represents the maximum reward for the submatrix , and the corresponding splitting point of the position is stored in . From Line 4 to Line 17, we use DP to find a best spitting boundary and record the reward. From Line 18 to Line 22, we collect the best splitting boundary, starting from the position and tracing back, and in the meanwhile, calculate the duration of each phoneme.

a.3. Details in Data Filtration

We use the splitting reward introduced in Section 3.2 to evaluate the alignment quality between singing and lyrics. We plot the alignments of some data and listen to these audios at the same time. We find that most alignments is accurate when the splitting reward is larger than 0.6. Therefore, we set the threshold to 0.6 and filter singing data.

a.4. Details in Singing Model

Model Settings

We list the model settings of the singing model in DeepSinger in Table 9.

Model Setting Value
Phoneme Embedding Dimension 384
Lyrics Encoder Layers 4
Pitch Encoder Layers 2
Reference Encoder Pre-net Layers 5
Reference Encoder Pre-net Hidden 384
Reference Encoder Layers 4
Decoder Layers 4
Encoder/Decoder Hidden 384
Encoder/Decoder Conv1D Kernel 3
Encoder/Decoder Conv1D Filter Size 1536
Encoder/Decoder Attention Heads 2
Dropout 0.1
Total Number of Parameters 34.22M
Table 9. Model settings of the singing model in DeepSinger.
Exploration of Autoregressive Transformer-based Singing Model

We replace the decoder of our singing model with an autoregressive decoder which takes the last output of the decoder (linear-spectrogram frame) and the outputs of the encoder as inputs. In the inference process, we generate the linear-spectrograms in an autoregressive manner. The results show that the voice quality is not as good as our singing model with non-autoregressive generation. Since the error propagation problem in autoregressive generation under noisy data is much serious, non-autoregressive generation in our singing model shows advantages when giving accurate phoneme duration and pitch sequence.

Exploration of Mel-Spectrogram Auxiliary Loss

Inspired by Tacotron, which generates mel-spectrograms first and transforms the mel-spectrograms to linear-spectrograms, we also try to modify the singing model to make it generate mel-spectrograms first and then convert it to linear-spectrograms using a post-net. We minimize the MSE loss of mel-spectrograms and linear-spectrograms in the training stage and only use linear-spectrograms in the inference stage. The results show that the model trained with mel-spectrogram auxiliary loss does not gain any performance.

a.5. Singing-Wild Dataset

We plot the distributions of sentence-level duration, phoneme-level duration and pitch as analyzed in Section 4 in Figure 6, 7 and 8.

Figure 6. The distributions of sentence-level duration on three languages.
Figure 7. The distributions of phoneme-level duration on three languages.
Figure 8. The pitch distributions on three languages.

a.6. Evaluation Metric

Mean Opinion Score (MOS)

The scale of MOS is set to between 1 to 5. MOS from 1 to 5 denote bad, poor, fair, good and excellent respectively.

Preference Score

We give each listener two voices and let the listener choose a better one or keep neutral based on the audio quality.

Similarity Score

We use similarity score to evaluate how well the generated audio is similar to the singer in the reference audio. The similarity evaluation is performed by 20 people and they are told to focus on the similarity of the singer to one another rather than the content or audio quality of the audio. The scale of similarity score is set to between 1 to 5 which denote not at all similar, slightly similar, moderately similar, very similar and extremely similar respectively.

a.7. Analysis on Cross-Lingual Synthesis

We analyze whether DeepSinger can synthesize singing voices with the reference audio in one language but lyrics in another language, and evaluate the quality in terms of similarity score. For each language (English and Chinese), we conduct evaluation on the following settings: 1) the similarity between the ground truth audios in the same singer, which can be regarded as the upper bound (denoted as GT (Same)); 2) the similarity between the ground truth audios from different singers, which can be regarded as the lower bound of similarity score (denoted as GT (Diff)); 3) the similarity between the generated singing audios and reference audios in same language (denoted as Same Lan); 4) the similarity between the generated singing audio and reference audio in different languages (denoted as Cross Lan).

Language GT (Same) GT (Diff) Same Lan Cross Lan
English 4.53 0.08 1.48 0.13 3.39 0.10 3.19 0.14
Chinese 4.65 0.07 1.32 0.10 3.68 0.08 3.21 0.17
Table 10. The similarity scores of DeepSinger with 95% confidence intervals.

We conduct analyses to synthesize voices on both Chinese and English to analyze the cross-lingual synthesis of DeepSinger. As shown in Table 10, DeepSinger achieves similarity scores above 3.0 (moderately similar) in “Cross Lan” setting, which demonstrates that the voice and timbre information captured by the reference encoder can be transferred across different languages.