With the recent development of deep learning, a learning-based singing voice synthesis (SVS) system, which synthesizes sounds as natural as the concatenative method[macon1997concatenation, bonada2003sample, kenmochi2007vocaloid], but can expand more flexibly, is proposed. For example, three SVS systems based on DNN, LSTM, and Wavenet architecture were proposed, respectively [nishimura2016singing, kim2018korean, blaauw2017neural]. These systems all include an acoustic model that is trained by singing, lyrics, and sheet music paired data, and each acoustical model is trained to predict the vocoder feature used as an input to the vocoder.
Although these neural network-based SVS system can achieve adequate performance, networks predicting vocoder features have limits that cannot exceed the upper bound of vocoder performance. Therefore, it is meaningful to propose an end-to-end framework that directly generates a linear-spectrogram, not a vocoder feature. However, the extension to the end-to-end framework of the SVS system is a challenging task because it involves increased complexity of the model. Creating a more complex target, linear-spectrogram, increases the complexity of the model and requires as much training data to generalize and train these models sufficiently. However, gathering singing audio with aligned lyrics in a controlled environment is a task that requires a lot of effort.
We proposed in this paper a Korean SVS system that can be trained by an end-to-end manner with moderate amounts of data 111The generated result can be found at: ksinging.strikingly.com.. Our baseline network is designed with the inspiration of DCTTS [tachibana2018efficiently], known as efficiently trainable text to speech (TTS) system. We applied the following novel approaches to enable end-to-end network training. First, we used the phonetic enhancement masking method, which separately modeled low-level acoustic features related to pronunciation from text information, to make more efficient use of the information contained in the training data. Second, we also proposed a method of reusing input data at the super-resolution stage and training with an adversarial manner to produce better sound quality singing.
The contribution of this paper is as follows: 1) We designed the end-to-end Korean SVS system and suggested a way to train it effectively. 2) We proposed a phonetic enhancement masking method that helps to produce more accurate pronunciation. 3) We proposed a conditional adversarial training method for the generation of more realistic singing voices.
2 Related work
The SVS system is similar to the TTS system in terms of synthesizing natural human speech. Recently, the end-to-end TTS system, which is trained as an autoregressive manner, such as Tacotron[wang2017tacotron], Deep voice[gibiansky2017deep], is showing better performance than the conventional method. In addition, various follow-up studies are being conducted that have further controllable elements such as prosody, style, etc [skerry2018towards, wang2018style], or models that can be trained more efficiently [tachibana2018efficiently, chung2018semi]. We conducted the study by modifyng the TTS model to suit the SVS task, based on DCTTS[tachibana2018efficiently], which is known to be capable of efficient end-to-end training.
The generative adversarial networks (GAN) is a widely used technique that helps train an arbitrary function to generate a similar sample as the sample from desired data distribution. This training method has been widely accepted in computer vision community and becomes one of the key components to attain photo-realism in super-resolution task. Unlike the success of adversarial training method in image domain, however, only a few works have achieved a reasonable success of training super-resolution task (specifically, band-width extension task) in audio signal processing community[li2018speech]. To further leverage the promise of adversarial training in audio generation process, we adopted a few recent works that stabilizes the adversarial training, namely, conditional GAN with projection discriminator [miyato2018cgans] and R1 regularization [brock2018large] which allow us to jointly train the autoregressive network (mel-synthesis) and super-resolution network making the proposed system as an end-to-end framework.
3 Proposed Network
As illustrated in Figure 1, our proposed model consists of two main modules, a mel-synthesis network and a super-resolution network. The mel-synthesis network is trained to produce a mel-spectrogram from previous mel input , time-aligned text , and pitch inputs . With text and pitch information as conditional input, the super-resolution network upsamples the generated mel-spectrogram to a linear-spectrogram . Finally, the discriminator takes the upsampled result with generated mel-spectrogram to train the network in an adversarial manner. During the test phase, a sequence of mel-spectrogram frames is generated in an autoregressive manner from a given text and pitch input which is then upsampled to linear-spectrogram by super resolution network. Finally, the generated linear-spectrogram is converted to a waveform using Griffin-Lim algorithm [griffin1984signal].
3.1 Input representation
Our training data includes recorded singing voice along with the corresponding text and midi. A single midi note represents pitch information with onset and offset. For the single midi note, one syllable and its corresponding vocal audio section are manually aligned. Figure 2 shows our input representation more concretely. To determine the text input sequence with length , we referred to the pronunciation system of Korean. A Korean syllable can be decomposed into three phonemes each of which corresponds to onset, nucleus, and coda, respectively. Since the nucleus occupies most of the pronunciation singing in Korean, we assigned onset and coda to the first and the last frame of input text array, respectively, and the rest of the frames with nucleus. Although this does not reflect accurate timing for each phoneme, we empirically found out that a convolution-based network with wide enough range of receptive field can handle this problem. For pitch input , we simply assigned a pitch number to each frame. In the case of the mel input , we used the mel-spectrogram itself, which was extracted from the recorded audio, where denotes the number of frequency bins.
3.2 Mel-synthesis network
The mel-synthesis network aims to generate the mel-spectrogram of the next time step from the given text, pitch, and mel input. Based on the text-to-mel network proposed by [tachibana2018efficiently] we modified it to fit the SVS system.
First, in order to enter pitch information, we added pitch encoders with the same structure as text encoders. In addition, the local conditioning method proposed by [oord2016wavenet] was used to conduct a conditioning of the encoded pitch on the mel decoder.
Second, we assumed that among the various elements forming a singing voice, information about pronunciation would be able to be controlled independently from text information. We also assumed that if the low-level audio feature that constitutes pronunciation information can be modeled independently, it is possible to focus on the pronunciation information in the data composed of various combinations of pronunciation-pitch, so that training data can be utilized more efficiently to generate more accurate pronounced singing voice. To this end, we designed an additional phonetic enhancement mask decoder, which receives encoded text only as input, and the output of the decoder element-wise multiplied by the output of the mel decoder to create the final mel-spectrogram. As a result, can be formulated as follows:
We trained the network with and binary divergence loss between ground truth and generated mel-spectrogram, and guided attention loss as the objective function. Please see [tachibana2018efficiently] for more detailed explanation on the loss terms. We also assumed that the loss between the differential spectrogram would be also beneficial for network to learn more about the relatively short pronounced onset, coda. Therefore, the overall objective function for is as follows:
3.3 Super-resolution network
In this section, we describe the details of the training method for super-resolution network . The purpose of the SR step is to upsample the generated mel-spectrogram into a linear-spectrogram thereby making it to an audible form, where and denote the number of frequency bins and temporal bins for linear-spectrogram. The idea of the SR network was proposed in a few previous TTS literatures, including Tacotron and its variants [tachibana2018efficiently, wang2017tacotron]. The major difference between the previous works and our work is twofold. First, we additionally reuse the aligned text and pitch information into the SR network exploiting the useful information in the generation process again. Second, we utilize adversarial training methods to make the SR network produce more realistic sound.
3.3.1 Local conditioning of text and pitch information
Unlike the attention-mechanism based TTS literature, SVS system requires the aligned text and pitch information as inputs for the controllability in the generation process. These information, therefore, can be easily reused in the SR step in the absence of time-alignment process as follows.
More specifically, each of the output from the text encoder and pitch encoder ( and ) is fed into a sequence of convolutional and dropout layer [srivastava2014dropout] which is then fed into a highway network as a local conditioning method as proposed in [oord2016wavenet]. For the upsampled , SR is trained with the objective function . For the exact network configuration, please refer to Figure 1.
3.3.2 Adversarial training method
Expecting to generate a realistic sound, we adopted a conditional adversarial training method which helps the output distribution of be similar to the real data distribution . Intuitively, in the conditional adversarial training framework, discriminator not only tries to check if is realistic but also the paired correspondence between and . Note that, we make a minor assumption that the distribution of approximately follows that of , that is, , allowing the joint training of two modules and . The conditioning to discriminator was done by following [miyato2018cgans] with a minor modification. First, the condition is fed into a 1d-convolutional layer and the intermediate output of discriminator is fed into a 2d-convolutional layer. Then, inner product between the two outputs is done as a projection. Finally, the obtained scalar value is added to the last layer of
resulting in final logit value. For the exact network configuration please refer to Figure1.
For the stable adversarial training, a regularization technique on has been proposed by several GAN related works [gulrajani2017improved, miyato2018spectral, mescheder2018training, brock2018large]. We adopted a simple, yet, effective gradient penalty technique called R1 regularization. This technique penalizes the squared 2-norm of the gradients of only when the sample from true distribution is taken as follows
Note that the output of
denotes the logit value before the sigmoid function. The final adversarial loss terms (and ) for and are as follows,
where includes not only the parameters of but also that of , hence the two consecutive modules acting as one generator function . The function is chosen as follows resulting in the vanilla GAN loss as in the original GAN paper [goodfellow2014generative].
Since there is no publicly available Korean singing voice dataset, we created the dataset as follows. First, we prepared accompaniment and singing voice MIDI files of 60 Korean pop songs. Next, a professional female vocalist was told to sing to the accompaniment. Then, the singing voice MIDI files were manually realigned so that the recorded audio have the exact alignment with the singing voice MIDI files. Finally, we manually assigned the syllables in lyrics to each MIDI note of singing voice MIDI file. The audio length of the entire dataset excluding the silence is about 2 hours. We used 49 songs for training dataset, 1 song for validation, and 10 songs for test dataset.
We trained the discriminator to minimize and the rest of the network to minimize , and . For SR networks, we have to start training after the appropriate level of mel is generated, so we have separately controlled and to add to the objective function. At this point, it was set to , , respectively.
In both cases, we used Adam optimizer [kingma2014adam], which was set to = 0.5 and = 0.9. The learning rate was scheduled to start from 0.0002 and was halved for every 30,000 iteration. All parameters of the networks were initialized with the Xavier initializer [glorot2010understanding].
For the ground truth mel/linear-spectrogram, we first extracted the linear-spectrogram from audio with . We then normalized the linear-spectrogram as follows , where denotes a pre-emphasis factor with the value of 0.6 in our case. 222Note that we post emphasized where denotes a post-emphasis factor with the value of 1.3. Afterwards, the mel-spectrogram was obtained by multiplying 80-d of mel filter bank to , and the same normalization method as in was used. In order to reduce the complexity of the model, we downsampled the mel-spectrogram to the quarter by taking the first frame of every four-frame of the mel-spectrogram giving the relationship of .
We trained a total of five models to see how the three proposed methods - method 1; phonetic enhancement masking, method 2; local conditioning pitch and text to , method 3; adversarial training method - actually affect the network. The differences between the five models are described in Table 1. 20 audio samples from each model were generated from the test dataset. Apart from the generated samples, we also compared the ground truth samples. Ground denotes the actual recorded audio, and Recons denotes the reconstructed audio from ground truth magnitude only linear-spectrogram using Griffin-Lim algorithm. Noe that Recons samples were included to evaluate the sound quality from the loss of phase information.
|Method||baseline||+(method 1)||+(method 2)||+(method 1,2)||+(method 1,2,3)|
4.3.1 Quantitative evaluation
We evaluated whether the network was actually producing a conditioned singing voice for a given input. To do this, we extracted f0 sequence from the generated audio through the world vocoder[morise2016world]
, converted it into a pitch sequence, and compared it to the input pitch sequence. We can judge that the higher the similarity between the two sequence, the more the network generates a singing that reflects the input condition. We calculated the precision, recall and f-score of the generated pitch sequence by frame-wise, and the results are shown in Table 1.
Even in the case of a real recording sample recorded by listening to the original midi accompaniment, it is not easy to adjust the timing and pitch of the correct note, so that a 100% accurate f-score can not be obtained. For all samples that were generated, a f-score similar to or higher than the real recording sample was obtained. This means that the model has generated a singing voice with the correct pitch and timing for at least the real recording for the given input.
4.3.2 Qualitative evaluation
We conducted a listening test to evaluate the quality of the generated singing voice. 19 native Korean speakers were asked to listen to the 20 audio samples from each model. Each participant was asked to evaluate the pronunciation accuracy, sound quality, and naturalness. During the listening test, lyrics of audio samples were provided for more accurate evaluation of pronunciation accuracy. The MOS results are shown in Table 1.
We conducted a paired t-test for each model response and based on this we verified the effectiveness of the proposed methods. For the accuracy of the pronunciation, we obtained significant differences for all comparisons except for models 2 and 3. In other words, all of the proposed methods helped to create more accurate pronunciation singing voices, and the performance was improved to the greatest extent with all three methods. In the case of sound quality, methods 1 and 2 did not significantly affect the improvement, but the applying method 3 showed a significant increase in score. From this we can confirm that training the network in an adversarial manner improves the quality of the generated audio. Finally, for naturalness, there was a significant improvement when all methods were applied.
4.4 Analysis on generated spectrogram
In this section we analyze the features generated by the mel-synthesis and super-resolution networks. In the case of mel-synthesis network, from observing internally generated features, we found that the low-level acoustic feature of pronunciation and pitch could be divided independently without any supervision. From Figure 3, shows the underlying structure of the spectrogram, such as the harmonic structure and the location of f0. In , on the other hand, we can observe the shape of determining the intensity of the frequency at every time-step, similar to the feature of the spectral envelope, which contains non-periodic information. This suggests that, from the perspective of source-filter models, one of the techniques that classical speech modelling techniques, our network can generate sources () and filters () separately from frequency domain without any supervised training.
We also analyzed the effect of adversarial training method by observing the generated linear-spectrogram. Three different spectrograms from model4 (: w/o adversarial loss), model5 (: w/ adversarial loss), and ground truth spectrogram () are demonstrated in the second row of Figure 3. While showing the blurry high frequency areas, clearly shows that adversarial training allows the proposed network to generate sample that is closer to the ground truth sample . Note that we have confirmed in 4.3.2, listening test that the sound quality can be significantly improved by comparing model4 and model5, which again reinforces our observation.
In this paper, we proposed the end-to-end Korean singing vocie synthesis system. We showed that using text information to model the phonetic enhancement mask actually worked, and produced more accurate pronunciation. Also, we successfully applied the conditional adversarial training method to the super-resolution stage, which resulted in a higher quality voice.
This work has partly supported by National Research Foundation of Korea (NRF) funded by the Korea government (NRF-2017R1E1A1A01076284), and partly by Institute for Information & Communications Technology Planning & Evaluation(IITP) grant funded by the Korea government (No.2019-0-01367)