Learning Singing From Speech

12/20/2019 ∙ by Liqiang Zhang, et al. ∙ Tencent Beijing Institute of Technology 0

We propose an algorithm that is capable of synthesizing high quality target speaker's singing voice given only their normal speech samples. The proposed algorithm first integrate speech and singing synthesis into a unified framework, and learns universal speaker embeddings that are shareable between speech and singing synthesis tasks. Specifically, the speaker embeddings learned from normal speech via the speech synthesis objective are shared with those learned from singing samples via the singing synthesis objective in the unified training framework. This makes the learned speaker embedding a transferable representation for both speaking and singing. We evaluate the proposed algorithm on singing voice conversion task where the content of original singing is covered with the timbre of another speaker's voice learned purely from their normal speech samples. Our experiments indicate that the proposed algorithm generates high-quality singing voices that sound highly similar to target speaker's voice given only his or her normal speech samples. We believe that proposed algorithm will open up new opportunities for singing synthesis and conversion for broader users and applications.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Singing is one of the most important music expression and the techniques of singing synthesis have many applications in entertainment industries. Over the past decades, many approaches have been proposed for singing synthesis. These include methods based on concatenative unit selection [3]

as well as more recent approaches based on deep neural network (DNN)

[10] and autoregressive generation models [2].

While existing singing synthesis algorithms are capable of producing natural singing, it normally requires a large amount of singing data for training new voices. Compared to normal speech data, singing data is much more difficult and expensive to collect. To address such limitation, more data efficient singing synthesis approaches [1] have been proposed recently, which adapts a multi-speaker trained singing synthesis model with a small amount of target speaker’s singing data.

Alternatively, singing synthesis with new voices can be achieved through singing voice conversion. The task of singing voice conversion is to convert one’s singing with the voice of another while keeping singing content the same. Traditional singing voice conversion [6, 7, 14] relies on parallel singing data to learn conversion function between different speakers. However, a recent study [9] on unsupervised singing voice conversion uses a WaveNet [11]

based autoencoder architecture to achieve singing voice conversion without parallel singing data or even the transcribed lyrics or notes.

While data efficient singing synthesis approach [1] as well as unsupervised singing voice conversion method [9], could efficiently generate singing with new voices, it still requires a minimal amount of singing voice samples from target speakers. This has limited the applications of singing voice synthesis to relatively restricted scenarios where the target speaker’s singing voice has to be available.

On the other hand, normal speech samples are much easier to collect than singing. However, there are only a few studies have investigated the use of speech samples for singing synthesis. The speech-to-singing synthesis method proposed in [13] attempts to convert a speaking voice to singing by directly modifying acoustic features such as f0 contour and phoneme duration in read speech. While speech-to-singing approaches could produce singing from read lyrics, it normally requires non-trivial amount of manual tuning of acoustic features for achieving high intelligibility and naturalness of singing voices.

Figure 1: Model architecture of DurIAN-4S.

In this paper, we propose an algorithm that directly synthesizes natural singing with target speakers’ voice by learning their voice characteristics from speech samples111Sound demo of proposed algorithm can be found at https://tencent-ailab.github.io/learning_singing_from_speech. The key part of proposed algorithm is to learn universal speaker embeddings, such that the speaker embeddings learned for the task of speech synthesis can be used for singing synthesis, and vice versa. For this purpose, we use our recently proposed autoregressive generation model, Duration Informed Attention Network (DurIAN) [16], for unifying text-to-speech and singing synthesis into a single framework. DurIAN, originally proposed for the task of multimodal synthesis, is essentially an autoregressive feature generation framework that could generates acoustic features (e.g., mel-spectrogram) from any audio source frame by frame. In proposed method, phoneme duration, fundamental frequency (F0) and root-mean-square energy (RMSE) are extracted from training data containing both singing or normal speech, and used as inputs for reconstructing target acoustic features. The entire model is trained jointly with learnable speaker embeddings as conditional input to the model. The trained model and speaker embeddings can be used to convert any singing into target speaker’s voice by using his or her speaker embedding as conditional input.

The paper is organized as following. Section 2 introduces the architecture of our conversion model. Section 3 introduces the experiment. Section 4 and 5 are the conclusion and acknowledgements.

2 Model Architecture

In this section, we first describe DurIAN based Speech and Singing Synthesis System (DurIAN-4S), a unified speech and singing synthesis system based on DurIAN. After that, we present singing voice conversion approach based on DurIAN-4S.

2.1 DurIAN-4S

While DurIAN was originally proposed for the task of multimodal speech synthesis, it is a general autoregressive framework that can be used for other synthesis tasks. The original DurIAN model is modified here to perform speech and singing synthesis at the same time. The major difference of DurIAN-4S compared to DurIAN is that it takes additional inputs. These additional inputs are attributes of singing that are useful for singing synthesis (music note, f0, etc.). As the focus of this study is singing voice conversion222For the task of singing synthesis from note and lyrics, the note of music can be used as additional inputs., we use frame level f0 and root mean square energy (RMSE) extracted from original singing/speech as additional inputs333The f0 and RMSE will not be available at inference time of speech synthesis to be used as additional inputs. But, our objective is singing voice conversion, and the model will not be used for speech synthesis inference. (Fig. 1).

The architecture of DurIAN-4S is illustrated in Fig. 1. It includes (1) an encoder that encodes the context of each phoneme, (2) an alignment model that aligns the input phoneme sequence and to target acoustic frames, (3) an autoregressive decoder network that generates target mel-spectrogram features frame by frame.

2.1.1 Encoder

We use phoneme sequence directly as input for both speech and singing synthesis. The output of the encoder is a sequence of hidden states containing the sequential representation of the input phonemes as


where is the length of input phoneme sequences444The state skipping structures in DurIAN [16] is not used here as it is not a necessary component for singing synthesis or conversion..

2.1.2 Alignment model

The purpose of alignment model is to generate frame aligned hidden states that will be used as input for autoregressive generation. Here, the output hidden sequence from encoder is first concatenated with speaker embedding and followed by a fully connect layer used for dimension reduction as


where indicates concatenation and indicates the embedding of speaker . The output hidden states after dimension reduction layer will be expanded according to the duration of each phoneme as


where is the total number of input audio frames. The state expansion is simply the replication of hidden states according to the provided phoneme duration. The duration of each phoneme is obtained from force alignments performed on input phonemes and acoustic features. The frame aligned hidden states is then concatenated with frame level f0, RMSE, and relative position of every frame inside each phone.


where and represents f0 and RMSE for each frame respectively. And is the position code of each frame.

Figure 2: The process diagram of training and converting. The yellow parts are used in training stage, the green parts are used in converting stage and the blue parts are used in both stages. The WaveRNN [5] model is trained separately.

2.1.3 Decoder

The decoder is the same as in DurIAN, composed of two autoregressive RNN layers. Different from the attention mechanism used in the end-to-end systems, the attention context is computed from a small number of encoded hidden states that are aligned with the target frames, which reduce artifacts observed in the end-to-end system. We decode two frames per time step in this paper. The output from the decoder network is passed through a post-CBHG [15] to improve the quality of predicted mel-spectrogram as


The entire network is trained to minimize the mel-spectrogram prediction loss before and after post-CBHG as


where represents l2 regularization.

2.2 Singing Voice Conversion

The whole process of our method is illustrated in Fig. 2. The training dataset contains a multi-speaker speech and singing corpus. For singing voice conversion task, the target speaker or singer should be included in the training data, while the source singing or singer to be converted doesn’t have to be seen in the training. The preprocess module mainly consists of two parts: the TDNN based phoneme alignment model [12] and the world vocoder [8]. The TDNN model is a component of a pre-trained general speech recognition model, which generates the phoneme sequence and its duration alignment from speech and singing data. The world vocoder is used to extract F0 which reflects the rhythm and melody of singing. Because the F0 envelope also determines the tone of each phone, so we use the non-tonal phones in our experiment. In addition, we also found that the RMSE can greatly improve the quality and stability of singing voice conversion. The input of DurIAN-4S is phoneme sequence, phoneme durations, f0, RMSE and speaker identity. The training target of DurIAN-4S is to reconstruct the mel-spectrogram. In the training stage, embeddings of speakers with speech samples and singing samples are all also optimized jointly.

After the model of DurIAN-4S is trained, it can be used to convert any singing to a target speaker’s voice. The process of singing voice conversion is that, we first extract the f0, phoneme duration, and RMSE from the preprocess module, and use these as input for singing generation. By choosing different speaker embedding during singing generation, we could produce singing with different voice. The generated mel-spectrogram from DurIAN-4S after conversion will be used for WaveRNN [5] model for waveform generation.

When conversion between male and female, the input F0 should be multiplied by a scalar as:


where is the source singing, is the target speaker , is the average F0 of vowel phone in the audio. However, the pitch of one’s singing is usually higher than the pitch of speech from the same person and it is common to adjust the key of songs within a certain range for different singers. We could control the scalar to get a flexible conversion performance.

3 Experiment

3.1 Dataset

The training set contains the Tencent multi-speaker speech corpus (TSP) and the Tencent singing corpus (TSG). In TSP corpus, we choose 3 male speakers and 4 female speakers, each with 1.5 hours of data. The TSG corpus contains a total of 28 hours singing data recorded by 3 female singers. For singing voice conversion task, we choose source singing from a separate singing corpus, which will not be used in training. All the data has a sampling rate of 24K.

3.2 Model Parameters

In our experiment, the dimensions of the phoneme embedding, speaker embedding, encoder CBHG module, attention layer are all 256. The decoder has 2 GRU layers with 256 dimension and the batch normalization is used in the encoder and post-net module. We use Adam optimizer and

initial learning rate with warm-up [4] schedule. There is a total of 250,000 steps with a batch size of 32 to converge the model. We found multi-speaker trained WaveRNN model will improve the synthesis stability in this singing voice conversion task.

3.3 Quality and Similarity Evaluation

Since we are not able to find any public benchmarks on speech based singing voice conversion, we compared it singing voice conversion based on singing samples. Both the quality of converted singing voice and similarity of converted singing voice and target speaker’s voice is compared. Subjective evaluation with Mean Opinion Scores (MOS) is used. A total of 14 subjects have been participated in our listening tests.

We select 20 segments from two different songs from a separate singing corpus. Three male speakers and two female speakers from TSP corpus are selected as target speakers, and 1 singer from TSG corpus as target singer. We perform the experiments conduct an ablation study on the importance of using RMSE for singing voice conversion. For the timbre similarity evaluation, the subjects are asked to score a similarity of voice timbre between converted singing and target speaker’s normal speech.

The similarity evaluation results are shown in Table 1. The scale of MOS is set between 1 to 5 with 5 being the highest score. We first compare the effects of RMSE as additional input for singing voice conversion. And the results show that using RMSE improves both the quality and similarity significantly. We found that the energy information of each frame concatenated with F0 could help the model learning the pronunciation of long vowels. The energy of each frame indicates the loudness of pronunciation, helping the model to determine when vowels should stop properly.

We also compare the performance of singing voice conversion using target speaker’s normal speech versus using their singing samples. Singing voice conversion using target speaker’s singing samples receives better score than that of using their speech samples. This is expected performance, as it is much easier to learn speaker’s singing voice from their singing samples than speech samples. However, we could see that the similarity of singing voice conversion using speech samples are not too far off, showing that proposed algorithm could synthesis target speaker’s singing voice, in both high quality and similarity, with only speech samples. The samples used in our experiments can be found at https://tencent-ailab.github.io/learning_singing_from_speech.

Method Target Speaker Naturalness Similarity
f0 singing 3.23 3.00
f0 speech 2.77 2.84
f0 + RMSE singing 3.80 3.65
f0 + RMSE speech 3.42 3.49
Table 1: MOS for Singing Conversion Quality and Similarity. Target speaker type indicates what types of samples we used for singing voice conversion from target speaker. ’singing’ means the singing samples from target speaker are used for singing voice conversion, and ’speech’ means speech samples from target speaker are used for singing voice conversion.

4 Conclusion

In this paper, we proposed an algorithm that synthesizes natural singing in target speaker’s voice given only their normal speech samples. We evaluate proposed algorithm on singing voice conversion task with speech samples, and obtained very promising results. In future work, we will focus on reducing the amount of target speech samples for both target singing synthesis and conversion tasks.

5 Acknowledgements

The authors would like to thanks Chunlei Zhang, Dongxiang Xu and other members in the Tencent AI Lab team for providing suggestions on model structure and optimization.


  • [1] M. Blaauw, J. Bonada, and R. Daido (2019) Data efficient voice cloning for neural singing synthesis. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6840–6844. Cited by: §1, §1.
  • [2] M. Blaauw and J. Bonada (2017) A neural parametric singing synthesizer modeling timbre and expression from natural songs. Applied Sciences 7 (12), pp. 1313. Cited by: §1.
  • [3] J. Bonada, M. Umbert, and M. Blaauw (2016) Expressive singing synthesis based on unit selection for the singing synthesis challenge 2016.. In INTERSPEECH, pp. 1230–1234. Cited by: §1.
  • [4] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He (2017)

    Accurate, large minibatch sgd: training imagenet in 1 hour

    arXiv preprint arXiv:1706.02677. Cited by: §3.2.
  • [5] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. van den Oord, S. Dieleman, and K. Kavukcuoglu (2018) Efficient neural audio synthesis. CoRR abs/1802.08435. External Links: Link, 1802.08435 Cited by: Figure 2, §2.2.
  • [6] K. Kobayashi, T. Toda, G. Neubig, S. Sakti, and S. Nakamura (2014) Statistical singing voice conversion with direct waveform modification based on the spectrum differential. In Fifteenth Annual Conference of the International Speech Communication Association, Cited by: §1.
  • [7] K. Kobayashi, T. Toda, G. Neubig, S. Sakti, and S. Nakamura (2015)

    Statistical singing voice conversion based on direct waveform modification with global variance

    In Sixteenth Annual Conference of the International Speech Communication Association, Cited by: §1.
  • [8] M. Morise, F. Yokomori, and K. Ozawa (2016) WORLD: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE TRANSACTIONS on Information and Systems 99 (7), pp. 1877–1884. Cited by: §2.2.
  • [9] E. Nachmani and L. Wolf (2019) Unsupervised singing voice conversion. arXiv preprint arXiv:1904.06590. Cited by: §1, §1.
  • [10] M. Nishimura, K. Hashimoto, K. Oura, Y. Nankaku, and K. Tokuda (2016) Singing voice synthesis based on deep neural networks.. In Interspeech, pp. 2478–2482. Cited by: §1.
  • [11] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu (2016) Wavenet: a generative model for raw audio. arXiv preprint arXiv:1609.03499. Cited by: §1.
  • [12] V. Peddinti, D. Povey, and S. Khudanpur (2015) A time delay neural network architecture for efficient modeling of long temporal contexts. In Sixteenth Annual Conference of the International Speech Communication Association, Cited by: §2.2.
  • [13] T. Saitou, M. Goto, M. Unoki, and M. Akagi (2007) Speech-to-singing synthesis: converting speaking voices to singing voices by controlling acoustic features unique to singing voices. In 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 215–218. Cited by: §1.
  • [14] F. Villavicencio and J. Bonada (2010) Applying voice conversion to concatenative singing-voice synthesis. In Eleventh Annual Conference of the International Speech Communication Association, Cited by: §1.
  • [15] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, et al. (2017) Tacotron: towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135. Cited by: §2.1.3.
  • [16] C. Yu, H. Lu, N. Hu, M. Yu, C. Weng, K. Xu, P. Liu, D. Tuo, S. Kang, G. Lei, et al. (2019) DurIAN: duration informed attention network for multimodal synthesis. arXiv preprint arXiv:1909.01700. Cited by: §1, footnote 4.