The rise of deep learning[lecun2015deep] has made more complex sequence generation tasks [sutskever2014sequence, Wang2017TacotronTE, shen2018natural, oord2016wavenet, Kalchbrenner2018EfficientNA] feasible. Text-based generation of natural speech has been continuously investigated over the past decades. Concatenative synthesis with unit selection [hunt1996unit] and statistical parametric speech synthesis [zen2009statistical] were the state-of-the-art systems for many years. However, such systems require lots of human labour and are unsatisfactory for lacking naturalness. Recently, a sequence-to-sequence architecture, Tacotron [Wang2017TacotronTE, shen2018natural], has greatly improved the naturalness and similarity of speech synthesis compared to traditional statistical parametric speech synthesis system [zen2009statistical]. Tacotron, usually followed by a traditional or neural vocoder [oord2016wavenet, Griffin1984SignalEF], takes linguistic feature and speaker identity as input and generates mel-spectrogram as output. Unfortunately, when dealing with out-of-domain or abnormal texts inputs, Tacotron-like attention based end-to-end structures could render unacceptable errors, including skipping, repeating, long unexpected pause and attention collapse [shen2018natural, tiantencent]. More recently, stepwise monotonic attention (SMA) [He2019RobustSA] method, which is based on monotonic attention [raffel2017online], was proposed to enforce strict constraint to meet the demand of locality, monotonicity and completeness in the speech synthesis process.
As far as we know, building a naturally speaking TTS system requires at least ten hours of recording audio. Moreover, every audio utterance should be recorded in a professional recording studio and the transcribed phonemes should be evenly distributed. Preparing such a large amount of high-quality data with multiple speakers is impractical and extremely expensive. Typically, it’s troublesome and unnecessary to let native Chinese speaker to say English if he knows little about English. Moreover, there is no chance to gather 10 hours training data for a specific person like a pop star. The only resources we can get are the limited talks or shows from TV. Therefore, utilizing a few minutes of audio and synthesizing arbitrary speech in target’s voice remains a very important task.
However, building TTS system with limited data often sacrifices quality and reliability [chung2019semi]. To scale the capacity for new speakers, we can adapt existing pre-trained multi-speaker system to generate new speakers’ voice, which is a well-studied subject of few-shot learning [fink2005object, fei2006one] also known as speaker adaptation [yamagishi2009analysis, leggetter1995maximum]. There are mainly two approaches here: the first is just to update the new speaker embedding and combine it with linguistic feature as inputs to a TTS model [jia2018transfer, li2017deep], which may require a very strong speaker encoder network trained by thousands of speakers ; the second is to fine-tune the entire multi-speaker network to select a optimal single-speaker model [arik2018neural, chen2018sample, 9054301]
. Although fine-tuning can combine the advantages of multiple speakers and achieve a new speaker’s better performance, as we described before, end-to-end attention models such as Tacotron-like models may meet unpredictable instability and bad cross-lingual speaking in few-shot learning settings. To achieve naturalness and robustness in speech synthesis, FastSpeech[ren2019fastspeech] and duration informed attention network (DurIAN) [Yu2019DurIANDI] have been recently proposed to overcome the unexpected errors of end-to-end systems by combining duration information of traditional statistical parametric speech synthesis system [zen2009statistical]. The former FastSpeech is a non-autoregressive feed-forward framework without attention. The latter DurIAN, originally proposed for multi-modal speech synthesis, is an autoregressive framework which achieves robustness and naturalness by using skip state encoder and combining duration with windowed content-based attention [Bahdanau2015NeuralMT].
To improve the scalability of TTS in few-shot speaker adaptation, we introduce AdaDurIAN, an adaptive neural TTS system based on DurIAN, with the ability to synthesize natural cross-lingual speech in a new speaker’s voice with just few minutes of monolingual data. We investigate it in three different aspects that have not been fully explored in previous work. First, we employ sequences of phoneme and tone (or stress) to achieve a robust speaker-independent content encoder, and incorporate the concatenated representation of speaker characteristics into the output states of content encoder. Second, instead of fine-tuning weights of the whole architecture, we found a key aspect that only fine-tuning the speaker embedding and decoder network leads to fewer pronunciation errors. Last, to generate the smooth mel spectrograms in a streaming inference manner, we adopt a time-delayed LSTM post-net instead of a global CBHG-like [Wang2017TacotronTE] module. Through various evaluations, our proposed AdaDurIAN significantly surpasses the Tacotron-like model [He2019RobustSA] in terms of naturalness, speaker similarity and cross-lingual speaking, and also shows its promising performance in few-shot emotion transfer tasks.
The rest of this paper is organized as follows. Section 2 describes the detailed architecture of AdaDurIAN and the speaker adaptation strategy. The experiment setup and evaluations are presented in Section . Concluding remarks are summarized in the final section.
2 The proposed method
2.1 Architecture of AdaDurIAN
The original DurIAN [Yu2019DurIANDI] is a single-speaker TTS system, the model for each speaker should be trained individually with their own voice. We made improvements to support multi-speaker, multi-style and multi-lingual speech synthesis. Figure 1 shows the architecture of proposed AdaDurIAN. It’s composed of (1) a speaker-independent content encoder that encodes the linguistic sequences, (2) an alignment model that predicts the duration of each phoneme and then aligns the output states of content encoder to acoustic frames, (3) a decoder network that generates frames of mel spectrogram autoregressively.
2.1.1 Speaker-independent Content Encoder
It’s hard to ensure that the input tokens (phonemes, tones, stresses and so on) are evenly distributed for a single speaker’s training corpus. To benefit from the knowledge of multi-speaker’s training corpus, different from DurIAN, we take both phoneme and tone (or stress, which appears in English words) sequence with prosodic boundary symbols as the input to content encoder of AdaDurIAN. With state skipping [Yu2019DurIANDI], the output of the content encoder is a sequence of hidden states containing speaker-independent global linguistic feature transformation.
2.1.2 Alignment Model
To combine linguistic feature transformation and speaker characteristics representation, we incorporate speaker embedding, emotion embedding and language embedding into the expanded states of content encoder. The language code switching is implemented based on the language to which the current phoneme belongs. Such speaker-dependent concatenated representation makes AdaDurIAN to synthesize speech for different speakers with different styles. Each speaker-dependent frame state is repeated according to the alignment model and then concatenated with relative position encoding [Yu2019DurIANDI] inside each phoneme. The detailed structure of alignment model is shown in Figure 1, and the alignment model doesn’t share any trainable embeddings with content encoder for stable training.
Different from DurIAN, we adopt the residual LSTM [Kim2017ResidualLD] layers for its efficient training which is of great importance in few-shot learning. Instead of CBHG [Wang2017TacotronTE] module, we adopt a vanilla LSTM layer with a time delay of frames as post-net. Practically, such structure of post-net can significantly improve the quality of mel spectrogram predicted by the decoder, and also achieve the streaming synthesis to be deployed in production environment. As a result, the inference of AdaDurIAN is times faster than real-time with two CPU cores.
2.2 Speaker Adaptation Strategy
Few-shot speaker adaptation is an intractable task for that there are very few training samples. The insoluble dilemma lying in few-shot speaker adaptation is that the distribution of linguistic tokens is hardly even. Following the general training procedure, the model of new speaker would soon be unable to synthesize out-of-domain words, let alone the naturalness and speaker similarity. Fortunately, the speaker-independent content encoder of AdaDurIAN can absorb the knowledge across different speakers, so a pre-trained content encoder can also be borrowed to transform linguistic feature for any new speaker.
|System||Chinese sentences||English sentences||Recordings|
|CN speaker||EN speaker||CN speaker||EN speaker||CN speaker||EN speaker|
Straightforwardly, the training procedure of AdaDurIAN for few-shot speaker adaptation is to transfer the linguistic feature transformation of a multi-speaker system to a new speaker with limited training data without losing naturalness, speaker similarity and cross-lingual speaking. Inspired by [Fan2015MultispeakerMA], modules in the light red dotted rectangle in Figure 1 will be fixed and shared for any new speaker. To achieve this, we first fully shuffle the training data to ensure that each mini-batch contains the data of different speakers and then train AdaDurIAN to get an average multi-speaker TTS model. At the stage of few-shot speaker adaptation, modules including phone embedding, tone(or stress) embedding, language embedding, emotion embedding and encoder in average model will be fixed. With such proposed speaker adaptation strategy, AdaDurIAN can be applied to speakers who have very limited data. We will validate that, by borrowing knowledge from other speakers and only optimizing speaker embedding and decoder, AdaDurIAN has a better performance in terms of naturalness, speaker similarity and cross-lingual speaking.
We take SMA [He2019RobustSA] as our baseline model, an optimal variant of Tacotron 2 [shen2018natural], in which the memory at each decoding step is computed by a stepwise monotonic attention [He2019RobustSA] instead of an alignment model. We find that, compared with original Tacotron 2, SMA performs much better in terms of accurate pronunciation and synthesizing long or abnormal utterances.
We performed three sets of experiments for adaptive TTS to show the performance of the proposed AdaDurIAN system. First, we investigate the stability of pronunciation under different adaptation strategies on AdaDurIAN. Second, we compare the performances of SMA and AdaDurIAN with a random subset of the audio with total duration of , and minutes, respectively. Finally, we perform the few-shot emotion transfer tasks on two unseen speakers with limited neutral speech data. We highly recommend readers to go listen to the generated audios 111https://xusongvae.github.io/adadurian.
3.1 Experiment Setup
The data we used is our internal carefully annotated -hour speech corpus which is collected from around speakers with different genders and nations. All audios are sampled by kHz with mono channel, windowed with ms and shifted every ms. The -th order mel spectrograms are extracted to represent the spectral envelope.
Two neural TTS systems are implemented for comparison. For AdaDurIAN, as shown in Figure 1, sequences of linguistic tokens are passed through a pre-net that contains three fully-connected layers followed by a CBHG module. The same group of sequences is taken as the input to duration model composed of two BLSTM layers with units. The output of content encoder would be expanded according to the duration of each phoneme. The pre-net of decoder is composed of two -unit fully-connected layers. The expanded state and output of decoder pre-net are passed through into a content-based attention with depth . Finally, the output of attention is passed through the bottom residual LSTM layer. At each decoding step, the second LSTM layer generates non-overlapped frames of mel spectrogram. A post-net with two stacked fully-connected layers with and hidden units, followed by a -unit LSTM layer with a time delay of frames, consequently makes the mel spectrogram more smooth and generated in a streaming manner. As for the baseline model SMA, the architecture of which used in this paper is almost the same as AdaDurIAN, except that the unique attention mechanism of SMA is implemented with a GRU cell of unit.
We first trained two average models with k steps for SMA and AdaDurIAN, and then conducted speaker adaptation and emotion transfer tasks with the already described strategy, respectively. The batch size is and the validation step interval is to get a better model. The procedure of gradient descent optimization is the same as the original DurIAN [Yu2019DurIANDI].
In addition to the acoustic model, we used the robust and fast WaveRNN [Kalchbrenner2018EfficientNA] variant as vocoder, which consists of D convolution as condition network and sparse GRU, four fully-connected layers with dual softmax structure. We trained such WaveRNN in the previously described dataset collected from around speakers, excluding the speakers to be evaluated in this paper. To eliminate the influence of WaveRNN, in MOS test, audios including recording audios are all converted by Griffin-Lim algorithm [Griffin1984SignalEF], while audios of other tests are converted by such speaker-independent WaveRNN.
3.2 Objective Evaluation
We performed pronunciation error statistical task by using -minute data of several speakers to compare the performances of AdaDurIAN under different few-shot speaker adaptation strategies. We randomly selected long and abnormal sentences with a total of words for synthesis. Such pronunciation error statistical task was performed with - anonymous and untrained subjects participating in several evaluation sessions, constructed so that each sentence was evaluated by
dinstinct subjects. Each participant was asked to count the number of errors in each sentence, including wrong pronunciation, unclearness and incorrect tone. Although we can use automatic speech recognition (ASR) system, we find that ASR system is too robust to spot minor pronunciation errors.
We evaluate the performance of each adaptation strategy by calculating the word error rate (WER). Table 1 shows each WER of different adaptation strategies. We find that fixing phone embedding, tone (or stress) embedding, language embedding and encoder could achieve the least pronunciation errors, which is reasonable because these fixed parameters still stay in the same distribution space even given very limited unbalanced data. With extremely imbalanced -min data, a much lower WER indicates that such fine-tuning strategy is reliable in few-shot speaker adaptation.
|System||Chinese sentences||English sentences|
|CN speaker||EN speaker||CN speaker||EN speaker|
3.3 Subjective Evaluation
3.3.1 Speaker Adaptation
We selected one native Chinese speaker (denoted as “CN speaker”) without English speech corpus, and one native English speaker (denoted as “EN speaker”) without Chinese speech corpus to perform few-shot speaker adaptation tasks. We constructed three datasets for each test speaker with a total duration of minute, minutes and minutes, respectively. There is a held-out validation set for each speaker. Each few-shot speaker adaptation system was trained and selected according to the lowest validation loss. Then, we synthesized Chinese sentences and English sentences that were excluded from all training samples. We performed two subjective tests on these sentences. In the first test, Mean Opinion Score (MOS) test, for each generated audio, the subjects were asked to rate each audio from lowest score to highest score on the naturalness. In the second test, the similarity ABX test, the subjects were asked to listen a recorded audio first, and then choose which of the converted audios sounds more like that recorded audio or neither.
As shown in Table 2, for both Chinese sentences and English sentences, AdaDurIAN has a higher MOS of naturalness over SMA. Especially, CN speaker has a MOS when given only minutes of training data and there is only a MOS gap of compared with MOS of recordings. Given just minute of data, CN speaker with AdaDurIAN system can still get a MOS on Chinese sentences and on English sentences, respectively. Although EN speaker with AdaDurIAN has a decreasing MOS with the increase of data on Chinese sentences, the highest MOS only has a minor distance to the MOS of the recordings. This indicates that AdaDurIAN can generalize well in cross-lingual speaking even if few minutes of monolingual data is given.
The result of speaker similarity preference test is shown in Table 3. For CN speaker, AdaDurIAN gains much more preferences than SMA in all experiments. For EN speaker, AdaDurIAN outperforms SMA with a significant margin on English sentences, and AdaDurIAN still has a comparable performance with SMA on Chinese sentences. Such promising evaluation results motivate us to apply AdaDurIAN in further tasks that have a high requirement of speaker similarity.
3.3.2 Emotion Transfer
To explore the ability of AdaDurIAN on transfering different emotions with few neutral data, we used an available female corpus with four annotated emotion styles: neutral, anger, happiness and sadness. We first trained such a base emotional model on the previous AdaDurIAN average model, then we fine-tuned such female emotional model with a minute male speech corpus and a minute female speech corpus with already described strategy. We evaluate the performances of two female-to-male (F2M) and female-to-female (F2F) few-shot emotion transfer tasks by subjective emotion classification. As shown in Table 4 and Table 5, emotion transfer of F2F task is less difficult than that of F2M. The mean emotion classification accuracy of F2F is while that of F2M is only . Specifically, we find that the emotion transfer of neutral and happiness is the easiest, emotion transfer of sadness is the second, while emotion transfer of anger is the hardest. This discovery provides an important reference for future few-shot emotion transfer research.
In summary, we proposed AdaDurIAN, a few-shot adaptive neural TTS system for higher naturalness and speaker similarity. We described the improvements of AdaDurIAN over original DurIAN and demonstrated the adaptation strategy when the speaker’s data is very limited. Based on AdaDurIAN, we performed several few-shot speaker adaptation tasks to evaluate the stability, naturalness, speaker similarity and emotion transfer ability. The evaluations show that, compared with Tacotron-like model, AdaDurIAN has both higher MOS of naturalness, more preferences of speaker similarity and especially fluent cross-lingual speaking. Furthermore, we also applied AdaDurIAN in emotion transfer tasks and showed its promising performance.