Linear networks based speaker adaptation for speech synthesis

03/05/2018 ∙ by Zhiying Huang, et al. ∙ 0

Speaker adaptation methods aim to create fair quality synthesis speech voice font for target speakers while only limited resources available. Recently, as deep neural networks based statistical parametric speech synthesis (SPSS) methods become dominant in SPSS TTS back-end modeling, speaker adaptation under the neural network based SPSS framework has also became an important task. In this paper, linear networks (LN) is inserted in multiple neural network layers and fine-tuned together with output layer for best speaker adaptation performance. When adaptation data is extremely small, the low-rank plus diagonal(LRPD) decomposition for LN is employed to make the adapted voice more stable. Speaker adaptation experiments are conducted under a range of adaptation utterances numbers. Moreover, speaker adaptation from 1) female to female, 2) male to female and 3) female to male are investigated. Objective measurement and subjective tests show that LN with LRPD decomposition performs most stable when adaptation data is extremely limited, and our best speaker adaptation (SA) model with only 200 adaptation utterances achieves comparable quality with speaker dependent (SD) model trained with 1000 utterances, in both naturalness and similarity to target speaker.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Given adequate amount of training data from target speaker, one can always build SD acoustic model that generates speech very similar to the target speaker himself or herself. Unfortunately, for the most of time, getting enough data from target speaker is not a trivial task. And building SD voices with limited data and bad phoneme coverage could lead to really poor voice quality and intelligibility. By reusing the information from other existing source speaker models, speaker adaptation can obtain satisfactory voice font for target speaker using only limited target speaker data. In this way, speaker adaptation saves the labor of massive recording, manually transcription and checking, and finally makes the cost of creating new voices small and acceptable.

In conventional Hidden Markov Model (HMM) based speech synthesis system, most adaptation methods firstly build average voice using multiple speakers’ data, and then conduct speaker adaptation from the huge average model with small amount target speaker data 

[1]

. Compared to the large data requirement for building SD model, speaker adaptation can adapt from speaker independent (SI) model with much smaller amount of target speaker data. Many effective speaker adaptation methods have been proposed under the HMM-based speech synthesis framework. The Maximum Likelihood Linear Regression (MLLR) is originally developed in automatic speech recognition tasks 

[2], and it was extended to speech synthesis in [3, 4]

. In MLLR, the mean vectors of HMM state distributions of the average voice model are transformed to target speaker dependent model through linear transformation. The Hidden Semi-Markov Model (HSMM) based MLLR adaptation 

[5, 6] transforms not only distributions of spectrum and pitch model but also distribution of duration model. Besides, the Constrained Structural Maximum A Posterior Linear Regression (CSMAPLR) adaptation [7, 8, 9]

simultaneously transforms the mean vectors and co-variance matrices of state output and state duration distributions of the HSMMs in speech synthesis for speaker adaptation.

In recent years, neural networks (NN) based speech synthesis dominates back-end acoustic modeling in speech synthesis due to its powerful modeling capacity. It’s proved that the NN-based speech synthesis system obtains better voice quality than conventional HMM-based method with the same number of model parameters [10]. From then on, many research works have been done to investigate speaker adaptation for NN-based speech synthesis. By combining the information from multiple speakers, multi-speaker DNN [11] is proposed. It’s assumed that the difference among different speakers can be learned by different output layers. In this way, each speaker owns his or her own output layer but all hidden layers are shared among all speakers. In [12], different levels of speaker adaptation are performed. I-vectors [13] are augmented with linguistic features to represent speaker identity information at input level, and the Learning Hidden Unit Contributions (LHUC) [14] is used to conduct model adaptation at the middle level. Finally, the feature space transformations are implemented at the output layer. Also, in some multi-speaker speech synthesis systems, i-vector and speaker code representing speaker [15, 16] are combined with linguistic features as input features for the neural network based acoustic model.

In this paper, a new linear networks based speaker adaptation approach is proposed in speech synthesis. Inspired by the use of Linear Networks (LN) [17, 18, 19] based adaptation method in speech recognition tasks, we introduce LN at multiple layers in the speech synthesis structure, and fine-tune them together with output linear layer for speaker adaptation. Moreover, LRPD decomposition was conducted for LN to remove redundant free parameters when adaptation data is very small. Both objective and subjective experiments show that the proposed methods render good adaptation ability in terms of naturalness and similarity to target speaker.

The remainder of this paper is organized as follows: section 2 introduces the adaptation framework in this paper, including 1) the multi-task DNN-BLSTM source acoustic model for speech synthesis baseline, 2) describing the framework of LN based speaker adaptation and 3) introducing LRPD decomposition based LN method in detail, section 3 evaluates the adaptation methods by using objective measurement and subjective tests, and section 4 draws conclusions finally.

2 Adaptation framework

Figure 1: Multi-task DNN-BLSTM based acoustic model (left) and Linear Network based speaker adaptation (right).

2.1 multi-task DNN-BLSTM based acoustic model

Similar to [20, 21], our back-end base acoustic model is shown on the left side of Fig. 1

, which is a multi-task DNN-BLSTM hybrid model. The input features of the back-end base acoustic model are linguistic features, including binary answers to questions about linguistic contexts and numeric values, e.g. absolute positions in different levels of units. The output acoustic features include mel-cepstral coefficients, logarithmic fundamental frequency (log F0) values, band aperiodicities and unvoiced/voiced identity. DNN layer is designed here after input features for better bottom feature extraction, which leads to faster overall convergence. On the left side of Fig. 

1, mel-cepstral coefficients, log F0s, band aperiodicities and unvoiced/voiced identity have their separate output layers, while sharing the same lower layers. Our preliminary experiments show that multi-task learning renders more stable synthesis voice than a large single output layer for all acoustic features. The back-end model described in this section serves as source NN structure in the following experiments.

2.2 Linear Network based adaptation

LN based adaptation is originally explored in speech recognition tasks, the structure of which is shown on the right side of Fig. 1. LN (the yellow part in Fig. 1) are inserted at multiple layers to source NN-based acoustic model. According to the different positions of LN, LN based adaptation method includes Linear Input Network (LIN [17]), Linear Hidden Network (LHN [18]) and Linear Output Network (LON [19]), while LHN can be inserted to any positions between two successive hidden layers.

When LN is inserted between -th and -th hidden layer, the calculation of output of LN is shown in equation (1).

(1)

Where is the activation output at -th hidden layer (or input acoustic features at input layer for LIN), denotes the speaker-specific linear transformation matrix, and

is the speaker-specific bias vector.

In adaptation process, LN are firstly inserted into specific positions in source model with the linear transformation matrix

initializing as identity matrix, and

is initialized to . Then, with adaptation data from target speaker, inserted layers are optimized in back-propagation while keeping other layers fixed. It’s worth noting that LIN, LHN and LON can be combined to train a better SA model so as to achieve good adaptation voice quality.

2.3 LN with Low-rank Plus Diagonal decomposition

Figure 2: LN with Low-rank Plus Diagonal decomposition.

LRPD decomposition is proposed to reduce the number of free adaptation parameters for inserted LN. LN with LRPD decomposition (names as LRPD-LN) can reduce the speaker-specific footprint by 82% over LN with full adaptation matrix (named as Full-LN) without significant loss of word error rate in speech recognition tasks [22]. The LRPD decomposition restructures the adaptation matrix as a superposition of a diagonal matrix and a low-rank matrix, which is shown in Fig. 2 and equation (2).

(2)

In equation (2), and are two small matrices with dimension and respectively. is a diagonal matrix. The number of parameters in Full-LN is , and the number of parameters in LRPD-LN is , which is much smaller than that of Full-LN. Moreover, LRPD-LN for speaker adaptation is more suitable than Full-LN when with small target speaker adaptation data, because the number of parameters to be fine-tuned is also smaller.

In paper [22], the diagonal matrix is initialized with identity matrix, which is fine-tuned or fixed during adaptation stage. It’s shown that keeping matrix fixed renders comparable performance with the fine-tuning, so we fix diagonal matrix as identity matrix in the following experiments in this work. LRPD-LN can be trained in two ways: 1) initialize two small matrices and randomly, then train all inserted parameters with target speaker data, 2) decompose a well trained Full-LN as the seed model, then fine-tune with target speaker’s data. From preliminary experiments, these two training methods shows comparable results, and here for simplicity, we only train LRPD-LN by the first method in the following experiments.

3 Experiments

Figure 3: Objective measurement of validation set utterances (from C-female to B-female).

3.1 Speech corpus

Experiments are conducted with a corpus of 3 native Mandarin speakers, all phonetically and prosodically rich. Speech signals are sampled at 16kHz, window size and window shift are 25ms and 5ms in feature extraction, respectively. For all the three speakers in the corpus, one is male (named male-A), and the other two are female (named female-B and female-C). Each speaker has approximately 5000 utterances.

3.2 Experiment setup

For adaptation experiments, adaptation utterances number for each target speaker varies from 50 to 1000. For each target speaker, 200 utterances are held out as validation set for objective measurement, and another 20 utterances are held out as testing set for subjective tests. Here, we only focus on adaptation scenario from one source speaker to one target speaker. The source speaker SD model trained with approximately 5000 utterances is used as source model for speaker adaptation. To investigate adaptation effect between speakers with different distances, three kinds of adaptation experiments are designed: from female to female, from male to female and from female to male. The adaptation from female to female is regarded as adaptation between two similar speakers (easy task), but other two are regarded as adaptation between two far-away speakers (tough task), since two speakers with different genders always have different characters in terms of pitch and spectrum. Moreover, speaker dependent systems of target speaker with different number of utterances are compared as reference in experiments.

Input linguistic features contain 738-dim binary features and 15-dim numerical features, and it’s 753-dim totally. Output acoustic feature vectors are 75-dim, comprising 60-dim mel-cepstral coefficients, 3-dim log F0 (

), 11-dim band aperiodicities and 1-dim UV flag. Linear interpolation of F0 is done over unvoiced segments before modeling, and both input linguistic features and output acoustic features are normalized to 0-mean and unit-variance. In synthesis stage, predicted acoustic features are de-normalized and sent to vocoder WORLD 

[23] for synthesis.

The topology of base DNN-BLSTM model contains a DNN layer of 1024 nodes, and 3 upper layers of BLSTMs with 1024 nodes (512 for each direction). For LRPD-LN, is set to 10, and the number of parameters in LRPD-LN is while in Full-LN is

. In training SD model and adaptation model, Minimum Square Error (MSE) is adopted as training criterion and Stochastic Gradient Descent (SGD) based back-propagation is used to optimize model parameters. When training with limited data, we manually tuned learning rate and corresponding hyper-parameters to avoid over-fitting. In informal experiments, it’s found that only inserting LN cannot achieve satisfactory adaptation results, and inspired by 

[11], we combine fine-tuning output layer with Full-LN/LRPD-LN based method in the following experiments. Different insert positions of LN and different combinations are compared in preliminary experiments, and it’s found that only insert LN before last hidden layer and before output layer together would achieve best performance, and this setting will be used in following experiments.

Both objective measurement and subjective tests are conducted in experiments. Objective measurement includes Mel-Cepstral Distortion (MCD), root mean squared error (RMSE) of F0, unvoiced/voiced (U/V) prediction errors and overall MSE in the multi-task output layer. Mean opinion score (MOS) tests on both naturalness and similarity to target speaker are conducted in subjective comparison. Each utterance is listened 5 times by different listeners.

3.3 Easy task: from female to female

Adaptation from one female speaker to another female speaker is first evaluated.

3.3.1 objective measure

Fig. 3 shows the relationship between target speaker adaptation utterances amount and system objective measurement detailed above. ”SD” means SD model of target speaker with the same training data as adaptation. ”OL” means adaptation with only fine-tuning output layer of source model. ”OL + Full-LN” and ”OL + LRPD-LN” mean Full-LN and LRPD-LN based adaptation methods and combined with fine-tuning output layer respectively.

As is shown in Fig. 3, with the number of training/adaptation utterances increasing, the objective distance of all systems becomes closer to 0, which indicates all systems perform better with more data. Compared to SD model of target speaker, all adaptation methods show better performance with same amount of adaptation data. Experimental results also reveal that, by fine-tuning together with inserted LN, the performance of the basic only fine-tuning output layer adaptation method, jumps from the pink curve to the blue(OL+LRPD-LN) and green curve(OL+Full-LN). These results indicate that by only fine-tuning the output transformation layer, without tuning more hidden layers, it is not possible to fully adapt to the target voice. This is the case specially when there is more adaptation data available, and the gap between OL and OL+LN method becomes huge as more adaptation data available. Since LRPD-LN and Full-LN are always fine-tuned with OL in our experiments, LRPD-LN and Full-LN is used to indicate OL+LRPD-LN and OL+Full-LN for short.

Moreover, Full-LN is worse than LRPD-LN when adaptation utterances are limited (less than 50). This is because Full-LN introduces too much speaker-specific parameters, and causes over-fitting due to the lack of data. When adaptation utterances are adequate (more than 100), Full-LN is better than LRPD-LN. It is mainly because the number of speaker specific parameters of LRPD-LN is limited with fixed decomposition dimension .

3.3.2 subjective measure

Figure 4: Subjective tests of testing set utterances (from C-female to B-female).

Naturalness and similarity to target speaker of different systems are shown in Fig. 4. In these two subjective tests, performance of SD (see the blue bar) system degrades quickly when the number of adaptation utterances decrease, and LRPD-LN (see the green bar) based adaptation is more stable compared to SD and Full-LN (see the red bar). Moreover, it’s found that both Full-LN and LRPD-LN show better performance than SD with same number of utterances. Full-LN and LRPD-LN with 200 adaptation utterances can achieve similar performance to SD with 1000 utterances. Different to the objective measurements, LRPD-LN outperforms Full-LN until adaptation data reaches 500. This is because the over-fitting makes synthesis voice not stable, sometimes sounds very weird and unintelligible. And we can still draw conclusion that over-fitting makes Full-LN become worse when the adaptation utterances is deficient.

3.4 Tough task: from male to female and from female to male

Both male to female adaptation and female to male adaptation obtain similar objective trends to Fig. 3, and thus we skip this results and only show the subjective results.

3.4.1 subjective measure

Figure 5: Subjective tests of testing set utterances (from A-male to B-female).

Figure 6: Subjective tests of testing set utterances (from B-female to A-male).

As shown in Fig. 5 and Fig. 6, the same conclusion as in section 3.3.2 can be made. LRPD-LN is more stable than SD and Full-LN, and the performance of SD declines quickly while decreasing the training utterances number. Also, Full-LN and LRPD-LN based methods both still show better performance than SD with the same utterances number. Similarly, 200 adaptation utterances for Full-LN and LRPD-LN adaptation achieve comparable performance to SD with 1000 utterances.

In addition, it is interesting to see in Fig. 6, the gap between adaptation voices and SD voice are larger when adaptation is extremely small. This may indicate that adaptation works best to if the target speaker is male, and if the adaptation data is limited. However, the gap shrinks quickly as the adaptation data becomes larger. Overall, speaking, Fig. 6 reveals the same trend as in Fig. 4 and Fig. 5.

4 Conclusion

In this paper, LN based speaker adaptation methods are investigated to speaker adaptation in speech synthesis, and LRPD decomposition for LN is employed to make the adapted voice more stable. After conducting speaker adaptation from 1) female to female, 2) male to female, and 3) female to male, experimental results show that LN with LRPD decomposition performs most stable when adaptation data is extremely limited. Also, our best SA model with only 200 adaptation utterances achieves comparable quality with SD model trained with 1000 utterances, in both naturalness and similarity to target speaker.

References

  • [1] Junichi Yamagishi, Masatsune Tamura, Takashi Masuko, Keiichi Tokuda, and Takao Kobayashi, “A training method of average voice model for HMM-based speech synthesis,” IEICE transactions on fundamentals of electronics, communications and computer sciences, vol. 86, no. 8, pp. 1956–1963, 2003.
  • [2] Christopher J Leggetter and Philip C Woodland, “Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models,” Computer Speech & Language, vol. 9, no. 2, pp. 171–185, 1995.
  • [3] Masatsune Tamura, Takashi Masuko, Keiichi Tokuda, and Takao Kobayashi, “Speaker adaptation for HMM-based speech synthesis system using MLLR,” in the third ESCA/COCOSDA Workshop (ETRW) on Speech Synthesis, 1998.
  • [4] Masatsune Tamura, Takashi Masuko, Keiichi Tokuda, and Takao Kobayashi, “Adaptation of pitch and spectrum for HMM-based speech synthesis using MLLR,” in Acoustics, Speech, and Signal Processing, 2001. Proceedings.(ICASSP’01). 2001 IEEE International Conference on. IEEE, 2001, vol. 2, pp. 805–808.
  • [5] Junichi Yamagishi, Katsumi Ogata, Yuji Nakano, Juri Isogai, and Takao Kobayashi, “HSMM-based model adaptation algorithms for average-voice-based speech synthesis,” in Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on. IEEE, 2006, vol. 1, pp. I–I.
  • [6] Junichi Yamagishi and Takao Kobayashi, “Average-voice-based speech synthesis using HSMM-based speaker adaptation and adaptive training,” IEICE TRANSACTIONS on Information and Systems, vol. 90, no. 2, pp. 533–543, 2007.
  • [7] Yuji Nakano, Makoto Tachibana, Junichi Yamagishi, and Takao Kobayashi, “Constrained structural maximum a posteriori linear regression for average-voice-based speech synthesis,” in Ninth International Conference on Spoken Language Processing, 2006.
  • [8] Junichi Yamagishi, Takashi Nose, Heiga Zen, Zhen-Hua Ling, Tomoki Toda, Keiichi Tokuda, Simon King, and Steve Renals, “Robust speaker-adaptive HMM-based text-to-speech synthesis,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, no. 6, pp. 1208–1230, 2009.
  • [9] Junichi Yamagishi, Takao Kobayashi, Yuji Nakano, Katsumi Ogata, and Juri Isogai, “Analysis of speaker adaptation algorithms for HMM-based speech synthesis and a constrained SMAPLR adaptation algorithm,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, no. 1, pp. 66–83, 2009.
  • [10] Heiga Zen, Andrew Senior, and Mike Schuster, “Statistical parametric speech synthesis using deep neural networks,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013, pp. 7962–7966.
  • [11] Yuchen Fan, Yao Qian, Frank K Soong, and Lei He, “Multi-speaker modeling and speaker adaptation for DNN-based tts synthesis,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 4475–4479.
  • [12] Zhizheng Wu, Pawel Swietojanski, Christophe Veaux, Steve Renals, and Simon King, “A study of speaker adaptation for DNN-based speech synthesis.,” in INTERSPEECH, 2015, pp. 879–883.
  • [13] Najim Dehak, Patrick J Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet, “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011.
  • [14] Pawel Swietojanski and Steve Renals, “Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models,” in Spoken Language Technology Workshop (SLT), 2014 IEEE. IEEE, 2014, pp. 171–176.
  • [15] Yi Zhao, Daisuke Saito, and Nobuaki Minematsu, “Speaker representations for speaker adaptation in multiple speakers’ BLSTM-RNN-based speech synthesis,” space, vol. 5, no. 6, pp. 7, 2016.
  • [16] Hieu-Thi Luong, Shinji Takaki, Gustav Eje Henter, and Junichi Yamagishi, “Adapting and controlling DNN-based speech synthesis using input codes,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017, pp. 4905–4909.
  • [17] Joao Neto, Luís Almeida, Mike Hochberg, Ciro Martins, Luis Nunes, Steve Renals, and Tony Robinson, “Speaker-adaptation for hybrid HMM-ANN continuous speech recognition system,” 1995.
  • [18] Roberto Gemello, Franco Mana, Stefano Scanzio, Pietro Laface, and Renato De Mori, “Linear hidden transformations for adaptation of hybrid ANN/HMM models,” Speech Communication, vol. 49, no. 10, pp. 827–835, 2007.
  • [19] Bo Li and Khe Chai Sim, “Comparison of discriminative input and output transformations for speaker adaptation in the hybrid NN/HMM systems,” 2010.
  • [20] Yuchen Fan, Yao Qian, Feng-Long Xie, and Frank K Soong,

    “TTS synthesis with bidirectional LSTM based recurrent neural networks,”

    in Fifteenth Annual Conference of the International Speech Communication Association, 2014.
  • [21] Heng Lu, Ming Lei, Zeyu Meng, Yuping Wang, and Miaomiao Wang, “The Alibaba-iDST entry to Blizzard Challenge 2017,” .
  • [22] Yong Zhao, Jinyu Li, and Yifan Gong, “Low-rank plus diagonal adaptation for deep neural networks,” in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016, pp. 5005–5009.
  • [23] Masanori Morise, Fumiya Yokomori, and Kenji Ozawa, “WORLD: A vocoder-based high-quality speech synthesis system for real-time applications,” IEICE TRANSACTIONS on Information and Systems, vol. 99, no. 7, pp. 1877–1884, 2016.