Neural text-to-speech is very popular in recent years [1, 3, 4, 5], and it can already produce speech that’s almost as natural as speech from a real person with high voice quality. However data collection is still a big challenge. We often need to collect a large amount of data in a high-fidelity recording studio with professional guidance to obtain high voice quality and high consistency of recordings. It is very costly, time-consuming or even impossible, e.g. in cases of custom speech and lombard speech . However, noisy and diverse data is usually easier to be collected. Thereby multi-speaker speech synthesis is proposed, which collects diverse data from lots of speakers to train a robust multi-speaker generative model. It can be further adapted in different tasks such as Speaker adaptation , cross-lingual Text-to-speech synthesis , and style conversion .
The state-of-art systems have an encoder-decoder structure of network with speaker embedding as conditions [9, 10, 11, 12, 13, 14, 15]. Some works investigated into the effective representations of speaker, e.g. [11, 13]
studied the effects of different speaker embeddings such as d-vector, x-vector , LDE-based speaker encoding .  proposed an attention-based variable-length embedding.  measured the speaker similarity between the predicted mel-spectrogram and the reference. Some works focused on solving the problem of noisy data [7, 9, 19, 20], e.g. [9, 19]
researched into the methods of transfer learning for noisy samples. aimed to disentangle speaker embedding and noise by data augmentation and a conditional generative model. And some works were interested in the controllability of system in the manner of zero-shot. [10, 13] tried to obtain target voice by feeding target speaker embedding without speaker adaptation. [22, 23] introduced latent variables to control the speaking style.
The previous studies rarely gave insights to what role the information other than text-content (called control information in this paper, e.g. speaker embedding, pitch and energy) played. The control information is usually represented by a fixed-or-variable-length embedding which may not be as effective as we excepted, e.g. the pitch embedding is relevant to the harmonics of speech but it’s not an effective representation of the harmonic structure. Besides, the embedding of control information is typically concatenated or added to the text-content representation, or is simply used to perform affine transformation on it . In this way, the control information is playing a similar role as the text-content information in the network. However, the text-content is the most important characteristic of speech which determines the intelligibility while the control information is affecting other characteristics like voice color of speech.
In this paper, we investigate better use of the control information under an encoder-decoder architecture. The major contributions are 1) Excitation spectrogram is designed to explicitly characterize harmonic structure of speech, which is fed to decoder instead of pitch/energy embeddings. 2) Conditional gated LSTM (CGLSTM) is proposed whose input/output/forget gates are re-weighted by speaker embedding while its cell/hidden states are dependent on the text-content. That’s to say, the speaker embedding controls the flow of text-content information through gates without affecting cell state and hidden state directly.
The rest article is organized as: Section 2 describes the proposed multi-speaker generative model from the overall framework (Section 2.1), excitation spectrogram generator(Section 2.2), and CGLSTM for decoder(Section 2.3). Section 3 is about the detailed settings and results of experiments. Finally, conclusion is drew in Section 4.
2 Multi-speaker Generative Model
The framework of the proposed system is illustrated as Figure 1. It is the state-of-art Tacotron-like structure with speaker encoder jointly trained. Due to the insufficient performance of attention mechanism for diverse data, phonemes’ duration are predicted for the alignment between phoneme sequence and mel-spectrogram sequence through a length regular module . In addition, energy/pitch are predicted to generate the excitation spectrogram which is fed to the CGLSTM-decoder finally.
The text-encoder is the standard Tacotron2 encoder  which has a stacked Conv1d followed by BLSTM. It takes phoneme-sequences with tones and prosody notations as inputs, and the output text-content embedding which, along with the speaker embedding, is used to predict phonemes’ duration firstly and then lf0/energy/mel-spectrogram after length regular.
. It takes the mel-spectrogrom of reference as inputs, and outputs the speaker embedding which is used to classify the speakers on the one hand and used as control information of the system on the other hand. Instead of introducing Gradient Reversal Layer (GRL) to remove text-content information from the speaker embedding, the reference is randomly chosen from the same speaker with the target mel-spectrogram .
The structure of duration-predictor is simply a layer of BLSTM with pre-dense and post-dense. It is used to replace attention based alignment between phoneme sequence and mel-spectrogram sequence. In the length-regular module, the phoneme sequence is repeated to the length of the mel-spectrogram sequence according to the duration. Besides, the frame position in phoneme is concatenated.
Pitch and energy are predicted separately with the same network structure, a stacked Conv1d with post-dense. Then excitation spectrogram is generated using the pitch and energy (see Section 2.2), which aims to address the one-to-many mapping problem by providing information of harmonic structure.
Finally, the decoder is an auto-regressive structure as Figure 2 with the proposed CGLSTM (see Section 2.3) in it.
In the flow of information, affine transformation is carried out on the text-content. It is defined as Equation 1
where means element-wise multiplication, and means matrix multiplication.
2.2 Excitation Spectrogram Generator
In the source-filter analysis , speech is produced when excitation signal passes through a system composed of chest, glottis and oral-cavity, etc., in which resonance occurs. The resonance phenomenon is reflected in the voice as the harmonic structure which is a very important characteristic of speech. Unfortunately, existing studies pay more attention to the use of pitch, rather than the harmonic structure. Pitch can reflect the periodic characteristics of the excitation signals while it does not reflect the resonance phenomenon. Thus we propose an excitation spectrogram generator to act as a simple resonator which takes pitch/energy as inputs and generates the excitation spectrogram with harmonics at vowels and uniform spectrogram at consonants. It provides a start point with explicit harmonic structure for the prediction of target mel-spectrogram.
The harmonics are defined as the multiples of the fundamental frequency as Equation 3:
where means the harmonic position of speech, is the number of harmonics, and is the fundamental frequency.
Then the excitation spectrogram is supposed to have energy only at harmonic positions during vowel and at all positions during constant. Energy is distributed equally at these positions as Equation 4:
where means the linear spectrogram at frame, is the total energy of frame, and is the fft number in the calculation of linear spectrum.
Finally, the linear excitation spectrogram is converted to mel excitation spectrogram by Equation 5
where is the transformation matrix from linear to mel spectrogram and is the dimension of mel-spectrogram.
2.3 Conditional Gated LSTM
Text content is the most important feature of speech due to its decisive role on the intelligibility. In addition, speech can be characterized in terms of timbre, style, speaker and emotion etc. Many researches aims to change or control some of these characteristics without a negative influence on the intelligibility. For this purpose, the control information is usually added or concatenated to the text content, which is fed as the inputs of network. However, in this way, the control information plays a similar role to the text content in the network. Both of them would directly take effect in the same way on the intelligibility and other characteristics at the same time, which is not the way we expected. Consequently, we propose conditional gated LSTM (CGLSTM) where the control information is used to re-weight the gates and the text content flows in the hidden/cell states. Thereby the control information will directly play a part in the gates-based flow of the text content without operating the text content in itself.
Compared with Long Short Time Memory (LSTM), which is frequently used in speech synthesis tasks due to its good capacity of learning long dependencies, the proposed CGLSTM calculates hidden/cell states in the same manner by current inputs and previous hidden/cell states. As for the calculation of input/output/forget gates, the control information is used to re-weight the LSTM gates as Equation 6 - 8.
where , and are the forget, input and output gates; , , and are the current text content inputs, current control information inputs, and previous hidden state; and are the corresponding weights and biases.
The data set of our experiments is the public multi-speaker mandarin corpus AISHELL-3 , which contains roughly 85 hours of recordings spoken by 218 native Chinese mandarin speakers. Among them, recordings from 173 speakers have Chinese character-level and pinyin-level transcripts and total 63263 utterances. This part of the transcribed data will be used in our experiments, which is divided into training set and test set without overlapping.
Training set: contains 57304 utterances from 165 speakers, with 133 females 46915 utterances and 32 males 10389 utterances. The training set is used to pre-train the multi-speaker generative model, which is further adapted using the test set.
Test set: contains 4 females and 4 males, and only 20 utterances of each speaker are randomly chosen for speaker adaptation.
The recordings are mono, 16bit, and down-sampled from 44100HZ to 16000HZ. Preprocessing is conducted on both the training and the test sets to reduce the diversity of them: 1) Energy normalization by scaling the maximal amplitude of utterance. 2) Silence trimming by keeping 60ms silence at the head and tail of utterance.
The pipeline of our experiments includes 1) Pre-training: train the multi-speaker generative model using the training set. 2) Speaker adaptation: train the target model by transfer learning using single-speaker data from the test set and 3) Inference: infer the mel-spectrogram to synthesize the waveform by vocoder. Here the modified neural vocoder LPCNet  is used, which takes the mel-spectrogram as inputs.
In our experiments, the frame hop size is set to 12.5ms, the window size is set to 50ms, and the number of mel-bank is set to 80 for mel-spectrogram. Mean Absolute Error (MAE) is calculated to measure the reconstruction error of lf0 and energy while Mean Square Error (MSE) is applied to mel-spectrogram. Besides, the task of speaker classification uses cross-entropy as loss function. The setup of our experiments is described as follows:
Baseline: Compared with the framework in Figure 1
, following modifications are made: 1) the excitation spectrogram generator is removed 2) the CGLSTM in the decoder is replaced with the standard LSTM while the speaker embedding is used to transform the text content by an affine layer before fed to the decoder.
System-1: Baseline + excitation spectrogram generator
System-2: Baseline + CGLSTM decoder
System-3: Baseline + excitation spectrogram generator + CGLSTM decoder.
3.3 Multi-speaker Generative Model
Figure 3 shows the reconstruction error of mel-spectrogram of different systems in the pre-training stage. Compared with the baseline, the excitation-spectrogram generator (System-1) and CGLSTM-decoder (System-2) brought obvious improvement in terms of reconstruction error separately. The reconstruction error reduced further in System-3. It shows that the excitation spectrogram and the CGLSTM, used together or separately, can greatly improve the modeling capability for multi-speaker data.
We also compared the amount of parameters of each system as Table 1. In general, there is no big difference among them. Compared with the baseline, the parameter amount of System-3 even drops by 10%. In other words, we can achieve better performance with less computation.
3.4 Speaker Adapted Model
For unseen speakers in the test set, we adapted the multi-speaker model using data from the target speaker. The Mean Opinion Score (MOS) test was carried out to evaluate the performance in intelligibility of speech, voice quality and speaker similarity. 20 native Chinese tester participated in it. The MOS results are shown at Table 2.
According to the MOS results, the System-1 outperforms the baseline in all aspects of intelligibility, voice quality and speaker similarity. It indicates that the excitation spectrogram, which captures explicit harmonic structure of speech, is much more effective than the simple use of pitch and energy. It can improve the clearness of pronunciation of some speakers, and at the same time reduce the noise or signal distortion caused by insufficient modeling capabilities for complex data.
For the proposed CGLSTM-decoder, we can find that, it also brings much improvement by comparing the System-2 and the baseline. The MOS of intelligibility increased by 0.26 points, which proves that CGLSTM can reduce the negative impact of control information on the intelligibility. Besides, the improvement in speaker similarity indicates that CGLSTM can control the specific characteristics of voice better than LSTM.
After using the excitation spectrogram and the CGLSTM-decoder together in System-3, we achieve the best performance from the MOS. In addition, AB-preference test is conducted between System-1 and System-3 as Figure 4. Here System-3 performs slightly better than System-1 for male and worse for female in voice quality with on average 37.5% testers have no preference. Considering both the results of MOS and AB preference, System-1 and System-3 are comparable.
Finally, the performance of males in the test set is obviously worse than that of females. One possible reason is that the data in the training set for females and males is not balanced, with a rough ratio female:male=9:2. The performance gap between females and males becomes smaller after using CGLSTM-decoder, e.g. the voice quality gap drops from 1.17 (System-1) to 0.83 (System-3). We may explain it like this: in the case of imbalanced data, CGLSTM can share information in a better way than LSTM and it can control the specific feature of voice through the control information. In the future, we need more investigations to prove it.
In this paper, we have proposed 1) the excitation spectrogram generator to capture the harmonic structure of speech, which aims to handle the diversity of multi-speaker data by providing a start point for the mel-spectrogram, and 2) CGLSTM to better control the specific characteristics of speech with less impact on the intelligibility than LSTM. The experiments showed large reduction in the reconstruction errors of mel-spectrogram by using the excitation spectrogram generator and CGLSTM decoder. In System-3, the multi-speaker generative model obtained better modeling capabilities with 10% reduction in model size. The effectiveness of the proposed methods is further verified in the subjective evaluations of speaker adapted models. We have made a comprehensive improvement in terms of intelligibility, voice quality and speaker similarity. e.g. the MOS of intelligibility was improved from 3.30 to 4.11, and voice quality improved from 2.54 to 3.93 for female in System-3. However, we also found that the performance of male is worse than that of female, which perhaps derives from the imbalanced data for females/males and needs further research in the future.
-  Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T. Liu, “Fastspeech 2: Fast and high-quality end-to-end text to speech,” ArXiv, vol. abs/2006.04558, 2020.
-  N. Kumar, S. Goel, A. Narang, and B. Lall, “Few shot adaptive normalization driven multi-speaker speech synthesis,” ArXiv, vol. abs/2012.07252, 2020.
-  J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerry-Ryan, R. A. Saurous, Y. Agiomyrgiannakis, and Y. Wu, “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783, 2018.
-  S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. C. Courville, and Y. Bengio, “Samplernn: An unconditional end-to-end neural audio generation model,” ArXiv, vol. abs/1612.07837, 2017.
-  J. Sotelo, S. Mehri, K. Kumar, J. F. Santos, K. Kastner, A. C. Courville, and Y. Bengio, “Char2wav: End-to-end speech synthesis,” in ICLR, 2017.
-  B. Bollepalli, L. Juvela, and P. Alku, “Lombard speech synthesis using transfer learning in a tacotron text-to-speech system,” in INTERSPEECH, 2019.
-  Q. Hu, E. Marchi, D. Winarsky, Y. Stylianou, D. K. Naik, and S. S. Kajarekar, “Neural text-to-speech adaptation from low quality public recordings,” in Speech Synthesis Workshop 10, 2019.
-  M. Chen, M. Chen, S. Liang, J. Ma, L. Chen, S. Wang, and J. Xiao, “Cross-lingual, multi-speaker text-to-speech synthesis using neural speaker embedding,” in INTERSPEECH, 2019.
-  D. Paul, P. V. M. Shifas, Y. Pantazis, and Y. Stylianou, “Enhancing speech intelligibility in text-to-speech synthesis using speaking style conversion,” ArXiv, vol. abs/2008.05809, 2020.
-  S. Choi, S. Han, D.-Y. Kim, and S. Ha, “Attentron: Few-shot text-to-speech utilizing attention-based variable-length embedding,” ArXiv, vol. abs/2005.08484, 2020.
-  C.-M. Chien, J. hao Lin, C. yu Huang, P. chun Hsu, and H. yi Lee, “Investigating on incorporating pretrained and learnable speaker representations for multi-speaker multi-style text-to-speech,” ArXiv, vol. abs/2103.04088, 2021.
-  M. Chen, X. Tan, Y. Ren, J. Xu, H. Sun, S. Zhao, and T. Qin, “Multispeech: Multi-speaker text to speech with transformer,” in INTERSPEECH, 2020.
-  E. Cooper, C.-I. Lai, Y. Yasuda, F. Fang, X. E. Wang, N. Chen, and J. Yamagishi, “Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings,” ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6184–6188, 2020.
-  E. Nachmani, A. Polyak, Y. Taigman, and L. Wolf, “Fitting new speakers based on a short untranscribed sample,” ArXiv, vol. abs/1802.06984, 2018.
-  Y. Jia, Y. Zhang, R. J. Weiss, Q. Wang, J. Shen, F. Ren, Z. Chen, P. Nguyen, R. Pang, I. Lopez-Moreno, and Y. Wu, “Transfer learning from speaker verification to multispeaker text-to-speech synthesis,” in NeurIPS, 2018.
-  L. Wan, Q. Wang, A. Papir, and I. Lopez-Moreno, “Generalized end-to-end loss for speaker verification,” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4879–4883, 2018.
-  D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust dnn embeddings for speaker recognition,” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333, 2018.
-  Z. Cai, C. Zhang, and M. Li, “From speaker verification to multispeaker speech synthesis, deep transfer with feedback constraint,” ArXiv, vol. abs/2005.04587, 2020.
-  J. Cong, S. Yang, L. Xie, G. Yu, and G. Wan, “Data efficient voice cloning from noisy samples with domain adversarial training,” ArXiv, vol. abs/2008.04265, 2020.
-  Z. Kons, S. Shechtman, A. Sorin, C. Rabinovitz, and R. Hoory, “High quality, lightweight and adaptable tts using lpcnet,” in INTERSPEECH, 2019.
-  W.-N. Hsu, Y. Zhang, R. J. Weiss, Y.-A. Chung, Y. Wang, Y. Wu, and J. R. Glass, “Disentangling correlated speaker and noise for speech synthesis via data augmentation and adversarial factorization,” ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5901–5905, 2019.
-  W.-N. Hsu, Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Y. Wang, Y. Cao, Y. Jia, Z. Chen, J. Shen, P. Nguyen, and R. Pang, “Hierarchical generative modeling for controllable speech synthesis,” ArXiv, vol. abs/1810.07217, 2019.
-  Y.-J. Zhang, S. Pan, L. He, and Z. Ling, “Learning latent representations for style control and transfer in end-to-end speech synthesis,” ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6945–6949, 2019.
-  Y. Wang, D. Stanton, Y. Zhang, R. Skerry-Ryan, E. Battenberg, J. Shor, Y. Xiao, F. Ren, Y. Jia, and R. A. Saurous, “Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis,” in ICML, 2018.
-  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” ArXiv, vol. abs/1706.03762, 2017.
-  Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation by backpropagation,” ArXiv, vol. abs/1409.7495, 2015.
-  M. Morise, F. Yokomori, and K. Ozawa, “World: A vocoder-based high-quality speech synthesis system for real-time applications,” IEICE Trans. Inf. Syst., vol. 99-D, pp. 1877–1884, 2016.
-  Y. Shi, H. Bu, X. Xu, S. Zhang, and M. Li, “Aishell-3: A multi-speaker mandarin tts corpus and the baselines,” ArXiv, vol. abs/2010.11567, 2020.
-  J. Valin and J. Skoglund, “Lpcnet: Improving neural speech synthesis through linear prediction,” ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5891–5895, 2019.