FeatherTTS: Robust and Efficient attention based Neural TTS

11/02/2020 ∙ by Qiao Tian, et al. ∙ Tencent 0

Attention based neural TTS is elegant speech synthesis pipeline and has shown a powerful ability to generate natural speech. However, it is still not robust enough to meet the stability requirements for industrial products. Besides, it suffers from slow inference speed owning to the autoregressive generation process. In this work, we propose FeatherTTS, a robust and efficient attention-based neural TTS system. Firstly, we propose a novel Gaussian attention which utilizes interpretability of Gaussian attention and the strict monotonic property in TTS. By this method, we replace the commonly used stop token prediction architecture with attentive stop prediction. Secondly, we apply block sparsity on the autoregressive decoder to speed up speech synthesis. The experimental results show that our proposed FeatherTTS not only nearly eliminates the problem of word skipping, repeating in particularly hard texts and keep the naturalness of generated speech, but also speeds up acoustic feature generation by 3.5 times over Tacotron. Overall, the proposed FeatherTTS can be 35x faster than real-time on a single CPU.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, with the rapid development of deep learning, neural text-to-speech (TTS) can synthesize speech which is more natural and expressive than traditional TTS pipeline. Neural TTS is usually divided into two parts: an acoustic model and a neural vocoder. First, the input text (phoneme) sequence is converted into an intermediate acoustic feature sequence(linear spectrogram or mel-spectrogram) through an acoustic model such as Tacotron

[19], Tacotron2 [15], Transformer TTS [9], FastSpeech [14], etc. Then, the Griffin-Lim algorithm [6] or neural vocoder such as WaveNet [18] and WaveRNN [10] is used to generate the final waveform according to the acoustic features. Sequence-to-sequence models with an attention mechanism are currently the predominant paradigm in neural acoustic model and have shown a powerful ability to generate expressive and high-quality speech. Those models learn the alignment between text sequence and frame-level acoustic features through the attention mechanism, and then predict spectral features that contain information such as pronunciation and prosody. The speech quality synthesized by the neural TTS is limited by the alignment generated by the attention mechanism. Although attention-based neural TTS has achieved great success, it is difficult to deploy in the industry due to its accidental alignment errors.

Tacotron [19] with content-based attention mechanism does not take into account the monotonicity and locality of TTS alignment, an improved hybrid location-sensitive mechanism proposed in Tacotron2 [15] combines content-based and location-based features to achieve the synthesis of longer utterances. However, such hybrid mechanism also causes alignment issues occasionally. Recently, inspired by the purely location-based GMM attention mechanism[5], an improved location-based GMM attention mechanism called GMMv2b is proposed in Google’s work[3], which shows that the GMMv2b-based mechanism is able to generalize to long utterances, and can also improve speed and consistency of alignment during training. However, the commonly used stop token architecture often causes early stop phenomenon for complex texts and long sentences. In addition, such GMM attention is unnormalized and not strictly monotonic, which leads to unstable performance.

In this paper, we propose a novel attention-based neural TTS model named FeatherTTS, which can perform stable, fast and high-quality synthesis. Our major contributions are as follows: (1) We introduce the Gaussian attention for acoustic modeling, a monotonic, normalized and stable attention mechanism, which is very interpretable for end to end speech synthesis. (2) To solve the stop early issue, we remove the widely adopted stop token architecture in Tacotron2 and propose the attentive stop loss (ATL), which can determine whether to stop directly based on alignment and fast convergence for Gaussian attention. (3) To improve the inference speed and reduce the number of parameters without sacrificing the speech quality, we propose to adopt block sparse strategy to prune the weights of decoder .

2 Related work

2.1 Hybrid attention based Tacotron2

Sequence-to-Sequence models with an attention mechanism are currently the predominant paradigm in neural TTS. Attention-based neural TTS such as Tacotron2 [15] generally uses an encoder to encode input sequence

into hidden representation




is the length of input phoneme sequence. Then, the attention RNN generates a state vector

, which is used as the query vector of the attention mechanism to generate alignment at decode time i. According to the alignment , a weighted average of the encoder output is calculated, which is the context vector .


Finally, the context vector is fed into the decoder, and the final acoustic feature sequence is computed through post-net as


where T is the length of output mel-spectrogram sequence.

Recently, many works have proposed various attention mechanism. Such as Tacotron [19] uses the purely content-based attention mechanism introduced in [1], Tacotron2[15] uses an improved hybrid location-sensitive mechanism introduced in [4], some works [13, 20, 7] explore the use of monotonic attention mechanisms, and some authors [8, 2] use the location-based GMM attention.

2.2 Location based GMMv2b

Recently, Google’s work [3] proposed a modified location-based attention mechanism which is called GMMv2b, has achieved great success. The GMMv2b mechanism is inspired by the location-based GMM attention mechanism introduced in [5]. The location-based GMM attention introduced in [5] uses K Gaussian components to compute the alignment as (5). The mean of each Gaussian component is computed following the recurrence relation in (6). The monotonicity of GMM attention is guaranteed by making non-negative.


GMM attention usually calculates the intermediate variables (, , ) first, and then uses the exponential function to obtain the final variables. In order to stabilize GMM attention, GMMv2b-based attention uses the softmax and the softplus functions to compute the final mixture parameters as


where and are the softmax function and the softplus function respectively. Besides, GMMv2b-based attention adds initial biases to the the intermediate parameters and , which can encourage the final parameters to take on useful values at initialization.

As shown in [3], the GMMv2b-based mechanism is able to generalize to long utterances and maintains good naturalness, which makes the synthesis of the entire paragraph possible.

3 The proposed method

Although the GMMv2b-based mechanism has good performance, it also has many problems. First, this model still use stop token architecture which can lead to early stop. Second, GMM attention isn’t completely monotonic because it uses a mixture of distributions with infinite support. Finally, GMMv2b attention is unnormalized because that attention weights are sampled from a continuous probability density function, this can lead to occasional spikes or dropouts in the alignment. Especially, there are repetition problems for the synthesis of short sentences, such as monophone and vowel. Therefore, we propose FeatherTTS, a more robust attention-based acoustic model, as shown in Fig. 

1. Our model is based on the Tacotron2 [15] architecture and consists of a CBHG encoder, Gaussian attention and a block sparse decoder.

Figure 1: The architecture of FeatherTTS

3.1 Gaussian attention

In order to solve the incomplete monotonic and unnormalized problem in GMM attention, we propose to use Gaussian attention mechanism to model alignment, as shown in (8). We also calculate the intermediate variables (, ) first, and then get the final parameters(, ) through the softplus function.


We use such simple and normalized Gaussian attention function to calculate the alignment . The mean

and the variance

of the Gaussian attention mechanism control the position and width of the attention window, respectively. is non-negative, so the mean is monotonically increasing, which guarantees the alignment process of the Gaussian attention mechanism is completely monotonic.

3.2 Attentive stop loss

The stop token architecture used in Tacotron2 [15] will cause stop early problems. In addition, the alignment information learned in the Gaussian attention is too weak, which makes the alignment difficult to converge. In order to solve the above problems, we remove the stop token architecture, and propose the attentive stop loss, which directly judges the stop based on alignment. It is calculated as


where is the mean value of Gaussian attention function at last step, and is the length of input phoneme sequence.

During training, the attentive stop loss forces the mean of Gaussian attention to go forward to the end of the phoneme sequence to ensure accurate alignment. In the inference stage, FeatherTTS will stop to predict when .

3.3 Sparse autoregressive decoder

It has been demonstrated that, with the same computational complexity, a larger sparse network behaves better than a smaller dense network [10, 17]. In this work, to reduce the amount of computation of LSTM layers in decoder without a significant loss in quality, we reduce the number of non-zero values in each LSTM kernel weight. Inspired by [12, 11], we adopt the weight pruning scheme based on the weight magnitude.

We start to perform weight pruning after 20K steps and every 500 steps, we sort the weights of sparsified LSTM layers and zero out certain number of weights with the smallest magnitudes until the target sparsity is reached at 200K step. After block sparsity, the number of main operations in every sparsified LSTM layer is


where and are the dimensions of input and hidden state of the LSTM cell, respectively, and is the target sparsity.

In FeatherTTS, we used the time-delayed post-net as in [21], which is a vanilla LSTM layer with 256 units. Overall, FeatherTTS is trained to minimize the total loss as


where d is the number of frames of time delay and is a scaling factor. On the right hand side of Eq. 12

, the first two items of the loss function are L1 loss between reference mel-spectrogram

and the predicted both before and after mel-spectrogram , . The last item is the attentive stop loss.

4 Experiments

4.1 Data Set

We used a corpus containing 20 hours of Mandarin corpus recordings by a professional broadcaster for all experiments. The corpus was split into a training set of approximately 18 hours and a test set of 2 hours. All the recordings were down-sampled to 24KHz sampling rate with 16-bit format. We used 80-band mel-scale spectrogram as training target, and then the mel-scale spectrogram was converted into waveforms by FeatherWave neural vocoder[16].

4.2 Experimental Setup

For comparisons, we implemented two models including GMMv2b-based Tacotron2 and FeatherTTS. As the baseline model, the GMMv2b-based model is composed of five mixture components. In order to reduce the model size, training and inference time, two consecutive frames were predicted at each decoding time step. For FeatherTTS, we delayed frames and the rate of attentive stop loss was set to . All models were trained 300k steps with batch size 32 on a single GPU. Other experimental setups are the same as AdaDurIAN [21] if not specified.

4.3 Evaluations

In this section, we evaluated the proposed FeatherTTS and Tacotron2 (GMMv2b) in term of naturalness and robustness, and compared the synthesis speed of the above two models and FastSpeech.

4.3.1 Mean Opinion Score

We used the Mean Opinion Score (MOS) to measure the naturalness of the synthesized speech111Part of synthesized samples could be found at this URL:
. We used 20 unseen sentences for evaluating the models. MOS of the naturalness of generated utterances rated by human subjects participated in the learning tests through crowdsourcing. The results of subjective MOS evaluation are presented in Table 1. The results show that, under the same vocoder configuration, both FeatherTTS and Tacotron2(GMMv2b) have similar MOS values. In addition, we compared the effect of block sparsity on the sound quality. It can be seen from the experimental results that FeatherTTS with block sparsity outperforms FeatherTTS without block sparsity with a gap of 0.01 in MOS, which is basically in line with our expectations.

Model MOS on speech quality
Tacotron2(GMMv2b) 4.31 0.03
FeatherTTS w/o Block sparsity 4.32 0.04
FeatherTTS 4.33 0.04
Table 1: Mean Opinion Score (MOS) with confidence intervals for different models.

4.3.2 Word Error Rate

The design goal of FeatherTTS is to keep the naturalness as Tacotron2(GMMv2b) while avoiding the mispronunciations observed in the Tacotron2(GMMv2b). Therefore, we compared the robustness of two systems in terms of generated speech. To evaluate the robustness of FeatherTTS, we prepared 20 hard sentences for two systems and focused on the word skipping, word repetition and inaccurate intonation. The results are shown in Table 2. We can see that Tacotron2(GMMv2b) has an error rate of , while FeatherTTS is more robust, with an error rate of only . This strongly proves the role of Gaussian attention and attentive stop loss in improving model stability.

Model Word error rate
Tacotron2(GMMv2b) 4.1%
FeatherTTS 0.9%
Table 2: The Word Error Rate (WER) for different models.

4.3.3 Synthesis Speed

In this experiment, we proved the effectiveness of the proposed block sparse decoder for accelerating training and inference. We compared the real-time rate of FastSpeech, Tacotron2(GMMv2b) and FeatherTTS to generate mel-spectrograms on a single core CPU(Intel Xeon Platinum 8255C). The results of synthesis speed are presented in Table 3. Tacotron2(GMMv2b) can achieve an inference speed of 3.5 times faster than real time on CPU with single core, while FeatherTTS can further be accelerated by 3.5 times over Tacotron2(GMMv2b). In addition, compared with non-autoregressive FastSpeech, FeatherTTS is also about 2.6 times faster . Furthermore, we truncated the parameters and ran them on the BF16 format to reduce the memory consumption, and finally achieve 60 times faster than real-time on a single CPU core (Cooper Lake, 3rd Gen Intel Xeon Scalable processors). The above experiments prove the accelerating performance of the block sparse decoder for inference, and makes it possible to deploy TTS on edge devices.

Model Speed
FastSpeech 13.3x
Tacotron2(GMMv2b) 10.4x
FeatherTTS 35.0x
FeatherTTS BF16 60.0x
Table 3: The inference speed of different models.

5 Conclusions

In this work, we proposed FeatherTTS, an improved neural TTS system with Gaussian attention, attentive stop loss and block sparse decoder. Experiments demonstrate that such attention mechanism is very efficient and would greatly improve robustness of attention-based neural TTS system. With block sparse decoder, our proposed FeatherTTS can speed up the mel-spectrogram generation by 3.5 times faster than Tacotron2 nearly without any performance degradation. The ideas introduced in FeatherTTS pave a new way for both efficient and robust speech synthesis, and could be also applied to other sequence-to-sequence task including automatic speech recognition.

For future work, we will continue to investigate the performance of FeatherTTS on edge-devices.

6 Acknowledgments

The authors would like to thank Yi Xie in IAGS, Intel Asia-Pacific Research Development Co Ltd.. This member in Intel helped to optimized our algorithm with AVX512 and BF16 intrinsics to get good performance on the 3rd Gen Intel Xeon Scalable processors.


  • [1] D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §2.1.
  • [2] E. Battenberg, S. Mariooryad, D. Stanton, R. Skerry-Ryan, M. Shannon, D. Kao, and T. Bagby (2019) Effective use of variational embedding capacity in expressive end-to-end speech synthesis. arXiv preprint arXiv:1906.03402. Cited by: §2.1.
  • [3] E. Battenberg, R. Skerry-Ryan, S. Mariooryad, D. Stanton, D. Kao, M. Shannon, and T. Bagby (2020) Location-relative attention mechanisms for robust long-form speech synthesis. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6194–6198. Cited by: §1, §2.2, §2.2.
  • [4] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio (2015) Attention-based models for speech recognition. In Advances in neural information processing systems, pp. 577–585. Cited by: §2.1.
  • [5] A. Graves (2013)

    Generating sequences with recurrent neural networks

    arXiv preprint arXiv:1308.0850. Cited by: §1, §2.2.
  • [6] D. Griffin and J. Lim (1984)

    Signal estimation from modified short-time fourier transform

    IEEE Transactions on Acoustics, Speech, and Signal Processing 32 (2), pp. 236–243. Cited by: §1.
  • [7] M. He, Y. Deng, and L. He (2019) Robust sequence-to-sequence acoustic modeling with stepwise monotonic attention for neural tts. arXiv preprint arXiv:1906.00672. Cited by: §2.1.
  • [8] K. Kastner, J. F. Santos, Y. Bengio, and A. Courville (2019) Representation mixing for tts synthesis. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5906–5910. Cited by: §2.1.
  • [9] N. Li, S. Liu, Y. Liu, S. Zhao, and M. Liu (2019)

    Neural speech synthesis with transformer network


    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 6706–6713. Cited by: §1.
  • [10] M. Luong, H. Pham, and C. D. Manning (2015) Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025. Cited by: §1, §3.3.
  • [11] S. Narang, E. Elsen, G. Diamos, and S. Sengupta (2017)

    Exploring sparsity in recurrent neural networks

    arXiv preprint arXiv:1704.05119. Cited by: §3.3.
  • [12] S. Narang, E. Undersander, and G. Diamos (2017) Block-sparse recurrent neural networks. arXiv preprint arXiv:1711.02782. Cited by: §3.3.
  • [13] C. Raffel, M. Luong, P. J. Liu, R. J. Weiss, and D. Eck (2017) Online and linear-time attention by enforcing monotonic alignments. arXiv preprint arXiv:1704.00784. Cited by: §2.1.
  • [14] Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T. Liu (2019) Fastspeech: fast, robust and controllable text to speech. In Advances in Neural Information Processing Systems, pp. 3171–3180. Cited by: §1.
  • [15] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan, et al. (2018) Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783. Cited by: §1, §1, §2.1, §2.1, §3.2, §3.
  • [16] Q. Tian, Z. Zhang, H. Lu, L. Chen, and S. Liu (2020) FeatherWave: an efficient high-fidelity neural vocoder with multi-band linear prediction. arXiv preprint arXiv:2005.05551. Cited by: §4.1.
  • [17] J. Valin and J. Skoglund (2019) LPCNet: improving neural speech synthesis through linear prediction. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5891–5895. Cited by: §3.3.
  • [18] A. Van Den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu (2016) WaveNet: a generative model for raw audio.. In SSW, pp. 125. Cited by: §1.
  • [19] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, et al. (2017) Tacotron: towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135. Cited by: §1, §1, §2.1.
  • [20] J. Zhang, Z. Ling, and L. Dai (2018) Forward attention in sequence-to-sequence acoustic modeling for speech synthesis. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4789–4793. Cited by: §2.1.
  • [21] Z. Zhang, Q. Tian, H. Lu, L. Chen, and S. Liu (2020) AdaDurIAN: few-shot adaptation for neural text-to-speech with durian. arXiv preprint arXiv:2005.05642. Cited by: §3.3, §4.2.