Text-To-Speech is a typical sequence-to-sequence modeling task. In general, its input is a grapheme or phoneme sequence while the output is a much longer sequence of acoustic parameters at the frame level. In recent popular encoder-decoder architectures, the attention mechanism demonstrated strong capability in mapping two sequences with different lengths  and achieved high naturalness in TTS tasks [17, 14, 6]. However, for unseen texts, it may also bring errors like missing and repeating phonemes, unexpected long silence, and even failure to produce speech completely [20, 5, 1]. Many efforts have been made to enhance the attention robustness by constraining the attention to meet locality, monotonicity, and completeness, such as Forward attention , Stepwise monotonic attention , and Location-Relative attentions . However, none of them constrained how many frames one token should occupy. Without it, phonemes in an unseen text may still be articulated extremely short or too long in synthesized speech.
Other than attention-based methods, many studies utilize a separate duration model to implement the sequence upsampling. Fastspeech , Fastspeech2 , and DurIAN  duplicate encoder outputs according to the phoneme duration. Non-Attentive Tacotron  implements upsampling with Gaussian weights. The ground-truth duration is obtained from external forced-alignment tools [11, 18, 13, 4, 9] or by internal joint training [19, 7]. Regardless of how the alignment is obtained and how the duplicated tokens are smoothed, duration-informed methods always show naturalness degradation due to hard duration control.
Differentiable duration models [2, 3] are also designed. They need no phoneme alignment guidance but to optimize duration model parameters by minimizing the final spectrogram reconstruction loss directly. For the duration loss, only the total duration of phonemes in a sequence is taken into account. It improved the naturalness of duration-informed methods. However, in such networks, the output of the duration model may not physically stand for phoneme duration. Particularly, when the predicted duration of one word is adjusted when inference, the durations of other words in the synthesized speech are often affected unexpectedly.
This paper proposes a Proceeding-Aware Monotonic Attention (PAMA111Audio examples: https://pama-tts.github.io/
) for sequence-to-sequence TTS to realize accurate phoneme duration control without naturalness degradation. The neural network is based on Tacotron2 but the Location Sensitive Attention (LSA) is replaced by stepwise monotonic attention. Besides, a soft guidance attention matrix is generated from ground-truth alignment to benefit both the efficiency of attention training and the correctness of learned alignment. At the same time, an auxiliary duration model is trained with the same alignment label. From the duration model, latent duration representation and backward position embedding are offered to attention memory and query respectively. The main contributions of this paper include:
Design an innovative guidance attention matrix for alignment constraint. The guidance is soft at phoneme boundaries since there are no solid ground-truth breaks;
Introduce latent duration representation into encoder output as attention memory. With this information, alignment loss converges faster and more stably;
Introduce backward frame position within phoneme into prenet output as an attention query. In this way, the generation of current spectrum conditions on not only the preceding spectrogram but also how many future frames the present phoneme should end within. The former ensures the spectrum smoothness while the latter helps the phoneme duration control. Their impacts are balanced by the network dynamically.
2 Related Works
Although PAMA-TTS calculates attention alignment vector recursively in the same way as stepwise monotonic attention in, both attention query and memory of them are different. For query, PAMA-TTS adds backward position information for token proceeding awareness. For memory, PAMA-TTS adds latent duration representation for efficient and stable training convergence.
VAENAR-TTS  introduces a latent variable to help soft attention alignment, in which implicitly stands for phoneme duration. However, there are no phoneme level duration labels to guide explicitly. Besides, VAENAR-TTS leverages both annealing reduction factor and causality mask to help attention-based alignment learning other than applies monotonic constraint.
Moreover, the attention alignment loss in PAMA-TTS is quite similar to PAG in . However, PAMA-TTS generates guidance matrices in a softer way for better flexibility, since the results of a forced alignment tool may have slight distortion, especially on found data.
The architecture of PAMA-TTS is shown in Fig. 1. Tacotron2  with stepwise monotonic attention  is employed as the backbone. Modified modules are highlighted and will be illustrated below one by one.
3.1 Text Encoder & Phoneme Classifier
The text encoder takes a sequence of token IDs as inputs and outputs the latent representation of them, which consist of regular phonemes, tones, prosodic boundaries, and silence. The tokens are placed in a carefully designed order to build up input sequences as demonstrated in Fig. 2.
Since tones and most prosodic boundaries (except intonation phrase boundary #3, which can be regarded as silence or short pause as well) do not correspond to any acoustic frames in speech, a filter is applied to skip the hidden states of them as shown in Fig. 2. A similar strategy is used in DurIAN 
, but they remove only prosodic boundaries. Moreover, the trimmed encoder output is fed into a phoneme classifier to ensure the token location information remains. Both above designs aim at making the subsequent alignment learned by the attention mechanism more meaningful.
The encoder structure is the same as that of Tacotron2, i.e. three convolutional layers followed by a BLSTM layer. For the phoneme classifier, a single feed-forward layer with softmax cross-entropy loss is employed.
3.2 Guided Attention Matrix
Guided attention is used to help the attention module learn a correct mapping between phoneme sequence and acoustic frames efficiently. Previous work  used time-aligned phoneme sequences obtained by forced alignment to generate hard guidance matrices. Considering the existence of alignment errors, this paper improves the guidance matrix to have fuzzy weights at phoneme boundaries as shown in Fig. 3.
According to statistics on large data, most alignment errors of phonemes are within 3 frames. Therefore, the weights at boundaries in the guidance matrix are linearly transitioned from 0 to 1 in six frames with a step of 0.2. Then, a mean square error is computed as alignment loss as
where , denote the number of spectrogram frames and filtered tokens, are the guidance matrix and attention weight matrix, respectively.
3.3 Progression-Aware Monotonic Attention
The proposed PAMA is based on stepwise monotonic attention. To make the monotonic attention aware of the mapping progression between phonemes and spectrogram, two additional pieces of information is leveraged: one is a latent duration code for attention memory, and the other is a relative position embedding for attention query.
The latent duration code is from the last hidden layer of a duration predictor and transformed by a linear layer. For each phoneme, its duration code is added with its encoder output to generate key and value for the attention mechanism. In this way, the attention’s key and value vectors carry duration information more explicitly.
The relative position embedding is a concatenation of two vectors from learnable look-up tables. One is for the forward position within a phoneme, which implies the distance to the beginning of the token. The other is for the backward position, which denotes the distance to the end of the token. Both of the two distances are ceilinged with a constant . For each acoustic frame, its relative positional embedding is concatenated with the output of prenet to generate an attention query.
Generally speaking, prenet output only carries information of the preceding spectrogram. The injection of relative position embedding, especially bringing the knowledge that how many future frames the current phoneme should end within, helps the attention be more premeditated.
At the training stage, the forward and backward positions are both derived from forced alignment labels. At the inference stage, the forward position is calculated according to the attention weights of preceding steps and the backward position is estimated from the predicted duration. To convert the forward /backward distance into a learnable vector, an embedding lookup layer is used in which two lookup tables are learned for forward and backward distances separately.
3.4 Training Loss
The overall loss is a weighted sum of four parts as
where , , , and denote MSE loss of Mel-spectrogram reconstruction, Cross-Entropy (CE) loss for the phoneme classifier, L1 loss for duration predictor, and MSE loss for guided attention, respectively. Their weights are set as empirically.
Here, the stop token predictor  is not used. Instead, the decoder is assumed to stop when attention has stayed at the last token for the predicted duration time.
4.1 Training Setup
We evaluated the proposed model on an internal corpus, which was from a non-professional female speaker, containing about 10 hours of speech data (about 12,000 utterances). The audios were collected in native mandarin Chinese and resampled into 16 kHz, 16-bit mono wave format.
A proprietary front-end engine was used to convert input texts into token sequences, which contain phonemes, tones, prosodic boundaries, and silence marks. Besides, a Kaldi-based forced alignment tool  was used to obtain phoneme duration labels from recordings.
Two variants of Tacotron2 are used as baselines. One replaces the attention mechanism in Tacotron2 with a duration informed length regulator (called TLR), and the other employs stepwise monotonic attention (called TSW). The postnet module is removed due to limited effectiveness. The reduction factor is set to 1 for a better quality of speech.
The same pre-trained LPCNet  is used as a vocoder to generate audio signals from the predicted Mel-spectrogram.
The MOS with 95% confidence intervals for the proposed method (PAMA), ground-truth samples (GT), and two baselines (TLR and TSW). The ground truth is obtained via analysis-synthesis.
4.2 Evaluation Setup
Two objective evaluations were conducted using 1,000 sentences. Firstly, the duration consistency was measured to show the duration controllability of models, which was calculated as the mean absolute errors (MAE) between the phoneme duration predicted by the duration predictor and that from a forced aligner. For TSW, phoneme duration was estimated from the attention results as , where was the duration of the th phoneme, and
was the final attention matrix. Secondly, the phoneme error rate (PER) given by an automatic speech recognition (ASR) model was adopted as the metric to measure the robustness of different models. The ASR model was based on a TDNN-LSTM structure and trained on nearly 100,000 hours of recordings collected from various Xiaomi mobile phones.
Subjective evaluations were conducted using 30 sentences. They were not included in the training data. The naturalness of the synthetic speech was evaluated through the mean opinion score (MOS) test and AB preference test. 16 native listeners participated in the test, and the speech samples were shuffled in each test.
4.3 Results & Discussion
As shown in Table 1, the proposed model (PAMA) gets the highest mean opinion score. TLR shows slightly mechanical rhythm while TSW has clarity issues in some cases.
Results of the AB preference test shown in Fig. 4 confirm the importance of procession-awareness for attention. If the relative position embedding is not leveraged, the naturalness of synthetic speech has remarkable degradation.
To check the duration controllability of different systems, MAE and PER are calculated for three duration factors (DF). We find stepwise monotonic attention is very weak at speech rate control. When attention score bias is shifted within a small range [-3, 3], the speech rate has a very slight change. However, if a greater shifting is applied, serious word skipping /repeating issues occur frequently. Therefore, only TLR and PAMA are evaluated for duration modification. Table 2 compares the capability of duration control. It shows PAMA has on-par or even fewer duration errors than TLR, and an overwhelming advantage over TSW. Table 3 compares the robustness with an ASR tool, in which PAMA has much fewer deletion errors than TSW and even better than TLR on overall performance.
This paper introduced progression-aware monotonic attention for robust sequence-to-sequence speech synthesis. The proposed model (PAMA-TTS) demonstrates that injecting the duration and relative position information into attention can achieve a better balance between the robustness and naturalness of synthetic speech. Besides, it enables accurate control of phoneme duration. Subjective and objective evaluation results show that PAMA-TTS outperforms the attention-based model on robustness and duration controllability while outperforms the duration-informed model on naturalness. Progression-aware monotonic attention is proved to be feasible for token length control and may be extended to other similar applications easily.
-  (2020) LOCATION-relative attention mechanisms for robust long-form speech synthesis. In Proc. ICASSP, pp. 6189–6193. Cited by: §1.
-  (2020) END-to-end adversarial text-to-speech. In arXiv:2006.03575, Cited by: §1.
-  (2021) Parallel tacotron 2: a non-autoregressive neural tts model with differentiable duration modeling. In arXiv:2103.14574, Cited by: §1.
-  (2021) PARALLEL tacotron: non-autoregressive and controllable tts. In Proc. ICASSP, pp. 5694–5698. Cited by: §1.
-  (2019) Robust sequence-to-sequence acoustic modeling with stepwise monotonic attention for neural tts. In Proc. Interspeech, pp. 1293–1297. Cited by: §1, §1, §2, §3.
Neural speech synthesis with transformer network. In Proc. AAAI, Vol. 33, pp. 6706–6713. Cited by: §1.
-  (2020) JDI-t: jointly trained duration informed transformer for text-to-speech without explicit alignment. In arXiv:2005.07799, Cited by: §1.
-  (2021) VAENAR-tts: variational auto-encoder based non-autoregressive text-to-speech synthesis. In Proc. Interspeech, pp. 3775–3779. Cited by: §2.
-  (2019) Tacotron-based acoustic model using phoneme alignment for practical neural text-to-speech systems. In Proc. ASRU, pp. 7254–7258. Cited by: §1.
-  (2011) The kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding, Cited by: §4.1.
-  (2020) FASTSPEECH 2: fast and high-quality end-toend text to speech. In arXiv:2006.04558, Cited by: §1.
-  (2019) FastSpeech: fast, robust and controllable text to speech. In Proc. NeurIPS, pp. 3165–3174. Cited by: §1.
-  (2020) NON-attentive tacotron: robust and controllable neural tts synthesis including unsupervised duration modeling. In arXiv:2010.04301, Cited by: §1.
-  (2018) NATURAL tts synthesis by conditioning wavenet on mel spectrogram predictions. In Proc. ICASSP, pp. 4779–4783. Cited by: §1, §3.4, §3.
-  (2019) LPCNet: improving neural speech synthesis through linear prediction. In Proc. ICASSP, pp. 5891–5895. Cited by: §4.1.
-  (2017) Attention is all you need. In Proc. NIPS, pp. 6000–6010. Cited by: §1.
-  (2017) Tacotron: towards end-to-end speech synthesis. In Proc. Interspeech, pp. 4006–4010. Cited by: §1.
-  (2019) DurIAN: duration informed attention network for multimodal synthesis. In arXiv:1909.01700, Cited by: §1, §3.1.
-  (2020) ALIGNTTS: efficient feed-forward text-to-speech system without explicit alignment. In Proc. ICASSP, pp. 6714–6718. Cited by: §1.
-  (2018) Forward attention in sequence-to-sequence acoustic modeling for speech synthesis. In Proc. ICASSP, pp. 4789–4793. Cited by: §1.
-  (2019) Pre-alignment guided attention for improving training efficiency and model stability in end-to-end speech synthesis. IEEE Access 7, pp. 65955–65964. Cited by: §2, §3.2.