Corresponding author email: email@example.com. Paper submitted to IEEE ICASSP 2020
Recent advances in TTS have improved the achievable synthetic speech naturalness to near human-like capabilities [1, 2, 3, 4]. This means that for simple sentences, or for situations in which we can correctly predict the most appropriate prosodic representation, TTS systems are providing us with speech practically indistinguishable from that of humans.
One aspect that most systems are still lacking is the natural variability of human speech, which is being observed as one of the reasons why the cognitive load of synthetic speech is higher than that of humans . This is something that variational models such as those based on Variational Auto-Encoding (VAE) [4, 6] attempt to solve by exploiting the sampling capabilities of the acoustic embedding space at inference time.
Despite the advantages that VAE-based inference brings, it also suffers from the limitation that to synthesize a sample, one has to select an appropriate acoustic embedding for it, which can be challenging. A possible solution to this is to remove the selection process and consistently use a centroid to represent speech. This provides reliable acoustic representations but it suffers again from the monotonicity problem of conventional TTS. Another approach is to simply do a random sampling of the acoustic space. This would certainly solve the monotonicity problem if the acoustic embedding were varied enough. It can however, introduce erratic prosodic representations of longer texts, which can prove to be worse than being monotonous. Finally, one can consider text-based selection or prediction, as done in this research.
. In the traditional Natural Language Processing (NLP) pipeline, constituency parsing produces full syntactic trees. More recent approaches based on Contextual Word Embedding (CWE) suggest that CWE are largely able to implicitly represent the classic NLP pipeline, while still retaining the ability to model lexical semantics . Thus, in this work we explore how TTS systems can enhance the quality of speech synthesis by using such linguistic features to guide the prosodic contour of generated speech.
Similar relevant recent work exploring the advantages of exploiting syntactic information for TTS can be seen in [11, 12]. While those studies, without any explicit acoustic pairing to the linguistic information, inject a number of curated features concatenated to the phonetic sequence as a way of informing the TTS system, the present study makes use of the linguistic information to drive the acoustic embedding selection rather than using it as an additional model features.
An exploration of how to use linguistics as a way of predicting adequate acoustic embeddings can be seen in , where the authors explore the path of predicting an adequate embedding by informing the system with a set of linguistic and semantic information. The main difference of the present work is that in our case, rather than predicting a point in a high-dimensional space by making use of sparse input information (which is a challenging task and potentially vulnerable to training-domain dependencies), we use the linguistic information to predict the most similar embedding in our training set, reducing the complexity of the task significantly.
The main contributions of this work are: i) we propose a novel approach of embedding selection in the acoustic space by using linguistic features; ii) we demonstrate that including syntactic information-driven acoustic embedding selection improves the overall speech quality, including its prosody; iii) we compare the improvements achieved by exploiting syntactic information in contrast with those brought by CWE; iv) we demonstrate that the approach improves the TTS quality in LFR experience as well.
2 Proposed Systems
CWE seem the obvious choice to drive embedding selection as they contain both syntactic and semantic information. However, a possible drawback of relying on CWE is that the linguistic-acoustic mapping space is sparse. The generalization capability of such systems in unseen scenarios will be poor . Also, as CWE models lexical semantics, it implies that two semantically similar sentences are likely to have similar CWE representations. This however does not necessarily correspond to a similarity in prosody, as the structure of the two sentences can be very different.
We hypothesize that, in some scenarios, syntax will have better capability to generalize than semantics and that CWE have not been optimally exploited for driving prosody in speech synthesis. We explore these two hypotheses in our experiments. The objective of this work is to exploit sentence-level prosody variations available in the training dataset while synthesizing speech for the test sentence. The steps executed in this proposed approach are: (i)
Generate suitable vector representations containing linguistic information for all the sentences in the train and test sets,(ii)
Measure the similarity of the test sentence with each of the sentences in the train set. We do so by using cosine similarity between the vector representations as done in to evaluate linguistic similarity, (iii) Choose the acoustic embedding of the train sentence which gives the highest similarity with the test sentence, (iv) Synthesize speech from VAE-based inference using this acoustic embedding
We experiment with three different systems for generating vector representations of the sentences, which allow us to explore the impact of both syntax and semantics on the overall quality of speech synthesis. The representations from the first system use syntactic information only, the second relies solely on CWE while the third uses a combination of CWE and explicit syntactic information.
Syntactic representations for sentences like constituency parse trees need to be transformed into vectors in order to be usable in neural TTS models. Some dimensions describing the tree can be transformed into word-based categorical feature like identity of parent and position of word in a phrase .
The syntactic distance between adjacent words is known to be a prosodically relevant numerical source of information which is easily extracted from the constituency tree . It is explained by the fact that if many nodes must be traversed to find the first common ancestor, the syntactic distance between words is high. Large syntactic distances correlate with acoustically relevant events such as phrasing breaks or prosodic resets.
To compute syntactic distance vector representations for sentences, we use the algorithm mentioned in . That is, for a sentence of n tokens, there are n corresponding distances which are concatenated together to give a vector of length n. The distance between the start of sentence and first token is always 0.
We can see an example in Fig. 1: for the sentence “The brown fox is quick and it is jumping over the lazy dog”, whose distance vector is d = [0 2 1 3 1 8 7 6 5 4 3 2 1]. The completion of the subject noun phrase (after ‘fox’) triggers a prosodic reset, reflected in the distance of 3 between ‘fox’ and ‘is’. There should also be a more emphasized reset at the end of the first clause, represented by the distance of 8 between ‘quick’ and ‘and’.
To generate CWE we use BERT , as it is one of the best performing pre-trained models with state of the art results on a large number of NLP tasks. BERT has also shown to generate strong representations for both syntax and semantics. We use the word representations from the uncased base (12 layer) model without fine-tuning. The sentence level representations are achieved by averaging the second to last hidden layer for each token in the sentence. These embeddings are used to drive acoustic embedding selection.
2.1.3 BERT Syntactic
Even though BERT embeddings capture some aspects of syntactic information along with semantics, we decided to experiment with a system combining the information captured by both of the above mentioned systems. The information from syntactic distances and BERT embeddings cannot be combined at token level to give a single vector representation since both these systems use different tokenization algorithms. Tokenization in BERT is based on the wordpiece algorithm  as a way to eliminate the out-of-vocabulary issues. On the other hand, tokenization used to generate parse trees is based on morphological considerations rooted in linguistic theory. At inference time, we average the similarity scores obtained by comparing the BERT embeddings and the syntactic distance vectors.
2.2 Applications to LFR
The approaches described in Section 2.1
produce utterances with more varied prosody as compared to the long-term monotonicity of those obtained via centroid-based VAE inference. However, when considering multi-sentence texts, we have to be mindful of the issues that can be introduced by erratic transitions. We tackle this issue by minimizing the acoustic variation a sentence can have with respect to the previous one, while still minimizing the linguistic distance. We consider the Euclidean distance between the 2D Principal Component Analysis (PCA) projected acoustic embeddings as a measure of acoustic variation, as we observe that the projected space provides us with an acoustically relevant space in which distances can be easily obtained. Doing the same in the 64-dimensional VAE space did not perform as intended, likely because of the non-linear manifold representing our system, in which distances are not linear. As a result, certain sentence may be linguistically the closest match in terms of syntactic distance or CWE, but it will still not be selected if its acoustic embedding is far apart from that of the previous sentence.
We modify the similarity evaluation metric used for choosing the closest match from the train set by adding a weighted cost to account for acoustic variation. This approach focuses only on the sentence transitions within a paragraph rather than optimizing the entire acoustic embedding path. This is done as follows:(i) Define the weights for linguistic similarity and acoustic similarity. In this work, the two weights sum up to 1; (ii) The objective is to minimize the following loss considering the acoustic embedding chosen for the previous sentence in the paragraph:
Loss = LSW * (1-LS) + (1-LSW) * D,
where LSW = Linguistic Similarity Weight; LS = Linguistic Similarity between test and train sentence; D = Euclidean distance between the acoustic embedding of the train sentence and the acoustic embedding chosen for the previous sentence.
We fix D=0 for the first sentence of every paragraph. Thus, this approach is more suitable for cases when the first sentence is generally the carrier sentence, i.e. one which uses a structural template. This is particularly the case for news stories such as the ones considered in this research.
Distances observed between the chosen acoustic embeddings for a sample paragraph and the effect of varying weights are depicted in the matrices in Fig 2. They are symmetric matrices, where each row and column of the matrix represents the sentence at index i in a paragraph. Each cell represents the Euclidean distance between the acoustic embeddings chosen for sentences at index i,j. We can see that in (a) the sentence at index 4 stands out as the most acoustically dissimilar sentence from the rest of the sentences in the paragraph. We see that the overall acoustic distance between sentences in much higher in (a) than in (b). As we are particularly concerned with transitions from previous to current sentence, we focus on cells (i,i-1) for each row. In (a), sentences at index 4 and 5 particularly stand out as potential erratic transitions due to high values in cell (4,3) and (5,4). In (b) we observe that the distances have significantly reduced and thus sentence transitions are expected to be smooth.
As LSW decreases, the transitions become smoother. This is not ‘free’: there is a trade-off, as increasing the transition smoothness decreases the linguistic similarity which also reduces the prosodic divergence. Fig. 3 shows the trade-off between the two, across the test set, when using syntactic distance to evaluate LS. Low linguistic distance (i.e. 1 - LS) and low acoustic distance are required.
The plot shows that there is a sharp decrease in acoustic distance between LSW of 1.0 and 0.9 but the reduction becomes slower from therein, while the changes in linguistic distance progress in a linear fashion. We informally evaluated the performance of the systems by reducing LSW from 1.0 till 0.7 with a step size of 0.05 in order to look for an optimal balance. At LSW=0.9, the first elbow on acoustic distance curve, there was a significant decrease in the perceived erraticness. As such, we chose those values for our LFR evaluations.
3 Experimental Protocol
The research questions we attempt to answer are:
Can linguistics-driven selection of acoustic waveform from the existing dataset lead to improved prosody and naturalness when synthesizing speech ?
How does syntactic selection compare with CWE selection?
Does this approach improve LFR experience as well?
To answer these questions, we used in our experiments the systems, data and subjective evaluations described below.
3.1 Text-to-Speech System
The evaluated TTS system is a Tacotron-like system  already verified for the newscaster domain. A schematic description can be seen in Fig. 4 and a detailed explanation of the baseline system and the training data can be read in [22, 23]. Conversion of the produced spectrograms to waveforms is done using the Universal WaveRNN-like model presented in .
For this study, we consider an improved system that replaced the one-hot vector style modeling approach by a VAE-based reference encoder similar to [6, 4], in which the VAE embedding represents an acoustic encoding of a speech signal, allowing us to drive the prosodic representation of the synthesized text as observed in . The way of selecting the embedding at inference time is defined by the approaches introduced in Sections 2.1 and 2.2. The dimension of the embedding is set to 64 as it allows for the best convergence without collapsing the KLD loss during training.
3.2.1 Training Dataset
(i) TTS System dataset:
We trained our TTS system with a mixture of neutral and newscaster style speech. For a total of 24 hours of training data, split in 20 hours of neutral (22000 utterances) and 4 hours of newscaster styled speech (3000 utterances).
(ii) Embedding selection dataset: As the evaluation was carried out only on the newscaster speaking style, we restrict our linguistic search space to the utterances associated to the newscaster style: 3000 sentences.
3.2.2 Evaluation Dataset
The systems were evaluated on two datasets:
(i) Common Prosody Errors (CPE): The dataset on which the baseline Prostron model fails to generate appropriate prosody. This dataset consists of complex utterances like compound nouns (22%), “or” questions (9%), “wh” questions (18%). This set is further enhanced by sourcing complex utterances (51%) from .
(ii) LFR: As demonstrated in , evaluating sentences in isolation does not suffice if we want to evaluate the quality of long-form speech. Thus, for evaluations on LFR we curated a dataset of news samples. The news style sentences were concatenated into full news stories, to capture the overall experience of our intended use case.
3.3 Subjective evaluation
Our tests are based on MUltiple Stimuli with Hidden Reference and Anchor (MUSHRA) , but without forcing a system to be rated as 100, and not always considering a top anchor. All of our listeners, regardless of linguistic knowledge were native US English speakers.
For the CPE dataset, we carried out two tests. The first one with 10 linguistic experts as listeners, who were asked to rate the appropriateness of the prosody ignoring the speaking style on a scale from 0 (very inappropriate) to 100 (very appropriate). The second test was carried out on 10 crowd-sourced listeners who evaluated the naturalness of the speech from 0 to 100. In both tests each listener was asked to rate 28 different screens, with 4 randomly ordered samples per screen for a total of 112 samples. The 4 systems were the 3 proposed ones and the centroid-based VAE inference as the baseline.
For the LFR dataset, we conducted only a crowd-sourced evaluation of naturalness, where the listeners were asked to assess the suitability of newscaster style on a scale from 0 (completely unsuitable) to 100 (completely adequate). Each listener was presented with 51 news stories, each playing one of the 5 systems including the original recordings as a top anchor, the centroid-based VAE as baseline and the 3 proposed linguistics-driven embedding selection systems.
Table 1 reports the average MUSHRA scores, evaluating prosody and naturalness, for each of the test systems on the CPE dataset. These results answer Q1, as the proposed approach improves significantly over the baseline on both grounds. It thus, gives us evidence supporting our hypothesis that linguistics-driven acoustic embedding selection can significantly improve speech quality. We also observe that better prosody does not directly translate into improved naturalness and that there is a need to improve acoustic modeling in order to better reflect the prosodic improvements achieved.
We validate the differences between MUSHRA scores using pairwise t-test. All proposed systems improved significantly over the baseline prosody (p0.01). For naturalness, BERT syntactic performed the best, improving over the baseline significantly (p=0.04). Other systems did not give statistically significant improvement over the baseline (p0.05). The difference between BERT and BERT Syntactic is also statistically insignificant.
Q2 is explored in Table 2, which gives the breakdown of prosody results by major categories in CPE. For ‘wh’ questions, we observe that Syntactic alone brings an improvement of 4% and BERT Syntactic performs the best by improving 8% over the baseline. This suggests that ‘wh’ questions generally share a closely related syntax structure and that information can be used to achieve better prosody. This intuition is further strengthened by the improvements observed for ‘or’ questions. Syntactic alone improves by 9% over the baseline and BERT Syntactic performs the best by improving 21% over the baseline. The improvement observed in ‘or’ questions is greater than ‘wh’ questions as most ‘or’ questions have a syntax structure unique to them and this is consistent across samples in the category. For both these categories, the systems Syntactic, BERT and BERT Syntactic show incremental improvement as the first system contains only syntactic information, the next captures some aspect of syntax with semantics and the third has enhanced the representation of syntax with CWE representation to drive selection. Thus, it is evident that the extent of syntactic information captured drives the quality in speech synthesis for these two categories.
Compound nouns proved harder to improve upon as compared to questions. BERT performed the best in this category with a 1.2% improvement over the baseline. We can attribute this to the capability of BERT to capture context which Syntactic does not do. This plays a critical role in compound nouns, where to achieve suitable prosody it is imperative to understand in which context the nouns are being used. For other complex sentences as well, BERT performed the best by improving over the baseline by 6%. This can again be attributed to the fact that most of the complex sentences required contextual knowledge. Although Syntactic does improve over the baseline, syntax does not look like the driving factor as BERT Syntactic performs a bit worse than BERT. This indicates that enhancing syntax representation hinders BERT from fully leveraging the contextual knowledge it captured to drive embedding selection.
Q3 is answered in Table 3, which reports the MUSHRA scores on the LFR dataset. The Syntactic system performed the best with high statistical significance (p=0.02) in comparison to baseline. We close the gap between the baseline and the recordings by almost 20%. Other systems show statistically insignificant (p0.05) improvements over the baseline. To achieve suitable prosody, LFR requires longer distance dependencies and knowledge of prosodic groups. Such information can be approximated more effectively by the Syntactic system rather than the CWE based systems. However, this is a topic for a potential future exploration as the difference between BERT and Syntactic is statistically insignificant (p=0.6).
The current VAE-based TTS systems are susceptible to monotonous speech generation due to the need to select a suitable acoustic embedding to synthesize a sample. In this work, we proposed to generate dynamic prosody from the same TTS systems by using linguistics to drive acoustic embedding selection. Our proposed approach is able to improve the overall speech quality including prosody and naturalness. We propose 3 techniques (Syntactic, BERT and BERT Syntactic) and evaluated their performance on 2 datasets: common prosodic errors and LFR. The Syntactic system was able to improve significantly over the baseline on almost all parameters (except for naturalness on CPE). Information captured by BERT further improved prosody in cases where contextual knowledge was required. For LFR, we bridged the gap between baseline and actual recording by 20%. This approach can be further extended by making the model aware of these features rather than using them to drive embedding selection.
-  Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al., “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4779–4783.
-  Ryan Prenger, Rafael Valle, and Bryan Catanzaro, “Waveglow: A flow-based generative network for speech synthesis,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 3617–3621.
-  Jaime Lorenzo-Trueba, Thomas Drugman, Javier Latorre, Thomas Merritt, Bartosz Putrycz, Roberto Barra-Chicote, Alexis Moinet, and Vatsal Aggarwal, “Towards achieving robust universal neural vocoding,” in Proc. Interspeech 2019, 09 2019, pp. 181–185.
-  Wei-Ning Hsu, Yu Zhang, Ron J Weiss, Heiga Zen, Yonghui Wu, Yuxuan Wang, Yuan Cao, Ye Jia, Zhifeng Chen, Jonathan Shen, et al., “Hierarchical generative modeling for controllable speech synthesis,” arXiv preprint arXiv:1810.07217, 2018.
-  Avashna Govender and Simon King, “Using pupillometry to measure the cognitive load of synthetic speech,” System, vol. 50, pp. 100.
Kei Akuzawa, Yusuke Iwasawa, and Yutaka Matsuo,
“Expressive speech synthesis via modeling expressions with variational autoencoder,”in Proc. Interspeech 2018, 2018, pp. 3067–3071.
-  Arne Köhn, Timo Baumann, and Oskar Dörfler, “An empirical analysis of the correlation of syntax and prosody,” in Proc. Interspeech 2018, 2018, pp. 2157–2161.
-  Michael Wagner and Duane G. Watson, “Experimental and theoretical advances in prosody: A review,” Language and Cognitive Processes, vol. 25, no. 7-9, pp. 905–945, 2010, PMID: 22096264.
-  Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, Adam Poliak, R. Thomas McCoy, Najoung Kim, Benjamin Van Durme, Samuel R. Bowman, Dipanjan Das, and Ellie Pavlick, “What do you learn from context? probing for sentence structure in contextualized word representations,” in International Conference on Learning Representations, 2019.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean,
“Distributed representations of words and phrases and their compositionality,”in Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, Eds., pp. 3111–3119. 2013.
-  Haohan Guo, Frank K. Soong, Lei He, and Lei Xie, “Exploiting syntactic features in a parsed tree to improve end-to-end tts,” in Proc. Interspeech 2019, 2019, pp. 4460–4464.
-  Adèle Aubin, Alessandra Cervone, Oliver Watts, and Simon King, “Improving speech synthesis with discourse relations,” in Proc. Interspeech 2019, 2019, pp. 4470–4474.
-  Daisy Stanton, Yuxuan Wang, and RJ Skerry-Ryan, “Predicting expressive speaking style from text in end-to-end speech synthesis,” in 2018 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2018, pp. 595–602.
Sabrina Stehwien, Ngoc Thang Vu, and Antje Schweitzer,
“Effects of word embeddings on neural network-based pitch accent detection,”in Proc. 9th International Conference on Speech Prosody 2018, 2018, pp. 719–723.
-  Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-Gazpio, and Lucia Specia, “SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation,” in Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Vancouver, Canada, Aug. 2017, pp. 1–14, Association for Computational Linguistics.
-  Rasmus Dall, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku, and Keiichi Tokuda, “Redefining the linguistic context feature set for hmm and dnn tts through position and parsing,” in INTERSPEECH, 2016.
Yikang Shen, Zhouhan Lin, Chin wei Huang, and Aaron Courville,
“Neural language modeling by jointly learning syntax and lexicon,”in International Conference on Learning Representations, 2018.
-  Yikang Shen, Zhouhan Lin, Athul Paul Jacob, Alessandro Sordoni, Aaron Courville, and Yoshua Bengio, “Straight to the tree: Constituency parsing with neural syntactic distance,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, July 2018, pp. 1171–1180, Association for Computational Linguistics.
-  Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, June 2019, pp. 4171–4186, Association for Computational Linguistics.
-  Mike Schuster and Kaisuke Nakajima, “Japanese and korean voice search,” in International Conference on Acoustics, Speech and Signal Processing, 2012, pp. 5149–5152.
-  Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc Le, Yannis Agiomyrgiannakis, Rob Clark, and Rif A. Saurous, “Tacotron: Towards end-to-end speech synthesis,” 2017.
-  Nishant Prateek, Mateusz Lajszczak, Roberto Barra-Chicote, Thomas Drugman, Jaime Lorenzo-Trueba, Thomas Merritt, Srikanth Ronanki, and Trevor Wood, “In other news: A bi-style text-to-speech model for synthesizing newscaster voice with limited data,” NAACL HLT 2019, pp. 205–213, 2019.
-  Javier Latorre, Jakub Lachowicz, Jaime Lorenzo-Trueba, Thomas Merritt, Thomas Drugman, Srikanth Ronanki, and Viacheslav Klimkov, “Effect of data reduction on sequence-to-sequence neural tts,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 7075–7079.
-  Zack Hodari, Oliver Watts, and Simon King, “Using generative modelling to produce varied intonation for speech synthesis,” arXiv preprint arXiv:1906.04233, 2019.
Will Coster and David Kauchak,
“Learning to simplify sentences using Wikipedia,”
Proceedings of the Workshop on Monolingual Text-To-Text Generation, Portland, Oregon, June 2011, pp. 1–9, Association for Computational Linguistics.
-  Rob Clark, Hanna Silen, Tom Kenter, and Ralph Leith, Eds., Evaluating Long-form Text-to-Speech: Comparing the Ratings of Sentences and Paragraphs, 2019.
-  ITUR Recommendation, “Method for the subjective assessment of intermediate sound quality (mushra),” ITU, BS, pp. 1543–1, 2001.