With the advent of deep learning, neural TTS has shown many advantages over the conventional TTS techniques[tokuda2013speech, ze2013statistical, liu2017mongolian]. For example, encoder-decoder architecture with attention mechanism, such as Tacotron [wang2017tacotron, shen2018natural, liu2020wavetts, lee2019robust], has consistently achieved high voice quality. The key idea is to integrate the conventional TTS pipeline [hunt1996unit, tokuda2002hmm] into an unified framework that learns sequence-to-sequence mapping from text to a sequence of acoustic features [lee2019robust, chung2019semi, He2019, Luong2019, liu2019teacher]. Furthermore, together with a neural vocoder [hayashi2017investigation, shen2018natural, chen2018high, Okamoto2019, berrak_is18, berrak-journal, sisman2018adaptive], neural TTS generates natural-sounding and human-like speech which achieves state-of-the-art performance. Despite the progress, the expressiveness of the synthesized speech remains to be improved.
Speech conveys information not only through phonetic content, but also through its prosody. Speech prosody can affect syntactic and semantic interpretation of an utterance [hirschberg2004pragmatics], that is called linguistic prosody. Speech prosody is also used to display one’s emotional state, that is referred to as affective prosody. Both linguistic prosody and affective prosody are manifested over a segment of speech beyond short-time speech frame. Linguistically, speech prosody in general refers to stress, intonation, and rhythm in spoken words, phrases, and sentences. As speech prosody is the result of the interplay of multiple speech properties, it is not easy to define speech prosody by a simple labeling scheme [Luong, lin2019investigation, hodariusing, zhao2020improved, hodari2020perception]. Even if a labeling scheme is possible [Silverman1992TOBIAS, taylor1998assigning], a set of discrete labels may not be sufficient to describe the entire continuum of speech prosody.
Besides naturalness, one of the factors that differentiates human speech from today’s synthesized speech is their expressiveness. Prosody is one of the defining features of expressiveness that makes speech lively. Several recent studies successfully improve the expressiveness of Tacotron TTS framework[wang2018style, Stanton2018Predicting, skerry2018towards, sun2020fully, sun2020generating]. The idea is to learn latent prosody embedding, i.e. style token, from training data [wang2018style]. At run-time, the style token can be used to predict the speech style from text [Stanton2018Predicting], or to transfer the speech style from a reference utterance to target [skerry2018towards]. It is observed that such speech styling is effective and consistently improves speech quality. Sun et al. [sun2020fully, sun2020generating] further study a hierarchical, fine-grained and interpretable latent variable model for prosody rendering. The studies show that precise control of the prosody style leads to improvement of prosody expressiveness in the Tacotron TTS framework. However, several issues have hindered the effectiveness of above prosody modeling techniques.
First, the latent embedding space of prosody is learnt in an unsupervised manner, where the style is defined as anything but speaker identity and phonetic content in speech. We note that many different styles co-exist in speech. Some are speaker dependent, such as accent and idiolect, others are speaker independent such as prosodic phrasing, lexical stress and prosodic stress. There is no guarantee that such latent embedding space of style represents only the intended prosody. Second, while the techniques don’t require the prosody annotations on training data, they require a reference speech or a manual selection of style token [wang2018style] in order to explicitly control the style of output speech during run-time inference. While it is possible to automate the style token selection [Stanton2018Predicting], a correct prediction of style token is subject to both the design of the style token dictionary, and the run-time style token prediction algorithm. Third, the style token dictionary in Tacotron is trained from a collection of speech utterances to represent a large range of acoustic expressiveness for a speaker or an audiobook[wang2018style]. It is not intended to provide differential prosodic details at phrase or utterance level. It is desirable for Tacotron system to learn to automate the prosody styling in response to input text at run-time, that will be the focus of this paper.
To address the above issues, we believe that Tacotron training should minimize frame level reconstruction loss [wang2017tacotron, shen2018natural] and utterance level perceptual loss at the same time. Perceptual loss is first proposed for image stylization and synthesis [dosovitskiy2016generating, johnson2016perceptual, chen2017photographic, 9052944]
, where feature activation patterns, or deep features, derived from pre-trained auxiliary networks are used to optimize perceptual quality of output image. Several computational models have been proposed to approximate human perception of audio quality, such as Perceptual Evaluation of Audio Quality (PEAQ)[thiede2000peaq], Perceptual Evaluation of Speech Quality (PESQ) [rix2001perceptual], and Perceptual Evaluation of Audio methods for Source Separation (PEASS) [emiya2010peass]. However, such models are not differentiable, hence cannot be directly employed during TTS training. We believe that utterance level perceptual loss based on deep features that reflects global speech style would be useful to improve overall speech quality.
We are motivated to study a novel training strategy for TTS systems, that learns to associate prosody styles with input text implicitly. We would like to avoid the use of prosody annotations. We don’t attempt to model prosody explicitly either, but rather learn the association between prosody styles and input text using existing neural TTS system, such as Tacotron. As the training strategy is only involved during training, it doesn’t change the run-time inference process for neural TTS system. At run-time, we don’t require any reference signal nor manual selection of prosody style.
The main contributions of this paper include: 1) we propose a novel training strategy for Tacotron TTS that effectively models both spectrum and prosody generation; 2) we propose to supervise the training of Tacotron with a fully differentiable perceptual loss, which is derived from a pre-trained auxiliary network, in addition to frame reconstruction loss; 3) we successfully implement a system that doesn’t require any reference speech nor manual selection of prosody style at run-time; and 4) we successfully validate the proposed perceptual loss by showing consistent speech quality improvement. To our best knowledge, this is the first study to incorporate perceptual loss into Tacotron training for improved expressiveness.
This paper is organized as follows: In Section II, we present the research background and related work to motivate our study. In Section III, we propose a novel training strategy for TTS system with frame and style reconstruction loss. In Section IV, we report the subjective and objective evaluations. Section V concludes the discussion.
Ii Background and Related Work
This work is built on several previous studies on neural TTS, prosody modeling, perceptual loss, and speech emotion recognition. Here we briefly summarize the related previous work to set the stage for our study, and to place our novel contributions in a proper context.
Ii-a Tacotron2-based TTS
In this paper, we adopt Tacotron2-based [shen2018natural] TTS model as a reference baseline, which is also referred to as Tacotron baseline for brevity. For rapid turn-around, we use Griffin-Lim [griffin1984signal] waveform reconstruction instead of WaveNet vocoder in this study. We note that the selection of waveform generation technique will not affect our judgment and performance comparison.
The overall architecture of the reference baseline includes encoder, attention-based decoder and Griffin-Lim algorithm as illustrated in Fig. 1
. The encoder consists of two components, a convolutional neural network (CNN) module[krizhevsky2012imagenet, ak2018learning] that has 3 convolutional layers, and a bidirectional LSTM (BLSTM) [emir2019semantically]
layer. The decoder consists of four components: a 2-layer pre-net, 2 LSTM layers, a linear projection layer and a 5-convolution-layer post-net. The decoder is a standard autoregressive recurrent neural network that generates mel-spectrum features and stop tokens frame by frame.
Just like other TTS systems, Tacotron [wang2017tacotron, shen2018natural] TTS system predicts mel-spectrum features from input sequence of characters by minimizing a frame level reconstruction loss. Such frame level objective function focuses on the distance between spectral features. It does not seek to optimize the perceptual quality at utterance level. To improve the suprasegmental expressiveness, there have been studies [yasuda2019investigation, Stanton2018Predicting, sun2020generating] on latent prosody representations, that make possible prosody styling in Tacotron TTS framework. However, most of the studies rely on the style tokens mechanism to explicitly model the prosody. Simply speaking, they build a Tacotron TTS system that synthesizes speech, and learns the global style tokens (GST) at the same time. At run-time inference, they apply the style tokens to control the expressive effect [skerry2018towards, wang2018style], that is referred to as the GST-Tacotron paradigm.
In this paper, we advocate a new way of addressing the expressiveness issue by integrating a perceptual quality motivated objective function into the training process, in addition to the frame level reconstruction loss function. We no longer require any dedicated prosody control mechanism during run-time inference, such as style tokens in Tacotron system.
Ii-B Prosody Modeling in TTS
Prosody conveys linguistic, para-linguistic and various types of non-linguistic information, such as speaker identity, intention, attitude and mood[f0_cwt_dct, Ann]. It is inherently supra-segmental [robertt, tokuda2013speech] due to the fact that prosody patterns cannot be derived solely from short-time segments [prosody3]. Prosody is hierarchical in nature [berrak2, prosody3, prosody4, WuProsody] and affected by long-term dependencies at different levels such as word, phrase and utterance level [Sanchez2014]. Studies on hierarchical modeling of F0 in speech synthesis [Vainio2013, tokuda2013speech, Suni2013] suggest that utterance-level prosody modeling is more effective. Similar studies, such as continuous wavelet transform, can be found in many speech synthesis related applications [Ming2016a, Sanchez2014, Luo2017a, Luo2017, Ming2016]. In this paper, we will study a novel technique to observe utterance-level prosody quality during Tacotron training to achieve expressive synthesis.
The early studies of modeling speaking styles are carried out on Hidden Markov Models (HMM)[tokuda2002hmm, yamagishi2003modeling]
, where we can synthesize speech with an intermediate speaking style between two speakers through model interpolation[tachibana2004hmm]. To improve the HMM-based TTS model, there have been studies to incorporate unsupervised expression cluster information during training [eyben2012unsupervised]
. Deep learning opens up many possibilities for expressive speech synthesis, where speaker, gender, and age codes can be used as control vectors to change TTS output in different ways[luong2017adapting]. The style tokens, or prosody embeddings, represent one type of such control vectors, that is derived from a representation learning network. The success of prosody embedding motivates us to further develop the idea.
Tacotron TTS framework has achieved remarkable performance in terms of spectral feature generation. With large training corpus, it may be able to generate natural prosody and expression by remembering the training data using a large number of network parameters. However, its training process doesn’t aim to optimize the system for expressive prosody rendering. As a result, Tacotron TTS system tends to generate speech outputs that represent model average, rather than the intended prosody.
The idea of global style tokens [wang2018style, Stanton2018Predicting] represents a success in controlling prosody style of Tacotron output. Style tokens learn to represent high level styles, such as speaker style, pitch range, and speaking rate across a collection of utterances or a speech database. We argue that they neither necessarily represent the useful styles to describe the continuum of prosodic expressions [kenter2019chive], nor provide the dynamic and differential prosodic details with the right level of granularity at utterance level. Sun et al. [sun2020fully, sun2020generating] study a way to include a hierarchical, fine-grained prosody representation, that represents the recent attempts to address the problems in GST-Tacotron paradigm.
We would like to address three issues in the existing prosody modeling in Tacotron framework, 1) lack of prosodic supervision during training; 2) limitation of explicit prosody modeling, such as style tokens, in describing the continuum of prosodic expressions; 3) lack of dynamic and differential prosody at utterance level.
Ii-C Perceptual Loss for Style Reconstruction
It is noted that frame-level reconstruction loss, denoted as frame reconstruction loss in short, is not always consistent with human perception because it doesn’t take into account human sensitivities to temporal and spectral information, such as prosody and temporal structure of the utterance. For example, if one repeatedly asks the same question two times, despite the perceptual similarity of two utterances, they would be very different as measured by frame-level losses.
Perceptual loss refers to the training loss derived from a pre-trained auxiliary network [johnson2016perceptual]. The auxiliary network is usually trained on a different task that provides perceptual quality evaluation of an input at a higher level than a speech frame. The intermediate feature representations, generated by the auxiliary network in form of hidden layer activations, are usually referred to as deep features. They are used as the high level abstraction to measure the training loss between reconstructed signals and reference signals. Such training loss is also called deep feature loss [9053110, 9054578].
In speech enhancement, perceptual loss has been used successfully in end-to-end speech denoising pipeline, with an auxiliary network pre-trained on audio classification task [germain2019speech]. Kataria et al.  propose to use perceptual loss which optimizes the enhancement network with an auxiliary network pre-trained on speaker recognition task. In voice conversion, Lo et al. [lo2019mosnet] propose deep learning-based assessment models to predict human ratings of converted speech. Lee [lee2020voice] propose a perceptually meaningful criterion where human auditory system was taken into consideration in measuring the distances between the converted speech and the reference.
In speech synthesis, Oord et al. propose to train a WaveNet-like classifier with perceptual loss for phone recognition[pmlr-v80-oord18a]. As the classifier extracts high-level features that are relevant for phone recognition, this loss term supervises the training of WaveNet to look after temporal dynamics, and penalize bad pronunciations. Cai et al. [cai2020speaker] study to use a pre-trained speaker embedding network to provide feedback constraint, that serves as the perceptual loss for the training of a multi-speaker TTS system.
In the context of prosody modeling, the perceptual loss in the above studies can be generally described as style reconstruction loss [johnson2016perceptual]. Following the same principle, we would like to propose a novel auxiliary network, that is pre-trained on a speech emotion recognition (SER) task, to extract high level prosody representations. By comparing prosody representations in a continuous space, we measure perceptual loss between two utterances. While perceptual loss is not new in speech reconstruction, the idea of using a pre-trained emotion recognition network for perceptual loss is a novel attempt in speech synthesis.
Ii-D Deep Features for Perceptual Loss
Now the question is which deep features could be suitable for measuring perceptual loss. We benefit from the prior work in prosody modeling. Prosody embedding in Tacotron is a type of feature learning, that learns the representation for prediction or classification tasks. With deep learning algorithms, automatic feature learning can be achieved in either supervised, such as multilayer perceptron[bottleneckSER]
, or unsupervised manner, such as variational autoencoder[kingma2014auto]. Deep features are usually more generalizable, and easier to manage than hand-crafted or manually designed features [Zhong2016AnOO]. There have been studies on representation learning for prosody patterns, such as speech emotion [vaefeaturelearning], and speech styles[wang2018style].
Affective prosody refers to the expression of emotion in speech [Zhang2018SpeechER, 10.1109/TASLP.2019.2898816]. It is prominently exhibited in emotion speech database. Therefore, the studies in speech emotion recognition provide valuable insights into prosodic modeling. Emotion are usually characterized by discrete categories, such as happy, angry, and sad, and continuous attributes, such as activation, valence and dominance [murray1993toward, pierre2003production]. Recent studies show that latent representations of deep neural networks also characterize well emotion in a continuous space [bottleneckSER].
There have been studies to leverage emotion speech modeling for expressive TTS [eyben2012unsupervised, skerry2018towards, wu2019end, gao2020interactive, um2020emotional]. Eyben et al. [eyben2012unsupervised] incorporate unsupervised expression cluster information into a HMM-based TTS system. Skerry-Ryan et al. [skerry2018towards] study learning prosody representation from animated and emotive storytelling speech corpus. Wu et al. [wu2019end] propose a semi-supervised training of Tacotron TTS framework for emotional speech synthesis, where style tokens are defined to represent emotion categories. Gao et al. [gao2020interactive] propose to use an emotion recognizer to extract the style embedding for speech style transfer. Um et al. [um2020emotional] study a technique to apply style embedding to Tacotron system to generate emotional speech, and to control the intensity of emotion expressiveness.
All the studies point to the fact that emotion-related deep features serve as excellent descriptors of speech prosody and speech styles. In this paper, instead of using the style tokens to control the TTS outputs, we would like to study how to use deep style features to measure perceptual loss for training of neural TTS system in general. While the idea is proposed for neural TTS, we use Tacotron TTS system as an example to carry out the study.
Iii Tacotron with Frame and Style Reconstruction Loss
We propose a novel training strategy for Tacotron with both frame and style reconstruction loss. As the style reconstruction loss is formulated as a perceptual loss (PL) [johnson2016perceptual], the proposed frame and style training strategy is called Tacotron-PL in short. It seeks to optimize both frame-level spectral loss, that is frame reconstruction loss; as well as utterance-level style loss, that is style reconstruction loss, at the same time.
The overall framework is illustrated in Fig. 2, that has three stages: 1) training of style descriptor, 2) the proposed frame and style training for Tacotron-PL model, and 3) run-time inference. In Stage I, we train an auxiliary network to serve as the style descriptor for input speech utterances. In Stage II, the proposed frame and style training strategy is implemented to associate input text with acoustic features, as well as prosody style of natural speech, that is assisted by the style descriptor obtained from Stage I. In Stage III, the Tacotron-PL system takes input text and generates expressive speech in the same way as a standard Tacotron does. Unlike other Tacotron variants [wang2018style], Tacotron-PL doesn’t require any add-on module or process for run-time inference.
As discussed in Section II-A, traditional Tacotron architecture contains a text encoder and an attention-based decoder. We first encode input character embedding into hidden state, from which the decoder generates mel-spectrum features. During training, we adopt a frame-level mel-spectrum loss as in [shen2018natural], which is a loss between the synthesized mel-spectrum and target mel-spectrum . We have as follows,
which is designed to minimize frame level distortion. As it doesn’t guarantee utterance level similarity concerning speech expressions, such as speech prosody and speech styles. We will study a new loss function next, that measures the utterance-level style reconstruction loss.
Iii-a Stage I: Training of Style Descriptor
One of the great difficulties of prosody modelling is the lack of reference problem. In linguistics, we usually describe prosody styles qualitatively. However, precise annotation of speech prosody is not straightforward. One of the ways to describe a prosody style is to show by example. The idea of style token [wang2018style] shows a way to compare two prosody styles quantitatively using deep features.
Manual prosodic annotations of recorded speech [Silverman1992TOBIAS] provide quantifiable prosodic labels that allow us to associate speech styles with actual acoustic features. Prosody labelling schemes often attempt to describe prosodic phenomena, such as the supra-segmental features of intonation, stress, rhythm and speech rate, in discrete categories. Categorical labels of speech emotion [busso2008iemocap] also seek to achieve a similar goal. The prosody labelling schemes serve as a type of style descriptor. With deep neural network, one is able to learn the feature representation of the data at different level of abstraction in a continuous space [goodfellow2016deeplearning]. As speech styles naturally spread over a continuum rather than forced-fitting into a finite set of categorical labels, we believe that deep neural network learned from animated and emotive speech serves as a more suitable style descriptor.
We propose to use a speech emotion recognizer (SER) [Zhang2018SpeechER, 10.1109/TASLP.2019.2898816] as a style descriptor , which extracts deep style features from an utterance Y, or
. We use neuronal activations of hidden units in a deep neural network as the deep style features to represent high level prosodic abstraction at utterance level. In practice, we first train a SER network with highly animated and emotive speech with supervised learning. We then derive deep style features from a small intermediate layer. As the intermediate layer is small relative to the size of the other layers, it creates a constriction in the network that forces the information pertinent to emotion classification into a low dimensional prosody representation[bottlenectDong]. If the network classifies the emotion well, the so derived deep features are believed to describe well prosody style of speech.
We follow the SER implementation in [chen20183] as illustrated in Fig. 3, that forms part of Fig. 2. The SER network includes 1) a three-dimensional (3-D) CNN layer; 2) a BLSTM layer [Greff2017LSTM]; 3) an attention layer; and 4) a fully connected (FC) layer. The 3-D CNN [chen20183] first extracts a latent representation from mel-spectrum, its delta and delta-delta values from input utterance, converting the input utterance of variable length into a fixed size latent representation, denoted as deep features sequence , that reflects the semantics of emotion. The BLSTM summarizes the temporal information of into another latent representation . Finally, the attention layer assigns weights to and generates for emotion prediction.
The question is which of the latent representations, , , and , is suitable to be the deep style features. To validate the descriptiveness of deep style features, we perform an analysis on LJ-Speech corpus [ljspeech17]. Specifically, we randomly select five utterances from each of the six style groups from the database, each group having a distinctive speech style, namely, 1) Short question; 2) Long question; 3) Short answer; 4) Short statement; 5) Long statement and 6) Digit string. The complete list of utterances can be found at Table III in Appendix A.
We visualize the , and of utterances using the t-SNE algorithm in a two dimensional plane [maaten2008visualizing], as shown in Fig. 4. It is observed that , and of utterances form clear style groups in terms of feature distributions, that is encouraging. We will further compare the performance of different deep style features through TTS experiments in Section IV.
Iii-B Stage II: Tacotron-PL Training
During the training of Tacotron-PL, the SER-based style descriptor is used to extract the deep style features . We define a style reconstruction loss that compares the prosody style between the reference speech Y and the generated speech .
where and . As illustrated in Fig. 3, the proposed training strategy involves two loss functions: 1) that minimizes the loss between synthesized and original mel-spectrum at frame level; and 2) that minimizes the style differences between the synthesized and reference speeches at utterance level.
where is also the loss function of a traditional Tacotron [shen2018natural] system.
Style reconstruction loss can be seen as a perceptual quality feedback at utterance level to supervise the training of prosody style. All parameters in the TTS model are updated with the gradients of the total loss through back-propagation. We expect that mel-spectrum generation will learn from local and global viewpoint through the frame and style reconstruction loss.
Iii-C Stage III: Run-time Inference
The inference stage follows exactly the same Tacotron workflow, that only involves the TTS Model in Fig. 3. The difference between Tacotron-PL and other global style tokens variation of Tacotron is that Tacotron-PL encodes prosody styling inside the standard Tacotron architecture. It doesn’t require any add-on module.
At run-time, the Tacotron architecture takes text as input and generate expressive mel-spectrum features as output, that is followed by Griffin-Lim algorithm [griffin1984signal] in this paper to generates audio signals.
We train a SER as the style descriptor on IEMOCAP dataset [busso2008iemocap], which consists of five sessions, each of which is displayed by a pair of speakers (female and male) in scripted and improvised scenarios. The dataset contains a total of 10,039 utterances, with an average duration of 4.5 seconds at a sampling rate of 16 kHz. We only use a subset of improvised data with four emotional categories, namely, happy, angry, sad, and neutral, which are recorded in hypothetical scenarios designed to elicit specific types of emotions.
With the style descriptor, we further train a Tacotron system on LJ-Speech database [ljspeech17], which consists of 13,100 short clips with a total of nearly 24 hours of speech from one single speaker reading 7 non-fiction books. The speech samples are available from the demo link 111 Speech Samples: https://ttslr.github.io/Expressive-TTS-Training-with-Frame-and-Style-Reconstruction-Loss/ .
Iv-a Comparative Study
We develop five Tacotron-based TTS systems for a comparative study, that include the Tacotron baseline, and four variants of Tacotron with the proposed training strategy, Tacotron-PL. To study the effect of different style descriptors, we compare the use of four deep style features, which includes three single features and a combination of them, in , as illustrated in Fig. 5, and summarized as follows:
Tacotron: Tacotron [shen2018natural] trained with as in Eq. (1).
Tacotron-PL(L): Tacotron-PL which uses in .
Tacotron-PL(M): Tacotron-PL which uses in .
Tacotron-PL(H): Tacotron-PL which uses in .
Tacotron-PL(LMH): Tacotron-PL which uses in .
Iv-B Experimental Setup
For SER training, we split the speech signals into segments of 3 seconds as in [chen20183]. Then 40-channel mel-spectrum features are extracted with a frame size of 50ms and 12.5ms frame shift. The first convolution layer has 128 feature maps, while the remaining convolution layers have 256 feature map. The filter size for all convolution layers is 5
3, with 5 along the time axis, and 3 along the frequency axis, and the pooling size for the max pooling layer is 22. We add a linear layer with 200 output units after 3-D CNN for dimension reduction.
In this way, the 3-D CNN extracts a fixed size of latent representation with dimension from the input utterance, that we use as the deep style features , which represents a temporal sequence of segment, each having an embedding of elements. As each direction of BLSTM layer contains cells, in two directions, we obtain output activations for each input segment, that are further mapped to output units via a linear layer. BLSTM summarizes the temporal information of into another fixed size latent representation of dimension. The attention layer assigns the weights to and generate a new latent representation . All latent representation , , have the same dimension.
The fully connected layer contains 64 output units. Batch normalization[ioffe2015batch] is applied to the fully connected layer to accelerate training and improve the generalization performance. The parameters of the SER model were optimized by minimizing the cross-entropy objective function, with a minibatch of 40 samples, using the Adam optimizer with Nestorov momentum. The initial learning rate is set to and the momentum is set to 0.9.
The SER-based style descriptor is used to extract deep style features for the computing of
. For TTS training, the encoder takes 256-dimensions character sequence as input and the decoder generates the 40-channel mel-spectrum. The training utterances from LJ-Speech database are of variable length. Mel-spectrum features are also extracted with a frame size of 50ms and 12.5ms frame shift. They are normalized to zero-mean and unit-variance to serve as the reference target. The decoder predicts only one non-overlapping output frame at each decoding step. We use the Adam optimizer with= 0.9, = 0.999 and a learning rate of exponentially decaying to starting at 50k iterations. We also apply regularization with weight . All models are trained with a batch size of 32 and 150k steps.
Iv-C Frame and Style Reconstruction Loss
To examine the effect of the proposed training strategy, and the influence of perceptual loss , we would like to observe how converges with different training schemes on the same training data. For brevity, we only compare the convergence trajectories of between Tacotron baseline, and the component of for training of Tacotron-PL(L) in Fig. 6.
A lower frame-level reconstruction loss, , indicates a better convergence, thus a better frame level spectral prediction. We observe that the component in achieves a lower convergence value than in traditional Tacotron training. This suggests that utterance-level style objective function not only optimizes style reconstruction loss, but also reduces frame-level reconstruction loss over the Tacotron baseline. We note that the trajectories of Tacotron-PL(M), Tacotron-PL(H), Tacotron-PL(LMH) follow similar trend as Tacotron-PL(L).
Iv-D Objective Evaluation
We conduct objective evaluation experiments to compare the systems in a comparative study. The results are summarized in Table I.
Iv-D1 Performance Evaluation Metrics
Mel-cepstral distortion (MCD) [kubichek1993mel] is used to measure the spectral distance between the synthesized and reference mel-spectrum features. MCD is calculated as:
where represents the dimension of the mel-spectrum, denotes the mel-spectrum component in frame for the reference target mel-spectrum, and for the synthesized mel-spectrum. Lower MCD value indicates smaller distortion.
We use Root Mean Squared Error (RMSE) as the evaluation metrics for F0 modeling, that is calculated as:
where and denote the reference and synthesized F0 at frame. We note that lower RMSE value suggests that the two F0 contours are more similar.
Moreover, we propose to use frame disturbance, denoted as FD, to calculate the deviation in the dynamic time warping (DTW) alignment path [8300542, jusoh2015investigation, gupta2017perceptual]. FD is calculated as:
where and denote the x-coordinate and the y-coordinate of the frame in the DTW alignment path. As FD represents the duration deviation of the synthesized speech from the target, it is a proxy to show the duration distortion. A larger value indicates poor duration modeling performance and a smaller value indicates otherwise.
Iv-D2 Spectral Modeling
We observe that all implementations of Tacotron-PL model consistently provide lower MCD values than Tacotron baseline, with Tacotron-PL(L) representing the lowest MCD, as can be seen in Table I. We also visualize the spectrograms of same speech content synthesized by five different models, together with that of the reference natural speech in Fig. 7. A visual inspection of the spectrograms suggests that Tacotron-PL models consistently provide finer spectral details than Tacotron baseline. All results confirm the observations in Fig. 6, that Tacotron-PL training provides a better spectral model.
|System||MCD [dB]||RMSE [Hz]||FD [frame]|
Iv-D3 F0 Modeling
Fundamental frequency, or F0, is an essential prosodic feature of speech [Stanton2018Predicting, sun2020generating]. As there is no guarantee that synthesized speech and reference speech have the same length, we apply DTW [muller2007dynamic] to align speech pairs and calculated RMSE between the F0 contour of them. The results are reported in Table I. It is observed that Tacotron-PL models consistently generate F0 contours which are closer to reference speech than Tacotron baseline.
We note that both F0 and prosody style contribute to RMSE measurement. To show the effect of various deep style features on the F0 contours, we also plot the F0 contours of the utterances in Fig. 7. A visual inspection suggests that the Tacotron-PL models benefit from the perceptual loss training, and produce F0 contour with a better fit to that of the reference speech, with Tacotron-PL(L) producing the best fit (see Fig. 7(c)).
Iv-D4 Duration Modeling
Frame disturbance is a proxy to the duration difference [gupta2017perceptual] between synthesized speech and reference natural speech. We report frame disturbance of five systems in Table I. As shown in Table I, Tacotron-PL models obtain significantly lower FD value than Tacotron baseline, with Tacotron-PL(L) giving the lowest FD. From Figs. 6 and 7, we can also observe that Tacotron-PL(L) example clearly provides a better duration prediction than other models. We can conclude that perceptual loss training with style reconstruction loss helps Tacotron to achieve more accurate rendering of prosodic patterns.
Iv-D5 Deep Style Features
We compare four different deep style features by evaluating the performance of their use in Tacotron-PL models, namely Tacotron-PL(L), Tacotron-PL(M), Tacotron-PL(H) and Tacotron-PL(LMH).
In supervised feature learning, the features that are near the input layer are related to the low level features, while those that are near the output are related to the supervision target, that are the categorical labels of the emotion. While we expect the style descriptors to capture utterance level prosody style, we don’t expect style reconstruction loss function to directly relate to emotion categories. Hence, lower level deep features, , as illustrated in Fig. 5, would be more appropriate than higher level deep features, such as and .
We observe that is more descriptive than other deep style features for perceptual loss evaluation, as reported in spectral modeling (MCD), F0 modeling (RMSE), duration modeling (FD) for Tacotron-PL experiment in Table I. The observations confirm our intuition and the analysis in Fig. 4.
Iv-E Subjective Evaluation
We conduct listening experiments to evaluate several aspects of the synthesized speech, and the choice of deep style features for .
Iv-E1 Voice Quality
Each audio is listened by 15 subjects, each of which listens to 90 synthesized speech samples. We first evaluate the voice quality in terms of mean opinion score (MOS) among Tacotron, Tacotron-PL(L), Tacotron-PL(M), Tacotron-PL(H), and Tacotron-PL(LMH). As shown in Fig. 8, Tacotron-PL models consistently outperforms Tacotron baseline, while Tacotron-PL(L) achieves the best result.
In the objective evaluations and MOS listening tests, Tacotron-PL(L) and Tacotron-PL(LHM) consistently offer better results. We next focus on comparing Tacotron-PL(L) and Tacotron-PL(LHM) with Tacotron baseline. We first conduct the AB preference test to assess speech expressiveness of the systems. Each audio is listened by 15 subjects, each of which listens to 90 synthesized speech samples. Fig. 9 reports the speech expressiveness evaluation results. We note that Tacotron-PL(L) outperforms both Tacotron baseline and Tacotron-PL(LMH) in the preference test. The results suggest that is more effective than other deep style features to inform the speech style.
We further conduct the AB preference test to assess the naturalness of the systems. Each audio is listened by 15 subjects, each of which listens to 90 synthesized speech samples. Fig. 10 reports the naturalness evaluation results. Just like in the expressiveness evaluation, we note that Tacotron-PL(L) outperforms both Tacotron baseline and Tacotron-PL(LMH) in the preference test. The results confirm that is more effective to inform the speech style.
|System||Best (%)||Worst (%)|
Iv-E4 Deep Style Features
We finally conduct Best Worst Scaling (BWS) listening experiments to compare the four different Tacotron-PL systems with different deep style features. The subjects are invited to evaluate multiple samples derived from the different models, and choose the best and the worst sample. we perform this experiments for 18 different utterances, and each subject listens to 72 speech samples in total. Each audio is listened by 15 subjects.
Table II summarizes the results. We can see that Tacotron-PL(L) is selected for 83% of time as the best model and only 2% of time as the worst model, that shows is the most effective deep style features.
We have studied a novel training strategy for Tacotron-based TTS system that include frame and style reconstruction loss. We implement a SER model as the style descriptor to extract deep style features to evaluate the style reconstruction loss. We have conducted a series of experiments and demonstrated that the proposed Tacotron-PL training strategy outperforms the start-of-the-art Tacotron baseline without the need of any add-on mechanism at run-time. While we conduct the experiments only on Tacotron, the proposed idea is applicable to other end-to-end neural TTS systems, that will be the future work in our plan.
|(1) What did he say to that?|
|(2) Where would be the use?|
|(3) Where is it?|
|(4) The soldiers then?|
|(5) What is my proposal?|
|(1) Answer: Yes.|
|(2) Answer: No.|
|(3) Answer: Thank you.|
|(4) Answer: No, sir.|
|(5) Answer: By not talking to him.|
|(1) In September he began to review Spanish.|
|(2) They agree that Hosty told Revill.|
|(3) Hardly any one.|
|(4) They are photographs of the same scene.|
|(5) and other details in the picture.|
|(1) Nineteen sixty-three.|
|(2) Fourteen sixty-nine, fourteen seventy.|
|(3) March nine, nineteen thirty-seven. Part one.|
|(4) Section ten. March nine, nineteen thirty-seven. Part two.|
|(5) On November eight, nineteen sixty-three.|