The goal of Text-To-Speech (TTS) synthesis is to generate human-like speech based on a text input. Recently, end-to-end trainable neural networks have become increasingly popular for this task. For example, Tacotron and Tacotron-2  use an encoder-decoder architecture that is trained with pairs of text and audio samples and a learning objective that the synthesized speech should faithfully reconstruct . With the success of neural TTS systems, the current focus has been on TTS stylization , where the goal is to control the style of speech during the synthesis process. The stylization occurs when the system can generate speech for a given text input in a style that is different from what exists in the training data. An ability to control speech style is crucial for developing natural, human-like TTS systems.
We use style dimension to refer to the category of the given style, such as speaker identity, emotion, or accent, and style class to refer to a specific type such as speaker1, happy, or Scottish. An audio sample has style class labels for either all the defined style dimensions, e.g., it is from speaker1 with happy emotion and a Scottish accent, or only for a subset of the style dimensions, e.g., it is missing the emotion and accent labels.
Multiple systems exist to model the style of speech [7, 12, 13], where a reference audio sample with the desired style is used as a conditioning variable during the TTS process. However, most existing approaches require a large number of text-audio training samples of different style dimensions/classes. They also often fail to generalize to new domains unseen during training. For example, to create speech in different speaker identities and emotion classes using a single model, a dataset containing audio samples for each emotion class and speaker identity is needed, and yet the model could still fail to transfer the emotion style to an unseen speaker. Collection of such datasets is challenging and this limits a timely deployment of large-scale TTS stylization systems.
In this paper, we focus on multi-reference neural TTS stylization with disjoint datasets. Disjoint datasets occur when one dataset contains samples of only a single style class for one of the style dimensions. Table 1 shows a particular scenario we consider in this paper: we use an internal dataset of North American English with two speakers. The dataset for Speaker 1 contains examples for only a single emotion (Neutral) whereas the dataset for Speaker 2 contains examples of all four emotion classes (Neutral, Sad, Angry, Happy). This represents a minimalistic scenario of the aforementioned issue: a model must be able to learn disentangled representations of the two style dimensions, and properly transfer the knowledge about one dimension (emotion) across another dimension (speaker identity) where no variation of style classes is available. This poses a significant challenge to TTS stylization similar to domain adaptation , yet in a unique scenario of style transfer in the speech signal processing domain.
Previous work on TTS stylization has primarily focused on the transfer of a single style reference audio sample [7, 12, 13, 5]. Those methods are inadequate for disjoint datasets because of their lack of domain adaptation capability. In an extreme case, those methods could, for example, learn to identify the emotion using features from the speaker identity dimension. They could also simply ignore the other style dimension (emotion) entirely and always map Speaker 1 samples to the only available style class (Neutral).
Recently, Bian et al.  tackled multi-reference TTS stylization, based on GST-Tacotron  and an intercross training scheme. They showed successful style transfer on a speaker-prosody multi-reference scenario using a 30-hour corpus with 27 speakers and 5 prosodies. However, their intercross training scheme does not guarantee each combination of style classes is seen during training, causing a missed opportunity to learn disentangled representations of styles and sub-optimal results on disjoint datasets.
In this paper, we address the challenges of multi-reference style transfer on disjoint datasets by using an adversarial cycle consistency training scheme. Unlike intercross training, our training scheme sweeps across all combinations of style classes via paired and unpaired triplets. This provides disentanglement of multiple style dimensions and classes, enabling our model to transfer style in a more faithful manner than existing methods. Testing on our 40-hour disjoint dataset of 2 speakers and 4 emotions, we observe improved emotional expressiveness in synthesized speech, achieving 98.34% classification accuracy of emotion, a 78.48% improvement over the baseline model.
2 Our Method
Fig 1 shows a schematic diagram of our system. It consists of a text encoder , reference audio encoders and , and an audio decoder . Each audio encoder captures a different style dimension, e.g., captures speaker identity and captures emotion.
At inference time, our model encodes a text string and two reference audio inputs, and produces a spectrogram using the audio decoder; this is converted to the wave file format using the Griffin-Lim vocoder . More specifically, our text encoder and audio decoder follow the same encoder-decoder architecture of Tacotron-2 . We augment this with a reference encoder for each style dimension and concatenate each output embedding
to the text context vector at each decoder step. The reference audio encoders follow the same structure as the audio encoder in.
During training, we attach style classifierswith gradient reversal layers and feed the generated spectrograms back to the reference audio encoders. This forms the adversarial cycle consistency objective . Below we provide details of our training method.
2.1 Model Training
Learning from disjoint datasets is difficult because we do not have text-audio pairs for all possible combinations of style classes across each style dimension. To encourage disentangling of the style embeddings, we require the model to use both style embeddings and during training, with each capturing a different style dimension. Further, we carefully select reference audio samples to ensure each style and speaker is seen during training, filling in the gaps in Table 1.
We achieve this by synthesizing speech from both paired and unpaired triplets. We use the convention to represent reference audio samples, where stands for the style dimension and stands for the pairing type. The pairing type can take one of three values: 1) a paired audio sample with the same verbal content as the input text, 2) a style-matched audio sample with the same style class as the paired audio sample but with a different verbal content than the input text, 3) a random audio sample with a random style class.
A paired triplet contains a text sample, a paired audio sample, and a style-matched audio sample, and it can be either or . An unpaired triplet contains a text sample and two random audio samples, . Our style-matched sample is similar to that in the intercross training scheme used by Bian et al. , and our random sample is similar to the unpaired training scheme used by Ma et al. 
. In this work, we combine those ideas to enable multi-reference TTS stylization from disjoint datasets. Next, we discuss the loss functions used for the paired and unpaired triplets.
Reconstruction loss. For the paired triplets only, we force the synthesized spectrograms to reconstruct the paired audio sample. We follow [11, 7, 12] and define an reconstruction loss between the input spectrogram and the output spectrogram ,
Adversarial Cycle Consistency Loss. The reconstruction loss alone is insufficient to constrain our model. Inspired by , we introduce an adversarial cycle consistency loss to further constrain it. Our main idea is that an embedding from the real audio sample must capture the correct style information. Thus, when we synthesize audio from it and feed the result back to the same audio encoder, the resulting embedding should contain the same style information as ; hence the cycle consistency. Furthermore, each of the two audio embeddings should only contain information about the corresponding style dimension; in other words, should have no information about style dimension two, and similarly should not have information about style dimension one; this can be enforced via adversarial learning.
We design our adversarial cycle consistency loss by combining the two ideas above. To this end, we define style classifiers where refers to the style dimension of the input embedding and refers to the style dimension upon which the classification occurs. The classifier is a two-layer MLP with a softmax classifier and outputs equal to the number of style classes for the -th style dimension. We train it with a cross-entropy loss:
where is the ground-truth style class for the -th embedding in the -th style dimension, and is the predicted style class. For , the classifier encourages an embedding to contain the correct information of the -th style dimension. For , the classifier discourages the use of information about the other style dimension. We use the gradient reversal layer  before the classifiers for to enable adversarial learning.
The adversarial cycle consistency loss is then a combination of classification losses for paired triplets, unpaired triplets, and synthesized samples (with ),
Orthogonality loss. Finally, we introduce an orthogonality constraint to help the model learn disentangled style representations, similar to . This is defined over the style embeddings as
where is the Frobenius norm and (and ) refers to the style embedding from style dimension (and ).
Training details. The final form of our loss function is
are weights for the different loss terms. We found the optimal weights and that the results are insensitive to small changes to those values through cross-validation. We train our model on a single machine with four NVIDIA Tesla M40 GPUs for 40k epochs using a batch size of 96 text/audio pairs, each with a paired and unpaired triplet, for a total of 192 triplets. Note that
is defined over only the paired triplets while the other two loss terms are defined over both paired and unpaired triplets. We use teacher-forcing for the reconstruction loss throughout the entire training procedure. At inference time, we use a window constraint for the text context attention, enforcing the maximum attention weight to be within a window of seven frames from the previous max. For the rest of the hyperparameters, we follow the same setup as outlined in. After the 40k epochs, we add the adversarial game loss presented in Ma et al.  and train for an additional 1k epochs. Fine-tuning the model with this loss increases the fidelity of the synthesized unpaired samples.
3 Experiments and Discussions
Our disjoint datasets are defined over two style dimensions, speaker identity and emotion, as shown in Table 1. The datasets contain 15,226 samples (18.55 hours) and 22,325 samples (21.62 hours) for Speakers 1 and 2 respectively. To the best of our knowledge, there exists only one published method that tackles multi-reference TTS stylization: we compare to Bian et al.  in our experiments.
Style Classification Accuracy. We train two speech style classifiers (speaker identity and emotion) using the reference audio samples from the TTS training data. The classifiers have the same structure as the reference encoder and the style classifier in our model. Their final validation accuracies are 99% and 95% respectively.
Next, we synthesize speech from each test text sample four times, once in each emotion, and predict their style class labels using the trained classifiers. For the emotion reference, we use a random sample in the appropriate emotion from the Speaker 2 test set. For the speaker identity reference, we use the paired audio sample.
shows the confusion matrices. Both models achieve greater than 96% accuracy on speaker identity, showing their ability to retain speaker identity in synthesized samples. However, the baseline model performs poorly on emotion classification, achieving only a 55.1% classification accuracy. As can be seen in the confusion matrix, many samples from the angry, happy, and sad classes are grouped into the neutral class, demonstrating a lack of style transfer. Our model achieves 98.3% classification accuracy, demonstrating a much higher rate of emotion style transfer.
|Emt Acc (%)||Spk Acc (%)|
|Bian et al. ||55.1||97.1|
We also visualize 100 embeddings (25 from each emotion) created by the emotion classifier’s reference encoder using t-SNE  in Figure 3. Our model produces much closer and more separable clusters due to the improved emotion style transfer; the results suggest an improved disentanglement of the two style dimensions using our model.
Human Subject Evaluation. We recruited eight human subjects to qualitatively evaluate our adversarial cycle consistency model. To test style transfer, we performed a side-by-side comparison of 20 synthesized Speaker 1 samples (5 texts in each of the 4 emotions). Subjects evaluated the samples on a 7-point scale (-3 to 3) where -3 refers to “sample A is closest to the reference emotion”. The results show our model was consistently rated as closer to the reference (), especially for the three unseen emotions in the Speaker 1 dataset (sad: , angry: , happy: ).
To test naturalness, we asked subjects to rate voice quality on a 5-point scale. Our model achieved a 3.29 mean opinion score (MOS) while the baseline reached a 3.43 MOS. Our model’s reduction in perceived quality may result from its more pronounced style transfer – on neutral samples, our model (3.63 MOS) outperforms the baseline (3.40 MOS). Perhaps the style transfer was too strong (almost exaggerated) for the other three emotions, leading to a decrease in the naturalness score.
Speech Fidelity. Finally, we evaluated the fidelity of the synthesized speech samples. We synthesized each Speaker 1 test text sample in each of the four emotions, then use the Microsoft Azure speech-to-text service to generate transcripts. The baseline reaches 15.75% word error rate (WER) while our model achieves 16.95%. Similar to naturalness, our model’s improved emotional expressiveness may be the cause of its lower performance since the emotion can serve to confound the automatic speech recognition system. We also believe that improved fidelity could be achieved with a more powerful vocoder such as WaveNet .
Comparison with Bian et al. . We believe the baseline model’s sub-optimal performance stems from the limitations of the intercross training procedure. Since the procedure only presents combinations of style classes that exist in the dataset (e.g. entries with a check-mark in Table 1), unrepresented combinations (e.g. the gaps in Table 1
) do not impact the model loss and, thus, are not accounted for during backpropagation. By training on unpaired triplets with random references, our cycle consistency training scheme ensures each combination of style class (e.g. each entry in Table1) is seen during training, forcing the model to learn to create speech for every style combination.
We present an adversarial cycle-consistent training procedure for multi-reference neural TTS stylization on disjoint datasets. Because recording training samples for new style classes is labor-intensive, transferring style from one dataset to another (including disjoint datasets) is an appealing feature for TTS systems. Using our adversarial cycle consistency training scheme, we achieve a much higher rate of style transfer for disjoint datasets than previous models. We show our model provides a 78% improvement in style transfer (based on emotion classification) over an existing method with minimal reduction in fidelity and naturalness.
-  (2019) Multi-reference tacotron by intercross training for style disentangling, transfer and control in speech synthesis. arXiv preprint arXiv:1904.02373. Cited by: §1, §2.1, §2.1, Figure 2, Figure 3, Table 2, §3, §3.
Domain adaptation for statistical classifiers.
Journal of artificial Intelligence research26, pp. 101–126. Cited by: §1.
Unsupervised domain adaptation by backpropagation.
Proceedings of the 32nd International Conference on Machine Learning, pp. 1180–1189. Cited by: §2.1.
-  (1984) . IEEE Transactions on Acoustics, Speech, and Signal Processing 32 (2), pp. 236–243. Cited by: §2.
-  (2018) Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In Advances in neural information processing systems, pp. 4480–4490. Cited by: §1.
-  (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §3.
-  (2019-04) Neural tts stylization with adversarial and collaborative games. In ICLR 2019, External Links: Cited by: §1, §1, §1, §2.1, §2.1, §2.1.
-  (2016) Wavenet: a generative model for raw audio. arXiv preprint arXiv:1609.03499. Cited by: §3.
-  (2018) Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783. Cited by: §1, §2.1, §2.
-  (2018) Towards end-to-end prosody transfer for expressive speech synthesis with tacotron. arXiv preprint arXiv:1803.09047. Cited by: §2.
-  (2017) Tacotron: towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135. Cited by: §1, §2.1.
-  (2018) Style tokens: unsupervised style modeling, control and transfer in end-to-end speech synthesis. arXiv preprint arXiv:1803.09017. Cited by: §1, §1, §1, §2.1.
-  (2019) End-to-end emotional speech synthesis using style tokens and semi-supervised training. arXiv preprint arXiv:1906.10859. Cited by: §1, §1.
Unpaired image-to-image translation using cycle-consistent adversarial networks. In
Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §2.1, §2.