Syntactic representation learning for neural network based TTS with syntactic parse tree traversal

12/13/2020 ∙ by Changhe Song, et al. ∙ 0

Syntactic structure of a sentence text is correlated with the prosodic structure of the speech that is crucial for improving the prosody and naturalness of a text-to-speech (TTS) system. Nowadays TTS systems usually try to incorporate syntactic structure information with manually designed features based on expert knowledge. In this paper, we propose a syntactic representation learning method based on syntactic parse tree traversal to automatically utilize the syntactic structure information. Two constituent label sequences are linearized through left-first and right-first traversals from constituent parse tree. Syntactic representations are then extracted at word level from each constituent label sequence by a corresponding uni-directional gated recurrent unit (GRU) network. Meanwhile, nuclear-norm maximization loss is introduced to enhance the discriminability and diversity of the embeddings of constituent labels. Upsampled syntactic representations and phoneme embeddings are concatenated to serve as the encoder input of Tacotron2. Experimental results demonstrate the effectiveness of our proposed approach, with mean opinion score (MOS) increasing from 3.70 to 3.82 and ABX preference exceeding by 17 syntactic parse trees, prosodic differences can be clearly perceived from the synthesized speeches.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently neural network based text-to-speech (TTS) systems have achieved certain success in prosody and naturalness of synthesized speech over conventional methods [1, 2, 3, 4]. By applying encoder-decoder framework with attention [5], these systems can directly predict speech parameters from graphemes or phonemes by learning acoustic and prosodic patterns via a flexible mapping from linguistic to acoustic space. However, the learnt prosodic patterns only contain part of prosodic structural information [4], resulting in poor prosody and naturalness performance even improper prosody.

To further improve prosody and naturalness of synthesized speech, adding prosodic structure annotations such as tones and break indices (ToBI) labels [6] or other prosodic structure labels [7] to the input sequence of neural network based TTS models has been proposed. Prosodic structure annotations need to be subjectively labeled from speech, which is time-consuming.

Although these annotations can be automatically annotated by training another prosodic structure prediction model [8], the accuracy of predicted prosodic structure labels is still limited by using subjectively labeled annotations as the ground-truths. The high correlation between syntactic structure and prosodic information has been proved by successful syntactic-to-prosodic mapping [9, 10]

. A set of rule-based syntactic features such as part-of-speech (POS) and positions of the current word in parent phrases are proposed and used in hidden Markov model (HMM) based acoustic model

[11]. To utilize more syntactic structure information, phrase structure based feature (PSF) and word relation based feature (WRF) are proposed in neural network based TTS [12]. PSF and WRF expand the set of syntactic features used in HMM model. More features such as highest-level phrase beginning with current word (HBCW) and lowest common ancestor (LCA) are further introduced to model syntactic structure [12].

However, the expanded features are still manually designed features rather than automatically learned high-level representations. PSF only contains features from limited layers of the whole syntactic tree structure. WRF only exposes the information of partial nodes and edges from the whole syntactic parse tree.

To maker better use of the syntactic information, motivated by the syntactic parse tree traversal approach in neural machine translation

[13], we propose a syntactic representation learning method to further improve the prosody and naturalness of synthesized speech in neural network based TTS. Syntactic parse tree is linearized into two constituent label sequences through left-first and right-first traversal. Then syntactic representations are extracted from the constituent label sequences using different uni-directional GRU network for each sequence. After which, the syntactic representations are up-sampled from word level to phoneme level and concatenated with phoneme embeddings. Tacotron 2 is employed to generate spectrogram from the concatenated syntactic representations and phoneme embeddings, with Griffin-Lim [14] to reconstruct the waveform. Nuclear-norm maximization loss (NML) is introduced to the constituent label embedding layer to enhance discriminability and diversity. Compared to only hiring left-first traversal [13], right-first traversal is proposed to alleviate the ambiguity.

Experimental results show that our proposed model outperforms the baseline in terms of prosody and naturalness. Mean opinion score (MOS) increases from to

compared with the baseline approach (t-test, p=0.0079). ABX preference rate exceeds the baseline approach by

. For sentences with multiple different syntactic parse trees, prosodic differences can be clearly perceived from corresponding synthesized speeches.

2 Methodology

Fig.1 shows the framework of our proposed method. Our work mainly focuses on introducing a trainable syntactic structure information extractor as part of neural network based TTS system to improve the prosody and naturalness of the synthesized speech.

2.1 Syntactic representation learning

To provide high-level syntactic representations with rich syntactic information to neural network based TTS system, we propose a syntactic representation learning network based on syntactic parse tree traversal. Constituent parse trees are extracted including labels and tree structure of constituents.

To represent the tree structure for neural network based TTS, depth-first traversals are of possible to use to linearize the syntactic parse tree to constituent sequences. Since any single tree traversal algorithm will map multiple syntactic parse trees to a same sequence, both left-first and right-first are proposed to use to alleviate the ambiguity. The sequence of constituent labels generated by the two traversals can be formulated as the following equations:

(1)

where and are the constituent label sequences generated from the left-first and right-first traversals respectively, and are constituent labels, is the length of sequences. The constituent labels are then embedded by a shared embedding layer and modeled by two different uni-directional GRU networks, one GRU network for each sequence. The process can be represented as:

(2)

where and are embedding sequences of constituent labels, and are two different uni-directional GRUs, and are the outputs of and respectively.

Syntactic features are the concatenations of the outputs of GRUs for each word, which can be formulated as:

(3)

where and are the positions of the -th word in and , is the number of words of the input text, is the learnt syntactic representation.

2.2 Nuclear-norm Maximization Loss

To improve the discriminability and diversity of the embeddings of the syntactic labels, global nuclear-norm maximization loss (NML) [15] is proposed to increase the rank of the embeddings of all possible constituent labels. The NML is defined as:

(4)

where is the set of all possible constituent labels, and are the embedding and length of respectively. is computed as:

(5)

where is the

-th the singular value of

.

2.3 TTS with syntactic representations

The learnt syntactic representations are word related, which are upsampled to phoneme level and concatenated with phoneme embeddings. Syntactic representation is copied to match the phoneme sequence length of current word. Tacotron 2 [4] is employed to generate spectrogram from the concatenated syntactic representations and phoneme embeddings, and Griffin-Lim [14]

is further utilized to reconstruct the waveform. The whole model is trained with a loss function which can be formulated as:

(6)

where is the loss function defined in Tacotron 2 and is the loss weight for NML.

3 Experiment

3.1 Training setup

We train models on public Chinese female corpus [16], which includes 10-hour professional speech and 10000 sentences. 500 sentences are used for validation and other sentences are used for training. We down-sample the speech to 16k Hz sampling frequency, The tacotron 2 part in our model is trained with vanilla setups except setting frequency to match our speech. The learning rate is fixed to and the loss weight of NML is 0.05.

We train the WRF based TTS [12] as baseline approach. The parser used in WRF [17] is replaced by state-of-the-art syntactic parsing model Benepar [18].

We program all the models based on an open sourced Tensorflow implemention of Tacotron 2

[19]. We train all the models for 50k iterations with a batch size of 16 on a NVIDIA 2080 Ti GPU.

3.2 Subjective evaluation

We randomly select 30 sentences from Internet as test set, 5 of which are sentences with multiple different syntactic parse trees. Synthesized speeches are shifted in random order and rated by 20 native speakers on a scale from 1 to 5, from which a subjective mean opinion score (MOS) is calculated.

As show in Table.1, the proposed system receives a MOS of while the baseline approach receives a MOS of

, with a comparable variance. T-test reveals that our proposed approach significantly outperforms the baseline with a

of .

We also conduct a ABX preference test between pairs of systems on the synthesized speech. The listeners are presented with the speeches synthesized by the baseline and proposed approaches in random order, and decide which one has the better prosody and naturalness. As show in Table.2, the proposed approach receives preference rate exceeding the baseline approach by .

WRF Proposed
MOS
Table 1: Comparision between baseline and the proposed method. MOS variances are given in brackets.
WRF Proposed Neutral
ABX-PR
Table 2: ABX preference comparision between baseline and the proposed method. ABX-PR means preference rate of ABX test.

(a) Without NML
(b) With NML

Figure 2: PCA results of constituent label embeddings on NML ablation study.

3.3 Ablation study

To visualize the contribution of NML, we train our model without NML with the same settings. Another ABX preference test is conducted on same test set and listeners. The listeners are presented with the speeches synthesized by our models with and without NML, and decide which one has the better prosody and naturalness. As show in Table.3, the approach with NML receives preference rate exceeding the approach without NML by .

Proposed Proposed Neutral
w/o NML w/ NML
ABX-PR
Table 3: ABX preference result with or without NML.

We visualize the learnt embeddings of constituent labels with and without NML by principal components analysis (PCA). As show in Fig.

2, embeddings with NML are more scattered than the embeddings without NML, which demostrates the effectiveness of NML in improving discriminability and diversity.

3.4 Analysis and discussion

Figure 3: Mel spectrogram and pitch contour comparison on sentence “At present, the funny memorization methods invented by students to memorize words are leftovers” between proposed model(below) and WRF(upper).

(a) White swan swims on the lake.
(b) During the day, goose swims on the lake.

Figure 4: Prosodic differences of the sentence with multiple syntactic parse trees.

We further conduct case studies by comparing the spectrogram and pitch contours of the speeches generated from different methods, as shown in Fig.3. The speech from baseline approach has an unexpected long pause between the 5-th character and 6-th character in such long sentence.

Besides, the word from the 16-th to 17-th character, “ji4 yi4” that are both of fourth tone in Chinese, is synthesized with an unnatural up-and-down tone in WRF. Instead, the spectrogram and pitch contour of the speech synthesized by our proposed method are more stable, indicating the stability of the proposed syntactic representations from bidirectional traversals. For the last three characters, there is an unexpected stress (high pitch value) on 23-th character “xia4” in WRF result, while proposed method shows a gradually decreasing pitch contour at the end of the sentence leading to higher naturalness. This is cause by the WRF features in baseline approach which consider a uni diridectional information in the syntactic parse tree.

With the proposed syntactic representation learning method, it is possible that a sentence with the same text but different syntactic parse trees might lead to synthesized speeches with different prosody expressions, which provides a possibility for prosody control of speech synthesis. To validate this, we conduct further experiments by inputting sentences with multiple different syntactic parse trees to the proposed model. One example is shown in Fig.4, from which the prosodic differences of the synthesized speeches can be clearly observed. Upper structure regards the first three characters as a word and the sentence means “White swan swims on the lake”, while below treats these characters as two words and the sentence means “During the day, goose swims on the lake”. Although the graphemes and phonemes are same, the meanings of the sentences are different with each tree. And the prosodic differences match the meanings respectively. Also the prosody of last five characters in either synthesized speech are similar since their corresponding syntactic structure information is similar.

4 Conclusion

In this study, we investigate a syntactic representation learning method to automatically utilize the syntactic structure information for neural network based TTS. Nuclear-norm maximization loss is introduced to enhance the discriminability and diversity of synthsized speech prosody. Experimental results demonstrate the effectiveness of our proposed approach. For sentences with multiple syntactic parse trees, prosodic difference can be clearly observed from the synthesized speeches.

References