Incremental Text to Speech for Neural Sequence-to-Sequence Models using Reinforcement Learning

08/07/2020 ∙ by Devang S Ram Mohan, et al. ∙ 0

Modern approaches to text to speech require the entire input character sequence to be processed before any audio is synthesised. This latency limits the suitability of such models for time-sensitive tasks like simultaneous interpretation. Interleaving the action of reading a character with that of synthesising audio reduces this latency. However, the order of this sequence of interleaved actions varies across sentences, which raises the question of how the actions should be chosen. We propose a reinforcement learning based framework to train an agent to make this decision. We compare our performance against that of deterministic, rule-based systems. Our results demonstrate that our agent successfully balances the trade-off between the latency of audio generation and the quality of synthesised audio. More broadly, we show that neural sequence-to-sequence models can be adapted to run in an incremental manner.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Efforts towards incremental text to speech (TTS) have typically focused on more traditional, non-neural architectures [baumann2012inprotk, baumann-schlangen-2012-inpro, pouget2015hmm]. However, advancements in neural TTS [shen2018natural, sotelo2017char2wav, vasquez2019melnet] have resulted in near human levels of naturalness and thus motivate an exploration of neural incremental TTS systems.

Neural TTS systems typically adopt sequence-to-sequence architectures which require the entire input sequence to be processed before generating any units of the output sequence. This offline characteristic is often useful; for example, a question mark at the end of a sentence would impact the intonation of preceding words. On the other hand, synthesising speech incrementally from text could be valuable. Such a model could be placed at the tail-end of an incremental speech recognition and machine translation pipeline to obtain a real-time speech-to-speech translation system.

The development of these streaming, end-to-end architectures has seen considerable attention for the tasks of automatic speech recognition

[he2019streaming, zhang2020transformer, sak2017recurrent, jaitly2015neural] and machine translation (MT) [DBLP:journals/corr/ChoE16, gu2016learning, zheng2019simpler, arivazhagan-etal-2019-monotonic]. Inspired by the approach of [gu2016learning], our proposed framework develops an agent that decides whether to trigger the encoder with the next input character (i.e., READ in Figure 1), or trigger the decoder with the characters read thus far (i.e., SPEAK in Figure 1). In this manner, our approach enables us to start generating mel-spectrograms while having read only a part of the input sentence. The mapping of these mel-spectrogram frames to raw audio waveforms can be achieved with an existing neural vocoder [shen2018natural, DBLP:journals/corr/abs-1802-08435] by adjusting its inference behaviour.

The challenge then lies in deciding when to incorporate an additional character into this restricted input subsequence. We use the REINFORCE algorithm [williams1992simple] to train an agent to make this decision.

Figure 1: Trajectory for an arbitrary sequence of READ (red, along y-axis) and SPEAK actions (blue, along x-axis)

2 Background

[ma2019incremental] proposes an approach for incremental neural TTS. The model is based on the prefix-to-prefix framework [ma2018stacl] and leverages a policy which maintains a fixed latency (in terms of number of words) behind the input. However, it would be challenging to construct such a rule-based approach if the desired latency was to be measured in a more granular unit, such as characters or phonemes. Furthermore, a dynamic, learnt policy would allow this approach to be used for new languages and speakers without manual calibration of these parameters.

The arena of incremental machine translation has also seen advancements. [DBLP:journals/corr/ChoE16] proposes the framework of READ/WRITE and once again uses rule-based policies to enable incremental machine translation. [gu2016learning] models this discrete action selection task using a reinforcement learning (RL) system, which we adapt in our work. Alternatively, [zheng2019simpler]

turns this non-differentiable framework into a supervised learning problem by training a model on sequences of interleaved

READ/WRITE decisions generated from a pre-trained model.

A major challenge in any sequence transduction task is to align the target sequence with the source at each step. [he2019streaming, zhang2020transformer] propose methods that leverage the RNN-T model [graves2012sequence] to address this for the task of speech recognition. As an alternative, the approaches in [sak2017recurrent, jaitly2015neural] propose architectures which utilise the fact that in speech recognition, the length of the target sequence is less than that of the source. [DBLP:journals/corr/abs-1712-05382, DBLP:journals/corr/BahdanauCSBB15, tjandra-etal-2017-local] use encoder-decoder architectures with attention, but compute the attention alignments in an online manner.

[1906.00672] adapts the online, monotonic attention mechanism proposed by [raffel2017online]

for the Tacotron 2 model. However, the motivation behind this was to ensure the surjectivity of the mapping between input elements and output frames and thus, the encoder and decoder architectures remain offline. Furthermore, the atomic input unit is a phoneme which can only be computed given the entire word. RL based approaches have also been used to generate attention weights for image captioning

[xu2015show], [DBLP:journals/corr/ZarembaS15, ling2017coarse]. However, these attention mechanisms generate hard attention weights which is undesirable for TTS [battenberg2019locationrelative].

3 Tacotron 2 Modifications

Our base model builds on the Tacotron 2 model, with certain modifications for the incremental setting. Note that while these modifications may affect the quality of synthesised speech, they are necessary restrictions for incremental synthesis.

The encoder is altered by simply removing the convolutional layers and replacing the bi-directional LSTM [schuster1997bidirectional, hochreiter1997long] with a uni-directional one. We further discard the post-net module, leaving only the attention mechanism that renders this model offline. Rather than modifying the computation of the alignment weights and potentially enforcing a hardness constraint, we maintain the soft attention weights and suitably restrict its scope as described in Section 4.

Finally, note that Tacotron 2 also has a vocoder component, which maps the mel-spectrogram to the raw audio waveform. We use a different vocoder architecture [DBLP:journals/corr/abs-1802-08435] and adapt its inference behaviour to work in a purely auto-regressive manner by restricting the number of mel-spectrogram frames input to its residual and up-sampling networks.

For the remainder of this paper, we use this modified Tacotron 2 architecture to generate mel-spectrograms with the understanding that any incremental vocoder can be leveraged for synthesis.

4 Incremental Text to Speech using Reinforcement Learning

Inspired by [gu2016learning], we maintain an increasing buffer of input characters, which the model attends over to synthesise the next mel-spectrogram frame. We then train an agent to make the decision of whether to add the next input character into this buffer, or to synthesise a frame of audio based on the information in the buffer. To train this agent, we leverage the RL paradigm.

4.1 RL Setup and Notation

The RL setup consists of a decision maker, called the agent, interacting with an environment, typically over a sequence of discrete steps which we index by . At the th interaction step, the agent selects an action , which the environment executes, and returns a new observation (which is a representation of how its internal state has changed) and a numerical reward, . In addition, the environment returns a flag which indicates whether this particular episode of interactions has completed, called the terminal flag. The task for the agent, then, is to learn a mapping from the space of all possible observations to a suitable action. Such a mapping, called a policy, should attempt to maximise the cumulative numerical reward achieved over the course of an episode (typically discounted temporally by a factor ) [sutton1998introduction].

Formally, let denote the sequence of input character embeddings and denote the corresponding encoder outputs from our modified Tacotron 2 (Section 3). Our modifications enable to be computed without knowledge of . Let the associated ground-truth mel-spectrogram consist of frames. At the th step of an episode, let denote the number of characters that have been read and represent the number of audio frames generated (aligned by teacher-forcing [shen2018natural] during training). Let denote the alignment weight over while generating the th decoder output, .

Instead of using {} to compute these weights (and thence the attention context), we use our restricted buffer . This approach guarantees that, at the time of synthesising the th frame of audio, our Tacotron 2 model only has access to the first characters.

4.2 Agent

The actions available to the agent are:

  • READ: (step along the vertical axis in Figure 1) Provides the attention mechanism with an additional character over which it may attend.

  • SPEAK: (step along the horizontal axis in Figure 1) Results in the generation of a mel-spectrogram frame based on the characters read thus far.

Then, a desirable learnt policy might be the agent learning to SPEAK as soon as there is enough READ context, and to resume READing only when the existing context is fully synthesised. Observe that the offline behaviour can also be obtained as a specific policy (READ all characters and then SPEAK until all frames are synthesised).

4.3 Environment

The environment uses a trained modified Tacotron 2 model to provide the agent with the requisite information and feedback.

4.3.1 Observations

Suppose we have just received action . The environment increments the appropriate counter ( or , based on ) and passes to the attention module, which computes

. The context vector is then


Since we want to contain enough information for the agent to decide whether to READ or SPEAK, we define to be the concatenation of:

  • : The attention context vector based on the characters read thus far.

  • : A fixed length moving window of the latest attention weights. This term was found to be crucial for learning a good policy.

  • (during training) or (during evaluation): The most recent mel-spectrogram frame.

4.3.2 Rewards

Underpinning our RL framework is the understanding that the quality of the generated output may trade-off against the delay incurred. Thus, we define our reward as


where encourages low latency while encourages high quality synthesis. Motivated by the treatment in [gu2016learning], we define



  • is a local signal to discourage consecutive READ actions


    is a counter for consecutive READs, is an acceptable number of consecutive READs and is a hyper-parameter.

  • is a global penalty incurred only at the end of an episode


    Geometrically, corresponds to the average proportion of area under the policy path (Figure 1). A value of for corresponds to READing the entire input sequence before generating any output, while corresponds to the unattainable scenario of synthesising all the audio without READing any characters. is a target value for and is a hyper-parameter.

Prior works in MT [gu2016learning, ma2018stacl] have a detailed description of these terms.

To compute , we use the mean squared error (MSE) between the ground truth and generated mel-spectrograms (aligned using teacher forcing). While the MSE is limited as a measure of perceived quality [DBLP:journals/corr/abs-1708-05987], its usage as a training objective for our underlying Tacotron 2 model suggests it is suitable for our setting. We obtain a quality penalty term given by


where . When a READ is executed, is set to .

4.3.3 Terminal Flag

At train time, there are two ways that the episode can terminate:

  • (all the characters have been read) At this point the agent is forced to SPEAK until . It is then given a cumulative reward for these SPEAK actions.

  • (all the aligned mel-spectrograms have been consumed) At this point, the agent is given an additional penalty equal to the number of unread characters and the episode is terminated.

During inference, the episode runs until our Tacotron 2 model’s termination criterion (i.e., the stop token) is triggered.

4.4 Agent Setup and Learning

The agent receives an observation

which is passed through a policy network consisting of a 512-dimensional GRU unit, a 2 layer dense network with ReLU non-linearity, and a softmax layer, to produce a 2-dimensional vector of action probabilities.

To learn these policy parameters , we use the policy gradient method [williams1992simple]

which maximises expected cumulative discounted reward. However, as a variance reduction technique, we replace the discounted returns

in the update, with a normalised advantage value [mnih2014neural]. To compute this we subtract a baseline return, (where parameterises a 3-layer fully connected network), and then normalise the result [mnih2014neural, dsilverlectures]. To learn the baseline network parameters , we minimise the expected squared loss between and .

For both terms, the expectation is approximated by sampling a trajectory under the policy . All parameters are trained jointly on collected batches of transitions.

5 Experiments

5.1 Settings

We use the LJ Speech dataset [ljspeech17], which consists of English audio from a single speaker. We partition this dataset into 12,000 train and 1,100 test/validation data points. We train our modified Tacotron 2 model for 300,000 iterations following the training routine in [shen2018natural].

We set the weights of each reward component, , and , to ensure that the scale of contribution is comparable. The target number of consecutive characters read, is set to while the target average proportion of area under the policy path, is set to . These values are interpretable levers that allow the model’s behaviour to be tweaked. The look-back of the attention window was set to .

During training, actions are sampled according to the probabilities returned by the policy to encourage exploration of the observation space. While evaluating, actions are chosen greedily. We use a discount factor of and train on batches of collected transitions at the end of every episodes, using an Adam optimiser [kingma2014adam] initialised with a learning rate of .

5.2 Benchmark Policies

To gauge the performance of our agent, we used two types of benchmark policies, inspired by [ma2019incremental, gu2016learning]:

Wait-Until-End (WUE): Execute READ actions until the text buffer is empty and then decode everything. Since this policy has access to the entire input sentence at the time of decoding, this gives an upper bound on the quality of the synthesised speech, at the cost of the largest possible delay.

Wait-k-Steps (WkS): Execute a READ action every steps, and decode in between. Despite incurring a smaller delay, the restricted access to the input sentence while decoding may impact the quality of the generated speech.

5.3 Qualitative Analysis

(a) Wait Until End (WUE) Policy
(b) Wait 2 Steps (W2S) Policy
(c) Wait 3 Steps (W3S) Policy
(d) Learnt Policy
Figure 2: Policy Path with Attention Alignments (English): Each plot depicts the policy path and the attention alignments (by colour). The greyed out section represents portions of the input sentence that is unavailable as those input characters have not yet been read.

Figure 2 depicts the attention alignments and policy path for a sample sentence 111 English (and French) audio samples can be found at Figures 1(a) and 1(b) show that, for a large part of the decoding process, the WUE and W2S policies have access to more characters than required which highlights an avoidable latency. Figure 1(c) suggests that the W3S policy is able to reduce these unnecessary READs. However, the resulting policy path appears to collide with the ‘prominent’ alignments on multiple occasions. As a result, the audio quality at these points is compromised because the decoder does not have sufficient context. This motivates the idea that an ideal policy path should hug the prominent alignments diagonal closely from above to successfully balance the quality of synthesis and latency incurred. Our learnt policy (Figure 1(d)) does precisely that. This suggests that the agent has in fact learnt to READ only when necessary and SPEAK only when it has something relevant to output.

5.4 Quantitative Analysis

There are two aspects of the agent’s performance that we track:

Quality: We compute the Mean Opinion Score (MOS) to measure the naturalness of our audio [streijl2016mean, shen2018natural]. We considered using a MUSHRA test [series2014method]. However, since some policies may generate unintelligible samples of audio, which in turn could be scored below a noisy anchor, this approach was set aside. We are also interested in measuring the intelligibility of the synthesised speech. Automatic speech recognition systems use word-error rate (WER) to measure the transcription quality [ali2018word]. Following this approach, we obtain human transcriptions of the speech and compute the WER against the ground truth.

Latency: We use the proportion of area under the policy path, described in Section 4.3. This metric lacks interpretability in terms of the actual delay incurred (e.g. the number of extra characters read). An alternate average lagging metric has been proposed in the MT setting [ma2018stacl]

. However, the skewed ratio between the source and target lengths for TTS coupled with a soft alignment between source and target make this metric challenging to adapt to TTS.

5.4.1 Results

Figures 3 and 4 depict the inherent trade-off between quality and latency. The ground truth marker depicts the value of the relevant metric for the vocoded ground truth mel-spectrograms.

Figure 3: Average WER vs Latency () on a test set comprising 40 samples from LJ Speech labelled by 5 annotators
Figure 4: Average MOS vs Latency () on a test set comprising 40 samples from LJ Speech labelled by 10 evaluators

We begin by observing that the W3S policy incurs the least delay, closely followed by our online agent, while the W2S and WUE policies incur substantial delays. In terms of intelligibility, our online agent achieves a better WER than W3S, and even outperforms W2S despite its sizeable latency advantage. In terms of naturalness, our agent similarly outperforms W3S on MOS, but in this case, W2S was, as expected, able to leverage the additional latency to produce more natural sounding speech.

These findings establish that our agent is able to learn a policy that successfully balances the quality of the synthesised output against the latency incurred. The W2S policy is either comparable (intelligibility) or marginally better (naturalness) than our online agent, but in doing so, performs a large number of premature READ actions. Our agent incurs a slightly larger delay than the W3S policy, and manages to outperform it on all quality metrics.

6 Future Work

Our results show that for neural sequence-to-sequence, attention-based TTS models, there is no algorithmic barrier to incrementally synthesising speech from text. It is also interesting to analyse the learnt policy for different languages given the varied challenges posed (eg. elisions and liasons in French [tranel1996french]). We provide samples from an agent trained on the French SIWIS dataset [siwisdataset] with the same setup as described, on our samples page11footnotemark: 1.

Furthermore, we used a modified Tacotron 2 model, pre-trained on full sentences. It would be interesting to analyse whether jointly learning the Tacotron weights helps synthesise partial fragments of a sentence better.

7 Acknowledgements

We would like to thank Simon King, Mark Herbster and Mark Gales for their valuable input on this research.