Text-to-speech synthesis (TTS), which generates speech from text, is an important task with wide applications in dialog systems, speech translation, natural language user interface, assistive technologies, etc. Recently, TTS research has benefited greatly from advances in deep learning, with neural TTS systems becoming capable of generating audios with near human-level naturalnessoord+:2016; shen+:2018.
State-of-the-art neural TTS systems generally consist of two stages: the text-to-spectrogram stage which generates an intermediate acoustic representation (linear or mel-spectrogram) from text, and the spectrogram-to-wave stage (vocoder) which converts the aforementioned acoustic representation into actual wave signals. In both stages, there are sequential approaches based on the seq-to-seq framework, as well as more recent parallel methods; the first stage, being relatively fast, is more commonly sequential wang+:2017; shen+:2018; ping+:2017; li+:2019 with notable exceptions ren+:2019; peng+:2019, while the second stage, being much slower, is more commonly parallel oord+:2018; prenger+:18; ping+:18.
Despite these successes, standard full-sentence neural TTS systems still suffer from two types of latencies: (a) the computational latency (synthesizing time), which still grows linearly with the sentence length even using parallel inference (esp. in the second stage), and (b) the input latency in scenarios where the input text is incrementally generated or revealed, such as in simultaneous translation bangalore+:2012; ma+:2019, dialog generation skantze+:2010; buschmeier+:2012, and assistive technologies elliott:2003. These latencies limit the applicability of neural TTS systems; for example, in simultaneous speech-to-speech translation, the TTS module has to wait until a full sentence of translation is available, causing the undesirable delay of at least one sentence.
To reduce these latencies, we devise the first neural incremental TTS approach based on the recently proposed prefix-to-prefix framework ma+:2019. Our idea is based on two observations: (a) in both stages, the dependencies on input are very local (see Fig. 1 for monotonic attention between text and spectrogram, for example); and (b) audio playing is inherently sequential in nature, but can be done simultaneously with audio generation, i.e., playing a segment of audio while generating the next. In a nutshell, we start to generate the spetrogram for the first word after receiving the first two words, and this spectrogram is fed into the vocoder right away to generate the waveform for the first word, which is also played immediately (see Fig. 2). This results in an rather than latency, for the very first in neural TTS. Experiments on English TTS show that our approach achieves similar speech naturalness compared to full sentence methods, but only using a fraction of time and a constant (1–2 words) latency.
Prior to the deep learning era, there also exist several efforts on (non-neural) incremental TTS, using very different techniques; see Sec. 6.
We firstly briefly review the full-sentence neural TTS pipline to set up the notations. Then we review the prefix-to-prefix framework, which has been introduced for simultaneous translation by ma+:2019.
2.1 Full-sentence TTS Pipeline
As shown in Fig. 3, the neural-based text-to-speech synthesis system generally has two main steps: (1) the text-to-spectrogram step which converts a sequence of textual features (e.g. characters, phonemes, words) into another sequence of spectrograms (e.g. mel-spectrogram or linear-spectrogram); and (2) the spectrogram-to-wave step, which takes the predicted spectrograms as inputs and generates the audio wave through one specific vocoder.
|a sample in wave|
|vector||a spectrogram frame|
|sequence||phonemes in a word,|
|of scalars||waveform of a word|
|seq. of vectors||spectrogram of a word|
|seq. of sequences||phonemes in a sentence,|
|of scalars||waveform of a sentence|
|seq. of sequences||spectrogram of a sent.|
2.1.1 Step I: Text-to-Spectrogram
Conventional neural-based text-to-spectrogram frameworks (e.g. Tacotron 1wang+:2017, Tacotron 2shen+:2018, Deep Voice 3 ping+:2017, Transformer-based TTS li+:2019) employ the seq-to-seq framework to encode the source text sequence (characters or phonemes) and decode the spectrogram sequentially despite the actual computation unit choice (RNN, CNN or Transformer).
Regardless the actual design of seq-to-seq framework, with the granularity defined on words, the encoder always takes as input a word sequence , where any word could be a sequence of phoneme or character indices, and produces another sequence of hidden states to represent the textual features.
On the other side, the decoder produces the new spectrogram given the entire sequence of hidden states and the previously generated spectrogram, denoted by , where is a sequence of spectrogram frames with . Formally, on word level, we define the inference process as follows:
and for each frame within one word, we have
where , and represents concatenation between two sequences.
During training time, we minimizing the difference between gold spectrogram and model’s prediction as follows:
can be different loss criteria, e.g. mean squared error. It is standard to use short-time Fourier transform (STFT) to obtain linear-frequency spectrogram and followed by an non-linear transform to the frequency domain to collect mel-frequency spectrogram as gold signal.
2.1.2 Step II: Spectrogram-to-Wave
Given a sequence of acoustic feature , the vocoder generates waveform sample vector , where , given only the linear- or mel-spectrogram without any phase information. The vocoder model model can be either autoregressive, e.g. WaveNet oord+:2016, or non-autoregressive, such as Parallel WaveNet oord+:2018, ClariNet ping+:18, WaveGlow prenger+:18 and FloWaveNet kim+:2019.
For the sake of both computation efficiency and near human sound quality, we choose non-autoregressive model as our vocoder, which can be defined as follows without losing generality:
where the vocoder function takes as input a random signal to generate the wave signal conditioning on the given spectrogram , where
is drawn from a simple tractable distribution, such as a zero mean spherical Gaussian distribution. The length of each is determined by the length of , and we have . Based on different STFT procedure, can be 256 or 300. More specifically, the wave generation of the word with an inverse autoregressive flow (IAF) model can be defined as follows
2.2 Prefix-to-prefix Framework
ma+:2019 propose a prefix-to-prefix framework for simultaneous machine translation. Given a monotonic non-decreasing function , the model would predict each target word based on current available source prefix and the predicted target words :
As a simple example in this framework, they present a wait- policy, which first wait source words, and then alternates emitting one target word and receiving one source word. With this policy, the output is always words behind the input. This policy can be defined with the following function
3 Incremental TTS
Both steps in the above conventional TTS pipeline require the fully observed source text or spectrograms as input to start the inference. In this section, we first propose a general framework to do inference at both steps with partial source information, then we present one simple specific example in this framework.
3.1 Prefix-to-Prefix for TTS
As shown in Fig. 1, there is no long distance reordering between input and output sides in the task of Text-to-Spectrogram, and the alignment from output side to the input side is monotonic. One way to utilize this monotonicity is to generate audio pieces for each word independently, that is to feed the input text each word to predict its spectrogram in step I, and then to generate audio piece based on the spectrogram. After generating audio pieces for all words, we can concatenate those audios together as the result. However, this simple incremental generation approach mostly provides some robotic and abnormal speech voice. In order to generate speech with better prosody, we need to consider some local information around the current word, when generating audio for each word. This is also necessary to connect several audio pieces smoothly.
To solve the above issue, we propose a prefix-to-prefix framework for TTS, which is inspired by the prefix-to-prefix framework ma+:2019 for simultaneous translation. Within this new framework, our and is generated incrementally as follows:
where and are monotonic functions that define the number of words, whose information is taken as input for each step, when generating results for the word.
3.2 Lookahead- Policy
As a simple example in the prefix-to-prefix framework, we define two lookahead polices for the two steps with and functions respectively. These are similar to the wait- policy introduced by ma+:2019.
Intuitively, the function implies that we wait for the first number of words, and then generate mel spectrogram for each word continuously until the end of the input sentence. Similarly, the function implies that we first wait for spectrograms of number of words, and then start generating audio piece for each word continuously. Combining these together, we can obtain a lookahead- policy for the whole TTS system, where . An example of lookahead-1 policy is provided in Fig. 2, where we take for the first step and for the second step.
4 Implementation Details
In this section, we provide some implementation details for the two steps. Note that we assume the pre-trained models for both steps are given, and we only perform inference time adaptations. For the first step, we assume the model is a Deep Voice 3 ping+:18 model, and for the second step the model is a ClariNet vocoder.
4.1 Incremental Generation of Spectrogram
Different from full sentence scenario, where we feed the entire source text to the encoder, we gradually provide source text input to the model word by word when more input words are available. By our prefix-to-prefix framework, we will predict mel spectrogram for the word, when there are words available. Thus, the decoder predicts the spectrogram frame of the word with only partial source information as follows:
where represents the first spectrogram frames in the word.
In order to obtain the corresponding relationship between the predicted spectrogram and the currently available source text, we rely on the attention alignment applied in our decoder, which is usually monotonic. To the spectrogram frame of the word, we can define the attention function in our decoder as follows
The output represents the alignment distribution over the input text for the
predicted spectrogram frame. And we choose the input element with the highest probability as the corresponding input element for this predicted spectrogram, that is,. When we have , it implies that the spectrogram frame corresponds to the word, and all the spectrogram frames for the word are predicted.
For models like Deep Voice 3, there are usually multiple attention layers in the decoder. For our implementation, we only consider the alignment obtained from the first attention layer. When the encoder observes the entire source sentence, a special symbol <eos> was feed into the encoder, and the decoder continue to generate spectrogram word by word. The decoding process ends when the binary “stop” predictor of the model predicts the probability larger than .
4.2 Generation of Waveform
After we obtain the predicted spectrograms for a new word, we feed them into our vocoder to generate waveform. Since we use a non-autoregressive vocoder, we can generate each audio piece for those given spectrograms in the same way as full sentence generation. Thus, we do not need to make modification on the vocoder model implementation. Then the straightforward way to generate each audio piece is to apply Eq. 5 at each step conditioned on the spectrograms of each word . However, when we concatenate the audio pieces generated in this way, we observe some noise at the connecting part of two audio pieces.
To avoid such noise, we sample a long enough random vector as the input vector and fix it when generating audio pieces. Further, we append additional number of spectrogram frames to the each side of the current spectrograms if possible. That is, at most number of last frames in are added in front of , and at most number of first frames in are added at the end of . This may give a longer audio piece than we need, so we can remove the extra parts from that. Formally, the generation procedure of wave for each word can be defined as follows
where and .
5.1 Experimental Setup
For simplicity, we assume the given input are all phonemes in our experiments, and we will include the text-to-phoneme model in the future version. In our experiments, we work on a chunk-level which consists of one or more words depending on a hyper-parameter . That is, a chunk consists of minimum number of words such that the number of phonemes in this chunk is at least . We use chunk instead of single word because the pronunciation of some words may be too short and they may affect the performance and efficiency of our system.
We use an internal English speech dataset containing 13,708 audio clips from a female speaker and the corresponding phoneme transcripts. The total audio lengths is about 20 hours and the sampling rate of the audio is 48 kHz. We downsample that to 24 kHz. We remove all intermediate punctuation marks in the transcripts, and randomly split the dataset into three sets: 13,158 samples for training, 275 samples for validation and 275 samples for testing. Our mel-scale spectrogram has 80 bands, and is computed through a short time Fourier transform (STFT) with window size of 1200 and hop size of 300.
We use the fully-convolutional text-to-spectrogram architecture introduced in Deep Voice 3 (DV3) ping+:2017 as our phoneme-to-spectrogram model. The original DV3 architecture consists of three components: an encoder, an decoder and a converter. We do not use the converter part since we only need the mel spectrogram output for our vocoder. Our encoder consists of eleven convolution blocks, and our decoder consists of seven convolution blocks. Different from the original architecture in DV3, our decoder only contains two attention blocks: one for the first convolution block and one for the last convolution block.
We use a 60-layer Gaussian inverse autoregressive flow (IAF) model introduced in ClariNet ping+:18 as our waveform synthesizer (i.e. vocoder). The model consists of four stacked Gaussian IAF blocks, which are parameterized by [10, 10, 10, 30]-layer WaveNets with 64 residual channels, 64 skip channels and filter size 3 in dilated convolutions. We distill this model from a 20-layer Gaussian autoregressive WaveNet, which have the same architecture as the teacher WaveNet model in ClariNet.
We separately train the DV3 model and the ClariNet vocoder. To train the DV3 model, we follow the original DV3 paper ping+:2017 and add the guided attention loss into our training loss, which is introduced by tachibana+:2018 to improve the efficiency of training. We use the Adam optimizer with batch size of 16 to train this model on NVIDIA GTX TITAN X GPU. We follow the original ClariNet paper ping+:18 to train the teacher autoregressive WaveNet and distill the student Gaussian IAF, which are both trained using ground truth mel-spectrograms and audio waveforms.
For inference, we apply the monotonic constraint introduced for DV3 to obtain a better attention. Specifically, we compute the softmax function over a fixed window instead of all generated encoder hidden states, which starts from the last attended position and has window size of 3. For our proposed method in the following sections, we consider two different policies: lookahead-1 policy for step I and lookahead-0 policy for step II (where we set ), giving a lookahead-1 policy for the system; and lookahead-1 policy for both steps (where we set for the second step), giving a lookahead-2 policy for the system. In the following sections, we only consider lookahead- policy for the TTS system instead of each step.
5.2 Audio Quality
We first compare the audio quality generated from different methods. For this purpose, we choose 50 sentencesfrom our test set and generate audio samples for these sentences with different methods, which include (1) Ground Truth Audio; (2) Ground Truth Mel, where we convert the ground truth mel spectrograms into audio samples using our vocoder; (3) Full-sentence, where we first predict all mel spectrograms given the full sentence text and then convert those to audio samples; (4) Lookahead-2, where we incrementally generate audio samples with lookahead-2 policy; (5) Lookahead-1, where we incrementally generate audio samples with lookahead-1 policy. For our method, we choose as 18 the least length of the first chunk and for other chunks we choose 6 as the least length. The MOS (Mean Opinion Score) of the evaluation is provided in Table 2.
|Ground Truth Audio|
|Ground Truth Mel|
Mean Opinion Score (MOS) ratings with 95% confidence intervals.
From Table 2, we can see that with lookahead-2 policy we can generate high quality audio similar to the full-sentence method, and the MOS decreases slightly with lookahead-1 policy. We also provide the relation between lookahead and MOS in Fig. 4. We can see there is a trade-off between lookahead numbers and audio quality (measured by MOS).
In this section, we compare the latency of full-sentence method and our proposed method. We consider two different scenarios: (1) when all text input is available; and (2) the text input is provided incrementally as audio playing speed. The first scenario is the common monolingual text-to-speech application, while the second scenario is more similar to the simultaneous translation application.
5.3.1 All Input Available
For full-sentence generation, the latency will be the generation time of the whole audio sample; while for our proposed incremental method, the latency will be the generation time of the first phoneme chunk if the next audio piece can be generated before the previous audio piece is finished playing. (We will show this later.) So we compare the generation time of sentence with different lengths, which is averaged over sentences with the same length. Here we choose as 18 the least length of the first chunk and for other chunks we choose 6 as the least length. We do inference on the test set on both CPU and GPU and provide the results in Fig. 5.
We can see the latency of full-sentence method increases along with the number of the phonemes in the sentence, which is more than 1 second on GPU for sentences with length larger than 60 and could be more than 4 seconds on CPU for those sentences; while our incremental method has a constant latency for sentences with different lengths: its latency on average is less than 0.5 seconds on GPU and less than 2.5 seconds.
We now show that our method can generate audio piece for each chunk fast enough such that the next audio piece will be generated before we finish playing the previous audios, i.e. the generated audios can be played continuously without interruptions. At each generation step, the available time to generate the current audio piece is equal to the whole previous audio time minus the generation time of all previous chunks but the first one. Starting from the second chunk, as long as the current available time is larger than the generation time of the current chunk, we can continuously play the current generated audio piece without any interruption. That is, the remaining time (time balance) from the available time after taking the generation time should be larger than zero at each step. To evaluate this case, we compute the time balance at several steps and show the results in Fig. 6. We find that the time balance is always larger than 0 on GPU with lookahead-2 policy, while on CPU the time balance is less than that on GPU. If we apply lookahead-1 policy, then the time balance is always larger than 0; if we apply lookahead-2 policy on CPU, then the time balance will be less than 0 starting from the second chunk. This is because the lookahead-2 policy needs to generate audio piece with appended mel spectrogram, which causes more computing time.
5.3.2 Input Provided Incrementally
When the text input (such as obtained from speech or translation system) is given incrementally, the latency of full-sentence method will be more than the running time, since it needs to wait for the whole text to finish to start generation. Similarly, the latency of our proposed method should be at least the sum of generation time of the first phoneme chunk and the time waited for enough input to start generation. To mimic this scenario, we consider the experiment where the goal is to repeat the sentence from the speaker as soon as possible after the speaker starts speaking (this is also called shadowing). Here we define averaged chunk delay as our latency metrics, which is the averaged lag time between the ending time of each input chunk and the ending time of its corresponding generated audio piece.
We choose the ground-truth audios from test set as the inputs and extract the ending time of each chunk in those audios by the Montreal Forced Aligner mcauliffe+:2017. The ending time of our chunk can be obtained with the generation time, audio playing time and input chunk ending time. For this experiment, we choose as 7 the least length of all chunks and the latency results are averaged over sentences with the same length. These results are provided in Fig. 7.
We find that the latency of our method is almost constant for different sentence lengths, which is less than 3 seconds on GPU and 4.5 seconds on CPU; while the latency of full-sentence method increases along with the number of the phonemes in the sentence. This is similar to the results from previous scenario, but its magnitude is larger than that in the previous scenario because of the waiting time for the input.
Generation Speed for Chunks
When input is given incrementally , we cannot always guarantee the continuity of the generated audio pieces since we do not have control on the source audio speed. That is, the interruption is unavoidable when the next chunk is given after the previous audio piece finishes. But we can compare averaged generation time and audio playing time for chunks to show that, the generation is faster than the audio playing, and our method can keep a low latency.
In the next experiment, we consider the sentence that have more than 4 chunks and compute the averaged generation time and audio playing time for the first three chunks and the last two chunks. The results are provided in Fig. 8.
We find that the averaged generation time of lookahead-1 policy on GPU is about 25% of the audio time for all the five chunks, and that of lookahead-2 policy is about 30% of the audio time for it has appended mel spectrogram chunk for vocoder. The generation time is much higher on CPU, which is about 80% of the audio playing time except for the last chunk for lookahead-1 policy, and about 130% of the audio playing time for lookahead-2 policy. These show that with lookahead-1 policy the generation of audio piece for each chunk can be faster than real-time on GPU and CPU.
Effects of CPU Numbers
In the above experiments, we use a machine with 80-core CPU, which may be expensive to use in practice. So we evaluate the ratio of generation time and audio time on different number of CPU’s to understand the effects of CPU numbers. The results are provided in Fig. 9. We can see that the averaged ratio is smaller than 1 for lookahead-1 policy when the CPU number is at least 40, but for lookahead-2 policy this ratio is still larger than 1 even with 80-core CPU.
In this section, we provide some analysis on attention and generated mel spectrogram to understand our method better. Our method trace the relationship between mel spectrogram frame and input phonemes in decoding step based on learned attention alignment. So we compare the alignment of the first attention layer from the full-sentence method and from our lookahead-1 policy in Fig. 10. We can see in this figure that the first two alignments are very similar, implying that our method will maintain the attention as full-sentence method, although we incrementally encoding given phonemes. And since the alignment is monotonic, we can usually get the correct correspondence from predicted mel spectrogram to input phonemes in decoding step.
We visualize the predicted mel spectrogram from different methods to make a comparison. We consider mel spectrograms from four different methods: (1) generated from ground truth audio wave, (2) predicted with full-sentence method, (3) predicted with lookahead-1 policy.These are provided in Fig. 11. From the visualization, we can see that the mel spectrogram predicted with lookahead-1 policy is very similar to that predicted by the full-sentence method, which is also similar to the ground truth. These show that our method can predict good mel spectrograms.
6 Related Work
Incremental TTS is previously studied in the statistical parametric speech synthesis framework based on Hidden Markov Model (HMM). Such kind of framework usually consists of several steps: extracting linguistic features, establishing an HMM sequence to estimate acoustic features, constructing speech waveform from those acoustic features. Based on this framework,baumann+:2012a propose an incremental spoken dialogue system architecture and toolkit called INPROTK, including recognition, dialogue management and TTS modules. With this toolkit, baumann+:2012b present a component for incremental speech synthesis, which is not fully incremental on the HMM level. pouget+:2015 propose a training strategy based on HMM with unknown linguistic features for incremental TTS. baumann:2014b; baumann:2014a proposes to flexibly use linguistic features and choose default values when they are not available. The above works all focus on stress-timed languages, such as English and German, while yanagita+:2018 propose a system for Japanese, a mora-timed language. Although these works show speech quality can be improved for incremental TTS, these systems require full context labels of linguistic features, making it difficult to improve the audio quality when input text is revealed incrementally. Further, each component in their systems is trained and tuned separately, resulting in error propagation.
Neural speech synthesis systems provide a solution for this problem. Such systems do not need the full context labels of linguistic features any more, and the quality of synthesized speech with those systems have obtained the start-of-the-art results. Several different systems are proposed, including Deep Voice arik+:2017, Deep Voice 2 gibiansky+:2017, Deep Voice 3 ping+:2017, Tacotron wang+:2017, Tacotron 2 shen+:2018, ClariNet ping+:18. However, these systems all need the entire sentence as input to generate the speech, resulting in large latency for some applications such as spoken dialogue system and speech simultaneous translation system. More recently, some parallel systems are proposed for TTS ren+:2019; peng+:2019, which avoid the autoregressive steps and provide faster audio generation. But these systems still suffer from large input latency compared with real incremental TTS system, since they generate waveform on a sentence-level instead of word-level, implying that they will need to wait for long enough input to start speech generation.
We have presented a prefix-to-prefix inference framework for incremental TTS system, and a lookahead- policy that the audio generation is always words behind the input. We show that this policy can maintain good audio quality compared with full-sentence method and can achieve low latency on different scenarios: when all the input are available and when input is given incrementally.