Attention Forcing for Sequence-to-sequence Model Training

09/26/2019 ∙ by Qingyun Dou, et al. ∙ University of Cambridge 0

Auto-regressive sequence-to-sequence models with attention mechanism have achieved state-of-the-art performance in many tasks such as machine translation and speech synthesis. These models can be difficult to train. The standard approach, teacher forcing, guides a model with reference output history during training. The problem is that the model is unlikely to recover from its mistakes during inference, where the reference output is replaced by generated output. Several approaches deal with this problem, largely by guiding the model with generated output history. To make training stable, these approaches often require a heuristic schedule or an auxiliary classifier. This paper introduces attention forcing, which guides the model with generated output history and reference attention. This approach can train the model to recover from its mistakes, in a stable fashion, without the need for a schedule or a classifier. In addition, it allows the model to generate output sequences aligned with the references, which can be important for cascaded systems like many speech synthesis systems. Experiments on speech synthesis show that attention forcing yields significant performance gain. Experiments on machine translation show that for tasks where various re-orderings of the output are valid, guiding the model with generated output history is challenging, while guiding the model with reference attention is beneficial.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Auto-regressive sequence-to-sequence (seq2seq) models with attention mechanism are widely used in a variety of areas including Neural Machine Translation (NMT)

(Neubig, 2017; Huang et al., 2016) and speech synthesis (Shen et al., 2018; Wang et al., 2018), also known as Text-To-Speech (TTS). These models excel at connecting sequences of different length, but can be difficult to train. A standard approach is teacher forcing, which guides a model with reference output history during training. This makes the model unlikely to recover from its mistakes during inference, where the reference output is replaced by generated output. One alternative is to train the model in free running mode, where the model is guided by generated output history. This approach often struggles to converge, especially for attention-based models, which need to infer the correct output and align it with the input at the same time.

Several approaches are introduced to tackle the above problem, namely scheduled sampling (Bengio et al., 2015) and professor forcing (Lamb et al., 2016)

. Scheduled sampling randomly decides, for each time step, whether the reference or generated output token is added to the output history. The probability of choosing the reference output token decays from 1 to 0 with a heuristic schedule. A natural extension is sequence-level scheduled sampling, where the decision is made for each sequence instead of token. Professor forcing views the seq2seq model as a generator. During training, the generator operates in both teacher forcing mode and free running mode. In teacher forcing mode, it tries to maximize the standard likelihood. In free running mode, it tries to fool a discriminator, which is trained to tell if the model is running in teacher forcing mode. To make training stable, the above approaches require either a well tuned schedule, or a well trained discriminator.

This paper introduces attention forcing, which guides the model with generated output history and reference attention. This approach makes training stable by decoupling the learning of the output and that of the alignment. There is no need for a schedule or a discriminator. Furthermore, for cascaded systems like many TTS systems, attention forcing can be particularly useful. A model trained with attention forcing can generate (in attention forcing mode) output sequences aligned with the references. These output sequences can be used to train a downstream model, enabling it to fix some upstream errors. The TTS experiments show that attention forcing yields significant gain in speech quality. The NMT experiments show that for tasks where various re-orderings of the output are valid, guiding the model with generated output history can be problematic, while guiding the model with reference attention yields slight but consistent gain in BLEU score (Papineni et al., 2002).

2 Sequence-to-sequence generation

Sequence-to-sequence generation can be defined as the problem of mapping an input sequence to an output sequence . From a probabilistic perspective, a model estimates the distribution of given , typically as a product of distributions conditioned on output history:

(1)

Ideally, the model is trained through minimizing the KL-divergence between the true distribution and the estimated distribution:

(2)

In practice, this is approximated by minimizing the Negative Log-Likelihood (NLL) of some training data , sampled from the true distribution:

(3)

While and are functions of , the subscripts are omitted to simplify notations, i.e. and are written as and . At inference stage, given an input , the output can be obtained through searching for the most probable sequence from the estimated distribution:

(4)

The exact search is computationally expensive, and is often approximated by greedy search if the output space is continuous, or beam search if the output space is discrete (Bengio et al., 2015).

2.1 Attention-based seq2seq model

Attention mechanisms (Bahdanau et al., 2014; Chorowski et al., 2015) are commonly used to connect sequences of different length. This paper focuses on attention-based encoder-decoder models. For these models, the probability is estimated as:

(5)
(6)
(7)

.

is an alignment vector (a set of attention weights).

is a state vector representing the output history , and is a context vector summarizing for the prediction of . The following equations, as well as figure 1, give a more detailed illustration of how , and can be computed:

(8)
(9)
(10)
(11)
(12)

First the encoder maps to an encoding sequence . For each decoder time step, is updated with . Based on and , the attention mechanism computes , and then as the weighted sum of . Finally, the decoder estimates a distribution based on and , and optionally generates an output token by either sampling or taking the most probable token. Note that the output history plays an important role, as it impacts through both and . Also note that there are many forms of attention-based encoder-decoder models. While attention forcing is illustrated with this particular form, it is not limited to it.

Figure 1: Illustration of an attention-based encoder-decoder model

2.2 Training approaches

As shown in equations 2 and 3, minimizing the KL-divergence between the true distribution and the model distribution can be approximated by minimizing the NLL. This motivates the approach to train the model in teacher forcing mode, where is computed with the correct output history , as shown in equations 5 and 6. In this case, the loss can be written as:

(13)

This approach yields the correct model (zero KL-divergence) if the following assumptions hold: 1) the model is powerful enough ; 2) the model is optimized correctly; 3) there is enough training data to approximate the expectation shown in equation 2. In practice, these assumptions are often not true, hence the model is prone to make mistakes. To illustrate the problem, suppose there is a reference output for the test input . Due to data sparsity in high-dimensional space, is likely to be unseen during training. If the probability is wrongly estimated to be small at time step , the probability of the reference output sequence will also be small, i.e. it will be unlikely for the model to generate .

In practice, the model can be assessed by some loss between the reference output and the generated output . Taking the expected value yields the Bayes risk: . This motivates training the model with the following loss:

(14)

is sampled from the estimated distribution . is minimal when the two sequences are equal. So the model is trained to not only assign high probability to the reference sequences in the training data, but also assign low probability to other sequences. This makes minimum Bayes risk training prone to overfitting.

Very often, is computed at sub-sequence level. Examples include BLEU score for NMT, word error rate for speech recognition and root mean square error for TTS. So if an approach trains the model to predict the reference output, based on erroneous output history, it will indirectly reduce the Bayes risk. One example is to train the model in free running mode, where is estimated with the generated output history:

(15)
(16)

is obtained from the estimated distribution , as shown in equation 12. (The approaches discussed in this section are designed for all auto-regressive models, with or without attention mechanism. So the realization

is not shown.) The corresponding loss function is:

(17)

Note that if there is enough data and modeling power, and the model is optimized correctly, the distribution can be the same as the true distribution . The problem with this approach is that training often struggles to converge. One concern is that the model needs to learn to infer the correct output and align that with the input at the same time. Therefore, several approaches, namely scheduled sampling and professor forcing, are proposed to train the model in a mode between teacher forcing and free running.

Scheduled sampling (Bengio et al., 2015) randomly decides, for each time step, whether the reference or generated output token is added to the output history . For this approach, is estimated as:

(18)
(19)
(20)

gradually decays from 1 to 0 with a heuristic schedule. Considering that during training, is mostly an inconsistent mixture of the reference output and the generated output, a natural extension is sequence-level scheduled sampling (Bengio et al., 2015), where the decision is made for each sequence instead of token:

(21)

This type of training improves the results of many experiments, but sometimes leads to worse results (Wang et al., 2017; Bengio et al., 2015). One concern is that the decay schedule does not fit the learning pace of the model.

Professor forcing (Lamb et al., 2016) is an alternative trade-off. During training, the model is viewed as a generator, which generates two output sequences for each input sequence, respectively in teacher forcing mode and free running mode111The term ”teacher forcing”, as well as ”attention forcing”, can refer to either an operation mode, or the approach to train a model in that operation mode. An operation mode can be used not only to train a model, but also to generate from it. For example, in teacher forcing mode, given the reference output , a model can generate a guided output , without evaluating the loss. is likely to be different but similar to , and can be useful for training the discriminator.. For the training example , let denote the output generated in teacher forcing mode, and the output generated in free running forcing mode, this can be expressed as:

(22)
(23)

In addition to the final output, some intermediate output sequences are saved. Let and denote the intermediate output sequences generated respectively in teacher forcing and free running mode. These generated sequences form a dataset that is used to train a discriminator . is trained to predict the probability that a group of sequences is generated in teacher forcing mode, and the loss function is:

(24)

While this loss function is optimized w.r.t. , it depends on , hence the notation . For the generator , there are three training objectives. The first one is the standard likelihood shown in equation 13. The second one is to fool the discriminator in free running mode:

(25)

The third one, which is optional, is to fool the discriminator in teacher forcing mode:

(26)

This approach makes the distribution estimated in free running mode similar to the corresponding distribution estimated in teacher forcing mode. In addition, it regularizes some hidden layers, encouraging them to behave as if in teacher forcing mode. The disadvantage is that it requires designing and training the discriminator.

3 Attention forcing

3.1 Guiding the model with attention

For attention-based seq2seq generation, we propose a new algorithm: attention forcing. The basic idea is to use reference attention (i.e. reference alignment) and generated output to guide the model during training. In attention forcing mode, the model does not need to learn to simultaneously infer the output and align it with the input. As the reference alignment is known, the decoder can focus on inferring the output, and the attention mechanism can focus on generating the correct alignment.

Let denote the model that is trained in attention forcing mode, and later used for inference. In attention forcing mode, is estimated with the generated output and the reference alignment , and equation 5 becomes:

(27)

and denote the state vector and context vector generated by . Details of attention forcing can be illustrated by figure 2, as well as the following equations:

(28)
(29)
(30)
(31)
(32)

The right side of the equations 28 to 30, as well as equations 31 and 32, show how the attention forcing model operates. and denote the encoding and alignment vectors generated by . is computed with . While an alignment is generated by , it is not used by the decoder, because is computed with the reference alignment . In most cases, is not available. One option of obtaining it is shown by the left side of equations 28 to 30, which is the same as equations 8 to 10. The option is to generate from a teacher forcing model . is trained in teacher forcing mode, as described in section 2.2. Once trained, it can generate , again in teacher forcing mode.

Figure 2: Illustration of attention forcing

During inference, the attention forcing model operates in free running mode. In this case, equation 31 becomes . The decoder is guided by , instead of .

During training, there are two objectives: to infer the reference output and to imitate the reference alignment. For the first objective, the loss function is:

(33)

For the second objective, as an alignment corresponds to a categorical distribution, the loss function is the average KL-divergence between the reference alignment and the generated alignment:

(34)

The two losses can be jointly optimized as . is a scaling factor that should be set according to the dynamic range of the two losses, which roughly indicates the norm of the gradient. The alignment loss can be interpreted as a regularization term, which encourages the attention mechanism of to behave like that of . Our default optimization option is as follows. is trained in teacher forcing mode, with the loss shown in equation 13, and then fixed to generate the reference attention. is trained with the joint loss

. In our experiments, this option makes training more stable, most probably because the reference attention is the same from epoch to epoch. There are several alternative options. One example is to tie

and , i.e. use only one set of model parameters, and train it with the joint loss . This option is less stable, but more efficient.

3.2 Comparison with related approaches

Intuitively, attention forcing, as well as scheduled sampling and professor forcing, is in the middle of teacher forcing and free running. Unlike scheduled sampling, attention forcing does not require a decay schedule, which can be difficult to tune. While the scaling factor is hyper parameter, it can be set according to the dynamic ranges of the two losses, as described in section 3.1. In addition, it can be tuned according the alignment vector, which is an interpretable indicator of how well the attention mechanism works. In terms of regularization, attention forcing is similar to professor forcing. The output layer of the attention mechanism, which can be viewed as a special hidden layer, is encouraged to behave as if in teacher forcing mode. The difference is that attention forcing does not require a discriminator to learn a loss function, as the KL-divergence is natural loss function for the alignment vector.

A limitation of attention forcing is that it is less general than the approaches described in section 2.2

, which are well defined for all auto-regressive models, with or without attention mechanism. To apply attention forcing to a model without attention mechanism, attention needs to be defined first. For convolutional neural networks, for example, attention maps can be defined based on activation or gradient

(Zagoruyko and Komodakis, 2016).

4 Application to speech synthesis

Attention forcing has a feature that is essential for many cascaded systems: when the reference alignment is available, the output can be generated in attention forcing mode, and will be aligned with the reference. TTS is a typical example. For TTS, the task is to map a sequence of characters to a sequence of waveform samples . Directly mapping to is difficult because the two sequences are not aligned and are orders of magnitude different in length. (10 characters can correspond to more than 1000 waveform samples.) As shown in figure 3, TTS is often realized by first mapping to a vocoder feature sequence , and then mapping to . The vocoder feature sequence is a compact and interpretable representation of the waveform; a vocoder can be used to map vocoder features to waveform or reversely, with a series of signal processing techniques. Each feature frame corresponds to a window of waveform samples, i.e. each time step in the feature sequence corresponds to a fixed number of time steps in the waveform sequence.

Figure 3: Illustration of a speech synthesis system

The model mapping to can be referred to as the frame-level model , and the model mapping to can be referred to as the waveform-level model . Conventionally, is a vocoder, and is not learnable. contains a text processing frontend, a duration model and a feature model (Li et al., 2018). The text processing frontend extracts linguistic features from ; the duration model predicts the duration of each linguistic feature; the feature model maps the linguistic features to . This paper focuses on the state-of-the-art approach, where , as well as , is a neural network. can be considered a neural vocoder, which is not limited by the assumptions made by the conventional vocoders (Lorenzo-Trueba et al., 2018; Kalchbrenner et al., 2018). is an attention-based seq2seq model, as described in section 2.1. Compared with the conventional approach, the attention-based model has several advantages, such as performance gain and less need for data labeling (Wang et al., 2017). Note that as shown in figure 3, learns not only to map a character sequence to a feature sequence, but also to align them. In contrast, does not align its input and output (Shen et al., 2018; Oord et al., 2016).

The training dataset usually contains pairs of waveform and text . (To simplify notations, the superscript is be omitted by default in the following discussion.) For each , a vocoder feature sequence can be extracted. The frame-level model is trained with . The waveform-level model can be trained with , or , where is generated by . Training with allows to fix some mistakes made by , but this is only possible when is aligned with . To ensure the alignment, the standard approach is to train in teacher forcing mode, and then generate from it in the same mode. This paper proposes an alternative approach: to use attention forcing instead of teacher forcing. As analyzed in section 3.1, training with attention forcing improves its performance. Furthermore, in attention forcing mode, each output is predicted based on (instead of ), hence is more likely (than in teacher forcing mode) to contain errors that makes at inference stage. Training with can enable it to correct the errors, improving the quality of the waveform. Note that if is trained with scheduled sampling or professor forcing, it is often not possible to predict, based only on generated output history, a vocoder feature sequence aligned with the reference waveform. Also note that is trained in teacher forcing mode, as it does not have attention mechanism. Hence the rest of this section focuses on discussing at training stage and inference stage.

During training, it is often assumed that the output tokens follow a certain type of distribution, so that minimizing the loss shown in equation 33 can be approximated by minimizing some distance metric between and . For example, assuming that the distribution shown in equation 27 is a Laplace distribution, minimizing is equivalent to minimizing the average distance:

(35)
(36)

The notation is the same as in section 3.1. denotes the attention forcing model; denotes the teacher forcing model generating reference alignment. Equation 36 replaces equation 32. In this case, is not sampled, and is always the mode of the predicted distribution. During inference, the exact search (equation 4) is approximated by greedy search: (Note that for TTS, the main difference between training and inference is the alignment, which influences duration more than quality.)

(37)

5 Experiments

5.1 Speech Synthesis

The TTS experiments are conducted on LJ dataset (Ito, 2017), which contains 13,100 utterances from a single speaker. The utterances vary in length from 1 to 10 seconds, totaling approximately 24 hours. A transcription is provided for each waveform, and the corresponding vocoder features are extracted with PML vocoder (Degottex et al., 2016)

. The training-validation-test split is 13000-50-50. The waveform-level model is the Hierarchical Recurrent Neural Network (HRNN) neural vocoder

(Mehri et al., 2016; Dou et al., 2018). The model structure is exactly the same as described in Dou et al. (2018), and the model configuration is adjusted for efficiency. The frame-level model is very similar to Tacotron (Wang et al., 2017). The model structure and configuration are the same as described in Wang et al. (2017), except that: 1) the decoder target is vocoder features; 2) the attention mechanism is the hybrid (content-based + location-based) attention (Chorowski et al., 2015); 3) each decoding step predicts 5 vocoder feature frames. The neural vocoder is always trained with teacher forcing. The frame-level model is trained with either teacher forcing or attention forcing. Details of the setup (data, models and training) are presented in appendix A.2.1.

Two TTS systems are built: a teacher forcing system and an attention forcing system. For the teacher forcing system, the frame-level model is trained in teacher forcing mode. The neural vocoder is trained with the vocoder features generated (in teacher forcing mode) by . For the attention forcing system, the frame-level model is trained in attention forcing mode, with reference attention generated (in teacher forcing mode) by . At this stage, is updated, while is fixed. The neural vocoder is trained with the vocoder features generated (in attention forcing mode) by . At inference stage, all the models operate in free-running mode.

For TTS, human perception is the gold-standard. The two systems are compared in a subjective listening test. Over 30 workers from Amazon Mechanical Turk are instructed to listen to pairs of utterances, and indicate which one they prefer in terms of overall quality. Each comparison includes 5 pairs of utterances randomly selected among all the test utterances. Figure 4 shows the result of the listening test. Each number indicates the percentage of a certain preference. Most participants prefer attention forcing. We strongly encourage readers to listens to the generated utterances222Generated test utterances are randomly selected and made available at http://mi.eng.cam.ac.uk/~qd212/iclr2020/samples.html. It is obvious that attention forcing yields utterances that are significantly more natural and expressive.

Figure 4: Result of the listening test comparing teacher forcing and attention forcing

5.2 Machine translation

The NMT experiments are conducted on the English-to-Vietnamese task in IWSLT 2015. It is a low resource NMT task, where training set contains 133K sentence pairs. The Stanford pre-processed data is used. The TED tst2012 is used as a validation set, and BLEU scores on TED tst2013 are reported. The scores use a 4-gram corpus level BLEU with equal weights. Google’s attention-based encoder-decoder LSTM model (Wu et al., 2016) is adopted. Details of the setup (data, model and training) are presented in appendix A.2.2.

Our initial experiments show that directly applying attention forcing to NMT can degrade the performance. One concern is that for translation, various re-orderings of the output sequence are valid. In this case, guiding the model with generated output can be problematic, as the reference output can take an ordering that is different from the generated output. To see if this is the reason, we tried a modified attention forcing mode, where the model is guided with reference attention and reference output. The right side of equation 29 becomes: . is computed with the reference output , and matches the reference attention Other parts of attention forcing (equations 28 to 31) stay the same, hence is predicted with and .

In the following experiments, two NMT models are compared: one is trained in teacher forcing mode, with the NLL loss in equation 13; the other is trained in the modified attention forcing mode described above, with both the NLL loss and the attention loss in equation 34. An ensemble of 10 models are trained with teacher forcing. Then each model generates reference attention for a corresponding model trained with additional attention loss. The average performance of the teacher forcing models is 26.35 BLEU, and adding the attention loss yields an average +0.35 BLEU gain. 9 of out 10 times, the performance improves. The slight but consistent gain shows that for NMT, guiding the model with generated output is indeed the cause degrading the performance. It also shows that guiding the model with reference attention can be beneficial. One possible reason is that the attention loss regularizes the attention mechanism. Another is that the model does not need to learn to simultaneously infer the output and align it with the input.

6 Conclusion

This paper introduces attention forcing, which guides a seq2seq model with generated output history and reference attention. This approach can train the model to recover from its mistakes, in a stable fashion, without the need for a schedule or a classifier. In addition, it allows the model to generate output sequences aligned with the reference output sequences, which can be important for cascaded systems like many TTS systems. The TTS experiments show that attention forcing yields significant gain in speech quality. The NMT experiments show that for tasks where various re-orderings of the output are valid, guiding the model with generated output history can be problematic, while guiding the model with reference attention yields slight but consistent gain in BLEU score.

References

  • D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §2.1.
  • S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer (2015) Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pp. 1171–1179. Cited by: §A.1, §1, §2.2, §2.
  • J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio (2015) Attention-based models for speech recognition. In Advances in Neural Information Processing Systems, pp. 577–585. Cited by: §A.2.1, §2.1, §5.1.
  • J. Chung, C. Gulcehre, K. Cho, and Y. Bengio (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. Cited by: §A.2.1.
  • G. A. Degottex, P. K. Lanchantin, and M. J. Gales (2016) A pulse model in log-domain for a uniform synthesizer. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp. 230–236. Cited by: §A.2.1, §5.1.
  • Q. Dou, M. Wan, G. Degottex, Z. Ma, and M. J. Gales (2018) Hierarchical rnns for waveform-level speech synthesis. In 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 618–625. Cited by: §A.2.1, §5.1.
  • P. Huang, F. Liu, S. Shiang, J. Oh, and C. Dyer (2016) Attention-based multimodal neural machine translation. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pp. 639–645. Cited by: §1.
  • K. Ito (2017) The lj speech dataset. Note: https://keithito.com/LJ-Speech-Dataset/ Cited by: §A.2.1, §5.1.
  • N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. v. d. Oord, S. Dieleman, and K. Kavukcuoglu (2018) Efficient neural audio synthesis. arXiv preprint arXiv:1802.08435. Cited by: §4.
  • D. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §A.2.1.
  • A. M. Lamb, A. G. A. P. Goyal, Y. Zhang, S. Zhang, A. C. Courville, and Y. Bengio (2016) Professor forcing: a new algorithm for training recurrent networks. In Advances In Neural Information Processing Systems, pp. 4601–4609. Cited by: §1, §2.2.
  • N. Li, S. Liu, Y. Liu, S. Zhao, M. Liu, and M. Zhou (2018) Close to human quality tts with transformer. arXiv preprint arXiv:1809.08895. Cited by: §4.
  • J. Lorenzo-Trueba, T. Drugman, J. Latorre, T. Merritt, B. Putrycz, and R. Barra-Chicote (2018) Robust universal neural vocoding. arXiv preprint arXiv:1811.06292. Cited by: §4.
  • M. Luong, H. Pham, and C. D. Manning (2015) Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025. Cited by: §A.2.2.
  • S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y. Bengio (2016) SampleRNN: an unconditional end-to-end neural audio generation model. arXiv preprint arXiv:1612.07837. Cited by: §A.2.1, §5.1.
  • G. Neubig (2017) Neural machine translation and sequence-to-sequence models: a tutorial. arXiv preprint arXiv:1703.01619. Cited by: §1.
  • A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu (2016) Wavenet: a generative model for raw audio. arXiv preprint arXiv:1609.03499. Cited by: §4.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §1.
  • J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan, et al. (2018) Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783. Cited by: §1, §4.
  • R. K. Srivastava, K. Greff, and J. Schmidhuber (2015) Highway networks. arXiv preprint arXiv:1505.00387. Cited by: §A.2.1.
  • Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, et al. (2017) Tacotron: towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135. Cited by: §A.2.1, §2.2, §4, §5.1.
  • Y. Wang, D. Stanton, Y. Zhang, R. Skerry-Ryan, E. Battenberg, J. Shor, Y. Xiao, F. Ren, Y. Jia, and R. A. Saurous (2018) Style tokens: unsupervised style modeling, control and transfer in end-to-end speech synthesis. arXiv preprint arXiv:1803.09017. Cited by: §1.
  • Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. Cited by: §A.2.2, §5.2.
  • S. Zagoruyko and N. Komodakis (2016) Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928. Cited by: §3.2.

Appendix A Appendix

a.1 Details of sequence-to-sequence generation

The exact search shown in equation 4 is computationally expensive, and is often approximated by greedy search if the output space is continuous, or beam search if the output space is discrete (Bengio et al., 2015). For greedy search, the model generates the output sequence one token at a time based on previous output tokens, until a special end-of-sequence token is generated. For beam search, a heap of best candidate sequences is kept. At each time step, the candidates are updated by extending each candidate by one step, and pruning the heap to only keep best candidates. The beam search stops when no new sequences are added.

a.2 Details of experimental setup

a.2.1 Speech synthesis

The TTS experiments are conducted on LJ dataset (Ito, 2017). This public domain dataset contains 13,100 utterances from a single speaker reading passages from 7 non-fiction books. The utterances vary in length from 1 to 10 seconds and have a total length of approximately 24 hours. A transcription (character sequence) is provided for each utterance (waveform sequence). The waveforms are resampled to 16kHz to increase the efficiency of neural vocoders. Corresponding vocoder features are extracted at the frame rate of 0.2kHz, using a PML vocoder (Degottex et al., 2016). The training-validation-test split is 13000-50-50.

The frame-level model is very similar to Tacotron (Wang et al., 2017), a powerful attention-based encoder-decoder model. The differences are: 1) the decoder target is vocoder features; 2) the attention mechanism is the hybrid (content-based + location-based) attention described in Chorowski et al. (2015); 3) the reduction factor is 5, i.e. each decoding step predicts 5 vocoder feature frames. Apart from these, the model structure is the same as described in Wang et al. (2017). The input characters are represented as one-hot vectors. The encoder has an embedding layer mapping the one-hot vectors to continuous vectors, a bottleneck layer with dropout, and a CBHG module generating the final encoding sequence. The CBHG module consists of a bank of 1-D convolutional filters, followed by highway networks (Srivastava et al., 2015)

and a bidirectional GRU. The decoder has a stack of GRUs with vertical residual connections and generates the intermediate vocoder features. These features are post-processed by another CBHG module, yielding the final vocoder features. The model configuration is the same as described by Table 1 in

Wang et al. (2017).

The waveform-level model is the Hierarchical Recurrent Neural Network (HRNN) neural vocoder (Mehri et al., 2016; Dou et al., 2018)

. The HRNN structure is a hierarchy of tiers; each tier includes several neural network layers and operates at a different frequency. The lowest tier operates at waveform-level frequency, and outputs distributions of waveform samples. Each higher tier operates at a lower frequency, and supervises the tier below it. The model configuration is as follows. Tier 0 is a 4-layer DNN, including three fully connected layers with ReLU activation and a softmax output layer; the dimension is 1024 for the first two fully connected layers, and is 256 for the other two layers. The other tiers are all 1-layer RNNs; Gated Recurrent Unit (GRU) 

(Chung et al., 2014) is used and the dimension is 512 for all layers. The frequencies for tiers 0 to 3 are respectively 16kHz, 8kHz, 2kHz and 0.4kHz. This neural vocoder models each waveform sample with a categorical distribution. Hence the waveform samples are quantized into 256 integer values.

The frame-level model is trained with either teacher forcing or attention forcing. In both cases, the loss shown in equation 35 is used for both the decoder and post-processing CBHG. The two losses have equal weights. For attention forcing, the additional alignment loss shown in equation 34 is used for the attention mechanism, and the scaling factor is 50. The neural vocoder is always trained in teacher forcing mode, and the loss function is shown in equation 13. For all experiments, the optimizer is Adam (Kingma and Ba, 2014), and the initial learning rate is 0.001.

a.2.2 Machine translation

The NMT experiments are conducted on the English-to-Vietnamese task in IWSLT 2015. It is a low resource NMT task, with the parallel training set containing 133K sentence pairs. The Stanford pre-processed data (https://nlp.stanford.edu/projects/nmt/) is used. The attention-based encoder-decoder LSTM model (Wu et al., 2016) is adopted. The model is simplified with a smaller number of LSTM layers due to the small scale of data: the encoder has 2 layers of bi-LSTM and the decoder has 4 layers of uni-LSTM; the general form of Luong attention Luong et al. (2015) is used; both English and Vietnamese word embeddings have 200 dimensions and are randomly initialised. Adam optimiser is used with a learning rate of 0.002 and the maximum gradient norm is set to be 1. Dropout is used with a probability of 0.2. During inference, predictions are made using beam search with a width of 10.