tensor2tensor
A library for generalized sequence to sequence models
view repo
Autoregressive sequence models based on deep neural networks, such as RNNs, Wavenet and the Transformer attain state-of-the-art results on many tasks. However, they are difficult to parallelize and are thus slow at processing long sequences. RNNs lack parallelism both during training and decoding, while architectures like WaveNet and Transformer are much more parallelizable during training, yet still operate sequentially during decoding. Inspired by [arxiv:1711.00937], we present a method to extend sequence models using discrete latent variables that makes decoding much more parallelizable. We first auto-encode the target sequence into a shorter sequence of discrete latent variables, which at inference time is generated autoregressively, and finally decode the output sequence from this shorter latent sequence in parallel. To this end, we introduce a novel method for constructing a sequence of discrete latent variables and compare it with previously introduced methods. Finally, we evaluate our model end-to-end on the task of neural machine translation, where it is an order of magnitude faster at decoding than comparable autoregressive models. While lower in BLEU than purely autoregressive models, our model achieves higher scores than previously proposed non-autogregressive translation models.
READ FULL TEXT VIEW PDFA library for generalized sequence to sequence models
Neural networks have been applied successfully to a variety of tasks involving natural language. In particular, recurrent neural networks (RNNs) with long short-term memory (LSTM) cells
(Hochreiter & Schmidhuber, 1997) in a sequence-to-sequence configuration (Sutskever et al., 2014) have proven successful at tasks including machine translation (Sutskever et al., 2014; Bahdanau et al., 2014; Cho et al., 2014), parsing (Vinyals et al., 2015), and many others. RNNs are inherently sequential, however, and thus tend to be slow to execute on modern hardware optimized for parallel execution. Recently, a number of more parallelizable sequence models were proposed and architectures such as WaveNet (van den Oord et al., 2016), ByteNet (Kalchbrenner et al., 2016) and the Transformer (Vaswani et al., 2017) can indeed be trained faster due to improved parallelism.When actually generating sequential output, however, their autoregressive nature still fundamentally prevents these models from taking full advantage of parallel computation. When generating a sequence in a canonical order, say from left to right, predicting the symbol first requires generating all symbols as the model predicts
During training, the ground truth is known so the conditioning on previous symbols can be parallelized. But during decoding, this is a fundamental limitation as at least sequential steps need to be made to generate .
To overcome this limitation, we propose to introduce a sequence of discrete latent variables , with , that summarizes the relevant information from the sequence . We will still generate autoregressively, but it will be much faster as (in our experiments we mostly use ). Then, we reconstruct each position in the sequence from in parallel.
For the above strategy to work, we need to autoencode the target sequence
into a shorter sequence. Autoencoders have a long history in deep learning
(Hinton & Salakhutdinov, 2006; Salakhutdinov & Hinton, 2009a; Vincent et al., 2010; Kingma & Welling, 2013). Autoencoders mostly operate on continuous representations, either by imposing a bottleneck (Hinton & Salakhutdinov, 2006), requiring them to remove added noise (Vincent et al., 2010), or adding a variational component (Kingma & Welling, 2013). In our case though, we prefer the sequence to be discrete, as we use standard autoregressive models to predict it. Despite some success (Bowman et al., 2016; Yang et al., 2017), predicting continuous latent representations does not work as well as the discrete case in our setting.However, using discrete latent variables can be challenging when training models end-to-end. Three techniques recently have shown how to successfuly use discrete variables in deep models: the Gumbel-Softmax (Jang et al., 2016; Maddison et al., 2016), VQ-VAE (van den Oord et al., 2017) and improved semantic hashing (Kaiser & Bengio, 2018)
. We compare all these techniques in our setting and introduce another one: decomposed vector quantization (DVQ) which performs better than VQ-VAE for large latent alphabet sizes.
Using either DVQ or improved semantic hashing, we are able to create a neural machine translation model that achieves good BLEU scores on the standard benchmarks while being an order of magnitude faster at decoding time than autoregressive models. A recent paper (Gu et al., 2017)
reported similar gain for neural machine translation. But their techniques are hand-tuned for translation and require training with reinforcement learning. Our latent variables are learned and the model is trained end-to-end, so it can be applied to any sequence problem. Despite being more generic, our model outperforms the hand-tuned technique from
(Gu et al., 2017) yielding better BLEU. To summarize, our main contributions are:A method for fast decoding for autoregressive models.
An improved discretization technique: the DVQ.
The resulting Latent Transformer model, achieving good results on translation while decoding much faster.
In this section we introduce various discretization bottlenecks used to train discrete autoencoders for the target sequence. We will use the notation from (van den Oord et al., 2017) where the target sequence is passed through an encoder, , to produce a continuous latent representation , where is the dimension of the latent space. Let be the size of the discrete latent space and let denote the set . The continuous latent is subsequently passed through a discretization bottleneck to produce a discrete latent representation , and an input to be passed to the decoder . For integers we will use to denote the binary representation of using bits, with the inverse operation, i.e. conversion from binary to decimal denoted by .
A discretization technique that has recently received a lot of interest is the Gumbel-Softmax trick proposed by (Jang et al., 2016; Maddison et al., 2016). In this case one simply projects the encoder output using a learnable projection
to get the logits
with the discrete code being defined as(1) |
The decoder input during evaluation and inference is computed using an embedding where , where . For training, the Gumbel-Softmax trick is used by generating samples i.i.d samples from the Gumbel distribution: , where are uniform samples. Then as in (Jang et al., 2016; Maddison et al., 2016), one computes the log-softmax of to get :
(2) |
with the input to the decoder being simply the matrix-vector product
. Note that the Gumbel-Softmax trick makes the model differentiable and thus it can be trained using backpropagation.
For low temperature the vector is close to the 1-hot vector representing the maximum index of , which is what is used during evaluation and testing. But at higher temperatures, it is an approximation (see Figure 1 in Jang et al. (2016)).
Another discretization technique proposed by (Kaiser & Bengio, 2018) that has been recently explored stems from semantic hashing (Salakhutdinov & Hinton, 2009b). The main idea behind this technique is to use a simple rounding bottleneck after squashing the encoder state
using a saturating sigmoid. Recall the saturating sigmoid function from
(Kaiser & Sutskever, 2016; Kaiser & Bengio, 2016):(3) |
During training, a Gaussian noise is added to which is then passed through a saturating sigmoid to get the vector :
(4) |
To compute the discrete latent representation, the binary vector is constructed via rounding, i.e.:
(5) |
with the discrete latent code corresponding to . The input to the decoder is computed using two embedding spaces , with , where the function is randomly chosen to be or half of the time during training, while is set equal to during inference.
The Vector Quantized - Variational Autoencoder (VQ-VAE) discretization bottleneck method was proposed in (van den Oord et al., 2017)
. Note that vector quantization based methods have a long history of being used successfully in various Hidden Markov Model (HMM) based machine learning models (see e.g.,
(Huang & Jack, 1989; Lee et al., 1989)). In VQ-VAE, the encoder output is passed through a discretization bottleneck using a nearest-neighbor lookup on embedding vectors .More specifically, the decoder input is defined as
(6) |
The corresponding discrete latent is then the index of the embedding vector closest to in distance. Let be the reconstruction loss of the decoder given , (e.g., the cross entropy loss); then the model is trained to minimize
(7) |
where is the stop gradient operator defined as follows:
(8) |
We maintain an exponential moving average (EMA) over the following two quantities: 1) the embeddings for every and, 2) the count measuring the number of encoder hidden states that have as it’s nearest neighbor. The counts are updated over a mini-batch of targets as:
(9) |
with the embedding being subsequently updated as:
(10) |
where is the indicator function and is a decay parameter which we set to in our experiments.
When the size of the discrete latent space is large, then an issue with the approach of Section 2.3 is index collapse, where only a few of the embedding vectors get trained due to a rich getting richer phenomena. In particular, if an embedding vector is close to a lot of encoder outputs , then it receives the strongest signal to get even closer via the EMA update of Equations (9) and (10). Thus only a few of the embedding vectors will end up actually being used. To circumvent this issue, we propose two variants of decomposing VQ-VAE that make more efficient use of the embedding vectors for large values of .
The main idea behind this approach is to break up the encoder output into smaller slices
(11) |
where each is a dimensional vector and denotes the concatenation operation. Corresponding to each we have an embedding space , where . Note that the reason for the particular choice of is information theoretic: using an embedding space of size from Section 2.3 allows us to express discrete codes of size . In the case when we have different slices, we want the total expressible size of the discrete code to be still and so is set to . We now compute nearest neighbors for each subspace as:
(12) |
with the decoder input being .
The training objective is the same as in Section 2.3, with each embedding space trained individually via EMA updates from over a mini-batch of targets :
(13) | ||||
(14) |
where is the indicator function as before, and is the decay parameter.
Then the discrete latent code is now defined as
(15) |
Observe that when , the sliced Vector Quantization reduces to the VQ-VAE of (van den Oord et al., 2017). On the other hand, when , sliced DVQ is equivalent to improved semantic hashing of Section 2.2 loosely speaking: the individual table size for each slice is , and it gets rounded to or depending on which embedding is closer. However, the rounding bottleneck in semantic hashing of Section 2.2 proceeds via a saturating sigmoid and thus strictly speaking, the two techniques are different.
Note that similar decomposition approaches to vector quantization in the context of HMMs have been studied in the past under the name multiple code-books, see for instance (Huang et al., 1989; Rogina & Waibel, 1994; Peinado et al., 1996). The approach of sliced Vector Quantization has also been studied more recently in the context of clustering, under the name of Product or Cartesian Quantization in (Jegou et al., 2011; Norouzi & Fleet, 2013). A more recent work (Shu & Nakayama, 2018) explores a similar quantization approach coupled with the Gumbel-Softmax trick to learn compressed word embeddings (see also (Lam, 2018)).
Another natural way to decompose Vector Quantization is to use a set of fixed randomly initialized projections
(16) |
to project the encoder output into a -dimensional subspace. For we have an embedding space , where as before. The training objective, embeddings update, the input to the decoder, and the discrete latent representation is computed exactly as in Section 2.4.1. Note that when , projected Vector Quantization reduces to the VQ-VAE of (van den Oord et al., 2017) with an extra encoder layer corresponding to the projections . Similarly, when , projected DVQ is equivalent to improved semantic hashing of Section 2.2 with the same analogy as in Section 2.4.1, except the encoder now has an extra layer. The VQ-VAE paper (van den Oord et al., 2017) also use multiple latents in the experiments reported on CIFAR-10 and in Figure 5, using an approach similar to what we call projected DVQ.
Using the discretization techniques from Section 2 we can now introduce the Latent Transformer (LT) model. Given an input-output pair the LT will make use of the following components.
The function will autoencode into a shorter sequence of discrete latent variables using the discretization bottleneck from Section 2.
The latent prediction model (a Transformer) will autoregressively predict based on .
The decoder is a parallel model that will decode from and the input sequence .
The functions and together form an autoencoder of the targets that has additional access to the input sequence . For the autoregressive latent prediction we use a Transformer (Vaswani et al., 2017), a model based on multiple self-attention layers that was originally introduced in the context of neural machine translation. In this work we focused on the autoencoding functions and did not tune the Transformer: we used all the defaults from the baseline provided by the Transformer authors ( layers, hidden size of and filter size of ) and only varied parameters relevant to and , which we describe below. The three components above give rise to two losses:
The autoencoder reconstruction loss coming from comparing to .
The latent prediction loss that comes from comparing to the generated .
We train the LT model by minimizing . Note that the final outputs are generated only depending on the latents but not on each other, as depicted in Figure 1. In an autoregressive model, each would have a dependence on all previous , as is the case for s in Figure 1.
The autoencoding function we use is a stack of residual convolutions followed by an attention layer attending to
and a stack of strided convolutions. We first apply to
a -layer block of -dimensional convolutions with kernel sizeand padding with
s on both sides (SAME-padding). We use ReLU non-linearities between the layers and layer-normalization
(Ba et al., 2016). Then, we add the input to the result, forming a residual block. Next we have an encoder-decoder attention layer with dot-product attention, same as in (Vaswani et al., 2017), with a residual connection. Finally, we process the result with a convolution with kernel size
and stride , effectively halving the size of . We do this strided processing times so as to decrease the length times (later ). The result is put through the discretization bottleneck of Section 2. The final compression function is given by and the architecture described above is depicted in Figure 2.To decode from the latent sequence , we use the function . It consists of steps that include up-convolutions that double the length, so effectively it increases the length times. Each step starts with the residual block, followed by an encoder-decoder attention to (both as in the ae function above). Then it applies an up-convolution, which in our case is a feed-forward layer (equivalently a kernel- convolution) that doubles the internal dimension, followed by a reshape to twice the length. The result after the steps is then passed to a self-attention decoder, same as in the Tranformer model (Vaswani et al., 2017).
Note that at the beginning of training (for the first 10K steps), we give the true targets to the transformer-decoder here, instead of the decompressed latents . This pre-training ensures that the self-attention part has reasonable gradients that are then back-propagated to the convolution stack and then back to the function and the discretization bottleneck of Section 2.
Machine translation using deep neural networks achieved great success with sequence-to-sequence models (Sutskever et al., 2014; Bahdanau et al., 2014; Cho et al., 2014) that used recurrent neural networks (RNNs) with LSTM cells (Hochreiter & Schmidhuber, 1997). The basic sequence-to-sequence architecture is composed of an RNN encoder which reads the source sentence one token at a time and transforms it into a fixed-sized state vector. This is followed by an RNN decoder, which generates the target sentence, one token at a time, from the state vector. While a pure sequence-to-sequence recurrent neural network can already obtain good translation results (Sutskever et al., 2014; Cho et al., 2014), it suffers from the fact that the whole input sentence needs to be encoded into a single fixed-size vector. This clearly manifests itself in the degradation of translation quality on longer sentences and was overcome in (Bahdanau et al., 2014) by using a neural model of attention. Convolutional architectures have been used to obtain good results in word-level neural machine translation starting from (Kalchbrenner & Blunsom, 2013) and later in (Meng et al., 2015). These early models used a standard RNN on top of the convolution to generate the output, which creates a bottleneck and hurts performance. Fully convolutional neural machine translation without this bottleneck was first achieved in (Kaiser & Bengio, 2016) and (Kalchbrenner et al., 2016). The model in (Kaiser & Bengio, 2016) (Extended Neural GPU) used a recurrent stack of gated convolutional layers, while the model in (Kalchbrenner et al., 2016) (ByteNet) did away with recursion and used left-padded convolutions in the decoder. This idea, introduced in WaveNet (van den Oord et al., 2016), significantly improves efficiency of the model. The same technique was improved in a number of neural translation models recently, including (Gehring et al., 2017) and (Kaiser et al., 2017). Instead of convolutions, one can use stacked self-attention layers. This was introduced in the Transformer model (Vaswani et al., 2017) and has significantly improved state-of-the-art in machine translation while also improving the speed of training. Thus, we use the Transformer model as a baseline in this work.
Variational autoencoders were first introduced in (Kingma & Welling, 2013; Rezende et al., 2014)
, however training them for discrete latent variable models has been challenging. The NVIL estimator of
(Mnih & Gregor, 2014) proposes using a single sample objective to optimize the variational lower bound, while VIMCO (Mnih & Rezende, 2016) proposes using a muliti-sample objective of (Burda et al., 2015) which further speeds up convergence by using multiple samples from the inference network. There have also been several discretization bottlenecks proposed recently that have been used successfully in various learning tasks, see Section 2 for a more detailed description of the techniques directly relevant to this work. Other recent works with similar approach to autoencoding include (Subakan et al., 2018).Much of the recent state of the art models in Neural Machine Translation are auto-regressive, meaning that the model consumes previously generated tokens to predict the next one. A recent work that attempts to speed up decoding by training a non-autotregressive model is (Gu et al., 2017). The approach of (Gu et al., 2017) is to use the self-attention Transformer model of (Vaswani et al., 2017), together with the REINFORCE algorithm (Williams, 1992) to model the fertilities of words to tackle the multi-modality problem in translation. However, the main drawback of this work is the need for extensive fine-tuning to make policy gradients work for REINFORCE, as well as the issue that this approach only works for machine translation and is not generic, so it cannot be directly applied to other sequence learning tasks.
The core of our approach to fast decoding consists of finding a sequence of latent variables such that we can predict the output sequence in parallel from and the input . In other words, we assume that each token is conditionally independent of all other tokens () given and : . Our autoencoder is thus learning to create a one-layer graphical model with variables () that can then be used to predict independently of each other.
We train the Latent Transformer with the base configuration to make it comparable to both the autoregressive baseline (Vaswani et al., 2017) and to the recent non-autoregressive NMT results (Gu et al., 2017)
. We used around 33K subword units as vocabulary and implemented our model in TensorFlow
(Abadi et al., 2015). Our implementation, together with hyper-parameters and everything needed to reproduce our results is available as open-source^{1}^{1}1The code is available under redacted..For non-autoregressive models, it is beneficial to generate a number of possible translations and re-score them with an autoregressive model. This can be done in parallel, so it is still fast, and it improves performance. This is called noisy parallel decoding in (Gu et al., 2017) and we include results both with and without it. The best BLEU scores obtained by different methods are summarized in Table 1. As you can see, our method with re-scoring almost matches the baseline autoregressive model without beam search.
Model | BLEU |
---|---|
Baseline Transformer [1] | 27.3 |
Baseline Transformer [2] | 23.5 |
Baseline Transformer [2] (no beam-search) | 22.7 |
NAT+FT (no NPD) [2] | 17.7 |
LT without rescoring | 19.8 |
NAT+FT (NPD rescoring 10) [2] | 18.7 |
LT rescornig top-10 | 21.0 |
NAT+FT (NPD rescoring 100) [2] | 19.2 |
LT rescornig top-100 | 22.5 |
To get a better understanding of the non-autoregressive models, we focus on performance without rescoring and investigate different variants of the Latent Transformer. We include different discretization bottlenecks, and report the final BLEU scores together with decoding speeds in Table 2. The LT is slower in non-batch mode than the simple NAT baseline of (Gu et al., 2017)
, which might be caused by system differences (our code is in TensorFlow and has not been optimized, while their implementation is in Torch). Latency at higher batch-size is much smaller, showing that the speed of the LT can still be significantly improved with batching. The choice of the discretization bottleneck seems to have a small impact on speed and both DVQ and improved semantic hashing yield good BLEU scores, while VQ-VAE fails in this context (see below for a discussion).
Model | BLEU | Latency | |
---|---|---|---|
Baseline (no beam-search) | 22.7 | 408 ms | - |
NAT | 17.7 | 39 ms | - |
NAT+NPD=10 | 18.7 | 79 ms | - |
NAT+NPD=100 | 19.2 | 257 ms | - |
LT, Improved Semhash | 19.8 | 105 ms | 8 ms |
LT, VQ-VAE | 2.78 | 148 ms | 7 ms |
LT, s-DVQ | 19.7 | 177 ms | 7 ms |
LT, p-DVQ | 19.8 | 182 ms | 8 ms |
Since the discretization bottleneck is critical to obtaining good results for fast decoding of sequence models, we focused on looking into it, especially in conjunction with the size of the latent vocabulary, the dimension of the latent space, and the number of decompositions for DVQ.
An issue with the VQ-VAE of (van den Oord et al., 2017) that motivated the introduction of DVQ in Section 2.3 is index collapse, where only a few embeddings are used and subsequently trained. This can be visualized in the histogram of Figure 4, where the -axis corresponds to the possible values of the discrete latents (in this case ), and the -axis corresponds to the training progression of the model (time steps increase in a downward direction). On the other hand, using the DVQ from Section 2.4.1 with leads to a much more balanced use of the available discrete latent space, as can be seen from Figure 5. We also report the percentage of available latent code-words used for different settings of in Table 3; the usage of the code-words is maximized for .
The other variables for DVQ are the choice of the decomposition, and the number of decompositions. For the projected DVQ, we use fixed projections ’s initialized using the Glorot initializer (Glorot & Bengio, 2010). We also found that the optimal number of decompositions for our choice of latent vocabulary size and was , with (i.e., regular VQ-VAE) performing noticeably worse (see Table 2 and Figure 4). Setting higher values of led to a decline in performance, possibly because the expressive power () was reduced for each decomposition, and the model also ended up using fewer latents (see Table 3).
Percentage of latents used | |
1 | |
2 | |
4 | |
8 |
Another important point about LT is that it allows making different trade-offs by tuning the fraction of the length of the original output sequence to the length of the latent sequence. As increases, so does the parallelism and decoding speed, but the latents need to encode more and more information to be able to decode the outputs in parallel. To study this tradeoff, we measure the reconstruction loss (the perplexity of the reconstructed vs the original) for different and varying the number of bits in the latent variables. The results, presented in Table 4, show clearly that reconstruction get better, as expected, if the latent state has more bits or is used to compress less subword units.
1.33 | 0.64 | |
2.04 | 1.26 | |
2.44 | 1.77 |
Autoregressive sequence models based on deep neural networks were made successful due to their applications in machine translation (Sutskever et al., 2014) and have since yielded state-of-the-art results on a number of tasks. With models like WaveNet and Transformer, it is possible to train them fast in a parallel way, which opened the way to applications to longer sequences, such as WaveNet for sound generation (van den Oord et al., 2016)
or Transformer for long text summarization
(Liu et al., 2018) and image generation (Vaswani et al., 2018). The key problem appearing in these new applications is the slowness of decoding: it is not practical to wait for minutes to generate a single example. In this work, while still focusing on the original problem of machine translation, we lay the groundwork for fast decoding for sequence models in general. While the latent transformer does not yet recover the full performance of the autoregressive model, it is already an order of magnitude faster and performs better than a heavily hand-tuned, task-specific non-autoregressive model.In the future, we plan to improve both the speed and the accuracy of the latent transformer. A simple way to improve speed that we did not yet try is to use the methods from this work in a hierarchical way. As illustrated in Figure 1
, the latents are still generated autoregressively which takes most of the time for longer sentences. In the future, we will apply the LT model to generate the latents in a hierarchical manner, which should result in further speedup. To improve the BLEU scores, on the other hand, we intend to investigate methods related to Gibbs sampling or even make the model partially autoregressive. For example, one could generate only the odd-indexed outputs,
, based on the latent symbols , and then generate the even-indexed ones based on both the latents and the odd-indexed outputs. We believe that including such techniques has the potential to remove the gap between fast-decoding models and purely autoregressive ones and will lead to many new applications.Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics
, pp. 249–256, 2010.Cartesian k-means.
In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pp. 3017–3024. IEEE, 2013.Deep Boltzmann machines.
In Proceedings of AISTATS’09, pp. 448–455, 2009a.Rebar: Low-variance, unbiased gradient estimates for discrete latent variable models.
In Advances in Neural Information Processing Systems, pp. 2624–2633, 2017.Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.
Journal of Machine Learning Research, 11:3371–3408, 2010.