Discrete Autoencoders for Sequence Models

by   Łukasz Kaiser, et al.

Recurrent models for sequences have been recently successful at many tasks, especially for language modeling and machine translation. Nevertheless, it remains challenging to extract good representations from these models. For instance, even though language has a clear hierarchical structure going from characters through words to sentences, it is not apparent in current language models. We propose to improve the representation in sequence models by augmenting current approaches with an autoencoder that is forced to compress the sequence through an intermediate discrete latent space. In order to propagate gradients though this discrete representation we introduce an improved semantic hashing technique. We show that this technique performs well on a newly proposed quantitative efficiency measure. We also analyze latent codes produced by the model showing how they correspond to words and phrases. Finally, we present an application of the autoencoder-augmented model to generating diverse translations.


page 1

page 2

page 3

page 4


Discretized Bottleneck in VAE: Posterior-Collapse-Free Sequence-to-Sequence Learning

Variational autoencoders (VAEs) are important tools in end-to-end repres...

N-Grammer: Augmenting Transformers with latent n-grams

Transformer models have recently emerged as one of the foundational mode...

Generating Sentences from a Continuous Space

The standard recurrent neural network language model (RNNLM) generates s...

Recurrent Neural Network-Based Semantic Variational Autoencoder for Sequence-to-Sequence Learning

Sequence-to-sequence (Seq2seq) models have played an import role in the ...

Hierarchical Quantized Representations for Script Generation

Scripts define knowledge about how everyday scenarios (such as going to ...

Representation Learning in Sequence to Sequence Tasks: Multi-filter Gaussian Mixture Autoencoder

Heterogeneity of sentences exists in sequence to sequence tasks such as ...

Hidden Schema Networks

Most modern language models infer representations that, albeit powerful,...

1 Introduction

Autoencoders have a long history in deep learning

(Hinton & Salakhutdinov, 2006; Salakhutdinov & Hinton, 2009a; Vincent et al., 2010; Kingma & Welling, 2013). In most cases, autoencoders operate on continuous representations, either by simply making a bottleneck (Hinton & Salakhutdinov, 2006), denoising (Vincent et al., 2010), or adding a variational component (Kingma & Welling, 2013). In many cases though, a discrete latent representation is potentially a better fit.

Language is inherently discrete, and autoregressive models based on sequences of discrete symbols yield impressive results. A discrete representation can be fed into a reasoning or planning system or act as a bridge towards any other part of a larger system. Even in reinforcement learning where action spaces are naturally continuous,

Metz et al. (2017) show that discretizing them and using autoregressive models can yield improvements.

Unluckily, using discrete latent variables is challenging in deep learning. And even with continuous autoencoders, the interactions with an autoregressive component cause difficulties. Despite some success (Bowman et al., 2016; Yang et al., 2017), the task of meaningfully autoencoding text in the presence of an autoregressive decoder has remained a challenge.

In this work we present an architecture that autoencodes a sequence of discrete symbols from any vocabulary (e.g., a tokenized sentence), into a -fold (we test and ) compressed sequence of latent symbols from a new vocabulary which is learned. The compressed sequence is generated to minimize perplexity in a (possibly conditional) language model trained to predict the next token on : the concatenation of with the original sequence .

Since gradient signals can vanish when propagating over discrete variables, the compression function can be hard to train. To solve this problem, we draw from the old technique of semantic hashing (Salakhutdinov & Hinton, 2009b)

. There, to discretize a dense vector

one computes where

is the sigmoid function and

represents annealed Gaussian noise that pushes the network to not use middle values in . We enhance this method by using a saturating sigmoid and a straight-through pass with only bits passed forward. These techniques, described in detail below, allow to forgo the annealing of the noise and provide a stable discretization mechanism that requires neither annealing nor additional loss factors.

We test our discretization technique by amending language models over with the autoencoded sequence . We compare the perplexity achieved on with and without the component, and contrast this value with the number of bits used in . We argue that this number is a proper measure for the performance of a discrete autoencoder. It is easy to compute and captures the performance of the autoencoding part of the model. This quantitative measure allows us to compare the technique we introduce with other methods, and we show that it performs better than a Gumbel-Softmax (Jang et al., 2016; Maddison et al., 2016) in this context.

Finally, we discuss the use of adding the autoencoded part to a sequence model. We present samples from a character-level language model and show that the latent symbols correspond to words and phrases when the architecture of is local. ehen, we introduce a decoding method in which is sampled and then

is decoded using beam search. This method alleviates a number of problems observed with beam search or pure sampling. We show how our decoding method can be used to obtain diverse translations of a sentence from a neural machine translation model. To summarize, the main contributions of this paper are:

  1. a discretization technique that works well without any extra losses or parameters to tune,

  2. a way to measure performance of autoencoders for sequence models with baselines,

  3. an improved way to sample from sequence models trained with an autoencoder part.

2 Techniques

Below, we introduce our discretization method, the autoencoding function

and finally the complete model that we use for our experiments. All code and hyperparameter settings needed to replicate our experiments are available as open-source

111See transformer_vae.py in https://github.com/tensorflow/tensor2tensor.

2.1 Discretization by Improved Semantic Hashing

As already mentioned above, our discretization method stems from semantic hashing (Salakhutdinov & Hinton, 2009b). To discretize a -dimensional vector , we first add noise, so . The noise is drawn from a

-dimensional Gaussian distribution with mean

and standard deviation

(deviations between and all work fine, see ablations below). The sum is component-wise, as are all operations below. Note that noise is used only for training, during evaluation and inference . From we compute two vectors: and , where is the saturating sigmoid function from (Kaiser & Sutskever, 2016; Kaiser & Bengio, 2016):

The vector represents the discretized value of and is used for evaluation and inference. During training, in the forward pass we use half of the time and the other half. In the backward pass, we let gradients always flow to , even if we used in the forward computation222

This can be done in TensorFlow using:

v += v - tf.stop_gradient(v)..

We will denote the vector discretized in the above way by . Note that if is -dimensional then will have bits. Since in other parts of the system we will predict with a softmax, we want the number of bits to not be too large. In our experiments we stick with , so is a vector of bits, and so can be interpreted as an integer between and .

The dense vectors representing activations in our sequence models have much larger dimensionality than (often , see the details in the experimental section below). To discretize such a high-dimensional vector we first have a simple fully-connected layer converting it into . In our notation, dense denotes a fully-connected layer applied to and mapping it into dimensions, i.e., where is a learned matrix of shape , where is the dimensionality of , and

is a learned bias vector of size

. The discretized vector is converted back into a high-dimensional vector using a 3-layer feed-forward network:

h1a = dense(vd, filter_size) h1b = dense(1.0 - vd, filter_size) h2 = dense(relu(h1a + h1b), filter_size) result = dense(relu(h2), hidden_size)

Above, every time we apply dense we create a new weight matrix an bias to be learned. The relu function is defined in the standard way: . In the network above, we usually use a large filter_size; in our experiments we set it to while hidden_size was usually . We suspect that this allows the above network to recover from the discretization bottleneck by simulating the distribution of encountered during training. Given a dense, high-dimensional vector we will denote the corresponding result returned from the network above by and the corresponding discrete vector by .

2.2 Gumbel-Softmax for Discretization

As an alternative discretization method, we consider the recently studied Gumbel-Softmax (Jang et al., 2016; Maddison et al., 2016). In that case, given a vector we compute by applying a linear layer mapping into

elements, resulting in the logits

. During evaluation and inference we simply pick the index of with maximum value for and the vector is computed by an embedding. During training we first draw samples from the Gumbel distribution: , where are uniform samples. Then, as in (Jang et al., 2016), we compute , the log-softmax of , and set:

With low temperature this vector is close to the 1-hot vector representing the maximum index of . But with higher temperature, it is an approximation (see Figure 1 in Jang et al. (2016)). We multiply this vector by the embedding matrix to compute during training.

2.3 Autoencoding Function

Having the functions and (respectively their Gumbel-Softmax versions), we can now describe the architecture of the autoencoding function . We assume that is already a sequence of dense vectors, e.g., coming from embedding vectors from a tokenized sentence. To halve the size of , we first apply to it layers of -dimensional convolutions with kernel size

and padding with

s on both sides (SAME-padding). We use ReLU non-linearities between the layers and layer-normalization (Ba et al., 2016). Then, we add the input to the result, forming a residual block. Finally, we process the result with a convolution with kernel size

and stride

, effectively halving the size of . In the local version of this function we only do the final strided convolution, without the residual block.

To autoencode a sequence and shorten it -fold, with , we first apply the above step times obtaining a sequence that is times shorter. Then we put it through the discretization bottleneck described above. The final compression function is given by and the architecture described above is depicted in Figure 1.

Note that, since we perform convolutions with kernel in each step, the network has access to a large context: just from the receptive fields of convolutions in the last step. That’s why we also consider the local version. With only strided convolutions, the -th symbol in the local has only access to a fixed symbols from the sequence and can only compress them.

Single step







Autoencoding function


single step


Figure 1: Architecture of the autoencoding function . We write conv to denote a 1D convolutional layer with kernel size and stride . See text for more details.

Training with defined above from scratch is hard, since at the beginning of training is generated by many layers of untrained convolutions that are only getting gradients through the discretization bottleneck. To help training, we add a side-path for without discretization: we just use for the first training steps. In this pretraining stage the network reaches loss of almost as everything needed to reconstruct is encoded in . After switching to the loss is high again and improves during further training.

2.4 Autoencoding Sequence Model

To test the autoencoding function we will use it to prefix the sequence in a sequence model. Normally, a sequence model would generate the -th element of conditioning on all elements of before that, , and possibly on some other inputs. For example, a language model would just condition on while a neural machine translation model would condition on the input sentence (in the other language) and . We do not change the sequence models in any way other than adding the sequence as the prefix of . Actually, for reasons analogous to those in (Sutskever et al., 2014), we first reverse the sequence , then add a separator symbol (#), and only then concatenate it with , as depicted in Figure 2. We also use a separate set of parameters for the model predicting so as to make sure that the models predicting with and without have the same capacity.

Standard language model.

Autoencoder-augmented language model.


Figure 2: Comparison of a standard language model and our autoencoder-augmented model. The architecture for is presented in Figure 1 and the arrows from to depict dependence.

As the architecture for the sequence model we use the Transformer (Vaswani et al., 2017). Transformer is based on multiple attention layers and was originally introduced in the context of neural machine translation. We focused on the autoencoding function and did not tune the sequence model in this work: we used all the defaults from the baseline provided by the Transformer authors ( layers, hidden size of and filter size of ) and only varied parameters relevant to .

3 Experiments

We experimented with autoencoding on different sequence tasks: (1) on a character-level language model, (2) on a word-level language model, and (3) on a word-level translation model. The goal for (1) was to check if our technique works at all, since character sequences are naturally amenable to compression into shorter sequences of objects from a larger vocabulary. For (2), we wanted to check if the good results obtained in (1) will still hold if the input is from a larger vocabulary and inherently more compressed space. Finally, in (3) we want to check if this method is applicable to conditional models and how it can be used to improve decoding.

We use the LM1B corpus (Chelba et al., 2013) for language modelling and we tokenize it using a subword (wordpiece) tokenizer (Sennrich et al., 2016) into a vocabulary of 32000 words and word-pieces. For translation, we use the WMT English-German corpus, similarly tokenized into a vocabulary of 32000 words and word-pieces333We used https://github.com/tensorflow/tensor2tensor for data preparation..

Below we report both qualitative and quantitative results. First, we focus on measuring the performance of our autoencoder quantitatively. To do that, we introduce a measure of discrete autoencoder performance on sequence tasks and compare our semantic hashing based method to Gumbel-Softmax on this scale.

3.1 Discrete Sequence Autoencoding Efficiency

Sequence models trained for next-symbol prediction are usually trained (and often also evaluated) based on the perplexity per token that they reach. Perplexity is defined as , where is the entropy (in bits) of a distribution. Therefore, a language model that reaches a per-word perplexity of , say , on a sentence can be said to compress each word from into bits of information.

Let us now assume that this model is allowed to access some additional bits of information about before decoding. In our autoencoding case, we let it peek at before decoding , and has times less symbols and bits in each symbol. So has the information capacity of bits per word. If our autoencoder was perfectly aligned with the needs of the language model, then allowing it to peek into would lower its information needs by these bits per word. The perplexity of the model with access to would thus satisfy , so its perplexity would be .

Getting the autoencoder perfectly aligned with the language model is hard, so in practice the perplexity is always higher. But since we measure it (and optimize for it during training), we can calculate how many bits has the part actually contributed to lowering the perplexity. We calculate and then, if is -times shorter than and uses bits, we define the discrete sequence autoencoding efficiency as:

The second formulation is useful when the raw numbers are given as natural logarithms, as is often the case during neural networks training.

Defined in this way, DSAE measures how many of the available bits in are actually used well by the model that peeks into the autoencoded part. Note that some models may have autoencoding capacity higher than the number of bits per word that indicates. In that case achieving DSAE=1 is impossible even if and the autoencoding is perfect. One should be careful when reporting DSAE for such over-capacitated models.

Problem ln(p) ln(p’) K DSAE
LM-en (characters) 1.027 0.822 32 59%
LM-en (word) 3.586 2.823 8 55%
NMT-en-de (word) 1.449 1.191 8 19%
LM-en (word, Gumbel-Softmax) 3.586 3.417 8 12%
NMT-en-de (word, Gumbel-Softmax) 1.449 1.512 8 0%

Table 1: Log-perplexities per word of sequence models with and without autoencoders, and their autoencoding efficiency. Results for Gumbel-Softmax heavily depend on tuning; see text for details.

So how does our method perform on DSAE and how does it compare with Gumbel-Softmax? In Table 1 we list log-perplexties of baseline and autoencoder models. We report numbers for the global version of on our 3 problems and compare it to Gumbel-Softmax on word-level problems. We did not manage to run the Gumbel-Softmax on character-level data in our baseline configuration because it requires too much memory (as it needs to learn the embeddings for each latent discrete symbol). Also, we found that the results for Gumbel-Softmax heavily depend on how the temperature parameter

is annealed during training. We tuned this on 5 runs of a smaller model and chose the best configuration. This was still not enough, as in many runs the Gumbel-Softmax would only utilize a small portion of the discrete symbols. We added an extra loss term to increase the variance of the Gumbel-Softmax and ran another 5 tuning runs to optimize this loss term. We used the best configuration for the experiments above. Still, we did not manage to get any information autoencoded in the translation model, and got only

efficiency in the language model (see Table 1).

Our method, on the other hand, was most efficient on character-level language modeling, where we reach almost efficiency, and it retained high efficiency on the word-level language modeling task. On the translation task, our efficiency goes down to , possibly because the function does not take inputs into account, and so may not be able to compress the right parts to align with the conditional model that outputs depending on the inputs. But even with efficiency it is still useful for sampling from the model, as shown below.

3.2 Sensitivity to Noise

To make sure that our autoencoding method is stable, we experiment with different standard deviations for the noise in the semantic hashing part. We perform these experiments on word-level language modelling with a smaller model configuration ( layers, hidden size of and filter size of ). The results, presented in Table 2, show that our method is robust to the amount of noise.

Noise standard deviation ln(p) ln(p’) K DSAE
1.5 3.912 3.313 8 43.2%
1.0 3.912 3.239 8 48.5%
0.5 3.912 3.236 8 48.5%
0.0 3.912 3.288 8 45.0%

Table 2: Autoencoder-augmented language models with different noise deviations. All values from no noise () upto a deviation of yield DSAE between and .

Interestingly, we see that our method works even without any noise (standard deviation ). We suspect that this is due to the fact that half of the time in the forward computation we use the discrete values anyway and pass gradients through to the dense part. Also, note that a standard deviation of still works, despite the fact that our saturating sigmoid is saturated for values above as . Finally, with deviation the small model achieves DSAE of , not much worse than the achieved by the large baseline model and better than the larger baseline model with Gumbel-Softmax.

3.3 Deciphering the Latent Code

Having trained the models, we try to find out whether the discrete latent symbols have any interpretable meaning. We start by asking a simpler question: do the latent symbols correspond to some fixed phrases or topics?

We first investigate this in a -fold compressed character-level language model. We set to random latent symbols and decode with beam search, obtaining:

All goods are subject to the Member States’ environmental and security aspects of the common agricultural policy.

Now, to find out whether the second symbol in stands for anything fixed, we replace the third symbol by the second one, hoping for some phrase to be repeated. Indeed, decoding from the new with beam search we obtain:

All goods are charged EUR 50.00 per night and EUR 50.00 per night stay per night.

Note that the beginning of the sentence remained the same, as we did not change the first symbol, and we see a repetition of EUR 50.00 per night. Could it be that this is what that second latent symbol stands for? But there were no EUR in the first sentence. Let us try again, now changing the first symbol to a different one. With the decoded is:

All bedrooms suited to the large suite of the large living room suites are available.

We see a repetition again, but of a different phrase. So we are forced to conclude that the latent code is structured, the meaning of the latent symbols can depend on other symbols before them.

Failing to decipher the code from this model, we try again with an -fold compressed character-level language model that uses the local version of the function . Recall (see Section 2.3) that a local function with 8-fold compression generates every latent symbol from the exact symbols that correspond to it in , without any context. With this simpler the model has lower DSAE, 35%, but we expect the latent symbols to be more context-independent. And indeed: if we pick the first latent symbols at random but fix the third, fourth and fifth to be the same, we obtain the following:

It’s studio, rather after a gallery gallery ...
When prices or health after a gallery gallery ...
I still offer hotels at least gallery gallery ...

So the fixed latent symbol corresponds to the word gallery in various contexts. Let us now ignore context-dependence, fix the first three symbols, and randomly choose another one that we repeat after them. Here are a few sample decodes:

Come to earth and culturalized climate climate ...
Come together that contribution itself, itself, ...
Come to learn that countless threat this gas threat...

In the first two samples we see that the latent symbol corresponds to climate or itself, respectively. Note that all these words or phrases are

-characters long (and one character for space), most probably due to the architecture of

. But in the last sample we see a different phenomenon: the latent symbol seems to correspond to X threat, where X depends on the context, showing that this latent code also has an interesting structure.

3.4 Mixed Sample-Beam Decoding

From the results above we know that our discretization method works quantitatively and we see interesting patterns in the latent code. But how can we use the autoencoder models in practice? One well-known problem with autoregressive sequence models is decoding. In settings where the possible outputs are fairly restricted, such as translation, one can obtain good results with beam search. But results obtained by beam search lack diversity (Vijayakumar et al., 2016). Sampling can improve diversity, but it can introduce artifacts or even change semantics in translation. We present an example of this problem in Figure 3. We pick an English sentence from the validation set of our English-German dataset and translate it using beam search and sampling (left and middle columns).

In the left column, we show top 3 results from beam search using our baseline model (without autoencoder). It is not necessary to speak German to see that they are all very similar; the only difference between the first and the last one are the spaces before ”%”. Further beams are also like this, providing no real diversity.

In the middle column we show 3 results sampled from the baseline model. There is more diversity in them, but they still share most of the first half and unluckily all of them actually changed the semantics of the sentence in the second half. The part African-Americans, who accounted however for only 13% of voters in the State becomes The american voters were only 13% of voters in the state in the first case, African-Americans, who accounted however for only 13% of all people in the State in the second one, and African-Americans, who elected only 13% of people in the State in the third case. This illustrates the dangers of just sampling different words during decoding.

Using a model with access to the autoencoded part presents us with another option: sample and then run beam search for the sequence appropriate for that . In this way we do not introduce low-level artifacts from sampling, but still preserve high-level diversity. To sample we train a language model on with the same architecture as the model for (and also conditioned on the input), but with a different set of weights. We then use the standard multinomial sampling from this model to obtain and run a beam search on the model for with the sampled .

In the right column in Figure 3 we show 3 samples obtained in this way. As you can see, these samples are much more diverse and they still preserve the semantics of the original sentence, even if with sometimes strange syntax. One would back-translate the first example as: In turned out, for example, in the course of the parliamentary elections in Florida, that 33% of the early voters are African-Americans, which were, however, only 13% of the voters of the state. Note the addition of It turned out and restructuring of the sentence. In the third sample the whole order is reversed, as it starts with 33% of the voters … instead of the election phrase. Obtaining such samples that differ in phrase order and other aspects but preserve semantics has been a challenge in neural translation.

English sentence:

For example, during the 2008 general election in Florida, 33% of early voters were African-Americans, who accounted however for only 13% of voters in the State.

Base model, beam decoding.

Während der Parlamentswahlen 2008 in Florida beispielsweise waren 33 % der frühen Wähler Afroamerikaner, die jedoch nur 13 % der Wähler im Staat ausmachten.

Während der Parlamentswahlen 2008 in Florida beispielsweise waren 33 % der frühen Wähler Afroamerikaner, die jedoch nur 13 % der Wähler im Staat stellten.

Während der Parlamentswahlen 2008 in Florida beispielsweise waren 33% der frühen Wähler Afroamerikaner, die jedoch nur 13% der Wähler im Staat ausmachten.

Base model, sampling.

So waren zum Beispiel bei den Parlamentswahlen 2008 in Florida 33 % der frühen Wähler Afroamerikaner. Die amerikanischen Wähler waren aber nur 13 % der Wähler im Staat.

So waren während der Parlamentswahlen 2008 in Florida 33 % der frühen Wähler Afroamerikaner, die aber nur 13 % der Bevölkerung im Staat ausmachten.

So waren während der Parlamentswahlen 2008 in Florida 33% der frühen Wähler Afroamerikaner, die jedoch nur 13% der Bevölkerung im Staat wählten.

Mixed decoding.

Es stellte sich beispielsweise im Verlauf der Parlamentswahlen in Florida heraus, dass 33% der frühen Wähler zu den afrikanischen Amerikanern zählten, die allerdings nur 13% der Wähler des Staates betrafen.

Dabei ist zum Beispiel im Laufe der Parlamentswahlen 2008 in Florida 33% in den frühen Wahlen der Afro-Amerikaner vertreten, die allerdings nur 13% der Wähler des Staates betrafen.

33% der frühen Wähler beispielsweise waren während der Hauptwahlen 2008 in Florida afrikanische Amerikaner, die für einen Anteil von nur 13% der Wähler im Staat verantwortlich waren.

Figure 3: Decoding from baseline and autoencoder-enhanced sequence-to-sequence models.

4 Conclusion

In this work, the study of text autoencoders (Bowman et al., 2016; Yang et al., 2017) is combined with the research on discrete autoencoders (Jang et al., 2016; Maddison et al., 2016). It turns out that the semantic hashing technique (Salakhutdinov & Hinton, 2009b) can be improved and then yields good results in this context. We introduce a measure of efficiency of discrete autoencoders in sequence models and show that improved semantic hashing has over efficiency. In some cases, we can decipher the latent code, showing that latent symbols correspond to words and phrases. On the practical side, sampling from the latent code and then running beam search allows to get valid but highly diverse samples, an important problem with beam search (Vijayakumar et al., 2016).

We leave a number of questions open for future work. How does the architecture of the function affect the latent code? How can we further improve discrete sequence autoencoding efficiency? Despite remaining questions, we can already see potential applications of discrete sequence autoencoders. One is the training of multi-scale generative models end-to-end, opening a way to generating truly realistic images, audio and video. Another application is in reinforcement learning. Using latent code may allow the agents to plan in larger time scales and explore more efficiently by sampling from high-level latent actions instead of just atomic moves.