1 Introduction
Autoencoders have a long history in deep learning
(Hinton & Salakhutdinov, 2006; Salakhutdinov & Hinton, 2009a; Vincent et al., 2010; Kingma & Welling, 2013). In most cases, autoencoders operate on continuous representations, either by simply making a bottleneck (Hinton & Salakhutdinov, 2006), denoising (Vincent et al., 2010), or adding a variational component (Kingma & Welling, 2013). In many cases though, a discrete latent representation is potentially a better fit.Language is inherently discrete, and autoregressive models based on sequences of discrete symbols yield impressive results. A discrete representation can be fed into a reasoning or planning system or act as a bridge towards any other part of a larger system. Even in reinforcement learning where action spaces are naturally continuous,
Metz et al. (2017) show that discretizing them and using autoregressive models can yield improvements.Unluckily, using discrete latent variables is challenging in deep learning. And even with continuous autoencoders, the interactions with an autoregressive component cause difficulties. Despite some success (Bowman et al., 2016; Yang et al., 2017), the task of meaningfully autoencoding text in the presence of an autoregressive decoder has remained a challenge.
In this work we present an architecture that autoencodes a sequence of discrete symbols from any vocabulary (e.g., a tokenized sentence), into a fold (we test and ) compressed sequence of latent symbols from a new vocabulary which is learned. The compressed sequence is generated to minimize perplexity in a (possibly conditional) language model trained to predict the next token on : the concatenation of with the original sequence .
Since gradient signals can vanish when propagating over discrete variables, the compression function can be hard to train. To solve this problem, we draw from the old technique of semantic hashing (Salakhutdinov & Hinton, 2009b)
. There, to discretize a dense vector
one computes whereis the sigmoid function and
represents annealed Gaussian noise that pushes the network to not use middle values in . We enhance this method by using a saturating sigmoid and a straightthrough pass with only bits passed forward. These techniques, described in detail below, allow to forgo the annealing of the noise and provide a stable discretization mechanism that requires neither annealing nor additional loss factors.We test our discretization technique by amending language models over with the autoencoded sequence . We compare the perplexity achieved on with and without the component, and contrast this value with the number of bits used in . We argue that this number is a proper measure for the performance of a discrete autoencoder. It is easy to compute and captures the performance of the autoencoding part of the model. This quantitative measure allows us to compare the technique we introduce with other methods, and we show that it performs better than a GumbelSoftmax (Jang et al., 2016; Maddison et al., 2016) in this context.
Finally, we discuss the use of adding the autoencoded part to a sequence model. We present samples from a characterlevel language model and show that the latent symbols correspond to words and phrases when the architecture of is local. ehen, we introduce a decoding method in which is sampled and then
is decoded using beam search. This method alleviates a number of problems observed with beam search or pure sampling. We show how our decoding method can be used to obtain diverse translations of a sentence from a neural machine translation model. To summarize, the main contributions of this paper are:

a discretization technique that works well without any extra losses or parameters to tune,

a way to measure performance of autoencoders for sequence models with baselines,

an improved way to sample from sequence models trained with an autoencoder part.
2 Techniques
Below, we introduce our discretization method, the autoencoding function
and finally the complete model that we use for our experiments. All code and hyperparameter settings needed to replicate our experiments are available as opensource
^{1}^{1}1See transformer_vae.py in https://github.com/tensorflow/tensor2tensor.2.1 Discretization by Improved Semantic Hashing
As already mentioned above, our discretization method stems from semantic hashing (Salakhutdinov & Hinton, 2009b). To discretize a dimensional vector , we first add noise, so . The noise is drawn from a
dimensional Gaussian distribution with mean
(deviations between and all work fine, see ablations below). The sum is componentwise, as are all operations below. Note that noise is used only for training, during evaluation and inference . From we compute two vectors: and , where is the saturating sigmoid function from (Kaiser & Sutskever, 2016; Kaiser & Bengio, 2016):The vector represents the discretized value of and is used for evaluation and inference. During training, in the forward pass we use half of the time and the other half. In the backward pass, we let gradients always flow to , even if we used in the forward computation^{2}^{2}2
This can be done in TensorFlow using:
v += v  tf.stop_gradient(v)..We will denote the vector discretized in the above way by . Note that if is dimensional then will have bits. Since in other parts of the system we will predict with a softmax, we want the number of bits to not be too large. In our experiments we stick with , so is a vector of bits, and so can be interpreted as an integer between and .
The dense vectors representing activations in our sequence models have much larger dimensionality than (often , see the details in the experimental section below). To discretize such a highdimensional vector we first have a simple fullyconnected layer converting it into . In our notation, dense denotes a fullyconnected layer applied to and mapping it into dimensions, i.e., where is a learned matrix of shape , where is the dimensionality of , and
is a learned bias vector of size
. The discretized vector is converted back into a highdimensional vector using a 3layer feedforward network:h1a = dense(vd, filter_size) h1b = dense(1.0  vd, filter_size) h2 = dense(relu(h1a + h1b), filter_size) result = dense(relu(h2), hidden_size)
Above, every time we apply dense we create a new weight matrix an bias to be learned. The relu function is defined in the standard way: . In the network above, we usually use a large filter_size; in our experiments we set it to while hidden_size was usually . We suspect that this allows the above network to recover from the discretization bottleneck by simulating the distribution of encountered during training. Given a dense, highdimensional vector we will denote the corresponding result returned from the network above by and the corresponding discrete vector by .
2.2 GumbelSoftmax for Discretization
As an alternative discretization method, we consider the recently studied GumbelSoftmax (Jang et al., 2016; Maddison et al., 2016). In that case, given a vector we compute by applying a linear layer mapping into
elements, resulting in the logits
. During evaluation and inference we simply pick the index of with maximum value for and the vector is computed by an embedding. During training we first draw samples from the Gumbel distribution: , where are uniform samples. Then, as in (Jang et al., 2016), we compute , the logsoftmax of , and set:With low temperature this vector is close to the 1hot vector representing the maximum index of . But with higher temperature, it is an approximation (see Figure 1 in Jang et al. (2016)). We multiply this vector by the embedding matrix to compute during training.
2.3 Autoencoding Function
Having the functions and (respectively their GumbelSoftmax versions), we can now describe the architecture of the autoencoding function . We assume that is already a sequence of dense vectors, e.g., coming from embedding vectors from a tokenized sentence. To halve the size of , we first apply to it layers of dimensional convolutions with kernel size
and padding with
s on both sides (SAMEpadding). We use ReLU nonlinearities between the layers and layernormalization (Ba et al., 2016). Then, we add the input to the result, forming a residual block. Finally, we process the result with a convolution with kernel sizeand stride
, effectively halving the size of . In the local version of this function we only do the final strided convolution, without the residual block.To autoencode a sequence and shorten it fold, with , we first apply the above step times obtaining a sequence that is times shorter. Then we put it through the discretization bottleneck described above. The final compression function is given by and the architecture described above is depicted in Figure 1.
Note that, since we perform convolutions with kernel in each step, the network has access to a large context: just from the receptive fields of convolutions in the last step. That’s why we also consider the local version. With only strided convolutions, the th symbol in the local has only access to a fixed symbols from the sequence and can only compress them.
Training with defined above from scratch is hard, since at the beginning of training is generated by many layers of untrained convolutions that are only getting gradients through the discretization bottleneck. To help training, we add a sidepath for without discretization: we just use for the first training steps. In this pretraining stage the network reaches loss of almost as everything needed to reconstruct is encoded in . After switching to the loss is high again and improves during further training.
2.4 Autoencoding Sequence Model
To test the autoencoding function we will use it to prefix the sequence in a sequence model. Normally, a sequence model would generate the th element of conditioning on all elements of before that, , and possibly on some other inputs. For example, a language model would just condition on while a neural machine translation model would condition on the input sentence (in the other language) and . We do not change the sequence models in any way other than adding the sequence as the prefix of . Actually, for reasons analogous to those in (Sutskever et al., 2014), we first reverse the sequence , then add a separator symbol (#), and only then concatenate it with , as depicted in Figure 2. We also use a separate set of parameters for the model predicting so as to make sure that the models predicting with and without have the same capacity.
As the architecture for the sequence model we use the Transformer (Vaswani et al., 2017). Transformer is based on multiple attention layers and was originally introduced in the context of neural machine translation. We focused on the autoencoding function and did not tune the sequence model in this work: we used all the defaults from the baseline provided by the Transformer authors ( layers, hidden size of and filter size of ) and only varied parameters relevant to .
3 Experiments
We experimented with autoencoding on different sequence tasks: (1) on a characterlevel language model, (2) on a wordlevel language model, and (3) on a wordlevel translation model. The goal for (1) was to check if our technique works at all, since character sequences are naturally amenable to compression into shorter sequences of objects from a larger vocabulary. For (2), we wanted to check if the good results obtained in (1) will still hold if the input is from a larger vocabulary and inherently more compressed space. Finally, in (3) we want to check if this method is applicable to conditional models and how it can be used to improve decoding.
We use the LM1B corpus (Chelba et al., 2013) for language modelling and we tokenize it using a subword (wordpiece) tokenizer (Sennrich et al., 2016) into a vocabulary of 32000 words and wordpieces. For translation, we use the WMT EnglishGerman corpus, similarly tokenized into a vocabulary of 32000 words and wordpieces^{3}^{3}3We used https://github.com/tensorflow/tensor2tensor for data preparation..
Below we report both qualitative and quantitative results. First, we focus on measuring the performance of our autoencoder quantitatively. To do that, we introduce a measure of discrete autoencoder performance on sequence tasks and compare our semantic hashing based method to GumbelSoftmax on this scale.
3.1 Discrete Sequence Autoencoding Efficiency
Sequence models trained for nextsymbol prediction are usually trained (and often also evaluated) based on the perplexity per token that they reach. Perplexity is defined as , where is the entropy (in bits) of a distribution. Therefore, a language model that reaches a perword perplexity of , say , on a sentence can be said to compress each word from into bits of information.
Let us now assume that this model is allowed to access some additional bits of information about before decoding. In our autoencoding case, we let it peek at before decoding , and has times less symbols and bits in each symbol. So has the information capacity of bits per word. If our autoencoder was perfectly aligned with the needs of the language model, then allowing it to peek into would lower its information needs by these bits per word. The perplexity of the model with access to would thus satisfy , so its perplexity would be .
Getting the autoencoder perfectly aligned with the language model is hard, so in practice the perplexity is always higher. But since we measure it (and optimize for it during training), we can calculate how many bits has the part actually contributed to lowering the perplexity. We calculate and then, if is times shorter than and uses bits, we define the discrete sequence autoencoding efficiency as:
The second formulation is useful when the raw numbers are given as natural logarithms, as is often the case during neural networks training.
Defined in this way, DSAE measures how many of the available bits in are actually used well by the model that peeks into the autoencoded part. Note that some models may have autoencoding capacity higher than the number of bits per word that indicates. In that case achieving DSAE=1 is impossible even if and the autoencoding is perfect. One should be careful when reporting DSAE for such overcapacitated models.
Problem  ln(p)  ln(p’)  K  DSAE 
LMen (characters)  1.027  0.822  32  59% 
LMen (word)  3.586  2.823  8  55% 
NMTende (word)  1.449  1.191  8  19% 
LMen (word, GumbelSoftmax)  3.586  3.417  8  12% 
NMTende (word, GumbelSoftmax)  1.449  1.512  8  0% 
So how does our method perform on DSAE and how does it compare with GumbelSoftmax? In Table 1 we list logperplexties of baseline and autoencoder models. We report numbers for the global version of on our 3 problems and compare it to GumbelSoftmax on wordlevel problems. We did not manage to run the GumbelSoftmax on characterlevel data in our baseline configuration because it requires too much memory (as it needs to learn the embeddings for each latent discrete symbol). Also, we found that the results for GumbelSoftmax heavily depend on how the temperature parameter
is annealed during training. We tuned this on 5 runs of a smaller model and chose the best configuration. This was still not enough, as in many runs the GumbelSoftmax would only utilize a small portion of the discrete symbols. We added an extra loss term to increase the variance of the GumbelSoftmax and ran another 5 tuning runs to optimize this loss term. We used the best configuration for the experiments above. Still, we did not manage to get any information autoencoded in the translation model, and got only
efficiency in the language model (see Table 1).Our method, on the other hand, was most efficient on characterlevel language modeling, where we reach almost efficiency, and it retained high efficiency on the wordlevel language modeling task. On the translation task, our efficiency goes down to , possibly because the function does not take inputs into account, and so may not be able to compress the right parts to align with the conditional model that outputs depending on the inputs. But even with efficiency it is still useful for sampling from the model, as shown below.
3.2 Sensitivity to Noise
To make sure that our autoencoding method is stable, we experiment with different standard deviations for the noise in the semantic hashing part. We perform these experiments on wordlevel language modelling with a smaller model configuration ( layers, hidden size of and filter size of ). The results, presented in Table 2, show that our method is robust to the amount of noise.
Noise standard deviation  ln(p)  ln(p’)  K  DSAE 

1.5  3.912  3.313  8  43.2% 
1.0  3.912  3.239  8  48.5% 
0.5  3.912  3.236  8  48.5% 
0.0  3.912  3.288  8  45.0% 
Interestingly, we see that our method works even without any noise (standard deviation ). We suspect that this is due to the fact that half of the time in the forward computation we use the discrete values anyway and pass gradients through to the dense part. Also, note that a standard deviation of still works, despite the fact that our saturating sigmoid is saturated for values above as . Finally, with deviation the small model achieves DSAE of , not much worse than the achieved by the large baseline model and better than the larger baseline model with GumbelSoftmax.
3.3 Deciphering the Latent Code
Having trained the models, we try to find out whether the discrete latent symbols have any interpretable meaning. We start by asking a simpler question: do the latent symbols correspond to some fixed phrases or topics?
We first investigate this in a fold compressed characterlevel language model. We set to random latent symbols and decode with beam search, obtaining:
All goods are subject to the Member States’ environmental and security aspects of the common agricultural policy.
Now, to find out whether the second symbol in stands for anything fixed, we replace the third symbol by the second one, hoping for some phrase to be repeated. Indeed, decoding from the new with beam search we obtain:
All goods are charged EUR 50.00 per night and EUR 50.00 per night stay per night.
Note that the beginning of the sentence remained the same, as we did not change the first symbol, and we see a repetition of EUR 50.00 per night. Could it be that this is what that second latent symbol stands for? But there were no EUR in the first sentence. Let us try again, now changing the first symbol to a different one. With the decoded is:
All bedrooms suited to the large suite of the large living room suites are available.
We see a repetition again, but of a different phrase. So we are forced to conclude that the latent code is structured, the meaning of the latent symbols can depend on other symbols before them.
Failing to decipher the code from this model, we try again with an fold compressed characterlevel language model that uses the local version of the function . Recall (see Section 2.3) that a local function with 8fold compression generates every latent symbol from the exact symbols that correspond to it in , without any context. With this simpler the model has lower DSAE, 35%, but we expect the latent symbols to be more contextindependent. And indeed: if we pick the first latent symbols at random but fix the third, fourth and fifth to be the same, we obtain the following:
It’s studio, rather after a gallery gallery ...
When prices or health after a gallery gallery ...
I still offer hotels at least gallery gallery ...
So the fixed latent symbol corresponds to the word gallery in various contexts. Let us now ignore contextdependence, fix the first three symbols, and randomly choose another one that we repeat after them. Here are a few sample decodes:
Come to earth and culturalized climate climate ...
Come together that contribution itself, itself, ...
Come to learn that countless threat this gas threat...
In the first two samples we see that the latent symbol corresponds to climate or itself, respectively. Note that all these words or phrases are
characters long (and one character for space), most probably due to the architecture of
. But in the last sample we see a different phenomenon: the latent symbol seems to correspond to X threat, where X depends on the context, showing that this latent code also has an interesting structure.3.4 Mixed SampleBeam Decoding
From the results above we know that our discretization method works quantitatively and we see interesting patterns in the latent code. But how can we use the autoencoder models in practice? One wellknown problem with autoregressive sequence models is decoding. In settings where the possible outputs are fairly restricted, such as translation, one can obtain good results with beam search. But results obtained by beam search lack diversity (Vijayakumar et al., 2016). Sampling can improve diversity, but it can introduce artifacts or even change semantics in translation. We present an example of this problem in Figure 3. We pick an English sentence from the validation set of our EnglishGerman dataset and translate it using beam search and sampling (left and middle columns).
In the left column, we show top 3 results from beam search using our baseline model (without autoencoder). It is not necessary to speak German to see that they are all very similar; the only difference between the first and the last one are the spaces before ”%”. Further beams are also like this, providing no real diversity.
In the middle column we show 3 results sampled from the baseline model. There is more diversity in them, but they still share most of the first half and unluckily all of them actually changed the semantics of the sentence in the second half. The part AfricanAmericans, who accounted however for only 13% of voters in the State becomes The american voters were only 13% of voters in the state in the first case, AfricanAmericans, who accounted however for only 13% of all people in the State in the second one, and AfricanAmericans, who elected only 13% of people in the State in the third case. This illustrates the dangers of just sampling different words during decoding.
Using a model with access to the autoencoded part presents us with another option: sample and then run beam search for the sequence appropriate for that . In this way we do not introduce lowlevel artifacts from sampling, but still preserve highlevel diversity. To sample we train a language model on with the same architecture as the model for (and also conditioned on the input), but with a different set of weights. We then use the standard multinomial sampling from this model to obtain and run a beam search on the model for with the sampled .
In the right column in Figure 3 we show 3 samples obtained in this way. As you can see, these samples are much more diverse and they still preserve the semantics of the original sentence, even if with sometimes strange syntax. One would backtranslate the first example as: In turned out, for example, in the course of the parliamentary elections in Florida, that 33% of the early voters are AfricanAmericans, which were, however, only 13% of the voters of the state. Note the addition of It turned out and restructuring of the sentence. In the third sample the whole order is reversed, as it starts with 33% of the voters … instead of the election phrase. Obtaining such samples that differ in phrase order and other aspects but preserve semantics has been a challenge in neural translation.
4 Conclusion
In this work, the study of text autoencoders (Bowman et al., 2016; Yang et al., 2017) is combined with the research on discrete autoencoders (Jang et al., 2016; Maddison et al., 2016). It turns out that the semantic hashing technique (Salakhutdinov & Hinton, 2009b) can be improved and then yields good results in this context. We introduce a measure of efficiency of discrete autoencoders in sequence models and show that improved semantic hashing has over efficiency. In some cases, we can decipher the latent code, showing that latent symbols correspond to words and phrases. On the practical side, sampling from the latent code and then running beam search allows to get valid but highly diverse samples, an important problem with beam search (Vijayakumar et al., 2016).
We leave a number of questions open for future work. How does the architecture of the function affect the latent code? How can we further improve discrete sequence autoencoding efficiency? Despite remaining questions, we can already see potential applications of discrete sequence autoencoders. One is the training of multiscale generative models endtoend, opening a way to generating truly realistic images, audio and video. Another application is in reinforcement learning. Using latent code may allow the agents to plan in larger time scales and explore more efficiently by sampling from highlevel latent actions instead of just atomic moves.
References
 Ba et al. (2016) Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
 Bowman et al. (2016) Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Józefowicz, and Samy Bengio. Generating sentences from a continuous space. In Proceedings of the SIGNLL’16, pp. 10–21, 2016.
 Chelba et al. (2013) Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One billion word benchmark for measuring progress in statistical language modeling. CoRR, abs/1312.3005, 2013. URL http://arxiv.org/abs/1312.3005.
 Hinton & Salakhutdinov (2006) Geoffrey E. Hinton and Ruslan Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.
 Jang et al. (2016) Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbelsoftmax. CoRR, abs/1611.01144, 2016. URL http://arxiv.org/abs/1611.01144.
 Kaiser & Bengio (2016) Łukasz Kaiser and Samy Bengio. Can active memory replace attention? In Advances in Neural Information Processing Systems, (NIPS), 2016.
 Kaiser & Sutskever (2016) Łukasz Kaiser and Ilya Sutskever. Neural GPUs learn algorithms. In International Conference on Learning Representations (ICLR), 2016.
 Kingma & Welling (2013) Diederik P. Kingma and Max Welling. Autoencoding variational bayes. CoRR, abs/1312.6114, 2013.
 Maddison et al. (2016) Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. CoRR, abs/1611.00712, 2016. URL http://arxiv.org/abs/1611.00712.
 Metz et al. (2017) Luke Metz, Julian Ibarz, Navdeep Jaitly, and James Davidson. Discrete sequential prediction of continuous actions for deep rl. arXiv, 2017. URL https://arxiv.org/abs/1705.05035.

Salakhutdinov & Hinton (2009a)
Ruslan Salakhutdinov and Geoffrey E. Hinton.
Deep Boltzmann machines.
In Proceedings of AISTATS’09, pp. 448–455, 2009a.  Salakhutdinov & Hinton (2009b) Ruslan Salakhutdinov and Geoffrey E. Hinton. Semantic hashing. Int. J. Approx. Reasoning, 50(7):969–978, 2009b.
 Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In Proceedings of ACL’16, 2016.
 Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pp. 3104–3112, 2014. URL http://arxiv.org/abs/1409.3215.
 Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. CoRR, 2017. URL http://arxiv.org/abs/1706.03762.
 Vijayakumar et al. (2016) Ashwin K. Vijayakumar, Michael Cogswell, Ramprasath R. Selvaraju, Qing Sun, Stefan Lee, David J. Crandall, and Dhruv Batra. Diverse beam search: Decoding diverse solutions from neural sequence models. CoRR, abs/1610.02424, 2016. URL http://arxiv.org/abs/1610.02424.

Vincent et al. (2010)
Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and
PierreAntoine Manzagol.
Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.
Journal of Machine Learning Research
, 11:3371–3408, 2010.  Yang et al. (2017) Zichao Yang, Zhiting Hu, Ruslan Salakhutdinov, and Taylor BergKirkpatrick. Improved variational autoencoders for text modeling using dilated convolutions. In Proceedings of ICML’17, pp. 3881–3890, 2017.