1 Introduction
Unsupervised learning of meaningful representations is a fundamental problem in machine learning since obtaining labeled data can often be very expensive. Continuous representations have largely been the workhorse of unsupervised deep learning models of images goodfellow2014generative ; van2016conditional ; kingma2016improved ; salimans2017pixelcnn++ ; imagetrans , audio van2016wavenet ; reed2017parallel , and video kalchbrenner2016video . However, it is often the case that datasets are more naturally modeled as a sequence of discrete symbols rather than continuous ones. For example, language and speech are inherently discrete in nature and images are often concisely described by language, see e.g., vinyals2015show . Improved discrete latent variable models could also prove useful for learning novel data compression algorithms theis2017lossy , while having far more interpretable representations of the data.
We build on Vector Quantized Variational Autoencoder (VQVAE) vqvae , a recently proposed training technique for learning discrete latent variables. The method uses a learned codebook combined with nearest neighbor search to train the discrete latent variable model. The nearest neighbor search is performed between the encoder output and the embedding of the latent code using the
distance metric. The generative process begins by sampling a sequence of discrete latent codes from an autoregressive model fitted on the encoder latents, acting as a learned prior. The discrete latent sequence is then consumed by the decoder to generate data. The resulting discrete autoencoder obtains impressive results on uncoditional image, speech, and video generation. In particular, on image generation the performance is almost on par with continuous VAEs on datasets such as CIFAR10
vqvae . An extension of this method to conditional supervised generation, outperforms continuous autoencoders on WMT EnglishGerman translation task kaiser2018fast .introduced the Latent Transformer, which achieved impressive results using discrete autoencoders for fast neural machine translation. However, additional training heuristics, namely, exponential moving averages (EMA) of cluster assignment counts, and product quantization
norouzi2013cartesian were essential to achieve competitive results with VQVAE. In this work, we show that tuning for the codebook size can significantly outperform the results presented in kaiser2018fast . We also exploit VQVAE’s connection with the expectation maximization (EM) algorithm dempster1977maximum , yielding additional improvements. With both improvements, we achieve a BLEU score of on English to German translation, outperforming kaiser2018fast by BLEU. Knowledge distillation hinton2015distilling ; seqd provides significant gains with our best models and EM, achieving BLEU, which almost matches the autoregressive transformer model with no beam search at BLEU, while being faster.Our contributions can be summarized as follows:

We show that VQVAE from vqvae can outperform previous stateoftheart without product quantization.

Inspired by the EM algorithm, we introduce a new training algorithm for training discrete variational autoencoders, that outperforms the previous best result with discrete latent autoencoders for neural machine translation.

Using EM training, we achieve better image generation results on CIFAR10, and with the additional use of knowledge distillation, allows us to develop a nonautoregressive machine translation model whose accuracy almost matches a strong greedy autoregressive baseline Transformer, while being times faster at inference.
2 VQVAE and the Hard EM Algorithm
The connection between means, and hard EM, or the Viterbi EM algorithm is well known bottou1995convergence
, where the former can be seen a special case of hardEM style algorithm with a mixtureofGaussians model with identity covariance and uniform prior over cluster probabilities. In the following sections we briefly explain the VQVAE discretization algorithm for completeness and it’s connection to classical EM.
2.1 VQVAE discretization algorithm
VQVAE models the joint distribution
where are the model parameters, is the data point and is the sequence of discrete latent variables or codes. Each position in the encoded sequence has its own set of latent codes. Given a data point, the discrete latent code in each position is selected independently using the encoder output. For simplicity, we describe the procedure for selecting the discrete latent code () in one position given the data point (). The encoder output is passed through a discretization bottleneck using a nearestneighbor lookup on embedding vectors . Here is the number of latent codes (in a particular position of the discrete latent sequence) in the model. More specifically, the discrete latent variable assignment is given by,(1) 
The selected latent variable’s embedding is passed as input to the decoder,
The model is trained to minimize:
(2) 
where is the reconstruction loss of the decoder given (e.g., the cross entropy loss), and, is the stop gradient operator defined as follows:
It was observed in kaiser2018fast that an exponentially moving average (EMA) update of the latent embeddings and codebook assignments results in more stable training than using gradientbased methods.
Specifically, they maintain EMA of the following two quantities: 1) the embeddings for every and, 2) the count measuring the number of encoder hidden states that have as it’s nearest neighbor. The counts are updated in a minibatch of targets as:
(3) 
with the embedding being subsequently updated as:
(4) 
where is the indicator function and is a decay parameter which we set to in our experiments. This amounts to doing stochastic gradient in the space of both codebook embeddings and cluster assignments. These techniques have also been successfully used in minibatch means sculley2010web and online EM liang2009online ; sato2000line .
The generative process begins by sampling a sequence of discrete latent codes from an autoregressive model, which we refer to as the Latent Predictor model. The decoder then consumes this sequence of discrete latent variables to generate the data. The autoregressive model which acts as a learned prior is fitted on the discrete latent variables produced by the encoder. The architecture of the encoder, the decoder, and the latent predictor model are described in further detail in the experiments section.
2.2 Hard EM and the means algorithm
In this section we briefly recall the hard Expectation maximization (EM) algorithm dempster1977maximum . Given a set of data points , the hard EM algorithm approximately solves the following optimization problem:
(5) 
Hard EM performs coordinate descent over the following two coordinates: the model parameters , and the hidden variables . In other words, hard EM consists of repeating the following two steps until convergence:

E step: ,

M step:
A special case of the hard EM algorithm is means clustering macqueen1967some ; bottou1995convergence where the likelihood is modelled by a Gaussian with identity covariance matrix. Here, the means of the
Gaussians are the parameters to be estimated,
With a uniform prior over the hidden variables (), the marginal is given by . In this case, equation (5) is equivalent to:
(6) 
Note that optimizing equation (6) is NPhard, however one can find a local optima by applying coordinate descent until convergence:

E step: Cluster assignment is given by,
(7) 
M step: The means of the clusters are updated as,
(8)
We can now easily see the connections between the training updates of VQVAE and means clustering. The encoder output corresponds to the data point while the discrete latent variables corresponds to clusters. Given this, Equation 1 is equivalent to the Estep (Equation 7) and the EMA updates in Equation 3 and Equation 4 converge to the Mstep (Equation 8) in the limit. The Mstep in
means overwrites the old values while the EMA updates interpolate between the old values and the M step update.
3 VQVAE training with EM
In this section, we investigate a new training strategy for VQVAE using the soft EM algorithm.
3.1 Soft EM
First, we briefly describe the soft EM algorithm. While the hard EM procedure selects one cluster or latent variable assignment for a data point, here the data point is assigned to a mixture of clusters. Now, the optimization objective is given by,
Coordinate descent algorithm is again used to approximately solve the above optimization algorithm. The E and M step are given by:

E step:
(9) 
M step:
(10)
3.2 Vector Quantized Autoencoders trained with EM
Now, we describe vector quantized autoencoders training using the soft EM algorithm. As discussed in the previous section, the encoder output
corresponds to the data point while the discrete latent variables corresponds to clusters. The E step instead of hard assignment now produces a probability distribution over the set of discrete latent variables (Equation
9). Following VQVAE, we continue to assume a uniform prior over clusters, since we observe that training the cluster priors seemed to cause the cluster assignments to collapse to only a few clusters. The probability distribution is modeled as a Gaussian with identity covariance matrix,Since computing the expectation in the M step (Equation 10) is computationally infeasible in our case, we instead perform MonteCarlo Expectation Maximization wei1990monte by drawing samples , where refers to the
way multinomial distribution with logits
. Thus, the E step can be finally written as:The model parameters are then updated to maximize this MonteCarlo estimate in the M step given by
Instead of exactly following the above M step update, we use the EMA version of this update similar to the one described in Section 2.1.
When sending the embedding of the discrete latent to the decoder, instead of sending the posterior mode, , similar to hard EM and means, we send the average of the embeddings of the sampled latents:
(11) 
Since latent code embeddings are sent to the decoder in the forward pass, all of them are updated in the backward pass for a single training example. In hard EM training, only one of them is updated during training. Sending averaged embeddings also results in more stable training using the soft EM algorithm compared to hard EM as shown in Section 5.
To train the latent predictor model (Section 2.1) in this case, we use an approach similar to label smoothing pereyra2017regularizing : the latent predictor model is trained to minimize the cross entropy loss with the labels being the average of the onehot labels of .
4 Other Related Work
Variational autoencoders were first introduced by kingma2016improved ; rezende2014stochastic for training continuous representations; unfortunately, training them for discrete latent variable models has proved challenging. One promising approach has been to use various gradient estimators for discrete latent variable models, starting with the REINFORCE estimator of williams1992simple
, an unbiased, highvariance gradient estimator. An alternate approach towards gradient estimators is to use continuous relaxations of categorical distributions, for e.g., the GumbelSoftmax reparametrization trick
gs1 ; gs2 . These methods provide biased but low variance gradients for training.Machine translation using deep neural networks have been shown to achieve impressive results sutskever14 ; bahdanau2014neural ; cho2014learning ; transformer . The stateoftheart models in Neural Machine Translation are all autoregressive, which means that during decoding, the model consumes all previously generated tokens to predict the next one. Very recently, there have been multiple efforts to speedup machine translation decoding. nonautoregnmt attempts to address this issue by using the Transformer model transformer together with the REINFORCE algorithm williams1992simple , to model the fertilities of words. The main drawback of the approach of nonautoregnmt is the need for extensive finetuning to make policy gradients work, as well as the nongeneric nature of the solution. lee2018deterministic propose a nonautoregressive model using iterative refinement. Here, instead of decoding the target sentence in oneshot, the output is successively refined to produce the final output. While the output is produced in parallel at each step, the refinement steps happen sequentially.
5 Experiments
We evaluate our proposed methods on unconditional image generation on the CIFAR10 dataset and supervised conditional language generation on the WMT EnglishtoGerman translation task. Our models and generative process follow the architecture proposed in vqvae for unconditional image generation, and kaiser2018fast for neural machine translation. For all our experiments, we use the Adam kingma2014adam optimizer and decay the learning rate exponentially after initial warmup steps. Unless otherwise stated, the dimension of the hidden states of the encoder and the decoder is , see Table 5 for a comparison of models with lower dimension. The code to reproduce our experiments will be released with the next version of the paper.
5.1 Machine Translation
In Neural Machine Translation with latent variables, we model , where and are the target and source sentence respectively. Our model architecture, depicted in Figure 2, is similar to the one in kaiser2018fast
. The encoder function is a series of strided convolutional layers with residual convolutional layers in between and takes target sentence
as input. The source sentence is converted to a sequence of hidden states through multiple causal selfattention layers. In kaiser2018fast , the encoder of the autoencoder attends additionally to this sequence of continuous representation of the source sentence. We use VQVAE as the discretization algorithm. The decoders, applied after the bottleneck layer uses transposed convolution layers whose continuous output is fed to a transformer decoder with causal attention, which generates the output.The results are summarized in Table 1. Our implementation of VQVAE achieves a significantly better BLEU score and faster decoding speed compared to kaiser2018fast . We found that tuning the codebook size (number of clusters) for using discrete latents achieves the best accuracy which is 16 times smaller as compared to the codebook size in kaiser2018fast . Additionally, we see a large improvement in the performance of the model by using sequencelevel distillation seqd , as has been observed previously in nonautoregressive models nonautoregnmt ; lee2018deterministic . Our teacher model is a base Transformer transformer that achieves a BLEU score of and on the WMT’14 test set using beam search decoding and greedy decoding respectively. For distillation purposes, we use the beam search decoded Transformer. Our VQVAE model trained with soft EM and distillation, achieves a BLEU score of , without noisy parallel decoding nonautoregnmt . This perforamce is bleu points lower than an autoregressive model decoded with a beam size of , while being faster. Importantly, we nearly match the same autoregressive model with beam size , with a speedup.
The length of the sequence of discrete latent variables is shorter than that of target sentence . Specifically, at each compression step of the encoder we reduce its length by half. We denote by , the compression factor for the latents, i.e. the number of steps for which we do this compression. In almost all our experiments, we use reducing the length by 8. We can decrease the decoding time further by increasing the number of compression steps. As shown in Table 1, by setting to 4, the decoding time drops to 58 milliseconds achieving 25.4 BLEU while a NAT model with similar decoding speed achieves only 18.7 BLEU. Note that, all NAT models also train with sequence level knowledge distillation from an autoregressive teacher.
5.1.1 Analysis
Attention to Source Sentence Encoder:
While the encoder of the discrete autoencoder in kaiser2018fast attends to the output of the encoder of the source sentence, we find that to be unnecessary, with both models achieving the same BLEU score with latents. Also, removing this attention step results in more stable training particularly for large codebook sizes, see e.g., Figure 3.
VQVAE vs Other Discretization Techniques:
We compare the GumbelSoftmax of gs1 ; gs2 and the improved semantic hashing discretization technique proposed in kaiser2018fast to VQVAE. When trained with sequence level knowledge distillation, the model using GumbelSoftmax reached BLEU, the model using improved semantic hashing reached BLEU, while the model using VQVAE reached BLEU on WMT’14 EnglishGerman.
Size of Discrete Latent Variable codebook:
Table 3 in Appendix shows the BLEU score for different codebook sizes for models trained using hard EM without distillation. While kaiser2018fast use as their codebook size, we find that gives the best performance.
Robustness of EM to Hyperparameters:
While the soft EM training gives a small performance improvement, we find that it also leads to more robust training (Figure 3).
Model Size:
The effect of model size on BLEU score for models trained with soft EM and distillation is shown in Table 5 in Appendix.
Number of samples in MonteCarlo EM update
While training with soft EM, we perform a MonteCarlo update with a small number of samples (Section 3.2). Table 4 in Appendix shows the impact of number of samples on the final BLEU score.
Model  BLEU  Latency  Speedup  
Autoregressive Model (beam size=4)      28.1  ms  
Autoregressive Baseline (no beamsearch)      27.0  265 ms  
NAT + distillation      17.7  39 ms  ^{*} 
NAT + distillation + NPD=10      18.7  79 ms  ^{*} 
NAT + distillation + NPD=100      19.2  257 ms  ^{*} 
LT + Semhash      19.8  105 ms  
Our Results  
VQVAE  3    21.4  81 ms  
VQVAE with EM  3  5  22.4  81 ms  
VQVAE + distillation  3    26.4  81 ms  
VQVAE with EM + distillation  3  10  26.7  81 ms  
VQVAE with EM + distillation  4  10  25.4  58 ms 

Speedup reported for these items are compared to the decode time of ms for an autoregressive Transformer from nonautoregnmt .
5.2 Image Generation
Model  Log perplexity  

ImageTransformer    
VAE    
VQVAE vqvae    
VQVAE (Ours)    
EM 
We train the unconditional VQVAE model on the CIFAR10 data set, modeling the joint probability , where is the image and are the discrete latent codes. We use a field of latents with a codebook of size each containing dimensions. We maintain the same encoder and decoder as used in Machine Translation. Our Latent Predictor, uses an Image Transformer imagetrans autoregressive decoder with layers of local selfattention. For the encoder, we use convolutional layers, with kernel size and strides , followed by residual layers, and a single dense layer. For the decoder, we use a single dense layers, residual layers, and deconvolutional layers.
We calculate the lower bound on negative loglikelihood in terms of the Latent Predictor loss and the negative logperplexity of the autoencoder. Let be the total number of positions in the image, and the number of latent codes. Then the lowerbound on the negative loglikelihood is computed in bits/dim as Note that for CIFAR10, while . We report the results in Table 2 and show reconstructions from the autoencoder in Figure 4. As seen from the results, our VQVAE model with EM gets bits/dim better negative loglikelihood as compared to the baseline VQVAE.
6 Conclusion
We investigate an alternate training technique for VQVAE inspired by its connection to the EM algorithm. Training the discrete bottleneck with EM helps us achieve better image generation results on CIFAR10, and together with knowledge distillation, allows us to develop a nonautoregressive machine translation model whose accuracy almost matches the greedy autoregressive baseline, while being 3.3 times faster at inference.
References
 (1) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.

(2)
Leon Bottou and Yoshua Bengio.
Convergence properties of the kmeans algorithms.
In Advances in neural information processing systems, pages 585–592, 1995.  (3) Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoderdecoder for statistical machine translation. CoRR, abs/1406.1078, 2014.
 (4) Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society. Series B (methodological), pages 1–38, 1977.
 (5) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 (6) Jiatao Gu, James Bradbury, Caiming Xiong, Victor O.K. Li, and Richard Socher. Nonautoregressive neural machine translation. CoRR, abs/1711.02281, 2017.
 (7) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
 (8) Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbelsoftmax. CoRR, abs/1611.01144, 2016.
 (9) Łukasz Kaiser and Samy Bengio. Discrete autoencoders for sequence models. CoRR, abs/1801.09797, 2018.
 (10) Łukasz Kaiser, Aurko Roy, Ashish Vaswani, Niki Pamar, Samy Bengio, Jakob Uszkoreit, and Noam Shazeer. Fast decoding in sequence models using discrete latent variables. arXiv preprint arXiv:1803.03382, 2018.
 (11) Nal Kalchbrenner, Aaron van den Oord, Karen Simonyan, Ivo Danihelka, Oriol Vinyals, Alex Graves, and Koray Kavukcuoglu. Video pixel networks. arXiv preprint arXiv:1610.00527, 2016.
 (12) Yoon Kim and Alexander Rush. Sequencelevel knowledge distillation. 2016.
 (13) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 (14) Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, pages 4743–4751, 2016.
 (15) Jason Lee, Elman Mansimov, and Kyunghyun Cho. Deterministic nonautoregressive neural sequence modeling by iterative refinement. arXiv preprint arXiv:1802.06901, 2018.
 (16) Percy Liang and Dan Klein. Online em for unsupervised models. In Proceedings of human language technologies: The 2009 annual conference of the North American chapter of the association for computational linguistics, pages 611–619. Association for Computational Linguistics, 2009.
 (17) James MacQueen et al. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281–297. Oakland, CA, USA, 1967.
 (18) Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. CoRR, abs/1611.00712, 2016.
 (19) Mohammad Norouzi and David J Fleet. Cartesian kmeans. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 3017–3024. IEEE, 2013.
 (20) Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, and Alexander Ku. Image transformer. arXiv, 2018.
 (21) Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, and Geoffrey Hinton. Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548, 2017.
 (22) Scott Reed, Aäron van den Oord, Nal Kalchbrenner, Sergio Gómez Colmenarejo, Ziyu Wang, Dan Belov, and Nando de Freitas. Parallel multiscale autoregressive density estimation. arXiv preprint arXiv:1703.03664, 2017.
 (23) Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. CoRR, abs/1401.4082, 2014.
 (24) Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P Kingma. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. arXiv preprint arXiv:1701.05517, 2017.
 (25) MasaAki Sato and Shin Ishii. Online em algorithm for the normalized gaussian network. Neural computation, 12(2):407–432, 2000.
 (26) David Sculley. Webscale kmeans clustering. In Proceedings of the 19th international conference on World wide web, pages 1177–1178. ACM, 2010.
 (27) Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pages 3104–3112, 2014.
 (28) Lucas Theis, Wenzhe Shi, Andrew Cunningham, and Ferenc Huszár. Lossy image compression with compressive autoencoders. arXiv preprint arXiv:1703.00395, 2017.
 (29) Aaron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
 (30) Aaron van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems, pages 4790–4798, 2016.
 (31) Aäron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. CoRR, abs/1711.00937, 2017.
 (32) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. CoRR, 2017.
 (33) Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, pages 3156–3164. IEEE, 2015.
 (34) Greg CG Wei and Martin A Tanner. A monte carlo implementation of the em algorithm and the poor man’s data augmentation algorithms. Journal of the American statistical Association, 85(411):699–704, 1990.

(35)
Ronald J Williams.
Simple statistical gradientfollowing algorithms for connectionist reinforcement learning.
In Reinforcement Learning, pages 5–32. Springer, 1992.
Appendix A Ablation Tables
Model  Codebook size  BLEU 

VQVAE  20.8  
VQVAE  21.6  
VQVAE  21.0  
VQVAE  21.8 
Model  BLEU  Latency  Speedup  

VQVAE + distillation  3    26.4  81 ms  4.08 
VQVAE with EM + distillation  3  5  26.4  81 ms  4.08 
VQVAE with EM + distillation  3  10  26.7  81 ms  4.08 
VQVAE with EM + distillation  3  25  26.6  81 ms  4.08 
VQVAE with EM + distillation  3  50  26.5  81 ms  4.08 
VQVAE + distillation  4    22.4  58 ms  5.71 
VQVAE with EM + distillation  4  5  22.3  58 ms  5.71 
VQVAE with EM + distillation  4  10  25.4  58 ms  5.71 
VQVAE with EM + distillation  4  25  25.1  58 ms  5.71 
VQVAE with EM + distillation  4  50  23.6  58 ms  5.71 
Model  Hidden Vector dimension  BLEU  Latency  Speedup  

VQVAE + distillation  256    24.5  76 ms  
VQVAE with EM + distillation  256  10  21.9  76 ms  
VQVAE with EM + distillation  256  25  25.8  76 ms  
VQVAE + distillation  384    25.6  80 ms  
VQVAE with EM + distillation  384  10  22.2  80 ms  
VQVAE with EM + distillation  384  25  26.2  80 ms 