1 Introduction
Autoencoder models are widely used in various NLP tasks such as machine translation Bahdanau et al. (2014); Cho et al. (2014); Sutskever et al. (2014); Luong et al. (2015); Vaswani et al. (2017); Devlin et al. (2018) and dialog response generation Vinyals and Le (2015); Olabiyi and Mueller (2019) tasks. Generally speaking, an autoencoder model learns a function to map each input to a latent representation and then back to the original data space.
Unlike Autoencoders, VAEs aim to learn a probability distribution of a dataset, which can generate new instances that look similar to the original dataset. With such a generative model, one can easily draw samples from the distribution following a decoding scheme. VAEs have achieved tremendous success in generating highquality images, videos, and speech
van den Oord et al. (2017); Razavi et al. (2019). At the same time, VAEs have also been applied in NLP to improve traditional maximumlikelihoodestimation (MLE) based models, achieving impressive progress in language modeling
Bowman et al. (2015); Fabius and van Amersfoort (2014); Miao et al. (2016); Yang et al. (2017), controllable text generation
Hu et al. (2017), neural machine translation Shah and Barber (2018), and many other applications.Although with impressive success, a wellknown pitfall with VAEs, especially in applications of sequencetosequence (Seq2Seq) modeling, is a phenomenon called latent variable collapse (or posterior collapse) Bowman et al. (2015), where an encoder yields meaningless posteriors that collapse to the prior. With this pitfall, VAEs usually fail to learn meaningful representations of individual data samples. Several attempts have been made to alleviate this issue (Bowman et al., 2015; Hoffman and Johnson, 2016; Sønderby et al., 2016; Kingma et al., 2016; Chen et al., 2016; Zhao et al., 2017b; Yeung et al., 2017; Alemi et al., 2017; Dieng et al., 2018; Fu et al., 2019; He et al., 2019; Fang et al., 2019)
, however most of these approaches are heuristic in nature.
Our solution is motivated by two possible explanations of posterior collapse: Recent research shows that the prior plays an important role in density estimation Hoffman and Johnson (2016); Takahashi et al. (2019). Although Gaussian prior and posterior are largely adopted, such simplified priors tend to incur latent variable collapse for poor density estimations. To overcome this issue, we argue that a flexible prior should be learned simultaneously during training. In this way, even if one encounters posterior collapse while learning, the collapsed variational distribution is still meaningful. Related work has also shown that the posterior collapse is caused by a lack of good latent codes Fu et al. (2019). Thus, designing an effective way of learning useful presentations without supervision is the key to address the problem. In this paper, based on the above two arguments, we propose to enforce a discrete latent space for VAEs. The discrete space consists of learnable atoms that are shared by all data inputs. The discrete latent space automatically brings in at least three benefits: The atoms of a discrete prior could be efficiently learned during training; The discrete nature of a prior makes the KLdivergence between the prior and a variational distribution unvanishable, thus free of posterior collapse; The discrete VAE is formulated following the standard VAE setting, making learning and inference particularly efficient. The contributions of our paper are summarized as follows:

We propose the concept of discretized bottleneck VAEs for RNNbased Seq2Seq models, which can overcome the posteriorcollapse problem, a longstanding issue that needs to be well addressed in NLP applications.

We showcase how to inject the discretized bottleneck in Seq2Seq models on a variety of NLP tasks. When a model and the training strategy are carefully managed, our DBVAE can accurately model discrete text without scarifying reliance on latent representations and experiencing posterior collapse. We also find that under our framework, the discrete bottleneck can capture more sentencelevel semantic features.

Inference of the proposed DBVAE requires a nearestneighbor (NN) search for the discrete atoms in a latent space. We extend NN to the NN setting and show that it can provide more corrected translations given one source text, thus increase the BLEU score. The method is referred to as top search. Naturally, it can also provide diverse responses in the dialog response generation task.
2 Preliminaries
2.1 Variational Autoencoder
VAEs consist of two parts, an encoder (inference network) and a decoder (generative network). The decoder corresponds to the following generative process for an input :
(1) 
where is a predefined prior distribution and is a conditional distribution (likelihood) induced by a decoder. To learn the parameters , one typically maximizes the following marginal loglikelihood:
(2) 
Direct optimization of the loglikelihood is usually intractable. VAEs instead parameterize a family of variational distribution (often known as an encoder) to approximate the true posterior , ending up optimizing the following evidence lower bound (ELBO):
(3) 
2.2 Posterior collapse
In neural language models, both the encoder and the decoder are often parameterized by strong autoregressive neural networks,
i.e., LSTM and GRU with an input , where every token is fully conditioned on all previous tokens:(4) 
An issue with VAE is that is defined in a very flexible manner that allows pushing the KL term towards zero, leading to posterior collapse that learns meaningless latent codes (Bowman et al., 2015; Kingma et al., 2016; Chen et al., 2016).
3 Discretized Bottleneck in VAE
3.1 Model
Our proposed model is general and can be applied to most existing Seq2Seq models. Without loss of generality, we will describe our framework under the setting of an RNNbased language model. As shown in Figure 1, our model consists of three parts, an encoder, a latent code generator, and a decoder.
Encoder
Let an input sequence be defined as . The encoder aims at encoding an input token at each time step to a latent representation. This is implemented by feeding an input sequence to an LSTM encoder, resulting in
(5) 
where
is the word embedding vector of the word
. The latent representation of the input sequence is .Latent code generation
Different from the vanilla VAE mechanism, we define a latent code to be a combination of a set of latent codes from a global codebook , where is the codebook size and is the latent embedding dimension. Specifically, the idea is to associate each with one , where is an index mapping function that maps the index to another index (will be defined later). After this, the latent code for input is then defined via an aggregation function (we define it as a “mean function” for simplicity):
(6) 
There are several ways to define the index mapping . We adopt the idea of nearest neighbor to define by choosing a code from that is closest to
after a linear transformation. Formally, let
with learnable parameters , we define as(7) 
Based on the above construction, it is easy to see that given the codebook , the final latent code of the input can be formulated as a discrete distribution, i.e.,
(8) 
where ; and is a delta function with point mass at . With such a construction, one can easily check that the KLdivergence between and a prior distribution
from the generative model (usually set to be from a simple Gaussian or uniform distribution) can be calculated as
(9)  
Remark 1
We observe that by optimizing the global codebook in the training process to make most of the codes informative so that they lie within the lowdensity regions of the prior , the KLdivergence will always be larger than zero, effectively preventing posterior collapse.
Decoder
Similar to the encoder, we parameterize the decoder with another LSTM. The target hidden state can be progressively calculated as
(10) 
Finally, we calculate the output distribution over the entire vocabulary at time as
(11) 
3.2 Training
Learning DBVAE is divided into two parts: 1) learning the encoder and decoder; and 2) learning the global codebook.
Learning the encoder and decoder
Our proposed DBVAE model belongs to the general VAE framework, by defining a special form of the variational distribution as in (8). As a result, the encoder and decoder can be learned by optimizing the evidence lower bound (ELBO):
ELBO  
(12) 
where denotes the training data distribution; and the KL term is evaluated following (9).
Learning the codebook
Directly optimizing the codebook with the above ELBO is infeasible because gradients cannot propagate back to the codebook due to the nondifferentiable operator defined in (7). To this end, we follow (van den Oord et al., 2017) and define a new objective for updating the codebook. The key observation is that the codebook only appears in (7), thus the goal is to update the codebook such that it makes the distance between a latent code and the corresponding codebook atom minimal. Specifically, the loss is defined as
(13) 
where sg() denotes the stopgradient operator to avoid complicated gradient flows and stabilize the training; is a constant to balance the two terms.
The overall algorithm
The full training algorithm is summarized in Algorithm 1. We find that it is important to balance between learning the encoderdecoder and learning the codebook. At the beginning, if the codebook does not learn as fast as the encoder, there will be a low utilization rate of the codebook to prevent codebook learning, e.g., most of the input samples only focus on a limited atoms of the codebook. To overcome this issue, we add a strikethrough pretraining step, where the decoder is fed with the latent codes directly from the encoder. This ensures that reasonable gradients can be passed through the latent space and the encoder. In the following, we will apply the superscript “” on a variable (or function) to denote the dependency of the variable to the th input sample, e.g., . To determine whether one should perform a pretraining step, we define a perplexity score to monitor the utilization of the codebook:
(14) 
where denotes a allzero vector except the th bit, which is set to 1. It is clear that the value is large when the elements in are close to uniform. Thus it can be used to indicate the utilized rate of the codebook.
Extension: topk NN search
In our construction of a latent code, we search the nearest code from the codebook via the index mapping defined in (7). Such a construction endows a limitation where a hidden state from the LSTM only corresponds to one atom from the codebook. This scheme, however, does not fit real applications well. For example, in neural machine translation, one source sentence (one hidden state) can correspond to multiple correct translations (multiple atoms); and in dialog response generation, a good model should be able to generate multiple relevant and diverse responses when same contexts are given. Furthermore, when a VAE is well trained, input texts with similar semantics should be mapped to close clusters in the latent space (see Section 5.1). As a result, we propose a generalization by extending the 1NN search to kNN search when searching the codebook to construct latent codes. In other words, Eq. (7) returns a set of indexes, corresponding to the nearest codebook atoms from the codebook. These atoms are then averaged over the whole sequence to generate the final latent code, as in Eq.(6). The corresponding algorithm is summarized in Algorithm 2.
4 Related Work on Posterior Collapse
Several attempts have been made to alleviate the posteriorcollapse issue. Among them, perhaps the simplest solution is via KL cost annealing, where the weight of the KL penalty term is scheduled to gradually increase during training Bowman et al. (2015). Later, Fu et al. (2019) proposes a cyclical annealing schedule, which allows progressive learning of more meaningful latent codes by leveraging informative representations of previous cycles as warm restarts. These approaches tend to manually encourage the use of latent codes, but might hurt a model’s density approximation ability as pointed out in He et al. (2019). Our method differs from these methods in that it maintains a model’s representation power while learning an informative latent space.
Other solutions include weakening the capacity of a generative network or enhancing the inference network. Yang et al. (2017) proposes the use of a dilated CNN as a decoder in VAE by controlling the size of context from previously generated words. Kim et al. (2018) propose a semiamortized approach that uses stochastic variation inference to iteratively refine an inference network. This method, however, is expensive to train. Similarly, He et al. (2019) propose a simple yet effective training algorithm that aggressively optimizes the inference network with more updates. Other threads of solutions introduce more complicated priors in the latent space Tomczak and Welling (2017); Xu and Durrett (2018). Makhzani et al. (2015); Joulin et al. (2016) further replace the KL regularizer with an adversarial regularizer. Our work outperforms these methods without increasing additional training burdens.
In the case of discrete representations in VAE, the most related work is (Zhao et al., 2018). It applies the GumbelSoftmax trick Jang et al. (2016) to train discrete variables, resulting in effective and interpretable dialog generation. Our approach has wider applicability and is ready to be extended to more NLP tasks. Other approaches combine vector quantization and the Transformer model Kaiser et al. (2018); Roy et al. (2018). These approaches have primarily focused on nonautoregressive neural machine translation, which did not investigate the posterior collapse issue in sequential variational inference.
5 Experiments
We conduct extensive experiments to demonstrate the effectiveness and efficiency of the proposed DBVAE on various language processing tasks, including language modeling (LM), unaligned textstyle transfer, dialogresponse generation and neural machine translation (NMT). In addition, we also evaluate how the codebook size K will affect a model’s performance. Code for reproducing these results will be made publicly available.
5.1 Language modeling
Following Yang et al. (2017), we evaluate our model for language modeling mainly on two largescale document corpus, Yahoo and Yelp. Detailed statistics of the two datasets are given in Table 7 in the Supplementary Material (SM) A.1. We first used a simple synthetic dataset He et al. (2019) consisting of 16k training sentences and 4k testing sentences to evaluate how the codebook size affects the model’s performance.
The impact of codebook size
Baseline and training details
Four representative LM models are chosen as baselines, including LSTMLM, the standard VAE, SAVAE (Kim et al., 2018) and LagVAE (He et al., 2019), the current stateoftheart. For fair comparisons, both the recognition network and generative network are implemented as a 1layer LSTM with 1024 hidden units for all models. The word embedding dimension is set to 1024 and the latent dimension to 32. The SGD optimizer with the same setting is applied to all models. The latent variable is used to initialize the hidden state of the decoder and fed as additional input at each time step.
LM results
The results in terms of reconstruction error, perplexity and training time are shown in Table 1 and in Figure 2. As expected, our model achieves the best performance in all the metrics due to the flexibility of the discrete variational distribution, which makes the model free of posterior collapse. Remarkably, our model runs almost as fast as the standard VAE. The faster convergence of LagVAE at the beginning is because it aggressively trains an encoder, where approximately
more data are used to train the encoder in one epoch.
Latent space visualization
For better understanding, we visualize the latent representations of the whole dataset using tSNE projection Maaten and Hinton (2008) in Figure 3. It is seen that our model is able to learn a much smoother and more separable transition from 0star to 4star reviews. To visualize the codebook utilization, we also compute the (Eq.14) on a random batch of testing data after each training epoch. As shown in Figure 4, the usage of the codebook becomes more balanced as the training goes on.
Codebook interpolation
Particularly in text modeling, when performing a convex combination between any two latent codes and , the interpolation is equivalent to . Ideally, adjusting from 0 to 1 will generate a series of sentences, where will be less semantically similar with the sentence corresponding to and much more semantically similar to that of (Berthelot et al., 2018). Table 2 shows the generated sentences when ranges from 0.0 to 1.0 with a stepsize of 0.2. Indeed, intermediate sentences produced by the proposed model can provide a semantically smooth morphing between the two endpoints. More detailed examples are provided in Table 8 in the SM A.
5.2 Unaligned neural text style transfer
Next, we evaluate the proposed model on the unaligned sentiment transfer task on the Yelp dataset. Review ratings above three are considered positive, and those below three are considered negative. Hence, we split the corpus into two sets of unaligned positive reviews (350k) and negative reviews (250k). The goal of the style transfer task is to change the underlying sentiment between positive and negative reviews.
Experiment setup
We denote as the sentiment attribute and construct a decoder to implement the conditional distribution . Following the setup in (Zhao et al., 2017a; Shen et al., 2017), we train two separate decoders where one is for positive reviews, , and the other one is for negative reviews, . Normally, the latent prior will encode all the semantic and attribute information of the input. In the models, we want the attribute information to be excluded from and let the decoder learns to produce the transferred reviews. According to (Zhao et al., 2017a), a classfier
is introduced to distinguish the latent code’s attribute, and adversarially train the encoder to fool the classifier and thus remove the sentiment attribute from the latent space.
Baseline
We compare our model with two strong baselines: 1) an adversarially regularized autoencoder (ARAE) Zhao et al. (2017a), which learns the prior via a more expensive and unstable adversarial training; 2) a recently developed implicit deeplatentvariable model (iVAE) Fang et al. (2019) that applies samplebased representations of variational distributions.
Quantitative metrics
We adopt several quantitative metrics: (i) Transfer: it measures the style transfer accuracy evaluated on an automatic classification model (fastText library (Joulin et al., 2016)); (ii) BLEU: the consistency between the translated candidate and the original reference; (iii) PPL and Reverse PPL (RPPL): PPL measures the fluency of the generated text, and RPPL works in a reverse fashion, which is computed by training an LM on generated data and evaluated on the original data. Mode collapse may be detected by the RPPL value.
Quantitative analysis
Table 3 shows the sentiment transfer results. The proposed method outperforms ARAE in all metrics. On the one hand, in addition to the higher PPL and RPPL, our model preserves the superiority that has already been highlighted in Section 5.1. On the other hand, compared with ARAE, the higher transfer accuracy, and BLEU score indicate that our model can capture more sentiment related information while keeping the grammar structure in the original text and the opposite text consistent.
Qualitative results
5.3 Dialog response generation
In this experiment, we follow (Gu et al., 2018) and evaluate the proposed model on two widelyused dialog datasets Switchboard (Godfrey and Holliman, 1997) and DailyDialog Dataset (Li et al., 2017). Responses generated by VAEbased models (Zhao et al., 2017c; Gu et al., 2018) are conditioned on the latent variable. So, this task can examine whether a model can capture a richer latent space and thus generate more diverse, informative and consistent responses.
Baselines
We compare our model’s performance with five representative baselines for dialog modeling: (i) SeqGAN: a GANbased model for sequence generation (Yu et al., 2017); (ii) CVAE: a conditional VAE model (Zhao et al., 2017c); (iii) CVAEBOW: CVAE with bagofword loss (Zhao et al., 2017c); (iv) VHRED: a hierarchical VAE model (Serban et al., 2017); (v) WAEGMP: a conditional Wasserstein autoencoder with a Gaussian mixture prior network (Gu et al., 2018), which holds the stateoftheart. (vi) DIVAE: a discrete VAE which is most related to our work.
Quantitative metrics
Follow the evaluation setup in (Gu et al., 2018)
, three evaluation metrics (see details in
A.2) are used:(i) Sentencelevel BLEU, which works by counting ngrams in the candidate (generated) sentences to ngrams in the reference text. (ii) BOW Embedding, which calculates the cosine similarity of bagofwords embedding between the candidate and the reference. (iii) Distinct, which computes the diversity of the generated responses.
Quantitative analysis
Table 5 and Table 9 show the quantitative results of our model and other strong baselines on Switchboard and DailyDialog. Our model outperforms the baselines in most metrics. Although our method obtains a similar BLEU score as WAEGMP, the interdist and intradist scores are much higher. In terms of intradist, the dist1 and dist2 on Switchboard are 19.2% and 24.6% higher than WAEGMP. This indicates that our model is capable of generating less repeated ngrams in each response. As for the interdist, dist1 and dist2 are even 66.1% and 45.6% higher than WAEGMP, meaning that our model generates much more diverse responses than WAEGMP.
5.4 Extension: RNNbased NMT model
We finally evaluate our model with the proposed topk NN search on the GermanEnglish translation task. Our model is built on a baseline RNNsearch architecture (Bahdanau et al., 2014)
. The recently proposed variational attention model
Deng et al. (2018) is also adopted as a baseline.We use the IWLST14 dataset (Cettolo et al., 2014), which is a standard benchmark for experimental NMT models. This dataset contains around 153K, 7K and 7K sentences for training, validation and testing, respectively. The same preprocessing as in (Ott et al., 2018) is applied. As for the architecture, both the encoder and the decoder have one layer, each with 512dimensional embedding. For BLEU evaluation, the beam size in beam search is 5. The library Fairseq (Ott et al., 2019) is adopted as the codebase. The codebook size is set to , and only the final hidden state of the encoder passes through the discretized bottleneck.
Results averaged by 5 different runs are reported in Table 6 and in Figure 5. Note the attention mechanism is used in RNNsearch, where each progressed state in the decoder side has direct access to the state in the encoder side. Although we only discretize the final hidden state of the encoder as formulated in Section 3.1, a notable improvement on the PPL and BLEU score is still observed. Following Algorithm 2, as we increase the value of from 1 to 10, the BLEU score continues increasing until reaches 5. The reason might be that the top5 latent codes have already encoded most sourcetarget combinations. Besides, the BLEU score is as low as 26.1 when we choose the farthest latent code from the codebook instead. These validate the effectiveness of our proposed topk inference strategy which applies to most RNNbased autoencoder models.
6 Conclusion
We propose the DBVAE, a variant of VAE that uses a discretized bottleneck obtained from a global codebook for latent representations. Our model can potentially overcome the posterior collapse issues in Seq2Seq models. The proposed DBVAE can provide a good balance between optimization of the inference network and the generative network. Moreover, our DBVAE can also interpret richer semantic information of discrete structured sequences. Extensive experiments demonstrate the effectiveness of the proposed approach. DBVAE is flexible enough to be extended to other NLP models such as the Transformer and BERT, which are left as interesting future work.
References
 Alemi et al. (2017) Alexander A Alemi, Ben Poole, Ian Fischer, Joshua V Dillon, Rif A Saurous, and Kevin Murphy. 2017. Fixing a broken elbo. arXiv preprint arXiv:1711.00464.
 Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
 Berthelot et al. (2018) David Berthelot, Colin Raffel, Aurko Roy, and Ian Goodfellow. 2018. Understanding and improving interpolation in autoencoders via an adversarial regularizer. arXiv preprint arXiv:1807.07543.
 Bowman et al. (2015) Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. 2015. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349.
 Cettolo et al. (2014) Mauro Cettolo, Jan Niehues, Sebastian Stüker, Luisa Bentivogli, and Marcello Federico. 2014. Report on the 11th iwslt evaluation campaign, iwslt 2014. In Proceedings of the International Workshop on Spoken Language Translation, Hanoi, Vietnam, page 57.
 Chen et al. (2016) Xi Chen, Diederik P Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya Sutskever, and Pieter Abbeel. 2016. Variational lossy autoencoder. arXiv preprint arXiv:1611.02731.
 Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
 Deng et al. (2018) Yuntian Deng, Yoon Kim, Justin Chiu, Demi Guo, and Alexander Rush. 2018. Latent alignment and variational attention. In Advances in Neural Information Processing Systems, pages 9712–9724.
 Devlin et al. (2018) Jacob Devlin, MingWei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
 Dieng et al. (2018) Adji B Dieng, Yoon Kim, Alexander M Rush, and David M Blei. 2018. Avoiding latent variable collapse with generative skip models. arXiv preprint arXiv:1807.04863.
 Fabius and van Amersfoort (2014) Otto Fabius and Joost R van Amersfoort. 2014. Variational recurrent autoencoders. arXiv preprint arXiv:1412.6581.
 Fang et al. (2019) Le Fang, Chunyuan Li, Jianfeng Gao, Wen Dong, and Changyou Chen. 2019. Implicit deep latent variable models for text generation. arXiv preprint arXiv:1908.11527.

Forgues et al. (2014)
Gabriel Forgues, Joelle Pineau, JeanMarie Larchevêque, and Réal
Tremblay. 2014.
Bootstrapping dialog systems with word embeddings.
In
Nips, modern machine learning and natural language processing workshop
, volume 2.  Fu et al. (2019) Hao Fu, Chunyuan Li, Xiaodong Liu, Jianfeng Gao, Asli Celikyilmaz, Lawrence Carin, et al. 2019. Cyclical annealing schedule: A simple approach to mitigating kl vanishing. arXiv preprint arXiv:1903.10145.
 Godfrey and Holliman (1997) John J Godfrey and Edward Holliman. 1997. Switchboard1 release 2. Linguistic Data Consortium, Philadelphia, 926:927.
 Gu et al. (2018) Xiaodong Gu, Kyunghyun Cho, JungWoo Ha, and Sunghun Kim. 2018. Dialogwae: Multimodal response generation with conditional wasserstein autoencoder. arXiv preprint arXiv:1805.12352.
 He et al. (2019) Junxian He, Daniel Spokoyny, Graham Neubig, and Taylor BergKirkpatrick. 2019. Lagging inference networks and posterior collapse in variational autoencoders. arXiv preprint arXiv:1901.05534.

Hoffman and Johnson (2016)
Matthew D Hoffman and Matthew J Johnson. 2016.
Elbo surgery: yet another way to carve up the variational evidence
lower bound.
In
Workshop in Advances in Approximate Bayesian Inference, NIPS
, volume 1.  Hu et al. (2017) Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P Xing. 2017. Toward controlled generation of text. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 1587–1596. JMLR. org.
 Jang et al. (2016) Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categorical reparameterization with gumbelsoftmax. arXiv preprint arXiv:1611.01144.
 Joulin et al. (2016) Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.
 Kaiser et al. (2018) Łukasz Kaiser, Aurko Roy, Ashish Vaswani, Niki Parmar, Samy Bengio, Jakob Uszkoreit, and Noam Shazeer. 2018. Fast decoding in sequence models using discrete latent variables. arXiv preprint arXiv:1803.03382.
 Kim et al. (2018) Yoon Kim, Sam Wiseman, Andrew C Miller, David Sontag, and Alexander M Rush. 2018. Semiamortized variational autoencoders. arXiv preprint arXiv:1802.02550.
 Kingma et al. (2016) Durk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. 2016. Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems, pages 4743–4751.
 Li et al. (2017) Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. 2017. Dailydialog: A manually labelled multiturn dialogue dataset. arXiv preprint arXiv:1710.03957.
 Luong et al. (2015) MinhThang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attentionbased neural machine translation. arXiv preprint arXiv:1508.04025.
 Maaten and Hinton (2008) Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using tsne. Journal of machine learning research, 9(Nov):2579–2605.
 Makhzani et al. (2015) Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. 2015. Adversarial autoencoders. arXiv preprint arXiv:1511.05644.
 Miao et al. (2016) Yishu Miao, Lei Yu, and Phil Blunsom. 2016. Neural variational inference for text processing. In International conference on machine learning, pages 1727–1736.
 Mitchell and Lapata (2008) Jeff Mitchell and Mirella Lapata. 2008. Vectorbased models of semantic composition. In proceedings of ACL08: HLT, pages 236–244.
 Olabiyi and Mueller (2019) Oluwatobi Olabiyi and Erik T Mueller. 2019. Multiturn dialogue response generation with autoregressive transformer models. arXiv preprint arXiv:1908.01841.
 van den Oord et al. (2017) Aaron van den Oord, Oriol Vinyals, et al. 2017. Neural discrete representation learning. In Advances in Neural Information Processing Systems, pages 6306–6315.
 Ott et al. (2019) Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. arXiv preprint arXiv:1904.01038.
 Ott et al. (2018) Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. 2018. Scaling neural machine translation. arXiv preprint arXiv:1806.00187.
 Razavi et al. (2019) Ali Razavi, Aaron van den Oord, and Oriol Vinyals. 2019. Generating diverse highfidelity images with vqvae2. arXiv preprint arXiv:1906.00446.
 Roy et al. (2018) Aurko Roy, Ashish Vaswani, Arvind Neelakantan, and Niki Parmar. 2018. Theory and experiments on vector quantized autoencoders. arXiv preprint arXiv:1805.11063.
 Rus and Lintean (2012) Vasile Rus and Mihai Lintean. 2012. A comparison of greedy and optimal assessment of natural language student input using wordtoword similarity metrics. In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP, pages 157–162. Association for Computational Linguistics.

Serban et al. (2017)
Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle
Pineau, Aaron Courville, and Yoshua Bengio. 2017.
A hierarchical latent variable encoderdecoder model for generating
dialogues.
In
ThirtyFirst AAAI Conference on Artificial Intelligence
.  Shah and Barber (2018) Harshil Shah and David Barber. 2018. Generative neural machine translation. In Advances in Neural Information Processing Systems, pages 1346–1355.
 Shen et al. (2017) Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi Jaakkola. 2017. Style transfer from nonparallel text by crossalignment. In Advances in neural information processing systems, pages 6830–6841.
 Sønderby et al. (2016) Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. 2016. How to train deep variational autoencoders and probabilistic ladder networks. In 33rd International Conference on Machine Learning (ICML 2016).
 Sutskever et al. (2014) I Sutskever, O Vinyals, and QV Le. 2014. Sequence to sequence learning with neural networks. Advances in NIPS.
 Takahashi et al. (2019) Hiroshi Takahashi, Tomoharu Iwata, Yuki Yamanaka, Masanori Yamada, and Satoshi Yagi. 2019. Variational autoencoder with implicit optimal priors. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 5066–5073.
 Tomczak and Welling (2017) Jakub M Tomczak and Max Welling. 2017. Vae with a vampprior. arXiv preprint arXiv:1705.07120.
 Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
 Vinyals and Le (2015) Oriol Vinyals and Quoc Le. 2015. A neural conversational model. arXiv preprint arXiv:1506.05869.
 Xu and Durrett (2018) Jiacheng Xu and Greg Durrett. 2018. Spherical latent spaces for stable variational autoencoders. arXiv preprint arXiv:1808.10805.
 Yang et al. (2017) Zichao Yang, Zhiting Hu, Ruslan Salakhutdinov, and Taylor BergKirkpatrick. 2017. Improved variational autoencoders for text modeling using dilated convolutions. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 3881–3890. JMLR. org.
 Yeung et al. (2017) Serena Yeung, Anitha Kannan, Yann Dauphin, and Li FeiFei. 2017. Tackling overpruning in variational autoencoders. arXiv preprint arXiv:1706.03643.
 Yu et al. (2017) Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. 2017. Seqgan: Sequence generative adversarial nets with policy gradient. In ThirtyFirst AAAI Conference on Artificial Intelligence.
 Zhao et al. (2017a) Jake Zhao, Yoon Kim, Kelly Zhang, Alexander M Rush, and Yann LeCun. 2017a. Adversarially regularized autoencoders. arXiv preprint arXiv:1706.04223.
 Zhao et al. (2017b) Shengjia Zhao, Jiaming Song, and Stefano Ermon. 2017b. Towards deeper understanding of variational autoencoding models. arXiv preprint arXiv:1702.08658.
 Zhao et al. (2018) Tiancheng Zhao, Kyusong Lee, and Maxine Eskenazi. 2018. Unsupervised discrete sentence representation learning for interpretable neural dialog generation. arXiv preprint arXiv:1804.08069.
 Zhao et al. (2017c) Tiancheng Zhao, Ran Zhao, and Maxine Eskenazi. 2017c. Learning discourselevel diversity for neural dialog models using conditional variational autoencoders. arXiv preprint arXiv:1703.10960.
Appendix A Supplemental Material
a.1 Language modeling
Generated intermediate sentences  

0.0  had a great experience at this place ! i had a great experience with the staff and the staff was very friendly and helpful ! i had a great experience and i will definitely be back ! 
0.1  had a great experience here ! the staff was very friendly and helpful ! i had a great time and i will definitely be back ! 
0.2  stopped in for a quick bite before heading out to the airport . i had the chicken and waffles and it was delicious ! i would definitely recommend this place to anyone looking for a great breakfast ! 
0.3  stopped in for a quick bite before heading out to the airport . i had the chicken and waffles and it was delicious ! the service was fast and friendly . i will definitely be back ! 
0.4  stopped in for a quick bite before heading out to the airport . i had the chicken and waffles and it was delicious ! the service was friendly and fast . i ’ll be back ! 
0.5  my husband and i stopped in for a quick bite before heading out to the airport . we were seated right away and we were seated right away . our server was very friendly and helpful . the food was pretty good and the service was great . 
0.6  my husband and i stopped in for a quick bite before heading out to the airport . we were seated right 
0.7  this was my first time here and i will definitely be back . the service was fast and friendly and the food was delicious . i ’ll be back . 
0.8  this was my first time here and i will definitely be back . the service was good , the food was good , and the prices were reasonable . i ’ll be back . 
0.9  this place was pretty good . i had the chicken and waffles and it was pretty good . i ’d definitely go back . 
1.0  this place was pretty good . i had the pulled pork sandwich and it was pretty good , but nothing special . the fries were pretty good though . 
a.2 Dialog response generation
Detailed evaluation metrics used in dialogresponsegeneration task:
(i) Sentencelevel BLEU, which works by counting ngrams in the candidate (generated) sentences to ngrams in the reference text. To compute the score, the setting is identical WAEGMP (Gu et al., 2018) where 10 responses (candidates) are sampled from the models for each test context. is set to 10 in Algorithm 2
. The precision and recall of BLEU are defined in
(Zhao et al., 2017c).(ii) BOW Embedding, which calculates the cosine similarity of bagofwords embedding between the candidate and the reference. We adopt three metrics here to compute the similarity, greedy (Rus and Lintean, 2012), average (Mitchell and Lapata, 2008) and extreme (Forgues et al., 2014).
(iii) Distinct, which computes the diversity of the generated responses. distn is defined as the ratio of unique ngrams (n=1,2) over all ngrams in the generated responses. As multiple responses are sampled from the models, we can define intradist as the average of distinct values within each sampled response and interdist as the distinct value among all sampled responses.
Model  BLEU  BOW Embedding  intradist  interdist  

R  P  F1  A  E  G  dist1  dist2  dist1  dist2  
SeqGAN  0.270  0.270  0.270  0.907  0.495  0.774  0.747  0.806  0.075  0.081 
CVAE  0.265  0.222  0.242  0.923  0.543  0.811  0.938  0.973  0.177  0.222 
CVAEBOW  0.256  0.224  0.239  0.923  0.540  0.812  0.949  0.976  0.165  0.206 
VHRED  0.271  0.260  0.265  0.892  0.507  0.786  0.633  0.711  0.071  0.089 
WAEGMP  0.372  0.286  0.323  0.952  0.591  0.853  0.754  0.892  0.313  0.597 
DIVAE  0.323  0.190  0.239  0.874  0.600  0.814  0.947  0.963  0.500  0.718 
DBVAE  0.373  0.276  0.317  0.944  0.615  0.839  0.954  0.997  0.467  0.787 