Discretized Bottleneck in VAE: Posterior-Collapse-Free Sequence-to-Sequence Learning

by   fcq, et al.
Duke University
University at Buffalo

Variational autoencoders (VAEs) are important tools in end-to-end representation learning. VAEs can capture complex data distributions and have been applied extensively in many natural-language-processing (NLP) tasks. However, a common pitfall in sequence-to-sequence learning with VAEs is the posterior-collapse issue in latent space, wherein the model tends to ignore latent variables when a strong auto-regressive decoder is implemented. In this paper, we propose a principled approach to eliminate this issue by applying a discretized bottleneck in the latent space. Specifically, we impose a shared discrete latent space where each input is learned to choose a combination of shared latent atoms as its latent representation. Compared with VAEs employing continuous latent variables, our model endows more promising capability in modeling underlying semantics of discrete sequences and can thus provide more interpretative latent structures. Empirically, we demonstrate the efficiency and effectiveness of our model on a broad range of tasks, including language modeling, unaligned text style transfer, dialog response generation, and neural machine translation.


page 6

page 13


Discrete Auto-regressive Variational Attention Models for Text Modeling

Variational autoencoders (VAEs) have been widely applied for text modeli...

Learning to Drop Out: An Adversarial Approach to Training Sequence VAEs

In principle, applying variational autoencoders (VAEs) to sequential dat...

Discrete Autoencoders for Sequence Models

Recurrent models for sequences have been recently successful at many tas...

Cyclical Annealing Schedule: A Simple Approach to Mitigating KL Vanishing

Variational autoencoders (VAEs) with an auto-regressive decoder have bee...

Representation Learning in Sequence to Sequence Tasks: Multi-filter Gaussian Mixture Autoencoder

Heterogeneity of sentences exists in sequence to sequence tasks such as ...

Structured Reordering for Modeling Latent Alignments in Sequence Transduction

Despite success in many domains, neural models struggle in settings wher...

SEQ^3: Differentiable Sequence-to-Sequence-to-Sequence Autoencoder for Unsupervised Abstractive Sentence Compression

Neural sequence-to-sequence models are currently the dominant approach i...

1 Introduction

Auto-encoder models are widely used in various NLP tasks such as machine translation Bahdanau et al. (2014); Cho et al. (2014); Sutskever et al. (2014); Luong et al. (2015); Vaswani et al. (2017); Devlin et al. (2018) and dialog response generation Vinyals and Le (2015); Olabiyi and Mueller (2019) tasks. Generally speaking, an auto-encoder model learns a function to map each input to a latent representation and then back to the original data space.

Unlike Auto-encoders, VAEs aim to learn a probability distribution of a dataset, which can generate new instances that look similar to the original dataset. With such a generative model, one can easily draw samples from the distribution following a decoding scheme. VAEs have achieved tremendous success in generating high-quality images, videos, and speech

van den Oord et al. (2017); Razavi et al. (2019)

. At the same time, VAEs have also been applied in NLP to improve traditional maximum-likelihood-estimation (MLE) based models, achieving impressive progress in language modeling

Bowman et al. (2015); Fabius and van Amersfoort (2014); Miao et al. (2016); Yang et al. (2017)

, controllable text generation

Hu et al. (2017), neural machine translation Shah and Barber (2018), and many other applications.

Although with impressive success, a well-known pitfall with VAEs, especially in applications of sequence-to-sequence (Seq2Seq) modeling, is a phenomenon called latent variable collapse (or posterior collapse) Bowman et al. (2015), where an encoder yields meaningless posteriors that collapse to the prior. With this pitfall, VAEs usually fail to learn meaningful representations of individual data samples. Several attempts have been made to alleviate this issue (Bowman et al., 2015; Hoffman and Johnson, 2016; Sønderby et al., 2016; Kingma et al., 2016; Chen et al., 2016; Zhao et al., 2017b; Yeung et al., 2017; Alemi et al., 2017; Dieng et al., 2018; Fu et al., 2019; He et al., 2019; Fang et al., 2019)

, however most of these approaches are heuristic in nature.

Our solution is motivated by two possible explanations of posterior collapse: Recent research shows that the prior plays an important role in density estimation Hoffman and Johnson (2016); Takahashi et al. (2019). Although Gaussian prior and posterior are largely adopted, such simplified priors tend to incur latent variable collapse for poor density estimations. To overcome this issue, we argue that a flexible prior should be learned simultaneously during training. In this way, even if one encounters posterior collapse while learning, the collapsed variational distribution is still meaningful. Related work has also shown that the posterior collapse is caused by a lack of good latent codes Fu et al. (2019). Thus, designing an effective way of learning useful presentations without supervision is the key to address the problem. In this paper, based on the above two arguments, we propose to enforce a discrete latent space for VAEs. The discrete space consists of learnable atoms that are shared by all data inputs. The discrete latent space automatically brings in at least three benefits: The atoms of a discrete prior could be efficiently learned during training; The discrete nature of a prior makes the KL-divergence between the prior and a variational distribution un-vanishable, thus free of posterior collapse; The discrete VAE is formulated following the standard VAE setting, making learning and inference particularly efficient. The contributions of our paper are summarized as follows:

  • We propose the concept of discretized bottleneck VAEs for RNN-based Seq2Seq models, which can overcome the posterior-collapse problem, a long-standing issue that needs to be well addressed in NLP applications.

  • We showcase how to inject the discretized bottleneck in Seq2Seq models on a variety of NLP tasks. When a model and the training strategy are carefully managed, our DB-VAE can accurately model discrete text without scarifying reliance on latent representations and experiencing posterior collapse. We also find that under our framework, the discrete bottleneck can capture more sentence-level semantic features.

  • Inference of the proposed DB-VAE requires a nearest-neighbor (NN) search for the discrete atoms in a latent space. We extend NN to the -NN setting and show that it can provide more corrected translations given one source text, thus increase the BLEU score. The method is referred to as top- search. Naturally, it can also provide diverse responses in the dialog response generation task.

2 Preliminaries

2.1 Variational Autoencoder

VAEs consist of two parts, an encoder (inference network) and a decoder (generative network). The decoder corresponds to the following generative process for an input :


where is a pre-defined prior distribution and is a conditional distribution (likelihood) induced by a decoder. To learn the parameters , one typically maximizes the following marginal log-likelihood:


Direct optimization of the log-likelihood is usually intractable. VAEs instead parameterize a family of variational distribution (often known as an encoder) to approximate the true posterior , ending up optimizing the following evidence lower bound (ELBO):


2.2 Posterior collapse

In neural language models, both the encoder and the decoder are often parameterized by strong auto-regressive neural networks,

i.e., LSTM and GRU with an input , where every token is fully conditioned on all previous tokens:


An issue with VAE is that is defined in a very flexible manner that allows pushing the KL term towards zero, leading to posterior collapse that learns meaningless latent codes (Bowman et al., 2015; Kingma et al., 2016; Chen et al., 2016).

3 Discretized Bottleneck in VAE

3.1 Model

Our proposed model is general and can be applied to most existing Seq2Seq models. Without loss of generality, we will describe our framework under the setting of an RNN-based language model. As shown in Figure 1, our model consists of three parts, an encoder, a latent code generator, and a decoder.

Figure 1: The graphical illustration of the proposed model


Let an input sequence be defined as . The encoder aims at encoding an input token at each time step to a latent representation. This is implemented by feeding an input sequence to an LSTM encoder, resulting in



is the word embedding vector of the word

. The latent representation of the input sequence is .

Latent code generation

Different from the vanilla VAE mechanism, we define a latent code to be a combination of a set of latent codes from a global codebook , where is the codebook size and is the latent embedding dimension. Specifically, the idea is to associate each with one , where is an index mapping function that maps the index to another index (will be defined later). After this, the latent code for input is then defined via an aggregation function (we define it as a “mean function” for simplicity):


There are several ways to define the index mapping . We adopt the idea of nearest neighbor to define by choosing a code from that is closest to

after a linear transformation. Formally, let

with learnable parameters , we define as


Based on the above construction, it is easy to see that given the codebook , the final latent code of the input can be formulated as a discrete distribution, i.e.,


where ; and is a delta function with point mass at . With such a construction, one can easily check that the KL-divergence between and a prior distribution

from the generative model (usually set to be from a simple Gaussian or uniform distribution) can be calculated as

Remark 1

We observe that by optimizing the global codebook in the training process to make most of the codes informative so that they lie within the low-density regions of the prior , the KL-divergence will always be larger than zero, effectively preventing posterior collapse.


Similar to the encoder, we parameterize the decoder with another LSTM. The target hidden state can be progressively calculated as


Finally, we calculate the output distribution over the entire vocabulary at time as


3.2 Training

Learning DB-VAE is divided into two parts: 1) learning the encoder and decoder; and 2) learning the global codebook.

Learning the encoder and decoder

Our proposed DB-VAE model belongs to the general VAE framework, by defining a special form of the variational distribution as in (8). As a result, the encoder and decoder can be learned by optimizing the evidence lower bound (ELBO):


where denotes the training data distribution; and the KL term is evaluated following (9).

Learning the codebook

Directly optimizing the codebook with the above ELBO is infeasible because gradients cannot propagate back to the codebook due to the non-differentiable operator defined in (7). To this end, we follow (van den Oord et al., 2017) and define a new objective for updating the codebook. The key observation is that the codebook only appears in (7), thus the goal is to update the codebook such that it makes the distance between a latent code and the corresponding codebook atom minimal. Specifically, the loss is defined as


where sg() denotes the stop-gradient operator to avoid complicated gradient flows and stabilize the training; is a constant to balance the two terms.

The overall algorithm

The full training algorithm is summarized in Algorithm 1. We find that it is important to balance between learning the encoder-decoder and learning the codebook. At the beginning, if the codebook does not learn as fast as the encoder, there will be a low utilization rate of the codebook to prevent codebook learning, e.g., most of the input samples only focus on a limited atoms of the codebook. To overcome this issue, we add a strike-through pretraining step, where the decoder is fed with the latent codes directly from the encoder. This ensures that reasonable gradients can be passed through the latent space and the encoder. In the following, we will apply the superscript “” on a variable (or function) to denote the dependency of the variable to the -th input sample, e.g., . To determine whether one should perform a pretraining step, we define a perplexity score to monitor the utilization of the codebook:


where denotes a all-zero vector except the -th bit, which is set to 1. It is clear that the value is large when the elements in are close to uniform. Thus it can be used to indicate the utilized rate of the codebook.

Require: encoder , decoder , codebook , threshold and batch size
Step 1: Strike-through pretraining
while   do
        Optimize the to train
        Optimize to learn
Step 2: Joint training
while done do
        Optimize to learn
        Backprop -ELBO+ to train
Algorithm 1 DB-VAE training
Require:encoder , decoder , codebook
while done do
        Find k-NN instead of 1-NN as in Eq.(7) to calculate
Algorithm 2 top-k NN Search Extension

Extension: top-k NN search

In our construction of a latent code, we search the nearest code from the codebook via the index mapping defined in (7). Such a construction endows a limitation where a hidden state from the LSTM only corresponds to one atom from the codebook. This scheme, however, does not fit real applications well. For example, in neural machine translation, one source sentence (one hidden state) can correspond to multiple correct translations (multiple atoms); and in dialog response generation, a good model should be able to generate multiple relevant and diverse responses when same contexts are given. Furthermore, when a VAE is well trained, input texts with similar semantics should be mapped to close clusters in the latent space (see Section 5.1). As a result, we propose a generalization by extending the 1-NN search to k-NN search when searching the codebook to construct latent codes. In other words, Eq. (7) returns a set of indexes, corresponding to the nearest codebook atoms from the codebook. These atoms are then averaged over the whole sequence to generate the final latent code, as in Eq.(6). The corresponding algorithm is summarized in Algorithm 2.

4 Related Work on Posterior Collapse

Several attempts have been made to alleviate the posterior-collapse issue. Among them, perhaps the simplest solution is via KL cost annealing, where the weight of the KL penalty term is scheduled to gradually increase during training Bowman et al. (2015). Later, Fu et al. (2019) proposes a cyclical annealing schedule, which allows progressive learning of more meaningful latent codes by leveraging informative representations of previous cycles as warm re-starts. These approaches tend to manually encourage the use of latent codes, but might hurt a model’s density approximation ability as pointed out in He et al. (2019). Our method differs from these methods in that it maintains a model’s representation power while learning an informative latent space.

Other solutions include weakening the capacity of a generative network or enhancing the inference network. Yang et al. (2017) proposes the use of a dilated CNN as a decoder in VAE by controlling the size of context from previously generated words. Kim et al. (2018) propose a semi-amortized approach that uses stochastic variation inference to iteratively refine an inference network. This method, however, is expensive to train. Similarly, He et al. (2019) propose a simple yet effective training algorithm that aggressively optimizes the inference network with more updates. Other threads of solutions introduce more complicated priors in the latent space Tomczak and Welling (2017); Xu and Durrett (2018). Makhzani et al. (2015); Joulin et al. (2016) further replace the KL regularizer with an adversarial regularizer. Our work outperforms these methods without increasing additional training burdens.

In the case of discrete representations in VAE, the most related work is (Zhao et al., 2018). It applies the Gumbel-Softmax trick Jang et al. (2016) to train discrete variables, resulting in effective and interpretable dialog generation. Our approach has wider applicability and is ready to be extended to more NLP tasks. Other approaches combine vector quantization and the Transformer model Kaiser et al. (2018); Roy et al. (2018). These approaches have primarily focused on non-autoregressive neural machine translation, which did not investigate the posterior collapse issue in sequential variational inference.

5 Experiments

We conduct extensive experiments to demonstrate the effectiveness and efficiency of the proposed DB-VAE on various language processing tasks, including language modeling (LM), unaligned text-style transfer, dialog-response generation and neural machine translation (NMT). In addition, we also evaluate how the codebook size K will affect a model’s performance. Code for reproducing these results will be made publicly available.

5.1 Language modeling

Following Yang et al. (2017), we evaluate our model for language modeling mainly on two large-scale document corpus, Yahoo and Yelp. Detailed statistics of the two datasets are given in Table 7 in the Supplementary Material (SM) A.1. We first used a simple synthetic dataset He et al. (2019) consisting of 16k training sentences and 4k testing sentences to evaluate how the codebook size affects the model’s performance.

Figure 2: Learning curves of VAE, Lag-VAE and DB-VAE on Yahoo.

The impact of codebook size

We first investigate the impact of codebook size K on the model’s behavior. The learning curves with different are shown in Figure 7 of the SM A.1. Because validation ’s are very close when , we adopt in all our experiments (a trade-off between memory and performance) unless explicitly declared.

Baseline and training details

Four representative LM models are chosen as baselines, including LSTM-LM, the standard VAE, SA-VAE (Kim et al., 2018) and Lag-VAE (He et al., 2019), the current state-of-the-art. For fair comparisons, both the recognition network and generative network are implemented as a 1-layer LSTM with 1024 hidden units for all models. The word embedding dimension is set to 1024 and the latent dimension to 32. The SGD optimizer with the same setting is applied to all models. The latent variable is used to initialize the hidden state of the decoder and fed as additional input at each time step.

LM results

The results in terms of reconstruction error, perplexity and training time are shown in Table 1 and in Figure 2. As expected, our model achieves the best performance in all the metrics due to the flexibility of the discrete variational distribution, which makes the model free of posterior collapse. Remarkably, our model runs almost as fast as the standard VAE. The faster convergence of Lag-VAE at the beginning is because it aggressively trains an encoder, where approximately

more data are used to train the encoder in one epoch.

scale=0.85 Models Rec(KL) Rec-PPL Time Yelp corpus LSTM-LM 358.1 40.64 - VAE 357.9 40.56 5.4 SA-VAE 357.5 40.39 56.3 Lag-VAE 351.4 37.92 20.3 DB-VAE 349.7 37.26 5.4 Yahoo corpus LSTM-LM 328.0 60.75 - VAE 328.6 61.21 6.9 SA-VAE 329.1 61.59 69.2 Lag-VAE 322.6 56.78 15.30 DB-VAE 320.4 55.24 7.0

Table 1: Performance comparisons on language modeling on the Yelp and Yahoo corpus.

scale=0.75 Generated intermediate sentences 0.0 i had a great experience with the staff and the staff was very friendly and helpful ! i will definitely… 0.2 stopped in for a quick bite before heading out to the airport . i would definitely recommend… 0.4 stopped in for a quick bite before heading out to the airport . the service was fast and friendly… 0.6 my husband and i stopped in for a quick bite before heading out to the airport . we were seated… 0.8 this was my first time here and i will definitely be back . the service was good , the food was good… 1.0 this place was pretty good . i had the pulled pork sandwich and it was pretty good , but nothing…

Table 2: Interpolating between latent codes

Latent space visualization

For better understanding, we visualize the latent representations of the whole dataset using t-SNE projection Maaten and Hinton (2008) in Figure 3. It is seen that our model is able to learn a much smoother and more separable transition from 0-star to 4-star reviews. To visualize the codebook utilization, we also compute the (Eq.14) on a random batch of testing data after each training epoch. As shown in Figure 4, the usage of the codebook becomes more balanced as the training goes on.

Figure 3: t-SNE embeddings of latent space on Yelp corpus. Left: Lag-VAE, Right: DB-VAE. 0-4 represents the review score, from negative to positive.
Figure 4: The heatmap of codebook learning on Yelp. The x-axis corresponds to the training epoch, and y-axis corresponds to indices of different codes.

Codebook interpolation

Particularly in text modeling, when performing a convex combination between any two latent codes and , the interpolation is equivalent to . Ideally, adjusting from 0 to 1 will generate a series of sentences, where will be less semantically similar with the sentence corresponding to and much more semantically similar to that of (Berthelot et al., 2018). Table 2 shows the generated sentences when ranges from 0.0 to 1.0 with a stepsize of 0.2. Indeed, intermediate sentences produced by the proposed model can provide a semantically smooth morphing between the two endpoints. More detailed examples are provided in Table 8 in the SM A.

scale=0.75 Model Transfer BLEU PPL RPPL ARAE 95.0 32.5 6.8 395 iVAE 92.0 36.7 6.2 285 DB-VAE 97.1 40.2 4.8 254

Table 3: Performances on Yelp sentiment transfer

5.2 Unaligned neural text style transfer

Next, we evaluate the proposed model on the unaligned sentiment transfer task on the Yelp dataset. Review ratings above three are considered positive, and those below three are considered negative. Hence, we split the corpus into two sets of unaligned positive reviews (350k) and negative reviews (250k). The goal of the style transfer task is to change the underlying sentiment between positive and negative reviews.

Experiment setup

We denote as the sentiment attribute and construct a decoder to implement the conditional distribution . Following the setup in (Zhao et al., 2017a; Shen et al., 2017), we train two separate decoders where one is for positive reviews, , and the other one is for negative reviews, . Normally, the latent prior will encode all the semantic and attribute information of the input. In the models, we want the attribute information to be excluded from and let the decoder learns to produce the transferred reviews. According to (Zhao et al., 2017a), a classfier

is introduced to distinguish the latent code’s attribute, and adversarially train the encoder to fool the classifier and thus remove the sentiment attribute from the latent space.

scale=0.95 Negative Positive Input the staff was very rude as well . DB-VAE the staff here is also fantastic . ARAE the staff was very friendly . Input but , the food is not good . DB-VAE but , the food and brews are the best . ARAE well, nice atmosphere with a nice selection . Input just had a bad experience with a _num_ minutes. DB-VAE always a great spot for happy hour or lunch . ARAE i love their happy hour . Positive Negative Input but , it ’s worth it ! DB-VAE however , it ’s just ok . ARAE but , i was so disappointed . Input the food is always fresh and tasty . DB-VAE the food was n’t good , and not fresh . ARAE the food was not good but the food was not very good . Input the service was top notch and so was the food . DB-VAE the service was slow and the food was very slow. ARAE i was told the server was nice but the food was cold .

Table 4: Sentiment transfer results on Yelp


We compare our model with two strong baselines: 1) an adversarially regularized autoencoder (ARAE) Zhao et al. (2017a), which learns the prior via a more expensive and unstable adversarial training; 2) a recently developed implicit deep-latent-variable model (iVAE) Fang et al. (2019) that applies sample-based representations of variational distributions.

scale=0.7 Model BLEU BOW Embedding intra-dist inter-dist R P F1 A E G dist-1 dist-2 dist-1 dist-2 SeqGAN 0.282 0.282 0.282 0.817 0.515 0.748 0.705 0.521 0.070 0.052 CVAE 0.295 0.258 0.275 0.836 0.572 0.846 0.803 0.415 0.112 0.102 CVAE-BOW 0.298 0.272 0.284 0.828 0.555 0.840 0.819 0.493 0.107 0.099 VHRED 0.253 0.231 0.242 0.810 0.531 0.844 0.881 0.522 0.110 0.092 WAE-GMP 0.420 0.258 0.319 0.925 0.661 0.894 0.713 0.671 0.333 0.555 DI-VAE 0.310 0.175 0.224 0.802 0.583 0.862 0.891 0.779 0.489 0.767 DB-VAE 0.386 0.274 0.320 0.925 0.668 0.906 0.905 0.836 0.553 0.808

Table 5: Performance comparison on dialog response generation, Switchboard Dataset

Quantitative metrics

We adopt several quantitative metrics: (i) Transfer: it measures the style transfer accuracy evaluated on an automatic classification model (fastText library (Joulin et al., 2016)); (ii) BLEU: the consistency between the translated candidate and the original reference; (iii) PPL and Reverse PPL (RPPL): PPL measures the fluency of the generated text, and RPPL works in a reverse fashion, which is computed by training an LM on generated data and evaluated on the original data. Mode collapse may be detected by the RPPL value.

Quantitative analysis

Table 3 shows the sentiment transfer results. The proposed method outperforms ARAE in all metrics. On the one hand, in addition to the higher PPL and RPPL, our model preserves the superiority that has already been highlighted in Section 5.1. On the other hand, compared with ARAE, the higher transfer accuracy, and BLEU score indicate that our model can capture more sentiment related information while keeping the grammar structure in the original text and the opposite text consistent.

Qualitative results

Some randomly selected examples are give in Table 4. It can be observed that both ARAE and DB-VAE can successfully transfer the sentiment given the input. However, DB-VAE shows better capability in content preserving, and this observation is per the BLEU scores in Table 3.

5.3 Dialog response generation

In this experiment, we follow (Gu et al., 2018) and evaluate the proposed model on two widely-used dialog datasets Switchboard (Godfrey and Holliman, 1997) and DailyDialog Dataset (Li et al., 2017). Responses generated by VAE-based models (Zhao et al., 2017c; Gu et al., 2018) are conditioned on the latent variable. So, this task can examine whether a model can capture a richer latent space and thus generate more diverse, informative and consistent responses.


We compare our model’s performance with five representative baselines for dialog modeling: (i) SeqGAN: a GAN-based model for sequence generation (Yu et al., 2017); (ii) CVAE: a conditional VAE model (Zhao et al., 2017c); (iii) CVAE-BOW: CVAE with bag-of-word loss (Zhao et al., 2017c); (iv) VHRED: a hierarchical VAE model (Serban et al., 2017); (v) WAE-GMP: a conditional Wasserstein autoencoder with a Gaussian mixture prior network (Gu et al., 2018), which holds the state-of-the-art. (vi) DI-VAE: a discrete VAE which is most related to our work.

Quantitative metrics

Follow the evaluation setup in (Gu et al., 2018)

, three evaluation metrics (see details in

A.2) are used:

(i) Sentence-level BLEU, which works by counting n-grams in the candidate (generated) sentences to n-grams in the reference text. (ii) BOW Embedding, which calculates the cosine similarity of bag-of-words embedding between the candidate and the reference. (iii) Distinct, which computes the diversity of the generated responses.

Quantitative analysis

Table 5 and Table 9 show the quantitative results of our model and other strong baselines on Switchboard and DailyDialog. Our model outperforms the baselines in most metrics. Although our method obtains a similar BLEU score as WAE-GMP, the inter-dist and intra-dist scores are much higher. In terms of intra-dist, the dist-1 and dist-2 on Switchboard are 19.2% and 24.6% higher than WAE-GMP. This indicates that our model is capable of generating less repeated n-grams in each response. As for the inter-dist, dist-1 and dist-2 are even 66.1% and 45.6% higher than WAE-GMP, meaning that our model generates much more diverse responses than WAE-GMP.

scale=0.7 Model PPL BLEU Variational Attention 6.13 33.41 RNNsearch 5.72 33.29 RNNsearch w/ top-k (ours) 5.63 33.59

Table 6: Evaluation on NMT, IWLST14
Figure 5: BLEU score on IWLST14

5.4 Extension: RNN-based NMT model

We finally evaluate our model with the proposed top-k NN search on the German-English translation task. Our model is built on a baseline RNNsearch architecture (Bahdanau et al., 2014)

. The recently proposed variational attention model

Deng et al. (2018) is also adopted as a baseline.

We use the IWLST14 dataset (Cettolo et al., 2014), which is a standard benchmark for experimental NMT models. This dataset contains around 153K, 7K and 7K sentences for training, validation and testing, respectively. The same preprocessing as in (Ott et al., 2018) is applied. As for the architecture, both the encoder and the decoder have one layer, each with 512-dimensional embedding. For BLEU evaluation, the beam size in beam search is 5. The library Fairseq (Ott et al., 2019) is adopted as the codebase. The codebook size is set to , and only the final hidden state of the encoder passes through the discretized bottleneck.

Results averaged by 5 different runs are reported in Table 6 and in Figure 5. Note the attention mechanism is used in RNNsearch, where each progressed state in the decoder side has direct access to the state in the encoder side. Although we only discretize the final hidden state of the encoder as formulated in Section 3.1, a notable improvement on the PPL and BLEU score is still observed. Following Algorithm 2, as we increase the value of from 1 to 10, the BLEU score continues increasing until reaches 5. The reason might be that the top-5 latent codes have already encoded most source-target combinations. Besides, the BLEU score is as low as 26.1 when we choose the farthest latent code from the codebook instead. These validate the effectiveness of our proposed top-k inference strategy which applies to most RNN-based autoencoder models.

6 Conclusion

We propose the DB-VAE, a variant of VAE that uses a discretized bottleneck obtained from a global codebook for latent representations. Our model can potentially overcome the posterior collapse issues in Seq2Seq models. The proposed DB-VAE can provide a good balance between optimization of the inference network and the generative network. Moreover, our DB-VAE can also interpret richer semantic information of discrete structured sequences. Extensive experiments demonstrate the effectiveness of the proposed approach. DB-VAE is flexible enough to be extended to other NLP models such as the Transformer and BERT, which are left as interesting future work.


Appendix A Supplemental Material

a.1 Language modeling

scale=0.9 Corpus #vocabulary #sentences avg.length Yahoo 20001 10000 80 Yelp 19997 10000 97

Table 7: Statistics of LM datasets
Figure 6: Learning curves of VAE, Lag-VAE and DB-VAE on Yelp
Figure 7: Learning curves with differnt codebook size K
Figure 8: t-SNE embeddings of latent space on Yelp corpus. Top: Lag-VAE, Bottom: DB-VAE. 0-4 represents the review score, from negative to positive.
Generated intermediate sentences
0.0 had a great experience at this place ! i had a great experience with the staff and the staff was very friendly and helpful ! i had a great experience and i will definitely be back !
0.1 had a great experience here ! the staff was very friendly and helpful ! i had a great time and i will definitely be back !
0.2 stopped in for a quick bite before heading out to the airport . i had the chicken and waffles and it was delicious ! i would definitely recommend this place to anyone looking for a great breakfast !
0.3 stopped in for a quick bite before heading out to the airport . i had the chicken and waffles and it was delicious ! the service was fast and friendly . i will definitely be back !
0.4 stopped in for a quick bite before heading out to the airport . i had the chicken and waffles and it was delicious ! the service was friendly and fast . i ’ll be back !
0.5 my husband and i stopped in for a quick bite before heading out to the airport . we were seated right away and we were seated right away . our server was very friendly and helpful . the food was pretty good and the service was great .
0.6 my husband and i stopped in for a quick bite before heading out to the airport . we were seated right
0.7 this was my first time here and i will definitely be back . the service was fast and friendly and the food was delicious . i ’ll be back .
0.8 this was my first time here and i will definitely be back . the service was good , the food was good , and the prices were reasonable . i ’ll be back .
0.9 this place was pretty good . i had the chicken and waffles and it was pretty good . i ’d definitely go back .
1.0 this place was pretty good . i had the pulled pork sandwich and it was pretty good , but nothing special . the fries were pretty good though .
Table 8: Detailed interpolation results

a.2 Dialog response generation

Detailed evaluation metrics used in dialog-response-generation task:
(i) Sentence-level BLEU, which works by counting n-grams in the candidate (generated) sentences to n-grams in the reference text. To compute the score, the setting is identical WAE-GMP (Gu et al., 2018) where 10 responses (candidates) are sampled from the models for each test context. is set to 10 in Algorithm 2

. The precision and recall of BLEU are defined in

(Zhao et al., 2017c).
(ii) BOW Embedding, which calculates the cosine similarity of bag-of-words embedding between the candidate and the reference. We adopt three metrics here to compute the similarity, greedy (Rus and Lintean, 2012), average (Mitchell and Lapata, 2008) and extreme (Forgues et al., 2014).
(iii) Distinct, which computes the diversity of the generated responses. dist-n is defined as the ratio of unique n-grams (n=1,2) over all n-grams in the generated responses. As multiple responses are sampled from the models, we can define intra-dist as the average of distinct values within each sampled response and inter-dist as the distinct value among all sampled responses.

Model BLEU BOW Embedding intra-dist inter-dist
R P F1 A E G dist-1 dist-2 dist-1 dist-2
SeqGAN 0.270 0.270 0.270 0.907 0.495 0.774 0.747 0.806 0.075 0.081
CVAE 0.265 0.222 0.242 0.923 0.543 0.811 0.938 0.973 0.177 0.222
CVAE-BOW 0.256 0.224 0.239 0.923 0.540 0.812 0.949 0.976 0.165 0.206
VHRED 0.271 0.260 0.265 0.892 0.507 0.786 0.633 0.711 0.071 0.089
WAE-GMP 0.372 0.286 0.323 0.952 0.591 0.853 0.754 0.892 0.313 0.597
DI-VAE 0.323 0.190 0.239 0.874 0.600 0.814 0.947 0.963 0.500 0.718
DB-VAE 0.373 0.276 0.317 0.944 0.615 0.839 0.954 0.997 0.467 0.787
Table 9: Performance comparison on dialog response generation, DailyDialog Dataset