Introduction
Neural sequencetosequence models with attention have become the de facto methods for machine translation [1, 30]. NMT models require a large amount of parallel data to surpass the quality of phrasebased statistical models, and they are very sensitive to data quality [15]. As a conditional text generation task, machine translation contains both intrinsic uncertainty, where a given sentence usually has multiple valid reference translations, and extrinsic uncertainty, due to noise in the sentence alignment that produces parallel training data [18].
As an option for handling data uncertainty, latent variable models such as variational autoencoders (VAE) have been investigated in language modeling and conditional text generation [16, 35, 34]
. However, in contrast to their success when applied to computer vision tasks
[14, 22], VAEs in natural language processing suffer from
posterior collapse, where the learnt latent code is ignored by the decoder [3].In this work, we propose to address posterior collapse when using latent variable models in neural machine translation. First, we provide an analysis of the evidence lower bound (ELBO) used in conditional variational autoencoders (CVAE) commonly used in conditional text generation. Our analysis reveals that optimizing CVAE’s ELBO not only inevitably leads to vanishing divergence of the posterior from the prior during training, but also decreasing mutual information between latent codes and data. Based on this insight, we propose two modifications of CVAE’s ELBO to address this problem: 1) we explicitly add mutual information back to the training objective in a principled way, and 2) we use a factorized decoder, predicting “bag of words” as an auxiliary decoding distribution to regularize latent variables, finding that both are complementary. We summarize our contribution as follows:

We improve CVAE by enhancing mutual information between latent variables and data, effectively mitigating posterior collapse in conditional text generation.

We apply the proposed model in neural machine translation with the Transformer architecture. Experiments demonstrate that latent variables are not ignored even in the presence of the powerful autoregressive decoder. Compared to variational NMT with CVAE architecture or nonlatent Transformer, the proposed improvements yield improved robustness and dataefficiency.

We extend the proposed model to semisupervised learning with monolingual data, and show that it has superior performance on selftraining by effectively learn from sourceside monolingual data.
Background
Neural Machine Translation
Problem instances in machine translation are pairs of sequences , where and represent the source and target sentences, respectively. Conventionally, a neural machine translation (NMT) model is a parameterized conditional distribution whose likelihood factors in an autoregressive fashion:
(1) 
The dominant translation paradigm first represents the source sentence as a sequence of contextualized vectors (using the
encoder), then decodes this representation tokenbytoken into a target hypothesis according to the above factorization. The parameters are learned by optimizing the loglikelihood of training pairs with stochastic gradient methods [2]. Decoding the model occurs in a deterministic fashion, using an efficient approximate search like beam search [28]. Recently, Transformer with multihead attention has become the state of the art for NMT [30].Conditional Variational Autoencoder (CVAE)
Our NMT approach extends the conditional variational autoencoder (CVAE) [26], of which variational NMT [35] is a particular case. It introduces a latent variable to model the conditional distribution:
(2) 
However, it is intractable to directly marginalize . Instead, the CVAE objective is to maximize the evidence lower bound (ELBO) of the (log)likelihood:
(3) 
where represents the Kullback–Leibler (KL) divergence between two distributions. Learning is done by amortized variational inference, where the variational distribution is an inference network parameterized by .
Posterior Collapse
Posterior collapse can be explained mathematically by analysis of the ELBO objective, as well as from the perspective of a powerful decoder. We consider both in this subsection.
We first provide an analysis of CVAE’s objective and identify its drawback. Recall that our computed loss approximates the loss on the true data distribution by using a finite number of samples:
(4) 
Thus, the KL term is:
(5) 
The third line comes from multiplying the numerator and denominator by following hoffman2016elbo hoffman2016elbo, the fact that is conditionally independent of , and defining for all training samples . The fifth line comes from factoring and conditional independence.
As the two resulting terms are nonnegative [5], the global minimum of Eq. 5 is . Unfortunately, at this point, the consequence of the optimization is that is conditionally independent of the data.
Another explanation of posterior collapse is the “powerful decoder” perspective: an autoregressive model with large capacity comes to approximate a complex distribution
without using the latent variables [3, 10]. This is a challenge for NMT, which requires a powerful decoder such as Transformer with direct attention to the encoder.Addressing Posterior Collapse
CVAE Guided by Mutual Information
Adding to ELBO
To combat the optimization dilemma from eq. 5, we explicitly add the mutual information term to the CVAE’s ELBO and obtain a new training objective:
(6) 
The new training objective aims to match the aggregated posterior distribution of the latent variable to the aggregated prior distribution . It can be seen as an extension of InfoVAE [37] to conditional generative models, where we have overcome the mismatch between the (joint) data distribution and the (conditional) loglikelihood objective .
Guiding to Encode Global Information
Several existing approaches weaken the decoder to encourage latent variables to be utilized, which is not preferred in practice [3, 9]. Here we propose a different approach: explicitly guiding the information encoded in without reducing the decoder’s capacity.
Inspired by an informationtheoretic view of posterior collapse using BitsBack Coding theory [31, 11, 4], we add an auxiliary loss for to encode information which cannot be modelled locally by the autoregressive decoder distribution . We use bagofwords (BoW) prediction as the auxiliary loss. It encodes global information while having a nonautoregressive factorization
. The auxiliary decoder complements the autoregressive decoder (which is locally factorized) by combining predictions at the Softmax layer, i.e.
is a mixture of softmaxes [33]:(7) 
Thus, the bagofwords objective regularizes the loglikelihood bound.
Architecture
Inference Network
We use discrete latent variables with reparameterization via GumbelSoftmax [12]
to allow backpropagation through discrete sampling. Compared to the multivariate Gaussian distribution commonly used in VAE and CVAE, our parameterization allows us to explicitly account for multiple modes in the data. To make our model more general, we introduce a
set of discrete latent variables which are independently sampled from their own inference networks . Specifically, each computes dot product attention with encoder outputs :(8) 
We can now sample by GumbelSoftmax reparameterization trick [12]:
(9) 
where is the Gumbel noise and is a fixed temperature (we use in this paper). In the inference time, we use a discrete version by directly sampling from the latent variable distribution.
BoW Auxiliary Decoder
Given inferred sample , the BoW decoder predicts all tokens at once without considering their order. We compute the crossentropy loss for the predicted tokens over the output vocabulary space :
(10) 
We take the empirical distribution to be a token’s frequency within a sentence normalized by its total frequency within a minibatch, mitigating the effect of frequent (stop) words. is computed by conditioning on the latent code only, without direct attention to encoder outputs. We use dotproduct attention between the latent embeddings and the token embeddings (each of dimensionality ):
(11) 
Training
We train our model using amortized variational inference, where samples
are drawn from the posterior distributions to get a Monte Carlo estimate of the gradient. In addition to standard CVAE supervised learning with parallel data, we also extend our model to be jointly trained by adding monolingual data.
Semisupervised learning
We apply the same modification to VAE’s ELBO, following zhao2017infovae zhao2017infovae. For jointly training with sourceside monolingual data, we add to the ELBO^{1}^{1}1Learning to copy the source text has proven useful for lowresource NMT [6]., and for targetside monolingual data, we add . The joint objective sums the modified CVAE and VAE objectives:
(12) 
(13) 
Algorithm 1 describes the overall training strategy.
Experiments
Here we describe our experiments, showing that our techniques have practical value for both mitigating posterior collapse and improving translation quality.
Setup
Datasets
First, we evaluate our models on standard WMT benchmark datasets. Second, we focus on two representative challenges in NMT: lowresource and robustness to noisy data.
 WMT14 German–English

We use data from the WMT14 news translation shared task, which has 3.9M sentence pairs for training with BPE tokenization.
 WMT16 Romanian–English

We use data from the WMT16 news translation shared task. We use the same BPEpreprocessed [24] train, dev and test splits as in gu2017non gu2017non with 608k sentence pairs for training.
 Low resource benchmark (FLoRes) Sinhala–English

We use the same preprocessed data as in guzman2019two guzman2019two. There are 646k sentence pairs.
 MT for Noisy Text (MTNT) French–English

We use 30K subword units built jointly from source and target sentences, and only keep sentences with less than 100 tokens. For training, there are 34,380 sentence pairs for EnglishFrench and 17,616 sentence pairs for French–English [17]. We also used 18,676 monolingual sentences per language from the same data source (Reddit).
Implementation
All of our models are implemented using Transformer architecture.For WMT14 De–En and WMT16 Ro–En, we use the base configuration [30]: 6 blocks, with 512dimensional embedding, 2048dimensional FFN, and 8 attention heads. For FLoRes (lowresource) and MTNT (both lowresource and noisy), we use a smaller Transformer: 4 layers, 256dimensional embedding, 1024dimensional inner layers, and 4 attention heads. Input and output embeddings are shared between the inference network and decoder. We use categorical latent variables each of dimension 16 which are found by grid search over validation set. Auxiliary bagofwords predictions are combined with decoder prediction with . All models are optimized using Adam with , , , weight decay of 0.001, and the same warmup and learning rate schedule as in ott2018scaling ott2018scaling. All models are trained on 8 Nvidia V100 GPUs with 32K tokens per minibatch. We train WMT14 DeEn with 200k updates and all other models with 100k updates.
We employ joint BPE vocabularies. The sizes are 32k for En–De and En–Ro; 30k for Fr–En; and 3k for Si–En. In addition, we use a word dropout rate of 0.4 during training of the baseline and latent variable models, which is complementary to our approach.
Baselines
We compare our model to three baselines: 1) Transformer, nonlatent: standard Transformer model without latent variables (denoted as nonlatent), 2) VNMT: CVAE model with Gaussian distribution as was proposed in Variational NMT by zhang2016variational zhang2016variational, which we reimplemented using Transformer, and 3) DCVAE: CVAE model with the same discrete latent variables parameterization as ours but without the proposed enhancement on promoting mutual information, i.e., the only differences are the modified ELBO and bagofwords regularizer.
Main Results
Preventing Posterior Collapse
In this set of experiments, we compare our model to a standard DCVAE without the proposed enhancement in mutual information. We report four metrics of posterior collapse on validate set of WMT Ro–En:

Mutual information between the latent variable and the data: and .

Negative loglikelihood (NLL) per token.
Table 1 shows that when using standard DCVAE ELBO, even with the common practice of KL annealing (KLA), both the KL loss and mutual information settle to almost 0 which is consistent with the analysis in Eq. 5. We also plot the progression of , , and during training in Figure 2. The posterior collapse of the baseline model is apparent: both mutual information terms drop to 0 at the beginning of training as a result ELBO’s design. On the other hand, our model, without using any annealing schedule, can effectively increase mutual information and prevent KL loss from settling to a degenerate solution early on.
As mutual information is a measure of the amount of information contained in one random variable about another random variable, how can it be the KL decrease while the I increase(since claimed the property of KL in VAE setting, the smaller, the better)?
^{disable,author=R3}^{disable,author=R3}todo: disable,author=R3Table 1 and figure 2 show that even with Modified ELBO and BOW, the KL terms are still at a very small value. In my experiments wrt VNMT personally, such low KL terms seemed to contribute very little information to the decoders. I suggest authors to provide more analysis of whether the z thing does actually play an important role in your model, such as substituting z with a zero vector.^{disable,author=R3}^{disable,author=R3}todo: disable,author=R3It seems that the bagofwords decoder plays a much more important role than modified ELBO according to the results in table 1. I think you need to evaluate the contribution of the modified ELBO to your model. Does modified ELBO improves BLEU score ?Translation Quality
We report corpuslevel BLEU [19]^{2}^{2}2Particularly, we use detokenized SacreBLEU [20]. on the test sets where the translations are generated by sampling each with softassignment (vs. argmax). ^{disable}^{disable}todo: disableAre the BLEU scores reported the mean score of several sampled translations (from the latent space)?
Supervised Learning on Parallel Data
First, we evaluate our model’s performance when trained with parallel data on standard WMT datasets. Table 2 shows that our model consistently outperforms both VNMT and DCVAE models—which requires adhoc KL annealing (KLA) while on par with a strong Transformer baseline.
Semisupervised with Sourceside Monolingual Data
Leveraging monolingual data is a common practice to improve low resource NMT. Current approach has been mostly focusing on using targetside monolingual data through “backtranslation” as a data augmentation, while how to effectively leverage sourceside monolingual to facilitate self training is still an open challenge [23, 36]. We use the joint training objective described in Eq. 13. To have a fair comparison, we also extend VNMT and DCVAE with the same joint training algorithm, i.e., the newly added monolingual data is used to train their corresponding sequence encoder and inference network with standard VAE ELBO. That is, the only difference is that our model was trained to promote mutual information and . As shown in Table 3, by doing so the proposed model brings larger gains during selftraining with sourceside monolingual data.
Robustness to noisy data
While highquality parallel data is scarce for lowresource language pairs, weakly aligned sentence pairs can be mined from massive unpaired data such as Paracrawl^{3}^{3}3https://paracrawl.eu/. We evaluate our model’s performance when augmenting the training set with increasingly noisy parallel data filtered by Zipporah [32]. Figure 3 shows the results in the Sinhala–English direction. Our model always outperforms standard Transformer, which struggles as more (and noisier) data is added.
Analysis
Ablation Study
We further investigate how different ingredients of our proposed approach contribute to preventing posterior collapse and improving translation quality. We conduct further experiments with two variants of the proposed model: 1) modified ELBO only: only adding mutual information term to the training objective, while without gradients from , 2) BoW only: which is equivalent to DCVAE combined with Bow decoder.
First, we perform the same collapse metrics evaluation as in Table 1. Figure 2 (B) suggests that by explicitly adding mutual information term back to the training objective, both and are effectively raised, while the remaining aggregated KL term is still optimized to zero. Such behavior is consistent with the analysis revealed in Eq. 5. On the other hand, regularizing with BoW decoder only, as is shown in Figure 2 (C), is very effective in preventing KL vanishing as well as increasing mutual information. When two approaches are combined, as was shown in Figure 2, the model retain higher mutual information for both and .
Analysis of Outputs
Delving into model predictions helps us understand how our model outperforms the others. We provide some 1best predictions from the Romanian–English data in Table 5. Several examples support the fact that our model has more fluent and accurate translations than the baseline or VNMT. VNMT often struggles by introducing disfluent words, and both VNMT and the baseline can select justifiable but incorrect words. For instance, in our second example, the gender and animacy of the possessor are not specified in Romanian. Our model selects a more plausible pronoun for this context.
More broadly, we find that the reference translations are quite loose and contextdependent (rather than wordforword translations), making it difficult for models to reproduce—they give reasonable translations with greater fidelity to source word order and content. (As an extreme example, the English translation of ed miliband isi cunostea dusmanii adds information to the beginning: for all his foolishness ed miliband knew who his enemies were; no model is able to add this.) Our model often makes superior judgments in terms of lexical choice and fluency.
Analysis of Latent Variables
Finally, we probe whether different latent variables encode different information. We random sample 100 sentences from two test sets of distinct domains, MTNT (Reddit comments) and WMT (news) with 50 sentences each. We plot the tSNE projection of their corresponding latent variables samples inferred from , respectively. Figure 4 indicates that different latent variables learn to organize the data in different manners, although there was no clear signal that any of them exclusively specialize in encoding a domain label. We leave an thorough analysis of their information specialization to future work.
Related Work
Unlike most prior work in (conditional) text generation, we are able to address posterior collapse without requiring an annealing schedule [3], a weakened decoder [9], or a restriction on the variational family [21].
Unlike maetal2018bag maetal2018bag, who also employ bagofwords as an objective for NMT, our bagofwords decoder only has access to , not the encoder states. Conversely, unlike wengetal2017neural wengetal2017neural, our generative decoder has access to both the latent variable and the encoder states, and the bagofwords prediction is handled by a separate set of parameters.
Posterior collapse for text VAE was first identified in language modeling [3]. VNMT [35] applies CVAE with Gaussian priors to conditional text generation. VRNMT [27] extends VNMT by modeling the translation process in greater granularity. All of them needed manually designed annealing schedules to increase KL loss to mitigate posterior collapse. Discrete latent variables have been applied to NMT [8, 25, 13] but did not use variational inference or address posterior collapse. Tackling posterior collapse has received more attention lately, with general approaches such as aggressively trained inference networks [10], skip connections [7], and more expressive priors [21, 29].
Conclusion
We have presented a conditional generative model with latent variables whose distribution is learned with variation inference, then applied it to the task of machine translation. Our approach does not require an annealing schedule or a hamstrung decoder to avoid posterior collapse. Instead, by providing a new analysis of the conditional VAE objective to improve it in a principled way and incorporating an auxiliary decoding objective, we measurably rely on the latent variables. In addition to preventing posterior collapse, our approach improves translation quality in terms of BLEU. Empirical evaluation demonstrates that the proposed method has improved performance in dealing with uncertainty in data, including weakly supervised learning from sourceside monolingual data as well as noisy parallel data.
References
 [1] (2015) Neural machine translation by jointly learning to align and translate. In ICLR, External Links: Link Cited by: Introduction.
 [2] (2004) Large scale online learning. In Advances in Neural Information Processing Systems 16, pp. 217–224. External Links: Link Cited by: Neural Machine Translation.
 [3] (201608) Generating sentences from a continuous space. In CoNLL, Berlin, Germany, pp. 10–21. External Links: Link, Document Cited by: Introduction, Posterior Collapse, Guiding to Encode Global Information, Related Work, Related Work.
 [4] (2017) Variational lossy autoencoder. In ICLR, External Links: Link Cited by: Guiding to Encode Global Information.
 [5] (2006) Elements of information theory (wiley series in telecommunications and signal processing). New York, NY, USA. External Links: ISBN 0471241954 Cited by: Posterior Collapse.
 [6] (201709) Copied monolingual data improves lowresource neural machine translation. In Proceedings of the Second Conference on Machine Translation, Copenhagen, Denmark, pp. 148–156. External Links: Link, Document Cited by: footnote 1.

[7]
(201916–18 Apr)
Avoiding latent variable collapse with generative skip models.
In
Proceedings of Machine Learning Research
, Proceedings of Machine Learning Research, Vol. 89, , pp. 2397–2405. External Links: Link Cited by: Related Work.  [8] (2018) Nonautoregressive neural machine translation. In ICLR, External Links: Link Cited by: Related Work.
 [9] (2017) PixelVAE: A latent variable model for natural images. In ICLR, External Links: Link Cited by: Guiding to Encode Global Information, Related Work.
 [10] (2019) Lagging inference networks and posterior collapse in variational autoencoders. In ICLR, External Links: Link Cited by: Posterior Collapse, Related Work.

[11]
(1993)
Keeping the neural networks simple by minimizing the description length of the weights
. InProceedings of the Sixth Annual Conference on Computational Learning Theory
, COLT ’93, New York, NY, USA, pp. 5–13. External Links: ISBN 0897916115, Link, Document Cited by: Guiding to Encode Global Information.  [12] (2017) Categorical reparameterization with Gumbel–Softmax. In ICLR, External Links: Link Cited by: Inference Network.
 [13] (2017) One model to learn them all. CoRR abs/1706.05137v1. External Links: Link, 1706.05137v1 Cited by: Related Work.
 [14] (2013) Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: Introduction.
 [15] (201708) Six challenges for neural machine translation. In Proceedings of the First Workshop on Neural Machine Translation, Vancouver, pp. 28–39. External Links: Link, Document Cited by: Introduction.
 [16] (2016) Neural variational inference for text processing. In ICML, pp. 1727–1736. Cited by: Introduction.
 [17] (2018OctoberNovember) MTNT: a testbed for machine translation of noisy text. In Proc. EMNLP, Brussels, Belgium, pp. 543–553. External Links: Link, Document Cited by: item MT for Noisy Text (MTNT) French–English.
 [18] (201810–15 Jul) Analyzing uncertainty in neural machine translation. In ICML, Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 3956–3965. External Links: Link Cited by: Introduction.
 [19] (200207) Bleu: a method for automatic evaluation of machine translation. In ACL, Philadelphia, Pennsylvania, USA, pp. 311–318. External Links: Link, Document Cited by: Translation Quality.
 [20] (201810) A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation, Belgium, Brussels, pp. 186–191. External Links: Link, Document Cited by: footnote 2.
 [21] (2019) Preventing posterior collapse with deltaVAEs. In ICLR, External Links: Link Cited by: Related Work, Related Work.
 [22] (2014) Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082. Cited by: Introduction.
 [23] (201608) Improving neural machine translation models with monolingual data. pp. 86–96. External Links: Link, Document Cited by: Semisupervised with Sourceside Monolingual Data.
 [24] (201608) Neural machine translation of rare words with subword units. In ACL, Berlin, Germany, pp. 1715–1725. External Links: Link, Document Cited by: item WMT16 Romanian–English.
 [25] (201909–15 Jun) Mixture models for diverse machine translation: tricks of the trade. In ICML, Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 5719–5728. External Links: Link Cited by: Related Work.
 [26] (2015) Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems 28, pp. 3483–3491. External Links: Link Cited by: Conditional Variational Autoencoder (CVAE).
 [27] (2018) Variational recurrent neural machine translation. In AAAI 2018, pp. 5488–5495. External Links: Link Cited by: Related Work.
 [28] (2003) Word reordering and a dynamic programming beam search algorithm for statistical machine translation. Computational Linguistics 29 (1), pp. 97–133. External Links: Link, Document Cited by: Neural Machine Translation.
 [29] (2017) VAE with a VampPrior. CoRR abs/1705.07120v5. External Links: Link, 1705.07120v5 Cited by: Related Work.
 [30] (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. External Links: Link Cited by: Introduction, Neural Machine Translation, Implementation.
 [31] (1987) Estimation and inference by compact coding. Journal of the Royal Statistical Society. Series B (Methodological) 49 (3), pp. 240–265. External Links: ISSN 00359246, Link Cited by: Guiding to Encode Global Information.
 [32] (201709) Zipporah: a fast and scalable data cleaning system for noisy webcrawled parallel corpora. In Proc. EMNLP, Copenhagen, Denmark, pp. 2945–2950. External Links: Link, Document Cited by: Robustness to noisy data.
 [33] (2018) Breaking the softmax bottleneck: a highrank RNN language model. In ICLR, External Links: Link Cited by: Guiding to Encode Global Information.
 [34] (201706–11 Aug) Improved variational autoencoders for text modeling using dilated convolutions. In ICML, Proceedings of Machine Learning Research, Vol. 70, International Convention Centre, Sydney, Australia, pp. 3881–3890. External Links: Link Cited by: Introduction.
 [35] (201611) Variational neural machine translation. In EMNLP, Austin, Texas, pp. 521–530. External Links: Link, Document Cited by: Introduction, Conditional Variational Autoencoder (CVAE), Related Work.
 [36] (2016) Exploiting sourceside monolingual data in neural machine translation. In Proc. EMNLP, pp. 1535–1545. Cited by: Semisupervised with Sourceside Monolingual Data.
 [37] (2017) InfoVAE: information maximizing variational autoencoders. CoRR abs/1706.02262v3. External Links: Link, 1706.02262v3 Cited by: Adding to ELBO.
Comments
There are no comments yet.