flowseq
Generative Flow based SequencetoSequence Toolkit written in Python.
view repo
Most sequencetosequence (seq2seq) models are autoregressive; they generate each token by conditioning on previously generated tokens. In contrast, nonautoregressive seq2seq models generate all tokens in one pass, which leads to increased efficiency through parallel processing on hardware such as GPUs. However, directly modeling the joint distribution of all tokens simultaneously is challenging, and even with increasingly complex model structures accuracy lags significantly behind autoregressive models. In this paper, we propose a simple, efficient, and effective model for nonautoregressive sequence generation using latent variable models. Specifically, we turn to generative flow, an elegant technique to model complex distributions using neural networks, and design several layers of flow tailored for modeling the conditional density of sequential latent variables. We evaluate this model on three neural machine translation (NMT) benchmark datasets, achieving comparable performance with stateoftheart nonautoregressive NMT models and almost constant decoding time w.r.t the sequence length.
READ FULL TEXT VIEW PDF
We propose a conditional nonautoregressive neural sequence model based ...
read it
We consider highdimensional distribution estimation through autoregress...
read it
Stochastic processes generated by nonstationary distributions are diffi...
read it
We present KERMIT, a simple insertionbased approach to generative model...
read it
Autoregressive sequence models based on deep neural networks, such as RN...
read it
Multivariate binary distributions can be decomposed into products of
uni...
read it
Lipreading is an impressive technique and there has been a definite
impr...
read it
Generative Flow based SequencetoSequence Toolkit written in Python.
Neural sequencetosequence (seq2seq) models (Bahdanau et al., 2015; Rush et al., 2015; Vinyals et al., 2015; Vaswani et al., 2017) generate an output sequence given an input sequence
using conditional probabilities
predicted by neural networks (parameterized by ).Most seq2seq models are autoregressive, meaning that they factorize the joint probability of the output sequence given the input sequence into the product of probabilities over the next token in the sequence given the input sequence and previously generated tokens:
(1) 
Each factor, , can be implemented by function approximators such as RNNs (Bahdanau et al., 2015) and Transformers (Vaswani et al., 2017). This factorization takes the complicated problem of
joint estimation
over an exponentially large output space of outputs , and turns it into a sequence of tractable multiclass classification problems predicting given the previous words, allowing for simple maximum loglikelihood training. However, this assumption of lefttoright factorization may be suboptimal from a modeling perspective Gu et al. (2019); Stern et al. (2019), and generation of outputs must be done through a linear lefttoright pass through the output tokens using beam search, which is not easily parallelizable on hardware such as GPUs.Recently, there has been work on nonautoregressive sequence generation for neural machine translation (NMT; Gu et al. (2018); Lee et al. (2018); Ghazvininejad et al. (2019)) and language modeling (Ziegler and Rush, 2019). Nonautoregressive models attempt to model the joint distribution directly, decoupling the dependencies of decoding history during generation. A naïve solution is to assume that each token of the target sequence is independent given the input:
(2) 
Unfortunately, the performance of this simple model falls far behind autoregressive models, as seq2seq tasks usually do have strong conditional dependencies between output variables (Gu et al., 2018). This problem can be mitigated by introducing a latent variable to model these conditional dependencies:
(3) 
where is the prior distribution over latent and is the “generative” distribution (a.k.a decoder). Nonautoregressive generation can be achieved by the following independence assumption in the decoding process:
(4) 
Gu et al. (2018) proposed a representing fertility scores specifying the number of output words each input word generates, significantly improving the performance over Eq. (2). But the performance still falls behind stateoftheart autoregressive models due to the limited expressiveness of fertility to model the interdependence between words in y.
In this paper, we propose a simple, effective, and efficient model, FlowSeq, which models expressive prior distribution using a powerful mathematical framework called generative flow Rezende and Mohamed (2015). This framework can elegantly model complex distributions, and has obtained remarkable success in modeling continuous data such as images and speech through efficient density estimation and sampling Kingma and Dhariwal (2018); Prenger et al. (2019); Ma and Hovy (2019). Based on this, we posit that generative flow also has potential to introduce more meaningful latent variables in the nonautoregressive generation in Eq. (3).
FlowSeq is a flowbased sequencetosequence model, which is (to our knowledge) the first nonautoregressive seq2seq model utilizing generative flows. It allows for efficient parallel decoding while modeling the joint distribution of the output sequence. Experimentally, on three benchmark datasets for machine translation – WMT2014, WMT2016 and IWSLT2014, FlowSeq achieves comparable performance with stateoftheart nonautoregressive models, and almost constant decoding time w.r.t. the sequence length compared to a typical lefttoright Transformer model, which is superlinear.
As noted above, incorporating expressive latent variables is essential to decouple the dependencies between tokens in the target sequence in nonautoregressive models. However, in order to model all of the complexities of sequence generation to the point that we can read off all of the words in the output in an independent fashion (as in Eq. (4)), the prior distribution will necessarily be quite complex. In this section, we describe generative flows Rezende and Mohamed (2015), an effective method for arbitrary modeling of complicated distributions, before describing how we apply them to sequencetosequence generation in §3.
Put simply, flowbased generative models work by transforming a simple distribution (e.g. a simple Gaussian) into a complex one (e.g. the complex prior distribution over that we want to model) through a chain of invertible transformations.
Formally, a set of latent variables are introduced with a simple prior distribution . We then define a bijection function (with ), whereby we can define a generative process over variables :
(5) 
An important insight behind flowbased models is that given this bijection function, the change of variable formula defines the model distribution on by:
(6) 
Here is the Jacobian matrix of at .
Eq. (6) provides a way to calculate the (complex) density of by calculating the (simple) density of and the Jacobian of the transformation from to . For efficiency purposes, flowbased models generally use certain types of transformations where both the inverse functions and the Jacobian determinants are tractable to compute. A stacked sequence of such invertible transformations is also called a (normalizing) flow (Rezende and Mohamed, 2015):
where is a flow of transformations (omitting s for brevity).
In the context of maximal likelihood estimation (MLE), we wish to minimize the negative loglikelihood of the parameters:
(7) 
where is the set of training data. However, the likelihood after marginalizing out latent variables (LHS in Eq. (3)) is intractable to compute or differentiate directly. Variational inference (Wainwright et al., 2008) provides a solution by introducing a parametric inference model (a.k.a posterior) which is then used to approximate this integral by sampling individual examples of . These models then optimize the evidence lower bound (ELBO), which considers both the “reconstruction error” and KLdivergence between the posterior and the prior:
(8) 
Both inference model and decoder parameters are optimized according to this objective.
We first overview FlowSeq’s architecture (shown in Figure 2) and training process here before detailing each component in following sections. Similarly to classic seq2seq models, at both training and test time FlowSeq first reads the whole input sequence
and calculates a vector for each word in the sequence, the source encoding.
At training time, FlowSeq’s parameters are learned using a variational training paradigm overviewed in §2.2. First, we draw samples of latent codes from the current posterior . Next, we feed together with source encodings into the decoder network and the prior flow to compute the probabilities of and for optimizing the ELBO (Eq. (2.2)).
At test time, generation is performed by first sampling a latent code from the prior flow by executing the generative process defined in Eq. (5). In this step, the source encodings produced from the encoder are used as conditional inputs. Then the decoder receives both the sampled latent code and the source encoder outputs to generate the target sequence from .
The source encoder encodes the source sequences into hidden representations, which are used in computing attention when generating latent variables in the posterior network and prior network as well as the crossattention with decoder. Any standard neural sequence model can be used as its encoder, including RNNs
(Bahdanau et al., 2015) or Transformers Vaswani et al. (2017).The latent variables are represented as a sequence of continuous random vectors with the same length as the target sequence . Each is a dimensional vector, where is the dimension of the latent space. The posterior distribution models each
as a diagonal Gaussian with learned mean and variance:
(9) 
where and are neural networks such as RNNs or Transformers.
While we perform standard random initialization for most layers of the network, we initialize the last linear transforms that generate the
andvalues with zeros. This ensures that the posterior distribution as a simple normal distribution, which we found helps train very deep generative flows more stably.
The motivation of introducing the latent variable into the model is to model the uncertainty in the generative process. Thus, it is preferable that capture contextual interdependence between tokens in . However, there is an obvious local optimum where the posterior network generates a latent vector that only encodes the information about the corresponding target token , and the decoder simply generates the “correct” token at each step with as input. In this case, FlowSeq reduces to the baseline model in Eq. (2). To escape this undesired local optimum, we apply tokenlevel dropout to randomly drop an entire token when calculating the posterior, to ensure the model also has to learn how to use contextual information. This technique is similar to the “masked language model” in previous studies (Melamud et al., 2016; Devlin et al., 2018; Ma et al., 2018).
As the decoder, we take the latent sequence as input, run it through several layers of a neural sequence model such as a Transformer, then directly predict the output tokens in individually and independently. Notably, unlike standard seq2seq decoders, we do not perform causal masking to prevent attending to future tokens, making the model fully nonautoregressive.
The flow architecture is based on Glow (Kingma and Dhariwal, 2018). It consists of a series of steps of flow, combined in a multiscale architecture (see Figure 2.) Each step of flow consists three types of elementary flows – actnorm, invertible multihead linear, and coupling. Note that all three functions are invertible and conducive to calculation of log determinants (details in Appendix A).
The activation normalization layer (actnorm; Kingma and Dhariwal (2018)
) is an alternative for batch normalization
(Ioffe and Szegedy, 2015), that has mainly been used in the context of image data to alleviate problems in model training. Actnorm performs an affine transformation of the activations using a scale and bias parameter per feature for sequences:(10) 
Both and
are tensors of shape
with time dimension and feature dimension . The parameters are initialized such that over each feature has zero mean and unit variance given an initial minibatch of data.To incorporate general permutations of variables along the feature dimension to ensure that each dimension can affect every other ones after a sufficient number of steps of flow, Kingma and Dhariwal (2018) proposed a trainable invertible convolution layer for 2D images. It is straightforward to apply similar transformations to sequential data:
(11) 
where is the weight matrix of shape . The logdeterminant of this transformation is:
The cost of computing is .
Unfortunately, in Seq2Seq generation is commonly large, e.g. , significantly slowing down the model for computing . To apply this to sequence generation, we propose a multihead invertible linear layer, which first splits each dimensional feature vector into heads with dimension . Then the linear transformation in (11) is applied to each head, with weight matrix , significantly reducing the dimension. For splitting of heads, one step of flow contains one linear layer with either rowmajor or columnmajor splitting format, and these steps with different linear layers are composed in an alternating pattern.
To model interdependence across time steps, we use affine coupling layers (Dinh et al., 2016):
where and are outputs of two neural networks with and as input. These are shown in Figure 3 (c). In experiments, we implement and with one Transformer decoder layer (Vaswani et al., 2017): multihead selfattention over , followed by multihead interattention over , followed by a positionwise feedforward network. The input is fed into this layer in one pass, without causal masking.
As in Dinh et al. (2016), the function splits the input tensor into two halves, while the operation performs the corresponding reverse concatenation operation. In our architecture, three types of split functions are used, based on the split dimension and pattern. Figure 3 (b) illustrates the three splitting types. The first type of split groups along the time dimension on alternate indices. In this case, FlowSeq mainly models the interactions between timesteps. The second and third types of splits perform on the feature dimension, with continuous and alternate patterns, respectively. For each type of split, we alternate and to increase the flexibility of the split function. Different types of affine coupling layers alternate in the flow, similar to the linear layers.
We follow Dinh et al. (2016) in implementing a multiscale architecture using the squeezing operation on the feature dimension, which has been demonstrated helpful for training deep flows. Formally, each scale is a combination of several steps of the flow (see Figure 3 (a)). After each scale, the model drops half of the dimensions with the third type of split in Figure 3 (b) to reduce computational and memory cost, outputting the tensor with shape . Then the squeezing operation transforms the tensor into an
one as the input of the next scale. We pad each sentence with
EOS tokens to ensure is divisible by . The right component of Figure 2 illustrates the multiscale architecture.In autoregressive seq2seq models, it is natural to determine the length of the sequence dynamically by simply predicting a special EOS token. However, for FlowSeq to predict the entire sequence in parallel, it needs to know its length in advance to generate the latent sequence
. Instead of predicting the absolute length of the target sequence, we predict the length difference between source and target sequences using a classifier with a range of
. Numbers in this range are predicted by maxpooling the source encodings into a single vector,
^{2}^{2}2We experimented with other methods such as meanpooling or taking the last hidden state and found no major difference in our experiments running this through a linear layer, and taking a softmax. This classifier is learned jointly with the rest of the model.At inference time, the model needs to identify the sequence with the highest conditional probability by marginalizing over all possible latent variables (see Eq. (3)), which is intractable in practice. We propose three approximating decoding algorithms to reduce the search space.
A more accurate approximation of decoding, proposed in Gu et al. (2018), is to draw samples from the latent space and compute the best output for each latent sequence. Then, a pretrained autoregressive model is adopted to rank these sequences. In FlowSeq, different candidates can be generated by sampling different target lengths or different samples from the prior, and both of the strategies can be batched via masks during decoding. In our experiments, we first select the top length candidates from the length predictor in §3.5. Then, for each length candidate we use random samples from the prior network to generate output sequences, yielding a total of candidates.
The third approximating method is based on the lower bound of importance weighted estimation (Burda et al., 2015). Similarly to NPD, IWD first draws samples from the latent space and computes the best output for each latent sequence. Then, IWD ranks these candidate sequences with importance samples:
IWD does not rely on a separate pretrained model, though it significantly slows down the decoding speed. The detailed comparison of these three decoding methods is provided in §4.2.
Different from the architecture proposed in Ziegler and Rush (2019), the architecture of FlowSeq is not using any autoregressive flow (Kingma et al., 2016; Papamakarios et al., 2017), yielding a truly nonautoregressive model with efficient generation. Note that the FlowSeq remains nonautoregressive even if we use an RNN in the architecture because RNN is only used to encode a complete sequence of codes and all the input tokens can be fed into the RNN in parallel. This makes it possible to use highlyoptimized implementations of RNNs such as those provided by cuDNN.^{3}^{3}3https://devblogs.nvidia.com/optimizingrecurrentneuralnetworkscudnn5/ Thus while RNNs do experience some drop in speed, it is less extreme than that experienced when using autoregressive models.
We evaluate FlowSeq on three machine translation benchmark datasets: WMT2014 DEEN (around 4.5M sentence pairs), WMT2016 ROEN (around 610K sentence pairs) and a smaller dataset IWSLT2014 DEEN (around 150K sentence pairs). We use scripts from fairseq Ott et al. (2019) to preprocess WMT2014 and IWSLT2014, where the preprocessing steps follow Vaswani et al. (2017) for WMT2014. We use the data provided in Lee et al. (2018) for WMT2016. For both WMT datasets, the source and target languages share the same set of BPE embeddings while for IWSLT2014 we use separate embeddings. During training, we filter out sentences longer than for WMT dataset and for IWSLT, respectively.
WMT2014  WMT2016  IWSLT2014  
Models  ENDE  DEEN  ENRO  ROEN  DEEN 
Raw Data  
CMLMbase  10.88  –  20.24  –  – 
LV NAR  11.80  –  –  –  – 
FlowSeqbase  18.55  23.36  29.26  30.16  24.75 
FlowSeqlarge  20.85  25.40  29.86  30.69  – 
Knowledge Distillation  
NATIR  13.91  16.77  24.45  25.73  21.86 
CTC Loss  17.68  19.80  19.93  24.71  – 
NAT w/ FT  17.69  21.47  27.29  29.06  20.32 
NATREG  20.65  24.77  –  –  23.89 
CMLMsmall  15.06  19.26  20.12  20.36  – 
CMLMbase  18.12  22.26  23.65  22.78  – 
FlowSeqbase  21.45  26.16  29.34  30.44  27.55 
FlowSeqlarge  23.72  28.39  29.73  30.72  – 
WMT2014  WMT2016  
Models  ENDE  DEEN  ENRO  ROEN 
Autoregressive Methods  
Transformerbase  27.30  –  –  – 
Our Implementation  27.16  31.44  32.92  33.09 
Raw Data  
CMLMbase (refinement 4)  22.06  –  30.89  – 
CMLMbase (refinement 10)  24.65  –  32.53  – 
FlowSeqbase (IWD )  20.20  24.63  30.61  31.50 
FlowSeqbase (NPD )  20.81  25.76  31.38  32.01 
FlowSeqbase (NPD )  21.15  26.04  31.74  32.45 
FlowSeqlarge (IWD )  22.94  27.16  31.08  32.03 
FlowSeqlarge (NPD )  23.14  27.71  31.97  32.46 
FlowSeqlarge (NPD )  23.64  28.29  32.35  32.91 
Knowledge Distillation  
NATIR (refinement 10)  21.61  25.48  29.32  30.19 
NAT w/ FT (NPD )  18.66  22.42  29.02  31.44 
NATREG (NPD )  24.61  28.90  –  – 
LV NAR (refinement 4)  24.20  –  –  – 
CMLMsmall (refinement 10)  25.51  29.47  31.65  32.27 
CMLMbase (refinement 10)  26.92  30.86  32.42  33.06 
FlowSeqbase (IWD )  22.49  27.40  30.59  31.58 
FlowSeqbase (NPD )  23.08  28.07  31.35  32.11 
FlowSeqbase (NPD )  23.48  28.40  31.75  32.49 
FlowSeqlarge (IWD )  24.70  29.44  31.02  31.97 
FlowSeqlarge (NPD )  25.03  30.48  31.89  32.43 
FlowSeqlarge (NPD )  25.31  30.68  32.20  32.84 
We implement the encoder, decoder and posterior networks with standard (unmasked) Transformer layers (Vaswani et al., 2017). For WMT datasets, the encoder consists of 6 layers, and the decoder and posterior are composed of 4 layers, and 8 attention heads. and for IWSLT, the encoder has 5 layers, and decoder and posterior have 3 layers, and 4 attention heads. The prior flow consists of 3 scales with the number of steps from bottom to top. To dissect the impact of model dimension on translation quality and speed, we perform experiments on two versions of FlowSeq with (base) and (large). More model details are provided in Appendix B.
Parameter optimization is performed with the Adam optimizer (Kingma and Ba, 2014) with and . Each minibatch consist of sentences. The learning rate is initialized to , and exponentially decays with rate
. The gradient clipping cutoff is
. For all the FlowSeq models, we apply label smoothing and averaged the 5 best checkpoints to create the final model.At the beginning of training, the posterior network is randomly initialized, producing noisy supervision to the prior. To mitigate this issue, we first set the weight of the term in ELBO to zero for 30,000 updates to train the encoder, decoder and posterior networks. Then the weight linearly increases to one for another 10,000 updates, which we found essential to accelerate training and achieve stable performance.
We first conduct experiments to compare the performance of FlowSeq with strong baseline models, including NAT w/ Fertility Gu et al. (2018), NATIR Lee et al. (2018), NATREG Wang et al. (2019), LV NAR Shu et al. (2019), CTC Loss Libovickỳ and Helcl (2018), and CMLM Ghazvininejad et al. (2019).
Table 1 provides the BLEU scores of FlowSeq with argmax decoding, together with baselines with purely nonautoregressive decoding methods that generate output sequence in one parallel pass. The first block lists results of models trained on raw data, while the second block are results using knowledge distillation. Without using knowledge distillation, FlowSeq base model achieves significant improvement (more than BLEU points) over CMLMbase and LV NAR. It demonstrates the effectiveness of FlowSeq on modeling the complex interdependence in target languages.
Towards the effect of knowledge distillation, we can mainly obtain two observations: i) Similar to the findings in previous work, knowledge distillation still benefits the translation quality of FlowSeq. ii) Compared to previous models, the benefit of knowledge distillation on FlowSeq is less significant, yielding less than BLEU improvement on WMT2014 DEEN corpus, and even no improvement on WMT2016 ROEN corpus. The reason might be that FlowSeq does not rely much on knowledge distillation to alleviate the multimodality problem.
Table 2 illustrates the BLEU scores of FlowSeq and baselines with advanced decoding methods such as iterative refinement, IWD and NPD rescoring. The first block in Table 2 includes the baseline results from autoregressive Transformer. For the sampling procedure in IWD and NPD, we sampled from a reducedtemperature model (Kingma and Dhariwal, 2018) to obtain highquality samples. We vary the temperature within and select the best temperature based on the performance on development sets. The analysis of the impact of sampling temperature and other hyperparameters on samples is in § 4.4. For FlowSeq, NPD obtains better results than IWD, showing that FlowSeq still falls behind autoregressive Transformer on model data distributions. Comparing with CMLM (Ghazvininejad et al., 2019) with iterations of refinement, which is a contemporaneous work that achieves stateoftheart translation performance, FlowSeq obtains competitive performance on both WMT2014 and WMT2016 corpora, with only slight degradation in translation quality. Leveraging iterative refinement to further improve the performance of FlowSeq has been left to future work.
In this section, we compare the decoding speed (measured in average time in seconds required to decode one sentence) of FlowSeq at test time with that of the autoregressive Transformer model. We use the test set of WMT14 ENDE for evaluation and all experiments are conducted on a single NVIDIA TITAN X GPU.
First, we investigate how different decoding batch size can affect the decoding speed. We vary the decoding batch size within . Figure. (a)a shows that for both FlowSeq and Transformer decoding is faster when using a larger batch size. However, FlowSeq has much larger gains in the decoding speed w.r.t. the increase in batch size, gaining a speed up of 594% of base model and 403% of large model when using a batch size of 128. We hypothesize that this is because the operations in FlowSeq are more friendly to batching while the Transformer model with beam search at test time is less efficient in benefiting from batching.
Next, we examine if sentence length is a major factor affecting the decoding speed. We bucket the test data by the target sentence length. From Fig. (b)b, we can see that as the sentence length increases, FlowSeq achieves almost constant decoding time while Transformer has a linearly increasing decoding time. The relative decoding speed up of FlowSeq versus Transformer linearly increases as the sequence length increases. The potential of decoding long sequences with constant time is an attractive property of FlowSeq.
In Fig. 5, we analyze how different sampling hyperparameters affect the performance of rescoring. First, we observe that the number of samples for each length is the most important factor. The performance is always improved with a larger sample size. Second, a larger number of length candidates does not necessarily increase the rescoring performance. Third, we find that a larger sampling temperature (0.3  0.5) can increase the diversity of translations and leads to better rescoring BLEU. However, the latent samples become noisy when a large temperature (1.0) is used.
Following Shen et al. (2019), we analyze the output diversity of FlowSeq. Shen et al. (2019) proposed pairwiseBLEU and BLEU computed in a leaveoneout manner to calibrate the diversity and quality of translation hypotheses. A lower pairwiseBLEU score implies a more diverse hypothesis set. And a higher BLEU score implies a better translation quality. We experiment on a subset of test set of WMT14ENDE with ten references each sentence Ott et al. (2018). In Fig. 6, we compare FlowSeq with other multihypothesis generation methods (ten hypotheses each sentence) to analyze how well the generation outputs of FlowSeq are in terms of diversity and quality. The right corner area of the figure indicates the ideal generations: high diversity and high quality. While FlowSeq still lags behind the autoregressive generations, by increasing the sampling temperature it provides a way of generating more diverse outputs while keeping the translation quality almost unchanged. More analysis of translation outputs and detailed results are provided in the Appendix D and E.
We propose FlowSeq, an efficient and effective model for nonautoregressive sequence generation by using generative flows. One potential direction for future work is to leverage iterative refinement techniques such as masked language models to further improve translation quality. Another exciting direction is to, theoretically and empirically, investigate the latent space in FlowSeq, hence providing deep insights of the model, even enhancing controllable text generation.
This work was supported in part by DARPA grant FA87501820018 funded under the AIDA program and grant HR001115C0114 funded under the LORELEI program. Any opinions, findings, and conclusions expressed in this material are those of the authors and do not necessarily reflect the views of DARPA. The authors thank Amazon for their gift of AWS cloud credits and anonymous reviewers for their helpful suggestions.
Importance weighted autoencoders
. arXiv preprint arXiv:1509.00519. Cited by: §3.6.International Conference on Machine Learning
, pp. 448–456. Cited by: §3.4.Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
, pp. 1173–1182. Cited by: §1, §4.1, §4.2.A neural attention model for abstractive sentence summarization
. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 379–389. Cited by: §1.Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 3156–3164. Cited by: §1.Logdeterminant:
Logdeterminant:
where is the number of heads.
Logdeterminant:
Model  Dimensions (Model/Hidden)  #Params 
Transformerbase  512/2048  65M 
Transformerlarge  2014/4096  218M 
FlowSeqbase  256/512  73M 
FlowSeq=large  512/2014  258M 
In Fig. 7
, we plot the train and dev loss together with dev BLEU scores for the first 50 epochs. We can see that the reconstruction loss is increasing at the initial stage of training, then start to decrease when training with full KL loss. In addition, we observed that FlowSeq does not suffer the KL collapse problem
(Bowman et al., 2015; Ma et al., 2019). This is because the decoder of FlowSeq is nonautogressive, with latent variable as the only input.Source  Grundnahrungsmittel gibt es schließlich überall und jeder Supermarkt hat mittlerweile Sojamilch und andere Produkte. 
Ground Truth  There are basic foodstuffs available everywhere , and every supermarket now has soya milk and other products. 
Sample 1  After all, there are basic foods everywhere and every supermarket now has soya amch and other products. 
Sample 2  After all, the food are available everywhere everywhere and every supermarket has soya milk and other products. 
Sample 3  After all, basic foods exist everywhere and every supermarket has now had soy milk and other products. 
Source  Es kann nicht erklären, weshalb die National Security Agency Daten über das Privatleben von Amerikanern sammelt und warum Whistleblower bestraft werden, die staatliches Fehlverhalten offenlegen. 
Ground Truth  And, most recently, it cannot excuse the failure to design a simple website more than three years since the Affordable Care Act was signed into law. 
Sample 1  And recently, it cannot apologise for the inability to design a simple website in the more than three years since the adoption of Affordable Care Act. 
Sample 2  And recently, it cannot excuse the inability to design a simple website in more than three years since the adoption of Affordable Care Act. 
Sample 3  Recently, it cannot excuse the inability to design a simple website in more than three years since the Affordable Care Act has passed. 
Source  Doch wenn ich mir die oben genannten Beispiele ansehe, dann scheinen sie weitgehend von der Regierung selbst gewählt zu sein. 
Ground Truth  Yet, of all of the examples that I have listed above, they largely seem to be of the administration’s own choosing. 
Sample 1  However, when I look at the above mentioned examples, they seem to be largely elected by the government itself. 
Sample 2  But if I look at the above mentioned examples, they seem to have been largely elected by the government itself. 
Sample 3  But when I look at the above examples, they seem to be largely chosen by the government itself. 
Source  Damit wollte sie auf die Gefahr von noch größeren Ruinen auf der Schweizer Wiese hinweisen  sollte das Riesenprojekt eines Tages scheitern. 
Ground Truth  In so doing they wanted to point out the danger of even bigger ruins on the Schweizer Wiese  should the huge project one day fail. 
Sample 1  In so doing, it wanted to highlight the risk of even greater ruins on the Swiss meadow  the giant project should fail one day. 
Sample 2  In so doing, it wanted to highlight the risk of even greater ruins on the Swiss meadow  if the giant project fail one day. 
Sample 3  In doing so, it wanted point out the risk of even greater ruins on the Swiss meadow  the giant project would fail one day. 
In Tab. 4, we present randomly picked translation outputs from the test set of WMT14DEEN. For each German input sentence, we pick three hypotheses from 30 samples. We have the following observations: First, in most cases, it can accurately express the meaning of the source sentence, sometimes in a different way from the reference sentence, which cannot be precisely reflected by the BLEU score. Second, by controlling the sampling hyperparameters such as the length candidates , the sampling temperature and the number of samples under each length, FlowSeq is able to generate diverse translations expressing the same meaning. Third, repetition and broken translations also exist in some cases due to the lack of language model dependencies in the decoder.
Table 5 shows the detailed results of translation deversity.
Models  Pairwise BLEU  LOO BLEU  
Human  –  35.48  69.07  
Sampling  –  24.10  37.80  
Beam Search  –  73.00  69.90  
HardMoE  –  50.02  63.80  

0.1  79.39  61.61  
0.2  72.12  61.05  
0.3  67.85  60.79  
0.4  64.75  60.07  
0.5  61.12  59.54  
1.0  43.53  52.86  

0.1  70.32  60.54  
0.2  66.45  60.21  
0.3  63.72  59.81  
0.4  61.29  59.47  
0.5  58.49  58.80  
1.0  42.93  52.58  

0.1  62.21  58.70  
0.2  59.74  58.59  
0.3  57.57  57.96  
0.4  55.66  57.45  
0.5  53.49  56.93  
1.0  39.75  50.94 
Comments
There are no comments yet.