1 Introduction and Motivation
Standard training of neural sequence to sequence (seq2seq) models requires the construction of a crossentropy loss (Sutskever et al., 2014; Lipton et al., 2015). This loss normally manipulates at the level of generating individual tokens in the target sequence, hence, potentially suffering from label or observation bias (Wiseman and Rush, 2016; Pereyra et al., 2017). Thus, it might be difficult for neural seq2seq models to capture the semantics at sequence level. This may be detrimental when the desired generated sequence may be missing or lacking some desired properties, for example, avoiding repetitions, preserving the consistency between source and target length ratio, or satisfying biasedness upon some external evaluation measures such as ROUGE (Lin, 2004) and BLEU (Papineni et al., 2002)
in summarisation and translation tasks, respectively; or avoiding omissions and additions of semantic materials in natural language generation; etc. Sequence properties, on the other hand, may be associated with prior knowledge about the sequence the model aims to generate.
In fact, the crossentropy loss with sequence constraints is intractable. In order to inject such prior knowledge into seq2seq models, methods in reinforcement learning (RL) (Sutton and Barto, 1998) emerge as reasonable choices. In principle, RL is a generalpurpose framework applying for sequential decision making processes. In RL, an agent interacts with an environment over a certain number of discrete timesteps (Sutton and Barto, 1998). The ultimate goal of the agent is to select any action according to a policy that maximises a future cumulative reward. This reward is the objective function of RL guided by the policy , and is defined specifically for the application task. Considering seq2seq models in our case, an action of choosing the next word prediction is guided by a stochastic policy and receives a taskspecific reward with a real value return. The agent tries to maximise the expected reward for timesteps, e.g., . The idea of RL has recently been applied to a variety of neural seq2seq tasks. For instance, Ranzato et al. (2015) applied this idea to abstractive summarisation with neural seq2seq models, using the ROUGE evaluation measure (Lin, 2004) as a reward. Similarly, some success also has been achieved for neural machine translation, e.g., (Ranzato et al., 2015; He et al., 2016, 2017). Ranzato et al. (2015) and He et al. (2017) used BLEU score (Papineni et al., 2002) as a reward function in their RL setups; whereas He et al. (2016)
used a reward interpolating the probabilistic scores from reverse translation and language models.
1.1 Why Moment Matching?
The main motivation of moment matching (MM) is to inject prior knowledge into the model which takes the properties of whole sequences into consideration. We aim to develop a generic method that is applicable for any seq2seq models.
Inspired from the method of moments in statistics,^{1}^{1}1https://en.wikipedia.org/wiki/Method_of_moments_(statistics) we propose the following moment matching approach. The underlying idea of moment matching is to seek optimal parameters reconciling two distributions, namely: one from the samples generated by the model and another one from the empirical data. Those distributions aim to evaluate the generated sequences as a whole, via the use of feature functions or constraints that one would like to behave similarly between the two distributions, based on the encoding of the prior knowledge about sequences. It is worth noting that this proposed moment matching technique is not standalone, but to be used in alternation or combination with standard crossentropy training. This is similar to the way RL is typically applied in seq2seq models (Ranzato et al., 2015).
Here, we will discuss some important differences with RL, then we will present the details on how the MM technique works in the next sections.
The first difference is that RL assumes that one has defined some reward function , which is done quite independently of what the training data tells us. By contrast, MM only assumes that one has defined certain features that are deemed important for the task, but one then relies on the actual training data to tell us how to use these features. One could say that the “arbitrariness” in MM is just in the choice of the features to focus on, while the arbitrariness in RL is that we want the model to get a good reward, even if that reward is not connected to the training data at all.
Suppose that we are in the context of NLG and are trying to reconcile several objectives at the same time, such as (1) avoiding omissions of semantic material, (2) avoiding additions of semantic material, (3) avoiding repetitions (Agarwal and Dymetman, 2017). In general, in order to address this kind of problem in an RL framework, we need to “invent” a reward function based on certain computable features of the model outputs which in particular means inventing a formula for combining the different objectives we have in mind into a single real number. This can be a rather arbitrary process, and potentially it does not guarantee any fit with actual training data. The point of MM is that the only arbitrariness is in choosing the features to focus on, but after that it is actual training data that tells us what should be done.
The second difference is that RL tries to maximize a reward, and is only sensitive to the rewards of individual instances, while MM tries to maximize the fit of the model distribution with that of the empirical distribution, where the fit is on specific features.
For instance, this difference is especially clear in the case of language modelling where RL will try to find a model that is strongly peaked on the which has the strongest reward (assuming no ties in the rewards), while MM will try to find a distribution over which has certain properties in common with the empirical distribution, e.g., for generating diverse outputs. For language modelling, RL is a strange method, because language modelling requires the model to be able to produce different outputs; for MT, the situation is a bit less clear, in case one wanted to argue that for each source sentence, there is a single best translation; but in principle, the observation also holds for MT, which is a conditional language model.
2 Proposed Model
In this section, we will describe our formulation of moment matching for seq2seq modeling in detail.
2.1 Moment Matching for Sequence to Sequence Models
Recall the sequencetosequence problem whose goal is to generate an output sequence given an input sequence. In the context of neural machine translation  which is our main focus here, the input sequence is a source language sentence, and the output sequence is a target language sentence.
Suppose that we are modeling the target sequence given a source sequence , using a sequential process
. This sequential process can be implemented via a neural mechanism, e.g., recurrent neural networks within an (attentional) encoder  decoder framework
(Bahdanau et al., 2015) or a transformer framework (Vaswani et al., 2017). Regardless of its implementation, such a neural mechanism depends on model parameters .Our proposal is that we would like our sequential process to satisfy some moment constraints. Such moment constraints can be modeled based on features that encode prior (or external) knowledge or semantics about the generated target sentence. Mathematically, features can be represented through vectors, e.g.,
, where is the conditional feature function of a target sequence given a source sequence , and is number of features or moment constraints. Considering a simple example where the moment feature is for controlling the length of a target sequence  which would just return a number of elements in that target sequence.2.2 Formulation of the MM Objective Function
In order to incorporate such constraints into the seq2seq learning process, we introduce a new objective function, namely the moment matching loss . Generally speaking, given a vector of features , the goal of moment matching loss is to encourage the identity of the model average estimate,
with the empirical average estimate,
(1) 
where is the training data; are source and target sequences, respectively; is the data index in . This can be formulated as minimising a squared distance between the two distributions with respect to model parameters :
(2) 
To be more elaborate, is the model average estimate over the samples which are drawn i.i.d. from the model distribution given the source sequence , and is the empirical average estimate given the training instance, where our data are drawn i.i.d. from the empirical distribution .
2.3 Derivation of the Moment Matching Gradient
We now show how to compute the gradient of in the equation 2, denoted as , which will be required in optimisation. We first define:
where , then the gradient can be computed as:
(3) 
Next, we need to proceed with the computation of . By derivation, we have the following:
(4) 
Proof.
Mathematically, we can say that is the gradient of the composition of two functions and .
Noting that the gradient is equal to the Jacobian
, and applying the chain rule for Jacobians, we have:
(5) 
Next, we need the computation for and in Equation 5. First, we have:
(6) 
where and are vectors of size . And we also have:
(7) 
A key part of these identities in Equation 7 is the value of which can be expressed as:
(8) 
Next, using the wellknown “logderivative trick”:
from the Policy Gradient technique in reinforcement learning (Sutton et al., 2000), we can rewrite the equation 8 as follows:
(9) 
Combining Equations 8, 9, we have:
so in turn we obtain the computation of . Note that the expectation is easy to sample and the gradient is easy to evaluate as well.
Since we already have the computations of and , we can finalise the gradient computation as follows:
By the reasoning just made, we can obtain the computation of which is the central formula of the proposed moment matching technique. ∎
Method  Formulation  Note 

Unconditional Case  
CE  
RL w/ PG  
MM  
Conditional Case  
CE  
RL w/ PG  
MM 
2.4 MM training vs CE training vs RL training with Policy Gradient
Based on equation 4, and ignoring the constant factor, we can use as our gradient update, for each pair () the value
where , the empirical average of , can be estimated through the observed value , i.e. .
Note that the above gradient update draws a very close connection to RL with policy gradient method (Sutton et al., 2000) where the “multiplication score” plays a similar role to the reward ; however, unlike RL training using a predefined reward , the major difference in MM training is that MM’s multiplication score does depend on the model parameters and looks at what the empirical data tells the model via using explicit prior features. Table 1 compares the differences among three methods, namely CE, RL with Policy Gradient (PG) and our proposal MM, for neural seq2seq models in both unconditional (e.g., language modeling) and conditional (e.g., NMT, summarisation) cases.
2.5 Computing the Moment Matching Gradient
We have derived the gradient of moment of matching loss as shown in Equation 4. In order to compute it, we still need to have the evaluation of two estimates, namely the model average estimate and the empirical average estimate .
Empirical Average Estimate.
First, we need to estimate the empirical average . In the general case, given a source sequence , suppose there are multiple target sequences associated with , then . Specifically, when we have only one reference sequence per source sequence , then — which is the standard case in the context of neural machine translation training.
Model Average Estimate.
In practice, it is impossible to obtain a full computation of due to intractable search of . Therefore, we resort to estimate by a sampling process. There are possible options for doing this.
The simplistic approach to that would be to:
First, estimate the model average by sampling and then estimating:
Next, estimate the expectation in Equation 4 by independently sampling second values of , and then estimate:
Note that two sets of samples are separate and . This would provide an unbiased estimate of , but at the cost of producing two independent sample sets of sizes and , used for two different purposes — which would be computationally wasteful.
A more economical approach might consist in using the same sample set of size for both purposes. However, this would produce a biased estimate of . This can be illustrated by considering the estimate case with . In this case, the dot product in is strictly positive since is equal to ; hence, in this case, the current sample to be systematically discouraged by the model.
Here, we proposed a better approach, resulting in an unbiased estimate of , formulated as follows:
First, we sample values of : with , then:
(10) 
where
(11) 
We can then prove that this computation provides an unbiased estimate of (see §LABEL:proof_unbiasedness). Note that here, we have exploited the same J samples for both purposes but have taken care of not exploiting the exact same for both — akin to a Jackknife resampling estimator.^{2}^{2}2https://en.wikipedia.org/wiki/Jackknife_resampling
2.6 Proof of Unbiasedness
For simplicity, we will consider here the “unconditional case”, e.g., instead of , but the conditional case follows easily.
Lemma 2.1.
We wish to compute the quantity , where .
Let us sample sequences ,…, (where is a predefined number of generated samples) independently from , and let us compute:
where is formulated as:
Then we have :
in the other words, provides an unbiased estimate of .
Proof.
See Appendix. ∎
To ground this lemma in our problem setting, consider the case where , and , then the quantity is equal to the overall gradient of the MM loss, for a given value of the model parameters (by the formula (4) obtained earlier, and up to a constant factor). We would like to obtain an unbiased stochastic gradient estimator of this gradient, in other words, we want to obtain an unbiased estimator of the quantity .
By Lemma 2.1, is equal to the expectation of , where are drawn i.i.d from distribution . In other words, if we sample one set of samples from , and compute , where on this set, then we obtain an unbiased estimate of . As a result, we obtain an unbiased estimate of the gradient of the overall MM loss, which is exactly what we need.
In principle, therefore, we need to first sample , and to compute
and then use this quantity as our stochastic gradient. In practice, what we do is to first sample , and then use the components of the sum:
as our individual stochastic gradients. Note that this computation differs from the original one by a constant factor , which can be accounted for by manipulating the learning rate.
2.7 Training with the Moment Matching Technique
Recall the goal of our technique is to preserve certain aspects of generated target sequences according to prior knowledge. In principle, the technique does not teach the model how to generate a proper target sequence based on the given source sequence (Ranzato et al., 2015). For that reason, it has to be used along with standard CE training of seq2seq model. In order to train the seq2seq model with the proposed technique, we suggest to use one of two training modes: alternation and interpolation. For the alternation mode, the seq2seq model is trained alternatively using both CE loss and moment matching loss. More specifically, the seq2seq model is initially trained with CE loss for some iterations, then switches to using moment matching loss; and vice versa. For the interpolation mode, the model will be trained with the interpolated objective using two losses with an additional hyperparameter balancing them. In summary, the general technique can be described as in Algorithm 1.
After some iterations of the algorithm, we can approximate over the development data (or sampled training data) through:
(12) 
We expect to decrease over iterations, potentially improving the explicit evaluation measure(s), e.g., BLEU (Papineni et al., 2002) in NMT.
3 Connections to Previous Work
Maximum Mean Discrepancies (MMD).
Our MM approach is related to the technique of Maximum Mean Discrepancies, a technique that has been successfully applied to computer vision, e.g., an alternative to learning generative adversarial network
(Li et al., 2015, 2017). The MMD is a way to measuring discrepancy between two distributions (for example, the empirical distribution and the model distribution) based on kernelbased similarities. The use of such kernels could potentially be useful in the long term to extend our approach, which can be seen as using a simple linear kernel over our predefined features, but in the specific context of seq2seq models, and in tandem with a generative process based on an autoregressive generative model.The Method of Moments.
Recently, Ravuri et al. (2018) proposed using a moment matching technique in situations where Maximum Likelihood is difficult to apply. A strong difference with the way we use MM is that they define feature functions parameterised by some parameters and let them be learned along with model parameters. In fact, they are trying to applying the method of moments to situations in which ML (maximum likelihood, or CE) is not applicable, but where MM can find the correct model distribution on its own. Hence the focus on having (and learning) a large number of features, because only many features will allow to approximate the actual distribution. In our case, we are not relying on MM to model the target distribution on its own. Doing so with a small number of features would be doomed (e.g, thinking of the length ratio feature: it would only guarantee that the translation has a correct length, irrespective of the lexical content). We are using MM to complement ML, in such a way that taskrelated important features are attended to even if that means getting a (slightly) worse likelihood (or perplexity) on the training set. One can in fact see our use of MM as a form of regularization technique for complementing the MLE training and this is an important aspect of our proposed MM approach.
4 Preliminary Experiments
4.1 Prior Features for NMT
In order to validate the proposed technique, we reapplied two prior features used for training NMT as in (Zhang et al., 2017), including source and target length ratio and lexical bilingual features. Zhang et al. (2017) showed in their experiments that these two are the most effective features for improving NMT systems.
The first feature is straightforward, just about measuring the ratio between source and target length. This feature aims at forcing the model to produce translations with consistent length ratio between source and target sentences, in such a way that too short or too long translations will be avoided.
Given the respective source and target sequences and , we define this source and target length ratio feature function as follows:
(13) 
where is additional hyperparameter, normally set empirically based on prior knowledge about source and target languages. In this case, the feature function is a real value.
The second feature we used is based on a wordtoword lexical translation dictionary produced by an offtheshelf SMT system (e.g., Moses).^{3}^{3}3https://github.com/mosessmt/mosesdecoder The goal of this feature is to ask the model to take external lexical translations into consideration. This feature will be potentially useful in cases such as: translation for rare words, and in low resource setting in which parallel data can be scarce.^{4}^{4}4NMT has been empirically found to be less robust in such a setting than SMT. Following Zhang et al. (2017), we defined sparse feature functions
where:
and where is a lexical translation dictionary produced by Moses.
4.2 Datasets and Baseline
We proceed to validate the proposed technique with smallscale experiments. We used the IWSLT’15 dataset, translating from English to Vietnamese. This dataset is relatively small, containing approximately 133K sentences for training, 1.5K for development , and 1.3K for testing. We reimplemented the transformer architecture (Vaswani et al., 2017) for training our NMT model^{5}^{5}5in our open source toolkit: https://github.com/duyvuleo/TransformerDyNet with hyperparameters: 4 encoder and 4 decoder layers; hidden dimension 512 and dropout probability 0.1 throughout the network. For the sampling process, we generated 5 samples for each moment matching training step. We used interpolation training mode with a balancing hyperparameter of 0.5. In fact, changing this hyperparameter only slightly affects the overall result. For the feature with length ratio between source and target sequences, we used the length factor . For the feature with bilingual lexical dictionary, we extracted it by Moses’s training scripts. In this dictionary, we filtered out the bad entries based on word alignment probabilities produced by Moses, e.g., using a threshold less than 0.5 following Zhang et al. (2017).
4.3 Results and Analysis
BLEU  MM Loss  

tensor2tensor (Vaswani et al., 2017)  27.69  _ 
base (our reimplementation  TransformerDyNet)  28.53  0.0094808 
base+mm  29.17  0.0068775 
BLEU  MM Loss  

tensor2tensor (Vaswani et al., 2017)  27.69  _ 
base (our reimplementation  TransformerDyNet)  28.53  0.7384 
base+mm  29.11  0.7128 
Our results can be found in Table 2 and 3. As can be seen from the tables, as long as the model attempted to reduce the moment matching loss, the BLEU scores (Papineni et al., 2002) improved statistically significantly with (Koehn, 2004). This was consistently shown in both experiments as an encouraging validation of our proposed training technique with moment matching.
5 Conclusion
We have shown some nice mathematical properties of the proposed moment matching training technique (in particular, unbiasedness) and believe it is promising. Our initial experiments indicate its potential for improving existing NMT systems using simple prior features. Future work may include exploiting more advanced features for improving NMT and evaluate our proposed technique on largerscale datasets.
Acknowledgments
Cong Duy Vu Hoang would like to thank NAVER Labs Europe for supporting his internship; and Reza Haffari and Trevor Cohn for their insightful discussions. Marc Dymetman wishes to thank Eric Gaussier and Shubham Agarwal for early discussions on the topic of moment matching.
References
 Agarwal and Dymetman (2017) Shubham Agarwal and Marc Dymetman. 2017. A surprisingly effective outofthebox char2char model on the e2e nlg challenge dataset. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue. Association for Computational Linguistics, pages 158–163.
 Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. In Proc. of 3rd International Conference on Learning Representations (ICLR2015).
 He et al. (2017) Di He, Hanqing Lu, Yingce Xia, Tao Qin, Liwei Wang, and Tieyan Liu. 2017. Decoding with value networks for neural machine translation. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, Curran Associates, Inc., pages 178–187.
 He et al. (2016) Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tieyan Liu, and WeiYing Ma. 2016. Dual Learning for Machine Translation. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, Curran Associates, Inc., pages 820–828. http://papers.nips.cc/paper/6469duallearningformachinetranslation.pdf.

Koehn (2004)
Philipp Koehn. 2004.
Statistical Significance Tests for Machine Translation Evaluation.
In Dekang Lin and Dekai Wu, editors,
Proceedings of Conference on Empirical Methods on Natural Language Processing (EMNLP)
. Association for Computational Linguistics, Barcelona, Spain, pages 388–395.  Li et al. (2017) ChunLiang Li, WeiCheng Chang, Yu Cheng, Yiming Yang, and Barnabas Poczos. 2017. Mmd gan: Towards deeper understanding of moment matching network. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, Curran Associates, Inc., pages 2203–2213.

Li et al. (2015)
Yujia Li, Kevin Swersky, and Richard Zemel. 2015.
Generative moment matching networks.
In
Proceedings of the 32Nd International Conference on International Conference on Machine Learning  Volume 37
. JMLR.org, ICML’15, pages 1718–1727.  Lin (2004) ChinYew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out. http://www.aclweb.org/anthology/W041013.
 Lipton et al. (2015) Z. C. Lipton, J. Berkowitz, and C. Elkan. 2015. A Critical Review of Recurrent Neural Networks for Sequence Learning. ArXiv eprints .
 Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, Stroudsburg, PA, USA, ACL ’02, pages 311–318. https://doi.org/10.3115/1073083.1073135.
 Pereyra et al. (2017) Gabriel Pereyra, George Tucker, Jan Chorowski, Lukasz Kaiser, and Geoffrey E. Hinton. 2017. Regularizing Neural Networks by Penalizing Confident Output Distributions. CoRR abs/1701.06548. http://arxiv.org/abs/1701.06548.
 Ranzato et al. (2015) Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2015. Sequence level training with recurrent neural networks. CoRR abs/1511.06732. http://arxiv.org/abs/1511.06732.
 Ravuri et al. (2018) Suman V. Ravuri, Shakir Mohamed, Mihaela Rosca, and Oriol Vinyals. 2018. Learning implicit generative models with the method of learned moments. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 1015, 2018. pages 4311–4320. http://proceedings.mlr.press/v80/ravuri18a.html.
 Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence Learning with Neural Networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems. MIT Press, Cambridge, MA, USA, NIPS’14, pages 3104–3112.
 Sutton and Barto (1998) Richard S. Sutton and Andrew G. Barto. 1998. Reinforcement learning: An introduction. IEEE Transactions on Neural Networks 16:285–286.
 Sutton et al. (2000) Richard S Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. 2000. Policy gradient methods for reinforcement learning with function approximation. In S. A. Solla, T. K. Leen, and K. Müller, editors, Advances in Neural Information Processing Systems 12, MIT Press, pages 1057–1063.
 Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, Curran Associates, Inc., pages 5998–6008. http://papers.nips.cc/paper/7181attentionisallyouneed.pdf.
 Wiseman and Rush (2016) Sam Wiseman and Alexander M. Rush. 2016. SequencetoSequence Learning as Beam Search Optimization. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas, pages 1296–1306. https://aclweb.org/anthology/D161137.
 Zhang et al. (2017) Jiacheng Zhang, Yang Liu, Huanbo Luan, Jingfang Xu, and Maosong Sun. 2017. Prior knowledge integration for neural machine translation using posterior regularization. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, pages 1514–1523. https://doi.org/10.18653/v1/P171139.
Appendix  Proof of Unbiasedness
Proof.
Let us define:
(14)  
For a given value of , we have:
(15)  
Finally, by collecting the results for the values of the index , we obtain:
(16)  
∎
Comments
There are no comments yet.