1 Introduction and Motivation
Standard training of neural sequence to sequence (seq2seq) models requires the construction of a cross-entropy loss (Sutskever et al., 2014; Lipton et al., 2015). This loss normally manipulates at the level of generating individual tokens in the target sequence, hence, potentially suffering from label or observation bias (Wiseman and Rush, 2016; Pereyra et al., 2017). Thus, it might be difficult for neural seq2seq models to capture the semantics at sequence level. This may be detrimental when the desired generated sequence may be missing or lacking some desired properties, for example, avoiding repetitions, preserving the consistency between source and target length ratio, or satisfying biasedness upon some external evaluation measures such as ROUGE (Lin, 2004) and BLEU (Papineni et al., 2002)
in summarisation and translation tasks, respectively; or avoiding omissions and additions of semantic materials in natural language generation; etc. Sequence properties, on the other hand, may be associated with prior knowledge about the sequence the model aims to generate.
In fact, the cross-entropy loss with sequence constraints is intractable. In order to inject such prior knowledge into seq2seq models, methods in reinforcement learning (RL) (Sutton and Barto, 1998) emerge as reasonable choices. In principle, RL is a general-purpose framework applying for sequential decision making processes. In RL, an agent interacts with an environment over a certain number of discrete timesteps (Sutton and Barto, 1998). The ultimate goal of the agent is to select any action according to a policy that maximises a future cumulative reward. This reward is the objective function of RL guided by the policy , and is defined specifically for the application task. Considering seq2seq models in our case, an action of choosing the next word prediction is guided by a stochastic policy and receives a task-specific reward with a real value return. The agent tries to maximise the expected reward for timesteps, e.g., . The idea of RL has recently been applied to a variety of neural seq2seq tasks. For instance, Ranzato et al. (2015) applied this idea to abstractive summarisation with neural seq2seq models, using the ROUGE evaluation measure (Lin, 2004) as a reward. Similarly, some success also has been achieved for neural machine translation, e.g., (Ranzato et al., 2015; He et al., 2016, 2017). Ranzato et al. (2015) and He et al. (2017) used BLEU score (Papineni et al., 2002) as a reward function in their RL setups; whereas He et al. (2016)
used a reward interpolating the probabilistic scores from reverse translation and language models.
1.1 Why Moment Matching?
The main motivation of moment matching (MM) is to inject prior knowledge into the model which takes the properties of whole sequences into consideration. We aim to develop a generic method that is applicable for any seq2seq models.
Inspired from the method of moments in statistics,111https://en.wikipedia.org/wiki/Method_of_moments_(statistics) we propose the following moment matching approach. The underlying idea of moment matching is to seek optimal parameters reconciling two distributions, namely: one from the samples generated by the model and another one from the empirical data. Those distributions aim to evaluate the generated sequences as a whole, via the use of feature functions or constraints that one would like to behave similarly between the two distributions, based on the encoding of the prior knowledge about sequences. It is worth noting that this proposed moment matching technique is not stand-alone, but to be used in alternation or combination with standard cross-entropy training. This is similar to the way RL is typically applied in seq2seq models (Ranzato et al., 2015).
Here, we will discuss some important differences with RL, then we will present the details on how the MM technique works in the next sections.
The first difference is that RL assumes that one has defined some reward function , which is done quite independently of what the training data tells us. By contrast, MM only assumes that one has defined certain features that are deemed important for the task, but one then relies on the actual training data to tell us how to use these features. One could say that the “arbitrariness” in MM is just in the choice of the features to focus on, while the arbitrariness in RL is that we want the model to get a good reward, even if that reward is not connected to the training data at all.
Suppose that we are in the context of NLG and are trying to reconcile several objectives at the same time, such as (1) avoiding omissions of semantic material, (2) avoiding additions of semantic material, (3) avoiding repetitions (Agarwal and Dymetman, 2017). In general, in order to address this kind of problem in an RL framework, we need to “invent” a reward function based on certain computable features of the model outputs which in particular means inventing a formula for combining the different objectives we have in mind into a single real number. This can be a rather arbitrary process, and potentially it does not guarantee any fit with actual training data. The point of MM is that the only arbitrariness is in choosing the features to focus on, but after that it is actual training data that tells us what should be done.
The second difference is that RL tries to maximize a reward, and is only sensitive to the rewards of individual instances, while MM tries to maximize the fit of the model distribution with that of the empirical distribution, where the fit is on specific features.
For instance, this difference is especially clear in the case of language modelling where RL will try to find a model that is strongly peaked on the which has the strongest reward (assuming no ties in the rewards), while MM will try to find a distribution over which has certain properties in common with the empirical distribution, e.g., for generating diverse outputs. For language modelling, RL is a strange method, because language modelling requires the model to be able to produce different outputs; for MT, the situation is a bit less clear, in case one wanted to argue that for each source sentence, there is a single best translation; but in principle, the observation also holds for MT, which is a conditional language model.
2 Proposed Model
In this section, we will describe our formulation of moment matching for seq2seq modeling in detail.
2.1 Moment Matching for Sequence to Sequence Models
Recall the sequence-to-sequence problem whose goal is to generate an output sequence given an input sequence. In the context of neural machine translation - which is our main focus here, the input sequence is a source language sentence, and the output sequence is a target language sentence.
Suppose that we are modeling the target sequence given a source sequence , using a sequential process
. This sequential process can be implemented via a neural mechanism, e.g., recurrent neural networks within an (attentional) encoder - decoder framework(Bahdanau et al., 2015) or a transformer framework (Vaswani et al., 2017). Regardless of its implementation, such a neural mechanism depends on model parameters .
Our proposal is that we would like our sequential process to satisfy some moment constraints. Such moment constraints can be modeled based on features that encode prior (or external) knowledge or semantics about the generated target sentence. Mathematically, features can be represented through vectors, e.g.,, where is the conditional feature function of a target sequence given a source sequence , and is number of features or moment constraints. Considering a simple example where the moment feature is for controlling the length of a target sequence - which would just return a number of elements in that target sequence.
2.2 Formulation of the MM Objective Function
In order to incorporate such constraints into the seq2seq learning process, we introduce a new objective function, namely the moment matching loss . Generally speaking, given a vector of features , the goal of moment matching loss is to encourage the identity of the model average estimate,
with the empirical average estimate,
where is the training data; are source and target sequences, respectively; is the data index in . This can be formulated as minimising a squared distance between the two distributions with respect to model parameters :
To be more elaborate, is the model average estimate over the samples which are drawn i.i.d. from the model distribution given the source sequence , and is the empirical average estimate given the training instance, where our data are drawn i.i.d. from the empirical distribution .
2.3 Derivation of the Moment Matching Gradient
We now show how to compute the gradient of in the equation 2, denoted as , which will be required in optimisation. We first define:
where , then the gradient can be computed as:
Next, we need to proceed with the computation of . By derivation, we have the following:
Mathematically, we can say that is the gradient of the composition of two functions and .
Noting that the gradient is equal to the Jacobian
, and applying the chain rule for Jacobians, we have:
Next, we need the computation for and in Equation 5. First, we have:
where and are vectors of size . And we also have:
A key part of these identities in Equation 7 is the value of which can be expressed as:
Next, using the well-known “log-derivative trick”:
so in turn we obtain the computation of . Note that the expectation is easy to sample and the gradient is easy to evaluate as well.
Since we already have the computations of and , we can finalise the gradient computation as follows:
By the reasoning just made, we can obtain the computation of which is the central formula of the proposed moment matching technique. ∎
|RL w/ PG|
|RL w/ PG|
2.4 MM training vs CE training vs RL training with Policy Gradient
Based on equation 4, and ignoring the constant factor, we can use as our gradient update, for each pair () the value
where , the empirical average of , can be estimated through the observed value , i.e. .
Note that the above gradient update draws a very close connection to RL with policy gradient method (Sutton et al., 2000) where the “multiplication score” plays a similar role to the reward ; however, unlike RL training using a predefined reward , the major difference in MM training is that MM’s multiplication score does depend on the model parameters and looks at what the empirical data tells the model via using explicit prior features. Table 1 compares the differences among three methods, namely CE, RL with Policy Gradient (PG) and our proposal MM, for neural seq2seq models in both unconditional (e.g., language modeling) and conditional (e.g., NMT, summarisation) cases.
2.5 Computing the Moment Matching Gradient
We have derived the gradient of moment of matching loss as shown in Equation 4. In order to compute it, we still need to have the evaluation of two estimates, namely the model average estimate and the empirical average estimate .
Empirical Average Estimate.
First, we need to estimate the empirical average . In the general case, given a source sequence , suppose there are multiple target sequences associated with , then . Specifically, when we have only one reference sequence per source sequence , then — which is the standard case in the context of neural machine translation training.
Model Average Estimate.
In practice, it is impossible to obtain a full computation of due to intractable search of . Therefore, we resort to estimate by a sampling process. There are possible options for doing this.
The simplistic approach to that would be to:
First, estimate the model average by sampling and then estimating:
Next, estimate the expectation in Equation 4 by independently sampling second values of , and then estimate:
Note that two sets of samples are separate and . This would provide an unbiased estimate of , but at the cost of producing two independent sample sets of sizes and , used for two different purposes — which would be computationally wasteful.
A more economical approach might consist in using the same sample set of size for both purposes. However, this would produce a biased estimate of . This can be illustrated by considering the estimate case with . In this case, the dot product in is strictly positive since is equal to ; hence, in this case, the current sample to be systematically discouraged by the model.
Here, we proposed a better approach, resulting in an unbiased estimate of , formulated as follows:
First, we sample values of : with , then: