Attention-based NMT with Coverage and Context Gate
Attention mechanism has enhanced state-of-the-art Neural Machine Translation (NMT) by jointly learning to align and translate. It tends to ignore past alignment information, however, which often leads to over-translation and under-translation. To address this problem, we propose coverage-based NMT in this paper. We maintain a coverage vector to keep track of the attention history. The coverage vector is fed to the attention model to help adjust future attention, which lets NMT system to consider more about untranslated source words. Experiments show that the proposed approach significantly improves both translation quality and alignment quality over standard attention-based NMT.READ FULL TEXT VIEW PDF
In this paper, we enhance the attention-based neural machine translation...
Existing neural machine translation systems do not explicitly model what...
Neural machine translation (NMT) has been a new paradigm in machine
In NMT, words are sometimes dropped from the source or generated repeate...
Despite their original goal to jointly learn to align and translate, Neu...
In this paper, we propose an effective way for biasing the attention
Auto-regressive sequence-to-sequence models with attention mechanisms ha...
Attention-based NMT with Coverage and Context Gate
The past several years have witnessed the rapid progress of end-to-end Neural Machine Translation (NMT) [Sutskever et al.2014, Bahdanau et al.2015]. Unlike conventional Statistical Machine Translation (SMT) [Koehn et al.2003, Chiang2007]
, NMT uses a single and large neural network to model the entire translation process. It enjoys the following advantages. First, the use of distributed representations of words can alleviate the curse of dimensionality[Bengio et al.2003]
. Second, there is no need to explicitly design features to capture translation regularities, which is quite difficult in SMT. Instead, NMT is capable of learning representations directly from the training data. Third, Long Short-Term Memory[Hochreiter and Schmidhuber1997] enables NMT to capture long-distance reordering, which is a significant challenge in SMT.
NMT has a serious problem, however, namely lack of coverage. In phrase-based SMT [Koehn et al.2003], a decoder maintains a coverage vector to indicate whether a source word is translated or not. This is important for ensuring that each source word is translated in decoding. The decoding process is completed when all source words are “covered” or translated. In NMT, there is no such coverage vector and the decoding process ends only when the end-of-sentence mark is produced. We believe that lacking coverage might result in the following problems in conventional NMT:
Over-translation: some words are unnecessarily translated for multiple times;
Under-translation: some words are mistakenly untranslated.
Specifically, in the state-of-the-art attention-based NMT model [Bahdanau et al.2015], generating a target word heavily depends on the relevant parts of the source sentence, and a source word is involved in generation of all target words. As a result, over-translation and under-translation inevitably happen because of ignoring the “coverage” of source words (i.e., number of times a source word is translated to a target word). Figure 1(a) shows an example: the Chinese word “guānbì” is over translated to “close(d)” twice, while “bèipò” (means “be forced to”) is mistakenly untranslated.
In this work, we propose a coverage mechanism to NMT (NMT-Coverage
) to alleviate the over-translation and under-translation problems. Basically, we append a coverage vector to the intermediate representations of an NMT model, which are sequentially updated after each attentive read during the decoding process, to keep track of the attention history. The coverage vector, when entering into attention model, can help adjust the future attention and significantly improve the overall alignment between the source and target sentences. This design contains many particular cases for coverage modeling with contrasting characteristics, which all share a clear linguistic intuition and yet can be trained in a data driven fashion. Notably, we achieve significant improvement even by simply using the sum of previous alignment probabilities as coverage for each word, as a successful example of incorporating linguistic knowledge into neural network based NLP models.
Experiments show that NMT-Coverage significantly outperforms conventional attention-based NMT on both translation and alignment tasks. Figure 1(b) shows an example, in which NMT-Coverage alleviates the over-translation and under-translation problems that NMT without coverage suffers from.
Our work is built on attention-based NMT [Bahdanau et al.2015], which simultaneously conducts dynamic alignment and generation of the target sentence, as illustrated in Figure 2. It produces the translation by generating one target word at each time step. Given an input sentence and previously generated words , the probability of generating next word is
where is a non-linear function, and is a decoding state for time step , computed by
Here the activation function
is a Gated Recurrent Unit (GRU)[Cho et al.2014b], and is a distinct source representation for time , calculated as a weighted sum of the source annotations:
where is the annotation of
from a bi-directional Recurrent Neural Network (RNN)[Schuster and Paliwal1997], and its weight is computed by
is an attention model that scores how well and match. With the attention model, it avoids the need to represent the entire source sentence with a single vector. Instead, the decoder selects parts of the source sentence to pay attention to, thus exploits an expected annotation over possible alignments for each time step .
However, the attention model fails to take advantage of past alignment information, which is found useful to avoid over-translation and under-translation problems in conventional SMT [Koehn et al.2003]. For example, if a source word is translated in the past, it is less likely to be translated again and should be assigned a lower alignment probability.
In SMT, a coverage set is maintained to keep track of which source words have been translated (“covered”) in the past. Let us take as an example of input sentence. The initial coverage set is which denotes that no source word is yet translated. When a translation rule is applied, we produce one hypothesis labelled with coverage . It means that the second and third source words are translated. The goal is to generate translation with full coverage . A source word is translated when it is covered by one translation rule, and it is not allowed to be translated again in the future (i.e., hard coverage). In this way, each source word is guaranteed to be translated and only be translated once. As shown, coverage is essential for SMT since it avoids gaps and overlaps in translation of source words.
Modeling coverage is also important for attention-based NMT models, since they generally lack a mechanism to indicate whether a certain source word has been translated, and therefore are prone to the “coverage” mistakes: some parts of source sentence have been translated more than once or not translated. For NMT models, directly modeling coverage is less straightforward, but the problem can be significantly alleviated by keeping track of the attention signal during the decoding process. The most natural way for doing that would be to append a coverage vector to the annotation of each source word (i.e., ), which is initialized as a zero vector but updated after every attentive read of the corresponding annotation. The coverage vector is fed to the attention model to help adjust future attention, which lets NMT system to consider more about untranslated source words, as illustrated in Figure 3.
Since the coverage vector summarizes the attention record for (and therefore for a small neighbor centering at the source word), it will discourage further attention to it if it has been heavily attended, and implicitly push the attention to the less attended segments of the source sentence since the attention weights are normalized to one. This can potentially solve both coverage mistakes mentioned above, when modeled and learned properly.
Formally, the coverage model is given by
is the function that updates after the new attention at time step in the decoding process;
is a -dimensional coverage vector summarizing the history of attention till time step on ;
is a word-specific feature with its own parameters;
are auxiliary inputs exploited in different sorts of coverage models.
Equation 6 gives a rather general model, which could take different function forms for and , and different auxiliary inputs (e.g., previous decoding state ). In the rest of this section, we will give a number of representative implementations of the coverage model, which either leverage more linguistic information (Section 3.1.1) or resort to the flexibility of neural network approximation (Section 3.1.2).
We first consider at linguistically inspired model which has a small number of parameters, as well as clear interpretation. While the linguistically-inspired coverage in NMT is similar to that in SMT, there is one key difference: it indicates what percentage of source words have been translated (i.e., soft coverage). In NMT, each target word is generated from all source words with probability for source word . In other words, the source word is involved in generating all target words and the probability of generating target word at time step is . Note that unlike in SMT in which each source word is fully translated at one decoding step, the source word is partially translated at each decoding step in NMT. Therefore, the coverage at time step denotes the translated ratio of that each source word is translated.
We use a scalar () to represent linguistic coverage for each source word and employ an accumulate operation for . The initial value of linguistic coverage is zero, which denotes that the corresponding source word is not translated yet. We iteratively construct linguistic coverages through accumulation of alignment probabilities generated by the attention model, each of which is normalized by a distinct context-dependent weight. The coverage of source word at time step is computed by
where is a pre-defined weight which indicates the number of target words is expected to generate. The simplest way is to follow Xu et al. Xu:2015:ICML in image-to-caption translation to fix for all source words, which means that we directly use the sum of previous alignment probabilities without normalization as coverage for each word, as done in [Cohn et al.2016].
However, in machine translation, different types of source words may contribute differently to the generation of target sentence. Let us take the sentence pairs in Figure 1 as an example. The noun in the source sentence “jīchǎng” is translated into one target word “airports”, while the adjective “bèipò” is translated into three words “were forced to”. Therefore, we need to assign a distinct for each source word. Ideally, we expect with being the total number of time steps in decoding. However, such desired value is not available before decoding, thus is not suitable in this scenario.
To predict , we introduce the concept of fertility, which is firstly proposed in word-level SMT [Brown et al.1993]. Fertility of source word tells how many target words
produces. In SMT, the fertility is a random variable, whose distribution is determined by the parameters of word alignment models (e.g., IBM models). In this work, we simplify and adapt fertility from the original model and compute the fertility by222Fertility in SMT is a random variable with a set of fertility probabilities, , which depends on the fertilities of previous source words. To simplify the calculation and adapt it to the attention model in NMT, we define the fertility in NMT as a constant number, which is independent of previous fertilities.
where is a predefined constant to denote the maximum number of target words one source word can produce,
is a logistic sigmoid function, andis the weight matrix. Here we use to denote since contains information about the whole input sentence with a strong focus on the parts surrounding [Bahdanau et al.2015]. Since does not depend on , we can pre-compute it before decoding to minimize the computational cost.
We next consider Neural Network (NN) based coverage model. When is a vector () and is a neural network, we actually have an RNN model for coverage, as illustrated in Figure 4. In this work, we take the following form:
where is a nonlinear activation function and is the auxiliary input that encodes past translation information. Note that we leave out the word-specific feature function and only take the input annotation as the input to the coverage RNN. It is important to emphasize that the NN-based coverage model is able to be fed with arbitrary inputs, such as the previous attentional context . Here we only employ for past alignment information, for past translation information, and for word-specific bias.333In our preliminary experiments, considering more inputs (e.g., current and previous attentional contexts, unnormalized attention weights ) does not always lead to better translation quality. Possible reasons include: 1) the inputs contains duplicate information, and 2) more inputs introduce more back-propagation paths and therefore make it difficult to train. In our experience, one principle is to only feed the coverage model inputs that contain distinct information, which are complementary to each other.
The neural function can be either a simple activation function or a gating function that proves useful to capture long-distance dependencies. In this work, we adopt GRU for the gating activation since it is simple yet powerful [Chung et al.2014]. Please refer to [Cho et al.2014b] for more details about GRU.
Intuitively, the two types of models summarize coverage information in “different languages”. Linguistic models summarize coverage information in human language, which has a clear interpretation to humans. Neural models encode coverage information in “neural language”, which can be “understood” by neural networks and let them to decide how to make use of the encoded coverage information.
Although attention based model has the capability of jointly making alignment and translation, it does not take into consideration translation history. Specifically, a source word that has significantly contributed to the generation of target words in the past, should be assigned lower alignment probabilities, which may not be the case in attention based NMT. To address this problem, we propose to calculate the alignment probabilities by incorporating past alignment information embedded in the coverage model.
Intuitively, at each time step in the decoding phase, coverage from time step () serves as an additional input to the attention model, which provides complementary information of that how likely the source words are translated in the past. We expect the coverage information would guide the attention model to focus more on untranslated source words (i.e., assign higher alignment probabilities). In practice, we find that the coverage model does fulfill the expectation (see Section 5). The translated ratios of source words from linguistic coverages negatively correlate to the corresponding alignment probabilities.
More formally, we rewrite the attention model in Equation 5 as
where is the coverage of source word before time . is the weight matrix for coverage with and being the numbers of hidden units and coverage units, respectively.
We take end-to-end learning for the NMT-Coverage model, which learns not only the parameters for the “original” NMT (i.e., for encoding RNN, decoding RNN, and attention model) but also the parameters for coverage modeling (i.e., for annotation and guidance of attention) . More specifically, we choose to maximize the likelihood of reference sentences as most other NMT models (see, however [Shen et al.2016]):
For the coverage model with a clearer linguistic interpretation (Section 3.1.1), it is possible to inject an auxiliary objective function on some intermediate representation. More specifically, we may have the following objective:
where the term penalizes the discrepancy between the sum of alignment probabilities and the expected fertility for linguistic coverage. This is similar to the more explicit training for fertility as in Xu et al. Xu:2015:ICML, which encourages the model to pay equal attention to every part of the image (i.e., ). However, our empirical study shows that the combined objective consistently worsens the translation quality while slightly improves the alignment quality.
Our training strategy poses less constraints on the dependency between and the attention than a more explicit strategy taken in [Xu et al.2015]. We let the objective associated with the translation quality (i.e., the likelihood) to drive the training, as in Equation 9. This strategy is arguably advantageous, since the attention weight on a hidden state cannot be interpreted as the proportion of the corresponding word being translated in the target sentence. For one thing, the hidden state , after the transformation from encoding RNN, bears the contextual information from other parts of the source sentence, and thus loses the rigid correspondence with the corresponding word. Therefore, penalizing the discrepancy between the sum of alignment probabilities and the expected fertility does not hold in this scenario.
|3||+ Linguistic coverage w/o fertility||+1K||31.26||32.16||24.84||29.42|
|4||+ Linguistic coverage w/ fertility||+3K||32.36||32.31||24.91||29.86|
|5||+ NN-based coverage w/o gating ()||+4K||31.94||32.11||23.31||29.12|
|6||+ NN-based coverage w/ gating ()||+10K||31.94||32.16||24.67||29.59|
|7||+ NN-based coverage w/ gating ()||+100K||32.73||32.47||25.23||30.14|
We carry out experiments on a Chinese-English translation task. Our training data for the translation task consists of 1.25M sentence pairs extracted from LDC corpora444The corpora include LDC2002E18, LDC2003E07, LDC2003E14, Hansards portion of LDC2004T07, LDC2004T08 and LDC2005T06. , with 27.9M Chinese words and 34.5M English words respectively. We choose NIST 2002 dataset as our development set, and the NIST 2005, 2006 and 2008 datasets as our test sets. We carry out experiments of the alignment task on the evaluation dataset from [Liu and Sun2015], which contains 900 manually aligned Chinese-English sentence pairs. We use the case-insensitive 4-gram NIST BLEU score [Papineni et al.2002] for the translation task, and the alignment error rate (AER) [Och and Ney2003]
for the alignment task. To better estimate the quality of the soft alignment probabilities generated by NMT, we propose a variant of AER, namingSAER:
where is a candidate alignment, and and are the sets of sure and possible links in the reference alignment respectively (). denotes alignment matrix, and for both and we assign the elements that correspond to the existing links in and with probabilities while assign the other elements with probabilities . In this way, we are able to better evaluate the quality of the soft alignments produced by attention-based NMT. We use sign-test [Collins et al.2005] for statistical significance test.
For efficient training of the neural networks, we limit the source and target vocabularies to the most frequent 30K words in Chinese and English, covering approximately 97.7% and 99.3% of the two corpora respectively. All the out-of-vocabulary words are mapped to a special token UNK. We set for the fertility model in the linguistic coverages. We train each model with the sentences of length up to 80 words in the training data. The word embedding dimension is 620 and the size of a hidden layer is 1000. All the other settings are the same as in [Bahdanau et al.2015].
We compare our method with two state-of-the-art models of SMT and NMT555There are recent progress on aggregating multiple models or enlarging the vocabulary(e.g., in [Jean et al.2015]), but here we focus on the generic models.:
Table 1 shows the translation performances measured in BLEU score. Clearly the proposed NMT-Coverage significantly improves the translation quality in all cases, although there are still considerable differences among different variants.
Coverage model introduces few parameters. The baseline model (i.e., GroundHog) has 84.3M parameters. The linguistic coverage using fertility introduces 3K parameters (2K for fertility model), and the NN-based coverage with gating introduces 10K parameters (6K for gating), where is the dimension of the coverage vector. In this work, the most complex coverage model only introduces 0.1M additional parameters, which is quite small compared to the number of parameters in the existing model (i.e., 84.3M).
Introducing the coverage model slows down the training speed, but not significantly. When running on a single GPU device Tesla K80, the speed of the baseline model is 960 target words per second. System 4 (“+Linguistic coverage with fertility”) has a speed of 870 words per second, while System 7 (“+NN-based coverage (d=10)”) achieves a speed of 800 words per second.
(Rows 3 and 4): Two observations can be made. First, the simplest linguistic coverage (Row 3) already significantly improves translation performance by 1.1 BLEU points, indicating that coverage information is very important to the attention model. Second, incorporating fertility model boosts the performance by better estimating the covered ratios of source words.
(Rows 5-7): (1) Gating (Rows 5 and 6): Both variants of NN-based coverages outperform GroundHog with averaged gains of 0.8 and 1.3 BLEU points, respectively. Introducing gating activation function improves the performance of coverage models, which is consistent with the results in other tasks [Chung et al.2014]. (2) Coverage dimensions (Rows 6 and 7): Increasing the dimension of coverage models further improves the translation performance by 0.6 point in BLEU score, at the cost of introducing more parameters (e.g., from 10K to 100K).666In a pilot study, further increasing the coverage dimension only slightly improved the translation performance. One possible reason is that encoding the relatively simple coverage information does not require too many dimensions.
|+ NN cov. w/ gating ()||3.28||3.73||16.7%||2.7%|
We also conduct a subjective evaluation to validate the benefit of incorporating coverage. Two human evaluators are asked to evaluate the translations of 200 source sentences randomly sampled from the test sets without knowing from which system a translation is selected. Table 2 shows the results of subjective evaluation on translation adequacy and fluency.777Fluency measures whether the translation is fluent, while adequacy measures whether the translation is faithful to the original sentence [Snover et al.2009]. GroudHog has a low adequacy since 25.0% of the source words are under-translated. This is mainly due to the serious under-translation problems on long sentences that consist of several sub-sentences, in which some sub-sentences are completely ignored. Incorporating coverage significantly alleviates these problems, and reduces 33.2% and 40.0% of under-translation and over-translation errors respectively. Benefiting from this, coverage model improves both translation adequacy and fluency by around 0.2 points.
|+ Ling. cov. w/o fertility||66.75||53.55|
|+ Ling. cov. w/ fertility||64.85||52.13|
|+ NN cov. w/o gating ()||67.10||54.46|
|+ NN cov. w/ gating ()||66.30||53.51|
|+ NN cov. w/ gating ()||64.25||50.50|
Table 3 lists the alignment performances. We find that coverage information improves attention model as expected by maintaining an annotation summarizing attention history on each source word. More specifically, linguistic coverage with fertility significantly reduces alignment errors under both metrics, in which fertility plays an important role. NN-based coverages, however, does not significantly reduce alignment errors until increasing the coverage dimension from 1 to 10. It indicates that NN-based models need slightly more dimensions to encode the coverage information.
Figure 5 shows an example. The coverage mechanism does meet the expectation: the alignments are more concentrated and most importantly, translated source words are less likely to get involved in generation of the target words next. For example, the first four Chinese words are assigned lower alignment probabilities (i.e., darker color) after the corresponding translation “romania reinforces old buildings” is produced.
Following Bahdanau et al. Bahdanau:2015:ICLR, we group sentences of similar lengths together and compute BLEU score and averaged length of translation for each group, as shown in Figure 6. Cho et al. Cho:2014:SSST show that the performance of Groundhog drops rapidly when the length of input sentence increases. Our results confirm these findings. One main reason is that Groundhog produces much shorter translations on longer sentences (e.g., , see right panel in Figure 6), and thus faces a serious under-translation problem. NMT-Coverage alleviates this problem by incorporating coverage information into the attention model, which in general pushes the attention to untranslated parts of the source sentence and implicitly discourages early stop of decoding. It is worthy to emphasize that both NN-based coverages (with gating, ) and linguistic coverages (with fertility) achieve similar performances on long sentences, reconfirming our claim that the two variants improve the attention model in their own ways.
As an example, consider this source sentence in the test set:
qiáodān běn sàijì píngjūn défēn 24.3fēn , tā zài sān zhōu qián jiēshòu shǒushù , qiúduì zài cǐ qījiān 4 shèng 8 fù .
Groundhog translates this sentence into:
jordan achieved an average score of eight weeks ahead with a surgical operation three weeks ago .
in which the sub-sentence “, qiúduì zài cǐ qījiān 4 shèng 8 fù” is under-translated. With the (NN-based) coverage mechanism, NMT-Coverage translates it into:
jordan ’s average score points to UNK this year . he received surgery before three weeks , with a team in the period of 4 to 8 .
in which the under-translation is rectified.
The quantitative and qualitative results show that the coverage models indeed help to alleviate under-translation, especially for long sentences consisting of several sub-sentences.
Our work is inspired by recent works on improving attention-based NMT with techniques that have been successfully applied to SMT. Following the success of Minimum Risk Training (MRT) in SMT [Och2003]
, Shen et al. Shen:2016:ACL proposed MRT for end-to-end NMT to optimize model parameters directly with respect to evaluation metrics. Based on the observation that attention-based NMT only captures partial aspects of attentional regularities, Cheng et al. Cheng:2016:IJCAI proposed agreement-based learning[Liang et al.2006] to encourage bidirectional attention models to agree on parameterized alignment matrices. Along the same direction, inspired by the coverage mechanism in SMT, we propose a coverage-based approach to NMT to alleviate the over-translation and under-translation problems.
Independent from our work, Cohn et al. Cohn:2016:NAACL and Feng et al. Feng:2016:arXiv made use of the concept of “fertility” for the attention model, which is similar in spirit to our method for building the linguistically inspired coverage with fertility. Cohn et al. Cohn:2016:NAACL introduced a feature-based fertility that includes the total alignment scores for the surrounding source words. In contrast, we make prediction of fertility before decoding, which works as a normalizer to better estimate the coverage ratio of each source word. Feng et al. Feng:2016:arXiv used the previous attentional context to represent implicit fertility and passed it to the attention model, which is in essence similar to the input-feed method proposed in [Luong et al.2015]. Comparatively, we predict explicit fertility for each source word based on its encoding annotation, and incorporate it into the linguistic-inspired coverage for attention model.
We have presented an approach for enhancing NMT, which maintains and utilizes a coverage vector to indicate whether each source word is translated or not. By encouraging NMT to pay less attention to translated words and more attention to untranslated words, our approach alleviates the serious over-translation and under-translation problems that traditional attention-based NMT suffers from. We propose two variants of coverage models: linguistic coverage that leverages more linguistic information and NN-based coverage that resorts to the flexibility of neural network approximation . Experimental results show that both variants achieve significant improvements in terms of translation quality and alignment quality over NMT without coverage.
This work is supported by China National 973 project 2014CB340301. Yang Liu is supported by the National Natural Science Foundation of China (No. 61522204) and the 863 Program (2015AA011808). We thank the anonymous reviewers for their insightful comments.