Mask-Align: Self-Supervised Neural Word Alignment

12/13/2020 ∙ by Chi Chen, et al. ∙ 0

Neural word alignment methods have received increasing attention recently. These methods usually extract word alignment from a machine translation model. However, there is a gap between translation and alignment tasks, since the target future context is available in the latter. In this paper, we propose Mask-Align, a self-supervised model specifically designed for the word alignment task. Our model parallelly masks and predicts each target token, and extracts high-quality alignments without any supervised loss. In addition, we introduce leaky attention to alleviate the problem of unexpected high attention weights on special tokens. Experiments on four language pairs show that our model significantly outperforms all existing unsupervised neural baselines and obtains new state-of-the-art results.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

(a) Translation
(b) Mask-Align
Figure 1: Inducing alignments from two different models. (a) Translations diverge at one decoding step, possibly yielding a wrong alignment (blue line). (b) Only one translation is possible for Mask-Align as future context is observed, leading to a correct alignment.

Word alignment is a task of finding the corresponding words in a sentence pair Brown et al. (1993) and used to be a key component of statistical machine translation (SMT; Koehn et al. (2003)

). Although word alignment is no longer explicitly modeled in neural machine translation (NMT;

Bahdanau et al. (2015)), it is often leveraged to interpret and analyze NMT models Ding et al. (2017); Tu et al. (2016). Word alignment is also used in many other scenarios, such as imposing lexical constraints on the decoding process Arthur et al. (2016); Hasler et al. (2018), improving automatic post-editing Pal et al. (2017) and providing guidance for translators in computer-aided translation Dagan et al. (1993).

Recently, unsupervised neural alignment methods have been studied and outperformed GIZA++ Och and Ney (2003) on many alignment datasets Garg et al. (2019); Zenkel et al. (2020); Chen et al. (2020)

. However, these methods are trained with a translation objective, which computes the probability of each target token conditioned on source tokens and previous target tokens. This will bring noisy alignments when the prediction is ambiguous (Figure

1(a)). To alleviate this problem, previous studies modify Transformer Vaswani et al. (2017) by adding alignment modules to re-predict the target token Zenkel et al. (2019, 2020), or computing an additional alignment loss on the full target sequence Garg et al. (2019). Moreover, Chen et al. (2020) propose an extraction method that induces alignment when the to-be-aligned target token is the decoder input.

Although these methods have demonstrated their effectiveness, they have two drawbacks. First, they retain the translation objective which is not tailored for word alignment. Consider the example in Figure 1(a). When predicting target token “Tokyo”, the translation model may wrongly generate “1968” as it only considers the previous context, which will result in an incorrect alignment link (“1968”, “Tokyo”). A better modeling is needed for obtaining more accurate alignments. Second, they need an additional guided alignment loss Chen et al. (2016) to outperform GIZA++, which requires inducing alignments for entire training corpus.

In this paper, we propose a self-supervised model specifically designed for the word alignment task, namely Mask-Align. Our model masks each target token and recovers it with the source and the rest of the target tokens. For example, as shown in Figure 1(b), the target token “Tokyo” is masked and re-predicted. During this process, our model can identify that only the source token “Tokio” has not been translated yet, so the to-be-predicted target token “Tokyo” is aligned to “Tokio”. Comparing with the translation model, this masked modeling method is highly related to word alignment, and based on that our model generates more accurate predictions and alignments.

To summarize, the main contributions of our work are listed as follows:

  • We propose a novel model for the word alignment task that parallelly masks and recovers each target token. Comparing with NMT-based alignment models, our model can leverage more context and find more accurate alignment for the to-be-predicted token.

  • We introduce a variant of attention called leaky attention that is more suitable for alignment extraction. Our leaky attention can reduce the unexpected high attention weights on special tokens.

  • By considering agreement between two directional models, we consistently outperform the state-of-the-art on four language pairs without any guided alignment loss.

2 Background

2.1 NMT-based Alignment Models

An NMT model can be utilized to measure the correspondence between source token and target token , and provides an alignment score matrix , where each element represents the relavance between and . Then we can extract the alignment matrix accordingly:

(1)

where indicates is aligned to .

There are two types of methods to obtain . The first kind of methods evaluate the importance of to the prediction of through feature importance measures such as prediction difference Li et al. (2019), gradient-based saliency Li et al. (2016); Ding et al. (2019) or norm-based measurement Kobayashi et al. (2020). While they provide proper ways to extract alignments from an NMT model without any parameter update or architectural modification, their results did not surpass statistical methods such as GIZA++.

The second type of methods refer to the cross-attention weights 111For multi-head attention, is usually defined as the average of all attention heads. between the encoder and decoder. There are two different kind of methods for extracting alignment from attention weights. For the first kind of methods, they consider as the relevance between source token and output target token (i.e., ) and extract alignment from top layers of decoder near the network output Garg et al. (2019); Zenkel et al. (2019). For the second kind, is taken as the alignment probability between source and input target token, i.e., Chen et al. (2020); Kobayashi et al. (2020).

2.2 Disentangled Context Masked Language Model

A conditional masked language model (CMLM) predicts a set of target tokens given a source text and part of the target text Ghazvininejad et al. (2019). The original CMLM randomly selects among the target tokens and only predicts a subset of target tokens in a forward pass. Kasai et al. (2020) extended it with an alternative objective called Disentangled Context (DisCo) objective, which predicts every target token given an arbitrary subset of other tokens (denoted as ). As directly computing with a vanilla Transformer requires sequential time, they modify the Transformer decoder to predict all target tokens in parallel. For target token , they separate the query input from key input and value input , and only update across decoder layers:

(2)
(3)

where represents the query input in the -th layer, and denotes the word and position embedding for respectively. The attention output in the -th layer is computed given the contexts corresponding to observed tokens :

(4)
(5)
(6)

where . The DisCo Transformer is efficient in modeling target tokens conditioned on both past and future context, and has succeeded in non-autoregreesive machine translation.

3 Method

We introduce Mask-Align, a self-supervised neural alignment model (see Figure 2). Different from NMT-based alignment models, our model parallelly masks each target token and recovers it under the context of source and the remaining target tokens. In this process, the most helpful source tokens are identified as the alignment of the target token to predict.

Figure 2: The architecture of Mask-Align.

3.1 Modeling

We propose to model the target token conditioned on the rest of the target tokens and the source sentence , and the probability of the target sentence given as :

(7)

We argue that this kind of modeling is more effective for word alignment extraction. For one thing, our method has higher prediction accuracy because future context is considered, which has proved to be helpful for alignment extraction Zenkel et al. (2020); Chen et al. (2020). For another, we hypothesize that in this way, our model can better identify the aligned source tokens since only their information is missing in the target.

Directly computing with a vanilla Transformer requires seperate forward passes, which is unacceptable. Inspired by Kasai et al. (2020), we modify the Transformer decoder to perform the forward passes concurrently. This is done by separating the query inputs from the key and value inputs in the decoder self-attention layers. In each layer, we update the query input in each position by attending to keys and values in other positions. To avoid the model simply copying the representations from the inputs, we set the query inputs in the first decoder layer to be the position embeddings and keep the key and value inputs unchanged from the sum of position and token embeddings. We name this variant of attention the Static-KV Attention.

Another modification is that we remove the cross-attention in all but the last decoder layer. This makes the information flow from source to target happens only in the last layer. Our experiments demonstrate this modification reduces model parameters and improves alignment results.

Formally, given the word embedding and position embedding for target token , we compute the output of the -th self-attention layer in the decoder for as :

(8)
(9)
(10)
(11)
(12)

where is the input for the first decoder layer. The output of the last self-attention layer is used to compute cross-attention with encoder outputs. We use the cross-attention weights to induce alignments.

3.2 Leaky Attention

We found that extracting alignments from vanilla cross-attention suffers from the unexpected high attention weights on some specific source tokens such as periods, [EOS] or other high frequency tokens. Hereinafter, we will refer to these tokens as the attractor tokens. As a result, if we compute the alignments according to the cross-attention weights, many target tokens are wrongly aligned to the attractor tokens.

This phenomenon has been studied in some of the previous work Clark et al. (2019); Kobayashi et al. (2020). Kobayashi et al. (2020)

also shows that the norms of the transformed vectors of the attractor tokens are usually small, thus their influence on the attention output is actually limited. We believe this is because vanilla attention does not consider untranslatable tokens, which is often aligned to a special NULL token in statistical alignment models

Brown et al. (1993). As a result, attractor tokens are implicitly treated as the NULL token.

We propose to explicitly model the NULL token with a modified attention, namely Leaky Attention. The leaky attention provides an extra position in addition to the attention memory for the target tokens to attend to. To be specific, we parameterize the key and value vectors as and for the leaky position in the cross-attention, and concatenate them with the transformed vectors of the encoder outputs. Then the attention output is computed as follows:

(13)

where is the query projection matrix, and

are projected key and value for encoder outputs. We use a normal distribution with a mean of 0 and a standard deviation of

to initialize and . When extracting alignments, we only consider the attention matrix without the leaky position.

Note that leaky attention is different from adding a special token in the source sequence, which will only act like another attractor token and share the high weights with the existing one instead of removing it Vig and Belinkov (2019). Our parameterized method is more flexible than Leaky-Softmax Sabour et al. (2017)

which adds an extra dimension with the value of zero to the routing logits.

With leaky attention, our model can capture more accurate alignment scores between source and target. In Section 3.3, we will show that this kind of attention is also helpful for agreement training.

3.3 Agreement

To better utilize the attention weights from the models in two directions, we apply an agreement loss in the training process to improve the symmetry of our model, which has proved effective in statistical alignment models Liang et al. (2006). Given a parallel sentence pair , we can obtain the attention weights from two different directions, denoted as and . As alignment is bijective, is supposed to be equal to the transpose of . We encourage this kind of symmetry through an agreement loss:

(14)

where represents the mean squared error.

For vanilla attention, this loss is hardly small because of the normalization constraint. Suppose we have = “Falsch”, = “Not true” and gold alignment , the optimal attention weights should be and because of the column normalization, resulting in minimal . The leaky attention is able to achieve lower agreement loss as the column sum is not strictly equal to one. We assume this kind of flexibility is helpful for agreement training.

However, since we relax the constraints in the cross-attention, our model may converge to a degenerate case of zero agreement loss where attention weights are all zero except for the leaky position. We avoid this by introducing an entropy loss on the attention weights:

(15)
(16)

where is the renormalized attention weights, is a hyperparamter. Similayly, we have for the inverse direction.

We jointly train two directional models with the standard NLL loss and , the agreement loss and entropy losses . The overall loss is:

(17)

where and

are hyperparameters.

When extracting alignments, we first compute the alignment score for and with attention weights and from two directional models:

(18)

We then extract alignments with that exceed a threshold .

4 Experiments

4.1 Datasets

We conduct our experiments on four public datasets: German-English, English-French, Romanian-English and Chinese-English. The Chinese-English training set is from the LDC corpus that compromises 1.2M sentence pairs. For validation and test, we use the Chinese-English alignment dataset from Liu et al. (2005)222http://nlp.csai.tsinghua.edu.cn/~ly/systems/TsinghuaAligner/TsinghuaAligner.html, which contains 450 sentence pairs for validation and 450 for test. For other three language pairs, we follow the experimental setup described in Zenkel et al. (2019, 2020) and use the preprocessing scripts from Zenkel et al. (2019)333https://github.com/lilt/alignment-scripts. Following Ding et al. (2019), we take the last 1000 sentences of the training data for each dataset as validation set. We learn a joint source and target Byte-Pair-Encoding (BPE, Sennrich et al. (2016)) with 40k merge operations. During training, we filter out sentences with the length of 1 to ensure the validity of the masking process.

4.2 Settings

We implement our model based on the Transforemr architecture Vaswani et al. (2017). The encoder consists of 6 standard Transformer encoder layers, and the decoder is composed of 6 layers. All decoder layers contain static-kv self-attention while only the last layer computes leaky-attention. We use the embedding size of 512, the hidden size of 1024, and 4 attention heads, and share the input and output embeddings for the decoder.

We train the models with a batch size of 36K tokens, and perform early stopping based on the prediction accuracy of the validation data. All models are trained in two directions without any alignment supervision. We tuned the hyperparameters via grid search on the Chinese-English validation set as it contains gold word alignments. In all of our experiments, we set (Eq. 16), , (Eq. 17) and . We evaluate the alignment quality with Alignment Error Rate (AER, Och and Ney (2000)).

Method DeEn EnFr RoEn ZhEn
fast-align Dyer et al. (2013) 25.7 12.1 31.8 -
GIZA++ Och and Ney (2003) 17.8    6.1 26.0 21.9
Attention (all) 34.0 18.5 32.9 29.7
Attention (single) 28.4 17.7 32.4 26.4
AddSGD Zenkel et al. (2019) 21.2 10.0 27.6 -
Mtl-Fullc Garg et al. (2019) 20.2    7.7 26.0 -
BAO Zenkel et al. (2020) 17.9    8.4 24.1 -
Shift-Att Chen et al. (2020) 17.9    6.6 23.9 16.8
Mask-Align 14.5    4.4 19.5 13.9
Table 1: Alignment Error Rate (AER) on four datasets for different alignment methods. The lower AER, the better. The results are symmetrized. We mark the best results for each language pair with boldface.

4.3 Baselines

As described in Section 2.1, neural alignment models induce alignments either from the attention weights or through feature importance measures. We compare our method with the attention-based methods for (1) we also extract alignments from the attention weights and (2) these methods achieve best alignment results. We thus introduce the following neural baselines besides two statistical baselines fast-align and GIZA++:

  • [topsep=5pt]

  • Attention (all): the method that induces alignments from attention weights of the best (usually the penultimate) decoder layer in a vanilla Tranformer.

  • Attention (last): same as Attention (all) except that only the last layer performs cross-attention.

  • AddSGD Zenkel et al. (2019): the method that adds an extra alignment layer to repredict the to-be-aligned target token.

  • Mtl-Fullc Garg et al. (2019): the method that supervises a single attention head conditioned on full target context with symmetrized Attention (all) alignments in a multi-task manner.

  • BAO Zenkel et al. (2020): an improved version of AddSGD that first extracts alignments with bidirectional attention optimization and then uses them to retrain the alignment layer with guided alignment loss.

  • Shift-Attn Chen et al. (2020): the method that induces alignments when the to-be-aligned tatget token is the decoder input instead the output.

For convenience, we will use Masked to represent the method that only uses the masked modeling described in Section 3.1, and Mask-Align to denote the one that also uses leaky attention and agreement training.

4.4 Main Results

Table 1 shows the comparisons of Mask-Align and all baselines on four datasets. Our approach significantly outperforms all baselines in all datasets. Specifically, it improves over GIZA++ by 1.7-8.0 AER points across different language pairs without using any guided alignment loss, making it a good substitute to this commonly used statistical alignment tool. Compared to Attention (all), we achieve a gain of 13.4-17.5 AER points with fewer parameters (as we removed some cross-attention sublayers) and no additional modules, showing the effectiveness of our method. When comparing with the state-of-the-art neural baselines, Mask-Align consistently outperforms BAO, the best method that extracts alignments for output target tokens, by 2.2-4.4 AER points, demonstrating our modeling method is more suitable for word alignment tasks than translation.

(a) Vanilla Attention
(b) Leaky Attention
Figure 3: Attention weights from vanilla and leaky attention. “MR” is short for “menschenrechte”, which means “human rights” in English. We use [NULL] to denote the leaky position.
Source Sent. [NULL] MR in der welt 1995 1996
vanilla attn. - 21.1 11.7 5.2 15.0 21.2 17.7 21.8
leaky attn. 1.9 28.5 17.2 18.1 20.2 24.2 21.4 23.8
Table 2: Norms of the transformed value vectors of different source tokens in Figure 3. We mark the minimum norm for each variant of attention with boldface.

4.5 Comparison with Guided Training Results

Previous work outperforms GIZA++ by incorporating a guided alignment loss during training. We compare three additional baselines with guided training: (1) Mtl-Fullc-GIZA++ Garg et al. (2019) which replaces the alignment labels in Mtl-Fullc with GIZA++ results; (2) BAO-Guided Zenkel et al. (2020) which uses alignments from BAO for guided alignment training; (3) Shift-AET Chen et al. (2020) which trains an additional alignment module with supervision from symmetrized Shift-Att alignments.

Method DeEn EnFr RoEn
Mtl-Fullc-GIZA++ Garg et al. (2019) 16.4 4.6 23.1
BAO-Guided Zenkel et al. (2020) 16.3 5.0 23.4
Shift-AET Chen et al. (2020) 15.4 4.7 21.2
1-4[4pt/4pt] Mask-Align 14.5 4.4 19.5
Table 3: Comparison of Mask-Align with other methods using guided alignment loss.

Table 3 shows the performance of Mask-Align and baselines using guided training. As we see, Mask-Align performs better than all baselines. Note that our method is simpler and faster than these methods with guided training. To compute the guided alignment loss, they have to induce alignments for the entire training set first, which is computational expensive. In contrast, our method is much more efficient because we only need one training pass. We also tried using guided training in on our approach and obeserved no further improvements. We attribute this to the use of agreement loss and leave it for future researches.

4.6 Leaky Attention

Figure 3 shows the attention weights from vanilla and leaky attention, and Table 2 presents the norms of the transformed value vectors of each source token for two types of attention. For vanilla attention, we can see large weights on the high frequency token “der” and the small norm of its transformed value vector. As a result, the target token “in” will be wrongly aligned to “der”. While for leaky attention, we observe the similar phenomenon on the leaky position “[NULL]”, and “in” will not be aligned to any source tokens since the weights on all source tokens are small. This confirms our hypothesis that the attractor tokens appear because of the untranslatable tokens. For target token “in”, there is no corresponding translation in the source. However, vanilla attention cannot set all attention weights for “in” to be small given the normalization constraint. Instead, the attractor tokens are put on high weights because they impact little on the attention output due to the small norms. On the contrary, our leaky attention performs better on untranslatable tokens as it explicitly models the NULL token, making it more suitable for word alignment tasks.

4.7 Ablation Studies

Method Cross Layers AER
Attention (all) 6 34.0
Attention (last) 1 28.4
1-3[4pt/4pt] Masked (all) 6 32.0
Masked (last) 1 25.4
+ leaky attention 1 17.5
++ agreement 1 14.5
Table 4: Ablation study on German-English dataset. The second column lists the number of decoder layers that perform cross-attention.

Table 4 shows the ablation results on German-English dataset. All results are symmetrized results. We first compare our masked modeling (Masked) with vanilla Transformer in two different settings in which all or only the last decoder layer will contain a cross-attention sublayer. The results show that limiting the interaction between encoder and decoder into a single layer will improve the quality of alignments, and our masked modeling is better than the vanilla translation modeling by 2.0-3.0 AER points. The leaky attention will bring an additional gain of 7.9 AER points, and based on that agreement training will further improve 3.1 AER points and achieve the best result, which shows the effectiveness of these two techniques.

4.8 Analysis

Prediction & Alignment We analyze the relevance between the correctness of word-level prediction and alignment. Specifically, we consider the following four cases: correct prediction and alignment (cPcA), correct prediction and wrong alignment (cPwA), same to wPcA and wPwA. We regard a word as correctly predicted if any of its subwords are correct. For words with more than one possible alignment, we consider it as correctly aligned if one possible alignment is matched. The results are shown in Figure 4. Our masked method has higher prediction accuracy, and significantly reduces the alignment error caused by wrong predictions.

Method w/ punc. w/o punc.
Masked (last) 25.4 17.7
+ leaky attention 17.5 17.6
Table 5: Comparison of AER with and without considering the attention weights on end punctuation.

Removing End Punctuation To further investigate the performance of leaky attention, we test an extraction method that excludes some of the influence of attractor tokens. To be specific, we remove the attention weights on the end punctuation of a source sentence. In our preliminary experiments, when the source sentence contains an end punctuation, it will be treated as the attractor token in most cases. Therefore removing it will alleviate the impact of attractor tokens to a certain extent. Table 5 shows the compared results. When we use leaky attention, the removal of end punctuation brings no improvement on alignment quality. However, when we test the model without leaky attention, removing end punctuation obtains gains by 7.7 AER points. This suggests that leaky attention can effectively avoid the problem of attractor tokens.

Figure 4: Numbers of four different kind of target tokens: correct prediction and alignment (cPcA), correct prediction and wrong alignment (cPwA), same to wPcA and wPwA. Our masked method predicts more accurately and can significantly reduce alignment errors caused by wrong predictions (wPwA).
(a) Reference
(b) Attention (all)
(c) Chen et al. (2020)
(d) Masked
(e) Mask-Align
Figure 5: Attention weights from different models for the example in Figure 1. Gold alignment is shown in (a). For target token “1968”, the NMT-based methods (b) and (c) assign high weights to the wrongly aligned source token “tokio”, while our masked methods only focus on the correct source token “1968”.

Case Study Figure 5 shows the attention weights from four different models for the example in Figure 1. As we have discussed in Section 1, in this example, the NMT-based methods may be confused when predicting the target token “1968”. From the attention weights, we can see that (b) and (c) indeed put high weights wrongly on “tokio” in the source sentence. Another observation is that in the attention map of (c), “[EOS]” acts the same as the period, which proves our claim in Section 3.2 that simply adding a special token in the source sequence as the NULL token doesn’t work. For our methods (d) and (e), we can see that the attention weights are highly consistent with gold alignment for this example. Our methods provide sparse and accurate attention weights. By comparing between (d) and (e), we notice that (e) eliminated some small noise in (d). We attribute this to the bidirectional agreement training.

5 Related Work

Neural Alignment Model Some neural alignment models use gold-standard alignment data. Stengel-Eskin et al. (2019) introduce a discriminative model using the dot-product distance measure between source and target representations to predict the alignment labels. Nagata et al. (2020) first transform the task of word alignment into a question answering task and then use a multilingual BERT to solve it. This line of research suffers from the lack of human-annotated alignment data. Therefore, many studies focus on alignment extraction without gold data Tamura et al. (2014); Legrand et al. (2016). Alkhouli et al. (2016) present neural translation and alignment models trained by using silver-standard alignments obtained from GIZA++. Peter et al. (2017) propose a target foresight approach and use silver-standard alignments to perform guided alignment training Chen et al. (2016). These methods are not satisfactory in terms of alignment results. Recently, there are plenty of studies that induce alignments from an NMT model. Garg et al. (2019) apply the guided alignment loss on a single attention head with silver-standard alignments from GIZA++. Zenkel et al. (2019, 2020) introduce an additional alignment module on top of the NMT model and also use guided training. Chen et al. (2020) come up with an extraction method that induce alignments when the to-be-aligned target token is the decoder input. However, all previous methods adopt a translation objective in the training process. Also, they outperform GIZA++ only with guided training, which requires inducing alignment for entire training set. Our method is fully self-supervised with a masked modeling objective and exceed all these unsupervised methods.

Masked Language Model Pre-trained masked language models (MLMs, Devlin et al. (2019)) have been successfully applied to many NLP tasks such as natural language understanding Wang et al. (2018)

and text generation

Lewis et al. (2020). Its idea has also been adopted in many advanced NLP models. Ghazvininejad et al. (2019) introduce a conditional masked language model (CMLM) to perform parallel decoding for non-autoregressive machine translation. The CMLM can leverage both previous and future context on the target side for sequence-to-sequence tasks with the masking mechanism. Kasai et al. (2020) extend it with a disentangled context Transformer that predicts every target token instead of a subset conditioned on arbitrary context. Our masked modeling method is inspired by CMLMs, as such a masking and predicting process is highly related to word alignment. To the best of our knowledge, this is the first work that incorporates a CMLM objective into alignment models.

6 Conclusion

In this paper, we propose a self-supervised neural alignment model Mask-Align. Different from the NMT-based methods, our model adopts a novel masked modeling objective that is more suitable for word alignment tasks. Moreover, Mask-Align can alleviate the problem of high attention weights on special tokens by introducing leaky attention. Experiments show that Mask-Align achieves new state-of-the-art results without guided alignment loss. We leave it for future work to extend our model in a semi-supervised setting.

References

  • T. Alkhouli, G. Bretschner, J. Peter, M. Hethnawi, A. Guta, and H. Ney (2016) Alignment-based neural machine translation. In Proceedings of the First Conference on Machine Translation: Volume 1, Research Papers, Berlin, Germany, pp. 54–65. External Links: Link, Document Cited by: §5.
  • P. Arthur, G. Neubig, and S. Nakamura (2016)

    Incorporating discrete translation lexicons into neural machine translation

    .
    In

    Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

    ,
    Austin, Texas, pp. 1557–1567. External Links: Link, Document Cited by: §1.
  • D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §1.
  • P. F. Brown, S. A. Della Pietra, V. J. Della Pietra, and R. L. Mercer (1993)

    The mathematics of statistical machine translation: parameter estimation

    .
    Computational linguistics 19 (2), pp. 263–311. Cited by: §1, §3.2.
  • W. Chen, E. Matusov, S. Khadivi, and J. Peter (2016) Guided alignment training for topic-aware neural machine translation. Association for Machine Translation in the Americas, pp. 121. Cited by: §1, §5.
  • Y. Chen, Y. Liu, G. Chen, X. Jiang, and Q. Liu (2020) Accurate word alignment induction from neural machine translation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing and the 10th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Cited by: §1, §2.1, §3.1, 5(c), 6th item, §4.5, Table 1, Table 3, §5.
  • K. Clark, U. Khandelwal, O. Levy, and C. D. Manning (2019) What does bert look at? an analysis of bert’s attention. In

    Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

    ,
    pp. 276–286. Cited by: §3.2.
  • I. Dagan, K. Church, and W. Gale (1993) Robust bilingual word alignment for machine aided translation. In Very Large Corpora: Academic and Industrial Perspectives, Cited by: §1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §5.
  • S. Ding, H. Xu, and P. Koehn (2019) Saliency-driven word alignment interpretation for neural machine translation. In Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers), pp. 1–12. Cited by: §2.1, §4.1.
  • Y. Ding, Y. Liu, H. Luan, and M. Sun (2017) Visualizing and understanding neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1150–1159. External Links: Link, Document Cited by: §1.
  • C. Dyer, V. Chahuneau, and N. A. Smith (2013) A simple, fast, and effective reparameterization of ibm model 2. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 644–648. Cited by: Table 1.
  • S. Garg, S. Peitz, U. Nallasamy, and M. Paulik (2019) Jointly learning to align and translate with transformer models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4443–4452. Cited by: §1, §2.1, 4th item, §4.5, Table 1, Table 3, §5.
  • M. Ghazvininejad, O. Levy, Y. Liu, and L. Zettlemoyer (2019) Mask-predict: parallel decoding of conditional masked language models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 6112–6121. External Links: Link, Document Cited by: §2.2, §5.
  • E. Hasler, A. de Gispert, G. Iglesias, and B. Byrne (2018) Neural machine translation decoding with terminology constraints. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, Louisiana, pp. 506–512. External Links: Link, Document Cited by: §1.
  • J. Kasai, J. Cross, M. Ghazvininejad, and J. Gu (2020) Non-autoregressive machine translation with disentangled context transformer. In Proc. of ICML, External Links: Link Cited by: §2.2, §3.1, §5.
  • G. Kobayashi, T. Kuribayashi, S. Yokoi, and K. Inui (2020) Attention module is not only a weight: analyzing transformers with vector norms. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing and the 10th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Cited by: §2.1, §2.1, §3.2.
  • P. Koehn, F. J. Och, and D. Marcu (2003) Statistical phrase-based translation. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pp. 127–133. External Links: Link Cited by: §1.
  • J. Legrand, M. Auli, and R. Collobert (2016) Neural network-based word alignment through score aggregation. In Proceedings of the First Conference on Machine Translation: Volume 1, Research Papers, Berlin, Germany, pp. 66–73. External Links: Link, Document Cited by: §5.
  • M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer (2020) BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 7871–7880. External Links: Link, Document Cited by: §5.
  • J. Li, X. Chen, E. H. Hovy, and D. Jurafsky (2016) Visualizing and understanding neural models in nlp. In HLT-NAACL, Cited by: §2.1.
  • X. Li, G. Li, L. Liu, M. Meng, and S. Shi (2019) On the word alignment from neural machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1293–1303. Cited by: §2.1.
  • P. Liang, B. Taskar, and D. Klein (2006) Alignment by agreement. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, pp. 104–111. Cited by: §3.3.
  • Y. Liu, Q. Liu, and S. Lin (2005) Log-linear models for word alignment. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), Ann Arbor, Michigan, pp. 459–466. External Links: Link, Document Cited by: §4.1.
  • M. Nagata, K. Chousa, and M. Nishino (2020) A supervised word alignment method based on cross-language span prediction using multilingual BERT. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 555–565. External Links: Link, Document Cited by: §5.
  • F. J. Och and H. Ney (2000) Improved statistical alignment models. In Proceedings of the 38th annual meeting of the association for computational linguistics, pp. 440–447. Cited by: §4.2.
  • F. J. Och and H. Ney (2003) A systematic comparison of various statistical alignment models. Computational Linguistics 29 (1), pp. 19–51. Cited by: §1, Table 1.
  • S. Pal, S. K. Naskar, M. Vela, Q. Liu, and J. van Genabith (2017) Neural automatic post-editing using prior alignment and reranking. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 349–355. Cited by: §1.
  • J. Peter, A. Nix, and H. Ney (2017) Generating alignments using target foresight in attention-based neural machine translation. The Prague Bulletin of Mathematical Linguistics 108 (1), pp. 27–36. Cited by: §5.
  • S. Sabour, N. Frosst, and G. E. Hinton (2017) Dynamic routing between capsules. In Advances in neural information processing systems, pp. 3856–3866. Cited by: §3.2.
  • R. Sennrich, B. Haddow, and A. Birch (2016) Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 1715–1725. External Links: Link, Document Cited by: §4.1.
  • E. Stengel-Eskin, T. Su, M. Post, and B. Van Durme (2019) A discriminative neural model for cross-lingual word alignment. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 909–919. Cited by: §5.
  • A. Tamura, T. Watanabe, and E. Sumita (2014) Recurrent neural networks for word alignment model. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Baltimore, Maryland, pp. 1470–1480. External Links: Link, Document Cited by: §5.
  • Z. Tu, Z. Lu, Y. Liu, X. Liu, and H. Li (2016) Modeling coverage for neural machine translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 76–85. External Links: Link, Document Cited by: §1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30, pp. 5998–6008. External Links: Link Cited by: §1, §4.2.
  • J. Vig and Y. Belinkov (2019) Analyzing the structure of attention in a transformer language model. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Florence, Italy, pp. 63–76. External Links: Link, Document Cited by: §3.2.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman (2018) GLUE: a multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, Belgium, pp. 353–355. External Links: Link, Document Cited by: §5.
  • T. Zenkel, J. Wuebker, and J. DeNero (2019) Adding interpretable attention to neural translation models improves word alignment. arXiv preprint arXiv:1901.11359. Cited by: §1, §2.1, 3rd item, §4.1, Table 1, §5.
  • T. Zenkel, J. Wuebker, and J. DeNero (2020) End-to-end neural word alignment outperforms giza++. ACL. Cited by: §1, §3.1, 5th item, §4.1, §4.5, Table 1, Table 3, §5.