Self-attention-based sequence-to-sequence models Vaswani et al. (2017) have recently emerged as the state of the art in neural machine translation. The decoder of the network typically consists of multiple layers, each with several attention heads. This makes it hard to interpret the attention activations and extract meaningful word-alignments. As a result, the most widely used tools to obtain word alignments are still based on the statistical IBM word alignment models which were introduced more than two decades ago. In this work we describe a simple modeling extension as well as a novel inference procedure that are capable of extracting alignments with a quality comparable to Giza++ from the self-attentional Transformer architecture.
Word alignments extracted as a by-product of machine translation have a number of applications. For example, they can be used to inject an external lexicon into the inference process to improve the translations of low-frequency content wordsArthur et al. (2016). Another use of word alignments is to project annotations from the source sentence to the target sentence. For example, if part of the source sentence is highlighted in the source document, the corresponding part of the target should be highlighted as well. In localization, all formatting and annotation is stored as tags over spans of the source, and word alignments can serve to place those tags automatically into the target.
The most widely used tools for word alignment are Giza++ Och and Ney (2003), which uses the IBM Model 4 in its default setting, and FastAlign Dyer et al. (2013), which relies on a re-parameterized version of the IBM Model 2. Previous work on neural models for word alignment either require complicated training approaches Tamura et al. (2014) or rely on supervised training, the training data for which is generated with one of the tools mentioned above Alkhouli et al. (2018); Peter et al. (2017).
This work extends the Transformer architecture with a separate alignment layer on top of the decoder sub-network. It does not contain skip connections around the encoder-attention module, so that it is trained to predict the next target word based on a linear combination of the encoder information. This encourages the alignment layer to focus its attention activations on relevant source words for a given target word. As a result we have two separate output layers, each of which defines a probability distribution over the next target word. During inference the add-on output layer is ignored and only the attention activations are computed, which are interpreted as a distribution over word alignments.
However, the attention mechanism still ignores all future target information, including the target token for which it computes the attention activations. In order to query the model w.r.t. the aligned target word to further improve the resulting alignments, we directly optimize the attention activations to maximize the likelihood of the target word using stochastic gradient descent (SGD).
Our approach has a number of desirable properties:
The model can be trained on the same data as the translation model in an unsupervised fashion.
Both the alignment and the translation model are incorporated into a single network.
Training is done by fine-tuning an existing translation model, significantly reducing overall training cost.
The extension is straightforward and easy to implement.
We validate our approach on three hand-aligned, publicly available data sets and compare the alignment error rate (AER) to a naïve baseline, FastAlign and Giza++. On the French-English task our method improves the alignment quality by a factor of three over the naïve baseline. With the application of the grow-diagonal heuristic for bidirectional alignment mergingKoehn et al. (2005), we achieve results that are comparable to Giza++ on two of the three data sets.
2 Machine Translation Model
All neural machine translation (NMT) models in this work are based on the Transformer architecture introduced by Vaswani et al. (2017) . It follows the encoder-decoder paradigm (Sutskever et al., 2014) and is composed of two sub-networks. The encoder network transforms the source sentence into a high-dimensional continuous space representation of the sentence. The decoder network uses the output of the encoder to compute a probability distribution over target language sentences. Different from the previously most widely used recurrent networks, it relies entirely on the attention mechanism to incorporate context. The attention function (Bahdanau et al., 2015)
is a major factor in the recent success of NMT and provides a mechanism to create a fixed-length context vector as a weighted sum of a variable number of input vectors.
In the Transformer architecture, each layer of the encoder network consists of two sub-layers, namely a self-attention module and a feed-forward module. Decoder layers are made up of a self-attention module, an encoder-attention module and a feed-forward module. In the decoder the self-attention sub-network is masked so that only left-hand context is incorporated. In contrast to Vaswani et al. (2017), we mask the attention to not attend to the end-of-sentence token. The encoder-attention always uses the output of the final encoder layer as input.
The attention function applied in the Transformer is a multi-head scaled dot-product variant. Given a query , a set of key-value pairs with and , we calculate a single attention vector as follows:
With the resulting attention activations we calculate the weighted sum of the values:
Multi-head attention is computed from the concatenation of context vectors calculated with lower-dimensional attention functions and the parameter matrices , :
That is for each head the query and the set of key-value pairs get linearly projected before getting fed to the individual attention heads. The resulting output of the individual heads gets concatenated, linearly projected and fed into the next layer.
We use an encoder with 6 and a decoder with 3 layers. The decoder sublayers are simplified versions of those described by Vaswani et al. (2017)
: The filter sub-layers perform only a single linear transformation, and layer normalization is only applied once per decoder layer after the filter sub-layer.
2.1 Average Transformer Attentions
The straightforward method to extract alignments from the Transformer architecture is to use its attention activations. As our baseline we average all attention matrices into the matrix to extract a soft-attention between subword units. denotes the source length and the target length. To extract hard alignments between subword units, we select the source unit with the maximal attention value for each target unit. We align a source word to a target word if an alignment between any of its subwords exists. We argue that this procedure is not the best approach, as it does not encourage the soft attentions to correspond to useful alignments.
3 Related Work
3.1 Statistical Models
The most commonly used statistical alignment models directly build on the lexical translation models of Brown et al. (1993), which are also referred to as IBM models. A popular tool is FastAlign (Dyer et al., 2013), a reparameterization of IBM Model 2, which is known for its usability and speed. Giza++ (Och and Ney, 2003), which is based on IBM Model 4, provides a solid benchmark in terms of AER results. In our experiments we run the MGIZA++ Gao and Vogel (2008), a parallel implementation of Giza++, and FastAlign toolkits with default parameters as a baseline.
The IBM models are composed of an alignment and a lexical model component. The alignment component is unlexicalized. The lexical component models the likelihood for each source word based on a single target word, i.e. it is conditionally independent of the source and target context. We argue that this assumption is a disadvantage over neural approaches, which commonly encode the content and the context of each word in a continuous representation. In this work we make use of both the source and the target context to infer word alignments.
3.2 Neural Models
The neural approaches that affect the attention of the model and therefore influence the alignments can be categorized into two groups depending on their goal: Improving translation quality of the machine translation system or solely focusing on the generation of alignments.
Nguyen and Chiang (2018) add a linear combination of source embeddings to the decoder output to improve the prediction of the next target word in an attention-based translation model. This encourages the model to attend to a useful source word and avoids that the resulting alignments are shifted by one word compared to human judgement. Alkhouli et al. (2018) train a single alignment head of the Transformer with supervised data generated with Giza++, so that its attention directly corresponds to alignments. This improved attention is then used during inference to separate the translation objective into a lexical and an alignment component and improve dictionary-guided translations. Arthur et al. (2016) add a lexical probability vector, a vector generated based on the information of a discrete lexicon, and use the attention vector to decide which source word’s lexical probabilities the model should focus on.
Tamura et al. (2014)
directly predict alignments based on a recurrent neural network which is conditioned on both the source and the target sequences. By using noise contrastive estimation and tying weights of a forward and a backward model during training they are able to train this network while only relying on IBM Model 1 to generate negative examples.Peter et al. (2017) build on an attention-based neural network to extract alignments. They achieve their best results by bootstrapping the attention with Giza++ alignments and by using target foresight, a technique that uses the target word during training to improve its attention.
4 Alignment Layer
In order to train our alignment component in an unsupervised fashion, i.e. without word-aligned training data, we want to design it with the following property: A source token should be aligned to a target token if we can predict the target token based on a continuous representation of the corresponding source token.
We achieve this by adding an alignment layer to the top of the decoder. As depicted in Figure 1 the complete model predicts the next target word twice: once with the original decoder, once based on the alignment layer.
The alignment layer uses a single multi-head attention submodule as in Equation 2 and is focusing its attention on the encoder. As the query we use the decoder output, for the key-value pairs we use the same encoder input, i.e. .
For now let us denote the vectors based on the hidden representation of the encoder, which we use as the keys and values, as . We calculate the probability vector of the next word as follows:
with as the output projection matrix and being the output of the multi-head attention of Equation 2:
In contrast to a decoder layer of the Transformer, the alignment layer does not use a self-attention sub-layer and we do not apply any skip connections. Thus the target word prediction of the alignment layer is forced to rely solely on the context vector , a linear combination of the encoder-side representations.
Figure 1 summarizes the whole architecture. Note that the decoder output is masked, it only encodes the left-hand context.
For the encoder representation we want to encode both the content of the source words and the context in which they appear. Therefore, we experiment with the following options: Directly using the word embeddings, using the encoder output and using the average of the word embeddings and the decoder output as the encoder representation . Table 1 summarizes these options.
|Avg||average over all attentions|
|Add||(word emb. + enc. out.) / 2.0|
|Add+SGD||initialize with forward path|
We train the alignment layer by fine-tuning a fully trained translation model, keeping the parameters of the underlying Transformer network fixed. For the alignment layer we use multi-head attention with a single head throughout this paper.111We verified that this leads to slightly better results than two attention heads, while four heads performed considerably worse.
5 Attention Optimization based on Target Word
Using the attention activations of the forward pass means that the word alignment to the -th target word does not depend on the identity of . However, it can be argued that this word is the most relevant information needed to select the correct alignment. When performing the task of word alignment given both source and target sentence we already now the target word . If we produce the alignment as a by-product of translation inference, it is likely that the prediction of the alignment layer and the actual target sentence are different.
We hypothesize that we can improve the alignment if we can find attention activations that lead to a correct prediction of the target sentence. Given attention activations and the linearly transformed values , we can rephrase the equations of Section 4:
Therefore, the probability distribution only depends on , which we extract from the forward pass, and the attention activations . We can optimize while evaluating the attention optimization sub-network of Figure 1 with its only input . We treat as a weight matrix for the remainder of this section and will refer to it as the attention weights.
The -th entry of the probability vector denotes the probability of the target word . Therefore, we can formulate our objective of maximizing the probability of the target word with respect to :
We optimize the attention weights for each word of the target sequence while keeping all other parameters of the alignment layer fixed. By applying gradient descent, we iteratively update the attention weights towards the goal of maximizing the probability of the correct target word. This can be done in parallel for each word of the target sentence using the cross entropy loss.
During optimization we relax the constraint for
to be a valid probability, i.e. to sum up to 1. While we experimented with using the softmax function during optimization, we found that only applying the rectified linear unit (RLU,) to guarantee non-negative activations before passing it to the function is easier to optimize.
, which maximizes the output of a neuron with respect to the input image.
5.1 Initialization of Attention Weights
The question remains how we initialize the attention weights
. A straightforward option is to initialize them randomly. While it is probably useful to initialize them with valid probability vectors, we used a uniform distribution between 0.0 and 1.0 as our first option.
However, intuitively it seems more reasonable to start with attention weights that correspond to a good alignment and that might be closer to a good local minimum. It is possible to convert alignments between subwords directly to attention weights and therefore using a hypothesis of an arbitrary alignment model222That can be done by setting weights that represent alignments between a source and a target word to 1.0 and all other weights to 0.0.. However, we restrict our experiments to improve the attention weights of the forward pass. Therefore, we run a forward pass of the complete Transformer network, extract the attention weights of the alignment layer and start the optimization process with these weights.
While tuning the parameters on the validation set, we found that applying three gradient descent steps with a learning rate of 1 leads to surprisingly good results, both in terms of predicting the correct word and the quality of the resulting alignments.
6 Experimental Setup
Our goal for the evaluation is to compare statistical alignment methods, namely FastAlign and Giza++, with the neural approach introduced in this paper. We attempt to do a fair comparison by using the same training data for all methods and standardize the pre-processing. We use publicly available training and test data for the following language pairs: German-English, Romanian-English and French-English. All approaches are evaluated for both forward and reverse direction as well as by combining these with the grow-diagonal heuristic Koehn et al. (2005), i.e. without using the finalize step. All hyper-parameters are tuned on the German-English task. We open-source our preprocessing pipeline and the baseline experiments using FastAlign and MGIZA++333https://github.com/lilt/alignment-scripts.
6.1 Training Data
The training data we use has between 0.4 and 1.9 million parallel sentences. This makes it possible to train both statistical and neural methods in a reasonable amount of time. For German-English, we use the Europarl v8 corpus. For Romanian-English and English-French we follow Mihalcea and Pedersen (2003). As the only exception we additionally use Europarl data for Romanian-English to increase the training data from 49k to 0.4M parallel sentences.
We preprocess the training data with the tokenizer from the Moses toolkit444https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl and consistently lowercase all training data. For all neural approaches we apply byte pair encoding Sennrich et al. (2016) with 10k merges, see Table 2 for an example. For FastAlign and Giza++ we concatenate the training and test data. Table 3 summarizes the training data we use.
|Source||Wir glauben nicht , daß wir nur Rosinen herauspicken sollten .|
|Source BPE||wir _gauben _nicht _, _daß _wir _nur _ro s inen _her au sp ick en _sollten _.|
|Target||We do not believe that we should cherry-pick .|
|Target BPE||we _do _not _believe _that _we _should _ch er ry - p ick _.|
|# Words Source||50.5M||20.0M||11.7M|
|# Words Target||53.2M||23.6M||11.8M|
6.2 Test Data
As test sets we use hand-aligned parallel sentences. For some of these data sets annotators were instructed to align target words that do not have a corresponding source word to a special null token. We always use the version without null tokens, i.e. a target word may not be aligned at all. The data sets are publicly available for German-English555https://www-i6.informatik.rwth-aachen.de/goldAlignment/ and both for Romanian-English and English-French666http://web.eecs.umich.edu/~mihalcea/wpt/index.html. Additionally, annotators made a distinction between probable and sure alignments. We evaluate the outputs of various systems based on the alignment error rate (AER) introduced by Och and Ney (2000).
We tune the hyperparameters of the approach presented in this paper on the German-English data set, which we choose as our development data (see Table4). The naïve approach of averaging all the attention activations of the Transformer network yields sub-optimal results. For the unidirectional models it produces alignment error rates well above 50%, combing two directions leads to 50.9%. The additional alignment layer always improves results. While using the word embeddings (31.4%) and the encoder output (28.6%) as keys and values yields good improvements, using a combination of both works best (27.1%). We speculate that the vectors representing the source tokens should both contain context and have a strong relation to the original word embedding. This procedure leads to results roughly as good as FastAlign (27.0%).
Optimizing the attention matrix from a random initialization to predict the reference translation with the SGD settings described in Section 5.1 proves unsuccessful in terms of alignment quality. However, using the attention activations of the forward pass noticeably improves the quality of the symmetrized alignments, yielding AER results similar to Giza++.
7.2 Qualitative Analysis
We will analyse the resulting attention activations and alignments based on the example presented in Table 2. This parallel sentence is the first sentence of our development set and highlights the following challenges: In both source (“wir”) and target (“we”) a word appears twice. Some words are not very common (“cherry-pick”) and get split into multiple subword units. The translation is non-literal, as “Rosinen herauspicken” (“pick raisins”) is translated as “cherry-pick”.
We plot multiple average attention activations for this example in Figure 2 with the a visualization tool provided by Rikters et al. (2017). The Transformer mainly focuses its attention on the punctuation mark at the end of the sentence.777We never attend to the end of sentence symbol of the source sequence, because we mask the attention to it consistently during training and inference. In contrast, the alignment layer attends to more meaningful source words.
Figure 3 shows alignments before and after applying SGD. When starting with a random initialization, no meaningful alignments can be generated. Note that different random initializations do not converge to similar alignments. However, when initializing with the attention activations of the forward pass, the resulting alignments improve in most of the cases.
7.3 English-French and Romanian-English
We now test our approach on the English-French and Romanian-English test sets of Mihalcea and Pedersen (2003). Similar to the German-English experiments, AER is consistently improved by adding the alignment layer and with optimization of the attention activations.
Interestingly, the neural approaches in this paper seem to profit more from symmetrizing both directions compared to the statistical approaches. The neural alignment models always use the full source context, but not the full target context, i.e. when generating an alignment we do not look at future target words. We speculate that this might be a contributing factor to the strong reduction in error rates by combining two unidirectional models.
We argue that the superior results of Giza++ on the English-French test set are mainly due to the large portion of probable alignment links (13,400 out of 17,438). This is favourable for Giza++, as it predicts the smallest number of alignments (20,200), while Add+SGD predicts considerably more (26,430). In contrast to the English-French test set, the Romanian-English set does not contain any probable alignments in its reference.
|Add + SGD||23.8%||20.5%||10.0%|
|Add + SGD||32.3%||34.8%||27.6%|
This paper addresses the problem of extracting meaningful word alignments from the self-attentive Transformer neural machine translation model. We extend the network with an alignment layer that contains no skip connections around the encoder-attention sub-layer and thus is encouraged to learn to attend to source words that correspond to the current target word. We further introduce a novel inference procedure to query the model with a given target word. By symmetrizing alignments extracted from models for both translation directions we achieve an alignment quality that is comparable to IBM Model 4 as implemented in Giza++ on two of the three tasks. Different from previous work our model is trained in an unsupervised fashion and does not require injecting external knowledge from the IBM models into the training pipeline.
- Alkhouli et al. (2018) Tamer Alkhouli, Gabriel Bretschner, and Hermann Ney. 2018. On the alignment problem in multi-head attention-based neural machine translation. Proceedings of the Third Conference on Machine Translation.
Arthur et al. (2016)
Philip Arthur, Graham Neubig, and Satoshi Nakamura. 2016.
translation lexicons into neural machine translation.
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
- Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. International Conference on Learning Representations.
- Brown et al. (1993) Peter F Brown, Vincent J Della Pietra, Stephen A Della Pietra, and Robert L Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational linguistics, 19(2):263–311.
- Dyer et al. (2013) Chris Dyer, Victor Chahuneau, and Noah A Smith. 2013. A simple, fast, and effective reparameterization of ibm model 2. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 644–648.
- Erhan et al. (2009) Dumitru Erhan, Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2009. Visualizing higher-layer features of a deep network. University of Montreal, 1341(3):1.
- Gao and Vogel (2008) Qin Gao and Stephan Vogel. 2008. Parallel implementations of word alignment tool. In Software engineering, testing, and quality assurance for natural language processing, pages 49–57. Association for Computational Linguistics.
- Koehn et al. (2005) Philipp Koehn, Amittai Axelrod, Alexandra Birch Mayne, Chris Callison-Burch, Miles Osborne, and David Talbot. 2005. Edinburgh system description for the 2005 iwslt speech translation evaluation. In International Workshop on Spoken Language Translation (IWSLT) 2005.
- Mihalcea and Pedersen (2003) Rada Mihalcea and Ted Pedersen. 2003. An evaluation exercise for word alignment. In Proceedings of the HLT-NAACL 2003 Workshop on Building and using parallel texts data driven machine translation and beyond -. Association for Computational Linguistics.
- Nguyen and Chiang (2018) Toan Nguyen and David Chiang. 2018. Improving lexical choice in neural machine translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics.
- Och and Ney (2000) Franz Josef Och and Hermann Ney. 2000. Improved statistical alignment models. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics.
- Och and Ney (2003) Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19–51.
- Olah et al. (2017) Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. 2017. Feature visualization. Distill. Https://distill.pub/2017/feature-visualization.
- Peter et al. (2017) Jan-Thorsten Peter, Arne Nix, and Hermann Ney. 2017. Generating alignments using target foresight in attention-based neural machine translation. The Prague Bulletin of Mathematical Linguistics, 108(1):27–36.
- Rikters et al. (2017) Matiss Rikters, Mark Fishel, and Ondřej Bojar. 2017. Visualizing Neural Machine Translation Attention and Confidence. The Prague Bulletin of Mathematical Linguistics, 109:1–12.
- Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics.
- Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112.
- Tamura et al. (2014) Akihiro Tamura, Taro Watanabe, and Eiichiro Sumita. 2014. Recurrent neural networks for word alignment model. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008.