Improving Neural Machine Translation with Parent-Scaled Self-Attention

by   Emanuele Bugliarello, et al.

Most neural machine translation (NMT) models operate on source and target sentences, treating them as sequences of words and neglecting their syntactic structure. Recent studies have shown that embedding the syntax information of a source sentence in recurrent neural networks can improve their translation accuracy, especially for low-resource language pairs. However, state-of-the-art NMT models are based on self-attention networks (e.g., Transformer), in which it is still not clear how to best embed syntactic information. In this work, we explore different approaches to make such models syntactically aware. Moreover, we propose a novel method to incorporate syntactic information in the self-attention mechanism of the Transformer encoder by introducing attention heads that can attend to the dependency parent of each token. The proposed model is simple yet effective, requiring no additional parameter and improving the translation quality of the Transformer model especially for long sentences and low-resource scenarios. We show the efficacy of the proposed approach on NC11 English-German, WMT16 and WMT17 English-German, WMT18 English-Turkish, and WAT English-Japanese translation tasks.


page 1

page 2

page 3

page 4


Explicit Reordering for Neural Machine Translation

In Transformer-based neural machine translation (NMT), the positional en...

Boosting Neural Machine Translation with Dependency-Scaled Self-Attention Network

The neural machine translation model assumes that syntax knowledge can b...

Joint Source-Target Self Attention with Locality Constraints

The dominant neural machine translation models are based on the encoder-...

Semantics-aware Attention Improves Neural Machine Translation

The integration of syntactic structures into Transformer machine transla...

Source Dependency-Aware Transformer with Supervised Self-Attention

Recently, Transformer has achieved the state-of-the-art performance on m...

Hybrid Self-Attention Network for Machine Translation

The encoder-decoder is the typical framework for Neural Machine Translat...

Modeling Recurrence for Transformer

Recently, the Transformer model that is based solely on attention mechan...


Neural machine translation (NMT) models [Sutskever, Vinyals, and Le2014, Bahdanau, Cho, and Bengio] have recently become the dominant paradigm to machine translation, obtaining outstanding empirical results by solving the translation task with simple end-to-end architectures that do not require tedious feature engineering typical of previous approaches. Most NMT models are only trained on corpora consisting of pairs of parallel sentences, disregarding any prior linguistic knowledge with the assumption that it can automatically be learned by an attention mechanism [Luong, Pham, and Manning2015].

However, despite the impressive results achieved by such attention-based NMT models, shi2016does shi2016does found that these models still fail to capture deep structural details. Being able to incorporate grammatical knowledge in a language is a promising approach to improve NMT models as statistical machine translation (SMT) models [Brown et al.1990, Koehn, Och, and Marcu2003] have shown relevant gains in translation accuracy in the past [Yamada and Knight2001, Liu, Liu, and Lin2006, Chiang2007].

Figure 1: A dependency tree for the input sentence “The monkey eats a banana”. Arrows point from heads to their dependents, while labels indicate relationships between words. Typically, the main verb of the sentence is designated as the root node and its head (or parent) is a special <ROOT> token.

Over the past few years, there have been a number of studies showing that syntactic information has the potential to also improve NMT models [Luong et al.2015, Sennrich and Haddow2016, Li et al.2017, Eriguchi, Tsuruoka, and Cho2017, Chen et al.2018]. However, the majority of recent syntax-aware NMT models [Bisk and Tran2018, Bastings et al.2019] are based on recurrent neural networks (RNNs) [Elman1990], a class of attentional encoder-decoder models [Bahdanau, Cho, and Bengio] that sequentially process an input sentence, one token at a time. vaswani2017attention vaswani2017attention recently introduced the Transformer model, an encoder-decoder architecture solely based on self-attention [Cheng, Dong, and Lapata2016, Lin et al.2017]. Transformer models can achieve state-of-the-art results on various translation tasks with much faster training time.

The key factor leading to the Transformer’s superior performance is its self-attention mechanism that allows to efficiently () access to any tokens in a sequence by directly attending to each pair of tokens. Nevertheless, recent studies have shown that self-attention networks (SANs) benefit from modeling local contexts, reducing the dispersion of the attention distribution by restricting it to neighboring representations [Shaw, Uszkoreit, and Vaswani2018, Yang et al.2018, Yang et al.2019]. Moreover, while SANs can focus on the entire sequence, recent work [Tran, Bisazza, and Monz2018, Tang et al.2018] suggests that they might not succeed, especially in low-resource scenarios, at capturing the inherent syntactic structure of languages as well as recurrent models. This feature could prove useful for the models in order to reduce ambiguity and preserve agreement when translating.

In response, we propose to enhance the self-attention mechanism by incorporating source-side syntactic information to further improve the performance of SANs in machine translation without compromising their flexibility. Specifically, we introduce parent-scaled self-attention (Pascal): a novel, parameter-free local attention mechanism that lets the model focus on the dependency parent of each token when encoding the source sentence. Our method is simple yet effective, resulting in better translation quality with no additional parameter to be learned or computational overhead.

To demonstrate the effectiveness and generality of our approach, we run extensive experiments on the popular large-scale WMT16 and WMT17 EnglishGerman (En-De), and WAT EnglishJapanese (En-Ja) translation tasks, as well as on News Commentary v11 EnglishGerman (En-De, De-En) and WMT18 EnglishTurkish (En-Tr) as low-resource scenarios. Our results show that the proposed approach consistently exhibits significant improvements in translation quality, especially for long sentences, not only over previous NMT approaches using dependency information and the Transformer baseline, but also against other strong syntax-aware variants of this model. To the best of our knowledge, this is the first work that investigates and exploits the core properties of the Transformer architecture to incorporate source-side syntax to further improve its translation quality.

Related Work


In this paper, vectors are column vectors represented by bold lowercase letters (e.g.,


). Matrices and tensors are denoted by bold uppercase letters (e.g.,

, ). For a given matrix , denotes its -th row as a column vector and is its entry in row and column .

Neural Machine Translation

Neural Machine Translation (NMT) systems typically learn a probabilistic mapping from a source language sentence to a target sentence , where and denote the -th and -th tokens in and , respectively. Most NMT systems are based on an encoder-decoder framework. In this setting, the encoder reads the source sentence and generates internal representations (context vectors)

that are then used by the decoder to compute the conditional probability of each target token given its preceding tokens

and the source sentence, . Here, denotes the decoder hidden state at time and is the contextual information used in generating from the encoder hidden states. Nowadays, context vectors are computed as a weighted average of , with weights given by an attention mechanism that assigns an alignment score to the pair of input at position and output at position . Let be the parameters of the neural network and the source-target sentence pairs in the training data. The learning objective of the model is:


Syntax-Aware Neural Machine Translation

Several approaches have been investigated in the literature in order to incorporate dependency syntax in NMT models.

For instance, eriguchi2016tree eriguchi2016tree integrate dependency trees into NMT models by a tree-based encoder that follows the phrase structure of a sentence. To alleviate the low efficiency of processing trees, P17-1064 P17-1064 linearize constituency parse trees into sequences of symbols mixed with words and syntactic tags. D17-1012 D17-1012 instead propose to combine together head information and sequential words as inputs to the source encoder. D17-1209 D17-1209 exploit graph-convolutional networks to produce syntax-aware representations of words.

The majority of these studies focused on recurrent networks. wu2018dep2dep wu2018dep2dep were also the first to evaluate an approach to embed syntactic information in NMT with a Transformer model. In this work, the authors first pair the source sentence encoder with two additional encoders that embed source dependency trees acquired by pre-order and post-order traversals. Then, the resulting context vectors are combined via a feed-forward layer and passed to two decoders, one modeling the target word sequence and the other modeling the parsing action sequence for the target dependency tree. The full model is trained to maximize the joint probability of target translations and their parsing trees.

Concurrently to our work, two other studies have proposed methodologies to introduce syntactic knowledge into Transformer-based models. First, zhang-etal-2019-syntax zhang-etal-2019-syntax integrate source-side syntax into NMT by concatenating the intermediate representations of a dependency parser to ordinary word embeddings. The authors rely on the hidden representations of the encoder of a dependency parser model in order to alleviate the error propagation problem from

-best tree outputs by syntax parsers. This approach, however, does not allow to learn sub-word units at the source side, requiring a larger vocabulary to minimize the number of out-of-vocabulary (OOV) words. In contrast, we explicitly account for sub-word units in our approach and also propose a regularization technique to make our attention mechanism robust to noisy parses. Second, currey-heafield-2019-incorporating currey-heafield-2019-incorporating propose two simple data augmentation techniques to incorporate source-side syntax. Their first method is a multi-task approach to parse and translate source sentences by prepending and appending special tags to each sentence and train the model to either translate the source sentence or output its linearized constituency parse. In their second approach, they train a Transformer model to translate both unparsed and parsed source sentences into unparsed target sentences. While these studies enhance the performance of the Transformer model, they treat it as a black box. Conversely, we explicitly enhance its self-attention mechanism (a core component of this architecture) to include syntactic information and further improve its translation accuracy.

Figure 2: Parent-Scaled Self-Attention (Pascal) head for the input sequence “The monkey eats a banana”.


In order to design a neural network that is efficient to train and that exploits syntactic information while producing high-quality translations, we base our model on the Transformer [Vaswani et al.2017] architecture and upgrade its encoder with parent-scaled self-attention (Pascal) heads at layer . Pascal is a novel, guided attention mechanism that enforces contextualization from the syntactic dependencies of each source token. In practice, we replace standard self-attention heads with Pascal ones in the first layer. In fact, the inputs to the first layer are word embeddings, which lack contextual information. By conditioning the representation of a given word on its parent, we can readily augment it and propagate it to upper layers. Our multi-head Pascal sub-layer has the same number of attention heads as other layers.

Transformer encoder.

The Transformer architecture follows the standard encoder-decoder paradigm, making use of positional embeddings to overcome the sequential nature of recurrent networks, which precludes them from parallelization among samples. The Transformer encoder consists of a stack of

identical layers, each having two sub-layers coupled with layer normalization and residual connections. The first sub-layer computes attention weights for each token in a sequence via a multi-head attention mechanism. Specifically,

heads are used to compute distributions for each token in a sequence, which are then concatenated together to let the model attend to different representations:


where , , and , , , are parameter matrices, is the representation of the input sequence from the previous layer, and is a constant that depends on the size of the model.

The second sub-layer is a two-layer feed-forward network with a ReLU activation function between them. The Transformer decoder has a similar architecture; we refer the reader to vaswani2017attention vaswani2017attention for more details.

Dependency position

Similarly to previous work, instead of just providing sequences of tokens, we supply the encoder with dependency relations given by an external parser. Our approach explicitly exploits sub-word units, which enable open-vocabulary translation: after generating sub-word units, we compute the middle position of each word in terms of number of tokens. For instance, if a word in position is split into three tokens, e.g., in positions , and , its middle position is . We finally map each sub-word unit of a given word to the middle position of its parent. For the root word, we define its parent to be itself, resulting in a parse that is a directed graph. The input to our encoder is then a sequence of tokens and the absolute positions of their parents.

Parent-Scaled Self-Attention

Figure 2 shows our parent-scaled self-attention sub-layer. Here, for a sequence of length , the inputs to each head are a matrix of token embeddings and a vector whose -th entry is the middle position of the -th token’s dependency parent. Following vaswani2017attention vaswani2017attention, in each attention head , we compute three vectors (called query, key and value) for each token, resulting in the three matrices , , and for the whole sequence, where . We then compute dot products between each query and all the keys, giving scores of how much focus to place on other parts of the input when encoding a token at a given position. The scores are divided by

to alleviate the vanishing gradient problem arising if dot products are large:


The main contribution of our work consists of weighing the scores of the token at position , , by the distance of each token from the position of ’s dependency parent:


where is the -th row of the matrix representing scores normalized by the proximity to ’s parent. is the entry of the matrix containing, for each row , the distances of every token from the middle position of token ’s dependency parent

. In this paper, we compute this distance as the probability density of a normal distribution centered at

and with variance

, :


Finally, we apply a softmax function to yield a distribution of weights for each token over all the tokens in the sentence, and multiply the resulting matrix with the value matrix , obtaining the final representations for Pascal head .

One of the major strengths of our proposal is being parameter-free: no additional parameter is required to train a Pascal sub-layer as can be computed by a distance function that depends only on the vector of tokens’ parents positions and evaluated using fast matrix operations on GPUs.

Parent ignoring.

Due to the lack of parallel corpora with gold-standard parses, we rely on noisy annotations from an external parser. However, the performance of syntactic parsers drops abruptly when evaluated on out-of-domain data [Dredze et al.2007]. To prevent our model from overfitting to noisy dependencies, we introduce a regularization technique for our Pascal sub-layer: parent ignoring. In a similar vein as dropout [Srivastava et al.2014], we disregard information during the training phase. Here, we ignore the position of the parent of a given token by randomly setting each row of to with some probability .

Gaussian weighing function.

The choice of weighing each score by a Gaussian probability density is motivated by two of its properties: bell-shaped curve and non-zero values. First, it allows to concentrate most of the probability density at the mean of the distribution, which we set to the middle position of the sub-word units of the dependency parent of each token. In our experiments, we find that most words in the vocabularies are not split into sub-words, hence allowing Pascal to mostly focus on the actual parent. In addition, non-negligible weights are placed on the neighbors of the parent token, allowing the attention mechanism to also attend to them. This could be useful, for instance, to learn idiomatic expressions such as prepositional verbs in English. Second, we exploit the support of Gaussian-like distributions: while most of the weight is placed in a small window of tokens around the mean of the distribution, all the values in the sequence are actually multiplied by non-zero factors. This allows a token farther away from the dependency parent of token , , to still play a role in the representation of if its attention score is high.

Our attention mechanism can be seen as an extension of the local attention introduced by luong2015effective luong2015effective, with the alignment now guided by syntactic information. yang-etal-2018-modeling yang-etal-2018-modeling proposed a similar method that learns a Gaussian bias that is added to, instead of multiplied by, the original attention distribution. As we will see in the next section, our model significantly outperforms them.

Corpus Train Filtered Train Validation Test
NC11     En-De 238,843 233,483 2,169 2,999
WMT18 En-Tr 207,373 3,000 3,007
WMT16 En-De 4,500,962 4,281,379 2,169 2,999
WMT17 En-De 5,852,458 2,999 3,004
WAT       En-Ja 3,008,500 1,790 1,812
Table 1: Number of sentences in our evaluation datasets.


Experimental Setup

We evaluate the efficacy of the proposed approach on standard, large-scale benchmarks as well as on low-resource scenarios, where Transformer models were shown to induce poorer syntactic relationships than on high-resource ones.


Unless otherwise specified, we follow the same pre-processing steps as vaswani2017attention vaswani2017attention. We use Stanford CoreNLP [Manning et al.2014] to generate syntactic information in our experiments, and jointly learn byte-pair encodings (BPE) [Sennrich, Haddow, and Birch2016] for source and target languages in each parallel corpus.


We report previous results in syntax-aware NMT for completeness and implement the following four Transformer-based approaches as stronger baselines:

  • Transformer:

    We train a Transformer model as a strong, standard baseline for our experiments using the hyperparameters in the latest Google’s Tensor2Tensor version (


  • +S&H: Following sennrich2016linguistic sennrich2016linguistic, we introduce syntactic information in the form of dependency labels in the embedding matrix of the Transformer encoder. More specifically, each token is associated with its dependency label which is first embedded into a vector representation of size and then used to replace the last embedding dimensions of the token embedding, ensuring a final size that matches the original one.

  • +SISA: Syntactically-Informed Self-Attention [Strubell et al.2018]. In one attention head , and are computed through a feed-forward layer and the key-query dot product to obtain attention weights is replaced by a bi-affine operator . These attention weights are further supervised to attend to each token’s parent by interpreting each row as the distribution over possible parents for token . Here, we extend the authors’ approach to BPE by defining the parent of a given token as its first sub-word unit (root of a word). The model is trained to maximize the joint probability of translations and parent positions.

  • +C&H: The multi-task approach from currey-heafield-2019-incorporating currey-heafield-2019-incorporating that uses a standard Transformer model to learn to both parse and translate source sentences. Each source sentence is first duplicated and associated its linearized parse as target sequence. To distinguish between the two tasks, a special tag indicating the desired task is prepended and appended to each source sentence. Finally, parsing and translation training data is shuffled together.


Following D17-1209 D17-1209, we train on News Commentary v11 (NC11) dataset111 with EnglishGerman (En-De) and GermanEnglish (De-En) tasks so as to simulate low-resource cases and to evaluate the performance of our models for different source languages. We also train on the full WMT16 dataset for En-De, using newstest2015 and newstest2016 as validation and test sets, respectively, in each of these experiments. Moreover, we notice that these datasets contain sentences in different languages and use langdetect222 to remove sentences whose main language does not match with source and target ones. The sizes of the final datasets are listed in Table 1, and we train our model on the filtered versions.

We also train our models on WMT18333 EnglishTurkish (En-Tr) as a standard low-resource scenario. Models are evaluated on newstest2016 and tested on newstest2017.

Previous studies on syntax-aware NMT have commonly been conducted on the WMT16 En-De and WAT EnglishJapanese (En-Ja) tasks, while concurrent approaches are evaluated on the WMT17444 En-De task. In order to provide a generic and comprehensive evaluation of our proposed approach on large-scale data, we also train our models on the latter tasks. We follow the WAT18 pre-processing steps555 for experiments on En-Ja but use Cabocha666 to tokenize target sentences. On WMT17, we use newstest2016 and newstest2017 as validation and test sets.

For each translation task, we jointly learn byte-pair encodings using merge operations on low-resource experiments, and merge operations on large-scale ones.

Training details.

NC11     En-De 0.0007 (0.9, 0.997) 2 0.4
NC11     De-En 0.0007 (0.9, 0.997) 8 0.0
WMT18 En-Tr 0.0007 (0.9, 0.98) 7 0.3
WMT16 En-De 0.0007 (0.9, 0.98) 5 0.0
WMT17 En-De 0.0007 (0.9, 0.997) 7 0.3
WAT       En-Ja 0.0007 (0.9, 0.997) 7 0.0
Table 2: Hyperparameters for the reported models. denotes the maximum learning rate, are Adam’s exponential decay rates, is the number of parent-scaled self-attention heads, and is the parent ignoring probability.

We implement our models in PyTorch on top of the Fairseq toolkit

777, which provides a re-implementation of the Transformer. All experiments are based on the base Transformer architecture and optimized following the learning schedule of vaswani2017attention vaswani2017attention with warm-up steps. Similarly, we use label smoothing during training and employ beam search with a beam size of and length penalty at inference time. We use a batch size of tokens and run experiments on a cluster of machines, each having Nvidia P100 GPUs.

For each model, we run a small grid search over the hyperparameters and select the ones giving the highest BLEU scores on the validation sets. Table 2 lists the hyperparameters of the proposed model, including the number of Pascal heads. In our datasets, we observe that most words are not split after BPE and that the vast majority are within a few tokens. For instance, of the English words in our WMT16 training data is not split after BPE and of them are at most split into sub-word units. Hence, a window size of would be most suitable for Pascal to attend to dependency parents. This can be achieved by setting a variance of for the Gaussian weighing function, which gives non-negligible weights to tokens at most positions away from the mean.

We use the sacreBLEU888
Signature: BLEU+c.mixed+#.1+s.exp+tok.13a+v.1.2.12.
tool to compute case-sensitive BLEU [Papineni et al.2002] scores. Statistical significance () against the Transformer baseline via bootstrap re-sampling [Koehn2004] is marked with . When evaluating En-Ja translations, we follow the procedure at WAT by computing BLEU scores after tokenizing target sentences using KyTea999 In addition, we also report RIBES [Isozaki et al.2010] scores on this translation task.

Following vaswani2017attention vaswani2017attention, we train Transformer-based models for steps on large-scale data. On small-scale data, we train for steps and use a dropout probability as they let the Transformer baseline achieve higher performance on this size of data. For instance, in WMT18 En-Tr, our baseline reaches BLEU points compared to the one in currey-heafield-2019-incorporating currey-heafield-2019-incorporating.

Main Results

Figure 3: Analysis by sentence length: between the proposed approach and the Transformer baseline.
Method NC11 NC11 WMT18
En-De De-En En-Tr
D17-1209 D17-1209 16.1
Transformer 25.0 26.6 13.1
Transformer 25.5 26.8 13.0
Transformer 25.5 27.2 13.6
Transformer 24.8 26.7 14.0
Proposed approach 25.9 27.4 14.0
Table 3: Evaluation results on small-scale data.
Method WMT16 WMT17
sennrich2016linguistic sennrich2016linguistic 28.4
D17-1209 D17-1209 23.9
tran2018inducing tran2018inducing 30.3 24.3
SE+SD-Transformer [Wu et al.2018] 26.2
Mixed Enc. [Currey and Heafield2019] 26.0
Transformer 33.0 25.5
Transformer 31.9 25.1
Transformer 33.6 25.9
Transformer 32.4 24.6
Proposed approach 33.9 26.1
Test BLEU scores, En-De, WMT16 and WMT17.
eriguchi2016tree eriguchi2016tree 34.9 81.58
LGP-NMT+ [Hashimoto and Tsuruoka2017] 39.4 82.83
SE+SD-NMT [Wu et al.2018] 36.4 81.83
Transformer 43.1 83.46
Transformer 42.8 83.88
Transformer 43.2 83.51
Transformer 43.1 84.87
Proposed approach 44.0 85.21
Test BLEU and RIBES scores, En-Ja, WAT.
Table 4: Evaluation results on large-scale data.

Low-resource experiments.

Table 3 presents the BLEU scores on the test data of the small-scale datasets introduced above. Clearly, the Transformer model vastly outperforms a previous syntax-aware RNN-based approach, proving it to be a strong baseline in our experiments. The proposed approach outperforms all the baselines, with consistent gains over the base Transformer regardless of the source language. Specifically, it leads to BLEU points on NC11 En-De and WMT18 En-Tr, and on NC11 De-En.

The table also shows that the simple dependency-aware approach of sennrich2016linguistic sennrich2016linguistic does not lead to notable advantages when applied to the embeddings of the Transformer model. On the other hand, the SISA mechanism leads to modest but consistent improvements across all tasks, confirming that it can also be used to improve NMT. Finally, we observe that the multi-task approach of currey-heafield-2019-incorporating currey-heafield-2019-incorporating can benefit of better parameterization of the Transformer, leading to BLEU points on the En-Tr task compared to the original results. While its performance matches our model on this task, it only attains comparable performance with the base Transformer on the NC11 tasks.

Component NC11 En-De NC11 De-En WMT18 En-Tr WMT16 En-De WMT17 En-De WAT En-Ja
Transformer 22.6 23.8 12.6 29.0 31.5 42.2
+ data filtering 22.8 (+0.2) 24.0 (+0.2) 28.7 (-0.3)
+ Pascal 23.0 (+0.2) 24.6 (+0.6) 13.6 (+1.0) 29.2 (+0.5) 31.6 (+0.1) 43.5 (+1.3)
+ parent ignoring 23.2 (+0.2) 13.7 (+0.1) 32.1 (+0.6)
Table 5: Validation BLEU when incrementally adding each component used by our best-performing models on each task.
SRC In a cooling experiment , only a tendency agreed
BASE 冷却 実験 で は ,わずか な 傾向 が 一致 し た
OURS 冷却 実験 で は 傾向 のみ 一致 し た
Table 6: Example of incorrect translation by the baseline.
Layer En-De De-En
1 23.2 24.6
2 22.5 20.1
3 22.5 23.8
4 22.6 23.8
5 22.9 23.8
6 22.4 23.9
Variance En-De De-En
Parent-only 22.5 22.4
1 23.2 24.6
4 22.7 24.3
9 22.8 24.3
16 22.7 24.4
25 22.8 24.1
Table 7: Validation BLEU as a function of the Pascal’s layer (a) and Gaussian’s variance (b) on NC11 EnDe tasks.

Large-scale experiments.

Table 4 lists the evaluation results on our large-scale datasets. As seen in small-scale experiments, our Transformer baseline outperforms previous RNN-based approaches. We now observe that adding syntactic information in the embeddings of the Transformer encoder leads to slightly lower BLEU scores than the baseline. This is also observed for our re-implementation of the multi-task approach of currey-heafield-2019-incorporating currey-heafield-2019-incorporating, which, however, again achieves a higher score than the one reported by the authors on WMT17. SISA, on the other hand, is consistent across small-scale and large-scale experiments but still giving modest improvements over the baseline. Finally, the proposed approach achieves the highest performance on the WMT16 and WAT tasks with a considerable over the baseline’s BLEU scores. Moreover, our model also achieves a significant boost over the baseline () in terms of RIBES, a metric with stronger correlation with human judgments than BLEU in EnJa translations. On WMT17, our slim model compares favorably to previous approaches, achieving the highest performance across all syntax-aware approaches that make use of source-side syntax. In fact, our method is only outperformed by the approach of wu2018dep2dep wu2018dep2dep, which makes use of both source-side and target-side syntactic information. Not only does this limit the application of their method to low-resource target languages, but this model is also much more complex than ours, requiring three encoders and two decoders. Lastly, note that modest improvements in our WMT17 task should not be surprising as the data consists of parallel sentences ( compared to WMT16) and raganato-tiedemann-2018-analysis raganato-tiedemann-2018-analysis showed that the Transformer model can already learn better syntactic relationships on larger datasets.


These results show that previous approaches for syntax NMT might give lower gains in the Transformer model. We also find, on average, stronger and more consistent results across data sizes when embedding syntax in the attention mechanism and that more syntax-aware heads further improve the translation accuracy of the Transformer. Overall, our model achieves substantial gains given the grammatically rigorous structure of English and German. We expect performance gains to further increase on less rigorous sources and with better parses [Zhang et al.2019].


In this section, we further investigate the performance of the proposed approach, ground our design choices and show the performance of each component through an ablation study.

Performance by sentence length.

As shown in Figure 3, our model is particularly useful when translating long sentences, obtaining more than BLEU points when translating long sentences in all low-resource experiments, and BLEU points on the large En-Ja task. However, only a few sentences () in the evaluation datasets are long.

Qualitative performance.

Table 6 presents an example where our model correctly translated the source sentence while the Transformer baseline made a syntactic error, misinterpreting the adverb “only” as an adjective of “tendency”.

Pascal layer.

When we introduced our model, we motivated our design choice of placing Pascal heads in the first layer in order to enrich the representations of words from their isolated embeddings by introducing contextualization from their parents. We ran an ablation study on the NC11 data in order to verify our hypothesis. As shown in Table 7(a), the performance of our model on the validation sets is lower when placing Pascal heads in upper layers; a trend that we also observed with the SISA mechanism. These results corroborate the findings of raganato-tiedemann-2018-analysis raganato-tiedemann-2018-analysis who noticed that, in the first layer, more attention heads solely focus on the word to be translated itself rather than its context. We can then deduce that enforcing syntactic dependencies in the first layer effectively leads to better word representations, which further enhance the translation accuracy of the Transformer model. Investigating the performance of multiple syntax-aware layers is left as future work.

Gaussian variance.

Another design choice we made was the variance of the Gaussian weighing function. We set it to in our experiments motivated by the statistics of our datasets, where the vast majority of words is at most split into a few tokens after applying BPE. Table 7(b) corroborates our choice, showing higher BLEU scores on the NC11 validation sets when the variance is equal to . Here, “parent-only” is the case where weights are only placed on parents.


Table 5 lists the contribution of each proposed technique (data filtering, Pascal and parent ignoring), in an incremental fashion, whenever they were used by a model.

While removing sentences whose languages do not match the translation task can lead to better performance (NC11), the precision of the detection tool assumes a major role at large scale. In WMT16, langdetect removes more than sentences and leads to performance losses. It would also drop sentences on the clean WAT En-Ja dataset.

The proposed Pascal mechanism is the component that most improves the performance of the models, achieving up to and BLEU on En-Tr and En-Ja, respectively.

With the exception of NC11 En-De, we find parent ignoring useful on the noisier WMT18 En-Tr and WMT17 En-De datasets. In the former, low-resource case, the benefits of parent ignoring are minimal, but it proves fundamental on the large-scale WMT17 data, where it leads to significant improvements when paired with the Pascal mechanism.

Finally, looking at the number of Pascal heads in Table 2, we notice that most models rely on a large number of syntax-aware heads. raganato-tiedemann-2018-analysis raganato-tiedemann-2018-analysis found that only a few attention heads per layer encoded a significant amount of syntactic dependencies. Our study shows that the performance of the Transformer model can be improved by having more attention heads learn syntactic dependencies.


This study provides a comprehensive investigation of approaches to induce syntactic knowledge into self-attention networks. We also introduce Pascal: a novel, parameter-free self-attention mechanism that enforces syntactic dependencies in the Transformer encoder with negligible computational overhead. Through extensive evaluations on a variety of translation tasks, we find that the proposed model leads to higher improvements in translation quality than other approaches. Moreover, we see that exploiting the core components of the Transformer model to embed linguistic knowledge leads to higher and more robust gains than treating it as a black box, and that multiple syntax-aware attention heads lead to superior performance. Our results show that the quality of the Transformer model can be improved by syntactic knowledge, motivating future research in this direction.


The research results have been achieved by “Research and Development of Deep Learning Technology for Advanced Multilingual Speech Translation”, the Commissioned Research of National Institute of Information and Communications Technology (NICT), Japan.


  • [Bahdanau, Cho, and Bengio] Bahdanau, D.; Cho, K.; and Bengio, Y. 2015. Neural machine translation by jointly learning to align and translate. In ICLR.
  • [Bastings et al.2017] Bastings, J.; Titov, I.; Aziz, W.; Marcheggiani, D.; and Simaan, K. 2017. Graph convolutional encoders for syntax-aware neural machine translation. In EMNLP.
  • [Bastings et al.2019] Bastings, J.; Aziz, W.; Titov, I.; and Sima’an, K. 2019. Modeling latent sentence structure in neural machine translation. arXiv preprint arXiv:1901.06436.
  • [Bisk and Tran2018] Bisk, Y., and Tran, K. 2018. Inducing grammars with and for neural machine translation. In WNMT.
  • [Brown et al.1990] Brown, P. F.; Cocke, J.; Della Pietra, S. A.; Della Pietra, V. J.; Jelinek, F.; Lafferty, J. D.; Mercer, R. L.; and Roossin, P. S. 1990. A statistical approach to machine translation. Computational linguistics 16(2).
  • [Chen et al.2018] Chen, K.; Wang, R.; Utiyama, M.; Sumita, E.; and Zhao, T. 2018. Syntax-directed attention for neural machine translation. In AAAI.
  • [Cheng, Dong, and Lapata2016] Cheng, J.; Dong, L.; and Lapata, M. 2016. Long short-term memory-networks for machine reading. In EMNLP.
  • [Chiang2007] Chiang, D. 2007. Hierarchical phrase-based translation. Computational linguistics 33(2).
  • [Currey and Heafield2019] Currey, A., and Heafield, K. 2019. Incorporating source syntax into transformer-based neural machine translation. In WMT.
  • [Dredze et al.2007] Dredze, M.; Blitzer, J.; Talukdar, P. P.; Ganchev, K.; Graca, J.; and Pereira, F. 2007. Frustratingly hard domain adaptation for dependency parsing. In EMNLP-CoNLL.
  • [Elman1990] Elman, J. L. 1990. Finding structure in time. Cognitive Science 14(2).
  • [Eriguchi, Hashimoto, and Tsuruoka2016] Eriguchi, A.; Hashimoto, K.; and Tsuruoka, Y. 2016. Tree-to-sequence attentional neural machine translation. In ACL.
  • [Eriguchi, Tsuruoka, and Cho2017] Eriguchi, A.; Tsuruoka, Y.; and Cho, K. 2017. Learning to parse and translate improves neural machine translation. In ACL.
  • [Hashimoto and Tsuruoka2017] Hashimoto, K., and Tsuruoka, Y. 2017. Neural machine translation with source-side latent graph parsing. In EMNLP.
  • [Isozaki et al.2010] Isozaki, H.; Hirao, T.; Duh, K.; Sudoh, K.; and Tsukada, H. 2010. Automatic evaluation of translation quality for distant language pairs. In EMNLP.
  • [Koehn, Och, and Marcu2003] Koehn, P.; Och, F. J.; and Marcu, D. 2003. Statistical phrase-based translation. In NAACL.
  • [Koehn2004] Koehn, P. 2004. Statistical significance tests for machine translation evaluation. In EMNLP.
  • [Li et al.2017] Li, J.; Xiong, D.; Tu, Z.; Zhu, M.; Zhang, M.; and Zhou, G. 2017. Modeling source syntax for neural machine translation. In ACL.
  • [Lin et al.2017] Lin, Z.; Feng, M.; Santos, C. N. d.; Yu, M.; Xiang, B.; Zhou, B.; and Bengio, Y. 2017. A structured self-attentive sentence embedding. In ICLR.
  • [Liu, Liu, and Lin2006] Liu, Y.; Liu, Q.; and Lin, S. 2006. Tree-to-string alignment template for statistical machine translation. In COLING-ACL.
  • [Luong et al.2015] Luong, M.-T.; V. Le, Q.; Sutskever, I.; Vinyals, O.; and Kaiser, L. 2015. Multi-task sequence to sequence learning. In ICLR.
  • [Luong, Pham, and Manning2015] Luong, T.; Pham, H.; and Manning, C. D. 2015. Effective approaches to attention-based neural machine translation. In EMNLP.
  • [Manning et al.2014] Manning, C. D.; Surdeanu, M.; Bauer, J.; Finkel, J.; Bethard, S. J.; and McClosky, D. 2014.

    The Stanford CoreNLP natural language processing toolkit.

    In ACL.
  • [Papineni et al.2002] Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. Bleu: A method for automatic evaluation of machine translation. In ACL.
  • [Raganato and Tiedemann2018] Raganato, A., and Tiedemann, J. 2018. An analysis of encoder representations in transformer-based machine translation. In BlackboxNLP.
  • [Sennrich and Haddow2016] Sennrich, R., and Haddow, B. 2016. Linguistic input features improve neural machine translation. In WMT.
  • [Sennrich, Haddow, and Birch2016] Sennrich, R.; Haddow, B.; and Birch, A. 2016. Neural machine translation of rare words with subword units. In ACL.
  • [Shaw, Uszkoreit, and Vaswani2018] Shaw, P.; Uszkoreit, J.; and Vaswani, A. 2018. Self-attention with relative position representations. In NAACL.
  • [Shi, Padhi, and Knight2016] Shi, X.; Padhi, I.; and Knight, K. 2016. Does string-based neural MT learn source syntax? In EMNLP.
  • [Srivastava et al.2014] Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. 2014. Dropout: A simple way to prevent neural networks from overfitting. JMLR 15.
  • [Strubell et al.2018] Strubell, E.; Verga, P.; Andor, D.; Weiss, D.; and McCallum, A. 2018. Linguistically-informed self-attention for semantic role labeling. In EMNLP.
  • [Sutskever, Vinyals, and Le2014] Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to sequence learning with neural networks. In NIPS.
  • [Tang et al.2018] Tang, G.; Müller, M.; Rios, A.; and Sennrich, R. 2018. Why self-attention? a targeted evaluation of neural machine translation architectures. In EMNLP.
  • [Tran, Bisazza, and Monz2018] Tran, K.; Bisazza, A.; and Monz, C. 2018. The importance of being recurrent for modeling hierarchical structure. In EMNLP.
  • [Vaswani et al.2017] Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L. u.; and Polosukhin, I. 2017. Attention is all you need. In NIPS.
  • [Wu et al.2018] Wu, S.; Zhang, D.; Zhang, Z.; Yang, N.; Li, M.; and Zhou, M. 2018. Dependency-to-dependency neural machine translation. IEEE/ACM Transactions on Audio, Speech, and Language Processing.
  • [Yamada and Knight2001] Yamada, K., and Knight, K. 2001. A syntax-based statistical translation model. In ACL.
  • [Yang et al.2018] Yang, B.; Tu, Z.; Wong, D. F.; Meng, F.; Chao, L. S.; and Zhang, T. 2018. Modeling localness for self-attention networks. In EMNLP.
  • [Yang et al.2019] Yang, B.; Li, J.; Wong, D. F.; Chao, L. S.; Wang, X.; and Tu, Z. 2019. Context-aware self-attention networks. In AAAI.
  • [Zhang et al.2019] Zhang, M.; Li, Z.; Fu, G.; and Zhang, M. 2019. Syntax-enhanced neural machine translation with syntax-aware word representations. In NAACL.