Source Dependency-Aware Transformer with Supervised Self-Attention

by   Chengyi Wang, et al.
Nankai University

Recently, Transformer has achieved the state-of-the-art performance on many machine translation tasks. However, without syntax knowledge explicitly considered in the encoder, incorrect context information that violates the syntax structure may be integrated into source hidden states, leading to erroneous translations. In this paper, we propose a novel method to incorporate source dependencies into the Transformer. Specifically, we adopt the source dependency tree and define two matrices to represent the dependency relations. Based on the matrices, two heads in the multi-head self-attention module are trained in a supervised manner and two extra cross entropy losses are introduced into the training objective function. Under this training objective, the model is trained to learn the source dependency relations directly. Without requiring pre-parsed input during inference, our model can generate better translations with the dependency-aware context information. Experiments on bi-directional Chinese-to-English, English-to-Japanese and English-to-German translation tasks show that our proposed method can significantly improve the Transformer baseline.


page 1

page 2

page 3

page 4


Improving Neural Machine Translation with Parent-Scaled Self-Attention

Most neural machine translation (NMT) models operate on source and targe...

Promoting the Knowledge of Source Syntax in Transformer NMT Is Not Needed

The utility of linguistic annotation in neural machine translation seeme...

Future-Guided Incremental Transformer for Simultaneous Translation

Simultaneous translation (ST) starts translations synchronously while re...

Integrating Dependency Tree Into Self-attention for Sentence Representation

Recent progress on parse tree encoder for sentence representation learni...

Analyzing Word Translation of Transformer Layers

The Transformer translation model is popular for its effective paralleli...

Boosting Neural Machine Translation with Dependency-Scaled Self-Attention Network

The neural machine translation model assumes that syntax knowledge can b...

Combining Improvements for Exploiting Dependency Trees in Neural Semantic Parsing

The dependency tree of a natural language sentence can capture the inter...

1 Introduction

The past few years have witnessed the rapid development of neural machine translation (NMT).

[1, 4, 5, 8] Particularly, the Transformer has refreshed the state-of-the-art performance on many translation tasks. Different from recurrence and convolution based network structures, the Transformer relies solely on the multi-head self-attention mechanism in which different heads implicitly model the inputs from different aspects [8].

(a) translation example
(b) source side dependency tree
Figure 1: (a). A translation example from the Chinese-to-English task. Text highlighted in the rectangle is the incorrect translation part. (b). The dependency tree of the source sentence. The highlighted phrase in the rectangle refers to the Transformer’s misunderstanding of the source sentence.

Although effective, all multi-head self-attentions are trained in an unsupervised manner without any explicit modeling of syntactic knowledge, which leads to incorrect translations that violate the syntactic constraints of source sentences. Figure 1(a) shows an example from the Chinese-to-English task. Though the translation is well formed and grammatical, its meaning is inconsistent with the source sentence. This error is caused by the misunderstanding of the subtle source syntactic dependency. As shown in Figure 1(b), the word “hángzhōu (Hangzhou)” is a modifier of “yǎnyì (present)” rather than “yinyuèjiā (musicians)”. Intuitively, such information can be effectively modeled by syntax structure such as dependency trees. Recent advances show that adding source syntactic information to the RNN-based NMT systems can improve translation quality. For example, DBLP:conf/acl/EriguchiHT16 DBLP:conf/acl/EriguchiHT16 and DBLP:conf/acl/ChenHCC17 DBLP:conf/acl/ChenHCC17 construct a tree-LSTM encoder on top of the standard sequential encoder;DBLP:conf/emnlp/BastingsTAMS17 DBLP:conf/emnlp/BastingsTAMS17 introduce an extra graph convolutional network (GCN) to encode dependency trees.

Though remarkable progress has been achieved, several issues still remain in this research line:

(1) Most existing methods introduce extra modules in addition to the sequential encoder such as the tree-LSTM, GCN or additional RNN encoder, which make the model heavy.

(2) Existing methods require a stand-alone parser to pre-generate syntactic trees as input during inference, since they are incapable to construct syntactic structures automatically.

(3) Previous methods are designed for RNN-based models and hard to be applied to highly parallelized Transformer.

To address these issues, we propose a novel framework to enable the Transformer to model source dependencies explicitly by supervising parts of the encoder self-attention heads. The self-attention heads are divided into two types: the unsupervised heads and the supervised heads. The unsupervised heads implicitly generalize patterns from raw data as in original Transformer, but the supervised ones learn to align to child and parent dependency words, guided by the parsing tree from a pre-trained dependency parser111To make the description clear, in the following of this paper, we use “parent” to denote the head/parent node of dependency tree and “head” to denote the attention head.

. As the implementation of the supervision, two regularization terms are introduced into the original cross-entropy loss function. With this method, the dependency information are naturally modeled with several heads without introducing extra modules, and in inference time, the supervised heads can predict the dependency relations without parsed input. Experiments conducted on bidirectional Chinese-English, English-to-Japanese and English-to-German show that the proposed method improves translation quality significantly over the strong Transformer baseline and can construct reasonable source dependency structures as well.

Our main contributions are summarized as follows:

  • We propose a novel method to incorporate source dependency knowledge into the Transformer without introducing additional modules. Our approach takes advantage of the numerous attention heads and trains some of them in a supervised manner.

  • With supervised framework, our method can build reasonable source dependency structures via self-attention heads and extra parsers are not required during inference. Decoding efficiency is therefore guaranteed.

  • Our proposed method significantly improves the Transformer baseline and outperforms several related methods in four translation tasks.

Figure 2: The architecture of our syntax-aware Transformer. and are child attentional adjacency matrix and parent attentional adjacency matrix corresponding to the tree in Figure 1(b), which are used as supervisions of attention score matrices. Each element shows the attention weight upon the -th word based on the -th word.

2 Background

2.1 Transformer


Transformer encoder is composed of a stack of identical layers, where each layer contains a self-attention module and a feed-forward network.

The self-attention module is designed as a multi-head attention mechanism. The input consists of a query , a key and a value . Every elements of are modeled as the word and position embedding representation for the first layer and as output of the previous layer for other layers. Multi-head attention linearly projects the times with different learned projections to , and dimensions respectively. Then the scaled dot-product attention function performs on them and yields different -dimensional representations. They are then concatenated and projected again to generate the final values:


where and are learned projection matrices. Note that , , and , where denotes the number of head and in practice.

The feed-forward network (FFN) is formed as two linear transformations with a ReLU activation in between.

Layer normalization and residual connection are used after each sub-layer.


Aside from the self-attention and feed-forward module, the decoder inserts an inter multi-head attention sub-layer which performs over the encoder output. Specifically, the output of the self-attention sub-layer is regarded as and linearly projected to -dimensional queries. The encoder output is regarded as and , which are projected to dimensions respectively.

2.2 Dependency Tree

A dependency tree directly models syntactic structures of arbitrary distance, where each word has a parent word that it depends on, except for the root word. The verb is taken to be the structural center and all other syntactic units are either directly or indirectly connected to it.

Figure 1(b) shows an example of a dependency tree. Without any constituent labels, dependency tree is simple in form but effectively characterizes word relations. Hence, it is usually regarded as a desirable linguistic model of the sentence.

3 Proposed Method

Figure 2 sketches the overall architecture of our proposed method. We adopt the source dependency tree, and define two attentional adjacency square matrices, and , to capture child and parent dependencies. The two matrices are used to supervise two self-attention heads in the Transformer encoder. In this section, we begin by explaining how the source syntax is represented and then give details on how the model is trained based on the learned representations.

3.1 Syntax Representation

Given a source sentence , where is the sentence length, and its dependency parse tree, we define a child attentional adjacency matrix and a parent attentional adjacency matrix , representing child and parent dependencies respectively.

Equation (4) gives the definition of matrix . Assuming that is a possible parent, the element is 1 when is a child of , otherwise 0. For all leaf nodes in the tree, we let them align to themselves. In cases where a parent node has multiple child nodes, we average the weight among all its child nodes. In this manner, each word is informed with its modifiers directly.


is the number of child nodes of .

Similarly, in , each word is encouraged to attend to its parent node directly and the attention score of the parent node is 1. The root node is aligned to itself as shown in Equation (5).


Either of the two matrices is sufficient to reconstruct the dependency tree, and they sketch the tree from different views. Figure 2 gives an example of the attentional adjacency matrices ( and ), which corresponds to the dependency tree in Figure 1(b). In , three child nodes of the root word “yǎnyì” (the 5th row) receive an equal attention score whereas the others are given no attention. For the parent word “yinyuèhuì” (the 7th row), it only has one child word “xinnián” (the 6th column) and thus, it receives an attention score of 1. As the leaf node, the word “xinnián” (the 6th row) scores itself as 1. In , all words put the whole attention score on its parent node except for the root word (the 5th row), which attends to itself.

3.2 Syntax-Aware Transformer

Inspired by the head selection idea for dependency parsing [9], we propose a supervised framework where two self-attention heads in the encoder are supervised by two attentional adjacency matrices. The supervised heads are expected to model the dependency knowledge from different view. As shown in Figure 2, the encoder encodes

and generates a hidden representation

. The decoder predicts the target sentence based on . In the training phase, the objective function is divided into three parts: the standard maximum likelihood of training data and the two regularization terms to encourage the self-attention to learn from adjacency matrices.

The Transformer encoder contains self-attention heads for each word in total, where is the number of layers and is the number of heads. Though these heads can model the source sentence from different aspects, whether they can learn accurate syntactic knowledge in an unsupervised manner is indecipherable. Thus, we use the attentional adjacency matrices to guide two of the attention heads to explicitly guarantee them to learn syntactic knowledge.

Specifically, for each source word , we expect that the self-attention function can focus more on its child words and parent word. Thus, we select two self-attention heads from the same layer as supervised attention heads and denote them as child supervised attention head (CSH) and parent supervised attention head (PSH). They are trained under the guidance of and

. The attention score/probability matrices produced by CSH and PSH are denoted as

and respectively. which are computed by softmax functions as in Equation (3). Two additional objective terms, namely, and are introduced to minimize the divergence between them and the attentional adjacency matrices , :


The two regularization objectives encourage the CSH and PSH to generate similar intermediate attention weight matrices as attentional adjacency matrices . Other attention heads are still trained in an unsupervised manner as the original Transformer does.

The original training objective on source sentence and target sentence is defined as:


After introducing the regularization terms, the new objective function is formulated as:


where and are hyper-parameters for the regularization terms, and is the training corpus.

4 Experiments

System NIST2005 NIST2008 NIST2012 Average
RNNsearch 38.41 30.01 28.48 32.30
Tree2Seq [2] 39.44 31.03 29.22 33.23
SE-NMT (Wu et al. 2017) 40.01 31.44 29.45 33.63
Transformer 43.89 34.83 32.59 37.10
+CSH 44.21 36.63 33.57 38.14
+PSH 44.24 36.17 33.86 38.09
+CSH+PSH 44.87 36.73 34.28 38.63
Table 1: Case-insensitive BLEU scores (%) for Chinese-to-English translation on NIST datasets. “+CSH” denotes model only trained under the supervision of child attentional adjacency matrix ( = 0). “+PSH” denotes model only trained under the supervision of parent attentional adjacency matrix ( = 0). “+CSH+PSH” is trained under the supervision of both.

4.1 Setup


For NIST OpenMT’s Chinese-to-English translation task, we leverage a subset of LDC corpus as bilingual training data, 222LDC2003E14, LDC2005T10, LDC2005E83, LDC2006E26, LDC2003E07, LDC2005T06, LDC2004T07, LDC2004T08, LDC2006E34, LDC2006E85, LDC2006E92, LDC2003E07, LDC2002E18, LDC2005T06 which contains 2.6M sentence pairs. The NIST 2005, 2008, 2012 are used as test sets. All English words are in lowercase. We keep the top 30K most frequent words for both sides, and the rest are replaced with <unk> and post-processed following DBLP:conf/acl/LuongSLVZ15 DBLP:conf/acl/LuongSLVZ15.

In the WAT2016 English-to-Japanese translation task, the top 1.5M sentence pairs from the ASPEC corpus [6]333 are used as training data. We follow the official pre-processing steps provided by WAT2016.

For WMT2017’s bidirectional Chinese-English translation tasks, we use the CWMT corpus444, which consists of 9M sentence pairs. The newstest2017 is used as the test set. For pre-processing, we segment Chinese sentences with our in-house tool and segment English sentences with Moses scripts555 We use 50k subword tokens as vocabulary based on Byte Pair Encoding (BPE) [7] for both sides’.

For English-to-German task, we use the WMT2014 corpus, which contains 4.5M sentence pairs. The newstest 2014 is used as test set. We use vocabularies of 50K sub-word tokens based on BPE for both sides.

Given that no golden annotations of source dependency trees exist in these corpus, we use pseudo parsing results from in-house implemented arc-eager dependency parsers following DBLP:conf/acl/ZhangN11 DBLP:conf/acl/ZhangN11. The English parser is trained on the Penn Treebank and the Chinese parser is trained on Chinese Treebank corpus. The unlabeled attachment score (UAS) are 92.3% and 83.7% respectively. As for tasks using BPE, we modify the pseudo-golden dependency trees by a rule: all pieces from one word are linked to the first piece.

We compare our proposed method with the Transformer [8]. The results are reported with the IBM BLEU-4. The English-to-Japanese task is evaluated following the official procedure with both BLEU and RIBES.

Model and Implementation Details:

The Transformer baseline and our proposed method follow the base setting of DBLP:conf/nips/VaswaniSPUJGKP17 DBLP:conf/nips/VaswaniSPUJGKP17. CHS and PSH are selected from the top layer. Our off-line experiments show that the model exhibits best performance when supervision is conducted on the top layer. This may be because the syntax information captured by lower layers is weakened as the encoder goes deeper, however, when the supervision is on the top layer, this kind of information is more strong and effective. The hyper-parameters used in our approach are set as , both are selected based on validation set. All experiments are conducted on a single GPU.

RNNsearch 34.83 80.92
Eriguchi et al. (2016) 34.91 81.66
Transformer 36.24 81.90
+CSH 36.83 82.15
+PSH 36.75 82.09
+CSH+PSH 37.22 82.37
Table 2: Evaluation results on the English-to-Japanese translation task.

4.2 Evaluation on NIST Chinese-to-English Translation

In this experiment, aside from the Transformer, we also compare our model with the RNN-based NMT baseline RNNsearch and several existing syntax-aware NMT methods that use source consistency/dependency trees. These methods are described as follows:

  • RNNSearch: A reimplementation of the conventional RNN-based NMT model [1].

  • Tree2Seq: DBLP:conf/acl/ChenHCC17 DBLP:conf/acl/ChenHCC17 propose a tree-to-sequence NMT model by leveraging source constituency trees with tree based coverage.666

  • SE-NMT: DBLP:conf/ijcai/WuZZ17 DBLP:conf/ijcai/WuZZ17 extract extra sequences by traversing and encode them using two RNNs. We re-implement their model named SE-NMT.

Table 1 shows the evaluation results for all test sets. We report on case-insensitive BLEU here since English words are lowercased. From the table we can see that syntax-aware RNN models always outperform the RNNsearch baseline. However, the performance of the Transformer is much higher than that of all RNN-based methods. In Transformer+CSH, we use only the child attentional adjacency matrix to guide the encoder. Being aware of child dependencies, Transformer+CSH gains 1.0 BLEU point improvement over the Transformer baseline on the average. Transformer+PSH, in which only the parent attentional adjacency matrix is used as supervision, also achieves about 1.0 BLEU point improvement. After combination, the new Transformer+CSH+PSH can further improve BLEU by about 0.5 point on the average, which significantly outperforms the baseline and other source syntax-based methods in all test sets. This demonstrates that both child dependencies and parent dependencies benefit the Transformer model and their effects can be accumulated.

4.3 Evaluation on English-to-Japanese task

We conduct experiments on the WAT2016 English-to-Japanese translation task in this section. Our baseline systems include RNNsearch, a tree2seq attentional NMT model using tree-LSTM proposed by Eriguchi et al. (2016) and Transformer. Table 2 shows the results. According to the table, our Transformer+CSH and Transformer+PSH outperform Transformer and the other existing NMT models in terms of both BLEU and RIBES. Similar as we get in Section 4.2, the Transformer+CSH+PSH gets the highest performance.

Zh-En En-Zh En-De
Transformer 21.29 32.12 19.14 25.71
+CSH 21.60 32.46 19.54 26.01
+PSH 21.67 32.37 19.53 25.87
+CSH+PSH 22.15 33.03 20.19 26.31
Table 3: BLEU scores (%) for Chinese-to-English (Zh-En), English-to-Chinese (En-Zh) translation on WMT2017 datasets and English-to-German (En-De) task. Both char-level BLEU (CBLEU) and word-level BLEU (WBLEU) are used as metrics for the En-Zh task.

4.4 Evaluation on the WMT Tasks

To verify the effect of syntax knowledge on large-scale translation tasks, we further conduct three experiments on the WMT2017 bidirectional English-Chinese tasks and WMT2014 English-to-German. The results are listed in Table 3. For the Chinese-to-English, our proposed method outperforms baseline by 0.86 BLEU score. For the English-to-Chinese task, the Transformer+CSH+PSH gains 0.91 and 1.05 improvements on char-level BLEU and word-level BLEU respectively. For En-De task, the improvement is 0.6 which is not as much as the other two. We speculate that as the grammars of English and German are very similar, the original model can capture the syntactic knowledge well. Even though, the improvement still illustrates the effectiveness of our method.

Figure 3: (a) Translation example from NIST Chinese-to-English task, in which incorrectly translated part is highlighted in wavy line. (b). Alignments of attention heads in the top layer. The first two are extracted from CSH and PSH respectively. Others are from unsupervised attention heads. Each pixel shows the attention weight. (0: white, 1: blue). (c) The constructed dependency tree based on alignment result of PSH in (b).

4.5 Quality Estimation of Source Dependency Tree Construction

As CSH and PSH learn distributions over possible parent and child nodes , we can use them to construct source-side dependency trees. In this section, we estimate tree qualities. We only take advantage of the alignment result of PSH since the number of child nodes is uncertain for each word. Specifically, for each word, we denote the node with highest attention weight as the prediction of its parent. The non-tree outputs are adjusted with a maximum spanning tree algorithm


. We estimate the consistency between the predicted trees and the parsing results of our stand-alone dependency parser due to the unavailable golden references. The higher the consistency is, the closer the performances are. This experiments are preformed on NIST Chinese-to-English task because it has sufficient test sets. To reduce the influence of ill-formed data as much as possible, we build the evaluation dataset by heuristically selecting 1000 source sentences from all NIST Chinese-to-English testsets that do not contain

<unk> and have a length of 20-30. Then the parsing results from our stand-alone Chinese parser are used as references. We obtain a UAS of 83.25%, which demonstrates that the predicted dependency trees are highly similar to the parsing results from the stand-alone parser (the UAS of our stand-alone Chinese parser is 83.7%).

4.6 Case Study

In this section, we give a case study to explain how our method works. Figure 4(a) provides a translation example from NIST Chinese-to-English test set. In this example, the Chinese word “shǒucì (the first time)” (the 3rd word) should be modifier of “tígōng (provides)” (the 4th word) . However, the Transformer misunderstands source syntax structure and thus generates an incorrect translation. While modeling the source syntax, our proposed model produces a high-quality translation. To further investigate the translation behavior, we visualize the attention weights of different attention heads and show them in Figure 4(b). The first two alignments are extracted from CSH and PSH while the others are from unsupervised ones in the same layer. Different from the general attention heads, CSH and PSH generate more interpretable alignments, based on which we construct the source dependency tree shown in Figure 4(c). From the tree we can see that the dependency of the word “shǒucì (the first time)” is correctly modeled.

5 Related Work

A large body of work has dedicated to incorporating source syntactic knowledge into the RNN-based NMT model. DBLP:conf/wmt/SennrichH16 DBLP:conf/wmt/SennrichH16 generalize the embedding layer to incorporate morphological features, part-of-speech tags and syntactic dependency labels, leading to improvements on several laguage pairs. DBLP:conf/acl/EriguchiHT16 DBLP:conf/acl/EriguchiHT16 propose a tree-to-sequence attentional NMT model in which a tree-LSTM is used to encode source-side parse tree. DBLP:conf/emnlp/BastingsTAMS17 DBLP:conf/emnlp/BastingsTAMS17 rely on the graph convolutional network (GCN) to incorporate source syntactic structure. However, they all specify extra features or introduce extra complicated modules in addition to the original sequential encoder. Instead, we focus on the dependency structure and let the model learn from the tree automatically. Other works use linearized representation of parses. For example, DBLP:conf/acl/LiXTZZZ17 DBLP:conf/acl/LiXTZZZ17 linearize the parse tree of source sentence and use three encoders to incorporate source syntax. DBLP:conf/ijcai/WuZZ17 DBLP:conf/ijcai/WuZZ17 propose a syntax-aware encoder to enrich each source state with global dependency structure. Though the linearized parses can inject syntactic information into the model without significant changes to the architecture, they are usually lead to much longer input and require additional encoders.

All these methods are designed for the RNN models and applying them to the highly parallelized Transformer is difficult. Besides, the extra modules to model the source syntax are always heavy and will fail when the input is not parsed.

6 Conclusion and Future Work

In this paper, we propose a novel supervised approach to leverage source dependency tree explicitly into Transformer. Our method is simple and efficient because no extra module is needed and no parser is required during inference. Experiments on several translation tasks show that our method yields improvements over the state-of-the-art Transformer model and outperforms other syntax-aware models.

In future work, we expect to achieve developments that will shed more light on utilizing source linguistic features, e.g., dependency labels or part-of-speech tags. Besides, we would like to explore whether incorporating target syntax to Transformer has the potential to improve translation quality.


  • [1] D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473. External Links: Link, 1409.0473 Cited by: §1, 1st item.
  • [2] H. Chen, S. Huang, D. Chiang, and J. Chen (2017) Improved neural machine translation with a syntax-aware encoder and decoder. In Proceedings of the ACL 2017, pp. 1936–1945. External Links: Link, Document Cited by: Table 1.
  • [3] Y. Chu (1965) On the shortest arborescence of a directed graph. Scientia Sinica 14, pp. 1396–1400. Cited by: §4.5.
  • [4] J. Gehring, M. Auli, D. Grangier, and Y. Dauphin (2017) A convolutional encoder model for neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pp. 123–135. External Links: Link, Document Cited by: §1.
  • [5] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin (2017) Convolutional sequence to sequence learning. In

    Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017

    pp. 1243–1252. External Links: Link Cited by: §1.
  • [6] T. Nakazawa, M. Yaguchi, K. Uchimoto, M. Utiyama, E. Sumita, S. Kurohashi, and H. Isahara (2016) ASPEC: asian scientific paper excerpt corpus. In Proceedings of the Tenth International Conference on Language Resources and Evaluation LREC 2016, Portorož, Slovenia, May 23-28, 2016., External Links: Link Cited by: §4.1.
  • [7] R. Sennrich, B. Haddow, and A. Birch (2016) Neural machine translation of rare words with subword units. In Proceedings of ACL 2016, External Links: Link Cited by: §4.1.
  • [8] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, pp. 6000–6010. Cited by: §1, §4.1.
  • [9] X. Zhang, J. Cheng, and M. Lapata (2017) Dependency parsing as head selection. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, Vol. 1, pp. 665–676. Cited by: §3.2.