Self-Attention with Structural Position Representations

09/01/2019 ∙ by Xing Wang, et al. ∙ Tencent 0

Although self-attention networks (SANs) have advanced the state-of-the-art on various NLP tasks, one criticism of SANs is their ability of encoding positions of input words (Shaw et al., 2018). In this work, we propose to augment SANs with structural position representations to model the latent structure of the input sentence, which is complementary to the standard sequential positional representations. Specifically, we use dependency tree to represent the grammatical structure of a sentence, and propose two strategies to encode the positional relationships among words in the dependency tree. Experimental results on NIST Chinese-to-English and WMT14 English-to-German translation tasks show that the proposed approach consistently boosts performance over both the absolute and relative sequential position representations.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, self-attention networks  (SANs, parikh2016decomposable; lin2017structured) have achieved the state-of-the-art results on a variety of NLP tasks Vaswani:2017:NIPS; strubell2018linguistically; Devlin:2019:NAACL. Sans perform the attention operation under the position-unaware “bag-of-words” assumption, in which positions of the input words are ignored. Therefore, absolute position Vaswani:2017:NIPS or relative position Shaw:2018:NAACL are generally used to capture the sequential order of words in the sentence. However, several researches reveal that the sequential structure may not be sufficient for NLP tasks tai2015improved; kim2016structured; shen2018ordered, since sentences inherently have hierarchical structures chomsky2014aspects; bever1970cognitive.

In response to this problem, we propose to augment Sans with structural position representations to capture the hierarchical structure of the input sentence. The starting point for our approach is a recent finding: the latent structure of a sentence can be captured by structural depths and distances hewitt2019structural. Accordingly, we propose absolute structural position to encode the depth of each word in a parsing tree, and relative structural position to encode the distance of each word pair in the tree.

We implement our structural encoding strategies on top of Transformer Vaswani:2017:NIPS and conduct experiments on both NIST ChineseEnglish and WMT14 EnglishGerman translation tasks. Experimental results show that exploiting structural position encoding strategies consistently boosts performance over both the absolute and relative sequential position representations across language pairs. Linguistic analyses P18-1198 reveal that the proposed structural position representation improves the translation performance with richer syntactic information. Our main contributions are:

  • Our study demonstrates the necessity and effectiveness of exploiting structural position encoding for Sans, which benefits from modeling syntactic depth and distance under the latent structure of the sentence.

  • We propose structural position representations for Sans to encode the latent structure of the input sentence, which are complementary to their sequential counterparts.

Figure 1: Illustration of (a) the standard sequential position encoding Vaswani:2017:NIPS; Shaw:2018:NAACL, and (b) the proposed structural position encoding. The relative position in the example is for the word “talk”.

2 Background


SANs produce representations by applying attention to each pair of elements from the input sequence, regardless of their distance. Given an input sequence , the model first transforms it into queries , keys , and values :


where are trainable parameters and indicates the hidden size. The output sequence is calculated as



is a dot-product attention model.

Sequential Position Encoding

To make use of the order of the sequence, information about the absolute or relative position of the elements in the sequence is injected into SAN:

  • Absolute Sequential PE Vaswani:2017:NIPS is defined as


    where is the absolute position in the sequence and is the dimension of position representations. is for the even dimension, and

    for the odd dimension.

    Vaswani:2017:NIPS propose to conduct element-wise addition to combine the fixed sequential position representation with word embedding and feed the combination representation to the Sans.

  • Relative Sequential PE Shaw:2018:NAACL is calculated as


    where is the relative position to the queried word, which is used to index a learnable matrix that represents relative position embeddings.

    Shaw:2018:NAACL propose relation-aware Sans and take the relative sequential encoding as the additional key and value (Eq.2) in the attention computation.

Figure 1(a) shows an example of absolute (i.e., ) and relative (i.e., ) sequential positions.

3 Approach

3.1 Structural Position Representations

In this study, we choose dependency tree to represent sentence structure for its simplicity on modeling syntactic relationships among input words. Figure 1 shows an example to illustrate the idea of the proposed approach. From the perspective of relationship path between words, sequential PE measures the sequential distance between the words. As shown in Figure 1 (a), for each word, absolute sequential position represents the sequential distance to the beginning of the sentence, while relative sequential position measures the relative distance to the queried word (“talk” in the example).

The latent structure can be interpreted in various ways, from syntactic tree structures, e.g., constituency tree collins2003head or dependency tree kubler2009dependency, to semantic graph structures, e.g., abstract meaning representation graph banarescu2013abstract. In this work, dependency path, which is induced from the dependency tree, is adopted to provide a new perspective on modelling pairwise relationships.

Figure 1 shows the difference between the sequential path and dependency path. The sequential distance between the two words “held” and “talk” is 2, while their structural distance is only 1 as word “talk” is the dependent of the head “held” nivre2005dependency.

Absolute Structural Position

We exploit the tree depth of the word in the dependency tree as its absolute structural position. Specifically, we treat the main verb tapanainen1997non of the sentence as the origin and use the distance of the dependency path from the target word to the origin as the absolute structural position


where is the target word, is the given dependency structure and the origin is the main verb of the .

In the field of NMT, BPE sub-words and end-of-sentence symbol should be carefully handled as they do not appear in the conventional dependency tree. In this work, we assign the BPE sub-words share the absolute structural position of the original word and set the the first larger integer than the max absolute structural position in dependency tree as the absolute structural position of end-of-sentence symbol.

Relative Structural Position

For the relative structural position , we calculate in the dependency tree following two hierarchical rules:

1. if and are at same dependency edge, .

2. if and are at different dependency edges, , where


Following Shaw:2018:NAACL, we use clipping distance to limit the maximum relative position.

3.2 Integrating Structural PE into SANs

We inherit position encoding functions from sequential approaches (Eq.3 and Eq.4) to implement the structural position encoding strategies. Since structural position representations capture complementary position information to their sequential counterparts, we also exploit to integrate the structural position encoding into SANs with the sequential counterparts approaches Vaswani:2017:NIPS; Shaw:2018:NAACL.

For the absolute position, we use the nonlinear function to fuse the sequential and structure position representations111We also use parameter-free element-wise addition method to combine two absolute position embedding and get 0.28 BLEU point improvement on development set of NIST ChineseEnglish over the baseline model that only uses absolute sequential encoding. :


where is the nonlinear function. and are absolute sequential and structural position embedding in Eq.3 and Eq.5 respectively.

For the relative position, we follow Shaw:2018:NAACL to extend the self-attention computation to consider the pairwise relationships and project the relative structural position as described at Eq.(3) and Eq.(4) in Shaw:2018:NAACL222Due to the space limitations we do not show these functions. Please refer to Shaw:2018:NAACL for more detail..

4 Related Work

There has been growing interest in improving the representation power of SanDou:2018:EMNLP; Dou:2019:AAAI; Yang:2018:EMNLP; WANG:2018:COLING; Wu:2018:ACL; Yang:2019:AAAI; Yang:2019:NAACL; sukhbaatar:2019:ACL. Among these approaches, a straightforward strategy is that augmenting the Sans with position representations Shaw:2018:NAACL; Ma:2019:NAACL; bello:2019:attention; Yang:2019:ICASSP, as the position representations involves element-wise attention computation. In this work, we propose to augment Sans with structural position representations to model the latent structure of the input sentence.

Our work is also related to the structure modeling for Sans, as the proposed model utilizes the dependency tree to generate structural representations. Recently, Hao:2019:NAACL,Hao:2019:EMNLPb integrate the recurrence into the Sans and empirically demonstrate that the hybrid models achieve better performances by modeling structure of sentences. Hao:2019:EMNLPa further make use of the multi-head attention to form the multi-granularity self-attention, to capture the different granularity phrases in source sentences. The difference is that we treat the position representation as a medium to transfer the structure information from the dependency tree into the Sans.

# Sequential Structural Spd. BLEU
Abs. Rel. Abs. Rel.
1 × × × × 2.81 28.33
2 × 2.53 35.43
3 × 2.65 34.23
4 × × × 3.23 44.31
5 × 2.65 44.84
6 2.52 45.10
7 × × 3.18 45.02
8 × 2.64 45.43
9 2.48 45.67
Table 1: Impact of the position encoding components on ChineseEnglish NIST02 development dataset using Transformer-Base model. “Abs.” and “Rel.” denote absolute and relative position encoding, respectively. “Spd.” denotes the decoding speed (sentences/second) on a Tesla M40, the speed of structural position encoding strategies include the step of dependency parsing.
Model Architecture ZhEn EnDe
MT03 MT04 MT05 MT06 Avg WMT14
Hao:2019:NAACL - - - - - 28.98
Transformer-Big 45.30 46.49 45.21 44.87 45.47 28.58
    + Structural PE 45.62 47.12 45.84 45.64 46.06 28.88
    + Relative Sequential PE 45.45 47.01 45.65 45.87 46.00 28.90
        + Structural PE 45.85 47.37 46.20 46.18 46.40 29.19
Table 2: Evaluation of translation performance on NIST ZhEn and WMT14 EnDe test sets. Hao:2019:NAACL is a Transformer-Big model which adopted an additional recurrence encoder with the attentive recurrent network to model syntactic structure. “”: significant over the Transformer-Big (), tested by bootstrap resampling koehn-2004-statistical.
Model Surface Syntactic Semantic
SeLen WC Avg TrDep ToCo BShif Avg Tense SubN ObjN SoMo CoIn Avg
Base 92.20 63.00 77.60 44.74 79.02 71.24 65.00 89.24 84.69 84.53 52.13 62.47 74.61
 + Rel. Seq. PE 89.82 63.17 76.50 45.09 78.45 71.40 64.98 88.74 87.00 85.53 51.68 62.21 75.03
   + Stru. PE 89.54 62.90 76.22 46.12 79.12 72.36 65.87 89.30 85.47 84.94 52.90 62.99 75.12
Table 3: Performance on linguistic probing tasks. The probing tasks were conducted by evaluating linguistics embedded in the Transformer-Base encoder outputs. “Base”, “+ Rel. Seq. PE”, “+ Stru. PE” denote Transformer-Base, Transformer-Base with relative sequential PE, Transformer-Base with relative sequential PE and structural PE models respectively.

5 Experiment

We conduct experiments on the widely used NIST ChineseEnglish and WMT14 EnglishGerman data, and report the 4-gram BLEU score papineni2002bleu.


We use the training dataset consists of about million sentence pairs. NIST 2002 (MT02) dataset is used as development set. NIST 2003, 2004, 2005, 2006 datasets are used as test sets. We use byte-pair encoding (BPE) toolkit to alleviate the out-of-vocabulary problem with 32K merge operations.


We use the dataset consisting of about million sentence pairs as the training set. The newstest2013 and newstest2014 are used as the development set and the test set. We also apply BPE with 32K merge operations to obtain subword unit.

We evaluate the proposed position encoding strategies on Transformer Vaswani:2017:NIPS and implement them on top of THUMT zhang2017thumt. We use the Stanford parser klein2003accurate to parse the sentences and obtain the structural structural absolute and relative position as described in Section 3. When using relative structural position encoding, we use clipping distance = 16. To make a fair comparison, we valid different position encoding strategies on the encoder and keep the Transformer decoder unchanged. We study the variations of the Base model on ChineseEnglish task, and evaluate the overall performance with the Big model on both translation tasks.

5.1 Model Variations

We evaluate the importance of the proposed absolute and relative structural position encoding strategies and study the variations with Transformer-Base model on ChineseEnglish data. The experimental results on the development set are shown in Table 1.

Effect of Position Encoding

We first remove the sequential encoding from the Transformer encoder (Model #1) and observe the translation performance degrades dramatically (), which demonstrates the necessity of the position encoding strategies.

Effect of Structural Position Encoding

Then we valid our proposed structural position encoding strategies over the position-unaware model (Models #2-3). We find that absolute and relative structural position encoding strategies improve the translation performance by 7.10 BLEU points and 5.90 BLEU points respectively, which shows that the introducing of the proposed absolute and relative structural positions improves the translation performance in terms of BLEU score.

Combination of Sequential and Structural Position Encoding Strategies

We integrate the absolute and relative structural position encoding strategies into the Base model equipped with absolute sequential position encoding (Models #4-6). We observe that the proposed two approaches are able to achieve improvements over the Base model with decoding speed marginally decreases.

Finally, we valid the proposed structural position encoding over the Base model equipped with absolute and relative sequential position encoding (Models #7-9). We find that sequential relative encoding obtains 0.71 BLEU points improvement (Model #7 vs. Model #4) and structural position encoding achieves a further improvement in performance by 0.65 BLEU points (Model #9 vs. Model #7), demonstrating the effectiveness of the proposed structural position encoding strategies.

5.2 Main Results

We valid the proposed structural encoding strategies over Transformer-Big model in ChineseEnglish and EnglishGerman data, and list the results in Table 2.

For ChineseEnglish, Structural position encoding (+ Structural PE) outperforms the Transformer-Big by 0.59 BLEU points on average over four NIST test sets. Sequential relative encoding approach (+Relative Sequential PE) outperforms the Transformer-Big by 0.53 BLEU points, and structural position encoding (+ Structural PE) achieves further improvement up to +0.40 BLEU points and outperforms the Transformer-Big by 0.93 BLEU points. For EnglishGerman, similar phenomenon is observed, which reveals that the proposed structural position encoding strategy can consistently boost translation performance over both the absolute and relative sequential position representations.

5.3 Linguistic Probing Evaluation

We conduct probing tasks333 P18-1198 to evaluate structure knowledge embedded in the encoder output in the variations of the Base model that are trained on EnDe translation task.

We follow Wang:2019:ACL to set model configurations. The experimental results on probing tasks are shown in Table 3, and the BLEU scores of “Base”, “+ Rel. Seq. PE”, “+ Stru. PE” are 27.31, 27.99 and 28.30. From the table, we can see 1) adding the relative sequential positional embedding achieves improvement over the baseline on semantic tasks (75.03 vs. 74.61). This may indicate the model benefits more from semantic modeling; 2) with the structural positional embedding, the model obtains improvement on syntactic tasks (65.87 v.s. 64.98), which indicates that the representations preserve more syntactic knowledge.

6 Conclusion

In this paper, we have presented a novel structural position encoding strategy to augment Sans by considering the latent structure of the input sentence. We extract structural absolute and relative positions from the dependency tree and integrate them into Sans. Experimental results on ChineseEnglish and EnglishGerman translation tasks have demonstrated that the proposed approach consistently improve translation performance over both the absolute and relative sequential position representations.

Future directions include inferring the structure representations from the AMR song-etal-2019-semantic or the external SMT knowledge AAAI1714451. Furthermore, the structural position encoding can be also applied to the decoder with RNN Grammars dyer2016recurrent; eriguchi2017learning, which we leave for future work.