1 Introduction
In recent years, self-attention networks (SANs, parikh2016decomposable; lin2017structured) have achieved the state-of-the-art results on a variety of NLP tasks Vaswani:2017:NIPS; strubell2018linguistically; Devlin:2019:NAACL. Sans perform the attention operation under the position-unaware “bag-of-words” assumption, in which positions of the input words are ignored. Therefore, absolute position Vaswani:2017:NIPS or relative position Shaw:2018:NAACL are generally used to capture the sequential order of words in the sentence. However, several researches reveal that the sequential structure may not be sufficient for NLP tasks tai2015improved; kim2016structured; shen2018ordered, since sentences inherently have hierarchical structures chomsky2014aspects; bever1970cognitive.
In response to this problem, we propose to augment Sans with structural position representations to capture the hierarchical structure of the input sentence. The starting point for our approach is a recent finding: the latent structure of a sentence can be captured by structural depths and distances hewitt2019structural. Accordingly, we propose absolute structural position to encode the depth of each word in a parsing tree, and relative structural position to encode the distance of each word pair in the tree.
We implement our structural encoding strategies on top of Transformer Vaswani:2017:NIPS and conduct experiments on both NIST ChineseEnglish and WMT14 EnglishGerman translation tasks. Experimental results show that exploiting structural position encoding strategies consistently boosts performance over both the absolute and relative sequential position representations across language pairs. Linguistic analyses P18-1198 reveal that the proposed structural position representation improves the translation performance with richer syntactic information. Our main contributions are:
-
Our study demonstrates the necessity and effectiveness of exploiting structural position encoding for Sans, which benefits from modeling syntactic depth and distance under the latent structure of the sentence.
-
We propose structural position representations for Sans to encode the latent structure of the input sentence, which are complementary to their sequential counterparts.

2 Background
Self-Attention
SANs produce representations by applying attention to each pair of elements from the input sequence, regardless of their distance. Given an input sequence , the model first transforms it into queries , keys , and values :
(1) |
where are trainable parameters and indicates the hidden size. The output sequence is calculated as
(2) |
where
is a dot-product attention model.
Sequential Position Encoding
To make use of the order of the sequence, information about the absolute or relative position of the elements in the sequence is injected into SAN:
-
Absolute Sequential PE Vaswani:2017:NIPS is defined as
(3) where is the absolute position in the sequence and is the dimension of position representations. is for the even dimension, and
for the odd dimension.
Vaswani:2017:NIPS propose to conduct element-wise addition to combine the fixed sequential position representation with word embedding and feed the combination representation to the Sans.
-
Relative Sequential PE Shaw:2018:NAACL is calculated as
(4) where is the relative position to the queried word, which is used to index a learnable matrix that represents relative position embeddings.
Shaw:2018:NAACL propose relation-aware Sans and take the relative sequential encoding as the additional key and value (Eq.2) in the attention computation.
Figure 1(a) shows an example of absolute (i.e., ) and relative (i.e., ) sequential positions.
3 Approach
3.1 Structural Position Representations
In this study, we choose dependency tree to represent sentence structure for its simplicity on modeling syntactic relationships among input words. Figure 1 shows an example to illustrate the idea of the proposed approach. From the perspective of relationship path between words, sequential PE measures the sequential distance between the words. As shown in Figure 1 (a), for each word, absolute sequential position represents the sequential distance to the beginning of the sentence, while relative sequential position measures the relative distance to the queried word (“talk” in the example).
The latent structure can be interpreted in various ways, from syntactic tree structures, e.g., constituency tree collins2003head or dependency tree kubler2009dependency, to semantic graph structures, e.g., abstract meaning representation graph banarescu2013abstract. In this work, dependency path, which is induced from the dependency tree, is adopted to provide a new perspective on modelling pairwise relationships.
Figure 1 shows the difference between the sequential path and dependency path. The sequential distance between the two words “held” and “talk” is 2, while their structural distance is only 1 as word “talk” is the dependent of the head “held” nivre2005dependency.
Absolute Structural Position
We exploit the tree depth of the word in the dependency tree as its absolute structural position. Specifically, we treat the main verb tapanainen1997non of the sentence as the origin and use the distance of the dependency path from the target word to the origin as the absolute structural position
(5) |
where is the target word, is the given dependency structure and the origin is the main verb of the .
In the field of NMT, BPE sub-words and end-of-sentence symbol should be carefully handled as they do not appear in the conventional dependency tree. In this work, we assign the BPE sub-words share the absolute structural position of the original word and set the the first larger integer than the max absolute structural position in dependency tree as the absolute structural position of end-of-sentence symbol.
Relative Structural Position
For the relative structural position , we calculate in the dependency tree following two hierarchical rules:
1. if and are at same dependency edge, .
2. if and are at different dependency edges, , where
(6) |
Following Shaw:2018:NAACL, we use clipping distance to limit the maximum relative position.
3.2 Integrating Structural PE into SANs
We inherit position encoding functions from sequential approaches (Eq.3 and Eq.4) to implement the structural position encoding strategies. Since structural position representations capture complementary position information to their sequential counterparts, we also exploit to integrate the structural position encoding into SANs with the sequential counterparts approaches Vaswani:2017:NIPS; Shaw:2018:NAACL.
For the absolute position, we use the nonlinear function to fuse the sequential and structure position representations111We also use parameter-free element-wise addition method to combine two absolute position embedding and get 0.28 BLEU point improvement on development set of NIST ChineseEnglish over the baseline model that only uses absolute sequential encoding. :
(7) |
where is the nonlinear function. and are absolute sequential and structural position embedding in Eq.3 and Eq.5 respectively.
For the relative position, we follow Shaw:2018:NAACL to extend the self-attention computation to consider the pairwise relationships and project the relative structural position as described at Eq.(3) and Eq.(4) in Shaw:2018:NAACL222Due to the space limitations we do not show these functions. Please refer to Shaw:2018:NAACL for more detail..
4 Related Work
There has been growing interest in improving the representation power of Sans Dou:2018:EMNLP; Dou:2019:AAAI; Yang:2018:EMNLP; WANG:2018:COLING; Wu:2018:ACL; Yang:2019:AAAI; Yang:2019:NAACL; sukhbaatar:2019:ACL. Among these approaches, a straightforward strategy is that augmenting the Sans with position representations Shaw:2018:NAACL; Ma:2019:NAACL; bello:2019:attention; Yang:2019:ICASSP, as the position representations involves element-wise attention computation. In this work, we propose to augment Sans with structural position representations to model the latent structure of the input sentence.
Our work is also related to the structure modeling for Sans, as the proposed model utilizes the dependency tree to generate structural representations. Recently, Hao:2019:NAACL,Hao:2019:EMNLPb integrate the recurrence into the Sans and empirically demonstrate that the hybrid models achieve better performances by modeling structure of sentences. Hao:2019:EMNLPa further make use of the multi-head attention to form the multi-granularity self-attention, to capture the different granularity phrases in source sentences. The difference is that we treat the position representation as a medium to transfer the structure information from the dependency tree into the Sans.
# | Sequential | Structural | Spd. | BLEU | ||
Abs. | Rel. | Abs. | Rel. | |||
1 | × | × | × | × | 2.81 | 28.33 |
2 | ✓ | × | 2.53 | 35.43 | ||
3 | × | ✓ | 2.65 | 34.23 | ||
4 | ✓ | × | × | × | 3.23 | 44.31 |
5 | ✓ | × | 2.65 | 44.84 | ||
6 | ✓ | ✓ | 2.52 | 45.10 | ||
7 | ✓ | ✓ | × | × | 3.18 | 45.02 |
8 | ✓ | × | 2.64 | 45.43 | ||
9 | ✓ | ✓ | 2.48 | 45.67 |
Model Architecture | ZhEn | EnDe | ||||
---|---|---|---|---|---|---|
MT03 | MT04 | MT05 | MT06 | Avg | WMT14 | |
Hao:2019:NAACL | - | - | - | - | - | 28.98 |
Transformer-Big | 45.30 | 46.49 | 45.21 | 44.87 | 45.47 | 28.58 |
+ Structural PE | 45.62 | 47.12 | 45.84 | 45.64 | 46.06 | 28.88 |
+ Relative Sequential PE | 45.45 | 47.01 | 45.65 | 45.87 | 46.00 | 28.90 |
+ Structural PE | 45.85 | 47.37 | 46.20 | 46.18 | 46.40 | 29.19 |
Model | Surface | Syntactic | Semantic | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SeLen | WC | Avg | TrDep | ToCo | BShif | Avg | Tense | SubN | ObjN | SoMo | CoIn | Avg | |
Base | 92.20 | 63.00 | 77.60 | 44.74 | 79.02 | 71.24 | 65.00 | 89.24 | 84.69 | 84.53 | 52.13 | 62.47 | 74.61 |
+ Rel. Seq. PE | 89.82 | 63.17 | 76.50 | 45.09 | 78.45 | 71.40 | 64.98 | 88.74 | 87.00 | 85.53 | 51.68 | 62.21 | 75.03 |
+ Stru. PE | 89.54 | 62.90 | 76.22 | 46.12 | 79.12 | 72.36 | 65.87 | 89.30 | 85.47 | 84.94 | 52.90 | 62.99 | 75.12 |
5 Experiment
We conduct experiments on the widely used NIST ChineseEnglish and WMT14 EnglishGerman data, and report the 4-gram BLEU score papineni2002bleu.
ChineseEnglish
We use the training dataset consists of about million sentence pairs. NIST 2002 (MT02) dataset is used as development set. NIST 2003, 2004, 2005, 2006 datasets are used as test sets. We use byte-pair encoding (BPE) toolkit to alleviate the out-of-vocabulary problem with 32K merge operations.
EnglishGerman
We use the dataset consisting of about million sentence pairs as the training set. The newstest2013 and newstest2014 are used as the development set and the test set. We also apply BPE with 32K merge operations to obtain subword unit.
We evaluate the proposed position encoding strategies on Transformer Vaswani:2017:NIPS and implement them on top of THUMT zhang2017thumt. We use the Stanford parser klein2003accurate to parse the sentences and obtain the structural structural absolute and relative position as described in Section 3. When using relative structural position encoding, we use clipping distance = 16. To make a fair comparison, we valid different position encoding strategies on the encoder and keep the Transformer decoder unchanged. We study the variations of the Base model on ChineseEnglish task, and evaluate the overall performance with the Big model on both translation tasks.
5.1 Model Variations
We evaluate the importance of the proposed absolute and relative structural position encoding strategies and study the variations with Transformer-Base model on ChineseEnglish data. The experimental results on the development set are shown in Table 1.
Effect of Position Encoding
We first remove the sequential encoding from the Transformer encoder (Model #1) and observe the translation performance degrades dramatically (), which demonstrates the necessity of the position encoding strategies.
Effect of Structural Position Encoding
Then we valid our proposed structural position encoding strategies over the position-unaware model (Models #2-3). We find that absolute and relative structural position encoding strategies improve the translation performance by 7.10 BLEU points and 5.90 BLEU points respectively, which shows that the introducing of the proposed absolute and relative structural positions improves the translation performance in terms of BLEU score.
Combination of Sequential and Structural Position Encoding Strategies
We integrate the absolute and relative structural position encoding strategies into the Base model equipped with absolute sequential position encoding (Models #4-6). We observe that the proposed two approaches are able to achieve improvements over the Base model with decoding speed marginally decreases.
Finally, we valid the proposed structural position encoding over the Base model equipped with absolute and relative sequential position encoding (Models #7-9). We find that sequential relative encoding obtains 0.71 BLEU points improvement (Model #7 vs. Model #4) and structural position encoding achieves a further improvement in performance by 0.65 BLEU points (Model #9 vs. Model #7), demonstrating the effectiveness of the proposed structural position encoding strategies.
5.2 Main Results
We valid the proposed structural encoding strategies over Transformer-Big model in ChineseEnglish and EnglishGerman data, and list the results in Table 2.
For ChineseEnglish, Structural position encoding (+ Structural PE) outperforms the Transformer-Big by 0.59 BLEU points on average over four NIST test sets. Sequential relative encoding approach (+Relative Sequential PE) outperforms the Transformer-Big by 0.53 BLEU points, and structural position encoding (+ Structural PE) achieves further improvement up to +0.40 BLEU points and outperforms the Transformer-Big by 0.93 BLEU points. For EnglishGerman, similar phenomenon is observed, which reveals that the proposed structural position encoding strategy can consistently boost translation performance over both the absolute and relative sequential position representations.
5.3 Linguistic Probing Evaluation
We conduct probing tasks333https://github.com/facebookresearch/SentEval/tree/master/data/probing P18-1198 to evaluate structure knowledge embedded in the encoder output in the variations of the Base model that are trained on EnDe translation task.
We follow Wang:2019:ACL to set model configurations. The experimental results on probing tasks are shown in Table 3, and the BLEU scores of “Base”, “+ Rel. Seq. PE”, “+ Stru. PE” are 27.31, 27.99 and 28.30. From the table, we can see 1) adding the relative sequential positional embedding achieves improvement over the baseline on semantic tasks (75.03 vs. 74.61). This may indicate the model benefits more from semantic modeling; 2) with the structural positional embedding, the model obtains improvement on syntactic tasks (65.87 v.s. 64.98), which indicates that the representations preserve more syntactic knowledge.
6 Conclusion
In this paper, we have presented a novel structural position encoding strategy to augment Sans by considering the latent structure of the input sentence. We extract structural absolute and relative positions from the dependency tree and integrate them into Sans. Experimental results on ChineseEnglish and EnglishGerman translation tasks have demonstrated that the proposed approach consistently improve translation performance over both the absolute and relative sequential position representations.
Future directions include inferring the structure representations from the AMR song-etal-2019-semantic or the external SMT knowledge AAAI1714451. Furthermore, the structural position encoding can be also applied to the decoder with RNN Grammars dyer2016recurrent; eriguchi2017learning, which we leave for future work.