Recently, much research has been dedicated to either supplementing RNN and CNN models with attention mechanisms or substituting them with attention-only approaches as proposed by NIPS2017_7181 or shen2018disan. However, most of this research concentrates on the field of Neural Machine Translation (NMT). Until recently, a self-attention-only mechanism has not been widely used for other NLP tasks.
There are, however, approaches to slot filling that use attention mechanisms to improve the performance of an LSTM layer. An example of this is the approach using position-aware attention on top of LSTM proposed by D17-1004. In this research paper, we aim to combine the self-attention encoder formulated by NIPS2017_7181 and augment it with the position-aware attention layer of D17-1004. Additionally, we propose various modifications to both of the approaches, most notably an attention weighting scheme that models pair-wise interactions between all tokens in the input sentence, taking into account their relative positions to each other.
The task of relation classification can be paraphrased in the following way: Decide which relations (of a fixed set of given relations) hold between two selected entities in a sentence.
The TAC KBP evaluations111https://tac.nist.gov/ provide a set of 42 frequent (pre-defined) relations for persons and organizations, and annotations as to whether those relations hold for selected entities in a sentence. Some examples are:
per:employee_of: Does (did) person X work for company Y?
org:city_of_headquarters: Is (was) company/organization X based in city Y?
per:countries_of_residence: Does (did) person X live in country Y?
per:title: Does (did) person X have the job title Y?
The TACRED dataset of D17-1004 provides the TAC KBP data of the years 2009 to 2014 in a format that can be processed as a multiclass input-output mapping, which assigns each sentence (with relational arguments marked as Subject and Object) one of the relations of interest (or the special label no_relation). An example instance is:
input: The last remaining assets of bankrupt Russian oil company Yukos - including its headquarters in Moscow - were sold at auction for nearly 3.9 billion U.S. dollars on Friday .
While the RNN-based architectures already include the relative and absolute relations between words due to their sequential nature, in the task of slot filling we not only need to take into account the sequence of words from start to end, but also to learn how the words relate to the query and the object in the sentence. The position-aware approach by D17-1004 already models the interactions relative to the subject and object positions. However, interactions between all other words are only only dealt with by the LSTM layer.
In our approach, we substitute the LSTM layer with the self-attention encoder, a mechanism that models all pair-wise interactions in an input sentence. The self-attention approach itself does not model the sequential order of the input. However, information about this order can be provided by embeddings of the (absolute) positions in the sentence, and previous work indicates that including relative positional representation in self-attention models improves performance for the task of Neural Machine Translation[Shaw et al.2018]. Before we describe our approach to dealing with relative positional encodings in the self-attention encoder and also show how to combine the encoder with the position-aware attention layer, we provide more background on how the original implementation of these approaches work.
2.1 Self-Attention Encoder (Google Transformer)
The Google Transformer model created by NIPS2017_7181 is the first model that uses self-attention without any RNN or CNN based components. It is used for the task of Neural Machine Translation and has an encoder-decoder structure with multiple stacked layers. In the transformer model, the input representation for each position is used as a query to compute attention scores for all positions in the sequence. Those scores are then used to compute the weighted average of the input representations.
The attention is regarded as a mapping of query and key/value pairs to an output, each of which being represented by a vector. More specifically, a self attention layer provides an encoding for each positionin the sequence, by taking a word representation at that position as the query (a matrix holds the queries for all positions) and computing a compatibility score with representations at all other positions (the values, represented by a matrix ). The compatibility scores w.r.t. a position are turned into an attention distribution over the entire sequence by the softmax function, which is used as a weighted average of representations at all positions, the resulting output representation for position .
In multi-headed self-attention, input representations are first linearly mapped to lower-dimensional spaces, and the output vectors of several attention mechanisms (called heads) are concatenated to form the output of one multi-headed self-attention layer. An encoder layer consists of a self-attention layer followed by a fully connected position-wise feed-forward layer.
For one attention head in the first self-attention layer, we obtain the vector for position :
are linear transformations (matrices) to map the input representation into lower-dimensional space, and the function computing the resulting vector (from, and ) is defined by:
The self-attention architecture encodes positional information by adding sinusoids of various wave-lengths [Vaswani et al.2017] to the word representations.
2.2 Argument Extraction with Self-attention
To the best of our knowledge, the transformer model has not yet been applied to relation classification as defined above (as selecting a relation for two given entities in context). It has however been applied as an encoding layer in the related setting of argument extraction [Roth et al.2018], which is similar to question answering (where the question is a pre-defined relation).
DBLP:journals/corr/abs-1803-01707 apply various modifications to the original self-attention model by NIPS2017_7181, namely:
The residual connection goes from the beginning of the self-attention block to the last normalization step after the feed-forward layer. In the original implementation, there are two residual connections within each layer.
In our experiments we observed improvements on the development data using this version rather than the original implementation by NIPS2017_7181. A more detailed overview of the results is given in the Subsection 4.3.
2.3 Position-aware Attention for Slot Filling
The position-aware attention approach to slot filling by D17-1004 uses an LSTM to encode the input and incorporates distributed representations of how words are positioned relative to the subject and the object in the sentence. This position-aware attention mechanism is used on top of a two-layer one directional LSTM network. In this implementation, the relative position encoding vectors are simultaneously computed relative to the subject and the object. To illustrate, if the sentence length is six words and the subject is at position two, the position encodings vector will take the following form:, where position 0 indicates the subject. Later for each position, an embedding is learned with an embedding layer. The same applies to a separate vector denoting object positions, and effectively two position embedding vectors are produced, for subject embeddings and for object embeddings, both sharing a position embedding matrix P respectively [Zhang et al.2017].
The final computation of the model uses the LSTM’s output state: a summary vector , the LSTM’s output vector of hidden states , and the embeddings for the subject and object related positional vectors. For each hidden state an attention weight is calculated using the following two equations:
where and weights are learned parameters using LSTM while and weights are learned using the positional encoding embeddings. Afterwards, is used to pass on the information on how much each word should contribute to the final sentence representation :
3 Proposed Approach
The approach by D17-1004 uses attention to summarize the encoded input instance. Here, input had been encoded using an LSTM, resulting in an hidden vector at any position in the sentence. “Traditional attention” is used, meaning that there is one sequence of weights, used for a weighted average of the hidden vectors.
We propose two main changes to this approach:
We replace the LSTM by a self-attention encoder that computes hidden vectors considering all pairwise interactions in the sentence (rather than sequential recurrences). Following the equations 3, 4 and 5 from Subsection 2.3, instead of calculating and using LSTM, we extract them using the self-attention encoder. The vector of values is a direct output of the encoder, while the summary
is extracted by a one-dimensional max pooling layer applied on the output vector.
We augment the self-attention mechanism with relative position encodings, which facilitate taking into account different effects that are dependent on the relative position of two tokens w.r.t. each other.
3.1 Changes to Self-attention Encoder
This subsection describes what aspects of the self-attention encoder we have changed, namely, a different training strategy, the structural changes and a different approach to positional encodings.
3.1.1 Changes to Positional Encodings
The self-attention layer proposed by NIPS2017_7181 does not directly model the sequential ordering of positions in the input sequence – rather, this ordering is modeled indirectly, using absolute positional encodings with cosine and sine functions to encode each position as a wavelength. Assuming that words in a text interact according to their relative positions (the negation “not” negates a verb in its vicinity to the right) rather than according to their absolute positions (the negation “not” negates a verb at position 12), modeling positional information burdens the model with the additional task of figuring out relative interactions from the absolute encodings.
Research by DBLP:journals/corr/abs-1803-02155 shows in the context of machine translation that using relative positional encoding can improve the model performance. Here, we describe our approach to making positional encodings relative, and its application to relation classification.
Recall from Section 2.1 that one self-attention head computes a representation for a position from the weighted (and linearly transformed) input representations at all positions:
where is a matrix of the (linearly transformed) input vectors, and contains the unnormalized weights for all positions. In the original self-attention model, simply contains the pairwise interactions of the input representations:
where is a matrix of the (transformed) input vectors, and is the (transformed) input representation at position (for which one wants to compute ). For relative position weights with respect to position , we compute a second score , that interacts with relative position embeddings , stacked to form the matrix :
where is the length of the input sequence and the vectors are parameters of the model (different are learned independently for each attention head). The matrix arranges the relative position vectors exactly such that is at position , and all other are ordered relative to that position. A query vector is computed analogously to from the input at : . The position score results from the interaction of with the relative position vectors in :
Our final model uses both the pairwise interaction scores and the relative position scores by summing them together before normalization:
3.1.2 Low-level Design Choices and Training Setting
Self-attention Encoder Layer. We take over the changes proposed by DBLP:journals/corr/abs-1803-01707 which we described in Subsection 2.2, namely, we use Batch Normalization and only one residual connection. Furthermore, instead of initializing weights using Xavier [Glorot and Bengio2010], we use Kaiming weight initialization [He et al.2015]
. Also, instead of using ReLU[Nair and Hinton2010]Xu et al.2015a].
Training. In the implementation by NIPS2017_7181, the Adam optimizer [Kingma and Ba2014]
with learning rate warm-up is used. In our approach we follow the learning strategy proposed by D17-1004 with the following hyper-parameters: we use Stochastic Gradient Descent with a learning rate set to 0.1, after epoch 15 the learning rate is decreased with a decay rate of 0.9 and patience of one epoch if the-score on the development set does not increase. All model variations are trained for 60 epochs.
3.2 Changes to the Position-aware Attention Layer
The attention-based position-aware relation classification layer encodes the relative positions w.r.t. the object and subject. We make it easier for the model to capture this kind of information, by binning positions that are far away from the subject or object: The further away a word is from the subject or the object, the bigger the bin index into which it will fall is. For instance, if the length of the sentence is 10 words and the subject position is at index 1, a regular positional vector will take the following form: . After introducing the relative position bins, the same position vector will change to: .
4.1 Experimental Setup
In addition to introducing various structural changes to self-attention and position-aware attention, we also use a different set of hyper-parameters than those reported by NIPS2017_7181 and D17-1004.
Instead of training word embeddings with a dimension of 512 as in the original implementation, we use a pre-trained GloVe word embedding vector [Pennington et al.2014] with the embedding size of 300. Additionally, following the implementation of D17-1004, we append an embedding vector of size 30 for the named entity tags and an embedding vector of the same size for the part-of-speech tags, amounting to a final embedding vector size of 360. Moreover, we see an improvement in performance when adding object position embeddings to the word embeddings, which is done before the relative positional embeddings discussed in Subsection 3.1.1 are applied in the self-attention encoder layer.
In the original self-attention encoder the implementation of the position-wise fully connected feed-forward layer uses the hidden size that is double the word embedding size. In our experiments, we see no direct improvement in either doubling or increasing the hidden size even more. However, lowering the hidden size contributes to a slightly better performance than when doubling it. In our implementation, the hidden size is half the size of the embedding vector, namely 130.
In the self-attention encoder instead of using a stack of six identical encoder layers, we use only one layer. Similarly, to the research of DBLP:journals/corr/abs-1803-01707 where only two layers are used, we see no performance gain when using more than one layer. In fact, in the case of slot-filling, a decrease in performance when using more than one encoder layer is observed.
Additionally, using 3 heads in the Multi-Head Attention instead of 8 yields the best performance. Using more than 3 heads gradually degrades performance with each additional head.
We change our dropout usage compared to the one used by NIPS2017_7181 where a dropout of 0.1 is used throughout the whole model. In our implementation, we use dropout of 0.4, apart from the Scaled Dot-Product Attention part of the self-attention encoder where we apply dropout of 0.1.
As described before, we train the model using Stochastic Gradient Descent with a learning rate of 0.1 and decay it using decay rate of 0.9 and epoch patience of one after epoch 15 if the performance on the development set does not improve. All models are trained for 60 epochs, and use a mini-batch size of 50.
4.2 TACRED Evaluation
The TACRED dataset [Zhang et al.2017] used to evaluate the model consists of 106.264 hand-annotated sentences denoting a query, object, and the relation between them. In addition to that, the dataset already includes part-of-speech tags as well as named entity tags for all words. The sentences that serve as samples are very long compared to the ones available in similar datasets, for instance, Semeval-2010 Task 8 [Hendrickx et al.2010], with an average sentence length of 36.2 words.
Furthermore, there often are multiple objects and relation types identified for each query within one sentence. Each query-argument relation example, however, is saved as a separate sentence sample.
Moreover, 79.5% of the whole dataset samples are query-argument pairs that do not have any relation between them and are labeled with a no_relation relation type. Overall, the dataset includes samples for 42 relation classes, out of which 25 are relations of type person:x (i.e., person:date_of_birth), 16 of type organization:x (i.e., organization:headquarters), and the no_relation class.
The dataset is already pre-partitioned into train (68124 samples), development (22631 samples), and test (15509 samples) sets. The dataset also comes with an evaluation script, which we use to run the subsequent evaluation.
Table 1 shows the LSTM baseline results and the best model results reported by D17-1004, as well as our best model results for comparison. Our model exhibits better performance overall with a 1.4% higher -score than the state-of-the-art performance reported by D17-1004. While our model achieves lower precision, the recall is considerably higher with a 4.1% difference.
In addition to testing the single model results, we also follow the same ensembling strategy applied by D17-1004, where five models are trained with a different random seed and later on, using ensemble majority vote a relation class is selected for each sample. The comparison of the results is shown in Table 2. Our ensembled model reaches a slightly higher
-score as that of D17-1004, namely, 67.3%. However, there are significant differences regarding precision and recall. Their ensembled model achieves a relatively high precision of 70.1%, while our model reaches high recall of 69.7%.
4.3 Model Variations
|Default residual conn.||61.7||69.7||65.4|
|ReLU instead of RReLU||64.5||68.0||66.2|
|LSTM with self-attention||65.2||62.7||64.0|
|Absolute pos. encodings||65.9||66.7||66.3|
|Kaiming instead of Xavier||64.3||68.8||66.5|
In addition to the final model described in Subsection 4.1, we try various variations and modifications the results of which we report in this subsection. Table 3 shows the results for the following variations:
Lemmas instead of raw words: instead of using raw words we extract their lemmatized representations using the spaCy222https://github.com/explosion/spaCy NLP toolkit. Using lemmas yields a small increase in precision but a lower recall. Overall, this approach achieves 65.6% -score.
Original residual connection in self-attention: the model uses the default residual connections as described by NIPS2017_7181, namely, one residual connection is passed from before the multi-head attention into the normalization layer after it, and the second one going from before the feed-forward part to the next normalization layer. In this case, we see an overall high recall score of 69.7% with a relatively low precision of 61.7%.
Layer normalization instead of batch normalization: In the original self-attention implementation by NIPS2017_7181, layer normalization [Ba et al.2016] is used. Here we show how model’s performance changes when using it over batch normalization [Ioffe and Szegedy2015]. By using layer normalization we can achieve a relatively high recall of 73.1%, although the precision of 53.6% is one of the lowest throughout all of our model variations with the exception of a variation using self-attention encoder without the position-aware layer.
Combining self-attention encoder with position-aware attention and LSTM: In this model variation we use the LSTM hidden layer to compute the in the equation 5 from Subsection 2.3 while using self-attention encoder for the calculation of , as well as the underlying and in equations 3 and 4. The final result does not yield any significant performance increase compared to other model variants.
Self-attention encoder without the position-aware attention layer: We also test our model performance by only using the self-attention encoder without the position-aware attention layer. This is a particularly interesting experiment, since the model reaches the highest recall value of 85.4% throughout all of our experiments, although at the same time achieving only 26.7% precision.
Self-attention encoder without the relative positional encodings: Using the original absolute positional encodings from the original self-attention encoder implementation also yields relatively good results compared to all of the other model variations. Overall, however, this approach is showing a 0.2% lower -score than when using the relative positional encodings.
Using Kaiming weight initialization instead of Xavier: The original self-attention implementation uses Xavier [Glorot and Bengio2010] weight initialization approach to initialize the weights for query, key and value matrices. In this comparison run, using Xavier over Kaiming weight initialization [He et al.2015] exhibits the same -score but increases the precision a bit. Ultimately, there is a very small difference between these initialization techniques, although both have a slightly different effect on precision and recall.
5 Related Work
While self-attention was originally used for the task of Neural Machine Translation, recently it was applied to other NLP tasks unrelated to NMT.
[Kitaev and Klein2018] use the self-attention encoder instead of LSTM to improve a discriminative constituency parser and achieve state-of-the-art performance with their approach. [Liu et al.2018] successfully use the self-attention decoder for the task of Neural Abstractive Summarization. There is also an ongoing research by OpenAI 333https://blog.openai.com/language-unsupervised/ to use self-attention to pre-train a task agnostic language model.
Nevertheless, to the best of our knowledge, self-attention was not previously applied to the task of relation extraction. Apart from the position-aware attention for LSTMs by D17-1004, various other approaches exist. Angeli14combiningdistant uses pattern based extractor and a supervised logistic regression classifier for relation extraction. DBLP:conf/naacl/NguyenG15 as well as adelSF2016 use Convolutions Neural Networks. Xu15classifyingrelations use a modified LSTM architecture called SPD-LSTM.
In this work, we show that self-attention architecture can be effectively applied to relation classification, resulting in a model that is purely based on attention mechanism, and does not depend on other encoding mechanisms such as LSTM. In our experiments, using the self-attention encoder and combining it with a position-aware attention layer achieves better results on the TACRED dataset than previously reported by D17-1004.
Additionally, we examine several changes to both of the approaches to make them more effective on the task of relation classification. The main change to the self-attention is that instead of using absolute positional encodings we successfully use relative positional encodings that increase the final performance. We also modify the way how the relative encodings in the position-aware layer are represented by grouping the word embeddings into bigger bins the further away they are from the subject or the object.
By trying out various model variations, we can see that using self-attention encoder alone leads to a high recall value but a very low precision. As a result, using the position-aware layer with the self-attention encoder helps achieve stable results with precision and recall being very close to each other.
As future work, we propose to investigate further variations of the self-attention encoder, and to do more research on why using multiple encoding layers and a higher number of heads does not improve the performance of the model. Moreover, since using only the self-attention encoder yields a relatively high recall value of 85.4%, it is worth exploring other approaches to improving precision without compromising the high recall in this model variation.
- [Adel et al.2016] Heike Adel, Benjamin Roth, and Hinrich Schütze. 2016. Comparing convolutional neural networks to traditional models for slot filling. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, USA, June 12 - June 17, 2016.
- [Angeli et al.2014] Gabor Angeli, Julie Tibshirani, Jean Y. Wu, and Christopher D. Manning. 2014. Combining distant and partial supervision for relation extraction. In In Proceedings of EMNLP, pages 1556–1567.
- [Ba et al.2016] Lei Jimmy Ba, Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization. CoRR, abs/1607.06450.
[Glorot and Bengio2010]
Xavier Glorot and Yoshua Bengio.
Understanding the difficulty of training deep feedforward neural
In Yee Whye Teh and Mike Titterington, editors,
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of
Proceedings of Machine Learning Research, pages 249–256, Chia Laguna Resort, Sardinia, Italy, 13–15 May. PMLR.
[He et al.2015]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.In
Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), ICCV ’15, pages 1026–1034, Washington, DC, USA. IEEE Computer Society.
[Hendrickx et al.2010]
Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva, Preslav Nakov, Diarmuid Ó Séaghdha, Sebastian Padó, Marco Pennacchiotti, Lorenza Romano, and Stan Szpakowicz.2010. Semeval-2010 task 8: Multi-way classification of semantic relations between pairs of nominals. In Proceedings of the 5th International Workshop on Semantic Evaluation, pages 33–38, Uppsala, Sweden, July. Association for Computational Linguistics.
- [Ioffe and Szegedy2015] Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 448–456, Lille, France, 07–09 Jul. PMLR.
- [Kingma and Ba2014] Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. CoRR, abs/1412.6980.
- [Kitaev and Klein2018] Nikita Kitaev and Dan Klein. 2018. Constituency parsing with a self-attentive encoder. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, July. Association for Computational Linguistics.
- [Liu et al.2018] Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. 2018. Generating wikipedia by summarizing long sequences. In International Conference on Learning Representations.
[Nair and Hinton2010]
Vinod Nair and Geoffrey E. Hinton.
Rectified linear units improve restricted boltzmann machines.In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, pages 807–814, USA. Omnipress.
- [Nguyen and Grishman2015] Thien Huu Nguyen and Ralph Grishman. 2015. Relation extraction: Perspective from convolutional neural networks. In VS@HLT-NAACL, pages 39–48. The Association for Computational Linguistics.
[Pennington et al.2014]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning.
Glove: Global vectors for word representation.
Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543.
- [Roth et al.2018] Benjamin Roth, Costanza Conforti, Nina Pörner, Sanjeev Karn, and Hinrich Schütze. 2018. Neural architectures for open-type relation argument extraction. CoRR, abs/1803.01707.
- [Shaw et al.2018] Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-attention with relative position representations. CoRR, abs/1803.02155.
- [Shen et al.2018] Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, Shirui Pan, and Chengqi Zhang. 2018. Disan: Directional self-attention network for rnn/cnn-free language understanding. In AAAI Conference on Artificial Intelligence.
- [Vaswani et al.2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc.
- [Xu et al.2015a] Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. 2015a. Empirical evaluation of rectified activations in convolutional network. CoRR, abs/1505.00853.
[Xu et al.2015b]
Yan Xu, Lili Mou, Ge Li, Yunchuan Chen, Hao Peng, and Zhi Jin.
Classifying relations via long short term memory networks along shortest dependency paths.In In Proceedings of Conference on Empirical Methods in Natural Language Processing.
- [Zhang et al.2017] Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor Angeli, and Christopher D. Manning. 2017. Position-aware attention and supervised data improve slot filling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 35–45. Association for Computational Linguistics.