Deep Semantic Role Labeling with Self-Attention
Semantic Role Labeling (SRL) is believed to be a crucial step towards natural language understanding and has been widely studied. Recent years, end-to-end SRL with recurrent neural networks (RNN) has gained increasing attention. However, it remains a major challenge for RNNs to handle structural information and long range dependencies. In this paper, we present a simple and effective architecture for SRL which aims to address these problems. Our model is based on self-attention which can directly capture the relationships between two tokens regardless of their distance. Our single model achieves F_1=83.4 on the CoNLL-2005 shared task dataset and F_1=82.7 on the CoNLL-2012 shared task dataset, which outperforms the previous state-of-the-art results by 1.8 and 1.0 F_1 score respectively. Besides, our model is computationally efficient, and the parsing speed is 50K tokens per second on a single Titan X GPU.READ FULL TEXT VIEW PDF
The current state-of-the-art end-to-end semantic role labeling (SRL) mod...
Self-attention network, an attention-based feedforward neural network, h...
Sequence labeling is a fundamental task in natural language processing a...
Many natural language processing tasks solely rely on sparse dependencie...
As a fundamental NLP task, semantic role labeling (SRL) aims to discover...
Current end-to-end machine reading and question answering (Q&A) models a...
Recent years have seen remarkable progress of text generation in differe...
Deep Semantic Role Labeling with Self-Attention
Semantic Role Labeling is a shallow semantic parsing task, whose goal is to determine essentially “who did what to whom”, “when” and “where”. Semantic roles indicate the basic event properties and relations among relevant entities in the sentence and provide an intermediate level of semantic representation thus benefiting many NLP applications, such as Information Extraction [Bastianelli et al.2013], Question Answering [Surdeanu et al.2003, Moschitti, Morarescu, and Harabagiu2003, Dan and Lapata2007], Machine Translation [Knight and Luk1994, Ueffing, Haffari, and Sarkar2007, Wu and Fung2009] and Multi-document Abstractive Summarization [Genest and Lapalme2011].
Semantic roles are closely related to syntax. Therefore, traditional SRL approaches rely heavily on the syntactic structure of a sentence, which brings intrinsic complexity and restrains these systems to be domain specific. Recently, end-to-end models for SRL without syntactic inputs achieved promising results on this task [Zhou and Xu2015, Marcheggiani, Frolov, and Titov2017, He et al.2017]
. As the pioneering work, Zhou and Xu zhou2015end introduced a stacked long short-term memory network (LSTM) and achieved the state-of-the-art results. He et al., he2017deep reported further improvements by using deep highway bidirectional LSTMs with constrained decoding. These successes involving end-to-end models reveal the potential ability of LSTMs for handling the underlying syntactic structure of the sentences.
Despite recent successes, these RNN-based models have limitations. RNNs treat each sentence as a sequence of words and recursively compose each word with its previous hidden state. The recurrent connections make RNNs applicable for sequential prediction tasks with arbitrary length, however, there still remain several challenges in practice. The first one is related to memory compression problem [Cheng, Dong, and Lapata2016]
. As the entire history is encoded into a single fixed-size vector, the model requires larger memory capacity to store information for longer sentences. The unbalanced way of dealing with sequential information leads the network performing poorly on long sentences while wasting memory on shorter ones. The second one is concerned with the inherent structure of sentences. RNNs lack a way to tackle the tree-structure of the inputs. The sequential way to process the inputs remains the network depth-in-time, and the number of nonlinearities depends on the time-steps.
To address these problems above, we present a deep attentional neural network (DeepAtt) for the task of SRL111Our source code is available at https://github.com/XMUNLP/Tagger. Our models rely on the self-attention mechanism which directly draws the global dependencies of the inputs. In contrast to RNNs, a major advantage of self-attention is that it conducts direct connections between two arbitrary tokens in a sentence. Therefore, distant elements can interact with each other by shorter paths , which allows unimpeded information flow through the network. Self-attention also provides a more flexible way to select, represent and synthesize the information of the inputs and is complementary to RNN based models. Along with self-attention, DeepAtt comes with three variants which uses recurrent (RNN), convolutional (CNN) and feed-forward (FFN) neural network to further enhance the representations.
Although DeepAtt is fairly simple, it gives remarkable empirical results. Our single model outperforms the previous state-of-the-art systems on the CoNLL-2005 shared task dataset and the CoNLL-2012 shared task dataset by and F score respectively. It is also worth mentioning that on the out-of-domain dataset, we achieve an improvement upon the previous end-to-end approach [He et al.2017] by F score. The feed-forward variant of DeepAtt allows significantly more parallelization, and the parsing speed is 50K tokens per second on a single Titan X GPU.
Given a sentence, the goal of SRL is to identify and classify the arguments of each target verb into semantic roles. For example, for the sentence“Marry borrowed a book from John last week.” and the target verb borrowed, SRL yields the following outputs:
Marry borrowed a book
from John last week
Here ARG0 represents the borrower, ARG1 represents the thing borrowed, ARG2 represents the entity borrowed from, AM-TMP is an adjunct indicating the timing of the action and V represents the verb.
Generally, semantic role labeling consists of two steps: identifying and classifying arguments. The former step involves assigning either a semantic argument or non-argument for a given predicate, while the latter includes labeling a specific semantic role for the identified argument. It is also common to prune obvious non-candidates before the first step and to apply post-processing procedure to fix inconsistent predictions after the second step. Finally, a dynamic programming algorithm is often applied to find the global optimum solution for this typical sequence labeling problem at the inference stage.
In this paper, we treat SRL as a BIO tagging problem. Our approach is extremely simple. As illustrated in Figure 1
, the original utterances and the corresponding predicate masks are first projected into real-value vectors, namely embeddings, which are fed to the next layer. After that, we design a deep attentional neural network which takes the embeddings as the inputs to capture the nested structures of the sentence and the latent dependency relationships among the labels. On the inference stage, only the topmost outputs of attention sub-layer are taken to a logistic regression layer to make the final decision222In case of BIO violations, we simply treat the argument of the B tags as the argument of the whole span..
In this section, we will describe DeepAtt in detail. The main component of our deep network consists of identical layers. Each layer contains a nonlinear sub-layer followed by an attentional sub-layer. The topmost layer is the softmax classification layer.
Self-attention or intra-attention, is a special case of attention mechanism that only requires a single sequence to compute its representation. Self-attention has been successfully applied to many tasks, including reading comprehension, abstractive summarization, textual entailment, learning task-independent sentence representations, machine translation and language understanding [Cheng, Dong, and Lapata2016, Parikh et al.2016, Lin et al.2017, Paulus, Xiong, and Socher2017, Vaswani et al.2017, Shen et al.2017].
In this paper, we adopt the multi-head attention formulation by Vaswani et al. vaswani2017attention. Figure 2 depicts the computation graph of multi-head attention mechanism. The center of the graph is the scaled dot-product attention, which is a variant of dot-product (multiplicative) attention [Luong, Pham, and Manning2015]. Compared with the standard additive attention mechanism [Bahdanau, Cho, and Bengio2014]
which is implemented using a one layer feed-forward neural network, the dot-product attention utilizes matrix production which allows faster computation. Given a matrix ofquery vectors , keys and values , the scaled dot-product attention computes the attention scores based on the following equation:
where is the number of hidden units of our network.
The multi-head attention mechanism first maps the matrix of input vectors to queries, keys and values matrices by using different linear projections. Then parallel heads are employed to focus on different part of channels of the value vectors. Formally, for the -th head, we denote the learned linear maps by , and , which correspond to queries, keys and values respectively. Then the scaled dot-product attention is used to compute the relevance between queries and keys, and to output mixed representations. The mathematical formulation is shown below:
Finally, all the vectors produced by parallel heads are concatenated together to form a single vector. Again, a linear map is used to mix different channels from different heads:
where and .
The self-attention mechanism has many appealing aspects compared with RNNs or CNNs. Firstly, the distance between any input and output positions is 1, whereas in RNNs it can be . Unlike CNNs, self-attention is not limited to fixed window sizes. Secondly, the attention mechanism uses weighted sum to produce output vectors. As a result, the gradient propagations are much easier than RNNs or CNNs. Finally, the dot-product attention is highly parallel. In contrast, RNNs are hard to parallelize owing to its recursive computation.
The successes of neural networks root in its highly flexible nonlinear transformations. Since attention mechanism uses weighted sum to generate output vectors, its representational power is limited. To further increase the expressive power of our attentional network, we employ a nonlinear sub-layer to transform the inputs from the bottom layers. In this paper, we explore three kinds of nonlinear sub-layers, namely recurrent, convolutional and feed-forward sub-layers.
We use bidirectional LSTMs to build our recurrent sub-layer. Given a sequence of input vectors , two LSTMs process the inputs in opposite directions. To maintain the same dimension between inputs and outputs, we use the sum operation to combine two representations:
For convolutional sub-layer, we use the Gated Linear Unit (GLU) proposed by Dauphin et al. dauphin2016language. Compared with the standard convolutional neural network, GLU is much easier to learn and achieves impressive results on both language modeling and machine translation task[Dauphin et al.2016, Gehring et al.2017]. Given two filters and , the output activations of GLU are computed as follows:
The filter width is set to 3 in all our experiments.
. In this work, we use the residual connections proposed by He et al. he2016deep to ease the training of our deep attentional neural network. Specifically, the outputof each sub-layer is computed by the following equation:
We then apply layer normalization [Ba, Kiros, and Hinton2016] after the residual connection to stabilize the activations of deep neural network.
The attention mechanism itself cannot distinguish between different positions. So it is crucial to encode positions of each input words. There are various ways to encode positions, and the simplest one is to use an additional position embedding. In this work, we try the timing signal approach proposed by Vaswani et al. vaswani2017attention, which is formulated as follows:
The timing signals are simply added to the input embeddings. Unlike the position embedding approach, this approach does not introduce additional parameters.
The first step of using neural networks to process symbolic data is to represent them by distributed vectors, also called embeddings [Bengio et al.2003]. We take the very original utterances and the corresponding predicate masks as the input features. is set to if the corresponding word is a predicate, or if not.
Formally, in SRL task, we have a word vocabulary and mask vocabulary . Given a word sequence and a mask sequence , each word and its corresponding predicate mask are projected into real-valued vectors and through the corresponding lookup table layer, respectively. The two embeddings are then concatenated together as the output feature maps of the lookup table layers. Formally speaking, we have .
We then build our deep attentional neural network to learn the sequential and structural information of a given sentence based on the feature maps from the lookup table layer. Finally, we take the outputs of the topmost attention sub-layer as inputs to make the final predictions.
Since there are dependencies between semantic labels, most previous neural network models introduced a transition model for measuring the probability of jumping between the labels. Different from these works, we perform SRL as a typical classification problem. Latent dependency information is embedded in the topmost attention sub-layer learned by our deep models. This approach is simpler and easier to implement compared to previous works.
Formally, given an input sequence , the log-likelihood of the corresponding correct label sequence is
Our model predict the corresponding label based on the representation produced by the topmost attention sub-layer of DeepAtt:
Where is the softmax matrix and is Kronecker delta with a dimension for each output symbol, so is exactly the ’th element of the distribution defined by the softmax. Our training objective is to maximize the log probabilities of the correct output labels given the input sequence over the entire training set.
We report our empirical studies of DeepAtt on the two commonly used datasets from the CoNLL-2005 shared task and the CoNLL-2012 shared task.
The CoNLL-2005 dataset takes section 2-21 of the Wall Street Journal (WSJ) corpus as training set, and section 24 as development set. The test set consists of section 23 of the WSJ corpus as well as 3 sections from the Brown corpus [Carreras and Màrquez2005]. The CoNLL-2012 dataset is extracted from the OntoNotes v5.0 corpus. The description and separation of training, development and test set can be found in Pardhan et al. Pradhan-CoNLL2013.
We initialize the weights of all sub-layers as random orthogonal matrices. For other parameters, we initialize them by sampling each element from a Gaussian distribution with mean
and variance. The embedding layer can be initialized randomly or using pre-trained word embeddings. We will discuss the impact of pre-training in the analysis subsection.333To be strictly comparable to previous work, we use the same vocabularies and pre-trained embeddings as He et al.he2017deep.
The settings of our models are described as follows. The dimension of word embeddings and predicate mask embeddings is set to 100 and the number of hidden layers is set to 10. We set the number of hidden units to . The number of heads is set to 8. We apply dropout [Srivastava et al.2014]
to prevent the networks from over-fitting. Dropout layers are added before residual connections with a keep probability of 0.8. Dropout is also applied before the attention softmax layer and the feed-froward ReLU hidden layer, and the keep probabilities are set to 0.9. We also employ label smoothing technique[Szegedy et al.2016] with a smoothing value of 0.1 during training.
Parameter optimization is performed using stochastic gradient descent. We adopt Adadelta[Zeiler2012] ( and
) as the optimizer. To avoid exploding gradients problem, we clip the norm of gradients with a predefined threshold[Pascanu et al.2013]. Each SGD contains a mini-batch of approximately 4096 tokens for the CoNLL-2005 dataset and 8192 tokens for the CoNLL-2012 dataset. The learning rate is initialized to 1.0. After training 400k steps, we halve the learning rate every 100K steps. We train all models for 600K steps. For DeepAtt with FFN sub-layers, the whole training stage takes about two days to finish on a single Titan X GPU, which is 2.5 times faster than the previous approach [He et al.2017].
|Model||Development||WSJ Test||Brown Test||Combined|
|He et al. (Ensemble) he2017deep||83.1||82.4||82.7||64.1||85.0||84.3||84.6||66.5||74.9||72.4||73.6||46.5||83.2|
|He et al. (Single) he2017deep||81.6||81.6||81.6||62.3||83.1||83.0||83.1||64.3||72.8||71.4||72.1||44.8||81.6|
|Zhou and Xu zhou2015end||79.7||79.4||79.6||-||82.9||82.8||82.8||-||70.7||68.2||69.4||-||81.1|
|FitzGerald et al. (Struct., Ensemble) fitzgerald2015semantic||81.2||76.7||78.9||55.1||82.5||78.2||80.3||57.3||74.5||70.0||72.2||41.3||-|
|Täckström et al. (Struct.) Tackstrom-Das-TACL2015||81.2||76.2||78.6||54.4||82.3||77.6||79.9||56.0||74.3||68.6||71.3||39.8||-|
|Toutanova et al. (Ensemble) Toutanova-Manning-CL2008||-||-||78.6||58.7||81.9||78.8||80.3||60.1||-||-||68.8||40.8||-|
|Punyakanok et al. (Ensemble) Punyakanok-Yih-2008||80.1||74.8||77.4||50.7||82.3||76.8||79.4||53.8||73.4||62.9||67.8||32.3||77.9|
|DeepAtt (FFN, Ensemble)||84.3||84.9||84.6||67.3||85.9||86.3||86.1||69.0||74.6||75.0||74.8||48.6||84.6|
|He et al. (Ensemble) he2017deep||83.5||83.2||83.4||67.5||83.5||83.3||83.4||68.5|
|He et al. (Single) he2017deep||81.7||81.4||81.5||64.6||81.8||81.6||81.7||66.0|
|Zhou and Xu zhou2015end||-||-||81.1||-||-||-||81.3||-|
|FitzGerald et al. (Struct., Ensemble) fitzgerald2015semantic||81.0||78.5||79.7||60.9||81.2||79.0||80.1||62.6|
|Täckström et al. (Struct., Ensemble) Tackstrom-Das-TACL2015||80.5||77.8||79.1||60.1||80.6||78.2||79.4||61.8|
|Pradhan et al.(Revised) Pradhan-CoNLL2013||-||-||-||-||78.5||76.6||77.5||55.8|
|DeepAtt (FFN, Ensemble)||83.6||84.7||84.1||68.7||83.3||84.5||83.9||69.3|
In Table 1 and 2, we give the comparisons of DeepAtt with previous approaches. On the CoNLL-2005 dataset, the single model of DeepAtt with RNN, CNN and FFN nonlinear sub-layers achieves an F score of , and respectively. The FFN variant outperforms previous best performance by 1.8 F score. Remarkably, we get 74.1 F score on the out-of-domain dataset, which outperforms the previous state-of-the-art system by F score. On the CoNLL-2012 dataset, the single model of FFN variant also outperforms the previous state-of-the-art by 1.0 F score. When ensembling 5 models with FFN nonlinear sub-layers, our approach achieves an F score of 84.6 and 83.9 on the two datasets respectively, which has an absolute improvement of 1.4 and 0.5 over the previous state-of-the-art. These results are consistent with our intuition that the self-attention layers is helpful to capture structural information and long distance dependencies.
In this subsection, we discuss the main factors that influence our results. We analyze the experimental results on the development set of CoNLL-2005 dataset.
Previous works [Zhou and Xu2015, He et al.2017] show that model depth is the key to the success of end-to-end SRL approach. Our observations also coincide with previous works. Rows 1-5 of Table 3 show the effects of different number of layers. For DeepAtt with 4 layers, our model only achieves 79.9 F score. Increasing depth consistently improves the performance on the development set, and our best model consists of 10 layers. For DeepAtt with 12 layers, we observe a slightly performance drop of 0.1 F.
We also conduct experiments with different model widths. We increase the number of hidden units from to and to as listed in rows 1, 6 and 7 of Table 3, and the corresponding hidden size of FFN sub-layers is increased to 1600 and 2400 respectively. Increasing model widths improves the F slightly, and the model with 600 hidden units achieves an F of 83.4. However, the training and parsing speed are slower as a result of larger parameter counts.
Previous works found that the performance can be improved by pre-training the word embeddings on large unlabeled data [Collobert et al.2011, Zhou and Xu2015]. We use the GloVe [Pennington, Socher, and Manning2014] embeddings pre-trained on Wikipedia and Gigaword. The embeddings are used to initialize our networks, but are not fixed during training. Rows 1 and 8 of Table 3 show the effects of additional pre-trained embeddings. When using pre-trained GloVe embeddings, the F score increases from 79.6 to 83.1.
From rows 1, 9 and 10 of Table 3 we can see that the position encoding plays an important role in the success of DeepAtt. Without position encoding, the DeepAtt with FFN sub-layers only achieves F score on the CoNLL-2005 development set. When using position embedding approach, the F score boosts to . The timing approach is surprisingly effective, which outperforms the position embedding approach by F score.
DeepAtt requires nonlinear sub-layers to enhance its expressive power. Row 11 of Table 3 shows the performance of DeepAtt without nonlinear sub-layers. We can see that the performance of 10 layered DeepAtt without nonlinear sub-layers only matches the 4 layered DeepAtt with FFN sub-layers, which indicates that the nonlinear sub-layers are the essential components of our attentional networks.
Table 4 show the effects of constrained decoding [He et al.2017] on top of DeepAtt with FFN sub-layers. We observe a slightly performance drop when using constrained decoding. Moreover, adding constrained decoding slow down the decoding speed significantly. For DeepAtt, it is powerful enough to capture the relationships among labels.
|Label||He et al. he2017deep||DeepAtt|
|He et al. he2017deep||91.87||87.10|
We list the detailed performance on frequent labels in Table 5. The results of the previous state-of-the-art [He et al.2017] are also shown for comparison. Compared with He et al. he2017deep, our model shows improvement on all labels except AM-PNC, where He’s model performs better. Table 6 shows the results of identifying and classifying semantic roles. Our model improves the previous state-of-the-art on both identifying correct spans as well as correctly classifying them into semantic roles. However, the majority of improvements come from classifying semantic roles. This indicates that finding the right constituents remains a bottleneck of our model.
Table 7 shows a confusion matrix of our model for the most frequent labels. We only consider predicted arguments that match gold span boundaries. Compared with the previous work [He et al.2017], our model still confuses ARG2 with AM-DIR, AM-LOC and AM-MNR, but to a lesser extent. This indicates that our model has some advantages on such difficult adjunct distinction [Kingsbury, Palmer, and Marcus2002].
Gildea and Jurafsky gildea2002automatic developed the first automatic semantic role labeling system based on FrameNet. Since then the task has received a tremendous amount of attention. The focus of traditional approaches is devising appropriate feature templates to describe the latent structure of utterances. Pradhan et al. Pradhan-Jurafsky-Conll2005; Surdeanu et al. Surdeanu-Aarseth-ACL2003; Palmer, Gildea, and Xue Palmer-Xue-2010 explored the syntactic features for capturing the overall sentence structure. Combination of different syntactic parsers was also proposed to avoid prediction risk which was introduced by Surdeanu et al. Surdeanu-Aarseth-ACL2003; Koomen et al. Koomen-Yih-CoNLL2005; Pradhan et al. Pradhan-CoNLL2013.
Beyond these traditional methods above, Collobert et al. Collobert-Ronan-JMLR2011 proposed a convolutional neural network for SRL to reduce the feature engineering. The pioneering work on building an end-to-end system was proposed by Zhou and Xu zhou2015end, who applied an 8 layered LSTM model which outperformed the previous state-of-the-art system. He et al.he2017deep improved further with highway LSTMs and constrained decoding. They used simplified input and output layers compared with Zhou and Xu zhou2015end. Marcheggiani, Frolov, Titov marcheggiani2017simple also proposed a bidirectional LSTM based model. Without using any syntactic information, their approach achieved the state-of-the-art result on the CoNLL-2009 dataset.
Our method differs from them significantly. We choose self-attention as the key component in our architecture instead of LSTMs. Like He et al. he2017deep, our system take the very original utterances and predicate masks as the inputs without context windows. At the inference stage, we apply argmax decoding approach on top of a simple logistic regression while Zhou and Xu zhou2015end chose a CRF approach and He et al. he2017deep chose constrained decoding. This approach is much simpler and faster than the previous approaches.
Self-attention have been successfully used in several tasks. Cheng, Dong, and Lapata cheng2016long used LSTMs and self-attention to facilitate the task of machine reading. Parikh et al. parikh2016decomposable utilized self-attention to the task of natural language inference. Lin et al. lin2017structured proposed self-attentive sentence embedding and applied them to author profiling, sentiment analysis and textual entailment. Paulus, Xiong, and Socher paulus2017deep combined reinforcement learning and self-attention to capture the long distance dependencies nature of abstractive summarization. Vaswani et al. vaswani2017attention applied self-attention to neural machine translation and achieved the state-of-the-art results. Very recently, Shen et al. shen2017disan applied self-attention to language understanding task and achieved the state-of-the-art on various datasets. Our work follows this line to apply self-attention for learning long distance dependencies. Our experiments also show the effectiveness of self-attention mechanism on the sequence labeling task.
We proposed a deep attentional neural network for the task of semantic role labeling. We trained our SRL models with a depth of and evaluated them on the CoNLL-2005 shared task dataset and the CoNLL-2012 shared task dataset. Our experimental results indicate that our models substantially improve SRL performances, leading to the new state-of-the-art.
This work was done while the first author’s internship at Tencent Technology. This work is supported by the Natural Science Foundation of China (Grant No. 61573294, 61303082, 61672440), the Ph.D. Programs Foundation of Ministry of Education of China (Grant No. 20130121110040), the Foundation of the State Language Commission of China (Grant No. WT135-10) and the Natural Science Foundation of Fujian Province (Grant No. 2016J05161). We also thank the anonymous reviews for their valuable suggestions.
Journal of Machine Learning Research3:1137–1155.
Framework for abstractive summarization using text-to-text generation.In Proceedings of the Workshop on Monolingual Text-To-Text Generation, 64–73. Association for Computational Linguistics.