ResCNN_RelationExtraction
Deep Residual Learning for Weakly-Supervised Relation Extraction: https://arxiv.org/abs/1707.08866
view repo
Deep residual learning (ResNet) is a new method for training very deep neural networks using identity map-ping for shortcut connections. ResNet has won the ImageNet ILSVRC 2015 classification task, and achieved state-of-the-art performances in many computer vision tasks. However, the effect of residual learning on noisy natural language processing tasks is still not well understood. In this paper, we design a novel convolutional neural network (CNN) with residual learning, and investigate its impacts on the task of distantly supervised noisy relation extraction. In contradictory to popular beliefs that ResNet only works well for very deep networks, we found that even with 9 layers of CNNs, using identity mapping could significantly improve the performance for distantly-supervised relation extraction.
READ FULL TEXT VIEW PDF
Supervised relation extraction methods based on deep neural network play...
read it
Recently, with the advances made in continuous representation of words (...
read it
This article presents the SIRIUS-LTG-UiO system for the SemEval 2018 Tas...
read it
We propose a framework to improve performance of distantly-supervised
re...
read it
A family of super deep networks, referred to as residual networks or Res...
read it
One of the methods used in image recognition is the Deep Convolutional N...
read it
Knowledge is a formal way of understanding the world, providing a human-...
read it
Deep Residual Learning for Weakly-Supervised Relation Extraction: https://arxiv.org/abs/1707.08866
Relation extraction is the task of predicting attributes and relations for entities in a sentence (Zelenko et al., 2003; Bunescu and Mooney, 2005; GuoDong et al., 2005; Yu et al., 2017). For example, given a sentence “Barack Obama was born in Honolulu, Hawaii.”
, a relation classifier aims at predicting the relation of
“bornInCity”. Relation extraction is the key component for building relation knowledge graphs, and it is of crucial significance to natural language processing applications such as structured search, sentiment analysis, question answering, and summarization.
A major issue for relation extraction is the lack of labeled training data. In recent years, distant supervision (Mintz et al., 2009; Hoffmann et al., 2011; Surdeanu et al., 2012)
emerges as the most popular method for relation extraction— it uses knowledge base facts to select a set of noisy instances from unlabeled data. Among all the machine learning approaches for distant supervision, the recently proposed Convolutional Neural Networks (CNNs) model
(Zeng et al., 2014) achieved the state-of-the-art performance. Following their success, Zeng et al. Zeng et al. (2015)proposed a piece-wise max-pooling strategy to improve the CNNs. Various attention strategies
(Lin et al., 2016; Shen and Huang, 2016) for CNNs are also proposed, obtaining impressive results. However, most of these neural relation extraction models are relatively shallow CNNs—typically only one convolutional layer and one fully connected layer were involved, and it was not clear whether deeper models could have benefits on distilling signals from noisy inputs in this task.In this paper, we investigate the effects of training deeper CNNs for distantly-supervised relation extraction. More specifically, we designed a convolutional neural network based on residual learning (He et al., 2016)—we show how one can incorporate word embeddings and position embeddings into a deep residual network, while feeding identity feedback to convolutional layers for this noisy relation prediction task. Empirically, we evaluate on the NYT-Freebase dataset (Riedel et al., 2010), and demonstrate the state-of-the-art performance using deep CNNs with identify mapping and shortcuts. In contrast to popular beliefs in vision that deep residual network only works for very deep CNNs, we show that even with a moderately deep CNNs, there are substantial improvements over vanilla CNNs for relation extraction. Our contributions are three-fold:
We are the first to consider deeper convolutional neural networks for weakly-supervised relation extraction using residual learning;
We show that our deep residual network model outperforms CNNs by a large margin empirically, obtaining state-of-the-art performances;
Our identity mapping with shortcut feedback approach can be easily applicable to any variants of CNNs for relation extraction.
In this section, we describe a novel deep residual learning architecture for distantly supervised relation extraction. Figure 1 describes the architecture of our model.
Let be the i-th word in the sentence and e1, e2 be the two corresponding entities. Each word will access two embedding look-up tables to get the word embedding and the position embedding
. Then, we concatenate the two embeddings and denote each word as a vector of
.Each representation corresponding to is a real-valued vector. All of the vectors are encoded in an embeddings matrix where is a fixed-sized vocabulary.
In relation classification, we focus on finding a relation for entity pairs. Following (Zeng et al., 2014), a PF is the combination of the relative distances of the current word to the first entity and the second entity . For instance, in the sentence ”Steve_Jobs is the founder of Apple.”, the relative distances from founder to (Steve_Job) and are 3 and -2, respectively. We then transform the relative distances into real valued vectors by looking up one randomly initialized position embedding matrices where P is fixed-sized distance set. It should be noted that if a word is too far from entities, it may be not related to the relation. Therefore, we choose maximum value and minimum value for the relative distance.
In the example shown in Figure 1, it is assumed that is 4 and is 1. There are two position embeddings: one for , the other for . Finally, we concatenate the word embeddings and position embeddings of all words and denote a sentence of length n
(padded where necessary) as a vector
where is the concatenation operator and ().
Let refer to the concatenation of words . A convolution operation involves a filter , which is applied to a window of h words to produce a new feature. A feature is generated from a window of word by
Here is a bias term and f is a non-linear function. This filter is applied to each possible window of words from to to produce feature with .
Residual learning connects low-level to high-level representations directly, and tackles the vanishing gradient problem in deep networks. In our model, we design the residual convolution block by applying shortcut connections. Each residual convolutional block is a sequence of two convolutional layers, each one followed by an ReLU activation. The kernel size of all convolutions is
, with padding such that the new feature will have the same size as the original one. Here are two convolutional filter , . For the first convolutional layer:For the second convolutional layer:
Here , are bias terms. For the residual learning operation:
Conveniently, the notation of c on the left is changed to define as the output vectors of the block. This operation is performed by a shortcut connection and element-wise addition. This block will be multiply concatenated in our architecture.
We then apply a max-pooling operation over the feature and take the maximum value . We have described the process by which one feature is extracted from one filter. Take all features into one high level extracted feature (note that here we have m
filters). Then, these features are passed to a fully connected softmax layer whose output is the probability distribution over relations. Instead of using
for output unit y in forward propagation, dropout uses where is the element-wise multiplication operation andis a ’masking’ vector of Bernoulli random variables with probability
p of being 1. In the test procedure, the learned weight vectors are scaled by p such that and used (without dropout) to score unseen instances.In this paper, we use the word embeddings released by (Lin et al., 2016) which are trained on the NYT-Freebase corpus (Riedel et al., 2010)
. We fine tune our model using validation on the training data. The word embedding is of size 50. The input text is padded to a fixed size of 100. Training is performed with tensorflow adam optimizer, using a mini-batch of size 64, an initial learning rate of 0.001. We initialize our convolutional layers following
(Glorot and Bengio, 2010). The implementation is done using Tensorflow 0.11. All experiments are performed on a single NVidia Titan X (Pascal) GPU. In Table 1 we show all parameters used in the experiments.Window size h | 3 |
Word dimension | 50 |
Position dimension | 5 |
Position maximum distance | 30 |
Position minimum distance | -30 |
Number of filters m | 128 |
Batch size B | 64 |
Learning rate | 0.001 |
Dropout probability p | 0.5 |
We experiment with several state-of-the-art baselines and variants of our model.
CNN-B: Our implementation of the CNN baseline (Zeng et al., 2014) which contains one convolutional layer, and one fully connected layer.
CNN+ATT: CNN-B with attention over instance learning (Lin et al., 2016).
PCNN+ATT: Piecewise CNN-B with attention over instance learning (Lin et al., 2016).
CNN: Our CNN model which includes one convolutional layer and three fully connected layers.
CNN-x: Deeper CNN model which has x convolutional layers. For example, CNN-9 is a model constructed with 9 convolutional layers (1 + 4 residual cnn block without identity shortcut) and three fully connected layers.
ResCNN-x: Our proposed CNN-x model with residual identity shortcuts.
We evaluate our models on the widely used NYT freebase larger dataset (Riedel et al., 2010). Note that ImageNet dataset used by the original ResNet paper (He et al., 2016) has 1.28 million training instances. NYT freebase dataset includes 522K training sentences, which is the largest dataset in relation extraction, and it is the only suitable dataset to train deeper CNNs.
The advantage of this dataset is that there are 522,611 sentences in training data and 172,448 sentences in testing data and this size can support us to train a deep network. Similar to previous work (Zeng et al., 2015; Lin et al., 2016), we evaluate our model using the held-out evaluation. We report both the aggregate curves precision/recall curves and Precision@N (P@N).
In Figure 2, we compare the proposed ResCNN model with various CNNs. First, CNNs with multiple fully-connected layers obtained very good results, which is a novel finding. Second, the results also suggest that deeper CNNs with residual learning help extracting signals from noisy distant supervision data. We observe that overfitting happened when we try to add more layers and the performance of CNN-9 is much worse than CNN. We find that ResNet can solve this problem and ResCNN-9 obtains better performance as compared to CNN-B and CNN and dominates the precision/recall curve overall.
We show the effect of depth in residual networks in Figure 3. We observe that ResCNN-5 is worse than CNN-5 because the ResNet does not work well for shallow CNNs, and this is consistent with the original ResNet paper. As we increase the network depth, we see that CNN-9 does overfit to the training data. With residual learning, both ResCNN-9 and ResCNN-13 provide significant improvements over CNN-5 and ResCNN-5 models. In contradictory to popular beliefs that ResNet only works well for very deep networks, we found that even with 9 layers of CNNs, using identity mapping could significantly improve the performance learning in a noisy input setting.
P@N(%) | 100 | 200 | 300 | Mean |
---|---|---|---|---|
CNN+ATT | 76.2 | 68.6 | 59.8 | 68.2 |
PCNN+ATT | 76.2 | 73.1 | 67.4 | 72.2 |
CNN-B | 41.0 | 40.0 | 41.0 | 40.7 |
CNN | 64.0 | 61.0 | 55.3 | 60.1 |
CNN-5 | 64.0 | 58.5 | 54.3 | 58.9 |
ResCNN-5 | 57.0 | 57.0 | 54.3 | 56.1 |
CNN-9 | 56.0 | 54.0 | 49.7 | 53.2 |
ResCNN-9 | 79.0 | 69.0 | 61.0 | 69.7 |
ResCNN-13 | 76.0 | 65.0 | 60.3 | 67.1 |
The intuition of ResNet help this task in two aspect. First, if the lower, middle, and higher levels learn hidden lexical, syntactic, and semantic representations respectively, sometimes it helps to bypass the syntax to connect lexical and semantic space directly. Second, ResNet tackles the vanishing gradient problem which will decrease the effect of noise in distant supervision data.
In Table 2, we compare the performance of our models to state-of-the-art baselines. We show that our ResCNN-9 outperforms all models that do not select training instances. And even without piecewise max-pooling and instance-based attention, our model is on par with the PCNN+ATT model.
For the more practical evaluation, we compare the results for precision@N where N is small (1, 5, 10, 20, 50) in Table 3
. We observe that our ResCNN-9 model dominates the performance when we predict the relation in the range of higher probability. ResNet helps CNNs to focus on the highly possible candidate and mitigate the noise effect of distant supervision. We believe that residual connections actually can be seen as a form of renormalizing the gradients, which prevents the model from overfitting to the noisy distant supervision data.
P@N(%) | 1 | 5 | 10 | 20 | 50 |
---|---|---|---|---|---|
PCNN+ATT | 1 | 0.8 | 0.9 | 0.75 | 0.7 |
ResCNN-9 | 1 | 1 | 0.9 | 0.9 | 0.88 |
In our distant-supervised relation extraction experience, we have two important observations: (1) We get significant improvements with CNNs adding multiple fully-connected layers. (2) Residual learning could significantly improve the performance for deeper CNNs.
In this paper, we introduce a deep residual learning method for distantly-supervised relation extraction. We show that deeper convolutional models help distill signals from noisy inputs. With shortcut connections and identify mapping, the performances are significantly improved. These results aligned with a recent study (Conneau et al., 2017), suggesting that deeper CNNs do have positive effects on noisy NLP problems.
International Conference on Artificial Intelligence and Statistics
, pages 249–256.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 770–778.
Comments
There are no comments yet.