1 Introduction
We studies the relation extraction (RE) problem, one of the important problem of information extraction and natural language processing (NLP). Given two entity mentions in a sentence (relation mentions), we need to identify the semantic relationship (if any) between the two entity mentions. One example is the recognition of the
Located relation between “He” and “Texas” in the sentence “He lives in Texas”.The two methods dominating RE research in the last decade are the featurebased method [Kambhatla2004, Boschee et al.2005, Zhou et al.2005, Grishman et al.2005, Jiang and Zhai2007, Chan and Roth2010, Sun et al.2011] and the kernelbased method [Zelenko et al.2003, Culotta and Sorensen2004, Bunescu and Mooney2005a, Bunescu and Mooney2005b, Zhang et al.2006, Zhou et al.2007, Qian et al.2008, Nguyen et al.2009, Plank and Moschitti2013]. These research extensively studies the leverage of linguistic analysis and knowledge resources to construct the feature representations, involving the combination of discrete
properties such as lexicon, syntax, gazetteers. Although these approaches are able to exploit the symbolic (discrete) structures within relation mentions, they also suffer from the difficulty to generalize over the unseen words
[Gormley et al.2015], motivating some very recent work on employing the continuous representations of words (word embeddings) to do RE. The most popular method involves neural networks (NNs) that effectively learn hidden structures of relation mentions from such word embeddings, thus achieving the top performance for RE [Zeng et al.2014, dos Santos et al.2015, Xu et al.2015].The NN research for relation extraction and classification has centered around two main network architectures: convolutional neural networks (CNNs) [dos Santos et al.2015, Zeng et al.2015] and recursive/recurrent neural networks [Socher et al.2012, Xu et al.2015]. The distinction between convolutional neural networks and recurrent neural networks (RNNs) for RE is that the former aim to generalize the local and consecutive context (i.e, the grams) of the relation mentions [Nguyen and Grishman2015a] while the latter adaptively accumulate the context information in the whole sentence via the memory units, thereby encoding the global and possibly unconsecutive patterns for RE [Hochreiter and Schmidhuber1997, Cho et al.2014]. Consequently, the traditional featurebased method (i.e, the loglinear or MaxEnt model with handcrafted features), the CNNs and the RNNs tend to focus on different angles for RE. Guided from this intuition, in this work, we propose to combine the three models to further improve the performance of RE.
While the architecture design of CNNs for RE is quite established due to the extensive studies in the last couple of years, the application of RNNs to RE is only very recent and the optimal designs of RNNs for RE are still an ongoing research. In this work, we first perform a systematic exploration of various network architectures to seek the best RNN model for RE. In the next step, we extensively study different methods to assemble the loglinear model, CNNs and RNNs for RE, leading to the combined models that yield the stateoftheart performance on the ACE 2005 and SemEval dataset. To the best of our knowledge, this is the first work to systematically examine the RNN architectures as well as combine them with CNNs and the traditional featurebased approach for RE.
2 Models
Relation mentions consist of sentences marked with two entity mentions of interest. In this paper, we examine two different representations for the sentences in RE: (i) the standard representation, called SEQ that takes all the words in the sentences into account and (ii) the dependency representation, called DEP that only considers the words along the dependency paths between the two entity mention heads of the sentences. In the following, unless indicated specifically, all the statements about the sentences hold for both representations SEQ and DEP.
Throughout this paper, for convenience, we assume that the input sentences of the relation mentions have the same fixed length . This can be achieved by setting
to the length of the longest input sentences and padding the shorter sentences with a special token. Let
be the input sentence of some relation mention, where is the th word in the sentence. Also, let and be the two heads of the two entity mentions of interest. In order to prepare the relation mention for neural networks, we first transform each wordinto a realvalued vector
using the concatenation of the following seven vectors, motivated by the previous research on neural networks and feature analysis for RE [Zhou et al.2005, Sun et al.2011, Gormley et al.2015]. The realvalued word embedding vector of , obtained by looking up the word embedding table .
 The realvalued distance embedding vectors , to encode the relative distances and of to the two entity heads of interest and : , where is the distance embedding table (initialized randomly). The objective is to inform the networks the positions of the two entity mentions for relation prediction.
 The realvalued embedding vectors for entity types and chunks to embed the entity type and chunking information for . These vectors are generated by looking up the entity type and chunk embedding tables (also initialized randomly) (i.e, and respectively) for the entity type and chunking label of : , .
 The binary vector with one dimension to indicate whether the word is on the dependency path between and or not.
 The binary vector whose dimensions correspond to the possible relations between words in the dependency trees. The value at a dimension of is only set to 1 if there exists one edge of the corresponding relation connected to in the dependency tree.
The transformation from the word to the vector essentially converts the relation mention with the input sentence into a realvalued matrix , to be used by the neural networks presented below.
2.1 The Separate Models
We describe two typical NN architectures for RE underlying the combined models in this work.
2.1.1 The Convolutional Neural Networks
In CNNs [Kalchbrenner et al.2014, Kim2014], given a window size of , we have a set of feature maps (filters). Each feature map is a weight matrix where is a vector to be learnt during training as the model parameters. The core of CNNs is the application of the convolutional operator on the input matrix and the filter matrix to produce a score sequence , interpreted as a more abstract representation of the input matrix :
where is a bias term and is the function.
In the next step, we further abstract the scores in by aggregating it via the
function to obtain the maxpooling score
. We then repeat this process for all the feature maps with different window sizesto generate a vector of the maxpooling scores. In the final step, we pass this vector into some standard multilayer neural network, followed by a softmax layer to produce the probabilistic distribution
over the possible relation classes in the prediction task.2.1.2 The Recurrent Neural Networks
In RNNs, we consider the input matrix as a sequence of column vectors indexed from 1 to . At each step , we compute the hidden vector from the current input vector and the previous hidden vector
using the nonlinear transformation function
: .This recurrent computation can be done via three different directional mechanisms: (i) the forward mechanism that recurs from 1 to and generate the forward hidden vector sequence: , (ii) the backward mechanism that runs RNNs from to 1 and results in the backward hidden vector sequence ^{1}^{1}1The initial hidden vectors are set to the zero vector., and (iii) the bidirectional mechanism that performs RNNs in both directions to produce the forward and backward hidden vector sequences, and then concatenate them at each position to generate the new hidden vector sequence : .
Given the hidden vector sequence obtained from one of the three mechanisms above, we study two following strategies to generate the representation vector for the initial relation mention. Note that this representation vector can be again fed into some standard multilayer neural network with a softmax layer in the end, resulting in the distribution for the RNN models:
 The HEAD strategy: In this strategy, is the concatenation of the hidden vectors at the positions of the two entity mention heads of interest: . This is motivated by the importance of the two mention heads in RE [Sun et al.2011, Nguyen and Grishman2014].
 The MAX strategy: This strategy is similar to our maxpooling mechanism in CNNs. In particular, is obtained by taking the maximum along each dimension of the hidden vectors . The idea is to further abstract the hidden vectors by retaining only the most important feature in each dimension.
Regarding the nonlinear function, the simplest form of
in the literature considers it as a onelayer feedforward neural network, called
: where is the sigmoid function. Unfortunately, the application of causes the socalled “vanishing/exploding gradient” problems [Bengio et al.1994], making it challenging to train RNNs properly [Pascanu et al.2012]. These problems are overcome by the longshort term memory units (LSTM)
[Hochreiter and Schmidhuber1997, Graves et al.2009]. In this work, we apply a variant of the memory units: the Gated Recurrent Units from Cho et al. Cho:14, called . is shown to be much simpler than LSTM in terms of computation but still achieves the comparable performance [Cho et al.2014].2.2 The Combined Models
We first present three different methods to assemble CNNs and RNNs: ensembling, stacking and voting, to be investigated in this work. The combination of the neural networks with the loglinear model would be discussed in the next section.
2.2.1 Ensembling
In this method, we first run some CNN and RNN in Section 2.1 over the input matrix to gather the corresponding distributions and . We then combine the CNN and RNN by multiplying their distributions (elementwise): ( is a normalization constant).
2.2.2 Stacking
The overall architecture of the stacking method is to use one of the two network architectures (i.e, CNNs and RNNs) to generalize the hidden vectors of the other architecture. The expectation is that we can learn more effective features for RE via such a deeper architecture by alternating between the local and global representations provided by CNNs and RNNs.
We examine two variants for this method. The first variant, called RNNCNN, applies the CNN model in Section 2.1.1 on the hidden vector sequence generated by some RNN in Section 2.1.2 to perform RE. The second variant, called CNNRNN, on the other hand, utilize the CNN model to acquire the hidden vector sequence, that is, in turn, fed as the input into some RNN for RE. For the second variant, as the length of the hidden vector in the CNN model depends on the specified window size for the feature map , we need to pad the input matrix with zero column vectors on both sides to ensure the same fixed length for all the hidden vectors: . Besides, we need to rearrange the scores in the hidden vectors from different feature maps of the CNN so they are grouped according to the positions in the sentence, thus being compatible with the input requirement of RNNs.
2.2.3 Voting
Instead of integrating CNNs and RNNs at the model level as the two previous methods, the voting method makes decision for a relation mention
by voting the individual decisions of the different models. While there are several voting schemes in the literature, for this work, we employ the simplest scheme of majority voting. If there are more than one relation classes receiving the highest number of votes, the relation class returned by a model and having the highest probability would be chosen.
2.3 The Hybrid Models
In order to further improve the RE performance of models above, we investigate the integration of these neural network models with the traditional loglinear model that relies on various linguistic features from the past research on RE [Zhou et al.2005, Sun et al.2011, Gormley et al.2015]. Specifically, in such integration models (called the hybrid models), the relation class distribution is obtained from the elementwise multiplication between the distributions of the neural network models and the loglinear model. Let us take the ensembling model in Section 2.2.1 as an example. The corresponding hybrid model in this case would be: , assuming be the distribution of the loglinear model and be the normalization constant. The parameters of the loglinear model are learnt jointly with the parameters of the neural networks.
Hypothesis: Let be the set of relation mentions correctly predicted by some neural network model in some dataset (the coverage set). The introduction of the loglinear model into this neural network model essentially changes the coverage set of the network, resulting in the new coverage set that might or might not subsume the original set . In this work, we hypothesize that although and overlap, there are still some relation mentions that only belong to either set. Consequently, we propose to implement a majority voting system (called the hybridvoting system) on the outputs of the network and its corresponding hybrid model to enhance both models.
Note that the voting models in Section 2.2.3 involve the voting on two models (i.e, CNN and RNN). In order to integrate the loglinear model into such voting models, we first augment the separate CNN and RNN models with the loglinear model before we perform the voting procedure on the resulting models. Finally, the corresponding hybridvoting systems would involve the voting on four models (CNN, hybrid CNN, RNN and hybrid RNN).
2.4 Training
We train the models by minimizing the negative loglikelihood function using the stochastic gradient descent algorithm with shuffled minibatches and the AdaDelta update rule
[Zeiler2012, Kim2014]. The gradients are computed via backpropagation while regularization is executed by a dropout on the hidden vectors before the the multilayer neural networks [Hinton et al.2012]. During training, besides the weight matrices, we also optimized the embedding tables to achieve the optimal state. Finally, we rescale the weights whosenorms exceed a hyperparameter
[Kim2014, Nguyen and Grishman2015a].3 Experiments
3.1 Resources and Parameters
For all the experiments below, we utilize the pretrained word embeddings word2vec with 300 dimensions from Mikolov et al. Mikolov:13 to initialize the word embedding table . The parameters for CNNs and traning the networks are inherited from the previous studies, i.e, the window size set for feature maps = , 150 feature maps for each window size, 50 dimensions for all the embedding tables (except the word embedding table ), the dropout rate , the minibatch size , the hyperparameter for the norms = 3 [Kim2014, Nguyen and Grishman2015a]. Regarding RNNs, we employ 300 units in the hidden layers.
3.2 Dataset
We evaluate our models on two datasets: the ACE 2005 dataset for relation extraction and the SemEval2010 Task 8 dataset [Hendrickx et al.2010] for relation classification.
The ACE 2005 corpus comes with 6 different domains: broadcast conversation (bc), broadcast news (bn), telephone conversation (cts), newswire (nw), usenet (un) and webblogs (wl). Following the common practice of domain adaptation research on this dataset [Plank and Moschitti2013, Nguyen and Grishman2014, Nguyen et al.2015c, Gormley et al.2015], we use news (the union of bn and nw) as the training data, a half of bc as the development set and the remainder (cts, wl and the other half of bc) as the test data. Note that we are using the data prepared by Gormley et. al Gormley:15, thus utilizing the same data split on bc as well as the same data processing and NLP toolkits. The total number of relations in the training set is 43,497^{2}^{2}2It was an error in Gormley et al. Gormley:15 that reported 43,518 total relations in the training set. The authors acknowledged this error.. We employ the BIO annotation scheme to capture the chunking information for words in the sentences and only mark the entity types of the two entity mention heads (obtained from human annotation) for this dataset.
The SemEval dataset concerns the relation classification task that aims to determine the relation type (or no relation) between two entities in sentences. In order to make it compatible with the previous research [Socher et al.2012, Gormley et al.2015], for this dataset, besides the word embeddings and the distance embeddings, we apply the name tagging, part of speech tagging and WordNet features (inherited from Socher et al. Socher:12 and encoded by the realvalued vectors for each word). The other settings are also adopted from the past studies [Socher et al.2012, Xu et al.2015].
3.3 RNN Architectures
This section evaluates the performance of various RNN architectures for RE on the development set. In particular, we compare different design combinations of the four following factors: (i) sentence representations (i.e, SEQ or DEP), (ii) transformation functions (i.e, FF or GRU), (iii) the strategies to employ the hidden vector sequence for RE (i.e, HEAD or MAX), and (iv) the directions to run RNNs (i.e, forward (), backward () or bidirectional ()). Table 1 presents the results.
Systems  DEP  SEQ  

60.78  63.22  
HEAD  55.55  60.05  
FF  57.69  58.54  
50.00  51.22  
MAX  52.08  53.96  
45.07  33.50  
63.32  63.23  
HEAD  63.69  62.77  
GRU  61.57  62.55  
60.96  64.24  
MAX  61.97  64.59  
61.56  64.30 
The main conclusions include:
(i) Assuming the same choices for the other three corresponding factors, GRU is more effective than FF, SEQ is better than DEP most of the time and HEAD outperforms MAX (except the case where SEQ and GRU are applied) for RE with RNNs.
(ii) Regarding the direction mechanisms, the bidirectional mechanism achieves the best performance for the HEAD strategy while the forward direction is the best mechanism for the MAX strategy. This can be partly explained by the lack of past or future context information in the HEAD strategy when we follow the backward or forward direction respectively.
The best performance corresponds to the application of the SEQ representation, the GRU function and the MAX strategy that would be used in all the RNN models below. We call such RNN models with the forward, backward and bidirectional mechanism FORWARD, BACKWARD and BIDIRECT respectively. We also apply the SEQ representation for the CNN model (called CNN) in the following experiments for consistency.
3.4 Evaluating the Combined Models
Model  P  R  F1 
BIDIRECT  69.16  59.97  64.24 
FORWARD  69.33  60.45  64.59 
BACKWARD  65.60  63.05  64.30 
CNN  68.35  59.16  63.42 
Ensembling  
CNNBIDIRECT  71.22  54.13  61.51 
CNNFORWARD  66.19  59.64  62.75 
CNNBACKWARD  65.09  60.13  62.51 
Stacking  
CNNBIDIRECT  66.55  59.97  63.09 
CNNFORWARD  69.46  63.05  66.10 
CNNBACKWARD  72.58  58.35  64.69 
BIDIRECTCNN  65.63  61.59  63.55 
FORWARDCNN  73.13  58.67  65.11 
BACKWARDCNN  67.60  58.51  62.73 
Voting  
CNNBIDIRECT  71.08  60.94  65.62 
CNNFORWARD  70.38  59.32  64.38 
CNNBACKWARD  69.78  61.75  65.52 
Model  Neural Networks  Hybrid Models  HybridVoting Models  

P  R  F1  P  R  F1  P  R  F1  
CNN  68.35  59.16  63.42  66.44  64.51  65.46  69.07  63.70  66.27 
BIDIRECT  69.16  59.97  64.24  68.04  59.00  63.19  71.13  60.29  65.26 
FORWARD  69.33  60.45  64.59  66.11  63.86  64.96  72.69  61.26  66.49 
BACKWARD  65.60  63.05  64.30  66.03  62.07  63.99  71.56  63.21  67.13 
Combined Models  
VOTEBIDIRECT  71.08  60.94  65.62  69.24  62.40  65.64  71.30  62.40  66.55 
STACKFORWARD  69.46  63.05  66.10  65.93  68.07  66.99  69.32  66.29  67.77 
VOTEBACKWARD  69.78  61.75  65.52  67.30  63.05  65.10  70.79  64.02  67.23 
We evaluate the combination methods for CNNs and RNNs presented in Section 2.2. In particular, for each method, we examine three models that are combined from one of the three RNN models FORWARD, BACKWARD, BIDIRECT and the CNN model. For instance, in the stacking method, the three combined models corresponding to the RNNCNN variant are FORWARDCNN, BACKWARDCNN, BIDIRECTCNN while the three combined models corresponding to the CNNRNN variant are CNNFORWARD, CNNBACKWARD, CNNBIDIRECT. The notations for the other methods are selfexplained. The model performance on the development set is given in Table 2 that also includes the performance of the separate models (i.e, CNN, FORWARD, BACKWARD, BIDIRECT) for convenient comparison.
The first observation is that the ensembling method is not an effective way to combine CNNs and RNNs as its performance is worse than the separate models. Second, regarding the stacking method, the best way to combine CNNs and RNNs in this framework is to assemble the CNN model and the FORWARD model. In fact, the combination of the CNN and FORWARD models helps to improve the performance of the separate models in both variants of this method (referring to the models CNNFORWARD and FORWARDCNN). Finally, the voting method is also helpful as it outperforms the separate models with the CNNBIDIRECT and CNNBACKWARD combinations.
For the following experiments, we would only focus on the three best combined models in this section, i.e, the CNNFORWARD model in the stacking method (called STACKFORWARD) and the CNNBIDIRECT, CNNBACKWARD models in the voting methods (called VOTEBIDIRECT and VOTEBACKWARD respectively).
System  bc  cts  wl  

P  R  F1  P  R  F1  P  R  F1  Ave  
FCM  66.56  57.86  61.9  65.62  44.35  52.93  57.80  44.62  50.36  55.06 
Hybrid FCM  74.39  55.35  63.48  74.53  45.01  56.12  65.63  47.59  55.17  58.26 
Separate Systems  
LogLinear  68.44  50.07  57.83  73.62  41.57  53.14  60.40  47.31  53.06  54.68 
CNN  65.62  61.06  63.26  65.92  48.12  55.63  54.14  53.68  53.91  57.60 
BIDIRECT  65.23  61.06  63.07  66.15  49.26  56.47  55.91  51.56  53.65  57.73 
FORWARD  63.64  59.39  61.44  60.12  50.57  54.93  55.54  54.67  55.10  57.16 
BACKWARD  60.44  61.2  60.82  58.20  54.01  56.03  51.03  52.55  51.78  56.21 
HybridVoting Systems  
VOTEBIDIRECT  70.40  63.84  66.96†  66.74  49.92  57.12†  59.24  54.96  57.02†  60.37 
STACKFORWARD  65.75  66.48  66.11†  63.58  51.72  57.04†  56.35  57.22  56.78†  59.98 
VOTEBACKWARD  69.57  63.28  66.28†  65.91  52.21  58.26†  58.81  55.81  57.27†  60.60 
3.5 Evaluating the Hybrid Models
This section investigates the hybrid and hybridvoting models (Section 2.3) to see if they can further improve the performance of the neural network models. In particular, we evaluate the separate models: CNN, BIDIRECT, FORWARD, BACKWARD, and the combined models: STACKFORWARD, VOTEBIDIRECT and VOTEBACKWARD when they are augmented with the traditional loglinear model (the hybrid models). Besides, in order to verify the hypothesis in Section 2.3, we also test the corresponding hybridvoting models. The experimental results are shown in Table 3. There are three main conclusions:
(i) For all the models in columns “Neural Networks”, “Hybrid Models” and “HybridVoting Models”, we see that the combined models outperform their corresponding separate models (only except the hybrid model of VOTEBACKWARD), thereby further confirming the benefits of the combined models.
(ii) Comparing columns “Neural Networks” and “Hybrid Models”, we find that the traditional loglinear model significantly helps the CNN model. The effects on the other models are not clear.
(iii) More interestingly, for all the neural networks being examined (either separate or combined), the corresponding hybridvoting systems substantially improve both the neural network models as well as the corresponding hybrid models, testifying to the hypothesis about the hybridvoting approach in Section 2.3. Note that the simpler voting systems on three models: the loglinear model, the CNN model and some RNN model (i.e, either BIDIRECT, FORWARD or BACKWARD) produce the worse performance than the hybridvoting methods (the respective performance is 66.13%, 65.27%, and 65.96%).
3.6 Comparing to the Stateoftheart
The stateoftheart system on the ACE 2005 for the unseen domains has been the featurerich compositional embedding model (FCM) and the hybrid FCM model from Gormley et al. Gormley:15. In this section, we compare the proposed hybridvoting systems with these stateoftheart systems on the test domains bc, cts, wl. Table 4 reports the results. For completeness, we also include the performance of the loglinear model and the separate models CNN, BIDIRECT, FORWARD, BACKWARD, serving as the other baselines for this work.
From the table, we see that although the separate neural networks outperform the FCM model across domains, they are still worse than the hybrid FCM model due to the introduction of the loglinear model into FCM. However, when the networks are combined and integrated with the loglinear model, they (the hybridvoting
systems) become significantly better than the FCM models across all domains (up to 2% improvement on the average absolute F score),
yielding the stateoftheart performance for the unseen domains in this dataset.3.7 Relation Classification Experiments
We further evaluate the proposed systems for the relation classification task on the SemEval dataset. Table 5 presents the performance of the seprate models, the proposed systems as well as the other representative systems on this task. The most important observation is that the hybridvoting systems VOTEBIDIRECT and VOTEBACKWARD achieve the stateoftheart performance for this dataset, further highlighting their benefit for relation classification. The hybridvoting STACKFORWARD system performs less effectively in this case, possibly due to the small size of the SemEval dataset that is not sufficient to training such a deep model.
Classifier  F 

SVM [Hendrickx et al.2010]  82.2 
RNN [Socher et al.2012]  77.6 
MVRNN [Socher et al.2012]  82.4 
CNN [Zeng et al.2014]  82.7 
CRCNN [dos Santos et al.2015]  84.1† 
FCM [Gormley et al.2015]  83.0 
Hybrid FCM [Gormley et al.2015]  83.4 
DepNN [Liu et al.2015]  83.6 
SDPLSTM [Xu et al.2015]  83.7 
CNN  83.5 
BIDIRECT  81.8 
FORWARD  81.9 
BACKWARD  82.4 
VOTEBIDIRECT  84.1 
STACKFORWARD  83.4 
VOTEBACKWARD  84.1 
3.8 Analysis
In order to better understand the reason helping the combination of CNNs and RNNs outperform the individual networks, we evaluate the performance breakdown per relation for the CNN and BIDIRECT models. The results on the development set of the ACE 2005 dataset are provided in Tabel 6.
Relation Class  CNN  BIDIRECT  

P  R  F1  P  F  F1  
PHYS  66.7  34.7  45.7  57.4  50.9  54.0 
PARTWHOLE  68.6  67.8  68.2  74.4  70.1  72.2 
ART  64.2  51.2  57.0  68.6  41.7  51.9 
ORGAFF  70.2  83.0  76.0  79.3  76.1  77.7 
PERSOC  71.1  59.3  64.6  69.6  59.3  64.0 
GENAFF  65.9  55.1  60.0  59.0  46.9  52.3 
all  68.4  59.2  63.4  69.2  60.0  64.2 
One of the main insights is although CNN and BIDIRECT have the comparable overall performance, their recalls on individual relations are very diverged. In particular, the BIDIRECT has much better recall for the PHYS relation while the recalls of CNN are significantly better for the ART, ORGAFF and GENAFF relations. A closer investigation reveals two facts: (i) the PHYS relation mentions that are only correctly predicted by BIDIRECT involve the long distances between two entity mentions, such as the PHYS relation between “Some” (a person entity) and “desert” (a location entity) in the following sentence: “Some of the 40,000 British troops are kicking up a lot of dust in the Iraqi desert making sure that nothing is left behind them that could hurt them.”, and (ii) the ART, ORGAFF, GENAFF relation mentions only correctly predicted by CNN contains the patterns between the two entity mentions that are short but meaningful enough to decide the relation classes, such as “The Iraqi unit in possession of those guns” (the ART relation between “unit” and “guns”), or “the al Qaeda chief operations officer” (the ORGAFF relation between “al Qaeda” and “officer”). The failure of CNN on the PHYS relation mentions with long distances originates from its mechanism to model short and consecutive grams (up to length 5 in our case), causing the difficulty to capture the long and/or unconsecutive patterns. BIDIRECT, on the other hand, fails to predict the short (but expressive enough) patterns for ART, ORGAFF, GENAFF because it involves the hidden vectors that only model the context words outside the short patterns, potentially introducing unnecessary and noisy information into the maxpooling scores for prediction. Eventually, the combination of RNNs and CNNs helps to compensate the drawbacks of each model.
4 Related Work
Starting from the invention of the distributed representations for words
[Bengio et al.2003, Mnih and Hinton2008, Collobert and Weston2008, Turian et al.2010, Mikolov et al.2013], CNNs and RNNs have gained significant successes on various NLP tasks, including sequential labeling [Collobert et al.2011], sentence modeling and classification [Kalchbrenner et al.2014, Kim2014], paraphrase identification [Yin and Schütze2015], event extraction [Nguyen and Grishman2015b, Chen et al.2015] for CNNs and machine translation [Cho et al.2014, Bahdanau et al.2015] for RNNs, to name a few.For relation extraction/classification, most work on neural networks has focused on the relation classification task. In particular, Socher et al. Socher:12 and Ebrahimi and Dou Ebrahimi:15 study the recursive NNs that recur over the tree structures while Xu et al. Xu:15 and Zhang and Wang Zhang:15 investigate recurrent NNs. Regarding CNNs, Zeng et al. Zeng:14 examine CNNs via the sequential representation of sentences, dos Santos et al. Santos:15 explore a ranking loss function with data cleaning while Zeng et al. Zeng:15 propose dynamic pooling and multiinstance learning. For RE, Yu et al. Yu:15 and Gormley et al. Gormley:15 work on the featurerich compositional embedding models. Finally, the only work that combines NN architectures is due to Liu et al. Liu:15 but it only focuses on the stacking of the recursive NNs and CNNs for relation classification.
5 Conclusion
We investigate different methods to combine CNNs, RNNs as well as the hybrid models to integrate the loglinear model into the NNs. The experimental results demonstrate that the simple majority voting between CNNs, RNNs and their corresponding hybrid models is the best combination method. We achieve the stateoftheart performance for both relation extraction and relation classification. In the future, we plan to further evaluate the proposed methods on the other tasks such as event extraction and slot filling in the KBP evaluation.
Acknowledgment
We would like to thank Matthew Gormley and Mo Yu for providing the dataset. Thank you to Kyunghyun Cho and Yifan He for valuable suggestions.
References
 [Bahdanau et al.2015] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In ICLR.

[Bengio et al.1994]
Yoshua Bengio, Patrice Simard, and Paolo Frasconi.
1994.
Learning longterm dependencies with gradient descent is difficult.
In
Journal of Machine Learning Research 3
.  [Bengio et al.2003] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilistic language model. In Journal of Machine Learning Research 3.
 [Boschee et al.2005] Elizabeth Boschee, Ralph Weischedel, , and Alex Zamanian. 2005. Automatic information extraction. In Proceedings of the International Conference on Intelligence Analysis.
 [Bunescu and Mooney2005a] Razvan Bunescu and Raymond Mooney. 2005a. A shortest path dependency kernel for relation extraction. In HLTEMNLP.
 [Bunescu and Mooney2005b] Razvan Bunescu and Raymond J. Mooney. 2005b. Subsequence kernels for relation extraction. In NIPS.
 [Chan and Roth2010] Yee S. Chan and Dan Roth. 2010. Exploiting background knowledge for relation extraction. In COLING.
 [Chen et al.2015] Yubo Chen, Liheng Xu, Kang Liu, Daojian Zeng, and Jun Zhao. 2015. Event extraction via dynamic multipooling convolutional neural networks. In ACLIJCNLP.
 [Cho et al.2014] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In EMNLP.
 [Collobert and Weston2008] Ronan Collobert and Jason Weston. 2008. A unified architecture for natural language processing: deep neural networks with multitask learning. In ICML.
 [Collobert et al.2011] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel P. Kuksa. 2011. Natural language processing (almost) from scratch. In CoRR.
 [Culotta and Sorensen2004] Aron Culotta and Jeffrey Sorensen. 2004. Dependency tree kernels for relation extraction. In ACL.
 [dos Santos et al.2015] Cicero dos Santos, Bing Xiang, and Bowen Zhou. 2015. Classifying relations by ranking with convolutional neural networks. In ACLIJCNLP.
 [Ebrahimi and Dou2015] Javid Ebrahimi and Dejing Dou. 2015. Chain based rnn for relation classification. In NAACL.
 [Gormley et al.2015] Matthew R. Gormley, Mo Yu, and Mark Dredze. 2015. Improved relation extraction with featurerich compositional embedding models. In EMNLP.
 [Graves et al.2009] A. Graves, Marcus EichenbergerLiwicki, S. Fernandez, R. Bertolami, H. Bunke, and J. Schmidhuber. 2009. A novel connectionist system for unconstrained handwriting recognition. In IEEE Transactions on Pattern Analysis and Machine Intelligence.
 [Grishman et al.2005] Ralph Grishman, David Westbrook, and Adam Meyers. 2005. Nyu’s english ace 2005 system description. In ACE 2005 Evaluation Workshop.
 [Hendrickx et al.2010] Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva, Preslav Nakov, Diarmuid Ó Séaghdha, Sebastian Padó, Marco Pennacchiotti, Lorenza Romano, and Stan Szpakowicz. 2010. Semeval2010 task 8: Multiway classification of semantic relations between pairs of nominals. In SemEval.
 [Hinton et al.2012] Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2012. Improving neural networks by preventing coadaptation of feature detectors. In CoRR, abs/1207.0580.
 [Hochreiter and Schmidhuber1997] Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long shortterm memory. In Neural Computation.
 [Jiang and Zhai2007] Jing Jiang and ChengXiang Zhai. 2007. A systematic exploration of the feature space for relation extraction. In NAACLHLT.
 [Kalchbrenner et al.2014] Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014. A convolutional neural network for modelling sentences. In ACL.
 [Kambhatla2004] Nanda Kambhatla. 2004. Combining lexical, syntactic, and semantic features with maximum entropy models for information extraction. In ACL.
 [Kim2014] Yoon Kim. 2014. Convolutional neural networks for sentence classification. In EMNLP.
 [Liu et al.2015] Yang Liu, Furu Wei, Sujian Li, Heng Ji, Ming Zhou, and Houfeng WANG. 2015. A dependencybased neural network for relation classification. In ACLIJCNLP.

[Mikolov et al.2013]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean.
2013.
Efficient estimation of word representations in vector space.
In ICLR.  [Mnih and Hinton2008] Andriy Mnih and Geoffrey Hinton. 2008. A scalable hierarchical distributed language model. In NIPS.
 [Nguyen and Grishman2014] Thien Huu Nguyen and Ralph Grishman. 2014. Employing word representations and regularization for domain adaptation of relation extraction. In ACL.
 [Nguyen and Grishman2015a] Thien Huu Nguyen and Ralph Grishman. 2015a. Relation extraction: Perspective from convolutional neural networks. In The NAACL Workshop on Vector Space Modeling for NLP (VSM).
 [Nguyen and Grishman2015b] Thien Huu Nguyen and Ralph Grishman. 2015b. Event detection and domain adaptation with convolutional neural networks. In ACLIJCNLP.
 [Nguyen et al.2009] TrucVien T. Nguyen, Alessandro Moschitti, and Giuseppe Riccardi. 2009. Convolution kernels on constituent, dependency and sequential structures for relation extraction. In EMNLP.
 [Nguyen et al.2015c] Thien Huu Nguyen, Barbara Plank, and Ralph Grishman. 2015c. Semantic representations for domain adaptation: A case study on the tree kernelbased method for relation extraction. In ACLIJCNLP.
 [Pascanu et al.2012] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2012. On the difficulty of training recurrent neural networks. In arXiv preprint arXiv:1211.5063.
 [Plank and Moschitti2013] Barbara Plank and Alessandro Moschitti. 2013. Embedding semantic similarity in tree kernels for domain adaptation of relation extraction. In ACL.
 [Qian et al.2008] Longhua Qian, Guodong Zhou, Fang Kong, Qiaoming Zhu, and Peide Qian. 2008. Exploiting constituent dependencies for tree kernelbased semantic relation extraction. In COLING.
 [Socher et al.2012] Richard Socher, Brody Huval, Christopher D. Manning, and Andrew Y. Ng. 2012. Semantic compositionality through recursive matrixvector spaces. In EMNLP.
 [Sun et al.2011] Ang Sun, Ralph Grishman, and Satoshi Sekine. 2011. Semisupervised relation extraction with largescale word clustering. In ACL.

[Turian et al.2010]
Joseph Turian, LevArie Ratinov, and Yoshua Bengio.
2010.
Word representations: A simple and general method for semisupervised learning.
In ACL.  [Xu et al.2015] Yan Xu, Lili Mou, Ge Li, Yunchuan Chen, Hao Peng, and Zhi Jin. 2015. Classifying relations via long short term memory networks along shortest dependency paths. In EMNLP.
 [Yin and Schütze2015] Wenpeng Yin and Hinrich Schütze. 2015. Convolutional neural network for paraphrase identification. In NAACL.
 [Yu et al.2015] Mo Yu, Matthew R. Gormley, and Mark Dredze. 2015. Combining word embeddings and feature embeddings for finegrained relation extraction. In NAACL.
 [Zeiler2012] Matthew D. Zeiler. 2012. Adadelta: An adaptive learning rate method. In CoRR, abs/1212.5701.
 [Zelenko et al.2003] Dmitry Zelenko, Chinatsu Aone, and Anthony Richardella. 2003. Exploring various knowledge in relation extraction. In Journal of Machine Learning Research.
 [Zeng et al.2014] Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou, and Jun Zhao. 2014. Relation classification via convolutional deep neural network. In COLING.
 [Zeng et al.2015] Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao. 2015. Distant supervision for relation extraction via piecewise convolutional neural networks. In EMNLP.
 [Zhang and Wang2015] Dongxu Zhang and Dong Wang. 2015. Relation classification via recurrent neural network. In arXiv:1503.00185.
 [Zhang et al.2006] Min Zhang, Jie Zhang, Jian Su, and GuoDong Zhou. 2006. A composite kernel to extract relations between entities with both flat and structured features. In COLINGACL.
 [Zhou et al.2005] GuoDong Zhou, Jian Su, Jie Zhang, and Min Zhang. 2005. Exploring various knowledge in relation extraction. In ACL.
 [Zhou et al.2007] GuoDong Zhou, Min Zhang, DongHong Ji, and QiaoMing Zhu. 2007. Tree kernelbased relation extraction with contextsensitive structured parse tree information. In EMNLPCoNLL.
Comments
There are no comments yet.