Information extraction (IE) systems are fundamental to the automatic construction of knowledge bases and ontologies from unstructured text. While important, in and of themselves, these resulting resources can be harnessed to advance other important language understanding applications including knowledge discovery and question answering systems. Among IE tasks are named entity recognition (NER) and binary relation extraction (RE) which involve identifying named entities and relations among them, respectively, where the latter is typically a set of triplets identifying pairs of related entities and their relation types.
We present Figure 1 as an example of the NER and RE problem given the input sentence “Mrs. Tsuruyama is from Yatsushiro in Kumamoto Prefecture in southern Japan.” First, we extract as entities the spans “Mrs. Tsuruyama”, “Yatsushiro”, “Kumamoto Prefecture”, and “Japan” where “Mrs. Tsuruyama” is of type PERSON and the rest are of type LOCATION. Thus, NER consists of identifying both the bounds and type of entities mentioned in the sentence. Once entities are identified, the next step is to extract relation triplets of the form (subject,predicate,object), if any, based on the context; for example, (Mrs. Tsuruyama, LIVE_IN, Yatsushiro) is a relation triple that may be extracted from the example sentence as output of an RE system. Given this, it is clear that RE is a complex problem given the sparse nature of the output space; for a sentence of length with possible relation types, the output is a variable-length set of relations each drawn from possible relation combinations.
NER and RE have been traditionally treated as independent problems to be solved separately and later combined in an ad-hoc manner as part of a pipeline system. End-to-end RE (
RE) is a relatively new research direction that seeks to model NER and RE jointly in a unified architecture. As these tasks are closely intertwined, joint models that simultaneously extract entities and their relations in a single framework have the capacity to exploit inter-task correlations and dependencies leading to potential performance gains. Moreover, joint approaches, like our method, are better equipped to handle datasets where entity annotations are non-exhaustive (that is, only entities involved in a relation are annotated), since standalone NER systems are not designed to handle incomplete annotations. Recent advancements in deep learning forRE are broadly divided into two categories: (1). The first category involves applying deep learning to the table structure first introduced by miwa2014modeling, including gupta2016table, pawar2017end, and zhang2017end where RE is reduced to some variant of the table-filling problem such that the -th cell is assigned a label that represents the relation between tokens at positions and in the sentence. We further describe the table-filling problem in Section 3.1. Recent approaches based on the table structure operate on the idea that cell labels are dependent on features or predictions derived from preceding or adjacent cells; hence, the table is filled incrementally leading to potential efficiency issues. Also, these methods typically require an additional expensive decoding step, involving beam search, to obtain a globally optimal table-wide label assignment. (2). The second category includes models where NER and RE are modeled jointly with shared components or parameters without the table structure. Even state-of-the-art methods not utilizing the table structure rely on conditional random fields (CRFs) as an integral component of the NER subsystem where Viterbi algorithm is used to decode the best label assignment at test time Bekoulis et al. (2018b, a).
Our model utilizes the table formulation by embedding features along the third dimension. We overcome efficiency issues by utilizing a more efficient and effective approach for deep feature aggregation such that local metric, dependency, and position based features are simultaneously pooled — in acellular window — over many applications of the 2D convolution. Intuitively, preliminary decisions are made at earlier layers and corroborated at later layers. Final label assignments for both NER and RE are made simultaneously via a simple softmax layer. Thus, computationally, our model is expected to improve over earlier efforts without a costly decoding step. We validate our proposed method on the CoNLL04 dataset Roth and Yih (2004) and the ADE dataset Gurulingappa et al. (2012), which correspond to the general English and the biomedical domain respectively, and show that our method improves over prior state-of-the-art in RE. We also show that our approach leads to training and testing times that are seven to ten times faster, where the latter can be critical for time-sensitive end-user applications. Lastly, we perform extensive error analyses and show that our network is visually interpretable by examining the activity of hidden pooling layers (corresponding to intermediate decisions). To our knowledge, our study is the first to perform this type of visual analysis of a deep neural architecture for end-to-end relation extraction.111Our code is included as supplementary material and will be made publicly available on GitHub.
2 Related Work
In this section, we provide an overview of three main types of relation extraction methods in the literature: studies that are limited to relation classification, early RE methods that assume known entity bounds, and recent efforts on RE that perform full entity recognition and relation extraction in an end-to-end fashion.
2.1 Relation Classification
The majority of past and current efforts in relation extraction treat the problem as a simpler relation classification
problem where pairs of entities are known during test time; the goal is to classify the pair of entities, given the context, as being either positive or negative for a particular type of relation. Many works on relation classification preprocess the input as a dependency parse treeBunescu and Mooney (2005); Qian et al. (2008) and exploit features corresponding to the shortest dependency path between candidate entities; this general approach has also been successfully applied in the biomedical domain Airola et al. (2008); Fundel, Küffner, and Zimmer (2007); Li et al. (2008); Özgür et al. (2008)
, where they typically involve a graph kernel based Support Vector Machine (SVM) classifierLi et al. (2008); Rink, Harabagiu, and Roberts (2011). The concept of network centrality has also been explored Özgür et al. (2008) to extract gene-disease relations. Other studies, such as the effort by frunza2011machine, apply the more traditional bag-of-words
approach focusing on syntactic and lexical features while exploring a wide variety of classification algorithms including decision trees, SVMs, and Naïve Bayes. More recently, innovations in relation extraction have centered around designing meaningful deep learning architectures. liu2016dependency proposed a dependency-based convolutional neural network (CNN) architecture wherein the convolution is applied over words adjacent according to the shortest path connecting the entities in the dependency tree, rather than words adjacent with respect to the order expressed, to detect drug-drug interactions (DDIs). In kavuluru2017extracting, ensembling of both character-level and word-level recurrent neural networks (RNNs) is further proposed for improved performance in DDI extraction. raj2017learning proposed a deep learning architecture such that word representations are first processed by a bidirectional RNN followed by a standard CNN, with an optional attention mechanism towards the output layer. zhang2018graph showed that relation extraction performance can be improved by applying graph convolutions over a pruned version of the dependency tree.
2.2 End-to-End Relation Extraction with Known Entity Bounds
Early efforts in
RE, as covered in this section, assume that entity bounds are known during test time. Hence, the NER aspect of these methods is limited to classifying entity type (e.g., is "President Kennedy" a person, place, or organization?). In a seminal work, roth2004linear proposed an integer linear programming (LP) approach to tackle the end-to-end problem. They discovered that the LP component was effective in enhancing classifier results by reducing semantic inconsistencies in the predictions compared to a traditional pipeline wherein the outputs of an NER component are passed as features into the RE component. Their results indicate that there are mutual inter-dependencies between NER and RE as subtasks which can be exploited. The LP technique has been also been successfully applied in jointly modeling entities and relations with respect to opinion recognition by choi2006joint. kate2010joint proposed a similar approach but presented a global inference mechanism induced by building a graph resembling a card-pyramid structure. A dynamic programming algorithm, similar to CYKJurafsky and Martin (2008) parsing, called card-pyramid
parsing is applied along with beam search to identify the most probable joint assignment of entities and their relations based on outputs of local classifiers. Other efforts to this end involve the use of probabilistic graphical modelsYu and Lam (2010); Singh et al. (2013).
2.3 End-to-End Relation Extraction
li2014incremental proposed one of the first truly joint models wherein entities, including entity mention bounds, and their relations are predicted. Structured perceptronsCollins (2002)
, as a learning framework, are used to estimate feature weights while beam search is used to explore partial solutions to incrementally arrive at the most probable structure. miwa2014modeling proposed the idea of using a table representation which simplifies the task into a table-filling problem such that NER and relation labels are assigned to cells of the table; the aim was to predict the most probable label assignment to the table, out of all possible assignments, using beam search. While the representation is in table form, beam search is performed sequentially, one cell-assignment per step. The table-filling problem forRE has since been successfully transferred to the deep neural network setting Gupta, Schütze, and Andrassy (2016); Pawar, Bhattacharyya, and Palshikar (2017); Zhang, Zhang, and Fu (2017).
Other recent approaches not utilizing a table structure involve modeling the entity and relation extraction task jointly with shared parameters Miwa and Bansal (2016); Li et al. (2016); Zheng et al. (2017a); Li et al. (2017); Katiyar and Cardie (2017); Bekoulis et al. (2018b); Zeng et al. (2018). katiyar2017going and bekoulis2018join specifically use attention mechanisms for the RE component without the need for dependency parse features. zheng2017joint_tag operate by reducing the problem to a sequence-labeling task that relies on a novel tagging scheme. zeng2018extracting use an encoder-decoder network such that the input sentence is encoded as fixed-length vector and decoded to relation triples directly. Most recently, bekoulis2018adversarial found that adversarial training (AT) is an effective regularization approach for RE performance.
We present our version of the table-filling problem, a novel neural network architecture to fill the table, and details of the training process. Here, Greek letter symbols are used to distinguish hyper-parameters from variables that are learned during training.
3.1 The Table-Filling Problem
Given a sentence of length , we use an table to represent a set of semantic relations such that the -th cell represents the relationship (or non-relation) between tokens and . In practice, we assign a tag for each cell in the table such that entity tags are encoded along the diagonal while relation tags are encoded at non-diagonal cells. For entity recognition, we use the BILOU tagging scheme Ratinov and Roth (2009). In the BILOU scheme, B, I, and L tags are used to indicate the beginning, inside, and last token of a multi-token entity respectively. The O tag indicates whether the token outside of an entity span, and U is used for unit-length entities.
In tabular form, entity and relation tags are drawn from a unified list serving as the label space; that is, each cell in the table is assigned exactly one tag from . For simplicity, the O tag is also used to indicate a null relation when occurring outside of a diagonal. As each entity type requires a BILOU variant, a problem with entity types and relation types has where the last term accounts for the O tag. Our conception of the table-filling problem differs from miwa2014modeling in that we utilize the entire table as opposed to only the lower triangle; this allows us to model directed relations without the need for additional inverse-relation tags. Moreover, we assign relation tags to cells where entity spans intersect instead of where head words intersect; thus encoded relations manifest as rectangular blocks in the proposed table representation. We present a visualization of our table representation in Figure 2. At test time, entities are first extracted, and relations are subsequently extracted by averaging the output probability estimates of the blocks where entities intersect. We describe the exact procedure for extracting relations from these blocks at test-time in Section 3.3.
3.2 Our Model: Relation-Metric Network
We propose a novel neural architecture, which we call the relation-metric network, combining the ideas of metric learning and convolutional neural networks (CNNs) for table filling. The schematic of the network is shown in Figure 3, whose components will be detailed in this section.
3.2.1 Context Embeddings Layer
In addition to word embeddings, we employ character-CNN based representations as commonly observed in recent neural NER models Chiu and Nichols (2016) and RE models Li et al. (2017). Character-based features can capture morphological features and help generalize to out-of-vocabulary words. For the proposed model, such representations are composed by convolving over character embeddings of size using a window of size 3, producing
feature maps; the feature maps are then max-pooled to produce-length feature representations. As our approach is standard, we refer readers to chiu2016named for full details. This portion of the network is illustrated in step of Figure 3.
Suppose the input is a sentence of length represented by a sequence of word indices into the vocabulary . Each word is mapped to an embedding vector via embedding matrices such that
is a hyperparameter that determines the size of word embeddings. Next, letbe the character-based representation for the word. An input sentence is represented by matrix wherein rows are words mapped to their corresponding embedding vectors; or concretely,
where is the vector concatenation operator and is the row of .
Next, we compose context embedding vectors (CVs), which embed each word of the sentence with additional contextual features. Suppose and
represent a long short term memory (LSTM) network composition in the forward and backward direction, respectively, and letbe a hyperparameter that determines context embedding size. We feed to a Bi-LSTM layer of hidden unit size such that
where is the row of S and represents the context centered at the word. The output of the Bi-LSTM can be represented as a matrix such that . This concludes step of Figure 3.
3.2.2 Relation-Metric Learning
Our goal is to design a network such that any two CVs can be compared via some “relatedness” measure; that is, we wish to learn a relatedness measure (as a parameterized function) that is able to capture correlative features indicating semantic relationships. A common approach in metric learning to parameterize a relatedness function is to model it in bilinear form. Here, for input vectors , a similarity function in bilinear form is formally defined as
where is a parameter of the relatedness function, dubbed a relation-metric embedding matrix, that is learned during the training process.
In machine learning research, Eq.1 is also associated with a type of attention mechanism commonly referred to as “multiplicative” attention Luong, Pham, and Manning (2015). However, we apply Eq. 1 with the classical goal of learning a variety of metric-based features. Our aim is to compute for all pairs of CVs in the sentence. Concretely, we can compute a “relational-metric table” over all pairs of CVs in the sentence such that . In fact, we can learn a collection of similarity functions corresponding to
relation metric tables; for our purposes, this is analogous to learning a diverse set of convolution filters in the context of CNNs. Thus we have the 3-dimensional tensor
with where the first and second dimension correspond to word position indices while the third dimension embeds metric-based features. This constitutes step of Figure 3. We show how is consumed by the rest of the network in Section 3.2.6. However, as a prerequisite, we first describe how dependency parse and relative position information is prepared in Section 3.2.3 and Section 3.2.4 respectively and define the 2D convolution in Section 3.2.5.
3.2.3 Dependency Embeddings Table
Let be the vocabulary of syntactic dependency tags (e.g., nsubj, dobj). For an input sentence, let be the set of dependency relations where are mappings to tags in that express the dependency-based relations between pairs of words at positions , respectively. We define the dependency embedding matrix as , where each unique dependency tag is a -dimensional embedding. We compose the dependency representation tensor for as
for , where is a trainable embedding vector representing the null dependency relation. As shown in the above equation for , we embed the dependency parse tree simply as an undirected graph.
3.2.4 Position Embeddings Table
First proposed by zeng2014relation, so called position vectors have been shown to be effective in neural models for relation classification. Position vectors are designed to encode the relative offset between a word and the two candidate entities (for RE) as fixed-length embeddings. We bring this idea to the tabular setting by proposing a position embeddings table , which is composed the same way as the dependencies table; however, instead of dependency tags, we simply encode the distance between two candidate CVs as discrete labels mapped to fixed-length embeddings (of size , a hyperparameter). It is straightforward to see there will be distinct position offset labels where is the maximum length of a sentence in the training data. Specifically, given a position vocabulary , associated position embedding matrix , the position embeddings tensor is for . As an implementation detail, we set to where is the maximum sentence length over all training examples. Both dependency and position embedding tensors are concatenated to the metric tensor (Eq. (2)) along the 3rd dimension prior to every convolution operation. Hence they are shown in steps and of Figure 3 for the network with two convolutional layers.
3.2.5 2D Convolution Operation
Unlike the standard 2D convolution typically used in NLP tasks, which takes 2D input, our 2D convolution operates on 3D input commonly seen in computer vision tasks where colored image data has height, width, and an additional dimension for color channel. The goal of the 2D convolution is to pool information within awindow along the first two dimensions such that metric features and dependency/positional information of adjacent cells are pooled locally over several layers. However, it is necessary to perform a padded
convolution to ensure that dimensions corresponding to word positions are not altered by the convolution. We denote this padding transformation using thehat accent. That is, for some tensor input , the padded version is and the zero-padding exists at the beginning and at the end of the first and second dimensions. Next, we define the 2D convolution operation via the operator which corresponds to an element-wise product of two tensors followed by summation over the products; formally, for two input tensors and , .
Now our 2D convolution step is a tensor map with filters of size , defined as
for where and for , are filter variables and bias terms respectively, and is a window of from to along the first dimension, to along the second dimension, and to along the final dimension. We show how is used to repeatedly pool contextual information in Section 3.2.6. Instead of a window, the convolution operation can be over any
window for some oddwhere large values lead to larger parameter spaces and multiplication operations. The 2D convolution is illustrated in Figure 4 and manifests in steps and of Figure 3.
3.2.6 Pooling Mechanism
Central to our architecture is the iterative pooling mechanism designed so that preliminary decisions are made in early iterations and further corroborated in subsequent iterations. It also facilitates the propagation of local metric and dependency/positional features to neighboring cells. Let be the set of tags for the target task. We denote hyper-parameters and as the number of channels and the number of CNN layers respectively, where is same hyperparameter previously defined to represent the size of metric-based features. The pooling layers are defined recursively with base case and
is the linear rectifier activation function. Here,and determine the breadth and depth of the architecture. A higher corresponds to a larger receptive field when making final predictions. For example at , the decision at some cell is informed by its immediate neighbors with a receptive field of . However, at , decisions are informed by all adjacent neighbors in a window. The last layer, , is the output layer immediately prior to application of the softmax function. Given the architecture in Figure 3 with two convolutional layers, the convolve-and-pool operation is applied twice, indicated as steps and in the figure.
3.2.7 Softmax Output Layer
Given , we apply the softmax function along the third dimension to obtain a categorical distribution tensor over output tags for each word position pair such that , where is the probability estimate of the pair of words at position and being assigned the th tag. This constitutes the final step of the network (Figure 3). Suppose
represents the corresponding one-hot encoded ground truth along the third dimension such that. Then the example-based loss is obtained by summing the categorical cross-entropy loss over each cell in the table, normalized by the number of words in the sentence; that is,
where is the network parameter set. During training, the loss is computed per example and averaged along the mini-batch dimension.
While we learn concrete tags during training, the process for extracting predictions is slightly more nuanced. Entity spans are straightforwardly extracted by decoding BILOU tags along the diagonal. However, RE is based on “ensembling” the cellular outputs of the table where entity spans intersect. For entities and represented by their starting and ending offsets, and , the relation between them is the label computed as which indexes a tag in the label space .
4 Experimental Setup
In this section, we describe the established evaluation method, the datasets used for training and testing, and the configuration of our model. We note that the computing hardware is controlled across experiments given we report training and testing run times. Specifically, we used the Amazon AWS EC2 p2.xlarge instance which supports the NVIDIA Tesla K80 GPU with 12 GB memory.
4.1 Evaluation Metrics
We use the well-known F1 measure (along with precision and recall) to evaluate NER and RE subtasks as in prior work. For NER, a predicted entity is treated as atrue positive if it is exactly matched to an entity in the groundtruth based on both character offsets and entity type. For RE, a predicted relation is treated as a true positive
if it is exactly matched to a relation in the ground truth based on subject/object entities and relation type. As relation extraction performance directly subsumes NER performance, we focus purely on relation extraction performance as the primary evaluation metric of this study.
We use the dataset originally released by roth2004linear with 1441 examples consisting of news articles from outlets such as WSJ and AP. The dataset has four entity types including Person, Location, Organization, and Other and five relation types including Live_In, Located_In, OrgBased_In, Work_For, and Kill. We report results based on training/testing on the same train-test split as established by gupta2016table,adel2017global,bekoulis2018join,bekoulis2018adversarial, which consists of 910 training, 243 development, and 288 testing instances.
We also validate our method on the Adverse Drug Events (ADE) dataset from gurulingappa2012development for extracting drug-related adverse effects from medical text. Here, the only entity types are Drug and Disease and the relation extraction task is strictly binary (i.e., Yes/No w.r.t the ADE relation). The examples come from 1644 PubMed abstracts and are divided in two partitions: the first partition of 6821 sentences contain at least one drug/disease pair while the second partition of 16695 sentences contain no drug/disease pairs. As with prior work Li et al. (2016, 2017); Bekoulis et al. (2018b, a), we only use examples from the first partition from which 120 relations with nested entity annotations (such as “lithium intoxication” where lithium and lithium intoxication are the drug/disease pair) are removed. Since sentences are duplicated for each pair of drug/disease mention in the original dataset, when collapsed on unique sentences, the final dataset used in our experiments constitutes 4271 sentences in total. Given there are no official train-test splits, we report results based on 10-fold cross-validation, where results are based on averaging performance across the ten folds, as in prior work.
4.3 Model Configuration
We tuned our model on the CoNLL04 development set; the corresponding configuration of our model (including hyperparameter values) used in our main experiments is shown in Table 1. For the ADE dataset, we used Word2Vec embeddings pretrained on the corpus of PubMed abstracts Pyysalo et al. (2013). For the CoNLL04 dataset, we used GloVe embeddings pretrained on Wikipedia and Gigaword Pennington, Socher, and Manning (2014)
. All other variables are initialized using values drawn from a normal distribution with a mean of
and standard deviation ofand further tuned during training. Words were tokenized on both spaces and punctuations; punctuation tokens were kept as is common practice for NER systems. For part-of-speech and dependency parsing, we use the well-known tool spaCy222https://spacy.io/. For both datasets, we used projective dependency parses produced from the default pretrained English models. We found that using models pretrained on biomedical text (namely, the GENIA Kim et al. (2003) corpus) did not improve performance on the ADE dataset.
Early experiments showed that applying exponential decay to the learning rate in conjunction with batch normalizationIoffe and Szegedy (2015) is essential for stable/effective learning for this particular architecture. We apply exponential decay to the learning rate such that it is roughly halved every 10 epochs; concretely, where is the base learning rate and is the rate at the th epoch. We apply dropout Srivastava et al. (2014) on for as regularization at the earlier layers. However, dropout had a detrimental impact when applied to later layers. We instead apply batch normalization as a form of regularization on representations and for . We optimize the objective loss using RMSProp Tieleman and Hinton (2012) with a relatively high initial learning rate of 0.005 given exponential decay is used.
5 Results and Discussion
|Entity Recognition||Relation Extraction||Avg. Epoch||Avg.|
|Model||P (%)||R (%)||F (%)||P (%)||R (%)||F (%)||Train Time||Test Time|
|Table RepresentationMiwa and Sasaki (2014)||81.20||80.20||80.70||76.00||50.90||61.00||-||-|
|Multihead Bekoulis et al. (2018b)||83.75||84.06||83.90||63.75||60.43||62.04||-||-|
|Multihead with AT Bekoulis et al. (2018a)||-||-||83.61||-||-||61.95||-||-|
|Replicating Multihead with AT Bekoulis et al. (2018a)||84.36||85.80||85.07 0.26||65.81||57.59||61.38 0.50||614 sec||34 sec|
|Relation-Metric (Ours)||84.46||84.67||84.57 0.29||67.97||58.18||62.68 0.46||101 sec||4.5 sec|
These results are directly comparable given the same train-test splits, pretrained word embeddings, and computing hardware.
Average test time is per test set of 288 examples; dependency parsing accounts for approximately 0.5 second of our reported test time.
Results comparing to other methods on the CoNLL04 dataset. We report 95% confidence intervals around the mean F1 over 30 runs for models in the last two rows. Our model was tuned on the CoNLL04 development set corresponding to the configuration from Table1.
|Entity Recognition||Relation Extraction||Avg. Epoch||Avg.|
|Model||P (%)||R (%)||F (%)||P (%)||R (%)||F (%)||Train Time||Test Time|
|Neural Joint Model Li et al. (2016)||79.50||79.60||79.50||64.00||62.90||63.40||-||-|
|Neural Joint Model Li et al. (2017)||82.70||86.70||84.60||67.50||75.80||71.40||-||-|
|Multihead Bekoulis et al. (2018b)||84.72||88.16||86.40||72.10||77.24||74.58||-||-|
|Multihead with AT Bekoulis et al. (2018a)||-||-||86.73||-||-||75.52||-||-|
|Replicating Multihead with AT Bekoulis et al. (2018a)||85.76||88.17||86.95||74.43||78.45||76.36||1567 sec||40 sec|
|Relation-Metric (Ours)||86.16||88.08||87.11||77.36||77.25||77.29||134 sec||4.5 sec|
These results are directly comparable given the same fixed 10-fold splits, pretrained word embeddings, and computing hardware.
Average test time is per test set of 427 examples; dependency parsing accounts for approximately 0.5 second of our reported test time.
We report our main results in Tables 2 and 3 for the CoNLL04 and ADE datasets respectively. As a baseline, we replicate the prior best models Bekoulis et al. (2018a) for both datasets based on publicly available source code333https://github.com/bekou/multihead_joint_entity_relation_extraction. Unlike prior work, which reports performance based on a single run, we report the 95% confidence interval around the mean F1 based on 30 runs with differing seed values for the CoNLL04 dataset. For the ADE dataset, we instead report the mean performance over 10-fold cross-validation so that results are comparable to established work. These experiments were performed using the same splits, pretrained embeddings, and computing hardware; hence, results are directly comparable.
We make the following observations based on our results from Table 2
. Both our model and the model from bekoulis2018adversarial tend to skew heavily towards precision. However, our method improves on both precision and recall, and by over 1% F1 on relation extraction where improvements are statistically significant (
) based on the two-tailed Student’s t-test. We note that our model performs slightly worse when evaluatedpurely on NER. We contend this is a worthwhile trade-off given our model is tuned purely on relation extraction and the relation extraction metric, being end-to-end, indirectly accounts for NER performance. Based on Table 3, when tested on the ADE dataset, our method improves over prior best results by approximately 1% F1 for RE on average. While the prior best skews toward recall in this case, our method exhibits better balance of precision and recall. Based on run time results, we contend that our method is more computationally efficient given training and testing times are nearly seven times lower on the CoNLL04 and ten times lower on the ADE set when compared to prior efforts. We note that dependency parsing accounts approximately one-half second of our testing time. While training time may not be crucial in most settings, we argue that fast and efficient predictions are important for many end-user applications.
As an auxiliary experiment, we tested the potential for integrating adversarial training (AT) with our model; however, there were no performance gains even with extensive tuning. On the CoNLL04 dataset, our method with AT performs at 62.26% F1, compared to 62.68% without AT. On the ADE dataset, our method performs at 76.83% F1 with AT, compared to 77.29% without AT. Given this, we have elected not to include AT evaluations as part of our main results.
Comparison with More Prior Efforts
gupta2016table, adel2017global, and zhang2017end also experimented with the CoNLL04 dataset; however, gupta2016table evaluate on a more relaxed evaluation metric for matching entity bounds while adel2017global assume entity bounds are known at test time thus treating the NER aspect as a simpler entity classification problem. Of the three studies, results from zhang2017end are most comparable given they consider entity bounds in their evaluations; however, their results are based on a random 80%–20% split of the train and test set. As we use established splits based on prior work, the two results are not directly comparable.
5.1 Ablation Analysis
|CoNLL04 (Relation)||ADE (Relation)|
|Model||P (%)||R (%)||F (%)||P (%)||R (%)||F (%)|
|– Character-based Input||67.30||52.69||59.09||76.73||76.44||76.58|
|– Dependency Embeddings||66.56||57.69||61.78||75.79||77.16||76.45|
|– Position Embeddings||68.57||57.34||62.43||75.94||76.62||76.27|
|– Pretrained Word Embeddings||62.33||46.09||52.96||72.50||71.41||71.91|
We report ablation analysis results in Table 4 using our best model as the baseline. We note that the model hyperparameters were tuned on the CoNLL04 development set. Character and dependency based features all had a notable impact on performance for either dataset. On the hand, while position embeddings had a positive effect on the ADE dataset, performance gains were negligible when testing on CoNLL04. For the CoNLL04 dataset, we find that character based features had little effect on precision while improving recall substantially.
Unsurprisingly, pretrained word embeddings had the greatest impact on performance in terms of both precision and recall. Early experiments showed that, unlike models from prior work that used static word embeddings Li et al. (2017); Bekoulis et al. (2018a), our model benefits from trainable word embeddings as shown in Figure 5. Here, trainable word embeddings with downscaled gradients refer to reducing the gradient of word embeddings by a factor of 10 at each training step.
5.2 Error Analyses
In this section, we first perform a class based analysis where performance variations for different classes of examples are examined. Then, a more in-depth error analysis is performed for interesting example cases. The class based analyses entail partitioning examples by length, entity distance, and relation type and are covered in Section 5.2.1. The more in-depth example based analysis is discussed in Section 5.2.2.
5.2.1 Class based analyses
Long sentences are a natural source of difficulty for relation extraction models given the potential for long-term dependencies. In this section, we perform straightforward analysis by conducting experiments to assess model performance with respect to increasing sentence length. For this experiment, we train a single model using 80% of the dataset with 20% held out for testing. For some sentence length limit , we evaluate on a subset of the overall test set that includes only examples with a sentence length that is less than or equal to .
Results from these experiments are plotted in Figures 7 and 7, for the CoNLL04 and ADE datasets respectively, such that is varied along the horizontal -axis. The top graph displays performance, while the bottom graph plots the number of examples with sentence length less than or equal to that are used for evaluation. As shown, performances for both NER and RE tend to decline as longer sentences are added to the evaluation set. Unsurprisingly, relation extraction is more susceptible to long sentences compared to entity recognition. While there is a decline in both relation extraction precision and recall, we note that recall drops at a faster rate with respect to maximum sentence length and this phenomenon is apparent for both datasets.
|CoNLL04 (Relation)||ADE (Relation)|
|Entity Distance||# of Examples||P (%)||R (%)||F (%)||# of Examples||P (%)||R (%)||F (%)|
|0 — 20||207||83.7||43.80||57.51||447||88.50||42.02||56.98|
|20 — 40||51||59.09||24.07||34.21||265||77.17||35.51||48.64|
|40 — 60||43||80.00||18.60||30.19||181||78.72||37.00||50.34|
|60 — 80||22||100.00||25.93||41.18||125||82.35||29.58||43.52|
|80 — 100||13||100.00||15.38||26.67||91||85.00||34.00||48.57|
In addition to length-based analysis, we also conducted experiments to study the variation in relation extraction performance with respect to the distance between subject and object entities as shown in Table 5. We measure distance by computing the absolute character offset between the last character of the first occurring entity and first character of the second occurring entity, which is henceforth simply referred to as “entity distance.” Our results show that, at least on the CoNLL04 dataset, notable performance differences occur at the boundary cases; i.e., very short range relations (0-20 entity distance) tend to be easier and very long range relations (80-100 entity distance) tend to be harder (mostly due to changes in recall). For the ADE dataset, performance is similar across all partitions of entity distances. This is surprising, as sentence length appears to have a more notable impact on relation extraction performance than entity distance for this particular architecture.
|Relation Type||# of Examples||Avg. Entity Distance||P (%)||R (%)||F (%)|
shows variance in performance when examined by relation type. Here, we see that performance depends heavily on the type of relation being extracted; our model exhibits much higher accuracy on theKill relation at 80% F1, with Located_In and Work_For being the most difficult with performance below 60% F1. These results further corroborate our analysis based on Table 5 that entity distance does not correlate with example difficulty given that the Kill relation, being the easiest relation to extract, occurs with the highest average entity distance.
5.2.2 Example based analysis
A common source of difficulty that occurs is ambiguity with respect to expression of the Live_In and Work_In relation types. For example, consider the sentence “After buying the shawl for $1,600, Darryl Breniser of Blue Ball, said the approximately 2-by-5 foot shawl was worth the money.” The ground truth relation is (Darryl Breniser, Live_In, Blue Ball) which indicates that “Blue Ball” is in fact a location. However, it is difficult to assess whether “Blue Ball” is a location or company based on the context alone and without broader geographical knowledge (even for humans). Our model predicted (Darryl Breniser, Work_For, Blue Ball) in this case. We observe a similar pattern in the following case: “Santa Monica artist Tom Van Sant said Monday after the 23-foot-tall statue was found crushed and broken in pieces.”; here, we see the same phenomenon where our model mistakes (Tom Van Sant, Live_In, Santa Monica) for (Tom Van Sant, Work_For, Santa Monica). Finally, we present the most interesting example of this type of ambiguity in the sentence: “‘Temperatures didn’t get too low, but the wind chill was bad’, said Bingham County Sheriff’s Lt. Bill Gordon.” Here, the ground truth indicates that the only relation to be extracted is (Bill Gordon, Live_In, Bingham County); however, our model extracts (Bill Gordon, Work_For, Bingham County Sheriff), which is also technically a valid relation. Such cases present ambiguities that are also difficult for human annotators; here, imbuing the NER component with external knowledge or learning based on a broader level of context may alleviate these types of errors.
Inconsistencies in the way entities are annotated can also cause issues when it comes to demarcating names that are accompanied with honorifics or titles. For example, some ground truth annotations will include the title, such as “President Park Chung-hee” or “Sen. Bob Dole”, and other cases will leave out the title, such as “Kennedy” instead of “President Kennedy.” These truth annotations are inconsistent and present a source of difficulty for the model during training and testing. For example, “Navy spokeswoman Lt. Nettie Johnson was unable to say immediately whether the aircraft had experienced problems from faulty check and drain valves.” Here, our model extracted (Lt. Nettie Johnson, Work_For, Navy), while the groundtruth is (Nettie Johnson, Work_For, Navy) — while both are technically correctly, the extremely precise nature of the evaluation metric causes this prediction to be considered a false positive.
We also see such issues with annotation at the relation extraction stage; for example, consider the sentence “In 1964, a jury in Dallas found Jack Ruby guilty of murdering Lee Harvey Oswald, the accused assassin of President Kennedy.” Figure 8 shows the internal activity of the network as it attempts to extract entities and relations from this particular example. Here, the ground truth annotation includes (Lee Harvey Oswald, Kills, President Kennedy), which our model fails to recognize; we instead obtain the prediction (Jack Ruby, Kills, Lee Harvey Oswald) which is a valid relation missed by the ground truth. In fact, it can be argued that the latter relation is a stronger manifested of the “Kill” relation based on the linguistic context as evidenced by the trigger phrase “found [..] guilty of murdering”. We note that our model is able to detect (Lee Harvey Oswald, Kills, President Kennedy) as shown in the center-bottom heatmap of Figure 8; however, signals were not strong enough to warrant a concrete extraction of the relation.
In the ADE dataset, we mostly observe issues with entity recognition where boundaries of noun phrases are not properly recognized. Modifier phrases are sometimes not predicted as part of the named entity, for example: “protracted neuromuscular block” instead extracted as “neuromuscular block”, and “generalized mite infestation” instead extracted as simply “mite infestation.” The nature of the data results in especially long named entities that are often entire noun or verb phrases which can be difficult to delimit. For example, consider the following case: “DISCUSSION: Central nervous system (CNS) toxicity has been described with ifosfamide, with most cases reported in the pediatric population.” Here, instead of extracting (Central nervous system (CNS) toxicity, ifosfamide) as the relation pair, our model predicts (Central nervous system, ifosfamide) and (CNS, ifosfamide). Essentially, long entity phrases are often not recognized in their entirety, and broken down into segments where each segment is independently involved in a relation. In this particular case, this error in prediction lead to one false negative and two false positives. This phenomenon occurs frequently with coordinated noun phrases which present a nontrivial challenge. For example, “Growth and adrenal suppression in asthmatic children treated with high-dose fluticasone propionate.” is annotated with “Growth and adrenal suppression” as a singular entity, while our model falsely recognizes it as two entities “Growth” and “adrenal suppression.” We see similar outcomes for the sentence: “Generalized maculopapular and papular purpuric eruptions are perhaps the most common thionamide-induced reactions.” Such cases occur frequently which we suspect are a major source of hampered precision given the increased number of false positives for each predictive mistake.
In this study, we introduced a novel neural architecture that combines the ideas of metric learning and convolutional neural networks to tackle the highly challenging problem of end-to-end relation extraction. Our method is able to simultaneously and efficiently recognize entity boundaries, the type of each entity, and the relationships among them. It achieves this by learning intermediate table representations by pooling local metric, dependency, and position information via repeated application of the 2D convolution. For end-to-end relation extraction, this approach improves over the state-of-the-art across two datasets from different domains with statistically significant results based on examining average performance of repeated runs. Moreover, the proposed architecture operates at substantially reduced training and testing times with testing times that are seven to ten times faster, the latter important for many user-end applications. We also perform extensive error analysis and show that our network can be visually analyzed by observing the hidden pooling activity leading to preliminary or intermediate decisions. Currently, the architecture is designed for extracting relations involving two entities and occur within sentence bounds; handling -ary relations and exploring document-level extraction involving cross-sentence relations will be the focus of future work.
Adel and Schütze (2017)
Adel, Heike and Hinrich Schütze. 2017.
Global normalization of convolutional neural networks for joint
entity and relation classification.
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1723–1729.
- Airola et al. (2008) Airola, Antti, Sampo Pyysalo, Jari Björne, Tapio Pahikkala, Filip Ginter, and Tapio Salakoski. 2008. All-paths graph kernel for protein-protein interaction extraction with evaluation of cross-corpus learning. BMC bioinformatics, 9(11):S2.
- Bekoulis et al. (2018a) Bekoulis, Giannis, Johannes Deleu, Thomas Demeester, and Chris Develder. 2018a. Adversarial training for multi-context joint entity and relation extraction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2830–2836.
- Bekoulis et al. (2018b) Bekoulis, Giannis, Johannes Deleu, Thomas Demeester, and Chris Develder. 2018b. Joint entity recognition and relation extraction as a multi-head selection problem. Expert Systems with Applications, 114:34–45.
- Bunescu and Mooney (2005) Bunescu, Razvan C and Raymond J Mooney. 2005. A shortest path dependency kernel for relation extraction. In Proceedings of the conference on human language technology and empirical methods in natural language processing, pages 724–731, Association for Computational Linguistics.
- Chiu and Nichols (2016) Chiu, Jason PC and Eric Nichols. 2016. Named entity recognition with bidirectional LSTM-CNNs. Transactions of the Association for Computational Linguistics, 4:357–370.
- Choi, Breck, and Cardie (2006) Choi, Yejin, Eric Breck, and Claire Cardie. 2006. Joint extraction of entities and relations for opinion recognition. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pages 431–439.
Collins, Michael. 2002.
Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms.In Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10, pages 1–8.
- Frunza, Inkpen, and Tran (2011) Frunza, Oana, Diana Inkpen, and Thomas Tran. 2011. A machine learning approach for identifying disease-treatment relations in short texts. IEEE Transactions on Knowledge and Data Engineering, 23(6):801–814.
- Fundel, Küffner, and Zimmer (2007) Fundel, Katrin, Robert Küffner, and Ralf Zimmer. 2007. RelEx - relation extraction using dependency parse trees. Bioinformatics, 23(3):365–371.
- Gupta, Schütze, and Andrassy (2016) Gupta, Pankaj, Hinrich Schütze, and Bernt Andrassy. 2016. Table filling multi-task recurrent neural network for joint entity and relation extraction. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 2537–2547.
- Gurulingappa et al. (2012) Gurulingappa, Harsha, Abdul Mateen Rajput, Angus Roberts, Juliane Fluck, Martin Hofmann-Apitius, and Luca Toldo. 2012. Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports. Journal of biomedical informatics, 45(5):885–892.
- Ioffe and Szegedy (2015) Ioffe, Sergey and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning (ICML 2015), pages 448–456.
Jurafsky and Martin (2008)
Jurafsky, Daniel and James H Martin. 2008.
Speech and language processing (prentice hall series in artificial intelligence).
- Kate and Mooney (2010) Kate, Rohit J and Raymond J Mooney. 2010. Joint entity and relation extraction using card-pyramid parsing. In Proceedings of the Fourteenth Conference on Computational Natural Language Learning (CoNLL 2010), pages 203–212.
- Katiyar and Cardie (2017) Katiyar, Arzoo and Claire Cardie. 2017. Going out on a limb: Joint extraction of entity mentions and relations without dependency trees. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 917–928.
- Kavuluru, Rios, and Tran (2017) Kavuluru, Ramakanth, Anthony Rios, and Tung Tran. 2017. Extracting drug-drug interactions with word and character-level recurrent neural networks. In Fifth IEEE International Conference on Healthcare Informatics (ICHI), pages 5–12, IEEE.
- Kim et al. (2003) Kim, J-D, Tomoko Ohta, Yuka Tateisi, and Jun’ichi Tsujii. 2003. Genia corpus—a semantically annotated corpus for bio-textmining. Bioinformatics, 19(suppl_1):i180–i182.
- Li et al. (2017) Li, Fei, Meishan Zhang, Guohong Fu, and Donghong Ji. 2017. A neural joint model for entity and relation extraction from biomedical text. BMC bioinformatics, 18(1):198.
- Li et al. (2016) Li, Fei, Yue Zhang, Meishan Zhang, and Donghong Ji. 2016. Joint models for extracting adverse drug events from biomedical text. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI 2015), volume 2016, pages 2838–2844.
- Li et al. (2008) Li, Jiexun, Zhu Zhang, Xin Li, and Hsinchun Chen. 2008. Kernel-based learning for biomedical relation extraction. Journal of the Association for Information Science and Technology, 59(5):756–769.
- Li and Ji (2014) Li, Qi and Heng Ji. 2014. Incremental joint extraction of entity mentions and relations. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 402–412.
- Liu et al. (2016) Liu, Shengyu, Kai Chen, Qingcai Chen, and Buzhou Tang. 2016. Dependency-based convolutional neural network for drug-drug interaction extraction. In 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 1074–1080, IEEE.
Luong, Pham, and Manning (2015)
Luong, Minh-Thang, Hieu Pham, and Christopher D Manning. 2015.
Effective approaches to attention-based neural machine translation.In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1412–1421.
- Miwa and Bansal (2016) Miwa, Makoto and Mohit Bansal. 2016. End-to-end relation extraction using LSTMs on sequences and tree structures. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1105–1116.
- Miwa and Sasaki (2014) Miwa, Makoto and Yutaka Sasaki. 2014. Modeling joint entity and relation extraction with table representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1858–1869.
- Özgür et al. (2008) Özgür, Arzucan, Thuy Vu, Güneş Erkan, and Dragomir R Radev. 2008. Identifying gene-disease associations using centrality on a literature mined gene-interaction network. Bioinformatics, 24(13):i277–i285.
- Pawar, Bhattacharyya, and Palshikar (2017) Pawar, Sachin, Pushpak Bhattacharyya, and Girish Palshikar. 2017. End-to-end relation extraction using neural networks and markov logic networks. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, volume 1, pages 818–827.
- Pennington, Socher, and Manning (2014) Pennington, Jeffrey, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
- Pyysalo et al. (2013) Pyysalo, Sampo, Filip Ginter, Hans Moen, Tapio Salakoski, and Sophia Ananiadou. 2013. Distributional semantics resources for biomedical text processing. In Proceedings of 5th International Symposium on Languages in Biology and Medicine, pages 39–44.
- Qian et al. (2008) Qian, Longhua, Guodong Zhou, Fang Kong, Qiaoming Zhu, and Peide Qian. 2008. Exploiting constituent dependencies for tree kernel-based semantic relation extraction. In Proceedings of the 22nd International Conference on Computational Linguistics, volume 1, pages 697–704, Association for Computational Linguistics.
- Raj, SAHU, and Anand (2017) Raj, Desh, SUNIL SAHU, and Ashish Anand. 2017. Learning local and global contexts using a convolutional recurrent network model for relation classification in biomedical text. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 311–321.
- Ratinov and Roth (2009) Ratinov, Lev and Dan Roth. 2009. Design challenges and misconceptions in named entity recognition. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning, pages 147–155, Association for Computational Linguistics.
- Rink, Harabagiu, and Roberts (2011) Rink, Bryan, Sanda Harabagiu, and Kirk Roberts. 2011. Automatic extraction of relations between medical concepts in clinical texts. Journal of the American Medical Informatics Association, 18(5):594–600.
- Roth and Yih (2004) Roth, Dan and Wen-tau Yih. 2004. A linear programming formulation for global inference in natural language tasks. In Proceedings of the Annual Conference on Computational Natural Language Learning (CoNLL), pages 1–8.
- Singh et al. (2013) Singh, Sameer, Sebastian Riedel, Brian Martin, Jiaping Zheng, and Andrew McCallum. 2013. Joint inference of entities, relations, and coreference. In Proceedings of the 2013 workshop on Automated knowledge base construction, pages 1–6.
- Srivastava et al. (2014) Srivastava, Nitish, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958.
- Tieleman and Hinton (2012) Tieleman, Tijmen and Geoffrey Hinton. 2012. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 4(2).
- Yu and Lam (2010) Yu, Xiaofeng and Wai Lam. 2010. Jointly identifying entities and extracting relations in encyclopedia text via a graphical model approach. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pages 1399–1407.
- Zeng et al. (2014) Zeng, Daojian, Kang Liu, Siwei Lai, Guangyou Zhou, Jun Zhao, et al. 2014. Relation classification via convolutional deep neural network. In Proceedings of the 25th International Conference on Computational Linguistics: Technical Papers (COLING 2014), pages 2335–2344.
- Zeng et al. (2018) Zeng, Xiangrong, Daojian Zeng, Shizhu He, Kang Liu, and Jun Zhao. 2018. Extracting relational facts by an end-to-end neural model with copy mechanism. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 506–514.
- Zhang, Zhang, and Fu (2017) Zhang, Meishan, Yue Zhang, and Guohong Fu. 2017. End-to-end neural relation extraction with global optimization. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1730–1740.
- Zhang, Qi, and Manning (2018) Zhang, Yuhao, Peng Qi, and Christopher D Manning. 2018. Graph convolution over pruned dependency trees improves relation extraction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.
- Zheng et al. (2017a) Zheng, Suncong, Yuexing Hao, Dongyuan Lu, Hongyun Bao, Jiaming Xu, Hongwei Hao, and Bo Xu. 2017a. Joint entity and relation extraction based on a hybrid neural network. Neurocomputing, 257:59–66.
- Zheng et al. (2017b) Zheng, Suncong, Feng Wang, Hongyun Bao, Yuexing Hao, Peng Zhou, and Bo Xu. 2017b. Joint extraction of entities and relations based on a novel tagging scheme. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1227–1236.