GraphIE: A Graph-Based Framework for Information Extraction

by   Yujie Qian, et al.

Most modern Information Extraction (IE) systems are implemented as sequential taggers and focus on modelling local dependencies. Non-local and non-sequential context is, however, a valuable source of information to improve predictions. In this paper, we introduce GraphIE, a framework that operates over a graph representing both local and non-local dependencies between textual units (i.e. words or sentences). The algorithm propagates information between connected nodes through graph convolutions and exploits the richer representation to improve word level predictions. The framework is evaluated on three different tasks, namely social media, textual and visual information extraction. Results show that GraphIE outperforms a competitive baseline (BiLSTM+CRF) in all tasks by a significant margin.



There are no comments yet.


page 1

page 2

page 3

page 4


LatentGNN: Learning Efficient Non-local Relations for Visual Recognition

Capturing long-range dependencies in feature representations is crucial ...

NL-LinkNet: Toward Lighter but More Accurate Road Extraction with Non-Local Operations

Road extraction from very high resolution satellite images is one of the...

Structural Supervision Improves Learning of Non-Local Grammatical Dependencies

State-of-the-art LSTM language models trained on large corpora learn seq...

Deep Neural Networks In Fully Connected CRF For Image Labeling With Social Network Metadata

We propose a novel method for predicting image labels by fusing image co...

Node Embedding using Mutual Information and Self-Supervision based Bi-level Aggregation

Graph Neural Networks (GNNs) learn low dimensional representations of no...

Embedded-State Latent Conditional Random Fields for Sequence Labeling

Complex textual information extraction tasks are often posed as sequence...

Qiniu Submission to ActivityNet Challenge 2018

In this paper, we introduce our submissions for the tasks of trimmed act...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Most modern Information Extraction (IE) systems are implemented as sequential taggers. While such models effectively capture relations in the local context, they have limited capability of capturing non-local and non-sequential dependencies. In many applications, however, such dependencies can greatly reduce tagging ambiguity, thereby improving overall extraction performance. For instance, when extracting entities from a document, various types of non-local contextual information such as co-references and identical mentions may provide valuable cues. In Figure 1, we can see how such non-local relations are crucial to discriminate the entity type of the second mention of Washington, which would instead be ambiguous (i.e. Person or Location) if taken in isolation.

Figure 1: Example of the entity extraction task with ambiguous entity mentions. Aside from the sentential forward and backward edges (green) which aggregate local contextual information, non-local relations such as the co-referent edges (red) between mentions of Washington, the identical-mention edges (orange) and the subject connections (blue) across sentences provide additional valuable information to reduce tagging ambiguity.

Most of the prior work incorporates these non-local dependencies by constraining the output space in a structured prediction framework (Finkel, Grenager, and Manning, 2005; Reichart and Barzilay, 2012; Hu et al., 2016). Such approaches, however, have limited capacity to capture structural relations in the input space. For instance, the co-referent dependencies shown in Figure 1 cannot be readily exploited for entity extraction by constraining the output space, since the co-referent mentions are not necessarily labeled as entities. One way to capture such dependencies is by defining a graph that outlines the input structure and engineering features to describe it, as proposed by Quirk and Poon (2017). However, designing effective features is challenging, especially when the underlying structure is complex. Moreover, this approach has limited ability to capture node interactions informed by the graph structure.

In this paper, we propose a graph-based approach that captures structural relations underlying the input space. This method can readily handle non-local and non-sequential dependencies, utilizing them as context for word-level predictions. Our approach is built on the basis of the Graph Convolutional Network (GCN) (Kipf and Welling, 2016). To adapt this architecture for information extraction, we need to answer two questions. The first question relates to the design of graph topology that can effectively encode input dependencies relevant to the target task. The second question concerns modeling the interaction between the non-local contextual information learned by the graph module and the local tagging modules.

To address these questions, we introduce a framework named GraphIE. It operates over a graph, where nodes correspond to textual units (i.e. words or sentences) and edges encode their relations, which describe task-specific structural constraints. The algorithm iteratively propagates information between neighboring nodes using a graph convolutional network, thereby constructing the non-local context representation. This contextual information is then projected back to the nodes, supporting tagging at the word level.

We evaluate GraphIE on three IE tasks, namely social media, textual, and visual (Aumann et al., 2006)

information extraction. Experimental results on multiple benchmark datasets show that GraphIE consistently outperforms a strong and commonly adopted sequential model (SeqIE), which consists of a bi-directional long-short term memory (BiLSTM) and a conditional random fields (CRF). For instance, in the social media IE task, GraphIE improves over SeqIE by

in extracting the Education attribute from twitter users. In the textual IE task, we obtain an improvement of over the baseline on the CoNLL03 dataset (Tjong, Erik, and De Meulder, 2003), and on chemical entity extraction (Krallinger et al., 2015).

Related Work

The problem of incorporating non-local and non-sequential context to improve information extraction has been extensively studied in the literature. The majority of methods have focused on enforcing constraints in the output space during inference, through various mechanisms such as posterior regularization or generalized expectations (Finkel, Grenager, and Manning, 2005; Mann and McCallum, 2010; Reichart and Barzilay, 2012; Li, Ji, and Huang, 2013; Hu et al., 2016).

Research capturing non-local dependencies in the input space have mostly relied on feature-based approaches. Roberts, Gaizauskas, and Hepple (2008), Hirano et al. (2010) and Swampillai and Stevenson (2011) have designed intra- and inter-sentential features based on discourse and syntactic dependencies (e.g., shortest paths) to improve relation extraction. Quirk and Poon (2017) used document graphs to flexibly represent multiple types of relations between words (e.g., syntactic, adjacency and discourse relations).

Graph-based representations can be also learned with neural networks. In this respect, the most related work to ours is the graph convolutional network by

Kipf and Welling (2016), which was developed to encode graph structures and perform node classification. In our framework, we adapt GCN as an intermediate module that learns non-local context, which — instead of being used directly for classification — is projected to the decoder to enrich local information and perform sequence tagging.

A handful of other information extraction approaches have used graph-based neural networks. Miwa and Bansal (2016) applied TreeLSTM (Tai, Socher, and Manning, 2015) to jointly represent sequences and dependency trees for entity and relation extraction. On the same line of work, Peng et al. (2017) introduced GraphLSTM, which extended the traditional LSTM to graphs by enabling a varied number of incoming edges at each memory cell. A parallel work to ours (Zhang, Qi, and Manning, 2018)

has exploited graph convolutions to pool information over pruned dependency trees, outperforming existing sequence and dependency-based neural models in a relation extraction task. These studies differ from ours in several respects. First, they can only model word-level graphs, whereas our framework can learn non-local context either from word- or sentence-level graphs, using it to reduce ambiguity during tagging at the word level. Second, all these studies achieved improvements only when using dependency trees. We extend the graph-based approach to validate the benefits of using other types of relations in a broader range of tasks, such as co-reference in named entity recognition,

followed-by link in social media, and layout structure in visual information extraction.

Figure 2: GraphIE architecture: (a) an overview of the framework; (b) architecture for sentence-level graph

, where each sentence is encoded to a node vector by a BiLSTM and fed into the graph module, and the output of the graph module is used as the initial state of the BiLSTM in the decoder; (c) architecture for

word-level graph, where the hidden state for each word of the BiLSTM encoder is taken as the input node vector of the graph module, and then the output is fed into the BiLSTM+CRF decoder.

Problem Definition

We formalize information extraction as a sequence tagging problem. Rather than simply modeling inputs as sequences, we assume there exists a graph structure in the data that can be exploited to capture non-local and non-sequential dependencies between textual units, namely words or sentences.

We consider the input to be a set of sentences and an auxiliary graph , where is the node set and is the edge set. Each sentence is a sequence of words. We consider two different designs of the graph:

  1. sentence-level graph, where each node is a sentence (i.e. ), and the edges encode sentence dependencies;

  2. word-level graph, where each node is a word (i.e. is the number of words in the input), and the edges connect pairs of words, such as co-referent tokens.

The edges in the graph can be either directed or undirected. Multiple edge types can also be defined to capture different structural factors underlying the task-specific input data.

We use the BIO (Begin, Inside, Outside) tagging scheme in this paper. For each sentence ,222While sentences may have different lengths, for notation simplicity we use a single variable . we sequentially tag each word as .


GraphIE jointly learns local and non-local dependencies by iteratively propagating information between node representations. Our model has three components:

  • an encoder

    , which generates local context-aware hidden representations for the textual unit (i.e. word or sentence, depending on the task) with a recurrent neural network;

  • a graph module, which captures the graph structure and learns non-local and non-sequential dependencies between textual units;

  • a decoder, which exploits the contextual information generated by the graph module to perform labelling at the word level.

Figure 2 illustrates the overview of GraphIE and the model architectures for both sentence- and word-level graphs. In the following sections, we first introduce the case of the sentence-level graph, and then we explain how to adapt the model for the word-level graph.


In GraphIE, we first use an encoder to generate text representations. Given a sentence of length , we encode it with a recurrent neural network (RNN). Each word is represented by a vector

, which is the concatenation of its word embedding and a feature vector learned with a character-level convolutional neural network (CharCNN;

Kim et al. (2016)). We choose a bi-directional LSTM (Hochreiter and Schmidhuber, 1997) to process the sentence, defining it as

where represents the encoder parameters, and indicates the initial hidden states of the BiLSTM encoder.

We obtain the sentence representation for by averaging the hidden states of its words, i.e. .

Graph Module

The graph module is designed to learn the non-local and non-sequential information from the graph. We adapt the graph convolutional network (GCN) to model the graph context for information extraction.

Given the sentence-level graph , where each node (i.e. sentence ) has the encoding capturing its local information, the graph module enriches such representation with neighbor information derived from the graph structure.

Our graph module is a GCN which takes as input the sentence representation, i.e. , and conducts graph convolution on every node, propagating information between its neighbors, and integrating such information into a new hidden representation. Specifically, each layer of GCN has two parts. The first gets the information of each node from the previous layer, i.e.


where is the weight to be learned. The second aggregates information from the neighbors of each node, i.e. for node , we have


where is the degree of node (i.e. the number of edges connected to ) and is used to normalize , ensuring that nodes with different degrees have representations of the same scale.333We choose this simple normalization strategy instead of the two-sided normalization in (Kipf and Welling, 2016), as it performs better in the experiments. The same strategy is also adopted by Zhang, Qi, and Manning (2018). In the simplest case, where the edges in the graph are undirected and have the same type, we use the same weight for all of them. In a more general case, where multiple edge types exist, we expect them to have different impacts on the aggregation. Thus, we model these edge types with different weights in Eq. 2, similar to the relational GCN proposed by Schlichtkrull et al. (2018). When edges are directed, i.e. edge is different from , the propagation mechanism should mirror such difference. In this case, we consider directed edges as two types of edges (forward and backward), and use different weights for them.

Finally, and are combined to obtain the representation at the -th layer,



is the non-linear activation function, and

is a bias parameter.

Because each layer only propagates information between directly connected nodes, we can stack multiple graph convolutional layers to get a larger receptive field, i.e. each node can be aware of more distant neighbors. After layers, for each node we obtain a contextual representation,


that captures both local and non-local information.


To support tagging, the learned representation is propagated to the decoder.

In our work, the decoder is instantiated as a BiLSTM+CRF tagger (Lample et al., 2016). The output representation of the graph module, , is split into two vectors of the same length, which are used as the initial hidden states for the forward and backward LSTMs, respectively. In this way, the graph contextual information is propagated to each word through the LSTM. Specifically, we have

where are the output hidden states of the encoder, represents the initial state, and is the decoder parameters. Finally, we use a CRF layer (Lafferty, McCallum, and Pereira, 2001) on top of the BiLSTM to perform tagging,


where is the set of all possible tag sequences of length , and

represents the CRF parameters, i.e. transition scores of tags. CRF combines the local predictions of BiLSTM and the transition scores to model the joint probability of the tag sequence.

444In GraphIE, the graph module models the input space structure, i.e. the dependencies between textual units (i.e. sentences or words), and the final CRF layer models the sequential connections of the output tags. Even though loops may exist in the input graph, CRF will operate sequentially, thus the inference is tractable.

Adaptation to Word-level Graphs

GraphIE can be easily adapted to model word-level graphs. In such case, the nodes represent words in the input, i.e. the number of nodes equals the total number of words in the sentences. At this point, each word’s hidden state in the encoder can be used as the input node vector of the graph module. GCN can then conduct graph convolution on the word-level graph and generate graph-contextualized representations for the words. Finally, the decoder directly operates on the GCN’s outputs, i.e. we change the BiLSTM decoder to

where is the GCN output for word . In this case, the BiLSTM initial states are set to the default zero vectors. The CRF layer remains unchanged.

As it can be seen in Figure 2(c), the word-level graph module differs from the sentence-level one because it directly takes the word representations from the encoder and feeds its output to the decoder. In sentence-level graph, the GCN operates on sentence representations, which are then used as the initial states of the decoder BiLSTM.

Evaluation Task Graph Type Node (Textual Unit) Edge (Relation)
Social Media IE sentence-level user’s tweets followed-by
Textual IE word-level word
1. non-local consistency (identical mentions)
2. local sentential forward and backward
Visual IE sentence-level text box spatial layout (horizontal and vertical)
Table 1: Comparisons of graph structure in the three IE tasks used for evaluation.

Experimental Setup

We evaluate the model on three tasks, including two traditional IE tasks, namely social media information extraction and textual information extraction, and an under-explored task — visual information extraction. Table 1 summarizes the characteristics of the three tasks.

Task 1: Social Media Information Extraction

Social media information extraction refers to the task of extracting information from users’ posts in online social networks (Benson, Haghighi, and Barzilay, 2011; Li, Ritter, and Hovy, 2014). In this paper, we aim at extracting education and job information from users’ tweets. Given a set of tweets posted by a user, the goal is to extract mentions of the organizations to which they belong. The fact that the tweets are short, highly contextualized and show special linguistic features makes this task particularly challenging.


We construct two datasets, Education and Job, from the Twitter corpus released by Li, Ritter, and Hovy (2014). The original corpus contains millions of tweets generated by thousand users, where the education and job mentions are annotated using distant supervision (Mintz et al., 2009). We sample the tweets from each user, maintaining the ratio between positive and negative posts.555Positive and negative refer here to whether or not the education or job mention is present in the tweet. In (Li, Ritter, and Hovy, 2014), sampling was not necessary because they processed the tweets and extracted engineered features beforehand, and then learned the model. The obtained Education dataset consists of tweets generated by users, and the Job dataset contains tweets generated by users. Dataset statistics are reported in Table 2.

The datasets are both split in 60% for training, 20% for development, and 20% for testing. We perform 5 different random splits and report the average results.

Number Education Job
Users 7,208 1,772
Edges 11,167 3,498
Positive Tweets 49,793 3,694
Negative Tweets 393,683 172,349
Table 2: Statistics of the Education and Job datasets (Task 1).

Graph Construction

We construct the graph as ego-networks (Leskovec and Mcauley, 2012), i.e. when we extract information about one user, we consider the subgraph formed by the user and his/her direct neighbors. Each node corresponds to a Twitter user, who is represented by the set of posted tweets.666As each node is a set of tweets posted by the user, we encode every tweet with the encoder, and then average them to obtain the node representation. In the decoding phase, the graph module’s output is fed to the decoder for each tweet. Edges are defined by the followed-by link, under the assumption that connected users are more likely to come from the same university or company. An example of Education social media graph is reported in the Supplemental Material.

Task 2: Textual Information Extraction

In this task, we focus on named entity recognition at discourse level (DiscNER). In contrast to traditional sentence-level NER (SentNER), where sentences are processed independently, in DiscNER, long-range dependencies and constraints across sentences have a crucial role in the tagging process. For instance, multiple mentions of the same entity are expected to be tagged consistently in the same discourse. Here we propose to utilize this (soft) consistency constraint to improve entity extraction.


We conduct experiments on two NER datasets: the CoNLL-2003 shared task dataset (CoNLL03) (Tjong, Erik, and De Meulder, 2003) and the Chemdner dataset for chemical entity extraction (Krallinger et al., 2015). We follow the standard split of each corpora. Statistics are shown in Table 3.

Dataset Train Dev Test
CoNLL03 #doc 946 216 231
#sent 14,987 3,466 3,684
Chemdner #doc 3,500 3,500 3,000
#sent 30,739 30,796 26,399
Table 3: Statistics of the CoNLL03 NER dataset and the Chemdner dataset (Task 2).

Graph Construction

In this task, nodes represent words. We create two types of edges for each given document:

  • Local edges: forward and backward edges are created between neighboring words in each sentence, allowing local contextual information to be utilized.

  • Non-local edges: identical mentions are connected, so that information can be propagated through and thus encourages global consistency of tagging.

To build the non-local edges, for Chemdner, we exploit a named entity dictionary (Hettne et al., 2009), which contains a noisy collection of 1.5M terms. For CoNLL03, we consider all reoccurring words as potential mentions without introducing any additional resources to ensure a fair comparison with previous work.777Note that other non-local relations such as co-references and subject connections (cf. the example in Figure 1) are desired for further improvement. However, these relations require additional resources or modules to obtain, and we leave them to future work.

Task 3: Visual Information Extraction

Visual information extraction refers to the extraction of attribute values from documents formatted in various layouts. Examples include invoices and forms, whose format can be exploited to infer valuable information to support extraction.


The corpus consists of 25,200 Adverse Event Case Reports (AECR) recording drug-related side effects. Each case contains an average of 9 pages. Since these documents are produced by multiple organizations, they exhibit large variability in the layout and presentation styles (e.g. text, table, etc.).888This dataset cannot be shared for patient privacy and proprietary issues. The collection is provided with a separate human-extracted ground truth database that is used as a source of distant supervision.

Our goal is to extract eight attributes related to the patient, the event, the drug and the reporter (cf. Table 6 for the full list). Attribute types include dates, words and phrases — which can be directly extracted from the document.

The dataset is split in 50% cases for training, 10% for development, and 40% for testing.

Graph Construction

We first turn the PDFs to text using PDFMiner,999 which provides words along with their positions in the page (i.e. bounding-box coordinates). Consecutive words are then geometrically joined into text boxes. Each text box is considered as a “sentence” in this task, and corresponds to a node in the graph.

Because the page layout is the major structural factor in these documents, we work on page-by-page basis, i.e. each page corresponds to a graph. The edges are defined to horizontally or vertically connect nodes (text boxes) that are close to each other (i.e. when the overlap of their bounding boxes, in either the vertical or horizontal direction, is over 50%). Four types of edge are considered: left-to-right, right-to-left, up-to-down, and down-to-up. When multiple nodes are aligned, only the closest ones are connected. An example of visual document graph is reported in the Supplemental Material.

Baseline and Our Method

We implement a two-layer BiLSTM with a CRF tagger as the sequential baseline (SeqIE). This architecture and its variants have been extensively studied and demonstrated to be successful in previous work on information extraction (Lample et al., 2016; Ma and Hovy, 2016). It shares the same encoder and decoder architecture with GraphIE, but removes the graph module.101010In the visual IE task (Task 3), in order to increase the capacity of the baseline, we sequentially concatenate the horizontally aligned text boxes. This obtains much better performance than processing text boxes independently, as it models the horizontal edges of the graph.

In Task 1 and Task 3, we apply GraphIE with sentence-level graph module (cf. Figure 2(b)), and in Task 2, we apply GraphIE with word-level graph module (cf. Figure 2(c)).

Implementation Details

The models are trained with (Kingma and Ba, 2014)

to minimize the CRF objective. For regularization, we choose dropout with a ratio of 0.1 on both the input word representation and the hidden layer of the decoder. The learning rate is set to 0.001. We use the development set for early-stopping and the selection of the best performing hyperparameters. For CharCNN, we use 64-dimensional character embeddings and 64 filters of width 2 to 4

(Kim et al., 2016). The 100-dimensional pretrained GloVe word embeddings (Pennington, Socher, and Manning, 2014) are used in Task 1 and 2, and 64-dimensional randomly initialized word embeddings are used in Task 3. We use a one-layer GCN in Task 1 and Task 3, and a two-layer GCN in Task 2. The encoder and decoder BiLSTMs have the same dimension as the graph convolution layer. In Task 3, we concatenate a positional encoding to each text box’s representation by transforming its bounding box coordinates to a vector of length 32, and then applying a activation.


Task 1: Social Media Information Extraction

Table 4 shows the results for the social media information extraction task. GraphIE outperforms SeqIE in both the Education and Job datasets, and the improvements are more significant for the Education dataset ( versus

). The reason for such difference is the variance in the affinity scores

(Mislove et al., 2010) between the two datasets. Li, Ritter, and Hovy (2014) underline that affinity value for Education is while for Job it is only , which means that in the datasets neighbors are times more likely to have studied in the same university than worked in the same company. We can therefore expect that a model like GraphIE, which exploits neighbors’ information, obtains larger advantages in a dataset characterized by higher affinity.

Dataset SeqIE GraphIE
P R F1 P R F1
Education 85.2 93.6 89.2 92.9 92.8 92.9
Job 66.2 66.7 66.2 67.1 66.1 66.5
Table 4: Extraction accuracy on the Education and Job datasets (Task 1). Scores are the average of 5 runs. * indicates the improvement over SeqIE is statistically significant (Welch’s -test, ).

Task 2: Textual Information Extraction

Table 5 describes the NER accuracy on the CoNLL03 (Tjong, Erik, and De Meulder, 2003) and the Chemdner (Krallinger et al., 2015) datasets.

GraphIE significantly outperforms SeqIE on both datasets, and also outperforms SeqIE+dict on the Chemdner dataset, demonstrating that (1) non-local information is beneficial for discourse-level entity extraction and that (2) GraphIE offers a coherent framework for effectively exploiting such non-local dependencies.

Dataset Model F1
CoNLL03 SeqIE 91.20
GraphIE 91.69
Chemdner SeqIE 88.28
GraphIE 89.71
Table 5: NER accuracy on the CoNLL03 and the Chemdner dataset (Task 2). Scores are the average of 3 runs. * indicates statistical significance ().

Task 3: Visual Information Extraction

Table 6 shows the results in the visual information extraction task. GraphIE outperforms the SeqIE baseline in most attributes, and achieves improvement in the mirco average F1 score. It confirms that the benefits of using layout graph structure in visual information extraction.

Attribute SeqIE GraphIE
P R F1 P R F1
Pt. initials 93.5 92.4 92.9 93.6 91.9 92.8
Pt. age 94.0 91.6 92.8 94.8 91.1 92.9
Pt. date of birth 96.6 96.0 96.3 96.9 94.7 95.8
Drug name 71.2 51.2 59.4 78.5 50.4 61.4
Event 62.6 65.2 63.9 64.1 68.7 66.3
Rp. first name 78.3 95.7 86.1 79.5 95.9 86.9
Rp. last name 84.5 68.4 75.6 85.6 68.2 75.9
Rp. city 88.9 65.4 75.4 92.1 66.3 77.1
Avg. (macro) 83.7 78.2 80.3 85.7 78.4 81.1
Avg. (micro) 78.5 73.8 76.1 80.3 74.6 77.3
Table 6: Extraction accuracy on the eight attributes of the AECR dataset (Task 3). “Pt.” stands for “Patient” and “Rp.” stands for “Reporter”. Scores are the average of 5 runs. * indicates statistical significance ().

The extraction performance varies across the attributes, ranging from for Drug name to for Pt. date of birth (similar variations are visible in the baseline). Similarly, the gap between GraphIE and SeqIE varies in relation to the attributes, ranging between in Pt. date of birth and in Event.

In the ablation test described in Table 7, we can see the contribution of: using separate weights for different edge types (), horizontal edges (), vertical edges (), and CRF ().

Model Dev F1
GraphIE 77.8
 – Edge types 77.0 ( 0.8)
 – Horizontal edges 74.7 ( 3.1)
 – Vertical edges 72.4 ( 5.4)
 – CRF 72.1 ( 5.7)
Table 7: Ablation study on AECR dataset (Task 3). Scores are micro average F1 on the development set. “–” means removing the element from GraphIE.
SeqIE GraphIE
Seen Templates 80.3 83.1
Unseen Templates 13.4 33.7
Table 8: Micro average F1 scores tested on seen and unseen templates (Task 3).


We also assess GraphIE’s capacity of dealing with unseen layouts through an extra analysis. From our dataset, we sample reports containing the three most frequent templates, and train the models on this subset. Then we test all models in two settings: 1) seen templates, consisting of additional reports in the same templates used for training; and 2) unseen templates, consisting of reports in two new template types.

The performance of GraphIE and SeqIE is reported in Figure 8. Both models achieve good results on seen templates, with GraphIE still scoring higher than SeqIE. The gap becomes even larger when our model and the sequential one are tested on unseen templates (i.e. ), demonstrating that by explicitly modeling the richer structural relations, GraphIE achieves better generalizability.


We introduced GraphIE, a framework for learning local and non-local contextual representations from graph structures. The system operates over a task-specific graph topology, jointly modeling textual units (i.e. words or sentences) representations and their dependencies. Graph convolutions project information through neighboring nodes to finally support the decoder during tagging at the word level.

We evaluated our framework on three IE tasks, namely social media, textual and visual information extraction. Our results show that supporting predictions with non-local context consistently enhances accuracy, outperforming the competitive SeqIE baseline based on BiLSTM+CRF.


  • Aumann et al. (2006) Aumann, Y.; Feldman, R.; Liberzon, Y.; Rosenfeld, B.; and Schler, J. 2006. Visual information extraction. Knowl. Inf. Syst. 10(1):1–15.
  • Benson, Haghighi, and Barzilay (2011) Benson, E.; Haghighi, A.; and Barzilay, R. 2011. Event discovery in social media feeds. In Proceedings of ACL, 389–398. ACL.
  • Finkel, Grenager, and Manning (2005) Finkel, J. R.; Grenager, T.; and Manning, C. 2005. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of ACL, 363–370. ACL.
  • Hettne et al. (2009) Hettne, K. M.; Stierum, R. H.; Schuemie, M. J.; Hendriksen, P. J.; Schijvenaars, B. J.; Mulligen, E. M. v.; Kleinjans, J.; and Kors, J. A. 2009. A dictionary to identify small molecules and drugs in free text. Bioinformatics 25(22):2983–2991.
  • Hirano et al. (2010) Hirano, T.; Asano, H.; Matsuo, Y.; and Kikui, G. 2010. Recognizing relation expression between named entities based on inherent and context-dependent features of relational words. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters, 409–417. ACL.
  • Hochreiter and Schmidhuber (1997) Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural computation 9(8):1735–1780.
  • Hu et al. (2016) Hu, Z.; Ma, X.; Liu, Z.; Hovy, E.; and Xing, E. 2016. Harnessing deep neural networks with logic rules. In Proceedings of ACL, volume 1, 2410–2420.
  • Kim et al. (2016) Kim, Y.; Jernite, Y.; Sontag, D.; and Rush, A. M. 2016. Character-aware neural language models. In Proceedings of AAAI, 2741–2749. AAAI Press.
  • Kingma and Ba (2014) Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • Kipf and Welling (2016) Kipf, T. N., and Welling, M. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907.
  • Krallinger et al. (2015) Krallinger, M.; Rabal, O.; Leitner, F.; Vazquez, M.; Salgado, D.; Lu, Z.; Leaman, R.; Lu, Y.; Ji, D.; Lowe, D. M.; et al. 2015. The chemdner corpus of chemicals and drugs and its annotation principles. Journal of cheminformatics 7(1):S2.
  • Lafferty, McCallum, and Pereira (2001) Lafferty, J. D.; McCallum, A.; and Pereira, F. C. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of ICML, 282–289.
  • Lample et al. (2016) Lample, G.; Ballesteros, M.; Subramanian, S.; Kawakami, K.; and Dyer, C. 2016. Neural architectures for named entity recognition. In Proceedings of NAACL-HLT, 260–270.
  • Leskovec and Mcauley (2012) Leskovec, J., and Mcauley, J. J. 2012. Learning to discover social circles in ego networks. In Advances in Neural Information Processing Systems, 539–547.
  • Li, Ji, and Huang (2013) Li, Q.; Ji, H.; and Huang, L. 2013. Joint event extraction via structured prediction with global features. In Proceedings of ACL (Volume 1: Long Papers), volume 1, 73–82.
  • Li, Ritter, and Hovy (2014) Li, J.; Ritter, A.; and Hovy, E. 2014. Weakly supervised user profile extraction from twitter. In Proceedings of ACL, volume 1, 165–174.
  • Ma and Hovy (2016) Ma, X., and Hovy, E. 2016. End-to-end sequence labeling via bi-directional lstm-cnns-crf. In Proceedings of ACL (Volume 1: Long Papers), 1064–1074. Berlin, Germany: ACL.
  • Mann and McCallum (2010) Mann, G. S., and McCallum, A. 2010.

    Generalized expectation criteria for semi-supervised learning with weakly labeled data.

    Journal of Machine Learning Research

  • Mintz et al. (2009) Mintz, M.; Bills, S.; Snow, R.; and Jurafsky, D. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of ACL, 1003–1011. ACL.
  • Mislove et al. (2010) Mislove, A.; Viswanath, B.; Gummadi, K. P.; and Druschel, P. 2010. You are who you know: inferring user profiles in online social networks. In Proceedings of the 3rd ACM International Conference on Web Search and Data Mining, 251–260. ACM.
  • Miwa and Bansal (2016) Miwa, M., and Bansal, M. 2016. End-to-end relation extraction using lstms on sequences and tree structures. arXiv preprint arXiv:1601.00770.
  • Peng et al. (2017) Peng, N.; Poon, H.; Quirk, C.; Toutanova, K.; and Yih, W.-t. 2017. Cross-sentence n-ary relation extraction with graph lstms. TACL 5:101–115.
  • Pennington, Socher, and Manning (2014) Pennington, J.; Socher, R.; and Manning, C. 2014. Glove: Global vectors for word representation. In Proceedings of EMNLP, 1532–1543.
  • Quirk and Poon (2017) Quirk, C., and Poon, H. 2017. Distant supervision for relation extraction beyond the sentence boundary. In Proceedings of ACL, volume 1, 1171–1182.
  • Reichart and Barzilay (2012) Reichart, R., and Barzilay, R. 2012. Multi event extraction guided by global constraints. In Proceedings of NAACL-HLT, 70–79. ACL.
  • Roberts, Gaizauskas, and Hepple (2008) Roberts, A.; Gaizauskas, R.; and Hepple, M. 2008. Extracting clinical relationships from patient narratives. In

    Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing

    , 10–18.
  • Schlichtkrull et al. (2018) Schlichtkrull, M.; Kipf, T. N.; Bloem, P.; van den Berg, R.; Titov, I.; and Welling, M. 2018. Modeling relational data with graph convolutional networks. In European Semantic Web Conference, 593–607. Springer.
  • Swampillai and Stevenson (2011) Swampillai, K., and Stevenson, M. 2011. Extracting relations within and across sentences. In Proceedings of the International Conference Recent Advances in Natural Language Processing, 25–32.
  • Tai, Socher, and Manning (2015) Tai, K. S.; Socher, R.; and Manning, C. D. 2015. Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075.
  • Tjong, Erik, and De Meulder (2003) Tjong, K. S.; Erik, F.; and De Meulder, F. 2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In Proceedings of NAACL-HLT-Volume 4, 142–147. ACL.
  • Zhang, Qi, and Manning (2018) Zhang, Y.; Qi, P.; and Manning, C. D. 2018. Graph convolution over pruned dependency trees improves relation extraction. 2018.