1.1 Graph problems in NLP
There are plenty of graph problems in the natural language processing area. These graphs include semantic graphs, dependency graphs, knowledge graphs and so on. Figure 1.1 shows an example for several types of graphs, where a semantic graph (Figure 1.0(a)) visualizes the underlining meaning (such as who does what to whom) of a given sentence by abstracting the sentence into several concepts (such as “describe-01” and “person”) and their relations (such as “:ARG0” and “:ARG1”). On the other hand, a dependency graph (Figure 1.0(b)) simply captures word-to-word dependencies, such as “Bell” being the subject (subj) of “makes”. Finally, a knowledge graph (Figure 1.0(c)) represents real-world knowledge by entities (such as “U.S. Government” and “Barack_Obama”) and their relations (such as “belong_to” and “born_in”). Since there is massive information in the world level, a knowledge graph (such as Freebase111https://developers.google.com/freebase/ and DBPedia222https://wiki.dbpedia.org/) are very large.
1.2 Previous approaches for modeling graphs
How to properly model these graphs has been a long-standing and important topic, as this directly contributes to natural language understanding, one of the most important key problems in NLP. Previously, statistical or rule-based approaches have been introduced to model graphs. For instance, synchronous grammar-based methods have been proposed to model semantic graphs (Jones et al., 2012a; Flanigan et al., 2016b; Song et al., 2017) and dependency graphs (Xie et al., 2011; Meng et al., 2013)
for machine translation and text generation. For modeling knowledge graphs, very different approaches have been adopted, probably due to the fact that their scale is too large. One popular method is random walk, which has been investigated for knowledge base completion(Lao et al., 2011) and entity linking (Han et al., 2011; Xue et al., 2019).
Recently, research on analyzing graphs with deep learning models has been receiving more and more attention. This is because they have demonstrated strong learning power and other superior properties, i.e. not needing feature engineering and benefiting from large-scale data. To date, people have studied several types of neural networks.
adopt convolutional neural networks (CNN)(LeCun et al., 1995) for encoding graphs. As shown in Figure 1.2, these models adopts multiple convolution layers, each capturing the local correspondences within -gram windows. By stacking the layers, more global correspondences can be captured. One drawback is the large amount of computation, because CNNs calculate features by enumerating all -gram windows, and there can be a large number of -gram (1) windows for a very dense graph. In particular, each node in a complete graph of nodes has left and right neighbors, so there are bigram, trigram and 4-gram windows, respectively. On the other hand, previous attempts at modeling text with CNN do not suffer from this problem, as a sentence with words only has windows for each -gram. This is because a sentence can be viewed as a chain graph, where each node has only one left neighbor and one right neighbor.
Another direction is applying RNNs on linearized graphs (Konstas et al., 2017; Li et al., 2017) based on depth-first traversal algorithms. Usually a bidirectional RNN is adopted to capture global dependencies within the whole graph. Comparing with CNNs, the computations of a RNN is only linear in terms of graph scale. However, one drawback of this direction is that some structural information is lost after linerization. Figure 1.3 shows a linearization result, where nodes “A” and “E” are far apart, while they are directly connected in the original graph. To alleviate this problem, previous methods insert brackets into their linearization results to indicate the original structure, and they hope RNNs can figure that out with the aid of brackets. But it is still uncertain this method can recover all the information loss. Later work (Song et al., 2018d, 2019) show large improvements by directly modeling original graphs without linearization. This indicates that simply inserting brackets does not handle this problem very well.
1.3 Motivation and overview of our model
In this dissertation, we want to explore better alternatives for encoding graphs, which are general enough to be applied on arbitrary graphs without destroying the original structures. We introduce graph recurrent networks (GRNs) and show that they are successful in handling a variety of graph problems in the NLP area. Note that there are other types of graph neural networks, such as graph convolutional network (GCN) and gated graph neural network (GGNN). I will give a comprehensive comparison in Chapter 2.2.
Given an input graph, GRN adopts a hidden state for each graph node. In order to capture non-local interaction between nodes, it allows information exchange between neighboring nodes through time. At each time step, each node propagates its information to each other node that has a direct connection so that every node absorbs more and more global information through time. Each resulting node representation contains information from a large context surrounding it, so the final representations can be very expressive for solving other graph problems, such as graph-to-sequence learning or graph classification. We can see that our GRN model only requires local and relative neighboring information, rather than an absolute topological order of all graph nodes required by RNNs. As a result, GRN can work on arbitrary graph structures with or without cycles. From the global view, GRN takes the collection of all node states as the state for the entire graph, and the neighboring information exchange through time can be considered as a graph state transition process. The graph state transition is a recurrent process, where the state is recurrently updated through time (This is reason we name it “graph recurrent network”). Comparatively, a regular RNN absorbs a new token to update its state at each time. On the other hand, our GRN lets node states exchange information for updating the graph state through time.
In this thesis, I will introduce the application of GRN on 3 types of popular graphs in the NLP area, which are dependency graphs, semantic graphs and another type of graphs (evidence graphs) that are constructed from textual input for modeling entities and their relations. The evidence graphs are created from documents to represent entities (such as “Time Square”, “New York City” and “United States”) and their relations for QA-oriented reasoning.
1.4 Thesis Outline
The remainder of the thesis is organized as follows.
Chapter 2: Background In this chapter, we briefly introduce previous deep learning models for encoding graphs. We first discuss applying RNNs and DAG networks on graphs, including their shortcomings. As a next step, we describe several types of graph neural networks (GNNs) in more detail, then systematically compare GNNs with RNNs. Finally, I point out one drawback of GNNs when encoding large-scale graphs, before showing some existing solutions.
Chapter 3: Graph Recurrent Network for Multi-hop Reading Comprehension In this chapter, I will first describe the multi-hop reading comprehension task, then propose a graph representation for each input document. To encode the graphs for global reasoning, we introduce 3 models for this task, including one RNN baseline, one baseline with DAG network and our GRN-based model. For fair comparison, all three models are in the same framework, with the only difference being how to encode the graph representations. Finally, our comprehensive experiments show the superiority of our GRN.
Chapter 4: Graph Recurrent Network for -ary Relation Extraction In this chapter, we extend our GRN from undirected and edge-unlabeled graphs (as in Chapter 3) to dependency graphs for solving a medical relation extraction (classification) problem. The goal is to determine whether a given medicine is effective on cancers caused by a type of mutation on a certain gene. Previous work has shown the effectiveness of incorporating rich syntactic and discourse information. The previous state of the art propose DAN networks by splitting the dependency graphs into two DAGs. Arguing that important information is lost by splitting the original graphs, we adapt GRN on the dependency graphs without destroying any graph structure.
Chapter 5: Graph Recurrent Network for AMR-to-text Generation In this chapter, we propose a graph-to-sequence model by extending GRN with an attention-based LSTM, and evaluate our model on AMR-to-text generation. AMR is a semantic formalism based on directed and edge-labelled graphs, and the task of AMR-to-text generation aims at recovering the original sentence of a given AMR graph. In our extensive experiments, our model show consistently better performance than a sequence-to-sequence baseline with a Bi-LSTM encoder under the same configuration, demonstrating the superiority of our GRN over other sequential encoders.
Chapter 6: Graph Recurrent Network for Semantic NMT using AMR
In this chapter, we further adapt our GRN-based graph-to-sequence model on AMR-based semantic neural machine translation. In particular, our model is extended by another Bi-LSTM encoder for modeling source sentences, as they are crucial for translation. The AMRs are automatically obtained by parsing the source sentences. Experiments show that using AMR outperforms other common syntactic and semantic representations, such as dependency and semantic role. We also show that GRN-based encoder is better than a Bi-LSTM encoder using linearized AMRs. This is consistent with the results of Chapter5.
Chapter 7: Conclusion Finally, I summarize the main contributions of this thesis, and propose some future research directions for further improving our graph recurrent network.
2.1 Encoding graphs with RNN or DAG network
Since the first great breakthrough of recurrent neural networks (RNN) on machine translation (Cho et al., 2014; Bahdanau et al., 2015), people have investigated the usefulness of RNN and several of its extensions for solving graph problems in the NLP area. Below we take abstract meaning representation (AMR) (Banarescu et al., 2013) as an example to demonstrate several existing ways for encoding graphs with RNNs. As shown in Figure 2.1, AMRs are rooted and directed graphs, where the graph nodes (such as “want-01” and “boy”) represent the concepts and edges (such as “ARG0” and “ARG1”) represent the relations between nodes.
Encoding with RNNs
To encode AMRs, one kind of approaches (Konstas et al., 2017) first linearize their inputs with depth-first traversal, before feeding the linearization results into a multi-layer LSTM encoder. We can see that the linearization causes loss of the structural information. For instance, originally closely-located graph nodes (such as parents and children) can be very far away, especially when the graph is very large. In addition, there is not a specific order among the children for a graph node, resulting in multiple linearization possibilities. This increases the data variation and further introduces ambiguity. Despite the drawbacks that mentioned above, these approaches received great success with the aid of large-scale training. For example, Konstas et al. (2017) leveraged 20 million sentences paired with automatically parsed AMR graphs using a multi-layer LSTM encoder, which demonstrates dramatic improvements (5.0+ BLEU points) over the existing statistical models (Pourdamghani et al., 2016; Flanigan et al., 2016b; Song et al., 2016b, 2017). This demonstrates the strong learning power of RNNs, but there is still room for improvement due to the above mentioned drawbacks.
Encoding with DAG networks
One better alternative than RNNs are DAG networks Zhu et al. (2016); Su et al. (2017); Zhang and Yang (2018), which extend RNN on directed acyclic graphs (DAGs). Comparing with sentences, where each word has exactly one preceding word and one succeeding word, a node in a DAG can have multiple preceding and succeeding nodes, respectively. To adapt RNNs on DAGs, the hidden states of multiple preceding nodes are first merged before being applied to calculate the hidden state of the current node. One popular way for merging preceding hidden states is called “child-sum” (Tai et al., 2015)
, which simply sum up their states to product one vector. Comparing RNNs, DAG networks have the advantage of preserving the original graph structures. Recently,Takase et al. (2016) applied a DAG network on encoding AMRs for headline generation, but no comparisons were made to contrast their model with any RNN-based models. Still, DAG networks are intuitively more suitable for encoding graphs than RNNs.
However, DAG networks suffer from two major problems. First, they fail on cyclic graphs, as they require an exact and finite node order to be executed on. Second, sibling nodes can not incorporate the information of each other, as the encoding procedure can either be bottom-up or top-down, but not both at the same time. For the first problem, previous work introduces two solutions to adapt DAG networks on cyclic graphs, but to my knowledge, no solution has been available for the second problem. Still, both solutions for the first problem have their own drawbacks, which I will be discussing here.
One solution (Peng et al., 2017) first splits a cyclic graph into two DAGs, which are then encoded with separate DAG networks. Note that legal splits always exist, one can first decide an order over graph nodes, before separating left-to-right edges from right-to-left ones to make a split. An obvious drawback is that structural information is lost by splitting a graph into two DAGs. Chapter 4 mainly studies this problem and gives our solution. The other solution (Liang et al., 2016) is to leverage another model to pick a node order from an undirected and unrooted graph. Since there are exponential numbers of node orders, more ambiguity is introduced, even through they use a model to pick the order. Also, preceding nodes cannot incorporate the information from their succeeding nodes.
2.2 Encoding graphs with Graph Neural Network
Since being introduced, graph neural networks (GNNs) (Scarselli et al., 2009) have long been neglected until recently (Li et al., 2016; Kipf and Welling, 2017; Zhang et al., 2018). To update node states within a graph, GNNs rely on a message passing mechanism that iteratively updates the node states in parallel. During an iteration, a message is first aggregated for each node from its neighbors, then the message is applied to update the node state. To be more specific, updating the hidden state for node for iteration can be formalized as the following equations:
where and represent the hidden states and embeddings of the neighbors for , and corresponds to the set of neighbors for . We can see that each node gradually absorbs larger context through this message passing framework. This framework is remotely related to loopy belief propagation (LBP) (Murphy et al., 1999), but the main difference is that LBP propagates probabilities, while GNNs propagate hidden-state units. Besides, the message passing process is only executed for a certain number of times for GNNs, while LBP is usually executed until convergence. The reason is that GNNs are optimized for end-to-end task performance, not for a joint probability.
2.2.1 Main difference between GNNs and RNNs
The main difference is that GNNs do not require a node order for the input, such as a left-to-right order for sentences and a bottom-up order for trees. In contrast, having an order is crucial for RNNs and their DAG extensions. In fact, GNNs only require the local neighborhood information, thus they are agnostic of the input structures and are very general for being applied on any types of graphs, trees and even sentences.
This is a fundamental difference that leads to many superior properties of GNNs comparing with RNNs and DAG networks. First, GNNs update node states in parallel within an iteration, thus they can be much faster than RNNs. We will give more discussions and analysis in Chapters 4 and 5. Second, sibling nodes can easily incorporate the information of each other with GNNs. This can be achieved by simply executing a GNN for 2 iterations, so that the information of one sibling can go up then down to reach the other.
2.2.2 Different types of GNNs
So far there have been several types of GNNs, and their main differences lay in the way for updating node states from aggregated messages (Equation 2.2). We list several existing approaches in Table 2.1 and give detailed introduction below:
The first type is named graph convolutional network (GCN) (Kipf and Welling, 2017). It borrows the idea of convolutional neural network (CNN) (Krizhevsky et al., 2012), which gathers larger contextual information through each convolution operation. To deal with the problem where a graph node can have arbitrary number of neighbors, GCN and its later variants first sum up the hidden states of all neighbors, before applying the result of summation as messages to update the graph node state:
Note that some GCN variations try to distinguish different types of neighbors before the summation:
But they actually choose s to be identical, and thus this is equivalent to Equation 2.3. The underlying reason is that making s to be different will introduce a lot of parameters, especially when the number of neighbor types is large. The summation operation can be considered as the message aggregation process first mentioned in Equation 2.1. In fact, most existing GNNs use sum to aggregate message. There are also other ways, which I will introduce later in this chapter.
where corresponds to the -th multi-head self attention layer. For the next step, GAN diretly use the newly calculated message as the new node state: .
|Type||Model||Ways for applying messages|
The massage propagation method based on linear transformations in GCN may suffer from long-range dependency problem, when dealing with very large and complex graphs. To alleviate the long-range dependency problem, recent work has proposed to use gated mechanisms to process messages. In particular, graph recurrent networks (GRN) (Zhang et al., 2018; Song et al., 2018d) leverage the gated operations of an LSTM (Hochreiter and Schmidhuber, 1997) step to apply messages for node state updates. On the other hand, gated graph neural networks (GGNN) (Li et al., 2016; Beck et al., 2018) adopt a GRU (Cho et al., 2014) step to conduct the update. The message propagation mechanism for both models are shown in the last group of Table 2.1. To generate messages, they also simply sum up the hidden states of all neighbors. This is the same as GCNs.
Discussion on message aggregation
The models mentioned above either use summations or an attention mechanism to calculate messages. As a result, these models have the same property: they are invariant to the permutations of their inputs, which result in different orders of neighbors. This property are also called “symmetric”, and Hamilton et al. (2017)
introduce several other “symmetric” and “asymmetric” message aggregators. In addition to summation and attention mechanisms, mean pooling and max pooling operations are also “symmetric”. This is intuitive, as both pooling operations are obviously invariant to input orders.
On the other hand, they mention an LSTM aggregator, which generates messages by simply applying an LSTM to a random permutation of a node’s neighbors. The LSTM aggregator is not permutation invariant, and thus is “asymmetric”.
2.2.3 Discussion on the memory usage of GNNs
So far we have discussed several advantages of GNNs. Comparing with RNNs and DAG networks, GNNs are more flexible for handling any types of graphs. Besides, they allow better parallelization, and thus are more efficient on GPUs. However, they also suffer from limitations, and the most severe one is the large-scale memory usage.
As mentioned above, GNNs update every graph node state within an iteration, and all node states are updated for times if a GNN executes for message passing steps. As a result, the increasing computation causes more memory usage. In general, the amount of memory usage is highly related to the density and scale of the input graph.
introduce a parameterized and trainable sampler to perform layerwise sampling conditioned on the former layer. Furthermore, this adaptive sampler could find optimal sampling importance and reduce variance simultaneously.
Recent years have witnessed a growing interest in the task of machine reading comprehension. Most existing work (Hermann et al., 2015; Wang and Jiang, 2017; Seo et al., 2016; Wang et al., 2016; Weissenborn et al., 2017; Dhingra et al., 2017a; Shen et al., 2017; Xiong et al., 2016) focuses on a factoid scenario where the questions can be answered by simply considering very local information, such as one or two sentences. For example, to correctly answer a question “What causes precipitation to fall?”, a QA system only needs to refer to one sentence in a passage: “… In meteorology, precipitation is any product of the condensation of atmospheric water vapor that falls under gravity. …”, and the final answer “gravity” is indicated key words of “precipitation” and “falls”.
A more challenging yet practical extension is multi-hop reading comprehension (MHRC) (Welbl et al., 2018), where a system needs to properly integrate multiple pieces of evidence to correctly answer a question. Figure 3.1 shows an example, which contains three associated passages, a question and several candidate choices. In order to correctly answer the question, a system has to integrate the facts “The Hanging Gardens are in Mumbai” and “Mumbai is a city in India”. There are also some irrelevant facts, such as “The Hanging Gardens provide sunset views over the Arabian Sea” and “The Arabian Sea is bounded by Pakistan and Iran”, which make the task more challenging, as an MHRC model has to distinguish the relevant facts from the irrelevant ones.
|[The Hanging Gardens], in [Mumbai], also known as Pherozeshah Mehta Gardens, are terraced gardens … [They] provide sunset views over the [Arabian Sea] …|
|[Mumbai] (also known as Bombay, the official name until 1995) is the capital city of the Indian state of Maharashtra. [It] is the most populous city in [India] …|
|The [Arabian Sea] is a region of the northern Indian Ocean bounded on the north by [Pakistan] and [Iran], on the west by northeastern [Somalia] and the Arabian Peninsula, and on the east by …|
|Q: (The Hanging gardens, country, ?)|
|Candidate answers: Iran, India, Pakistan, Somalia, …|
Despite being a practical task, so far MHRC has received little research attention. One notable method, Coref-GRU (Dhingra et al., 2018), uses coreference information to gather richer context for each candidate. However, one main disadvantage of Coref-GRU is that the coreferences it considers are usually local to a sentence, neglecting other useful global information. In addition, the resulting DAGs are usually very sparse, thus few new facts can be inferred. The top part of Figure 3.2 shows a directed acyclic graph (DAG) with only coreference edges. In particular, the two coreference edges infer two facts: “The Hanging Gardens provide views over the Arabian Sea” and “Mumbai is a city in India”, from which we cannot infer the ultimate fact, “The Hanging Gardens are in India”, for correctly answering this instance.
We propose a general graph scheme for evidence integration, which allows information exchange beyond co-reference nodes, by allowing arbitrary degrees of the connectivity of the reference graphs. In general, we want the resulting graphs to be more densely connected so that more useful facts can be inferred. For example each edge can connect two related entity mentions, while unrelated mentions, such as “the Arabian Sea” and “India”, may not be connected. In this paper, we consider three types of relations as shown in the bottom part of Figure 3.2.
The first type of edges connect the mentions of the same entity appearing across passages or further apart in the same passage. Shown in Figure 3.2, one instance connects the two “Mumbai” across the two passages. Intuitively, same-typed edges help to integrate global evidence related to the same entity, which are not covered by pronouns. The second type of edges connect two mentions of different entities within a context window. They help to pass useful evidence further across entities. For example, in the bottom graph of Figure 3.2, both window-typed edges of 1⃝ and 6⃝ help to pass evidence from “The Hanging Gardens” to “India”, the answer of this instance. Besides, window-typed edges enhance the relations between local mentions that can be missed by the sequential encoding baseline. Finally, coreference-typed edges are further complementary to the previous two types, and thus we also include them.
Our generated graphs are complex and can have cycles, making it difficult to directly apply a DAG network (e.g. the structure of Coref-GRU). So we adopt graph recurrent network (GRN), as it has been shown successful on encoding various types of graphs, including semantic graphs (Song et al., 2018d), dependency graphs (Song et al., 2018e) and even chain graphs created by raw texts (Zhang et al., 2018).
Given an instance containing several passages and a list of candidates, we first use NER and coreference resolution tools to obtain entity mentions, and then create a graph out of the mentions and relevant pronouns. As the next step, evidence integration is executed on the graph by adopting a graph neural network on top of a sequential layer. The sequential layer learns local representation for each mention, while the graph network learns a global representation. The answer is decided by matching the representations of the mentions against the question representation.
Experiments on WikiHop (Welbl et al., 2018) show that our created graphs are highly useful for MHRC. On the hold-out testset, it achieves an accuracy of 65.4%, which is highly competitive on the leaderboard111http://qangaroo.cs.ucl.ac.uk/leaderboard.html as of the paper submission time. In addition, our experiments show that the questions and answers are dramatically better connected on our graphs than on the coreference DAGs, if we map the questions on graphs using the question subject. Our experiments also show a positive relation between graph connectivity and end-to-end accuracy.
On the testset of ComplexWebQuestions (Talmor and Berant, 2018), our method also achieves better results than all published numbers. To our knowledge, we are among the first to investigate graph neural networks on reading comprehension.
As shown in Figure 3.3, we introduce two baselines, which are inspired by Dhingra et al. (2018). The first baseline, Local, uses a standard BiLSTM layer (shown in the green dotted box), where inputs are first encoded with a BiLSTM layer, and then the representation vectors for the mentions in the passages are extracted, before being matched against the question for selecting an answer. The second baseline, Coref LSTM, differs from Local by replacing the BiLSTM layer with a DAG LSTM layer (shown in the orange dotted box) for encoding additional coreference information, as proposed by Dhingra et al. (2018).
3.2.1 Local: BiLSTM encoding
Given a list of relevant passages, we first concatenate them into one large passage , where each is a passage word and is the embedding of it. The Local baseline adopts a Bi-LSTM to encode the passage:
Each hidden state contains the information of its local context. Similarly, the question words are first converted into embeddings before being encoded by another BiLSTM:
3.2.2 Coref LSTM: DAG LSTM encoding with conference
Taking the passage word embeddings and coreference information as the input, the DAG LSTM layer encodes each input word embedding (such as ) with the following gated operations222Only the forward direction is shown for space consideration:
represents all preceding words of in the DAG, , and are the input, output and forget gates, respectively. , and () are model parameters.
3.2.3 Representation extraction
After encoding both the passage and the question, we obtain a representation vector for each entity mention ( represents all entities), spanning from to , by concatenating the hidden states of its start and end positions, before they are correlated with a fully connected layer:
where and are model parameters for compressing the concatenated vector. Note that the current multi-hop reading comprehension datasets all focus on the situation where the answer is a named entity. Similarly, the representation vector for the question is generated by concatenating the hidden states of its first and last positions:
where and are also model parameters.
3.2.4 Attention-based matching
Given the representation vectors for the question and the entity mentions in the passages, an additive attention model(Bahdanau et al., 2015)333We adopt a standard matching method, as our focus is the effectiveness of evidence integration. We leave investigating other approaches (Luong et al., 2015; Wang et al., 2017) as future work. is adopted by treating all entity mention representations and the question representation as the memory and the query, respectively. In particular, the probability for a candidate being the answer given input is calculated by summing up all the occurrences of across the input passages:444All candidates form a subset of all entities ().
where and represent all occurrences of the candidate and all occurrences of all candidates, respectively. Previous work (Wang et al., 2018a) shows that summing the probabilities over all occurrences of the same entity mention is important for the multi-passage scenario. is the attention score for the entity mention , calculated by an additive attention model shown below:
where , , and are model parameters.
Comparison with Dhingra et al. (2018)
The Coref-GRU model (Dhingra et al., 2018) is based on the gated-attention reader (GA reader) (Dhingra et al., 2017a). GA reader is designed for the cloze-style reading comprehension task (Hermann et al., 2015), where one
token is selected from the input passages as the answer for each instance. To adapt their model for the WikiHop benchmark, where an answer candidate can contain multiple tokens, they first generate a probability distribution over the passage tokens with GA reader, and then compute the probability for each candidateby aggregating the probabilities of all passage tokens that appear in and renormalizing over the candidates.
In addition to using LSTM instead of GRU555Model architectures are selected according to dev results., the main difference between our two baselines and Dhingra et al. (2018) is that our baselines consider each candidate as a whole unit no matter whether it contains multiple tokens or not. This makes our models more effective on the datasets containing phrasal answer candidates.
3.3 Evidence integration with GRN encoding
Over the representation vectors for a question and the corresponding entity mentions, we build an evidence integration graph of the entity mentions by connecting relevant mentions with edges, and then integrating relevant information for each graph node (entity mention) with a graph recurrent network (GRN) (Zhang et al., 2018; Song et al., 2018d). Figure 3.4 shows the overall procedure of our approach.
3.3.1 Evidence graph construction
As a first step, we create an evidence graph from a list of input passages to represent interrelations among entities within the passages. The entity mentions within the passages are taken as the graph nodes. They are automatically generated by NER and coreference annotators, so that each graph node is either an entity mention or a pronoun representing an entity. We then create a graph by ensuring that edges between two nodes follow the situations below:
They are occurrences of the same entity mention across passages or with a distance larger than a threshold when being in the same passage.
One is an entity mention and the other is its coreference. Our coreference information is automatically generated by a coreference annotator.
Between two mentions of different entities in the same passage within a window threshold of .
Between every two entities that satisfy the situations above, we make two edges in opposite directions. As a result, each generated graph can also be considered as an undirected graph.
3.3.2 Evidence integration with graph encoding
Tackling multi-hop reading comprehension requires inferring on global context. As the next step, we merge related information through the three types of edges just created by applying GRN on our graphs.
Figure 3.5 shows the overall structure of our graph encoder. Formally, given a graph , a hidden state vector is created to represent each entity mention . The state of the graph can thus be represented as:
In order to integrate non-local evidence among nodes, information exchange between neighborhooding nodes is performed through recurrent state transitions, leading to a sequence of graph states , where and
is a hyperparameter representing the number of graph state transition decided by a development experiment. For initial state, we initialize each by:
where is the corresponding representation vector of entity mention , calculated by Equation 3.6. is the question representation. and are model parameters.
A gated recurrent neural network is used to model the state transition process. In particular, the transition from to consists of a hidden state transition for each node, as shown in Figure 3.5. At each step , direct information exchange is conducted between a node and all its neighbors. To avoid gradient diminishing or bursting, LSTM (Hochreiter and Schmidhuber, 1997) is adopted, where a cell vector is taken to record memory for hidden state :
where is the cell vector to record memory for , and , and are the input, output and forget gates, respectively. and () are model parameters. In the remaining of this thesis, I will use the symbol LSTM to represent Equation 3.13. is the sum of the neighborhood hidden states for the node 666We tried distinguishing neighbors by different types of edges, but it does not improve the performance.:
represents the set of all neighbors of .
Using the above state transition mechanism, information from each node propagates to all its neighboring nodes after each step. Therefore, for the worst case where the input graph is a chain of nodes, the maximum number of steps necessary for information from one arbitrary node to reach another is equal to the size of the graph. We experiment with different transition steps to study the effectiveness of global encoding.
Note that unlike the sequence LSTM encoder, our graph encoder allows parallelization in node-state updates, and thus can be highly efficient using a GPU. It is general and can be potentially applied to other tasks, including sequences, syntactic trees and cyclic structures.
3.3.3 Matching and combination
After evidence integration, we match the hidden states at each graph encoding step with the question representation using the same additive attention mechanism introduced in the Baseline section. In particular, for each entity , the matching results for the baseline and each graph encoding step are first generated, before being combined using a weighted sum to obtain the overall matching result:
where is the baseline matching result for , is the matching results after steps, and is the total number of graph encoding steps. , , , , and are model parameters. In addition, a probability distribution is calculated from the overall matching results using softmax, similar to Equations 3.10. Finally, probabilities that belong to the same entity mention are merged to obtain the final distribution, as shown in Equation 3.8.
We train both the baseline and our models using the cross-entropy loss:
where is ground-truth answer, and are the input and model parameters, respectively. Adam (Kingma and Ba, 2014) with a learning rate of 0.001 is used as the optimizer. Dropout with rate 0.1 and a 2 normalization weight of are used during training.
3.5 Experiments on WikiHop
In this section, we study the effectiveness of rich types of edges and the graph encoders using the WikiHop (Welbl et al., 2018) dataset. It is designed for multi-evidence reasoning, as its construction process makes sure that multiple evidence are required for inducing the answer for each instance.
The dataset contains around 51K instances, including 44K for training, 5K for development and 2.5K for held-out testing. Each instance consists of a question, a list of associated passages, a list of candidate answers and a correct answer. One example is shown in Figure 3.1. On average each instance has around 19 candidates, all of which are the same category. For example, if the answer is a country, all other candidates are also countries. We use Stanford CoreNLP (Manning et al., 2014) to obtain coreference and NER annotations. Then the entity mentions, pronoun coreferences and the provided candidates are taken as graph nodes to create an evidence graph. The distance thresholds ( and , in Section 3.3.1) for making same and window typed edges are set to 200 and 20, respectively.
We study the model behavior on the WikiHop devset, choosing the best hyperparameters for online system evaluation on the final holdout testset. Our word embeddings are initialized from the 300-dimensional pretrained Glove word embeddings (Pennington et al., 2014) on Common Crawl, and are not updated during training.
For model hyperparameters, we set the graph state transition number as 3 according to development experiments. Each node takes information from at most 200 neighbors, where same and coref typed neighbors are kept first. The hidden vector sizes for both bidirectional LSTM and GRN layers are set to 300.
3.5.3 Development experiments
Figure 3.6 shows the devset performance of our GRN-based model with different transition steps. It shows the baseline performances when transition step is 0. The accuracy goes up when increasing the transition step to 3. Further increasing the transition step leads to a slight performance decrease. One reason can be that executing more transition steps may also introduce more noise through richly connected edges. We set the transition step to 3 for all remaining experiments.
|GA w/ GRU (Dhingra et al., 2018)||54.9||–|
|GA w/ Coref-GRU (Dhingra et al., 2018)||56.0||59.3|
|Leaderboard 1st [anonymized]||–||70.6|
|Leaderboard 2nd [anonymized]||–||67.6|
|3rd, Cao et al. (2018)||–||67.6|
3.5.4 Main results
Table 3.1 shows the main comparison results777At paper-writing time, we observe a recent short arXiv paper (Cao et al., 2018) and two anonymous papers submitted to ICLR, showing better results with ELMo (Peters et al., 2018). Our main contribution is studying an evidence integration approach, which is orthogonal to the contribution of ELMo on large-scale training. We will investigate ELMo in a future version. with existing work. GA w/ GRU and GA w/ Coref-GRU correspond to Dhingra et al. (2018), and their reported numbers are copied. The former is their baseline, a gated-attention reader (Dhingra et al., 2017a), and the latter is their proposed method.
For our baselines, Local and Local-2L encode passages with a BiLSTM and a 2-layer BiLSTM, respectively, both only capture local information for each mention. We introduce Local-2L for better comparison, as our models have more parameters than Local. Coref LSTM is another baseline, encoding passages with coreference annotations by a DAG LSTM (Section 3.2.2). This is a reimplementation of Dhingra et al. (2018) based on our framework. Coref GRN is another baseline that encodes coreferences with GRN. It is for contrasting coreference DAGs with our evidence integration graphs. MHQA-GRN corresponds to our evidence integration approaches via graph encoding, adopting GRN for graph encoding.
First, even our Local show much higher accuracies compared with GA w/ GRU and GA w/ Coref-GRU. This is because our models are more compatible with the evaluated dataset. In particular, GA w/ GRU and GA w/ Coref-GRU calculate the probability for each candidate by summing up the probabilities of all tokens within the candidate. As a result, they cannot handle phrasal candidates very well, especially for the overlapping candidates, such as “New York” and “New York City”. On the other hand, we consider each candidate answer as a single unit, and does not suffer from this issue. As a reimplementation of their idea, Coref LSTM only shows 0.4 points gains over Local, a stronger baseline than GA w/ GRU. On the other hand, MHQA-GRN is 1.8 points more accurate than Local.
The comparisons below help to further pinpoint the advantage of our approach: MHQA-GRN is 1.4 points better than Coref GRN , while Coref GRN gives a comparable performance with Coref LSTM. Both comparisons show that our evidence graphs are the main reason for achieving the 1.8-points improvement, and it is mainly because our evidence graphs are better connected than coreference DAGs and are more suitable for integrating relevant evidence. Local-2L is not significantly better than Local, meaning that simply introducing more parameters does not help.
In addition to the systems above, we introduce Fully-Connect-GRN for demonstrating the effectiveness of our evidence graph creating approach. Fully-Connect-GRN creates fully connected graphs out of the entity mentions, before encoding them with GRN. Within each fully connected graph, the question is directly connected with the answer. However, fully connected graphs are brute-force connections, and are not representative for integrating related evidence. MHQA-GRN is 1.5 points better than Fully-Connect-GRN, while questions and answers are more directly connected (with distance 1 for all cases) by Fully-Connect-GRN. The main reason can be that our evidence graphs only connect related entity mentions, making our models easier to learn how to integrate evidence. On the other hand, there are barely learnable patterns within fully connected graphs. More analyses on the relation between graph connectivity and end-to-end performance will be shown in later paragraphs.
We observe some unpublished papers showing better results with ELMo (Peters et al., 2018), which is orthogonal to our contribution.
Effectiveness of edge types
Table 3.2 shows the ablation study of different types of edges that we introduce for evidence integration. The first group shows the situations where one type of edges are removed. In general, there is a large performance drop by removing any type of edges. The reason can be that the connectivity of the resulting graphs is reduced, thus fewer facts can be inferred. Among all these types, removing window-typed edges causes the least performance drop. One possible reason is that some information captured by them has been well captured by sequential encoding. However, window-typed edges are still useful, as they can help passing evidence through to further nodes. Take Figure 3.2 as an example, two window-typed edges help to pass information from “The Hanging Gardens” to “India”. The other two types of edges are slightly more important than window-typed edges. Intuitively, they help to gather more global information than window-typed edges, thus learn better representations for entities by integrating contexts from their occurrences and co-references.
The second group of Table 3.2 shows the model performances when only one type of edges are used. None of the performances with single-typed edges are significantly better than the Local baseline, whereas the combination of all types of edges achieves a much better accuracy (1.8 points) than Local. This indicates the importance of evidence integration over better connected graphs. We show more detailed quantitative analyses later on. The numbers generally demonstrate the same patterns as the first group. In addition, only same is slightly better than only coref. It is likely because some coreference information can also be captured by sequential encoding.
Distance Figure 3.7 shows the percentage distribution of distances between a question and its closest answer when either all types of edges are adopted or only coreference edges are used. The subject of each question888As shown in Figure 3.1, each question has a subject, a relation and asks for the object. is used to locate the question on the corresponding graph.
When all types of edges are adopted, the questions and the answers for more than 90% of the development instances are connected, and the question-and-answer distances for more than 70% are within 3. On the other hand, the instances with distances longer than 4 only count for 10%. This can be the reason why performances do not increase when more than 3 transition steps are performed in our model. The advantage of our approach can be shown by contrasting the distance distributions over graphs generated either by the baseline or by our approach.
We further evaluate both approaches on a subset of the development instances, where the answer-and-question distance is at most 3 in our graph. The accuracies of Coref LSTM and MHQA-GRN on this subset are 61.1 and 63.8, respectively. Comparing with the performances on the whole devset (61.4 vs 62.8), the performance gap on this subset is increased by 1.3 points. This indicates that our approach can better handle these “relatively easy” reasoning tasks. However, as shown in Figure 3.6, instances that require large reasoning steps are still challenging to our approach.
3.6 Experiments on ComplexWebQuestions
In this section, we conduct experiments on the newly released ComplexWebQuestions version 1.1 (Talmor and Berant, 2018) for better evaluating our approach. Compared with WikiHop, where the complexity is implicitly specified in the passages, the complexity of this dataset is explicitly specified on the question side. One example question is “What city is the birthplace of the author of ‘Without end”’. A two-step reasoning is involved, with the first step being “the author of ‘Without end”’ and the second being “the birthplace of ”. is the answer of the first step.
|MHQA-GRN w/ only same||32.2||–|
|SplitQA w/ additional labeled data||35.6||34.2|
In this dataset, web snippets (instead of passages as in WikiHop) are used for extracting answers. The baseline of Talmor and Berant (2018) (SimpQA) only uses a full question to query the web for obtaining relevant snippets, while their model (SplitQA) obtains snippets for both the full question and its sub-questions. With all the snippets, SplitQA models the QA process based on a computation tree999A computation tree is a special type of semantic parse, which has two levels. The first level contains sub-questions and the second level is a composition operation. of the full question. In particular, they first obtain the answers for the sub-questions, and then integrate those answers based on the computation tree. In contrast, our approach creates a graph from all the snippets, thus the succeeding evidence integration process can join all associated evidence.
Main results As shown in Table 3.3, similar to the observations in WikiHop, MHQA-GRN achieves large improvements over Local. Both the baselines and our models use all web snippets, but MHQA-GRN further considers the structural relations among entity mentions. SplitQA achieves 0.5% improvement over SimpQA101010Upon the submission time, the authors of ComplexWebQuestions have not reported testing results for the two methods. To make a fair comparison we compare the devset accuracy.. Our Local baseline is comparable with SplitQA and our graph-based models contribute a further 2% improvement over Local. This indicates that considering structural information on passages is important for the dataset.
Analysis To deal with complex questions that require evidence from multiple passages to answer, previous work (Wang et al., 2018b; Lin et al., 2018; Wang et al., 2018c) collect evidence from occurrences of an entity in different passages. The above methods correspond to a special case of our method, i.e. MHQA with only the same-typed edges. From Table 3.3, our method gives 1 point increase over MHQA-GRN w/ only same, and it gives more increase in WikiHop (comparing all types with only same in Table 3.2). Both results indicate that our method could capture more useful information for multi-hop QA tasks, compared to the methods developed for previous multi-passage QA tasks. This is likely because our method integrates not only evidences for an entity but also these for other related entities.
The leaderboard reports SplitQA with additional sub-question annotations and gold answers for sub-questions. These pairs of sub-questions and answers are used as additional data for training SplitQA. The above approach relies on annotations of ground-truth answers for sub-questions and semantic parses, thus is not practically useful in general. However, the results have additional value since it can be viewed as an upper bound of SplitQA. Note that the gap between this upper bound and our MHQA-GRN is small, which further proves that larger improvement can be achieved by introducing structural connections on the passage side to facilitate evidence integration.
3.7 Related Work
Question answering with multi-hop reasoning Multi-hop reasoning is an important ability for dealing with difficult cases in question answering (Rajpurkar et al., 2016; Boratko et al., 2018). Most existing work on multi-hop QA focuses on hopping over knowledge bases or tables (Jain, 2016; Neelakantan et al., 2016; Yin et al., 2016), thus the problem is reduced to deduction on a readily-defined structure with known relations. In contrast, we study multi-hop QA on textual data and we introduce an effective approach for creating evidence integration graph structures over the textual input for solving our problems. Previous work (Hill et al., 2015; Shen et al., 2017) studying multi-hop QA on text does not create reference structures. In addition, they only evaluate their models on a simple task (Weston et al., 2015) with a very limited vocabulary and passage length. Our work is fundamentally different from theirs by modeling structures over the input, and we evaluate our models on more challenging tasks.
Recent work starts to exploit ways for creating structures from inputs. Talmor and Berant (2018) build a two-level computation tree over each question, where the first-level nodes are sub-questions and the second-level node is a composition operation. The answers for the sub-questions are first generated, and then combined with the composition operation. They predefine two composition operations, which makes it not general enough for other QA problems. Dhingra et al. (2018) create DAGs over passages with coreference. The DAGs are then encoded using a DAG recurrent network. Our work follows the second direction by creating reasoning graphs on the passage side. However, we consider more types of relations than coreference, making a thorough study on evidence integration. Besides, we also investigate a recent graph neural network (namely GRN) on this problem.
Question answering over multiple passages
Recent efforts in open-domain QA start to generate answers from multiple passages instead of a single passage. However, most existing work on multi-passage QA selects the most relevant passage for answering the given question, thus reducing the problem to single-passage reading comprehension (Chen et al., 2017a; Dunn et al., 2017; Dhingra et al., 2017b; Wang et al., 2018a; Clark and Gardner, 2018). Our method is fundamentally different by truly leveraging multiple passages.
A few multi-passage QA approaches merge evidence from multiple passages before selecting an answer (Wang et al., 2018b; Lin et al., 2018; Wang et al., 2018c). Similar to our work, they combine evidences from multiple passages, thus fully utilizing input passages. The key difference is that their approaches focus on how the contexts of a single answer candidate from different passages could cover different aspects of a complex question, while our approach studies how to properly integrate the related evidence of an answer candidate, some of which come from the contexts of different entity mentions. This increases the difficulty, since those contexts do not co-occur with the candidate answer nor the question. When a piece of evidence does not co-occur with the answer candidate, it is usually difficult for these methods to integrate the evidence. This is also demonstrated by our empirical comparison, where our approach shows much better performance than combining only the evidence of the same entity mentions.
We have introduced a new approach for tackling multi-hop reading comprehension (MHRC), with a graph-based evidence integration process. Given a question and a list of passages, we first connect related evidence in reference passages into a graph, and then adopt recent graph neural networks to encode resulted graphs for performing evidence integration. Results show that the three types of edges are useful on combining global evidence and that the graph neural networks are effective on encoding complex graphs resulted by the first step. Our approach shows highly competitive performances on two standard MHRC datasets.
As a central task in natural language processing, relation extraction has been investigated on news, web text and biomedical domains. It has been shown to be useful for detecting explicit facts, such as cause-effect Hendrickx et al. (2009), and predicting the effectiveness of a medicine on a cancer caused by mutation of a certain gene in the biomedical domain Quirk and Poon (2017); Peng et al. (2017). While most existing work extracts relations within a sentence Zelenko et al. (2003); Palmer et al. (2005); Zhao and Grishman (2005); Jiang and Zhai (2007); Plank and Moschitti (2013); Li and Ji (2014); Gormley et al. (2015); Miwa and Bansal (2016); Zhang et al. (2017), the task of cross-sentence relation extraction has received increasing attention Gerber and Chai (2010); Yoshikawa et al. (2011). Recently, Peng et al. (2017) extend cross-sentence relation extraction by further detecting relations among several entity mentions (-ary relation). Table 4.1 shows an example, which conveys the fact that cancers caused by the 858E mutation on EGFR gene can respond to the gefitinib medicine. The three entity mentions form a ternary relation yet appear in distinct sentences.
|The deletion mutation on exon-19 of EGFR gene was present in 16 patients, while the 858E point mutation on exon-21 was noted in 10.|
|All patients were treated with gefitinib and showed a partial response.|
Peng et al. (2017) proposed a graph-structured LSTM for -ary relation extraction. As shown in Figure 4.1 (a), graphs are constructed from input sentences with dependency edges, links between adjacent words, and inter-sentence relations, so that syntactic and discourse information can be used for relation extraction. To calculate a hidden state encoding for each word, Peng et al. (2017) first split the input graph into two directed acyclic graphs (DAGs) by separating left-to-right edges from right-to-left edges (Figure 4.1 (b)). Then, two separate gated recurrent neural networks, which extend tree LSTM Tai et al. (2015), were adopted for each single-directional DAG, respectively. Finally, for each word, the hidden states of both directions are concatenated as the final state. The bi-directional DAG LSTM model showed superior performance over several strong baselines, such as tree-structured LSTM Miwa and Bansal (2016), on a biomedical-domain benchmark.
However, the bidirectional DAG LSTM model suffers from several limitations. First, important information can be lost when converting a graph into two separate DAGs. For the example in Figure 4.1, the conversion breaks the inner structure of “exon-19 of EGFR gene”, where the relation between “exon-19” and “EGFR” via the dependency path “exon-19 gene EGFR” is lost from the original subgraph. Second, using LSTMs on both DAGs, information of only ancestors and descendants can be incorporated for each word. Sibling information, which may also be important, is not included.
A potential solution to the problems above is to model a graph as a whole, learning its representation without breaking it into two DAGs. Due to the existence of cycles, naive extension of tree LSTMs cannot serve this goal. Recently, graph convolutional networks (GCN) (Kipf and Welling, 2017; Marcheggiani and Titov, 2017; Bastings et al., 2017) and graph recurrent networks (GRN) Zhang et al. (2018); Song et al. (2018d) have been proposed for representing graph structures for NLP tasks. Such methods encode a given graph by hierarchically learning representations of neighboring nodes in the graphs via their connecting edges. While GCNs use CNN for information exchange, GRNs take gated recurrent steps to this end. For fair comparison with DAG LSTMs, we build a graph model by extending Song et al. (2018d), which strictly follow the configurations of Peng et al. (2017) such as the source of features and hyper parameter settings. In particular, the full input graph is modeled as a single state, with words in the graph being its sub states. State transitions are performed on the graph recurrently, allowing word-level states to exchange information through dependency and discourse edges. At each recurrent step, each word advances its current state by receiving information from the current states of its adjacent words. Thus with increasing numbers of recurrent steps each word receives information from a larger context. Figure 4.2 shows the recurrent transition steps where each node works simultaneously within each transition step.
Compared with bidirectional DAG LSTM, our method has several advantages. First, it keeps the original graph structure, and therefore no information is lost. Second, sibling information can be easily incorporated by passing information up and then down from a parent. Third, information exchange allows more parallelization, and thus can be very efficient in computation.
Results show that our model outperforms a bidirectional DAG LSTM baseline by 5.9% in accuracy, overtaking the state-of-the-art system of Peng et al. (2017) by 1.2%. Our code is available at https://github.com/freesunshine0316/nary-grn.
Our contributions are summarized as follows.
We empirically compared our GRN with DAG LSTM for -ary relation extraction tasks, showing that the former is better by more effective use of structural information;
To our knowledge, we are the first to investigate a graph recurrent network for modeling dependency and discourse relations.
4.2 Task Definition
Formally, the input for cross-sentence -ary relation extraction can be represented as a pair , where is the set of entity mentions, and is a text consisting of multiple sentences. Each entity mention belongs to one sentence in . There is a predefined relation set , where None represents that no relation holds for the entities. This task can be formulated as a binary classification problem of determining whether together form a relation Peng et al. (2017), or a multi-class classification problem of detecting which relation holds for the entity mentions. Take Table 4.1 as an example. The binary classification task is to determine whether gefitinib would have an effect on this type of cancer, given a cancer patient with 858E mutation on gene EGFR. The multi-class classification task is to detect the exact drug effect: response, resistance, sensitivity, etc.
4.3 Baseline: Bi-directional DAG LSTM
Peng et al. (2017) formulate the task as a graph-structured problem in order to adopt rich dependency and discourse features. In particular, Stanford parser Manning et al. (2014) is used to assign syntactic structure to input sentences, and heads of two consecutive sentences are connected to represent discourse information, resulting in a graph structure. For each input graph , the nodes are words within input sentences, and each edge connects two words that either have a relation or are adjacent to each other. Each edge is denoted as a triple , where and are the indices of the source and target words, respectively, and the edge label indicates either a dependency or discourse relation (such as “nsubj”) or a relative position (such as “next_tok” or “prev_tok”). Throughout this paper, we use and to denote the sets of incoming and outgoing edges for word .
For a bi-directional DAG LSTM baseline, we follow Peng et al. (2017), splitting each input graph into two separate DAGs by separating left-to-right edges from right-to-left edges (Figure 4.1). Each DAG is encoded by using a DAG LSTM (Section 4.3.2), which takes both source words and edge labels as inputs (Section 4.3.1
where is the hidden state of entity . and are parameters.
4.3.1 Input Representation
Both nodes and edge labels are useful for modeling a syntactic graph. As the input to our DAG LSTM, we first calculate the representation for each edge by:
where and are model parameters, is the embedding of the source word indexed by , and is the embedding of the edge label .
4.3.2 Encoding process
The baseline LSTM model learns DAG representations sequentially, following word orders. Taking the edge representations (such as ) as input, gated state transition operations are executed on both the forward and backward DAGs. For each word , the representations of its incoming edges are summed up as one vector:
Similarly, for each word , the states of all incoming nodes are summed to a single vector before being passed to the gated operations:
Finally, the gated state transition operation for the hidden state of the -th word can be defined as:
where , and are a set of input, output and forget gates, respectively, and , and () are model parameters.
4.3.3 Comparison with Peng et al. (2017)
Our baseline is computationally similar to Peng et al. (2017), but different on how to utilize edge labels in the gated network. In particular, Peng et al. (2017) make model parameters specific to edge labels. They consider two model variations, namely Full Parametrization (FULL) and Edge-Type Embedding (EMBED). FULL assigns distinct s (in Equation 4.5) to different edge types, so that each edge label is associated with a 2D weight matrix to be tuned in training. On the other hand, EMBED assigns each edge label to an embedding vector, but complicates the gated operations by changing the s to be 3D tensors.111For more information please refer Section 3.3 of Peng et al. (2017).
In contrast, we take edge labels as part of the input to the gated network. In general, the edge labels are first represented as embeddings, before being concatenated with the node representation vectors (Equation 4.2). We choose this setting for both the baseline and our GRN in Section 4.4, since it requires fewer parameters compared with FULL and EMBED, thus being less exposed to overfitting on small-scaled data.
4.4 Encoding with Graph Recurrent Network
Our input graph formulation strictly follows Section 4.3. In particular, our model adopts the same methods for calculating input representation (as in Section 4.3.1) and performing classification as the baseline model. However, different from the baseline bidirectional DAG LSTM model, we leverage GRN to directly model the input graph, without splitting it into two DAGs. Comparing with the evidence graphs shown in Chapter 3, the dependency graphs are directed and contain edge labels that provide important information. Here we adapt GRN to further incorporate this information.
Figure 4.2 shows an overview of the GRN encoder for dependency graphs. Formally, given an input graph , we define a state vector for each word . The state of the graph consists of all word states, and thus can be represented as:
Same as Chapter 3, the GRN-based encoder performs information exchange between neighboring words through a recurrent state transition process, resulting in a sequence of graph states , where , and the initial graph state consists of a set of initial word states , where is a zero vector. The main change is on message aggregation, where we further distinguish incoming neighbors and outgoing neighbors, and edge labels are also incorporated.
For each time step , the message to a word includes the representations of the edges that are connected to , where can be either the source or the target of the edge. Similar to Section 4.3.1, we define each edge as a triple , where and are indices of the source and target words, respectively, and is the edge label. is the representation of edge . The inputs for are distinguished by incoming and outgoing directions, where:
Here and denote the sets of incoming and outgoing edges of , respectively.
In addition to edge inputs, the message also contains the hidden states of its incoming and outgoing words during a state transition. In particular, the states of all incoming words and outgoing words are summed up, respectively:
Based on the above definitions of , , and , the message is aggregated by their concatenation:
before being applied with an LSTM step (defined in Equation 3.13) to update the node hidden state :
where is the cell memory for hidden state .
GRN vs bidirectional DAG LSTM
A contrast between the baseline DAG LSTM and our graph LSTM can be made from the perspective of information flow. For the baseline, information flow follows the natural word order in the input sentence, with the two DAG components propagating information from left to right and from right to left, respectively. In contrast, information flow in our GRN is relatively more concentrated at individual words, with each word exchanging information with all its graph neighbors simultaneously at each sate transition. As a result, wholistic contextual information can be leveraged for extracting features for each word, as compared to separated handling of bi-directional information flow in DAG LSTM. In addition, arbitrary structures, including arbitrary cyclic graphs, can be handled.
From an initial state with isolated words, information of each word propagates to its graph neighbors after each step. Information exchange between non-neighboring words can be achieved through multiple transition steps. We experiment with different transition step numbers to study the effectiveness of global encoding. Unlike the baseline DAG LSTM encoder, our model allows parallelization in node-state updates, and thus can be highly efficient using a GPU.
We train our models with a cross-entropy loss over a set of gold standard data:
where is an input graph, is the gold class label of , and is the model parameters. Adam Kingma and Ba (2014) with a learning rate of 0.001 is used as the optimizer, and the model that yields the best devset performance is selected to evaluate on the test set. Dropout with rate 0.3 is used during training. Both training and evaluation are conducted using a Tesla K20X GPU.
We conduct experiments for the binary relation detection task and the multi-class relation extraction task discussed in Section 4.2.
|Data||Avg. Tok.||Avg. Sent.||Cross (%)|
We use the dataset of Peng et al. (2017), which is a biomedical-domain dataset focusing on drug-gene-mutation ternary relations,222The dataset is available at http://hanover.azurewebsites.net. extracted from PubMed. It contains 6987 ternary instances about drug-gene-mutation relations, and 6087 binary instances about drug-mutation sub-relations. Table 4.2 shows statistics of the dataset. Most instances of ternary data contain multiple sentences, and the average number of sentences is around 2. There are five classification labels: “resistance or non-response”, “sensitivity”, “response”, “resistance” and “None”. We follow Peng et al. (2017)
and binarize multi-class labels by grouping all relation classes as “Yes” and treat “None” as “No”.
Following Peng et al. (2017), five-fold cross-validation is used for evaluating the models,333The released data has been separated into 5 portions, and we follow the exact split. and the final test accuracy is calculated by averaging the test accuracies over all five folds. For each fold, we randomly separate 200 instances from the training set for development. The batch size is set as 8 for all experiments. Word embeddings are initialized with the 100-dimensional GloVe Pennington et al. (2014) vectors, pretrained on 6 billion words from Wikipedia and web text. The edge label embeddings are 3-dimensional and randomly initialized. Pretrained word embeddings are not updated during training. The dimension of hidden vectors in LSTM units is set to 150.
4.6.3 Development Experiments
We first analyze our model on the drug-gene-mutation ternary relation dataset, taking the first among 5-fold cross validation settings for our data setting. Figure 4.3 shows the devset accuracies of different state transition numbers, where forward and backward execute our graph state model only on the forward or backward DAG, respectively. Concat concatenates the hidden states of forward and backward. All executes our graph state model on original graphs.
The performance of forward and backward lag behind concat, which is consistent with the intuition that both forward and backward relations are useful Peng et al. (2017). In addition, all gives better accuracies compared with concat, demonstrating the advantage of simultaneously considering forward and backward relations during representation learning. For all the models, more state transition steps result in better accuracies, where larger contexts can be integrated in the representations of graphs. The performance of all starts to converge after 4 and 5 state transitions, so we set the number of state transitions to 5 in the remaining experiments.
4.6.4 Final results
|Quirk and Poon (2017)||74.7||77.7|
|Peng et al. (2017) - EMBED||76.5||80.6|
|Peng et al. (2017) - FULL||77.9||80.7|
|Bidir DAG LSTM||75.6||77.3|
Table 4.3 compares our model with the bidirectional DAG baseline and the state-of-the-art results on this dataset, where EMBED and FULL have been briefly introduced in Section 4.3.3. +multi-task applies joint training of both ternary (drug-gene-mutation) relations and their binary (drug-mutation) sub-relations. Quirk and Poon (2017) use a statistical method with a logistic regression classifier and features derived from shortest paths between all entity pairs. Bidir DAG LSTM is our bidirectional DAG LSTM baseline, and GRN is our GRN-based model.
Using all instances (the Cross column in Table 4.3), our graph model shows the highest test accuracy among all methods, which is 5.9% higher than our baseline.444 using t-test. For the remaining of this paper, we use the same measure for statistical significance.
using t-test. For the remaining of this paper, we use the same measure for statistical significance.The accuracy of our baseline is lower than EMBED and FULL of Peng et al. (2017), which is likely due to the differences mentioned in Section 4.3.3. Our final results are better than Peng et al. (2017), despite the fact that we do not use multi-task learning.
We also report accuracies only on instances within single sentences (column Single in Table 4.3), which exhibit similar contrasts. Note that all systems show performance drops when evaluated only on single-sentence relations, which are actually more challenging. One reason may be that some single sentences cannot provide sufficient context for disambiguation, making it necessary to study cross-sentence context. Another reason may be overfitting caused by relatively fewer training instances in this setting, as only 30% instances are within a single sentence. One interesting observation is that our baseline shows the least performance drop of 1.7 points, in contrast to up to 4.1 for other neural systems. This can be a supporting evidence for overfitting, as our baseline has fewer parameters at least than FULL and EMBED.
|Bidir DAG LSTM||281s||27.3s|
The average times for training one epoch and decoding (seconds) over five folds on drug-gene-mutationTernary cross sentence setting.
Table 4.4 shows the training and decoding time of both the baseline and our model. Our model is 8 to 10 times faster than the baseline in training and decoding speeds, respectively. By revisiting Table 4.2, we can see that the average number of tokens for the ternary-relation data is 74, which means that the baseline model has to execute 74 recurrent transition steps for calculating a hidden state for each input word. On the other hand, our model only performs 5 state transitions, and calculations between each pair of nodes for one transition are parallelizable. This accounts for the better efficiency of our model.
Accuracy against sentence length
Figure 4.4 (a) shows the test accuracies on different sentence lengths. We can see that GRN and Bidir DAG LSTM show performance increase along increasing input sentence lengths. This is likely because longer contexts provide richer information for relation disambiguation. GRN is consistently better than Bidir DAG LSTM, and the gap is larger on shorter instances. This demonstrates that GRN is more effective in utilizing a smaller context for disambiguation.
Accuracy against the maximal number of neighbors
Figure 4.4 (b) shows the test accuracies against the maximum number of neighbors. Intuitively, it is easier to model graphs containing nodes with more neighbors, because these nodes can serve as a “supernode” that allow more efficient information exchange. The performances of both GRN and Bidir DAG LSTM increase with increasing maximal number of neighbors, which coincide with this intuition. In addition, GRN shows more advantage than Bidir DAG LSTM under the inputs having lower maximal number of neighbors, which further demonstrates the superiority of GRN over Bidir DAG LSTM in utilizing context information.
Figure 4.5 visualizes the merits of GRN over Bidir DAG LSTM using two examples. GRN makes the correct predictions for both cases, while Bidir DAG LSTM fails to. The first case generally mentions that Gefitinib does not have an effect on T790M mutation on EGFR gene. Note that both “However” and “was not” serve as indicators; thus incorporating them into the contextual vectors of these entity mentions is important for making a correct prediction. However, both indicators are leaves of the dependency tree, making it impossible for Bidir DAG LSTM to incorporate them into the contextual vectors of entity mentions up the tree through dependency edges.555As shown in Figure 4.1, a directional DAG LSTM propagates information according to the edge directions. On the other hand, it is easier for GRN. For instance, “was not” can be incorporated into “Gefitinib” through “suppressed treatment Gefitinib”.
|Quirk and Poon (2017)||73.9||75.2|
|Miwa and Bansal (2016)||75.9||75.9|
|Peng et al. (2017) - EMBED||74.3||76.5|
|Peng et al. (2017) - FULL||75.6||76.7|
|Bidir DAG LSTM||76.9||76.4|
The second case is to detect the relation among “cetuximab” (drug), “EGFR” (gene) and “S492R” (mutation), which does not exist. However, the context introduces further ambiguity by mentioning another drug “Panitumumab”, which does have a relation with “EGFR” and “S492R”. Being sibling nodes in the dependency tree, “can not” is an indicator for the relation of “cetuximab”. GRN is correct, because “can not” can be easily included into the contextual vector of “cetuximab” in two steps via “bind cetuximab”.
4.6.6 Results on Binary Sub-relations
Following previous work, we also evaluate our model on drug-mutation binary relations. Table 4.5 shows the results, where Miwa and Bansal (2016) is a state-of-the-art model using sequential and tree-structured LSTMs to jointly capture linear and dependency contexts for relation extraction. Other models have been introduced in Section 4.6.4.
Similar to the ternary relation extraction experiments, GRN outperforms all the other systems with a large margin, which shows that the message passing graph LSTM is better at encoding rich linguistic knowledge within the input graphs. Binary relations being easier, both GRN and Bidir DAG LSTM show increased or similar performances compared with the ternary relation experiments. On this set, our bidirectional DAG LSTM model is comparable to FULL using all instances (“Cross”) and slightly better than FULL using only single-sentence instances (“Single”).
4.6.7 Fine-grained Classification
Our dataset contains five classes as mentioned in Section 4.6.1. However, previous work only investigates binary relation detection. Here we also study the multi-class classification task, which can be more informative for applications.
Table 4.6 shows accuracies on multi-class relation extraction, which makes the task more ambiguous compared with binary relation extraction. The results show similar comparisons with the binary relation extraction results. However, the performance gaps between GRN and Bidir DAG LSTM dramatically increase, showing the superiority of GRN over Bidir DAG LSTM in utilizing context information.
4.7 Related Work
-ary relation extraction
-ary relation extractions can be traced back to MUC-7 Chinchor (1998), which focuses on entity-attribution relations. It has also been studied in biomedical domain McDonald et al. (2005b), but only the instances within a single sentence are considered. Previous work on cross-sentence relation extraction relies on either explicit co-reference annotation Gerber and Chai (2010); Yoshikawa et al. (2011), or the assumption that the whole document refers to a single coherent event Wick et al. (2006); Swampillai and Stevenson (2011). Both simplify the problem and reduce the need for learning better contextual representation of entity mentions. A notable exception is Quirk and Poon (2017), who adopt distant supervision and integrated contextual evidence of diverse types without relying on these assumptions. However, they only study binary relations. We follow Peng et al. (2017) by studying ternary cross-sentence relations.
|Bidir DAG LSTM||51.7||50.7|
Graph encoder Liang et al. (2016) build a graph LSTM model for semantic object parsing, which aims to segment objects within an image into more fine-grained, semantically meaningful parts. The nodes of an input graph come from image superpixels, and the edges are created by connecting spatially neighboring nodes. Their model is similar as Peng et al. (2017) by calculating node states sequentially: for each input graph, a start node and a node sequence are chosen, which determines the order of recurrent state updates. In contrast, our graph LSTM do not need ordering of graph nodes, and is highly parallelizable.
We explored graph recurrent network for cross-sentence -ary relation extraction, which uses a recurrent state transition process to incrementally refine a neural graph state representation capturing graph structure contexts. Compared with a bidirectional DAG LSTM baseline, our model has several advantages. First, it does not change the input graph structure, so that no information can be lost. For example, it can easily incorporate sibling information when calculating the contextual vector of a node. Second, it is better parallelizable. Experiments show significant improvements over the previously reported numbers, including that of the bidirectional graph LSTM model.
Abstract Meaning Representation (AMR) Banarescu et al. (2013) is a semantic formalism that encodes the meaning of a sentence as a rooted, directed graph. Figure 5.1 shows an AMR graph in which the nodes (such as “describe-01” and “person”) represent the concepts, and edges (such as “:ARG0” and “:name”) represent the relations between concepts they connect. AMR has been proven helpful on other NLP tasks, such as machine translation Jones et al. (2012a); Tamchyna et al. (2015), question answering Mitra and Baral (2015), summarization Takase et al. (2016) and event detection Li et al. (2015).
The task of AMR-to-text generation is to produce a text with the same meaning as a given input AMR graph. The task is challenging as word tenses and function words are abstracted away when constructing AMR graphs from texts. The translation from AMR nodes to text phrases can be far from literal. For example, shown in Figure 5.1, “Ryan” is represented as “(p / person :name (n / name :op1 “Ryan”))”, and “description of” is represented as “(d / describe-01 :ARG1 )”.
While initial work used statistical approaches Flanigan et al. (2016b); Pourdamghani et al. (2016); Song et al. (2017); Lampouras and Vlachos (2017); Mille et al. (2017); Gruzitis et al. (2017), recent research has demonstrated the success of deep learning, and in particular the sequence-to-sequence model Sutskever et al. (2014), which has achieved the state-of-the-art results on AMR-to-text generation Konstas et al. (2017). One limitation of sequence-to-sequence models, however, is that they require serialization of input AMR graphs, which adds to the challenge of representing graph structure information, especially when the graph is large. In particular, closely-related nodes, such as parents, children and siblings can be far away after serialization. It can be difficult for a linear recurrent neural network to automatically induce their original connections from bracketed string forms.
To address this issue, we introduce a novel graph-to-sequence model, where a graph recurrent network (GRN) is used to encode AMR structures directly. To capture non-local information, the encoder performs graph state transition by information exchange between connected nodes, with a graph state consisting of all node states. Multiple recurrent transition steps are taken so that information can propagate non-locally, and LSTM Hochreiter and Schmidhuber (1997) is used to avoid gradient diminishing and bursting in the recurrent process. The decoder is an attention-based LSTM model with a copy mechanism Gu et al. (2016); Gulcehre et al. (2016), which helps copy sparse tokens (such as numbers and named entities) from the input.
Trained on a standard dataset (LDC2015E86), our model surpasses a strong sequence-to-sequence baseline by 2.3 BLEU points, demonstrating the advantage of graph-to-sequence models for AMR-to-text generation compared to sequence-to-sequence models. Our final model achieves a BLEU score of 23.3 on the test set, which is 1.3 points higher than the existing state of the art Konstas et al. (2017) trained on the same dataset. When using gigaword sentences as additional training data, our model is consistently better than Konstas et al. (2017) using the same amount of gigaword data, showing the effectiveness of our model on large-scale training set. We release our code and models at https://github.com/freesunshine0316/neural-graph-to-seq-mp.
5.2 Baseline: a seq-to-seq model
Our baseline is a sequence-to-sequence model, which follows the encoder-decoder framework of Konstas et al. (2017).
5.2.1 Input representation
Given an AMR graph , where and denote the sets of nodes and edges, respectively, we use the depth-first traversal of Konstas et al. (2017) to linearize it to obtain a sequence of tokens , where is the number of tokens. For example, the AMR graph in Figure 1 is serialized as “describe :arg0 ( person :name ( name :op1 ryan ) ) :arg1 person :arg2 genius”. We can see that the distance between “describe” and “genius”, which are directly connected in the original AMR, becomes 14 in the serialization result.
A simple way to calculate the representation for each token is using its word embedding :
where and are model parameters for compressing the input vector size. To alleviate the data sparsity problem and obtain better word representation as the input, we also adopt a forward LSTM over the characters of the token, and concatenate the last hidden state with the word embedding:
The encoder is a bi-directional LSTM applied on the linearized graph by depth-first traversal, as in Konstas et al. (2017). At each step , the current states and are generated given the previous states and and the current input :
We use an attention-based LSTM decoder Bahdanau et al. (2015), where the attention memory () is the concatenation of the attention vectors among all input words. Each attention vector is the concatenation of the encoder states of an input token in both directions ( and ) and its input vector ():
where is the number of input tokens.
The decoder yields an output sequence by calculating a sequence of hidden states recurrently. While generating the -th word, the decoder considers five factors: (1) the attention memory ; (2) the previous hidden state of the LSTM decoder ; (3) the embedding of the current input word (previously generated word) ; (4) the previous context vector , which is calculated by an attention mechanism (will be shown in the next paragraph) from ; and (5) the previous coverage vector , which is the accumulation of all attention distributions so far Tu et al. (2016). When , we initialize and as zero vectors, set to the embedding of the start token “s”, and calculate by averaging all encoder states.
For each time-step , the decoder feeds the concatenation of the embedding of the current input and the previous context vector into the LSTM model to update its hidden state. Then the attention probability on the attention vector for the time-step is calculated as:
where , , , and are model parameters. The coverage vector is updated by , and the new context vector is calculated via . The output probability distribution over a vocabulary at the current state is calculated by:
where and are model parameters, and the number of rows in represents the number of words in the vocabulary.
5.3 The graph-to-sequence model
Unlike the baseline sequence-to-sequence model, we leverage our recurrent graph network (GRN) to represent each input AMR, which directly models the graph structure without serialization.
5.3.1 The graph encoder
Figure 5.2 shows the overall structure of our graph encoder. Formally, given an AMR graph , we use a hidden state vector to represent each node . The state of the graph can thus be represented as:
Same as Chapter 4, our GRN-based graph encoder performs information exchange between nodes through a sequence of state transitions, leading to a sequence of states , where . The initial state consists of a set of node states that contain all zeros.
The AMR graphs are similar with the dependency graphs (described in Chapter 4) in that both are directed and contain edge labels, so we simply adopt the GRN in Chapter 4 as our AMR graph encoder. Particularly, for node , the inputs include representations of edges that are connected to it, where it can be either the source or the target of the edge. We follow Chapter 4 to define each edge as a triple , where and are indices of the source and target nodes, respectively, and is the edge label. is the representation of edge , detailed in Section 5.3.2. The inputs for are distinguished by incoming and outgoing edges, before being summed up:
where and denote the sets of incoming and outgoing edges of , respectively. In addition to edge inputs, the encoder also considers the hidden states of its incoming nodes and outgoing nodes during a state transition. In particular, the states of all incoming nodes and outgoing nodes are summed up before being passed to the cell and gate nodes:
As the next step, the message is aggregated by the concatenation:
Then, it is applied with an LSTM step to update the node hidden state , the detailed equations are shown in Equation 3.13.
where is the cell memory for hidden state .
5.3.2 Input Representation
Different from sequences, the edges of an AMR graph contain labels, which represent relations between the nodes they connect, and are thus important for modeling the graphs. Similar with Section 5.2, we adopt two different ways for calculating the representation for each edge :
where and are the embeddings of edge label and source node , denotes the last hidden state of the character LSTM over , and and are trainable parameters. The equations correspond to Equations 5.1 and 5.2 in Section 5.2.1, respectively.
As shown in Figure 5.3, we adopt the attention-based LSTM decoder as described in Section 5.2.3. Since our graph encoder generates a sequence of graph states, only the last graph state is adopted in the decoder. In particular, we make the following changes to the decoder. First, each attention vector becomes , where is the last state for node . Second, the decoder initial state is the average of the last states of all nodes.
5.3.4 Integrating the copy mechanism
Open-class tokens, such as dates, numbers and named entities, account for a large portion in the AMR corpus. Most appear only a few times, resulting in a data sparsity problem. To address this issue, Konstas et al. (2017) adopt anonymization for dealing with the data sparsity problem. In particular, they first replace the subgraphs that represent dates, numbers and named entities (such as “(q / quantity :quant 3)” and “(p / person :name (n / name :op1 “Ryan”))”) with predefined placeholders (such as “num_0” and “person_name_0”) before decoding, and then recover the corresponding surface tokens (such as “3” and “Ryan”) after decoding. This method involves hand-crafted rules, which can be costly.
to solve this problem. The mechanism works on top of an attention-based RNN decoder by integrating the attention distribution into the final vocabulary distribution. The final probability distribution is defined as the interpolation between two probability distributions:
where is a switch for controlling generating a word from the vocabulary or directly copying it from the input graph. is the probability distribution of directly generating the word, as defined in Equation 5.9, and is calculated based on the attention distribution by summing the probabilities of the graph nodes that contain identical concept. Intuitively, is relevant to the current decoder input and state , and the context vector . Therefore, we define it as:
where vectors , , and scalar are model parameters. The copy mechanism favors generating words that appear in the input. For AMR-to-text generation, it facilitates the generation of dates, numbers, and named entities that appear in AMR graphs.
Copying vs anonymization
Both copying and anonymization alleviate the data sparsity problem by handling the open-class tokens. However, the copy mechanism has the following advantages over anonymization: (1) anonymization requires significant manual work to define the placeholders and heuristic rules both from subgraphs to placeholders and from placeholders to the surface tokens, (2) the copy mechanism automatically learns what to copy, while anonymization relies on hard rules to cover all types of the open-class tokens, and (3) the copy mechanism is easier to adapt to new domains and languages than anonymization.
5.4 Training and decoding
We train our models using the cross-entropy loss over each gold-standard output sequence