To discover the mystery of consciousness, several competing theories [2, 3, 4, 5] have been proposed by neuroscientists. Despite their contradictory claims, they share a common notion that consciousness is a cognitive state of experiencing one’s own existence, i.e. the state of awareness. Here, we do not refer to those elusive and mysterious meanings attributed to the word "consciousness". Instead, we focus on the basic idea, awareness or attentive awareness, to derive a neural network-based attentive computation framework on graphs, attempting to mimic the phenomenon of consciousness to some extent.
The first work to bring the idea of attentive awareness into deep learning models, as far as we know, is Yoshua Bengio’s consciousness prior 
. He points out the process of disentangling higher-level abstract factors from full underlying representation and forming a low-dimensional combination of a few selected factors or concepts to constitute a conscious thought. Bengio emphasizes the role of attention mechanism in expressing awareness, which helps focus on a few elements of state representation at a given moment and combining them to make a statement, an action or policy. Two recurrent neural networks (RNNs), the representation RNN and the consciousness RNN, are used to summarize the current and recent past information and encode two types of state, the unconscious state denoted by a full high-dimensional vector before applying attention, and the conscious state by a derived low-dimensional vector after applying attention.
Inspired by the consciousness prior, we develop an attentive message passing mechanism. We model query-dependent states as motivation to drive iterative sparse access to an underlying large graph and navigate information flow via a few nodes to reach a target. Instead of using RNNs, we use two GNNs [6, 7] with node state representations. Nodes sense nearby topological structures by exchanging messages with neighbors, and then use aggregated information to update their states. However, the standard message passing runs globally and uniformly. Messages gathered by a node can come from possibly everywhere and get further entangled by aggregation operations. Therefore, we need to draw a query-dependent or context-aware local subgraph to guide message passing. Nodes within such a subgraph are densely connected, forming a community to further exchange and share information, reaching some resonance, and making subsequent decisions collectively to expand the subgraph and navigate information flow. To support such attentive information flow, we design an attention flow layer above two GNNs. One GNN uses the standard message passing over a full graph, called unconsciousness flow layer, while the other GNN runs on a subgraph built by attention flow, called consciousness flow layer. These three flow layers constitute our attentive computation framework.
We realize the connection between attentive awareness and reasoning. A reasoning process is understood as a sequence of obvious or interpretable steps, either deductive, inductive, or abductive, to derive a less obvious conclusion. From the aspect of awareness, reasoning requires computation to be self-attentive or self-aware during processing in a way different from fitting by a black box. Therefore, interpretability must be one of the properties of reasoning. Taking KBC tasks as an example, many embedding-based models [8, 9, 10, 11, 12, 13] can do a really good job in link prediction, but lacking interpretation makes it hard to argue for their reasoning ability. People who aim at knowledge graph reasoning mainly focus on the path-based models using RL [14, 15, 16, 17] or logic-like methods [18, 19] to explicitly model a reasoning process to provide interpretations beyond predictions. Here, instead, we apply a flow-based attention mechanism, proposed in , as an alternative to RL for learning composition structure. In a manner of flowing, attention can propagate to cover a broader scope and increase the chance to hit a target. It maintains an end-to-end differentiable style, contrary to the way RL agents learn to choose a discrete action.
Other crucial properties of reasoning include relational inductive biases and iterative processing. Therefore, GNNs [6, 7] are a better choice compared to RNNs for encoding structured knowledge explicitly. Compared with the majority of previous GNN literature, focusing on the computation side, making neural-based architectures more composable and complex, we put a cognitive insight into it under the notion of attentive awareness. Specifically, we design an attention flow layer to chain attention operations directly with transition matrices, parallel to the message-passing pipeline to get less entangled with representation computation. This gives our model the ability to select edges step by step during computation and attend to a query-dependent subgraph, making a sharper prediction due to the disentanglement. These extracted subgraphs can reduce the computation cost greatly. In practice, we find our model can be applied to very large graphs with millions of nodes, such as the YAGO3-10 dataset, even running on a single laptop.
Our contributions are three-fold: (1) We propose an attentive computation framework on graphs, combining GNNs’ representation power with explicit reasoning pattern, motivated by the cognitive notion of attentive awareness. (2) We exploit query-dependent subgraph structure, extracted by an attention flow mechanism, to address two shortcomings of most GNN implementations: the complexity and the non-context-aware aggregation schema. (3) We design a specific architecture for KBC tasks and demonstrate our model’s strong reasoning capability compared to the state of the art, showing that a compact query-dependent subgraph is better than a path as a reasoning pattern.
2 Related Work
KBC and knowledge graph reasoning. Early work for KBC, including TransE  and its analogues [21, 22, 23], DistMult , ConvE  and ComplEx , focuses on learning embeddings of entities and relations. Some recent work of this line [12, 13] achieves high accuracy, yet unable to explicitly deal with compositional relationships that is crucial for reasoning. Another line aims to learn inference paths [14, 24, 25, 26, 27, 28] for knowledge graph reasoning, such as DeepPath , MINERVA , and M-Walk , using RL to learn multi-hop relational paths over a graph towards a target given a query. However, these approaches, based on policy gradients or Monte Carlo tree search, often suffer from low sample efficiency and sparse rewards, requiring a large number of rollouts or running many simulations, and also the sophisticated reward function design. Other efforts include learning soft logical rules [18, 19] or compostional programs  to reason over knowledge graphs.
Relational reasoning by GNNs and attention mechanisms. Relational reasoning is regarded as the key component of humans’ capacity for combinatorial generalization, taking the form of entity- and relation-centric organization to reason about the composition structure of the world [30, 31, 32, 33, 34]. A multitude of recent implementations  encode relational inductive biases into neural networks to exploit graph-structured representation, including graph convolution networks (GCNs) [35, 36, 37, 38, 39, 40, 41, 42] and graph neural networks [6, 43, 44, 45, 46], and overcome the difficulty to achieve relational reasoning for traditional deep learning models. These approaches have been widely applied to accomplishing real-world reasoning tasks (such as physical reasoning [45, 47, 48, 49, 50, 51], visual reasoning [44, 51, 52, 53, 54], textual reasoning [44, 55, 56], knowledge graph reasoning [41, 57, 58], multiagent relationship reasoning [59, 60], and chemical reasoning ), solving algorithmic problems (such as program verification [43, 61]62, 63, 64], state transitions , and bollean satisfiability ), or facilitating reinforcement learning with the structured reasoning or planning ability [67, 68, 49, 50, 69, 70, 71]. Variants of GNN architectures have been developed with different focuses. Relation networks  use a simple but effective neural module to equip deep learning models with the relational reasoning ability, and its recurrent versions [55, 56] do multi-step relational inference for long periods; Interaction networks  provide a general-purpose learnable physics engine, and two of its variants are visual interaction networks  learning directly from raw visual data, and vertex attention interaction networks  with an attention mechanism; Message passing neural networks  unify various GCNs and GCNs into a general message passing formalism by analogy to the one in graphical models.
Despite the strong representation power of GNNs, recent work points out its drawbacks that limit its capability. The vanilla message passing or neighborhood aggregation schema cannot adapt to strongly diverse local subgraph structure, causing performance degeneration when applying a deeper version or running more iterations , since a walk of more steps might drift away from local neighborhood with information washed out via averaging. It is suggested that covariance rather than invariance to permutations of nodes and edges is preferable , since being fully invariant by summing or averaging messages may worsen the representation power, lacking steerability. In this context, our model expresses permutation invariance under a constrained compositional transformation according to the group of possible permutations within each extracted query-dependent subgraph rather than the underlying full graph. Another drawback is the heavy computation complexity. GNNs are notorious for its poor scalability due to its quadratic complexity in the number of nodes when graphs are fully connected. Even scaling linearly with the number of edges by exploiting structure sparsity can still cause trouble on very large graphs, making selective or attentive computation on graphs so desirable.
Neighborhood attention operation can alleviate some limitation on GNNs’ representation power by specifying different weights to different nodes or nodes’ features [74, 60, 53, 75]. These approaches often use multi-head self-attention to focus on specific interactions with neighbors when aggregating messages, inspired by [76, 77, 78] originally for capturing long range dependencies. We notice that most graph-based attention mechanisms attend over neighborhood in a single-hop fashion, and  claims that the multi-hop architecture does not help in experiments, though they expect multiple hops to offer the potential to model high-order interaction. However, a flow-based design of attention in  shows a promising way to characterize long distance dependencies over graphs, breaking the isolation of attention operations and stringing them in chronological order by transition matrices, like the spread of a random walk, parallel to the message-passing pipeline.
It is natural to extend relational reasoning to graph structure inference or graph generation, such as reasoning about a latent interaction graph explicitly to acquire knowledge of observed dynamics , or learning generative models of graphs [79, 80, 81, 82]. Soft plus hard attention mechanisms may be a better alternative to probabilistic models that is hard to train with latent discrete variables or might degenerate multi-step predictions due to the inaccuracy (biased gradients) of back-propagation.
3 NeuCFlow Model
3.1 Attentive computation framework
We extend Bengio’s consciousness prior to graph-structured representation. Conscious thoughts are modeled by a few selected nodes and their edges, forming a context-aware subgraph, cohesive with sharper semantics, disentangled from the full graph. The underlying full graph forms the initial representation, entangled but rich, to help shape potential high-level subgraphs. We use attention flow to navigate conscious thoughts, capturing a step-by-step reasoning pattern. The attentive computation framework, as illustrated in Figure 1, consists of: (1) an unconsciousness flow (U-Flow) layer, (2) a consciousness flow (C-Flow) layer, and (3) an attention flow (A-Flow) layer, with four guidelines to design a specific implementation as follows:
U-Flow corresponds to a low-level computation graph for full state representation learning.
C-Flow contains high-level disentangled subgraphs for context-aware representation learning.
A-Flow is conditioned by both U-Flow and C-Flow, and also motivate C-Flow but not U-Flow.
Information can be accessed by C-Flow from U-Flow with the help of A-Flow.
3.2 Model architecture design for knowledge graph reasoning
We choose KBC tasks to do KG reasoning. We let denote a KG where is a set of nodes (or entities) and is a set of edges (or relations). A KG is viewed as a directed graph with each edge represented by a triple , where is the head entity, is the tail entity, and is their relation type. The aim of a KBC task is to predict potential unknown links, i.e., which entity is likely to be the tail given a query with the head and the relation type specified.
The model architecture has three core components as shown in Figure 2. We here use the term "component" instead of "layer" to differentiate our flow layers from the referring normally used in neural networks, as each flow layer is more like a block containing many neural network layers.
U-Flow component. We implement this component over the full graph using the standard message passing mechanism . If the graph has an extremely large number of edges, we sample a subset of edges,
, randomly each step when running message passing. For each batch of input queries, we let the representation computed by the U-Flow component be shared across these different queries, which means U-Flow is query-independent, with its state representation tensors containing no batch dimension, so that its complexity does not scale with the batch size and the saved computation resources can be allocated to sampling more edges. In U-Flow, each nodehas a learnable embedding and a dynamical state for step , called unconscious node states, where the initial for all . Each edge type also has a learnable embedding , and edge can produce a message, denoted by , at step . The U-Flow component includes:
Message function: , where .
Message aggregation: , where .
Node state update function: , where .
We compute messages only for the sampled edges, , each step. Functions and are implemented by a two-layer MLP (using for the first layer and
for the second layer) with input arguments concatenated respectively. Messages are aggregated by dividing the sum by the square root of the number of sampled neighbors that send messages, preserving the scale of variance. We use a residual adding to update each node state instead of a GRU or a LSTM. After running U-Flow forsteps, we return a pooling result or simply the last, , to feed into downstream components.
C-Flow component. C-Flow is query-dependent, which means that conscious node states, denoted by , have a batch dimension representing different input queries, making the complexity scale with the batch size. However, as C-Flow uses attentive message passing, running on small local subgraphs each conditioned by a query, we leverage the sparsity to record only for the visited nodes . For example, when , for query , we start from node , with being a singleton, and thus record only. When computing messages, denoted by , in C-Flow, we use a sampling-attending procedure, explained in Section 3.3, to further control the number of computed edges. The C-Flow component has:
Message function: , where , and .
Message aggregation: , where .
Node state attending function: , where and .
Node state update function: , where .
C-Flow and U-Flow share the embeddings . A query is represented by its head and relation embeddings, and , participating in computing messages and updating node states. We here select a subset of edges, , rather than sampling, according to edges between the attended nodes at step and the seen nodes at step , defined in Section 3.3, as shown in Figure 3. We introduce the node state attending function to pass an unconscious state to C-Flow adjusted by a scalar attention and a learnable matrix . We initialize for , treating the rest as zero states.
Attention flow is represented by a series of probability distributions changing across steps, denoted as. The initial distribution is a one-hot vector with . To spread attention, we need to compute transition matrices each step. Given that A-Flow is conditioned by both U-Flow and C-Flow, we model the transition from to by two types of interaction: conscious-to-conscious, , and conscious-to-unconscious, . The former favors previously visited nodes, while the latter is useful to attend to unseen nodes.
where and , and and are two learnable matrices. Each MLP uses one single layer with the activation. To reduce the complexity for computing , we select attended nodes, , which is the set of nodes with the k-largest attention, and then sample from neighbors as next nodes. Then, we compute a sparse according to edges . Due to the fact that the attended nodes may not carry all attention, a small amount of attention can be lost during transition, causing the total amount to decrease. Therefore, we use a renormalized version, . We use the final attention on the tail as the probability for prediction to compute the training objective, as shown in Figure 2.
3.3 Complexity reduction by iterative sampling and attending
Previously, we use edge sampling, in a globally and uniformly random manner, to address the complexity issue in U-Flow, where we are not concerned about the batch size. Here, we need to confront the complexity that scales with the batch size in C-Flow. Suppose that we run a normal message passing for steps on a KG with nodes and edges for a batch of queries. Then, the complexity is where represents the number of representation dimensions. The complexity can be reduced to by using edges sampling. is a small positive integer, often less than . is normally between and , and being too small for would lead to underfitting. In U-Flow, we have , while in C-Flow, let us say . Then, to maintain the same complexity as U-Flow, we have to reduce the sampling rate by a factor of on each query. However, the U-Flow’s edge sampling procedure is for the full graph, and it is inappropriate to apply to C-Flow on each query due to the reduced sample rate. Also, when becomes as large as , we also need to consider decreasing .
Good news is that C-Flow deals with a local subgraph for each query so that we only record a few selected nodes, called visited nodes, denoted by . We can see that is much less than . The initial , when , contains only one node , and then is enlarged each step by adding new nodes during spreading. When propagating messages, we only care about the one-step neighborhood each step. However, the spreading goes so rapidly that after only a few steps it covers almost all nodes, causing the number of computed edges to increase dramatically. The key to address the problem is that we need to constrain the scope of nodes we jump from each step, i.e., the core nodes that determine where we can go based on where we depart from. We call them attended nodes, which are in charge of the attending-from horizon, selected by based on the current attention . Given the set of attended nodes, we still need edge sampling over their neighborhoods in case of a hub node of extremely high degree. Here, we face a tricky problem that is to make a trade-off between the coverage and the complexity when sampling over the neighborhoods. Also, we need to well maintain these coherent context-aware node states and avoid possible noises or drifting away caused by sampling neighbors randomly. Therefore, we introduce an attending-to horizon inside the sampling horizon. We compute A-Flow over the sampling horizon with a smaller dimension to compute the attention, exchanged for sampling more neighbors to increase the coverage. Based on the newly computed attention , we select a smaller subset of nodes, , to receive messages in C-Flow, called seen nodes, in charge of the attending-to horizon. The next attending-from horizon is chosen by , a sub-horizon of the current attending-to horizon. All seen and attended nodes are stored as visited nodes along steps. We illustrate this sampling-attending procedure in Figure 3.
To compute our reduced complexity, we let be the maximum number of sampled edges per attended node per step, the maximum number of seen nodes per step, and the maximum number of attended nodes per step. We also denote the dimension number used in A-Flow as . For one batch, the complexity of C-Flow is for the worst case, where attended and seen nodes are fully connected, and in most cases, where is a small constant. The complexity of A-Flow is where is much smaller than .
4.1 Datasets and experimental settings
|Dataset||#Entities||#Rels||#Train||#Valid||#Test||PME (tr)||PME (te)||AvgD (te)|
|DistMult ||20.6 (.4)||31.8 (.2)||-||29.0 (.2)||38.4 (.4)||42.4 (.3)||-||41.3 (.3)|
|ComplEx ||20.8 (.2)||32.6 (.5)||-||29.6 (.2)||38.5 (.3)||43.9 (.3)||-||42.2 (.2)|
|ConvE ||23.3 (.4)||33.8 (.3)||-||30.8 (.2)||39.6 (.3)||44.7 (.2)||-||43.3 (.2)|
|NeuralLP ||18.2 (.6)||27.2 (.3)||-||24.9 (.2)||37.2 (.1)||43.4 (.1)||-||43.5 (.1)|
|MINERVA ||14.1 (.2)||23.2 (.4)||-||20.5 (.3)||35.1 (.1)||44.5 (.4)||-||40.9 (.1)|
|M-Walk ||16.5 (.3)||24.3 (.2)||-||23.2 (.2)||41.4 (.1)||44.5 (.2)||-||43.7 (.1)|
|NeuCFlow||28.6 (.1)||40.3 (.1)||53.0 (.3)||36.9 (.1)||44.4 (.4)||49.7 (.8)||55.8 (.5)||48.2 (.5)|
. Some collected results only have a metric score while some including ours take the form of "mean (standard deviation)".
Datasets. We evaluate our model using six large KG datasets222https://github.com/netpaladinx/NeuCFlow/tree/master/data: FB15K, FB15K-237, WN18, WN18RR, NELL995, and YAGO3-10. FB15K-237  is sampled from FB15K  with redundant relations removed, and WN18RR  is a subset of WN18  removing triples that cause test leakage. Thus, they are both considered more challenging. NELL995  has separate datasets for 12 query relations each corresponding to a single-query-relation KBC task. YAGO3-10  contains the largest KG with millions of edges. Their statistics are shown in Table 1. We find some statistical differences between train and test. In a KG with all training triples as its edges, a triple is considered as a multi-edge triple if the KG contains other triples that also connect and ignoring the direction. We notice that FB15K-237 is a special case compared with the others, as there are no edges in its KG directly linking any pair of and in test. Therefore, when using training triples as queries to train our model, given a batch, for FB15K-237, we cut off from the KG all triples connecting the head-tail pairs in the given batch, ignoring relation types and edge directions, forcing the model to learn a composite reasoning pattern rather than a single-hop pattern, and for the rest datasets, we only remove the triples of this batch and their inverse from the KG before training on this batch.
. We create a KG, a directed graph, consisting of all train triples and their inverse added for each dataset except NELL995, since it already includes reciprocal relations. Besides, every node in KGs has a self-loop edge to itself. We also add inverse relations into the validation and test set to evaluate the two directions. For evaluation metrics, we use HITS@1,3,10 and the mean reciprocal rank (MRR) in the filtered setting for FB15K-237, WN18RR, FB15K, WN18, and YAGO3-10, and use the mean average precision (MAP) for NELL995’s single-query-relation KBC tasks. For NELL995, we follow the same evaluation procedure as in[15, 16, 17]
, ranking the answer entities against the negative examples given in their experiments. We run our experiments using a 12G-memory GPU, TITAN X (Pascal), with Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz. Our code is written in Python based on TensorFlow 2.0 and NumPy 1.16.
4.2 Baselines and comparison results
|NeuCFlow||72.6 (.4)||78.4 (.4)||83.4 (.5)||76.4 (.4)||91.6 (.8)||93.6 (.4)||94.9 (.4)||92.8 (.6)|
|AthletePlaysForTeam||83.9 (0.5)||84.7 (1.3)||82.7 (0.8)||72.1 (1.2)||62.7||67.3|
|AthletePlaysInLeague||97.5 (0.1)||97.8 (0.2)||95.2 (0.8)||92.7 (5.3)||77.3||91.2|
|AthleteHomeStadium||93.6 (0.1)||91.9 (0.1)||92.8 (0.1)||84.6 (0.8)||71.8||72.2|
|AthletePlaysSport||98.6 (0.0)||98.3 (0.1)||98.6 (0.1)||91.7 (4.1)||87.6||96.3|
|TeamPlayssport||90.4 (0.4)||88.4 (1.8)||87.5 (0.5)||69.6 (6.7)||76.1||81.4|
|OrgHeadQuarteredInCity||94.7 (0.3)||95.0 (0.7)||94.5 (0.3)||79.0 (0.0)||62.0||65.7|
|WorksFor||86.8 (0.0)||84.2 (0.6)||82.7 (0.5)||69.9 (0.3)||67.7||69.2|
|PersonBornInLocation||84.1 (0.5)||81.2 (0.0)||78.2 (0.0)||75.5 (0.5)||71.2||81.2|
|PersonLeadsOrg||88.4 (0.1)||88.8 (0.5)||83.0 (2.6)||79.0 (1.0)||75.1||77.2|
|OrgHiredPerson||84.7 (0.8)||88.8 (0.6)||87.0 (0.3)||73.8 (1.9)||71.9||73.7|
Baselines. We compare our model against embedding-based approaches, including TransE , TransR , DistMult , ConvE , ComplE , HolE , RotatE , and ComplEx-N3 , and path-based approaches that use RL methods, including DeepPath , MINERVA , and M-Walk , and also that uses learned neural logic, NeuralLP 
. For all the baselines, we quote the results from the corresponding papers instead of rerunning them. For our method, we run the experiments three times in each hyperparameter setting on each dataset to report the means and standard deviations of the results. We put the details of our hyperparameter settings in the appendix.
Comparison results and analysis. We first report the comparison on FB15K-23 and WN18RR in Table 2. NeuCFlow has a surprisingly good result, significantly outperforming all the compared methods in HITS@1,3 and MRR on both the two datasets. Compared to the best baseline, RotatE, published very recently, we only lose a few points in HITS@10 but gain a lot in HITS@1,3 and MRR. Based on the observation that NeuCFlow gains a larger amount of advantage when k in HITS@k gets smaller, we speculate that the reasoning ability acquired by NeuCFlow is to make a sharper prediction by exploiting graph-structured composition locally and conditionally, in contrast to embedding-based methods, which totally rely on vectorized representation. When a target becomes too vague to predict, reasoning may lose its great advantage, though still very competitive. However, path-based baselines, with a certain ability to do KG reasoning, perform worse than we expect. We argue that it is inappropriate to think of reasoning, a sequential decision process, as a sequence of nodes, i.e. a path, in KGs. The average length of the shortest paths between heads and tails in the test set in a KG, as shown in Table 1, suggests an extremely short path, making the motivation for using a path pattern almost pointless. The iterative reasoning pattern should be characterized in the form of dynamically varying local graph-structured patterns, holding a bunch of nodes resonating with each other to produce a decision collectively. Then, we run our model on larger KGs, including FB15K, WN18, and YAGO3-10, and summarize the comparison in Table 3,4, where NeuCFlow beats most well-known baselines and achieves a very competitive position against the best state-of-the-art methods. Moreover, we summarize the comparison on NELL995’s tasks in Table 5. NeuCFlow performs the best on five tasks, also being very competitive against M-Walk, the best path-based method as far as we know, on the rest. We find no reporting on the last two tasks from the corresponding papers.
4.3 Experimental analysis
Convergence analysis. During training we find that NeuCFlow converges surprisingly fast. We may use half of training examples to get the model well trained and generalize it to the test, sometimes producing an even better metric score than trained for a full epoch, as shown in Figure 4(A). Compared with the less expensive computation using embedding-based models, although our model takes a large number of edges to compute for each input query, consuming more time on one batch, it does not need a second epoch or even taking all training triples as queries in one epoch, thus saving a lot of training time. The reason may be that all queries are directly from the KG’s edge set and some of them have probably been exploited to construct subgraphs for many times during the training of other queries, so that we might not have to train the model on each query explicitly as long as we have other ways to exploit them.
Component analysis. If we do not run U-Flow, then the unconscious state is just the initial embedding of node , and we can still run C-Flow as usual. We want to know whether the U-Flow component is actually useful. Considering that long-distance message passing might bring in less informative features, we compare running U-Flow for two steps against totally shutting it down. The result in Figure 4(B) shows that U-Flow brings a small gain in each metric on WN18RR.
Horizon analysis. The sampling, attending and searching horizons determine how large area the flow can spread over. They impact the computation complexity as well as the performance of the model with different degrees depending on the properties of a dataset. Intuitively, enlarging the probe scope by sampling more, attending more, or searching longer, may increase the chance to hit a target. However, the experimental results in Figure 4(C)(D) show that it is not always the case. In Figure 4(E), we can see that increasing the maximum number of the attending-from nodes, i.e. attended nodes, per step is more important, but our GPU does not allow for a larger number to accommodate more intermediate data produced during computation, otherwise causing the error of ResourceExhaustedError. Figure 4(F) shows the step number of C-Flow cannot get too small as two.
Attention flow analysis. If attention flow can really capture the way we reason about the world, its process should be conducted in a diverging-converging thinking pattern. Intuitively, first, for the diverging thinking, we search and collect ideas as much as we can; then, for the converging thinking, we try to concentrate our thoughts on one point. To check whether the attention flow has such a pattern, we measure the average entropy of attention distributions varying along steps and also the proportion of attention concentrated at the top-1,3,5 attended nodes. As we expect, attention indeed is more focused at the final step as well as at the beginning.
Time cost analysis. The time cost is affected not only by the scale of a dataset but also by the horizon setting. For each dataset, we list the training time for one epoch corresponding to the standard hyperparameter settings in the appendix. Note that there is always a trade-off between the complexity and the performance. We thus study whether we can reduce the time cost a lot at the price of sacrificing a little performance. We plot the one-epoch training time in Figure 6(A)-(D), using the same settings as we do in the horizon analysis. We can see that Max-attended-nodes-per-step and #Steps-of-C-Flow affect the training time significantly while Max-sampled-edges-per-node and Max-seen-nodes-per-step affect very slightly. Therefore, we can use smaller Max-sampled-edges-per-node and Max-seen-nodes-per-step in order to gain a larger batch size, making the computation more efficiency as shown in Figure 6(E).
To further demonstrate the reasoning ability acquired by our model, we show some visualization results of the extracted subgraphs on NELL995’s test data for 12 separate tasks. We avoid using the training data in order to show the generalization of our model’s learned reasoning ability on knowledge graphs. Here, we show the visualization result for the AthletePlaysForTeam task. The rest can be found in the appendix.
For the AthletePlaysForTeam task
In the above case, the query is (concept_personnorthamerica_michael_turner, concept:athleteplays-forteam, ?) and the desired answer is concept_sportsteam_falcons. From Figure 7, we can see our model learns that (concept_personnorthamerica_michael_turner, concept:athletehomestadium, concept_stadiumoreventvenue_georgia_dome) and (concept_stadiumoreventvenue_georgia_dome, concept:teamhomestadium_inv, concept_sportsteam_falcons) are two important facts to support the answer of concept_sportsteam_falcons. Besides, other facts, such as (concept_athlete_joey_harrington, concept:athletehomestadium, concept_stadiumoreventvenue_georgia_dome) and (concept_athlete-_joey_harrington, concept:athleteplaysforteam, concept_sportsteam_falcons), provide a vivid example that a person or an athlete with concept_stadiumoreventvenue_georgia_dome as his or her home stadium might play for the team concept_sportsteam_falcons. We have such examples more than one, like concept_athlete_roddy_white’s and concept_athlete_quarterback_matt_ryan’s. The entity concept_sportsleague_nfl cannot help us differentiate the true answer from other NFL teams, but it can at least exclude those non-NFL teams. In a word, our subgraph-structured representation can well capture the relational and compositional reasoning pattern.
We introduce an attentive message passing mechanism on graphs under the notion of attentive awareness, inspired by the phenomenon of consciousness, to model the iterative compositional reasoning pattern by forming a compact query-dependent subgraph. We propose an attentive computation framework with three flow-based layer to combine GNNs’ representation power with explicit reasoning process, and further reduce the complexity when applying GNNs to large-scale graphs. It is worth mentioning that our framework is not limited to knowledge graph reasoning, but has a wider applicability to large-scale graph-based computation with a few input-dependent nodes and edges involved each time.
-  Yoshua Bengio. The consciousness prior. CoRR, abs/1709.08568, 2017.
Stanislas Dehaene, Michel Kerszberg, and Jean Pierre Changeux.
A neuronal model of a global workspace in effortful cognitive tasks.Proceedings of the National Academy of Sciences of the United States of America, 95 24:14529–34, 1998.
-  Giulio Tononi, Mélanie Boly, Marcello Massimini, and Christof Koch. Integrated information theory: from consciousness to its physical substrate. Nature Reviews Neuroscience, 17:450–461, 2016.
-  David Rosenthal and Josh Weisberg. Higher-order theories of consciousness. Scholarpedia, 3:4407, 2008.
-  Robert Van Gulick. Higher-order global states (hogs): an alternative higher-order model. Higher-order theories of consciousness, pages 67–93, 2004.
-  Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20:61–80, 2009.
-  Peter W. Battaglia, Jessica B. Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinícius Flores Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, Çaglar Gülçehre, Francis Song, Andrew J. Ballard, Justin Gilmer, George E. Dahl, Ashish Vaswani, Kelsey R. Allen, Charles Nash, Victoria Langston, Chris Dyer, Nicolas Heess, Daan Wierstra, Pushmeet Kohli, Matthew Botvinick, Oriol Vinyals, Yujia Li, and Razvan Pascanu. Relational inductive biases, deep learning, and graph networks. CoRR, abs/1806.01261, 2018.
-  Antoine Bordes, Nicolas Usunier, Alberto García-Durán, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multi-relational data. In NIPS, 2013.
-  Bishan Yang, Wen tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. Embedding entities and relations for learning and inference in knowledge bases. CoRR, abs/1412.6575, 2015.
-  Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. Convolutional 2d knowledge graph embeddings. In AAAI, 2018.
-  Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard. Complex embeddings for simple link prediction. In ICML, 2016.
-  Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and Jian Tang. Rotate: Knowledge graph embedding by relational rotation in complex space. CoRR, abs/1902.10197, 2018.
-  Timothée Lacroix, Nicolas Usunier, and Guillaume Obozinski. Canonical tensor decomposition for knowledge base completion. In ICML, 2018.
-  Ni Lao, Tom Michael Mitchell, and William W. Cohen. Random walk inference and learning in a large scale knowledge base. In EMNLP, 2011.
-  Wenhan Xiong, Thien Hoang, and William Yang Wang. Deeppath: A reinforcement learning method for knowledge graph reasoning. In EMNLP, 2017.
-  Rajarshi Das, Shehzaad Dhuliawala, Manzil Zaheer, Luke Vilnis, Ishan Durugkar, Akshay Krishnamurthy, Alexander J. Smola, and Andrew McCallum. Go for a walk and arrive at the answer: Reasoning over paths in knowledge bases using reinforcement learning. CoRR, abs/1711.05851, 2018.
-  Yelong Shen, Jianshu Chen, Pu Huang, Yuqing Guo, and Jianfeng Gao. M-walk: Learning to walk over graphs using monte carlo tree search. In NeurIPS, 2018.
-  William W. Cohen. Tensorlog: A differentiable deductive database. CoRR, abs/1605.06523, 2016.
-  Fan Yang, Zhilin Yang, and William W. Cohen. Differentiable learning of logical rules for knowledge base reasoning. In NIPS, 2017.
-  Xiaoran Xu, Songpeng Zu, Chengliang Gao, Yuan Zhang, and Wei Feng. Modeling attention flow on graphs. CoRR, abs/1811.00497, 2018.
Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen.
Knowledge graph embedding by translating on hyperplanes.In AAAI, 2014.
-  Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and Xuan Zhu. Learning entity and relation embeddings for knowledge graph completion. In AAAI, 2015.
-  Guoliang Ji, Shizhu He, Liheng Xu, Kang Liu, and Jian Zhao. Knowledge graph embedding via dynamic mapping matrix. In ACL, 2015.
-  Matt Gardner, Partha Pratim Talukdar, Jayant Krishnamurthy, and Tom Michael Mitchell. Incorporating vector space similarity in random walk inference over knowledge bases. In EMNLP, 2014.
-  Kelvin Guu, John Miller, and Percy S. Liang. Traversing knowledge graphs in vector space. In EMNLP, 2015.
-  Yankai Lin, Zhiyuan Liu, and Maosong Sun. Modeling relation paths for representation learning of knowledge bases. In EMNLP, 2015.
-  Kristina Toutanova, Victoria Lin, Wen tau Yih, Hoifung Poon, and Chris Quirk. Compositional learning of embeddings for relation paths in knowledge base and text. In ACL, 2016.
-  Rajarshi Das, Arvind Neelakantan, David Belanger, and Andrew McCallum. Chains of reasoning over entities, relations, and text using recurrent neural networks. In EACL, 2017.
Chen Liang, Jonathan Berant, Quoc V. Le, Kenneth D. Forbus, and Ni Lao.
Neural symbolic machines: Learning semantic parsers on freebase with weak supervision.In ACL, 2016.
-  Kenneth H. Craik. The nature of explanation. 1952.
-  John R. Anderson. Acquisition of cognitive skill. 1982.
-  Dedre Gentner and Arthur B. Markman. Structure mapping in analogy and similarity. 1997.
-  John E. Hummel and Keith J. Holyoak. A symbolic-connectionist theory of relational inference and generalization. Psychological review, 110 2:220–64, 2003.
-  Brenden M. Lake, Tomer D. Ullman, Joshua B. Tenenbaum, and Samuel J Gershman. Building machines that learn and think like people. The Behavioral and brain sciences, 40:e253, 2017.
-  Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally connected networks on graphs. CoRR, abs/1312.6203, 2014.
-  Mikael Henaff, Joan Bruna, and Yann LeCun. Deep convolutional networks on graph-structured data. CoRR, abs/1506.05163, 2015.
-  David K. Duvenaud, Dougal Maclaurin, Jorge Aguilera-Iparraguirre, Rafael Gómez-Bombarelli, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P. Adams. Convolutional networks on graphs for learning molecular fingerprints. In NIPS, 2015.
-  Steven M. Kearnes, Kevin McCloskey, Marc Berndl, Vijay S. Pande, and Patrick Riley. Molecular graph convolutions: moving beyond fingerprints. Journal of computer-aided molecular design, 30 8:595–608, 2016.
-  Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In NIPS, 2016.
-  Mathias Niepert, Mohammed Hassan Ahmed, and Konstantin Kutzkov. Learning convolutional neural networks for graphs. In ICML, 2016.
-  Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. CoRR, abs/1609.02907, 2017.
-  Michael M. Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep learning: Going beyond euclidean data. IEEE Signal Processing Magazine, 34:18–42, 2017.
-  Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard S. Zemel. Gated graph sequence neural networks. CoRR, abs/1511.05493, 2016.
-  Adam Santoro, David Raposo, David G. T. Barrett, Mateusz Malinowski, Razvan Pascanu, Peter W. Battaglia, and Timothy P. Lillicrap. A simple neural network module for relational reasoning. In NIPS, 2017.
-  Peter W. Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, and Koray Kavukcuoglu. Interaction networks for learning about objects, relations and physics. In NIPS, 2016.
-  Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. Neural message passing for quantum chemistry. In ICML, 2017.
-  Michael Chang, Tomer Ullman, Antonio Torralba, and Joshua B. Tenenbaum. A compositional object-based approach to learning physical dynamics. CoRR, abs/1612.00341, 2017.
-  Thomas N. Kipf, Ethan Fetaya, Kuan-Chieh Wang, Max Welling, and Richard S. Zemel. Neural relational inference for interacting systems. In ICML, 2018.
-  Alvaro Sanchez-Gonzalez, Nicolas Heess, Jost Tobias Springenberg, Josh Merel, Martin A. Riedmiller, Raia Hadsell, and Peter W. Battaglia. Graph networks as learnable physics engines for inference and control. In ICML, 2018.
-  Jessica B. Hamrick, Kelsey R. Allen, Victor Bapst, Tina Zhu, Kevin R. McKee, Joshua B. Tenenbaum, and Peter W. Battaglia. Relational inductive bias for physical construction in humans and machines. CoRR, abs/1806.01203, 2018.
-  Nicholas Watters, Daniel Zoran, Théophane Weber, Peter W. Battaglia, Razvan Pascanu, and Andrea Tacchetti. Visual interaction networks: Learning a physics simulator from video. In NIPS, 2017.
-  David Raposo, Adam Santoro, David G. T. Barrett, Razvan Pascanu, Timothy P. Lillicrap, and Peter W. Battaglia. Discovering objects and their relations from entangled scene representations. CoRR, abs/1702.05068, 2017.
-  Xiaolong Wang, Ross B. Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. , pages 7794–7803, 2018.
-  Xinlei Chen, Li-Jia Li, Li Fei-Fei, and Abhinav Gupta. Iterative visual reasoning beyond convolutions. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7239–7248, 2018.
-  Adam Santoro, Ryan Faulkner, David Raposo, Jack W. Rae, Mike Chrzanowski, Théophane Weber, Daan Wierstra, Oriol Vinyals, Razvan Pascanu, and Timothy P. Lillicrap. Relational recurrent neural networks. In NeurIPS, 2018.
-  Rasmus Berg Palm, Ulrich Paquet, and Ole Winther. Recurrent relational networks. In NeurIPS, 2018.
-  Daniel Oñoro-Rubio, Mathias Niepert, Alberto García-Durán, Roberto Gonzalez, and Roberto Javier López-Sastre. Representation learning for visual-relational knowledge graphs. CoRR, abs/1709.02314, 2017.
-  Takuo Hamaguchi, Hidekazu Oiwa, Masashi Shimbo, and Yuji Matsumoto. Knowledge transfer for out-of-knowledge-base entities: A graph neural network approach. 2017.
Sainbayar Sukhbaatar, Arthur Szlam, and Rob Fergus.
Learning multiagent communication with backpropagation.In NIPS, 2016.
-  Yedid Hoshen. Vain: Attentional multi-agent predictive modeling. In NIPS, 2017.
-  Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. Learning to represent programs with graphs. CoRR, abs/1711.00740, 2018.
-  Irwan Bello, Hieu Quang Pham, Quoc V. Le, Mohammad Norouzi, and Samy Bengio. Neural combinatorial optimization with reinforcement learning. CoRR, abs/1611.09940, 2017.
-  Alex Nowak, Soledad Villar, Afonso S. Bandeira, and Joan Bruna. A note on learning algorithms for quadratic assignment with graph neural networks. CoRR, abs/1706.07450, 2017.
-  Elias Boutros Khalil, Hanjun Dai, Yuyu Zhang, Bistra N. Dilkina, and Le Song. Learning combinatorial optimization algorithms over graphs. In NIPS, 2017.
-  Daniel D. Johnson. Learning graphical state transitions. In ICLR, 2017.
-  Daniel Selsam, Matthew Lamm, Benedikt Bünz, Percy S. Liang, Leonardo de Moura, and David L. Dill. Learning a sat solver from single-bit supervision. CoRR, abs/1802.03685, 2018.
-  Jessica B. Hamrick, Andrew J. Ballard, Razvan Pascanu, Oriol Vinyals, Nicolas Heess, and Peter W. Battaglia. Metacontrol for adaptive imagination-based optimization. CoRR, abs/1705.02670, 2017.
-  Razvan Pascanu, Yujia Li, Oriol Vinyals, Nicolas Heess, Lars Buesing, Sébastien Racanière, David P. Reichert, Théophane Weber, Daan Wierstra, and Peter W. Battaglia. Learning model-based planning from scratch. CoRR, abs/1707.06170, 2017.
-  Tingwu Wang, Renjie Liao, Jimmy Ba, and Sanja Fidler. Nervenet: Learning structured policy with graph neural networks. In ICLR, 2018.
-  Vinícius Flores Zambaldi, David Raposo, Adam Santoro, Victor Bapst, Yujia Li, Igor Babuschkin, Karl Tuyls, David P. Reichert, Timothy P. Lillicrap, Edward Lockhart, Murray Shanahan, Victoria Langston, Razvan Pascanu, Matthew Botvinick, Oriol Vinyals, and Peter W. Battaglia. Relational deep reinforcement learning. CoRR, abs/1806.01830, 2018.
-  Sam Toyer, Felipe W. Trevizan, Sylvie Thiébaux, and Lexing Xie. Action schema networks: Generalised policies with deep learning. In AAAI, 2018.
-  Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken ichi Kawarabayashi, and Stefanie Jegelka. Representation learning on graphs with jumping knowledge networks. In ICML, 2018.
-  Risi Kondor, Hy Truong Son, Horace Pan, Brandon M. Anderson, and Shubhendu Trivedi. Covariant compositional networks for learning graphs. CoRR, abs/1801.02144, 2018.
-  Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Alejandro Romero, Pietro Lió, and Yoshua Bengio. Graph attention networks. CoRR, abs/1710.10903, 2018.
-  Wouter Kool. Attention solves your tsp , approximately. 2018.
-  Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2015.
-  Zhouhan Lin, Minwei Feng, Cícero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. A structured self-attentive sentence embedding. CoRR, abs/1703.03130, 2017.
-  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017.
-  Yujia Li, Oriol Vinyals, Chris Dyer, Razvan Pascanu, and Peter W. Battaglia. Learning deep generative models of graphs. CoRR, abs/1803.03324, 2018.
-  Nicola De Cao and Thomas Kipf. Molgan: An implicit generative model for small molecular graphs. CoRR, abs/1805.11973, 2018.
-  Jiaxuan You, Zhitao Ying, Xiang Ren, William L. Hamilton, and Jure Leskovec. Graphrnn: Generating realistic graphs with deep auto-regressive models. In ICML, 2018.
-  Aleksandar Bojchevski, Oleksandr Shchur, Daniel Zügner, and Stephan Günnemann. Netgan: Generating graphs via random walks. In ICML, 2018.
-  Dai Quoc Nguyen, Tu Dinh Nguyen, Dat Quoc Nguyen, and Dinh Q. Phung. A novel embedding model for knowledge base completion based on convolutional neural network. In NAACL-HLT, 2018.
-  Kristina Toutanova and Danqi Chen. Observed versus latent features for knowledge base and text inference. 2015.
-  Farzaneh Mahdisoltani, Joanna Asia Biega, and Fabian M. Suchanek. Yago3: A knowledge base from multilingual wikipedias. In CIDR, 2014.
-  Maximilian Nickel, Lorenzo Rosasco, and Tomaso A. Poggio. Holographic embeddings of knowledge graphs. In AAAI, 2016.
6.1 Hyperparameter settings
|Training time per epoch (h)||25.7||63.7||4.3||8.5||185.0||0.12|
Our hyperparameters can be categorized into three groups:
The normal hyperparameters, including batch_size, n_dims_att, n_dims, learning_rate, grad_clipnorm, and n_epochs. Here, we set a smaller dimension, n_dims_att, for the attention flow computation, as it uses more edges for computation than the message passing uses in the consciousness flow layer, and also intuitively, it does not need to propagate high-dimensional messages but only compute a scalar score for each of the sampled neighbor nodes, in concert with the idea in the key-value mechanism . We set in most cases, indicating that our model needs to be trained only for one epoch due to its fast convergence.
The hyperparameters that are in charge of controlling the sampling-attending horizon, including max_sampled_edges_per_step that controls the maximum number to sample edges per step per query for the message passing in the unconsciousness flow layer, and max_sampled_edges_per_node, max_attended_nodes_per_step and max_seen_nodes_per_step that control the maximum number to sample edges connected to each current node per step per query, the maximum number of current nodes to attend from per step per query, and the maximum number of neighbor nodes to attend to per step per query in the consciousness flow layer.
The hyperparameters that are in charge of controlling the searching horizon, including n_steps_of_u_flow representing the number of steps to run the unconcsiousness flow, and n_steps_of_c_flow representing the number of steps to run the consciousness flow.
Note that we choose these hyperparameters not only by their performances but also the computation resources available to us. In some cases, to deal with a very large knowledge graph with limited resources, we need to make a trade-off between the efficiency and the effectiveness. For example, each of NELL995’s single-query-relation tasks has a small training set, though still with a large graph, so we can reduce the batch size in favor of affording larger dimensions and a larger sampling-attending horizon without any concern for waiting too long to finish one epoch.
6.2 Other experimental analysis
6.3 Other visualization
For the AthletePlaysInLeague task
For the AthleteHomeStadium task
For the AthletePlaysSport task
For the TeamPlaysSport task
For the OrganizationHeadQuarteredInCity task
For the WorksFor task
For the PersonBornInLocation task
For the PersonLeadsOrganization task
For the OrganizationHiredPerson task
For the AgentBelongsToOrganization task
For the TeamPlaysInLeague task