1 Introduction
To discover the mystery of consciousness, several competing theories [2, 3, 4, 5] have been proposed by neuroscientists. Despite their contradictory claims, they share a common notion that consciousness is a cognitive state of experiencing one’s own existence, i.e. the state of awareness. Here, we do not refer to those elusive and mysterious meanings attributed to the word "consciousness". Instead, we focus on the basic idea, awareness or attentive awareness, to derive a neural networkbased attentive computation framework on graphs, attempting to mimic the phenomenon of consciousness to some extent.
The first work to bring the idea of attentive awareness into deep learning models, as far as we know, is Yoshua Bengio’s consciousness prior [1]
. He points out the process of disentangling higherlevel abstract factors from full underlying representation and forming a lowdimensional combination of a few selected factors or concepts to constitute a conscious thought. Bengio emphasizes the role of attention mechanism in expressing awareness, which helps focus on a few elements of state representation at a given moment and combining them to make a statement, an action or policy. Two recurrent neural networks (RNNs), the representation RNN and the consciousness RNN, are used to summarize the current and recent past information and encode two types of state, the unconscious state denoted by a full highdimensional vector before applying attention, and the conscious state by a derived lowdimensional vector after applying attention.
Inspired by the consciousness prior, we develop an attentive message passing mechanism. We model querydependent states as motivation to drive iterative sparse access to an underlying large graph and navigate information flow via a few nodes to reach a target. Instead of using RNNs, we use two GNNs [6, 7] with node state representations. Nodes sense nearby topological structures by exchanging messages with neighbors, and then use aggregated information to update their states. However, the standard message passing runs globally and uniformly. Messages gathered by a node can come from possibly everywhere and get further entangled by aggregation operations. Therefore, we need to draw a querydependent or contextaware local subgraph to guide message passing. Nodes within such a subgraph are densely connected, forming a community to further exchange and share information, reaching some resonance, and making subsequent decisions collectively to expand the subgraph and navigate information flow. To support such attentive information flow, we design an attention flow layer above two GNNs. One GNN uses the standard message passing over a full graph, called unconsciousness flow layer, while the other GNN runs on a subgraph built by attention flow, called consciousness flow layer. These three flow layers constitute our attentive computation framework.
We realize the connection between attentive awareness and reasoning. A reasoning process is understood as a sequence of obvious or interpretable steps, either deductive, inductive, or abductive, to derive a less obvious conclusion. From the aspect of awareness, reasoning requires computation to be selfattentive or selfaware during processing in a way different from fitting by a black box. Therefore, interpretability must be one of the properties of reasoning. Taking KBC tasks as an example, many embeddingbased models [8, 9, 10, 11, 12, 13] can do a really good job in link prediction, but lacking interpretation makes it hard to argue for their reasoning ability. People who aim at knowledge graph reasoning mainly focus on the pathbased models using RL [14, 15, 16, 17] or logiclike methods [18, 19] to explicitly model a reasoning process to provide interpretations beyond predictions. Here, instead, we apply a flowbased attention mechanism, proposed in [20], as an alternative to RL for learning composition structure. In a manner of flowing, attention can propagate to cover a broader scope and increase the chance to hit a target. It maintains an endtoend differentiable style, contrary to the way RL agents learn to choose a discrete action.
Other crucial properties of reasoning include relational inductive biases and iterative processing. Therefore, GNNs [6, 7] are a better choice compared to RNNs for encoding structured knowledge explicitly. Compared with the majority of previous GNN literature, focusing on the computation side, making neuralbased architectures more composable and complex, we put a cognitive insight into it under the notion of attentive awareness. Specifically, we design an attention flow layer to chain attention operations directly with transition matrices, parallel to the messagepassing pipeline to get less entangled with representation computation. This gives our model the ability to select edges step by step during computation and attend to a querydependent subgraph, making a sharper prediction due to the disentanglement. These extracted subgraphs can reduce the computation cost greatly. In practice, we find our model can be applied to very large graphs with millions of nodes, such as the YAGO310 dataset, even running on a single laptop.
Our contributions are threefold: (1) We propose an attentive computation framework on graphs, combining GNNs’ representation power with explicit reasoning pattern, motivated by the cognitive notion of attentive awareness. (2) We exploit querydependent subgraph structure, extracted by an attention flow mechanism, to address two shortcomings of most GNN implementations: the complexity and the noncontextaware aggregation schema. (3) We design a specific architecture for KBC tasks and demonstrate our model’s strong reasoning capability compared to the state of the art, showing that a compact querydependent subgraph is better than a path as a reasoning pattern.
2 Related Work
KBC and knowledge graph reasoning. Early work for KBC, including TransE [8] and its analogues [21, 22, 23], DistMult [9], ConvE [10] and ComplEx [11], focuses on learning embeddings of entities and relations. Some recent work of this line [12, 13] achieves high accuracy, yet unable to explicitly deal with compositional relationships that is crucial for reasoning. Another line aims to learn inference paths [14, 24, 25, 26, 27, 28] for knowledge graph reasoning, such as DeepPath [15], MINERVA [16], and MWalk [17], using RL to learn multihop relational paths over a graph towards a target given a query. However, these approaches, based on policy gradients or Monte Carlo tree search, often suffer from low sample efficiency and sparse rewards, requiring a large number of rollouts or running many simulations, and also the sophisticated reward function design. Other efforts include learning soft logical rules [18, 19] or compostional programs [29] to reason over knowledge graphs.
Relational reasoning by GNNs and attention mechanisms. Relational reasoning is regarded as the key component of humans’ capacity for combinatorial generalization, taking the form of entity and relationcentric organization to reason about the composition structure of the world [30, 31, 32, 33, 34]. A multitude of recent implementations [7] encode relational inductive biases into neural networks to exploit graphstructured representation, including graph convolution networks (GCNs) [35, 36, 37, 38, 39, 40, 41, 42] and graph neural networks [6, 43, 44, 45, 46], and overcome the difficulty to achieve relational reasoning for traditional deep learning models. These approaches have been widely applied to accomplishing realworld reasoning tasks (such as physical reasoning [45, 47, 48, 49, 50, 51], visual reasoning [44, 51, 52, 53, 54], textual reasoning [44, 55, 56], knowledge graph reasoning [41, 57, 58], multiagent relationship reasoning [59, 60], and chemical reasoning [46]), solving algorithmic problems (such as program verification [43, 61]
[62, 63, 64], state transitions [65], and bollean satisfiability [66]), or facilitating reinforcement learning with the structured reasoning or planning ability [67, 68, 49, 50, 69, 70, 71]. Variants of GNN architectures have been developed with different focuses. Relation networks [44] use a simple but effective neural module to equip deep learning models with the relational reasoning ability, and its recurrent versions [55, 56] do multistep relational inference for long periods; Interaction networks [45] provide a generalpurpose learnable physics engine, and two of its variants are visual interaction networks [51] learning directly from raw visual data, and vertex attention interaction networks [60] with an attention mechanism; Message passing neural networks [46] unify various GCNs and GCNs into a general message passing formalism by analogy to the one in graphical models.Despite the strong representation power of GNNs, recent work points out its drawbacks that limit its capability. The vanilla message passing or neighborhood aggregation schema cannot adapt to strongly diverse local subgraph structure, causing performance degeneration when applying a deeper version or running more iterations [72], since a walk of more steps might drift away from local neighborhood with information washed out via averaging. It is suggested that covariance rather than invariance to permutations of nodes and edges is preferable [73], since being fully invariant by summing or averaging messages may worsen the representation power, lacking steerability. In this context, our model expresses permutation invariance under a constrained compositional transformation according to the group of possible permutations within each extracted querydependent subgraph rather than the underlying full graph. Another drawback is the heavy computation complexity. GNNs are notorious for its poor scalability due to its quadratic complexity in the number of nodes when graphs are fully connected. Even scaling linearly with the number of edges by exploiting structure sparsity can still cause trouble on very large graphs, making selective or attentive computation on graphs so desirable.
Neighborhood attention operation can alleviate some limitation on GNNs’ representation power by specifying different weights to different nodes or nodes’ features [74, 60, 53, 75]. These approaches often use multihead selfattention to focus on specific interactions with neighbors when aggregating messages, inspired by [76, 77, 78] originally for capturing long range dependencies. We notice that most graphbased attention mechanisms attend over neighborhood in a singlehop fashion, and [60] claims that the multihop architecture does not help in experiments, though they expect multiple hops to offer the potential to model highorder interaction. However, a flowbased design of attention in [20] shows a promising way to characterize long distance dependencies over graphs, breaking the isolation of attention operations and stringing them in chronological order by transition matrices, like the spread of a random walk, parallel to the messagepassing pipeline.
It is natural to extend relational reasoning to graph structure inference or graph generation, such as reasoning about a latent interaction graph explicitly to acquire knowledge of observed dynamics [48], or learning generative models of graphs [79, 80, 81, 82]. Soft plus hard attention mechanisms may be a better alternative to probabilistic models that is hard to train with latent discrete variables or might degenerate multistep predictions due to the inaccuracy (biased gradients) of backpropagation.
3 NeuCFlow Model
3.1 Attentive computation framework
We extend Bengio’s consciousness prior to graphstructured representation. Conscious thoughts are modeled by a few selected nodes and their edges, forming a contextaware subgraph, cohesive with sharper semantics, disentangled from the full graph. The underlying full graph forms the initial representation, entangled but rich, to help shape potential highlevel subgraphs. We use attention flow to navigate conscious thoughts, capturing a stepbystep reasoning pattern. The attentive computation framework, as illustrated in Figure 1, consists of: (1) an unconsciousness flow (UFlow) layer, (2) a consciousness flow (CFlow) layer, and (3) an attention flow (AFlow) layer, with four guidelines to design a specific implementation as follows:

[wide=0pt, leftmargin=]

UFlow corresponds to a lowlevel computation graph for full state representation learning.

CFlow contains highlevel disentangled subgraphs for contextaware representation learning.

AFlow is conditioned by both UFlow and CFlow, and also motivate CFlow but not UFlow.

Information can be accessed by CFlow from UFlow with the help of AFlow.
3.2 Model architecture design for knowledge graph reasoning
We choose KBC tasks to do KG reasoning. We let denote a KG where is a set of nodes (or entities) and is a set of edges (or relations). A KG is viewed as a directed graph with each edge represented by a triple , where is the head entity, is the tail entity, and is their relation type. The aim of a KBC task is to predict potential unknown links, i.e., which entity is likely to be the tail given a query with the head and the relation type specified.
The model architecture has three core components as shown in Figure 2. We here use the term "component" instead of "layer" to differentiate our flow layers from the referring normally used in neural networks, as each flow layer is more like a block containing many neural network layers.
UFlow component. We implement this component over the full graph using the standard message passing mechanism [46]. If the graph has an extremely large number of edges, we sample a subset of edges,
, randomly each step when running message passing. For each batch of input queries, we let the representation computed by the UFlow component be shared across these different queries, which means UFlow is queryindependent, with its state representation tensors containing no batch dimension, so that its complexity does not scale with the batch size and the saved computation resources can be allocated to sampling more edges. In UFlow, each node
has a learnable embedding and a dynamical state for step , called unconscious node states, where the initial for all . Each edge type also has a learnable embedding , and edge can produce a message, denoted by , at step . The UFlow component includes:
[wide=0pt, leftmargin=]

Message function: , where .

Message aggregation: , where .

Node state update function: , where .
We compute messages only for the sampled edges, , each step. Functions and are implemented by a twolayer MLP (using for the first layer and
for the second layer) with input arguments concatenated respectively. Messages are aggregated by dividing the sum by the square root of the number of sampled neighbors that send messages, preserving the scale of variance. We use a residual adding to update each node state instead of a GRU or a LSTM. After running UFlow for
steps, we return a pooling result or simply the last, , to feed into downstream components.CFlow component. CFlow is querydependent, which means that conscious node states, denoted by , have a batch dimension representing different input queries, making the complexity scale with the batch size. However, as CFlow uses attentive message passing, running on small local subgraphs each conditioned by a query, we leverage the sparsity to record only for the visited nodes . For example, when , for query , we start from node , with being a singleton, and thus record only. When computing messages, denoted by , in CFlow, we use a samplingattending procedure, explained in Section 3.3, to further control the number of computed edges. The CFlow component has:

[wide=0pt, leftmargin=]

Message function: , where , and .

Message aggregation: , where .

Node state attending function: , where and .

Node state update function: , where .
CFlow and UFlow share the embeddings . A query is represented by its head and relation embeddings, and , participating in computing messages and updating node states. We here select a subset of edges, , rather than sampling, according to edges between the attended nodes at step and the seen nodes at step , defined in Section 3.3, as shown in Figure 3. We introduce the node state attending function to pass an unconscious state to CFlow adjusted by a scalar attention and a learnable matrix . We initialize for , treating the rest as zero states.
AFlow component.
Attention flow is represented by a series of probability distributions changing across steps, denoted as
. The initial distribution is a onehot vector with . To spread attention, we need to compute transition matrices each step. Given that AFlow is conditioned by both UFlow and CFlow, we model the transition from to by two types of interaction: conscioustoconscious, , and conscioustounconscious, . The former favors previously visited nodes, while the latter is useful to attend to unseen nodes.where and , and and are two learnable matrices. Each MLP uses one single layer with the activation. To reduce the complexity for computing , we select attended nodes, , which is the set of nodes with the klargest attention, and then sample from neighbors as next nodes. Then, we compute a sparse according to edges . Due to the fact that the attended nodes may not carry all attention, a small amount of attention can be lost during transition, causing the total amount to decrease. Therefore, we use a renormalized version, . We use the final attention on the tail as the probability for prediction to compute the training objective, as shown in Figure 2.
3.3 Complexity reduction by iterative sampling and attending
Previously, we use edge sampling, in a globally and uniformly random manner, to address the complexity issue in UFlow, where we are not concerned about the batch size. Here, we need to confront the complexity that scales with the batch size in CFlow. Suppose that we run a normal message passing for steps on a KG with nodes and edges for a batch of queries. Then, the complexity is where represents the number of representation dimensions. The complexity can be reduced to by using edges sampling. is a small positive integer, often less than . is normally between and , and being too small for would lead to underfitting. In UFlow, we have , while in CFlow, let us say . Then, to maintain the same complexity as UFlow, we have to reduce the sampling rate by a factor of on each query. However, the UFlow’s edge sampling procedure is for the full graph, and it is inappropriate to apply to CFlow on each query due to the reduced sample rate. Also, when becomes as large as , we also need to consider decreasing .
Good news is that CFlow deals with a local subgraph for each query so that we only record a few selected nodes, called visited nodes, denoted by . We can see that is much less than . The initial , when , contains only one node , and then is enlarged each step by adding new nodes during spreading. When propagating messages, we only care about the onestep neighborhood each step. However, the spreading goes so rapidly that after only a few steps it covers almost all nodes, causing the number of computed edges to increase dramatically. The key to address the problem is that we need to constrain the scope of nodes we jump from each step, i.e., the core nodes that determine where we can go based on where we depart from. We call them attended nodes, which are in charge of the attendingfrom horizon, selected by based on the current attention . Given the set of attended nodes, we still need edge sampling over their neighborhoods in case of a hub node of extremely high degree. Here, we face a tricky problem that is to make a tradeoff between the coverage and the complexity when sampling over the neighborhoods. Also, we need to well maintain these coherent contextaware node states and avoid possible noises or drifting away caused by sampling neighbors randomly. Therefore, we introduce an attendingto horizon inside the sampling horizon. We compute AFlow over the sampling horizon with a smaller dimension to compute the attention, exchanged for sampling more neighbors to increase the coverage. Based on the newly computed attention , we select a smaller subset of nodes, , to receive messages in CFlow, called seen nodes, in charge of the attendingto horizon. The next attendingfrom horizon is chosen by , a subhorizon of the current attendingto horizon. All seen and attended nodes are stored as visited nodes along steps. We illustrate this samplingattending procedure in Figure 3.
To compute our reduced complexity, we let be the maximum number of sampled edges per attended node per step, the maximum number of seen nodes per step, and the maximum number of attended nodes per step. We also denote the dimension number used in AFlow as . For one batch, the complexity of CFlow is for the worst case, where attended and seen nodes are fully connected, and in most cases, where is a small constant. The complexity of AFlow is where is much smaller than .
4 Experiments
4.1 Datasets and experimental settings


Dataset  #Entities  #Rels  #Train  #Valid  #Test  PME (tr)  PME (te)  AvgD (te) 
FB15K  14,951  1,345  483,142  50,000  59,071  81.2%  80.9%  1.22 
FB15K237  14,541  237  272,115  17,535  20,466  38.0%  0%  2.25 
WN18  40,943  18  141,442  5,000  5,000  93.1%  94.0%  1.18 
WN18RR  40,943  11  86,835  3,034  3,134  34.5%  35.0%  2.87 
NELL995  74,536  200  149,678  543  2,818  100%  41.0%  2.06 
YAGO310  123,188  37  1,079,040  5,000  5,000  56.4%  56.0%  1.75 




FB15K237  WN18RR  
Metric ()  H@1  H@3  H@10  MRR  H@1  H@3  H@10  MRR 
TransE []      46.5  29.4      50.1  22.6 
DistMult []  15.5  26.3  41.9  24.1  39  44  49  43 
DistMult []  20.6 (.4)  31.8 (.2)    29.0 (.2)  38.4 (.4)  42.4 (.3)    41.3 (.3) 
ComplEx []  15.8  27.5  42.8  24.7  41  46  51  44 
ComplEx []  20.8 (.2)  32.6 (.5)    29.6 (.2)  38.5 (.3)  43.9 (.3)    42.2 (.2) 
ConvE []  23.7  35.6  50.1  32.5  40  44  52  43 
ConvE []  23.3 (.4)  33.8 (.3)    30.8 (.2)  39.6 (.3)  44.7 (.2)    43.3 (.2) 
RotatE []  24.1  37.5  53.3  33.8  42.8  49.2  57.1  47.6 
NeuralLP []  18.2 (.6)  27.2 (.3)    24.9 (.2)  37.2 (.1)  43.4 (.1)    43.5 (.1) 
MINERVA []  14.1 (.2)  23.2 (.4)    20.5 (.3)  35.1 (.1)  44.5 (.4)    40.9 (.1) 
MINERVA []      45.6    41.3  45.6  51.3   
MWalk []  16.5 (.3)  24.3 (.2)    23.2 (.2)  41.4 (.1)  44.5 (.2)    43.7 (.1) 
NeuCFlow  28.6 (.1)  40.3 (.1)  53.0 (.3)  36.9 (.1)  44.4 (.4)  49.7 (.8)  55.8 (.5)  48.2 (.5) 

. Some collected results only have a metric score while some including ours take the form of "mean (standard deviation)".
Datasets. We evaluate our model using six large KG datasets^{2}^{2}2https://github.com/netpaladinx/NeuCFlow/tree/master/data: FB15K, FB15K237, WN18, WN18RR, NELL995, and YAGO310. FB15K237 [84] is sampled from FB15K [8] with redundant relations removed, and WN18RR [10] is a subset of WN18 [8] removing triples that cause test leakage. Thus, they are both considered more challenging. NELL995 [15] has separate datasets for 12 query relations each corresponding to a singlequeryrelation KBC task. YAGO310 [85] contains the largest KG with millions of edges. Their statistics are shown in Table 1. We find some statistical differences between train and test. In a KG with all training triples as its edges, a triple is considered as a multiedge triple if the KG contains other triples that also connect and ignoring the direction. We notice that FB15K237 is a special case compared with the others, as there are no edges in its KG directly linking any pair of and in test. Therefore, when using training triples as queries to train our model, given a batch, for FB15K237, we cut off from the KG all triples connecting the headtail pairs in the given batch, ignoring relation types and edge directions, forcing the model to learn a composite reasoning pattern rather than a singlehop pattern, and for the rest datasets, we only remove the triples of this batch and their inverse from the KG before training on this batch.
Experimental settings. We use the same data split protocol as in many papers [10, 15, 16]
. We create a KG, a directed graph, consisting of all train triples and their inverse added for each dataset except NELL995, since it already includes reciprocal relations. Besides, every node in KGs has a selfloop edge to itself. We also add inverse relations into the validation and test set to evaluate the two directions. For evaluation metrics, we use HITS@1,3,10 and the mean reciprocal rank (MRR) in the filtered setting for FB15K237, WN18RR, FB15K, WN18, and YAGO310, and use the mean average precision (MAP) for NELL995’s singlequeryrelation KBC tasks. For NELL995, we follow the same evaluation procedure as in
[15, 16, 17], ranking the answer entities against the negative examples given in their experiments. We run our experiments using a 12Gmemory GPU, TITAN X (Pascal), with Intel(R) Xeon(R) CPU E52670 v3 @ 2.30GHz. Our code is written in Python based on TensorFlow 2.0 and NumPy 1.16.
4.2 Baselines and comparison results



FB15K  WN18  
Metric ()  H@1  H@3  H@10  MRR  H@1  H@3  H@10  MRR 
TransE []  29.7  57.8  74.9  46.3  11.3  88.8  94.3  49.5 
HolE []  40.2  61.3  73.9  52.4  93.0  94.5  94.9  93.8 
DistMult []  54.6  73.3  82.4  65.4  72.8  91.4  93.6  82.2 
ComplEx []  59.9  75.9  84.0  69.2  93.6  93.6  94.7  94.1 
ConvE []  55.8  72.3  83.1  65.7  93.5  94.6  95.6  94.3 
RotatE []  74.6  83.0  88.4  79.7  94.4  95.2  95.9  94.9 
NeuralLP []      83.7  76      94.5  94 
NeuCFlow  72.6 (.4)  78.4 (.4)  83.4 (.5)  76.4 (.4)  91.6 (.8)  93.6 (.4)  94.9 (.4)  92.8 (.6) 




YAGO310  
Metric ()  H@1  H@3  H@10  MRR 
DistMult []  24  38  54  34 
ComplEx []  26  40  55  36 
ConvE []  35  49  62  44 
ComplExN3 []      71  58 
NeuCFlow  48.4  59.5  67.9  55.3 




Tasks  NeuCFlow  MWalk  MINERVA  DeepPath  TransE  TransR 
AthletePlaysForTeam  83.9 (0.5)  84.7 (1.3)  82.7 (0.8)  72.1 (1.2)  62.7  67.3 
AthletePlaysInLeague  97.5 (0.1)  97.8 (0.2)  95.2 (0.8)  92.7 (5.3)  77.3  91.2 
AthleteHomeStadium  93.6 (0.1)  91.9 (0.1)  92.8 (0.1)  84.6 (0.8)  71.8  72.2 
AthletePlaysSport  98.6 (0.0)  98.3 (0.1)  98.6 (0.1)  91.7 (4.1)  87.6  96.3 
TeamPlayssport  90.4 (0.4)  88.4 (1.8)  87.5 (0.5)  69.6 (6.7)  76.1  81.4 
OrgHeadQuarteredInCity  94.7 (0.3)  95.0 (0.7)  94.5 (0.3)  79.0 (0.0)  62.0  65.7 
WorksFor  86.8 (0.0)  84.2 (0.6)  82.7 (0.5)  69.9 (0.3)  67.7  69.2 
PersonBornInLocation  84.1 (0.5)  81.2 (0.0)  78.2 (0.0)  75.5 (0.5)  71.2  81.2 
PersonLeadsOrg  88.4 (0.1)  88.8 (0.5)  83.0 (2.6)  79.0 (1.0)  75.1  77.2 
OrgHiredPerson  84.7 (0.8)  88.8 (0.6)  87.0 (0.3)  73.8 (1.9)  71.9  73.7 
AgentBelongsToOrg  89.3 (1.2)           
TeamPlaysInLeague  97.2 (0.3)           

Baselines. We compare our model against embeddingbased approaches, including TransE [8], TransR [22], DistMult [9], ConvE [10], ComplE [11], HolE [86], RotatE [12], and ComplExN3 [13], and pathbased approaches that use RL methods, including DeepPath [15], MINERVA [16], and MWalk [17], and also that uses learned neural logic, NeuralLP [19]
. For all the baselines, we quote the results from the corresponding papers instead of rerunning them. For our method, we run the experiments three times in each hyperparameter setting on each dataset to report the means and standard deviations of the results. We put the details of our hyperparameter settings in the appendix.
Comparison results and analysis. We first report the comparison on FB15K23 and WN18RR in Table 2. NeuCFlow has a surprisingly good result, significantly outperforming all the compared methods in HITS@1,3 and MRR on both the two datasets. Compared to the best baseline, RotatE, published very recently, we only lose a few points in HITS@10 but gain a lot in HITS@1,3 and MRR. Based on the observation that NeuCFlow gains a larger amount of advantage when k in HITS@k gets smaller, we speculate that the reasoning ability acquired by NeuCFlow is to make a sharper prediction by exploiting graphstructured composition locally and conditionally, in contrast to embeddingbased methods, which totally rely on vectorized representation. When a target becomes too vague to predict, reasoning may lose its great advantage, though still very competitive. However, pathbased baselines, with a certain ability to do KG reasoning, perform worse than we expect. We argue that it is inappropriate to think of reasoning, a sequential decision process, as a sequence of nodes, i.e. a path, in KGs. The average length of the shortest paths between heads and tails in the test set in a KG, as shown in Table 1, suggests an extremely short path, making the motivation for using a path pattern almost pointless. The iterative reasoning pattern should be characterized in the form of dynamically varying local graphstructured patterns, holding a bunch of nodes resonating with each other to produce a decision collectively. Then, we run our model on larger KGs, including FB15K, WN18, and YAGO310, and summarize the comparison in Table 3,4, where NeuCFlow beats most wellknown baselines and achieves a very competitive position against the best stateoftheart methods. Moreover, we summarize the comparison on NELL995’s tasks in Table 5. NeuCFlow performs the best on five tasks, also being very competitive against MWalk, the best pathbased method as far as we know, on the rest. We find no reporting on the last two tasks from the corresponding papers.
4.3 Experimental analysis
Convergence analysis. During training we find that NeuCFlow converges surprisingly fast. We may use half of training examples to get the model well trained and generalize it to the test, sometimes producing an even better metric score than trained for a full epoch, as shown in Figure 4(A). Compared with the less expensive computation using embeddingbased models, although our model takes a large number of edges to compute for each input query, consuming more time on one batch, it does not need a second epoch or even taking all training triples as queries in one epoch, thus saving a lot of training time. The reason may be that all queries are directly from the KG’s edge set and some of them have probably been exploited to construct subgraphs for many times during the training of other queries, so that we might not have to train the model on each query explicitly as long as we have other ways to exploit them.
Component analysis. If we do not run UFlow, then the unconscious state is just the initial embedding of node , and we can still run CFlow as usual. We want to know whether the UFlow component is actually useful. Considering that longdistance message passing might bring in less informative features, we compare running UFlow for two steps against totally shutting it down. The result in Figure 4(B) shows that UFlow brings a small gain in each metric on WN18RR.
Horizon analysis. The sampling, attending and searching horizons determine how large area the flow can spread over. They impact the computation complexity as well as the performance of the model with different degrees depending on the properties of a dataset. Intuitively, enlarging the probe scope by sampling more, attending more, or searching longer, may increase the chance to hit a target. However, the experimental results in Figure 4(C)(D) show that it is not always the case. In Figure 4(E), we can see that increasing the maximum number of the attendingfrom nodes, i.e. attended nodes, per step is more important, but our GPU does not allow for a larger number to accommodate more intermediate data produced during computation, otherwise causing the error of ResourceExhaustedError. Figure 4(F) shows the step number of CFlow cannot get too small as two.
Attention flow analysis. If attention flow can really capture the way we reason about the world, its process should be conducted in a divergingconverging thinking pattern. Intuitively, first, for the diverging thinking, we search and collect ideas as much as we can; then, for the converging thinking, we try to concentrate our thoughts on one point. To check whether the attention flow has such a pattern, we measure the average entropy of attention distributions varying along steps and also the proportion of attention concentrated at the top1,3,5 attended nodes. As we expect, attention indeed is more focused at the final step as well as at the beginning.
Time cost analysis. The time cost is affected not only by the scale of a dataset but also by the horizon setting. For each dataset, we list the training time for one epoch corresponding to the standard hyperparameter settings in the appendix. Note that there is always a tradeoff between the complexity and the performance. We thus study whether we can reduce the time cost a lot at the price of sacrificing a little performance. We plot the oneepoch training time in Figure 6(A)(D), using the same settings as we do in the horizon analysis. We can see that Maxattendednodesperstep and #StepsofCFlow affect the training time significantly while Maxsamplededgespernode and Maxseennodesperstep affect very slightly. Therefore, we can use smaller Maxsamplededgespernode and Maxseennodesperstep in order to gain a larger batch size, making the computation more efficiency as shown in Figure 6(E).
4.4 Visualization
To further demonstrate the reasoning ability acquired by our model, we show some visualization results of the extracted subgraphs on NELL995’s test data for 12 separate tasks. We avoid using the training data in order to show the generalization of our model’s learned reasoning ability on knowledge graphs. Here, we show the visualization result for the AthletePlaysForTeam task. The rest can be found in the appendix.
For the AthletePlaysForTeam task
In the above case, the query is (concept_personnorthamerica_michael_turner, concept:athleteplaysforteam, ?) and the desired answer is concept_sportsteam_falcons. From Figure 7, we can see our model learns that (concept_personnorthamerica_michael_turner, concept:athletehomestadium, concept_stadiumoreventvenue_georgia_dome) and (concept_stadiumoreventvenue_georgia_dome, concept:teamhomestadium_inv, concept_sportsteam_falcons) are two important facts to support the answer of concept_sportsteam_falcons. Besides, other facts, such as (concept_athlete_joey_harrington, concept:athletehomestadium, concept_stadiumoreventvenue_georgia_dome) and (concept_athlete_joey_harrington, concept:athleteplaysforteam, concept_sportsteam_falcons), provide a vivid example that a person or an athlete with concept_stadiumoreventvenue_georgia_dome as his or her home stadium might play for the team concept_sportsteam_falcons. We have such examples more than one, like concept_athlete_roddy_white’s and concept_athlete_quarterback_matt_ryan’s. The entity concept_sportsleague_nfl cannot help us differentiate the true answer from other NFL teams, but it can at least exclude those nonNFL teams. In a word, our subgraphstructured representation can well capture the relational and compositional reasoning pattern.
5 Conclusion
We introduce an attentive message passing mechanism on graphs under the notion of attentive awareness, inspired by the phenomenon of consciousness, to model the iterative compositional reasoning pattern by forming a compact querydependent subgraph. We propose an attentive computation framework with three flowbased layer to combine GNNs’ representation power with explicit reasoning process, and further reduce the complexity when applying GNNs to largescale graphs. It is worth mentioning that our framework is not limited to knowledge graph reasoning, but has a wider applicability to largescale graphbased computation with a few inputdependent nodes and edges involved each time.
References
 [1] Yoshua Bengio. The consciousness prior. CoRR, abs/1709.08568, 2017.

[2]
Stanislas Dehaene, Michel Kerszberg, and Jean Pierre Changeux.
A neuronal model of a global workspace in effortful cognitive tasks.
Proceedings of the National Academy of Sciences of the United States of America, 95 24:14529–34, 1998.  [3] Giulio Tononi, Mélanie Boly, Marcello Massimini, and Christof Koch. Integrated information theory: from consciousness to its physical substrate. Nature Reviews Neuroscience, 17:450–461, 2016.
 [4] David Rosenthal and Josh Weisberg. Higherorder theories of consciousness. Scholarpedia, 3:4407, 2008.
 [5] Robert Van Gulick. Higherorder global states (hogs): an alternative higherorder model. Higherorder theories of consciousness, pages 67–93, 2004.
 [6] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20:61–80, 2009.
 [7] Peter W. Battaglia, Jessica B. Hamrick, Victor Bapst, Alvaro SanchezGonzalez, Vinícius Flores Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, Çaglar Gülçehre, Francis Song, Andrew J. Ballard, Justin Gilmer, George E. Dahl, Ashish Vaswani, Kelsey R. Allen, Charles Nash, Victoria Langston, Chris Dyer, Nicolas Heess, Daan Wierstra, Pushmeet Kohli, Matthew Botvinick, Oriol Vinyals, Yujia Li, and Razvan Pascanu. Relational inductive biases, deep learning, and graph networks. CoRR, abs/1806.01261, 2018.
 [8] Antoine Bordes, Nicolas Usunier, Alberto GarcíaDurán, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multirelational data. In NIPS, 2013.
 [9] Bishan Yang, Wen tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. Embedding entities and relations for learning and inference in knowledge bases. CoRR, abs/1412.6575, 2015.
 [10] Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. Convolutional 2d knowledge graph embeddings. In AAAI, 2018.
 [11] Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard. Complex embeddings for simple link prediction. In ICML, 2016.
 [12] Zhiqing Sun, ZhiHong Deng, JianYun Nie, and Jian Tang. Rotate: Knowledge graph embedding by relational rotation in complex space. CoRR, abs/1902.10197, 2018.
 [13] Timothée Lacroix, Nicolas Usunier, and Guillaume Obozinski. Canonical tensor decomposition for knowledge base completion. In ICML, 2018.
 [14] Ni Lao, Tom Michael Mitchell, and William W. Cohen. Random walk inference and learning in a large scale knowledge base. In EMNLP, 2011.
 [15] Wenhan Xiong, Thien Hoang, and William Yang Wang. Deeppath: A reinforcement learning method for knowledge graph reasoning. In EMNLP, 2017.
 [16] Rajarshi Das, Shehzaad Dhuliawala, Manzil Zaheer, Luke Vilnis, Ishan Durugkar, Akshay Krishnamurthy, Alexander J. Smola, and Andrew McCallum. Go for a walk and arrive at the answer: Reasoning over paths in knowledge bases using reinforcement learning. CoRR, abs/1711.05851, 2018.
 [17] Yelong Shen, Jianshu Chen, Pu Huang, Yuqing Guo, and Jianfeng Gao. Mwalk: Learning to walk over graphs using monte carlo tree search. In NeurIPS, 2018.
 [18] William W. Cohen. Tensorlog: A differentiable deductive database. CoRR, abs/1605.06523, 2016.
 [19] Fan Yang, Zhilin Yang, and William W. Cohen. Differentiable learning of logical rules for knowledge base reasoning. In NIPS, 2017.
 [20] Xiaoran Xu, Songpeng Zu, Chengliang Gao, Yuan Zhang, and Wei Feng. Modeling attention flow on graphs. CoRR, abs/1811.00497, 2018.

[21]
Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen.
Knowledge graph embedding by translating on hyperplanes.
In AAAI, 2014.  [22] Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and Xuan Zhu. Learning entity and relation embeddings for knowledge graph completion. In AAAI, 2015.
 [23] Guoliang Ji, Shizhu He, Liheng Xu, Kang Liu, and Jian Zhao. Knowledge graph embedding via dynamic mapping matrix. In ACL, 2015.
 [24] Matt Gardner, Partha Pratim Talukdar, Jayant Krishnamurthy, and Tom Michael Mitchell. Incorporating vector space similarity in random walk inference over knowledge bases. In EMNLP, 2014.
 [25] Kelvin Guu, John Miller, and Percy S. Liang. Traversing knowledge graphs in vector space. In EMNLP, 2015.
 [26] Yankai Lin, Zhiyuan Liu, and Maosong Sun. Modeling relation paths for representation learning of knowledge bases. In EMNLP, 2015.
 [27] Kristina Toutanova, Victoria Lin, Wen tau Yih, Hoifung Poon, and Chris Quirk. Compositional learning of embeddings for relation paths in knowledge base and text. In ACL, 2016.
 [28] Rajarshi Das, Arvind Neelakantan, David Belanger, and Andrew McCallum. Chains of reasoning over entities, relations, and text using recurrent neural networks. In EACL, 2017.

[29]
Chen Liang, Jonathan Berant, Quoc V. Le, Kenneth D. Forbus, and Ni Lao.
Neural symbolic machines: Learning semantic parsers on freebase with weak supervision.
In ACL, 2016.  [30] Kenneth H. Craik. The nature of explanation. 1952.
 [31] John R. Anderson. Acquisition of cognitive skill. 1982.
 [32] Dedre Gentner and Arthur B. Markman. Structure mapping in analogy and similarity. 1997.
 [33] John E. Hummel and Keith J. Holyoak. A symbolicconnectionist theory of relational inference and generalization. Psychological review, 110 2:220–64, 2003.
 [34] Brenden M. Lake, Tomer D. Ullman, Joshua B. Tenenbaum, and Samuel J Gershman. Building machines that learn and think like people. The Behavioral and brain sciences, 40:e253, 2017.
 [35] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally connected networks on graphs. CoRR, abs/1312.6203, 2014.
 [36] Mikael Henaff, Joan Bruna, and Yann LeCun. Deep convolutional networks on graphstructured data. CoRR, abs/1506.05163, 2015.
 [37] David K. Duvenaud, Dougal Maclaurin, Jorge AguileraIparraguirre, Rafael GómezBombarelli, Timothy Hirzel, Alán AspuruGuzik, and Ryan P. Adams. Convolutional networks on graphs for learning molecular fingerprints. In NIPS, 2015.
 [38] Steven M. Kearnes, Kevin McCloskey, Marc Berndl, Vijay S. Pande, and Patrick Riley. Molecular graph convolutions: moving beyond fingerprints. Journal of computeraided molecular design, 30 8:595–608, 2016.
 [39] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In NIPS, 2016.
 [40] Mathias Niepert, Mohammed Hassan Ahmed, and Konstantin Kutzkov. Learning convolutional neural networks for graphs. In ICML, 2016.
 [41] Thomas N. Kipf and Max Welling. Semisupervised classification with graph convolutional networks. CoRR, abs/1609.02907, 2017.
 [42] Michael M. Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep learning: Going beyond euclidean data. IEEE Signal Processing Magazine, 34:18–42, 2017.
 [43] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard S. Zemel. Gated graph sequence neural networks. CoRR, abs/1511.05493, 2016.
 [44] Adam Santoro, David Raposo, David G. T. Barrett, Mateusz Malinowski, Razvan Pascanu, Peter W. Battaglia, and Timothy P. Lillicrap. A simple neural network module for relational reasoning. In NIPS, 2017.
 [45] Peter W. Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, and Koray Kavukcuoglu. Interaction networks for learning about objects, relations and physics. In NIPS, 2016.
 [46] Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. Neural message passing for quantum chemistry. In ICML, 2017.
 [47] Michael Chang, Tomer Ullman, Antonio Torralba, and Joshua B. Tenenbaum. A compositional objectbased approach to learning physical dynamics. CoRR, abs/1612.00341, 2017.
 [48] Thomas N. Kipf, Ethan Fetaya, KuanChieh Wang, Max Welling, and Richard S. Zemel. Neural relational inference for interacting systems. In ICML, 2018.
 [49] Alvaro SanchezGonzalez, Nicolas Heess, Jost Tobias Springenberg, Josh Merel, Martin A. Riedmiller, Raia Hadsell, and Peter W. Battaglia. Graph networks as learnable physics engines for inference and control. In ICML, 2018.
 [50] Jessica B. Hamrick, Kelsey R. Allen, Victor Bapst, Tina Zhu, Kevin R. McKee, Joshua B. Tenenbaum, and Peter W. Battaglia. Relational inductive bias for physical construction in humans and machines. CoRR, abs/1806.01203, 2018.
 [51] Nicholas Watters, Daniel Zoran, Théophane Weber, Peter W. Battaglia, Razvan Pascanu, and Andrea Tacchetti. Visual interaction networks: Learning a physics simulator from video. In NIPS, 2017.
 [52] David Raposo, Adam Santoro, David G. T. Barrett, Razvan Pascanu, Timothy P. Lillicrap, and Peter W. Battaglia. Discovering objects and their relations from entangled scene representations. CoRR, abs/1702.05068, 2017.

[53]
Xiaolong Wang, Ross B. Girshick, Abhinav Gupta, and Kaiming He.
Nonlocal neural networks.
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pages 7794–7803, 2018.  [54] Xinlei Chen, LiJia Li, Li FeiFei, and Abhinav Gupta. Iterative visual reasoning beyond convolutions. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7239–7248, 2018.
 [55] Adam Santoro, Ryan Faulkner, David Raposo, Jack W. Rae, Mike Chrzanowski, Théophane Weber, Daan Wierstra, Oriol Vinyals, Razvan Pascanu, and Timothy P. Lillicrap. Relational recurrent neural networks. In NeurIPS, 2018.
 [56] Rasmus Berg Palm, Ulrich Paquet, and Ole Winther. Recurrent relational networks. In NeurIPS, 2018.
 [57] Daniel OñoroRubio, Mathias Niepert, Alberto GarcíaDurán, Roberto Gonzalez, and Roberto Javier LópezSastre. Representation learning for visualrelational knowledge graphs. CoRR, abs/1709.02314, 2017.
 [58] Takuo Hamaguchi, Hidekazu Oiwa, Masashi Shimbo, and Yuji Matsumoto. Knowledge transfer for outofknowledgebase entities: A graph neural network approach. 2017.

[59]
Sainbayar Sukhbaatar, Arthur Szlam, and Rob Fergus.
Learning multiagent communication with backpropagation.
In NIPS, 2016.  [60] Yedid Hoshen. Vain: Attentional multiagent predictive modeling. In NIPS, 2017.
 [61] Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. Learning to represent programs with graphs. CoRR, abs/1711.00740, 2018.
 [62] Irwan Bello, Hieu Quang Pham, Quoc V. Le, Mohammad Norouzi, and Samy Bengio. Neural combinatorial optimization with reinforcement learning. CoRR, abs/1611.09940, 2017.
 [63] Alex Nowak, Soledad Villar, Afonso S. Bandeira, and Joan Bruna. A note on learning algorithms for quadratic assignment with graph neural networks. CoRR, abs/1706.07450, 2017.
 [64] Elias Boutros Khalil, Hanjun Dai, Yuyu Zhang, Bistra N. Dilkina, and Le Song. Learning combinatorial optimization algorithms over graphs. In NIPS, 2017.
 [65] Daniel D. Johnson. Learning graphical state transitions. In ICLR, 2017.
 [66] Daniel Selsam, Matthew Lamm, Benedikt Bünz, Percy S. Liang, Leonardo de Moura, and David L. Dill. Learning a sat solver from singlebit supervision. CoRR, abs/1802.03685, 2018.
 [67] Jessica B. Hamrick, Andrew J. Ballard, Razvan Pascanu, Oriol Vinyals, Nicolas Heess, and Peter W. Battaglia. Metacontrol for adaptive imaginationbased optimization. CoRR, abs/1705.02670, 2017.
 [68] Razvan Pascanu, Yujia Li, Oriol Vinyals, Nicolas Heess, Lars Buesing, Sébastien Racanière, David P. Reichert, Théophane Weber, Daan Wierstra, and Peter W. Battaglia. Learning modelbased planning from scratch. CoRR, abs/1707.06170, 2017.
 [69] Tingwu Wang, Renjie Liao, Jimmy Ba, and Sanja Fidler. Nervenet: Learning structured policy with graph neural networks. In ICLR, 2018.
 [70] Vinícius Flores Zambaldi, David Raposo, Adam Santoro, Victor Bapst, Yujia Li, Igor Babuschkin, Karl Tuyls, David P. Reichert, Timothy P. Lillicrap, Edward Lockhart, Murray Shanahan, Victoria Langston, Razvan Pascanu, Matthew Botvinick, Oriol Vinyals, and Peter W. Battaglia. Relational deep reinforcement learning. CoRR, abs/1806.01830, 2018.
 [71] Sam Toyer, Felipe W. Trevizan, Sylvie Thiébaux, and Lexing Xie. Action schema networks: Generalised policies with deep learning. In AAAI, 2018.
 [72] Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken ichi Kawarabayashi, and Stefanie Jegelka. Representation learning on graphs with jumping knowledge networks. In ICML, 2018.
 [73] Risi Kondor, Hy Truong Son, Horace Pan, Brandon M. Anderson, and Shubhendu Trivedi. Covariant compositional networks for learning graphs. CoRR, abs/1801.02144, 2018.
 [74] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Alejandro Romero, Pietro Lió, and Yoshua Bengio. Graph attention networks. CoRR, abs/1710.10903, 2018.
 [75] Wouter Kool. Attention solves your tsp , approximately. 2018.
 [76] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2015.
 [77] Zhouhan Lin, Minwei Feng, Cícero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. A structured selfattentive sentence embedding. CoRR, abs/1703.03130, 2017.
 [78] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017.
 [79] Yujia Li, Oriol Vinyals, Chris Dyer, Razvan Pascanu, and Peter W. Battaglia. Learning deep generative models of graphs. CoRR, abs/1803.03324, 2018.
 [80] Nicola De Cao and Thomas Kipf. Molgan: An implicit generative model for small molecular graphs. CoRR, abs/1805.11973, 2018.
 [81] Jiaxuan You, Zhitao Ying, Xiang Ren, William L. Hamilton, and Jure Leskovec. Graphrnn: Generating realistic graphs with deep autoregressive models. In ICML, 2018.
 [82] Aleksandar Bojchevski, Oleksandr Shchur, Daniel Zügner, and Stephan Günnemann. Netgan: Generating graphs via random walks. In ICML, 2018.
 [83] Dai Quoc Nguyen, Tu Dinh Nguyen, Dat Quoc Nguyen, and Dinh Q. Phung. A novel embedding model for knowledge base completion based on convolutional neural network. In NAACLHLT, 2018.
 [84] Kristina Toutanova and Danqi Chen. Observed versus latent features for knowledge base and text inference. 2015.
 [85] Farzaneh Mahdisoltani, Joanna Asia Biega, and Fabian M. Suchanek. Yago3: A knowledge base from multilingual wikipedias. In CIDR, 2014.
 [86] Maximilian Nickel, Lorenzo Rosasco, and Tomaso A. Poggio. Holographic embeddings of knowledge graphs. In AAAI, 2016.
6 Appendix
6.1 Hyperparameter settings


Hyperparameter  FB15K237  FB15K  WN18RR  WN18  YAGO310  NELL995 
batch_size  80  80  100  100  100  10 
n_dims_att  50  50  50  50  50  200 
n_dims  100  100  100  100  100  200 
max_sampled_edges_per_step  10000  10000  10000  10000  10000  10000 
max_attended_nodes_per_step  20  20  20  20  20  100 
max_sampled_edges_per_node  200  200  200  200  200  1000 
max_seen_nodes_per_step  200  200  200  200  200  1000 
n_steps_of_u_flow  2  1  2  1  1  1 
n_steps_of_c_flow  6  6  8  8  6  5 
learning_rate  0.001  0.001  0.001  0.001  0.0001  0.001 
optimizer  Adam  Adam  Adam  Adam  Adam  Adam 
grad_clipnorm  1  1  1  1  1  1 
n_epochs  1  1  1  1  1  3 
Training time per epoch (h)  25.7  63.7  4.3  8.5  185.0  0.12 

Our hyperparameters can be categorized into three groups:

The normal hyperparameters, including batch_size, n_dims_att, n_dims, learning_rate, grad_clipnorm, and n_epochs. Here, we set a smaller dimension, n_dims_att, for the attention flow computation, as it uses more edges for computation than the message passing uses in the consciousness flow layer, and also intuitively, it does not need to propagate highdimensional messages but only compute a scalar score for each of the sampled neighbor nodes, in concert with the idea in the keyvalue mechanism [1]. We set in most cases, indicating that our model needs to be trained only for one epoch due to its fast convergence.

The hyperparameters that are in charge of controlling the samplingattending horizon, including max_sampled_edges_per_step that controls the maximum number to sample edges per step per query for the message passing in the unconsciousness flow layer, and max_sampled_edges_per_node, max_attended_nodes_per_step and max_seen_nodes_per_step that control the maximum number to sample edges connected to each current node per step per query, the maximum number of current nodes to attend from per step per query, and the maximum number of neighbor nodes to attend to per step per query in the consciousness flow layer.

The hyperparameters that are in charge of controlling the searching horizon, including n_steps_of_u_flow representing the number of steps to run the unconcsiousness flow, and n_steps_of_c_flow representing the number of steps to run the consciousness flow.
Note that we choose these hyperparameters not only by their performances but also the computation resources available to us. In some cases, to deal with a very large knowledge graph with limited resources, we need to make a tradeoff between the efficiency and the effectiveness. For example, each of NELL995’s singlequeryrelation tasks has a small training set, though still with a large graph, so we can reduce the batch size in favor of affording larger dimensions and a larger samplingattending horizon without any concern for waiting too long to finish one epoch.
6.2 Other experimental analysis
6.3 Other visualization
For the AthletePlaysInLeague task
For the AthleteHomeStadium task
For the AthletePlaysSport task
For the TeamPlaysSport task
For the OrganizationHeadQuarteredInCity task
For the WorksFor task
For the PersonBornInLocation task
For the PersonLeadsOrganization task
For the OrganizationHiredPerson task
For the AgentBelongsToOrganization task
For the TeamPlaysInLeague task