Network data model interactions between entities such as humans (Leskovec et al., 2014), genes (Wang et al., 2006), and publications (Tang et al., 2008). Networks with node or edge content information are known as Attributed Networks. For example, in an attributed web network, nodes are attributed with full website content and edges are attributed with the mention contexts (the sentence encompassing the website mention). A variety of graph mining tasks on attributed networks have been exploited as popular research topics, such as graph embedding (Gao and Huang, 2018; Yang et al., 2015; Huang et al., 2017; Liao et al., 2018), community detection and clustering (Perozzi et al., 2014a; Falih et al., 2018), classification (Yang et al., 2016; Kipf and Welling, 2016; Velickovic et al., 2017), and NLP (Göttler, 1982). In this paper, we focus on the problem of semi-supervised node classification on attributed graphs with both nodes and edge contents.
Definition 1.1 ().
Semi-supervised Node Classification: Given an attributed graph , where node set contains a small subset of labelled nodes and the remaining nodes are unlabeled. and denote the attributes of nodes and edges in the graph , respectively. The goal is to infer the labels of the unlabeled nodes based on the available but limited node labels. Learning from the graph content and structure information.
The main solutions to this problem are categorized into two modes: unsupervised embedding + classifier, and semi-supervised learning on graph. The approaches in the first branch apply a classifier on embeddings of graph nodes learned using methods like Node2Vec (Grover and Leskovec, 2016), DeepWalk (Perozzi et al., 2014b), or TADW (Yang et al., 2015). The algorithms belonging to the second branch directly learn from the graphs, e.g., non-attributed (Label propagation (Zhu and Ghahramani, 2002) and label spreading (Zhou et al., 2004)), and attributed graphs embedding (GCN (Kipf and Welling, 2016), Planetoid (Yang et al., 2016), DGM (Akujuobi et al., 2018), and (Velickovic et al., 2017; Thekumparampil et al., 2018)). The core ideas behind these approaches are to 1) jointly learn from the graph structure and the node attributes (most of them are not designed to include edge contents); and 2) aggregate the content of neighboring nodes at different levels of relevance, from immediate neighbors to neighbors -hop away. One limitation of these approaches is the performance downgrade caused by the noisy information from an exponentially increasing number of expanded neighborhood members (Zhou et al., 2018), even though considering high-order structures in graphs might be beneficial for some graph-based problems (Lee et al., 2018b; Rossi et al., 2018b, a). Another issue is the high computational cost, especially in memory cost, caused by the exponentially increasing number of expanded neighbors.
Furthermore, most of the previously proposed semi-supervised methods are transductive, and thus cannot fit to the situations where new nodes are observed and inserted to the graph. However, deriving embeddings and conducting classification in an inductive way for new unseen nodes is highly demanding in real-world settings, e.g., classifying a new published paper/website. Inductive approaches also facilitate the generalization across attributed graphs with similar feature spaces (Hamilton et al., 2017; Yang et al., 2016). It is thus desirable to design approaches that are flexible for both transductive and inductive setting.
To reduce the scope of neighbors to be evaluated in the semi-supervised node classification problem and maintain an inductive property, we propose a recurrent attention framework to learn to explore neighborhoods. In this way we guide neighborhood exploration to better serve the goal of node classification, compared to purely random walk. We pose the learn-to-walk task as a partially observable markov decision process (POMDP) problem and attack it with reinforcement learning.
To summarize, we address the node classification problem by letting an agent make recurrent decisions on next nodes to visit in its walk on the graph. This process can be considered as a recurrent attention-based walk. Therefore, we call our proposed model, Recurrent Attention Walk (RAW). Comparing to other popular semi-supervised graph-based node classification approaches, RAW has the following advantages:
RAW uses a recurrent-attention strategy, while attention-based node classification approaches like GAT (Velickovic et al., 2017) and AGNN (Thekumparampil et al., 2018) are based on a self-attention strategy, which accessing high-order neighbors by iteratively aggregating one-hop neighbors. By contrast, our recurrent-attention strategy learns how to walk and thus can find the walk path well tuned for classifying the target nodes, and thereby minimizing the noisy information obtained.
RAW thus is more efficient than GCN (Kipf and Welling, 2016) and GAT like approaches on memory cost, because RAW reduces the number of nodes to aggregate per hop.
RAW is usable in both transductive and inductive settings. We perform extensive experiments on real-world large datasets. The result shows that RAW has superior performance, significantly on inductive setting.
The walking path generated by RAW can be used to interpret the decision making process and infer class label dependency, as shown in our case studies.
2. Previous Work
In general, solutions for the studied problem (as defined in the previous section) target on minimizing the loss
is the supervised loss function andis the regularizer. The regularizer penalizes a model for assigning different labels to similar nodes and , which are close on the graph and have similar content.
Zhu and Ghahramani (Zhu and Ghahramani, 2002) proposed a transductive label propagation model following the theoretical framework of Gaussian Random Fields to classify nodes in a nearest neighbor graph of a semi-supervised data set. Some other works follow a two-step solution by first learning node embeddings with unsupervised methods (Perozzi et al., 2014b; Grover and Leskovec, 2016; Tang et al., 2015), and then building classifiers on the learned node embedding to infer the unknown labels. Since the embedding is learned in a unsupervised way, it is general enough to be deployed across different tasks (e.g., clustering and link prediction). However, it is not tailored to fit the use in node classification111We are aware of a big group of related work to our study in graph embedding. In this section, we focus on the most relevant ones solving node classification in semi-supervised learning. Comprehensive discussion of other unsupervised graph embedding for both plain and attributed graphs can be found at (Cai et al., 2018)..
Recent decades have witnessed a new trend of research on node classification, which focuses on conducting semi-supervised learning on graphs. Yang et al. (Yang et al., 2016) proposed a node embedding method to jointly predict the neighborhood context and labels of graph nodes. Kipf et al. (Kipf and Welling, 2016) proposed the use of graph convolutional networks (GCN) for graph-based semi-supervised learning. Zhuang and Ma (Zhuang and Ma, 2018) extended the idea of GCN by considering global and local consistency. Akujuobi et al. (Akujuobi et al., 2018) studied the use of deep generative models for graph-based semi-supervised learning. Hamilton et al. (Hamilton et al., 2017) proposed GraphSAGE, an inductive method that computes a node representation by applying an aggregation function over a fixed sample length of node neighbors.
In general, few of the above-discussed approaches attentively selects the relevant neighboring nodes. The relevance of all neighboring nodes may be implicitly encoded in the aggregation procedure. However, the action on all neighboring nodes without prior preference introduces noisy information due to the exponentially increasing of nodes as the exploration range of the neighborhood extends. To suppress the potential impact of noisy information during aggregating the node neighbors, we propose an attention-based reinforcement learning method for node classification. Next, we survey the use of attention mechanisms and use of reinforcement learning on graph-based problems.
2.1. Attention-based Node Classification
We can consider selecting the relevant neighboring nodes to visit from the perspective of attention mechanism. Introducing attention mechanism allows the models to focus on the relevant areas of graphs for a given learning task, such as node classification (Lee et al., 2018a). Abu-El-Haija et al. (Abu-El-Haija et al., 2017) extends deepwalk by using the attention to guide random walk. Thekumparampil et al. (Thekumparampil et al., 2018) introduced attention to the GCN propagation layers to assign more weight to relevant neighbors of each node. Velicovic et al. (Velickovic et al., 2017)
, extend the idea of GraphSAGE by introducing the use of attention in the node neighbor sampling. Note that the attention neighboorhood per node in the papers, as mentioned above, are the nodes one-hop away from a given node. Our model removes this restriction and thus can achieve better graph exploration. Also, most of these proposed methods do not scale well on large graphs with non-sparse feature vectors as node attributes (i.e., continuous vectors). Furthermore, all these attention models share aself-attention strategy. Specifically, hidden states of each node are computed by attending their neighbors. Thus, by stacking more layers (i.e., -layers), the nodes aggregate information from neighbors up to -hop away. We consider a recurrent-attention strategy, where hidden states of each node are computed by enforcing attention on a recurrent walk on the graph. This strategy reduces the number of nodes to be considered per hop and thereby, minimizing the noisy information obtained. Also, it enables us to evaluate which nodes are more useful based on the information it already gathered from previous hops, and which areas of the graph to explore.
2.2. Reinforcement on Graph-Structured Data
Several works have studied the application of reinforcement learning on graph-structured data. Hoshen (Hoshen, 2017) applied soft attention on the matrix pair-wise interactions between game agents to select information from relevant agents. Jiang et al. (Jiang et al., 2018) introduced a graph convolutional reinforcement learning method to learn multi-agent cooperation. Xiong et al. (Xiong et al., 2017)
proposed a model for finding multi-hop relation paths in knowledge graphs. None of these models are designed to select the optimal movement trajectory (path) for node classification.
The GAM (Graph Attention Model) proposed by Lee et al. (Lee et al., 2018c) is an RNN model for graph classification (not node classification), through attention on the graph structural composition. The graph classification differs from node classification on the prediction goal. Thus, the GAM model cannot be applied for node classification as the embedding learned from the graph classification is based on recurrent attention on nodes with random starting nodes. It is not designed to encode a linear combination of the node embeddings. Secondly, the GAM method evaluate the graph label prediction per step iteratively, which is not feasible for node classification in large graphs. GAM also assumes that all the nodes know node types (labels), which does not hold in the settings of semi-supervised node classification.
3.1. Model Description
We model the sequential decision making of which next node to visit by Recurrent Neural Networks (RNN)(Cho et al., 2014) to capture the recurrent dependency in the walk path on the graph. Sequential decision making describes a situation where the decision maker takes its action upon successive observations. The choice of action depends on the expected benefit that can be potentially gained in the future. Given this setting, Markov Decision Process (MDP) provides a coherently appropriate solution to the sequential decision making problem. Nevertheless, exploration of the walk path in an attributed graph violates the Markov property: the observations of the agent at each step should be rich enough to distinguish states of the agent from one to another. In the walk over the graph, observing only the attributes and the neighbors of the current node is not enough to capture all topological information. Therefore, the neighborhood exploration task reduces to a Partially Observable MDP (POMDP) problem. To attack this issue, we encode the past histories of walk paths with RNN to augment state representation of the agent, which facilitates the process of policy learning.
As illustrated in Figure 1, the proposed RAW is composed of 3 networks: the core network, the score network, and the classification network. With a small example of an attributed graph in Figure 1 (a), the whole process can be explained as follows. At the current time , the agent is at node and deciding the next visit at time , thus . In the left of Figure 1 (b), the score network takes as input the previous history , the current node attribute , and the attributes of the current neighborhood observation of the agent, which includes the attributes of the immediate node neighbors and edges . The job of the score network is to generate a score for each node neighbor. The generated score in range [0, 1] denotes the relevance of a node neighbor to the given node. After the relevance score is normalized, the next node to visit is sampled from its neighbors in proportion to their relevance. The core network takes over after relevance score is generated. By selectively aggregating the embeddings of neighboring nodes based on the score network, an immediate neighborhood information is formed (see section 3.1.2 for more details). The core network takes as input the neighborhood aggregation , the previous history , and the current node embedding , and outputs the current walk history . This process leads the agent to at time , and it repeats to make a move to at , etc. After a fixed number of steps , the final vector summarizing the information obtained from the graph walk is passed to the classification network for the label prediction of the starting node. See algorithm 1 for more details.
RAW is also applicable in inductive setting, where the walk policy is learned based on the nodes available in the graph. Given a new node added to the graph, the agent initiates a walk from the new unlabelled node guided by the learned policy based on and , and finally uses for classification.
3.1.1. Information Flow
The information flow in RAW has been described above as a sequential decision process, formulated as POMDP. At the time , the agent, which can only observe its one-hop neighbors at the current node, cannot capture the complete topological information in the large graph. Formulating as POMDP allows for a careful treatment to the incomplete observation problem, which is necessary in our case.
To address the uncertainty of observation, we augment the observation by integrating the information from the previous walk path. This information is encoded recurrently by RNN and updated as the agent traverses.
At each step, the agent takes action based on its observation, including the previous history , the current node attribute , and attributes of its immediate node and edge neighbors, and respectively, transiting to the next node . The history acts as a summary of the previous observations in the graph walk, combined with the current observation, the history is updated by the core network , which has GRU at its core and is formulated as:
where and denote element-wise multiplication and vector concatenation respectively. The variable is the update gate which determines the amount of past information to overwrite, is the reset gate which decides the amount of past information to compute a new memory content, is the current memory content, and is the output vector containing information from the current unit and previous units. The variables and are the weights; is the node attribute of the current node, is the aggregated attribute of the relevant current node neighbors (see section 3.1.2), and
are the bias vectors.
At the end of the walk (), the core network produces , the embedding of the full trajectory started from the target node. To classify the target node, is given to the classification network , modeled as a 2-layer neural network, to predict the class label.
The agent is expected to take actions to choose the most relevant nodes to visit, and finally collect sufficient information for classifying the target node. Therefore, we can determine the next node to select as an action based on the output of the score network. The output is a measure of relevance between node and its neighbors, and thus, is useful for deciding which of the neighboring nodes are relevant to the current node . will be used for the next node selection, and also serve for the history aggregation update.
The score network is modeled using a sigmoid activation function. Values inare thus between 0 and 1 for each neighboring node. For the sake of better exploration, a stochastic policy is adopted to make the choice of the next node to visit via sampling under the categorical distribution , after normalizing :
Then the aggregation of relevant neighboring nodes is conducted as:
where is the set of nodes in the one-hope neighborhood of the current node , is the node attribute of node in the set, and is the relevance score of . The indicator function outputs 1 when positive and 0 otherwise.
In our model, the performance of a graph walk path (trajectory) would be measured at the end, like evaluating a student passing or failing a course in the final exam after one-semester recurrent study. Specifically, the agent gets an immediate reward at the last step , if the label prediction at the end () is correct and otherwise. The goal of the agent is to take actions with large reward to go, . This reward encourages the agent to explore nodes on the graph that improve the final predictive performance.
The setting of is application dependent. A large allows for long-run exploration but increases computational cost, while a small limits the knowledge to aggregate. We have a sensitivity analysis about in the experimental section.
The final target of our model is to classify an unknown node. Given a trained model, the agent starts from the unknown node, follows the policy to traverse the graph and assigns a label to the given node by the classification network at the end of the graph walk. To fulfill the goal of the model, it is required to learn a good walk policy and classification network. And we conduct the training process in a semi-supervised manner integrating both labeled nodes and unlabeled ones efficiently.
We augmented the observation to tackle the partial observation problem. But to indicate the property of POMDP, we adopt to represent the partial observations along the path until time , while in our study augmented observation acts as . We would like to train the policy to learn the mapping from the observation space to the action space. Since the policy will take its history from previous transitions as one part of its input, the training of policy will in fact result in an improved core network to provide better history embedding and a score network for more accurate score generation. Therefore, we train the parameters together for the policy. The policy objective is the reward in the future over the expectation of the graph walk paths following the current policy, which is .
However, computing the objective function is tough in practice. The expectation over joint probability distribution of walk paths is hard to measure. Therefore, adopting the trick of log derivative to change the gradient of the expectation to the expectation of the gradient, the algorithm REINFORCE for POMDP in(Williams, 1992) could take gradients of the objective as following:
The ’s are the roll-out sequences obtained from running the agent for episodes, and is a discount factor that gives more preference to actions performed closer to the time the final prediction is made (i.e., ). is the reward to go of the episode . We only adjust the log-probabilities for steps since there is no choice of next node to visit at time .
The expected reward of the roll-out only depends on the classification at the end of the walk. Therefore, with the roll-outs starting at labeled nodes but traversing over unlabelled nodes, the training is allowed in a semi-supervised manner to use the unlabeled ones as the transitional nodes. It effectively integrates labeled and unlabeled nodes to utilize their information to the maximal extent. Besides, high variance from sampling still exists, though the estimate is an unbiased one. The reward setting alleviates this problem in sampled trajectories to some degree by reducing the reward collected at the intermediate steps of roll-outs.
For the classification network, we define the loss to include the classification error (cross-entropy) and L2 regularization. The classification network is trained in supervised way via gradient descent by itself and provides reward signals to the agent, while score network is trained using REINFORCE. The whole model is trained end-to-end.
In this section, we present and discuss the extensive experiments and results obtained. We first introduce the four used datasets, the comparison methods, the implementation details, and parameters used for all the models. Finally, we report the results obtained and also present a case study.
The evaluation datasets are citation networks constructed from Cora, DBLP, and Delve datasets. For each of the resulting paper in the citation networks, we extract the titles (and abstract when available). We also extract the citation context (sentences encompassing the citation) of the references from the papers when available. The statistics of the datasets are shown in Table 1.
CoraL1: The Cora dataset is extracted from the original Cora data222https://people.cs.umass.edu/mccallum/data.html. We excluded papers with missing titles and papers with no citation and references (isolated papers). We use the top level labels provided in the dataset.
The CoraIDA dataset is constructed as in the CoraL1. However, we only train and test on the papers under Artificial Intelligence, Databases, and Information Retrieval.
DBLP: The DBLP dataset was extracted from the DBLP dump333https://dblp.uni-trier.de/xml/. This dump is composed of the full DBLP data at the time of download. We extracted papers published in preselected conferences and journals with a focus on predefined topics. Thus, if a paper is published in one of the database focused conference or journal, paper is assigned the label “database”. We constructed a citation network by selecting the neighbors (1 hop away) of each paper. For each of the resulting paper, we extract the title (and abstract when available). This dataset has no edge attribute since the DBLP has no full-text content information.
Delve: The delve dataset is extracted from the delve website444http://adatahub.com. Just as in the DBLP data, we extracted papers published in pre-selected conferences/journals targeting some predefined topics. The citation graph and paper labeling were constructed in the same ways as in DBLP.
|# Nodes||#Edges||#labels||# Labeled nodes|
4.2. Experimental setup
The experiments were conducted on a Linux system using Python . Our method is implemented using the Tensorflow library. Each GPU based experiment was conducted on an Nvidia 1080TI GPU. When the abstract is available, a paper (node) attribute is given as a concatenation of both the title and abstract else only the title is used. Each citation relationship (edge) attribute is given as the concatenation of all its citation contexts (i.e., sentences where the reference is mentioned in the citing paper). The paper and citation attributes are then converted to a vector by applying the latent semantic analysis (LSI) method on the document-term matrix features, resulting in 300-dimension features vectors. We complete the missing citation attributes with zero vectors and assume no missing paper attribute. In all the experiments, the attribute vector is normalized to unit norm.
For our proposed model, we performed a grid search over the length of walk and the number of walks per node . For each neural network based model, we performed a grid search over the learning rate and hidden layer dimension
. We performed the parameter grid search by training on the CoraL1 dataset with 10% labeled samples. The best parameters per model from the grid search are then used in all experiments. The RAW models are trained for 30 epochs with a parameter set (, , for training and for testing). The GCN and FastGCN models are trained for 200 epochs with and and respectively. The GraphSAGE and GAT models are trained for 20 and 100 epochs respectively with a parameter set of (). The Planetoid models are trained for 5000 epochs with a parameter set of (). We used the Scikit-Learn implementation of Linear SVM with default settings for embedding based evaluations. All experiment results reported in this paper are averaged from running on each dataset five times on random samples. For each experiment, we separate 30% of the labeled data for testing. We then vary the number of labeled training data, with the remaining labeled samples assumed to be unlabeled (included in the set of unlabeled samples). In all our experiments, we assume the graph to be undirected.
4.3. Comparison Methods
To evaluate the performance of our model, we compare RAW with several state-of-the-art semi-supervised graph-based methods using classification accuracy as the performance metric. We selected the most competitive baselines that are also publicly available online to avoid unfair evaluations due to faulty implementation. The baseline methods are from different groups:
Unsupervised embedding + classifier: we generate embeddings using several unsupervised embedding methods, which we then give as input to the Linear SVM model for training and classification. The embedding methods include: Node2Vec (Grover and Leskovec, 2016), DeepWalk (Perozzi et al., 2014b), Latent Semantic Analysis (Deerwester et al., 1990), and TADW (Yang et al., 2015).
Supervised learning on graph: in the inductive setting, we evaluate against several variants of FastGCN (Chen et al., 2018) and GraphSAGE (Hamilton et al., 2017) which are supervised learning models for inductive node classification. Note, however, that our proposed method works in a semi-supervised manner in both the inductive and transductive settings.
Table 3 shows the classification performance of our proposed model and other state-of-the-art models. Our proposed model exhibited similar performance compared with GCN in the transductive settings, but it outperforms all other baseline methods in all settings. For the GAT models, we use the sparse version (SpGAT) as the original implementation gave an out of memory error (OOM) on even the CoraL1 with 10% labeled samples. We could only evaluate the GCN and GAT on the Cora datasets as we got out of memory error when applying them to the other large datasets due to the dense LSI vectors. We will evaluate the scalability of these methods and show the memory usage analysis in section 4.4.4. In summary, RAW is usable on large-scale graphs and produce the best node classification results, with no significant difference to GCN, but more efficient than GCN.
Table 3 also shows the comparison of RAW and other inductive models. RAW outperforms all the baseline methods in all settings. In the inductive setting, the testing nodes are removed from the training graph and thus are not seen during training. The agent learns the optimal policy for the graph walk during training that will be generalized to unseen nodes. The test nodes are only added to the graph during testing. The agent (guided by the policy learned after training), starts a walk from the added nodes to learn the embedding for the new nodes. We compare RAW against GraphSAGE, FastGCN and Planetoid inductive model. GraphSAGE and FastGCN are supervised learning algorithms and thus do not use the unlabeled and test nodes during training. The superior performance of RAW shows that walks starting from the new nodes guided by the learned policy aggregated the most useful information for classifying the starting node (the target to classify).
4.4.3. Trajectory Analysis
Furthermore, we analyze the returned walk trajectory from RAW. This study is performed on the CoraIDA dataset with . We extract the trajectories learned for nodes in each class, and then get the distribution of labels for all nodes visited on these trajectories. The 23 columns in the whole heatmap plot correspond to 23 class labels given in Table 3. From figure 2, we can see that the walk sequences for each class mostly visit the nodes in the same class as the target class (the light squares on diagonal). This verifies that RAW agent tends to walk to nodes in the same class for accomplishing the classification task. It is worth mentioning that RAW agent has no information about label when walking, neither the target label (label of the starting node), nor the label of neighboring nodes.
|DB/Deductive||6||AI/Games and Search||17|
AI/Vision and Pattern Recognition
More importantly, we observe in Figure 2 the relationship between the classes (note again RAW agent moves without any label information). For instance, we can observe that papers under some topics in a research field tend to visit other papers in the same research field more often. Database papers (with label 0-6) form a block in the left-bottom corner. The other two blocks, although not obvious but observable, correspond to information retrial and artificial intelligence. Figure 2 also highlights the important topics. We can notice the influence of the Machine Learning class on the Artificial Intelligence and Information Retrieval community. This influence is shown by the ratio of times the walk sequence of nodes in each class under AI visits the machine learning nodes. It is interpretable as an individual usually needs to read some machine learning papers/books to understand these topics better. We also notice the versatility of the classes. We see the walk sequence of the class Theorem Proving mostly visit nodes in its class. This result shows that the research area is quite narrow while Machine Learning and Knowledge representation are broader topics and therefore, more versatile.
To further analyze the performance of RAW, we compare the path made by RAW and random walk. By setting , we obtain a set of trajectories returned by RAW, and another set of trajectories by random walk starting from the same node. For each trajectory, we calculate the path label diversity for each walking step :
where is the label of the starting node, is the label of the -th node on the path, and is the indicator function. The value is low (to 0) when nodes on the path have the same label as the starting node, indicating that the agent learned to explore neighboring nodes with the same label as the target label. Note that the agent has no label knowledge during the walk. Figure 5 shows the mean and variance of the path label diversity when starting at two different selected nodes. We can see that RAW agent walks with a much lower diversity than random walk.
4.4.4. Parameter and Memory Analysis
We study the effect of the walk length on the performance of the model. We train the model on the CoraL1 dataset with 10% training samples. In Figure 4, it can be observed that the model already performs well after ten steps as there is no much improvement with an increase in the number of walks. Meanwhile, more number of walks causes higher time cost.
Figure 3 shows the GPU memory utilization of our proposed model and several semi-supervised state-of-the-art transductive models. We randomly generated Erdős-Rényi graphs in size of 100, 1K, 5K, 10K, 50K, 100K and 500K (the number of nodes), and set the number of edges in each graph to be ten times the number of nodes. We randomly generate 300-dimension attributes for the nodes and edges. We then measure the GPU memory consumption using the nvidia-smi Linux command on each graph and compare RAW with GCN, GAT and SpGAT. GCN and GAT do not scale with the number of nodes and edges (shown with zero bars in Figure 3 after the graph gets larger than 50K).
5. Case Study
In this section, we train the RAW model on the CoraIDA dataset with a trajectory length of and present a case study of a sampled walk trajectory of a paper. Figure 6 shows the walk sequence extracted from a paper entitled “A Critique of Structure from motion Algorithm” classified as “Vision and Pattern Recognition” on a subgraph of the CoraIDA graph. The thickness of the edges signifies the ratio of times the edge was traversed during the walk. The color of the nodes signifies the class relationship of the node to the target class. The blue color signifies that a node has the same label as the target label, the red signifies that a node has a label different to the target node, and the black color signifies the start node. Note that the target class is the class of the start node.
We see from Figure 6 that the agent can selectively make decisions to visit nodes with the same labels as that of the start node. The agent also visits unlabeled nodes (white nodes in the right-bottom corner). We observe that even though the labels are unknown, the visited unlabeled nodes work on similar topics as the start paper, e.g., entitled “new statistical models for randoms-precorrected pet scans”, “fast monotonic algorithms for transmission tomography”, etc.
In this paper, we propose to address the semi-supervised node classification problem in attributed networks by letting an agent choose the most relevant nodes in a recurrent walk framework. The decision of where to visit is determined by considering the previous visiting history, the current node content, node content of node one-hop neighbors, and the edge content between the current node and its linked neighbors. The accumulated information from the nodes in the sequence is finally used for classification. We show by several experiments and analysis that the proposed model outperforms several state-of-the-art methods in both transductive and inductive settings. The analysis of the obtained walk sequences also confirms that our model selects the most relevant nodes to visit and thus leads to higher classification accuracy than other methods.
This work is supported by the King Abdullah University of Science and Technology (KAUST), Saudi Arabia
- Watch your step: learning graph embeddings through attention. arXiv preprint arXiv:1710.09599. Cited by: §2.1.
- Mining top-k popular datasets via a deep generative model. In 2018 IEEE International Conference on Big Data (Big Data), pp. 584–593. Cited by: §1, §2.
- A comprehensive survey of graph embedding: problems, techniques, and applications. IEEE Transactions on Knowledge and Data Engineering 30 (9), pp. 1616–1637. Cited by: footnote 1.
- Fastgcn: fast learning with graph convolutional networks via importance sampling. arXiv preprint arXiv:1801.10247. Cited by: §4.3.
- Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: §3.1.
- Indexing by latent semantic analysis. Journal of the American society for information science 41 (6), pp. 391–407. Cited by: §4.3.
- Community detection in attributed network. In Companion Proceedings of the The Web Conference 2018, pp. 1299–1306. Cited by: §1.
- Deep attributed network embedding.. In International Joint Conference on Artificial Intelligence, pp. 3364–3370. Cited by: §1.
- Attributed graph grammars for graphics. In International Workshop on Graph Grammars and Their Application to Computer Science, pp. 130–142. Cited by: §1.
- Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864. Cited by: §1, §2, §4.3.
- Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1024–1034. Cited by: §1, §2, §4.3.
- Vain: attentional multi-agent predictive modeling. In Advances in Neural Information Processing Systems, pp. 2701–2711. Cited by: §2.2.
- Label informed attributed network embedding. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pp. 731–739. Cited by: §1.
- Graph convolutional reinforcement learning for multi-agent cooperation. arXiv preprint arXiv:1810.09202. Cited by: §2.2.
- Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: 2nd item, §1, §1, §2, §4.3.
- Attention models in graphs: a survey. arXiv preprint arXiv:1807.07984. Cited by: §2.1.
- Higher-order graph convolutional networks. arXiv preprint arXiv:1809.07697. Cited by: §1.
- Graph classification using structural attention. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1666–1674. Cited by: §2.2.
- Stanford large network dataset collection, 2014. URL: http://snap. stanford. edu/data/index. html. Cited by: §1.
- Attributed social network embedding. IEEE Transactions on Knowledge and Data Engineering 30 (12), pp. 2257–2270. Cited by: §1.
Focused clustering and outlier detection in large attributed graphs. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1346–1355. Cited by: §1.
- Deepwalk: online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710. Cited by: §1, §2, §4.3.
- Higher-order network representation learning. In Companion Proceedings of the The Web Conference 2018, pp. 3–4. Cited by: §1.
- Estimation of graphlet counts in massive networks. IEEE Transactions on Neural Networks and Learning Systems. Cited by: §1.
- Line: large-scale information network embedding. In Proceedings of the 24th international conference on world wide web, pp. 1067–1077. Cited by: §2.
- Arnetminer: extraction and mining of academic social networks. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 990–998. Cited by: §1.
- Attention-based graph neural network for semi-supervised learning. arXiv preprint arXiv:1803.03735. Cited by: 1st item, §1, §2.1, §4.3.
- Graph attention networks. arXiv preprint arXiv:1710.10903 1 (2). Cited by: 1st item, §1, §1, §2.1, §4.3.
- A protein interaction network for pluripotency of embryonic stem cells. nature 444 (7117), pp. 364. Cited by: §1.
- Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (3-4), pp. 229–256. Cited by: §3.2.
- Deeppath: a reinforcement learning method for knowledge graph reasoning. arXiv preprint arXiv:1707.06690. Cited by: §2.2.
- Network representation learning with rich text information. In Twenty-Fourth International Joint Conference on Artificial Intelligence, Cited by: §1, §1, §4.3.
- Revisiting semi-supervised learning with graph embeddings. arXiv preprint arXiv:1603.08861. Cited by: §1, §1, §1, §2, §4.3.
- Learning with local and global consistency. In Advances in neural information processing systems, pp. 321–328. Cited by: §1.
- Graph neural networks: a review of methods and applications. arXiv preprint arXiv:1812.08434. Cited by: §1.
- Learning from labeled and unlabeled data with label propagation. Cited by: §1, §2.
- Dual graph convolutional networks for graph-based semi-supervised classification. In Proceedings of the 2018 World Wide Web Conference, pp. 499–508. Cited by: §2.