Machine learning on graph-structured network data has proliferated in a number of important applications. To name a few, it shows great potential in a chemical prediction problem (Gilmer et al. (2017)), a protein functions understanding, and particle physics experiments (Henrion et al. (2017); Choma et al. (2018)
). Learning the representation of structural information about a graph discovers a mapping that embeds nodes (or sub-graphs), as points in a low-dimensional vector space. Graph neural network algorithms based on neighborhood aggregation, addressed the problem by leveraging a node’s attributes (Kipf & Welling (2016); Hamilton et al. (2017); Pham et al. (2017)). The GraphSAGE algorithm (Hamilton et al. (2017)
) recursively subsamples by uniform sampling a fixed number of nodes from local neighborhoods over multiple hops, and learns a set of aggregator models that aggregate the hidden features of the subsampled nodes by backtracking toward the origin. The sampling approach keeps the computational footprint of each batch in parallel computing fixed. However, despite the comprehensive features of GraphSAGE, unbiased random sampling with uniform distribution causes high variance in training and testing, which leads to suboptimal accuracy. In the present work, we propose a novel method to replace the subsampling algorithm in GraphSAGE with a data-driven sampling algorithm, trained with Reinforcement Learning.
2 Preliminaries: GraphSAGE
GraphSAGE (Hamilton et al. (2017)) performs local neighborhood sampling and then aggregation of generating the embeddings of the sampled nodes. The sampling step provides the benefits such that the computational and memory complexity is constant with respect to the size of a graph. Once the target node, , is determined, a fixed set of neighborhoods, , is sampled as follows:
where is a set of neighboring nodes of and is the sample size at depth . is a sampler from a uniform distribution as a default setting. This way the receptive field of a single node grows with respect to the number of layers, , so the size of is . After the sampling, we aggregate the embeddings of nodes in the sampled set toward the original node .
The initial node embeddings, for a sampled set , are the input node attributes (features) with the dimension of :
The mean_concat aggregator averages the embeddings, , of the neighboring nodes, , of a set of sampled node . Then, that aggregated neighbor embedding is combined by concatenation with the embedding of a node to assign a new embedding into the node. If the concatenation is changed into the addition, it becomes the mean_add aggregator.
where and with a size of at the first layer and at the remaining layers are weight matrices that are shared among nodes in the network layer . is the hidden feature dimension, and
is a non-linear function, such as a rectified linear unit, defined as. The operator indicates the concatenation of two vectors. Afterward, the new embedding, , is normalized. After finishing -layer processing, the final embedding vector,
, is generated. This goes to a classifying layer to predict-classes. The GraphSAGE model is trained to minimize classification cross-entropy loss.
3.1 Value Function-based Reinforcement Learning for Node Sampling
To replace the previous uniform sampler, we consider a Reinforcement Learning approach which helps learning how to quickly find a good sampling distribution in a new dataset. A per-step reward, , is a negative value of cross-entropy loss computed at the node given a -hop uniformly subsampled neighborhood as well as a directly connected 1-hop neighborhood, . Note that the per-step reward is a batch-wise value not applying summation over a mini-batch of target nodes, :
where is the aggregator of GraphSAGE and inputs a target node and -hop subsampled neighborhood, . A per-step visit count, , records how many times is indexed.
The layer depth of aggregator is equal to the number of hop (), as seen in the iteration count of the outer loop surrounding the aggregator (equation 3). To produce per-step rewards, GraphSAGE predicts the classes, , at all the intermediate layers. To do so, we add the auxiliary classifying layers at every intermediate layer beside the final layer. We consider a return consisting of the discounted sum of per-step rewards propagated from the first hop to the final -th hop:
where is a discount factor that discounts the contribution from the future reward. In other words, with a lower , we impose that a neighborhood at a closer distance has more influence on the return . In order to avoid the overhead computing all the per-step rewards, we explore an approximation scheme where we set to zero if . Equation 7 can be replaced with a version of last-hop learning approximating all-hop learning; . A visit count, , sums all the per-step visit counts.
This return is optimized with respect to a policy using Reinforcement Learning. The inputs of the policy are a target node and candidates of its neighborhood, and the output action space is either or , indicating being selected as a subsample or not. The value function associated to this policy is denoted by ; we recall that it is the expected return, obtained by division of by , under the policy starting from a target node to a neighboring node . The relationship between the value function and the neighboring node connected to the target node is defined as follows:
3.2 Nonlinear Regressor to Model the Value Function
A possible state is not confined to a finite set of nodes observed in training. That is because it is assumed the graph is evolving; that is, unseen nodes can be observed during testing. Thus, We consider a function approximation to the value function using non-linear combination of attributes at state .
where let and be -dimensional input vectors (attributes) of a node and each member of a neighborhood, , respectively. denotes the weights of a differentiable non-linear regressor function, . A weight matrix with a size of and bias
are the parameters of a single perceptron layer to be learned. This model is trained to minimize the-norm between the true value function, , obtained in equation 9 and the output, , using mini-batch gradient descent optimization. The learned weights are shared in sampling neighborhood at all depths.
3.3 Node Sampling and Acceleration
For subsampling a set of neighborhood of a set of node by reinforcement learning, we redefine the neighborhood sampling function, in equation 1, to include the non-linear regressor trained in subsection 3.2.
where is a set of neighboring nodes of . is the non-linear regressor. is the subsample size at the
-th hop. Based on the estimated value functions over the neighborhood, sorting the neighboring nodes in descending order and selecting topdecrease the computational efficiency. To alleviate complexity and obtain the benefits of parallelism, all immediate neighbors are partitioned into groups. Then, the arg operation is executed in parallel to find the neighbor with the maximal predicted return in each batch. This scheme reduces the complexity to in sequential mode or in parallel mode:
where let be groups of evenly partitioned neighborhood.
4.1 Experimental Setup
For a supervised classification task on a large-scale graph in an inductive setting, we used protein-protein interaction (PPI) (Zitnik & Leskovec (2017)), Reddit (Grover & Leskovec (2016)), and PubMed (Kipf & Welling (2016)) datasets. The classification accuracy metric is a micro F1 score, combining a recall and a precision, that is commonly used in the benchmark task. We tested on the mean_concat aggregator in equation 3 with 2 or 3 layers (). The default hidden feature dimension size is 512 in all hidden layers. The neighborhood sample size is set to 30 at all hops. We use the Adam optimizer (Kingma & Ba (2014)
) and ran 10 epochs with a batch size of 32 and a learning rate of 0.01. When optimizing the non-linear regressor, we ran 50 epochs with a batch size of 512 and a learning rate of 0.001.
In Table 1
, the RL-based training showed over the baseline method relative improvement of 12.0% (two-layer) and 8.5% (three-layer) for the PPI dataset. The all-hop reward training exhibited slight superiority over last-hop reward, but the difference was not as large as the difference over the baseline. It supports the use of the last-hop approximation which is computationally more efficient. The effect of RL-based sampling was shown differently according to the distribution type and range of the observed value function. It was close to the Gaussian distribution spanned over a high and wide range for the PPI dataset while it was closely characterized by the Rayleigh distribution concentrated on a very low and narrow range for the Reddit or PubMed dataset. We can infer from the high concentration on near-zero values that the graph nodes are distributed over the relatively regular space. This may cause a marginal advantage of the RL-based sampling over the uniform sampling.
In Table 2, GraphSAGE with the RL-based sampling (*) achieved the runner-up accuracy on the PPI and the best on the Reddit and PubMed datasets. Training for longer epochs helped improving the accuracy (#3 vs. #5, #4 vs. #6). Beside the default GraphSAGE of mean_concat aggregator and a sample size of [30, 30], a better compute-optimized network consisting of a mean_add aggregator and a smaller and hop-wise decreasing sample size of [25, 10] (suggested by Hamilton et al. (2017)) was also performed (#7, #8). The parameter size of the mean_add aggregator was approximately two third smaller than mean_concat (refer to Par (MB)). Nevertheless, the accuracy of mean_add was similar to or higher than mean_concat when our proposed sampling method was applied (#6 vs. #8). The proposed method is proven to be practical and useful among these cutting-edge methods from the perspectives of high-ranked accuracy and memory and computing efficiency.
|Uniform||All-hop RL||First-hop RL||Last-hop RL|
We introduced a novel data-driven neighborhood sampling approach, learned by a Reinforcement Learning, replacing random sampling with uniform distribution in GraphSAGE (Hamilton et al. (2017)). In order to embed nodes in a large-scale graph using limited computing and memory resources, it is crucial to sample a small set of neighboring nodes with high importance. For the supervised classification task in an inductive setting, we empirically showed that the proposed sampling method improves the node classification accuracy over the uniform sampling based GraphSAGE.
The authors would like to thank the anonymous referees for their valuable comments and helpful suggestions. The authors collaborated with the Center for Data Science, New York University, New York, NY, USA, and were funded by Samsung Research, Samsung Electronics Co., Seoul, Republic of Korea. We express special thanks to Dr. Daehyun Kim, Dr. Myungsun Kim, and Yongwoo Lee at Samsung Research for their substantial help in supporting this collaboration.
- Chen et al. (2018) Jie Chen, Tengfei Ma, and Cao Xiao. Fastgcn: Fast Learning With Graph Convolu- Tional Networks Via Importance Sampling. Iclr, pp. 1–15, 2018. URL https://openreview.net/pdf?id=rytstxWAW.
- Choma et al. (2018) Nicholas Choma, Federico Monti, Lisa Gerhardt, Tomasz Palczewski, Zahra Ronaghi, Prabhat Prabhat, Wahid Bhimji, Michael Bronstein, Spencer Klein, and Joan Bruna. Graph neural networks for icecube signal classification. In 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 386–391. IEEE, 2018.
- Gilmer et al. (2017) Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. Neural Message Passing for Quantum Chemistry. 2017. ISSN 0022-2623. doi: 10.1021/acs.jmedchem.7b01484. URL http://arxiv.org/abs/1704.01212.
- Grover & Leskovec (2016) Aditya Grover and Jure Leskovec. Node2Vec. Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discov. Data Min. - KDD ’16, pp. 855–864, 2016. ISSN 2154-817X. doi: 10.1145/2939672.2939754. URL http://dl.acm.org/citation.cfm?doid=2939672.2939754.
- Hamilton et al. (2017) Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1024–1034, 2017.
- Henrion et al. (2017) Isaac Henrion, Johann Brehmer, Joan Bruna, Kyunghyun Cho, Kyle Cranmer, Gilles Louppe, and Gaspar Rochette. Neural message passing for jet physics. 2017.
- Kingma & Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Kipf & Welling (2016) Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
- Pham et al. (2017) Trang Pham, Truyen Tran, Dinh Q Phung, and Svetha Venkatesh. Column networks for collective classification. In AAAI, pp. 2485–2491, 2017.
- Velickovic et al. (2017) Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 1(2), 2017.
- Zitnik & Leskovec (2017) Marinka Zitnik and Jure Leskovec. Predicting multicellular function through multi-layer tissue networks. Bioinformatics, 33(14):i190–i198, 2017.