Several sampling algorithms with variance reduction have been proposed for accelerating the training of Graph Convolution Networks (GCNs). However, due to the intractable computation of optimal sampling distribution, these sampling algorithms are suboptimal for GCNs and are not applicable to more general graph neural networks (GNNs) where the message aggregator contains learned weights rather than fixed weights, such as Graph Attention Networks (GAT). The fundamental reason is that the embeddings of the neighbors or learned weights involved in the optimal sampling distribution are changing during the training and not known a priori, but only partially observed when sampled, thus making the derivation of an optimal variance reduced samplers non-trivial. In this paper, we formulate the optimization of the sampling variance as an adversary bandit problem, where the rewards are related to the node embeddings and learned weights, and can vary constantly. Thus a good sampler needs to acquire variance information about more neighbors (exploration) while at the same time optimizing the immediate sampling variance (exploit). We theoretically show that our algorithm asymptotically approaches the optimal variance within a factor of 3. We show the efficiency and effectiveness of our approach on multiple datasets.READ FULL TEXT VIEW PDF
Sampling methods (e.g., node-wise, layer-wise, or subgraph) has become a...
Many stochastic optimization algorithms work by estimating the gradient ...
The graph-based model can help to detect suspicious fraud online. Owing ...
Adaptive importance sampling for stochastic optimization is a promising
The weighted k-nearest neighbors algorithm is one of the most fundamenta...
Graph neural networks (GNNs) were shown to effectively learn from highly...
Graph neural networks Kipf and Welling (2016); Hamilton et al. (2017) have emerged as a powerful tool for representation learning of graph data in irregular or non-euclidean domains (Battaglia et al., 2018; Wu et al., 2019). For instance, graph neural networks have demonstrated state-of-the-art performance on learning tasks such as node classification, link and graph property prediction, with applications ranging from drug design Dai et al. (2016), social networks Hamilton et al. (2017), transaction networks Liu et al. (2018), gene expression networks Fout et al. (2017)
, and knowledge graphsSchlichtkrull et al. (2018).
One major challenge of training GNNs comes from the requirements of heavy floating point operations and large memory footprints, due to the recursive expansions over the neighborhoods. For a minibatch with a single vertex , to compute its embedding at the -th layer, we have to expand its neighborhood from the -th layer to the -th layer, i.e. -hops neighbors. That will soon cover a large portion of the graph if particularly the graph is dense. One basic idea of alleviating such “neighbor explosion” problem was to sample neighbors in a top-down manner, i.e. sample neighbors in the -th layer given the nodes in the -th layer recursively.
Several layer sampling approaches Hamilton et al. (2017); Chen et al. (2018); Huang et al. (2018); Zou et al. (2019) have been proposed to alleviate above “neighbor explosion” problem and improve the convergence of training GCNs, e.g. with importance sampling. However, the optimal sampler Huang et al. (2018), for vertex
, to minimize the variance of the estimatorinvolves all its neighbors’ hidden embeddings, i.e. , which is infeasible to be computed because we can only observe them partially while doing sampling. Existing approaches Chen et al. (2018); Huang et al. (2018); Zou et al. (2019) typically compromise the optimal sampling distribution via approximations, which may impede the convergence. Moreover, such approaches are not applicable to more general cases where the weights or kernels ’s are not known a priori, but are learned weights parameterized by attention functions Veličković et al. (2017). That is, both the hidden embeddings and learned weights involved in the optimal sampler constantly vary during the training process, and only part of the unnormalized attention values or hidden embeddings can be observed while do sampling.
Present work. We derive novel variance reduced samplers for training of GCNs and attentive GNNs with a fundamentally different perspective. That is, different with existing approaches that need to compute the immediate sampling distribution, we maintain nonparametric estimates of the sampler instead, and update the sampler towards optimal variance after we acquire partial knowledges about neighbors being sampled, as the algorithm iterates.
To fulfil this purpose, we formulate the optimization of the samplers as a bandit problem, where the regret is the gap between expected loss (negative reward) under current policy (sampler) and expected loss with optimal policy. We define the reward with respect to each action, i.e. the choice of a set of neighbors with sample size , as the derivatives of the sampling variance, and show the variance of our samplers asymptotically approaches the optimal variance within a factor of . Under this problem formulation, we propose two bandit algorithms. The first algorithm based on multi-armed bandit (MAB) chooses arms (neighbors) repeatedly. Our second algorithm based on MAB with multiple plays chooses a combinatorial set of neighbors with size only once.
To summarize, (1) We recast the sampler for GNNs as a bandit problem from a fundamentally different perspective. It works for GCNs and attentive GNNs while existing approaches apply only to GCNs. (2) We theoretically show that the regret with respect to the variance of our estimators asymptotically approximates the optimal sampler within a factor of 3 while no existing approaches optimize the sampler. (3) We empirically show that our approachs are way competitive in terms of convergence and sample variance, compared with state-of-the-art approaches on multiple public datasets.
Let denote the graph with nodes , and edges . Let the adjacency matrix denote as . Assuming the feature matrix with denoting the -dimensional feature of node . We focus on the following simple but general form of GNNs:
where is the hidden embedding of node at the -th layer, is a kernel or weight matrix, is the transform parameter on the -th layer, and
is the activation function. The weight, or for simplicity, is non-zero only if is in the -hop neighborhood of . It varies with the aggregation functions Battaglia et al. (2018); Wu et al. (2019). For example, (1) GCNs Dai et al. (2016); Kipf and Welling (2016) define fixed weights as or respectively, where , and is the diagonal node degree matrix of . (2) The attentive GNNs Veličković et al. (2017); Liu et al. (2019) define a learned weight by attention functions: , where the unnormalized attentions , are parameterized by . Different from GCNs, the learned weights can be evaluated only given all the unnormalized weights in the neighborhood.
where , and . Hence we can evaluate each node at the -th layer, using a Monte Carlo estimator with sampled neighbors at the -th layer. Without loss of generality, we assume and that meet the setting of attentive GNNs in the rest of this paper. To further reduce the variance, let us consider the following importance sampling
where we use to include transform parameter into the function for conciseness. As such, one can find an alternative sampling distribution to reduce the variance of an estimator, e.g. a Monte Carlo estimator , where .
Take expectation over , we define the variance of at step and -th layer to be:
Note that and that are inferred during training may vary over steps ’s. We will explicitly include step and layer only when it is necessary. By expanding Eq. (4) one can write as the difference of two terms. The first is a function of , which we refer to as the effective variance:
However, evaluating this sampling distribution is infeasible because we cannot have all the knowledges of neighbors’ embeddings in the denominator of Eq. (6). Moreover, the ’s in attentive GNNs could also vary during the training procedure. Existing layer sampling approaches based on importance sampling just ignore the effects of norm of embeddings and assume the ’s are fixed during training. As a result, the sampling distribution is suboptimal and only applicable to GCNs where the weights are fixed. Note that our derivation above follows the setting of node-wise sampling approaches Hamilton et al. (2017), but the claim remains to hold for layer-wise sampling approaches Chen et al. (2018); Huang et al. (2018); Zou et al. (2019).
We summarize three types of works for training graph neural networks.
First, several “layer sampling” approaches Hamilton et al. (2017); Chen et al. (2018); Huang et al. (2018); Zou et al. (2019) have been proposed to alleviate the “neighbor explosion” problems. Given a minibatch of labeled vertices at each iteration, such approaches sample neighbors layer by layer in a top-down manner. Particularly, node-wise samplers Hamilton et al. (2017) randomly sample neighbors in the lower layer given each node in the upper layer, while layer-wise samplers Chen et al. (2018); Huang et al. (2018); Zou et al. (2019) leverage importance sampling to sample neighbors in the lower layer given all the nodes in upper layer with sample sizes of each layer be independent of each other. Empirically, the layer-wise samplers work even worse Chen et al. (2017) compared with node-wise samplers, and one can set an appropriate sample size for each layer to alleviate the growth issue of node-wise samplers. In this paper, we focus on optimizing the variance in the vein of layer sampling approaches. Though the derivation of our bandit samplers follows the node-wise samplers, it can be extended to layer-wise. We leave this extension as a future work.
Second, Chen et al. Chen et al. (2017) proposed a variance reduced estimator by maintaining historical embeddings of each vertices, based on the assumption that the embeddings of a single vertex would be close to its history. This estimator uses a simple random sampler and works efficient in practice at the expense of requiring an extra storage that is linear with number of nodes.
Third, two “graph sampling” approaches Chiang et al. (2019); Zeng et al. (2019) first cut the graph into partitions Chiang et al. (2019) or sample into subgraphs Zeng et al. (2019), then they train models on those partitions or subgraphs in a batch mode Kipf and Welling (2016)
. They show that the training time of each epoch may be much faster compared with “layer sampling” approaches. We summarize the drawbacks as follows. First, the partition of the original graph could be sensitive to the training problem. Second, these approaches assume that all the vertices in the graph have labels, however, in practice only partial vertices may have labelsHu et al. (2019); Liu et al. (2018).
We formulate the optimization of sampling variance as a bandit problem. Our basic idea is that instead of explicitly calculating the intractable optimal sampling distribution in Eq. (6) at each iteration, we aim to optimize a sampler or policy for each vertex over the horizontal steps , and make the variance of the estimator following this sampler asymptotically approach the optimum , such that for some constant . Each action of policy is a choice of any subset of neighbors where . We denote
as the probability of the action thatchooses at . The gap to be minimized between effective variance and the oracle is
Note that the function is convex w.r.t , hence for and we have the upper bound derived on right hand of Eq. (7). We define this upper bound as regret at , which means the expected loss (negative reward) with policy minus the expected loss with optimal policy . Hence the reward w.r.t choosing at is the negative derivative of the effective variance .
In the following, we adapt this bandit problem in the adversary bandit setting Auer et al. (2002) because the rewards vary as the training proceeds and do not follow a priori fixed distribution Burtini et al. (2015). We leave the studies of other bandits as a future work. We show in section 6 that with this regret the variances of our estimators asymptotically approach the optimal variance within a factor of .
Our samplers sample -element subset of neighbors times or a -element subset of neighbors once from the alternative sampling distribution for each vertex . We instantiate above framework under two bandit settings. (1) In the adversary MAB setting Auer et al. (2002), we define the sampler as , that samples exact an arm (neighbor) from . In this case the set is the 1-element subset . To have a sample size of neighbors, we repeat this action times. After we collected rewards we update by EXP3 Auer et al. (2002). (2) In the adversary MAB with multiple plays setting Uchiya et al. (2010), it uses an efficient -combination sampler (DepRound Gandhi et al. (2006)) to sample any -element subset that satisfies , where corresponds to the alternative probability of sampling . As such, it allows us to select from a set of distinct subsets of arms from arms at once. The selection can be done in . After we collected the reward , we update by EXP3.M Uchiya et al. (2010).
Discussions. We have to select a sample size of neighbors in GNNs. Note that in the rigorous bandit setting, exact one action should be made and followed by updating the policy. In adversary MAB, we do the selection times and update the policy, hence strictly speaking applying MAB to our problem is not rigorous. Applying MAB with multiple plays to our problem is rigorous because it allows us to sample neighbors at once and update the rewards together. For readers who are interested in EXP3, EXP3.M and DepRound, please find them in Appendix A.
The framework of our algorithm is: (1) use a sampler to sample arms from the alternative sampling distribution for any vertex , (2) establish the unbiased estimator, (3) do feedforward and backpropagation, and finally (4) calculate the rewards and update the alternative sampling distribution with a proper bandit algorithm. We show this framework in Algorithm 1. Note that the variance w.r.t in Eq. (4) is defined only at the -th layer, hence we should maintain multiple ’s at each layer. In practice, we find that maintain a single and update it only using rewards from the -st layer works well enough. The time complexity of our algorithm is same with any node-wise approaches Hamilton et al. (2017). In addition, it requires a storage in to maintain nonparametric estimates ’s.
It remains to instantiate the estimators, variances and rewards related to our two bandit settings. We name our first algorithm GNN-BS under adversary MAB setting, and the second GNN-BS.M under adversary MAB with multiple plays setting. We first assume the weights ’s are fixed, then extend to attentive GNNs that ’s change.
Given a vertex , an important property of DepRound is that it satisfies , where is any subset of size . We have the following unbiased estimator.
is the unbiased estimator of given that is sampled from using the DepRound sampler , where is the selected -subset neighbors of vertex .
The effective variance of this estimator is . Since the derivative of this effective variance w.r.t does not factorize, we instead have the following approximated effective variance using Jensen’s inequality.
The effective variance can be approximated by .
The negative derivative of the approximated effective variance w.r.t , i.e. the reward of choosing at is .
Follow EXP3.M we use the reward w.r.t each arm as . Our proofs rely on the property of DepRound introduced above.
In this section, we extend our algorithms to attentive GNNs. The issue remained is that the attention value can not be evaluated with only sampled neighborhoods, instead, we can only compute the unnormalized attentions . We define the adjusted feedback attention values as follows:
where ’s are the unnormalized attention values that can be obviously evaluated when we have sampled . We use as a surrogate of so that we can approximate the truth attention values by our adjusted attention values .
As we described in section 4, the regret is defined as . By choosing the reward as the negative derivative of the effective variance, we have the following theorem that our bandit sampling algorithms asymptotically approximate the optimal variance within a factor of 3.
Using Algorithm 1 with and to minimize the effective variance with respect to , we have
where , .
Our proof follows Salehi et al. (2017) by upper and lower bounding the potential function. The upper and lower bounds are the functions of the alternative sampling probability and the reward respectively. By multiplying the upper and lower bounds by the optimal sampling probability and using the reward definition in (9), we have the upper bound of the effective variance. The growth of this regret is sublinear in terms of . The regret decreases in polynomial as sample size grows. Note that the number of neighbors is always well bounded in pratical graphs, and can be considered as a moderate constant number. Compared with existing layer sampling approaches Hamilton et al. (2017); Chen et al. (2018); Zou et al. (2019) that have a fixed variance given the specific estimators, this is the first work optimizing the sampling variance of GNNs towards optimum. We will empirically show the sampling variances in experiments.
|Dateset||V||E||Degree||# Classes||# Features||# train||# validation||# test|
In this section, we conduct extensive experiments compared with state-of-the-art approaches to show the advantage of our training approaches. We use the following rule to name our approaches: GNN architecture plus bandit sampler. For example, GCN-BS, GAT-BS and GP-BS denote the training approaches for GCN, GAT Veličković et al. (2017) and GeniePath Liu et al. (2019) respectively.
The major purpose of this paper is to compare the effects of our samplers with existing training algorithms, so we compare them by training the same GNN architecture. We use the following architectures unless otherwise stated. We fix the number of layers as as in Kipf and Welling (2016) for all comparison algorithms. We set the dimension of hidden embeddings as for Cora and Pubmed, and for PPI, Reddit and Flickr. For a fair comparison, we do not use the normalization layer Ba et al. (2016) particularly used in some works Chen et al. (2017); Zeng et al. (2019). For attentive GNNs, we use the attention layer proposed in GAT. we set the number of multi-heads as for simplicity.
We report results on benchmark data that include Cora Sen et al. (2008), Pubmed Sen et al. (2008), PPI Hamilton et al. (2017), Reddit Hamilton et al. (2017), and Flickr Zeng et al. (2019). We follow the standard data splits, and summarize the statistics in Table 1.
We summarize the comparison algorithms as follows. (1) GraphSAGE Hamilton et al. (2017) is a node-wise layer sampling approach with a random sampler. (2) FastGCN Chen et al. (2018), LADIES Zou et al. (2019), and AS-GCN Huang et al. (2018) are layer sampling approaches based on importance sampling. (3) S-GCN Chen et al. (2017) can be viewed as an optimization solver for training of GCN based on a simply random sampler. (4) ClusterGCN Chiang et al. (2019) and GraphSAINT Zeng et al. (2019) are “graph sampling” techniques that first partition or sample the graph into small subgraphs, then train each subgraph using the batch algorithm Kipf and Welling (2016). (5)
The open source algorithms that support the training of attentive GNNs are AS-GCN and GraphSAINT. We denote them as AS-GAT and GraphSAINT-GAT.
We do grid search for the following hyperparameters in each algorithm, i.e., the learning rate, the penalty weight on the -norm regularizers , the dropout rate . By following the exsiting implementations111Checkout: https://github.com/matenure/FastGCN or https://github.com/huangwb/AS-GCN, we save the model based on the best results on validation, and restore the model to report results on testing data in Section 7.1. For the sample size in GraphSAGE, S-GCN and our algorithms, we set for Cora and Pubmed, for Flickr, for PPI and reddit. We set the sample size in the first and second layer for FastGCN and AS-GCN/AS-GAT as 256 and 256 for Cora and Pubmed, and for PPI, and for Flickr, and and for Reddit. We set the batch size of all the layer sampling approaches and S-GCN as for all the datasets. For ClusterGCN, we set the partitions according to the suggestions Chiang et al. (2019) for PPI and Reddit. We set the number of partitions for Cora and Pubmed as 10, for flickr as 200 by doing grid search. We set the architecture of GraphSAINT as “0-1-1”222Checkout https://github.com/GraphSAINT/ for more details. which means MLP layer followed by two graph convolution layers. We use the “rw” sampling strategy that reported as the best in their original paper to perform the graph sampling procedure. We set the number of root and walk length as the paper suggested.
times and report the mean and standard deviation. The results on the two layer GCN architecture show that our GCN-BS performs the best on most of datasets. The results on the two layer attentive GNN architecture show the superiority of our algorithms on training more complex GNN architectures. GraphSAINT or AS-GAT cannot compute the softmax of learned weights, but simply use the unnormalized weights to perform the aggregation. As a result, most of results from AS-GAT and GraphSAINT-GAT in Table3 are worse than their results in Table 2. Thanks to the power of attentive structures in GNNs, our algorithms perform better results on PPI and Reddit compared with GCN-BS, and significantly outperform the results from AS-GAT and GraphSAINT-GAT.
In this section, we analyze the convergences of comparison algorithms on the two layer GCN and attentive GNN architectures in Figure 1 in terms of epoch. We run all the algorithms times and show the mean and standard deviation. Our approaches consistently converge to better results with faster rates and lower variances in most of datasets like Pubmed, PPI, Reddit and Flickr compared with the state-of-the-art approaches. The GNN-BS algorithms perform very similar to GNN-BS.M, even though strictly speaking GNN-BS does not follow the rigorous MAB setting. Furthermore, we show a huge improvement on the training of attentive GNN architectures compared with GraphSAINT-GAT and AS-GAT. The convergences on validation in terms of timing (seconds), compared with layer sampling approaches, in Appendix C.1 show the similar results. We further give a discussion about timing among layer sampling approaches and graph sampling approaches in Appendix C.2.
We analyze the sampling variances and accuracy as sample size varies using PPI data. Note that existing layer sampling approaches do not optimize the variances once the samplers are specified. As a result, their variances are simply fixed Zou et al. (2019), while our approaches asymptotically appoach the optimum. For comparison, we train our models until convergence, then compute the average sampling variances. We show the results in Figure 2. The results are grouped into two categories, i.e. results for GCNs and attentive GNNs respectively. The sampling variances of our approaches are smaller in each group, and even be smaller than the variances of S-GCN that leverages a variance reduction solver. This explains the performances of our approaches on testing Micro F1 scores. We also find that the overall sampling variances of node-wise approaches are way better than those of layer-wise approaches.
In this paper, we show that the optimal layer samplers based on importance sampling for training general graph neural networks are computationally intractable, since it needs all the neighbors’ hidden embeddings or learned weights. Instead, we re-formulate the sampling problem as a bandit problem that requires only partial knowledges from neighbors being sampled. We propose two algorithms based on multi-armed bandit and MAB with multiple plays, and show the variance of our bandit sampler asymptotically approaches the optimum within a factor of . Furthermore, our algorithms are not only applicable to GCNs but more general architectures like attentive GNNs. We empirically show that our algorithms can converge to better results with faster rates and lower variances compared with state-of-the-art approaches.
International conference on machine learning, pages 2702–2711, 2016.
Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 946–953, 2019.
is the unbiased estimator of given that is sampled from using the DepRound sampler , where is the selected -subset neighbors of vertex .
Let us denote as the probability of vertex choosing any -element subset from the -element set using DepRound sampler . This sampler follows the alternative sampling distribution where denotes the alternative probability of sampling neighbor . This sampler is guaranteed to satisfy , i.e. the sum over the probabilities of all subsets that contains element equals the probability .
The effective variance can be approximated by .
The variance is
Therefore the effective variance has following upper bound: