1 Introduction
Deep Learning, especially Convolutional Neural Networks (CNNs), has revolutionized various machine learning tasks with gridlike input data, such as image classification [1] and machine translation [2]. By making use of local connection and weight sharing, CNNs are able to pursue translational invariance of the data. In many other contexts, however, the input data are lying on irregular or noneuclidean domains, such as graphs which encode the pairwise relationships. This includes examples of social networks [3], protein interfaces [4], and 3D meshes [5]. How to define convolutional operations on graphs is still an ongoing research topic.
There have been several attempts in the literature to develop neural networks to handle arbitrarily structured graphs. Whereas learning the graph embedding is already an important topic [6, 7, 8], this paper mainly focus on learning the representations for graph vertices by aggregating their features/attributes. The closest work to this vein is the Graph Convolution Network (GCN) [9] that applies connections between vertices as convolution filters to perform neighborhood aggregation. As demonstrated in [9], GCNs have achieved the stateoftheart performance on node classification.
An obvious challenge for applying current graph networks is the scalability. Calculating convolutions requires the recursive expansion of neighborhoods across layers, which however is computationally prohibitive and demands hefty memory footprints. Even for a single node, it will quickly cover a large portion of the graph due to the neighborhood expansion layer by layer if particularly the graph is dense or powerlaw. Conventional minibatch training is unable to speed up the convolution computations, since every batch will involve a large amount of vertices, even the batch size is small.
To avoid the overexpansion issue, we accelerate the training of GCNs by controlling the size of the sampled neighborhoods in each layer (see Figure 5). Our method is to build up the network layer by layer in a topdown way, where the nodes in the lower layer^{1}^{1}1Here, lower layers denote the ones closer to the input. are sampled conditionally based on the upper layer’s. Such layerwise sampling is efficient in two technical aspects. First, we can reuse the information of the sampled neighborhoods since the nodes in the lower layer are visible and shared by their different parents in the upper layer. Second, it is easy to fix the size of each layer to avoid overexpansion of the neighborhoods, as the nodes of the lower layer are sampled as a whole.
The core of our method is to define an appropriate sampler for the layerwise sampling. A common objective to design the sampler is to minimize the resulting variance. Unfortunately, the optimal sampler to minimize the variance is uncomputable due to the inconsistency between the topdown sampling and the bottomup propagation in our network (see § 4.2
for details). To tackle this issue, we approximate the optimal sampler by replacing the uncomputable part with a selfdependent function, and then adding the variance to the loss function. As a result, the variance is explicitly reduced by training the network parameters and the sampler.
Moreover, we explore how to enable efficient message passing across distant nodes. Current methods [6, 10] resort to random walks to generate neighborhoods of various steps, and then take integration of the multihop neighborhoods. Instead, this paper proposes a novel mechanism by further adding a skip connection between the th and th layers. This shortcut connection reuses the nodes in the th layer as the 2hop neighborhoods of the th layer, thus it naturally maintains the secondorder proximity without incurring extra computations.
To sum up, we make the following contributions in this paper: I.We develop a novel layerwise sampling method to speed up the GCN model, where the betweenlayer information is shared and the size of the sampling nodes is controllable. II. The sampler for the layerwise sampling is adaptive and determined by explicit variance reduction in the training phase. III. We propose a simple yet efficient approach to preserve the secondorder proximity by formulating a skip connection across two layers. We evaluate the performance of our method on four popular benchmarks for node classification, including Cora, Citeseer, Pubmed [11] and Reddit [3]. Intensive experiments verify the effectiveness of our method regarding the classification accuracy and convergence speed.
2 Related Work
While graph structures are central tools for various learning tasks (e.g
[12, 9]), how to design efficient graph convolution networks has become a popular research topic. Graph convolutional approaches are often categorized into spectral and nonspectral classes [13]. The spectral approach first proposed by [14] defines the convolution operation in Fourier domain. Later, [15] enables localized filtering by applying efficient spectral filters, and [16] employs Chebyshev expansion of the graph Laplacian to avoid the eigendecomposition. Recently, GCN is proposed in [9] to simplify previous methods with firstorder expansion and reparameterization trick. Nonspectral approaches define convolution on graph by using the spatial connections directly. For instance, [17] learns a weight matrix for each node degree, the work by [18] defines multiplehop neighborhoods by using the powers series of a transition matrix, and other authors [19] extracted normalized neighborhoods that contain a fixed number of nodes.A recent line of research is to generalize convolutions by making use of the patch operation [20] and selfattention [13]. As opposed to GCNs, these methods implicitly assign different importance weights to nodes of a same neighborhood, thus enabling a leap in model capacity. Particularly, Monti et al. [20] presents mixture model CNNs to build CNN architectures on graphs using the patch operation, while the graph attention networks [13]
compute the hidden representations of each node on graph by attending over its neighbors following a selfattention strategy.
More recently, two kinds of samplingbased methods including GraphSAGE [3] and FastGCN [21] were developed for fast representation learning on graphs. To be specific, GraphSAGE computes node representations by sampling neighborhoods of each node and then performing a specific aggregator for information fusion. The FastGCN model interprets graph convolutions as integral transforms of embedding functions and samples the nodes in each layer independently. While our method is closely related to these methods, we develop a different sampling strategy in this paper. Compared to GraphSAGE that is nodewise, our method is based on layerwise sampling as all neighborhoods are sampled as altogether, and thus can allow neighborhood sharing as illustrated in Figure 5. In contrast to FastGCN that constructs each layer independently, our model is capable of capturing the betweenlayer connections as the lower layer is sampled conditionally on the top one. We detail the comparisons in § 6. Another related work is the controlvariatebased method by [22]. However, the sampling process of this method is nodewise, and the historical activations of nodes are required.
3 Notations and Preliminaries
Notations. This paper mainly focuses on undirected graphs. Let denote the undirected graph with nodes , edges , and defines the number of the nodes. The adjacency matrix represents the weight associated to edge by each element . We also have a feature matrix with denoting the dimensional feature for node .
GCN. The GCN model developed by Kipf and Welling [9] is one of the most successful convolutional networks for graph representation learning. If we define as the hidden feature of the th layer for node , the feed forward propagation becomes
(1) 
where is the renormalization of the adjacency matrix; is a nonlinear function; is the filter matrix in the th layer; and we denote the nodes in the th layer as to distinguish them from those in the th layer.
4 Adaptive Sampling
Eq. (1) indicates that, GCNs require the full expansion of neighborhoods for the feed forward computation of each node. This makes it computationally intensive and memoryconsuming for learning on largescale graphs containing more than hundreds of thousands of nodes. To circumvent this issue, this paper speeds up the feed forward propagation by adaptive sampling. The proposed sampler is adaptable and applicable for variance reduction.
We first reformulate the GCN update to the expectation form and introduce the nodewise sampling accordingly. Then, we generalize the nodewise sampling to a more efficient framework that is termed as the layerwise sampling. To minimize the resulting variance, we further propose to learn the layerwise sampler by performing variance reduction explicitly. Lastly, we introduce the concept of skipconnection, and apply it to enable the secondorder proximity for the feedforward propagation.
4.1 From NodeWise Sampling to LayerWise Sampling
NodeWise Sampling. We first observe that Eq (1) can be rewritten to the expectation form, namely,
(2) 
where we have included the weight matrix into the function for concision;
defines the probability of sampling
given , with .A natural idea to speed up Eq. (2
) is to approximate the expectation by MonteCarlo sampling. To be specific, we estimate the expectation
with given by(3) 
By setting , the MonteCarlo estimation can reduce the complexity of (1) from ( denotes the number of edges) to if the numbers of the sampling points for the th and th layers are both .
By applying Eq. (3) in a multilayer network, we construct the network structure in a topdown manner: sampling the neighbours of each node in the current layer recursively (see Figure 5 (a)). However, such nodewise sampling is still computationally expensive for deep networks, because the number of the nodes to be sampled grows exponentially with the number of layers. Taking a network with depth for example, the number of sampling nodes in the input layer will increase to , leading to significant computational burden for large ^{2}^{2}2One can reduce the complexity of the nodewise sampling by removing the repeated nodes. Even so, for dense graphs, the sampling nodes will still quickly fills up the whole graph as the depth grows. .
LayerWise Sampling. We equivalently transform Eq. (2) to the following form by applying importance sampling, i.e.,
(4) 
where is defined as the probability of sampling given all the nodes of the current layer (i.e., ). Similarly, we can speed up Eq. (4) by approximating the expectation with the MonteCarlo mean, namely, computing with
(5) 
We term the sampling in Eq. (5) as the layerwise sampling strategy. As opposed to the nodewise method in Eq. (3) where the nodes are generated for each parent independently, the sampling in Eq. (5) is required to be performed only once. Besides, in the nodewise sampling, the neighborhoods of each node are not visible to other parents; while for the layerwise sampling all sampling nodes are shared by all nodes of the current layer. This sharing property is able to enhance the message passing at utmost. More importantly, the size of each layer is fixed to , and the total number of sampling nodes only grows linearly with the network depth.
4.2 Explicit Variance Reduction
The remaining question for the layerwise sampling is how to define the exact form of the sampler . Indeed, a good estimator should reduce the variance caused by the sampling process, since high variance probably impedes efficient training. For simplicity, we concisely denote the distribution as below.
According to the derivations of importance sampling in [23], we immediately conclude that
Proposition 1.
Unfortunately, it is infeasible to compute the optimal sampler in our case. By its definition, the sampler is computed based on the hidden feature that is aggregated by its neighborhoods in previous layers. However, under our topdown sampling framework, the neural units of lower layers are unknown unless the network is completely constructed by the sampling.
To alleviate this chickenandegg dilemma, we learn a selfdependent function of each node to determine its importance for the sampling. Let be the selfdependent function computed based on the node feature . Replacing the hidden function in Eq. (7) with arrives at
(8) 
The sampler by Eq. (8) is nodewise and varies for different . To make it applicable for the layerwise sampling, we summarize the computations over all nodes , thus we attain
(9) 
In this paper, we define as a linear function i.e. parameterized by the matrix . Computing the sampler in Eq. (9) is efficient, since computing (i.e. the adjacent value) and the selfdependent function is fast.
Note that applying the sampler given by Eq. (9) not necessarily results in a minimal variance. To fulfill variance reduction, we add the variance to the loss function and explicitly minimize the variance by model training. Suppose we have a minibatch of data pairs , where is the target nodes and is the corresponded groundtrue label. By the layerwise sampling (Eq. (9)), the nodes of previous layer are sampled given , and this process is recursively called layer by layer until we reaching the input domain. Then we perform a bottomup propagation to compute the hidden features and obtain the estimated activation for node , i.e. . Certain nonlinear and softmax functions are further added on to produce the prediction . By taking the classification loss and variance (Eq. (6)) into account, we formulate a hybrid loss as
(10) 
where is the classification loss (e.g., the crossing entropy); is the tradeoff parameter and fixed as 0.5 in our experiments. Note that the activations for other hidden layers are also stochastic, and the resulting variances should be reduced. In Eq. (10) we only penalize the variance of the top layer for efficient computation and find it sufficient to deliver promising performance in our experiments.
To minimize the hybrid loss in Eq. (10), it requires to perform gradient calculations. For the network parameters, e.g. in Eq. (2), the gradient calculation is straightforward and can be easily derived by the automaticallydifferential platform, e.g
., TensorFlow
[24]. For the parameters of the sampler, e.g. in Eq. (9), calculating the gradient is nontrivial as the sampling process (Eq. (5)) is nondifferential. Fortunately, we prove that the gradient of the classification loss with respect to the sampler is zero. We also derive the gradient of the variance term regarding the sampler, and detail the gradient calculation in the supplementary material5 Preserving SecondOrder Proximities by Skip Connections
The GCN update in Eq. (1) only aggregates messages passed from 1hop neighborhoods. To allow the network to better utilize information across distant nodes, we can sample the multihop neighborhoods for the GCN update in a similar way as the random walk [6, 10]. However, the random walk requires extra sampling to obtain distant nodes which is computationally expensive for dense graphs. In this paper, we propose to propagate the information over distant nodes via skip connections.
The key idea of the skip connection is to reuse the nodes of the th layer to preserve the secondorder proximity (see the definition in [7]). For the th layer, the nodes of the th layer are actually the 2hop neighborhoods. If we further add a skip connection from the th to the th layer, as illustrated in Figure 5 (c), the aggregation will involve both the 1hop and 2hop neighborhoods. The calculations along the skip connection are formulated as
(11) 
where denote the nodes in the th layer. Due to the 2hop distance between and , the weight is supposed to be the element of . Here, to avoid the full computation of , we estimate the weight with the sampled nodes of the th layer, i.e.,
(12) 
Instead of learning a free in Eq. (11), we decompose it to be
(13) 
where and are the filters of the th and th layers in original network, respectively. The output of skipconnection will be added to the GCN layer (Eq.(1)) before nonlinearity.
By the skip connection, the secondorder proximity is maintained without extra 2hop sampling. Besides, the skip connection allows the information to pass between two distant layers thus enabling more efficient backpropagation and model training.
While the designs are similar, our motivation of applying the skip connection is different to the residual function in ResNets [1]. The purpose of employing the skip connection in [1] is to gain accuracy by increasing the network depth. Here, we apply it to preserve the secondorder proximity. In contrast to the identity mappings used in ResNets, the calculation along the skipconnection in our model should be derived specifically (see Eq. (12) and Eq. (13)).
6 Discussions and Extensions
Relation to other sampling methods. We contrast our approach with GraphSAGE [3] and FastGCN [21] regarding the following aspects:
1. The proposed layerwise sampling method is novel. GraphSAGE randomly samples a fixedsize neighborhoods of each node, while FastGCN constructs each layer independently according to an identical distribution. As for our layerwise approach, the nodes in lower layers are sampled conditioned on the upper ones, which is capable of capturing the betweenlayer correlations.
2. Our framework is general. Both GraphSAGE and FastGCN can be categorized as the specific variants of our framework. Specifically, the GraphSAGE model is regarded as a nodewise sampler in Eq (3) if
is defined as the uniform distribution; FastGCN can be considered as a special layerwise method by applying the sampler
that is independent to the nodes in Eq. (5).3. Our sampler is parameterized and trainable for explicit variance reduction. The sampler of GraphSAGE or FastGCN involves no parameter and is not adaptive for minimizing variance. In contrast, our sampler modifies the optimal importance sampling distribution with a selfdependent function. The resulting variance is explicitly reduced by finetuning the network and sampler.
Taking the attention into account. The GAT model [13] applies the idea of selfattention to graph representation learning. Concisely, it replaces the renormalization of the adjacency matrix in Eq. (1) with specific attention values, i.e., where measures the attention value between the hidden features and , which is derived as by using the LeakyReLU nonlinearity and SoftMax normalization with parameters and .
It is impracticable to apply the GATlike attention mechanism directly in our framework, as the probability in Eq. (9) will become related to the attention value that is determined by the hidden features of the th layer. As discussed in § 4.2, computing the hidden features of lower layers is impossible unless the network is already built after sampling. To solve this issue, we develop a novel attention mechanism by applying the selfdependent function similar to Eq. (9). The attention is computed as
(14) 
where and are the learnable parameters.
7 Experiments
We evaluate the performance of our methods on the following benchmarks: (1) categorizing academic papers in the citation network datasets–Cora, Citeseer and Pubmed [11]; (2) predicting which community different posts belong to in Reddit [3]. These graphs are varying in sizes from small to large. Particularly, the number of nodes in Cora and Citeseer are of scale , while Pubmed and Reddit contain more than and vertices, respectively. Following the supervised learning scenario in FastGCN [21], we use all labels of the training examples for training. More details of the benchmark datasets and more experimental evaluations are presented in the supplementary material.
Our sampling framework is inductive in the sense that it clearly separates out test data from training. In contrast to the transductive learning where all vertices should be provided, our approach aggregates the information from each node’s neighborhoods to learn structural properties that can be generalized to unseen nodes. For testing, the embedding of a new node may be either computed by using the full GCN architecture or approximated through sampling as is done in model training. Here we use the full architecture as it is more straightforward and easier to implement. For all datasets, we employ the network with two hidden layers as usual. The hidden dimensions for the citation network datasets (i.e., Cora, Citeseer and Pubmed) are set to be 16. For the Reddit dataset, the hidden dimensions are selected to be 256 as suggested by [3]. The numbers of the sampling nodes for all layers excluding the top one are set to for Cora and Citeseer, for Pubmed and for Reddit. The sizes of the top layer (i.e. the stochastic minibatch size) are chosen to be 256 for all datasets. We train all models using early stopping with a window size of 30, as suggested by [9]. Further details on the network architectures and training settings are contained in the supplementary material.
The accuracy curves of test data on Cora, Citeseer and Reddit. Here, one training epoch means a complete pass of all training samples.
7.1 Alation Studies on the Adaptive Sampling
Baselines. The codes of GraphSAGE [3] and FastGCNN [21] provided by the authors are implemented inconsistently; here we reimplement them based on our framework to make the comparisons more fair^{3}^{3}3We also perform experimental comparisons by using the public codes of FastGCN in the supplementary material.. In details, we implement the GraphSAGE method by applying the nodewise strategy with a uniform sampler in Eq. (3), where the number of the sampling neighborhoods for each node are set to 5. For FastGCN, we adopt the IndependentIdenticalDistribution (IID) sampler proposed by [21] in Eq. (5), where the number of the sampling nodes for each layer is the same as our method. For consistence, the reimplementations of GraphSAGE and FastGCN are named as NodeWise and IID in our experiments. We also implement the Full GCN architecture as a strong baseline. All compared methods shared the same network structure and training settings for fair comparison. We have also conducted the attention mechanism introduced in § 6 for all methods.
Comparisons with other sampling methods. The random seeds are fixed and no early stopping is used for the experiments here. Figure 5 reports the converging behaviors of all compared methods during training on Cora, Citeseer and Reddit^{4}^{4}4The results on Pubmed are provided in the supplementary material.. It demonstrates that our method, denoted as Adapt, converges faster than other sampling counterparts on all three datasets. Interestingly, our method even outperforms the Full model on Cora and Reddit. Similar to our method, the IID sampling is also layerwise, but it constructs each layer independently. Thanks to the conditional sampling, our method achieves more stable convergent curve than the IID method as Figure 5 shown. It turns out that considering the betweenlayer information helps in stability and accuracy.
Moreover, we draw the training time in Figure 3 (a). Clearly, all sampling methods run faster than the Full model. Compared to the NodeWise method, our approach exhibits a higher training speed due to the more compact architecture. To show this, suppose the number of nodes in the top layer is , then for the NodeWise method the input, hidden and top layers are of sizes , and , respectively, while the numbers of the nodes in all layers are for our model. Even with less sampling nodes, our model still surpasses the NodeWise method by the results in Figure 5.
How important is the variance reduction? To justify the importance of the variance reduction, we implement a variant of our model by setting the tradeoff parameter as in Eq. (10). By this, the parameters of the selfdependent function are randomly initialized and no training is performed. Figure 5 shows that, removing the variance loss does decrease the accuracies of our method on Cora and Reddit. For Citeseer, the effect of removing the variance reduction is not so significant. We conjecture that the average degree of Citeseer (i.e. 1.4) is smaller than Cora (i.e. 2.0) and Reddit (i.e. 492), and penalizing the variance is not so impeding due to the limited diversity of neighborhoods.
Comparisons with other stateoftheart methods. We contrast the performance of our methods with the graph kernel method KLED [25] and Diffusion Convolutional Network (DCN) [18]. We use the reported results of KLED and DCN on Cora and Pubmed in [18]. We also summarize the results of GraphSAGE and FastGCN by their original implementations. For GraphSAGE, we report the results by the mean aggregator with the default parameters. For FastGCN, we directly make use of the provided results by [21]. For the baselines and our approach, we run the experiments with random seeds over 20 trials and record the mean accuracies and the standard variances. All results are organized in Table 1. As expected, our method achieves the best performance among all datasets, which are consistent with the results in Figure 5. It is also observed that removing the variance reduction will decrease the performance of our method especially on Cora and Reddit.
Methods  Cora  Citeseer  Pubmed  

KLED [25]  0.8229    0.8228   
2hop DCNN [18]  0.8677    0.8976   
FastGCN [21]  0.8500  0.7760  0.8800  0.9370 
GraphSAGE[3]  0.8220  0.7140  0.8710  0.9432 
Full  
IID  
NodeWise  
Adapt (no vr)  
Adapt 
7.2 Evaluations of the Skip Connection
We evaluate the effectiveness of the skip connection on Cora. For the experiments on other datasets, we present the details in the supplementary material. The original network has two hidden layers. We further add a skip connection between the input and top layers, by using the computations in Eq. (12) and Eq. (13). Figure 5 displays the convergent curves of the original Adapt method and its variant with the skip connection, where the random seeds are shared and no early stopping is adapted. Although the improvement by our skip connection is not big regarding the final accuracy, it indeed speeds up the convergence significantly. This can be observed from Figure 3 (b) where adding the skip connection reduces the required epoches to converge from around 150 to 100.
Adapt  Adapt+sc  Adapt+2hop 

We run experiments with different random seeds over 20 trials and report the mean results obtained by early stopping in Table 2. It is observed that the skip connection slightly improves the performance. Besides, we explicitly involve the 2hop neighborhood sampling in our method by replacing the renormalization matrix with its 2order power expansion, i.e. . As displayed in Table 2, the explicit 2hop sampling further boosts the classification accuracy. Although the skipconnection method is slightly inferior to the explicit 2hop sampling, it avoids the computation of (i.e. ) and yields more computationally beneficial for large and dense graphs.
8 Conclusion
We present a framework to accelerate the training of GCNs through developing a sampling method by constructing the network layer by layer. The developed layerwise sampler is adaptive for variance reduction. Our method outperforms the other samplingbased counterparts: GraphSAGE and FastGCN in effectiveness and accuracy on extensive experiments. We also explore how to preserve the secondorder proximity by using the skip connection. The experimental evaluations demonstrate that the skip connection further enhances our method in terms of the convergence speed and eventual classification accuracy.
References

[1]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 770–778, 2016.  [2] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010, 2017.
 [3] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pages 1025–1035, 2017.
 [4] Alex Fout, Jonathon Byrd, Basir Shariat, and Asa BenHur. Protein interface prediction using graph convolutional networks. In Advances in Neural Information Processing Systems, pages 6533–6542, 2017.
 [5] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 1(2):4, 2017.
 [6] Bryan Perozzi, Rami AlRfou, and Steven Skiena. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 701–710. ACM, 2014.
 [7] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line: Largescale information network embedding. In Proceedings of the 24th International Conference on World Wide Web, pages 1067–1077. International World Wide Web Conferences Steering Committee, 2015.
 [8] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 855–864. ACM, 2016.
 [9] Thomas N Kipf and Max Welling. Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
 [10] Felipe Petroski Such, Shagan Sah, Miguel Alexander Dominguez, Suhas Pillai, Chao Zhang, Andrew Michael, Nathan D Cahill, and Raymond Ptucha. Robust spatial filtering with graph convolutional neural networks. IEEE Journal of Selected Topics in Signal Processing, 11(6):884–896, 2017.
 [11] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina EliassiRad. Collective classification in network data. AI magazine, 29(3):93, 2008.
 [12] Wei Liu, Jun Wang, and ShihFu Chang. Robust and scalable graphbased semisupervised learning. Proceedings of the IEEE, 100(9):2624–2638, 2012.
 [13] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
 [14] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203, 2013.
 [15] Mikael Henaff, Joan Bruna, and Yann LeCun. Deep convolutional networks on graphstructured data. arXiv preprint arXiv:1506.05163, 2015.
 [16] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, pages 3844–3852, 2016.
 [17] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán AspuruGuzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pages 2224–2232, 2015.
 [18] James Atwood and Don Towsley. Diffusionconvolutional neural networks. In Advances in Neural Information Processing Systems, pages 1993–2001, 2016.
 [19] Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning convolutional neural networks for graphs. In International conference on machine learning, pages 2014–2023, 2016.
 [20] Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodola, Jan Svoboda, and Michael M Bronstein. Geometric deep learning on graphs and manifolds using mixture model cnns. In Proc. CVPR, volume 1, page 3, 2017.
 [21] Jie Chen, Tengfei Ma, and Cao Xiao. Fastgcn: Fast learning with graph convolutional networks via importance sampling. arXiv preprint arXiv:1801.10247, 2018.
 [22] Jianfei Chen, Jun Zhu, and Le Song. Stochastic training of graph convolutional networks with variance reduction. In International conference on machine learning, 2018.
 [23] Art B. Owen. Monte Carlo theory, methods and examples. 2013.
 [24] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for largescale machine learning. In OSDI, volume 16, pages 265–283, 2016.
 [25] Fran?ois Fouss, Kevin Fran?oisse, Luh Yen, Alain Pirotte, and Marco Saerens. An experimental investigation of graph kernels on a collaborative recommendation task. In Proceedings of the 6th International Conference on Data Mining (ICDM 2006, pages 863–868, 2006.
Appendix
This supplementary material provides the gradient calculation of the loss function ( Eq. (10)) with respect to the sampler. It also contains more setting details and more results for the experiments.
9 Gradient Calculation
We prove that the gradient of the expectation in Eq. (5) with respect to the sampler is equal to zero. To demonstrate this, we decompose the gradient as
(15)  
Hence, the gradient of the classification loss in Eq. (10) regarding the sampler is equal to zero. To perform the gradient calculation for the variance term, we first estimate it with the sampled instances by
(16) 
whose gradient is given by
(17) 
where the samples generated from independently.
10 More Experimental Evaluations
Datasets. The Cora, Citeseer and Pubmed datasets are downloaded from https://github.com/tkipf/gcn. We follow the setting as [21] by keeping the validation and test indexes unchanged but using all remaining samples for training. The Reddit dataset is from http://snap.stanford.edu/graphsage/. The statistics of four datasets are summarized in Table 3.
Datasets  Nodes  Edges  Classes  Features  Training/Validation/Testing 

Cora  2,708  5,429  7  1,433  1, 208/500/1,000 
Citeseer  3,327  4,732  6  3,703  1, 812/500/1,000 
Pubmed  19,717  44,338  3  500  18, 217/500/1,000 
232,965  11,606,919  41  602  152,410/23,699/55,334 
Further implementation details.
The initial learning rates for the Adam optimizer are set to be 0.001 for Cora, Citeseer and Pubmed, and 0.01 for Reddit. The weight decays for all datasets are selected to be 0.0004. We apply ReLu function as the activation function and no dropout in our experiments. As presented in the paper, all models are implemented with 2hiddenlayer networks. For the Reddit dataset, we follow the suggestion by
[21] to fix the weight of the bottom layer and precompute the product given the input features for efficiency. All experiments are conducted on a single Tesla P40 GPU. We apply the earlystopping for the training with a window size of 30 and apply the model that achieves the best validation accuracy for testing.More results on the variance reduction. As shown in Table 1, it is sufficient to boost the performance by only reducing the variance of the top layer. Indeed, it is convenient to reduce the variances of all layers in our method, e.g., adding them all to the loss. To show this, we conduct an experiment on Cora by minimizing the variances of both the first and top hidden layers, where the experimental settings are the same as Table 1. The result is , which slightly outperforms the original accuracy in Table 1 (i.e. ).
Comparisons with FastGCN by using the official codes. We use the public code to rerun the experiments of FastGCN in Figure 2 and Table 1. The average accuracies of FastGCN for four datasets are , , and . The running curves of Figure 2 in the paper are updated by Figure 5 here. Clearly, our method still outperforms FastGCN remarkably. We have observed the inconsistences between the official implementations of GraphSAGE and FastGraph including the adjacent matrix construction, hidden dimensions, minibatch sizes, maximal training epoches and other engineering tricks not mentioned in their papers. For fair comparisons, we reimplements them and uses the same experimental settings as our method in the main text.
More results on Pubmed. In the paper, Figure 2 displays the accuracy curves of test data on Cora, Citeseer and Reddit, where the random seeds are fixed. For those on Pubmed, we provide results in Figure 5. Obviously, our method outperforms the IID and NodeWise counterparts consistently. The Full model achieves the best accuracy around the 30th epoch, but drops down after the 60th epoch properly due to the overfitting. In contrast, our performance is more stable and it gives even better results in the end. Performing the variance reduction on this dataset is only helpful during the early stage, but contributes little when the model converges.
Table 3 (b) reports the accuracy curve of the model with the skip connection on Cora. Here, we evaluate the effectiveness of the skip connection on Citeseer and Pubmed in Figure 6. It demonstrates that the skip connection is helpful to speed up the convergence on Citeseer. While on the Pubmed dataset, adding the skip connection boosts the performance only during early training epochs. For the Reddit dataset, we can not apply the skip connection in the network since the bottom layer is fixed and the output features are precomputed.
Comments
There are no comments yet.