Introduction
In this paper, we study the representation learning task involving data that lie in an irregular domain, i.e. graphs. Many data have the form of a graph, e.g., social networks [Perozzi, AlRfou, and Skiena2014], citation networks [Sen et al.2008], biological networks [Zitnik and Leskovec2017], and transaction networks [Liu et al.2017]. We are interested in graphs with permutation invariant properties, i.e., the ordering of the neighbors for each node is irrelevant to the learning tasks. This is in opposition to temporal graphs [Kostakos2009].
Convolutional Neural Networks (CNN) have been proven successful in a diverse range of applications involving images [He et al.2016a] and sequences [Gehring et al.2016]. Recently, interests and efforts have emerged in the literature trying to generalize convolutions to graphs [Hammond, Vandergheynst, and Gribonval2011, Defferrard, Bresson, and Vandergheynst2016, Kipf and Welling2016, Hamilton, Ying, and Leskovec2017a], which also brings in new challenges.
Unlike image and sequence data that lie in regular domains, graph data are irregular in nature, making the receptive field of each neuron different for different nodes in the graph. Assuming a graph
with nodes , edges , the sparse adjacency matrix , diagonal node degree matrix (), and a matrix of node features . Consider the following calculations in typical graph convolutional networks: at the th layer parameterized by , where denotes the intermediate embeddings of nodes at the th layer. In the case whereand we ignore the activation function
, after times iterations, it yields ^{1}^{1}1We collapse as , and ., with the th order transition matrix [Gagniuc2017] as the predefined receptive field. That is, the depth of layers determines the extent of neighbors to exploit, and any node which satisfies , where is the shortest path distance, contributes to node ’s embedding, with the importance weight predefined as . Essentially, the receptive field of one target node in graph domain is equivalent to the subgraph that consists of nodes along paths to the target node.Is there a specific path in the graph contributing mostly to the representation? Is there an adaptive and automated way of choosing the receptive fields or paths of a graph convolutional network? It seems that the current literature has not provided such a solution. For instance, graph convolution neural networks which lie in spectral domain [Bruna et al.2013] heavily rely on the graph Laplacian matrix [Chung1997] to define the importance of the neighbors (and hence receptive field) for each node. Approaches that lie in spatial domain define convolutions directly on the graph, with receptive field more or less handdesigned. For instance, GraphSage [Hamilton, Ying, and Leskovec2017b] used the mean or max of a fixedsize neighborhood of each node, or an LSTM aggregator which needs a preselected order of neighboring nodes. These predefined receptive fields, either based on graph Laplacian in the spectral domain, or based on uniform operators like mean, max operators in the spatial domain, thus limit us from discovering meaningful receptive fields from graph data adaptively. For example, the performance of GCN [Kipf and Welling2016] based on graph Laplacian could deteriorate severely if we simply stack more and more layers to explore deeper and wider receptive fields (or paths), even though the situation could be alleviated, to some extent, if adding residual nets as extra addons [Hamilton, Ying, and Leskovec2017b, Veličković et al.2017].
To address the above challenges, (1) we first formulate the space of functionals wherein any eligible aggregator functions should satisfy, for permutation invariant graph data; (2) we propose adaptive path layer with two complementary components: adaptive breadth and depth functions, where the adaptive breadth function can adaptively select a set of significant important onehop neighbors, and the adaptive depth function can extract and filter useful and noisy signals up to long order distance. (3) experiments on several datasets empirically show that our approaches are quite competitive, and yield stateoftheart results on large graphs. Another remarkable result is that our approach is less sensitive to the depth of propagation layers.
Intuitively our proposed adaptive path layer guides the breadth and depth exploration of the receptive fields. As such, we name such adaptively learned receptive fields as receptive paths.
Graph Convolutional Networks
Generalizing convolutions to graphs aims to encode the nodes with signals lie in the receptive fields. The output encoded embeddings can be further used in endtoend supervised learning
[Kipf and Welling2016]or unsupervised learning tasks
[Perozzi, AlRfou, and Skiena2014, Grover and Leskovec2016].The approaches lie in spectral domain heavily rely on the graph Laplacian operator [Chung1997]. The real symmetric positive semidefinite matrix
can be decomposed into a set of orthonormal eigenvectors which form the graph Fourier basis
such that , whereis the diagonal matrix with main diagonal entries as ordered nonnegative eigenvalues. As a result, the convolution operator in spatial domain can be expressed through the elementwise Hadamard product in the Fourier domain
[Bruna et al.2013]. The receptive fields in this case depend on the kernel .Kipf and Welling kipf2016semi further propose GCN to design the following approximated localized 1order spectral convolutional layer:
(1) 
where is a symmetric normalization of with selfloops, i.e. , , is the diagonal node degree matrix of , denotes the th hidden layer with , is the layerspecific parameters, and denotes the activation functions.
GCN requires the graph Laplacian to normalize the neighborhoods in advance. This limits the usages of such models in inductive settings. One trivial solution (GCNmean) instead is to average the neighborhoods:
(2) 
where is the rownormalized adjacency matrix.
More recently, Hamilton et al. hamilton2017inductive proposed GraphSAGE, a method for computing node representation in inductive settings. They define a series of aggregator functions in the following framework:
(3) 
where is an aggregator function over the graph. For example, their mean aggregator is nearly equivalent to GCNmean (Eq. 2), but with additional residual architectures [He et al.2016a]
. They also propose max pooling and LSTM (Long shortterm memory)
[Hochreiter and Schmidhuber1997] based aggregators. However, the LSTM aggregator operates on a random permutation of nodes’ neighbors that is not a permutation invariant function with respect to the ordering of neighborhoods, thus makes the operators hard to understand.A recent work from [Veličković et al.2017] proposed Graph Attention Networks (GAT) which uses attention mechanisms to parameterize the aggregator function . This method is restricted to learn a direct neighborhood receptive field, which is not as general as our proposed approach which can explore receptive field in both breadth and depth directions.
Note that, this paper focuses mainly on works related to graph convolutional networks, but may omit some other types of graph neural networks.
Discussions
To summarize, the major effort put on this area is to design effective aggregator functions that can propagate signals around the th order neighborhood for each node. Few of them try to learn the meaningful paths that direct the propagation.
Why learning receptive paths is important? The reasons could be twofold: (1) graphs could be noisy, (2) different nodes play different roles, thus it yields different receptive paths. For instance, in Figure 1 we show the graph patterns from real fraud detection data. This is a bipartite graph where we have two types of nodes: accounts and devices (e.g. IP proxy). Malicious accounts (red nodes) tend to aggregate together as the graph tells. However, in reality, normal and malicious accounts could connect to the same IP proxy. As a result, we cannot simply tell from the graph patterns that the account labeled in “green” is definitely a malicious one. It is important to verify if this account behave in the similar patterns as other malicious accounts, i.e. according to the features of each node. Thus, node features could be additional signals to refine the importance of neighbors and paths. In addition, we should pay attention to the number of hops each node can propagate. Obviously, the nodes at frontier would not aggregate signals from a hub node, else it would make everything “overspread”.
We demonstrate the idea of learning receptive paths of graph neural networks in Figure 2. Instead of aggregating all the 2hops neighbors to calculate the embedding of the target node (black), we aim to learn meaningful receptive paths (shaded region) that contribute mostly to the target node. The meaningful paths can be viewed as a subgraph associated to the target node. What we need is to do breadth/depth exploration and filter useful/noisy signals. The breadth exploration determines which neighbors are important, i.e. leads the direction to explore, while the depth exploration determines how many hops of neighbors away are still useful. Such signal filterings on the subgraphs essentially learn receptive paths.
Proposed Approaches
In this section, we first discuss the space of functionals satisfying the permutation invariant requirement for graph data. Then we design neural network layers which satisfy such requirement while at the same time have the ability to learn “receptive paths” on graph.
Permutation Invariant
We aim to learn a function that maps , i.e. the graph and associatd (latent) feature spaces, into the range . We have node , and the 1order neighborhood associated with , i.e.
, and a set of (latent) features in vector space
belongs to the neighborhood, i.e. . In the case of permutation invariant graphs, we assume the learning task is independent of the order of neighbors.Now we require the (aggregator) function acting on the neighbors must be invariant to the orders of neighbors under any random permutation , i.e.
(4)  
Theorem 1 (Permutation Invariant)
A function operating on the neighborhood of can be a valid function, i.e. with the assumption of permutation invariant to the neighbors of , if and only if the can be decomposed into the form with mappings and .
Proof sketch. It is trivial to show that the sufficiency condition holds. The necessity condition holds due to the Fundamental Theorem of Symmetric Functions [Zaheer et al.2017].
Remark 2 (Associative Property)
If is permutation invariant with respect to a neighborhood , is still permutation invariant with respect to the neighborhood if is independent of the order of . This allows us to stack the functions in a cascading manner.
In the following, we will use the permutation invariant requirement and the associative property to design propagation layers. It is trivial to check that the LSTM aggregator in GraphSAGE is not a valid function under this condition, even though it could be possibly an appropriate aggregator function in temporal graphs.
Adaptive Path Layer
Our goal is to learn receptive paths while propagate signals along the learned paths rather than the predefined paths. The problem is equivalent to determining a proper subgraph through breadth (which onehop neighbor is important) and depth (the importance of neighbors at the th hop away) expansions for each node.
To explore the breadth of the receptive paths, we learn an adaptive breadth function parameterized by , to aggregate the signals by adaptively assigning different importances to different onehop neighbors. To explore the depth of the receptive paths, we learn an adaptive depth function parameterized by (shared among all nodes ) that could further extract and filter the aggregated signals at the th order distance away from the anchor node by modeling the dependencies among aggregated signals at various depths. We summarize the overall “GeniePath” algorithm in Algorithm 1.
The parameterized functions in Algorithm 1
are optimized by the customerdefined loss functions. This may includes supervised learning tasks (e.g. multiclass, multilabel), or unsupervised tasks like the objective functions defined in
[Perozzi, AlRfou, and Skiena2014].Now it remains to specify the adaptive breadth and depth functions, i.e. and . We make the adaptive path layer more concrete as follows:
(5) 
then
and finally,
(6) 
The first equation (Eq. (5)) corresponds to and the rest gated units correspond to . We maintain for each node a memory (initialized as ) , and gets updated as the receptive paths being explored . At the th layer, i.e. while we are exploring the neighborhood , we have the following two complementary functions.
Adaptive breadth function. Eq. (5) assigns the importance of any onehop neighbors’ embedding by the parameterized generalized linear attention operator as follows
(7) 
where .
Adaptive depth function. The gated unit with sigmoid output (in the range ) is used to extract newly useful signals from , and be added to the memory as . The gated unit is used to filter useless signals from the old memory given the newly observed neighborhood by . As such, we are able to filter the memory for each node as while we extend the depth of the neighborhood. Finally, with the output gated unit and the latest memory , we can output node ’s embedding at th layer as .
The overall architecture of adaptive path layer is illustrated in Figure 3. We let with as node ’s feature. The parameters to be optimized are weight matrix , , and that are extremely compact (at the same scale as other graph neural networks).
Note that the adaptive path layer with only adaptive breadth function reduces to the proposal of GAT, but with stronger nonlinear representation capacities. Our generalized linear attention can assign symmetric importance with constraint . We do not limit as a LSTMlike function but will report this architecture in experiments.
A variant. Next, we propose a variant called “GeniePathlazy” that postpones the evaluations of the adaptive depth function at each layer. We rather propagate signals up to th order distance by merely stacking adaptive breadth functions. Given those hidden units , we add adaptive depth function on top of them to further extract and filter the signals at various depths. We initialize , and feed to the final loss functions. Formally given we have:
(8)  
Permutation invariant. Note that the adaptive breadth function
in our case operates on a linear transformation of neighbors, thus satisfies the permutation invariant requirement stated in Theorem
1. The function operates on layers at each depth is independent of the order of any 1step neighborhood, thus the composition of is still permutation invariant.Efficient Numerical Computation
Our algorithms could be easily formulated in terms of linear algebra, thus could leverage the power of Intel MKL [Intel2007] on CPUs or cuBLAS [Nvidia2008] on GPUs. The major challenge is to numerically efficient calculate Eq. (5). For example, GAT [Veličković et al.2017] in its first version performs attention over all nodes with masking applied by way of an additive bias of to the masked entries’ coefficients before apply the softmax, which results into in terms of computation and storage.
Dateset  V  E  # Classes  # Features  Label rate (train / test) 

Pubmed  0.3% / 5.07%  
BlogCatalog  50% / 40%  
BlogCatalog  50% / 40%  
PPI  78.8% / 9.7%  
Alipay  20.5% / 3.2% 
Instead, all we really need is the attention entries in the same scale as adjacency matrix , i.e. . Our trick is to build two auxiliary sparse matrices and both with entries. For the th row of and , we have , , and corresponds to an edge of the graph. After that, we can do for the transformations on all the edges in case of generalized linear form in Eq. (7). As a result, our algorithm complexity and storage is still in linear with as other type of GNNs.
Experiments
In this section, we first discuss the experimental results of our approach evaluated on various types of graphs compared with strong baselines. We then study its abilities of learning adaptive depths. Finally we give qualitative analyses on the learned paths compared with graph Laplacian.
Datasets
Transductive setting.
The Pubmed [Sen et al.2008] is a type of citation networks, where nodes correspond to documents and edges to undirected citations. The classes are exclusive, so we treat the problem as a multiclass classification problem. We use the exact preprocessed data from Kipf and Welling kipf2016semi.
The BlogCatalog [Zafarani and Liu2009] is a type of social networks, where nodes correspond to bloggers listed on BlogCatalog websites, and the edges to the social relationships of those bloggers. We treat the problem as a multilabel classification problem. Different from other datasets, the BlogCatalog has no explicit features available, as a result, we encode node ids as onehot features for each node, i.e. dimensional features. We further build dataset BlogCatalog with dimensional features decomposed by SVD on the adjacency matrix .
The Alipay dataset [Liu et al.2017] is a type of AccountDevice Network, built for detecting malicious accounts in the online cashless payment system at Alipay. The nodes correspond to users’ accounts and devices logged in by those accounts. The edges correspond to the login relationships between accounts and devices during a time period. Node features are counts of login behaviors discretized into hours and account profiles. There are classes in this dataset, i.e. malicious accounts and normal accounts. The Alipay dataset is random sampled during one week. The dataset consists of disjoint subgraphs.
Inductive setting.
The PPI [Hamilton, Ying, and Leskovec2017a] is a type of proteinprotein interaction networks, which consists of 24 subgraphs with each corresponds to a human tissue [Zitnik and Leskovec2017]. The node features are extracted by positional gene sets, motif gene sets and immunological signatures. There are 121 classes for each node from gene ontology. Each node could have multiple labels, then results into a multilabel classification problem. We use the exact preprocessed data provided by Hamilton et al. hamilton2017inductive. There are 20 graphs for training, 2 for validation and 2 for testing.
We summarize the statistics of all the datasets in Table 1.
Transductive  

Methods  Pubmed  BlogCatalog  BlogCatalog  Alipay 
MLP  71.4%    0.134  0.741 
node2vec  65.3%  0.136  0.136   
Chebyshev  74.4%  0.160  0.166  0.784 
GCN  79.0%  0.171  0.174  0.796 
GraphSAGE  78.8%  0.175  0.175  0.798 
GAT  78.5%  0.201  0.197  0.811 
GeniePath  78.5%  0.195  0.202  0.826 
Inductive  

Methods  PPI 
MLP  0.422 
GCNmean  0.71 
GraphSAGE  0.768 
GAT  0.81 
GeniePath  0.979 
Methods  PPI 

GAT  0.81 
GATresidual  0.914 
GeniePath  0.952 
GeniePathlazy  0.979 
GeniePathlazyresidual  0.985 
Experimental Settings
Comparison Approaches
We compare our methods with several strong baselines.
(1
) MLP (multilayer perceptron), utilizes only the node features but not the structures of the graph. (
2) node2vec [Grover and Leskovec2016]. Note that, this type of methods built on top of lookup embeddings cannot work on problems with multiple graphs, i.e. on datasets Alipay and PPI, because without any further constraints, the entire embedding space cannot guarantee to be consistent during training [Hamilton, Ying, and Leskovec2017a]. (3) Chebyshev [Defferrard, Bresson, and Vandergheynst2016], which approximates the graph spectral convolutions by a truncated expansion in terms of Chebyshev polynomials up to th order. (4) GCN [Kipf and Welling2016], which is defined in Eq. (1). Same as Chebyshev, it works only in the transductive setting. However, if we just use the normalization formulated in Eq. (2), it can work in an inductive setting. We denote this variant of GCN as GCNmean. (5) GraphSAGE [Hamilton, Ying, and Leskovec2017a], which consists of a group of pooling operators and skip connection architectures as discussed in section Graph Convolutional Networks. We will report the best results of GraphSAGEs with different pooling strategies as GraphSAGE. (6) Graph Attention Networks (GAT) [Veličković et al.2017] is similar to a reduced version of our approach with only adaptive breadth function. This can help us understand the usefulness of adaptive breadth and depth function.We pick the better results from GeniePath or GeniePathlazy as GeniePath. We found out that the residual architecture (skip connections) [He et al.2016b] are useful for stacking deeper layers for various approaches and will report the approaches with suffix “residual” to distinguish the contributions between graph convolution operators and skip connection architecture.
Experimental Setups
In our experiments, we implement our algorithms in TensorFlow
[Abadi et al.2016] with the Adam optimizer [Kingma and Ba2014]. For all the graph convolutional networkstyle approaches, we set the hyperparameters include the dimension of embeddings or hidden units, the depth of hidden layers and learning rate to be same. For node2vec, we tune the return parameter
and inout parameter by grid search. Note that setting is equivalent to DeepWalk [Perozzi, AlRfou, and Skiena2014]. We sample 10 walkpaths with walklength as 80 for each node in the graph. Additionally, we tune the penalty of regularizers for different approaches.Transductive setting. In transductive settings, we allow all the algorithms to access to the whole graph, i.e. all the edges among the nodes, and all of the node features.
For pubmed, we set the number of hidden units as 16 with 2 hidden layers. For BlogCatalog and BlogCatalog, we set the number of hidden units as 128 with 3 hidden layers. For Alipay, we set the number of hidden units as 16 with 7 hidden layers.
Inductive setting. In inductive settings, all the nodes in test are completely unobserved during the training procedure. For PPI, we set the number of hidden units as 256 with 3 hidden layers.
) v.s. Estimated receptive paths (
right) with respect to the black node on PPI dataset: we retain all the paths to the black node in 2hops that involved in the propagation, with edge thickness denotes the importance of edge estimated in the first adaptive path layer. We also discretize the importance of each edge into 3 levels: Green (), Blue (), and Red ().Classification
We report the comparison results of transductive settings in Table 3. For Pubmed, there are only 60 labels available for training. We found out that GCN works the best by stacking exactly 2 hidden convolutional layers. Stacking 2 more convolutional layers will deteriorate the performance severely. Fortunately, we are still performing quite competitive on this data compared with other methods. Basically, this dataset is quite small that it limits the capacity of our methods.
In BlogCatalog and BlogCatalog, we found both GAT and GeniePath work the best compared with other methods. Another suprisingly interesting result is that graph convolutional networksstyle approaches can perform well even without explicit features. This works only in transductive setting, and the lookup embeddings (in this particular case) of testing nodes can be propagated to nodes in training.
The graph in Alipay [Liu et al.2017] is relative sparse. We found that GeniePath works quite promising on this large graph. Since the dataset consists of ten thousands of subgraphs, the node2vec is not applicable in this case.
We report the comparison results of inductive settings on PPI in Table 3. We use “GCNmean” by averaging the neighborhoods instead of GCN for comparison. GeniePath performs extremely promising results on this big graph, and shows that the adaptive depth function plays way important compared to GAT with only adaptive breadth function.
To further study the contributions of skip connection architectures, we compare GAT, GeniePath, and GeniePathlazy with additional residual architecture (“skip connections”) in Table 4. The results on PPI show that GeniePath is less sensitive to the additional “skip connections”. The significant improvement of GAT on this dataset relies heavily on the residual structure.
In most cases, we found GeniePathlazy converges faster than GeniePath and other GCNstyle models.
Depths of Propagation
We show our results on classification measures with respect to the depths of propagation layers in Figure 4. As we stack more graph convolutional layers, i.e. with deeper and broader receptive fields, GCN, GAT and even GraphSAGE with residual architectures can hardly maintain consistently resonable results. Interestingly, GeniePath with adaptive path layers can adaptively learn the receptive paths and achieve consistent results, which is remarkable.
Qualitative Analysis
We show a qualitative analysis about the receptive paths with respect to a sampled node learned by GeniePath in Figure 5. It can be seen, the graph Laplacian assigns importance of neighbors at nearly the same scale, i.e. results into very dense paths (every neighbor is important). However, the GeniePath can help select significantly important paths (red ones) to propagate while ignoring the rest, i.e. results into much sparser paths. Such “neighbor selection” processes essentially lead the direction of the receptive paths and improve the effectiveness of propagation.
Conclusion
In this paper, we studied the problems of graph convolutional networks on identifying meaningful receptive paths. We proposed adaptive path layers with adaptive breadth and depth functions to guide the receptive paths. Our experiments on large benchmark data show that GeniePath significantly outperfoms stateoftheart approaches, and are less sensitive to the depths of the stacked layers, or extent of neighborhood set by hand. The success of GeniePath shows that selecting appropriate receptive paths for different nodes is important. In future, we expect to further study the sparsification of receptive paths, and help explain the potential propagations behind graph neural networks in specific applications. In addition, studying temporal graphs that the ordering of neighbors matters could be another challenging problem.
References

[Abadi et al.2016]
Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.;
Ghemawat, S.; Irving, G.; and Isard, M.
2016.
Tensorflow: a system for largescale machine learning.
 [Bruna et al.2013] Bruna, J.; Zaremba, W.; Szlam, A.; and LeCun, Y. 2013. Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203.
 [Chung1997] Chung, F. R. 1997. Spectral graph theory. Number 92. American Mathematical Soc.
 [Defferrard, Bresson, and Vandergheynst2016] Defferrard, M.; Bresson, X.; and Vandergheynst, P. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, 3844–3852.
 [Gagniuc2017] Gagniuc, P. A. 2017. Markov Chains: From Theory to Implementation and Experimentation. John Wiley & Sons.
 [Gehring et al.2016] Gehring, J.; Auli, M.; Grangier, D.; and Dauphin, Y. N. 2016. A convolutional encoder model for neural machine translation. arXiv preprint arXiv:1611.02344.
 [Grover and Leskovec2016] Grover, A., and Leskovec, J. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, 855–864. ACM.
 [Hamilton, Ying, and Leskovec2017a] Hamilton, W. L.; Ying, R.; and Leskovec, J. 2017a. Inductive representation learning on large graphs. arXiv preprint arXiv:1706.02216.
 [Hamilton, Ying, and Leskovec2017b] Hamilton, W. L.; Ying, R.; and Leskovec, J. 2017b. Representation learning on graphs: Methods and applications. arXiv preprint arXiv:1709.05584.
 [Hammond, Vandergheynst, and Gribonval2011] Hammond, D. K.; Vandergheynst, P.; and Gribonval, R. 2011. Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis 30(2):129–150.

[He et al.2016a]
He, K.; Zhang, X.; Ren, S.; and Sun, J.
2016a.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, 770–778.  [He et al.2016b] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016b. Identity mappings in deep residual networks. In European Conference on Computer Vision, 630–645. Springer.
 [Hochreiter and Schmidhuber1997] Hochreiter, S., and Schmidhuber, J. 1997. Long shortterm memory. Neural computation 9(8):1735–1780.
 [Intel2007] Intel, M. 2007. Intel math kernel library.
 [Kingma and Ba2014] Kingma, D., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
 [Kipf and Welling2016] Kipf, T. N., and Welling, M. 2016. Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907.
 [Kostakos2009] Kostakos, V. 2009. Temporal graphs. Physica A: Statistical Mechanics and its Applications 388(6):1007–1023.
 [Liu et al.2017] Liu, Z.; Chen, C.; Zhou, J.; Li, X.; Xu, F.; Chen, T.; and Song, L. 2017. Poster: Neural networkbased graph embedding for malicious accounts detection. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS ’17, 2543–2545. New York, NY, USA: ACM.
 [Nvidia2008] Nvidia, C. 2008. Cublas library. NVIDIA Corporation, Santa Clara, California 15(27):31.
 [Perozzi, AlRfou, and Skiena2014] Perozzi, B.; AlRfou, R.; and Skiena, S. 2014. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 701–710. ACM.
 [Sen et al.2008] Sen, P.; Namata, G.; Bilgic, M.; Getoor, L.; Galligher, B.; and EliassiRad, T. 2008. Collective classification in network data. AI magazine 29(3):93.
 [Veličković et al.2017] Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; and Bengio, Y. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903.
 [Zafarani and Liu2009] Zafarani, R., and Liu, H. 2009. Social computing data repository at asu.
 [Zaheer et al.2017] Zaheer, M.; Kottur, S.; Ravanbakhsh, S.; Poczos, B.; Salakhutdinov, R.; and Smola, A. 2017. Deep sets. arXiv preprint arXiv:1703.06114.
 [Zitnik and Leskovec2017] Zitnik, M., and Leskovec, J. 2017. Predicting multicellular function through multilayer tissue networks. Bioinformatics 33(14):i190–i198.