1 Introduction
Application of convolutional neural networks to the analysis of data with an underlying graph structure has been an active area of research in recent years. Earlier works towards the development of Graph Convolutional Neural Networks (GCNNs) include (Bruna et al., 2013; Henaff et al., 2015; Duvenaud et al., 2015). (Defferrard et al., 2016) introduced an approach based on spectral filtering, which was adapted in subsequent works (Levie et al., 2019; Chen et al., 2018b; Kipf and Welling, 2017). On the other hand, spatial filtering or aggregation strategies are considered in (Atwood and Towsley, 2016; Hamilton et al., 2017). (Monti et al., 2017) present a general framework for applying neural networks on graphs and manifolds, which encompasses many existing approaches.
Several modifications have been proposed in the literature to improve the performance of GCNNs. These include incorporating attention nodes (Veličković et al., 2018), gates (Li et al., 2016b; Bresson and Laurent, 2017), edge conditioning and skip connections (Sukhbaatar et al., 2016; Simonovsky and Komodakis, 2017). Other approaches consider an ensemble of graphs (Anirudh and Thiagarajan, 2017), multiple adjacency matrices (P. Such et al., 2017), the dual graph (Monti et al., 2018) and random perturbation of the graph (Sun et al., 2019). Scalable training for large networks can be achieved through neighbour sampling (Hamilton et al., 2017), performing importance sampling (Chen et al., 2018b) or using control variate based stochastic approximation (Chen et al., 2018a).
Most existing approaches process the graph as if it represents the true relationship between nodes. However, in many cases the graphs employed in applications are themselves derived from noisy data or inaccurate modelling assumptions. The presence of spurious edges or the absence of edges between nodes with very strong relationships in these noisy graphs can affect learning adversely. This can be addressed to some extent by attention mechanisms (Veličković et al., 2018) or generating an ensemble of multiple graphs by erasing some edges (Anirudh and Thiagarajan, 2017), but these approaches do not consider creating any edges that were not present in the observed graph.
In order to account for the uncertainty in the graph structure, (Zhang et al., 2019) present a Bayesian framework where the observed graph is viewed as a random sample from a collection described by a parametric random graph model. This permits joint inference of the graph and the GCNN weights. This technique significantly outperforms the stateoftheart algorithms when only a limited amount of training labels is available. While the approach is effective, choosing an appropriate random graph model is very important and the correct choice can vary greatly for different problems and datasets. Another significant drawback of the technique is that the posterior inference of the graph is carried out solely conditioned on the observed graph. As a result any information provided by the node features and the training labels is completely disregarded. This can be highly undesirable in scenarios where the features and labels are highly correlated with the true graph connectivity.
In this paper, we propose an alternative approach which formulates the posterior inference of the graph in a nonparametric fashion, conditioned on the observed graph, features and training labels. Experimental results show that our approach obtains impressive performance for the semisupervised node classification task with a limited number of training labels.
2 Graph convolutional neural networks (GCNNs)
Although graph convolutional neural networks have been applied in a variety of inference tasks, here we consider the node classification problem in a graph for conciseness. In this setting, we have access to an observed graph , where is the set of nodes and denotes the set of edges. For every node
we observe a feature vector
, but the label is known for only a subset of the nodes . The goal is to infer the labels of the remaining nodes using the information provided by the observed graph , the feature matrix and the training labels .A GCNN performs graph convolution operations within a neural network architecture to address this task. Although there are many different versions of the graph convolution operation, the layerwise propagation rule for the simpler architectures (Defferrard et al., 2016; Kipf and Welling, 2017) can be expressed as:
(1)  
(2) 
Here are the output features from layer , and
is a pointwise nonlinear activation function. The normalized adjacency operator
, which is derived from the observed graph, determines the mixing of the output features across the graph at each layer. denotes the weights of the neural network at layer . We use to denote all GCNN weights.For an layer network, the final output is
. Learning of the weights of the neural network is carried out by backpropagation with the objective of minimizing an error metric between the training labels
and the network predictions , at the nodes in the training set.3 Methodology
We consider a Bayesian approach, by constructing a joint posterior distribution of the graph, the weights in the GCNN and the node labels. Our goal is to compute the marginal posterior probability of the node labels, which is expressed as follows:
(3) 
Here denotes the random weights of a Bayesian GCNN over graph . In a node classification setting, the term is modelled using a categorical distribution by applying a softmax function to the output of the GCNN. As the integral in equation 3 can not be computed analytically, a Monte Carlo approximation is formed as follows:
(4) 
In this approximation, graphs are sampled from and weight samples are drawn from by training a Bayesian GCN corresponding to the graph .
(Zhang et al., 2019) assume that is a sample from a collection of graphs associated with a parametric random graph model and their approach targets inference of via marginalization of the random graph parameters, ignoring any possible dependence of the graph on the features and the labels . By contrast, we consider a nonparametric posterior distribution of the graph as . This allows us to incorporate the information provided by the features and the labels in the graph inference process.
We denote the symmetric adjacency matrix with nonnegative entries of the random undirected graph by . The prior distribution for is defined as
(5) 
For allowable graphs, the first term in the log prior prevents any isolated node in and the second encourages low weights for the links. and
are hyperparameters which control the scale and sparsity of
. The joint likelihood of , and conditioned on is:(6) 
where is a symmetric pairwise distance matrix between the nodes. The symbol denotes the Hadamard product and denotes the elementwise norm. We propose to use
(7) 
where, the ’th entries of and are defined as follows:
(8)  
(9) 
Here, is any suitable embedding of node and is the label obtained at node by a base classification algorithm. summarizes the pairwise distance in terms of the observed topology and features and encodes the dissimilarity in node labels robustly by considering the obtained labels of the neighbours in the observed graph. In this paper, we choose the Graph Variational AutoEncoder (GVAE) algorithm (Kipf and Welling, 2016) as the node embedding method to obtain the vectors and use the GCNN proposed by (Kipf and Welling, 2017)
as the base classifier to obtain the
values. The neighbourhood is defined as:is a hyperparameter which controls the importance of relative to . We set:
In order to use the approximation in equation 4, we need to obtain samples from the posterior of . However, the design of a suitable MCMC in this high dimensional (, where is the number of the nodes) space is extremely challenging and computationally expensive. Instead we replace the integral over in equation 3
by the maximum a posteriori estimate of
, following the approach of (MacKay, 1996). We solve the following optimization problem(10) 
and approximate the integral in equation 3 as follows:
(11) 
Here, weight samples are drawn from . The MAP inference in 10 is equivalent to learning a symmetric adjacency matrix of .
(12) 
The optimization problem in 12 has been studied in the context of graph learning from smooth signals. In (Kalofolias, 2016), a primaldual optimization algorithm is employed to solve this problem. However the complexity of this approach scales as , which can be prohibitive for large graphs. We use the approximate algorithm in (Kalofolias and Perraudin, 2017), which has an approximate complexity. This formulation allows us to effectively use one hyperparameter instead of and
to control the sparsity of the solution and provides a useful heuristic for choosing a suitable value.
Various techniques such as expectation propagation (HernándezLobato and Adams, 2015), variational inference (Gal and Ghahramani, 2016; Sun et al., 2017; Louizos and Welling, 2017)
, and Markov Chain Monte Carlo methods
(Neal, 1993; Korattikara et al., 2015; Li et al., 2016a) can be employed for the posterior inference of the GCNN weights. Following the approach in (Zhang et al., 2019), we train a GCNN over the inferred graph and use Monte Carlo dropout (Gal and Ghahramani, 2016) to sample from a particular variational approximation of . The resulting algorithm is described in Algorithm 1.4 Experimental Results
We investigate the performance of the proposed Bayesian GCNN on three citation datasets (Sen et al., 2008): Cora, CiteSeer, and Pubmed. In these datasets each node corresponds to a document and the undirected edges are citation links. Each node has a sparse bagofwords feature vector associated with it and the node label represents the topic of the document. We address a semisupervised node classification task where we have access to the labels of a few nodes per class and the goal is to infer labels for the others. We consider three different experimental settings where we have 5, 10 and 20 labels per class in the training set. The partitioning of the data in 20 labels per class case is the same as in (Yang et al., 2016) whereas in the other two cases, we construct the training set by including the first 5 or 10 labels from the previous partition. The hyperparameters of GCNN are borrowed from (Kipf and Welling, 2017) and are used for the BGCN algorithms as well.
We compare the proposed BGCN in this paper with ChebyNet (Defferrard et al., 2016), GCNN (Kipf and Welling, 2017), GAT (Veličković et al., 2018) and the BGCN in (Zhang et al., 2019). Table 1 shows the summary of results based on 50 runs with random weight initializations.
Dataset  Cora  Citeseer  Pubmed  
Labels/class  5  10  20  5  10  20  5  10  20 
ChebyNet  67.93.1  72.72.4  80.40.7  53.01.9  67.71.2  70.20.9  68.12.5  69.41.6  76.01.2 
GCNN  74.40.8  74.90.7  81.60.5  55.41.1  65.81.1  70.80.7  69.70.5  72.80.5  78.90.3 
GAT  73.52.2  74.51.3  81.60.9  55.42.6  66.11.7  70.81.0  70.00.6  71.60.9  76.90.5 
BGCN  75.30.8  76.60.8  81.20.8  57.30.8  70.80.6  72.20.6  70.90.8  72.30.8  76.60.7 
BGCN (ours)  76.01.1  76.80.9  80.30.6  59.01.5  71.70.8  72.60.6  73.30.7  73.90.9  79.20.5 
The results in Table 1 show that the proposed algorithm yields higher classification accuracy compared to the other algorithms in most cases. Figure 1 demonstrates that in most cases, for the Cora and the Citeseer datasets, the proposed BGCN algorithm corrects more errors of the GCNN base classifier for low degree nodes. The same trend is observed for the Pubmed dataset as well. In Figure 2, the adjacency matrix () of the MAP estimate graph is shown along with the observed adjacency matrix for the Cora dataset. We observe that compared to , has denser connectivity among the nodes with the same label.
5 Conclusion
In this paper, we present a Bayesian GCNN using a nonparametric graph inference technique. The proposed algorithm achieves superior performance when the amount of available labels during the training process is limited. Future work will investigate extending the methodology to other graph based learning tasks, incorporating other generative models for graphs and developing scalable techniques to perform effective inference for those models.
References
 Bootstrapping graph convolutional neural networks for autism spectrum disorder classification. arXiv:1704.07487. Cited by: §1, §1.
 Diffusionconvolutional neural networks. In Proc. Adv. Neural Inf. Proc. Systems, Cited by: §1.
 Residual gated graph convnets. arXiv:1711.07553. Cited by: §1.
 Spectral networks and locally connected networks on graphs. In Proc. Int. Conf. Learning Representations, Scottsdale, AZ, USA. Cited by: §1.

Stochastic Training of Graph Convolutional Networks with Variance Reduction
. InProc. Int. Conf. Machine Learning
, Cited by: §1.  FastGCN: fast learning with graph convolutional networks via importance sampling. In Proc. Int. Conf. Learning Representations, Cited by: §1, §1.
 Convolutional neural networks on graphs with fast localized spectral filtering. In Proc. Adv. Neural Inf. Proc. Systems, Cited by: §1, §2, §4.
 Convolutional networks on graphs for learning molecular fingerprints. In Proc. Adv. Neural Inf. Proc. Systems, Cited by: §1.

Dropout as a Bayesian approximation: Representing model uncertainty in deep learning
. In Proc. Int. Conf. Machine Learning, Cited by: §3.  Inductive representation learning on large graphs. In Proc. Adv. Neural Inf. Proc. Systems, Cited by: §1, §1.
 Deep convolutional networks on graphstructured data. arXiv:1506.05163. Cited by: §1.
 Probabilistic backpropagation for scalable learning of Bayesian neural networks. In Proc. Int. Conf. Machine Learning, Cited by: §3.
 Large Scale Graph Learning from Smooth Signals. arXiv eprint arXiv : 1710.05654. Cited by: §3.

How to learn a graph from smooth signals.
In
Proc. Artificial Intelligence and Statistics
, Cited by: §3.  Variational graph autoencoders. arXiv:1611.07308. Cited by: §3.
 Semisupervised classification with graph convolutional networks. In Proc. Int. Conf. Learning Representations, Cited by: §1, §2, §3, §4, §4.
 Bayesian dark knowledge. In Proc. Adv. Neural Inf. Proc. Systems, Cited by: §3.
 CayleyNets: graph convolutional neural networks with complex rational spectral filters. IEEE Trans. Signal Processing 67 (1). Cited by: §1.
 Preconditioned stochastic gradient Langevin dynamics for deep neural networks. In Proc. AAAI Conf. Artificial Intelligence, Cited by: §3.
 Gated graph sequence neural networks. In Proc. Int. Conf. Learning Representations, Cited by: §1.
 Multiplicative normalizing flows for variational Bayesian neural networks. arXiv:1703.01961. Cited by: §3.
 Maximum entropy and Bayesian methods. pp. 43–59. Cited by: §3.

Geometric deep learning on graphs and manifolds using mixture model CNNs.
In
Proc. IEEE Conf. Comp. Vision and Pattern Recognition
, Cited by: §1.  Dualprimal graph convolutional networks. arXiv:1806.00770. Cited by: §1.
 Bayesian learning via stochastic dynamics. In Proc. Adv. Neural Inf. Proc. Systems, pp. 475–482. Cited by: §3.
 Robust spatial filtering with graph convolutional neural networks. IEEE J. Sel. Topics Signal Proc. 11 (6), pp. 884–896. Cited by: §1.
 Collective classification in network data. AI Magazine 29 (3), pp. 93. Cited by: §4.
 Dynamic edgeconditioned filters in convolutional neural networks on graphs. arXiv:1704.02901. Cited by: §1.
 Learning multiagent communication with backpropagation. In Proc. Adv. Neural Inf. Proc. Systems, Cited by: §1.
 FisherBures adversary graph convolutional networks. arXiv eprints, arXiv : 1903.04154. Cited by: §1.
 Learning structured weight uncertainty in Bayesian neural networks. In Proc. Artificial Intelligence and Statistics, Cited by: §3.
 Graph attention networks. In Proc. Int. Conf. Learning Representations, Vancouver, Canada. Cited by: §1, §1, §4.

Revisiting semisupervised learning with graph embeddings
. arXiv preprint arXiv:1603.08861. Cited by: §4.  Bayesian graph convolutional neural networks for semisupervised classification. In Proc. AAAI Conf. Artificial Intelligence, Cited by: §1, §3, §3, §4.