I Introduction
Learning in graphs is an important task in machine learning and data science, with applications ranging from social science such as social network analysis, biology such as protein structure prediction and molecular finger prints learning, and computer science such as knowledge graph analysis. The key difference between learning in graph data and the conventional machine learning with images and natural languages is that in addition to complex features on each element, there are also relational features encoded by edges in the graph.
Consider a simple and classic example of classification of nodes into several groups in a citation network. In addition to the edges between nodes representing citations between documents, each node has features such as key words in the document, which reveals information of group labels. We consider the problem that for a small subset of nodes their group memberships are available, working as training data, and the task is to predict the group membership of other nodes using the training data, together their features and edges in graphs. This problem is known as semi-supervised classification in graphs which has drawn many attention in both networks science and machine learning communities. On this problem we witnessed the burst in developing Graph Convolution neural Networks (GCN) which gave ground-breaking performance in recent years Kipf and Welling (2017).
The deep convolution neural networks Goodfellow et al. (2016)
have made great success in machine learning and artificial intelligence. Since there are many applications that data are represented as a graph, rather than on the
-dimensional grids, a lot of efforts has been made to extending the convolution networks from grid data to graph data, focusing on constructing linear convolution kernels to extract local features in graphs and on learning effective representations of the graph objects. During the past several years, many different GCNs have been proposed, using different kinds of convolution kernels, as well as different neural network architectures Klicpera et al. (2019); Veličković et al. (2018); Kipf and Welling (2017); Gilmer et al. (2017). In recent years, GCNs have quickly dominated the performances of various tasks such as object classification, link prediction, and graph level classification on graph data, and are also regarded to have a big potential on relational reasoning Battaglia et al. (2018). For the object classification, without depending on specific model, GCNs learn effective object representations with non-linear neural architectures, with the whole framework trained in an end-to-end fashion. Despite that GCNs have achieved state-of-the-art performance on semi-supervised classification, so far there are very little theoretical understanding on the mathematical principles behind the graph convolutions, and to what extent they can work on particular problem instance. The main difficulty is that GCNs are usually tested only on several classic real-world benchmark datasets, however perfectly modeling the real-world data sets is hard. So we are lack of controllable datasets with continuously tunable parameters for studying the strengths and limitations of GCNs.In this article we propose to model the semi-supervised classification problem by combining the celebrated stochastic block models Holland et al. (1983) (in the semi-supervised fashion) which generates relational edges, and bipartite stochastic block model Hric et al. (2016); Florescu and Perkins (2016) which generates features of each node. We call the model Joint Stochastic Block Model (JSBM). The generated graph is a mixture of a uni-partite graph and a bi-partite graph, both of which carry information of group labels. To best of our knowledge, the JSBM was proposed in Hric et al. (2016)
for studying missing link and missing node predictions using both edges and labels based on MCMC based methods. In this article we focus on the semi-supervised learning in the JSBM, its theoretical properties and optimal algorithms.
On the graphs generated by the JSBM, the clustering and classification problems can be translated to a Bayesian inference problem which we aim to solving using statistical physics approaches in an asymptotically exact way. This leads to a message-passing algorithm, known in computer science as belief propagation (BP)Yedidia et al. (2003) algorithm, that is asymptotically exact in large sparse random networks generated by the JSBM. Without semi-supervision, by analyzing stability of fixed points of the belief propagation algorithm, a phases transition — detectability transition is discovered beyond which no algorithm is able to do classification without supervision. It generalizes the celebrated detectability phase transition Decelle et al. (2011) in the stochastic block model Holland et al. (1983) and puts fundamental limits on accuracy of the classifications in networks with both pairwise relations edges and node features in the JSBM. In the semi-supervised classification setting where a small fraction of node labels are known as training data, using method of pining the labeled nodes as introduced in Zhang et al. (2014), our approach still gives Bayes optimal results. The parameters can be learnt using truncated belief propagation together with the back propagation algorithm. This naturally extends the belief propagation algorithm to a graph convolution network, which greatly outperforms existing graph convolution on synthetic graphs.
In addition to the theoretical analysis and algorithmic applications, the JSBM can produce well-controlled benchmark graphs with continuously tunable parameters, for benchmarking graph convolution networks, and for understanding their strengths and weaknesses under certain properties of graphs. For example, we discover that the existing state-of-the-art graph convolution networks have a sparsity issue and overfitting issue when applied on synthetic benchmarks. Based on the observation, we give explanations of the weakness of the GCNs based on spectral properties of the graph convolution kernels used by GCNs, and further suggestions on how to overcome the issues.
The paper is organized as follows. In Sec. II we introduce the joint stochastic block model. In Sec. III we translate the classification and clustering problems in the joint stochastic block model to the Bayesian inference problem, and derive an asymptotically optimal algorithm belief propagation. In Sec. IV we study the detectability phase transitions of the JSBM using stability analysis of the belief propagation algorithm. In Sec. V we convert the belief propagation algorithm to a graph convolution neural network. In Sec. VI we evaluate the performance of our BPGCN and compare the performance with state-of-the-art graph convolution neural networks on both synthetic and real-world data sets.
Ii Joint stochastic block model
The purpose of the JSBM is to generate synthetic random graphs with nodes with both edges and node features. Each node has a pre-determined hidden label (i.e. ground-true group membership)
each of which is chosen randomly and independently with probability
for group . Then for each pair of nodes , an edge is connected with probability and not connected with probability . In the sense of the statistical inference, the generating process is a measuring process of the signal (ground-true labels), finally producing edges as measuring results which encode information of ground-true labels. The properties of the connections, or adjacency matrix of the generated graph, is controlled by the matrix . If diagonal elements of is much larger than the off diagonal elements,there would be more edges connecting nodes in the same group than edges connecting different groups. The above process of generating edges so far is the same as the stochastic block model. An intuitive picture of the understanding generated network is a citation network, where nodes in the network represents research papers, each of which belongs to a certain research area denoted by labels, such as “physics”, or “history”. Apparently, articles belonging to the same category would citing more frequently than articles belonging to different categories.In addition to edges, we consider further totally distinct features, as another type of nodes, that we call feature nodes. As an example, in the citation networks, in addition to citations revealing research field information, each article also has a set of keywords, which also reveals labels of the node. Each feature node also has a ground-true label analogous to graph node , which is chosen randomly from labels with probability . We then generate connections between a feature node and a graph node with probability , and disconnect them with probability . Apparently, the connections in between graph nodes and feature nodes form a bipartite graph, with analogous generation process of the bipartite stochastic block model Florescu and Perkins (2016)
. After all, the JSBM generates a graph that combines a uni-partite graph and a bipartite graph, similar to the graph appears in the semi-restricted Boltzmann machines
Osindero and Hinton (2008). In this picture, a feature plays a role of hidden variable, or a functional node with a multi-variable interaction among the graph nodes connected to , so we will also refer to the edges connecting a feature node to neighboring graph node as hyper edges. Algebraically, the graph generated by JSBM can be represented by two matrix, a node-node adjacency matrix with denoting there is an edge between graph node and graph node , and a feature-node adjacency matrix , with denoting graph node is connected to feature node and denoting no such connection. We notice that the JSBM was proposed in Hric et al. (2016) for studying missing link and missing node predictions using both edges and labels (annotations) of each node.Iii Bayesian inference of the Joint Stochastic Block Model
The clustering problem defined in a graph generated by the JSBM is to recover as accurate as possible the ground-true labels using edges and features; while the semi-supervised classification problem asks to do the recovery using edge, feature, together with a small amount of training labels for belonging to training set . If we know the parameters in generating the graph , the tasks can be translated into an inference problem of reconstructing another set of hidden parameters of the model, the group labels given measurement results on . There are basically two kinds of inferences we can do for this task, the Max A Posterior (M.A.P.) (with Maximum Likelihood inference as a special case of using flat priors), and thee Bayesian inference which amounts to computing the posterior distribution. In this work we consider the Bayesian inference which is Bayes optimal. Using Bayes rule, the posterior is written as
(1) |
where represents prior information we have on the node labels, is the likelihood of group labels , which is product of probabilities of generating all edges and hyper edges.
(2) |
where denotes the set of edges between graph nodes, and denotes the set of edges between graph nodes and feature nodes. The clustering problem corresponds to putting a flat prior, i.e. , and the semi-supervised classification problem corresponds to putting a strong prior on the nodes in the training set to pin the marginals of the nodes in the direction of training labels, as proposed in Zhang et al. (2014).
It is well known that computing the normalization of the posterior distribution is a #P problem, so it is hopeless to find polynomial algorithms for solving it. Thus we need efficient and accuracy approximations for the inference. In the language of statistical physics, the Bayesian inference problem of determining the posterior distribution is a Boltzmann distribution at unit temperature: the negative log-likelihood represent the energy; the evidence, i.e. the normalization constant for the posterior is the partition function; and the possible prior information plays a role of external fields acting on each graph node. For the clustering problem where we do not have any training labels, the external fields are set to zero; For the semi-supervised classification problems, the fields on the nodes belonging to the training set are set to infinity for pining the nodes to the training labels Zhang et al. (2014).
For random sparse graphs the inference can be studied theoretically at the thermodynamic limit using the cavity method from statistical physics Mezard and Montanari (2009); Yedidia et al. (2003). If we known in prior the parameters which has been used in generating the graph, the system is on the Nishimori line Nishimori (1980); Iba (1999) and no spin glass phase could appear. On a single instance, the replica symmetry cavity method naturally translates into the belief propagation (BP) algorithm, which passes messages along directed edges of the factor graph. The factor graph is illustrated in Fig. 1, where each edge represent a two-body factor, and each feature represents a multi-body factor.
When the messages converge, they can be used to compute posterior marginals and the Bethe free energy.
From the algorithm point of view, the belief propagation adopts the Bethe approximation Bethe (1935); Yedidia et al. (2003)
(3) |
as a variational distribution, and adjusts variational parameters, two-point marginals and single point marginals , for minimizing the Bethe free energy. The final form of the belief propagation is written as iterative equations for three kinds of messages, which are listed as follows. We refer to detailed derivations as well as approximation applied to non-edges in the Appendices.
(4) |
Here, is the cavity marginals passing through graph node to graph node , representing probability of node taking label when node is removed from the graph. is similar, but represents probability of node taking label when a feature is removed from the graph; analogously, represents the probability of a feature taking label when graph node is removed from the graph. , , and are normalizing factors ensuring a normalized probability; denotes the set of neighbors of node in the graph. In the equations, variables and are adaptive fields contributed by the non-edges of the graph, they are derived in the Appendices, and can be formulated as
(5) |
Once above iterative equations converge (i.e. messages do not change significantly), we can estimate posterior marginals using
(6) |
Based on the marginals obtained using Eq. (III), one can estimate label of each graph node using the one that maximizes the marginal
(7) |
In Bayes inference, is the maximum posterior estimate, which gives the optimal results with the minimum mean square error (MMSE) Iba (1999). The essential approximation applied in the belief propagation is the conditional independence assumption, which is exact on trees. Thus BP gives exact posterior marginals if the given graph is a tree, reflecting the fact that the Bethe approximation (3
) is always correct in a tree for describing any joint probability distribution. Empirically BP also gives a good approximation to the true posterior marginals if the graph is sparse and have locally tree like structures, hence is widely applied to inference problems in sparse systems
Mezard and Montanari (2009).Iv Detectability transitions of JSBM
If the graph is generated by the JSBM, and the parameters are known, the cavity method provides asymptotically exact analysis; the belief propagation algorithm almost always converges using correct parameters, without encountering the spin glass phase due to the Nishimori line property Nishimori (1980); Iba (1999). Thus, asymptotically exact properties such as phase diagram of the JSBM can be studied directly at the thermodynamic limit by analysing the messages of the belief propagation.
Observe that there is always a trivial fixed point of BP (III)
(8) |
This fixed point states that every node in the graph has equal probability of belonging to every group, so is known as paramagnetic fixed point or liquid fixed point. It indicates that the marginals do not provide any information about the ground-true node labels, but only reflect the permutation symmetry of the system. When the paramagnetic fixed point is stable, the system is in the paramagnetic state, we conjecture that no algorithm can find information of the ground-true labels with a success better than a random guess. This conjecture for the stochastic block model is also known as non-detectable phase, its existence has been mathematically proved in Mossel et al. (2018). Here we extend this conjecture to the JSBM with both edges and node features. The un-detectable phase in the JSBM has conceptual analogous meaning of the ferromagnetic Ising model in paramagnetic phase where the underlying ground-true labels are the all-one configuration; and analogous to the Hopfield model where the underlying ground-true labels are the stored patterns. From the view point of statistical inference, edges and hyper edges are observations to the signal (i.e. the ground-true labels), so the paramagnetic phase indicates that the number of observations is too few to reveal any information of the signal. An extreme example is that when there is no edge or hyper edge at all, every group assignment for graph node labels has equal probability, hence there is for sure no way to recover any information of the ground true labels using any algorithm.
When the number of observations increases, the paramagnetic fixed point will eventually become unstable, leading to a non-trivial fixed point of belief propagation (III) which are correlated with the ground-true group labels that can be extracted using Eq. (7). The point where the paramagnetic fixed point of belief propagation becomes unstable is the detectability transition of the JSBM which puts fundamental limit on ability of all possible algorithms in detecting information of the ground-true labels, and is algorithm independent.
This transition can be determined using the stability analysis of the BP paramagnetic fixed point (IV) under random perturbations. Assume that the graph generated using the JSBM has graph nodes, feature nodes. Each graph node is connected to on average graph nodes and feature node, and each feature node is connected to on average
graph nodes. Consider putting some random noise with zero mean and unit variance on every node of the graph. Using the locally-tree like property of the graph, after one-step iteration of BP equations (
III), noise will be propagated to on average graph nodes through edges, and graph nodes through features (i.e. hyper edges). The process of the noise propagation in the tree-graph is depicted in Fig. 2. If all leaves of the tree are attached with random noises with zero mean and unit variance, after -step iterations of BP equations (III) with , the total variance of noise on the root node can be computed as(9) |
Where and are the average degree of graph nodes and factor nodes respectively, which also equals to the average excess degree when the degree distribution is Poisson as in the JSBM. The details of derivation can be found in the Appendix. and are the largest eigenvalue of the Jacobian matrix with respect to messages passing from a graph node to another graph node along edge , and the Jacobian matrix corresponding to message passing from a feature node to a graph node , respectively. The elements of the Jacobian matrices (together with the third matrix )evaluated at the paramagnetic fixed point are written as
(10) |
The paramagnetic fixed point is unstable under random perturbation whenever , it indicates that the detectability phase transition locates at
(11) |
This kind of stability condition is also known in the spin glass literature as the Almeida-Thouless local stability condition De Almeida and Thouless (1978), and in the computer science as as the Kesten-Stigum bound on reconstruction on trees Kesten and Stigum (1966, 1967), and robust reconstruction threshold Janson et al. (2004).
To clearly illustrate the phase transitions and phase diagrams we use a simple case of diagonal matrix with on the diagonal and on the off-diagonal elements, and diagonal matrix with on the diagonal and on the off-diagonal. We notice that there are constraints on the matrix elements
(12) |
which means that there is only one free parameter for each matrix. So we introduce parameters and for parametrizations. In this setting, we have a simpler expression for eigenvalues
(13) |
To verify our theoretical results on the detectability transitions, we carry out numerical experiments on large synthetic graphs generated by the JSBM, and compare the results obtained by the belief propagation with theoretical predictions. The accuracy of belief propagation are evaluated using overlap between the obtained partition and the ground-true partition
(14) |
which is computed by maximizing over all permutations of groups . We can see from the definition Eq. (14) that if the inferred labels is randomly chosen which has nothing to do with the ground-true one , the overlap would be ; otherwise if the inferred label is correlated with the ground-truth, the overlap will be greater than and is upper-bounded by which indicates an exact match of two labels. Overlap is a commonly chosen quantity for estimating similarity between two group assignments with a small number of groups. When the number of groups is large, or with different group sizes, we should use other measures such as normalized mutual information and its variances Danon et al. (2005); Zhang (2015).
The BP overlaps are shown in Fig. 3. In the left panel of the figure we fix , and plot the overlap obtained by belief propagation with varying . In the middle panel, is fixed to and varies. We can see from the figure that the overlap is always quite close to with a small values indicating that the reconstruction of ground-truth is almost perfect. With increases, the accuracy of inference decreases, and eventually goes to . The point that the overlap decays to coincides very well with the prediction of the detectability transition (11) which is indicated by the dashed lines. With and beyond the detectability transition, system is in the paramagnetic phase, and BP overlap is always which is identical to the accuracy of random guess in two groups. We also claim that the overlap obtained by BP, those are the lines indicated in the figure, are optimal among all possible algorithms in the thermodynamic limit. In the right panel of Fig. 3 we report accuracy of belief propagation on the - plane where the overlap decays from a large value to non-informative (indicated by the colors). The magenta dashed line represents theoretical predictions of the phase transition given by (11), which matches very well with the numerical experiments.
V from belief propagation to graph convolution network
When applying the previously introduced BP algorithm to real-world graphs or to synthetic graph without knowing the parameters, an essential problem is how to learn parameters
. A classic approach is the Expectation Maximization (EM)
Dempster et al. (1977), which updates parameters by maximizing the total log-likelihood of data, that is, minimizing the total free energy in the language of physics. However, in practice the EM for minimizing free energy is prone to overfitting Decelle et al. (2011) and suffers from trapping into local minimal of free energy. Note that in the setting of semi-supervised learning, we have a small amount of labels, thus the parameters learning can be done in the supervised fashion by matching predicted labels and training labels. We propose to first expand and truncate the belief propagation equations to finite time steps as a forward step, then learn the JSBM parameters using back-propagation. We call the algorithm Belief Propagation Graph Convolution Network (BPGCN), as the whole procedure is analogous to the canonical graph convolution networks Kipf and Welling (2017), but with a different convolution kernel and non-linear activation functions that come from the mathematically principled message passing equations for optimal JSBM inference. In this sense, the parameters of the JSBM
and become weight matrices of the neural network to be learnt. The input of the network is a randomly initialized cavity messages (at -th layer) , and , and marginal matrix and , where and are number of edges in uni-partite graph and bi-partite graph. The forward pass of the network naturally comes from the iterative equations of BP, and the propagation equation from the -th layer to the -th layer are formulated as(15) |
where the activation function is inherited from BP which asks to normalize various marginal probabilities with components. If , the activation function reduces to the sigmoid funciton. Matrices are non-backtracking matrices Krzakala et al. (2013) encoding adjacent information of cavity messages. The detailed explanations of forward propagation of the BPGCN can be found at the Appendices.
Assume that the depth of the BPGCN is , then the marginals of graph nodes
is the output of BPGCN. After that, we choose a loss function on the training labels (a small fraction of known labels in the semi-supervised learning). A common choice of loss function for classifications is the cross entropy, which is defined as
(16) |
where denotes the (training) set of graph nodes, stands for the labels for node in the training set, with
being the one-hot vector.
The training of the NBGCN is the same as other graph convolution networks: in each epoch (loop) of training we first do a forward pass and evaluate the cross entropy loss function on the training set, then use the
Back Propagation Goodfellow et al. (2016) algorithm to compute the gradients of the loss function with respect to elements in andmatrices, then apply (stochastic) gradient descent or its variants (such as ADAM
Kingma and Ba (2015)) to update parameters. We note here that the final evaluation of the algorithm is the accuracy of the predicted labels on the nodes with unknown labels in the test set.The crucial difference between BPGCN and belief propagation (BP) (including the semi-supervised version Zhang et al. (2014)) is that BP minimizes the Bethe free energy, while BPGCN minimizes the loss function evaluated on the training data. On large JSBM synthetic graphs, free energy is the best loss function to minimize. However when applied on graphs that were not generated by JSBM, minimizing Bethe free energy or energy is prone to overfitting Decelle et al. (2011); Zhang and Moore (2014). Moreover, in BPGCN the adaptive fields Eq. (III
), that contributed by non-edges in BP, are not necessary, because the labels automatically balance the group sizes. There are two main differences between BPGCN and the standard graph neural networks: First, the activation function in BPGCN is determined by BP message passing equations rather than being chosen manually. This is in contrast with usual graph convolution networks where many activation functions such as ReLU, PReLU and Tanh are available under choice. Second, there are only few trainable parameters in BPGCN, which are elements of
and matrices. They are totally shared across all layers of the neural network. This certainly limits the overall representation power of the BPGCN, but could in principle enhances very much the generalization power of the model. Indeed, in the semi-supervised classifications, the amount of training data is much fewer than in the normal supervised setting, and the number of observations is proportional to the number of edges (and hyper edges), so the totally shared parameters are very helpful in preventing overfitting the training data.Vi Comparing BPGCN with other graph convolution networks
In this section we compare the performance of our BP and BPGCN with several state-of-the-art graph convolution networks on various datasets. These algorithms we compare with include
. Graph Convolution Network (GCN) Kipf and Welling (2017)
GCN is probably the most famous graph convolution network which drastically outperformed all non-neural-network type algorithms when it was proposed in .
The layer-wise forward propagation rule of the GCN is formulated as
(17) |
where and are states of hidden variables and weight matrix at the -th layer respectively, is the graph convolution kernel, which is defined as
(18) |
with being a diagonal degree matrix with . The propagation rule of the GCN was motivated by a first-order approximation of localized spectral filters, we will discuss in detail in the Appendices. Therefore a standard two layers GCN can be written as:
(19) |
where is the predictions.
. Approximate Personalized Propagation of Neural Predictions (APPNP) Klicpera et al. (2019)
Similar to the GCN, APPNP extract feature information (which is encoded in the
) to hidden neuron states using a multi-layer perceptron (MLP)
then propagate the hidden states via personalized PageRank scheme to produce predictions of node labels.
(20) |
In the recent benchmarking studies on performance of popular GCNs, APPNP performs the best among various of datasets Fey and Lenssen (2019).
. Graph attention network (GAT) Veličković et al. (2018)
The GAT adopts the fashion and complex attention mechanism Vaswani et al. (2017) to learn attention coefficients (weights) between pair of connected nodes. This can be seen as based on adjacency matrix but with learned adjustable weights on the edges.
. Simplified Graph Convolution Networks (SGCN) Wu et al. (2019)
The SGCN tries to remove redundant and unnecessary computations from popular graph convolution networks, resulting into a simple low-pass filter followed by a linear classifier. Finally the SGCN takes a simple propagation rule using
-th power of the variant adjacency matrix as in GCN:(21) |
The empirical results show that the simplification captures the essential reasons for success of GCNs, giving positive impact on accuracy as well as speed up over GCNs.
vi.1 Comparisons on synthetic datasets
We first compare algorithms on synthetic networks generated by the JSBM. In the experiments we give a fraction randomly chosen node labels as supervision. Also, like usual experimental setup in the studies of classifications, we randomly choose nodes with known labels as validation sets for tuning network hyper-parameters and early stopping strategy in order to get better generalizations on the test set. The classification accuracy is evaluated using the overlap (14) on the rest of the node (test set) labels. Here, We choose a very straightforward (although not optimal) initial matrix in BPGCN which contains two kinds of elements, on the diagonal and on the off-diagonal, with ratio . The matrix is initialized in a similar way with . For all synthetic networks, we use as initial conditions for NBGCN. Notice that for sure we can tune the initial and with validation sets on every synthetic network to obtain better generalization on test dataset but in this study we have not done this.
We have carried out extensive numerical experiments on large synthetic graphs with varying average degrees and varying signal-to-noise ratio
. Most of the results are plotted in the Appendices, and representative results are shown in Fig. 4. The first message follows from our results is that the BPGCN performs very close to the asymptotically optimal BP results, but all other GCNs do not. In the left panel of the figure, is fixed to , and increases from to . We can see that GCN, APPNP, GAT, and SGCN work much worse than BPGCN, even when is quite large. In the middle panel of the graph, we fix to and vary . The figure demonstrates that when is small, GCN, APPNP, GAT, and SGCN perform poorly, which is consistent with what the left panel shows. This observation gives a clear evidence that existing popular graph convolution networks are significantly more influenced by the sparsity in the graph structure. We call the issue of conventional GCNs sparsity issue.In the right panel in Fig. 4, we compare performance of GCNs with fixed to , average degrees fixed to , and , and varying . We can see from the figure that while BPGCN works perfectly in the whole range of , conventional GCNs fail surprisingly, even when BPGCN gives almost exact results with overlap close to . This observation indicates that the conventional GCNs we have tested have difficulties in extracting information about group labels contained in the features when the edges of the graph are noisy. Further check on the output of the conventional GCNs show that they all have good training overlap, but have bad test overlap, so this is a clear sign of overfitting to the training labels and noise in the graph edges, which we call overfitting issue. In what follows we give analysis on these two issues and give suggestions on how to overcome them based on features of BPGCN which is immune to the issues.
Sparsity issue: Properties of the forward-propagation of GCNS are closely related to linear convolution kernels used by the GCNs, so the reason for the sparsity issue can be understood by studying the spectrum of the linear convolution kernel used in graph convolutions. In the GCN, SGCN and APPNP, the linear convolution kernel is the variant of the normalized adjacency matrix (18). It has been established in e.g. Krzakala et al. (2013); Zhang (2016)
that this kind of linear operators have localization problems in large sparse graphs, with leading eigenvectors encoding only local structures, rather than global information of labels. In the appendices we gave a detailed analysis on the spectral localization problem of conventional convolution filters. In the contrast, our approach, BPGCN, is immune to the sparsity issue, because the linear convolution kernel of the BPGCN is the non-backtracking matrix, which naturally works well and overcome the localization problems in the large sparse networks
Krzakala et al. (2013). Thus, a straight forward way of overcoming the sparsity issue in the classic GCNs might be considering using a linear kernel that does not have the localization problems in the sparse graphs, such as the non-backtracking matrix, or the X-Laplacian Zhang (2016).Overfitting issue: From Eq. (17),Eq. (VI) and Eq. (21
) we observe that for the linear filters are always directly operate on the weight matrix, or hidden states. This reflects a straightforward assumsions in the conventional GCNs: the relational data, i.e. edges or adjacency matrix of the graph, must contain information of the labels. This is a natural assumption, but should not be always true. In contrast, the BPGCN overcomes this defects by learning an affinity matrix
which stores essentially the learned signal-to-noise ratio corresponding to the edges, which may identify that there is no information encoded in graph edges. Based on this observation, a simple solution would be applying an affinity matrix to the convolution kernel in conventional GCNs, just as what NBGCN doesvi.2 Comparisons on real-world datasets
Next we apply the BPGCN on several commonly used real-world networks with and without node features. The Karate Club network Zachary (1977a) and Political Blogs network Adamic and Glance (2005a)
are classic networks containing community structures. In these two networks there are no node features, canonical GCNs usually use identity matrix as feature matrix
Kipf and Welling (2017). In the Karate Club network we use labels per group as training labels, and labels per group as validate labels. In the Political Blogs network we use labels per group as training labels and labels per group as validation set. The Citeseer, Cora and Pubmed networks are standard datasets for semi-supervised classifications used in almost all studies related to graph neural networks. We follow exactly the same split of training, test and validation set as in Kipf and Welling (2017) on these graphs. In these networks, there are ground true labels for each node coming from expert division (in Karate club and Political blogs) or from human analysis of content of each article (such as research area of articles in citation networks). Labels of both training and validation sets are visible to GCN algorithms, the task of GCN algorithms is to predict labels of the unseen test set. The performance of the algorithms are evaluated using the overlap between predictions and the ground-true labels on the test set.In the BPGCN algorithm, we use a tunable external field as hyper parameter to adjust the strength of the training label acting on the node. We also did a simple search for hyperparameters including external field strength, network layers using validation set. We again choose a very straightforward (although not optimal) initial
matrix in BPGCN which contains two kinds of elements, on the diagonal and on the off-diagonal, with ratio . The matrix is initialized similarly with . In contrast to synthetic network, we did a simple search for and using valdation set. The details of parameters as well as hyper parameters can be found at the appendices.Our comparison results are listed in the Table 1. We can see from the table that on the two networks without node features, Karate club network and the Polictical blogs network, all GCNs performs quite well, indicating that all of them have succeeded in extracing labeling information from edges of the network. Our algorithm, BPGCN, outperforms other three GCNs in the Polical blogs networks. We think the good performance comes from the non-backtracking convolution kernel, which is known to give good spectral properties on the large sparse graphs such as Political blogs network Krzakala et al. (2013).
On three networks with node features, Citeseer, Cora, and Pubmed, we see different quanlitative behaviors. On Citeseer and Cora, the BPGCN performs comparably to other GCNs. Particularly, on the Citeseer network, our algorihtm outperforms Graph Attention Network, and on the Cora network, BPGCN outperforms Graph Convoulution Networks. However, the BPGCN works poorly on the Pubmed network, giving the worse performance among all GCNs. The reason is that the Pubmed network contains nodes, but only features. This means that the features are densely connected to nodes so the bipartite part of the graph significantly deviates from sparse random graphs where the JSBM model and our BPGCN algorihtm rely on. As a consequence, our algirhtms is not able to extract the information from the features very well. Indeed, we have verified that by completely ignoring the features, our algorithm gives even better classification accuracy using only the edges of the graph. We also noticed that the Multilayer percetron (MLP) which uses only the information of the features already achieves an high accuracy. So a simple remedy to the BPGCN in the situation that the feature (the bipartite part of the graph) is far from locally-tree-like structures, we simply take the results given by MLP instead of using the bipartite graph, as an external field acting on the BPGCN. We confirm that this strategy lifts the accuracy of BPGCN on the Pubmed network to , which is as good as the state-of-the-art results given by the APPNP.
Karate | Polblogs | Citeseer | Cora | Pubmed | |
# nodes | 34 | 1490 | 3327 | 2078 | 19717 |
# features | 0 | 0 | 3703 | 1433 | 500 |
4.6 | 22.4 | 2.78 | 3.89 | 4.49 | |
MLP Kipf and Welling (2017) | — | — | 58.4 | 52.2 | 72.7 |
GCN Kipf and Welling (2017) | 86.2 | 71.1 | 81.5 | 79.0 | |
GAT Veličković et al. (2018) | 87.9 | 88.7 | 70.8 | 83.1 | 78.5 |
SGCN Wu et al. (2019) | 91.6 | 81.3 | 71.3 | 81.7 | 78.9 |
APPNP Klicpera et al. (2019) | 96.3 | 87.5 | |||
BPGCN | 95.8 | 71.1 | 82.1 | 70.0 |
Vii Conlusions and discussions
We have proposed to model semisupervised classification in graphs with both pairwise relational features and node features using the joint stochastic block model. We gave a theoretical analysis on the detectability transition and phase diagram, using the cavity method in the statistical physics and the belief propagation algorithm which we claim to be asymptotically exact in the thermodynamic limit. The JSBM can be used to generate benchmark networks with continuously tunable parameters, asymptotically optimal algorithms and accuracy. The benchmarks are particularly useful in evaluating graph neural networks with different design arts. In particular, we found that the state-of-the-art graph convolution networks we have tested all perform poorly in sparse graphs, which we can understand using spectral properties of the convolution kernels in the GCNs. We also observe that popular GCNs tends to overfit the training labels when the edges in the graph carry few information about group labels.
Our algorithm naturally translates to a graph convolution network, BPGCN. In contrast to the popular graph convolution networks, the convolution kernel and activation function are determined mathematically from Bayesian optimal inference algorithm of the JSBM. We show that on synthetic networks our algorithm greatly outperforms existing graph convolution networks, and gives high classification accuracy in the parameter regime that conventional GCNs fail to work. On the real-wolrd networks our algorithm also displays comparable performance to the state-of-the art GCNs in most cases.
The BPGCN has a unique feature that it is quite powerful in extracting label information from edges of the graph, and significantly outperforms existing GCNs we have tested on this point. This power comes from the non-backtracking convolution kernel inherited from the belief propagation algorithm. The weakeness of the BPGCN is also illustrated using the Pubmed dataset in which the features is too extensive to be approximated by a random bipartite graph. We give a simple remedy to replace the bipartite graph by external field given by a multi layer perceptron. Based on the fact that the BPGCN is immune to the sparsity issue and overfitting issue of conventional GCNs, we have discussed how to improve current GCN techniques. It would be interesting to combine the strenghts of BPGCN and state-of-the art GCNs more deeply, which may inspire a new architecture of graph neural networks. We leave this for future work.
References
- Kipf and Welling (2017) T. N. Kipf and M. Welling, in International Conference on Learning Representations (2017).
- Goodfellow et al. (2016) I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning, Vol. 1 (MIT press Cambridge, 2016).
- Klicpera et al. (2019) J. Klicpera, A. Bojchevski, and S. Günnemann, in International Conference on Learning Representations (2019).
- Veličković et al. (2018) P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio, in International Conference on Learning Representations (2018).
- Gilmer et al. (2017) J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl, in Proceedings of the 34th International Conference on Machine Learning-Volume 70 (2017) pp. 1263–1272.
- Battaglia et al. (2018) P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, et al., arXiv preprint arXiv:1806.01261 (2018).
- Holland et al. (1983) P. W. Holland, K. B. Laskey, and S. Leinhardt, Social networks 5, 109 (1983).
- Hric et al. (2016) D. Hric, T. P. Peixoto, and S. Fortunato, Physical Review X 6, 031038 (2016).
- Florescu and Perkins (2016) L. Florescu and W. Perkins, in Conference on Learning Theory (2016) pp. 943–959.
- Yedidia et al. (2003) J. S. Yedidia, W. T. Freeman, and Y. Weiss, Exploring artificial intelligence in the new millennium 8, 236 (2003).
- Decelle et al. (2011) A. Decelle, F. Krzakala, C. Moore, and L. Zdeborová, Physical Review E 84, 066106 (2011).
- Zhang et al. (2014) P. Zhang, C. Moore, and L. Zdeborová, Physical Review E 90, 052802 (2014).
- Osindero and Hinton (2008) S. Osindero and G. E. Hinton, in Advances in neural information processing systems (2008) pp. 1121–1128.
- Mezard and Montanari (2009) M. Mezard and A. Montanari, Information, Physics and Computation (Oxford University press, 2009).
- Nishimori (1980) H. Nishimori, Journal of Physics C: Solid State Physics 13, 4071 (1980).
- Iba (1999) Y. Iba, Journal of Physics A: Mathematical and General 32, 3875 (1999).
- Bethe (1935) H. Bethe, Proc. R. Soc. London A 150, 552 (1935).
- Mossel et al. (2018) E. Mossel, J. Neeman, and A. Sly, Combinatorica 38, 665 (2018).
- De Almeida and Thouless (1978) J. De Almeida and D. J. Thouless, Journal of Physics A: Mathematical and General 11, 983 (1978).
- Kesten and Stigum (1966) H. Kesten and B. P. Stigum, The Annals of Mathematical Statistics 37, 1463 (1966).
- Kesten and Stigum (1967) H. Kesten and B. P. Stigum, Journal of Mathematical Analysis and Applications 17, 309 (1967).
- Janson et al. (2004) S. Janson, E. Mossel, et al., The Annals of Probability 32, 2630 (2004).
- Danon et al. (2005) L. Danon, A. Diaz-Guilera, J. Duch, and A. Arenas, Journal of Statistical Mechanics: Theory and Experiment 2005, P09008 (2005).
- Zhang (2015) P. Zhang, Journal of Statistical Mechanics: Theory and Experiment 2015, P11006 (2015).
- Dempster et al. (1977) A. P. Dempster, N. M. Laird, and D. B. Rubin, Journal of the Royal Statistical Society: Series B (Methodological) 39, 1 (1977).
- Krzakala et al. (2013) F. Krzakala, C. Moore, E. Mossel, J. Neeman, A. Sly, L. Zdeborová, and P. Zhang, Proceedings of the National Academy of Sciences 110, 20935 (2013).
- Kingma and Ba (2015) D. P. Kingma and J. Ba, in International Conference on Learning Representations (2015).
- Zhang and Moore (2014) P. Zhang and C. Moore, Proceedings of the National Academy of Sciences 111, 18144 (2014).
- Fey and Lenssen (2019) M. Fey and J. E. Lenssen, in ICLR Workshop on Representation Learning on Graphs and Manifolds (2019).
- Vaswani et al. (2017) A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, in Advances in neural information processing systems (2017) pp. 5998–6008.
- Wu et al. (2019) F. Wu, A. Souza, T. Zhang, C. Fifty, T. Yu, and K. Weinberger, in Proceedings of the 36th International Conference on Machine Learning, Vol. 97 (PMLR, 2019) pp. 6861–6871.
- Zhang (2016) P. Zhang, in Advances In Neural Information Processing Systems (Curran Associates, Inc., 2016) pp. 541–549.
- Zachary (1977a) W. W. Zachary, Journal of anthropological research 33, 452 (1977a).
- Adamic and Glance (2005a) L. A. Adamic and N. Glance, in Proceedings of the 3rd international workshop on Link discovery (ACM, 2005) pp. 36–43.
- Zachary (1977b) W. W. Zachary, Journal of anthropological research , 452 (1977b).
- Adamic and Glance (2005b) L. A. Adamic and N. Glance, in Proceedings of the 3rd international workshop on Link discovery (ACM, 2005) pp. 36–43.
- LeCun et al. (1995) Y. LeCun, Y. Bengio, et al., The handbook of brain theory and neural networks 3361, 1995 (1995).
- Krizhevsky et al. (2012) A. Krizhevsky, I. Sutskever, and G. E. Hinton, in Advances in neural information processing systems (2012) pp. 1097–1105.
- Bruna et al. (2014) J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, in 2nd International Conference on Learning Representations (2014).
- Hammond et al. (2011) D. K. Hammond, P. Vandergheynst, and R. Gribonval, Applied and Computational Harmonic Analysis 30, 129 (2011).
- Weston et al. (2012) J. Weston, F. Ratle, H. Mobahi, and R. Collobert, in Neural Networks: Tricks of the Trade (Springer, 2012) pp. 639–655.
- Yang et al. (2016) Z. Yang, W. W. Cohen, and R. Salakhutdinov, in Proceedings of the 33rd International Conference on International Conference on Machine Learning-Volume 48 (JMLR. org, 2016) pp. 40–48.
- Zhu et al. (2003) X. Zhu, Z. Ghahramani, and J. D. Lafferty, in Proceedings of the 20th International conference on Machine learning (ICML-03) (2003) pp. 912–919.
- Perozzi et al. (2014) B. Perozzi, R. Al-Rfou, and S. Skiena, in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (ACM, 2014) pp. 701–710.
- Lu and Getoor (2003) Q. Lu and L. Getoor, in Proceedings of the 20th International Conference on Machine Learning (ICML-03) (2003) pp. 496–503.
Appendix A Deriving belief propagations
Starting from the Boltzmann distribution Eq. (1), and definition of likelihood funciton Eq. (III), using the variational distribution with form Eq. (3), by minimizing the Bethe free energy with respect to constraint subject to normalizations of marginals, we can arrive at the standard form of belief propagation Yedidia et al. (2003) for the JSBM:
(22) |
where , , and are cavity probabilities as we have stated in the main text. The marginal probabilities are estimated as a function of cavity probabilities as
(23) |
Since we have nonzero interactions between every pair of nodes, in Eq. A we have totally messages. However, this gives an algorithm where even a single update takes time, making it suitable only for networks of up to a few thousand nodes. Happily, for large sparse networks, i.e., when , is large and , we can neglect terms of sub-leading order in the equations. In that case we can assume that or sends the same message to all its non-neighbors , and treat these messages as an external fields, thus we only need to keep track of messages where is the number of all edges (edges and hyperedges), and finally each update step takes on time.
Suppose that , we have :
(24) |
Similarly, , hence the messages on non-edges do not depend to leading order on the target node. For nodes with , we have
(25) |
where we have neglected terms that contribution and to , and defined an auxiliary external field
(26) |
Applying the same approximation to the external fields acting on feature nodes, we can finally arrive at (III).
Appendix B deriving the detectability transition using noise perturbations
On graphs generated by the JSBM, the probability that a graph node has
neighboring graph nodes follows a Poisson distribution
with average degree . And the probability of having features follows a Poisson distribution with average degree . Similarly, the probability that a feature node has neighboring nodes follows a Poisson distribution with average degree . Considering a branching process on a graph generated by the JSBM with infinite size, the average branching ratio of the process is related to the excess degree which is defined using the average number of neighbors based on having a neighbor. The average excess degree is computed as(27) |
Similarly, the excess degree of feature nodes are
(28) |
and similarliy, . Let us consider a noise propagation process in a tree as depicted in Fig. 2, with number of depth
. In the tree, odd layers contain solely graph nodes, and even layers contain solely feature nodes (heperedge, blue boxes) and edges (green boxes) which connect graph nodes in odd layers. Assume that on the leaves of the tree (nodes on the
layer) the paramagnetic fixed point is perturbed as(29) |
where represent graph node on the layer, is label of , corresponds to root node in Fig. 2. Now let us investigate influence of perturbation of message on any leaf, to the message on the root of the tree. First choose one path only containing graph node s(i.e., nodes are not connected by feature nodes) for simplicity and latter generalize to paths containing feature nodes or hyper-edges, we define Jacobian matrix with respect to messages passing from a graph node to another graph node along an edge , and the Jacobian matrix corresponding to message passing from a feature node to a graph node , together with the third matrix . The elements of the Jacobian matrices are computed as
(30) |
these matrices represent propagation strength (III) between any two messages in the vicinity of the paramagnetic fixed point. we can see that all these matrices are independent on node indices, and only depend on what node type of the two nodes connected or propagated to. If the path only contains edges, i.e. without hyperedges, the perturbation of the root node induced by the perturbation can be written as
(31) |
or in a vector form, . Now consider the path contains both edges and hyper edges: every step passing through a hyper edge, and transmits to , therefore, the total weights acting on the path is
where