1 Introduction
Growing interest in inference tasks involving networks has prompted the need for learning architectures adapted to graphstructured data. As a result, numerous models have been proposed for addressing various graph based learning tasks such as classification, link prediction, and recommendation. These approaches process the observed graph as if it depicts the true relationship among the nodes. In practice, the observed graphs are formed based on imperfect observations and incorrect modelling assumptions. Spurious edges might be formed and important links might be deleted. The vast majority of existing algorithms cannot take the uncertainty of the graph structure into account during training as there is no mechanism for removing spurious edges and/or adding informative edges in the observed graph.
Several algorithms that do address this uncertainty by incorporating a graph learning component have been proposed recently [44, 20, 36, 11]. These methods have limitations, either involving parametric graph models that restrict their applicability or being focused on the task of node classification.
In this work, we propose a nonparametric graph inference technique which is incorporated in a Bayesian framework to tackle node and/or edge level learning tasks. Our approach has the following key benefits. First, it generalizes the applicability of the Bayesian techniques outside the realm of parametric modelling. Second, flexible, task specific graph learning can be achieved; this makes effective use of the outputs of existing graphlearning techniques to improve upon them. Third, the graph learning procedure scales well to large graphs, in contrast to the increased difficulty of parametric approaches.
We conduct extensive experiments to demonstrate the usefulness of our model for three different graph related tasks. In a node classification setting we observed increased accuracy for settings where the amount of labeled data is very limited. For the setting of unsupervised learning, we show that incorporating a graph learning step when performing variational modelling of the graph structure with autoencoder models leads to better link prediction. Finally, a Bayesian approach based on our proposed model improves recall for existing stateoftheart graphbased recommender system architectures.
2 Related Work
Topology uncertainty in graph neural networks:
The most closely related work to our proposed approach is a group of recent techniques that jointly perform inference of the graph while addressing a learning task such as node classification. The recently proposed Bayesian GCN [44] provides a general, principled framework to deal with the issue of uncertainty on graphs. Similar ideas are considered in [20], where variational inference is used to learn the graph structure. This formulation allows consideration of additional data such as features and labels when performing graph inference, but the technique is still tied to a parametric model. In [36], the authors take a nonparametric approach, but their probabilistic formulation is focused on improving only very noisy graphs. In [11], simultaneous optimization of the graph structure along with the learning task is considered. In all of these works, only the node classification task has been explored. Our methodology extends the applicability of these methods by combining the Bayesian framework with a more flexible nonparametric graph model.
Graph learning:
Multiple algorithms have been proposed that focus exclusively on learning graph connectivity based on observed data [3, 13]. These works differ from ours in that the end goal is topology inference. These algorithms typically appeal to a smoothness criterion for the graph. Although these methods provide useful graphs, they have complexity. As a result, many do not scale well to large graphs. Approximate nearest neighbour (ANN) graph learning [22] has complexity, which is more suitable for large scale applications, but the learned graph generally has poor quality compared to the kNN graph. A more recent method in [12] introduces an approximate graph learning algorithm which provides an efficient trade off between runtime and the quality of the solution. We build on this method for our inference procedure, but our graph model is tailored to the specific learning task we address.
Deep learning based graph generative models:
There is a large body of existing work for deep learning based graph generative models. In
[17, 31, 43, 18, 19] various algorithms for graph generation using VAEs, RNNs, and normalizing flow are developed. These approaches are evaluated based on the likelihood of sampled graphs and comparing graph characteristics. Moreover these algorithms do not preserve node identities, so sampled (inferred) graphs cannot be directly used for node or edge level inference. Generative adversarial networks (GANs) based approaches [39, 1]are more successful in sampling graphs similar to the observed one. However, these models have prohibitively high computational complexity and their performance is heavily dependent on hyperparameter tuning.
Node classification:
A central learning task on graphs is semisupervised node classification. In general, the most common approach is to incorporate graph filters within deep learning algorithms. Early works [4, 2] based their models on theory from the graph signal processing community. This approach led to more sophisticated graph convolution architectures [15, 38, 8]. More recent models include [46, 41]. In [35], a learnable graph kernel based on a datadriven similarity metric is considered for node classification. Our graph learning framework can be combined with these algorithms to augment performance, particularly when there is a very limited amount of labelled data.
Link prediction:
Several algorithms based on autoencoders have been shown to perform extremely well for the link prediction task
[14, 27, 7, 23]. These techniques learn node embeddings in a (variational) autoencoder framework and model the probability of the existence of an edge based on the closeness of the embeddings. We show how our method can be combined with these strategies to deliver a small but consistent improvement for the link prediction task.
Recommender systems:
Recommender systems have become a key factor to meet users’ diverse and personalized needs for online consumption platforms. The most common approach is collaborative filtering (CF). Recent works have incorporated graphs and GNNs to better model the useritem interactions [37, 42, 40, 32, 24, 45].
Although the GNNbased recommendation models have achieved impressive performance, existing methods regard the provided user–item interaction records as ground truth. In many practical settings, the useritem interaction graph has spurious edges due to noisy information; on the other hand, some potential useritem positive interactions are missing because the item is never presented to the user. This is falsely indicated as a negative interaction. Thus, it is important to capture the uncertainty in the observed useritem interaction graph. In the following methodology section, we elaborate on how our graph learning approach can alleviate this problem.
3 Methodology
3.1 NonParametric Graph Learning
In many learning tasks, often an observed graph provides additional structure to the given data . The data
can include feature vectors, labels, and other information, depending on the task at hand. If
is not readily available, it is often built from the data and possibly other sideinformation. In many cases, does not represent the true relationship of the nodes as it is often formed using inaccurate modelling assumptions and/or is constructed from noisy data. In several recent works [44, 20, 36], it has been shown that building a posterior model for the ‘true’ graph and incorporating it in the learning task is beneficial.We propose a nonparametric generative model for the adjacency matrix of the random undirected graph . is assumed to be a symmetric matrix with nonnegative entries. We emphasize that our model retains the identities of the nodes and disallows permutations of nodes (permutations of adjacency matrices are not equivalent graphs when node identities are preserved). This characteristic is essential for its use in node and edge level inference tasks. We define the prior distribution for as
(1) 
The first term in the log prior is a logarithmic barrier on the degree of the nodes which prevents any isolated node in . The second term is a regularizer based on the Frobenius norm which encourages low weights for the links. and are hyperparameters which control the scale and sparsity of . In our model, the joint likelihood of and conditioned on is:
(2) 
where is a symmetric pairwise distance matrix which encodes the dissimilarity between the nodes. The symbol denotes the Hadamard product and denotes the elementwise norm. The likelihood encourages higher edge weights for the node pairs with lower pairwise distances and vice versa.
Bayesian inference of the graph involves sampling from its posterior distribution. The space is high dimensional (, where
is the number of the nodes). Designing a suitable sampling scheme (e.g., Markov Chain Monte Carlo) in such a high dimensional space is extremely challenging and computationally demanding for large graphs. Instead we pursue maximum a posteriori estimation, which is equivalent to approximating the posterior by a point mass at the mode
[21]. We solve the following optimization problem:(3) 
which is equivalent to learning an symmetric adjacency matrix of .
(4) 
The optimization problem in (4) has been studied in the context of graph learning from smooth signals. [13] adopts a primaldual optimization technique to solve this problem. However the complexity of this approach scales as , which can be prohibitive for large graphs. In this paper, we employ the scalable, approximate algorithm in [12], which has several advantages as follows. First, it can use existing approximate nearest neighbour techniques, as in [22], to reduce the dimensionality of the optimization problem. Second, the graph learning has a computational complexity of (the same as approximate nearest neighbour algorithms), while the quality of the learned graph is comparable to the stateoftheart. Third, if we are not concerned about the scale of the learned graph (which is typical in many learning tasks we consider, since a normalized version of the adjacency or Laplacian matrix is used), the approximate algorithm allows us to effectively use only one hyperparameter instead of and
to control the sparsity of the solution and provides a useful heuristic for automatically selecting a suitable value based on the desired edge density of the solution.
In our work, we use this approximate algorithm for inference of the graph , which is subsequently used in various learning tasks. Since, we have freedom in choosing a functional form for , we can design suitable distance metrics in a task specific manner. This flexibility allows us to incorporate the graph learning step in diverse tasks. In the next three subsections, we present how the graph learning step can be applied to develop Bayesian algorithms for node classification, link prediction and recommendation systems.
3.2 Node Classification
Problem Statement:
We consider a semisupervised node classification problem for the nodes in . In this setting we also have access to the node attributes and the labels in the training set . So, . The task is to predict the labels of the remaining nodes , where .
Bayesian GCN – nonparametric model:
[44] derive a Bayesian learning methodology for GCNs by building a posterior model for . Their approach assumes that is sampled from a parametric graph model. The graph model parameters are marginalized to target inference of the graph posterior . Although this approach is effective, it has several drawbacks. The methodology lacks flexibility since a particular parametric model might not fit different types of graph. Bayesian inference of the model parameters is often challenging for large graphs. Finally, parametric modelling of graphs cannot use the information provided by the node features and training labels for inference of
. Here, we propose to incorporate a nonparametric model for inference of
in the BGCN framework. We aim to compute the marginal posterior probability of the node labels, which is obtained via marginalization with respect to the graph
and GCN weights :(5) 
The categorical distribution of the node labels is modelled by applying a softmax function to the output of the last layer of the GCN. The integral in (5) cannot be computed in a closed form, so we employ Monte Carlo to approximate it as follows:
(6) 
Here, we learn the maximum a posteriori (MAP) estimate and subsequently sample weight matrices from by training a Bayesian GCN using the graph .
In order to perform the graph learning step, we need to define a pairwise distance matrix
. For this application, we propose to combine the output of a node embedding algorithm and a base classifier to form
:(7) 
Here is a hyperparameter which controls the importance of relative to . The ’th entries of and are defined as follows:
(8)  
(9) 
Here, is any suitable embedding of node and is the predicted label at node obtained from the base classification algorithm. measures pairwise dissimilarity in terms of the observed topology and features and summarizes the discrepancy of the node labels in the neighbourhood. For the experiments, we choose the Variational Graph AutoEncoder (VGAE) algorithm [14] as the node embedding method to obtain the vectors and use the GCN proposed by [15] as the base classifier to obtain the values. The neighbourhood of the th node is defined as:
Here, is the set of edges in . With the regard to the choice of the hyperparameter , we observe that
works well in our experiments, although it can be tuned via crossvalidation if a validation set is available.
For the inference of GCN weights , many existing algorithms such as expectation propagation [9], variational inference [5, 33], and Markov Chain Monte Carlo methods [26, 16] can be employed. As in [44], we train a GCN on the inferred graph and use Monte Carlo dropout [5]. This is equivalent to sampling from a particular variational approximation of . The resulting algorithm is provided in the supplementary material.
3.3 Link Prediction
Problem statement:
In this setting, some of the links in are hidden or unobserved. The task is to predict the unseen links based on the knowledge of the (partially) observed and the node features . Thus in this case, the additional data beyond the graph is .
Background:
In existing works, the link prediction problem is addressed by building deep learning based generative models for graphs. In particular, various architectures of graph variational autoencoders [14, 7, 23] aim to learn the posterior distribution of the node embedding conditioned on the observed graph and the node features . The inference model (encoder) often uses simplifying assumptions (e.g. meanfield approximation over nodes or diagonal covariance structures) for the parametric form of the approximate variational posterior distribution . Deep learning architectures are used to learn the parameters of the model. The decoder is another deep learning model which explains how the graph is generated from the embeddings, i.e., it parameterizes . Typically the probability of a link in these models is dependent on the similarity of the embedding of the two incident nodes. Assuming a suitable prior , the encoder and decoder is trained jointly to minimize the KL divergence between and the true posterior . The learned embeddings are evaluated based on an amortized link prediction task for the unseen portion of the graph.
Proposed methodology – Bayesian VGAE:
We consider a Bayesian formulation, where we conduct Bayesian inference of the graph in the encoder. Let us introduce a function that returns a graph such that the unobserved entries of the adjacency matrix of are replaced by the corresponding entries of . We then model the inference distribution as follows:
where is the MAP estimate from the nonparametric model. The intuitive idea behind this modeling is that if the nonparametric inference provides a reasonable approximation of the unobserved adjacency matrix entries, then an auto encoder trained on a graph that incorporates these approximate entries should learn better embeddings. For the graph learning step, we form the distance matrix using the output of an autoencoder as follows:
(10) 
The resulting algorithm is summarized in the supplementary material.
3.4 Recommendation Systems
Problem statement:
In this section we address a personalized item recommendation task based on historical interaction data. We denote the set of users and items by and respectively. The interaction between any user and item is encoded as a link in a bipartite graph . The task is to infer the unobserved interactions (and to use these as predictions of future interactions). Viewed in this light, the recommendation task is a link prediction problem. However, in many cases, predicting a personalized ranking for the items is important [29].
For each user , if there is an observed interaction with item and an unobserved interaction with item , we write that in the training set. The introduced relation implies that user prefers item to item . This interaction training data leads to a set of rankings for each user over the training set of triples: . We denote these rankings for all users in as . This training data is used to learn a model parameterized by . The generalization capability is tested by ranking, for each user , all such that both and . We denote the rankings for a specific user in this test set as . The collection of all such rankings for all users is denoted In this paper, we propose to incorporate Bayesian inference of graph in the Bayesian Personalized Ranking (BPR) loss formulation [29]. A brief review of the BPR loss is provided for completeness.
Background – BPR loss:
Many existing graph based deep learning recommender systems [32, 40, 42] learn an embedding for user and for item and model the probability that user prefers item to item as follows:
Here
is the sigmoid function and
is the inner product. Our goal is to compute:(11) 
but this integral is not tractable. In practice, we assume a prior for and model the preferences of different users as independent. We can then consider a MAP estimate of :
This is equivalent to minimizing the BPR loss, where the positive pool and negative pool are created according to . Once the MAP estimate has been obtained, we assess the performance by ranking the test set triples using .
Nonparametric model – Bayesian graph recommender system:
In the Bayesian setting, ranking is conducted by considering an expectation with respect to the posterior distribution of the graph from the nonparametric model . We need to evaluate the posterior probability of ranking in the test set. Let us introduce the graph , which is obtained via a function that combines the information in and . We specify the function that we employ in our methodology more precisely below. We can then write the posterior probability of the ranking of the test set as follows:
(12) 
We approximate the integrals with respect to the posteriors of and by the MAP estimates to obtain:
(13) 
To calculate this approximation we first perform the nonparametric graph learning to obtain , then compute the new graph and minimize the BPR loss to form the estimate of the weights
(14) 
according to the positive and negative pool defined by this new graph .
Since the dot product measures the similarity between the embeddings in the proposed recommender system architecture, we use the pairwise cosine distance between the learned embedding of a base node embedding algorithm for learning a bipartite graph.
(15) 
Here, the ’s and ’s are obtained from the node embedding algorithm. Since in , none of the test set useritem interactions are present, they are all included in the negative pool. We use the estimated graph to remove potentially positive interactions in the test set from the negative pool. This is achieved by constructing as follows. We identify a fraction of links with the highest edge weights in and subsequently remove them form the negative pool of interactions for the Bayesian approach. The number of links to be removed is decided based on examining the performance on a validation set. The resulting algorithm is summarized in the supplementary material.
4 Experiments
4.1 Node Classification
We consider a semisupervised node classification task on three benchmark citation networks Cora, Citeseer [30] and Pubmed [25]. The details of the datasets are included in the supplementary material. The attribute vector at a node is a sparse bagofwords extracted from the keywords in the article and the label denotes the research topic addressed in the article. We consider three different experimental settings where we have 5, 10 and 20 labeled nodes per class in the training set. In each setting, we conduct 50 trials based on random splitting of the data and random initialization of the learnable weights. We compare the proposed BGCN with the ChebyNet [2], the GCN [15], the GAT [38], the DFNET [41] (for only Cora and Citeseer due to runtime considerations), the SBMGCN [20] and the BGCN in [44]. The hyperparameters for the GCN are set to those reported in [15]
and the same values are used for the BGCNs. We report the average classification accuracies along with their standard errors in Table
1. For each setting, we conduct a Wilcoxon signed rank test to determine whether the best performing algorithm is significantly better than the secondbest. Results in bold font indicate statistical significance at the 5% level.Algorithms  5 labels  10 labels  20 labels  

Cora  ChebyNet  61.76.8  72.53.4  78.81.6 
GCN  70.03.7  76.02.2  79.81.8  
GAT  70.43.7  76.62.8  79.91.8  
DFNETATT  72.32.9  75.8 1.7  79.31.8  
SBMGCN  46.019  74.410  82.60.2  
BGCN  74.62.8  77.52.6  80.21.5  
BGCN (ours)  74.22.8  76.92.2  78.81.7  
Citeseer  ChebyNet  58.54.8  65.82.8  67.51.9 
GCN  58.54.7  65.42.6  67.82.3  
GAT  56.75.1  64.13.3  67.62.3  
DFNETATT  60.51.2  63.2 2.9  66.31.7  
SBMGCN  24.57.3  43.312  66.15.7  
BGCN  63.04.8  69.92.3  71.11.8  
BGCN (ours)  64.94.6  70.11.9  71.41.6  
Pubmed  ChebyNet  62.76.9  68.65.0  74.33.0 
GCN  69.74.5  73.93.4  77.52.5  
GAT  68.04.8  72.63.6  76.43.0  
SBMGCN  59.010  67.86.9  74.64.5  
BGCN  70.24.5  73.33.1  76.02.6  
BGCN (ours)  71.14.4  74.63.6  77.62.9 
The results in Table 1 show that the proposed BGCN with nonparametric modelling of the graph achieves either higher or competitive accuracies in most cases. The relative improvement compared to the GCN is more significant if the labelled data is scarce. Comparison with the BGCN approach based on parametric modelling in [44] demonstrates that better or comparable accuracies can be achieved from this model, even if we do not target modelling the community structure of the graph explicitly. From Figure 1, we observe that in most cases, for the Cora and the Citeseer datasets, the proposed BGCN algorithm corrects more misclassifications of the GCN for low degree nodes. The same trend is observed for the Pubmed dataset. The empirical success of the GCN is primarily due to aggregating information with neighbors. As the low degree nodes have less opportunity to aggregate, performance is worse at these nodes. The proposed BGCN approach generates many additional links between similar nodes (Fig. 2). This improves learning, particularly at low degree nodes.
In Figure 2, we compare the adjacency matrix () of the MAP estimate graph with the observed adjacency matrix for the Cora dataset. This reveals that compared to , has denser connectivity among the nodes with the same label. This provides a rationale of why the proposed BGCN outperforms the GCN in most cases.
4.2 Link Prediction
We consider a link prediction task to demonstrate the usefulness of the learned embeddings from the Bayesian approach. We split the links in 85/5/10% for training, validation and testing respectively. The validation and test sets contain the same number of nonlinks as links. During model training, the links in the validation and test sets are hidden while the node features are unaltered. We compare the Bayesian approach with the GAE and VGAE [14], the GRAPHITEAE and VAE [7] and the DGLFRM [23]
models. The hyperparameters of these baseline algorithms are selected according to the corresponding papers. Other common baselines, e.g. spectral Clustering
[34], Deepwalk [28] and node2vec [6] are not included since it has been demonstrated that the baselines we include significantly outperform them. We incorporate the nonparametric graph inference technique in the existing autoencoders to build a Bayesian version of these algorithms. The Area Under the ROC Curve (AUC) and the Average Precision (AP) score are used as performance metrics. Table 2 shows the mean AUC and AP, together with standard errors, based on 50 trials. Each trial corresponds to a random split of the graph and a random initialization of the learnable parameters. We conduct a Wilcoxon signed rank test to determine the statistical significance of the improvement compared to the corresponding base model. Results in bold font indicate settings where the test declares a significance at the 5% level.Algorithm  Cora  Citeseer  Pubmed 

AUC  
GAE  91.50.9  89.41.5  96.20.2 
BGAE  91.80.8  89.61.6  96.20.2 
VGAE  91.80.9  90.71.0  94.50.7 
BVGAE  92.20.8  91.21.0  94.40.7 
GraphiteAE  92.00.9  90.81.1  96.00.4 
BGraphiteAE  92.40.9  91.11.1  96.00.4 
GraphiteVAE  92.30.8  90.91.1  95.20.4 
BGraphiteVAE  92.70.8  91.41.1  95.20.4 
DGLFRM  93.10.6  93.90.7  95.90.1 
BDGLFRM  93.20.6  94.10.7  95.90.2 
AP  
GAE  92.60.9  90.01.7  96.30.3 
BGAE  92.8 0.9  90.21.7  96.30.2 
VGAE  92.90.7  92.01.0  94.70.6 
BVGAE  93.30.7  92.51.0  94.60.6 
GraphiteAE  92.80.9  91.61.1  96.00.4 
BGraphiteAE  93.10.9  92.01.1  96.00.4 
GraphiteVAE  93.30.7  92.11.0  95.30.4 
BGraphiteVAE  93.70.7  92.61.0  95.30.4 
DGLFRM  93.80.6  94.50.7  96.40.1 
BDGLFRM  93.90.6  94.70.7  96.30.1 
From the results in Table 2, we observe the proposed approach improves link prediction performance for the Cora and Citeseer datasets compared to the baseline autoencoder models. The improvement is small but consistent over almost all of the random trials. No improvement is observed for Pubmed. To examine this further, we conducted an experiment where the groundtruth for the test set was provided to the autoencoders. The performance does not change from the reported values; this suggests that the models have reached accuracy limits for the Pubmed dataset.
4.3 Recommendation Systems
We investigate the performance of the proposed Bayesian method on four realworld and publicly available datasets: ML100K, AmazonBooks, AmazonCDs and Yelp2018. For each dataset, we conduct preprocessing to ensure that each node in the dataset has sufficient interactions. We consider two threshold values and , and filter out those users and those items with fewer than and
interactions, respectively. For each user, we split each dataset’s existing interaction records into training, validation and test set with the ratio of 70/10/20. We evaluate the model performance using Recall@k and NDCG@k, which are the coverage of true items in the topk recommendations, and a measure of recommendation ranking quality, respectively. Details of statistics of each dataset after the preprocessing step and the definitions of the evaluation metrics are included in the supplementary material.
AmazonCDs  R@10  R@20  N@10  N@20  Yelp2018  R@10  R@20  N@10  N@20 

MGCCF  10.1%  16.1%  13.1%  16.9%  MGCCF  7.5%  12.7%  13.0%  17.4% 
BMGCCF  10.6%  17.0%  13.4 %  17.3%  BMGCCF  7.6%  13.0%  13.2%  17.7% 
NGCF  8.1%  13.5%  11.4%  13.8%  NGCF  6.6%  11.3%  11.5%  15.3% 
BNGCF  9.9%  16.2%  12.8%  16.6%  BNGCF  6.7%  11.4%  11.5%  15.5% 
AmazonBooks  R@10  R@20  N@10  N@20  ML100K  R@10  R@20  N@10  N@20 
MGCCF  10.3 %  16.6%  15.0 %  19.4%  MGCCF  18.3%  29.4%  25.6%  30.9% 
BMGCCF  10.3%  16.4%  14.8%  19.3%  BMGCCF  18.4%  29.5%  25.9%  31.4% 
NGCF  8.7%  14.5%  13.6%  17.8%  NGCF  17.7%  29.0%  25.3%  30.3% 
BNGCF  10.2%  16.8%  15.2%  19.6%  BNGCF  17.7%  28.9%  25.2%  30.1% 
We apply our proposed Bayesian graphbased recommendation formulation to two recent graphbased recommendation models: the MGCCF [32] and the NGCF [40].
We first train the two algorithms with early stopping patience of 50 epochs to get the embedding vectors for users and items. These are used to calculate the pairwise cosine distance metrics
for our proposed graph optimizer. We refer to these original algorithms as “base models”. We build our proposed models (BMGCCF and BNGCF) on top of the base models via the following procedure. We first apply edge dropping with a threshold to shrink each dataset’s negative edge candidate set. We further train the base models with this optimized negative edge pool with an early stop patience of 100. For a fair comparison, to obtain the baseline performance, we also conduct further training of the models with the original negative edge pool with the same early stop patience setting. We use grid search to determine the percentage of the inferred links with the highest edge weights to be removed from the negative pool. A suitable value is chosen for each dataset from {1, 2, 5, 10, 20}%.We report the Recall@k and the NDCG@k of the proposed Bayesian models (BMGCCF and BNGCF) along with those of the base models MGCCF and NGCF for four datasets in Table 3. We conduct Wilcoxon signed rank test in each case to determine the significance of the obtained result from the Bayesian model over the corresponding base model. Bold numbers indicate a statistically significant difference at 5% level between the base model and the Bayesian version of the algorithm. The advantages of our proposed Bayesian framework can be observed for both base models and across both evaluation metrics. For the much denser ML100K dataset, the procedure is less effective (and in some cases ineffective). With many more edges in the observed graph, the graphbased recommender system algorithms already have considerable information. Although the inferred graph does remove many incorrect edges from the negative pool, this has only a minor impact on the learned embeddings.
The learning curve comparison for training the original model and the Bayesian version of the model is shown in Figure 3 for the Amazon CD dataset. We can observe that with our proposed solution, the training converges much faster. The Bayesian training framework also allows us to avoid overfitting in this case.
Conventional recommendation training procedure, especially in the implicit recommendation setting, treat all of the unobserved useritem interactions as negative feedback (demonstrating a lack of interest). Our proposed approach aims to learn which of these unobserved interactions are most likely to be false negatives. We analyze the overlap between the edges that we remove from the negative candidates set with the edges in the validation and test set. As shown in Table 4, our proposed Bayesian formulation is able to remove a significant percentage of test and validation edges from the negative sample pool.
Am. CDs  Am. Books  Yelp2018  ML100Ks  

BMGCF  20.6%  17.9%  13.6%  12.3% 
BNGCF  23.4%  30.0%  13.3%  62.1% 
5 Conclusion
In this paper, we propose the use of nonparametric modelling and inference of graphs for various learning tasks. In the proposed model, a higher edge weight between two nodes is more likely if the nodes are close in terms of a distance metric. An appropriate distance metric can be chosen depending on the learning task which results in flexible, taskspecific design of learning algorithms. The proposed model is adapted to a Bayesian learning framework which aims to account for graph uncertainty. Experimental results demonstrate that the model can learn useful graphs that improve performance significantly over baseline algorithms for node classification, link prediction, and recommendation.
References
 [1] (2018Jul.) NetGAN: generating graphs via random walks. In Proc. Int. Conf. Machine Learning, pp. 609–618. Cited by: §2.
 [2] (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In Proc. Adv. Neural Inf. Proc. Systems, pp. 3844–3852. Cited by: §2, §4.1.
 [3] (2016) Learning laplacian matrix in smooth graph signal representations. IEEE Trans. Sig. Proc. 64 (23), pp. 6160–6173. Cited by: §2.
 [4] (2015) Convolutional networks on graphs for learning molecular fingerprints. In Proc. Adv. Neural Info. Proc. Systems, pp. 2224–2232. Cited by: §2.
 [5] (2016) Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In Proc. Int. Conf. Machine Learning, pp. 1050–1059. Cited by: §3.2.
 [6] (2016) Node2vec: scalable feature learning for networks. In Proc. ACM Int. Conf. Knowl. Disc. Data Mining, pp. 855–864. Cited by: §4.2.
 [7] (2019Jul.) Graphite: iterative generative modeling of graphs. In Proc. Int. Conf. Machine Learning, pp. 2434–2444. Cited by: §2, §3.3, §4.2.
 [8] (2017) Inductive representation learning on large graphs. In Proc. Adv. Neural Inf. Proc. Systems, pp. 1024–1034. Cited by: §2.

[9]
(2015)
Probabilistic backpropagation for scalable learning of Bayesian neural networks
. In Proc. Int. Conf. Machine Learning, pp. 1861–1869. Cited by: §3.2.  [10] (2000) IR evaluation methods for retrieving highly relevant documents. In Proc. Int. ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 41–48. Cited by: 2nd item.
 [11] (201904) Graph optimized convolutional networks. arXiv eprints : arXiv 1904.11883. Cited by: §1, §2.
 [12] (2019) Large scale graph learning from smooth signals. In Proc. Int. Conf. Learning Representations, Cited by: §2, §3.1.
 [13] (2016) How to learn a graph from smooth signals. In Proc. Artificial Intell. and Statist., pp. 920–929. Cited by: §2, §3.1.
 [14] (2016Nov.) Variational graph autoencoders. In Proc. Bayesian Deep Learning Workshop, Adv. Neural Info. Process. Systems, Cited by: §2, §3.2, §3.3, §4.2.
 [15] (2017) Semisupervised classification with graph convolutional networks. In Proc. Int. Conf. Learning Representations, Cited by: §2, §3.2, §4.1.
 [16] (2016) Preconditioned stochastic gradient Langevin dynamics for deep neural networks. In Proc. AAAI Conf. Artificial Intell., pp. 1788–1794. Cited by: §3.2.
 [17] (2018Mar.) Learning deep generative models of graphs. In Proc. Int. Conf. Machine Learning, Cited by: §2.
 [18] (2019) Efficient graph generation with graph recurrent attention networks. In Proc. Adv. Neural Info. Proc. Systems, pp. 4255–4265. Cited by: §2.
 [19] (2019) Graph normalizing flows. In Proc. Adv. Neural Info. Proc. Systems, pp. 13578–13588. Cited by: §2.

[20]
(2019)
A flexible generative framework for graphbased semisupervised learning
. In Proc. Adv. Neural Info. Proc. Systems, pp. 3276–3285. Cited by: §1, §2, §3.1, §4.1.  [21] (1996) Maximum entropy and Bayesian methods. pp. 43–59. Cited by: §3.1.
 [22] (2020) Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Trans. Pattern Analysis and Machine Intell. 42 (4), pp. 824–836. Cited by: §2, §3.1.
 [23] (2019) Stochastic blockmodels meet graph neural networks. In Proc. Int. Conf. Machine Learning, pp. 4466–4474. Cited by: §2, §3.3, §4.2.
 [24] (2017) Geometric matrix completion with recurrent multigraph neural networks. In Proc. Adv. Neural Info. Proc. Systems, Cited by: §2.
 [25] (2012) Querydriven active surveying for collective classification. In Proc. Workshop on Mining and Learning with Graphs, Int. Conf. Machine Learning, Cited by: §4.1, §6.1.
 [26] (1992) Bayesian learning via stochastic dynamics. In Proc. Adv. Neural Inf. Proc. Systems, pp. 475–482. Cited by: §3.2.
 [27] (2018) Adversarially regularized graph autoencoder for graph embedding.. In Proc. Int. Joint Conf. Artificial Intell., pp. 2609–2615. Cited by: §2.
 [28] (2014) Deepwalk: online learning of social representations. In Proc. ACM SIGKDD Int. Conf. Knowl. Dis. Data Mining, pp. 701–710. Cited by: §4.2.
 [29] (2009) BPR: bayesian personalized ranking from implicit feedback. Proc. Conf. Uncertainty in Artificial Intell., pp. 452–461. Cited by: §3.4, §3.4.
 [30] (2008) Collective classification in network data. AI Magazine 29 (3), pp. 93. Cited by: §4.1, §6.1.
 [31] (2018) GraphVAE: towards generation of small graphs using variational autoencoders. In Proc. Int. Conf. Artificial Neural Networks, pp. 412–422. Cited by: §2.
 [32] (2019) Multigraph convolution collaborative filtering. In Proc. IEEE Int. Conf. Data Mining, pp. 1306–1311. Cited by: §2, §3.4, §4.3.
 [33] (2017) Learning structured weight uncertainty in Bayesian neural networks. In Proc. Artificial Intell. and Statist., pp. 1283–1292. Cited by: §3.2.
 [34] (2011) Leveraging social media networks for classification. Data Mining and Knowledge Discovery, pp. 447–478. Cited by: §4.2.
 [35] (2019Dec.) Rethinking kernel methods for node representation learning on graphs. In Proc. Adv. Neural Info. Proc. Systems, pp. 2088–2099. Cited by: §2.
 [36] (2019) Variational spectral graph convolutional networks. In Proc. Graph Representation Learning Workshop, Adv. Neural Info. Proc. Systems, Cited by: §1, §2, §3.1.
 [37] (2018) Graph convolutional matrix completion. In Proc. ACM Int. Conf. Knowl. Disc. Data Mining, Cited by: §2.
 [38] (2018) Graph attention networks. In Proc. Int. Conf. Learning Representations, Cited by: §2, §4.1.
 [39] (2017) GraphGAN: graph representation learning with generative adversarial nets. In Proc. AAAI Conf. Artificial Intell., pp. 2508–2515. Cited by: §2.
 [40] (2019) Neural graph collaborative filtering. In Proc. Int. ACM SIGIR Conf. Research and Development Info. Retrieval, pp. 165–174. Cited by: §2, §3.4, §4.3.
 [41] (2019) DFNets: spectral CNNs for graphs with feedbacklooped filters. In Proc. Adv. Neural Info. Proc. Systems, pp. 6009–6020. Cited by: §2, §4.1.
 [42] (2018) Graph convolutional neural networks for webscale recommender systems. In Proc. ACM Int. Conf. Knowl. Disc. Data Mining, pp. 974–983. Cited by: §2, §3.4.
 [43] (2018) GraphRNN: generating realistic graphs with deep autoregressive models. In Proc. Int. Conf. Machine Learning, pp. 5708–5717. Cited by: §2.
 [44] (2019) Bayesian graph convolutional neural networks for semisupervised classification. In Proc. AAAI Conf. Artificial Intell., pp. 5829–5836. Cited by: §1, §2, §3.1, §3.2, §3.2, §4.1, §4.1, §6.3.2, §6.3.2.
 [45] (2018) Spectral collaborative filtering. In Proc. ACM Conf. Recommender Systems, Cited by: §2.
 [46] (2018) Dual graph convolutional networks for graphbased semisupervised classification. In Proc. World Wide Web Conf., pp. 499–508. Cited by: §2.
6 Supplementary Material
6.1 Description of the Datasets
For the semisupervised node classification and link prediction tasks, we conduct experiments on benchmark citation network datasets (Cora [30], Citeseer [30], and Pubmed [25]). In these datasets each node represents a research article and the undirected edges are formed according to citation links. Each node has a sparse bagofwords feature vector derived from the keywords of the document. The node labels indicate the primary research topics addressed in the articles. The statistics of the citation datasets are summarized in Table 5.
Dataset  # Classes  # Features  # Nodes  # Edges  Edge Density 

Cora  7  1,433  2,485  5,069  0.04% 
Citeseer  6  3,703  2110  3,668  0.04% 
Pubmed  3  500  19,717  44,324  0.01% 
For the experiments on the recommendation systems, we use four realworld and publicly available datasets: ML100K, AmazonBooks, AmazonCDs and Yelp2018. For each dataset, we apply a preprocessing step to ensure that each user/item node in the dataset has sufficient interactions. We consider two threshold values and , and filter out those users and those items with fewer than and interactions, respectively. The number of users and items after the preprocessing step are denoted by and respectively. The statistics of the preprocessed datasets are provided in Table 6.
Dataset  (, )  # Edges  Edge Density  

ML100K  (10, 10)  (897, 823)  36330  0.0716 
CDs  (50, 20)  (2217, 2034)  68293  0.0151 
Books  (250, 22)  (1608, 2270)  69782  0.0191 
Yelp2018  (22, 35)  (2799, 1673)  119514  0.0255 
6.2 Graph Inference in NonParametric Model
For completeness, we provide a brief summary of the graph inference task considered in the main paper. We solve for the maximizer of the posterior distribution of the ‘true’ graph as follows:
(16) 
Here, denotes the observed graph and is additional information. In our nonparametric model, this is equivalent to solving the following optimization problem for the adjacency matrix of :
(17) 
We recall that is a symmetric pairwise distance matrix which measures the dissimilarity between the nodes. and are the hyperparameters of the prior distribution of the random graph .
6.3 Bayesian Graph Neural Network
This section summarizes the novel Bayesian Graph Neural Network algorithm and provides some additional results for the node classification task.
6.3.1 BGCN Algorithm
We recall the notations from the main paper: is the feature matrix of the nodes in the observed graph and is the set of known training labels. denotes the th sample of GCN weights trained on the inferred graph from the nonparametric model. represents any suitable embedding of node and is the obtained label of th node from a baseline node classification algorithm. is the neighbourhood of th node in . The distance matrix is defined as follows:
(18)  
(19)  
(20) 
Here, is a hyperparameter which controls the importance of relative to . We need to compute the Monte Carlo approximation of the posterior distribution of labels, which is given as:
(21) 
Pseudocode for the proposed BGCN algorithm is provided in Algorithm 1.
6.3.2 Results on fixed trainingtest split from [44]
Fixed split  5 labels  10 labels  20 labels 

ChebyNet  67.93.1  72.72.4  80.40.7 
GCN  74.40.8  74.90.7  81.60.5 
GAT  73.52.2  74.51.3  81.60.9 
BGCN  75.30.8  76.60.8  81.20.8 
SBMGCN  59.31.3  77.31.2  82.20.8 
BGCN (ours)  76.01.1  76.80.9  80.30.6 
Fixed split  5 labels  10 labels  20 labels 

ChebyNet  53.01.9  67.71.2  70.20.9 
GCN  55.41.1  65.81.1  70.80.7 
GAT  55.42.6  66.11.7  70.81.0 
BGCN  57.30.8  70.80.6  72.20.6 
SBMGCN  20.82.0  66.30.6  71.70.1 
BGCN (ours)  59.01.5  71.70.8  72.60.6 
Fixed split  5 labels  10 labels  20 labels 

ChebyNet  68.12.5  69.41.6  76.01.2 
GCN  69.70.5  72.80.5  78.90.3 
GAT  70.00.6  71.60.9  76.90.5 
BGCN  70.90.8  72.30.8  76.60.7 
SBMGCN  64.80.8  71.70.7  80.60.4 
BGCN (ours)  73.30.7  73.90.9  79.20.5 
In the main paper, we consider random partitioning of the nodes in training and test sets and report the average accuracies across different splits. We conduct another experiment where the same fixed trainingtest split of [44] is used for 50 random intializations of the GCN weights. The classification results for this setting are provided in Tables 7, 8 and 9.
6.4 Bayesian Vgae Algorithm for Link Prediction
We recall that the function returns a graph such that the unobserved entries of the adjacency matrix of are replaced by the corresponding entries of . The distance matrix for the nonparametric graph inference is defined as:
(22) 
where, is the approximate posterior distribution of unsupervised node representations from a Variational Graph AutoEncoder (VGAE) model. In the proposed Bayesian VGAE, the inference distribution is modelled as follows:
(23) 
Here is the inferred graph from the nonparametric model. The resulting algorithm is summarized in Algorithm 2
6.5 Recommendation Systems
6.5.1 Algorithm
We recall that is the set of training ranknings in observed graph . The distance between the th user and the th item is defined as:
(24) 
Here, and are the representations from the base node embedding algorithm for the th user and the th item, respectively. We define by removing a fraction of links with the highest edge weights in from the negative pool of interactions according to . In the Bayesian version of the recommendation system, we need to compute:
(25) 
We perform nonparametric graph inference to obtain , then compute and minimize the BPR loss to form the estimate of the weights:
(26) 
The resulting algorithm is summarized in Algorithm 3.
6.5.2 Definitions of the performance evaluation metrics

Recall@k denotes the proportion of the true (preferred) items from the top recommendation. For a user , the algorithm recommends an ordered set of top items . There is a set of true preferred items for a user and a number of true positive . The recall of a user is defined as follows :

NDCG@k: Normalized Discounted Cumulative Gain (NDCG) [10] computes a score for which emphasizes higherranked true positives. accounts for a relevancy score . We consider binary responses, so we use a binary relevance score: if and 0 otherwise.