Non-Parametric Graph Learning for Bayesian Graph Neural Networks

06/23/2020 ∙ by Soumyasundar Pal, et al. ∙ HUAWEI Technologies Co., Ltd. Université de Montréal McGill University 0

Graphs are ubiquitous in modelling relational structures. Recent endeavours in machine learning for graph-structured data have led to many architectures and learning algorithms. However, the graph used by these algorithms is often constructed based on inaccurate modelling assumptions and/or noisy data. As a result, it fails to represent the true relationships between nodes. A Bayesian framework which targets posterior inference of the graph by considering it as a random quantity can be beneficial. In this paper, we propose a novel non-parametric graph model for constructing the posterior distribution of graph adjacency matrices. The proposed model is flexible in the sense that it can effectively take into account the output of graph-based learning algorithms that target specific tasks. In addition, model inference scales well to large graphs. We demonstrate the advantages of this model in three different problem settings: node classification, link prediction and recommendation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Growing interest in inference tasks involving networks has prompted the need for learning architectures adapted to graph-structured data. As a result, numerous models have been proposed for addressing various graph based learning tasks such as classification, link prediction, and recommendation. These approaches process the observed graph as if it depicts the true relationship among the nodes. In practice, the observed graphs are formed based on imperfect observations and incorrect modelling assumptions. Spurious edges might be formed and important links might be deleted. The vast majority of existing algorithms cannot take the uncertainty of the graph structure into account during training as there is no mechanism for removing spurious edges and/or adding informative edges in the observed graph.

Several algorithms that do address this uncertainty by incorporating a graph learning component have been proposed recently [44, 20, 36, 11]. These methods have limitations, either involving parametric graph models that restrict their applicability or being focused on the task of node classification.

In this work, we propose a non-parametric graph inference technique which is incorporated in a Bayesian framework to tackle node and/or edge level learning tasks. Our approach has the following key benefits. First, it generalizes the applicability of the Bayesian techniques outside the realm of parametric modelling. Second, flexible, task specific graph learning can be achieved; this makes effective use of the outputs of existing graph-learning techniques to improve upon them. Third, the graph learning procedure scales well to large graphs, in contrast to the increased difficulty of parametric approaches.

We conduct extensive experiments to demonstrate the usefulness of our model for three different graph related tasks. In a node classification setting we observed increased accuracy for settings where the amount of labeled data is very limited. For the setting of unsupervised learning, we show that incorporating a graph learning step when performing variational modelling of the graph structure with auto-encoder models leads to better link prediction. Finally, a Bayesian approach based on our proposed model improves recall for existing state-of-the-art graph-based recommender system architectures.

2 Related Work

Topology uncertainty in graph neural networks:

The most closely related work to our proposed approach is a group of recent techniques that jointly perform inference of the graph while addressing a learning task such as node classification. The recently proposed Bayesian GCN [44] provides a general, principled framework to deal with the issue of uncertainty on graphs. Similar ideas are considered in [20], where variational inference is used to learn the graph structure. This formulation allows consideration of additional data such as features and labels when performing graph inference, but the technique is still tied to a parametric model. In [36], the authors take a non-parametric approach, but their probabilistic formulation is focused on improving only very noisy graphs. In [11], simultaneous optimization of the graph structure along with the learning task is considered. In all of these works, only the node classification task has been explored. Our methodology extends the applicability of these methods by combining the Bayesian framework with a more flexible non-parametric graph model.

Graph learning:

Multiple algorithms have been proposed that focus exclusively on learning graph connectivity based on observed data [3, 13]. These works differ from ours in that the end goal is topology inference. These algorithms typically appeal to a smoothness criterion for the graph. Although these methods provide useful graphs, they have complexity. As a result, many do not scale well to large graphs. Approximate nearest neighbour (A-NN) graph learning [22] has complexity, which is more suitable for large scale applications, but the learned graph generally has poor quality compared to the k-NN graph. A more recent method in [12] introduces an approximate graph learning algorithm which provides an efficient trade off between runtime and the quality of the solution. We build on this method for our inference procedure, but our graph model is tailored to the specific learning task we address.

Deep learning based graph generative models:

There is a large body of existing work for deep learning based graph generative models. In 

[17, 31, 43, 18, 19] various algorithms for graph generation using VAEs, RNNs, and normalizing flow are developed. These approaches are evaluated based on the likelihood of sampled graphs and comparing graph characteristics. Moreover these algorithms do not preserve node identities, so sampled (inferred) graphs cannot be directly used for node or edge level inference. Generative adversarial networks (GANs) based approaches [39, 1]

are more successful in sampling graphs similar to the observed one. However, these models have prohibitively high computational complexity and their performance is heavily dependent on hyperparameter tuning.

Node classification:

A central learning task on graphs is semi-supervised node classification. In general, the most common approach is to incorporate graph filters within deep learning algorithms. Early works [4, 2] based their models on theory from the graph signal processing community. This approach led to more sophisticated graph convolution architectures [15, 38, 8]. More recent models include [46, 41]. In [35], a learnable graph kernel based on a data-driven similarity metric is considered for node classification. Our graph learning framework can be combined with these algorithms to augment performance, particularly when there is a very limited amount of labelled data.

Link prediction:

Several algorithms based on autoencoders have been shown to perform extremely well for the link prediction task 

[14, 27, 7, 23]

. These techniques learn node embeddings in a (variational) autoencoder framework and model the probability of the existence of an edge based on the closeness of the embeddings. We show how our method can be combined with these strategies to deliver a small but consistent improvement for the link prediction task.

Recommender systems:

Recommender systems have become a key factor to meet users’ diverse and personalized needs for online consumption platforms. The most common approach is collaborative filtering (CF). Recent works have incorporated graphs and GNNs to better model the user-item interactions [37, 42, 40, 32, 24, 45].

Although the GNN-based recommendation models have achieved impressive performance, existing methods regard the provided user–item interaction records as ground truth. In many practical settings, the user-item interaction graph has spurious edges due to noisy information; on the other hand, some potential user-item positive interactions are missing because the item is never presented to the user. This is falsely indicated as a negative interaction. Thus, it is important to capture the uncertainty in the observed user-item interaction graph. In the following methodology section, we elaborate on how our graph learning approach can alleviate this problem.

3 Methodology

3.1 Non-Parametric Graph Learning

In many learning tasks, often an observed graph provides additional structure to the given data . The data

can include feature vectors, labels, and other information, depending on the task at hand. If

is not readily available, it is often built from the data and possibly other side-information. In many cases, does not represent the true relationship of the nodes as it is often formed using inaccurate modelling assumptions and/or is constructed from noisy data. In several recent works [44, 20, 36], it has been shown that building a posterior model for the ‘true’ graph and incorporating it in the learning task is beneficial.

We propose a non-parametric generative model for the adjacency matrix of the random undirected graph . is assumed to be a symmetric matrix with non-negative entries. We emphasize that our model retains the identities of the nodes and disallows permutations of nodes (permutations of adjacency matrices are not equivalent graphs when node identities are preserved). This characteristic is essential for its use in node and edge level inference tasks. We define the prior distribution for as

(1)

The first term in the log prior is a logarithmic barrier on the degree of the nodes which prevents any isolated node in . The second term is a regularizer based on the Frobenius norm which encourages low weights for the links. and are hyperparameters which control the scale and sparsity of . In our model, the joint likelihood of and conditioned on is:

(2)

where is a symmetric pairwise distance matrix which encodes the dissimilarity between the nodes. The symbol denotes the Hadamard product and denotes the elementwise norm. The likelihood encourages higher edge weights for the node pairs with lower pairwise distances and vice versa.

Bayesian inference of the graph involves sampling from its posterior distribution. The space is high dimensional (, where

is the number of the nodes). Designing a suitable sampling scheme (e.g., Markov Chain Monte Carlo) in such a high dimensional space is extremely challenging and computationally demanding for large graphs. Instead we pursue maximum a posteriori estimation, which is equivalent to approximating the posterior by a point mass at the mode 

[21]. We solve the following optimization problem:

(3)

which is equivalent to learning an symmetric adjacency matrix of .

(4)

The optimization problem in (4) has been studied in the context of graph learning from smooth signals. [13] adopts a primal-dual optimization technique to solve this problem. However the complexity of this approach scales as , which can be prohibitive for large graphs. In this paper, we employ the scalable, approximate algorithm in  [12], which has several advantages as follows. First, it can use existing approximate nearest neighbour techniques, as in [22], to reduce the dimensionality of the optimization problem. Second, the graph learning has a computational complexity of (the same as approximate nearest neighbour algorithms), while the quality of the learned graph is comparable to the state-of-the-art. Third, if we are not concerned about the scale of the learned graph (which is typical in many learning tasks we consider, since a normalized version of the adjacency or Laplacian matrix is used), the approximate algorithm allows us to effectively use only one hyperparameter instead of and

to control the sparsity of the solution and provides a useful heuristic for automatically selecting a suitable value based on the desired edge density of the solution.

In our work, we use this approximate algorithm for inference of the graph , which is subsequently used in various learning tasks. Since, we have freedom in choosing a functional form for , we can design suitable distance metrics in a task specific manner. This flexibility allows us to incorporate the graph learning step in diverse tasks. In the next three subsections, we present how the graph learning step can be applied to develop Bayesian algorithms for node classification, link prediction and recommendation systems.

3.2 Node Classification

Problem Statement:

We consider a semi-supervised node classification problem for the nodes in . In this setting we also have access to the node attributes and the labels in the training set . So, . The task is to predict the labels of the remaining nodes , where .

Bayesian GCN – non-parametric model:

[44] derive a Bayesian learning methodology for GCNs by building a posterior model for . Their approach assumes that is sampled from a parametric graph model. The graph model parameters are marginalized to target inference of the graph posterior . Although this approach is effective, it has several drawbacks. The methodology lacks flexibility since a particular parametric model might not fit different types of graph. Bayesian inference of the model parameters is often challenging for large graphs. Finally, parametric modelling of graphs cannot use the information provided by the node features and training labels for inference of

. Here, we propose to incorporate a non-parametric model for inference of

in the BGCN framework. We aim to compute the marginal posterior probability of the node labels, which is obtained via marginalization with respect to the graph

and GCN weights :

(5)

The categorical distribution of the node labels is modelled by applying a softmax function to the output of the last layer of the GCN. The integral in (5) cannot be computed in a closed form, so we employ Monte Carlo to approximate it as follows:

(6)

Here, we learn the maximum a posteriori (MAP) estimate and subsequently sample weight matrices from by training a Bayesian GCN using the graph .

In order to perform the graph learning step, we need to define a pairwise distance matrix

. For this application, we propose to combine the output of a node embedding algorithm and a base classifier to form

:

(7)

Here is a hyperparameter which controls the importance of relative to . The ’th entries of and are defined as follows:

(8)
(9)

Here, is any suitable embedding of node and is the predicted label at node obtained from the base classification algorithm. measures pairwise dissimilarity in terms of the observed topology and features and summarizes the discrepancy of the node labels in the neighbourhood. For the experiments, we choose the Variational Graph Auto-Encoder (VGAE) algorithm [14] as the node embedding method to obtain the vectors and use the GCN proposed by [15] as the base classifier to obtain the values. The neighbourhood of the -th node is defined as:

Here, is the set of edges in . With the regard to the choice of the hyperparameter , we observe that

works well in our experiments, although it can be tuned via cross-validation if a validation set is available.

For the inference of GCN weights , many existing algorithms such as expectation propagation [9], variational inference [5, 33], and Markov Chain Monte Carlo methods [26, 16] can be employed. As in [44], we train a GCN on the inferred graph and use Monte Carlo dropout [5]. This is equivalent to sampling from a particular variational approximation of . The resulting algorithm is provided in the supplementary material.

3.3 Link Prediction

Problem statement:

In this setting, some of the links in are hidden or unobserved. The task is to predict the unseen links based on the knowledge of the (partially) observed and the node features . Thus in this case, the additional data beyond the graph is .

Background:

In existing works, the link prediction problem is addressed by building deep learning based generative models for graphs. In particular, various architectures of graph variational auto-encoders [14, 7, 23] aim to learn the posterior distribution of the node embedding conditioned on the observed graph and the node features . The inference model (encoder) often uses simplifying assumptions (e.g. mean-field approximation over nodes or diagonal covariance structures) for the parametric form of the approximate variational posterior distribution . Deep learning architectures are used to learn the parameters of the model. The decoder is another deep learning model which explains how the graph is generated from the embeddings, i.e., it parameterizes . Typically the probability of a link in these models is dependent on the similarity of the embedding of the two incident nodes. Assuming a suitable prior , the encoder and decoder is trained jointly to minimize the KL divergence between and the true posterior . The learned embeddings are evaluated based on an amortized link prediction task for the unseen portion of the graph.

Proposed methodology – Bayesian VGAE:

We consider a Bayesian formulation, where we conduct Bayesian inference of the graph in the encoder. Let us introduce a function that returns a graph such that the unobserved entries of the adjacency matrix of are replaced by the corresponding entries of . We then model the inference distribution as follows:

where is the MAP estimate from the non-parametric model. The intuitive idea behind this modeling is that if the non-parametric inference provides a reasonable approximation of the unobserved adjacency matrix entries, then an auto encoder trained on a graph that incorporates these approximate entries should learn better embeddings. For the graph learning step, we form the distance matrix using the output of an auto-encoder as follows:

(10)

The resulting algorithm is summarized in the supplementary material.

3.4 Recommendation Systems

Problem statement:

In this section we address a personalized item recommendation task based on historical interaction data. We denote the set of users and items by and respectively. The interaction between any user and item is encoded as a link in a bipartite graph . The task is to infer the unobserved interactions (and to use these as predictions of future interactions). Viewed in this light, the recommendation task is a link prediction problem. However, in many cases, predicting a personalized ranking for the items is important [29].

For each user , if there is an observed interaction with item and an unobserved interaction with item , we write that in the training set. The introduced relation implies that user prefers item to item . This interaction training data leads to a set of rankings for each user over the training set of triples: . We denote these rankings for all users in as . This training data is used to learn a model parameterized by . The generalization capability is tested by ranking, for each user , all such that both and . We denote the rankings for a specific user in this test set as . The collection of all such rankings for all users is denoted In this paper, we propose to incorporate Bayesian inference of graph in the Bayesian Personalized Ranking (BPR) loss formulation [29]. A brief review of the BPR loss is provided for completeness.

Background – BPR loss:

Many existing graph based deep learning recommender systems [32, 40, 42] learn an embedding for user and for item and model the probability that user prefers item to item as follows:

Here

is the sigmoid function and

is the inner product. Our goal is to compute:

(11)

but this integral is not tractable. In practice, we assume a prior for and model the preferences of different users as independent. We can then consider a MAP estimate of :

This is equivalent to minimizing the BPR loss, where the positive pool and negative pool are created according to . Once the MAP estimate has been obtained, we assess the performance by ranking the test set triples using .

Non-parametric model – Bayesian graph recommender system:

In the Bayesian setting, ranking is conducted by considering an expectation with respect to the posterior distribution of the graph from the non-parametric model . We need to evaluate the posterior probability of ranking in the test set. Let us introduce the graph , which is obtained via a function that combines the information in and . We specify the function that we employ in our methodology more precisely below. We can then write the posterior probability of the ranking of the test set as follows:

(12)

We approximate the integrals with respect to the posteriors of and by the MAP estimates to obtain:

(13)

To calculate this approximation we first perform the non-parametric graph learning to obtain , then compute the new graph and minimize the BPR loss to form the estimate of the weights

(14)

according to the positive and negative pool defined by this new graph .

Since the dot product measures the similarity between the embeddings in the proposed recommender system architecture, we use the pairwise cosine distance between the learned embedding of a base node embedding algorithm for learning a bipartite graph.

(15)

Here, the ’s and ’s are obtained from the node embedding algorithm. Since in , none of the test set user-item interactions are present, they are all included in the negative pool. We use the estimated graph to remove potentially positive interactions in the test set from the negative pool. This is achieved by constructing as follows. We identify a fraction of links with the highest edge weights in and subsequently remove them form the negative pool of interactions for the Bayesian approach. The number of links to be removed is decided based on examining the performance on a validation set. The resulting algorithm is summarized in the supplementary material.

4 Experiments

4.1 Node Classification

We consider a semi-supervised node classification task on three benchmark citation networks Cora, Citeseer [30] and Pubmed [25]. The details of the datasets are included in the supplementary material. The attribute vector at a node is a sparse bag-of-words extracted from the keywords in the article and the label denotes the research topic addressed in the article. We consider three different experimental settings where we have 5, 10 and 20 labeled nodes per class in the training set. In each setting, we conduct 50 trials based on random splitting of the data and random initialization of the learnable weights. We compare the proposed BGCN with the ChebyNet [2], the GCN [15], the GAT [38], the DFNET [41] (for only Cora and Citeseer due to runtime considerations), the SBM-GCN [20] and the BGCN in [44]. The hyperparameters for the GCN are set to those reported in [15]

and the same values are used for the BGCNs. We report the average classification accuracies along with their standard errors in Table 

1. For each setting, we conduct a Wilcoxon signed rank test to determine whether the best performing algorithm is significantly better than the second-best. Results in bold font indicate statistical significance at the 5% level.

Algorithms 5 labels 10 labels 20 labels
Cora ChebyNet 61.76.8 72.53.4 78.81.6
GCN 70.03.7 76.02.2 79.81.8
GAT 70.43.7 76.62.8 79.91.8
DFNET-ATT 72.32.9 75.8 1.7 79.31.8
SBM-GCN 46.019 74.410 82.60.2
BGCN 74.62.8 77.52.6 80.21.5
BGCN (ours) 74.22.8 76.92.2 78.81.7
Citeseer ChebyNet 58.54.8 65.82.8 67.51.9
GCN 58.54.7 65.42.6 67.82.3
GAT 56.75.1 64.13.3 67.62.3
DFNET-ATT 60.51.2 63.2 2.9 66.31.7
SBM-GCN 24.57.3 43.312 66.15.7
BGCN 63.04.8 69.92.3 71.11.8
BGCN (ours) 64.94.6 70.11.9 71.41.6
Pubmed ChebyNet 62.76.9 68.65.0 74.33.0
GCN 69.74.5 73.93.4 77.52.5
GAT 68.04.8 72.63.6 76.43.0
SBM-GCN 59.010 67.86.9 74.64.5
BGCN 70.24.5 73.33.1 76.02.6
BGCN (ours) 71.14.4 74.63.6 77.62.9
Table 1: Accuracy of semi-supervised node classification.

The results in Table 1 show that the proposed BGCN with non-parametric modelling of the graph achieves either higher or competitive accuracies in most cases. The relative improvement compared to the GCN is more significant if the labelled data is scarce. Comparison with the BGCN approach based on parametric modelling in [44] demonstrates that better or comparable accuracies can be achieved from this model, even if we do not target modelling the community structure of the graph explicitly. From Figure 1, we observe that in most cases, for the Cora and the Citeseer datasets, the proposed BGCN algorithm corrects more misclassifications of the GCN for low degree nodes. The same trend is observed for the Pubmed dataset. The empirical success of the GCN is primarily due to aggregating information with neighbors. As the low degree nodes have less opportunity to aggregate, performance is worse at these nodes. The proposed BGCN approach generates many additional links between similar nodes (Fig. 2). This improves learning, particularly at low degree nodes.

In Figure 2, we compare the adjacency matrix () of the MAP estimate graph with the observed adjacency matrix for the Cora dataset. This reveals that compared to , has denser connectivity among the nodes with the same label. This provides a rationale of why the proposed BGCN outperforms the GCN in most cases.

Figure 1: Boxplots of different categories of nodes in the Cora and Citeseer datasets based on the classification results of the GCN and the proposed BGCN algorithms. The two groups are formed by thresholding the degree of the nodes in the test set at the median value.
Figure 2: (a) the observed adjacency matrix ( and (b) the MAP estimate of adjacency matrix () from the non-parametric model for the Cora dataset. The node are reordered based on labels. The red lines show the class boundaries.

4.2 Link Prediction

We consider a link prediction task to demonstrate the usefulness of the learned embeddings from the Bayesian approach. We split the links in 85/5/10% for training, validation and testing respectively. The validation and test sets contain the same number of non-links as links. During model training, the links in the validation and test sets are hidden while the node features are unaltered. We compare the Bayesian approach with the GAE and VGAE [14], the GRAPHITE-AE and VAE [7] and the DGLFRM [23]

models. The hyperparameters of these baseline algorithms are selected according to the corresponding papers. Other common baselines, e.g. spectral Clustering 

[34], Deepwalk [28] and node2vec [6] are not included since it has been demonstrated that the baselines we include significantly outperform them. We incorporate the non-parametric graph inference technique in the existing auto-encoders to build a Bayesian version of these algorithms. The Area Under the ROC Curve (AUC) and the Average Precision (AP) score are used as performance metrics. Table 2 shows the mean AUC and AP, together with standard errors, based on 50 trials. Each trial corresponds to a random split of the graph and a random initialization of the learnable parameters. We conduct a Wilcoxon signed rank test to determine the statistical significance of the improvement compared to the corresponding base model. Results in bold font indicate settings where the test declares a significance at the 5% level.

Algorithm Cora Citeseer Pubmed
AUC
GAE 91.50.9 89.41.5 96.20.2
BGAE 91.80.8 89.61.6 96.20.2
VGAE 91.80.9 90.71.0 94.50.7
BVGAE 92.20.8 91.21.0 94.40.7
Graphite-AE 92.00.9 90.81.1 96.00.4
BGraphite-AE 92.40.9 91.11.1 96.00.4
Graphite-VAE 92.30.8 90.91.1 95.20.4
BGraphite-VAE 92.70.8 91.41.1 95.20.4
DGLFRM 93.10.6 93.90.7 95.90.1
BDGLFRM 93.20.6 94.10.7 95.90.2
AP
GAE 92.60.9 90.01.7 96.30.3
BGAE 92.8 0.9 90.21.7 96.30.2
VGAE 92.90.7 92.01.0 94.70.6
BVGAE 93.30.7 92.51.0 94.60.6
Graphite-AE 92.80.9 91.61.1 96.00.4
BGraphite-AE 93.10.9 92.01.1 96.00.4
Graphite-VAE 93.30.7 92.11.0 95.30.4
BGraphite-VAE 93.70.7 92.61.0 95.30.4
DGLFRM 93.80.6 94.50.7 96.40.1
BDGLFRM 93.90.6 94.70.7 96.30.1
Table 2: Area Under the ROC Curve (AUC) and Average Precision (AP) score for link prediction (in %).

From the results in Table 2, we observe the proposed approach improves link prediction performance for the Cora and Citeseer datasets compared to the baseline auto-encoder models. The improvement is small but consistent over almost all of the random trials. No improvement is observed for Pubmed. To examine this further, we conducted an experiment where the ground-truth for the test set was provided to the autoencoders. The performance does not change from the reported values; this suggests that the models have reached accuracy limits for the Pubmed dataset.

4.3 Recommendation Systems

We investigate the performance of the proposed Bayesian method on four real-world and publicly available datasets: ML100K, Amazon-Books, Amazon-CDs and Yelp2018. For each dataset, we conduct pre-processing to ensure that each node in the dataset has sufficient interactions. We consider two threshold values and , and filter out those users and those items with fewer than and

interactions, respectively. For each user, we split each dataset’s existing interaction records into training, validation and test set with the ratio of 70/10/20. We evaluate the model performance using Recall@k and NDCG@k, which are the coverage of true items in the top-k recommendations, and a measure of recommendation ranking quality, respectively. Details of statistics of each dataset after the preprocessing step and the definitions of the evaluation metrics are included in the supplementary material.

Amazon-CDs R@10 R@20 N@10 N@20 Yelp2018 R@10 R@20 N@10 N@20
MGCCF 10.1% 16.1% 13.1% 16.9% MGCCF 7.5% 12.7% 13.0% 17.4%
BMGCCF 10.6% 17.0% 13.4 % 17.3% BMGCCF 7.6% 13.0% 13.2% 17.7%
NGCF 8.1% 13.5% 11.4% 13.8% NGCF 6.6% 11.3% 11.5% 15.3%
BNGCF 9.9% 16.2% 12.8% 16.6% BNGCF 6.7% 11.4% 11.5% 15.5%
Amazon-Books R@10 R@20 N@10 N@20 ML100K R@10 R@20 N@10 N@20
MGCCF 10.3 % 16.6% 15.0 % 19.4% MGCCF 18.3% 29.4% 25.6% 30.9%
BMGCCF 10.3% 16.4% 14.8% 19.3% BMGCCF 18.4% 29.5% 25.9% 31.4%
NGCF 8.7% 14.5% 13.6% 17.8% NGCF 17.7% 29.0% 25.3% 30.3%
BNGCF 10.2% 16.8% 15.2% 19.6% BNGCF 17.7% 28.9% 25.2% 30.1%
Table 3: Recall@10, NDCG@10, Recall@20 and NDCG@10 for the four datasets.

We apply our proposed Bayesian graph-based recommendation formulation to two recent graph-based recommendation models: the MGCCF [32] and the NGCF [40].

We first train the two algorithms with early stopping patience of 50 epochs to get the embedding vectors for users and items. These are used to calculate the pairwise cosine distance metrics

for our proposed graph optimizer. We refer to these original algorithms as “base models”. We build our proposed models (BMGCCF and BNGCF) on top of the base models via the following procedure. We first apply edge dropping with a threshold to shrink each dataset’s negative edge candidate set. We further train the base models with this optimized negative edge pool with an early stop patience of 100. For a fair comparison, to obtain the baseline performance, we also conduct further training of the models with the original negative edge pool with the same early stop patience setting. We use grid search to determine the percentage of the inferred links with the highest edge weights to be removed from the negative pool. A suitable value is chosen for each dataset from {1, 2, 5, 10, 20}%.

We report the Recall@k and the NDCG@k of the proposed Bayesian models (BMGCCF and BNGCF) along with those of the base models MGCCF and NGCF for four datasets in Table 3. We conduct Wilcoxon signed rank test in each case to determine the significance of the obtained result from the Bayesian model over the corresponding base model. Bold numbers indicate a statistically significant difference at 5% level between the base model and the Bayesian version of the algorithm. The advantages of our proposed Bayesian framework can be observed for both base models and across both evaluation metrics. For the much denser ML-100K dataset, the procedure is less effective (and in some cases ineffective). With many more edges in the observed graph, the graph-based recommender system algorithms already have considerable information. Although the inferred graph does remove many incorrect edges from the negative pool, this has only a minor impact on the learned embeddings.

Figure 3: Training curve for MGCCF vs. BMGCCF (Amazon - CD).

The learning curve comparison for training the original model and the Bayesian version of the model is shown in Figure 3 for the Amazon CD dataset. We can observe that with our proposed solution, the training converges much faster. The Bayesian training framework also allows us to avoid overfitting in this case.

Conventional recommendation training procedure, especially in the implicit recommendation setting, treat all of the unobserved user-item interactions as negative feedback (demonstrating a lack of interest). Our proposed approach aims to learn which of these unobserved interactions are most likely to be false negatives. We analyze the overlap between the edges that we remove from the negative candidates set with the edges in the validation and test set. As shown in Table 4, our proposed Bayesian formulation is able to remove a significant percentage of test and validation edges from the negative sample pool.

Am. CDs Am. Books Yelp2018 ML100Ks
BMGCF 20.6% 17.9% 13.6% 12.3%
BNGCF 23.4% 30.0% 13.3% 62.1%
Table 4: Edge overlap of the inferred graph with the test set.

5 Conclusion

In this paper, we propose the use of non-parametric modelling and inference of graphs for various learning tasks. In the proposed model, a higher edge weight between two nodes is more likely if the nodes are close in terms of a distance metric. An appropriate distance metric can be chosen depending on the learning task which results in flexible, task-specific design of learning algorithms. The proposed model is adapted to a Bayesian learning framework which aims to account for graph uncertainty. Experimental results demonstrate that the model can learn useful graphs that improve performance significantly over baseline algorithms for node classification, link prediction, and recommendation.

References

  • [1] A. Bojchevski, O. Shchur, D. Zügner, and S. Günnemann (2018-Jul.) NetGAN: generating graphs via random walks. In Proc. Int. Conf. Machine Learning, pp. 609–618. Cited by: §2.
  • [2] M. Defferrard, X. Bresson, and P. Vandergheynst (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In Proc. Adv. Neural Inf. Proc. Systems, pp. 3844–3852. Cited by: §2, §4.1.
  • [3] X. Dong, D. Thanou, P. Frossard, and P. Vandergheynst (2016) Learning laplacian matrix in smooth graph signal representations. IEEE Trans. Sig. Proc. 64 (23), pp. 6160–6173. Cited by: §2.
  • [4] D. Duvenaud, D. Maclaurin, et al. (2015) Convolutional networks on graphs for learning molecular fingerprints. In Proc. Adv. Neural Info. Proc. Systems, pp. 2224–2232. Cited by: §2.
  • [5] Y. Gal and Z. Ghahramani (2016) Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In Proc. Int. Conf. Machine Learning, pp. 1050–1059. Cited by: §3.2.
  • [6] A. Grover and J. Leskovec (2016) Node2vec: scalable feature learning for networks. In Proc. ACM Int. Conf. Knowl. Disc. Data Mining, pp. 855–864. Cited by: §4.2.
  • [7] A. Grover, A. Zweig, and S. Ermon (2019-Jul.) Graphite: iterative generative modeling of graphs. In Proc. Int. Conf. Machine Learning, pp. 2434–2444. Cited by: §2, §3.3, §4.2.
  • [8] W. Hamilton, R. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In Proc. Adv. Neural Inf. Proc. Systems, pp. 1024–1034. Cited by: §2.
  • [9] J. M. Hernández-Lobato and R. Adams (2015)

    Probabilistic backpropagation for scalable learning of Bayesian neural networks

    .
    In Proc. Int. Conf. Machine Learning, pp. 1861–1869. Cited by: §3.2.
  • [10] K. Järvelin and J. Kekäläinen (2000) IR evaluation methods for retrieving highly relevant documents. In Proc. Int. ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 41–48. Cited by: 2nd item.
  • [11] B. Jiang, Z. Zhang, J. Tang, and B. Luo (2019-04) Graph optimized convolutional networks. arXiv e-prints : arXiv 1904.11883. Cited by: §1, §2.
  • [12] V. Kalofolias and N. Perraudin (2019) Large scale graph learning from smooth signals. In Proc. Int. Conf. Learning Representations, Cited by: §2, §3.1.
  • [13] V. Kalofolias (2016) How to learn a graph from smooth signals. In Proc. Artificial Intell. and Statist., pp. 920–929. Cited by: §2, §3.1.
  • [14] T. Kipf and M. Welling (2016-Nov.) Variational graph auto-encoders. In Proc. Bayesian Deep Learning Workshop, Adv. Neural Info. Process. Systems, Cited by: §2, §3.2, §3.3, §4.2.
  • [15] T. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In Proc. Int. Conf. Learning Representations, Cited by: §2, §3.2, §4.1.
  • [16] C. Li, C. Chen, D. Carlson, and L. Carin (2016) Pre-conditioned stochastic gradient Langevin dynamics for deep neural networks. In Proc. AAAI Conf. Artificial Intell., pp. 1788–1794. Cited by: §3.2.
  • [17] Y. Li, O. Vinyals, C. Dyer, R. Pascanu, and P. Battaglia (2018-Mar.) Learning deep generative models of graphs. In Proc. Int. Conf. Machine Learning, Cited by: §2.
  • [18] R. Liao, Y. Li, Y. Song, S. Wang, C. Nash, W. L. Hamilton, D. Duvenaud, R. Urtasun, and R. S. Zemel (2019) Efficient graph generation with graph recurrent attention networks. In Proc. Adv. Neural Info. Proc. Systems, pp. 4255–4265. Cited by: §2.
  • [19] J. Liu, A. Kumar, J. Ba, J. Kiros, and K. J. Swersky (2019) Graph normalizing flows. In Proc. Adv. Neural Info. Proc. Systems, pp. 13578–13588. Cited by: §2.
  • [20] J. Ma, W. Tang, J. Zhu, and Q. Mei (2019)

    A flexible generative framework for graph-based semi-supervised learning

    .
    In Proc. Adv. Neural Info. Proc. Systems, pp. 3276–3285. Cited by: §1, §2, §3.1, §4.1.
  • [21] D. J. C. MacKay (1996) Maximum entropy and Bayesian methods. pp. 43–59. Cited by: §3.1.
  • [22] Y. A. Malkov and D. A. Yashunin (2020) Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Trans. Pattern Analysis and Machine Intell. 42 (4), pp. 824–836. Cited by: §2, §3.1.
  • [23] N. Mehta, L. C. Duke, and P. Rai (2019) Stochastic blockmodels meet graph neural networks. In Proc. Int. Conf. Machine Learning, pp. 4466–4474. Cited by: §2, §3.3, §4.2.
  • [24] F. Monti, M. Bronstein, and X. Bresson (2017) Geometric matrix completion with recurrent multi-graph neural networks. In Proc. Adv. Neural Info. Proc. Systems, Cited by: §2.
  • [25] G. Namata, B. London, L. Getoor, and B. Huang (2012) Query-driven active surveying for collective classification. In Proc. Workshop on Mining and Learning with Graphs, Int. Conf. Machine Learning, Cited by: §4.1, §6.1.
  • [26] R. M. Neal (1992) Bayesian learning via stochastic dynamics. In Proc. Adv. Neural Inf. Proc. Systems, pp. 475–482. Cited by: §3.2.
  • [27] S. Pan, R. Hu, G. Long, J. Jiang, L. Yao, and C. Zhang (2018) Adversarially regularized graph autoencoder for graph embedding.. In Proc. Int. Joint Conf. Artificial Intell., pp. 2609–2615. Cited by: §2.
  • [28] B. Perozzi, R. Al-Rfou, and S. Skiena (2014) Deepwalk: online learning of social representations. In Proc. ACM SIGKDD Int. Conf. Knowl. Dis. Data Mining, pp. 701–710. Cited by: §4.2.
  • [29] S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-Thieme (2009) BPR: bayesian personalized ranking from implicit feedback. Proc. Conf. Uncertainty in Artificial Intell., pp. 452–461. Cited by: §3.4, §3.4.
  • [30] P. Sen, G. Namata, et al. (2008) Collective classification in network data. AI Magazine 29 (3), pp. 93. Cited by: §4.1, §6.1.
  • [31] M. Simonovsky and N. Komodakis (2018) GraphVAE: towards generation of small graphs using variational autoencoders. In Proc. Int. Conf. Artificial Neural Networks, pp. 412–422. Cited by: §2.
  • [32] J. Sun, Y. Zhang, C. Ma, M. Coates, H. Guo, R. Tang, and X. He (2019) Multi-graph convolution collaborative filtering. In Proc. IEEE Int. Conf. Data Mining, pp. 1306–1311. Cited by: §2, §3.4, §4.3.
  • [33] S. Sun, C. Chen, and L. Carin (2017) Learning structured weight uncertainty in Bayesian neural networks. In Proc. Artificial Intell. and Statist., pp. 1283–1292. Cited by: §3.2.
  • [34] L. Tang and H. Liu (2011) Leveraging social media networks for classification. Data Mining and Knowledge Discovery, pp. 447–478. Cited by: §4.2.
  • [35] Y. Tian, L. Zhao, X. Peng, and D. Metaxas (2019-Dec.) Rethinking kernel methods for node representation learning on graphs. In Proc. Adv. Neural Info. Proc. Systems, pp. 2088–2099. Cited by: §2.
  • [36] L. C. Tiao, P. Elinas, H. Nguyen, and E. V. Bonilla (2019) Variational spectral graph convolutional networks. In Proc. Graph Representation Learning Workshop, Adv. Neural Info. Proc. Systems, Cited by: §1, §2, §3.1.
  • [37] R. van den Berg, T. N. Kipf, and M. Welling (2018) Graph convolutional matrix completion. In Proc. ACM Int. Conf. Knowl. Disc. Data Mining, Cited by: §2.
  • [38] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio (2018) Graph attention networks. In Proc. Int. Conf. Learning Representations, Cited by: §2, §4.1.
  • [39] H. Wang, J. Wang, J. Wang, M. Zhao, W. Zhang, F. Zhang, X. Xie, and M. Guo (2017) GraphGAN: graph representation learning with generative adversarial nets. In Proc. AAAI Conf. Artificial Intell., pp. 2508–2515. Cited by: §2.
  • [40] X. Wang, X. He, M. Wang, F. Feng, and T. Chua (2019) Neural graph collaborative filtering. In Proc. Int. ACM SIGIR Conf. Research and Development Info. Retrieval, pp. 165–174. Cited by: §2, §3.4, §4.3.
  • [41] W. O. K. A. S. Wijesinghe and Q. Wang (2019) DFNets: spectral CNNs for graphs with feedback-looped filters. In Proc. Adv. Neural Info. Proc. Systems, pp. 6009–6020. Cited by: §2, §4.1.
  • [42] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and J. Leskovec (2018) Graph convolutional neural networks for web-scale recommender systems. In Proc. ACM Int. Conf. Knowl. Disc. Data Mining, pp. 974–983. Cited by: §2, §3.4.
  • [43] J. You, R. Ying, X. Ren, W. L. Hamilton, and J. Leskovec (2018) GraphRNN: generating realistic graphs with deep auto-regressive models. In Proc. Int. Conf. Machine Learning, pp. 5708–5717. Cited by: §2.
  • [44] Y. Zhang, S. Pal, M. Coates, and D. Üstebay (2019) Bayesian graph convolutional neural networks for semi-supervised classification. In Proc. AAAI Conf. Artificial Intell., pp. 5829–5836. Cited by: §1, §2, §3.1, §3.2, §3.2, §4.1, §4.1, §6.3.2, §6.3.2.
  • [45] L. Zheng, C. Lu, F. Jiang, J. Zhang, and P. S. Yu (2018) Spectral collaborative filtering. In Proc. ACM Conf. Recommender Systems, Cited by: §2.
  • [46] C. Zhuang and Q. Ma (2018) Dual graph convolutional networks for graph-based semi-supervised classification. In Proc. World Wide Web Conf., pp. 499–508. Cited by: §2.

6 Supplementary Material

6.1 Description of the Datasets

For the semi-supervised node classification and link prediction tasks, we conduct experiments on benchmark citation network datasets (Cora [30], Citeseer [30], and Pubmed [25]). In these datasets each node represents a research article and the undirected edges are formed according to citation links. Each node has a sparse bag-of-words feature vector derived from the keywords of the document. The node labels indicate the primary research topics addressed in the articles. The statistics of the citation datasets are summarized in Table 5.

Dataset # Classes # Features # Nodes # Edges Edge Density
Cora 7 1,433 2,485 5,069 0.04%
Citeseer 6 3,703 2110 3,668 0.04%
Pubmed 3 500 19,717 44,324 0.01%
Table 5: Statistics of evaluation datasets for node classification and generative graph models.

For the experiments on the recommendation systems, we use four real-world and publicly available datasets: ML100K, Amazon-Books, Amazon-CDs and Yelp2018. For each dataset, we apply a pre-processing step to ensure that each user/item node in the dataset has sufficient interactions. We consider two threshold values and , and filter out those users and those items with fewer than and interactions, respectively. The number of users and items after the preprocessing step are denoted by and respectively. The statistics of the preprocessed datasets are provided in Table 6.

Dataset (, ) # Edges Edge Density
ML100K (10, 10) (897, 823) 36330 0.0716
CDs (50, 20) (2217, 2034) 68293 0.0151
Books (250, 22) (1608, 2270) 69782 0.0191
Yelp2018 (22, 35) (2799, 1673) 119514 0.0255
Table 6: Statistics of evaluation datasets for recommender systems.

6.2 Graph Inference in Non-Parametric Model

For completeness, we provide a brief summary of the graph inference task considered in the main paper. We solve for the maximizer of the posterior distribution of the ‘true’ graph as follows:

(16)

Here, denotes the observed graph and is additional information. In our non-parametric model, this is equivalent to solving the following optimization problem for the adjacency matrix of :

(17)

We recall that is a symmetric pairwise distance matrix which measures the dissimilarity between the nodes. and are the hyperparameters of the prior distribution of the random graph .

6.3 Bayesian Graph Neural Network

This section summarizes the novel Bayesian Graph Neural Network algorithm and provides some additional results for the node classification task.

6.3.1 BGCN Algorithm

1:  Input: , ,
2:  Output:
3:  Train a node embedding algorithm using and to obtain for . Compute using (19).
4:  Train a base classifier using , and to obtain for . Compute using (20).
5:  Compute using (18).
6:  Solve the optimization problem in (17) to obtain (equivalently, ).
7:  for  to  do
8:     Sample weights using MC dropout by training a GCN over the graph .
9:  end for
10:  Approximate using (21).
Algorithm 1 Bayesian GCN using non-parametric graph learning

We recall the notations from the main paper: is the feature matrix of the nodes in the observed graph and is the set of known training labels. denotes the -th sample of GCN weights trained on the inferred graph from the non-parametric model. represents any suitable embedding of node and is the obtained label of -th node from a baseline node classification algorithm. is the neighbourhood of -th node in . The distance matrix is defined as follows:

(18)
(19)
(20)

Here, is a hyperparameter which controls the importance of relative to . We need to compute the Monte Carlo approximation of the posterior distribution of labels, which is given as:

(21)

Pseudocode for the proposed BGCN algorithm is provided in Algorithm 1.

6.3.2 Results on fixed training-test split from [44]

Fixed split 5 labels 10 labels 20 labels
ChebyNet 67.93.1 72.72.4 80.40.7
GCN 74.40.8 74.90.7 81.60.5
GAT 73.52.2 74.51.3 81.60.9
BGCN 75.30.8 76.60.8 81.20.8
SBM-GCN 59.31.3 77.31.2 82.20.8
BGCN (ours) 76.01.1 76.80.9 80.30.6
Table 7: Classification accuracy (in ) for Cora dataset.
Fixed split 5 labels 10 labels 20 labels
ChebyNet 53.01.9 67.71.2 70.20.9
GCN 55.41.1 65.81.1 70.80.7
GAT 55.42.6 66.11.7 70.81.0
BGCN 57.30.8 70.80.6 72.20.6
SBM-GCN 20.82.0 66.30.6 71.70.1
BGCN (ours) 59.01.5 71.70.8 72.60.6
Table 8: Classification accuracy (in ) for Citeseer dataset.
Fixed split 5 labels 10 labels 20 labels
ChebyNet 68.12.5 69.41.6 76.01.2
GCN 69.70.5 72.80.5 78.90.3
GAT 70.00.6 71.60.9 76.90.5
BGCN 70.90.8 72.30.8 76.60.7
SBM-GCN 64.80.8 71.70.7 80.60.4
BGCN (ours) 73.30.7 73.90.9 79.20.5
Table 9: Classification accuracy (in ) for Pubmed dataset.

In the main paper, we consider random partitioning of the nodes in training and test sets and report the average accuracies across different splits. We conduct another experiment where the same fixed training-test split of [44] is used for 50 random intializations of the GCN weights. The classification results for this setting are provided in Tables 7, 8 and 9.

6.4 Bayesian Vgae Algorithm for Link Prediction

We recall that the function returns a graph such that the unobserved entries of the adjacency matrix of are replaced by the corresponding entries of . The distance matrix for the non-parametric graph inference is defined as:

(22)

where, is the approximate posterior distribution of unsupervised node representations from a Variational Graph Auto-Encoder (VGAE) model. In the proposed Bayesian VGAE, the inference distribution is modelled as follows:

(23)

Here is the inferred graph from the non-parametric model. The resulting algorithm is summarized in Algorithm 2

1:  Input: ,
2:  Output:
3:  Train a node embedding algorithm using and to obtain for .
4:  Compute using eq. (22).
5:  Solve the optimization problem in (17) to obtain (equivalently, ).
6:  Build a new graph and train the auto-encoder on it to obtain (eq. (23)).
Algorithm 2 Bayesian VGAE

6.5 Recommendation Systems

6.5.1 Algorithm

We recall that is the set of training ranknings in observed graph . The distance between the -th user and the -th item is defined as:

(24)

Here, and are the representations from the base node embedding algorithm for the -th user and the -th item, respectively. We define by removing a fraction of links with the highest edge weights in from the negative pool of interactions according to . In the Bayesian version of the recommendation system, we need to compute:

(25)

We perform non-parametric graph inference to obtain , then compute and minimize the BPR loss to form the estimate of the weights:

(26)

The resulting algorithm is summarized in Algorithm 3.

1:  Input: ,
2:  Output:
3:  Train a base node embedding algorithm using and to obtain embeddings for all users and items.
4:  Compute as the pairwise cosine distance between the embeddings (eq. (24)).
5:  Solve the optimization problem in (17) using to obtain .
6:  Form the modified graph .
7:  Minimize BPR loss with negative pool according to to obtain (eq. (26)).
8:  Obtain the test set rankings using the embeddings to form (eq. (25))
Algorithm 3 BPR loss minimization with non-parametric graph learning

6.5.2 Definitions of the performance evaluation metrics

  • Recall@k denotes the proportion of the true (preferred) items from the top- recommendation. For a user , the algorithm recommends an ordered set of top- items . There is a set of true preferred items for a user and a number of true positive . The recall of a user is defined as follows :

  • NDCG@k: Normalized Discounted Cumulative Gain (NDCG) [10] computes a score for which emphasizes higher-ranked true positives. accounts for a relevancy score . We consider binary responses, so we use a binary relevance score: if and 0 otherwise.