Graph Neural Processes: Towards Bayesian Graph Neural Networks

02/26/2019 ∙ by Andrew N. Carr, et al. ∙ 0

We introduce Graph Neural Processes (GNP), inspired by the recent work in conditional and latent neural processes. A Graph Neural Process is defined as a Conditional Neural Process that operates on arbitrary graph data. It takes features of sparsely observed context points as input, and outputs a distribution over target points. We demonstrate graph neural processes in edge imputation and discuss benefits and drawbacks of the method for other application areas. One major benefit of GNPs is the ability to quantify uncertainty in deep learning on graph structures. An additional benefit of this method is the ability to extend graph neural networks to inputs of dynamic sized graphs.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this work, we consider the problem of imputing the value of an edge on a graph. This is a valuable problem when an edge is known to exist, but due to a noisy signal, or poor data acquisition process, the value is unknown. We solve this problem using a proposed method called Graph Neural Processes.

In recent years, deep learning techniques have been applied, with much success, to a variety of problems. However, many of these neural architectures (e.g., CNNs) rely on the underlying assumption that our data is Euclidean. However, since graphs do not lie on regular lattices, many of the concepts and underlying operations typically used in deep learning need to be extended to non-Euclidean valued data. Deep learning on graph-structured data has received much attention over the past decade [Gori et al.2005] [Scarselli et al.2005] [Li et al.2016] [Zhou et al.2019] [Battaglia et al.2018] and has show significant promise in many applied fields [Battaglia et al.2018]. These ideas were recently formalized as geometric deep learning [Bronstein et al.2017] which categorizes and expounds on a variety of methods and techniques. These techniques are mathematically designed to deal with graph-structured data and extend deep learning into new research areas.

Similarly, but somewhat orthogonally, progress has been made in Bayesian methods when applied to deep learning [Auld et al.2007]

. In Bayesian neural networks, uncertainty estimates are used at the weight level, or at the output of the network. These extensions to typical deep learning give insight into what the model is learning, and where it may encounter failure modes. In this work, we use some of this progress to impute the value distribution on an edge in graph-structured data.

Figure 1: Graph with imputed edge distributions

In this work we propose a novel architecture and training mechanism which we call Graph Neural Processes (GNP). This architecture is based on the ideas first formulated in [Garnelo et al.2018]

in that is synthesizes global information to a fixed length representation that is then used for probability estimation. Our contribution is to extend those ideas to graph-structured data and show that the methods perform favorably.

Specifically, we use features typically used in Graph Neural Networks as a replacement to the convolution operation from traditional deep learning. These features, when used in conjunction with the traditional CNP architecture offer local representations of the graph around edges and assist in the learning of high level abstractions across classes of graphs. Graph Neural Processes learn a high level representation across a family of graphs, in part by utilizing these instructive features.

2 Background

2.1 Graph Structured Data

We define a graph where is some global attribute222typically a member of the real numbers, IR. Our method does not utilize this global attribute., is the set of nodes where with being the node’s attribute, and is the set of edges where where is the attribute on the edge, and are the two nodes connected by the edge. In this work, we focus on undirected graphs, but the principles could be extended to many other types of graphs.

2.2 Conditional Neural Processes

First introduced in [Garnelo et al.2018], Conditional Neural Processes (CNPs) are loosely based on traditional Gaussian Processes. CNPs map an input to an output . It does this by defining a collection of conditional distributions that can be realized by conditioning on an arbitrary number of context points and their associated outputs . Then an arbitrary number of target points , with their outputs can be modeled using the conditional distribution. This modeling is invariant to ordering of context points, and ordering of targets. This invariance lends itself well to graph structured data and allows arbitrary sampling of edges for learning and imputation. It is important to note that, while the model is defined for arbitrary and , it is common practice (which we follow) to use .

The context point data is aggregated using a commutative operation 333Works has be done [Kim et al.2019] to explore other aggregation operations such as attention. that takes elements in some and maps them into a single element in the same space. In the literature, this is referred to as the

context vector. We see that

summarizes the information present in the observed context points. Formally, the CNP is learning the following conditional distribution.

(1)

In practice, this is done by first passing the context points through a DNN to obtain a fixed length embedding of each individual context point. These context point representation vectors are aggregated with to form . The target points are then decoded, conditional to , to obtain the desired output distribution

the models the target outputs

.

More formally, this process can be defined as follows.

(2)
(3)
(4)
Figure 2: Conditional Neural Process Architecture

Traditionally, maximum likelihood is used in cases where the output distribution is continuous. In the examples explored in this work, we are dealing with categorical data and so use alternative training schemes, such as cross entropy, that are better suited to handle such outputs.

2.3 Graph Neural Networks

Of all the inductive biases introduced by standard deep learning architectures, the most common is that we are working with Euclidean data. This fact is exploited by the use of convolutional neural networks (CNN), where spatially local features are learned for downstream tasks. If, however, our data is Non-Euclidean (e.g., graphs, manifolds) then many of the operations used in standard deep learning (e.g., convolutions) no longer produce the desired results. This is due to the fact that the intrinsic measures and mathematical structure on these surfaces violates assumptions made by traditional deep learning operations. There has been much work done recently in Graph Neural Networks (GNN) that operate on graph-structured inputs

[Zhou et al.2019]. There are two main building blocks of geometric methods in GNNs. spectral [Bronstein et al.2017] and spatial [Duvenaud et al.2015] methods. These methods are unified into Message Passing Neural Networks, and Non-local Neural Networks. A more thorough discussion of these topics can be found in [Zhou et al.2019]. Typically, when using these GNN methods, the input dataset graphs need to have the same number of nodes/edges. This is because of the technicalities involved in defining a spectral convolution. However, in the case of Graph Neural Processes, local graph features are used which allows one to learn conditional distributions over arbitrarily sized graph-structured data. One important concept utilized in the spectral GNN methods is that of the Graph Laplacian. Typically, the graph Laplacian is defined as the adjacency matrix subtracted from the degree matrix . This formulation, common in multi-agent systems, hampers the flow of information because it it potentially non-symmetric, and unnormalized. As such, the Normalized Symmetric Graph Laplacian is given as follows.

(5)

This object can be thought of as the difference between the average value of a function around a point and the actual value of the function at the point. This, therefore, encodes local structural information about the graph itself.

In spectral GNN methods, the eigenvalues and eigenvectors of the graph Laplacian are used to define convolution operations on graphs. In this work, however, we use the local structural information encoded in the Laplacian as an input feature for the context points we wish to encode.

3 Related Work

3.1 Edge Imputation

In many applications, the existence of an edge is known, but the value of the edge is unknown (e.g, traffic prediction, social networks). Traditional edge imputation involves generating a point estimate for the value on an edge. This can be done through mean filling, regression, or classification techniques [Huisman2009]

. These traditional methods, especially mean filling, can fail to maintain the variance and other significant properties of the edge values. In

[Huisman2009] they show ”Bias in variances and covariances can be greatly reduced by using a conditional distribution and replacing missing values with draws from this distribution.” This fact, coupled with the neural nature of the conditional estimation, gives support to the hypothesis that Graph Neural Processes preserve important properties of edge values, and are effective in the imputation process.

3.2 Bayesian Deep Learning

In Bayesian neural networks, it is often the case where the goal is to learn a function given some training inputs and corresponding outputs . This function is often approximated using a fixed neural network that provides a likely correlational explanation for the relationship between x and y. There has been good work done in this area [Zhang et al.2018] and there are two main types of Bayesian Deep Learning. This is not an exhaustive list of all the methods, but a broad overview of two. Firstly, instead of using point estimates for the weights

of the neural network hidden layers, a distribution over values is used. In other words, the weights are modeled as a random variable with an imposed prior distribution. This encodes,

a priori uncertainty information about the neural transformation. Similarly, since the weight values

are not deterministic, the output of the neural network can also be modeled as a random variable. A generative model is learned based on the structure of the neural network and loss function being used. Predictions from these networks can be obtained by integrating with respect to the posterior distribution of

(6)

This integral is often intractable in practice, and number of techniques have been proposed in the literature to overcome [Zhang et al.2018] this problem. Intuitively, a Bayesian neural network is encoding information about the distribution over output values given certain inputs. This is a very valuable property of Bayesian deep learning that GNPs help capture. In the case of Graph Neural Processes, we model the output of the process as a random variable and learn a conditional distribution over that variable. This fits into the second class of Bayesian neural networks since the weights are not modeled as random variables in this work.

4 Model and Training

While the and encoder and decoder could be implemented arbitrarily, we use fully connected layers that operate on informative features from the graph. In GNPs, we use local spectral features derived from global spectral information of the graphs. Typical graph networks that use graph convolutions require a fixed size Laplacian to encode global spectral information; for GNPs we use an arbitrary fixed size neighborhood of the Laplacian around each edge.

To be precise, with the Laplacian as defined in equation (5) one can compute the spectra of L and the corresponding eigenvector matrix . In , each column is an eigenvector of the graph Laplacian. To get the arbitrary fixed sized neighborhood around each node/edge we define the restriction of as

(7)

Where is some arbitrary fixed constant based on the input space dimension. We call this restriction the local structural eigenfeatures of an edge. These eigenfeatures are often used in conjunction with other, more standard, graph based features. For example, we found that the node values and node degrees for each node attached to an edge serve as informative structural features for GNPs. In other words, to describe each edge we use the local structural eigenfeatures and a 4-tuple where is a function that returns the degree of a node.

Formally, we define the Encoder

, to be a 4-layer fully connected neural network with ReLU non-linearities. It takes the GNP features from each context point as input, and outputs a fixed length (e.g., 256) vector. This vector

is the aggregation of information across the arbitrarily sampled context points. The Decoder is also a 4-layer fully connected neural network with ReLU non-linearities. The decoder takes, as input, a concatenation of the vector from the encoder and a list of features all the edges sans attribute. Then, for each edge, the concatenation of and the feature vector is passed through the layers of the decoder until an output distribution is reached. Where the output distribution size is , which is the number of unique values assigned to an edge.

Additionally, we obtain the context points by first defining a lower bound and upper bound

for a Uniform probability distribution. Then, the number of context points is

where and . The context points all come from a single graph, and training is done over a family of graphs. The GNP is designed to learn a representation over a family of graphs and impute edge values on a new member of that family.

The loss function is problem specific. In the CNP work, as mentioned above, Maximum Likelihood is used to encourage the output distribution of values to match the true target values. Additionally, one could minimize the Kullback–Leibler, Jensen–Shannon, or the Earth-Movers divergence between output distribution and true distribution, if that distribution is known. In this work, since the values found on the edges are categorical we use the multiclass Cross Entropy.

(8)
(9)

We train our method using the ADAM optimizer [Kingma and Ba2014] with standard hyper-parameters. To give clarity, and elucidate the connection and differences between CNPs and GNPs, we present the training algorithm for Graph Neural Processes in Algorithm (1)

Require: , lower bound percentage of edges to sample as context points. , corresponding upper bound. , size of slice (neighborhood) of local structural eigenfeatures.
Require: , initial encoder parameters. initial decoder parameters.

1:  Let input graphs
2:  for   do
3:     for  in  do
4:        Sample
5:        Assign
6:        Sparsely Sample
7:        Compute degree and adj matrix , for graph
8:        Compute
9:        Define as empty feature matrix for
10:        Define as empty feature matrix for full graph
11:        for edge in  do
12:           Extract eigenfeatures from , see eq (7)
13:           Concatenate where are the attribute values at the node, and the degree at the node.
14:           if edge  then
15:              Append features for context point to
16:           end if
17:           Append features for all edges to
18:        end for
19:        Encode and aggregate
20:        Decode
21:        Calculate Loss
22:        Step Optimizer
23:     end for
24:  end for
Algorithm 1 Graph Neural Processes. All experiments use and default Adam optimizer parameters

5 Applications

To show the efficacy of GNPs on real world examples, we conduct a series of experiments on a collection of 16 graph benchmark datasets [Kersting et al.2016]. These datasets span a variety of application areas, and have a diversity of sizes; features of the explored datasets are summarized in Table 2. While the benchmark collection has more than 16 datasets, we selected only those that have both node and edge labels that are fully known; these were then artificially sparsified to create graphs with known edges, but unknown edge labels.

We use the model described in section (4) and the algorithm presented in algorithm (1) and train a GNP to observe context points where and

. We compare GNPs with naive baselines and a strong random forest model for the task of edge imputation. The baselines are described in detail in

[Huisman2009]:

  • Random: a random edge label is imputed for each unknown edge

  • Common: the most common edge label is imputed for each unknown edge

  • Random forest

    : from scikit-learn; default hyperparameters

For each algorithm on each dataset, we calculate weighted precision, weighted recall, and weighted F1-score (with weights derived from class proportions), averaging each algorithm over multiple runs. Statistical significance was assessed with a two-tailed t-test with

.

5.1 Results

The results are pictured in Figures (3 - 5); all of the data is additionally summarized in Table 1. We first note several high-level results, and then look more deeply into the results on different subsets of the overall dataset.

First, we note that the GNP provides the best F1-score on 14 out of 16 datasets, and best recall on 14 out of datasets (in this case, recall is equivalent to classification accuracy). By learning a high-level abstract representation of the data, along with a conditional distribution over edge values, the Graph Neural Process is able to perform well at this edge imputation task, beating both naive and strong baselines. The GNP is able to do this on datasets with as few as about 300 graphs, or as many as about 9000. We also note that the GNP is able to overcome class imbalance.

AIDS. The AIDS Antiviral Screen dataset [Riesen and Bunke2008] is a dataset of screens checking tens of thousands of compounds for evidence of anti-HIV activity. The available screen results are chemical graph-structured data of these various compounds. A moderately sized dataset, the GNP beats the RF by 7%.

bzr,cox2,dhfr,er. These are chemical compound datasets BZR, COX2, DHFR and ER which come with 3D coordinates, and were used by [Mahé et al.2006] to study the pharmacophore kernel. Results are mixed between algorithms on these datasets; these are the datasets that have an order of magnitude more edges (on average) than the other datasets. These datasets have a large class imbalance: predicting the most common edge label yields around 90% accuracy. For example, the bzr dataset has 61,594 in class 1, and 7,273 in the other 4 classes combined. Even so, the GNP yields best F1 and recall on 2/4 of these; random forest gives the best precision on 3/4. It may simply be that there is so much data to work with, and the classes are so imbalanced, that it is hard to find much additional signal.

mutagenicity, MUTAG. The MUTAG dataset [Debnath et al.1991] consists of 188 chemical compounds divided into two classes according to their mutagenic effect on a bacterium. While the mutagenicity dataset [Kazius et al.2005] is a collection of molecules and their interaction information with in vitro.

Here, GNP beats the random forest by several percent; both GNP and RF are vastly superior to naive baselines.

PTC_*. The various Predictive Toxicology Challenge [Helma et al.2001] datasets consist of several hundred organic molecules marked according to their carcinogenicity on male and female mice and rats. [Helma et al.2001] On the PTC family of graphs, GNP bests random forests by 10-15% precision, and 3-10% F1-score; both strongly beat naive baselines.

Tox21_*. This data consists of 10,000 chemical compounds run against human nuclear receptor signaling and stress pathway, it was originally designed to look for structure-activity relationships and improve overall human health. [for Advancing Translation Services2014] On the Tox family of graphs, the GNP strongly outperforms all other models by about 20% precision; about 12% F1; and about 10% recall.

Figure 3: Experimental precision graph compared with baselines, we see our method performs achieves a higher precision on average.
Figure 4: Experimental recall graph compared with baselines
Figure 5: Experimental F1-score graph compared with baselines.

6 Areas for Further Exploration

This introduction of GNPs is meant as a proof of concept to encourage further research into Bayesian graph methods. As part of this work, we list a number of problems where GNPs could be applied.

Visual scene understanding

[Raposo et al.2017] [Santoro et al.2017] where a graph is formed through some non-deterministic process where a collection of the inputs may be corrupted, or inaccurate. As such, a GNP could be applied to infer edge or node values in the scene to improve downstream accuracy. Few-shot learning [Satorras and Estrach2018] where there is hidden structural information. A method like [Kemp and Tenenbaum2008] could be used to discover the form, and a GNP could then be leveraged to impute other graph attributes.

Learning dynamics of physical systems [Battaglia et al.2016] [Chang et al.2016] [Watters et al.2017] [van Steenkiste et al.2018] [Sanchez-Gonzalez et al.2018] with gaps in observations over time, where the GNP could infer values involved in state transitions.

Traffic prediction on roads or waterways [Li et al.2017] [Cui et al.2018] or throughput in cities. The GNP would learn a conditional distribution over traffic between cities

Multi-agent systems [Sukhbaatar et al.2016] [Hoshen2017][Kipf et al.2018] where you want to infer internal agent state of competing or cooperating agents. GNP inference could run in conjunction with other multi-agent methods and provide additional information from graph interactions.

Natural language processing

in the construction of knowledge graphs

[Bordes et al.2013] [Oñoro-Rubio et al.2017] [Hamaguchi et al.2017] by relational infilling or reasoning about connections on a knowledge graph. Alternatively, they could be used to perform semi-supervised text classification [Kipf and Welling2016] by imputing relations between words and sentences.

Dataset RF RV CV GNP
P R F1 P R F1 P R F1 P R F1
AIDS 0.69 0.74 0.68 0.60 0.25 0.34 0.53 0.73 0.61 0.76 0.79 0.75
bzr_md 0.81 0.88 0.84 0.81 0.25 0.36 0.80 0.89 0.84 0.79 0.89 0.83
cox2_md 0.85 0.90 0.87 0.84 0.25 0.36 0.84 0.91 0.87 0.84 0.92 0.88
dhfr_md 0.83 0.90 0.86 0.83 0.25 0.36 0.82 0.91 0.86 0.81 0.90 0.85
er_md 0.82 0.90 0.85 0.82 0.25 0.36 0.81 0.90 0.86 0.79 0.89 0.84
mutagenicity 0.79 0.83 0.80 0.72 0.25 0.35 0.69 0.83 0.75 0.81 0.85 0.81
mutag 0.73 0.74 0.73 0.48 0.25 0.31 0.40 0.63 0.49 0.75 0.79 0.76
PTC_FM 0.64 0.64 0.62 0.43 0.25 0.31 0.25 0.50 0.33 0.76 0.72 0.72
PTC_FR 0.63 0.63 0.61 0.43 0.25 0.31 0.26 0.51 0.34 0.77 0.68 0.69
PTC_MM 0.62 0.62 0.61 0.43 0.25 0.31 0.25 0.50 0.34 0.78 0.64 0.64
tox21_ahr 0.61 0.61 0.60 0.47 0.25 0.31 0.36 0.60 0.45 0.81 0.71 0.72
Tox21_ARE 0.61 0.61 0.60 0.46 0.25 0.31 0.34 0.58 0.43 0.81 0.71 0.73
Tox21_AR-LBD 0.61 0.61 0.61 0.47 0.25 0.31 0.36 0.60 0.45 0.82 0.71 0.73
Tox21_aromatase 0.61 0.61 0.61 0.47 0.25 0.31 0.37 0.61 0.46 0.81 0.70 0.72
Tox21_ATAD5 0.61 0.61 0.61 0.46 0.25 0.31 0.35 0.59 0.44 0.81 0.71 0.73
Tox21_ER 0.62 0.60 0.60 0.47 0.25 0.31 0.36 0.60 0.45 0.80 0.71 0.72
Table 1: Experimental Results. RF=Random Forest; RV=Random edge label; CV=most common edge label; GNP=Graph Neural Process. P=precision; R=recall; F1=F1 score. Statistically significant bests are in bold with non-significant ties bolded across methods.
Dataset
AIDS 2000 15.69 16.20 3
BZR_MD 306 21.30 225.06 5
COX2_MD 303 26.28 335.12 5
DHFR_MD 393 23.87 283.01 5
ER_MD 446 21.33 234.85 5
Mutagenicity 4337 30.32 30.77 3
MUTAG 188 17.93 19.79 4
PTC_FM 349 14.11 14.48 4
PTC_FR 351 14.56 15.00 4
PTC_MM 336 13.97 14.32 4
Tox21_AHR 8169 18.09 18.50 4
Tox21_ARE 7167 16.28 16.52 4
Tox21_aromatase 7226 17.50 17.79 4
Tox21_ARLBD 8753 18.06 18.47 4
Tox21_ATAD5 9091 17.89 18.30 4
Tox21_ER 7697 17.58 17.94 4
Table 2: Features of the explored data sets

There are a number of computer vision applications where graphs and GNPs could be extremely valuable. For example, one could improve

3D meshes and point cloud [Wang et al.2018] construction using lidar or radar during tricky weather conditions. The values in the meshes or point clouds could be imputed directly from conditional draws from the distribution learned by the GNP.

7 Conclusions

In this work we have introduced Graph Neural Processes, a model that learns a conditional distribution while operating on graph structured data. This model has the capacity to generate uncertainty estimates over outputs, and encode prior knowledge about input data distributions. We have demonstrated GNP’s ability on edge imputation and given potential areas for future exploration.

While we note the encoder and decoder architectures can be extended significantly by including work from modern deep learning architectural design, this work is a step towards building Bayesian neural networks on arbitrarily graph structured inputs. Additionally, it encourages the learning of abstractions about these structures. In the future, we wish to explore the use of GNPs to inform high-level reasoning and abstraction about fundamentally relational data.

References