Graphite: Iterative Generative Modeling of Graphs

03/28/2018 ∙ by Aditya Grover, et al. ∙ Stanford University 0

Graphs are a fundamental abstraction for modeling relational data. However, graphs are discrete and combinatorial in nature, and learning representations suitable for machine learning tasks poses statistical and computational challenges. In this work, we propose Graphite an algorithmic framework for unsupervised learning of representations over nodes in a graph using deep latent variable generative models. Our model is based on variational autoencoders (VAE), and differs from existing VAE frameworks for data modalities such as images, speech, and text in the use of graph neural networks for parameterizing both the generative model (i.e., decoder) and inference model (i.e., encoder). The use of graph neural networks directly incorporates inductive biases due to the spatial, local structure of graphs directly in the generative model. Moreover, we draw novel connections between graph neural networks and approximate inference via kernel embeddings of distributions. We demonstrate empirically that Graphite outperforms state-of-the-art approaches for the tasks of density estimation, link prediction, and node classification on synthetic and benchmark datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Latent variable generative modeling is an effective approach for unsupervised representation learning of high-dimensional data 

(loehlin1998latent, )

. In recent years, representations learned by latent variable models parameterized by deep neural networks have shown impressive performance on many tasks such as semi-supervised learning and structured prediction 

(kingma2014semi, ; sohn2015learning, ). However, these successes have been largely restricted to specific data modalities such as images and speech. In particular, it is challenging to apply current deep generative models for large scale graph-structured data which arise in a wide variety of domains in physical sciences, information sciences, and social sciences.

To effectively model the relational structure of large graphs for deep learning, prior works have proposed to use

graph neural networks (gori2005new, ; scarselli2009graph, ; bruna2013spectral, ). A graph neural network learns node-level representations by parameterizing an iterative message passing procedure between nodes and their neighbors. The tasks which have benefited from graph neural networks, including semi-supervised learning (kipf2016semi, ) and few shot learning (garcia2017few, ), involve encoding an input graph to a final output representation (such as the labels associated with the nodes). The inverse problem of learning to decode

a hidden representation into a graph, as in the case of a latent variable generative model, is to the best of our knowledge largely an open question that we address in this work.

We propose Graphite, a framework for learning latent variable generative models of graphs based on variational autoencoding (kingma2013auto, )

. Specifically, we learn a directed model expressing a joint distribution over adjacency matrix of graphs and latent feature vectors for every node. Our framework uses graph neural networks for inference (encoding) and generation (decoding). While the encoding is straightforward, the decoding is done using a multi-layer iterative decoding procedure that alternates between message passing and graph refinement. In contrast to recent concurrent works, Graphite is especially designed to scale to large graphs with feature attributes.

From a theoretical standpoint, we highlight novel connections of special cases of our framework with existing literature on approximate inference via kernel embeddings (smola2007hilbert, ). Finally, we use Graphite as the central building block for several inference tasks over entire graphs, nodes, and edges. In empirical evaluations on synthetic and benchmark datasets for density estimation, link prediction, and semi-supervised node classification, we demonstrate that our general-purpose framework outperforms competing approaches for graph representation learning.

2 Preliminaries

Throughout this work, we use upper-case symbols to denote probability distributions and assume they admit absolutely continuous densities on a suitable reference measure, denoted by the corresponding lower-case notation. Consider an undirected graph

where and denote index sets of nodes and edges respectively. We represent the graph structure using a symmetric adjacency matrix where and the entries denote the weight of the edge between node and . Additionally, we denote the feature matrix associated with the graph as for an -dimensional signal associated with each node, for e.g., these could refer to the user attributes in a social network. If there are no explicit node features, we set (identity).

2.1 Weisfeiler-Lehman algorithm

The -dim Weisfeiler-Lehman (WL) algorithm (weisfeiler1968reduction, ; douglas2011weisfeiler, )

is a heuristic test of graph isomorphism between any two graphs

and . The algorithm proceeds in iterations. For brevity, we present the algorithm for . Before the first iteration, we label every node in and with a scalar isomorphism invariant initialization (for e.g., node degrees). That is, if and are assumed to be isomorphic, then the matching nodes establishing the isomorphism in and have the same labels (a.k.a. messages) for an isomorphism invariant initialization. Let denote the vector of initializations for the nodes in the graph at iteration . At every iteration , we perform a relabelling of nodes in and based on a message passing update rule:

(1)

where denotes the adjacency matrix of the graph. Hence, the message for every node is computed as a hashed sum of the messages from the neighboring nodes (since only if and are neighbors). We repeat the process for a specified number of iterations, or until convergence. If the label sets for the nodes in and are equal (which can be checked using sorting in time), then the algorithm declares the two graphs and to be isomorphic.

The -dim WL algorithm simultaneously passes messages of length (each initialized with some isomorphism invariant scheme) and a positive test for isomorphism requires equality in all dimensions for nodes in and after the termination of message passing. This algorithmic test is a heuristic which guarantees no false negatives but can give false positive. Empirically, the test has been shown to fail on some regular graphs but gives excellent performance on real-world graphs (shervashidze2011weisfeiler, ).

2.2 Graph neural networks

The message passing procedure in the WL algorithm encodes messages that are most sensitive to structural information. Graph neural networks (GNN) build on this observation and parameterize an unfolding of the iterative message passing procedure which we describe next.

A GNN consists of many layers, indexed by with each layer associated with an activation and a dimensionality . In addition to the input graph , every layer of the GNN takes as input the activations from the previous layer

, a family of linear transformations

, and a matrix of learnable weight parameters and optional bias parameters . Recursively, the layer wise propagation rule in a GNN is given by:

(2)

with the base cases and . Several variants of graph neural networks have been proposed in prior work. For instance, graph convolutional networks (GCN) kipf2016semi instantiate graph neural networks with the following propagation rule:

(3)

where is the symmetric diagonalization of given the diagonal degree matrix (i.e., ), and same base cases as before. Comparing the above with the WL update rule in Eq. (1), we can see that the activations for every layer in a GCN are computed via parameterized, scaled activations (messages) of the previous layer being passed over

, with the hash function implicitly specified using an activation function

.

Our framework is agnostic to instantiations of message passing rule of a graph neural network in Eq. (2), and we use graph convolutional networks for experimental validation. For brevity, we denote the output for the final layer of a multi-layer graph neural network with input adjacency matrix , node feature matrix , and parameters as , with appropriate activation functions and linear transformations applied at each hidden layer of the network.

2.3 Kernel embeddings

A kernel defines a notion of similarity between objects, such as a pair of graphs (scholkopf2002learning, ; shawe2004kernel, ). In order to do so, we first consider a mapping of objects (graphs) into a potentially infinite dimensional feature space . These mappings need not be defined explicitly, and typically we only require the kernel function defined over a space of graphs .

Kernel methods can learn feature mappings for distributions of graphs and other objects (smola2007hilbert, ; gretton2007kernel, ). Formally, we denote these functional mappings as where specifies the space of all distributions on . These mappings, referred to as kernel embeddings of distributions, are defined as:

for some . We are particularly interested in injective embeddings, i.e., for any pair of distributions and , we have if . For injective embeddings, all statistical features of the distribution are preserved by the embedding. Crucially, this implies that we can compute functionals of any distribution by directly applying a corresponding function on its kernel embedding. Formally, for every operator , there exists a corresponding operator such that:

(4)

if is an injective embedding. In Section 4, we will use the above property of injective embeddings to interpret the role of graph neural networks in Graphite.

3 Generative Modeling with Graphite

For generative modeling of graphs, we are interested in learning a parameterized distribution over adjacency matrices .111In this work, we restrict ourselves to modeling graph structure only, and any additional information in the form of node signals is incorporated as evidence.

Figure 1: Latent variable model for Graphite. Observed evidence variables in gray.

In Graphite, we adopt a latent variable approach for modeling the generative process. That is, we introduce latent variable vectors and evidence feature vectors for each node along with an observed variable for each pair of nodes . Unless necessary, we use a succinct representation , , and for the variables henceforth. The conditional independencies between the variables can be summarized in the directed graphical model (using plate notation) in Figure 1. We can learn the model parameters by maximizing the marginal likelihood of the observed adjacency matrix:

(5)

If we have multiple observed adjacency matrices in our data, we maximize the expected log-likelihoods over all these matrices. We can obtain a tractable, stochastic evidence lower bound (ELBO) to the above objective by introducing a variational posterior with parameters :

(6)

The lower bound is tight when the variational posterior matches the true posterior and hence maximizing the above objective optimizes for the variational parameters that define the best approximation to the true posterior within the variational family. We now discuss parameterizations for specifying (i.e., encoder) and (i.e., decoder).

Encoding using forward message passing.

Typically we use the mean field approximation for defining the variational family and hence:

(7)

Additionally, we typically make distributional assumptions on each variational marginal density such that it is reparameterizable and easy-to-sample, thereby permitting efficient learning. In Graphite, we assume isotropic Gaussian variational marginals with diagonal covariance. The parameters for the variational marginals are specified using a graph neural network:

(8)

where and denote the sufficient statistics for the variational marginals .

Decoding using reverse message passing.

For specifying the observation model , we cannot directly use a graph neural network since we do not have an input graph for message passing. To sidestep this issue, we propose an iterative two-step approach that alternates between defining an intermediate graph and then gradually refining this graph through message passing. Formally, given a latent matrix and an input feature matrix , we iterate over the following sequence of operations:

(9)
(10)

where the second argument to the GNN is a concatenation of and . The first step constructs an intermediate graph by applying an inner product of with itself , adding an additional constant of 1 to every element guarantee the intermediate graph is non-negative. And the second step performs a pass through a parameterized graph neural network. We can repeat the above sequence to gradually refine the feature matrix . The final distribution over graph parameters is obtained using an inner product step on akin to Eq. (9). For ease-of-sampling, we assume the observation model factorizes:

(11)

4 Interpreting Graph Neural Networks via Kernels

2

3

1
(a) Input graph with edge set .

(b) Latent variable model satisfying Property 1 with .
Figure 2: Interpreting Graph Neural Networks via Kernel Embeddings

Locality preference for representational learning is a key inductive bias for graphs. We formulate this using an (undirected) graphical model over , , and to specify conditional independence structure in the conditional distribution . We are interested in models that satisfy the following property.

Property 1.

The edge set defined by the adjacency matrix is an I-map for the posterior .

In words, the above property implies that according to the posterior distribution over , any individual is independent of all other when conditioned on , , and the neighboring latent variables of node as determined by the edge set . See Figure 2 for an illustration.

Consider a mean field approximation of :

(12)

where denotes the full set of parameters for the variational posterior. Using variational arguments (wainwright2008graphical, ), we know that the optimal variational marginals assume the following functional form:

(13)

where denotes the neighbors of variable in the graphical model and is a function determined by the fixed point equations that depends on the actual potentials. Importantly, the above functional form suggests that the parameters for the optimal marginals in mean field inference are only a function of the parameters of the neighboring marginals.

We will sidestep deriving , and instead use the kernel embeddings of the variational marginals to directly reason in the embedding space. That is, we assume we have an injective embedding for each marginal given by for some feature map and directly use the equivalence established in Eq. (4) iteratively. This gives us the following recursive expression for the embeddings at iteration :

(14)

with an appropriate base case for . We then have the following result:

Theorem 2.

Given , there exists a choice of , , and such that the GNN propagation rule in Eq. (2) is computationally equivalent to one iteration of variational message passing on a first order approximation to Eq. (14) for any graphical model satisfying Property 1.

Proof.

See Appendix A. ∎

While is fixed beforehand, the parameters , and are directly learned from data. Hence we have shown that a GNN is a good model for computation with respect to latent variable models that attempt to capture inductive biases relevant to graphs, i.e., ones where the latent feature vector for every node is conditionally independent from everything else given the feature vectors of its neighbors (and , ). Note that such a graphical model would satisfy Property 1 but is in general different from the posterior specified by the one in Figure 1. However if the true (but unknown) posterior on the latent variables for the model proposed in Figure 1 could be expressed as an equivalent model satisfying the desired property, then Theorem 2 indeed suggests the use of GNNs for parameterizing variational posteriors, as we do so in the case of Graphite.

5 Experimental Evaluation

Erdos-Renyi Ego Regular Geometric Power Law Barabasi-Albert
GAE -221.79 7.58 -197.3 1.99 -198.5 4.78 -514.26 41.58 -519.44 36.30 -236.29 15.13
Graphite-AE -195.56 1.49 -182.79 1.45 -191.41 1.99 -181.14 4.48 -201.22 2.42 -192.38 1.61
VGAE -273.82 0.07 -273.76 0.06 -275.29 0.08 -274.09 0.06 -278.86 0.12 -274.4 0.08
Graphite-VAE -270.22 0.15 -270.70 0.32 -266.54 0.12 -269.71 0.08 -263.92 0.14 -268.73 0.09
Table 1: Mean reconstruction errors and negative log-likelihood estimates (in nats) for autoencoders and variational autoencoders respectively on test instances from six different generative families.

We evaluate Graphite on tasks involving entire graphs, nodes, and edges. We consider two variants of our proposed framework: the Graphite-VAE, which corresponds to a directed latent variable model as described in Section 3 and Graphite-AE, which corresponds to an autoencoder trained to minimize the error in reconstructing an input adjacency matrix. For unweighted graphs (i.e.,

), the reconstruction terms in the objectives for both Graphite-VAE and Graphite-AE minimize the negative cross entropy between the input and reconstructed adjacency matrices. For weighted graphs, we use the mean squared error. Hyperparameter details beyond this section are described in Appendix 

B

5.1 Reconstruction and density estimation

In the first set of tasks, we evaluate learning in Graphite based on held-out reconstruction losses and log-likelihoods estimated by the learned Graphite-VAE and Graphite-AE models respectively. As a benchmark comparison, we compare against the Graph Autoencoder/Variational Graph Autoencoder (GAE/VGAE) (kipf2016variational, ). The GAE/VGAE models consist of an encoding procedure similar to Graphite. However, the decoder has no learnable parameters and reconstruction is done solely through an inner product operation (such as the one in Eq. (9)).

We create datasets from six graph families with fixed, known generative processes: the Erdos-Renyi, ego-nets, random regular graphs, random geometric graphs, random Power Law Tree and Barabasi-Albert. For each family, 300 graph instances were sampled with each instance having nodes and evenly split into train/validation/test instances. The results on a test set of instances are shown in Table 1

. Both Graphite-AE and Graphite-VAE outperform AE and VGAE significantly with respect to the evaluation metrics. These results indicate the usefulness of learned decoders in Graphite.

5.2 Link prediction

The task of link prediction is to predict whether an edge exists between a pair of nodes (loehlin1998latent, ). Even though Graphite learns a distribution over graphs, it can be used for predictive tasks within a single graph. In order to do so, we learn a model for a random, connected training subgraph of the true graph. For validation and testing, we add a balanced set of positive and negative (false) edges to the original graph and evaluate the model performance based on the reconstruction probabilities assigned to the validation and test edges (similar to denoising of the input graph). In our experiments, we held out a set of edges for validation, edges for testing, and train all models on the remaining subgraph. Additionally, the validation and testing sets also each contain an equal number of non-edges.

We evaluate performance based on the Area Under the ROC Curve (AUC) and Average Precision (AP) metrics. We compared across standard benchmark citation network datasets: Cora, Citeseer, and Pubmed with papers as nodes and citations as edges (sen2008networks, )

. For these networks, the text in the papers can be synthesized into optional node-level features. We evaluated Graphite-VAE and Graphite-AE against the following baselines: Spectral Clustering (SC) 

(tang2011leveraging, ), DeepWalk (perozzi2014deepwalk, ), node2vec (grover2016node2vec, ), and GAE/VGAE (kipf2016variational, ). SC, DeepWalk, and node2vec do not provide the ability to incorporate node features while learning embeddings, and hence we evaluate them only on the featureless datasets.

Cora Citeseer Pubmed Cora* Citeseer* Pubmed*
SC 89.9 0.20 91.5 0.17 94.9 0.04 - - -
DeepWalk 85.0 0.17 88.6 0.15 91.5 0.04 - - -
node2vec 85.6 0.15 89.4 0.14 91.9 0.04 - - -
GAE 90.2 0.16 92.0 0.14 92.5 0.06 93.9 0.11 94.9 0.13 96.8 0.04
VGAE 90.1 0.15 92.0 0.17 92.3 0.06 94.1 0.11 96.7 0.08 95.5 0.13
Graphite-AE 91.0 0.15 92.6 0.16 94.5 0.05 94.2 0.13 96.2 0.10 97.8 0.03
Graphite-VAE 91.5 0.15 93.5 0.13 94.6 0.04 94.7 0.11 97.3 0.06 97.4 0.04
Table 2: Area Under the ROC Curve (AUC) for link prediction (* denotes dataset with features).
(a) Graphite-AE
(b) Graphite-VAE
Figure 3: t-SNE embeddings of the latent feature vectors for the Cora dataset. Colors denote labels.

The AUC results (along with standard errors) are shown in Table 

2 (AP results in Table 2 in the Appendix) averaged over 50 random train/validation/test splits. On both metrics, Graphite-VAE gives the best performance overall. Graphite-AE also gives good results, generally outperforming its closest competitor GAE. We visualize the embeddings learned by Graphite and given by a 2D t-SNE projection (maaten2008visualizing, ) of the latent feature vectors (given as rows for with ) on the Cora dataset in Figure 3. Even without any access to label information for the nodes during training, the name models are able to cluster the nodes (papers) as per their labels (paper categories).

5.3 Semi-supervised node classification

Cora* Citeseer* Pubmed*
SemiEmb 59.0 59.6 71.1
DeepWalk 67.2 43.2 65.3
ICA 75.1 69.1 73.9
Planetoid 75.7 64.7 77.2
GCN 81.5 70.3 79.0
Graphite 82.1 0.06 71.0 0.07 79.3 0.03
GAT 83.0 0.07 72.5 0.07 79.0 0.03
Table 3: Classification accuracies (* denotes dataset with features). Baseline numbers from (kipf2016semi, ).

Given the labels for a few nodes in the underlying graph, the goal of this task is to predict the labels for the remaining nodes. We consider a transductive setting, where we have access to the test nodes used for evaluation during training. Closest approach to Graphite for this task is a graph convolutional network (GCN) trained end-to-end. We consider an extension of this baseline, wherein we augment the GCN objective with the Graphite objective and a hyperparameter to control the relative importance of these two terms. The parameters for the encoder are shared across these two objectives.

The classification accuracy of the semi-supervised models is given in Table 3. We find that Graphite-hybrid outperforms nearly all competing models on all datasets and is competitive with the Graph Attention Networks recently proposed in (velickovic2018graph, ). In the future, we would like to explore the use of Graph Attention Networks for parameterizing the encoder and decoder in Graphite.

6 Discussion and related work

Our framework effectively marries probabilistic modeling and representation learning on graphs. We review some of the dominant prior works in these fields below.

Probabilistic modeling of graphs.

The earliest probabilistic models of graphs proposed to generate graphs by creating an edge between any pair of nodes with a constant probability (erdos1959random, ). Several alternatives have been proposed since; for e.g., the small-world graphs model generates graphs with small diameter and hence the generated graphs exhibit strong local clustering (watts1998collective, ), the Barabasi-Albert models preferential attachment whereby nodes high-degree nodes are likely to form edges with newly added nodes (barabasi1999random, ) etc. We direct the interested reader to prominent surveys on this topic (newman2003structure, ; mitzenmacher2004brief, ; chakrabarti2006graph, ).

Representation learning on graphs.

For representation learning on graphs, we can characterize majority of the prior work into three kinds of approaches: matrix factorization, random walk based approaches, and graph neural networks. We include a brief discussion on the first two kinds in the Appendix and refer the reader to (hamilton2017representation, ) for a recent survey.

Graph neural networks, a collective term for networks that operate over graphs using message passing have shown success on several downstream applications, see for instance,  (duvenaud2015molecular, ; li2015gated, ; kearnes2016molecular, ; kipf2016semi, ; hamilton2017inductive, ) and the references therein. (gilmer2017neural, ) provides a comprehensive characterization of these networks in the message passing setup. We used Graph Convolution Networks, partly to provide a direct comparison with GAE/VGAE and leave the exploration of other GNN variants for future work.

Latent variable models for graphs.

Hierarchical Bayesian models parameterized by deep neural networks have been recently proposed for graphs (hu2017deep, ; wang2017relational, )

. Besides making strong assumptions about the model structure and being restricted to single graphs, these models have either expensive inference times requiring Markov chains 

(hu2017deep, ) or are task-specific (wang2017relational, ). (johnson2017transitions, ) and (kipf2018neural, ) generate graphs as latent representations learned directly from data. Their framework does not have an explicit probabilistic interpretation for modeling graph densities. Finally, there has been a fair share of recent work for generation of special kinds of graphs, such as parsed trees of source code (maddison2014structured, ) and SMILES representations for molecules (olivecrona2017denovo, ).

6.1 Scalable learning and inference in Graphite

Several deep generative models for graphs concurrent with this work have recently been proposed. Amongst adversarial generation approaches, (wang2017graphgan, ) and (bojchevski2018netgan, ) model local graph neighborhoods and random walks on graphs respectively. (li2018deepgen, ) and (you2018graphrnn, ) model graphs as sequences and generate graphs via autoregressive procedures. Finally, closest to our framework is the GAE/VGAE approach (kipf2016variational, ) discussed in Section 5 which build a generative model of graphs based on variational principles. While adversarial and autoregressive approaches have shown success in generation of small to medium graphs, they lack the powerful inference capabilities of the variational approaches.

Real world graphs commonly have thousands of nodes, and hence we want Graphite to be able to effectively scale to large graphs. On the surface, the decoding step in Graphite (as well as GAE/VGAE) involves inner products of potentially dense matrices or , which is an operation. For any intermediate decoding step as in Eq. (9), we can offset this difficulty by associativity of matrix multiplications for the message passing step in Eq. (10). For notational brevity, consider the simplified graph propagation rule for a GNN:

where is defined in Eq. (9). If and denote the size of the layers and respectively, then the time complexity of propagation based on right multiplication is given by , where is the dimension of the latent node representations used to define .

The above trick sidesteps the quadratic complexity for Graphite decoding in the intermediate layers without any loss in statistical accuracy. The final layer however still involves an inner product operation with respect to between potentially dense matrices for evaluating the ELBO objective. However, since the edges are generated independently, we can approximate the loss objective by performing a Monte Carlo evaluation of the reconstructed adjacency matrix parameters in Eq. (11).

By adaptively choosing the number of entries for Monte Carlo approximation, we can trade-off statistical accuracy for computational efficiency. We experimented with learning VGAE and Graphite models by subsampling random entries for Monte Carlo evaluation of the objective at each iteration. The corresponding AUC scores are shown in Table 4. The impact of subsampling edges for a varying coefficient in both models is given in the appendix. The results suggest that Graphite can effectively scale to large graphs without significant loss in accuracy.

Cora Citeseer Pubmed
VGAE 89.6 92.2 92.3
Graphite 90.5 92.5 93.1
Table 4: AUC scores for link prediction with Monte Carlo subsampling during training.

7 Conclusion

We proposed Graphite, a scalable framework for deep generative modeling and representation learning in graphs based on variational autoencoding. The encoders and decoders in our generative model are parameterized by graph neural networks that propagate information locally on a graph. To motivate our choice beyond empirical evidence, we highlighted novel connections of graph neural networks to first-order approximations of embedded mean-field inference in related latent variable models for structured data. Finally, our empirical evaluation demonstrated that Graphite consistently outperforms competing approaches for density estimation, link prediction, and node classification.

An interesting direction of future work is to explore the robustness of Graphite to permutation-invariance across graphs by incorporating robust graph representations (verma2017hunt, ). In the future, we would also like to extend Graphite for richer graphs such as heterogeneous graphs, inference tasks such as community detection, and generative design and synthesis applications.

Acknowledgements

We would like to thank Daniel Levy for helpful comments on early drafts. This research has been supported by a Microsoft Research PhD fellowship in machine learning for the first author, Siemens, a Future of Life Institute grant, and NSF grants #1651565, #1522054, #1733686.

References

  • [1] A.-L. Barabasi and R. Albert. Emergence of scaling in random networks. Science, 286(5439):509–512, 1999.
  • [2] M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in Neural Information Processing Systems, 2002.
  • [3] A. Bojchevski, O. Shchur, D. Zügner, and S. Günnemann. Netgan: Generating graphs via random walks. arXiv preprint arXiv:1803.00816, 2018.
  • [4] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun. Spectral networks and locally connected networks on graphs. In International Conference on Learning Representations, 2013.
  • [5] D. Chakrabarti and C. Faloutsos. Graph mining: Laws, generators, and algorithms. ACM Computing Surveys, 38(1):2, 2006.
  • [6] B. L. Douglas. The Weisfeiler-Lehman method and graph isomorphism testing. arXiv preprint arXiv:1101.5211, 2011.
  • [7] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in Neural Information Processing Systems, 2015.
  • [8] P. Erdös and A. Rényi. On random graphs. Publicationes Mathematicae (Debrecen), 6:290–297, 1959.
  • [9] V. Garcia and J. Bruna. Few-shot learning with graph neural networks. In International Conference on Learning Representations, 2018.
  • [10] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neural message passing for quantum chemistry. In International Conference on Machine Learning, 2017.
  • [11] M. Gori, G. Monfardini, and F. Scarselli. A new model for learning in graph domains. In International Joint Conference on Neural Networks, 2005.
  • [12] A. Gretton, K. Borgwardt, M. Rasch, B. Schölkopf, and A. Smola. A kernel method for the two-sample-problem. In Advances in Neural Information Processing Systems, 2007.
  • [13] A. Grover and J. Leskovec. node2vec: Scalable feature learning for networks. In International Conference on Knowledge Discovery and Data Mining, 2016.
  • [14] W. Hamilton, Z. Ying, and J. Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, 2017.
  • [15] W. L. Hamilton, R. Ying, and J. Leskovec. Representation learning on graphs: Methods and applications. arXiv preprint arXiv:1709.05584, 2017.
  • [16] C. Hu, P. Rai, and L. Carin. Deep generative models for relational data with side information. In International Conference on Machine Learning, 2017.
  • [17] D. D. Johnson. Learning graphical state transitions. In International Conference on Learning Representations, 2017.
  • [18] S. Kearnes, K. McCloskey, M. Berndl, V. Pande, and P. Riley. Molecular graph convolutions: moving beyond fingerprints. Journal of computer-aided molecular design, 30(8):595–608, 2016.
  • [19] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling. Semi-supervised learning with deep generative models. In Advances in Neural Information Processing Systems, 2014.
  • [20] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In International Conference on Learning Representations, 2014.
  • [21] T. Kipf, E. Fetaya, K.-C. Wang, M. Welling, and R. Zemel. Neural relational inference for interacting systems. arXiv preprint arXiv:1802.04687, 2018.
  • [22] T. N. Kipf and M. Welling. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308, 2016.
  • [23] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, 2017.
  • [24] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel. Gated graph sequence neural networks. In International Conference on Learning Representations, 2016.
  • [25] Y. Li, O. Vinyals, C. Dyer, R. Pascanu, and P. Battaglia. Learning deep generative models of graphs. 2018.
  • [26] J. C. Loehlin. Latent variable models: An introduction to factor, path, and structural analysis. Lawrence Erlbaum Associates Publishers, 1998.
  • [27] Q. Lu and L. Getoor. Link-based classification. In International Conference on Machine Learning, 2003.
  • [28] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9(Nov):2579–2605, 2008.
  • [29] C. Maddison and D. Tarlow. Structured generative models of natural source code. In International Conference on Machine Learning, 2014.
  • [30] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, 2013.
  • [31] M. Mitzenmacher. A brief history of generative models for power law and lognormal distributions. Internet mathematics, 1(2):226–251, 2004.
  • [32] M. E. Newman. The structure and function of complex networks. SIAM review, 45(2):167–256, 2003.
  • [33] M. Olivecrona, T. Blaschke, O. Engkvist, and H. Chen.

    Molecular de-novo design through deep reinforcement learning.

    Journal of cheminformatics, 9(1):48, 2017.
  • [34] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. Scikit-learn: Machine learning in python. Journal of Machine Learning Research, 12(Oct):2825–2830, 2011.
  • [35] B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk: Online learning of social representations. In International Conference on Knowledge Discovery and Data Mining, 2014.
  • [36] A. Saxena, A. Gupta, and A. Mukerjee. Non-linear dimensionality reduction by locally linear isomaps. In International Conference on Neural Information Processing, 2004.
  • [37] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2009.
  • [38] B. Schölkopf and A. J. Smola.

    Learning with kernels: Support vector machines, regularization, optimization, and beyond

    .
    MIT press, 2002.
  • [39] P. Sen, G. M. Namata, M. Bilgic, L. Getoor, B. Gallagher, and T. Eliassi-Rad. Collective classification in network data. AI Magazine, 29(3):93–106, 2008.
  • [40] J. Shawe-Taylor and N. Cristianini. Kernel methods for pattern analysis. Cambridge university press, 2004.
  • [41] N. Shervashidze, P. Schweitzer, E. J. v. Leeuwen, K. Mehlhorn, and K. M. Borgwardt. Weisfeiler-Lehman graph kernels. Journal of Machine Learning Research, 12(Sep):2539–2561, 2011.
  • [42] A. Smola, A. Gretton, L. Song, and B. Schölkopf. A hilbert space embedding for distributions. In International Conference on Algorithmic Learning Theory, 2007.
  • [43] K. Sohn, H. Lee, and X. Yan. Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems, 2015.
  • [44] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei. Line: Large-scale information network embedding. In International World Wide Web Conference, 2015.
  • [45] L. Tang and H. Liu. Leveraging social media networks for classification. Data Mining and Knowledge Discovery, 23(3):447–478, 2011.
  • [46] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio. Graph Attention Networks. International Conference on Learning Representations, 2018. accepted as poster.
  • [47] S. Verma and Z.-L. Zhang. Hunt for the unique, stable, sparse and fast feature learning on graphs. In Advances in Neural Information Processing Systems, 2017.
  • [48] M. J. Wainwright, M. I. Jordan, et al. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1–2):1–305, 2008.
  • [49] H. Wang, X. Shi, and D.-Y. Yeung. Relational deep learning: A deep latent variable model for link prediction. In

    AAAI Conference on Artificial Intelligence

    , 2017.
  • [50] H. Wang, J. Wang, J. Wang, M. Zhao, W. Zhang, F. Zhang, X. Xie, and M. Guo. Graphgan: Graph representation learning with generative adversarial nets. In AAAI Conference on Artificial Intelligence, 2018.
  • [51] D. J. Watts and S. H. Strogatz. Collective dynamics of ‘small-world’networks. Nature, 393(6684):440, 1998.
  • [52] B. Weisfeiler and A. Lehman. A reduction of a graph to a canonical form and an algebra arising during this reduction. Nauchno-Technicheskaya Informatsia, 2(9):12–16, 1968.
  • [53] J. Weston, F. Ratle, and R. Collobert. Deep learning via semi-supervised embedding. In International Conference on Machine Learning, 2008.
  • [54] Z. Yang, W. W. Cohen, and R. Salakhutdinov. Revisiting semi-supervised learning with graph embeddings. arXiv preprint arXiv:1603.08861, 2016.
  • [55] J. You, R. Ying, X. Ren, W. L. Hamilton, and J. Leskovec. Graphrnn: A deep generative model for graphs. arXiv preprint arXiv:1802.08773, 2018.

Appendices

Appendix A Proof of Theorem 2

Proof.

For simplicity, we state the proof for a single variational marginal density and consider that for all are single dimensional.

Let us denote to be the vector of neighboring kernel embeddings at iteration such that the -th entry of corresponds to if and zero otherwise. Hence, we can rewrite Eq. (14) as:

(15)

where we have overloaded to now denote a function that takes as argument an -dimensional vector but is evaluated only with respect to the embeddings of the neighbors of node .

Assuming that the function is differentiable, a first-order Taylor expansion of Eq. (15) around the origin is given by:

(16)

Again, for simplicity, let us assume a GNN with a single activation per node in every layer, , . This also implies that the biases and weights can be expressed as -dimensional vectors, i.e., and . For a single entry of the layerwise activation vector, we can specify Eq. (2) component-wise as:

(17)

where denotes the -th row of and is non-zero only for entries corresponding to the neighbors of node .

Now, consider the following instantiation of Eq. (17):

  • A family of transformations where

  • (identity function)

With the above substitutions, we can equate the first order approximation in Eq. (15) to the GNN message passing rule in Eq. (17). With vectorized notation, the derivation above also applies to entire vectors of variational marginal embeddings with arbitrary dimensions, thus completing the proof. ∎

Appendix B Experiment Specifications

Nodes Edges Node Features Label Classes
Cora 2708 5429 1433 7
Citeseer 3327 4732 3703 6
Pubmed 19717 44338 500 3
Table 5: Citation network statistics

b.1 Dataset details

Table 5 characterizes the citation networks used in our experiments.

b.2 Link Prediction

Cora Citeseer Pubmed Cora* Citeseer* Pubmed*
SC 92.8 0.12 94.4 0.11 96.0 0.03 - - -
DeepWalk 86.6 0.17 90.3 0.12 91.9 0.05 - - -
node2vec 87.5 0.14 91.3 0.13 92.3 0.05 - - -
GAE 92.4 0.12 94.0 0.12 94.3 0.5 94.3 0.12 94.8 0.15 96.8 0.04
VGAE 92.3 0.12 94.2 0.12 94.2 0.04 94.6 0.11 97.0 0.08 95.5 0.12
Graphite-AE 92.8 0.13 94.1 0.14 95.7 0.06 94.5 0.14 96.1 0.12 97.7 0.03
Graphite-VAE 93.2 0.13 95.0 0.10 96.0 0.03 94.9 0.13 97.4 0.06 97.4 0.04
Table 6: Average Precision (AP) scores for link prediction (* denotes dataset with features). Higher is better.

We used the SC implementation from [34] and public implementations for others made available by the authors. For SC, we used a dimension size of . For DeepWalk and node2vec which uses a skipgram like objective on random walks from the graph, we used the same dimension size and default settings used in [35] and [13] respectively of random walks of length per node and a context size of . For node2vec, we searched over the random walk bias parameters using a grid search in as prescribed in the original work. For GAE and VGAE, we used the same architecture as VGAE and Adam optimizer with learning rate of .

For Graphite-AE and Graphite-VAE, we used an architecture of 32-32 units for the encoder and 16-32-16 units for the decoder trained using the Adam optimizer [20] with a learning rate of . The dropout rate (for edges) and were tuned as hyperparameters on the validation set to optimize the AUC, whereas traditional dropout was set to 0 for all datasets. Additionally, we trained every model for iterations and used the model checkpoint with the best validation loss for testing. Scores are reported as an average of 50 runs with different train/validation/test splits (with the requirement that the training graph necessarily be connected).

For Graphite, we observed that using a form of skip connections to define a linear combination of the initial embedding and the final embedding is particularly useful. The skip connection consists of a tunable hyperparameter controlling the relative weights of the embeddings. The final embedding of Graphite is a function of the initial embedding and the last induced embedding . We consider two functions to aggregate them into a final embedding. That is, and , which correspond to a convex combination of two embeddings, and an incremental update to the initial embedding in a given direction, respectively. Note that in either case, GAE and VGAE reduce to a special case of Graphite, using only a single inner-product decoder (i.e., ). On Cora and Pubmed final embeddings were derived through convex combination, on Citeseer through incremental update.

The AP results are shown in Table 6. The AUC results trained with edge subsampling are shown in Table 4.

Figure 4: AUC score of VGAE and Graphite with subsampled edges on the Cora dataset.

b.3 Semi-supervised Classification

We report the baseline results for SemiEmb [53], DeepWalk [35], ICA [27] and Planetoid [54] as specified in [23]

. GCN uses a 32-16 architecture with ReLu activations and early stopping after

epochs without increasing validation accuracy. The Graphite-hybrid model uses the same architecture as in link prediction (with no edge dropout). The parameters of the posterior distributions are concatenated with node features to predict the final output. The Graphite-gen model contains hidden layers with 16 units for all parametrized learned distributions. The parameters for both models are learned using the Adam optimizer [20] with a learning rate of . All accuracies are taken as an average of 100 runs.

b.4 Density Estimation

To accommodate for input graphs of different sizes, we learn a model architecture specified for the maximum possible nodes (i.e., in this case). While feeding in smaller graphs, we simply add dummy nodes disconnected from the rest of the graph. The dummy nodes have no influence on the gradient updates for the parameters affecting the latent or observed variables involving nodes in the true graph. For the experiments on density estimation, we pick a graph family, then train and validate on graphs sampled exclusively from that family. We consider graphs with nodes ranging between 10 and 20 nodes belonging to the following graph families :

  • Erdos-Renyi [8]: each edge independently sampled with probability

  • Ego Network: a random Erdos-Renyi graph with all nodes neighbors of one randomly chosen node

  • Random Regular: uniformly random regular graph with degree

  • Random Geometric: graph induced by uniformly random points in unit square with edges between points at euclidean distance less than

  • Random Power Tree: Tree generated by randomly swapping elements from a degree distribution to satisfy a power law distribution for

  • Barabasi-Albert  [1]: Preferential attachment graph generation with attachment edge count

We use convex combinations over three successively induced embeddings. Scores are reported over an average of 50 runs. Additionally, a two-layer neural net is applied to the initially sampled embedding before being fed to the inner product decoder for GAE and VGAE, or being fed to the iterations of Eqs. (9) and (10) for Graphite and Graphite-AE.

b.5 Further discussion on related work

Factorization based approaches operate on a matrix representation of the graph, such as the adjacency matrix or the graph Laplacian. These approaches are closely related to dimensionality reduction and can be computationally expensive. Popular ones include Laplacian Eigenmaps [2] and IsoMaps [36].

Random-walk methods are based on variations of the skip-gram objective [30] and learn representations by linearizing the graph through random walks. These methods, in particular DeepWalk [35], LINE [44], and node2vec [13], learn general-purpose unsupervised representations that have been shown to give excellent performance for semi-supervised node classification and link prediction. Planetoid [54] learn representations based on a similar objective specifically for semi-supervised node classification by explicitly accounting for the available label information during learning.