Graph Classification with Recurrent Variational Neural Networks

02/07/2019
by   Edouard Pineau, et al.
0

We address the problem of graph classification based only on structural information. Most standard methods require either the pairwise comparisons of all graphs in the dataset or the extraction of ad-hoc features to perform classification. Those methods respectively raise scalability issues when the number of samples in the dataset is large, and flexibility issues when discriminative information is characterized by exotic features. Recent advances in neural network architectures offer new possibilities for graph analysis in terms of scalability and feature learning. In this paper, we propose a new sequential approach using recurrent neural networks (RNN). Our model sequentially embeds information allowing to model final class membership probabilities. We also propose a regularization based on variational node prediction ending up with better learning and generalization. We experimentally show that our model reaches state-of-the-art classification results on several common molecular datasets. Finally, we perform a qualitative analysis and give some insights about how the joint node prediction helps the model to better classify graphs.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

06/08/2020

Graph Representation Learning Network via Adaptive Sampling

Graph Attention Network (GAT) and GraphSAGE are neural network architect...
08/26/2019

Variational Graph Recurrent Neural Networks

Representation learning over graph structured data has been mostly studi...
02/26/2019

EvolveGCN: Evolving Graph Convolutional Networks for Dynamic Graphs

Graph representation learning resurges as a trending research subject ow...
01/31/2020

Edge-based sequential graph generation with recurrent neural networks

Graph generation with Machine Learning is an open problem with applicati...
06/24/2021

Fea2Fea: Exploring Structural Feature Correlations via Graph Neural Networks

Structural features are important features in graph datasets. However, a...
03/29/2021

Modeling Graph Node Correlations with Neighbor Mixture Models

We propose a new model, the Neighbor Mixture Model (NMM), for modeling n...
08/09/2019

Catching the Phish: Detecting Phishing Attacks using Recurrent Neural Networks (RNNs)

The emergence of online services in our daily lives has been accompanied...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many natural or synthetic systems have a natural graph representation where entities are described through their mutual connections: chemical compounds, social or biological networks, for example. Therefore, automatic mining of such structures is useful in a variety of applications.

Graphs can be studied either individually, considering the nodes as samples, or collectively, each sample of the dataset being a graph object. Here, we consider the later case, applied to classification task. This setting raises several difficulties to leverage standard machine learning algorithms. Indeed, most of these algorithms take vectors of fixed size as inputs. In the case of graphs, usual representations such as edge list or adjacency matrix do not match this constraint. The size of the representations is graph dependent (number of edges in the first case, number of nodes squared in the second) and these representations are index dependent i.e., up to indexing of its nodes, a same graph admits several equivalent representations. In a classification task, the label of a graph is independent from the indices of its nodes, so the model used for prediction should be invariant to node ordering as well. On the other hand, the problem of variable size inputs is well known in the field of natural language processing (NLP) as sentences have variable lengths. This is why we adapt NLP techniques to our problem in order to overcome this specific difficulty.

In this paper, we propose a method to sequentially embed graph information in order to perform classification. By construction, our model overcomes the common difficulties listed above. In fact, the sequential modelling allows to solve the graph-dependent size of the input. Although recurrent neural networks have the capacity to deal with large datasets, they remain time-consuming for the learning phase. We propose a regularization that leads to more efficient learning and better generalization, offering additional scalability to our model.

To address the problem, we use neither node attributes nor edge attributes. This way, we want to reveal intrinsic capacities of recurrent neural embedding for pattern recognition. Adding node information would confuse the origin of the performances of our model.

The rest of this paper is organized as follows, section 2 presents the state-of-the-art methods for graph classification without node features, section 3 introduces our sequential model, section 4 describes our experimental procedure and results, and section 5 draws some conclusions.

2 Related work

Graph classification methods can schematically be divided into three categories: graph kernels, sequential methods and embedding methods. In this section, we briefly present these different approaches.

2.1 Kernel methods

Kernel methods (nikolentzos2017kernel; nikolentzos2017matching; nikolentzos2018degeneracy; neumann2016propagation)

perform pairwise comparisons between the graphs of the dataset and apply a classifier, usually a support vector machine (SVM), on the similarity matrix. In order to maintain the number of comparisons tractable when the number of graphs is large, they often use Nyström algorithm

(williams2001using) to compute a low rank approximation of the similarity matrix. The key is to construct an efficient kernel that can be applied to graphs of varying sizes and captures useful features for the downstream classification. The Weisfeiler-Lehman subtree kernel (shervashidze2011weisfeiler) has proven to be very efficient for such tasks (yanardag2015deep), but it requires graphs with labeled nodes, and is therefore is not applicable in our unlabeled graph case study.

2.2 Sequential methods

Some random walk models are used to address node classification (callut2008classification) or graph classification (xu2012protein)

problems. The idea is to sequentially walk on a graph, one node at a time in a random fashion, and agglomerate information. The graph is represented by a discrete-time Markov chain where each node is associated to a state of the chain, and each probability of transition is proportional to its adjacency. More recently,

jin2018learning or you2018graphrnn transform a graph into a sequence of fixed size vectors. Each of these vectors is an embedding of one node of the graph. The sequence of embeddings is then fed to a recurrent neural network (RNN). The two main challenges in this kind of approaches are the design of the embedding function for the nodes and the order in which the embeddings are given to the recurrent neural network.

2.3 Embedding methods

Embedding methods (barnett2016feature; DBLP:journals/corr/NarayananCVCLJ17) derive a fixed number of features for each graph which are used as a vector representations for classification. Some algorithms consider features based on the dynamics of random walks on the graph (gomez2017dynamics) while others are graphlet based (dutta2017high)

. While deriving a good set of features is often a difficult task, this approach has the merit of being compatible with any standard classifier (SVM, random forest, multilayer perceptron) in a

plug and play fashion.

3 Model

We propose to use a sequential approach to embed graphs with a variable number of nodes and edges into a vector space of a chosen dimension. This latent representation is then used for classification. Node index invariance is approximated through specific pre-processing and aggregation.

Let be an undirected and unweighted graph with a set of nodes and a set of edges. The graph can be represented, modulo any permutation over its nodes , by its boolean adjacency matrix such that if nodes indexed by and are connected in the graph and otherwise. We use this adjacency matrix as a raw representation of the graph.

Our model is a recurrent variational neural network classifier (RVNC), composed of three main parts: node ordering and embedding, classification and regularization with variational auto-regression (VAR), see figure 1 for an illustration. Each of these parts will be detailed respectively in subsections 3.1, 3.2 and 3.3.

Figure 1: Macroscopic description of RVNC. The three main blocks are described respectively in sections 3.1, 3.2 and 3.3.

3.1 Node ordering and embedding

Before being processed by the neural network, the adjacency matrix of a graph is transformed on-the-fly (you2018graphrnn). First, a node is selected at random and used as root for a breadth first search (BFS) over the graph. The rows and columns of the adjacency matrix are then reordered according to the sequence of nodes returned by the BFS. Next, each row (corresponding to the node in the BFS ordering) is truncated to keep only the connections of node with the nodes that preceded in the BFS. This way, each node is

-dimensional, and each truncated matrix is zero-padded in order to have dimensions

. Throughout the rest of the paper, we use the notation for .

After node ordering and pre-embedding, each graph is processed as a sequence of

-dimensional nodes by a gated recurrent unit (GRU) neural network

(cho2014learning). The GRU is a special RNN able to learn long term dependencies by solving vanishing gradient effect111

The choice of GRU over Long Short Term Memory networks is arbitrary as they have equivalent long-term modeling power

(chung2014empirical).
.

In order to help the recurrent network training, we propose to add a simple fully connected network between pre-embedding and recurrent embedding. Therefore, the node will be presented to the GRU in the shape of continuous vectors instead of binary adjacency vectors.

The GRU sequentially embeds each node by using information contained in and in the memory cell with the recurrent process

where , , , , and are GRU parameters and denotes element-wise vector multiplication and .

The embedded node sequence feeds both the VAR and the classifier as discussed in subsequent sections.

The embedding part is illustrated in the top line of figure 2.

3.2 Classification

After the embedding step, we use an additional GRU dedicated to classification that takes as input. Its last memory cell, denoted , feeds a softmax multilayer perceptron (MLP) which performs class prediction.

Formally, let be the class index, the classifier is trained by minimizing the cross-entropy loss

where is the softmax class membership probability vector for a given graph that has been sorted by a BFS rooted with node .

As discriminating patterns might be spread across the whole graph, the network is required to model long-term dependencies. By construction, GRUs have such ability.

The classification part is illustrated in the middle line of figure 2.

3.3 Regularization with variational auto-regression

As the structure of a graph is the concatenation of the interactions between all nodes and their respective neighbors, learning a good representation without using node attributes requires for the model to capture the structure of the graph while classifying. In order to do so, we add an auto-regression block to our model: at each node, the network makes a prediction for the next node adjacency. This task is displayed separately from the recurrent classifier.

We use a variational auto-encoder (VAE) (kingma2013auto) to learn the representation of each node given

. This constitutes the VAR. Such a representation for sequence classification has already been used for sentiment analysis

(latif2017variational; xu2017variational). It is the natural language processing equivalent of predicting the word of a sentence, given an aggregated representation of this sentence up to word .

For each graph with embedded nodes (see 3.1), the fully connected variational auto-encoder takes as input. Let

be the latent random variables for the model

Training is done by minimizing the loss:

which is a lower bound of the negative marginal log-likelihood . and are the respective densities of and , whose distribution are parameterized by and

respectively. KL denotes the Kullback-Leibler divergence,

is the empirical distribution of and is the density of the prior distribution of latent variables . We use the standard VAE prior distribution .

In practice, and are modelled by neural networks parametrized by and , which require differentiable functions for training. However, models a binary adjacency vector representing the connections between node and previously visited nodes . Therefore, we use a continuous relaxation of discrete sampling: the Gumbel trick (jang2016categorical) to train our neural network based model.

The regularization part is illustrated in the bottom line of figure 2.

In the end, the model is trained by minimizing the total loss

where is a hyper-parameter.

Figure 2: Architecture for RVNC. Top: node ordering and embedding (section 3.1). Middle: classification (section 3.2). Bottom: regularization with variational auto-regression (section 3.3) plus final aggregation (section 3.4).

3.4 Aggregation of the results for testing

The node ordering step introduces randomness to our model. On the one hand, it helps learn more general graph representations during the training phase, but on the other hand, it might produce different outputs for the same graph during the testing phase, depending on the root of the BFS. In order to counter this side effect, we add the following aggregation step for the testing phase. Each graph is processed times by the model with different roots for BFS ordering. The class membership probability vectors are extracted and averaged. The average score vector is noted and computed as follows:

with an element-wise sum.

This soft vote is repeated times resulting in probability vectors for each graph . The final class attributed to a graph corresponds to the highest probability among the vectors.

This second hard vote enables to choose the batch of votes for which the model is the most confident.

Figure 2 provides a detailed illustration of our model.

4 Experiments

4.1 Datasets

We evaluated our model against four standard datasets from biology: Mutag (MT), Enzymes (EZ), Proteins Full (PF) and National Cancer Institute (NCI1) (KKMMN2016). All graphs represent chemical compounds, nodes are molecular substructures (typically atoms) and edges represent connections between these substructures (chemical bound or spatial proximity). In MT, the compounds are either mutagenic or not mutagenic. EZ contains tertiary structures of proteins from the 6 Enzyme Commission top level classes; it is the only multiclass dataset of this paper. PF is a subset of the Dobson and Doig dataset representing secondary structures of proteins being either enzyme or not enzyme. In NCI1, compounds either have an anti-cancer activity or do not. Statistics about the graphs are presented in table 1.

MT EZ PF NCI1
graphs 188 600 1113 4110
classes 2 6 2 2
bias 0.66 0.17 0.60 0.5
avg. |V| 18 33 39 29.9
avg. |E| 39 124 146 64.6
Table 1: Basic characteristics of the datasets. Bias indicates the proportion of the largest class.

4.2 Experimental setup

MT, EZ, PF and NCI1 are respectively divided into 3, 10, 10 and 10 folds such that the class proportions are preserved in each fold for all datasets222MT counts only 188 graphs, therefore a 10-fold cross validation does not give any insurance of having representative samples at test time.. These folds are then used for cross-validation i.e., one fold serves as the testing set while the other ones compose the training set. Results are averaged over all testing sets.

Our model is implemented in Pytorch

(paszke2017pytorch) and trained with the Adam stochastic optimization method (kingma2014adam) on a NVIDIA TitanXp GPU. Table 2 features the architecture of the model.

Step Architecture
BFS 1-layer FC.
embedding 2-layer GRU.
VAR Encoder
1-layer FC.
Gaussian sampling
Predictor

2-layer ReLU FC.

Gumbel sigmoid sampling
Classifier 2-layer GRU. + DP(0.25)
2-layer ReLU FC. + SF
Table 2: Generic architecture used in our experiments. ReLU FC stands for fully-connected network with ReLU activation. DP stands for dropout. SF stands for softmax.

The input size of the recurrent neural network is chosen for each dataset according to the algorithm described in (you2018graphrnn), namely 11 for MT, 25 for EZ, 80 for PF and 11 for NCI1. is set to . For training, batch size is set to 5, and the learning rate to , decreased by at iterations and . We use the same hyper-parameters for every dataset in order to avoid over-fitting and unveil the model capacities.

4.3 Results

We compare our results to those obtained by Earth Mover’s Distance (nikolentzos2017matching) (EMD), Pyramid Match (nikolentzos2017matching) (PM), Feature-Based (barnett2016feature) (FB), Dynamic-Based Features (gomez2017dynamics) (DyF) and Stochastic Graphlet Embedding (dutta2017high) (SGE). All values are directly taken from the aforementioned papers as they use a setup similar to ours, except for the number of folds for MT. For algorithms presenting results with and without node features, we reported the results without node features. For algorithms presenting results with several sets of hyper-parameters, we have reported the results for the set of parameters that performed best on the largest number of datasets. Results are reported in table 3. We obtain state-of-the-art results on three out of the four datasets used for this paper and the second best result on the fourth one.

MT EZ PF NCI1
EMD 86.1 36.8 - 72.7
PM 85.6 28.2 - 69.7
FB 84.7 29.0 70.0 62.9
DyF 86.3 26.6 73.1 66.6
SGE 87.3 40.7 71.9 -
RVNC 84.7 / 88.3 333Respectively with 3-folds and 10-folds cross validation. See experimental setup and footnote 2 48.4 74.8 80.7
Table 3: Experimental results of different models plus our own on four standard molecular datasets.

4.4 Analysis

Node indexing invariance

Our model is designed to be independent from node ordering of the graph with respect to different BFS roots. Inputs representing the same graph (up to node ordering) should be close from one another in the latent embedding space. As the preprocessing is performed on each graph at each epoch, a same graph is processed many times by the model during training with different embeddings. This creates a natural regularization for the network. Indeed, as illustrated in figure

3, the projections corresponding to the same graphs form a heap in the low dimensional representation of the latent space.

Figure 3: TSNE projection of the latent state preceding classification for five graphs of EZ each initiated with 20 different BFS. Colors and markers represent the respective classes of the graphs.
Contribution of the VAR to classification

In order to demonstrate the interest of training our model to both auto-regress and classify each sample, we have run some experiments removing the VAR regularization () on a 90/10 train-test split of EZ dataset (same proportions as the experiments). We observe in the figure 4 a more efficient training procedure and a faster generalization when is positive. This allows for convergence of the model in less than one day for every dataset we used.

Figure 4: Classification accuracy as a function of the number of training epochs with and without variational auto-regressor ( vs ) for EZ with a 90/10 train-test split. Top: training set. Bottom: testing set.
Auto-regression quality

In order to provide some intuition about the node prediction capacities of our network, we propose in figure 5 an illustration of some graphs and their auto-regressed counterparts.

Figure 5: Left: Representation of the same graph after two differently rooted BFS ordering and truncation. Right: corresponding auto-regressions.

5 Conclusion

In this work, we introduced a recurrent neural network based embedding method for graphs. We applied our model to graph classification without node or edge attributes. As each graph can be processed individually, there is no scalability issue with respect to the number of samples in the dataset. Features are neither ad-hoc nor handcraft but learned from the data. In the end, by joint training of classification and prediction models, we obtain state-of-the-art results on standard benchmark datasets.

Moreover, this model can easily be adapted to incorporate exogenous information such as node or edge attributes. This could be addressed in a future work.

Acknowledgments

We would like to thank NVIDIA and its GPU Grant Program for providing the hardware we used in our experiments.

References

References