Many natural or synthetic systems have a natural graph representation where entities are described through their mutual connections: chemical compounds, social or biological networks, for example. Therefore, automatic mining of such structures is useful in a variety of applications.
Graphs can be studied either individually, considering the nodes as samples, or collectively, each sample of the dataset being a graph object. Here, we consider the later case, applied to classification task. This setting raises several difficulties to leverage standard machine learning algorithms. Indeed, most of these algorithms take vectors of fixed size as inputs. In the case of graphs, usual representations such as edge list or adjacency matrix do not match this constraint. The size of the representations is graph dependent (number of edges in the first case, number of nodes squared in the second) and these representations are index dependent i.e., up to indexing of its nodes, a same graph admits several equivalent representations. In a classification task, the label of a graph is independent from the indices of its nodes, so the model used for prediction should be invariant to node ordering as well. On the other hand, the problem of variable size inputs is well known in the field of natural language processing (NLP) as sentences have variable lengths. This is why we adapt NLP techniques to our problem in order to overcome this specific difficulty.
In this paper, we propose a method to sequentially embed graph information in order to perform classification. By construction, our model overcomes the common difficulties listed above. In fact, the sequential modelling allows to solve the graph-dependent size of the input. Although recurrent neural networks have the capacity to deal with large datasets, they remain time-consuming for the learning phase. We propose a regularization that leads to more efficient learning and better generalization, offering additional scalability to our model.
To address the problem, we use neither node attributes nor edge attributes. This way, we want to reveal intrinsic capacities of recurrent neural embedding for pattern recognition. Adding node information would confuse the origin of the performances of our model.
2 Related work
Graph classification methods can schematically be divided into three categories: graph kernels, sequential methods and embedding methods. In this section, we briefly present these different approaches.
2.1 Kernel methods
Kernel methods (nikolentzos2017kernel; nikolentzos2017matching; nikolentzos2018degeneracy; neumann2016propagation)
perform pairwise comparisons between the graphs of the dataset and apply a classifier, usually a support vector machine (SVM), on the similarity matrix. In order to maintain the number of comparisons tractable when the number of graphs is large, they often use Nyström algorithm(williams2001using) to compute a low rank approximation of the similarity matrix. The key is to construct an efficient kernel that can be applied to graphs of varying sizes and captures useful features for the downstream classification. The Weisfeiler-Lehman subtree kernel (shervashidze2011weisfeiler) has proven to be very efficient for such tasks (yanardag2015deep), but it requires graphs with labeled nodes, and is therefore is not applicable in our unlabeled graph case study.
2.2 Sequential methods
Some random walk models are used to address node classification (callut2008classification) or graph classification (xu2012protein)
problems. The idea is to sequentially walk on a graph, one node at a time in a random fashion, and agglomerate information. The graph is represented by a discrete-time Markov chain where each node is associated to a state of the chain, and each probability of transition is proportional to its adjacency. More recently,jin2018learning or you2018graphrnn transform a graph into a sequence of fixed size vectors. Each of these vectors is an embedding of one node of the graph. The sequence of embeddings is then fed to a recurrent neural network (RNN). The two main challenges in this kind of approaches are the design of the embedding function for the nodes and the order in which the embeddings are given to the recurrent neural network.
2.3 Embedding methods
Embedding methods (barnett2016feature; DBLP:journals/corr/NarayananCVCLJ17) derive a fixed number of features for each graph which are used as a vector representations for classification. Some algorithms consider features based on the dynamics of random walks on the graph (gomez2017dynamics) while others are graphlet based (dutta2017high)plug and play fashion.
We propose to use a sequential approach to embed graphs with a variable number of nodes and edges into a vector space of a chosen dimension. This latent representation is then used for classification. Node index invariance is approximated through specific pre-processing and aggregation.
Let be an undirected and unweighted graph with a set of nodes and a set of edges. The graph can be represented, modulo any permutation over its nodes , by its boolean adjacency matrix such that if nodes indexed by and are connected in the graph and otherwise. We use this adjacency matrix as a raw representation of the graph.
Our model is a recurrent variational neural network classifier (RVNC), composed of three main parts: node ordering and embedding, classification and regularization with variational auto-regression (VAR), see figure 1 for an illustration. Each of these parts will be detailed respectively in subsections 3.1, 3.2 and 3.3.
3.1 Node ordering and embedding
Before being processed by the neural network, the adjacency matrix of a graph is transformed on-the-fly (you2018graphrnn). First, a node is selected at random and used as root for a breadth first search (BFS) over the graph. The rows and columns of the adjacency matrix are then reordered according to the sequence of nodes returned by the BFS. Next, each row (corresponding to the node in the BFS ordering) is truncated to keep only the connections of node with the nodes that preceded in the BFS. This way, each node is
-dimensional, and each truncated matrix is zero-padded in order to have dimensions. Throughout the rest of the paper, we use the notation for .
After node ordering and pre-embedding, each graph is processed as a sequence of
-dimensional nodes by a gated recurrent unit (GRU) neural network(cho2014learning). The GRU is a special RNN able to learn long term dependencies by solving vanishing gradient effect111
The choice of GRU over Long Short Term Memory networks is arbitrary as they have equivalent long-term modeling power(chung2014empirical)..
In order to help the recurrent network training, we propose to add a simple fully connected network between pre-embedding and recurrent embedding. Therefore, the node will be presented to the GRU in the shape of continuous vectors instead of binary adjacency vectors.
The GRU sequentially embeds each node by using information contained in and in the memory cell with the recurrent process
where , , , , and are GRU parameters and denotes element-wise vector multiplication and .
The embedded node sequence feeds both the VAR and the classifier as discussed in subsequent sections.
The embedding part is illustrated in the top line of figure 2.
After the embedding step, we use an additional GRU dedicated to classification that takes as input. Its last memory cell, denoted , feeds a softmax multilayer perceptron (MLP) which performs class prediction.
Formally, let be the class index, the classifier is trained by minimizing the cross-entropy loss
where is the softmax class membership probability vector for a given graph that has been sorted by a BFS rooted with node .
As discriminating patterns might be spread across the whole graph, the network is required to model long-term dependencies. By construction, GRUs have such ability.
The classification part is illustrated in the middle line of figure 2.
3.3 Regularization with variational auto-regression
As the structure of a graph is the concatenation of the interactions between all nodes and their respective neighbors, learning a good representation without using node attributes requires for the model to capture the structure of the graph while classifying. In order to do so, we add an auto-regression block to our model: at each node, the network makes a prediction for the next node adjacency. This task is displayed separately from the recurrent classifier.
We use a variational auto-encoder (VAE) (kingma2013auto) to learn the representation of each node given
. This constitutes the VAR. Such a representation for sequence classification has already been used for sentiment analysis(latif2017variational; xu2017variational). It is the natural language processing equivalent of predicting the word of a sentence, given an aggregated representation of this sentence up to word .
For each graph with embedded nodes (see 3.1), the fully connected variational auto-encoder takes as input. Let
be the latent random variables for the model
Training is done by minimizing the loss:
which is a lower bound of the negative marginal log-likelihood . and are the respective densities of and , whose distribution are parameterized by and
respectively. KL denotes the Kullback-Leibler divergence,is the empirical distribution of and is the density of the prior distribution of latent variables . We use the standard VAE prior distribution .
In practice, and are modelled by neural networks parametrized by and , which require differentiable functions for training. However, models a binary adjacency vector representing the connections between node and previously visited nodes . Therefore, we use a continuous relaxation of discrete sampling: the Gumbel trick (jang2016categorical) to train our neural network based model.
The regularization part is illustrated in the bottom line of figure 2.
In the end, the model is trained by minimizing the total loss
where is a hyper-parameter.
3.4 Aggregation of the results for testing
The node ordering step introduces randomness to our model. On the one hand, it helps learn more general graph representations during the training phase, but on the other hand, it might produce different outputs for the same graph during the testing phase, depending on the root of the BFS. In order to counter this side effect, we add the following aggregation step for the testing phase. Each graph is processed times by the model with different roots for BFS ordering. The class membership probability vectors are extracted and averaged. The average score vector is noted and computed as follows:
with an element-wise sum.
This soft vote is repeated times resulting in probability vectors for each graph . The final class attributed to a graph corresponds to the highest probability among the vectors.
This second hard vote enables to choose the batch of votes for which the model is the most confident.
Figure 2 provides a detailed illustration of our model.
We evaluated our model against four standard datasets from biology: Mutag (MT), Enzymes (EZ), Proteins Full (PF) and National Cancer Institute (NCI1) (KKMMN2016). All graphs represent chemical compounds, nodes are molecular substructures (typically atoms) and edges represent connections between these substructures (chemical bound or spatial proximity). In MT, the compounds are either mutagenic or not mutagenic. EZ contains tertiary structures of proteins from the 6 Enzyme Commission top level classes; it is the only multiclass dataset of this paper. PF is a subset of the Dobson and Doig dataset representing secondary structures of proteins being either enzyme or not enzyme. In NCI1, compounds either have an anti-cancer activity or do not. Statistics about the graphs are presented in table 1.
4.2 Experimental setup
MT, EZ, PF and NCI1 are respectively divided into 3, 10, 10 and 10 folds such that the class proportions are preserved in each fold for all datasets222MT counts only 188 graphs, therefore a 10-fold cross validation does not give any insurance of having representative samples at test time.. These folds are then used for cross-validation i.e., one fold serves as the testing set while the other ones compose the training set. Results are averaged over all testing sets.
Our model is implemented in Pytorch(paszke2017pytorch) and trained with the Adam stochastic optimization method (kingma2014adam) on a NVIDIA TitanXp GPU. Table 2 features the architecture of the model.
2-layer ReLU FC.
|Gumbel sigmoid sampling|
|Classifier||2-layer GRU. + DP(0.25)|
|2-layer ReLU FC. + SF|
The input size of the recurrent neural network is chosen for each dataset according to the algorithm described in (you2018graphrnn), namely 11 for MT, 25 for EZ, 80 for PF and 11 for NCI1. is set to . For training, batch size is set to 5, and the learning rate to , decreased by at iterations and . We use the same hyper-parameters for every dataset in order to avoid over-fitting and unveil the model capacities.
We compare our results to those obtained by Earth Mover’s Distance (nikolentzos2017matching) (EMD), Pyramid Match (nikolentzos2017matching) (PM), Feature-Based (barnett2016feature) (FB), Dynamic-Based Features (gomez2017dynamics) (DyF) and Stochastic Graphlet Embedding (dutta2017high) (SGE). All values are directly taken from the aforementioned papers as they use a setup similar to ours, except for the number of folds for MT. For algorithms presenting results with and without node features, we reported the results without node features. For algorithms presenting results with several sets of hyper-parameters, we have reported the results for the set of parameters that performed best on the largest number of datasets. Results are reported in table 3. We obtain state-of-the-art results on three out of the four datasets used for this paper and the second best result on the fourth one.
|RVNC||84.7 / 88.3 333Respectively with 3-folds and 10-folds cross validation. See experimental setup and footnote 2||48.4||74.8||80.7|
Node indexing invariance
Our model is designed to be independent from node ordering of the graph with respect to different BFS roots. Inputs representing the same graph (up to node ordering) should be close from one another in the latent embedding space. As the preprocessing is performed on each graph at each epoch, a same graph is processed many times by the model during training with different embeddings. This creates a natural regularization for the network. Indeed, as illustrated in figure3, the projections corresponding to the same graphs form a heap in the low dimensional representation of the latent space.
Contribution of the VAR to classification
In order to demonstrate the interest of training our model to both auto-regress and classify each sample, we have run some experiments removing the VAR regularization () on a 90/10 train-test split of EZ dataset (same proportions as the experiments). We observe in the figure 4 a more efficient training procedure and a faster generalization when is positive. This allows for convergence of the model in less than one day for every dataset we used.
In order to provide some intuition about the node prediction capacities of our network, we propose in figure 5 an illustration of some graphs and their auto-regressed counterparts.
In this work, we introduced a recurrent neural network based embedding method for graphs. We applied our model to graph classification without node or edge attributes. As each graph can be processed individually, there is no scalability issue with respect to the number of samples in the dataset. Features are neither ad-hoc nor handcraft but learned from the data. In the end, by joint training of classification and prediction models, we obtain state-of-the-art results on standard benchmark datasets.
Moreover, this model can easily be adapted to incorporate exogenous information such as node or edge attributes. This could be addressed in a future work.
We would like to thank NVIDIA and its GPU Grant Program for providing the hardware we used in our experiments.