A Simple Baseline Algorithm for Graph Classification

10/22/2018
by   Nathan de Lara, et al.
Télécom Paris
Safran
0

Graph classification has recently received a lot of attention from various fields of machine learning e.g. kernel methods, sequential modeling or graph embedding. All these approaches offer promising results with different respective strengths and weaknesses. However, most of them rely on complex mathematics and require heavy computational power to achieve their best performance. We propose a simple and fast algorithm based on the spectral decomposition of graph Laplacian to perform graph classification and get a first reference score for a dataset. We show that this method obtains competitive results compared to state-of-the-art algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

08/11/2020

PiNet: Attention Pooling for Graph Classification

We propose PiNet, a generalised differentiable attention-based pooling m...
08/30/2020

K-way p-spectral clustering on Grassmann manifolds

Spectral methods have gained a lot of recent attention due to the simpli...
12/31/2021

Fast Graph Subset Selection Based on G-optimal Design

Graph sampling theory extends the traditional sampling theory to graphs ...
01/06/2019

LanczosNet: Multi-Scale Deep Graph Convolutional Networks

We propose the Lanczos network (LanczosNet), which uses the Lanczos algo...
11/27/2016

Kernel classification of connectomes based on earth mover's distance between graph spectra

In this paper, we tackle a problem of predicting phenotypes from structu...
02/10/2020

Deep Graph Mapper: Seeing Graphs through the Neural Lens

Recent advancements in graph representation learning have led to the eme...
05/25/2020

AutoMSC: Automatic Assignment of Mathematics Subject Classification Labels

Authors of research papers in the fields of mathematics, and other math-...

Code Repositories

A-simple-baseline-algorithm-for-graph-classification

None


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Graph classification methods can schematically be divided into three categories: graph kernels, sequential methods and embedding methods. In this section, we briefly present these different approaches, focusing on methods that only use the structure of the graph and no exogenous information, such as node features, to perform classification as we only want to compare the capacity of the algorithms to capture structural information.

Kernel methods

Kernel methods [16, 17, 15, 14]

perform pairwise comparisons between the graphs of the dataset and apply a classifier, usually a support vector machine (SVM), on the similarity matrix. In order to maintain the number of comparisons tractable when the number of graphs is large, they often use Nyström algorithm

[22] to compute a low rank approximation of the similarity matrix. The key is to construct an efficient kernel that can be applied to graphs of varying sizes and captures useful features for the downstream classification.

Sequential methods

Some methods tackle the varying sizes of graphs by processing them as a sequence of nodes. Earliest models used random walk based representations [4, 23]. More recently, [8] or [24]

transform a graph into a sequence of fixed size vectors, corresponding to its nodes, which is fed to a recurrent neural network. The two main challenges in this approach are the design of the embedding function for the nodes and the order in which the embeddings are given to the recurrent neural network.

Embedding methods

Embedding methods [7, 1, 6, 13], derive a fixed number of features for each graph which is used as a vector representation for classification. Even though deriving a good set of features is often a difficult task, this approach has the benefit of being compatible with any standard classifier in a plug and play

fashion (SVM, random forest, multilayer perceptron…). Our model belongs to this class of methods as we rely on spectral features of the graph.

2 Model

Let be an undirected and unweighted graph and its boolean adjacency matrix with respect to an arbitrary indexing of the nodes. is assumed to be connected, otherwise, we extract its largest connected component. Let be the matrix of node degrees, the normalized Laplacian of is defined as

(1)

We use the

smallest positive eigenvalues of

in ascending order as input of the classifier:

If the graph has less than

nodes, we use right zero padding to get a vector of appropriate dimensions:

. We denote this embedding as spectral features (SF).

The normalized Laplacian matrix of a graph is a well-known object in spectral learning [2, 9]

. However, for node clustering or classification most of the attention is usually directed to its eigenvectors and not its spectrum. A major benefit of the ordered spectrum representation for graph classification is that it does not depend on the indexing of the nodes.

Some Laplacian eigenvalues properties

The eigenvalues of the normalized Laplacian matrix lie between and . Such a property is very convenient for the downstream use of a standard classifier without heavy rescaling or preprocessing. The multiplicity of the eigenvalue corresponds to the number of connected components in the graph, hence the omission of in our representation as we only consider the largest connected component. Other values are also known to denote the presence of specific structures in the graph [5]. For example, an eigenvalue equal to denotes a bipartite structure.

Physical interpretations

In [3], each eigenvalue of the Laplacian corresponds to the energy level of a stable configuration of the nodes in the embedding space. The lower the energy, the stabler the configuration. In [20], these eigenvalues correspond to frequencies associated to a Fourier decomposition of any signal living on the vertices of the graph. Thus, the truncation of the Fourier decomposition acts as low-pass filter on the signal. Characterizing a graph by the smallest eigenvalues of its normalized Laplacian is thus comparable to characterizing a melody by its lowest fundamental frequencies.

Finally, there have been some attempts to connect spectral decomposition to graph isomorphism [21, 11], however, to the best of our knowledge, this is still an open problem.

The choice of the classifier is left to the discretion of the user. In our experiments, we chose a random forest classifier (RFC) which offers a good computational speed versus accuracy trade-off. Results with several other common classifiers are displayed in appendix A.

An illustration of the model is proposed in figure 1.

Figure 1: Schematic view of our model. denotes the Laplacian as in equation 1 and the predicted class.

3 Experiments

Datasets

We evaluated our model against some standard datasets from biology: Mutag (MT), Predictive Toxicology Challenge (PTC), Enzymes (EZ), Proteins Full (PF), Dobson and Doig (DD) and National Cancer Institute (NCI1) [10]. All graphs represent chemical compounds. Nodes are molecular substructures (typically atoms) and edges represent connections between these substructures (chemical bound or spatial proximity). In MT, the compounds are either mutagenic and not mutagenic while in PTC, they are either carcinogens or not. EZ contains tertiary structures of proteins from the 6 Enzyme Commission top level classes. In DD, graphs represent secondary structures of proteins being either enzyme or not enzyme. PF is a subset of DD where the largest graphs have been removed. In NCI1, compounds have either an anti-cancer activity or not. Statistics about the graphs are presented in table 1.

MT PTC EZ PF DD NCI1
graphs 188 344 600 1113 1178 4110
classes 2 2 6 2 2 2
bias () 66.5 55.8 16.7 59.6 58.7 50.0
avg. |V| 18 14 33 39 284 30
avg. |E| 39 15 124 146 1431 65
Table 1: Basic characteristics of the datasets. Bias indicates the proportion of the dominant class.

Experimental setup

Each dataset is divided into 10 folds such that the class proportions are preserved in each fold for all datasets. These folds are then used for cross-validation i.e, one fold serves as the testing set while the other ones compose the training set. Results are averaged over all testing sets. We built the folds using scikit-learn [18] StratifiedKFold function with the random seed fixed to in order to get reproducible results.

The embedding dimension is set to the average number of nodes for each dataset (see appendix B for additional experiments) and a unique set of hyper-parameters for the classifier is used for all datasets. We used the random forest classifier from scikit-learn with class_weights: balanced. The other non-default hyper parameters were selected by randomized cross validation over the different datasets (see table 5 for more details). We also conducted experiments to ensure the robustness of our model with respect to some of its hyper-parameters, see Appendix C for more details. All experiments were run on a laptop equipped with an intel core i7 vPro processor and 16GB of RAM.

Results

We compare our results (RFC) to those obtained by Earth Mover’s Distance [17] (EMD), Pyramid Match [17] (PM), Feature-Based [1] (FB), Dynamic-Based Features [7] (DyF) and Stochastic Graphlet Embedding [6] (SGE). All values are directly taken from the aforementioned papers as they used a setup similar to ours. For algorithms presenting results with and without node features, we reported the results without node features. For algorithms presenting results with several sets of hyper-parameters, we reported the results for the set of parameters that gave the best performance on the largest number of datasets. Results are reported in table 2.

MT PTC EZ PF DD NCI1
EMD 86.1 57.7 36.8 - - 72.7
PM 85.6 59.4 28.2 - 75.6 69.7
FB 84.7 55.6 29.0 70.0 - 62.9
DyF 86.3 56.2 26.6 73.1 - 66.6
SGE 87.2 60.0 40.7 - 76.6 -
SF + RFC 88.4 62.8 43.7 73.6 75.4 75.2

Table 2: Experimental accuracy () of different models plus ours over standard molecular datasets.

We see that our model achieves good performance compared to the state-of-the art. It gives the best result on five out of the six datasets (MT, PTC, EZ, PF, NCI1). Besides, it did not require any per-dataset hyper parameters intensive tuning as we used the same random forest for all datasets.

Computation analysis

The results were obtained extremely quickly (some kernel methods cannot run within one day on DD for example [6]). Embedding all graphs took approximately minutes (most of it dedicated to DD which has the largest graphs and largest embedding dimension), while training and testing the random forest on all folds took less than a minute. Hence, the total time to run all described experiments was less than minutes.

Conclusion

We experimentally showed the interest of normalized Laplacian eigenvalues for graph classification. This feature is easy to extract and can be combined to any other graph representation in order to improve the model performances. We hope it will inspire new approaches to graph classification. Experimenting with permutation-invariant classifiers [12, 19] could be a natural continuation of this work in order to properly include information from eigenvectors of which are node-indexing dependent.

Acknowledgments

We would like to thank Thomas Bonald, Sebastien Razakarivony and all the anonymous reviewers for their comments and help. This work is supported by the company Safran through the CIFRE convention 2017/1317.

References

Appendix A Results for different classifiers

Besides RFC, we experimented with different standard classifiers combined to our spectral embedding. Namely: -nearest neighbors classifier (

-NNC), 2-layers perceptron with Relu non-linearity (MLP), support vector machine with

one versus one

classification (SVM) and ridge regression classifier (RRC). Results are reported in table

3.

MT PTC EZ PF DD NCI1
SF + RFC 88.4 62.8 43.7 73.6 75.4 75.2
SF + 1-NNC 86.8 59.3 37.3 65.6 69.6 68.3
SF + 15-NNC 85.7 61.9 33.7 70.4 75.0 69.6
SF + MLP 86.3 60.5 31.8 71.6 75.6 62.3
SF + SVM 85.3 60.8 31.3 73.0 75.0 63.9
SF + RRC 84.2 59.6 26.7 71.5 75.0 62.2

Table 3: Accuracy () of different classifiers combined to the spectral features embedding.

As we can see, RFC provides the best results for all datasets except DD where MLP has an accuracy of 75.6 against 75.4. Our intuition to explain these good results is that the decision tree classifier, which is at the core of RFC, is an algorithm based on level thresholding. As explained in section

2, our embedding represents a sequence of energy levels, being above or below a certain level is thus likely to be meaningful for classification.

Appendix B Results for different embedding dimensions

We experimented with different embedding dimensions for RFC:

. The hyperparameters are the same as in section

3. Results are reported in table 4.

MT PTC EZ PF DD NCI1
1 76.2 56.1 23.8 64.0 57.2 58.2
5 86.8 62.5 39.0 69.6 73.9 72.5
10 86.8 61.4 42.8 71.7 75.5 75.5
25 88.4 62.8 42.7 72.8 75.7 75.2
50 88.4 62.8 43.7 73.6 75.1 75.2

Table 4: Accuracy () of RF combined to the spectral features embedding of different dimensions.

We see that even the first energy level is sufficient to obtain a non-trivial classification. provides results competitive with the state of the art while provides results relatively similar to . We did not experiment with larger values of as it would mostly result into additional zero padding for most graphs. Note that embedding all graphs for took less than a minute in our experimental setting.

Appendix C Hyper parameters search and robustness analysis

In order to confirm the intrinsic quality of our spectral graph representation, we performed robustness analysis of our model with respect to the classifier. To do so, we measured the marginal variation of accuracy with respect to some hyperparameters, the others being fixed. To ensure that we only capture parameters sensibility, we fixed the seed of the random forest to for all experiments. See table 5 for the parameters grid and figure 2 for the results.

We see that our method is very robust against RFC hyperparameters variability. Outliers in boxplots are all due to highly improper parameters (

, , …).

RFC hyperparameters Hyperparameters grid
1, 10, 50, 100, 250, 500, 750, 1000
1, 2, 3, 4, 5, 6
1, 5, 10, 50, 100, 250, 500, 750, 1000
True, False
Table 5: Parameters grid for RFC. Bold values correspond to parameters used in the experimental setup and reference values for robustness analysis.
Figure 2: Box plot representing , and percentiles (box), confidence (moustaches) and outliers (isolated points) of the empirical distribution of the classification accuracy with respect to four hyperparameters of the RFC.