M-Evolve: Structural-Mapping-Based Data Augmentation for Graph Classification

07/11/2020 ∙ by Jiajun Zhou, et al. ∙ City University of Hong Kong 2

Graph classification, which aims to identify the category labels of graphs, plays a significant role in drug classification, toxicity detection, protein analysis etc. However, the limitation of scale in the benchmark datasets makes it easy for graph classification models to fall into over-fitting and undergeneralization. To improve this, we introduce data augmentation on graphs (i.e. graph augmentation) and present four methods:random mapping, vertex-similarity mapping, motif-random mapping and motif-similarity mapping, to generate more weakly labeled data for small-scale benchmark datasets via heuristic transformation of graph structures. Furthermore, we propose a generic model evolution framework, named M-Evolve, which combines graph augmentation, data filtration and model retraining to optimize pre-trained graph classifiers. Experiments on six benchmark datasets demonstrate that the proposed framework helps existing graph classification models alleviate over-fitting and undergeneralization in the training on small-scale benchmark datasets, which successfully yields an average improvement of 3-13 classification tasks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Graph classification, or network classification, which aims to identify the category labels of graphs in a dataset, has recently attracted considerable attention from different fields like bioinformatics [1] and chemoinformatics [2]. For instance, in bioinformatics, protein or enzymes can be represented as labeled graphs, in which vertices are atoms and edges represent chemical bonds that connect atoms. The task of graph classification is to classify these molecular graphs according to their chemical properties like carcinogenicity, mutagenicity and toxicity.

However, in bioinformatics and chemoinformatics, the scale of the known benchmark graph datasets is generally in the range of tens to thousands, which is far from that of the available real-world bibliography datasets like DBLP Graph Stream [3]

. Despite the advances of various graph classification methods, from graph kernels, graph embedding to graph neural networks, the limitation of data scale makes them easily fall into the dilemma of over-fitting and undergeneralization. Over-fitting refers to a modeling error that occurs when a model learns a function with high variance to perfectly fit the limited set of data. A natural idea to address over-fitting at the data level is data augmentation, which is widely applied in computer vision. Data augmentation encompasses a number of techniques that enhance both the scale and the quality of training data such that the models of higher performance can be learnt satisfactorily. In computer vision, image augmentation methods include geometric transformation, color depth adjustment, neural style transfer, adversarial training, etc.

Fig. 1: An illustration of data augmentation application in graph (Graph Augmentation).

However, differ from image data, which have a clear grid structure, graphs have irregular topological structures, making it hard to generalize some basic augmentation operations to graphs.

To solve the above problem, we take an effective approach to study data augmentation on graphs, as visualized in Fig. 1, and develop four graph augmentation methods, called random mapping, vertex-similarity mapping, motif-random mapping and motif-similarity mapping, respectively. The idea is to generate more virtual data for small datasets via heuristic modification and transformation of graph structures. Since the generated graphs are artificial and treated as weakly labeled data, their availability remains to be verified. Therefore, we introduce a concept of label reliability, which reflects the matching degree between examples and their labels against classifier, to filter fine augmented examples from the generated data. Furthermore, we introduce a model evolution framework, named M-Evolve, which combines graph augmentation, data filtration and model retraining to optimize classifiers. We demonstrate that the new framework achieves a significant improvement of performance on graph classification.

The main contributions of this work are summarized as follows:

  • We effectively utilize the technique of data augmentation on graph classification, and develop several methods to generate effective weakly labeled data for graph benchmark datasets. To the best of our knowledge, this is the first work that uses data augmentation in graph mining.

  • We establish a generic model evolution framework named M-Evolve for enhancing graph classification, which can be easily combined with existing graph classification models, and we also optimize their performances.

  • We conduct experiments on six benchmark datasets. Experimental results demonstrate the superiority of our M-Evolve framework in helping five graph classification algorithms to achieve significant improvement of performances.

The rest of paper is organized as follows. First, in Sec. 2, we review the related works on graph classification. Then, in Sec. 3, we introduce and then analyze several graph augmentation methods and a new model evolution framework. Thereafter, we present extensive experiments in Sec. 4, with detailed discussions. Finally, we conclude the paper and outline future work in Sec. 5.

2 Related work

2.1 Graph Classification

2.1.1 Graph Kernel Methods

Graph kernels perform graph comparison by recursively decomposing pairwise graphs from the dataset into atomic substructures and using a similarity function among these substructures. Intuitively, graph kernels can be understood as functions measuring the similarity of pairwise graphs. Generally, the kernels can be designed by considering various structural properties like the similarity of local neighborhood structures (WL kernel [4], propagation kernel [5]), the occurrence of certain graphlets or subgraphs (graphlet kernel [6]), the number of walks in common (random walk kernel [7, 8, 9, 10]), and the attributes and lengths of the shortest paths (SP kernel [11]).

2.1.2 Embedding Methods

Graph embedding methods [12, 13, 14, 15]

capture the graph topology and derive a fixed number of features, ultimately achieving vector representations for the whole graph. For prediction on the graph level, this approach is compatible with any standard machine learning classifier such as support vector machine (SVM),

-nearest neighbors and random forest. Widely used embedding methods include graph2vec 

[16], structure2vec [17], subgraph2vec [18], etc.

2.1.3 Deep Learning Methods

Recently, increasing attention is drawn to the application of deep learning to graph mining and a wide variety of graph neural network (GNN) frameworks have been proposed for graph classification, including methods inspired by CNN, RNN, etc. One typical approach is to obtain a representation of the entire graph by aggregating the vertex embeddings that are the output of GNNs 

[2, 19]. Some sequential methods [20, 21, 22]

handle these graphs with varying sizes by transforming them into sequences of fixed-length vectors and then feeding them to RNN. In addition, some hierarchical clustering methods 

[23, 24, 25] learn hierarchical graph representations by combining GNNs with clustering algorithms. Notably, some recent works design universal graph pooling modules, which can learn the hierarchical representations of graphs and are compatible with various GNN architectures, e.g., DiffPool [26] learns a differentiable soft cluster assignment for vertices and then maps them to a coarsened graph layer by layer, and EigenPool [27]

compresses the vertex features and local structures into coarsened signals via graph Fourier transform.

Symbol Definition
Original/augmented graph
Sets of vertices/edges in graph
Number of edges/vertices
Adjacency matrix of graph
Length of path
Length- path between vertices
Dataset, training/testing/validation set
Augmented set
Budget of edge modification
Mapping
Candidate edges set of addition/deletion
Edges set of addition/deletion
Similarity score of
Sampling weight of
Set of similarity scores
Addition/deletion weights

Prediction probability vector of example

Average probability vector of class

Probability confusion matrix

Label reliability
Label reliability threshold
Number of iterations
TABLE I: Terms and notations used in this paper.

3 Methodology

In this section, we first formulate the problem of data augmentation on graphs, and then preset several graph augmentation strategies, which are heuristic and especially fit for graph classification task. The notations used in this paper are listed in TABLE I.

3.1 Problem Formulation

Let be an undirected and unweighted graph, which consists of a vertex set and an edge set . The topological structure of graph is represented by an adjacency matrix with if and otherwise. Given pairwise vertices , the length- path between them is represented as an ordered edge sequence, i.e., . Dataset that contains a series of graphs is denoted as , where is the label of graph . For , an upfront split will be applied to yield disjoint training, validation and testing sets, denoted as , and , respectively. The original classifier will be pre-trained using and .

We further explore the data augmentation technique for graph classification from a heuristic approach and consider optimizing graph classifiers. Fig. 1 demonstrates the application of data augmentation to graph structured data, which consists of two phases: graph augmentation and data filtration. Specifically, we aim to update a classifier with augmented data, which are first generated via graph augmentation and then filtered in terms of their label reliability. During graph augmentation, our purpose is to map the graph to a new graph with the formal format, , where is the label of . We treat the generated graphs as weakly labeled data and classify them into two groups via a label reliability threshold learnt from . Then, the augmented set filtered from the generated graph pool will be merged with to produce the training set:

(1)

Finally, we finetune or retrain the classifier with , and evaluate it on the testing set .

3.2 Graph Augmentation

Graph augmentation aims to expand training data via artificially creating more reasonable virtual data from a limited set of graphs. In this paper, we consider augmentation as a topological mapping, which is conducted via heuristic modification or transformation of graph structures. In order to ensure the approximate reasonability of the generated virtual data, our graph augmentation will follow the following principles: 1) edge modification, where is a partially modified graph with some of the edges added/removed from ; 2) structure property preservation, where augmentation operation keeps the graph connectivity and the number of edges constant.

During edge modification, those edges removed from graph are sampled from the candidate edge set , while the edges added to the graph are sampled from the candidate pairwise vertices set . The construction of candidate sets varies for different methods, as further discussed below.

3.2.1 Random Mapping

Here, consider random mapping as a simple baseline method. For a given graph , one can randomly remove some edges from and then link the same number of pairwise vertices, which exist in . In this random scenario, the candidate sets are denoted as follows:

(2)
Input: Target network , proportion of modification .
Output: Augmented graph
1 Get and via Eq. (2) ;
2 Compute the addition weights via Eq. 5 ;
3 ;
4 Compute the deletion weights via Eq. 6 ;
5 ;
6 Get augmented graph via Eq. (4) ;
7 end ;
return ;
Algorithm 1 Vertex-Similarity mapping

Notably, in Eq. (2), is actually the edge set of the graph, and is the set of virtual edges which consists of unlinked pairwise vertices. Then, one can get the set of edges added/removed from via sampling in the candidate sets randomly:

(3)

where is the budget of edge modification and . Finally, based on the random mapping, the connectivity structure of the original graph is modified to generate a virtual graph:

(4)

3.2.2 Vertex-Similarity Mapping

Vertex similarity, which reflects the number of common features shared by a pair of vertices, has been widely applied to graph mining tasks such as link prediction and community detection. Empirically, vertices in graph tend to intersect with each other due to their high similarity. Therefore, this similarity index is used to link those vertices of high similarity for graph augmentation.

Fig. 2: An example for edge swapping in motif-random mapping. Left : Common graph motifs such as open-triad and open-quad. Right : Augmented motifs obtained via edge swapping. The dashed lines represent the virtual edges in graph.
Fig. 3: An example for motif-similarity mapping. Red lines, both dashed and solid, represent the candidates. Black lines, both dashed and solid, represent the modified edges.

Vertex-similarity mapping shares the same candidate sets with random mapping but conduct edge selection with weighted random sampling. In other words, random mapping makes selections with equal probability, while vertex-similarity mapping assigns all entries in and with relative weights that are associated with the vertex similarity scores. Specifically, before sampling, the similarity scores are computed over all entries in using the Resource Allocation (RA) index [28]. For each entry in , the RA score and addition weight are computed as follows:

(5)

where denotes the one-hop neighbors of and denotes the degree of vertex . Weighted random sampling means that the probability for an entry in to be selected is proportional to its addition weight . It is worth noting that many other similarity indices can also be applied into this scheme. Similarly, during edge deletion, the probability of edge sampled from is proportional to the deletion weight as follows:

(6)

which means that these edges with smaller RA scores have more chances to be removed.

3.2.3 Motif-Random Mapping

Graph motifs are sub-graphs that repeat themselves in a specific graph or even among various graphs. Each of these sub-graphs, defined by a particular pattern of interactions between vertices, may describe a framework in which particular functions are accomplished efficiently. In this paper, only those open motifs with a chain structure are considered, as shown in Fig. 2, so that the motif discovery process can be replaced by path search. For instance, common motifs such as open-triad is equivalent to length-2 paths emanating from the head vertex that induce a triangle, which gives the following equivalence relation:

(7)

where is the length of motif.

The motif-random mapping aims to finetune these motifs to the approximately equivalent ones via edge swapping. As shown in Fig. 2, during edge swapping, edge addition operation takes effect between the head and the tail vertices of the motif, while edge deletion operation randomly removes an edge in the motif. For all length- motifs, the candidate set of edge addition is:

(8)
Input: Target network , length of motif , proportion of modification .
Output: Augmented graph
1 Get via Eq. (8) ;
2 Compute the addition weights via Eq. 5 ;
3 ;
4 Initialize ;
5 for  do
6       Get the length- motif via path search: ;
7       Compute the deletion weights via Eq. 9 ;
8       ;
9       Add to ;
10      
11Get augmented graph via Eq. (4) ;
12 end ;
return ;
Algorithm 2 Motif-Similarity mapping

where is to the power of . Then, one can get , the set of edges added to , via sampling from randomly. For each pair of vertices in , there exists a length- motif , which has head vertex and tail vertex . Next, we randomly sample one edge in to remove and all of these removed edges constitute . Finally, the virtual graph can be obtained via Eq. (4).

3.2.4 Motif-Similarity Mapping

Motif-similarity mapping inherits from motif-random mapping and conducts edge sampling in term of vertex similarity. Specifically, for the candidate set of edge addition given in Eq. (8), the probability for an arbitrary entry to be selected is proportional to its addition weight according to Eq. (5). Similarly, during edge deletion, the probability of edge sampled from is proportional to the deletion weight as follows:

(9)

The whole process for motif-similarity mapping with and is illustrated in Fig. 3.

3.3 Data Filtration

Fig. 4: The architecture of the model evolution. The complete workflow proceeds as follows: 1) pre-train graph classifier using training set; 2) apply graph augmentation to generate data pool; 3) compute the label reliability threshold using validation set; 4) compute the label reliability of examples sampled from graph pool; 5) filter data and obtain augmented set using threshold; 6) retrain graph classifier using the union of training set and augmented set.

In computer vision, the technique of data augmentation is widely used to generate augmented examples, which can be directly treated as new training data. For instance, after an image of cat undergoes simple data augmentation such as geometric transformation or color depth adjustment, the resulting new image retains the explicit semantics and is still an image of cat in human vision. However, due to the difference between image data and graph structured data, the examples generated via graph augmentation may lose original semantics. By assigning the label of the original graph to the generated graph directly during graph augmentation, one cannot determine whether the assigned label is reliable. Therefore, the concept of label reliability is employed here to measure the matching degree between examples and labels against classifier.

Each graph in will be fed into classifier to obtain the prediction vector

, which represents the probability distribution of how likely an input example belongs to each possible class, and

is the number of classes for labels. Then, a probability confusion matrix , in which the entry represents the average probability that the classifier classifies the graphs of the -th class into the -th class, is computed as follows:

(10)

where is the number of graphs belonging to the -th class in and is the average probability distribution of the -th class.

The label reliability of an example is defined as the product of example probability distribution and class probability distribution as follows:

(11)

Notably, the definition indicates that the example which can be predicted correctly by classifier with a greater probability tends to have higher label reliability.

A threshold used to filter the generated data is defined as:

(12)

where if and otherwise, and if and otherwise. The process of threshold optimization tends to ensure that correctly classified examples have greater label reliability than incorrectly classified examples.

Input: Training set , validation set , graph augmentation , number of iterations .
Output: Evolutive model
1 Pre-training classifier using and ;
2 Initalize ;
3 for  do
4       Graph augmentation: ;
5       For all graphs in classified by , get ;
6       Get probability confusion matrix via Eq. 10 ;
7       For all graphs in classified by , get via Eq. 11;
8       Get the label reliability threshold via Eq. 12 ;
9       For all samples in classified by , compute , if , ;
10       Get evolutive classifier: ;
11       ;
12       ;
13      
14end ;
return ;
Algorithm 3 M-Evolve
Collections Dataset bias (%)
Chemical
Compounds
MUTAG 188 2 17.93 19.79 66.5
PTC-MR 344 2 14.29 14.69 55.8
ENZYMES 600 6 32.63 62.14 16.7
Brain KKI 83 2 26.96 48.42 55.4
Peking-1 85 2 39.31 77.35 57.6
OHSU 79 2 82.01 199.66 55.7
TABLE II: Dataset properties. is the number of graphs in dataset , is the average number of vertices, is the average number of edges and is the proportion of the dominant class.

3.4 Model Evolution Framework

Model evolution aims to optimize classifiers via graph augmentation, data filtration and model retraining iteratively, and ultimately improve the performance on graph classification. Fig. 4 and Algorithm 3 demonstrate the workflow and the procedure of M-Evolve, respectively. Here, a variable, number of iterations , is introduced for repeating the above workflow to continuously augment the dataset and optimize classifier.

4 Experiments

4.1 Datasets

We evaluate the proposed methods against six benchmark datasets from bioinformatics and chemoinformatics: Mutag [29], PTC-MR [30], ENZYMES [1], KKI, Peking-1 and OHSU [31]. The first three represent the graph collections of chemical compounds in which vertices correspond to molecular structures and edges indicate chemical bonds between them. And the last three are constructed from brain networks, where vertices correspond to the Regions of Interest (ROI) and edges represent the correlations between two ROIs. The specifications of these datasets are given in TABLE II.

Dataset Mapping Graph Classification Model Avg RIMP
SF NetLSD Graph2vec Gl2vec Diffpool
SVM Log KNN RF SVM Log KNN RF SVM Log KNN RF SVM Log KNN RF
MUTAG original 0.822 0.824 0.824 0.846 0.823 0.829 0.828 0.836 0.737 0.820 0.784 0.820 0.746 0.830 0.800 0.817 0.801
random 0.843 0.845 0.846 0.878 0.855 0.851 0.860 0.886 0.756 0.844 0.793 0.847 0.748 0.851 0.820 0.841 0.810 2.78%
vertex-similarity 0.848 0.855 0.840 0.870 0.843 0.845 0.856 0.874 0.750 0.850 0.801 0.861 0.748 0.845 0.823 0.850 0.825 2.86%
motif-random 0.860 0.852 0.846 0.887 0.861 0.862 0.856 0.882 0.761 0.850 0.803 0.859 0.752 0.856 0.829 0.845 0.807 3.63%
motif-similarity 0.863 0.855 0.850 0.890 0.861 0.864 0.861 0.892 0.759 0.851 0.809 0.852 0.762 0.863 0.833 0.846 0.831 4.00%
PTC-MR original 0.551 0.566 0.577 0.587 0.543 0.578 0.548 0.576 0.571 0.518 0.509 0.549 0.572 0.538 0.507 0.550 0.609
random 0.611 0.590 0.605 0.618 0.579 0.580 0.590 0.607 0.580 0.572 0.547 0.592 0.587 0.571 0.528 0.594 0.637 5.77%
vertex-similarity 0.595 0.594 0.601 0.622 0.577 0.580 0.578 0.601 0.592 0.571 0.548 0.599 0.601 0.575 0.554 0.605 0.636 6.22%
motif-random 0.615 0.595 0.609 0.630 0.595 0.583 0.582 0.612 0.587 0.570 0.551 0.592 0.587 0.578 0.535 0.596 0.627 6.37%
motif-similarity 0.616 0.595 0.610 0.624 0.581 0.583 0.597 0.620 0.596 0.579 0.553 0.593 0.588 0.579 0.545 0.602 0.639 6.97%
ENZYMES original 0.309 0.237 0.287 0.397 0.337 0.287 0.304 0.349 0.361 0.253 0.283 0.337 0.348 0.268 0.238 0.318 0.487
random 0.347 0.412 0.302 0.412 0.353 0.287 0.327 0.369 0.336 0.269 0.290 0.346 0.286 0.273 0.259 0.350 0.500 7.25%
vertex-similarity 0.368 0.416 0.333 0.431 0.352 0.295 0.329 0.359 0.372 0.279 0.290 0.369 0.345 0.266 0.274 0.357 0.489 11.12%
motif-random 0.364 0.418 0.317 0.418 0.365 0.294 0.340 0.367 0.376 0.268 0.299 0.355 0.298 0.273 0.264 0.356 0.508 10.20%
motif-similarity 0.363 0.414 0.317 0.415 0.375 0.291 0.335 0.376 0.352 0.270 0.289 0.352 0.291 0.280 0.260 0.358 0.506 9.55%
KKI original 0.550 0.500 0.520 0.517 0.548 0.524 0.512 0.496 0.549 0.527 0.524 0.552 0.538 0.502 0.526 0.502 0.523
random 0.600 0.544 0.554 0.622 0.599 0.535 0.553 0.562 0.580 0.568 0.594 0.574 0.556 0.508 0.544 0.544 0.580 7.97%
vertex-similarity 0.601 0.556 0.559 0.654 0.618 0.568 0.568 0.585 0.603 0.570 0.573 0.655 0.578 0.597 0.582 0.588 0.597 12.87%
motif-random 0.598 0.560 0.578 0.647 0.603 0.563 0.558 0.574 0.586 0.606 0.592 0.661 0.567 0.593 0.596 0.604 0.586 13.11%
motif-similarity 0.607 0.560 0.561 0.649 0.619 0.558 0.565 0.582 0.587 0.606 0.603 0.634 0.581 0.592 0.597 0.582 0.612 13.36%
Peking-1 original 0.578 0.548 0.541 0.558 0.605 0.612 0.589 0.591 0.572 0.522 0.474 0.522 0.555 0.522 0.521 0.521 0.586
random 0.660 0.562 0.603 0.627 0.652 0.631 0.662 0.654 0.579 0.547 0.546 0.597 0.584 0.555 0.559 0.607 0.650 9.20%
vertex-similarity 0.672 0.571 0.615 0.623 0.652 0.644 0.622 0.666 0.638 0.557 0.562 0.627 0.619 0.561 0.566 0.632 0.657 11.47%
motif-random 0.681 0.583 0.619 0.644 0.668 0.636 0.666 0.689 0.579 0.553 0.593 0.612 0.605 0.581 0.567 0.633 0.654 12.34%
motif-similarity 0.670 0.587 0.624 0.663 0.694 0.648 0.671 0.699 0.581 0.565 0.564 0.630 0.607 0.563 0.572 0.635 0.632 12.72%
OHSU original 0.610 0.595 0.610 0.667 0.547 0.489 0.549 0.581 0.557 0.577 0.585 0.567 0.557 0.541 0.544 0.557 0.543
random 0.643 0.635 0.645 0.697 0.595 0.534 0.582 0.641 0.557 0.640 0.628 0.645 0.564 0.595 0.570 0.642 0.637 8.08%
vertex-similarity 0.653 0.641 0.643 0.722 0.641 0.546 0.614 0.661 0.557 0.658 0.623 0.673 0.557 0.625 0.625 0.632 0.640 10.82%
motif-random 0.650 0.638 0.648 0.728 0.613 0.546 0.584 0.641 0.557 0.653 0.633 0.686 0.564 0.602 0.625 0.645 0.627 10.04%
motif-similarity 0.656 0.641 0.650 0.726 0.638 0.541 0.587 0.641 0.557 0.678 0.635 0.650 0.572 0.605 0.625 0.652 0.615 10.33%
TABLE III: Graph classification results of original and evolutive models. The best results are marked in bold. The far-right column gives the average relative improvement rate (Avg RIMP) in accuracy.

4.2 Graph Classification Methods

We consider the following five graph classification methods in our experiments, the first two are graph embedding, the middle two are kernel models, and the last one is the GNN model.

  • SF [32]. It is an embedding method and performs graph classification by spectral decomposition of the graph Laplacian, i.e., it relies on spectral features of the graph.

  • Graph2vec [16]

    . It extends the document embedding methods to graph classification and learns a distributed representation of the entire graph via document embedding neural networks.

  • NetLSD [33]. It is a kernel method and performs graph classification by extracting compact graph signatures that inherit the formal properties of the Laplacian spectrum.

  • Gl2vec [34]. It constructs vectors for feature representations by comparing static and temporal network graphlet distributions to random graphs generated from different null models.

  • Diffpool [26]. It is a differentiable graph pooling module that can generate hierarchical representations of graphs by learning a differentiable soft cluster assignment for vertices and maps them to a coarsened graph layer by layer.

For all graph kernel and embedding methods, we implement graph classification by using the following machine learning classifiers: SVM based on radial basis kernel (SVM), Logistic regression classifier (Log),

-nearest neighbors classifier (KNN) and random forest classifier (RF). In total, there are available combinations of graph classification.

4.3 Experiment Setup

Each dataset is split into training, validation and testing sets with a proportion of 7:1:2. In this work, the validation set takes effect in two parts: 1) finetune the hyper-parameters of classifiers in combination with grid search; 2) compute the label reliability threshold and confusion matrix of classifiers during model evolution. We repeat 5-fold cross validation for 10 times and report the average accuracy across all trials.

For all kernel and embedding methods, we set the embedding dimension to 128. We then set the budget of edge modification as 0.15. During motif-random and motif-similarity mapping, we fix , which means that we only consider open-triad motif. Furthermore, during model evolution, we set the number of iterations to 5.

4.4 Evaluation

We evaluate the benefit of the proposed graph augmentation methods and M-Evolve framework, answering the following research questions:

  • RQ1: Can M-Evolve improve the performance of graph classification when being combined with existing graph classification models?

  • RQ2: What are the roles that similarity and motif mechanisms play in improving the effect of graph augmentation?

  • RQ3: Is the data filtration necessary and how does it help M-Evolve achieve performance improvement in graph classification?

  • RQ4: How does M-Evolve achieve interpretable enhancement of graph classification?

We combine the proposed model evolution framework with all graph classification models to show a crosswise comparison. Specifically, the four graph augmentation methods are combined with 17 graph classification combinations, totaling 68 available experimental combinations. For instance, a practical combination could be {vertex-similarity mapping + Graph2vec + KNN} or {random mapping + Diffpool}.

4.4.1 Enhancement for Graph Classification

TABLE III reports the results of performance comparison between the evolutive models and the original models, from which one can observe that there is a significant boost in classification performance across all six datasets. Overall, these models combined with the proposed M-Evolve framework obtain higher average classification accuracy in most cases and the M-Evolve achieves a 96.81% success rate on the enhancement of graph classification 111The success rate refers to the percentage of evolutive models with accuracy higher than that of the corresponding original models in Table 3. The actual calculation formula is 395408=96.81%.. These phenomena provide a positive answer to RQ1, indicating that the M-Evolve significantly improve the performance of the 17 graph classification combinations. We speculate that the original models trained with limited training data are over-fitting, and on the contrary, M-Evolve enriches the scale of training data via graph augmentation and optimizes graph classifiers via iterative retraining, which can improve the generalization and avoid over-fitting to a certain extent.

Now, we define the relative improvement rate (RIMP) in accuracy as follows:

(13)

where and refer to the accuracy of the evolutive and original models, respectively. Comparing all the graph augmentation methods, we count the numbers of experiments where they obtain the highest RIMP, which were 1, 32, 24, 57 for random, vertex-similarity, motif-random and motif-similarity, respectively. In TABLE III, the far-right column gives the average relative improvement rate (Avg RIMP) in accuracy, from which one can see that the M-Evolve combined with similarity-based mappings obtain the best results overall. In particular, motif-similarity mapping, which combines similarity and motif mechanisms, outperforms the others in half cases. These results indicate that both similarity and motif mechanisms play positive roles in enhancing graph classification, answering RQ2. As a reasonable explanation, similarity mechanism tends to link vertices with higher similarity and is capable of optimizing topological structure legitimately, while the motif mechanism achieves edge modification via local edge swapping, which has less effect on the degree distribution of the graph.

Fig. 5: The impact of data filtration on classification performance in term of Avg IMP.
Fig. 6: Visualization of training data distribution and decision boundaries of different graph classifiers on MUTAG dataset. Points with different colors represent training data with different labels and regions with different colors belong to different classes.

4.4.2 Impact of Data Filtration in Model Evolution

Thanks to the outstanding performance of M-Evolve, we further investigate the impact of the data filtration on the enhancement of graph classification. Specifically, we conduct contrastive experiments in which the data filtration operation is removed from M-Evolve and the performance differences between the two cases with and without data filtration are shown in Fig. 5. From the comparison results, we observe that there is a more significant improvement in classification performance when the M-Evolve framework is combined with data filtration in most cases, positively answering RQ3.

Notably, without data filtration, the results of M-Evolve with different mappings vary considerably in overall quality, and this mechanism may even have negative effect in certain cases. On the contrary, with data filtration, M-Evolve has more significant and consistent effects on enhancing graph classification among different mappings. A reasonable explanation for this effect is that data filtration is capable of retaining examples that are conducive to model’s decision, so that these augmented sets obtained via various mappings tend to have similar feature distributions. As a result, data filtration narrows the quality gap between the data generated by random mapping and those by the other three mappings, achieving the consistency of performance. As an exception, the vertex-similarity mapping (v-s) shows particularly excellent performance in the case without data filtration in ENZYMES. One possible explanation is that accepting more augmented examples may be more favorable for optimizing relatively complex multiclass classification.

Fig. 7: Visualization of decision boundaries of graph classification combination (SF + SVM) on MUTAG dataset. Points with different colors represent testing data with different labels.

4.4.3 Explanatory Visualization of Data and Models

Next, we apply visualization techniques to investigate how graph augmentation enriches the data distribution and how the M-Evolve framework optimizes the performances of different models. Since higher-dimensional data are difficult to visualize, we set the embedding dimension of both graph kernel and embedding models to 2.

Firstly, we compare the training data distribution before and after model evolution, as visualized in Fig. 6, to demonstrate the effectiveness of the proposed graph augmentation. Specifically, the top and bottom rows show the decision regions of the original and evolutive models respectively, and the points with different colors represent training data with different labels, (a)(d) are based on the same data split and graph augmentation, but different graph classification combinations (vertex-similarity mapping + SF + {SVM, Log, KNN, RF}). Obviously, there is a significant boost in the scale of training data and the distribution boundaries of data with different labels, indicating that graph augmentation effectively enriches the training data and the new data distribution is more conducive to the training of classifiers.

Fig. 8: Parameter sensitivity evaluation of the M-Evolve framework on graph classification.

Furthermore, we visualize the decision boundaries in Fig. 7, to clearly highlight the difference between the original and evolutive models. Specifically, (a)(e) are based on the same combination (vertex-similarity mapping + SF + SVM), but different data split schemes, which refer to testing using different folds of dataset. As one can see, the decision regions of the non-dominant class are fragmented and scattered in the original models. During model evolution, scattered regions tend to merge, and the original decision boundaries are optimized to smoother ones. These phenomena answer RQ4.

In summary, graph augmentation can efficiently increase the data scale, indicating its ability in enriching data distribution. And the entire M-Evolve framework is capable of optimizing the decision boundaries of the classifiers and ultimately improving their generalization performances.

Fig. 9: The impact of data filtration on the parameter sensitivity.

4.4.4 Parameter Sensitivity

In this subsection, we further analyze the impact of key parameters on the performance of the M-Evolve framework. Specifically, we vary the budget of edge modification in . Due to space limit, we only present the evaluation results of graph classification based on these combinations ({SF, NetLSD, Graph2vec, Gl2vec} + SVM & Diffpool) in Fig. 8, involving the graph kernel and embedding methods with SVM and the GNN-based Diffpool method. From the results, one can see that the M-Evolve framework is not strictly sensitive to different parameter settings when it comes to graph kernel and embedding models. On the other hand, when it comes to the Diffpool model, there are consistent tendencies in the sensitivity curves among all datasets, indicating that too large or too small perturbations are not conducive to graph augmentation.

Furthermore, we supplement the contrastive experiments of parameter sensitivity without data filtration, as shown in Fig. 9, to verify our conjecture that data filtration can actually help M-Evolve reduce parameter sensitivity. Specifically, we use the variance of the five data points on each curve to measure the corresponding parameter sensitivity. Fig 9 (a) shows the individual variance (Ind Var) of all the experimental combinations involved in Fig. 8, and (b) presents the average variance for each dataset. From the comparison results, one can observe that these results with data filtration have less fluctuation and better stability under different parameter settings, which provide positive support for our conjecture that data filtration can actually improve the robustness of the M-Evolve against parameter variations. As for the more obvious fluctuations on the curves of Diffpool, we speculate that end-to-end deep learning models are more capable of capturing slight changes in data features when compared to machine learning models like SVM.

In summary, data filtration not only narrows the performance gap among different graph augmentation methods, but also reduces the sensitivity of M-Evolve to different parameter settings, implying that the M-Evolve framework is robust to parameter settings to a certain extent.

5 Conclusion

In this paper, we introduce data augmentation for graph classification and present four heuristic algorithms to generate weakly labeled data for small-scale benchmark datasets via a heuristic transformation of the graph structure. Furthermore, we propose a generic model evolution framework named M-Evolve, which combines graph augmentation, data filtration and model retraining to optimize pre-trained graph classifiers. Experiments conducted on six benchmark datasets demonstrate that our proposed framework performs surprisingly well and helps existing graph classification models alleviate over-fitting when training on small-scale benchmark datasets and achieve significant improvement of classification performance. For future work, we will design effective graph augmentation methods on large scale graphs and extend the current framework to work on real-world datasets like social networks and transaction networks.

Acknowledgments

The authors would like to thank all the members in the IVSN Research Group, Zhejiang University of Technology for the valuable discussions about the ideas and technical details presented in this paper.

References

  • [1] K. M. Borgwardt, C. S. Ong, S. Schönauer, S. Vishwanathan, A. J. Smola, and H.-P. Kriegel, “Protein function prediction via graph kernels,” Bioinformatics, vol. 21, no. suppl_1, pp. i47–i56, 2005.
  • [2] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams, “Convolutional networks on graphs for learning molecular fingerprints,” in Advances in Neural Information Processing Systems, 2015, pp. 2224–2232.
  • [3] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su, “Arnetminer: extraction and mining of academic social networks,” in Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2008, pp. 990–998.
  • [4] N. Shervashidze, P. Schweitzer, E. J. Van Leeuwen, K. Mehlhorn, and K. M. Borgwardt, “Weisfeiler-lehman graph kernels.” Journal of Machine Learning Research, vol. 12, no. 9, 2011.
  • [5] M. Neumann, R. Garnett, C. Bauckhage, and K. Kersting, “Propagation kernels: efficient graph kernels from propagated information,” Machine Learning, vol. 102, no. 2, pp. 209–245, 2016.
  • [6] N. Shervashidze, S. Vishwanathan, T. Petri, K. Mehlhorn, and K. Borgwardt, “Efficient graphlet kernels for large graph comparison,” in Artificial Intelligence and Statistics, 2009, pp. 488–495.
  • [7] T. Gärtner, P. Flach, and S. Wrobel, “On graph kernels: Hardness results and efficient alternatives,” in Learning Theory and Kernel Machines.   Springer, 2003, pp. 129–143.
  • [8] H. Kashima, K. Tsuda, and A. Inokuchi, “Marginalized kernels between labeled graphs,” in Proceedings of the 20th International Conference on Machine Learning (ICML-03), 2003, pp. 321–328.
  • [9] P. Mahé, N. Ueda, T. Akutsu, J.-L. Perret, and J.-P. Vert, “Extensions of marginalized graph kernels,” in Proceedings of the twenty-first International Conference on Machine Learning, 2004, p. 70.
  • [10] M. Sugiyama and K. Borgwardt, “Halting in random walk kernels,” in Advances in Neural Information Processing Systems, 2015, pp. 1639–1647.
  • [11] K. M. Borgwardt and H.-P. Kriegel, “Shortest-path kernels on graphs,” in Fifth IEEE International Conference on Data Mining (ICDM’05).   IEEE, 2005, pp. 8–pp.
  • [12] H. Cai, V. W. Zheng, and K. C.-C. Chang, “A comprehensive survey of graph embedding: Problems, techniques, and applications,” IEEE Transactions on Knowledge and Data Engineering, vol. 30, no. 9, pp. 1616–1637, 2018.
  • [13] F. Chen, Y.-C. Wang, B. Wang, and C.-C. J. Kuo, “Graph representation learning: a survey,” APSIPA Transactions on Signal and Information Processing, vol. 9, 2020.
  • [14]

    C. Fu, Y. Zheng, Y. Liu, Q. Xuan, and G. Chen, “Nes-tl: Network embedding similarity-based transfer learning,”

    IEEE Transactions on Network Science and Engineering, 2019.
  • [15] W. Guo, Y. Shi, S. Wang, and N. N. Xiong, “An unsupervised embedding learning feature representation scheme for network big data analysis,” IEEE Transactions on Network Science and Engineering, vol. 7, no. 1, pp. 115–126, 2019.
  • [16] A. Narayanan, M. Chandramohan, R. Venkatesan, L. Chen, Y. Liu, and S. Jaiswal, “graph2vec: Learning distributed representations of graphs,” arXiv preprint arXiv:1707.05005, 2017.
  • [17] H. Dai, B. Dai, and L. Song, “Discriminative embeddings of latent variable models for structured data,” in International Conference on Machine Learning, 2016, pp. 2702–2711.
  • [18] A. Narayanan, M. Chandramohan, L. Chen, Y. Liu, and S. Saminathan, “subgraph2vec: Learning distributed representations of rooted sub-graphs from large graphs,” arXiv preprint arXiv:1606.08928, 2016.
  • [19] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl, “Neural message passing for quantum chemistry,” arXiv preprint arXiv:1704.01212, 2017.
  • [20] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel, “Gated graph sequence neural networks,” arXiv preprint arXiv:1511.05493, 2015.
  • [21] Y. Jin and J. F. JaJa, “Learning graph-level representations with recurrent neural networks,” arXiv preprint arXiv:1805.07683, 2018.
  • [22] J. You, R. Ying, X. Ren, W. L. Hamilton, and J. Leskovec, “Graphrnn: Generating realistic graphs with deep auto-regressive models,” arXiv preprint arXiv:1802.08773, 2018.
  • [23]

    M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” in

    Advances in Neural Information Processing Systems, 2016, pp. 3844–3852.
  • [24] M. Simonovsky and N. Komodakis, “Dynamic edge-conditioned filters in convolutional neural networks on graphs,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , 2017, pp. 3693–3702.
  • [25] M. Fey, J. Eric Lenssen, F. Weichert, and H. Müller, “Splinecnn: Fast geometric deep learning with continuous b-spline kernels,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 869–877.
  • [26] Z. Ying, J. You, C. Morris, X. Ren, W. Hamilton, and J. Leskovec, “Hierarchical graph representation learning with differentiable pooling,” in Advances in Neural Information Processing Systems, 2018, pp. 4800–4810.
  • [27] Y. Ma, S. Wang, C. C. Aggarwal, and J. Tang, “Graph convolutional networks with eigenpooling,” in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2019, pp. 723–731.
  • [28] T. Zhou, L. Lü, and Y.-C. Zhang, “Predicting missing links via local information,” The European Physical Journal B, vol. 71, no. 4, pp. 623–630, 2009.
  • [29] A. K. Debnath, R. L. Lopez de Compadre, G. Debnath, A. J. Shusterman, and C. Hansch, “Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. correlation with molecular orbital energies and hydrophobicity,” Journal of Medicinal Chemistry, vol. 34, no. 2, pp. 786–797, 1991.
  • [30] C. Helma, R. D. King, S. Kramer, and A. Srinivasan, “The predictive toxicology challenge 2000–2001,” Bioinformatics, vol. 17, no. 1, pp. 107–108, 2001.
  • [31] S. Pan, J. Wu, X. Zhu, G. Long, and C. Zhang, “Task sensitive feature exploration and learning for multitask graph classification,” IEEE Transactions on Cybernetics, vol. 47, no. 3, pp. 744–758, 2016.
  • [32] N. de Lara and E. Pineau, “A simple baseline algorithm for graph classification,” arXiv preprint arXiv:1810.09155, 2018.
  • [33] A. Tsitsulin, D. Mottin, P. Karras, A. Bronstein, and E. Müller, “Netlsd: hearing the shape of a graph,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2018, pp. 2347–2356.
  • [34] K. Tu, J. Li, D. Towsley, D. Braines, and L. D. Turner, “gl2vec: Learning feature representation using graphlets for directed networks,” in Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, 2019, pp. 216–221.