1 Introduction
Graph classification, or network classification, which aims to identify the category labels of graphs in a dataset, has recently attracted considerable attention from different fields like bioinformatics [1] and chemoinformatics [2]. For instance, in bioinformatics, protein or enzymes can be represented as labeled graphs, in which vertices are atoms and edges represent chemical bonds that connect atoms. The task of graph classification is to classify these molecular graphs according to their chemical properties like carcinogenicity, mutagenicity and toxicity.
However, in bioinformatics and chemoinformatics, the scale of the known benchmark graph datasets is generally in the range of tens to thousands, which is far from that of the available realworld bibliography datasets like DBLP Graph Stream [3]
. Despite the advances of various graph classification methods, from graph kernels, graph embedding to graph neural networks, the limitation of data scale makes them easily fall into the dilemma of overfitting and undergeneralization. Overfitting refers to a modeling error that occurs when a model learns a function with high variance to perfectly fit the limited set of data. A natural idea to address overfitting at the data level is data augmentation, which is widely applied in computer vision. Data augmentation encompasses a number of techniques that enhance both the scale and the quality of training data such that the models of higher performance can be learnt satisfactorily. In computer vision, image augmentation methods include geometric transformation, color depth adjustment, neural style transfer, adversarial training, etc.
However, differ from image data, which have a clear grid structure, graphs have irregular topological structures, making it hard to generalize some basic augmentation operations to graphs.
To solve the above problem, we take an effective approach to study data augmentation on graphs, as visualized in Fig. 1, and develop four graph augmentation methods, called random mapping, vertexsimilarity mapping, motifrandom mapping and motifsimilarity mapping, respectively. The idea is to generate more virtual data for small datasets via heuristic modification and transformation of graph structures. Since the generated graphs are artificial and treated as weakly labeled data, their availability remains to be verified. Therefore, we introduce a concept of label reliability, which reflects the matching degree between examples and their labels against classifier, to filter fine augmented examples from the generated data. Furthermore, we introduce a model evolution framework, named MEvolve, which combines graph augmentation, data filtration and model retraining to optimize classifiers. We demonstrate that the new framework achieves a significant improvement of performance on graph classification.
The main contributions of this work are summarized as follows:

We effectively utilize the technique of data augmentation on graph classification, and develop several methods to generate effective weakly labeled data for graph benchmark datasets. To the best of our knowledge, this is the first work that uses data augmentation in graph mining.

We establish a generic model evolution framework named MEvolve for enhancing graph classification, which can be easily combined with existing graph classification models, and we also optimize their performances.

We conduct experiments on six benchmark datasets. Experimental results demonstrate the superiority of our MEvolve framework in helping five graph classification algorithms to achieve significant improvement of performances.
The rest of paper is organized as follows. First, in Sec. 2, we review the related works on graph classification. Then, in Sec. 3, we introduce and then analyze several graph augmentation methods and a new model evolution framework. Thereafter, we present extensive experiments in Sec. 4, with detailed discussions. Finally, we conclude the paper and outline future work in Sec. 5.
2 Related work
2.1 Graph Classification
2.1.1 Graph Kernel Methods
Graph kernels perform graph comparison by recursively decomposing pairwise graphs from the dataset into atomic substructures and using a similarity function among these substructures. Intuitively, graph kernels can be understood as functions measuring the similarity of pairwise graphs. Generally, the kernels can be designed by considering various structural properties like the similarity of local neighborhood structures (WL kernel [4], propagation kernel [5]), the occurrence of certain graphlets or subgraphs (graphlet kernel [6]), the number of walks in common (random walk kernel [7, 8, 9, 10]), and the attributes and lengths of the shortest paths (SP kernel [11]).
2.1.2 Embedding Methods
Graph embedding methods [12, 13, 14, 15]
capture the graph topology and derive a fixed number of features, ultimately achieving vector representations for the whole graph. For prediction on the graph level, this approach is compatible with any standard machine learning classifier such as support vector machine (SVM),
nearest neighbors and random forest. Widely used embedding methods include graph2vec
[16], structure2vec [17], subgraph2vec [18], etc.2.1.3 Deep Learning Methods
Recently, increasing attention is drawn to the application of deep learning to graph mining and a wide variety of graph neural network (GNN) frameworks have been proposed for graph classification, including methods inspired by CNN, RNN, etc. One typical approach is to obtain a representation of the entire graph by aggregating the vertex embeddings that are the output of GNNs
[2, 19]. Some sequential methods [20, 21, 22]handle these graphs with varying sizes by transforming them into sequences of fixedlength vectors and then feeding them to RNN. In addition, some hierarchical clustering methods
[23, 24, 25] learn hierarchical graph representations by combining GNNs with clustering algorithms. Notably, some recent works design universal graph pooling modules, which can learn the hierarchical representations of graphs and are compatible with various GNN architectures, e.g., DiffPool [26] learns a differentiable soft cluster assignment for vertices and then maps them to a coarsened graph layer by layer, and EigenPool [27]compresses the vertex features and local structures into coarsened signals via graph Fourier transform.
Symbol  Definition 

Original/augmented graph  
Sets of vertices/edges in graph  
Number of edges/vertices  
Adjacency matrix of graph  
Length of path  
Length path between vertices  
Dataset, training/testing/validation set  
Augmented set  
Budget of edge modification  
Mapping  
Candidate edges set of addition/deletion  
Edges set of addition/deletion  
Similarity score of  
Sampling weight of  
Set of similarity scores  
Addition/deletion weights  
Prediction probability vector of example 

Average probability vector of class  
Probability confusion matrix 

Label reliability  
Label reliability threshold  
Number of iterations 
3 Methodology
In this section, we first formulate the problem of data augmentation on graphs, and then preset several graph augmentation strategies, which are heuristic and especially fit for graph classification task. The notations used in this paper are listed in TABLE I.
3.1 Problem Formulation
Let be an undirected and unweighted graph, which consists of a vertex set and an edge set . The topological structure of graph is represented by an adjacency matrix with if and otherwise. Given pairwise vertices , the length path between them is represented as an ordered edge sequence, i.e., . Dataset that contains a series of graphs is denoted as , where is the label of graph . For , an upfront split will be applied to yield disjoint training, validation and testing sets, denoted as , and , respectively. The original classifier will be pretrained using and .
We further explore the data augmentation technique for graph classification from a heuristic approach and consider optimizing graph classifiers. Fig. 1 demonstrates the application of data augmentation to graph structured data, which consists of two phases: graph augmentation and data filtration. Specifically, we aim to update a classifier with augmented data, which are first generated via graph augmentation and then filtered in terms of their label reliability. During graph augmentation, our purpose is to map the graph to a new graph with the formal format, , where is the label of . We treat the generated graphs as weakly labeled data and classify them into two groups via a label reliability threshold learnt from . Then, the augmented set filtered from the generated graph pool will be merged with to produce the training set:
(1) 
Finally, we finetune or retrain the classifier with , and evaluate it on the testing set .
3.2 Graph Augmentation
Graph augmentation aims to expand training data via artificially creating more reasonable virtual data from a limited set of graphs. In this paper, we consider augmentation as a topological mapping, which is conducted via heuristic modification or transformation of graph structures. In order to ensure the approximate reasonability of the generated virtual data, our graph augmentation will follow the following principles: 1) edge modification, where is a partially modified graph with some of the edges added/removed from ; 2) structure property preservation, where augmentation operation keeps the graph connectivity and the number of edges constant.
During edge modification, those edges removed from graph are sampled from the candidate edge set , while the edges added to the graph are sampled from the candidate pairwise vertices set . The construction of candidate sets varies for different methods, as further discussed below.
3.2.1 Random Mapping
Here, consider random mapping as a simple baseline method. For a given graph , one can randomly remove some edges from and then link the same number of pairwise vertices, which exist in . In this random scenario, the candidate sets are denoted as follows:
(2)  
Notably, in Eq. (2), is actually the edge set of the graph, and is the set of virtual edges which consists of unlinked pairwise vertices. Then, one can get the set of edges added/removed from via sampling in the candidate sets randomly:
(3)  
where is the budget of edge modification and . Finally, based on the random mapping, the connectivity structure of the original graph is modified to generate a virtual graph:
(4) 
3.2.2 VertexSimilarity Mapping
Vertex similarity, which reflects the number of common features shared by a pair of vertices, has been widely applied to graph mining tasks such as link prediction and community detection. Empirically, vertices in graph tend to intersect with each other due to their high similarity. Therefore, this similarity index is used to link those vertices of high similarity for graph augmentation.
Vertexsimilarity mapping shares the same candidate sets with random mapping but conduct edge selection with weighted random sampling. In other words, random mapping makes selections with equal probability, while vertexsimilarity mapping assigns all entries in and with relative weights that are associated with the vertex similarity scores. Specifically, before sampling, the similarity scores are computed over all entries in using the Resource Allocation (RA) index [28]. For each entry in , the RA score and addition weight are computed as follows:
(5)  
where denotes the onehop neighbors of and denotes the degree of vertex . Weighted random sampling means that the probability for an entry in to be selected is proportional to its addition weight . It is worth noting that many other similarity indices can also be applied into this scheme. Similarly, during edge deletion, the probability of edge sampled from is proportional to the deletion weight as follows:
(6) 
which means that these edges with smaller RA scores have more chances to be removed.
3.2.3 MotifRandom Mapping
Graph motifs are subgraphs that repeat themselves in a specific graph or even among various graphs. Each of these subgraphs, defined by a particular pattern of interactions between vertices, may describe a framework in which particular functions are accomplished efficiently. In this paper, only those open motifs with a chain structure are considered, as shown in Fig. 2, so that the motif discovery process can be replaced by path search. For instance, common motifs such as opentriad is equivalent to length2 paths emanating from the head vertex that induce a triangle, which gives the following equivalence relation:
(7)  
where is the length of motif.
The motifrandom mapping aims to finetune these motifs to the approximately equivalent ones via edge swapping. As shown in Fig. 2, during edge swapping, edge addition operation takes effect between the head and the tail vertices of the motif, while edge deletion operation randomly removes an edge in the motif. For all length motifs, the candidate set of edge addition is:
(8) 
where is to the power of . Then, one can get , the set of edges added to , via sampling from randomly. For each pair of vertices in , there exists a length motif , which has head vertex and tail vertex . Next, we randomly sample one edge in to remove and all of these removed edges constitute . Finally, the virtual graph can be obtained via Eq. (4).
3.2.4 MotifSimilarity Mapping
Motifsimilarity mapping inherits from motifrandom mapping and conducts edge sampling in term of vertex similarity. Specifically, for the candidate set of edge addition given in Eq. (8), the probability for an arbitrary entry to be selected is proportional to its addition weight according to Eq. (5). Similarly, during edge deletion, the probability of edge sampled from is proportional to the deletion weight as follows:
(9) 
The whole process for motifsimilarity mapping with and is illustrated in Fig. 3.
3.3 Data Filtration
In computer vision, the technique of data augmentation is widely used to generate augmented examples, which can be directly treated as new training data. For instance, after an image of cat undergoes simple data augmentation such as geometric transformation or color depth adjustment, the resulting new image retains the explicit semantics and is still an image of cat in human vision. However, due to the difference between image data and graph structured data, the examples generated via graph augmentation may lose original semantics. By assigning the label of the original graph to the generated graph directly during graph augmentation, one cannot determine whether the assigned label is reliable. Therefore, the concept of label reliability is employed here to measure the matching degree between examples and labels against classifier.
Each graph in will be fed into classifier to obtain the prediction vector
, which represents the probability distribution of how likely an input example belongs to each possible class, and
is the number of classes for labels. Then, a probability confusion matrix , in which the entry represents the average probability that the classifier classifies the graphs of the th class into the th class, is computed as follows:(10)  
where is the number of graphs belonging to the th class in and is the average probability distribution of the th class.
The label reliability of an example is defined as the product of example probability distribution and class probability distribution as follows:
(11) 
Notably, the definition indicates that the example which can be predicted correctly by classifier with a greater probability tends to have higher label reliability.
A threshold used to filter the generated data is defined as:
(12) 
where if and otherwise, and if and otherwise. The process of threshold optimization tends to ensure that correctly classified examples have greater label reliability than incorrectly classified examples.
Collections  Dataset  bias (%)  


MUTAG  188  2  17.93  19.79  66.5  
PTCMR  344  2  14.29  14.69  55.8  
ENZYMES  600  6  32.63  62.14  16.7  
Brain  KKI  83  2  26.96  48.42  55.4  
Peking1  85  2  39.31  77.35  57.6  
OHSU  79  2  82.01  199.66  55.7 
3.4 Model Evolution Framework
Model evolution aims to optimize classifiers via graph augmentation, data filtration and model retraining iteratively, and ultimately improve the performance on graph classification. Fig. 4 and Algorithm 3 demonstrate the workflow and the procedure of MEvolve, respectively. Here, a variable, number of iterations , is introduced for repeating the above workflow to continuously augment the dataset and optimize classifier.
4 Experiments
4.1 Datasets
We evaluate the proposed methods against six benchmark datasets from bioinformatics and chemoinformatics: Mutag [29], PTCMR [30], ENZYMES [1], KKI, Peking1 and OHSU [31]. The first three represent the graph collections of chemical compounds in which vertices correspond to molecular structures and edges indicate chemical bonds between them. And the last three are constructed from brain networks, where vertices correspond to the Regions of Interest (ROI) and edges represent the correlations between two ROIs. The specifications of these datasets are given in TABLE II.
Dataset  Mapping  Graph Classification Model  Avg RIMP  

SF  NetLSD  Graph2vec  Gl2vec  Diffpool  
SVM  Log  KNN  RF  SVM  Log  KNN  RF  SVM  Log  KNN  RF  SVM  Log  KNN  RF  
MUTAG  original  0.822  0.824  0.824  0.846  0.823  0.829  0.828  0.836  0.737  0.820  0.784  0.820  0.746  0.830  0.800  0.817  0.801  – 
random  0.843  0.845  0.846  0.878  0.855  0.851  0.860  0.886  0.756  0.844  0.793  0.847  0.748  0.851  0.820  0.841  0.810  2.78%  
vertexsimilarity  0.848  0.855  0.840  0.870  0.843  0.845  0.856  0.874  0.750  0.850  0.801  0.861  0.748  0.845  0.823  0.850  0.825  2.86%  
motifrandom  0.860  0.852  0.846  0.887  0.861  0.862  0.856  0.882  0.761  0.850  0.803  0.859  0.752  0.856  0.829  0.845  0.807  3.63%  
motifsimilarity  0.863  0.855  0.850  0.890  0.861  0.864  0.861  0.892  0.759  0.851  0.809  0.852  0.762  0.863  0.833  0.846  0.831  4.00%  
PTCMR  original  0.551  0.566  0.577  0.587  0.543  0.578  0.548  0.576  0.571  0.518  0.509  0.549  0.572  0.538  0.507  0.550  0.609  – 
random  0.611  0.590  0.605  0.618  0.579  0.580  0.590  0.607  0.580  0.572  0.547  0.592  0.587  0.571  0.528  0.594  0.637  5.77%  
vertexsimilarity  0.595  0.594  0.601  0.622  0.577  0.580  0.578  0.601  0.592  0.571  0.548  0.599  0.601  0.575  0.554  0.605  0.636  6.22%  
motifrandom  0.615  0.595  0.609  0.630  0.595  0.583  0.582  0.612  0.587  0.570  0.551  0.592  0.587  0.578  0.535  0.596  0.627  6.37%  
motifsimilarity  0.616  0.595  0.610  0.624  0.581  0.583  0.597  0.620  0.596  0.579  0.553  0.593  0.588  0.579  0.545  0.602  0.639  6.97%  
ENZYMES  original  0.309  0.237  0.287  0.397  0.337  0.287  0.304  0.349  0.361  0.253  0.283  0.337  0.348  0.268  0.238  0.318  0.487  – 
random  0.347  0.412  0.302  0.412  0.353  0.287  0.327  0.369  0.336  0.269  0.290  0.346  0.286  0.273  0.259  0.350  0.500  7.25%  
vertexsimilarity  0.368  0.416  0.333  0.431  0.352  0.295  0.329  0.359  0.372  0.279  0.290  0.369  0.345  0.266  0.274  0.357  0.489  11.12%  
motifrandom  0.364  0.418  0.317  0.418  0.365  0.294  0.340  0.367  0.376  0.268  0.299  0.355  0.298  0.273  0.264  0.356  0.508  10.20%  
motifsimilarity  0.363  0.414  0.317  0.415  0.375  0.291  0.335  0.376  0.352  0.270  0.289  0.352  0.291  0.280  0.260  0.358  0.506  9.55%  
KKI  original  0.550  0.500  0.520  0.517  0.548  0.524  0.512  0.496  0.549  0.527  0.524  0.552  0.538  0.502  0.526  0.502  0.523  – 
random  0.600  0.544  0.554  0.622  0.599  0.535  0.553  0.562  0.580  0.568  0.594  0.574  0.556  0.508  0.544  0.544  0.580  7.97%  
vertexsimilarity  0.601  0.556  0.559  0.654  0.618  0.568  0.568  0.585  0.603  0.570  0.573  0.655  0.578  0.597  0.582  0.588  0.597  12.87%  
motifrandom  0.598  0.560  0.578  0.647  0.603  0.563  0.558  0.574  0.586  0.606  0.592  0.661  0.567  0.593  0.596  0.604  0.586  13.11%  
motifsimilarity  0.607  0.560  0.561  0.649  0.619  0.558  0.565  0.582  0.587  0.606  0.603  0.634  0.581  0.592  0.597  0.582  0.612  13.36%  
Peking1  original  0.578  0.548  0.541  0.558  0.605  0.612  0.589  0.591  0.572  0.522  0.474  0.522  0.555  0.522  0.521  0.521  0.586  – 
random  0.660  0.562  0.603  0.627  0.652  0.631  0.662  0.654  0.579  0.547  0.546  0.597  0.584  0.555  0.559  0.607  0.650  9.20%  
vertexsimilarity  0.672  0.571  0.615  0.623  0.652  0.644  0.622  0.666  0.638  0.557  0.562  0.627  0.619  0.561  0.566  0.632  0.657  11.47%  
motifrandom  0.681  0.583  0.619  0.644  0.668  0.636  0.666  0.689  0.579  0.553  0.593  0.612  0.605  0.581  0.567  0.633  0.654  12.34%  
motifsimilarity  0.670  0.587  0.624  0.663  0.694  0.648  0.671  0.699  0.581  0.565  0.564  0.630  0.607  0.563  0.572  0.635  0.632  12.72%  
OHSU  original  0.610  0.595  0.610  0.667  0.547  0.489  0.549  0.581  0.557  0.577  0.585  0.567  0.557  0.541  0.544  0.557  0.543  – 
random  0.643  0.635  0.645  0.697  0.595  0.534  0.582  0.641  0.557  0.640  0.628  0.645  0.564  0.595  0.570  0.642  0.637  8.08%  
vertexsimilarity  0.653  0.641  0.643  0.722  0.641  0.546  0.614  0.661  0.557  0.658  0.623  0.673  0.557  0.625  0.625  0.632  0.640  10.82%  
motifrandom  0.650  0.638  0.648  0.728  0.613  0.546  0.584  0.641  0.557  0.653  0.633  0.686  0.564  0.602  0.625  0.645  0.627  10.04%  
motifsimilarity  0.656  0.641  0.650  0.726  0.638  0.541  0.587  0.641  0.557  0.678  0.635  0.650  0.572  0.605  0.625  0.652  0.615  10.33% 
4.2 Graph Classification Methods
We consider the following five graph classification methods in our experiments, the first two are graph embedding, the middle two are kernel models, and the last one is the GNN model.

SF [32]. It is an embedding method and performs graph classification by spectral decomposition of the graph Laplacian, i.e., it relies on spectral features of the graph.

Graph2vec [16]
. It extends the document embedding methods to graph classification and learns a distributed representation of the entire graph via document embedding neural networks.

NetLSD [33]. It is a kernel method and performs graph classification by extracting compact graph signatures that inherit the formal properties of the Laplacian spectrum.

Gl2vec [34]. It constructs vectors for feature representations by comparing static and temporal network graphlet distributions to random graphs generated from different null models.

Diffpool [26]. It is a differentiable graph pooling module that can generate hierarchical representations of graphs by learning a differentiable soft cluster assignment for vertices and maps them to a coarsened graph layer by layer.
For all graph kernel and embedding methods, we implement graph classification by using the following machine learning classifiers: SVM based on radial basis kernel (SVM), Logistic regression classifier (Log),
nearest neighbors classifier (KNN) and random forest classifier (RF). In total, there are available combinations of graph classification.4.3 Experiment Setup
Each dataset is split into training, validation and testing sets with a proportion of 7:1:2. In this work, the validation set takes effect in two parts: 1) finetune the hyperparameters of classifiers in combination with grid search; 2) compute the label reliability threshold and confusion matrix of classifiers during model evolution. We repeat 5fold cross validation for 10 times and report the average accuracy across all trials.
For all kernel and embedding methods, we set the embedding dimension to 128. We then set the budget of edge modification as 0.15. During motifrandom and motifsimilarity mapping, we fix , which means that we only consider opentriad motif. Furthermore, during model evolution, we set the number of iterations to 5.
4.4 Evaluation
We evaluate the benefit of the proposed graph augmentation methods and MEvolve framework, answering the following research questions:

RQ1: Can MEvolve improve the performance of graph classification when being combined with existing graph classification models?

RQ2: What are the roles that similarity and motif mechanisms play in improving the effect of graph augmentation?

RQ3: Is the data filtration necessary and how does it help MEvolve achieve performance improvement in graph classification?

RQ4: How does MEvolve achieve interpretable enhancement of graph classification?
We combine the proposed model evolution framework with all graph classification models to show a crosswise comparison. Specifically, the four graph augmentation methods are combined with 17 graph classification combinations, totaling 68 available experimental combinations. For instance, a practical combination could be {vertexsimilarity mapping + Graph2vec + KNN} or {random mapping + Diffpool}.
4.4.1 Enhancement for Graph Classification
TABLE III reports the results of performance comparison between the evolutive models and the original models, from which one can observe that there is a significant boost in classification performance across all six datasets. Overall, these models combined with the proposed MEvolve framework obtain higher average classification accuracy in most cases and the MEvolve achieves a 96.81% success rate on the enhancement of graph classification ^{1}^{1}1The success rate refers to the percentage of evolutive models with accuracy higher than that of the corresponding original models in Table 3. The actual calculation formula is 395408=96.81%.. These phenomena provide a positive answer to RQ1, indicating that the MEvolve significantly improve the performance of the 17 graph classification combinations. We speculate that the original models trained with limited training data are overfitting, and on the contrary, MEvolve enriches the scale of training data via graph augmentation and optimizes graph classifiers via iterative retraining, which can improve the generalization and avoid overfitting to a certain extent.
Now, we define the relative improvement rate (RIMP) in accuracy as follows:
(13) 
where and refer to the accuracy of the evolutive and original models, respectively. Comparing all the graph augmentation methods, we count the numbers of experiments where they obtain the highest RIMP, which were 1, 32, 24, 57 for random, vertexsimilarity, motifrandom and motifsimilarity, respectively. In TABLE III, the farright column gives the average relative improvement rate (Avg RIMP) in accuracy, from which one can see that the MEvolve combined with similaritybased mappings obtain the best results overall. In particular, motifsimilarity mapping, which combines similarity and motif mechanisms, outperforms the others in half cases. These results indicate that both similarity and motif mechanisms play positive roles in enhancing graph classification, answering RQ2. As a reasonable explanation, similarity mechanism tends to link vertices with higher similarity and is capable of optimizing topological structure legitimately, while the motif mechanism achieves edge modification via local edge swapping, which has less effect on the degree distribution of the graph.
4.4.2 Impact of Data Filtration in Model Evolution
Thanks to the outstanding performance of MEvolve, we further investigate the impact of the data filtration on the enhancement of graph classification. Specifically, we conduct contrastive experiments in which the data filtration operation is removed from MEvolve and the performance differences between the two cases with and without data filtration are shown in Fig. 5. From the comparison results, we observe that there is a more significant improvement in classification performance when the MEvolve framework is combined with data filtration in most cases, positively answering RQ3.
Notably, without data filtration, the results of MEvolve with different mappings vary considerably in overall quality, and this mechanism may even have negative effect in certain cases. On the contrary, with data filtration, MEvolve has more significant and consistent effects on enhancing graph classification among different mappings. A reasonable explanation for this effect is that data filtration is capable of retaining examples that are conducive to model’s decision, so that these augmented sets obtained via various mappings tend to have similar feature distributions. As a result, data filtration narrows the quality gap between the data generated by random mapping and those by the other three mappings, achieving the consistency of performance. As an exception, the vertexsimilarity mapping (vs) shows particularly excellent performance in the case without data filtration in ENZYMES. One possible explanation is that accepting more augmented examples may be more favorable for optimizing relatively complex multiclass classification.
4.4.3 Explanatory Visualization of Data and Models
Next, we apply visualization techniques to investigate how graph augmentation enriches the data distribution and how the MEvolve framework optimizes the performances of different models. Since higherdimensional data are difficult to visualize, we set the embedding dimension of both graph kernel and embedding models to 2.
Firstly, we compare the training data distribution before and after model evolution, as visualized in Fig. 6, to demonstrate the effectiveness of the proposed graph augmentation. Specifically, the top and bottom rows show the decision regions of the original and evolutive models respectively, and the points with different colors represent training data with different labels, (a)(d) are based on the same data split and graph augmentation, but different graph classification combinations (vertexsimilarity mapping + SF + {SVM, Log, KNN, RF}). Obviously, there is a significant boost in the scale of training data and the distribution boundaries of data with different labels, indicating that graph augmentation effectively enriches the training data and the new data distribution is more conducive to the training of classifiers.
Furthermore, we visualize the decision boundaries in Fig. 7, to clearly highlight the difference between the original and evolutive models. Specifically, (a)(e) are based on the same combination (vertexsimilarity mapping + SF + SVM), but different data split schemes, which refer to testing using different folds of dataset. As one can see, the decision regions of the nondominant class are fragmented and scattered in the original models. During model evolution, scattered regions tend to merge, and the original decision boundaries are optimized to smoother ones. These phenomena answer RQ4.
In summary, graph augmentation can efficiently increase the data scale, indicating its ability in enriching data distribution. And the entire MEvolve framework is capable of optimizing the decision boundaries of the classifiers and ultimately improving their generalization performances.
4.4.4 Parameter Sensitivity
In this subsection, we further analyze the impact of key parameters on the performance of the MEvolve framework. Specifically, we vary the budget of edge modification in . Due to space limit, we only present the evaluation results of graph classification based on these combinations ({SF, NetLSD, Graph2vec, Gl2vec} + SVM & Diffpool) in Fig. 8, involving the graph kernel and embedding methods with SVM and the GNNbased Diffpool method. From the results, one can see that the MEvolve framework is not strictly sensitive to different parameter settings when it comes to graph kernel and embedding models. On the other hand, when it comes to the Diffpool model, there are consistent tendencies in the sensitivity curves among all datasets, indicating that too large or too small perturbations are not conducive to graph augmentation.
Furthermore, we supplement the contrastive experiments of parameter sensitivity without data filtration, as shown in Fig. 9, to verify our conjecture that data filtration can actually help MEvolve reduce parameter sensitivity. Specifically, we use the variance of the five data points on each curve to measure the corresponding parameter sensitivity. Fig 9 (a) shows the individual variance (Ind Var) of all the experimental combinations involved in Fig. 8, and (b) presents the average variance for each dataset. From the comparison results, one can observe that these results with data filtration have less fluctuation and better stability under different parameter settings, which provide positive support for our conjecture that data filtration can actually improve the robustness of the MEvolve against parameter variations. As for the more obvious fluctuations on the curves of Diffpool, we speculate that endtoend deep learning models are more capable of capturing slight changes in data features when compared to machine learning models like SVM.
In summary, data filtration not only narrows the performance gap among different graph augmentation methods, but also reduces the sensitivity of MEvolve to different parameter settings, implying that the MEvolve framework is robust to parameter settings to a certain extent.
5 Conclusion
In this paper, we introduce data augmentation for graph classification and present four heuristic algorithms to generate weakly labeled data for smallscale benchmark datasets via a heuristic transformation of the graph structure. Furthermore, we propose a generic model evolution framework named MEvolve, which combines graph augmentation, data filtration and model retraining to optimize pretrained graph classifiers. Experiments conducted on six benchmark datasets demonstrate that our proposed framework performs surprisingly well and helps existing graph classification models alleviate overfitting when training on smallscale benchmark datasets and achieve significant improvement of classification performance. For future work, we will design effective graph augmentation methods on large scale graphs and extend the current framework to work on realworld datasets like social networks and transaction networks.
Acknowledgments
The authors would like to thank all the members in the IVSN Research Group, Zhejiang University of Technology for the valuable discussions about the ideas and technical details presented in this paper.
References
 [1] K. M. Borgwardt, C. S. Ong, S. Schönauer, S. Vishwanathan, A. J. Smola, and H.P. Kriegel, “Protein function prediction via graph kernels,” Bioinformatics, vol. 21, no. suppl_1, pp. i47–i56, 2005.
 [2] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. AspuruGuzik, and R. P. Adams, “Convolutional networks on graphs for learning molecular fingerprints,” in Advances in Neural Information Processing Systems, 2015, pp. 2224–2232.
 [3] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su, “Arnetminer: extraction and mining of academic social networks,” in Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2008, pp. 990–998.
 [4] N. Shervashidze, P. Schweitzer, E. J. Van Leeuwen, K. Mehlhorn, and K. M. Borgwardt, “Weisfeilerlehman graph kernels.” Journal of Machine Learning Research, vol. 12, no. 9, 2011.
 [5] M. Neumann, R. Garnett, C. Bauckhage, and K. Kersting, “Propagation kernels: efficient graph kernels from propagated information,” Machine Learning, vol. 102, no. 2, pp. 209–245, 2016.
 [6] N. Shervashidze, S. Vishwanathan, T. Petri, K. Mehlhorn, and K. Borgwardt, “Efficient graphlet kernels for large graph comparison,” in Artificial Intelligence and Statistics, 2009, pp. 488–495.
 [7] T. Gärtner, P. Flach, and S. Wrobel, “On graph kernels: Hardness results and efficient alternatives,” in Learning Theory and Kernel Machines. Springer, 2003, pp. 129–143.
 [8] H. Kashima, K. Tsuda, and A. Inokuchi, “Marginalized kernels between labeled graphs,” in Proceedings of the 20th International Conference on Machine Learning (ICML03), 2003, pp. 321–328.
 [9] P. Mahé, N. Ueda, T. Akutsu, J.L. Perret, and J.P. Vert, “Extensions of marginalized graph kernels,” in Proceedings of the twentyfirst International Conference on Machine Learning, 2004, p. 70.
 [10] M. Sugiyama and K. Borgwardt, “Halting in random walk kernels,” in Advances in Neural Information Processing Systems, 2015, pp. 1639–1647.
 [11] K. M. Borgwardt and H.P. Kriegel, “Shortestpath kernels on graphs,” in Fifth IEEE International Conference on Data Mining (ICDM’05). IEEE, 2005, pp. 8–pp.
 [12] H. Cai, V. W. Zheng, and K. C.C. Chang, “A comprehensive survey of graph embedding: Problems, techniques, and applications,” IEEE Transactions on Knowledge and Data Engineering, vol. 30, no. 9, pp. 1616–1637, 2018.
 [13] F. Chen, Y.C. Wang, B. Wang, and C.C. J. Kuo, “Graph representation learning: a survey,” APSIPA Transactions on Signal and Information Processing, vol. 9, 2020.

[14]
C. Fu, Y. Zheng, Y. Liu, Q. Xuan, and G. Chen, “Nestl: Network embedding similaritybased transfer learning,”
IEEE Transactions on Network Science and Engineering, 2019.  [15] W. Guo, Y. Shi, S. Wang, and N. N. Xiong, “An unsupervised embedding learning feature representation scheme for network big data analysis,” IEEE Transactions on Network Science and Engineering, vol. 7, no. 1, pp. 115–126, 2019.
 [16] A. Narayanan, M. Chandramohan, R. Venkatesan, L. Chen, Y. Liu, and S. Jaiswal, “graph2vec: Learning distributed representations of graphs,” arXiv preprint arXiv:1707.05005, 2017.
 [17] H. Dai, B. Dai, and L. Song, “Discriminative embeddings of latent variable models for structured data,” in International Conference on Machine Learning, 2016, pp. 2702–2711.
 [18] A. Narayanan, M. Chandramohan, L. Chen, Y. Liu, and S. Saminathan, “subgraph2vec: Learning distributed representations of rooted subgraphs from large graphs,” arXiv preprint arXiv:1606.08928, 2016.
 [19] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl, “Neural message passing for quantum chemistry,” arXiv preprint arXiv:1704.01212, 2017.
 [20] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel, “Gated graph sequence neural networks,” arXiv preprint arXiv:1511.05493, 2015.
 [21] Y. Jin and J. F. JaJa, “Learning graphlevel representations with recurrent neural networks,” arXiv preprint arXiv:1805.07683, 2018.
 [22] J. You, R. Ying, X. Ren, W. L. Hamilton, and J. Leskovec, “Graphrnn: Generating realistic graphs with deep autoregressive models,” arXiv preprint arXiv:1802.08773, 2018.

[23]
M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” in
Advances in Neural Information Processing Systems, 2016, pp. 3844–3852. 
[24]
M. Simonovsky and N. Komodakis, “Dynamic edgeconditioned filters in
convolutional neural networks on graphs,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2017, pp. 3693–3702.  [25] M. Fey, J. Eric Lenssen, F. Weichert, and H. Müller, “Splinecnn: Fast geometric deep learning with continuous bspline kernels,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 869–877.
 [26] Z. Ying, J. You, C. Morris, X. Ren, W. Hamilton, and J. Leskovec, “Hierarchical graph representation learning with differentiable pooling,” in Advances in Neural Information Processing Systems, 2018, pp. 4800–4810.
 [27] Y. Ma, S. Wang, C. C. Aggarwal, and J. Tang, “Graph convolutional networks with eigenpooling,” in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2019, pp. 723–731.
 [28] T. Zhou, L. Lü, and Y.C. Zhang, “Predicting missing links via local information,” The European Physical Journal B, vol. 71, no. 4, pp. 623–630, 2009.
 [29] A. K. Debnath, R. L. Lopez de Compadre, G. Debnath, A. J. Shusterman, and C. Hansch, “Structureactivity relationship of mutagenic aromatic and heteroaromatic nitro compounds. correlation with molecular orbital energies and hydrophobicity,” Journal of Medicinal Chemistry, vol. 34, no. 2, pp. 786–797, 1991.
 [30] C. Helma, R. D. King, S. Kramer, and A. Srinivasan, “The predictive toxicology challenge 2000–2001,” Bioinformatics, vol. 17, no. 1, pp. 107–108, 2001.
 [31] S. Pan, J. Wu, X. Zhu, G. Long, and C. Zhang, “Task sensitive feature exploration and learning for multitask graph classification,” IEEE Transactions on Cybernetics, vol. 47, no. 3, pp. 744–758, 2016.
 [32] N. de Lara and E. Pineau, “A simple baseline algorithm for graph classification,” arXiv preprint arXiv:1810.09155, 2018.
 [33] A. Tsitsulin, D. Mottin, P. Karras, A. Bronstein, and E. Müller, “Netlsd: hearing the shape of a graph,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2018, pp. 2347–2356.
 [34] K. Tu, J. Li, D. Towsley, D. Braines, and L. D. Turner, “gl2vec: Learning feature representation using graphlets for directed networks,” in Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, 2019, pp. 216–221.