GraphSMOTE: Imbalanced Node Classification on Graphs with Graph Neural Networks

03/16/2021
by   Tianxiang Zhao, et al.
Penn State University
0

Node classification is an important research topic in graph learning. Graph neural networks (GNNs) have achieved state-of-the-art performance of node classification. However, existing GNNs address the problem where node samples for different classes are balanced; while for many real-world scenarios, some classes may have much fewer instances than others. Directly training a GNN classifier in this case would under-represent samples from those minority classes and result in sub-optimal performance. Therefore, it is very important to develop GNNs for imbalanced node classification. However, the work on this is rather limited. Hence, we seek to extend previous imbalanced learning techniques for i.i.d data to the imbalanced node classification task to facilitate GNN classifiers. In particular, we choose to adopt synthetic minority over-sampling algorithms, as they are found to be the most effective and stable. This task is non-trivial, as previous synthetic minority over-sampling algorithms fail to provide relation information for newly synthesized samples, which is vital for learning on graphs. Moreover, node attributes are high-dimensional. Directly over-sampling in the original input domain could generates out-of-domain samples, which may impair the accuracy of the classifier. We propose a novel framework, GraphSMOTE, in which an embedding space is constructed to encode the similarity among the nodes. New samples are synthesize in this space to assure genuineness. In addition, an edge generator is trained simultaneously to model the relation information, and provide it for those new samples. This framework is general and can be easily extended into different variations. The proposed framework is evaluated using three different datasets, and it outperforms all baselines with a large margin.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

06/21/2021

GraphMixup: Improving Class-Imbalanced Node Classification on Graphs by Self-supervised Context Prediction

Recent years have witnessed great success in handling node classificatio...
12/01/2021

Imbalanced Graph Classification via Graph-of-Graph Neural Networks

Graph Neural Networks (GNNs) have achieved unprecedented success in lear...
06/05/2021

ImGAGN:Imbalanced Network Embedding via Generative Adversarial Graph Networks

Imbalanced classification on graphs is ubiquitous yet challenging in man...
05/25/2021

AdaGCN:Adaptive Boosting Algorithm for Graph Convolutional Networks on Imbalanced Node Classification

The Graph Neural Network (GNN) has achieved remarkable success in graph ...
10/22/2021

Distance-wise Prototypical Graph Neural Network in Node Imbalance Classification

Recent years have witnessed the significant success of applying graph ne...
04/19/2021

SAS: A Simple, Accurate and Scalable Node Classification Algorithm

Graph neural networks have achieved state-of-the-art accuracy for graph ...
08/19/2021

EqGNN: Equalized Node Opportunity in Graphs

Graph neural networks (GNNs), has been widely used for supervised learni...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Recent years have witnessed great improvements in learning from graphs with the developments of graph neural networks(GNNs) (Kipf and Welling, 2017; Hamilton et al., 2017; Xu et al., 2018). One typical task is semi-supervised node classification (Yang et al., 2016), where we have a large graph with a small ratio of nodes labeled. A classifier can be trained on those supervised nodes, and be used to classify other nodes during testing. GNNs have obtained state-of-the-art performance in this task, and is developing rapidly. For example, GCN (Kipf and Welling, 2017) exploits features in the spectral domain efficiently by using a simplified first-order approximation; GraphSage (Hamilton et al., 2017) utilizes features in the spatial domain and is better at adapting to diverse graph topology. Despite all these progresses, existing work mainly focus on the setting that node classes are balanced.

In many real-world applications, node classes could be imbalanced in graphs, i.e., some classes have significantly fewer samples for training than other classes. For example, for fake account detection (Mohammadrezaei et al., 2018; Zhao et al., 2009), the majority of users in a social network platform are benign users while only a small portion of them are bots. Similarly, topic classification for website pages (Wang et al., 2020) could also suffer from this problem, as the materials for some topics are scarce, comparing to those on-trend topics. Thus, we are often faced with imbalanced node classification problem. An example of the imbalanced node classification problem is shown in Figure 1(a). Each blue node refers to a real user, each red node refers to a fake user, and the edges denote the friendship. The task is to predict whether those unlabeled users in dashes are real or fake. The classes are imbalanced in nature, as fake users are often less than of all the users. The semi-supervised setting further magnifies the class imbalanced issue as we are only given limited labeled data, which makes the number of labeled minority samples extremely small.

(a) Bot detection task
(b) After over-sampling
Figure 1. An example of bot detection on a social network, and the idea of over-sampling. Note that the over-sampling is in the latent space.

The imbalanced node classification brings challenges to existing GNNs because the majority classes could dominate the loss function of GNNs, which makes the trained GNNs over-classify those majority classes and become unable to predict accurately for samples from minority classes. This issue impedes the adoption of GNNs for many real-world applications with imbalanced class distribution such as malicious account detection. Therefore, it is important to develop GNNs for class imbalanced node classification.

In machine learning domain, traditional class imbalance problem has been extensively studied. Algorithms can be summarized into three groups: data-level approaches, algorithm-level approaches, and hybrid approaches. Data-level approaches seek to make the class distribution more balanced, using over-sampling or down-sampling techniques

(More, 2016; Chawla et al., 2002)

; algorithm-level approaches typically introduce different mis-classification penalties or prior probabilities for different classes 

(Elkan, 2001; Ling and Sheng, 2008; Zhou and Liu, 2005); and hybrid approaches (Chawla et al., 2003; Liu et al., 2008) combine both of them. However, directly applying them to graphs may get sub-optimal results. Relation is the key information needed to be exploited in graph data, and under-representation of minority samples would impair not only their embedding quality, but also the knowledge exchange processes across neighboring nodes. Previous algorithms fail to address that due to their i.i.d assumption, taking each sample as independent.

Therefore, in this work, we study a novel problem of exploring synthetic minority oversampling for imbalanced node classification with GNNs 111Code available at https://github.com/TianxiangZhao/GraphSmote. The idea is shown in Figure 1(b). Previous algorithms are not readily applicable to graphs, due to two-folded reasons. First, it is difficult to generate relation information for synthesized new samples. Mainstream oversampling techniques (More, 2016)

use interpolation between target example and its nearest neighbor to generate new training examples. However, interpolation is improper for edges, as they are usually discreet and sparse. Interpolation could break down the topology structure. Second, synthesized new samples could be of low quality. Node attributes are high-dimensional, and directly interpolating on them would easily generate out-of-domain examples, which are not beneficial for training the classifier.

Targeting at these two problems, we extend previous over-sampling algorithms to a new framework, GraphSMOTE, in order to cope with graphs. The modifications are mainly at two places. First, we propose to obtain new edges between generated samples and existing samples with an edge predictor. This predictor can learn the genuine distribution of edges, and hence can be used to produce reliable relation information among samples. Second, we propose to perform interpolation at the intermediate embedding space of a GNN network, inspired by  (Ando and Huang, 2017). In this intermediate embedding space, the dimensionality is much lower, and the distribution of samples from the same class would be more dense. As intra-class similarity as well as inter-class differences would have been captured by previous layers, interpolation can be better trusted to generate in-domain samples. Concretely, we propose a new framework in which graph auto-encoding task and node classification task are combined together. These two tasks share the same feature extractor, and oversampling is performed at the output space of that module, as shown in Figure 2. The main contributions are:

  • We propose to study a novel problem, node class imbalance problem for learning on graphs. It has many real-world applications, and this paper is the first work focusing on this task as far as we know.

  • We design a new framework which extends previous over-sampling algorithms to work for graph data. It addresses the deficiencies of previous methods, by generating more natural nodes as well as relation information. Besides, it is general and easy to extend.

  • Experiments are performed on three datasets, and GraphSMOTE outperforms all baselines with a large gap. Extensive analysis of our model’s behavior as well as recommended settings are also presented.

The rest of the paper are organized as follows. In Sec. 2, we review related work. In Sec. 3, we formally define the problem. In Sec. 4, we give the details of GraphSMOTE. In Sec. 5, we conduct experiments to evaluate the effectiveness of GraphSMOTE. In Sec. 6, we conclude with future work.

2. Related Work

In this section, we briefly review related works, which include graph neural networks and class imbalance problem.

2.1. Class Imbalance Problem

Class imbalance is common in real-world applications, and has long been a classical research direction in the machine learning domain. Plenty of tasks suffer from this problem, like medical diagnosis (Mac Namee et al., 2002; Grzymala-Busse et al., 2004) or fine-grained image classification (Peng et al., 2017; Van Horn et al., 2017). Classes with larger number of instances are usually called as majority classes, and those with fewer instances are usually called as minority classes. The countermeasures against this problem can generally be classified into three groups, i.e., algorithm-level, data-level and hybrid.

The first group of methods are data-level, seeking to directly adjust class sizes through over- or under-sampling. The vanilla form of over-sampling is replicating existing samples. It reduces this imbalance, but can lead to over-fitting as no extra information is introduced. SMOTE (Chawla et al., 2002) addresses this problem by generating new samples, performing interpolation between samples in minority classes and their nearest neighbors. SMOTE is the most popular over-sampling approach, and many extensions are proposed on top of it to make the interpolation process more effective. For example, Borderline-SMOTE (Han et al., 2005) limits over-sampling to samples near the borderline of classes, which are believed to be more informative. Safe-Level-SMOTE (Bunkhumpornpat et al., 2009) computes the safe level for each interpolation direction using majority class neighbors, in order to make the generated new samples safer. Cluster-based Over-sampling (Jo and Japkowicz, 2004) first clusters samples into different groups, than over-samples each group separately, considering that small districts often exist in the input space. Besides,  (Ando and Huang, 2017) extends over-sampling to work with CNNs, through interpolation in an embedding space. Under-sampling discards some samples from majority classes, which can also make classes balanced, but at the price of losing some information. To overcome this deficiency, many extensions are proposed to remove only redundant samples, like  (Kubat et al., 1997; Barandela et al., 2004). The second group of methods are algorithm-level. Cost sensitive learning (Zhou and Liu, 2005; Ling and Sheng, 2008) generally constructs a cost matrix to assign different mis-classification penalties for different classes. Its effect is similar to vanilla over-sampling.  (Parambath et al., 2014) proposes an approximation to F measurement, which can be directly optimized by gradient propagation. Threshold moving (Lawrence et al., 1998)

modifies the inference process after the classifier is trained, by introducing a prior probability for each class. Through these approaches, the importance of minority classes can be increased. The last group are hybrid approaches, which combine multiple algorithms from one or both aforementioned categories.  

(Liu et al., 2008) uses a group of classifiers, each one is trained on a subset of majority classes and minority classes.  (Chawla et al., 2003) combines boosting with SMOTE approach, and  (He and Garcia, 2009) combines over-sampling with cost sensitive learning.  (Sun et al., 2007) introduces three cost-sensitive boosting approaches, which iteratively updates the impact of each class in together with the AdaBoost parameters.

Some systematic analysis of them have found that synthetic minority oversampling techniques such as SMOTE are the most popular and effective approaches for addressing class imbalance (Buda et al., 2018; Johnson and Khoshgoftaar, 2019). However, existing work are overwhelmingly dedicated to i.i.d data. They cannot be directly applied to graph structured data because: (i) the synthetic node generation on the raw feature space cannot take the graph information into consideration; and (ii) the generated nodes doesn’t have links with the graph, which cannot facilitate the graph based classifier such as GNNs. Hence, in this work, we focus on extending SMOTE into graph domain for GNNs.

2.2. Graph Neural Network

In recent years, with the increasing requirements of learning on non-Euclidean space and modeling rich relation information among samples, graph neural networks (GNNs) have received much more attention and are developing rapidly. GNNs generalize convolutional neural networks to graph structured data and have shown great ability in modeling graph structured data. Current GNNs follow a message-passing framework, which is composed of pattern extraction and interaction modeling within each layer 

(Gilmer et al., 2017). Generally, existing GNN frameworks can be categorized into two categorizes, i.e., spectral-based (Bruna et al., 2013; Tang et al., 2019; Kipf and Welling, 2017; Hamilton et al., 2017) and spatial-based (Duvenaud et al., 2015; Atwood and Towsley, 2016).

Spectral-based GNNs defines the convolution operation in the Fourier domain by computing the eigendecomposition of the graph Laplacian. Early work (Bruna et al., 2013) in this domain involves extensive computation, and is time-consuming. To accelerate,  (Tang et al., 2019) adopts Chebyshev Polynomials to approximate spectral kernels, and enforces locality constraints by truncating only top-k terms. GCN (Kipf and Welling, 2017) takes a further step by preserving only top-2 terms, and obtains a more simplified form. GCN is one of the most widely-used GNN currently. However, all spectral-based GNNs suffer from the generalization problem, as they are dependent on the Laplacian eigenbasis (Zhou et al., 2018). Hence, they are usually applied in the transductive setting, training and testing on the same graph structure. Spatial-based GNNs are more flexible and have stronger in generalization ability. They implement convolutions basing on the neighborhoods of each node. As each node could have different number of neighbors, Duvenaud et al., (Duvenaud et al., 2015) uses multiple weight matrices, one for each degree.  (Atwood and Towsley, 2016) proposes a diffusion convolution neural network, and  (Niepert et al., 2016) adopts a fixed number of neighbors for each sample. A more popular model is GraphSage (Hamilton et al., 2017), which samples and aggregates embedding from local neighbors of each sample. More recently,  (Xu et al., 2018) extends expressive power of GNNs to that of WL test, and  (You et al., 2019) introduce a new GNN layer that can encode node positions.

Despite the success of various GNNs, existing work doesn’t consider the class imbalance problem, which widely exists in real-world applications and could significantly reduce the performance of GNNs. Thus, we study a novel problem of synthetic minority oversampling on graphs to facilitate the adoption of GNNs for class imbalance node classification.

3. Problem Definition

In this work, we focus on semi-supervised node classification task on graphs, in the transductive setting. As shown in Figure 1, we have a large network of entities, with some labeled for training. Both training and testing are performed on this same graph. Each entity belongs to one class, and the distribution of class sizes are imbalanced. This problem has many practical applications. For example, the homophily in social networks which results in the under-representation of minority groups, malicious behavior or fake user accounts on social networks which are outnumbered by normal ones, and linked web pages in knowledge base where materials for some topics are limited.

Throughout this paper, we use to denote an attributed network, where is a set of nodes. is the adjacency matrix of , and denotes the node attribute matrix, where is the node attributes of node j and is the dimension of the node attributes. is the class information for nodes in . During training, only a subset of , , is available, containing the labels for node subset . There are classes in total, . is the size of -th class, referring to the number of samples belong to that class. We use imbalance ratio, , to measure the extent of class imbalance. In the imbalanced setting, imbalance ratio of is small.

Given whose node class set is imbalanced, and labels for a subset of nodes , we aim to learn a node classifier that can work well for both majority and minority classes, i.e.,

(1)

4. Methodology

In this section, we give the details of the proposed framework GraphSMOTE. The main idea of GraphSMOTE is to generate synthetic minority nodes through interpolation in an expressive embedding space acquired by the GNN-based feature extractor, and use an edge generator to predict the links for the synthetic nodes, which forms an augmented balanced graph to facilitate node classification by GNNs. An illustration of the proposed framework is shown in Figure 2. GraphSMOTE is composed of four components: (i) a GNN-based feature extractor (encoder) which learns node representation that preserves node attributes and graph topology to facilitate the synthetic node generation; (ii) a synthetic node generator which generates synthetic minority nodes in the latent space; (iii) an edge generator which generate links for the synthetic nodes to from an augmented graph with balanced classes; and (iv) a GNN-based classifier which performs node classification based on the augmented graph. Next, we give the details of each component.

Figure 2. Overview of the framework

4.1. Feature Extractor

One way to generate synthetic minority nodes is to directly apply SMOTE on the raw node feature space. However, this will cause several problems: (i) the raw feature space could be sparse and high-dimensional, which makes it difficult to find two similar nodes of the same class for interpolation; and (ii) it doesn’t consider the graph structure, which can result in sub-optimal synthetic nodes. Thus, instead of directly do synthetic minority over-sampling in the raw feature space, we introduce an feature extractor learn node representations that can simultaneously capture node properties and graph topology. Generally, the node representations should reflect inter-class and intra-class relations of samples. Similar samples should be closer to each other, and dissimilar samples should be more distant. In this way, when performing interpolation on minority node with its nearest neighbor, the obtained embedding would have a higher probability of representing a new sample belonging to the same minority class. In graphs, the similarity of nodes need to consider node attributes, node labels, as well as local graph structures. Hence, we implement it with GNN, and train it on two down-stream tasks, edge prediction and node classification.

The feature extractor can be implemented using any kind of GNNs. In this work, we choose GraphSage as the backbone model structure because it is effective in learning from various types of local topology, and generalizes well to new structures. It has been observed that too deep GNNs often lead to sub-optimal performance, as a result of over-smoothing and over-fitting. Therefore, we adopt only one GraphSage block as the feature extractor. Inside this block, the message passing and fusing process can be written as:

(2)

represents input node attribute matrix and represents attribute for node . is the -th column in adjacency matrix, and is the obtained embedding for node . is the weight parameter, and

refers to the activation function such as ReLU.

4.2. Synthetic Node Generation

After obtaining the representation of each node in the embedding space constructed by the feature extractor, now we can perform over-sampling on top of that. We seek to generate the expected representations for new samples from the minority classes. In this work, to perform over-sampling, we adopt the widely used SMOTE algorithm, which augments vanilla over-sampling via changing repetition to interpolation. We choose it due to its popularity, but our framework can also cope with other over-sampling approaches as well. The basic idea of SMOTE is to perform interpolation on samples from the target minority class with their nearest neighbors in the embedding space that belong to the same class. Let be a labeled minority nodes with label as . The first step is to finds the closest labeled node of the same class as , i.e.,

(3)

refers to the nearest neighbor of from the same class, measured using Euclidean distance in the embedding space. With the nearest neighbor, we can generate synthetic nodes as

(4)

where

is a random variable, following uniform distribution in the range

. Since and belong to the same class and are very close to each other, the generated synthetic node should also belong to the same class. In this way, we can obtain labeled synthetic nodes.

For each minority class, we can apply SMOTE to generate syntetic nodes. We use a hyper-parameter, over-sampling scale, to control the amount of samples to be generated for each class. Through this generation process, we can make the distribution of class size more balanced, and hence make the trained classifier perform better on those initially under-represented classes.

4.3. Edge Generator

Now we have generated synthetic nodes to balance the class distribution. However, these nodes are isolated from the raw graph as they don’t have links. Thus, we introduce an edge generator to model the existence of edges among nodes. As GNNs need to learn how to extract and propagate features simultaneously, this edge generator can provide relation information for those synthesized samples, and hence facilitate the training of GNN-based classifier. This generator is trained on real nodes and existing edges, and is used to predict neighbor information for those synthetic nodes. These new nodes and edges will be added to the initial adjacency matrix , and serve as input the the GNN-based classifier.

In order to maintain model’s simplicity and make the analysis easier, we adopt a vanilla design, weighted inner production, to implement this edge generator as:

(5)

where refers to the predicted relation information between node and , and is the parameter matrix capturing the interaction between nodes. The loss function for training the edge generator is

(6)

where refers to predicted connections between nodes in , i.e., no synthetic nodes. Since we learn an edge generator which is good at reconstructing the adjacency matrix using the node representations, it should give good link predictions for synthetic nodes.

With the edge generator, we attempt two strategies to put the predicted edges for synthetic nodes into the augmented adjacency matrix. In the first strategy, this generator is optimized using only edge reconstruction, and the edges for the synthetic node is generated by setting a threshold :

(7)

where is the adjacency matrix after over-sampling, by inserting new nodes and edges into , and will be sent to the classifier.

In the second strategy, for synthetic node , we use soft edges instead of binary ones:

(8)

In this case, gradient on can be propagated from the classifier, and hence the generator can be optimized using both edge prediction loss and node classification loss, which will be introduced later. Both two strategies are implemented, and their performance are compared in the experiment part.

4.4. GNN Classifier

Let be the augmented node representation set by concatenating (embedding of real nodes) with the embedding of the synthetics nodes, and be the augmented labeled set by incorporating the synthetic nodes into . Now we have an augmented graph with labeled node set . The data size of different classes in becomes balanced, and an unbiased GNN classifier would be able to be trained on that. Specifically, we adopt another GraphSage block, appended by a linear layer for node classificaiton on as:

(9)
(10)

where represents node representation matrix of the 2nd GraphSage block, and refers to the weight parameters.

is the probability distribution on class labels for node

. The classifier module is optimized using cross-entropy loss as:

(11)

And during testing, the predicted class for node , will be set as the class with highest probability,

(12)

4.5. Optimization Objective

Putting the feature extractor, synthetic node generator, edge generator and GNN classifier together, previous parts together, the final objective function of GraphSMOTE can be written as:

(13)

wherein are the parameters for feature extractor, edge generator, and node classifier respectively. As the model’s performance is dependent on the quality of embedding space and generated edges, to make training phrase more stable, we also tried pre-training feature extractor and edge generator using .

The design of GraphSMOTE has several advantages: (i) it is easy to implement synthetic minority over-sampling process. Through uniting interpolated node embedding and predicted edges, new samples can be generated; (ii) the feature extractor is optimized using training signal from both node classification task and edge prediction task. Therefore, rich intra-class and inter-class relation information would be encoded in the embedding space, making the interpolation more robust; and (iii) it is a general framework. It can cope with different structure choices for each component, and different regularization terms can be enforced to provide prior knowledge.

4.6. Training Algorithm

The full pipeline of running our framework can be summarized in Algorithm 1. Inside each optimization step, we first obtain node representations using the feature extractor in line . Then, from line to line , we perform over-sampling in the embedding space to make node classes balanced. After predicting edges for generated new samples in line , the following node classifier can be trained on top of that over-sampled graph. The full framework is trained altogether with edge prediction loss and node classification loss, as shown in line .

0:  
0:  
1:  Randomly initialize the feature extractor, edge generator and node classifier;
2:  if Require pre-train then
3:     Fix other parts, train the feature extractor and edge generator module until convergence, based on loss ;
4:  end if
5:  while Not Converged do
6:     Input to feature extractor, obtaining ;
7:     for class c in minority classes do
8:        for i in  do
9:           Generate a new sample in class c, Following Equation (3) and (4);
10:        end for
11:     end for
12:     Generate using edge generator, basing on Equation (7) or (8);
13:     Update the model using ;
14:  end while
15:  return  Trained feature extractor, edge predictor, and node classifier module.
Algorithm 1 Full Training Algorithm

5. Experiments

In this section, we conduct experiments to evaluate the benefits of GraphSMOTE for the node classification task when classes are imbalanced. Both artificial and genuine imbalanced datasets are used, and different configurations are adopted to test its generalization ability. Particularly, we want to answer the following questions:

  • How effective is GraphSMOTE in imbalanced node classification task?

  • How different choices of over-sampling scales would affect the performance of GraphSMOTE?

  • Can GraphSMOTE generalize well to different imbalance ratios, or different base model structures?

We begin by introducing the experimental settings, including datasets, baselines, and evaluation metrics. We then conduct experiments to answer these questions.

5.1. Experimental Settings

5.1.1. Datasets

We conduct experiments on two widely used publicly available datasets for node classification, Cora 

(Sen et al., 2008) and BlogCatalog (Tang and Liu, 2009), and one fake account detection dataset, Twitter (Mohammadrezaei et al., 2018). The details of these three datasets are given as follows:

  • Cora: Cora is a citation network dataset for transductive learning setting. It contains one single large graph with papers from areas. Each node has a

    -dim attribution vector, and a total number of

    citation links exist in that graph. In this dataset, class distributions are relatively balanced, so we use an imitative imbalanced setting: three random classes are selected as minority classes and down-sampled. All majority classes have a training set of nodes. For each minority class, the number is . We vary to analyze the performance of GraphSMOTE under various imbalanced scenarios.

  • BlogCatalog: This is a social network dataset crawled from BlogCatalog222http://www.blogcatalog.com, with bloggers from classes and friendship edges. The dataset doesn’t contain node attributes. Following  (Perozzi et al., 2014), we attribute each node with a -dim embedding vector obtained from Deepwalk. Classes in this dataset follow a genuine imbalanced distribution, with classes smaller than , and classes larger than . For this dataset, we use samples of each class for training and 25% for validation, the remaining for testing.

  • Twitter: This dataset is crawled by  (Mohammadrezaei et al., 2018) with a dedicated API crawler from Twitter333https://twitter.com on bot infestation problem. It has users in total. Among them, users are bots. In this work, we split a connected sub-graph from it containing genuine users and robots. Node embedding is obtained through Deepwalk, appended with node degrees. This dataset is used for binary classification, and the imbalance ratio is roughly . We randomly select of total samples for training, 25% for validation, and the remaining for testing.

5.1.2. Baselines

We compare GraphSMOTE with representative and state-of-the-art approaches for handling imbalanced class distribution problem, which includes:

  • Over-sampling: A classical approach for imbalanced learning problem, by repeating samples from minority classes. We implement it in the raw input space, by duplicating minority nodes along their edges. In each training iteration, is over-sampled to contain nodes, and .

  • Re-weight (Yuan and Ma, 2012): This is a cost-sensitive approach which gives class-specific loss weight. In particular, it assigns higher loss weights to samples from minority so as to alleviate the issue of majority classes dominating the loss function.

  • SMOTE (Chawla et al., 2002): Synthetic minority oversampling techniques generate synthetic minority samples by interpolating a minority samples and its nearest neighbors of the same class. For newly generated nodes, its edges are set to be the same as the target node.

  • Embed-SMOTE (Ando and Huang, 2017)

    : An extension of SMOTE for deep learning scenario, which perform over-sampling in the intermediate embedding layer instead of the input domain. We set it as the output of last GNN layer, so that there is no need to generate edges.

Basing on the strategy for training edge generator and setting edges, four implementations of GraphSMOTE are tested:

  • : The edge generator is trained using loss from only edge prediction task. The predicted edges are set to binary values with a threshold before sending to GNN-based classifier;

  • : Predicted edges are set as continuous so that gradient can be calculated and propagated from GNN-based classifier. The edge generator is trained along with other components with training signals from both edge generation task and node classification task;

  • : An extension of , in which the feature extractor and edge generator are pre-trained on the edge prediction task, before fine-tuning on Equation.13. During fine-tuning, edge generator is optimized using only ;

  • : An extension of , in which a pre-training process is also conducted before fine-tuning, same as .

In the experiments, all these methods are implemented and tested on the same GNN-based network for a fair comparison.

Cora BlogCatalog Twitter
Methods ACC AUC-ROC F Score ACC AUC-ROC F Score ACC AUC-ROC F Score
Origin
over-sampling
Re-weight
SMOTE
Embed-SMOTE
Table 1. Comparison of different approaches for imbalanced node classification.

5.1.3. Evaluation Metrics

Following existing works in evaluating imbalanced classification (Rout et al., 2018; Johnson and Khoshgoftaar, 2019), we adopt three criteria: classification accuracy(ACC), mean AUC-ROC score (Bradley, 1997)

, and mean F-measure. ACC is computed on all testing examples at once, therefore may underweight those under-represented classes. AUC-ROC score illustrates the probability that the corrected class is ranked higher than other classes, and F-measure gives the harmonic mean of precision and recall for each class. Both AUC-ROC score and F-measure are calculated separately for each class and then non-weighted average over them, therefore can better reflect the performance on minority classes.

5.1.4. Configurations

All experiments are conducted on a -bit machine with Nvidia GPU (Tesla V100, 1246MHz , 16 GB memory), and ADAM optimization algorithm is used to train all the models.

For all methods, the learning rate is initialized to , with weight decay being . is set as , since we did not normalize and it is much larger than . On Cora dataset, imbalance_ratio is set to and over-sampling scale is set as if not specified otherwise. For BlogCatalog and Twitter dataset, imbalance_ratio is not involved, and over-sampling scale is set class-wise: for minority class

, to make the class size balanced. Besides, all models are trained until converging, with the maximum training epoch being

.

5.2. Imbalanced Classification Performance

To answer the first question, we compare the imbalanced node classification performance of GraphSMOTE

with the baselines on aforementioned three datasets. Each experiment is conducted 3 times to alleviate the randomness. The average results with standard deviation are reported in Table 

1. From the table, we can make following observations:

  • All four variants of GraphSMOTE showed significant improvements on imbalanced node classification task, compared to the “Origin” setting, in which no special algorithm is adopted. They also outperform almost all baselines in all datasets, on all evaluation metrics. These results validate the effectiveness of proposed framework.

  • The improvements brought by GraphSMOTE are much larger than directly applying previous over-sampling algorithms. For example, compared with Over-sampling shows an improvement of in AUC-ROC score, and an improvement of in AUC-ROC score compared with Embed-SMOTE. This result validates the advantages of GraphSMOTE over previous algorithms, in constructing an embedding space for interpolation and provide relation information.

  • Among different variants of GraphSMOTE, pre-trained implementations show much stronger performance than not pre-trained ones. This result implies the importance of a better embedding space in which the similarities among samples are well encoded.

To summarize, these results prove the advantages of introducing over-sampling algorithm for imbalanced node classification task. They also validate that GraphSMOTE can generate more realistic samples and the importance of providing relation information.

Figure 3. Affects of over-sampling scale.

5.3. Influence of Over-sampling Scale

In this subsection, we analyze the performance change of different algorithms w.r.t different over-sampling scales, in the pursuit of answering the second question. To conduct experiments in a constrained setting, we use Cora dataset and fix imbalance ratio as . Over-sampling scale is varied as . Every experiment is conducted 3 times and the average results are presented in Figure 3. From the figure, we make the following observations:

  • When over-sampling scale is smaller than , generating more samples for minority classes, i.e., making the classes more balanced, would help the classifier to achieve better performance, which is as expected because these synthetic nodes not only balance the datasets but also introduce new supervision for training a better GNN classifier.

  • When the over-sampling scale becomes larger, keeping increasing it may result in opposite effects. It can be observed that the performance remains similar, or degrade a little when changing over-sampling scale from to . This is because when too many synthetic nodes are generated, some of these synthetic nodes contain similar/redundant information which cannot further help learn a better GNN.

  • Based on these observations, generally setting the over-sampling scale set a value that can make the class balanced is a good choice, which is consistent with existing work for synthetic minority oversampling (Buda et al., 2018).

5.4. Influence of Imbalance Ratio

In this subsection, we analyze the performance of different algorithms with respect to different imbalance ratios, to evaluate their robustness. Experiment is also conducted in a well-constrained setting on Cora, by fixing over-sampling scale to , and varying imbalance ratio as . Each experiments are conducted 3 times and the average results are shown in Table 2. From the table, we make the following observations:

  • The proposed framework GraphSMOTE generalizes well to different imbalance ratios. It achieves the best performance across all the settings, which shows the effectiveness of the proposed framework under various scenarios.

  • The improvement of GraphSMOTE is more significant when the imbalance extent is more extreme. For example, when imbalance ratio is , outperforms Re-weight by , and the gap reduces to when the imbalance ratio become . This is because when the datasets is not that imbalanced, minority oversampling is not that important, which makes the improvement of proposed algorithm over others not that significant.

  • Pre-training is important when the imbalance ratio is extreme. When imbalance ratio is , shows an improvement of over , and the gap reduces to when the imbalance ratio changes to .

Imbalance Ratio
Methods
Origin
over-sampling
Re-weight
SMOTE
Embed-SMOTE
Table 2. Node classification performance in terms of AUC on Cora under various imbalance ratios.

5.5. Influence of Base Model

In this subsection, we test generalization ability of the proposed algorithm by applying it to another widely-used graph neural network: GCN. Comparison between it and baselines is presented in Table 3. All methods are implemented on the same network. Experiments are performed on Cora, with imbalance ratio set as and over-sampling scale as . Experiments are run three times, with both averaged results and standard deviation reported. From the result, it can be observed that:

  • Generally, GraphSMOTE adapt well to GCN-based model. Four variants of it all work well and achieve the best performance, as shown in Table 3.

  • Compared with using GraphSage as base model, a main difference is that pre-training seems to be less necessary in this case. We think it may be caused by the fact that GCN is less powerful than GraphSage in representation ability. GraphSage is more flexible and can model more complex relation information, and hence is more difficult to train. Therefore, it can benefit more from obtaining a well-trained embedding space in advance.

Cora
Methods ACC AUC-ROC F Score
Origin
over-sampling
Re-weight
SMOTE
Embed-SMOTE
Table 3. Evaluation of different algorithm’s performance when changed to GCN as base model.

5.6. Parameter Sensitivity Analysis

In this part, the hyper-parameter is varied to test GraphSMOTE’s sensitivity towards it. To keep simplicity, we adopt and as base model, and set to be in . Each experiment is conducted on Cora with imbalance ratio and over-sampling scale . The results weare shown in Figure 4. From the figure, we can observe that:

  • Generally, as increases, the performance first increase then decrease. The performance would drop significantly if is too large. Generally, a smaller between and works better. The reason could be the difference in scale of two losses.

  • Pre-training makes GraphSMOTE more stable w.r.t .

(a) AUC-ROC Score
(b) F Measurement
Figure 4. Affects of hyper-parameter .

6. Conclusion and Future Work

Class imbalance problem of nodes in graphs widely exists in real-world tasks, like fake user detection, web page classification, malicious machine detection, etc. This problem can significantly influence classifier’s performance on those minority classes, but is left unconsidered in previous works. Thus, in this work, we investigate this imbalanced node classification task. Specifically, we propose a novel framework GraphSMOTE, which extends previous over-sampling algorithms for i.i.d data to this graph setting. Concretely, GraphSMOTE constructs an intermediate embedding space with a feature extractor, and train an edge generator and a GNN-based node classifier simultaneously on top of that. Experiments on one artificial dataset and two real-world datasets demonstrated its effectiveness, outperforming all other baselines with a large margin. Ablation studies are performed to understand GraphSMOTE performs under various scenarios. Parameter sensitivity analysis is also conducted to understand the sensitivity of GraphSMOTE

on the hyperparameters.

There are several interesting directions need further investigation. First, besides node classification, other tasks like edge type prediction or node representation learning may also suffer from under-representation of nodes in minority classes. And sometimes, node class might not be provided explicitly. Therefore, we will also extend GraphSMOTE for handling other types of imbalanced learning problems on graphs. Second, in this paper, we mainly conduct experiments on citation network and social media network. There are many other real-world applications which can be treated as imbalanced node classification problems. Therefore, we would like to extend our framework for more application domains such as document analysis in the websites.

7. Acknowledgement

This project was partially supported by NSF projects IIS-1707548, CBET-1638320, IIS-1909702, IIS1955851, and the Global Research Outreach program of Samsung Advanced Institute of Technology under grant #225003.

References

  • S. Ando and C. Y. Huang (2017) Deep over-sampling framework for classifying imbalanced data. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 770–785. Cited by: §1, §2.1, 4th item.
  • J. Atwood and D. Towsley (2016) Diffusion-convolutional neural networks. In Advances in neural information processing systems, pp. 1993–2001. Cited by: §2.2, §2.2.
  • R. Barandela, R. M. Valdovinos, J. S. Sánchez, and F. J. Ferri (2004) The imbalanced training sample problem: under or over sampling?. In

    Joint IAPR international workshops on statistical techniques in pattern recognition (SPR) and structural and syntactic pattern recognition (SSPR)

    ,
    pp. 806–814. Cited by: §2.1.
  • A. P. Bradley (1997) The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern recognition 30 (7), pp. 1145–1159. Cited by: §5.1.3.
  • J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun (2013) Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203. Cited by: §2.2, §2.2.
  • M. Buda, A. Maki, and M. A. Mazurowski (2018) A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks 106, pp. 249–259. Cited by: §2.1, 3rd item.
  • C. Bunkhumpornpat, K. Sinapiromsaran, and C. Lursinsap (2009) Safe-level-smote: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In Pacific-Asia conference on knowledge discovery and data mining, pp. 475–482. Cited by: §2.1.
  • N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer (2002) SMOTE: synthetic minority over-sampling technique.

    Journal of artificial intelligence research

    16, pp. 321–357.
    Cited by: §1, §2.1, 3rd item.
  • N. V. Chawla, A. Lazarevic, L. O. Hall, and K. W. Bowyer (2003) SMOTEBoost: improving prediction of the minority class in boosting. In European conference on principles of data mining and knowledge discovery, pp. 107–119. Cited by: §1, §2.1.
  • D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams (2015) Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pp. 2224–2232. Cited by: §2.2, §2.2.
  • C. Elkan (2001) The foundations of cost-sensitive learning. In International joint conference on artificial intelligence, Vol. 17, pp. 973–978. Cited by: §1.
  • J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl (2017) Neural message passing for quantum chemistry. In ICML, Cited by: §2.2.
  • J. W. Grzymala-Busse, L. K. Goodwin, W. J. Grzymala-Busse, and X. Zheng (2004) An approach to imbalanced data sets based on changing rule strength. In Rough-neural computing, pp. 543–553. Cited by: §2.1.
  • W. L. Hamilton, Z. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In NIPS, Cited by: §1, §2.2, §2.2.
  • H. Han, W. Wang, and B. Mao (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In International conference on intelligent computing, pp. 878–887. Cited by: §2.1.
  • H. He and E. A. Garcia (2009) Learning from imbalanced data. IEEE Transactions on knowledge and data engineering 21 (9), pp. 1263–1284. Cited by: §2.1.
  • T. Jo and N. Japkowicz (2004) Class imbalances versus small disjuncts. ACM Sigkdd Explorations Newsletter 6 (1), pp. 40–49. Cited by: §2.1.
  • J. M. Johnson and T. M. Khoshgoftaar (2019) Survey on deep learning with class imbalance. Journal of Big Data 6 (1), pp. 27. Cited by: §2.1, §5.1.3.
  • T. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. ArXiv abs/1609.02907. Cited by: §1, §2.2, §2.2.
  • M. Kubat, S. Matwin, et al. (1997) Addressing the curse of imbalanced training sets: one-sided selection. In Icml, Vol. 97, pp. 179–186. Cited by: §2.1.
  • S. Lawrence, I. Burns, A. Back, A. C. Tsoi, and C. L. Giles (1998) Neural network classification and prior class probabilities. In Neural networks: tricks of the trade, pp. 299–313. Cited by: §2.1.
  • C. X. Ling and V. S. Sheng (2008) Cost-sensitive learning and the class imbalance problem. Vol. 2011, Citeseer. Cited by: §1, §2.1.
  • X. Liu, J. Wu, and Z. Zhou (2008) Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39 (2), pp. 539–550. Cited by: §1, §2.1.
  • B. Mac Namee, P. Cunningham, S. Byrne, and O. I. Corrigan (2002) The problem of bias in training data in regression problems in medical decision support. Artificial intelligence in medicine 24 (1), pp. 51–70. Cited by: §2.1.
  • M. Mohammadrezaei, M. E. Shiri, and A. M. Rahmani (2018) Identifying fake accounts on social networks based on graph analysis and classification algorithms. Security and Communication Networks 2018. Cited by: §1, 3rd item, §5.1.1.
  • A. More (2016) Survey of resampling techniques for improving classification performance in unbalanced datasets. arXiv preprint arXiv:1608.06048. Cited by: §1, §1.
  • M. Niepert, M. Ahmed, and K. Kutzkov (2016) Learning convolutional neural networks for graphs. In International conference on machine learning, pp. 2014–2023. Cited by: §2.2.
  • S. P. Parambath, N. Usunier, and Y. Grandvalet (2014) Optimizing f-measures by cost-sensitive classification. In Advances in Neural Information Processing Systems, pp. 2123–2131. Cited by: §2.1.
  • Y. Peng, X. He, and J. Zhao (2017)

    Object-part attention model for fine-grained image classification

    .
    IEEE Transactions on Image Processing 27 (3), pp. 1487–1500. Cited by: §2.1.
  • B. Perozzi, R. Al-Rfou, and S. Skiena (2014) DeepWalk: online learning of social representations. In KDD ’14, Cited by: 2nd item.
  • N. Rout, D. Mishra, and M. K. Mallick (2018) Handling imbalanced data: a survey. In International Proceedings on Advances in Soft Computing, Intelligent Systems and Applications, pp. 431–443. Cited by: §5.1.3.
  • P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Gallagher, and T. Eliassi-Rad (2008) Collective classification in network data. AI Magazine 29, pp. 93–106. Cited by: §5.1.1.
  • Y. Sun, M. S. Kamel, A. K. Wong, and Y. Wang (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition 40 (12), pp. 3358–3378. Cited by: §2.1.
  • L. Tang and H. Liu (2009) Relational learning via latent social dimensions. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 817–826. Cited by: §5.1.1.
  • S. Tang, B. Li, and H. Yu (2019) ChebNet: efficient and stable constructions of deep neural networks with rectified power units using chebyshev approximations. ArXiv abs/1911.05467. Cited by: §2.2, §2.2.
  • G. Van Horn, O. Mac Aodha, Y. Song, A. Shepard, H. Adam, P. Perona, and S. Belongie (2017) The inaturalist challenge 2017 dataset. arXiv preprint arXiv:1707.06642 1 (2), pp. 4. Cited by: §2.1.
  • Z. Wang, X. Ye, C. Wang, J. Cui, and P. Yu (2020) Network embedding with completely-imbalanced labels. IEEE Transactions on Knowledge and Data Engineering. Cited by: §1.
  • K. Xu, W. Hu, J. Leskovec, and S. Jegelka (2018) How powerful are graph neural networks?. In International Conference on Learning Representations, Cited by: §1, §2.2.
  • Z. Yang, W. Cohen, and R. Salakhudinov (2016)

    Revisiting semi-supervised learning with graph embeddings

    .
    In International conference on machine learning, pp. 40–48. Cited by: §1.
  • J. You, R. Ying, and J. Leskovec (2019) Position-aware graph neural networks. In ICML, Cited by: §2.2.
  • B. Yuan and X. Ma (2012) Sampling + reweighting: boosting the performance of adaboost on imbalanced datasets. The 2012 International Joint Conference on Neural Networks (IJCNN), pp. 1–6. Cited by: 2nd item.
  • Y. Zhao, Y. Xie, F. Yu, Q. Ke, Y. Yu, Y. Chen, and E. Gillum (2009) BotGraph: large scale spamming botnet detection.. In NSDI, Vol. 9, pp. 321–334. Cited by: §1.
  • J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, and M. Sun (2018) Graph neural networks: a review of methods and applications. ArXiv abs/1812.08434. Cited by: §2.2.
  • Z. Zhou and X. Liu (2005) Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on knowledge and data engineering 18 (1), pp. 63–77. Cited by: §1, §2.1.