The development of information technology brings the explosion of massive data in our daily life. However, many real applications usually generate very imbalanced datasets for corresponding key classification tasks. For instance, online advertising services can give rise to a high amount of datasets, consisting of user views or clicks on ads, for the task of click-through rate prediction [graepel2010ctr]. Commonly, user clicks only constitute a small rate of user behaviors . For another example, credit fraud detection [dal2018creditfraud] relies on the dataset containing massive real credit card transactions where only a small proportion are frauds. Similar situations also exist in the tasks of medical diagnosis, record linkage and network intrusion detection etc [gamberger1999medical, sariyar2011record-linkage, haixiang2017overview]. In addition, real-world datasets are likely to contain other difficulty factors, including noises and missing values. Such highly imbalanced, large-scale and noisy data brings serious challenges of downstream classification tasks.
Traditional classification algorithms (e.g., C4.5, SVM or Neural Networks [quinlan1986dt, cortes1995svm, haykin2009neuralnetworks]) demonstrate unsatisfactory performance on imbalanced datasets. The situation can be even worse when the dataset is large-scale and noisy at the same time. Attribute to their inappropriate presuming on relatively balanced distribution between positive and negative samples, the minority class is usually ignored due to the overwhelming number of majority instances. On the other hand, the minority class usually carries the concepts with greater interests than majority class [he2008overview, he2013overview].
To overcome such issue, a series of research work has been proposed, which can be classified into three categories:
Data-level methods modify the collection of examples to balanced distributions and / or remove difficult samples. They may be inapplicable on datasets with categorical features or missing values due to their distance-based design (e.g., NearMiss, Tomeklink [mani2003nearmiss, tomek1976tomeklink]). Besides, they suffer from large computational cost (e.g., SMOTE, ADASYN [chawla2002smote, he2008adasyn]) when applying on large-scale data.
Algorithm-level methods directly modify existing learning algorithms to alleviate the bias towards majority objects. However, they require assistance from domain experts before-hand (e.g., setting cost matrix in cost-sensitive learning [elkan2001cost-sensitive, liu2006cost-sensitive-imbalance]). They may also fail when cooperating with batch-training classifiers like neural network since they do not balance the class distribution on the training data.
Ensemble methods combine one of the previous approaches with an ensemble learning algorithm to form an ensemble classifier. Some of them suffer from large training cost and poor applicability (e.g., SMOTEBagging [wang2009smotebagging]) on realistic tasks. The other ones potentially lead to underfitting or overfitting (e.g., EasyEnsemble, BalanceCascade [liu2009ee-bc]) when the dataset is highly noisy.
For above reasons and more, none of the prevailing methods can well handle the highly imbalanced, large-scale and noisy classification task, while it is a common problem in real-world applications. The main reason behind existing methods’ failure on such tasks is that they ignored difficulties embedded in the nature of imbalance learning. Not only the class imbalance itself, other factors like presence of noise samples [napierala2010learn-from-noisy-data] and overlapped underlying distribution between the classes [garcia2007overlap, prati2004overlap-small-disjuncts] also significantly deteriorate the classification performance. Their influences can be further enlarged by the high imbalance ratio. Besides, different models show various sensitivity to these factors. For above reasons, all these factors need to be considered to achieve more accurate classification.
We introduce the concept of “classification hardness” to integrate aforementioned difficulties. Intuitively, hardness represents the difficulty of correctly classifying a sample for a specific classifier. Thus the distribution of classification hardness implicitly contains the information of task difficulties. For example, noises are likely to have large hardness values and the proportion of high-hardness samples reflected the level of class overlapping. Moreover, hardness distribution is naturally adaptive to different models since it was defined with respect to given classifier. Such hardness distribution can be used to guide the re-sampling strategy to achieve better performance.
Based on the classification hardness, we propose a novel learning framework called Self-paced Ensemble (abbreviated as SPE) in this paper. Instead of simply balancing the positive/negative data or directly assigning instance weights, we consider the distribution of classification hardness over the dataset, and iteratively select the most informative majority data samples according to the hardness distribution. The under-sampling strategy is controlled by a self-paced procedure. Such self-paced procedure enables our framework gradually focuses on the harder data samples, while still keeps the knowledge of easy sample distribution in order to prevent overfitting. Fig. 1 shows the pipeline of self-paced ensemble.
In summary, the contributions of this paper are as follows:
In this paper we demonstrate the reason of conventional imbalance learning methods failing on the real-world massive imbalanced classification task. We conduct comprehensive experiments with analysis and visualization that can be valuable for other similar classification systems.
We proposed Self-paced Ensemble (SPE), a learning framework for massive imbalanced data classification. SPE can be used to boost any canonical classifier’s performance (e.g., C4.5, SVM, GBDT, and Neural Network) on real-world highly imbalanced tasks while being very computationally efficient. Comparing with the existing methods, SPE is accurate, fast, robust, and adaptive.
We introduce the concept of classification hardness. By considering the distribution of classification hardness over the dataset, the learning procedure of our proposed framework SPE is automatically optimized in a model-specific way. Unlike prevailing methods, our learning framework does not require any pre-defined distance metrics which is usually unavailable in real-world scenarios.
Ii Problem definition
In this section, we first describe the class imbalance problem considered in this paper. Then we give some necessary symbol definition and show the evaluation criteria that are commonly used in imbalanced scenarios.
Class imbalance: A dataset is said to be imbalanced whenever the number of instances from the different classes is not nearly the same. Class imbalance exists in the nature of various real-world applications, like medicine (sick vs. healthy), fraud detection (normal vs. fraud), or click-through-rate prediction (clicked vs. ignored). The uneven distribution poses a difficulty for applying canonical learning algorithms on imbalanced dataset, as they will be biased towards the majority group due to their accuracy-oriented design. Despite such problem has been extensively studied, in real applications, class imbalance often co-exists with other difficulty factors, such as enormous data scale, noises, and missing values. Therefore, the performances of existing methods are still unsatisfactory.
Symbol definition: In this paper, we only consider the binary situation that exists widely in practical applications [dal2018creditfraud, he2008overview, he2013overview]. In binary imbalance classification, only two classes were considered: the minority class with less samples and the majority class with relatively more samples. For simplicity, in this paper we always let the minority class to be positive class and the majority class to be negative. We use to denote the collection of all training samples . The minority class set and majority class set are then defined as:
Therefore, we have for (highly) imbalanced problems. In order to uniformly describe the level of class imbalance in different datasets, we consider the Imbalance Ratio (IR), which is defined as the number of majority class examples divided by the number of minority class examples:
Since the accuracy does not well reflect the model performance, we usually adopt the other evaluation criteria based on the number of true / false positive / negative prediction. Under the binary scenario, the results of the correctly and incorrectly recognized examples of each class can be recorded in a confusion matrix. TableI shows the confusion matrix for binary classification.
|Positive||True Positive (TP)||False Negative (FN)|
|Negative||False Positive (FP)||True Negative (TN)|
AUCPRC = Area Under Precision-Recall Curve
Iii Limitations of Existing Methods
In this section, we give a brief introduction to existing imbalance learning solutions, and discuss why they obtain unsatisfactory performance on the real-world industrial tasks. To solve the class imbalance problem, researchers have proposed a variety of methods. This research field is also known as imbalance learning. As mentioned in the introduction, we categorize existing imbalance learning methods into three groups: Data-level, Algorithm-level and Ensemble.
Data-level Methods: This group of methods concentrates on modifying the training set to make it suitable for a standard learning algorithm. With respect to balancing distributions, data-level methods can be categorized into three groups:
Under-sampling approaches that remove samples from the majority class (e.g., [kubat1997oss, laurikkala2001ncr, tomek1976tomeklink]).
Over-sampling approaches that generate new objects for the minority class (e.g., [chawla2002smote, han2005borderline-smote, he2008adasyn]).
Hybrid-sampling approaches that combine two methods above (e.g., [batista2003smotetomek, batista2004smoteenn]).
Standard random re-sampling methods often lead to removal of important samples or introduction of meaningless new objects. Therefore, more advanced methods were proposed that try to maintain structures of groups and/or generate new data according to underlying distributions. They apply k-Nearest Neighborhood (k-NN) algorithm [altman1992knn] to extract underlying distribution in the feature space, and use that information to guide their re-sampling.
However, the application of k-NN algorithm requires pre-defined distance metric, which is usually unavailable in the real-world datasets since they may contain categorical features and/or missing values. k-NN algorithm is also easily disturbed by noises thus unable to reveal the underlying distribution for re-sampling methods when the dataset is noisy. Moreover, the computational cost of applying k-NN grows quadratically with the size of the dataset. Thus running such distance-based re-sampling methods on large-scale datasets can be extremely slow.
Algorithm-level Methods: This group of methods concentrates on modifying existing learners to alleviate their bias towards majority groups. The most popular algorithm-level method is cost-sensitive learning [elkan2001cost-sensitive, liu2006cost-sensitive-imbalance]. By assigning large cost to minority instances and small cost to majority instances, it boosts minority importance during the learning process to alleviate the bias towards majority class.
It must be noted that the cost matrix on a specific task is given by domain expert before-hand, which is usually unavailable in many real-world problems. Even if one has the domain knowledge required for setting the cost, such cost matrix is usally designed for specific tasks and do not generalize across different classification tasks. On the other hand, for the batch training models such as neural networks, the positive (minority) samples are only contained in a few batches. Even if we apply cost-sensitive into the training process, the model still soon stuck into local minima.
Ensemble Methods: This group of methods concentrates on merging one of the data-level or algorithm-level solutions with an ensemble learning method. Most of them are based on a canonical ensemble learning algorithm with an imbalance learning algorithm embedded in the pipeline, e.g., SMOTE [chawla2002smote] + Adaptive Boosting [freund1997adaboost] = SMOTEBoost [chawla2003smoteboost]. Some other ensemble methods introduce another ensemble classifier as their base learner, e.g., EasyEnsemble [liu2009ee-bc] trains multiple AdaBoost classifier to form its final ensemble.
However, those ensemble-based methods suffer from low efficiency, poor applicability and high sensitivity to noise when applying on realistic imbalanced tasks, since they still have those data/algorithm-level imbalance learning methods in their pipeline. There are few methods carried out preliminary exploration of using training feedback information to perform dynamic re-sampling on imbalance datasets. However, such methods do not take full account of the data distribution. For instance, BalanceCascade iteratively discards majority samples that were well-classified by the current classifier. It may result in overweighting outliers in late iterations and finally deteriorate the ensemble.
Iv Classification Hardness Distribution
Before we describe our algorithm, we introduce the concept of the “classification hardness” in this section. We explain the benefits of considering hardness distribution into imbalance learning framework. We also present an intuitive visualization in Fig. 2 to help understand the relationship between hardness, imbalance ratio, class overlapping and model capacity.
Definition: We use the symbol to denote the classification hardness function, where can be any “decomposable” error function, i.e., the overall error is calculated by the summation of individual sample errors. Examples include Absolute Error, Squared Error (Brier-score) and Cross Entropy. Suppose is a trained classifier, we use
to denote the classifier’s output probability ofbeing a positive instance. Then the hardness of sample with respect to is given by the function .
Advantages: The concept of the classification hardness has two advantages under the imbalance classification scenario:
First, it fills the gap between the imbalance ratio and the task difficulty. As mentioned in the introduction, even with the same imbalance ratio, different tasks could demonstrate extremely different difficulties. We show a detailed example in Fig. 2. In Fig. 2(a), the dataset is generated with two disjoint Gaussian components. The growth of the imbalance ratio does not affect much of the task hardness. While in Fig. 2(d) the dataset is generated by several overlapped Gaussian components. As the imbalance ratio grows, it varies from an easy classification task to an extremely hard task. However, the imbalance ratio could not well reflect such task hardness. Instead, we show the classification hardness of those two datasets based on different classifiers. As the imbalance ratio grows, the quantity of the hard samples increases sharply in Fig. 2(e) and Fig. 2(f), while stays constant in Fig. 2(b) and Fig. 2(c). Thus, the classification hardness carries more information about the underlying data structure and better indicates the current task hardness.
Second, the classification hardness also fills the gap between data sampling strategy and the classifiers’ capacity. Most of the existing sampling method totally ignores the capacity of the base classifier. However, different classifiers usually demonstrate very different performances on the imbalanced data classification. For example, in Fig.2, KNN and Adaboost show very different hardness distribution for the same dataset. It is beneficial to consider the model capacity when performing under-sampling. Using the classification hardness, our framework is able to optimize any kind of classifier’s final performance in a model-specific way.
Types of Data Samples: Intuitively, we distinguish three kinds of data samples, i.e., trivial, noise and borderline samples according to their corresponding hardness values:
Most of the data samples are trivial samples and can be well-classified by the current model, e.g., the blue samples in Fig. 2(e) and Fig. 2(f). Each of the trivial samples only contributes tiny hardness. However, the overall contribution is non-negligible due to its large population. For such kind of samples, we only need to keep a small proportion of them to represent the “skeleton” of their corresponding distribution in order to prevent overfitting, then drop most of them since they have already been well-learned.
On the contrary, there are also several noise samples, e.g., the dark red samples in Fig. 2. Despite their small population, each of them contributes a large hardness value. Thus, the total contribution can be very huge. We stress that these noise samples are usually caused by the indistinguishable overlapping or outliers since they exist stably even when the model is converged. Enforcing model to learn such samples could lead to serious overfitting.
For the rest samples, here we simply classify them as the borderline samples. The borderline samples are the most informative data samples during the training. For example, as we can see, in Fig. 2, the light red points are very close to the decision boundary of the current model. Enlarging the weights of those borderline samples is usually helpful to further improve the model performance.
The above discussion provides us with an intuition to distinguish different data samples. However, since it is hard to make such an explicit distinction in practice, we alternatively categorize the data samples in a “soft” way, as described in the next section.
V Self-paced Ensemble
We now describe Self-paced Ensemble111Code available at https://github.com/ZhiningLiu1998/self-paced-ensemble. (SPE), our framework for massive imbalance classification. Firstly, we demonstrate the ideas of hardness harmonize and self-paced factor. After that, we summarize the SPE procedure in Algorithm 1.
V-a Self-paced Under-sampling
Motivated by previous observations, we aim to design an under-sampling mechanism that reduces the effect of trivial and noise samples, while enlarges the importance of the borderline samples as we expected. Therefore, we introduce the concept of “hardness harmonize” and a self-paced training procedure, to achieve such goal.
V-A1 Hardness Harmonize
We split all the majority samples into bins according to their hardness values, where is a hyper-parameter. Each bin indicates a specific hardness level. We then under-sample the majority instances into a balanced dataset by keeping the total hardness contribution in each bin as the same. Such method is so-called “harmonize” in the gradient-based optimization literature [li2018ghm], where they harmonize the gradient contribution in batch training of neural networks. In our case, we adopt a similar idea to harmonize the hardness in the first iteration.
However, we do not simply use the hardness harmonize in all the iterations. The main reason is that the population of trivial samples grows during the training process since the ensemble classifier will gradually fit the training set. Hence, simply harmonizing the hardness contribution still leaves a lot of trivial samples (Fig. 3(b)). Those samples greatly slow down the learning procedure in the later iterations since they are less informative. Instead, we introduce the “self-pace factor” to perform self-paced harmonize under-sampling.
V-A2 Self-paced Factor
Specifically, start from harmonizing the hardness contribution of each bin, we gradually decrease the sample probability of those bins with a large population. The decreasing level is controlled by a self-paced factor . When goes large, we focus more on the harder samples instead of the simple hardness contribution harmonize. In the first few iterations, our framework mainly focuses on those informative borderline samples, thus the outliers and noises do not affect much of the generalization ability of our model. In the later iterations where is very large, our framework still keeps a reasonable fraction of trivial (high confidence) samples as the “skeleton”, which effectively prevents our framework from overfitting. Fig. 3 shows the self-paced under-sampling process of a real-world large-scale dataset222Payment Simulation dataset, statistics can be found in Table III. .
V-B Algorithm Formalization
Finally, in this subsection, we describe our algorithm formally. Recall that in Section 2, we use to denote the collection of all training samples . / is the majority / minority set in . We use to denote the validation set, which is used to measure the generalization ability of the ensemble model. Note that keeps the original imbalanced distribution with no re-sampling. Moreover, we use to denote the -th bin, where is defined as
The details are shown in Algorithm 1. Notice that we update hardness value in each iteration (line 4-5) in order to select data samples that were most beneficial for the current ensemble. We use the function (line 7) to control the growth of self-paced factor . Thus we have in the first iteration and in the last iteration.
Vi Experiments & Analysis
In this section, we present the results of our experimental study on one synthetic and five real-world extremely imbalanced datasets. We tested the applicability of our proposed algorithm to incorporate with different kinds of base classifiers. We also show some visualizations to help understand the difference between our proposed method and the other imbalance learning methods. We evaluated the experiment results with multiple criteria, and demonstrate the strength of our proposed framework.
Vi-a Synthetic Dataset
To provide more insights of our framework, we first show the experimental results on the synthetic dataset. We create a checkerboard dataset to validate our method. The dataset contains Gaussian components. All Gaussian components share the same covariance matrix of . We set the number of minority samples as , and the number of majority as . The training set , validation set and test set were independently sampled from same original distribution. See Fig. 4 for an example.
Vi-A1 Setup Details
We compared our proposed method SPE333In our implementation of SPE, we set the number of bins , and use absolute error as the classification hardness, i.e., , unless otherwise stated. with following imbalance learning approaches:
RandUnder (Random Under-sampling) randomly under-sample the majority class to get a subset such that . The set was then used for training.
Clean (Neighbourhood Cleaning Rule based under-sampling) [laurikkala2001ncr] removes a majority instance if most of its neighbors come from another class.
SMOTE (Synthetic Minority Over-sampling TechniquE) [chawla2002smote] generates synthetic minority instances between existing minority samples until the dataset is balanced.
Easy (EasyEnsemble) [liu2009ee-bc] utilizes RandUnder to train multiple AdaBoost [freund1997adaboost] models and combine their outputs.
Cascade (BalanceCascade) [liu2009ee-bc] extends Easy by iteratively drop majority examples that were already well classified by current base classifier.
In addition, according to our aforementioned discussion in the Classification Hardness section, by considering the hardness distribution our proposed framework SPE is able to work with any kind of classifiers and optimize the final performance in a model-specific way. Hence, we introduce 8 canonical classifiers in order to test the effectiveness and applicability of different imbalance learning methods:
K-Nearest Neighbors (KNN) [altman1992knn]
Decision Tree (DT) [quinlan1986dt]
Support Vector Machine (SVM) [cortes1995svm]
Multi-Layer Perceptron(MLP) [haykin2009neuralnetworks]
Adaptive Boosting (AdaBoost) [freund1997adaboost]
Bootstrap aggregating (Bagging) [breiman1996bagging]
Random Forest (RandForest) [liaw2002rf]
Gradient Boosting Decision Tree (GBDT) [friedman2002gbdt]
We apply imbalanced-learn [guillaume2017imblearn] package to implement aforementioned imbalance learning methods, and scikit-learn [pedregosa2011sklearn], LightGBM [ke2017lightgbm]
, Pytorch[paszke2017pytorch] packages to implement the canonical classifiers. We use subscripts to denote the number of base models in an ensemble classifier, e.g., Easy indicates Easy with 10 base models. Due to space limitation, we only present the experimental results of AUCPRC in this experiment, other metrics will be used in following extensive experiments on real-world datasets.
Vi-A2 Results on synthetic dataset
lists the results on checkerboard task. Note that to reduce randomness, we show the mean and standard deviation of 10 independent runs. We also list the hyper-parameters we used for each base classifier. From the TableII we can observe that:
SPE consistently outperform other methods on the checkerboard dataset using 8 different classifiers.
Distance-based re-sampling lead to poor results when cooperating with specific classifiers, e.g., SMOTE+KNN, Clean+RandForest. We argue that the ignorance of difference in model capacity is the main reason that causes invalidity to those re-sample methods.
Comparing with other methods, ensemble methods Easy and Cascade obtain better and more robust performance but still worse than our proposed ensemble framework SPE.
Vi-A3 Robustness under Class Overlapping
Furthermore, we test the robustness of SPE, when the Gaussian components have different levels of overlapping. We control the components overlapping by replacing the original covariance matrix from to and . The distribution is less overlapped when the covariance factor in covariance matrix is smaller, and more overlapped when it is bigger. We keep the size and imbalance ratio to be the same, and sample three different checkerboard datasets respectively. Fig. 5 shows how the AUCPRC (on test set) changes within training process:
The level of distribution overlapping significantly influences the classification performance, even though the size and imbalance ratio of all datasets are totally the same.
As the overlapping aggravates, the performance of Cascade shows more obvious downward trend in later iterations. The reason behind is that Cascade inclines to overfit the noise samples, while SPE can alleviate this issue by keeping a reasonable proportion of trivial and borderline samples.
|Dataset||#Attribute||#Sample||Feature Format||Imbalance Ratio||Model|
|Credit Fraud||31||284,807||Numerical||578.88:1||KNN, DT, MLP|
|KDDCUP (DOS vs. PRB)||42||3,924,472||Integer & Categorical||94.48:1||AdaBoost|
|KDDCUP (DOS vs. R2L)||42||3,884,496||Integer & Categorical||3448.82:1||AdaBoost|
|Record Linkage||12||5,749,132||Numerical & Categorical||273.67:1||GBDT|
|Payment Simulation||11||6,362,620||Numerical & Categorical||773.70:1||GBDT|
|AdaBoost||AUCPRC||0.9300.012||- - -||- - -||0.9950.002||1.0000.000||1.0000.000|
|KDDCUP||F1||0.9620.001||- - -||- - -||0.9970.000||0.9990.000||0.9990.000|
|(DOS vs. PRB)||GM||0.9640.001||- - -||- - -||0.9970.001||0.9980.000||0.9990.000|
|MCC||0.9560.004||- - -||- - -||0.9920.001||0.9930.003||0.9990.000|
|AdaBoost||AUCPRC||0.0340.005||- - -||- - -||0.1080.011||0.9450.005||0.9990.001|
|KDDCUP||F1||0.0500.005||- - -||- - -||0.2590.058||0.9650.005||0.9910.003|
|(DOS vs. R2L)||GM||0.1640.011||- - -||- - -||0.3290.015||0.9670.008||0.9880.004|
|MCC||0.1750.016||- - -||- - -||0.2140.004||0.9050.056||0.9860.004|
|Record Linkage||GBDT||AUCPRC||0.9880.011||- - -||- - -||0.9990.000||1.0000.000||1.0000.000|
|F1||0.9950.000||- - -||- - -||0.9960.000||0.9980.000||0.9980.000|
|GM||0.9940.002||- - -||- - -||0.9960.000||0.9980.000||0.9980.000|
|MCC||0.7800.000||- - -||- - -||0.8840.000||0.9400.000||0.9980.000|
|Payment Simulation||GBDT||AUCPRC||0.2780.030||- - -||- - -||0.6760.058||0.7760.004||0.9440.001|
|F1||0.4460.030||- - -||- - -||0.7090.021||0.8510.003||0.8850.001|
|GM||0.5300.020||- - -||- - -||0.7350.011||0.8510.001||0.8850.001|
|MCC||0.2900.023||- - -||- - -||0.7220.015||0.8560.002||0.8760.001|
Vi-A4 Intuitive Visualization
We give a visualization in Fig. 6 to show how the aforementioned imbalance learning approaches train / predict on checkerboard dataset.
As we can see, Clean tries to clean up the majority outliers who were surrounded by minority data points, however, it retains all the trivial samples so that the learning model cannot focus on more informative data. SMOTE over-generalizes minority class due to indistinguishable overlapping. Easy performs simple random under-sampling and thus part of majority samples are dropped which causes the information loss. Cascade keeps many outliers in late iterations. Those outliers finally lead to bad generalization. By contrast, SPE gets a much more accurate and robust results by considering the classification hardness distribution over the dataset.
Vi-B Real-world Datasets
We choose several real-life datasets with highly skewed class distribution to assess the effectiveness of our proposed learning framework on realistic tasks.
Credit Fraud contains transactions made by credit cards in September 2013 by European card-holders [dal2018creditfraud]. The task is to detect frauds from credit card transaction records. It is a highly imbalanced dataset with only 492 frauds out of 284,807 transactions, which brings a high imbalance ratio of 578.88:1. Payment Simulation is a large-scale dataset with 6.35 million instances. It simulates mobile money transactions based on a sample of real transactions extracted from one month of financial logs from a mobile money service implemented in an African country. Similarly, it has 8,213 frauds out of 6,362,620 transactions and a high imbalance ratio 773.70:1. Record Linkage is a dataset of element-wise comparison of records with personal data from a record linkage setting. The task requires us to decide whether the underlying records belong to one person. The underlying records stem from the epidemiological cancer registry of the German state of North Rhine-Westphalia, which has 5,749,132 record pairs, and 20,931 of them are matches. KDDCUP-99 contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment. The competition task was to build a network intrusion detector, a predictive model capable of distinguishing between “bad” connections, called intrusions or attacks, and “good” normal connections. It is a multi-class task with 4 main categories of attacks: DOS, R2L, U2R and probing (PRB). We formed 2 two-class imbalanced problems by taking the majority class (i.e., DOS) and a minority class (i.e., PRB and R2L), namely, KDDCUP (DOS vs. PRB) and KDDCUP (DOS vs. R2L).
Table III lists the statistics of each dataset.
|No re-sampling||ORG||0.5870.001||0.7210.000||0.6320.011||0.6630.026||0.8030.001||170885||- - -|
|Under-sampling + Ensemble||0.7550.003||0.7670.001||0.7830.015||0.8080.004||0.8490.002||63210||0.116|
Vi-B1 Setup Details
For each real-world task, we use 60% of the full data as the training set and 20% as the validation set (some classifiers like GBDT need validation set for early stopping), the rest 20% is then used as the test set . All results in this section were evaluated on the test set in order to test the classifier’s generalized performance.
Vi-B2 Results on Real-world Datasets
We first extend the previous experiment on synthetic data to realistic datasets that we mentioned above. Table IV lists the experimental results of applying 6 different imbalance learning approaches (i.e., RandUnder, Clean, SMOTE, Easy, Cascade, and SPE) combine with 5 different canonical classification algorithms (i.e., KNN, DT, MLP, AdaBoost, and GBDT) on 5 real-world classification tasks444Due to space limitation, Table 5 only lists some most representative results. See HERE for additional experimental results on more datasets and classifiers.. The performance was evaluated by 4 criterions (i.e., AUCPRC, F1-score, G-mean, and MCC) on the test set. For reduce the effect of randomness, we show the mean and standard deviation of 10 independent runs:
SPE demonstrates the best performance on all tested real-world tasks using 5 classifiers over 4 evaluation criteria.
Clean + MLP performs poorly on Credit Fraud task since Clean only cleans up noises and does not guarantee a balanced dataset. As described above, the batch training method will fail when the class distribution is skewed.
use randomized under-sampling to get a small majority subset for training. They suffer from severe information loss and high potential variance when applying on highly imbalanced dataset.
Some results of Clean and SMOTE are missing in Table IV due to lack of appropriate distance metric and unacceptable computational cost. Take the KDDCUP (DOS vs. PRB) dataset as an example, from our experiment, Clean needs more than 8 hours to calculate the distance between each data sample. Similarly, SMOTE generates millions of synthetic samples that further enlarge the scale of the training set.
Vi-C Extensive Experiments on Real-world Datasets
We further introduce some other widely used re-sampling and ensemble-based imbalance learning methods for a more comprehensive comparison. By showing supplementary information, e.g., the number of samples used for training and the processing time, we demonstrate the efficiency of different methods on real-world highly imbalanced tasks.
Vi-C1 Comparison with Re-sampling Methods
9 more re-sampling based imbalance learning methods were introduced, including 5 under-sampling methods, 3 over-sampling methods and 2 hybrid-sampling methods (see Table V):
NearMiss [mani2003nearmiss] selects samples from the majority class for which the average distance of the k nearest samples of the minority class is the smallest.
ENN (Edited Nearest Neighbor) [wilson1972enn] aims to remove noisy samples from the majority class for which their class differs from the one of their nearest-neighbors.
TomekLink [tomek1976tomeklink] removes majority samples by detecting “TomekLinks”. A TomekLink exists if two samples of different class are the nearest neighbors of each other.
AllKNN [tomek1976allknn] extends ENN by repeating the algorithm multiple times, the number of neighbors of the internal nearest neighbors algorithm is increased at each iteration.
OSS (One Side Sampling) [kubat1997oss] makes use of TomekLink by running it multiple times to iteratively decide if a sample should be kept in a dataset or not.
RandOver (Random Over-sampling) randomly repeats some minority samples to balance the class distribution.
ADASYN (ADAptive SYNthetic over-sampling) [he2008adasyn] focuses on generating samples next to the original samples which are wrongly classified using a k-nearest neighbors classifier.
BorderSMOTE (Borderline Synthetic Minority Over-sampling TechniquE) [han2005borderline-smote] offers a variant of the SMOTE algorithm, where only the borderline examples could be seeds for over-sampling.
SMOTEENN (SMOTE with Edited Nearest Neighbours cleaning) [batista2004smoteenn] utilizes ENN as the cleaning method after applying SMOTE over-sampling to obtain a cleaner space.
SMOTETomek (SMOTE with Tomek links cleaning) [batista2003smotetomek] uses TomekLink instead of ENN as the cleaning method.
As mentioned before, running some of these re-sampling methods on large-scale datasets can be extremely slow. It is also hard to define an appropriate distance metric on a dataset with non-numerical features. With these considerations, we apply all methods on the Credit Fraud dataset. This dataset has 284,807 samples, and only contains normalized numerical features, which enables all distance-based re-sampling methods to achieve their maximum potential. Thus we can fairly compare the pros and cons of different methods.
We use 5 different classifiers, i.e., Logistic Regression (LR), KNN, DT, AdaBoost, and GBDT, to collaborate with: ORG which refers to train classifier over the original training set with no re-sampling, 12 re-sampling methods which refer to train classifier on the re-sampled training set, and SPE which refers to use our proposed method to get an ensemble of the given classifier. We also list the number of examples that used for training and the time it takes to perform re-sampling for each method. All aforementioned re-sampling methods were implemented using imbalanced-learn Python package 0.4.3 [guillaume2017imblearn] with Python 3.7.1, and run on an Intel Core i7-7700K CPU with 16 GB RAM. Experimental results were shown in Table V:
SPE significantly boosts the performance of various canonical classification algorithms on highly imbalanced dataset. Comparing with other re-sampling methods, it only requires very little training data and short processing time to achieve such effects.
Most methods can only obtain reasonable results (better than ORG) when cooperating with specific classifiers. For instance, TomekLink works well with LR, DT, and GBDT but fails to boost the performance of KNN and AdaBoost. The reason behind is that they perform model-agnostic re-sampling without considering classifier’s capacity.
On a dataset with such high imbalance ratio (IR=578.88:1), the minority class is often poorly represented and lacks a clear structure. Therefore, straightforward application of re-sampling, especially over-sampling that rely on relations between minority objects can actually deteriorate the classification performance, e.g., advanced over-sampling method SMOTE perform even worse than RandOver and ORG.
|# Base Classifiers||Metric|
Vi-C2 Comparison with Ensemble Methods
In this experiment, we introduce four other ensemble based imbalance learning approaches for comparison:
RUSBoost [seiffert2010rusboost], which applies RandUnder within each iteration of Adaptive Boosting (AdaBoost) pipeline.
SMOTEBoost [chawla2003smoteboost], which applies SMOTE to generate new synthetic minority samples within each iteration of AdaBoost pipeline.
UnderBagging [barandela2003underbagging] which applies RandUnder to get each bag for Bagging [breiman1996bagging]. Note that the only difference between UnderBagging and Easy is that Easy use AdaBoost as its base classifier.
SMOTEBagging [wang2009smotebagging], which applies SMOTE to get each bag for Bagging [breiman1996bagging], where each bag’s sample quantity varies.
Our proposed method was then compared with 4 above methods and the Cascade that we used before. They were considered as the most effective and widely-used imbalance learning methods in very recent reviews [albert02018experiment, haixiang2017overview]. Considering most of the previous approaches were proposed in combination with C4.5 [barandela2003underbagging, wang2009smotebagging, seiffert2010rusboost], for a fair comparison, we also use the C4.5 classifier as the base model in this experiment. Easy was not included here since it is equivalent to UnderBagging when cooperating with C4.5 classifier.
Because the number of base models significantly influences the performance of ensemble methods, we test each method with 10, 20 and 50 base models in its ensemble. We must note that such comparison is not totally fair since over-sampling methods need far more data and resources to train each base model. In consideration of computational cost (SMOTEBoost and SMOTEBagging generate a huge amount of synthetic samples on large-scale highly-imbalanced dataset, see Table VI), all ensemble methods were applied on the Credit Fraud dataset with AUCPRC, F1-score, G-mean, MCC for evaluation. For each method, we also list the total number of data samples (# Samples.) that used for training all base models in the ensemble. Table VI shows the experimental results of 5 ensemble methods and our proposed method:
Comparing with other 3 under-sampling based ensemble methods, SPE uses the same amount of training data but significantly outperforms them over 4 evaluation criteria.
Comparing with 2 over-sampling based ensemble methods, SPE demonstrates competitive performance using far less (around 1/300) training data.
Over-sampling based methods are woefully sample-inefficient. They generate a substantial number of synthetic samples under high imbalance ratio. As a result, they further enlarge the scale of training set thus need far more computing resources to train each base model. Higher imbalance ratio and larger dataset can make this situation even worse.
We conduct more detailed experiments on Credit Fraud and Payment Simulation datasets, as shown in Fig. 7. We can see that although SPE uses little data for training, it can still obtain a desirable result which is even better than over-sampling based methods. Moreover, on both tasks SPE shows consistent performance in multiple independent runs. Compared to SPE, other methods are less stable and have greater randomness.
Vi-C3 Robustness under Missing Values
Finally, we test the robustness of different ensemble methods when there are missing values in the dataset. It is also a common problem that widely existing in real-world applications. To simulate the situation of missing values, we randomly select values from all features in both training and test datasets, then replace them with meaningless 0. We tested all methods on the Credit Fraud dataset, where 0% / 25% / 50% / 75% values are missing.
Results were reported in Table VII. We can observe that SPE demonstrates robust performance under different level of missing, while other methods performing poorly when the missing ratio is high. We also notice that tested methods show different sensitivity to missing values. For an example, SMOTEBagging obtains results better than SMOTEBoost on the original dataset, but this situation is reversed when the missing ratio is greater than 50%.
Vi-C4 Sensitivity to Hyper-parameters
SPE has 3 key hyper-parameters: number of base classifiers , number of bins and hardness function . In previous discussion we demonstrate the influence of the number of base classifiers (). Now we conduct experiment to verify the impact of the number of bins () and different choices of hardness function (). Specifically, we test on two real-world tasks with ranging from 1 to 50, in cooperation with 3 different hardness functions. They are Absolute Error (AE), Squared Error (SE) and Cross Entropy (CE), where:
The results in Fig. 8 show that our method is robust to different selection of and . Note that determines how detailed our hardness distribution approximation is, thus setting a small , e.g., , may lead to poor performance.
Vii Related work
Imbalanced data classification has been a fundamental problem in machine learning[he2008overview, he2013overview]. Many research works have been proposed to solve such problem. This research field is also known as Imbalance Learning. Recently, Guo et al. provided a systematic review of existing methods and real-world applications in the field of imbalance learning [haixiang2017overview].
Most of proposed works employed distance-based methods to obtain re-sampled data for training canonical classifiers [laurikkala2001ncr, tomek1976tomeklink, chawla2002smote, he2008adasyn]. Based on them, many works combine re-sampling with ensemble learning [seiffert2010rusboost, chawla2003smoteboost, barandela2003underbagging, wang2009smotebagging]. Such strategies have proven to be very effective [albert02018experiment]. Distance-based methods have several deficiencies. First, it is hard to define distance on a real-world dataset, especially when it contains categorical features or missing values. Second, the cost of computing distances between each samples can be huge when applying on large-scale datasets. Even though the distance-based methods have been successfully used for re-sampling, they do not guarantee desirable performance for different classifiers due to their model-agnostic designs.
Some other methods try to assigning different weights to samples rather than re-sampling the whole dataset [elkan2001cost-sensitive, liu2006cost-sensitive-imbalance]. They require assistance from domain experts and may fail when cooperating with batch training methods (e.g. neural network). We prefer not to include such methods in this paper because previous experiments [liu2006cost-sensitive-imbalance] have shown that setting arbitrary costs without domain knowledge do not allow them to achieve their maximum potential.
There are some works in other domains (e.g. Active Learning[settles2009active-learning], Self-paced Learning [kumar2010spl]) that adopt the idea of selecting “informative” samples but focus on completely different problems. Specifically, an active learner interactively queries the user to obtain the labels of new data points, while a self-paced learning algorithm tries to present the training data in a meaningful order that facilitates learning. However, they perform the sampling without considering the overall data distribution, thus their fine-tuning process can be easily disturbed when the training set is imbalanced. In comparison, SPE applies under-sampling + ensemble strategy to balance the dataset, making it applicable to any canonical classifier. By considering the dynamic hardness distribution over the whole dataset, SPE performs adaptive and robust under-sampling rather than blindly selecting “informative” data samples.
To summarize, traditional distance-based re-sampling methods ignore the difference of model capacity, thus may lead to poor performance when cooperating with specific classifiers. They also require additional computation to calculate distances between samples, making them computationally inefficient, especially on large datasets. Moreover, it is often difficult to determine a clear distance metric in practice, as real-world datasets may contain categorical features and missing values. Most ensemble-based methods integrate such distance-based re-sampling into their pipelines, thus are still negatively affected by the above factors. Comparing with existing works, SPE doesn’t require any pre-defined distance metric or computation, making it easier to apply and more computationally efficient. By self-paced harmonizing the hardness distribution w.r.t the given classifier, SPE is adaptive to different models and robust to noises and missing values.
In this paper we have described the problem of highly imbalanced, large-scale and noisy data classification that widely exists in real-world applications. Under such a scenario, we have demonstrate that canonical machine learning / imbalance learning approaches suffer from unsatisfactory results and low computational efficiency.
Self-paced Ensemble, a novel learning framework for massive imbalance classification has been proposed in this paper. We argue that all of the difficulties - high imbalance ratio, overlapping between classes, presence of noises - are critical for massive imbalance classification. Hence, we have introduced the concept of classification hardness distribution to integrate the information of these difficulties into our learning framework. We conducted extensive experiments on a variety of challenging real-world tasks. Comparing with other methods, our framework has better performance, wider applicability, and higher computational efficiency. Overall, we believe that we have provided a new paradigm of integrating task difficulties into the imbalance classification system. Various real-world applications can benefit from our framework.