1 Introduction
We are interested in the common setting in which one has a set of training data , with
a vector of features and
a class label. The goal is to estimate a classifier
, which outputs a class label given an input feature vector. This involves partitioning the feature spaceinto subsets having different class labels, as illustrated in Figure 1. Classification trees are particularly popular to their combined flexibility and interpretability. Competitors like deep neural networks (DNN;
Schmidhuber 2015) often have higher accuracy in prediction, but are essentially uninterpretable black boxes.Our focus is on the case in which is small, for one or more . For simplicity in exposition, we assume , so that there are only two classes. Without loss of generality, suppose that is the majority class, is the minority class, and call set the decision set (of the minority class). Hence, is relatively small compared to by this convention.
To illustrate the problems that can arise from this imbalance, we consider a toy example. Let the sample space of be . Let the conditional distribution of given be , . We generate a training data set with minority samples and majority samples. The estimated decision sets of unpruned CART and optimally pruned CART (Breiman et al., 1984) are shown in Figure (a) and (b), respectively. Minority samples are given weight 32 while majority samples are given weight 1 to naively address the imbalance. Both decision sets are inaccurate, with unpruned CART overfitting and pruned CART also poor.
Imbalanced data problems have drawn substantial interest; see Haixiang et al. (2017), He and Garcia (2008), Krawczyk (2016) and Fernández et al. (2017) for reviews. Some early work relied on random under or oversampling, which is essentially equivalent to modifying the weights and cannot address the key problem. Chawla et al. (2002) proposed SMOTE, which instead creates synthetic samples. For each minority class sample, they create synthetic samples along the line segments that join each minority class sample with its k nearest neighbors in the minority class. Building on this idea, many other synthetic sampling methods have been proposed, including ADASYN (He et al., 2008), BorderlineSMOTE (Han et al., 2005), SPIDER (Stefanowski and Wilk, 2008), safelevelSMOTE (Bunkhumpornpat et al., 2009) and WGANBased sampling (Wang et al., 2019). These synthetic sampling methods have been demonstrated to be relatively effective.
However, current understanding of synthetic sampling is inadequate. Chawla et al. (2002)
motivates SMOTE as designed to “create large and less specific decision regions”, “rather than smaller and more specific regions”. Later papers fail to improve upon this heuristic justification. Practically, the advantage of synthetic sampling versus random oversampling diminishes as the dimension of the feature space increases. In general, for each minority sample, we require at least
synthetic samples to fully describe its neighborhood. This is often infeasible due to the sample size of the majority class and to computational complexity. Hence, it is typical to fix the number of synthetic samples regardless of the dimension of the feature space (Chawla et al., 2002), which may fail to “create large and less specific decision regions” when the dimension is high.Motivated by these issues, we propose to directly penalize the SurfacetoVolume Ratio (SVR) of the decision set. A primary issue with imbalanced data is estimating a decision set consisting of small neighborhoods around each minority class sample. By penalizing SVR we favor regularly shaped decision sets much less subject to such overfitting. With this motivation, we propose a new class of SVRTree algorithms.
The rest of the paper is organized as follows. Section 2 describes our methodology. Section 3 provides theory on estimation and feature selection consistency. Section 4 presents numerical studies for real datasets. Section 5 contains a discussion, and proofs are included in an Appendix.
2 Methodology
We first introduce the definition of surfacetovolume ratio (SVR) and tree impurity, and then define SVRTree as the minimizer of a weighted average of tree impurity and SVR. We then state the algorithm to estimate this tree from training data . We assume readers have basic knowledge of treebased classifiers like CART (Breiman et al., 1984) and C4.5 (Quinlan, 2014). In the rest of the paper, the word “tree” refers specifically to classification trees that specify a class label associated with each leaf node.
2.1 Notations
Training data are denoted as , with a vector of features and a class label. Uppercase letters
denote random variables, while lowercase
denote specific values. We denote the th feature of as . The true distribution of is denoted by , while the empirical distribution, which assigns mass to each training data point, is denoted by . For a constant , let modify to up weight minority class samples by . That is, for any subset ,Similarly, let be the weighted version of . To avoid complex subscripts, with some abuse of notation, we also use
to denote the corresponding marginal probability measures on
. For example,Whether they refer to the joint distribution of
or the marginal distribution of should be clear from the context. When discussing the probability of certain events that include random variables , we simply use to represent the probability measure in the product space. For example, for , denotes the probability of the event . We use for expectations over and for expectations over .2.2 SurfacetoVolume Ratio
For all , define a dimensional measure space as , where is the collection of Borel sets, and is Lebesgue measure. For any measurable closed set , we define its volume as Lebesgue measure of set : . We define the surface as the dimensional Lebesgue measure of the boundary of : For any set with , the surfacetovolume ratio (SVR) can be obtained as For sets with the same volume, the dimensional ball has the smallest SVR, while sets having multiple disconnected subsets and/or irregular boundaries have relatively high SVR.
SurfacetoVolume Ratio of a Classification Tree
For training data , we define a closed bounded sample space such that the support of is a subset of . A classification tree divides the sample space into two disjoint subsets , , where . The tree predicts a new sample belongs to class 0 if , and class 1 if . The surfacetovolume ratio of a classification tree is defined as the surfacetovolumeratio of set :
2.3 Impurity Function and Tree Impurity
A classification tree partitions the sample space into multiple leaf nodes, assigning one class label to each leaf node. The tree should be built to maximize homogeneity of the training sample class labels within nodes. This motivates the following definition.
Definition 1 (Impurity, Definition 2.5 of Breiman et al. 1984).
An impurity function is defined on the set of pairs satisfying , with the properties (i) achieves its maximum only at ; (ii) achieves its minimum only at ; (iii) is symmetrical in , i.e., .
Let and represent the probabilities of belonging to the majority and minority class, respectively, within some branch of the tree. Ideally, splits of the tree are chosen so that, after splitting, and move closer to 0 or 1 and further from 1/2. Many different tree building algorithms use impurity to measure the quality of a split; for example, CART uses Gini impurity and C4.5 uses entropy, which is effectively a type of impurity measure.
When data are imbalanced, it is important to modify the definition of impurity to account for the fact that is much smaller than . With this in mind, we propose the following weighted impurity function.
Definition 2 (Weighted Impurity).
Letting be an impurity function, a weighted impurity function with weight for the minority class is defined as
In the remainder of the paper, the term ‘impurity function’ refers to in Definition 2 with and the function refers to the Gini impurity. That is,
Let be the leaf nodes of a classification tree and let be the predictive class label for node , for . Let be a probability measure over . Then the impurity of leaf node is . The impurity of node measures the class homogeneity in , but does not depend on the predictive class label . Let denote the dominant class label in under weight . We define a signed impurity taking into account as
If the predictive class label matches the dominant class label in node , the signed impurity of node is equal to the impurity of node and is no greater than . Otherwise, an extra penalty is applied. Taking a weighted average of the signed impurities across the leaf nodes, one obtains the tree impurity and signed tree impurity.
Definition 3 (Tree Impurity).
Let be a tree and be all the leaf nodes of this tree. Upweighting the minority class by , the tree impurity of is
where is the weighted version of defined in section 2.1.
Definition 4 (Signed Tree Impurity).
Under the notation of Definition 3, the signed tree impurity is
2.4 SVRTree Classifiers
The SVRTree classifier is the minimizer of the weighted average of signed tree impurity and surfacetovolume ratio. Letting be the collection of possible trees, the SVRTree classifier is formally defined as
(1) 
where is a penalty. The unknown probability measure is replaced with the empirical measure that assigns mass to each training sample . Unfortunately, without restricting the space of trees , optimization problem (1) is intractable. In the following subsection, we introduce an iterative greedy search algorithm that limits the size of in each step to efficiently obtain a tree having provably good performance.
2.5 The SVRTree Algorithm
The SVRTree Algorithm is designed to find a nearly optimal SVRTree classifier. SVRTree proceeds in a greedy manner. We begin with the root node. At each step, we operate on one leaf node of the current tree, partitioning it into two new leaf nodes by finding the solution of (1). The node to partition at each step is uniquely specified by a breadthfirst searching order. After partitioning, the tree is updated and the node to partition in the next step will be specified. The process stops when further splitting of leaf nodes either does not improve the loss or a prespecified maximum number of leaf nodes is achieved.
We first describe how to split a current leaf node to improve the solution to (1). Suppose the current tree is and the node to partition is , with training samples. For each feature , sort all samples in by increasing order of the th feature as . We only allow partitions to occur at , the midpoint of two adjacent values of each feature. The total number of different partitions is no more than . After each such split of , we keep all other leaf nodes unchanged while allowing all different class label assignments at the two new daughter nodes of . The current set of trees to choose from in optimizing (1) include the initial and all the partitioned trees described above. The cardinality of is no more than , a linear order of . We compute the risk for all to find the minimizer. If the initial is the risk minimizer, we do not make any partition in this step and mark the node as “complete”. Any node marked as complete will no longer be partitioned in subsequent steps.
It remains to specify the ‘breadthfirst’ searching order determining which node to partition in each step. Let depth of a node be the number of edges to the root node, which has depth 0. Breadthfirst algorithms explore all nodes at the current depth before moving on (Cormen et al. 2009, chapter 22.2). To keep track of changes in the tree, we use a queue^{1}^{1}1A queue is a dynamic set in which the elements are kept in order and the principal operations on the set are the insertion of elements to the tail, known as enqueue, and deletion of elements from the head, known as dequeue. See chapter 10.1 of Cormen et al. (2009) for details.. We begin our algorithm with a queue where the only entity is the root node. At each step, we remove the node at the front terminal of the queue, and partition at this node as described in the previous paragraph. If a partitioned tree is accepted as the risk minimizer over the current set , we enqueue two new leaf nodes; otherwise, the unpartitioned tree is the risk minimizer over the current set , so we don’t enqueue any new node. The nodes in the front of the queue have the lowest depth. Therefore, our algorithm naturally follows a breadthfirst searching order. We preset the maximal number of leaf nodes as . The process is stopped when either the queue is empty, in which case all the leaf nodes are marked as complete, or the number of leaf nodes is .
Our SVRTree algorithm has a coarse to fine tree building style, tending to first partition the sample space into larger pieces belonging to two different classes, followed by modifications to the surface of the decision set to decrease tree impurity and SVR. The steps are sketched in Algorithm 1, where feature selection steps are marked as optional. A more detailed and rigorous version is in the last section of supplemental material.
Optional Step for Feature Selection
It is likely that some features will not help to predict class labels. These features should be excluded from our estimated tree. Under some mild conditions, a split on a redundant feature has minimal impact on the tree impurity compared to a partition in a nonredundant feature. Thus feature selection can be achieved by thresholding. Suppose we are partitioning node into two new leaf nodes . Then the (unsigned) tree impurity decrease after this partition is defined as:
Let be the indices of features that have been partitioned in previous tree building steps. Given that we are partitioning on node , let be the maximal tree impurity decrease over all partitions in feature . Then a partition in a new feature , with tree impurity decrease , is accepted if
(2) 
where is a constant independent of the training data. By equation (2), a partition on a new feature is accepted if its tree impurity decrease is greater than , the tree impurity decrease over all previously partitioned features, plus a threshold term . Theoretical support for this thresholding approach is provided in Section 3.
2.6 Computational Complexity
Recall the number of training samples is and the number of features is . Denote the depth of the estimated tree as and let the maximal number of leaf nodes be .^{2}^{2}2In our experiments, we set .
The storage complexity is trivial and is the same as usual decision trees, i.e.,
. We analyze the time complexity of Algorithm 1 in this section. We first introduce the approach to compute surfacetovolume ratio, then discuss the time complexity of building the tree.Computing SVR
Suppose in some intermediate state of building the SVR Tree, the current tree has leaf nodes . We already know the surface and volume of the current tree. Now we need to perform a partition at node , which has samples. Suppose we partition at and obtain two child nodes. The volume of the tree after this partition can be computed by adding the volume of the minority class child node(s), which takes time. The major concern lies in the computation of the surface area. If both child nodes are in the majority class or the minority class, the surface after partitioning is the surface that node is labeled as the majority class or the minority class. It takes time to compute the surface of , and time to compute all the overlapping surface between and . Therefore, the time complexity to compute the surface of the tree after partitioning is .
If one child node belongs to the minority class and the other belongs to the majority class, the problem becomes more complicated. Let be the surface of the partitioned tree if is partitioned at and the left child is labeled as and the right child is labeled as . is similarly defined. Both and are piecewise constant functions whose change points can only exist at borders of . Therefore, we first compute the analytical forms of and for all . This requires us to find all the borders of , to compute all the overlapping surface between and and to compute the surface of itself. The process of computing the analytical forms of and for all , takes time. The cost of evaluating and at a specific value can takes as much as time; but if the samples are presorted for all features, it only takes time to evaluate and , , at all the possible partition locations of . Therefore, it takes time to compute surface area for all the possible partition locations and class label assignments. Similarly, the costs of computing volume for all possible partition locations is . Thus for all the possible partition locations and class label assignments at , the total cost of computing SVR is .
Time Complexity of Algorithm 1
Before working on any partitions, we first need to sort the whole dataset for all features, taking time. For a node with samples and leaf nodes in the current tree, there are possible partition locations. It takes time to compute signed impurity for all possible class label assignments and all partition locations; time to compute surfacetovolume ratio for all possible class label assignments and all partition locations; time to find the best partition when all the signed impurities and SVR are already computed. The overall time complexity of finding the best partition at this node is . Let be the number of leaf nodes of the tree output by Algorithm 1, with . The time complexity is
where we use the fact that . This shows the efficiency of Algorithm 1.
3 Theoretical Properties
In this section, we will discuss the consistency of our classification tree obtained from Algorithm 1 in two aspects: the first is estimation consistency – as sample size goes to infinity, a generalized distance between SVRTree and the oracle classifier converges to zero; the other is feature selection consistency – for redundant features that are conditionally independent of given the other features, the probability of SVRTree excluding these features converges to one.
3.1 Estimation Consistency
We define a generalized metric on the space of all nonrandom classifiers, introduce classifier risk, and define our notions of classifier consistency.
Definition 5.
For any weight and classifier , we define the risk as
where the expectation is taken over probability measure for random variables .
Let be the collection of all measurable functions from to . Then the oracle classifier and minimal risk are defined as and respectively. Without loss of generality, we assume . Then the oracle classifier is unique almost surely. We now define the consistency of a sequence of classifiers based on L1 distance from the oracle classifier.
Definition 6.
A sequence of classifiers is consistent if
Denote the tree obtained from Algorithm 1 as and the classifier associated with as . Theorem 1 shows is consistent under mild conditions.
Theorem 1 (Estimation consistency).
Let be absolutely continuous with respect to Lebesgue measure on . Assume and Let be a constant no smaller than one. Then the classifier obtained from Algorithm 1 is consistent.
3.2 Feature Selection Consistency
Let denote the set of all features except . We say is redundant if conditionally on , is independent of . We denote as the collection of all nonredundant features, and as the collection of all redundant features. Under two conditions on the distribution of and , we can show if goes to zero slower than , the probability of excluding all redundant features goes to one. We assume there are nonredundant features denoted as . The redundant features are denoted as .
Before stating these two conditions, we need to discuss how a partition can decrease the tree impurity. Suppose we are partitioning on node at feature , resulting in two new leaf nodes: and . Then the weighted conditional expectation of in node is
(3) 
Thus the impurity of is and Similarly, denoting the weighted conditional expectation of in node as , the impurity of is . The impurity decrease on node is
(4) 
Noting is determined by , can be considered as a function of . Denoting , then the maximal impurity decrease at feature is
(5) 
The quality of in reducing the impurity of node is measured relative to the oracle impurity decrease at node . Suppose we partition node into measurable sets and satisfying for all and where and are not required to be hyper rectangles. By definition, the set corresponds to the proportion of having the smallest values, while the set corresponds to the proportion of having the largest values. Similar to equation (3) and (4), we can define and the impurity decrease . Given , the impurity decrease is unique, which is also the maximal impurity decrease for all measurable satisfying . Thus we can denote the impurity decrease as . The oracle impurity decrease is the maximal value of : In general, the larger impurity decrease will tend to correspond to nonredundant features; this will particularly be the case under Conditions 12.
Condition 1.
There exists , such that for all ,
Condition 1 relates the impurity decrease in nonredundant features to the oracle impurity decrease. The strength of the condition is dependent on the value of . If , the condition does not impose any restrictions; if , partitions in nonredundant features can fully explain the oracle impurity decrease. Here we give some examples of models with different values.
Example 1 (Generalized Linear Models).
Let and the marginal distribution of
be the uniform distribution on
. Suppose where is a monotonically increasing function, and . Let be measurable sets having and , which achieve the oracle impurity decrease , and let . Then the constant in Condition 1 satisfies if and if , whereis the cumulative distribution function for the IrwinHall distribution with parameter
.Example 2.
Let and be the uniform distribution on . Let and if and if Further assume . Then .
Condition 2.
The weighted probability measure has density in . Moreover, where for all , with a constant.
Condition 2 asserts that the joint density of can be decomposed into an independent component plus a dependent component, where the dependent component is dominated by the independent component up to a constant. This condition controls the dependence between and . Since given , is independent of , this condition will also control the dependence between and . The strength of the condition depends on the constant . When , the condition imposes no restrictions; when , the condition asserts and are completely independent.
Theorem 2 (Feature selection consistency).
In Theorem 2 conditions 1 and 2 are complementary. If is smaller (i.e., condition 2 is stronger), can be smaller (i.e., condition 1 is weaker). The opposite also holds. The following two Corollaries cover special cases.
Corollary 1.
If the optional steps in Algorithm 1 are enabled, condition 1 is satisfied with (that is, in each hyperrectangle, the maximal impurity decrease at nonredundant features is equal to the oracle impurity decrease) and for some constant , , we have
Corollary 2.
If the optional steps in Algorithm 1 are enabled, and are independent and for some constant , , we have
4 Numerical Studies
We compare SVR Tree with popular imbalanced classification methods on real datasets, adding redundant features to these datasets in evaluating feature selection. A confusion matrix (Table
1) is often used to assess classification performance. A common criteria for accuracy is When s are relatively rare, it is often important to give higher priority to true positives, which is accomplished using the true positive rate (recall) and precision To combine these, the Fmeasure is often used: .True Label  
1  0  

1  True Positive (TP)  False Positive (FP)  
0  False Negative (FN)  True Negative (TN) 
4.1 Datasets
We test our method on 5 datasets from the UCI Machine Learning Repository
(Dua and Graff, 2017), varying in size, number of features and level of imbalance.Glass dataset
https://archive.ics.uci.edu/ml/datasets/Glass+Identification consists of 213 samples and 9 features. The objective is to classify samples into one of seven types of glass. We choose class 7 (headlamps) as the minority class and class 16 as the majority class, yielding 29 minority class samples.
Vehicle dataset
(Siebert, 1987) https://archive.ics.uci.edu/ml/datasets/Statlog+(Vehicle+Silhouettes) consists of 846 samples and 18 features. The aim is to classify a silhouette into one of four types of vehicles. As in He et al. (2008), we choose class “Van” as the minority class and the other three types of vehicles as the majority class, resulting in 199 minority class samples.
Abalone dataset
https://archive.ics.uci.edu/ml/datasets/Abalone aims to predict the age of abalone by physical measurements. As in He et al. (2008), we let class “18” be the minority class and class “9” be the majority class. This yields 731 samples in total, among which 42 samples belong to the minority class. We also remove the discrete feature “sex”, which gives us 9 features.
Satimage dataset
https://archive.ics.uci.edu/ml/datasets/Statlog+(Landsat+Satellite) consists of a training set and a test set. We have 6435 samples and 36 features. As in Chawla et al. (2002), we choose class ‘4” as the minority class and collapsed all other classes into a single majority class, resulting in 626 minority class samples.
Wine dataset
(Cortez et al., 2009) https://archive.ics.uci.edu/ml/datasets/Wine+Quality collects information on wine quality. We focus on the red wine subset, which has 1599 samples and 11 features. We let the minority class be samples having quality , while the majority class has quality . This generates 217 minority class samples.
4.2 Experimental Setup
We test the performance of SVRTree, SVRTree with feature selection, CART (Breiman et al., 1984) with duplicated oversampling, CART with SMOTE (Chawla et al., 2002), CART with BorderlineSMOTE (Han et al., 2005) and CART with ADASYN (He et al., 2008)
on all five datasets. Features are linearly transformed so that samples lie in
.For each method and dataset, we run the algorithm 50 times. For each run, we randomly choose 2/3 samples as training and 1/3 as testing. The average values of TP, TN, FP, FN on testing sets are used to compute the accuracy, TPR, precision and Fmeasure. The specific settings for each method are discussed below.
SVRTree with and without feature selection are described in Algorithm 1. The weight for the minority class, , is set to be the largest integer that makes the total weight of the minority class no greater than the total weight of the majority class; If is greater than , it is truncated to be . The maximal number of leaf nodes is . The penalty parameter for SVR is chosen from a geometric sequence in ; the parameter with highest Fmeasure on 50 runs is selected. For SVRTree with feature selection, the constant in equation (2) is fixed to ; In practice, the results are insensitive to .
For the other methods, we first over sample the minority class samples, such that the number of minority samples are multiplied by after oversampling. We than build the CART tree on the over sampled dataset and prune it. The pruning parameter of CART is selected to maximize the Fmeasure. By the algorithm proposed by Breiman et al. (1984), the range from which to choose the pruning parameter will be available after we build the tree and does not need to be specified ahead of time.
For duplicated oversampling, we sample each minority class sample times; For SMOTE, the number of nearest samples is set as ; For BorderlineSMOTE, we adopt the BorderlineSMOTE1 of (Han et al., 2005), with the number of nearest samples . For both SMOTE and BorderlineSMOTE, if , some nearest neighbors may be used multiple times to generate synthetic samples. For ADASYN, denote the number of majority class samples as and the number of minority class samples as , then the parameter is set to be .
4.3 Results
The average values of accuracy, precision, TPR, Fmeasure and number of selected features across the 50 runs are shown in Table 2. In the column “Method”, SVR = SVRTree, SVRSelect = SVRTree with feature selection, Duplicate = CART with duplicated oversampling, SMOTE = CART with SMOTE, and BSMOTE = CART with BorderlineSMOTE and ADASYN = CART with ADASYN. For each dataset and evaluation measure, the method that ranks first is highlighted in bold. The number of wins for each performance measure are also calculated. The SVR methods perform the best overall, with SVRSelect having particularly good performance. Furthermore, SVRSelect chooses the fewest number of features for each dataset, so has a good balance of accuracy and parsimony.
Data set  Method  Accuracy  Precision  TPR  Fmeasure  
Glass  SVR  0.9583  0.8161  0.8956  0.8540  4.98 
SVRSelect  0.9683  0.8668  0.9067  0.8863  1.0  
Duplicate  0.9646  0.8489  0.9000  0.8737  1.0  
SMOTE  0.9624  0.8295  0.9111  0.8684  1.02  
BSMOTE  0.9602  0.8160  0.9133  0.8619  1.0  
ADASYN  0.9498  0.7713  0.8978  0.8297  5.72  
Vehicle  SVR  0.9368  0.8659  0.8652  0.8655  14.76 
SVRSelect  0.9355  0.8544  0.8748  0.8645  5.7  
Duplicate  0.9362  0.8377  0.9039  0.8696  11.64  
SMOTE  0.9317  0.8385  0.8791  0.8583  10.98  
BSMOTE  0.9309  0.8417  0.8697  0.8554  13.96  
ADASYN  0.9341  0.8423  0.8855  0.8634  11.58  
Abalone  SVR  0.9212  0.3251  0.3457  0.3351  6.88 
SVRSelect  0.9244  0.3431  0.3457  0.3444  5.6  
Duplicate  0.9184  0.2956  0.3043  0.2999  6.94  
SMOTE  0.8974  0.2578  0.4186  0.3191  6.94  
BSMOTE  0.8960  0.2479  0.3986  0.3057  6.96  
ADASYN  0.8937  0.2480  0.4186  0.3114  6.92  
Satimage  SVR  0.9036  0.5032  0.6969  0.5844  34.5 
SVRSelect  0.9029  0.5008  0.7020  0.5845  28.12  
Duplicate  0.9032  0.5017  0.6553  0.5683  34.58  
SMOTE  0.8895  0.4578  0.7364  0.5646  29.2  
BSMOTE  0.8945  0.4720  0.71  0.5671  31.38  
ADASYN  0.8946  0.4711  0.6831  0.5576  34.72  
Wine  SVR  0.8476  0.4564  0.6433  0.5340  10.96 
SVRSelect  0.8481  0.4578  0.6475  0.5363  10.54  
Duplicate  0.8553  0.4744  0.6103  0.5338  11.0  
SMOTE  0.8513  0.4647  0.6311  0.5353  11.0  
BSMOTE  0.8503  0.4608  0.6047  0.5231  11.0  
ADASYN  0.8477  0.4554  0.6228  0.5261  11.0  
Total # Wins  SVR  2  2  0  0  
SVRSelect  2  2  1  4  
Duplicate  1  1  1  1  
SMOTE  0  0  1.5  0  
BSMOTE  0  0  1  0  
ADASYN  0  0  0.5  0 
4.4 Additional Experiments for Feature Selection
For the Wine and Abalone datasets, we generate 10 additional uninformative features independently from . We reran the analyses as above and results are shown in Table 3, where denotes the average number of original features selected by the method and denotes the average number of artificially generated features selected by the method.
SVRSelect performs well when there are a considerable number of redundant features. SVRSelect selects significantly more original features than artificially generated features, suggesting effectiveness in feature selection. For all other methods, the relative difference between number of original features and number of artificially generated features is much smaller. In addition, nearly all methods select fewer of the original features when compared with results in Table 2.
Data set  Method  Accuracy  Precision  TPR  Fmeasure  
Wine  SVR  0.8277  0.4242  0.5486  0.4785  10.72  8.58 
SVRSelect  0.8115  0.3910  0.6975  0.5011  3.96  0.26  
Duplicate  0.8186  0.3963  0.6439  0.4906  9.58  7.32  
SMOTE  0.8085  0.3838  0.6789  0.4904  8.8  5.74  
BSMOTE  0.8136  0.3883  0.6492  0.4859  7.6  4.24  
ADASYN  0.8073  0.3816  0.6767  0.4880  8.6  6.44  
Abalone  SVR  0.8779  0.2173  0.4329  0.2894  5.34  8.0 
SVRSelect  0.9106  0.2788  0.3500  0.3104  2.42  0.7  
Duplicate  0.7730  0.1472  0.6157  0.2376  1.0  0.0  
SMOTE  0.8471  0.1706  0.43  0.2443  3.32  4.14  
BSMOTE  0.8761  0.2028  0.3943  0.2678  3.7  3.7  
ADASYN  0.8383  0.1640  0.4429  0.2393  3.48  4.12 
5 Discussion
A major challenge in analyzing imbalanced data is small sample size in the minority class leading to overfitting. It is natural to consider using regularization to address this problem. Regularization of classification trees is an old idea; for example, Breiman et al. (1984) proposed to penalize the number of leaf notes in the tree. Other classification trees like C4.5 (Quinlan, 2014) and Ripper (Cohen, 1995) also prune the overgrown tree. However, the number of leaf nodes may not be a good measure of complexity of a classification tree. Recently, Hahn et al. (2020) adds a Bayesian prior to an ensemble of trees, which functions as indirect regularization. Following the idea of directly regularizing the shape of the decision set and complexity of the decision boundary, we instead penalize the surfacetovolume ratio. To our knowledge, this is a new idea in the statistical literature on classification.
SVRTree can be trivially generalized to the multiclass case and balanced datasets. For multiple classes, we can apply SVR to one or more minority classes and take the sum of these ratios as regularization. For balanced datasets, we can either compute the SVR ratio of all classes, or we can simply regularize the surface of the decision boundary. The principle behind these generalizations is to regularize the complexity of the decision boundaries and shapes of the decision sets.
SUPPLEMENTARY MATERIAL
Proofs and a Detailed Algorithm: Supplemental Material for “Classification Trees for Imbalanced and Sparse Data: SurfacetoVolume Regularization”.
Codes: https://github.com/YichenZhuDuke/ClassificationTreewith
SurfacetoVolumeratioRegularization.git.
References
 Classification and regression trees. Wadsworth. Cited by: §1, §2, §4.2, §4.2, §5, Definition 1.
 Safelevelsmote: safelevelsynthetic minority oversampling technique for handling the class imbalanced problem. In PacificAsia Conference on Knowledge Discovery and Data Mining, pp. 475–482. Cited by: §1.

SMOTE: synthetic minority oversampling technique.
Journal of Artificial Intelligence Research
16, pp. 321–357. Cited by: §1, §1, §4.1, §4.2.  Fast effective rule induction. In Machine Learning Proceedings, pp. 115–123. Cited by: §5.
 Introduction to algorithms. MIT press. Cited by: §2.5, footnote 1.
 Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems 47 (4), pp. 547–553. Cited by: §4.1.
 UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §4.1.
 An insight into imbalanced big data classification: outcomes and challenges. Complex & Intelligent Systems 3 (2), pp. 105–120. Cited by: §1.
 A distributionfree theory of nonparametric regression. Springer Science & Business Media. Cited by: §A.1.
 Bayesian regression tree models for causal inference: regularization, confounding, and heterogeneous effects. Bayesian Analysis. Cited by: §5.
 Learning from classimbalanced data: review of methods and applications. Expert Systems with Applications 73, pp. 220–239. Cited by: §1.
 Borderlinesmote: a new oversampling method in imbalanced data sets learning. In International Conference on Intelligent Computing, pp. 878–887. Cited by: §1, §4.2, §4.2.
 ADASYN: adaptive synthetic sampling approach for imbalanced learning. In IEEE International Joint Conference on Neural Networks, pp. 1322–1328. Cited by: §1, §4.1, §4.1, §4.2.
 Learning from imbalanced data. IEEE Transactions on Knowledge & Data Engineering (9), pp. 1263–1284. Cited by: §1.
 Learning from imbalanced data: open challenges and future directions. Progress in Artificial Intelligence 5 (4), pp. 221–232. Cited by: §1.
 Histogram regression estimation using datadependent partitions. Annals of Statistics 24 (3), pp. 1084–1105. Cited by: §A.1.
 C4. 5: programs for machine learning. Elsevier. Cited by: §2, §5.
 Deep learning in neural networks: an overview. Neural networks 61, pp. 85–117. Cited by: §1.

Consistency of random forests
. Annals of Statistics 43 (4), pp. 1716–1741. Cited by: §A.1, §A.2.  Vehicle recognition using rule based methods. Cited by: §4.1.
 Selective preprocessing of imbalanced data for improving classification performance. In International Conference on Data Warehousing and Knowledge Discovery, pp. 283–292. Cited by: §1.
 Optimal aggregation of classifiers in statistical learning. Annals of Statistics 32 (1), pp. 135–166. Cited by: §A.1.
 WGANbased synthetic minority oversampling technique: improving semantic finegrained classification for lung nodules in ct images. IEEE Access 7, pp. 18450–18463. Cited by: §1.
Appendix A Proofs
We prove Theorem 1 and 2 here. Proofs for Corollary 1, bounds in examples 12 and lemmas in Appendix are in the supplemental material. Without loss of generality, we assume in this section.
a.1 Proof of Theorem 1
The proof builds on Nobel (1996), Györfi et al. (2006), Tsybakov (2004) and Scornet et al. (2015). We first establish a sufficient condition for consistency, showing a classification tree whose signed impurity converges to an oracle bound is consistent. We then break the difference between signed impurity of and the oracle bound into two parts. The first is estimation error, which goes to zero if the number of leaves increases slower than ; the second is approximation error, which goes to zero if goes to a constant within each leaf node and penalty goes to zero.
Denote , and its weighted version . Define the oracle lower bound for signed impurity as The following lemma shows if the signed impurity of a classification tree converges to as , the classifier associated with the tree is consistent.
Lemma 1.
Let be a sequence of classification trees, let be the classifier associated with . is consistent if in probability.
We then decompose the difference between signed impurity of and the oracle bound into estimation error and approximation error.
Lemma 2.
Let be a classification tree trained from data , be all the leaf nodes of . Define the set of classifiers as:
We have
(6) 
The first term on the right hand side of equation (6) is the “estimation error”, which measures the difference between functions evaluated under the empirical and true distributions. The second term is “approximation error”, which measures the ability of to approximate the oracle prediction function. The next two lemmas show both terms go to zero in probability.
Lemma 3.
If we have in probability.
Lemma 4.
As , if and , in probability.
a.2 Proof of Theorem 2
The proof of theorem 2 mainly consists of two parts. The first works on the true distribution , proving that under , the partition with highest impurity decrease is always in nonredundant features; The second works on the randomness brought by
Comments
There are no comments yet.