Classification Trees for Imbalanced and Sparse Data: Surface-to-Volume Regularization

04/26/2020 ∙ by Yichen Zhu, et al. ∙ 60

Classification algorithms face difficulties when one or more classes have limited training data. We are particularly interested in classification trees, due to their interpretability and flexibility. When data are limited in one or more of the classes, the estimated decision boundaries are often irregularly shaped due to the limited sample size, leading to poor generalization error. We propose a novel approach that penalizes the Surface-to-Volume Ratio (SVR) of the decision set, obtaining a new class of SVR-Tree algorithms. We develop a simple and computationally efficient implementation while proving estimation and feature selection consistency for SVR-Tree. SVR-Tree is compared with multiple algorithms that are designed to deal with imbalance through real data applications.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We are interested in the common setting in which one has a set of training data , with

a vector of features and

a class label. The goal is to estimate a classifier

, which outputs a class label given an input feature vector. This involves partitioning the feature space

into subsets having different class labels, as illustrated in Figure 1. Classification trees are particularly popular to their combined flexibility and interpretability. Competitors like deep neural networks (DNN;

Schmidhuber 2015) often have higher accuracy in prediction, but are essentially uninterpretable black boxes.

Our focus is on the case in which is small, for one or more . For simplicity in exposition, we assume , so that there are only two classes. Without loss of generality, suppose that is the majority class, is the minority class, and call set the decision set (of the minority class). Hence, is relatively small compared to by this convention.

To illustrate the problems that can arise from this imbalance, we consider a toy example. Let the sample space of be . Let the conditional distribution of given be , . We generate a training data set with minority samples and majority samples. The estimated decision sets of unpruned CART and optimally pruned CART (Breiman et al., 1984) are shown in Figure (a) and (b), respectively. Minority samples are given weight 32 while majority samples are given weight 1 to naively address the imbalance. Both decision sets are inaccurate, with unpruned CART overfitting and pruned CART also poor.

Imbalanced data problems have drawn substantial interest; see Haixiang et al. (2017), He and Garcia (2008), Krawczyk (2016) and Fernández et al. (2017) for reviews. Some early work relied on random under- or over-sampling, which is essentially equivalent to modifying the weights and cannot address the key problem. Chawla et al. (2002) proposed SMOTE, which instead creates synthetic samples. For each minority class sample, they create synthetic samples along the line segments that join each minority class sample with its k nearest neighbors in the minority class. Building on this idea, many other synthetic sampling methods have been proposed, including ADASYN (He et al., 2008), Borderline-SMOTE (Han et al., 2005), SPIDER (Stefanowski and Wilk, 2008), safe-level-SMOTE (Bunkhumpornpat et al., 2009) and WGAN-Based sampling (Wang et al., 2019). These synthetic sampling methods have been demonstrated to be relatively effective.

However, current understanding of synthetic sampling is inadequate. Chawla et al. (2002)

motivates SMOTE as designed to “create large and less specific decision regions”, “rather than smaller and more specific regions”. Later papers fail to improve upon this heuristic justification. Practically, the advantage of synthetic sampling versus random over-sampling diminishes as the dimension of the feature space increases. In general, for each minority sample, we require at least

synthetic samples to fully describe its neighborhood. This is often infeasible due to the sample size of the majority class and to computational complexity. Hence, it is typical to fix the number of synthetic samples regardless of the dimension of the feature space (Chawla et al., 2002), which may fail to “create large and less specific decision regions” when the dimension is high.

Motivated by these issues, we propose to directly penalize the Surface-to-Volume Ratio (SVR) of the decision set. A primary issue with imbalanced data is estimating a decision set consisting of small neighborhoods around each minority class sample. By penalizing SVR we favor regularly shaped decision sets much less subject to such over-fitting. With this motivation, we propose a new class of SVR-Tree algorithms.

The rest of the paper is organized as follows. Section 2 describes our methodology. Section 3 provides theory on estimation and feature selection consistency. Section 4 presents numerical studies for real datasets. Section 5 contains a discussion, and proofs are included in an Appendix.

(a) Unpruned CART.
(b) Optimally pruned CART.
(c) SVR-Tree.
Figure 1: Decision sets for different methods. Red crosses denote minority class samples, while blue points denote majority class samples. Rectangles with dashed frames denote the support of minority class samples, while rectangles filled with green color denote the minority class decision set.

2 Methodology

We first introduce the definition of surface-to-volume ratio (SVR) and tree impurity, and then define SVR-Tree as the minimizer of a weighted average of tree impurity and SVR. We then state the algorithm to estimate this tree from training data . We assume readers have basic knowledge of tree-based classifiers like CART (Breiman et al., 1984) and C4.5 (Quinlan, 2014). In the rest of the paper, the word “tree” refers specifically to classification trees that specify a class label associated with each leaf node.

2.1 Notations

Training data are denoted as , with a vector of features and a class label. Uppercase letters

denote random variables, while lowercase

denote specific values. We denote the th feature of as . The true distribution of is denoted by , while the empirical distribution, which assigns mass to each training data point, is denoted by . For a constant , let modify to up weight minority class samples by . That is, for any subset ,

Similarly, let be the weighted version of . To avoid complex subscripts, with some abuse of notation, we also use

to denote the corresponding marginal probability measures on

. For example,

Whether they refer to the joint distribution of

or the marginal distribution of should be clear from the context. When discussing the probability of certain events that include random variables , we simply use to represent the probability measure in the -product space. For example, for , denotes the probability of the event . We use for expectations over and for expectations over .

2.2 Surface-to-Volume Ratio

For all , define a -dimensional measure space as , where is the collection of Borel sets, and is Lebesgue measure. For any measurable closed set , we define its volume as Lebesgue measure of set : . We define the surface as the dimensional Lebesgue measure of the boundary of : For any set with , the surface-to-volume ratio (SVR) can be obtained as For sets with the same volume, the -dimensional ball has the smallest SVR, while sets having multiple disconnected subsets and/or irregular boundaries have relatively high SVR.

Surface-to-Volume Ratio of a Classification Tree

For training data , we define a closed bounded sample space such that the support of is a subset of . A classification tree divides the sample space into two disjoint subsets , , where . The tree predicts a new sample belongs to class 0 if , and class 1 if . The surface-to-volume ratio of a classification tree is defined as the surface-to-volume-ratio of set :

2.3 Impurity Function and Tree Impurity

A classification tree partitions the sample space into multiple leaf nodes, assigning one class label to each leaf node. The tree should be built to maximize homogeneity of the training sample class labels within nodes. This motivates the following definition.

Definition 1 (Impurity, Definition 2.5 of Breiman et al. 1984).

An impurity function is defined on the set of pairs satisfying , with the properties (i) achieves its maximum only at ; (ii) achieves its minimum only at ; (iii) is symmetrical in , i.e., .

Let and represent the probabilities of belonging to the majority and minority class, respectively, within some branch of the tree. Ideally, splits of the tree are chosen so that, after splitting, and move closer to 0 or 1 and further from 1/2. Many different tree building algorithms use impurity to measure the quality of a split; for example, CART uses Gini impurity and C4.5 uses entropy, which is effectively a type of impurity measure.

When data are imbalanced, it is important to modify the definition of impurity to account for the fact that is much smaller than . With this in mind, we propose the following weighted impurity function.

Definition 2 (Weighted Impurity).

Letting be an impurity function, a weighted impurity function with weight for the minority class is defined as

In the remainder of the paper, the term ‘impurity function’ refers to in Definition 2 with and the function refers to the Gini impurity. That is,

Let be the leaf nodes of a classification tree and let be the predictive class label for node , for . Let be a probability measure over . Then the impurity of leaf node is . The impurity of node measures the class homogeneity in , but does not depend on the predictive class label . Let denote the dominant class label in under weight . We define a signed impurity taking into account as

If the predictive class label matches the dominant class label in node , the signed impurity of node is equal to the impurity of node and is no greater than . Otherwise, an extra penalty is applied. Taking a weighted average of the signed impurities across the leaf nodes, one obtains the tree impurity and signed tree impurity.

Definition 3 (Tree Impurity).

Let be a tree and be all the leaf nodes of this tree. Up-weighting the minority class by , the tree impurity of is

where is the weighted version of defined in section 2.1.

Definition 4 (Signed Tree Impurity).

Under the notation of Definition 3, the signed tree impurity is

2.4 SVR-Tree Classifiers

The SVR-Tree classifier is the minimizer of the weighted average of signed tree impurity and surface-to-volume ratio. Letting be the collection of possible trees, the SVR-Tree classifier is formally defined as

(1)

where is a penalty. The unknown probability measure is replaced with the empirical measure that assigns mass to each training sample . Unfortunately, without restricting the space of trees , optimization problem (1) is intractable. In the following subsection, we introduce an iterative greedy search algorithm that limits the size of in each step to efficiently obtain a tree having provably good performance.

2.5 The SVR-Tree Algorithm

The SVR-Tree Algorithm is designed to find a nearly optimal SVR-Tree classifier. SVR-Tree proceeds in a greedy manner. We begin with the root node. At each step, we operate on one leaf node of the current tree, partitioning it into two new leaf nodes by finding the solution of (1). The node to partition at each step is uniquely specified by a breadth-first searching order. After partitioning, the tree is updated and the node to partition in the next step will be specified. The process stops when further splitting of leaf nodes either does not improve the loss or a prespecified maximum number of leaf nodes is achieved.

We first describe how to split a current leaf node to improve the solution to (1). Suppose the current tree is and the node to partition is , with training samples. For each feature , sort all samples in by increasing order of the th feature as . We only allow partitions to occur at , the midpoint of two adjacent values of each feature. The total number of different partitions is no more than . After each such split of , we keep all other leaf nodes unchanged while allowing all different class label assignments at the two new daughter nodes of . The current set of trees to choose from in optimizing (1) include the initial and all the partitioned trees described above. The cardinality of is no more than , a linear order of . We compute the risk for all to find the minimizer. If the initial is the risk minimizer, we do not make any partition in this step and mark the node as “complete”. Any node marked as complete will no longer be partitioned in subsequent steps.

It remains to specify the ‘breadth-first’ searching order determining which node to partition in each step. Let depth of a node be the number of edges to the root node, which has depth 0. Breadth-first algorithms explore all nodes at the current depth before moving on (Cormen et al. 2009, chapter 22.2). To keep track of changes in the tree, we use a queue111A queue is a dynamic set in which the elements are kept in order and the principal operations on the set are the insertion of elements to the tail, known as enqueue, and deletion of elements from the head, known as dequeue. See chapter 10.1 of Cormen et al. (2009) for details.. We begin our algorithm with a queue where the only entity is the root node. At each step, we remove the node at the front terminal of the queue, and partition at this node as described in the previous paragraph. If a partitioned tree is accepted as the risk minimizer over the current set , we enqueue two new leaf nodes; otherwise, the unpartitioned tree is the risk minimizer over the current set , so we don’t enqueue any new node. The nodes in the front of the queue have the lowest depth. Therefore, our algorithm naturally follows a breadth-first searching order. We preset the maximal number of leaf nodes as . The process is stopped when either the queue is empty, in which case all the leaf nodes are marked as complete, or the number of leaf nodes is .

Our SVR-Tree algorithm has a coarse to fine tree building style, tending to first partition the sample space into larger pieces belonging to two different classes, followed by modifications to the surface of the decision set to decrease tree impurity and SVR. The steps are sketched in Algorithm 1, where feature selection steps are marked as optional. A more detailed and rigorous version is in the last section of supplemental material.

Result: Output the fitted tree
Input training data , impurity function , weight for minority class , SVR penalty parameter , and maximal number of leaf nodes . Let be a queue of only root node for  do
      Sort all the samples in values of th feature;
end for
while  is not empty and number of leaf nodes  do
       Dequeue the first entity in , denoting it as ; for all possible partitions in  do
             (optional) Check if the current partition satisfies feature selection conditions; if not satisfied, reject the current partition and continue; Compute the signed tree impurity of the current partition;
       end for
      Find the partition with the minimal tree impurity; if the minimal tree impurity is decreased then
             Accept the current partition. Enqueue two child nodes of into ;
      else
             Reject the current partition;
       end if
      
end while
Algorithm 1 Outline of steps of SVR-Tree

Optional Step for Feature Selection

It is likely that some features will not help to predict class labels. These features should be excluded from our estimated tree. Under some mild conditions, a split on a redundant feature has minimal impact on the tree impurity compared to a partition in a non-redundant feature. Thus feature selection can be achieved by thresholding. Suppose we are partitioning node into two new leaf nodes . Then the (unsigned) tree impurity decrease after this partition is defined as:

Let be the indices of features that have been partitioned in previous tree building steps. Given that we are partitioning on node , let be the maximal tree impurity decrease over all partitions in feature . Then a partition in a new feature , with tree impurity decrease , is accepted if

(2)

where is a constant independent of the training data. By equation (2), a partition on a new feature is accepted if its tree impurity decrease is greater than , the tree impurity decrease over all previously partitioned features, plus a threshold term . Theoretical support for this thresholding approach is provided in Section 3.

2.6 Computational Complexity

Recall the number of training samples is and the number of features is . Denote the depth of the estimated tree as and let the maximal number of leaf nodes be .222In our experiments, we set .

The storage complexity is trivial and is the same as usual decision trees, i.e.,

. We analyze the time complexity of Algorithm 1 in this section. We first introduce the approach to compute surface-to-volume ratio, then discuss the time complexity of building the tree.

Computing SVR

Suppose in some intermediate state of building the SVR Tree, the current tree has leaf nodes . We already know the surface and volume of the current tree. Now we need to perform a partition at node , which has samples. Suppose we partition at and obtain two child nodes. The volume of the tree after this partition can be computed by adding the volume of the minority class child node(s), which takes time. The major concern lies in the computation of the surface area. If both child nodes are in the majority class or the minority class, the surface after partitioning is the surface that node is labeled as the majority class or the minority class. It takes time to compute the surface of , and time to compute all the overlapping surface between and . Therefore, the time complexity to compute the surface of the tree after partitioning is .

If one child node belongs to the minority class and the other belongs to the majority class, the problem becomes more complicated. Let be the surface of the partitioned tree if is partitioned at and the left child is labeled as and the right child is labeled as . is similarly defined. Both and are piecewise constant functions whose change points can only exist at borders of . Therefore, we first compute the analytical forms of and for all . This requires us to find all the borders of , to compute all the overlapping surface between and and to compute the surface of itself. The process of computing the analytical forms of and for all , takes time. The cost of evaluating and at a specific value can takes as much as time; but if the samples are pre-sorted for all features, it only takes time to evaluate and , , at all the possible partition locations of . Therefore, it takes time to compute surface area for all the possible partition locations and class label assignments. Similarly, the costs of computing volume for all possible partition locations is . Thus for all the possible partition locations and class label assignments at , the total cost of computing SVR is .

Time Complexity of Algorithm 1

Before working on any partitions, we first need to sort the whole dataset for all features, taking time. For a node with samples and leaf nodes in the current tree, there are possible partition locations. It takes time to compute signed impurity for all possible class label assignments and all partition locations; time to compute surface-to-volume ratio for all possible class label assignments and all partition locations; time to find the best partition when all the signed impurities and SVR are already computed. The overall time complexity of finding the best partition at this node is . Let be the number of leaf nodes of the tree output by Algorithm 1, with . The time complexity is

where we use the fact that . This shows the efficiency of Algorithm 1.

3 Theoretical Properties

In this section, we will discuss the consistency of our classification tree obtained from Algorithm 1 in two aspects: the first is estimation consistency – as sample size goes to infinity, a generalized distance between SVR-Tree and the oracle classifier converges to zero; the other is feature selection consistency – for redundant features that are conditionally independent of given the other features, the probability of SVR-Tree excluding these features converges to one.

3.1 Estimation Consistency

We define a generalized metric on the space of all nonrandom classifiers, introduce classifier risk, and define our notions of classifier consistency.

Definition 5.

For any weight and classifier , we define the risk as

where the expectation is taken over probability measure for random variables .

Let be the collection of all measurable functions from to . Then the oracle classifier and minimal risk are defined as and respectively. Without loss of generality, we assume . Then the oracle classifier is unique almost surely. We now define the consistency of a sequence of classifiers based on L1 distance from the oracle classifier.

Definition 6.

A sequence of classifiers is consistent if

Denote the tree obtained from Algorithm 1 as and the classifier associated with as . Theorem 1 shows is consistent under mild conditions.

Theorem 1 (Estimation consistency).

Let be absolutely continuous with respect to Lebesgue measure on . Assume and Let be a constant no smaller than one. Then the classifier obtained from Algorithm 1 is consistent.

3.2 Feature Selection Consistency

Let denote the set of all features except . We say is redundant if conditionally on , is independent of . We denote as the collection of all non-redundant features, and as the collection of all redundant features. Under two conditions on the distribution of and , we can show if goes to zero slower than , the probability of excluding all redundant features goes to one. We assume there are non-redundant features denoted as . The redundant features are denoted as .

Before stating these two conditions, we need to discuss how a partition can decrease the tree impurity. Suppose we are partitioning on node at feature , resulting in two new leaf nodes: and . Then the weighted conditional expectation of in node is

(3)

Thus the impurity of is and Similarly, denoting the weighted conditional expectation of in node as , the impurity of is . The impurity decrease on node is

(4)

Noting is determined by , can be considered as a function of . Denoting , then the maximal impurity decrease at feature is

(5)

The quality of in reducing the impurity of node is measured relative to the oracle impurity decrease at node . Suppose we partition node into measurable sets and satisfying for all and where and are not required to be hyper rectangles. By definition, the set corresponds to the proportion of having the smallest values, while the set corresponds to the proportion of having the largest values. Similar to equation (3) and (4), we can define and the impurity decrease . Given , the impurity decrease is unique, which is also the maximal impurity decrease for all measurable satisfying . Thus we can denote the impurity decrease as . The oracle impurity decrease is the maximal value of : In general, the larger impurity decrease will tend to correspond to non-redundant features; this will particularly be the case under Conditions 1-2.

Condition 1.

There exists , such that for all ,

Condition 1 relates the impurity decrease in non-redundant features to the oracle impurity decrease. The strength of the condition is dependent on the value of . If , the condition does not impose any restrictions; if , partitions in non-redundant features can fully explain the oracle impurity decrease. Here we give some examples of models with different values.

Example 1 (Generalized Linear Models).

Let and the marginal distribution of

be the uniform distribution on

. Suppose where is a monotonically increasing function, and . Let be measurable sets having and , which achieve the oracle impurity decrease , and let . Then the constant in Condition 1 satisfies if and if , where

is the cumulative distribution function for the Irwin-Hall distribution with parameter

.

Example 2.

Let and be the uniform distribution on . Let and if and if Further assume . Then .

Condition 2.

The weighted probability measure has density in . Moreover, where for all , with a constant.

Condition 2 asserts that the joint density of can be decomposed into an independent component plus a dependent component, where the dependent component is dominated by the independent component up to a constant. This condition controls the dependence between and . Since given , is independent of , this condition will also control the dependence between and . The strength of the condition depends on the constant . When , the condition imposes no restrictions; when , the condition asserts and are completely independent.

Using Conditions 1-2, we establish feature selection consistency in Theorem 2.

Theorem 2 (Feature selection consistency).

If the optional steps in Algorithm 1 are enabled, condition 1, 2 are satisfied with and for some constant , , we have

In Theorem 2 conditions 1 and 2 are complementary. If is smaller (i.e., condition 2 is stronger), can be smaller (i.e., condition 1 is weaker). The opposite also holds. The following two Corollaries cover special cases.

Corollary 1.

If the optional steps in Algorithm 1 are enabled, condition 1 is satisfied with (that is, in each hyper-rectangle, the maximal impurity decrease at non-redundant features is equal to the oracle impurity decrease) and for some constant , , we have

Corollary 2.

If the optional steps in Algorithm 1 are enabled, and are independent and for some constant , , we have

Corollary 2 is a direct result of theorem 2 with and an arbitrary value of .

4 Numerical Studies

We compare SVR Tree with popular imbalanced classification methods on real datasets, adding redundant features to these datasets in evaluating feature selection. A confusion matrix (Table

1) is often used to assess classification performance. A common criteria for accuracy is When s are relatively rare, it is often important to give higher priority to true positives, which is accomplished using the true positive rate (recall) and precision To combine these, the F-measure is often used: .

True Label
1 0
Predicted
Label
1 True Positive (TP) False Positive (FP)
0 False Negative (FN) True Negative (TN)
Table 1: Confusion matrix for two class classification.

4.1 Datasets

We test our method on 5 datasets from the UCI Machine Learning Repository

(Dua and Graff, 2017), varying in size, number of features and level of imbalance.

Glass dataset

https://archive.ics.uci.edu/ml/datasets/Glass+Identification consists of 213 samples and 9 features. The objective is to classify samples into one of seven types of glass. We choose class 7 (headlamps) as the minority class and class 1-6 as the majority class, yielding 29 minority class samples.

Vehicle dataset

(Siebert, 1987) https://archive.ics.uci.edu/ml/datasets/Statlog+(Vehicle+Silhouettes) consists of 846 samples and 18 features. The aim is to classify a silhouette into one of four types of vehicles. As in He et al. (2008), we choose class “Van” as the minority class and the other three types of vehicles as the majority class, resulting in 199 minority class samples.

Abalone dataset

https://archive.ics.uci.edu/ml/datasets/Abalone aims to predict the age of abalone by physical measurements. As in He et al. (2008), we let class “18” be the minority class and class “9” be the majority class. This yields 731 samples in total, among which 42 samples belong to the minority class. We also remove the discrete feature “sex”, which gives us 9 features.

Satimage dataset

https://archive.ics.uci.edu/ml/datasets/Statlog+(Landsat+Satellite) consists of a training set and a test set. We have 6435 samples and 36 features. As in Chawla et al. (2002), we choose class ‘4” as the minority class and collapsed all other classes into a single majority class, resulting in 626 minority class samples.

Wine dataset

(Cortez et al., 2009) https://archive.ics.uci.edu/ml/datasets/Wine+Quality collects information on wine quality. We focus on the red wine subset, which has 1599 samples and 11 features. We let the minority class be samples having quality , while the majority class has quality . This generates 217 minority class samples.

4.2 Experimental Setup

We test the performance of SVR-Tree, SVR-Tree with feature selection, CART (Breiman et al., 1984) with duplicated oversampling, CART with SMOTE (Chawla et al., 2002), CART with Borderline-SMOTE (Han et al., 2005) and CART with ADASYN (He et al., 2008)

on all five datasets. Features are linearly transformed so that samples lie in

.

For each method and dataset, we run the algorithm 50 times. For each run, we randomly choose 2/3 samples as training and 1/3 as testing. The average values of TP, TN, FP, FN on testing sets are used to compute the accuracy, TPR, precision and F-measure. The specific settings for each method are discussed below.

SVR-Tree with and without feature selection are described in Algorithm 1. The weight for the minority class, , is set to be the largest integer that makes the total weight of the minority class no greater than the total weight of the majority class; If is greater than , it is truncated to be . The maximal number of leaf nodes is . The penalty parameter for SVR is chosen from a geometric sequence in ; the parameter with highest F-measure on 50 runs is selected. For SVR-Tree with feature selection, the constant in equation (2) is fixed to ; In practice, the results are insensitive to .

For the other methods, we first over sample the minority class samples, such that the number of minority samples are multiplied by after oversampling. We than build the CART tree on the over sampled dataset and prune it. The pruning parameter of CART is selected to maximize the F-measure. By the algorithm proposed by Breiman et al. (1984), the range from which to choose the pruning parameter will be available after we build the tree and does not need to be specified ahead of time.

For duplicated oversampling, we sample each minority class sample times; For SMOTE, the number of nearest samples is set as ; For Borderline-SMOTE, we adopt the Borderline-SMOTE1 of (Han et al., 2005), with the number of nearest samples . For both SMOTE and Borderline-SMOTE, if , some nearest neighbors may be used multiple times to generate synthetic samples. For ADASYN, denote the number of majority class samples as and the number of minority class samples as , then the parameter is set to be .

4.3 Results

The average values of accuracy, precision, TPR, F-measure and number of selected features across the 50 runs are shown in Table 2. In the column “Method”, SVR = SVR-Tree, SVR-Select = SVR-Tree with feature selection, Duplicate = CART with duplicated oversampling, SMOTE = CART with SMOTE, and BSMOTE = CART with Borderline-SMOTE and ADASYN = CART with ADASYN. For each dataset and evaluation measure, the method that ranks first is highlighted in bold. The number of wins for each performance measure are also calculated. The SVR methods perform the best overall, with SVR-Select having particularly good performance. Furthermore, SVR-Select chooses the fewest number of features for each dataset, so has a good balance of accuracy and parsimony.

Data set Method Accuracy Precision TPR F-measure
Glass SVR 0.9583 0.8161 0.8956 0.8540 4.98
SVR-Select 0.9683 0.8668 0.9067 0.8863 1.0
Duplicate 0.9646 0.8489 0.9000 0.8737 1.0
SMOTE 0.9624 0.8295 0.9111 0.8684 1.02
BSMOTE 0.9602 0.8160 0.9133 0.8619 1.0
ADASYN 0.9498 0.7713 0.8978 0.8297 5.72
Vehicle SVR 0.9368 0.8659 0.8652 0.8655 14.76
SVR-Select 0.9355 0.8544 0.8748 0.8645 5.7
Duplicate 0.9362 0.8377 0.9039 0.8696 11.64
SMOTE 0.9317 0.8385 0.8791 0.8583 10.98
BSMOTE 0.9309 0.8417 0.8697 0.8554 13.96
ADASYN 0.9341 0.8423 0.8855 0.8634 11.58
Abalone SVR 0.9212 0.3251 0.3457 0.3351 6.88
SVR-Select 0.9244 0.3431 0.3457 0.3444 5.6
Duplicate 0.9184 0.2956 0.3043 0.2999 6.94
SMOTE 0.8974 0.2578 0.4186 0.3191 6.94
BSMOTE 0.8960 0.2479 0.3986 0.3057 6.96
ADASYN 0.8937 0.2480 0.4186 0.3114 6.92
Satimage SVR 0.9036 0.5032 0.6969 0.5844 34.5
SVR-Select 0.9029 0.5008 0.7020 0.5845 28.12
Duplicate 0.9032 0.5017 0.6553 0.5683 34.58
SMOTE 0.8895 0.4578 0.7364 0.5646 29.2
BSMOTE 0.8945 0.4720 0.71 0.5671 31.38
ADASYN 0.8946 0.4711 0.6831 0.5576 34.72
Wine SVR 0.8476 0.4564 0.6433 0.5340 10.96
SVR-Select 0.8481 0.4578 0.6475 0.5363 10.54
Duplicate 0.8553 0.4744 0.6103 0.5338 11.0
SMOTE 0.8513 0.4647 0.6311 0.5353 11.0
BSMOTE 0.8503 0.4608 0.6047 0.5231 11.0
ADASYN 0.8477 0.4554 0.6228 0.5261 11.0
Total # Wins SVR 2 2 0 0
SVR-Select 2 2 1 4
Duplicate 1 1 1 1
SMOTE 0 0 1.5 0
BSMOTE 0 0 1 0
ADASYN 0 0 0.5 0
Table 2: Results of Numerical Studies on five real world datasets.

4.4 Additional Experiments for Feature Selection

For the Wine and Abalone datasets, we generate 10 additional uninformative features independently from . We reran the analyses as above and results are shown in Table 3, where denotes the average number of original features selected by the method and denotes the average number of artificially generated features selected by the method.

SVR-Select performs well when there are a considerable number of redundant features. SVR-Select selects significantly more original features than artificially generated features, suggesting effectiveness in feature selection. For all other methods, the relative difference between number of original features and number of artificially generated features is much smaller. In addition, nearly all methods select fewer of the original features when compared with results in Table 2.

Data set Method Accuracy Precision TPR F-measure
Wine SVR 0.8277 0.4242 0.5486 0.4785 10.72 8.58
SVR-Select 0.8115 0.3910 0.6975 0.5011 3.96 0.26
Duplicate 0.8186 0.3963 0.6439 0.4906 9.58 7.32
SMOTE 0.8085 0.3838 0.6789 0.4904 8.8 5.74
BSMOTE 0.8136 0.3883 0.6492 0.4859 7.6 4.24
ADASYN 0.8073 0.3816 0.6767 0.4880 8.6 6.44
Abalone SVR 0.8779 0.2173 0.4329 0.2894 5.34 8.0
SVR-Select 0.9106 0.2788 0.3500 0.3104 2.42 0.7
Duplicate 0.7730 0.1472 0.6157 0.2376 1.0 0.0
SMOTE 0.8471 0.1706 0.43 0.2443 3.32 4.14
BSMOTE 0.8761 0.2028 0.3943 0.2678 3.7 3.7
ADASYN 0.8383 0.1640 0.4429 0.2393 3.48 4.12
Table 3: Additional Numerical Study for Feature Selection.

5 Discussion

A major challenge in analyzing imbalanced data is small sample size in the minority class leading to overfitting. It is natural to consider using regularization to address this problem. Regularization of classification trees is an old idea; for example, Breiman et al. (1984) proposed to penalize the number of leaf notes in the tree. Other classification trees like C4.5 (Quinlan, 2014) and Ripper (Cohen, 1995) also prune the overgrown tree. However, the number of leaf nodes may not be a good measure of complexity of a classification tree. Recently, Hahn et al. (2020) adds a Bayesian prior to an ensemble of trees, which functions as indirect regularization. Following the idea of directly regularizing the shape of the decision set and complexity of the decision boundary, we instead penalize the surface-to-volume ratio. To our knowledge, this is a new idea in the statistical literature on classification.

SVR-Tree can be trivially generalized to the multi-class case and balanced datasets. For multiple classes, we can apply SVR to one or more minority classes and take the sum of these ratios as regularization. For balanced datasets, we can either compute the SVR ratio of all classes, or we can simply regularize the surface of the decision boundary. The principle behind these generalizations is to regularize the complexity of the decision boundaries and shapes of the decision sets.


SUPPLEMENTARY MATERIAL

Proofs and a Detailed Algorithm: Supplemental Material for “Classification Trees for Imbalanced and Sparse Data: Surface-to-Volume Regularization”.
Codes: https://github.com/YichenZhuDuke/Classification-Tree-with-
Surface-to-Volume-ratio-Regularization.git.

References

  • L. Breiman, J. Friedman, R. Olshen, and C. Stone (1984) Classification and regression trees. Wadsworth. Cited by: §1, §2, §4.2, §4.2, §5, Definition 1.
  • C. Bunkhumpornpat, K. Sinapiromsaran, and C. Lursinsap (2009) Safe-level-smote: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 475–482. Cited by: §1.
  • N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer (2002) SMOTE: synthetic minority over-sampling technique.

    Journal of Artificial Intelligence Research

    16, pp. 321–357.
    Cited by: §1, §1, §4.1, §4.2.
  • W. W. Cohen (1995) Fast effective rule induction. In Machine Learning Proceedings, pp. 115–123. Cited by: §5.
  • T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein (2009) Introduction to algorithms. MIT press. Cited by: §2.5, footnote 1.
  • P. Cortez, A. Cerdeira, F. Almeida, T. Matos, and J. Reis (2009) Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems 47 (4), pp. 547–553. Cited by: §4.1.
  • D. Dua and C. Graff (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §4.1.
  • A. Fernández, S. del Río, N. V. Chawla, and F. Herrera (2017) An insight into imbalanced big data classification: outcomes and challenges. Complex & Intelligent Systems 3 (2), pp. 105–120. Cited by: §1.
  • L. Györfi, M. Kohler, A. Krzyzak, and H. Walk (2006) A distribution-free theory of nonparametric regression. Springer Science & Business Media. Cited by: §A.1.
  • P. R. Hahn, J. S. Murray, and C. M. Carvalho (2020) Bayesian regression tree models for causal inference: regularization, confounding, and heterogeneous effects. Bayesian Analysis. Cited by: §5.
  • G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, and G. Bing (2017) Learning from class-imbalanced data: review of methods and applications. Expert Systems with Applications 73, pp. 220–239. Cited by: §1.
  • H. Han, W. Wang, and B. Mao (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In International Conference on Intelligent Computing, pp. 878–887. Cited by: §1, §4.2, §4.2.
  • H. He, Y. Bai, E. A. Garcia, and S. Li (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In IEEE International Joint Conference on Neural Networks, pp. 1322–1328. Cited by: §1, §4.1, §4.1, §4.2.
  • H. He and E. A. Garcia (2008) Learning from imbalanced data. IEEE Transactions on Knowledge & Data Engineering (9), pp. 1263–1284. Cited by: §1.
  • B. Krawczyk (2016) Learning from imbalanced data: open challenges and future directions. Progress in Artificial Intelligence 5 (4), pp. 221–232. Cited by: §1.
  • A. Nobel (1996) Histogram regression estimation using data-dependent partitions. Annals of Statistics 24 (3), pp. 1084–1105. Cited by: §A.1.
  • J. R. Quinlan (2014) C4. 5: programs for machine learning. Elsevier. Cited by: §2, §5.
  • J. Schmidhuber (2015) Deep learning in neural networks: an overview. Neural networks 61, pp. 85–117. Cited by: §1.
  • E. Scornet, G. Biau, and J. Vert (2015)

    Consistency of random forests

    .
    Annals of Statistics 43 (4), pp. 1716–1741. Cited by: §A.1, §A.2.
  • J. P. Siebert (1987) Vehicle recognition using rule based methods. Cited by: §4.1.
  • J. Stefanowski and S. Wilk (2008) Selective pre-processing of imbalanced data for improving classification performance. In International Conference on Data Warehousing and Knowledge Discovery, pp. 283–292. Cited by: §1.
  • A. B. Tsybakov (2004) Optimal aggregation of classifiers in statistical learning. Annals of Statistics 32 (1), pp. 135–166. Cited by: §A.1.
  • Q. Wang, X. Zhou, C. Wang, Z. Liu, J. Huang, Y. Zhou, C. Li, H. Zhuang, and J. Cheng (2019) WGAN-based synthetic minority over-sampling technique: improving semantic fine-grained classification for lung nodules in ct images. IEEE Access 7, pp. 18450–18463. Cited by: §1.

Appendix A Proofs

We prove Theorem 1 and 2 here. Proofs for Corollary 1, bounds in examples 1-2 and lemmas in Appendix are in the supplemental material. Without loss of generality, we assume in this section.

a.1 Proof of Theorem 1

The proof builds on Nobel (1996), Györfi et al. (2006), Tsybakov (2004) and Scornet et al. (2015). We first establish a sufficient condition for consistency, showing a classification tree whose signed impurity converges to an oracle bound is consistent. We then break the difference between signed impurity of and the oracle bound into two parts. The first is estimation error, which goes to zero if the number of leaves increases slower than ; the second is approximation error, which goes to zero if goes to a constant within each leaf node and penalty goes to zero.

Denote , and its weighted version . Define the oracle lower bound for signed impurity as The following lemma shows if the signed impurity of a classification tree converges to as , the classifier associated with the tree is consistent.

Lemma 1.

Let be a sequence of classification trees, let be the classifier associated with . is consistent if in probability.

We then decompose the difference between signed impurity of and the oracle bound into estimation error and approximation error.

Lemma 2.

Let be a classification tree trained from data , be all the leaf nodes of . Define the set of classifiers as:

We have

(6)

The first term on the right hand side of equation (6) is the “estimation error”, which measures the difference between functions evaluated under the empirical and true distributions. The second term is “approximation error”, which measures the ability of to approximate the oracle prediction function. The next two lemmas show both terms go to zero in probability.

Lemma 3.

If we have in probability.

Lemma 4.

As , if and , in probability.

Proof of Theorem 1.

Combining lemma 1, 2, 3, 4, we finish the proof.

a.2 Proof of Theorem 2

The proof of theorem 2 mainly consists of two parts. The first works on the true distribution , proving that under , the partition with highest impurity decrease is always in non-redundant features; The second works on the randomness brought by