Instance-Based Classification through Hypothesis Testing

01/03/2019 ∙ by Zengyou He, et al. ∙ Dalian University of Technology Harbin Institute of Technology 0

Classification is a fundamental problem in machine learning and data mining. During the past decades, numerous classification methods have been presented based on different principles. However, most existing classifiers cast the classification problem as an optimization problem and do not address the issue of statistical significance. In this paper, we formulate the binary classification problem as a two-sample testing problem. More precisely, our classification model is a generic framework that is composed of two steps. In the first step, the distance between the test instance and each training instance is calculated to derive two distance sets. In the second step, the two-sample test is performed under the null hypothesis that the two sets of distances are drawn from the same cumulative distribution. After these two steps, we have two p-values for each test instance and the test instance is assigned to the class associated with the smaller p-value. Essentially, the presented classification method can be regarded as an instance-based classifier based on hypothesis testing. The experimental results on 40 real data sets show that our method is able to achieve the same level performance as the state-of-the-art classifiers and has significantly better performance than existing testing-based classifiers. Furthermore, we can handle outlying instances and control the false discovery rate of test instances assigned to each class under the same framework.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Classification is a fundamental data analysis procedure, which is ubiquitously used across different fields. Thousands of classification algorithms (classifiers) have been developed during the past decades [1]. These classifiers range from simple models such as k-nearest neighbors (k-NN) [2]

to more sophisticated models such as support vector machine (SVM)

[3]

and random forests (RF)

[4].

Despite the advances on the development of new classifiers, no single classification algorithm can always achieve the best performance on all data sets [1]. This indicates that different classifiers are complementary to each other in different contexts. Therefore, it is still necessary to develop new and alternative classifiers based on some principles that remain unexplored.

The motivation behind this research is based on the following observations. First, existing non-lazy classifiers typically formulate the classification problem as an optimization problem. Such optimization-based learning strategies can always generate the target classifiers, regardless of the statistical significance of learnt models. Second, classifiers such as logistic regression are able to provide probability values for categorizing an unknown test instance. However, it is not an easy task to determine a universal probability threshold to ensure that the classification of the test instance into the corresponding class is statistically significant. Last but not least, existing classifiers cannot control the number of misclassified test instances in terms of metrics such as false discovery rate (FDR). Such capability is quite important in the scenario of biological data analysis, in which the prediction results will be further validated by wet-lab experiments that can be costly and time-consuming

[5]. Thus, we need to add some notion of statistical significance to classifiers.

In fact, the classification problem has already been formulated as a hypothesis testing issue in [6]. More recently, several research efforts [7], [8] further extend the initial formulation in [6] from different aspects. However, the following observations motivate this research. First of all, existing testing-based classification methods deserve certain theoretical drawbacks, as discussed and summarized in Section 2. Second, only simulation data sets and several small real data sets have been empirically tested, making it difficult to convince people on the practical usage of such testing-based formulation. Third, the connection between this new formulation and existing classification methods have never been discussed. Finally, the potential benefit of the testing-based classification model remains unexplored.

Based on the above observations, we present a new testing-based classification formulation, in which the null hypothesis is that, informally, the test instance doesn’t belong to any class. To precisely define the null hypothesis, we focus on the classification problem in a two-class setting. First, we can calculate the distance between the test instance and each training instance in the training data set. In this way, we will generate two sets of distances for one test instance that needs to be classified. Then, the hypothesis testing issue can be casted as a two-sample testing problem [9], in which each sample corresponds to a set of distances. In this formulation, the null hypothesis is that two sets of distances are drawn from the same cumulative distribution.

Two-sample testing is a fundamental problem in statistics. We employ the classical Wilcoxon-Mann-Whitney (WMW) test for quantifying the statistical significance in terms of p-values. To alleviate the effect of outlying and irrelevant training instances, we further apply the WMW test to two distance sets that are generated from k-NNs of the test instance.

The testing-based classification formulation has several salient features. First of all, it can provide p-values for each test instance to quantify the statistical significance of classifying this instance to certain classes. Accordingly, we can detect outlying test instances that do not belong to any class if the p-values with respect to all classes are larger than the significance level threshold. Second, we can control the FDR of test instances that are assigned to each class based on their p-values.

We evaluate our method on forty data sets from the UCI [10] repository and the KEEL-dataset repository [11] with respect to the standard classification task. The experimental results show that our method is able to achieve the same level performance as the state-of-the-art classifiers. Meanwhile, it can handle outlying test instances and control the FDR of test instances assigned to each class in a natural manner.

The main contributions of this paper can be summarized as follows.

(1) The binary classification issue is formulated as a two-sample testing problem. Since two-sample testing is a fundamental problem in statistics and many well-known tests are available in the literature, it can be expected that we may introduce many effective testing-based classifiers in the near future.

(2) The classification model that integrates hypothesis testing and the k-NN method is presented. This formulation can alleviate the effect of outlying and irrelevant training instances to improve the classification accuracy significantly.

(3) A comprehensive performance comparison over 40 real data sets is conducted. The experimental results demonstrate the fact that the testing-based classifier is able to achieve the same level performance as standard classifiers such as SVM and decision tree.

(4) Some interesting connections between our testing-based classifiers and existing classification methods are presented.

(5) The advantage of the testing-based classification model on handling outliers and controlling the Type I error rate in terms of FDR is empirically investigated.

The rest of this paper is organized as follows. Section 2 discusses some previous works that are related to our method. Section 3 presents the details of our method. Section 4 reports experimental results on 40 real data sets. Section 5 discusses the relationship between our method and other approaches. Finally, Section 6 concludes this paper.

2 Related Work

2.1 Instance-based learning

Instance-based learning is a lazy learning scheme in which the training instances are simply stored. When a new instance is encountered, a set of similar training instances are retrieved to classify the unknown testing instance. The most basic instance-based method is the k-nearest neighbor algorithm (k-NN) [2] [12], which assigns a new instance to the most common class among its k-NNs in training instances.

Essentially, our method can be considered as an instance-based learning approach since the two-sample test is conducted on the distance sets generated from all training instances or k-NNs. This indicates that it is feasible to apply techniques developed for instance-based learning during the past decades (e.g. [13], [14], [15]) to further improve our method.

2.2 Classification based on hypothesis testing

Liao & Akritas [6] introduce a classification method based on hypothesis testing, which is abbreviated to TBC. Suppose there are two classes (positive vs. negative) in the training set, i.e., a binary classification problem, the issue is to allocate a new instance to one of the two classes. The basic idea of TBC is that, if is placed into the wrong class, then the difference of two samples will be blurred. To implement this idea, two tests with respect to the equality of the means of two samples are conducted, in which is placed into the set of positive instances and the set of negative instances, respectively. Accordingly, we will obtain two p-values and , where () is generated from the test in which is assumed to belong to the positive (negative) class. If , then is classified as a positive instance. Otherwise, will be classified as a negative instance. This method works well when the theoretical p-values can be computed and compared. However, TBC has two problems. First, when the number of features of data set is larger than the sample size of one class, the p-values cannot be computed at all because of the singularity of the sample covariance matrix. Second, when the instances from two class are well separated, the p-values will equal to zero.

Ghimire & Wang [7] improve the TBC method by introducing a minimum distance into the method and come up with a new classifier for image pixels. Their new method works well in the context of image pixel classification.

Modarres [16], [17], [18] studies the properties of squared Euclidean interpoint distances (IPDs) between different samples which are taken from multivariate Bernoulli, multivariate Poisson and multinomial distributions. And he also discusses some applications based on IPDs within one sample and across two samples in different distributions.

Afterwards, Guo & Modarres [8]

develop a classification method based on hypothesis testing, which is abbreviated to IDC. It is capable of classifying high dimensional instances by employing testing methods based on the IPDs between different instances. Several different test statistics based on IPDs have been discussed in

[8] and we will take the Baringhaus and Franz (BF) statistic as the example. Given two sets of training instances, i.e., one positive set and one negative set , IDC first computes the average IPDs within , within and between and , which are denoted by , and respectively. Then, it calculates . Similarly, and can be obtained by placing into and , respectively. Note that () can be used to measure the change in the value of BF when is assigned to (). Therefore, if , is classified as a positive instance; otherwise, will be labelled as negative instance.

2.3 Asymmetric classification error control

In binary classification, most classifiers are constructed to minimize the overall classification error, which is a weighted sum of type I error (misclassifying a negative instance as a positive one) and type II error (misclassifying a positive instance as a negative one). However, in many realistic applications, different types of errors are often asymmetric, which have different costs and need to be treated with different weights.

The cost-sensitive classification (CSC) method [19], [20] can solve this problem to some extent. It takes the misclassification costs into consideration and aims to minimize the total cost of both errors. Another method is the Neyman-Pearson (NP) classification [21], which is inspired by classical NP hypothesis testing. It is a novel statistical framework for handling asymmetric type I/II error priorities and can seek a classifier that minimizes the type II error while maintaining the type I error below a user-specified level [22], [23]. CSC and NP classification are fundamentally different approaches that have their own pros and cons [21]. A main advantage of the NP classification is that it is a general framework that allows users to control type I classification error under with a high probability.

It is very easy to control the type I error in terms of FDR in our formulation since the p-values of each test instance with respect to different classes will be generated in the classification phase. In other words, such testing-based classification formulation provides a unified framework for controlling the asymmetric classification error in a natural way.

3 Methods

3.1 Two-sample testing

Given two independent random samples and , where is drawn from the population and is drawn from the population, the general two-sample testing problem is concerned with the null hypothesis that the two samples are drawn from identical populations [9]:

where and

are the cumulative distribution functions for the

population and the population, respectively.

3.2 Problem formulation

We consider the binary classification problem, in which the training set is composed of two disjoint sets and . and are called the positive training set and the negative training set, respectively. Given a test instance , the classification task is to decide its class label (positive vs. negative).

We formulate the binary classification problem as a two-sample testing problem. In this formulation, the first sample is a set of n observations, where the ith observation is the distance between the test instance and the ith training instance in , i.e. . Similarly, each observation in the second sample is the distance between the test instance and each training instance in , i.e. .

To conduct the standard classification task, we may test the null hypothesis against two alternative hypotheses and to obtain two one-sided p-values ( and ). If , we will label as a positive instance. Otherwise, we will classify as a negative instance.

To handle the multi-classification problem with classes (), we can explore the one-vs-rest strategy by regarding the set of instances from one class as the positive training set and using the set of instances from the remaining classes as the negative training set. For each of binary classification problems, we first conduct the two-sample testing to generate a one-sided p-value for the corresponding class. Then, we can assign the test instance to the class that has the smallest p-value.

3.3 K-NN variants

In the above problem formulation, the distances to all training instances are utilized in the hypothesis testing. However, the existence of outlying and irrelevant training instances may decrease the classification accuracy. To alleviate this issue, we can conduct the hypothesis testing on two samples that are derived from the k-NNs of the test instance.

Under , two natural k-NN variants can be formulated. Similar to the k-NN classifier, the first variant is to directly take the k-NNs of the test instance to generate two samples. The distances from the test instance to these k nearest training instances are divided into two groups according to the class label, where each group corresponds to one sample in our scenario. The second variant is to take nearest instances from and retrieve nearest instances from to generate two distance sets, where . The rationale behind the second variant is that, if the null hypothesis is true, then the number of k-NNs from each class is proportional to the number of training instances in that class. Since when , we can take the same number of k-NNs from each class in this case.

3.4 The choice of testing methods

The testing method for two-sample differences has been extensively investigated in the literature. One widely used test for this issue is the WMW test, which is also called the Mann-Whitney U test or Wilcoxon rank-sum test [24]. To obtain the test statistic in WMW test, and are merged to form a combined sample . Then, the observations in are ordered:

According to the ordered list, is defined as the rank of in and . If the null hypothesis is true, then

where

Based on the above normal approximation, we can calculate the one-sided p-value to test against () for some .

In our classification model, the choice of testing method is very flexible since the samples to be tested are unidimensional. That is, we can use any univariate two-sample testing method in our classifier. Therefore, we can also employ the testing methods such as pooled t-test, two-sample Kolmogorov-Smirnov test [25] and precedence test instead of the WMW test. In Section 5, we will further show that the use of different testing methods will establish the connection between our formulation and existing classification models.

3.5 Handling outliers and FDR control

As we have argued, the testing-based classification model has the advantage of controlling the FDR of classified test instances and handling outlying instances under the same framework. In general, we will assign the test instance to the class that has the smallest p-value among Q p-values, where Q is the number of classes. However, it is inappropriate to do so when all Q p-values are not significant. Luckily, we can use FDR [26] to tackle this problem. We can obtain Q sets of p-values from all test instances because our method returns Q p-values to classify every test instance. Every p-value set is firstly sorted in a non-descending order: , where is the number of all test instances. Given a significance level , let be the largest index for which

If , then the corresponding test instance will be assigned to the current class. After conducting FDR control on all Q p-value sets, we can label the test instances that are not classified to any class as outliers.

4 Experiments

4.1 Data sets and experimental settings

We have conducted experiments on 40 data sets from the UCI [10] repository and the KEEL-dataset repository [11]. Among these data sets, the number of instances ranges from 80 to 10092 and the number of features varies from 2 to 90. Most data sets have less than 10 classes and only six of them have more than 10 classes. The detailed characteristics of these data sets are given in Appendix A. Moreover, the instances with missing values are discarded and the numeric feature values are normalized into the interval in the pre-processing process.

In the experiment, we perform 10-fold cross-validation (CV) and count the number of instances which have been correctly classified to compute a classification accuracy value. For every data set, we repeat the 10-fold CV experiment 10 times and record the average and standard deviation of 10 accuracy values as the final results.

Methods Avg accuracy
IBT-U 0.6795
IBT-U-K-D 0.8027
IBT-U-K-S 0.7906
TABLE I: The average accuracy over forty data sets for IBT-U and IBT-U-K variants (k=3).
Methods k=3 k=5 k=7 k=9
IBT-U-K-D 0.8027 0.7835 0.7677 0.7547
IBT-U-K-S 0.7906 0.7829 0.7742 0.7703
TABLE II: The average accuracy over forty data sets for two IBT-U-K variants.

4.2 All instances vs. k-NNs

In the first experiment, we compare several variants of our formulation to check which one is better in practice. Since our method is a classifier that combines instance-based learning and hypothesis testing, we will use the abbreviation IBT to denote such a classification model. To distinguish different variants, IBT-U is used to denote the classification model when the Mann-Whitney U test is applied to the distance sets derived from all training instances. Similarly, IBT-U-K is used to denote the classification model in which the distance sets are generated according to k-NNs of the test instance. Furthermore, two k-NN variants are denoted by IBT-U-K-D (k-NNs are obtained Directly without considering the class label) and IBT-U-K-S (k-NNs are obtained Separately from different classes), respectively.

Additionally, the parameter k for two k-NN variants is specified as 3,5,7 and 9, respectively. The detailed experimental results on these three variants are given in Appendix B, C and D and their average accuracies are summarized in Table 1 and Table 2.

As shown in Table 1, the performance of IBT-U is much worse than that of two k-NN variants. This indicates that it is plausible to explore the k-NN strategy in the testing-based classification model. As shown in Table 2, the average classification accuracies of two k-NN variants are quite similar when k is varied from 3 to 9. In the forthcoming sections, we will use IBT-U-K-D (k=3) as a representative of our classifiers in the performance comparison.

4.3 Our method vs. Other testing-based classifiers

In the second experiment, we compare our method with two previous methods, TBC [6] and IDC [8], which also use hypothesis testing to solve a classification problem. The detailed experimental results are given in Appendix E and their average accuracies are presented in Table 3.

In the implementation of TBC, we employ the Hotelling’s test as the testing method, which has been utilized in [6]. And we use the Hotelling’s statistics instead of p-values in the classification since the generated p

-values are often zeros. In the implementation of IDC, we use the Baringhaus and Franz (BF) statistic as the test statistic and assume equal prior probabilities in splite of unequal sample sizes.

For TBC, the classification accuracies on five data sets (Cleveland, Dermatology, Hepatitis, Movement_libras and Winequality-red) are 0 because the number of features of these data sets is larger than the sample size of one class, so we only use the rest 35 data sets to compute the average classification accuracy. For IDC, it can be applied to all data sets, so we simply compute the average of 40 accuracy values. According the comparison result, it’s obvious to see that our method performs significantly better than TBC and IDC.

Among these three methods, our method can achieve the best performance due to the following reasons. First, our method only consider the k-NNs of test instance while TBC and IDC utilize all training instances without considering the existence of outlying and irrelevent ones. Second, our method employs a hypothesis testing strategy that is totally different from that used in TBC and IDC.

Methods Avg accuracy
TBC 0.5901
IDC 0.6859
Our method 0.8027
TABLE III: The average accuracy for three testing-based classification methods: TBC, IDC and our method (IBT-U-K-D, k=3).
Methods Avg accuracy
k-NN 0.8058
SVM 0.7928
DT 0.8003
Our method 0.8027
TABLE IV: The average accuracy for three classic classifiers: k-NN (k=3), SVM, decision tree (DT) and our method (IBT-U-K-D, k=3).

4.4 Our method vs. Classic classifiers

In the third experiment, we compare our method with three classic classifiers: k-NN, support vector machine (SVM) and decision tree (DT). The detailed experimental results are given in Appendix F and G and their average accuracies are presented in Table 4.

For SVM, k-NN and DT, we use the functions fitcecoc, fitcknn and fitctree with their default parameter settings in Matlab 2018b, respectively. The reason for using fitcecoc function is that it can generate a multi-class model for SVM.

As shown in Table 4, our method is able to achieve the same level performance as these classic classifiers. Concretely, there are 13, 19 and 18 data sets on which our method can produce higher classification accuracies than k-NN, SVM and DT among the 40 data sets, respectively. In a word, our method is competitive to these classic classifiers with respect to the overall performance.

4.5 Handling outliers through FDR control

In the last experiment, we investigate the potential of our method on outlier detection and FDR control. The

balance data set from UCI is used as an example, which has 625 instances and three classes (L, B and R). There are 288, 49 and 288 instances in the three classes respectively, as shown in Table 5. If we take a subset of the 576 (288+288) instances from the class L and R as training instances and use the 49 instances from the class B as test instances, then it is obvious that all test instances should be considered as outliers.

We randomly take 80 percent of instances from the class L and R to compose the training set. In order to obtain the average performance, 10 different random training sets are generated. We use IBT-U as the classifier and the significance level for FDR is set to be 0.05. The experimental results show that 48 of 49 test instances can be labelled as outliers on average. Specifically, there are at most 2 test instances which cannot be labelled as outliers and they are usually different when the training set is different. Therefore, our method is able to recognize outliers and control the FDR of classification results in the same time.

5 Relationship to Other Approaches

Our classification method is a two-phase approach: two distance sets are first generated and then the two-sample test is conducted. As we have discussed, we may use different significance testing methods in the second phase. In this section, we will show that the use of different testing methods will lead to different classifiers that have close relationship with existing classification models.

5.1 Connection to Nearest Centroid Classifier

The nearest centroid (mean) classifier is one of the most widely used instance-based classification models [27]. In the training phase, only the centroid for each class is calculated and stored. In the classification phase, the distance between one unknown instance and each centroid is calculated to find the nearest centroid. Then, this new test instance is assigned to the class of its nearest centroid.

If the pooled t-test is employed as the significance testing procedure in our model, then we can reveal some interesting connections between our method and the nearest centroid classifier. To simplify the analysis, we first consider the scenario of univariate data set and then discuss the case of multivariate data set.

Given two one-dimensional sets and , their centroids (means) can be easily computed by and . Given an unknown instance , the distances between and these two centroids can be measured by and . The nearest centroid classification method will assign to the positive or the negative class according to whether .

In our method, two samples and are obtained and their means are denoted by and . Then, we test the null hypothesis against two alternative hypotheses and on the two samples to obtain two one-sided p-values ( and ). At last, our method will assign to the positive (negative) class if ().

Note that when the pooled t-test is employed in our method, we will obtain two t statistics ( and ). We can get

Similarly, we can also get . Therefore, our method will assign to the positive class if . Otherwise, we will label as a negative instance.

According to the triangle inequality, we can get

in which the equality holds if and only if or . Similarly, we can get in which the equality holds if and only if or .

When and , our method will assign the test instance to the same class label as the nearest centroid classification method. Obviously, the above analysis establish the equivalence between our method and the nearest centroid classifier under very strict constraints: (1) one-dimensional data set, (2) the test instance is no less (more) than all training instances in each class.

For the multivariate case, it is very difficult to analyze their relationship in a quantitative manner. One naive connection is that if , then our method and the nearest centroid classification method will produce the same classification result.

5.2 Connection to k-NN Classifier

The k-NN classifier is one of the most popular classification methods in the literature [28]. In our formulation, if the precedence test [9] is employed as the significance testing method, then we may uncover some interesting connections between our method and the k-NN classifier.

We still consider the binary classification problem in which the training data is composed of positive instances from and negative instances from . Given an unknown instance , the k-NN classification method finds its k nearest neighbors (k-NNs) to conduct the classification. These k-NNs can be divided into two groups: positive instances from and instances from , where . If , then will be classified as a positive instance. Otherwise, is assigned to the negative class.

The precedence test is a two-sample test based on the order of early failures [29]. Given two independent samples, and , let and denote their order statistics. The precedence test is based on the number of observations from one sample which exceed (precede) some threshold specified by the other sample. More precisely, the test statistic is the number of observations in that precede the r-th order statistic from . Alternatively, one can use the number of observations in that exceed the s-th order statistic from as the test statistic . Large values of these two test statistics will lead to the rejection of the null hypothesis that two distributions are equal.

In our problem formulation, () is the distance set between and the instances in (). Then, will be the k distance values between and its -NNs. If we use the precedence test as the significance testing method and suppose that , we can set to obtain the corresponding test statistic for testing the null hypothesis against the alternative hypothesis (). Alternatively, if we let , we can obtain another test statistic for testing the null hypothesis against the alternative hypothesis (). And we can also get two p-values, and . At last, will be assigned to the positive (negative) class if the former (latter) is smaller.

If we further assume that the positive training set and the negative training set have the same size, i.e., , then the two p-values will be totally determined by the two test statistics: or . Therefore, our method and the k-NN classifier will generate the same classification result under the above assumptions. From this aspect, we may regard our method equipped with the precedence test as a generalized ”statistical” k-NN classifier.

6 Conclusion

Due to the importance of the classification problem, many effective classification algorithms have been proposed from different societies. However, most work on classification does not address the issue of statistical significance. Towards this direction, several initial research efforts have investigated the feasibility of constructing a classifier through significance testing. Unfortunately, this interesting idea has not receive much attention during the past 10 years. This is mainly because the following reasons: (1) there are still no such testing-based classifiers that can achieve the same level performance as the state-of-the-art methods on real data sets; (2) the potential benefit of deploying such testing-based classifiers is still not clear.

Based on the above observations, this paper takes one step further towards this direction by formulating the classification problem as a two-sample testing problem. This new formulation enables us to generate several testing-based classifiers that have comparable performance with standard classifiers such as SVM. In addition, we show that it is quite easy to handle outlying test instances and control the FDR of classification results based on the p-values associated with each test instance.

We believe this paper will significantly contribute to the development of testing-based classification model, which will become a new promising classifier family. As the study on the testing-based classification model is still in its infancy stage, many research issues remain unexplored and should be further investigated in the future work. For example, since all the existing testing-based classifiers are based on the idea of instance-based learning, how to build a non-lazy testing-based classifier will be an interesting and challenging issue.

Appendix A

The detailed characteristics of the forty data sets is given by Table 5.

ID Names Instances Features Classes Class Distribution Download Links
1 Appendicitis 106 7 2 85/21 KEEL
2 Balance 625 4 3 288/49/288 UCI, KEEL
3 Banana 5300 2 2 2924/2376 KEEL
4 Bands 365(539) 19 2 230/135 UCI, KEEL
5 Bupa 345 6 2 145/200 UCI, KEEL
6 Cleveland 297(303) 13 5 160/54/35/35/13 UCI, KEEL
7 Dermatology 358(366) 34 6 111/60/71/48/48/20 UCI, KEEL
8 Haberman 306 3 2 225/81 UCI, KEEL
9 Hayes-roth 160 4 3 65/64/31 UCI, KEEL
10 Heart 270 13 2 150/120 UCI, KEEL
11 Hepatitis 80(155) 19 2 13/67 UCI, KEEL
12 Ionosphere 351 34 2 225/126 UCI, KEEL
13 Iris 150 4 3 50/50/50 UCI, KEEL
14 Led7digit 500 7 10 45/37/51/57/52/52/47/57/53/49 UCI, KEEL
15 Mammographic 830(961) 5 2 427/403 UCI, KEEL
16 Marketing 6876(8993) 13 9 1255/529/505/618/527/846/784/1069/743 KEEL
17 Monks-2 432 7 2 290/142 UCI, KEEL
18 Movement_libras 360 90 15 24/24/24/24/24/24/24/24/24/24/24/24/24/24/24 UCI, KEEL
19 Newthyroid 215 5 3 150/35/30 UCI, KEEL
20 Page-blocks 5473 10 5 4913/329/28/88/115 UCI, KEEL
21 Penbased 10092 16 10 1143/1143/1144/1055/1144/1055/1056/1142/1055/1055 UCI, KEEL
22 Phoneme 5404 5 2 3818/1586 UCL, KEEL
23 Pima 768 8 2 500/268 UCI, KEEL
24 Ring 7400 20 2 3664/3736 TORONTO, KEEL
25 Satimage 6435 36 7 1533/703/1358/626/707/0/1508 UCI, KEEL
26 Segment 2310 19 7 330/330/330/330/330/330/330 UCI, KEEL
27 Sonar 208 60 2 97/111 UCI, KEEL
28 Spambase 4597(4601) 57 2 2788/1813 UCI, KEEL
29 Spectfheart 267 44 2 55/212 UCI, KEEL
30 Tae 151 5 3 49/50/52 UCI, KEEL
31 Texture 5500 40 11 500/500/500/500/500/500/500/500/500/500/500 UCL, KEEL
32 Thyroid 7200 21 3 166/368/6666 UCI, KEEL
33 Titanic 2201 3 2 1490/711 TORONTO, KEEL
34 Twonorm 7400 20 2 3703/3697 TORONTO, KEEL
35 Vehicle 846 18 4 212/218/199/217 UCI, KEEL
36 Vowel 990 13 11 90/90/90/90/90/90/90/90/90/90/90 UCI, KEEL
37 Wdbc 569 30 2 357/212 UCI, KEEL
38 Wine 178 13 3 59/71/48 UCI, KEEL
39 Winequality-red 1599 11 6 10/53/681/638/199/18 UCI, KEEL
40 Wisconsin 683(699) 9 2 444/239 UCI, KEEL
TABLE V: The detailed characteristics of the forty data sets. For each data set, the number of instances without (with) missing values is provided outside (inside) the parentheses in the second column. The class distribution information, i.e. the number of instances in every class, is given in the 5th column. The last column provides links to download the corresponding data set.

Appendix B

The detailed experimental results of IBT-U are given by Table 6.

ID Names Avg Std
1 Appendicitis 0.8557 0.0046
2 Balance 0.8800 0.0039
3 Banana 0.5998 0.0017
4 Bands 0.6405 0.0128
5 Bupa 0.5574 0.0170
6 Cleveland 0.5505 0.0048
7 Dermatology 0.8944 0.0041
8 Haberman 0.7144 0.0166
9 Hayes-roth 0.5581 0.0221
10 Heart 0.8241 0.0047
11 Hepatitis 0.8088 0.0084
12 Ionosphere 0.6638 0.0033
13 Iris 0.9567 0.0047
14 Led7digit 0.7206 0.0076
15 Mammographic 0.7952 0.0000
16 Marketing 0.2995 0.0015
17 Monks-2 0.5185 0.0149
18 Movement_libras 0.3883 0.0146
19 Newthyroid 0.8581 0.0025
20 Page-blocks 0.9043 0.0005
21 Penbased 0.5566 0.0005
22 Phoneme 0.7172 0.0008
23 Pima 0.7233 0.0032
24 Ring 0.5049 0.0000
25 Satimage 0.7262 0.0005
26 Segment 0.7923 0.0013
27 Sonar 0.6861 0.0204
28 Spambase 0.8241 0.0008
29 Spectfheart 0.4097 0.0054
30 Tae 0.3861 0.0125
31 Texture 0.7414 0.0009
32 Thyroid 0.3158 0.0015
33 Titanic 0.7760 0.0000
34 Twonorm 0.9770 0.0003
35 Vehicle 0.4375 0.0086
36 Vowel 0.2748 0.0060
37 Wdbc 0.9404 0.0010
38 Wine 0.9416 0.0039
39 Winequality-red 0.5131 0.0035
40 Wisconsin 0.9458 0.0000
Avg 0.6795 0.0055
TABLE VI: The detailed experimental results of IBT-U.

Appendix C

The detailed experimental results of IBT-U-K-D are given by Table 7.

ID Names k=3 k=5 k=7 k=9
Avg Std Avg Std Avg Std Avg Std
1 Appendicitis 0.8283 0.0116 0.7764 0.0141 0.7642 0.0252 0.7170 0.0209
2 Balance 0.7782 0.0030 0.7528 0.0039 0.7184 0.0065 0.6834 0.0078
3 Banana 0.8642 0.0016 0.8500 0.0020 0.8338 0.0024 0.8238 0.0013
4 Bands 0.6978 0.0132 0.6726 0.0147 0.6564 0.0121 0.6452 0.0226
5 Bupa 0.5986 0.0087 0.5948 0.0116 0.5797 0.0196 0.5713 0.0119
6 Cleveland 0.5380 0.0074 0.5091 0.0191 0.4707 0.0122 0.4609 0.0150
7 Dermatology 0.9402 0.0046 0.9349 0.0076 0.9179 0.0106 0.9101 0.0072
8 Haberman 0.6585 0.0118 0.6585 0.0180 0.6261 0.0135 0.5971 0.0096
9 Hayes-roth 0.7500 0.0189 0.7256 0.0163 0.7038 0.0232 0.6969 0.0221
10 Heart 0.7552 0.0088 0.6856 0.0126 0.6722 0.0142 0.6652 0.0160
11 Hepatitis 0.8150 0.0115 0.7850 0.0287 0.7425 0.0251 0.7363 0.0161
12 Ionosphere 0.8556 0.0052 0.8575 0.0059 0.8558 0.0043 0.8541 0.0078
13 Iris 0.9600 0.0054 0.9420 0.0077 0.9053 0.0129 0.9127 0.0097
14 Led7digit 0.5770 0.0091 0.5230 0.0162 0.4604 0.0089 0.4286 0.0072
15 Mammographic 0.7171 0.0061 0.7045 0.0084 0.6745 0.0069 0.6508 0.0055
16 Marketing 0.2573 0.0016 0.2567 0.0025 0.2553 0.0024 0.2480 0.0027
17 Monks-2 0.7704 0.0124 0.7683 0.0174 0.7745 0.0182 0.7745 0.0170
18 Movement_libras 0.8181 0.0086 0.8036 0.0113 0.7978 0.0084 0.7875 0.0155
19 Newthyroid 0.9614 0.0058 0.9581 0.0062 0.9470 0.0070 0.9474 0.0054
20 Page-blocks 0.9534 0.0013 0.9466 0.0015 0.9405 0.0013 0.9361 0.0016
21 Penbased 0.9931 0.0002 0.9915 0.0005 0.9896 0.0004 0.9876 0.0005
22 Phoneme 0.8900 0.0014 0.8675 0.0022 0.8516 0.0020 0.8415 0.0033
23 Pima 0.6915 0.0089 0.6634 0.0096 0.6406 0.0135 0.6319 0.0134
24 Ring 0.7894 0.0013 0.7948 0.0016 0.8003 0.0020 0.8041 0.0018
25 Satimage 0.8949 0.0012 0.8827 0.0027 0.8706 0.0022 0.8634 0.0022
26 Segment 0.9640 0.0017 0.9572 0.0017 0.9513 0.0017 0.9396 0.0027
27 Sonar 0.8630 0.0089 0.8452 0.0115 0.8260 0.0084 0.7957 0.0109
28 Spambase 0.8978 0.0017 0.8704 0.0021 0.8458 0.0026 0.8306 0.0017
29 Spectfheart 0.6835 0.0149 0.6408 0.0129 0.6431 0.0188 0.6015 0.0131
30 Tae 0.5874 0.0125 0.5139 0.0285 0.5192 0.0319 0.5099 0.0207
31 Texture 0.9889 0.0005 0.9845 0.0010 0.9814 0.0009 0.9766 0.0008
32 Thyroid 0.9038 0.0012 0.8834 0.0020 0.8663 0.0016 0.8457 0.0019
33 Titanic 0.7897 0.0009 0.7899 0.0013 0.7717 0.0049 0.7564 0.0010
34 Twonorm 0.9381 0.0014 0.9194 0.0019 0.9006 0.0018 0.8880 0.0018
35 Vehicle 0.6833 0.0060 0.6619 0.0102 0.6426 0.0078 0.6344 0.0093
36 Vowel 0.9862 0.0027 0.9767 0.0024 0.9743 0.0023 0.9618 0.0028
37 Wdbc 0.9499 0.0037 0.9387 0.0062 0.9250 0.0060 0.9178 0.0057
38 Wine 0.9506 0.0069 0.9365 0.0046 0.9298 0.0130 0.9022 0.0113
39 Winequality-red 0.6196 0.0052 0.5790 0.0080 0.5444 0.0032 0.5225 0.0061
40 Wisconsin 0.9492 0.0022 0.9384 0.0031 0.9388 0.0041 0.9290 0.0053
Avg 0.8027 0.0060 0.7835 0.0085 0.7677 0.0091 0.7547 0.0085
TABLE VII: The detailed experimental results of IBT-U-K-D.

Appendix D

The detailed experimental results of IBT-U-K-S are given by Table 8.

ID Names k=3 k=5 k=7 k=9
Avg Std Avg Std Avg Std Avg Std
1 Appendicitis 0.7585 0.0119 0.7594 0.0156 0.7896 0.0214 0.8047 0.0100
2 Balance 0.7282 0.0047 0.7490 0.0070 0.7494 0.0069 0.7878 0.0083
3 Banana 0.8826 0.0011 0.8885 0.0014 0.8941 0.0015 0.8965 0.0010
4 Bands 0.6915 0.0097 0.6734 0.0129 0.6770 0.0100 0.6575 0.0129
5 Bupa 0.6232 0.0189 0.6188 0.0140 0.6101 0.0101 0.6168 0.0118
6 Cleveland 0.4879 0.0131 0.4845 0.0092 0.4916 0.0091 0.4889 0.0081
7 Dermatology 0.9567 0.0020 0.9536 0.0024 0.9489 0.0042 0.9464 0.0018
8 Haberman 0.6010 0.0104 0.6173 0.0117 0.6281 0.0111 0.6212 0.0147
9 Hayes-roth 0.7325 0.0218 0.6038 0.0341 0.4988 0.0206 0.4850 0.0236
10 Heart 0.7833 0.0066 0.7985 0.0063 0.8037 0.0086 0.8026 0.0065
11 Hepatitis 0.7950 0.0087 0.8150 0.0053 0.8000 0.0118 0.8063 0.0106
12 Ionosphere 0.8698 0.0030 0.8695 0.0042 0.8678 0.0047 0.8667 0.0032
13 Iris 0.9587 0.0042 0.9600 0.0054 0.9593 0.0073 0.9587 0.0061
14 Led7digit 0.7088 0.0049 0.7242 0.0075 0.7336 0.0075 0.7324 0.0065
15 Mammographic 0.7760 0.0028 0.8037 0.0024 0.8060 0.0045 0.8083 0.0036
16 Marketing 0.2922 0.0027 0.2996 0.0028 0.3052 0.0018 0.3084 0.0027
17 Monks-2 0.7752 0.0084 0.7426 0.0109 0.7384 0.0096 0.7153 0.0095
18 Movement_libras 0.7839 0.0083 0.7106 0.0095 0.6264 0.0079 0.5964 0.0113
19 Newthyroid 0.9577 0.0056 0.9507 0.0050 0.9577 0.0056 0.9535 0.0066
20 Page-blocks 0.8574 0.0012 0.8441 0.0013 0.8377 0.0016 0.8427 0.0010
21 Penbased 0.9934 0.0002 0.9919 0.0003 0.9902 0.0002 0.9889 0.0003
22 Phoneme 0.8736 0.0018 0.8656 0.0010 0.8568 0.0016 0.8506 0.0017
23 Pima 0.7250 0.0045 0.7354 0.0069 0.7316 0.0052 0.7311 0.0067
24 Ring 0.7155 0.0021 0.6885 0.0012 0.6687 0.0013 0.6539 0.0016
25 Satimage 0.9024 0.0016 0.9021 0.0013 0.8988 0.0013 0.8959 0.0009
26 Segment 0.9601 0.0014 0.9515 0.0015 0.9506 0.0018 0.9487 0.0016
27 Sonar 0.8375 0.0101 0.8341 0.0129 0.7947 0.0072 0.7683 0.0136
28 Spambase 0.9047 0.0018 0.9031 0.0011 0.9048 0.0013 0.9026 0.0017
29 Spectfheart 0.6296 0.0101 0.5906 0.0092 0.5918 0.0092 0.5809 0.0076
30 Tae 0.5318 0.0179 0.5252 0.0269 0.5152 0.0112 0.5099 0.0259
31 Texture 0.9868 0.0004 0.9835 0.0007 0.9811 0.0005 0.9785 0.0007
32 Thyroid 0.7707 0.0016 0.7826 0.0024 0.7568 0.0016 0.7504 0.0021
33 Titanic 0.7601 0.0000 0.7607 0.0013 0.7883 0.0008 0.7892 0.0001
34 Twonorm 0.9667 0.0008 0.9710 0.0005 0.9726 0.0005 0.9732 0.0007
35 Vehicle 0.7116 0.0067 0.7047 0.0119 0.6974 0.0067 0.6918 0.0070
36 Vowel 0.9606 0.0048 0.8629 0.0095 0.7551 0.0113 0.6969 0.0093
37 Wdbc 0.9645 0.0026 0.9664 0.0021 0.9680 0.0031 0.9685 0.0029
38 Wine 0.9534 0.0075 0.9528 0.0054 0.9517 0.0060 0.9573 0.0039
39 Winequality-red 0.4826 0.0070 0.4994 0.0057 0.4946 0.0066 0.5063 0.0078
40 Wisconsin 0.9750 0.0034 0.9755 0.0029 0.9739 0.0013 0.9735 0.0016
Avg 0.7906 0.0059 0.7829 0.0068 0.7742 0.0061 0.7703 0.0064
TABLE VIII: The detailed experimental results of IBT-U-K-S.

Appendix E

The detailed experimental results of TBC and IDC are given in Table 9.

ID Names IBC IDC
Avg Std Avg Std
1 Appendicitis 0.8613 0.0064 0.8075 0.0101
2 Balance 0.8654 0.0050 0.7618 0.0065
3 Banana 0.5568 0.0013 0.7313 0.0019
4 Bands 0.6088 0.0115 0.5841 0.0141
5 Bupa 0.6275 0.0088 0.5803 0.0086
6 Cleveland 0 0 0.4892 0.0126
7 Dermatology 0 0 0.8746 0.0066
8 Haberman 0.7310 0.0064 0.6876 0.0222
9 Hayes-roth 0.5288 0.0053 0.4744 0.0238
10 Heart 0.8396 0.0072 0.8170 0.0040
11 Hepatitis 0 0 0.8475 0.0211
12 Ionosphere 0.8695 0.0057 0.7513 0.0043
13 Iris 0.6667 0.0000 0.9060 0.0021
14 Led7digit 0.2622 0.0109 0.4736 0.0080
15 Mammographic 0.8088 0.0016 0.7982 0.0017
16 Marketing 0.2652 0.0029 0.1284 0.0018
17 Monks-2 0.5294 0.0206 0.6391 0.0140
18 Movement_libras 0 0 0.2642 0.0169
19 Newthyroid 0.3023 0.0000 0.8377 0.0056
20 Page-blocks 0.0750 0.0006 0.8892 0.0007
21 Penbased 0.1998 0.0000 0.6636 0.0004
22 Phoneme 0.7595 0.0005 0.7684 0.0010
23 Pima 0.7615 0.0041 0.7177 0.0028
24 Ring 0.7621 0.0008 0.9603 0.0004
25 Satimage 0.3448 0.0002 0.6317 0.0010
26 Segment 0.2857 0.0000 0.6624 0.0022
27 Sonar 0.7447 0.0159 0.7226 0.0116
28 Spambase 0.9064 0.0011 0.8376 0.0006
29 Spectfheart 0.6105 0.0122 0.7528 0.0138
30 Tae 0.4974 0.0096 0.4020 0.0153
31 Texture 0.3163 0.0013 0.5477 0.0022
32 Thyroid 0.0713 0.0001 0.9239 0.0005
33 Titanic 0.7807 0.0008 0.6900 0.0015
34 Twonorm 0.9781 0.0003 0.9771 0.0001
35 Vehicle 0.5324 0.0196 0.3018 0.0041
36 Vowel 0.1818 0.0000 0.2899 0.0060
37 Wdbc 0.9617 0.0023 0.9374 0.0017
38 Wine 0.6011 0.0000 0.9472 0.0060
39 Winequality-red 0 0 0.4021 0.0026
40 Wisconsin 0.9581 0.0012 0.9555 0.0016
Avg 0.5901 0.0047 0.6859 0.0066
TABLE IX: The detailed experimental results of TBC and IDC.

Appendix F

The detailed experimental results of k-NN are given in Table 10.

ID Names k=3 k=5 k=7 k=9
Avg Std Avg Std Avg Std Avg Std
1 Appendicitis 0.8406 0.0094 0.8642 0.0119 0.8764 0.0030 0.8708 0.0100
2 Balance 0.8485 0.0065 0.8661 0.0059 0.8813 0.0048 0.8928 0.0048
3 Banana 0.8841 0.0014 0.8896 0.0012 0.8942 0.0021 0.8978 0.0012
4 Bands 0.7093 0.0122 0.6942 0.0122 0.6797 0.0083 0.6712 0.0098
5 Bupa 0.6371 0.0113 0.6078 0.0130 0.6238 0.0121 0.6293 0.0134
6 Cleveland 0.5545 0.0152 0.5545 0.0057 0.5663 0.0115 0.5626 0.0117
7 Dermatology 0.9623 0.0033 0.9592 0.0027 0.9575 0.0039 0.9517 0.0040
8 Haberman 0.6954 0.0109 0.6944 0.0082 0.7111 0.0054 0.7186 0.0070
9 Hayes-roth 0.6350 0.0187 0.5575 0.0255 0.4344 0.0215 0.3581 0.0228
10 Heart 0.7778 0.0089 0.8033 0.0066 0.8126 0.0068 0.8115 0.0069
11 Hepatitis 0.8288 0.0145 0.8525 0.0255 0.8800 0.0134 0.8563 0.0169
12 Ionosphere 0.8570 0.0044 0.8501 0.0054 0.8393 0.0041 0.8425 0.0043
13 Iris 0.9507 0.0034 0.9560 0.0034 0.9673 0.0066 0.9527 0.0049
14 Led7digit 0.6598 0.0077 0.7116 0.0047 0.7090 0.0058 0.7234 0.0041
15 Mammographic 0.7678 0.0055 0.7981 0.0067 0.7999 0.0051 0.8027 0.0050
16 Marketing 0.2872 0.0030 0.2942 0.0015 0.2990 0.0025 0.3050 0.0020
17 Monks-2 0.7972 0.0072 0.8000 0.0054 0.7914 0.0127 0.7644 0.0074
18 Movement_libras 0.8075 0.0049 0.7417 0.0103 0.7181 0.0090 0.6739 0.0218
19 Newthyroid 0.9409 0.0044 0.9381 0.0058 0.9316 0.0054 0.9237 0.0050
20 Page-blocks 0.9596 0.0012 0.9583 0.0009 0.9545 0.0009 0.9536 0.0006
21 Penbased 0.9935 0.0004 0.9926 0.0004 0.9919 0.0003 0.9905 0.0003
22 Phoneme 0.8878 0.0021 0.8808 0.0028 0.8752 0.0017 0.8701 0.0023
23 Pima 0.7396 0.0055 0.7367 0.0072 0.7449 0.0055 0.7357 0.0046
24 Ring 0.7186 0.0014 0.6922 0.0010 0.6747 0.0012 0.6608 0.0017
25 Satimage 0.9096 0.0012 0.9078 0.0011 0.9065 0.0015 0.9049 0.0019
26 Segment 0.9613 0.0020 0.9532 0.0014 0.9502 0.0015 0.9481 0.0015
27 Sonar 0.8303 0.0072 0.8135 0.0115 0.7880 0.0135 0.7457 0.0175
28 Spambase 0.9019 0.0021 0.9030 0.0015 0.8995 0.0013 0.8959 0.0023
29 Spectfheart 0.7150 0.0134 0.7390 0.0149 0.7629 0.0142 0.7547 0.0124
30 Tae 0.5119 0.0153 0.5219 0.0184 0.5086 0.0253 0.4927 0.0263
31 Texture 0.9878 0.0005 0.9853 0.0005 0.9828 0.0007 0.9809 0.0007
32 Thyroid 0.9391 0.0008 0.9407 0.0005 0.9401 0.0005 0.9400 0.0002
33 Titanic 0.6109 0.0107 0.7796 0.0118 0.7819 0.0013 0.7816 0.0034
34 Twonorm 0.9650 0.0010 0.9697 0.0007 0.9705 0.0008 0.9714 0.0006
35 Vehicle 0.7033 0.0051 0.7025 0.0054 0.7039 0.0055 0.6941 0.0096
36 Vowel 0.9706 0.0025 0.9387 0.0057 0.8871 0.0071 0.7972 0.0108
37 Wdbc 0.9692 0.0017 0.9678 0.0024 0.9705 0.0027 0.9692 0.0028
38 Wine 0.9640 0.0039 0.9573 0.0089 0.9596 0.0052 0.9567 0.0088
39 Winequality-red 0.5839 0.0062 0.5902 0.0069 0.5797 0.0040 0.5803 0.0042
40 Wisconsin 0.9691 0.0022 0.9742 0.0024 0.9728 0.0019 0.9706 0.0021
Avg 0.8058 0.0060 0.8085 0.0067 0.8045 0.0060 0.7951 0.0069
TABLE X: The detailed experimental results of k-NN.

Appendix G

The detailed experimental results of SVM and DT are given in Table 11.

ID Names SVM DT
Avg Std Avg Std
1 Appendicitis 0.8736 0.0049 0.8358 0.0135
2 Balance 0.8698 0.0060 0.7894 0.0080
3 Banana 0.5517 0.0000 0.8799 0.0027
4 Bands 0.6877 0.0107 0.6285 0.0272
5 Bupa 0.5791 0.0018 0.6571 0.0183
6 Cleveland 0.5859 0.0104 0.5091 0.0079
7 Dermatology 0.9673 0.0019 0.9374 0.0058
8 Haberman 0.7340 0.0017 0.6935 0.0139
9 Hayes-roth 0.5144 0.0198 0.8181 0.0192
10 Heart 0.8374 0.0041 0.7581 0.0196
11 Hepatitis 0.8575 0.0278 0.8350 0.0269
12 Ionosphere 0.8821 0.0054 0.8806 0.0101
13 Iris 0.9613 0.0061 0.9487 0.0045
14 Led7digit 0.7392 0.0075 0.7114 0.0075
15 Mammographic 0.7959 0.0026 0.7988 0.0065
16 Marketing 0.3210 0.0014 0.2970 0.0032
17 Monks-2 0.6713 0.0000 0.9067 0.0130
18 Movement_libras 0.7197 0.0117 0.6572 0.0265
19 Newthyroid 0.8944 0.0062 0.9298 0.0060
20 Page-blocks 0.9342 0.0005 0.9649 0.0010
21 Penbased 0.9784 0.0004 0.9582 0.0010
22 Phoneme 0.7731 0.0008 0.8650 0.0032
23 Pima 0.7699 0.0032 0.7078 0.0105
24 Ring 0.7651 0.0008 0.8858 0.0028
25 Satimage 0.8646 0.0008 0.8608 0.0039
26 Segment 0.9303 0.0012 0.9568 0.0039
27 Sonar 0.7736 0.0169 0.7221 0.0185
28 Spambase 0.9031 0.0009 0.9190 0.0028
29 Spectfheart 0.7951 0.0018 0.7401 0.0155
30 Tae 0.5364 0.0219 0.5444 0.0168
31 Texture 0.9873 0.0003 0.9220 0.0030
32 Thyroid 0.9371 0.0001 0.9960 0.0004
33 Titanic 0.7760 0.0000 0.7898 0.0013
34 Twonorm 0.9783 0.0003 0.8431 0.0048
35 Vehicle 0.7356 0.0039 0.7139 0.0115
36 Vowel 0.7129 0.0062 0.7666 0.0111
37 Wdbc 0.9773 0.0027 0.9185 0.0040
38 Wine 0.9860 0.0040 0.9096 0.0107
39 Winequality-red 0.5841 0.0027 0.6077 0.0102
40 Wisconsin 0.9687 0.0021 0.9492 0.0042
Avg 0.7928 0.0050 0.8003 0.0095
TABLE XI: The detailed experimental results of SVM and DT.

Acknowledgments

This work was partially supported by the Natural Science Foundation of China (Nos. 61572094, 61771331) and the Fundamental Research Funds for the Central Universities (No. DUT2017TB02).

References

  • [1] M. F. Delgado, E. Cernadas, S. Barro, and D. G. Amorim, “Do we need hundreds of classifiers to solve real world classification problems?” Journal of Machine Learning Research, vol. 15, no. 1, pp. 3133–3181, 2014.
  • [2] T. M. Cover and P. E. Hart, “Nearest neighbor pattern classification,” IEEE Transactions on Information Theory, vol. 13, no. 1, pp. 21–27, 1967.
  • [3] C. Cortes and V. Vapnik, “Support vector networks,” Machine Learning, vol. 20, no. 3, pp. 273–297, 1995.
  • [4] L. Breiman, “Random forests,” Machine Learning, vol. 45, pp. 5–32, 2001.
  • [5] O. Wagih, J. Reimand, and G. D. Bader, “MIMP: Predicting the impact of mutations on kinase-substrate phosphorylation,” Nature Methods, vol. 12, no. 6, pp. 531–3, 2015.
  • [6] S.-M. Liao and M. Akritas, “Test-based classification: A linkage between classification and statistical testing,” Statistics & probability letters, vol. 77, no. 12, pp. 1269–1281, 2007.
  • [7] S. Ghimire and H. Wang, “Classification of image pixels based on minimum distance and hypothesis testing,” Computational Statistics & Data Analysis, vol. 56, no. 7, pp. 2273–2287, 2012.
  • [8] L. Guo and R. Modarres, “Interpoint distance classification of high dimensional discrete observations,” International Statistical Review, 2018.
  • [9] J. D. Gibbons and S. Chakraborti, Nonparametric statistical inference, 5th ed.   CRC Press, 2011.
  • [10] D. Dheeru and E. Karra Taniskidou, “UCI machine learning repository.” 2017. [Online]. Available: http://archive.ics.uci.edu/ml
  • [11] J. Alcalá-Fdez, A. Fernández, J. Luengo, J. Derrac, and S. García, “Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework,” Journal of Multiple-Valued Logic & Soft Computing, vol. 17, pp. 255–287, 2011.
  • [12] T. M. Mitchell, Machine Learning, 1st ed.   New York, NY, USA: McGraw-Hill, Inc., 1997.
  • [13] D. R. Wilson and T. R. Martinez, “Reduction techniques for instance-based learning algorithms,” Machine learning, vol. 38, no. 3, pp. 257–286, 2000.
  • [14] S. Garcia, J. Derrac, J. Cano, and F. Herrera, “Prototype selection for nearest neighbor classification: Taxonomy and empirical study,” IEEE transactions on pattern analysis and machine intelligence, vol. 34, no. 3, pp. 417–435, 2012.
  • [15] J. Derrac, S. García, and F. Herrera, “Fuzzy nearest neighbor algorithms: Taxonomy, experimental analysis and prospects,” Information Sciences, vol. 260, pp. 98–119, 2014.
  • [16] R. Modarres, “On the interpoint distances of Bernoulli vectors,” Statistics & Probability Letters, vol. 84, pp. 215–222, 2014.
  • [17] ——, “Multivariate Poisson interpoint distances,” Statistics & Probability Letters, vol. 112, pp. 113–123, 2016.
  • [18] ——, “Multinomial interpoint distances,” Statistical Papers, vol. 59, no. 1, pp. 341–360, 2018.
  • [19] C. Elkan, “The foundations of cost-sensitive learning,” in

    International joint conference on artificial intelligence

    , vol. 17, no. 1.   Lawrence Erlbaum Associates Ltd, 2001, pp. 973–978.
  • [20] B. Zadrozny, J. Langford, and N. Abe, “Cost-sensitive learning by cost-proportionate example weighting,” in Data Mining, 2003. ICDM 2003. Third IEEE International Conference on.   IEEE, 2003, pp. 435–442.
  • [21] C. Scott and R. Nowak, “A Neyman-Pearson approach to statistical learning,” IEEE Transactions on Information Theory, vol. 51, no. 11, pp. 3806–3819, 2005.
  • [22] X. Tong, Y. Feng, and A. Zhao, “A survey on Neyman-Pearson classification and suggestions for future research,” Wiley Interdisciplinary Reviews: Computational Statistics, vol. 8, no. 2, pp. 64–81, 2016.
  • [23] X. Tong, Y. Feng, and J. J. Li, “Neyman-Pearson classification algorithms and NP receiver operating characteristics,” Science Advances, vol. 4, no. 2, 2018. [Online]. Available: http://advances.sciencemag.org/content/4/2/eaao1659
  • [24]

    H. B. Mann and D. R. Whitney, “On a test of whether one of two random variables is stochastically larger than the other,”

    Annals of Mathematical Statistics, vol. 18, no. 1, pp. 50–60, 1947.
  • [25] J. Wang, W. W. Tsang, and G. Marsaglia, “Evaluating Kolmogorov’s distribution,” Journal of Statistical Software, vol. 8, no. 18, 2003.
  • [26] Y. Benjamini and Y. Hochberg, “Controlling the false discovery rate: A practical and powerful approach to multiple testing,” Journal of the Royal Statistical Society, vol. 57, no. 1, pp. 289–300, 1995.
  • [27] J. Friedman, T. Hastie, and R. Tibshirani, The elements of statistical learning.   Springer series in statistics New York, NY, USA, 2001.
  • [28] X. Wu, V. Kumar, J. R. Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. J. McLachlan, A. Ng, B. Liu, S. Y. Philip et al., “Top 10 algorithms in data mining,” Knowledge and information systems, vol. 14, no. 1, pp. 1–37, 2008.
  • [29] N. Balakrishnan and H. T. Ng, Precedence-type tests and applications.   John Wiley & Sons, Hoboken, NJ, 2006.