Abstaining Classification When Error Costs are Unequal and Unknown

06/09/2018 ∙ by Hongjiao Guan, et al. ∙ NetEase, Inc 0

Abstaining classificaiton aims to reject to classify the easily misclassified examples, so it is an effective approach to increase the clasificaiton reliability and reduce the misclassification risk in the cost-sensitive applications. In such applications, different types of errors (false positive or false negative) usaully have unequal costs. And the error costs, which depend on specific applications, are usually unknown. However, current abstaining classification methods either do not distinguish the error types, or they need the cost information of misclassification and rejection, which are realized in the framework of cost-sensitive learning. In this paper, we propose a bounded-abstention method with two constraints of reject rates (BA2), which performs abstaining classification when error costs are unequal and unknown. BA2 aims to obtain the optimal area under the ROC curve (AUC) by constraining the reject rates of the positive and negative classes respectively. Specifically, we construct the receiver operating characteristic (ROC) curve, and stepwise search the optimal reject thresholds from both ends of the curve, untill the two constraints are satisfied. Experimental results show that BA2 obtains higher AUC and lower total cost than the state-of-the-art abstaining classification methods. Meanwhile, BA2 achieves controllable reject rates of the positive and negative classes.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Pattern classification techniques have been widely applied to solve practical issues, such as face/text recognition, fault detection, medical diagnosis and so on. A large number of approaches have been proposed to increase the classification accuracy of total examples. However, they neglect the classification reliability of each individual example, especially, in the risk-associated fields. In such fields, the wrong classification of a specific example leads to serious consequences, such as enormous economic loss or irretrievable death. Abstaining classification [14] or classification with reject option [10, 21] is helpful to improve the reliability and reduce the risk by abstaining the uncertain examples of low membership degree, since these examples are easily misclassified.

There are two reject rules in abstaining classification: Chow’s rule and ROC-based rule. Chow [3]

initially presents the optimal reject rule in the framework of Bayesian theory. In Chow’s rule, the classification of an example is rejected if its maximum posterior probability is less than a given threshold. Without distinction of error types and correct recognition types, the threshold is related to the ratio of

, where , and

are the costs of rejecting, correctly classifying or misclassifying an example. Chow’s optimal reject rule is obtained only if the exact knowledge of the posterior probabilities is known, which is impossible in practice. The alternative way is to use estimated posterior probabilities, such as the probabilistic outputs in neural networks 

[8]. Also, the outputs of non-probabilistic classifiers can be converted into posterior probabilities using the sigmoid transformation function [16], isotonic regression [23], the histogram method [11], or bootstrap sampling method [22]. However, the accuracy of the estimation significantly influences the performance of reject classification. Other methods, such as estimating the data distribution [4]

, probability density function 

[9]

, and confidence interval

[5], are proposed for classification with rejection.

ROC-based rule is proved to be theoretically equivalent with Chow’s rule under the general cost term that distinguishes the wrong and correct recognition types [17]. And researches have shown that ROC-based rule performs better than Chow’s rule on real world datasets [22, 13]. ROC-based rule is initially proposed in [18]

, where two points on the ROC curve corresponding to two reject thresholds are determined by minimizing the total cost. This method is implemented using support vector machine (SVM) in 

[19]. Pietraszek [14] proposes a ROC-based bounded-abstention model (BA), which minimizes a cost-weighted function while keeping the overall reject rate below a given value. A twin SVM with reject option (RO-TWSVM) is proposed in [12], which improves the previous SVM with reject option (RO-SVM) [19] using twin SVM instead of SVM. These abstaining classification methods are cost-sensitive, which rely on cost information that is usually unknown in practical applications.

In this paper, we propose a ROC-based abstaining classification method, bounded-abstention with two constraints of reject rates (BA2), to overcome the limitations of using posterior probabilities and cost information. Note that in the paper we only consider the binary classification. We expect to maximize the area under the ROC curve (AUC) under two constraints that the reject rates of the positive class and the negative class do not exceed two given bounds, respectively. This is realized by stepwise searching two points on the ROC curve from two endpoints. When the two constraints are satisfied, the searching process stops, and the final two points are corresponding to the reject thresholds.

The proposed method has several advantages. BA2 is developed to obtain maximum AUC value rather than to minimize the total cost, which skillfully avoids setting the unknown cost term. Furthermore, BA2 distinguishes the error types and restrains the reject rates of the positive class and the negative class separately. This is beneficial to control the respective performance of two classes when the error costs are unequal.

2 Proposed Method

The ROC curve depicts different operating points using the false positive rates () as the -axis and the true positive rates () as the -axis, which can visualize the performance of a binary classifier. Note that , i.e., the number of misclassified negative examples () divided by the number of all negative examples (); i.e., the number of positive examples that are correctly classified () divided by the number of all positive examples (). The ROC curve is insensitive to unequal misclassification costs [7], which is an effective tool to analyze classifiers’ behavior.

Assume that in the two-class classification problem, an example obtains its score measuring the likeliness of belonging to positive class. We aim to obtain the two reject thresholds and (), and the corresponding reject rule is as follows:

(1)

The idea of the proposed BA2 method is to maximize AUC under two constraints of two classes’ reject rates. The problem is formalized as:

(2)

The two constraints are and . is defined as the ratio of the number of rejected negative examples () divided by the number of all negative examples (); . is defined as the ratio of the number of rejected positive examples () divided by the number of all positive examples (); . We use and to denote the maximum reject rates that and should not exceed, respectively.

Input: the ROC curve described by multiple points (); and , the maximum reject rates that and should not exceed;
Output: and , the two reject thresholds
1 Initialization: , ;
2 ;
3 while  do
4       Compute and according to equations (3) and (4), respectively;
5       if  and  then
6             if  then
7                  if  then
8                        ;
9                  else
10                        ;
11                   end if
12                  
13            else
14                  if  then
15                        ;
16                  else
17                        ;
18                   end if
19                  
20             end if
21            ; continue;
22            
23       end if
24      if  and  then
25            ; continue;
26       end if
27      if  and  then
28            ; continue;
29       end if
30      break;
31       Calculate the score thresholds corresponding to the locations of () and , i.e., and .
32 end while
Algorithm 1 Constructing the bounded-abstention classifier with two constraints (BA2)

We regard the ROC curve as a function in the two-dimensional ROC space, so we can denote . And we denote as and as . () and () are the two points on the ROC curve (). According to the definitions, and can be expressed as [20]:

(3)
(4)

Let and start from points (1, 1) and (0, 0), and step-wisely compute and with step 0.01, until and . Algorithm 1 is the pseudo-code of constructing BA2. Note that, since the ROC curve obtained on real datasets has concavities, the convex hull of points in the ROC plane (ROCCH) is usually constructed [6]. In the paper, when we mention the ROC curve, it refers to the corresponding ROCCH curve.

Lines 5-26 in Algorithm 1 is the process of searching the two optimal points and on the ROC curve. When the constraints are not satisfied, we increase the value of or decrease the value of with step 0.01. Now we analyze the searching process according to whether and are equal.

Figure 1: The illustration of the searching process

When , = 1 and lines 6-18 in Algorithm 1 can be simplified as:

(5)

This statement means that if the rejected negative rate is larger than the rejected positive rate, decrease ; otherwise, increase . [20] indicates that if equals , the AUC of the abstention ROC (the ROC curve obtained by the abstaining classifier)is always larger than that of the original ROC, since in this case, the abstention ROC dominates the original ROC. Hence, to obtain maximum AUC of the abstaining classifier, when , we should decrease , i.e., move to the left. This can be implemented in Fig. 1, where and . Since , . Therefore, if we move to , the difference of the two reject rates becomes larger, which violates the intention that and should be as approximate as possible. Likewise, when , we should move to the right. If one of the two constraints is satisfied, lines 21-26 are enforced. In this situation, the enforcement of lines 21-26 is equivalent to the implementation of the statement in (5). For example, if and , we can obtain that .

When and 1, to make the reject rates controllable, and should be approximated. Hence, when , we move to the left; otherwise, move to the right. The process stops until

. At this moment, the constraint

is usually not satisfied, and line 25 of Algorithm 1 is performed. When and 1, the opposite operation is carried out.

Once the constraints and are satisfied, the searching process is terminated. Then we calculate the reject thresholds and . The BA2 classifier based on and has the optimal AUC. If becomes large or becomes small, the performance constraints are not satisfied any more. If becomes small or becomes large, the rejection interval [] becomes small. This causes that the possible errors may not be rejected and therefore, the AUC decreases.

3 Experimental Framework

3.1 Experimental Datasets and Setups

We perform the experiments using four real-world datasets, which are available from the UCI repository [2]. The characteristics of the four datasets [12] are listed in Table 1. For each dataset, we perform a stratified ten-fold cross validation, of which nine folds are used to determine the reject thresholds and , and the remaining fold is used to obtain the performance of the compared methods. In the process of determining the thresholds and , we utilize a nine-fold cross validation to generate a more smooth ROC curve. That is, among the nine folds, we use eight folds as the training set and the remaining fold as the test set to generate a ROC curve. Then, the resulting nine ROC curves are averaged using threshold averaging method [6]. The entire ten-fold cross validation is repeated ten times, and we present the averaged performance metrics.

dataset # Pos. # Neg. # Attr.
German credit 300 700 20
hepatitis 32 123 19
cmc 333 1140 9
abalone 335 3842 8
Table 1: Characteristics of the real-world datasets

In the paper, two groups of experiments are enforced to evaluate the BA2 method by comparing with bounded-abstaining classifier (BA) [15] in Section 3.2 and comparing with twin SVM with reject option (RO-TWSVM) [12] in Section 3.3, respectively.

3.2 Comparison of BA2 and BA

BA is selected to compare with BA2, since the ideas of BA and BA2 are similar, which search the two reject thresholds using ROC curves by restricting the maximum ratios of rejected examples. The difference is that BA is realized by minimizing the misclassification cost and restricting the total reject rate less than a given value; whereas BA2 is obtained by maximizing the AUC value and controlling the proportions of the rejected positive and negative examples separately. We use -NN as the scoring classifier to build the ROC curve [20]. Considering the small size of the datasets, we set = 3. Here, we use the area under the ROC curve (), the sensitivity (), the rejected positive rate (), and the rejected negative rate (

) as evaluation metrics.

0.1-reject 0.2-reject 0.3-reject
BA(1) BA2 BA(1) BA2 BA(1) BA2
German credit
0.7352 0.7352 0.7469 0.7509 0.7453 0.7632
0.4651 0.7159 0.4645 0.7390 0.4484 0.7595
0.1367 0.0937 0.2727 0.1917 0.3893 0.2953
0.0920 0.0906 0.1759 0.1984 0.2627 0.2964
hepatitis
0.8807 0.9245 0.8859 0.9266 0.9246 0.9523
0.5476 0.7356 0.5690 0.7615 0.5994 0.8468
0.0717 0.0317 0.0857 0.0486 0.1155 0.0859
0.0692 0.0851 0.1705 0.1958 0.3106 0.2878
cmc
0.6412 0.6581 0.6098 0.6619 0.5759 0.6649
0.2563 0.4993 0.2085 0.5171 0.1442 0.5204
0.1513 0.0901 0.3040 0.2018 0.4216 0.3010
0.0790 0.0885 0.1608 0.1975 0.2523 0.2910
abalone
0.8469 0.8656 0.8195 0.8783 0.7843 0.8934
0.4635 0.7821 0.3877 0.8033 0.2573 0.8291
0.2970 0.0901 0.5509 0.1862 0.7294 0.2817
0.0842 0.0962 0.1719 0.2007 0.2658 0.2945
Table 2: Results of BA(1) ( = 1) and BA2 at 0.1/0.2/0.3-reject.
BA(0.5) BA2
0.1-reject 0.2-reject 0.3-reject (0.1, 0.2) (0.1, 0.3) (0.2,0.3)
German credit
0.7331 0.7353 0.7554 0.7200 0.6947 0.7413
0.8297 0.8380 0.8266 0.9192 0.9853 0.8934
0.0797 0.1603 0.2660 0.1040 0.1007 0.1980
0.1076 0.2164 0.3149 0.2127 0.2847 0.3019
hepatitis
0.8975 0.9416 0.9579 0.9376 0.9293 0.9529
0.7938 0.7739 0.7871 0.9367 0.9789 0.9500
0.0277 0.0683 0.0865 0.0570 0.0564 0.0643
0.1147 0.2208 0.3018 0.1966 0.2792 0.2938
cmc
0.6617 0.6656 0.6703 0.6503 0.6569 0.6543
0.6802 0.6872 0.6519 0.9128 0.9251 0.7927
0.0822 0.1751 0.2637 0.0161 0.0048 0.2110
0.1054 0.2114 0.3245 0.0355 0.0090 0.3375
abalone
0.8656 0.8805 0.8884 0.8664 0.8598 0.8836
0.7780 0.7504 0.7280 0.9060 0.9769 0.8865
0.1080 0.2566 0.4225 0.0924 0.1056 0.1850
0.1034 0.1994 0.2924 0.1979 0.2965 0.3011
Table 3: Results of BA(0.5) ( = 0.5) and BA2 with different values of and

Since BA only restricts the overall reject rate, to ensure comparability, we firstly set the same value for and in BA2, where and are the preset upper bounds that and should not exceed. We set the upper bounds of the reject rates at 0.1, 0.2 and 0.3, denoted as 0.1/0.2/0.3-reject. In BA, the cost is set by defining the cost ratio , where is the cost of misclassifying a negative example as positive, and is the cost of misclassifying a positive example as negative. We set = 1, which assumes that the error costs of two classes are equal, and this setting is denoted as BA(1). The results of , , , and using 0.1/0.2/0.3-reject are shown in Table 2.

We can observe that BA2 has larger and sensitivity values than BA(1) on the four real-world datasets. When the preset maximum reject rates increase, the values of the performance metrics and sensitivity increase in BA2; whereas in BA(1), their values decrease on datasets cmc and abalone. Also, in BA(1), the values of and are not controllable, and the values of are much higher than the preset bounds on the last two datasets. By contrast, the values of and in BA2 can be controlled by the setting parameters and . The controllable reject rates of two classes is of great significance in practical applications, since we can set acceptable and according to the actual application requirements of human and financial resources.

Considering the unequal error costs, we compare BA and BA2 by setting 1 in BA and setting different values for and in BA2. Usually, the positive class has higher error cost and has less examples in risk-related fields, and in the four datasets used in the paper, the number of positive examples is exactly smaller than that of negative examples. Therefore, we set = 0.5, following the setup in [15]. Likewise, because of the higher cost of the positive class, we set . The results of BA with = 0.5 and BA2 using different values of and are shown in Table 3. BA(0.5) means the setting of = 0.5 in BA, and (0.1,0.2) means the values of and are 0.1 and 0.2 in BA2, respectively.

Although the comparability between BA(0.5) and BA2 with different and is small, an obvious observation is that when the values of BA(0.5) and BA2 are similar, the sensitivity values of BA2 are much higher than that of BA(0.5). And in BA2, almost all the values of and are controllable when and are set using different values. This is very important when error costs are unequal and the cost of the false negative is higher than the cost of the false positive. In this case, BA2 can be set using small and large , and thus high sensitivity can be obtained while good results of are achieved. In addition, in BA, comparing with the values of with , the values of with decrease.

3.3 Comparison of BA2 and RO-TWSVM

RO-TWSVM improves the previous SVM with reject option (RO-SVM) [19] using twin SVM instead of SVM. In RO-TWSVM, the ROC curve is built according to the scores obtained by TWSVM, and the reject thresholds are determined by minimizing the total cost. The total cost is defined as

(6)

where and are the ratios of the positive examples and the negative examples in the training set, respectively. , and are the costs of false positive errors, false negative errors and rejection, respectively. and are the costs of true positive and true negative, respectively. We adopt three cost models used in [12], which are shown in Table 4. Here, we do not use the cost model of CM2. In CM2, the mean cost of (Unif [0,50]) is lower than the mean cost of (Unif [0,100]). In the experiment, we assume the misclassification cost of the positive class is higher than that of the negative class. To compare with RO-TWSVM, we also use TWSVM to obtain the example scores in BA2 (BA2-TWSVM), and we utilize four same datasets (Table 5) in [12], which are available in KEEL dataset repository [1].

CM1 Unif [-10,0] Unif [0,50] Unif [-10,0] Unif [0,50] 1
CM3 Unif [-10,0] Unif [0,50] Unif [-10,0] Unif [0,100] 1
CM4 Unif [-10,0] Unif [0,50] Unif [-10,0] Unif [0,50] Unif [0,30]
Table 4: The cost models used in [12]
Dataset # Pos. # Neg. # Attr.
Pima 268 500 8
German credit (GC) 300 700 20
Breast cancer Wisconsin (WBC) 239 460 9
Heart disease Cleveland (CHD) 83 214 13
Table 5: Characteristics of the KEEL datasets

We conduct the Wilcoxon rank sum test [12] on the four real-world datasets for the comparison of RO-TWSVM and BA2-TWSVM. In the Wilcoxon rank sum test, for each cost model in Table 4, 1000 groups of cost terms of , , , and are generated. For each cost group, the total costs of two compared methods are computed according to the equation (3.3) and AUC values of the two abstaining classifiers are also calculated. Finally, the numbers of cases where values of BA2-TWSVM is higher, lower or identical than values of RO-TWSVM in terms of total cost and AUC are counted. The identical case includes two conditions that the costs or AUC values of the compared methods are equal or the cost term is not applicable to take reject option in RO-TWSVM. The details of Wilcoxon rank sum test can be found in [12]. Considering the need of setting the parameters of reject rates in BA2, we firstly perform RO-TWSVM, and use the average values of the rejected positive rate and the rejected negative rate in each cost group as and in BA2-TWSVM, respectively. The linear kernel function is used in TWSVM. The comparison results of cost and are shown in Tables 6 and 7, respectively, where in each scenarios, the three numbers are the counts of BA2-TWSVM having higher, lower or identical value compared with RO-TWSVM.

Pima GC WBC CHD
CM1 104 305 311 217
782 597 591 685
98 98 98 98
CM3 233 312 267 162
704 625 670 775
63 63 63 63
CM4 112 136 143 161
452 428 421 403
436 436 436 436
Table 6: Cost results of RO-TWSVM and BA2-TWSVM for linear kernel based on Wilcoxon rank sum test
Pima GC WBC CHD
CM1 803 829 54 615
82 73 840 287
99 98 106 98
CM3 694 903 44 637
243 34 887 300
63 63 69 63
CM4 286 310 194 229
238 203 248 335
476 487 558 436
Table 7: AUC results of RO-TWSVM and BA2-TWSVM for linear kernel based on Wilcoxon rank sum test

In Table 6, for CM1 and CM3, the number of cases that BA2-TWSVM produces lower costs than RO-TWSVM is much more than the number of the higher or identical cases. For CM4, due to the variable reject cost , the case that the cost term is not suitable for rejection becomes more. However, in the remaining two cases, the number of BA2-TWSVM’s costs lower than RO-TWSVM’s costs is more than the number of the opposite case. In Table 7, except on the dataset WBC, the number of cases that BA2-TWSVM has higher values than RO-TWSVM is significantly more than the number of the other two cases for CM1 and CM3. For dataset WBC, the classifier TWSVM obtains high value (larger than 0.99), so the examples are easily discriminative, and thus the reject option is not essential. For other hard datasets , BA2-TWSVM performs better than RO-TWSVM in terms of total cost and .

4 Conclusions

In this paper, we propose the BA2 method for abstaining classification, which avoids the introduction of the cost information. The BA2 method achieves higher AUC compared with BA and it has lower cost compared with RO-TWSVM. What’s more, BA2 can control the respective reject rates of the positive and negative classes, which is extremely essential in risk-associated fields, such as medical diagnosis. Acceptable upper bounds of the reject rates can be set according to the actual application requirements of human and financial resources. In future, we would like to employ the method to specific risk-associated fields and to the field of big data with imbalance.

References

  • [1] Alcalá-Fdez, J., Fernández, A., Luengo, J., Derrac, J., García, S., Sánchez, L., Herrera, F.: Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. Journal of Multiple-Valued Logic & Soft Computing 17 (2011)
  • [2]

    Asuncion, A., Newman, D.: Uci machine learning repository (2007)

  • [3] Chow, C.: On optimum recognition error and reject tradeoff. IEEE Transactions on information theory 16(1), 41–46 (1970)
  • [4]

    Devarakota, P.R., Mirbach, B., Ottersten, B.: Confidence estimation in classification decision: A method for detecting unseen patterns. In: Advances In Pattern Recognition, pp. 290–294. World Scientific (2007)

  • [5] Devarakota, P.R.R., Mirbach, B., Ottersten, B.: Reliability estimation of a statistical classifier. Pattern Recognition Letters 29(3), 243–253 (2008)
  • [6] Fawcett, T.: Roc graphs: Notes and practical considerations for researchers. Machine learning 31(1), 1–38 (2004)
  • [7] Fawcett, T.: An introduction to roc analysis. In: Proc. Natl. Acad. pp. 10–1016 (2006)
  • [8] Giusti, N., Sperduti, A.: Theoretical and experimental analysis of a two-stage system for classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(7), 893–904 (2002)
  • [9] Ishidera, E., Nishiwaki, D., Sato, A.: A confidence value estimation method for handwritten kanji character recognition and its application to candidate reduction. Document Analysis and Recognition 6(4), 263–270 (2003)
  • [10] Kamiran, F., Mansha, S., Karim, A., Zhang, X.: Exploiting reject option in classification for social discrimination control. Information Sciences 425 (2017)
  • [11] Li, M., Sethi, I.K.: Confidence-based classifier design. Pattern Recognition 39(7), 1230–1240 (2006)
  • [12] Lin, D., Sun, L., Toh, K.A., Zhang, J.B., Lin, Z.: Twin svm with a reject option through roc curve. Journal of the Franklin Institute (2017)
  • [13] Marrocco, C., Molinara, M., Tortorella, F.: An empirical comparison of ideal and empirical roc-based reject rules. In: International Workshop on Machine Learning and Data Mining in Pattern Recognition. pp. 47–60. Springer (2007)
  • [14] Pietraszek, T.: Optimizing abstaining classifiers using roc analysis. In: Proceedings of the 22nd international conference on Machine learning. pp. 665–672. ACM (2005)
  • [15] Pietraszek, T.: On the use of roc analysis for the optimization of abstaining classifiers. Machine Learning 68(2), 137–169 (2007)
  • [16] Platt, J., et al.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers 10(3), 61–74 (1999)
  • [17] Santos-Pereira, C.M., Pires, A.M.: On optimal reject rules and roc curves. Pattern recognition letters 26(7), 943–952 (2005)
  • [18] Tortorella, F.: An optimal reject rule for binary classifiers. In: Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR). pp. 611–620. Springer (2000)
  • [19] Tortorella, F.: Reducing the classification cost of support vector classifiers through an roc-based reject rule. Pattern Analysis and Applications 7(2), 128–143 (2004)
  • [20] Vanderlooy, S., Sprinkhuizen-Kuyper, I.G., Smirnov, E.N., van den Herik, H.J.: The roc isometrics approach to construct reliable classifiers. Intelligent Data Analysis 13(1), 3–37 (2009)
  • [21]

    Wang, Z., Wang, Z., He, S., Gu, X., Yan, Z.F.: Fault detection and diagnosis of chillers using bayesian network merged distance rejection and multi-source non-sensor information. Applied Energy

    188, 200–214 (2017)
  • [22] Xie, J., Qiu, Z., Wu, J.: Bootstrap methods for reject rules of fisher lda. In: Pattern Recognition, 2006. ICPR 2006. 18th International Conference on. vol. 3, pp. 425–428. IEEE (2006)
  • [23] Zadrozny, B., Elkan, C.: Transforming classifier scores into accurate multiclass probability estimates. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 694–699. ACM (2002)