A Model-Agnostic Algorithm for Bayes Error Determination in Binary Classification

07/24/2021 ∙ by Umberto Michelucci, et al. ∙ 1

This paper presents the intrinsic limit determination algorithm (ILD Algorithm), a novel technique to determine the best possible performance, measured in terms of the AUC (area under the ROC curve) and accuracy, that can be obtained from a specific dataset in a binary classification problem with categorical features regardless of the model used. This limit, namely the Bayes error, is completely independent of any model used and describes an intrinsic property of the dataset. The ILD algorithm thus provides important information regarding the prediction limits of any binary classification algorithm when applied to the considered dataset. In this paper the algorithm is described in detail, its entire mathematical framework is presented and the pseudocode is given to facilitate its implementation. Finally, an example with a real dataset is given.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The majority of machine learning projects tend to follow the same pattern. Namely, many different machine learning model types (as decision trees, logistic regression, random forest, neural network, etc.) are first trained from data to predict specific outcomes, and then tested and compared to find the one that gives the best prediction performance on validation data. Many techniques to compare models have been developed and are commonly used in several settings

raschka2018model; arlot2010survey. For some specific model types, as neural networks, it is difficult to know when to stop the search Michelucci2017

. There is always the hope that a different set of hyperparameters, as the number of layers, or a better optimizer will give a better performance

Michelucci2017; yu2020hyper. This makes the model comparison laborious and time-consuming.

Many reasons may lead to a bad accuracy: overlapping class densities garcia2008k; yuan2021novel, noise affecting the data schlimmer1986incremental; angluin1988learning; raychev2016learning, deficiencies in the classifier or limitations in the training data being the most important tumer2003bayes. Classifier deficiencies could be addressed by building better models, of course, but other types of errors linked with the data (for example mislabeled patterns or missing relevant features) will lead to an error that cannot be reduced by any optimisation in the model training, regardless of the effort invested. This error is also known in the literature as Bayes error (BE) Michelucci2017; gareth2013introduction

. The BE can, in theory, be obtained from the Bayes theorem if one would know all density probabilities exactly. However, this is impossible in all real-life scenarios, and thus the BE cannot be computed directly from the data in all non-trivial cases. The Naïve Bayes classifier

gareth2013introduction tries to approximately address this problem, but it is based on the assumption of the conditional independence of the features, rarely satisfied in practice tumer1998mutual; tumer2003bayes; gareth2013introduction

. The methods to estimate the BE developed in the past decade tend to follow the same strategy: reduce the error linked to the classifier as much as possible, thus being left with only the BE. Ensemble strategies and meta learners

tumer1998mutual; tumer2003bayes; ghosh2002multiclassifier have been widely used to address this problem. The idea is to exploit the multiple predictions to provide an indication of the limits to the performance for a given dataset tumer2003bayes. This approach has been widely used with neural networks, given their universal approximator nature richard1991neural; shoemakerleast.

In any supervised learning task, knowing the BE linked to a given dataset would be of extreme importance. Such a value would help practitioners decide whether or not it is worthwhile to spend time and computing resources in improving the developed classifiers or acquiring additional training data. Even more importantly, knowing the BE would let practitioners assess if the available features in a dataset are useful for a specific goal. Suppose for example that a set of clinical exams are available for a large number of patients. If such a feature set gives a BE of 30% (so an accuracy of 70%) in predicting an outcome and a BE smaller than 20% is the desired target, it is useless to spend time in developing models. So time would be better spent in acquiring additional features. The problem of determining the BE intrinsic of a given dataset is addressed and solved in this work from a theoretical point of view.

The contribution of this paper is twofold. Firstly, a new algorithm, called Intrinsic Limit Determination algorithm (ILD algorithm) is presented. The ILD algorithm allows computing the maximum performance in a binary classification problem, expressed both as the largest area under the ROC curve (AUC) and as the accuracy that can be achieved with any given dataset with categorical features. This is by far the largest contribution of this paper, also with respect to previous methods, since the ILD algorithm for the first time allows evaluating the BE for a given dataset exactly. This paper demonstrates how the BE is a limit not dependent on any chosen model but is an inherent property of the dataset itself. Thus, the ILD algorithm gives the upper limit of the prediction possibilities of any possible model when applied to a given dataset, with the only restrictions that the features must be categorical and that the target variable must be binary. Secondly, the mathematical framework on which the ILD algorithm is based is discussed and a mathematical proof of the algorithm validity is given. The algorithm’s computational complexity is also discussed.

The paper is organized as follows. The necessary notation and dataset restructuring for the ILD algorithm is discussed in Section 2. In Section 3 the complete mathematical formalism necessary for the ILD algorithm is discussed in detail and the fundamental ILD Theorem is given and proof is discussed. Application of the ILD algorithm to a real dataset is provided in Section 5. Finally, in Section 6 conclusions and promising research developments are discussed.

2 Mathematical Notation and Dataset Aggregation

Let us consider a dataset with categorical features, i.e., each feature can only assume a finite set of values. Let us suppose that the feature, denoted as , takes possible values. For notational simplicity, it is assumed that the categorical feature is encoded in such a way that its possible values are integers from to , with (note that each can assume different integer values). Each possible combination of the features is called here a bucket. The idea is that the observations will be aggregated in buckets depending on their features. The number of observations present in the dataset are indicated with . All the observations with the same set of features are said to be in the same bucket.

The problem is a binary classification one, aiming at predicting an event that can have only two possible outcomes, indicated here with class 0 and class 1. In general, in the bucket, there will be a certain number of observations (that we will indicate with ) with a class of 0, and a certain number of observations (that we will indicate with ) with a class of 1.

The feature vector for each observation, denoted as

(with ), is thus defined by an -dimensional vector , where denotes the value of the -th feature of the -th sample.

The following useful definitions are now introduced.

Definition 2.1.

A feature bucket is a possible combination of the values of the features, i.e.,

(1)

In the rest of the paper, the feature bucket is indicated as

to explicitly mention the feature values characterizing the bucket . The total number of feature buckets is thus equal to: .

As an example, in the case of two binary features and , four possible feature buckets can be constructed, namely: , , and .

Definition 2.2.

The set of the indices of observations belonging to the -th feature bucket is defined as

(2)

The cardinality of the set will be denoted as . In a binary classification problem, the observations belonging to the feature bucket , denoted as

(3)

will contain observations with a target variable equal to and observations with a target variable equal to . Note that by definition

(4)

and

(5)

Based on the definitions above, the original dataset can be equivalently represented by a new dataset of buckets, an aggregated dataset , each of which contains a certain number of samples . A visual representation of the previously described re-arrangement of the original dataset is reported in Figure 1 for a dataset with two binary features.

Figure 1: An intuitive representation of the dataset aggregation step for a dataset with two binary features and . Observations with, for example, and will be in bucket 1 in the aggregated dataset . Features with and in bucket 2 and so on.

In the aggregated dataset each record is thus a feature bucket , characterized by the number of observations of class 0 () and the number of observations of class 1 (). In the previous example of a dataset with only two binary features, would look like the one in Table 1. In this easy example a dataset with any number of observations would be reduced to one with only 4 records, i.e., the number of possible buckets.

Bucket Feature 1 Feature 2 Class 0 Class 1
1
2
3
4
Table 1: An example of an aggregated dataset with only two binary features.

With this new representation of the dataset, generated by aggregating all observations sharing the same feature values in buckets, the proposed ILD algorithm allows computing the best possible ROC curve considering all possible predictions.

3 ILD Algorithm Mathematical Framework

Since the output of any machine learning model is a function of the feature values, and since a bucket is a collection of observations all with the same feature values, any possible deterministic model will associate to the bucket of features only one possible class prediction that can be either 0 or 1. More in general, to each model can be associated a prediction vector . In the next sections important quantities (as TPR and FPR) evaluated for the aggregated dataset as functions of , and are derived.

3.1 True Positive, True Negative, False Positive, False Negative

In the feature bucket , if then only observations would be correctly classified. On the other side, if only observations would be correctly classified. For each bucket the true positive can be written as

(6)

In fact, if , then , and if then . Considering the entire dataset, the true positive, true negative (), false positive (), false negative () are given by:

(7)
(8)
(9)
(10)

where the sums are performed over all the buckets.

3.2 Accuracy

In a binary classification problem, the accuracy is defined as

(11)

Using equations (7) and (8) the accuracy can be rewritten as

(12)

The maximum value of the accuracy is obtained if the model predicts as soon as . This can be stated as

Theorem 3.1.

The accuracy for an aggregated categorical dataset , expressed as Equation (12), is maximised by choosing when and when .

Proof.

The proof is given by considering each bucket separately. Let’s consider a bucket that has . In this case, there are two possibilities:

(13)

Therefore, the contribution to the accuracy in Equation (12) is maximised by choosing for those buckets where . With a similar reasoning, the contribution to the accuracy for those buckets where is maximised by choosing . This concludes the proof. ∎

3.3 Sensitivity and specificity

The sensitivity or true positive rate () is the ability to correctly predict the positive cases. Considering the entire dataset, the can be expressed using Equations (7) and (10) as

(14)

Analogously, the specificity or true negative rate (), which is the ability to correctly reject the negative cases, can be written using Equations (8) and (9) as

(15)

3.4 ROC Curve

The receiver operating characteristic (ROC) curve is built by plotting the true positive rate on the -axis, and the false positive rate () on the -axis. For completeness, the is

(16)

3.5 Perfect Bucket and Perfect Dataset

Sometimes a bucket may contain only observations that are all in class 0 or 1. Such a bucket is called in this paper perfect bucket and is defined as follows.

Definition 3.1.

A feature bucket is called perfect if one of the following is satisfied

(17)

or

(18)

It is also useful to define the set of all perfect buckets .

Definition 3.2.

The set of all perfect buckets, indicated with is defined by

(19)

Note that by definition, the set contains only imperfect buckets, namely buckets where and .

An important special case is that of a perfect dataset, one in which all buckets are perfect. Indicating with the set containing all buckets, we have . It is easy to see that we can create a prediction vector that will predict all cases perfectly, by simply choosing, for feature bucket

(20)

Remember that all feature buckets are perfect, and in a perfect feature bucket and cannot be greater than zero at the same time. To summarise our definitions we can define:

Definition 3.3.

A dataset (where is the dataset containing all feature buckets) is called perfect if .

and

Definition 3.4.

A dataset (where is the dataset containing all feature buckets) is called imperfect if .

4 The Intrinsic Limit Determination Algorithm

Let us introduce the predictor vector , for which it clearly holds

(21)

and indicate and evaluated for , respectively.

Let us indicate as a flip the change of a component of a prediction vector from the value of to the value of . Any possible prediction vector can thus be obtained by a finite series of flips starting from , where a flip is done only on components with a value equal to . Let us denote with the prediction vector obtained after the first flip, after the second, and so on. After flips, the prediction vector will be . The and evaluated for a prediction vector (with ) are indicated as and . The set of tuples of points’ coordinates is indicated with :

(22)

A curve can be constructed by joining the points contained in in ordered segments, where ordered segments means that the point will be connected to ; with ; and so on. The segment that joins the point with is denoted as ; the one that joins the points and is denoted as , and so on. In Figure 2 a curve obtained by joining the tuples given by the respective prediction vectors obtained with three flips is visualised to give an intuitive understanding of the process.

Figure 2: Example of a curve constructed by joining 3 segments obtained after 3 flips.

The ILD algorithm provides a constructive process to select the order of the components to be flipped to obtain the curve characterized by the theoretical maximum AUC that can be obtained from the considered dataset, regardless of the predictive model used.

To be able to describe the ILD algorithm effectively and prove its validity, some additional concepts are needed and described in the following paragraphs.

4.1 Effect of one single flip

Let us consider what happens if one single component, say the component, of is changed from to . The and values clearly change. By denoting with the prediction vector in which the component was changed, the following equations hold:

(23)

Therefore, and will be reduced by an amount equal to the ratio and , respectively. As an example, the effect of multiple single flips on and is illustrated in Figure 3. Here are shown the ROC curves resulting from a random flipping starting from for a real-life dataset, namely, the Framingham dataset (mahmood2014framingham) (See Section 5). As expected, flipping components randomly results in a curve that lies close to the diagonal. Since the diagonal corresponds to randomly assigning classes to the observations, randomly flipping does not bring to the best prediction possible with the given dataset.

Figure 3: Two examples of ROC curves obtained from random flipping applied to the Framingham dataset (mahmood2014framingham).

By ordering the points in in ascending order based on the ratio , a new set of points is constructed. It can happen that in a given dataset, multiple points have . In this case, this ratio can not be calculated. If this happens all the points with can be placed at the end of the list of points. The order between those points is irrelevant. It is interesting to note that a flip for perfect buckets for which will have and for all perfect buckets for which will have .

With the set of ordered points , a curve can be constructed by joining the points in as described in the previous paragraph. Note that the relative order of all points with is also irrelevant, in the sense that this order does not affect the AUC of .

4.2 ILD Theorem

The ILD theorem can now be formulated.

Theorem 4.1 (ILD Theorem).

Among all possible curves that can be constructed by generating a set of points by flipping all components of in any order one at a time, the curve has the maximum AUC.

Proof.

The Theorem is proven by giving a construction algorithm. It starts with one set of points generated by flipping components of in a random order. Let us consider two adjacent segments generated with : and . In Figure 4 panel (A) the two segments are plotted in the case where the angle between them . The angles and indicate the angles of the segments with the horizontal direction and the angle between the two segments and . The area under the two segments and any horizontal line that lies below the segments can be increased by simply switching the order of the two flips, as it is depicted in Figure 4 panel (B). Switching the order simply means flipping first the component and then the component.

Figure 4: Visual explanation for the ILD Theorem. Panel (A): two consecutive segments after flipping components and then ; Panel (B): two consecutive segments after flipping components and then ; Panel (C): parallelogram representing the difference between the area under the segments in panel (A) and the area under the segments in panel (B).

It is important to note that in Figure 4 panel (A) while in panel (B) . It is evident that the area under the two segments in panel (B) is greater than the one in panel (A). The parallelogram in Figure 4 panel (C) depicts the area difference.

The proof is based on repeating the previous step until all angles . This is described in pseudo-code in Algorithm 1.

Generate a set of points by flipping the components of in a random sequence;
while true do
       ;
       for  do
             if  then
                  Switch points and in ;
                   c = c + 1;
             end if
            
       end for
      if  then
            Exit While loop and end the Algorithm
       end if
      The final set of points obtained in the loop above will generate the curve .
end while
Algorithm 1 Algorithm to construct the curve .

The area under the curve obtained with Algorithm 1 cannot be made larger with any further switch of points in . ∎

Note that Algorithm 1 will end after a finite number of steps. This can be shown by noting that the described algorithm is nothing else than the bubble sort algorithm nocedal2006numerical applied to the set of angles , , …, . So this algorithm has a worst-case and average complexity of .

4.3 Handling missing values

Missing values can be handled by imputing them with a value that does not appear in any feature. All observations that have missing values in a specific feature will be assigned to the same bucket and considered similar, since we have no way of knowing better.

5 Application of the ILD algorithm to the Framingham Heart Study Dataset

The power of the ILD algorithm is best demonstrated by applying it to a real dataset, here the medical dataset named Framingham (mahmood2014framingham), which is publicly available on the Kaggle website (kaggle). This dataset comes from an ongoing cardiovascular risk study made on residents of the town of Framingham (Massachusetts, US). Different cardiovascular risk score versions were developed during the years (wilson1998prediction), the most current of whom is the 2008 study by d2008general, to which the ILD algorithm results are also referred to for comparison of performances.

The classification goal is to predict, given a series of risk factors, the 10-years risk of a patient of future coronary heart disease. This is a high impact task, since 17.9 million deaths occur worldwide every year due to heart diseases cardiodiseases and their early prognosis may be of crucial importance for a correct and successful treatment. The dataset used in our study contains 4238 patients and 7 features: gender (0: female, 1: male); smoker (0: no, 1: yes); diabetes (0: no, 1: yes); hypertension treatment (0: no, 1: yes); age; total cholesterol; and systolic blood pressure (SBP). The last three features are continuous variables and are discretized as followed:

  • age: ;

  • total cholesterol: ;

  • SBP: .

The outcome variable is binary (0: no coronary disease, 1: coronary disease).

To correctly interpret the comparison with existing results, it is important to remark that in the dataset used in our study, the high-density lipoprotein (HDL) cholesterol variable is missing with respect to the original Framingham dataset employed in (d2008general). Finally, to create the buckets all missing values are substituted by a feature value of -1.

The application of the algorithm starts with the population of the buckets as described in the previous sections. A total number of 177 buckets are generated. The dataset is split into two parts: a training set (80% of the data) and a validation one (20% of the data). For comparison, the Naïve Bayes classifier is also trained and validated on and . The performance of the ILD algorithm and the Naïve Bayes classifier are shown through the ROC curve in Figure 5. The AUC for the ILD algorithm is 0.78, clearly higher than that for the Naïve Bayes classifier, namely, 0.68.

Figure 5: Comparison of the performance of the ILD algorithm (ILDA, red) and Naïve Bayes classifier (NB, blue) implemented on categorical features based on one single training and validation split.

To further test the performance, the split and training is repeated for 100 different dataset splits. Each time both the ILD algorithm and the Naïve Bayes classifier are applied to the validation set and the resulting AUCs are plotted in Figure 6 (top panel). For clarity, the difference between the AUC provided by the two algorithms is shown in Figure 6 (bottom panel).

Figure 6: Comparison between the performance of the ILD algorithm (red) and Naïve Bayes classifier (blue) implemented on categorical features based on 100 different training and validation splits. Top panel: AUC; bottom panel: difference between the AUC provided by the ILD algorithm and the Naïve Bayes classifier.

This example shows that the application of the ILD algorithm allows the comparison of the prediction performance of a model, here the Naïve Bayes classifier, with the maximum obtainable for a given dataset. The maximum accuracy over the validation set is 85% for the Naïve Bayes classifier (calculated for a specific threshold, i.e., 61%, which optimizes the accuracy over the training set ) and 86% for the ILD algorithm (calculated applying Theorem 3.1). The reported accuracies are the ones that refer to the ROC curves shown in Figure 5. The two values are similar and have been reported for completeness, even if this result may by misleading. The reason lies in the strong dataset unbalance, since only 15% of the patients experienced a cardiovascular event. In particular, both the Naïve Bayes classifier and ILD algorithm obtains a true positive rate and a false positive rate near zero in the point of the ROC curve which maximises the accuracy over the validation set, with a high miss-classification in positive patients, which however cannot be noticed from the accuracy result. As it is known, the accuracy is not a good metric for unbalanced datasets, and the AUC is a much widely used metric that does not suffer from the problem described above.

6 Conclusions

The work presents a new algorithm, the ILD algorithm, which determines, the best possible ROC curve that can be obtained from a dataset with categorical features and binary outcome, regardless of the predictive model.

The ILD algorithm is of fundamental importance to practitioners because it allows:

  • to determine the prediction power (namely, the BE) of a specific set of categorical features;

  • to decide when to stop searching for better models;

  • to decide if it is necessary to enrich the dataset.

The ILD algorithm has thus the potential to revolutionize how binary prediction problems will be solved in the future, allowing practitioners to save an enormous amount of efforts, time, and money (considering that, for example, computing time is expensive especially in cloud environments).

The major limitations of the ILD algorithm are firstly the requirement for the features to be categorical. The generalization of this approach to continuous features is the natural next step and will open new ways of understanding datasets with continuous features. Secondly, the ILD algorithm works well when the different buckets are populated with enough observations. The ILD algorithm would not give any useful information on a dataset with just one observation in each bucket (since it would be a perfect dataset). Consider the example of gray levels images. Even if pixel values could be considered categorical (the gray level of a pixel is an integer that can assume values from 0 to 255), two major problems would arise if the ILD algorithm would be applied to such a case: the number of buckets would be extremely large and each bucket would contain only one image therefore making the ILD algorithm completely useless, as only perfect buckets will be constructed.

An important further research direction is the expansion of the ILD algorithm to detect the best performing models that do not overfit the data. In the example of images, it is clear that being a perfect dataset one could theoretically construct a perfect predictor, therefore giving a maximum accuracy of 1. The interesting question is how to determine the maximum accuracy or the best AUC only in cases in which no overfitting is occurring. This is a nontrivial problem that is currently under investigation by the authors. To address, at least partially, this problem, the authors have defined a perfection index (IP) that can help in this regard. IP is discussed in Appendix A.

To conclude, although more research is needed to generalize the ILD algorithm, but it is, to the best knowledge of the authors, the first algorithm that is able to determine the exact BE from a generic dataset with categorical features, regardless of the predictive models.

Appendix A Perfection Index

It is very useful to give a measure of how perfect a dataset is with a single number. To achieve this, a perfection index (PI) can be defined.

Definition A.1.

The Perfection Index is defined as:

(24)

Note that if a bucket is perfect then either or are zero. In a perfect dataset . In this case, from Eq. (24) it is easy to see that since

(25)

where

(26)

and

(27)

For an imperfect dataset, it is to see that will be less than 1. In fact

(28)

where

(29)

that cannot be one as long as is not empty. There is a special interesting case when the dataset is completely imperfect, meaning , and for every bucket with is true that . In this case, regardless of the predictions a model may make, the accuracy will always be 0.5. In this case, . In facts, one can see that in this case

(30)

since and that

(31)

This index is particularly useful since the following theorem can be proved.

Theorem A.1.

The perfection index satisfy the relationship

(32)

where is the set of all possible prediction vectors for the bucket set . The perfection index measures that the ranges of possible values that the accuracy () can have.

Proof.

To start the proof the formula for the maximum () and minimum () possible accuracy must be derived. Starting from Equation (12) and choosing (to get the maximum accuracy) for and for the result is

(33)

at the same time, by choosing for and for can be written as

(34)

To prove the theorem, let us rewrite the maximum accuracy from Equation (33) as

(35)

and using Equations (25) and (34) previous equation can be re-written as

(36)

This concludes the proof. ∎

a.1 Interpretation of the perfection index

In the two extreme cases, if is small then the ILD algorithm will give very useful information. The dataset is ”imperfect” enough that an analysis as described is very useful. The more the value of gets close to one, the more the analysis, as it is formulated here, is less helpful. Since in the case of a perfect dataset a perfect prediction model can be easily built and therefore the question of what is the best possible model loses significance. Note that there is a big assumption made, namely, that a feature bucket with a few observations contains the same information as one with one thousand observations in it. In real life, one feature bucket with just one (or few) observation will probably be due to the lack of observations collected with the given set of features and therefore should be doubted in its importance.

References