In the recent flood of papers analyzing the details of the inner workings of classifiers [22, 11, 8, 20], the attention typically is focused on a single classifier. We might want to know how a black-box classifier arrives at its predictions [11, 20], where the classifier predicts well or badly , which input attributes influence the output predictions  and to which degree. Important as that might be, we propose that more can be learned by investigating the collective behavior of a set of classifiers. Let us illustrate this with a practical example.
Suppose that we work at a bank, and we have to decide on whether or not to lend a mortgage to a series of customers. We have a rule-based system in place to make this decision. Since the economic tide ebbs and flows over time, we may need to adapt the rule-based system periodically, to achieve appropriate results. On every point in time, the system can predict for every customer whether the person gets the mortgage or not. Interesting would be to find out when and why the rule-based system changes its mind: if subsequent iterations of the system suddenly grants a loan to a previously rejected customer, or vice versa, the era of responsible data science compels us to properly motivate why. Ideally, we would not just identify single customers for which this holds, but coherent groups of customers that come with a concise description: it would be interesting to know if the system has changed its mind about granting mortgages to people under the age of thirty with at least two kids, for example. Such descriptions give us more information on whether the behavior displayed by the system is, in fact, desirable.
In this paper we introduce few variants of the following problem. Given a dataset, and a collection of relevant classifiers, identify and name the regions of the domain for which there is a high disagreement. We describe an algorithm which is based on the Exceptional Model Mining framework [16, 7], and provide quality measures to address a few possible motivations and preferences. We evaluate the usefulness of the algorithm on publicly available datasets and bring qualitative and quantitative findings.
Ii Related work
Given a classifier and a relevant dataset, investigating the interactions among the model and the data is often referred to as model debugging, providing model transparency, or also model interpretability. Some interpretability mechanisms treat a model as a black box [11, 12, 20, 1] while other employ methods that are tailored to specific classification techniques. The algorithm described in 
, enables to investigate a single soft classifier against a dataset. It requires that the ground truth is provided, and also that the model outputs probabilities (it is a soft classifier). The method investigates the degree to which the ranking of the model in a specific subgroup is in agreement with the ground truth. This is done by counting the obvious errors (when a negative is ranked before a positive). Regions for which the rate of obvious errors is significantly higher, or significantly lower, from the same measure for the whole dataset, are then reported.
Black box auditing, or discrimination aware approach, is described in . GoldenEye/GoldenEye++ [11, 12] is highlighting the feature importance/feature interaction by shuffling the values in columns of specific features within a predicted label, and measuring the label changes. EXPLAIN [22, 23] checks the effect of blinding the model with respect to values of a specific attribute. SHAP  divides the contribution to a classification among the features of a case. LIME  attempts to describe the model in a locality of a case under scrutiny using an interpretable proxy. Also interesting approaches for interpretability explore regions of uncertainty , attention given a case under test , or cognitive psychology traits of the model . Some works extract rules, or provide a simplified model [17, 15, 10].
If one manages to compress the model and the data, for example following the Minimum Description Length (MDL) framework , then they somehow capture the essence of the model/data. One should describe a model and then the exceptions in the data that do not follow the model and to find the point for which the overall description is minimal (the model is described in sufficient detail and the leftover exceptions are few).
Assume given a dataset from a domain , consisting of cases (or records of the form . We refer to the final element of each case, as target, or also the true label. The target is nominal, with the set of possible values , thus . All other elements of each case are referred to as the attributes, which can be either of a numeric type, or a nominal type. While the domain of each individual attribute is left free, we denote the collective domain of the attributes by . This notation allows us to formally define what a classifier is:
Definition 1 (Classifier).
Given a domain with collective attribute domain and the nominal type of the target , a classifier is a function , assigning a label to every possible input value from .
The main goal of a classifier, as it is generally understood in machine learning, is to predict: assigning labels to cases whose real target value we do not know. To arrive at a formal definition of such predictions, we need to introduce some more notation. Letbe a dataset, where the true labels, in the general case, are not known. We denote by superscript the th case of the dataset or elements thereof. Hence, the first case is denoted by , the target value of the seventh case, whether it is known or not, by , and the value for the fourth attribute in the eighth case by .
Definition 2 (Predictions).
Given a dataset consisting of cases, and a classifier , we define the predictions of on to be the vector
to be the vector, where .
Hence, the vector of predictions collects the outputs of the classifier function on all cases in the dataset .
The main goal of this paper is to find regions, or subgroups of cases, of high controversy across a set of classifiers. Hence, we assume as given a set of classifiers . For the purposes of this paper, it is irrelevant exactly how any of these classifiers arrive at their predictions: we are agnostic of the internal workings of a classifier function. Instead, we merely analyze them in terms of their predictions:
Definition 3 (Prediction matrix).
Given a dataset consisting of cases, and a set of classifiers , the prediction matrix is the -matrix with entries from defined by:
Hence, the first row of the prediction matrix collects the predictions of all classifiers for the first case in the original matrix , etcetera.
Iii-a Local Pattern Mining
We would like to identify one or more subgroups of the cases, for example , such that on average for cases , and the classifiers , there is high controversy among the relevant entries . The ground truth, , for the classification problem, referred above as the target values, or the true labels, is not a necessity for the problem we describe next in its basic form, yet once present, new options and questions can be investigated.
When selecting a subset of the cases in , we restrict ourselves to regions that can be identified with a description that belongs to a description language . Thus for example, if is a valid description in , then the matching subgroup of cases , those for which the description evaluates to true, is a valid candidate as a subgroup. This is often the approach with Subgroup Discovery , and with Exceptional Model Mining [16, 7]. In Subgroup Discovery (SD), one can identify the most interesting subgroups w.r.t. a single target. With Exceptional Model Mining (EMM), one can address multiple target attributes when evaluating how exceptional a subgroup is. Both frameworks require that one declares a set of attributes that can be part of the description for a subgroup, therefore, the identification of the region. Also required is a single or, for EMM, a set of attributes, that are used when evaluating the exceptionality of the region. Formally both SD, and EMM, require a declaration of a subset of the attributes of the dataset, , where are used to describe subgroups, and are used to evaluate subgroups. Thus the description language is based on and the relevant domains. Given a dataset , a description, , is interchangeable with the subgroup that corresponds to the cases for which is true. For evaluating the subgroups, as mentioned above, are used. For SD, , for EMM, . Of course there are many ways to evaluate the exceptionality of a subgroup, for EMM in particular, but also for SD. Therefore a specific quality measure (for EMM, based on a model class) must be chosen to evaluate the quality of the region in terms of exceptionality. Hence, assigns a value to the description based on the attributes of the relevant entries in . A reasonable choice to realize the search involved with SD or EMM, is with the Beam Search algorithm .
Iv The Controversy Rules Model Class for EMM
Our prerequisites and the standard EMM terminology can be naturally mapped onto one another, as follows. The descriptors from EMM will be of the dataset, and the targets of EMM will be the predictions from . In some situations we augment , where available and relevant, with the ground truth label .
We illustrate the core concept of Controversy Rules by a single case, or row, compared to another row . If the set of predictions over has higher entropy than the set over , we would claim that is more interesting than . We use here the base 2 Shannon entropy, :
Following this definition, if we have classifiers (), and a binary target, then a row where five classifiers predict the one label and five the other, is more interesting than if the votes were six versus four. If we have classes, then a tally of is as interesting as of and both are more interesting than a tally of . This is of course if we look for regions with disagreement. If we seek for regions with high agreement, we prefer the lower entropy.
Now consider subgroups of the cases, or collections of rows. As to help the reader to follow the intuition, we give two matrices for two toy datasets and their respective classifiers, in Table II and Table II. For simplicity, assume that a description exists for each subset of both toy datasets. Therefore one can name any of those subsets and evaluate their quality measures. Below we refer, for example, to the subgroup containing rows 1 and 3 as subgroup .
Iv-a Row Controversy
In the first scenario, we seek regions with high per row controversy across the classifiers. We measure this by mean per row entropy over the cases in a subgroup. Therefore the quality measure that we use here is:
Note that we ignore the identity of the classifiers, or the actual predictions, and we just evaluate the mean per row entropy for the subgroup. We set a minimum threshold for number of rows, so that the reported subgroups are actionable, yet other from that, a smaller subgroup with higher mean entropy is still ranked before bigger subgroups with smaller mean entropy. The use case for this scenario is when we are interested in subgroups of the domain for which different classifiers predict differently or even completely at random. The rationale for this desire, described here, for simplicity, in binary classification terms, is that we are less concerned by a big subgroup, for which at any given row, one classifier gets it wrong (or only one gets it right), while the other get it right (wrong), than by a smaller subgroup for which always half of the classifiers get those cases wrong. The subgroup on which half of the classifiers get the cases wrong should be ranked higher. In Table II, toy dataset and its relevant classifiers, we would like to discover first the subgroup .
Iv-B Consistent Classification
In the next scenario, we consider the following objective. We are interested in controversy but of less random nature: a scenario in which few classifiers consistently differ from the other classifiers. We assume here that the classifiers are consistent in the regions (low entropy per classifier). In the example from Table II, toy dataset , we would like to discover first subgroup or subgroup . This is because all four classifiers, are each internally consistent in those regions, while there is a disagreement across the four. Notice that using that intuition, we direct the search to a region in which the per-classifier entropy is low, but mean per-row entropy is high. To this end, we define the following quality measure:
For example, (rounded), , , , and . The use case for this scenario is to identify regions in which few classifiers behave different, yet limiting the search for regions in which each classifier is consistent.
Iv-C Consistent Accordance
We next identify controversy of consistent nature, while overcoming the rigidity of , where different predictions over different cases result in high classifier-wise entropy, thus lower rank for the relevant subgroup. Achieving this goal allows us to identify regions where a few classifiers are the negation of the majority. We cannot normally achieve this with , unless the same regions indeed contain the greatest per-row entropy on average. To allow for different predictions per-classifier we move from the prediction space to the accordance space. Thus we first identify the top predicted class per row (most frequently predicted), and then compare it to the prediction. In case of a tie, we choose one of the classes. Thus for every row, , , and then for every classifier , . We next search for interesting regions based on , using the following quality measure:
Iv-D Consistent Correctness
In this scenario and all subsequent ones, we assume the availability of the ground truth, . The availability of the ground truth enables us to attempt to identify hard-to-classify regions, on which few models actually succeed, or the other way around: easy regions, on which a few models consistently fail. We start by collecting the correctness of the predictions, hence for every case and for every classifier , . We then evaluate using the mean of row-wise entropies minus the mean of classifier-wise entropies. Note that also here, once we switch from the output space to the correctness space, the per classifier consistency is of a different nature. Hence, classifiers that are the negation of other classifiers may result in higher ranking for the relevant regions. We use the following quality measure:
Differences between and are possible, where there are cases for which the majority of classification is different from the true label.
Iv-E Ground Truth as Yet Another Classifier
If we treat the ground truth as yet another classifier, we can evaluate the mean per-row entropy as is done for . What is the effect of adding as an additional classifier? Rows for which most of the classifiers predict correctly, now have a lower entropy. Rows for which only a minority of the classifiers predict rightly, have a higher entropy. The search for regions for which the mean row-wise entropy is the highest, results in finding regions that are hard to predict correctly. We add as an additional classifier, as described above, and also in another experiment, add as additional classifiers.
Note that is expected to be similar to yet puts some additional emphasis on regions with errors. should take this aspect even further: by matching each classifier’s prediction with a copy of the ground truth, the weight of mistakes, as is reflected in the ranking of the subgroups, should be even higher.
Iv-F Relative Average Subranking Loss
The last scenario in this paper is applicable to binary classification only. We adapt the existing SCaPE model class for EMM , to identify regions that are exceptionally hard or easy to predict. By examining , we calculate the empirical probability of predicting the positive class per row. Thus for every row ,
We obtain therefore a soft classifier, , to be contrasted with the ground truth , gauged with the quality measure used in SCaPE.
We illustrate the workings of the Controversy Rules model class for EMM, by experimenting on the following classifiers: Decision Tree, Naïve Bayes , 3-Nearest Neighbors 13]
, and Support Vector Machine with linear kernel. The choice of classifiers is purely for illustrative purposes and should not be confused with the core contribution of this paper: we provide a method to find regions of controversy between classifiers, which we illustrate with this selection of well-known classifiers (which should not be taken as endorsement of the classifiers themselves). We obtain the predictions by running 10-fold cross validation for each of the model classes. Hence, technically, each prediction column is created by 10 different classifiers; the perceived classifiers are virtual, and have never existed. We mention this for the benefit or reproducibility; how the predictions were obtained is not fundamental to the core contribution of this paper.
We run the experiments on the eight datasets listed in Table III. Most are taken from the UCI ML repository . The Titanic dataset is taken from Kaggle (https://www.kaggle.com/c/titanic/data), and Pima-indians (which is no longer available in the UCI ML repository) can also be accessed there (https://www.kaggle.com/uciml/pima-indians-diabetes-database/data). The YearPredictionMSD dataset comes with a naturally in-built regression task (predicting the year in which a song was released). We define our own classification task on this dataset, converting the year into decades (the floor of the year divided by 10 is taken as the true label). Some of the datasets are suitable for binary classification tasks (Mushroom, Titanic, Adult, Pima-indians), while other contain more than 2 classes, although sometimes ordinal in nature (Balance-scale, Car, YearPredictionMSD).
To discover subgroups, we employ the Beam Search algorithm for Exceptional Model Mining, as described in [7, Algorithm 1]. The parameters are set as follows: beam width , search depth . To avoid tiny subgroups, we require a minimum support of of the cases.
On the Mushroom dataset, four out of five classifiers predicted almost all test cases correct (Naïve Bayes has 216 false positives, and 3 false negatives, k-Nearest Neighbors has 2 false negatives, and the other three classifiers do not make errors). The prediction matrix is displayed in Figure 1. The order of the classifiers, from left to right, is as those are listed for the experiment. The cases, or the rows, are ordered by the predictions of the classifiers, lexicographically from left to right. As can be seen, one classifier (Naïve Bayes in this case) is predicting differently from the rest for numerous cases, while the other agree almost always. The exact descriptions ordered the same, are reported also by and by . This is expected as errors and disagreements here are in the same cases.
did not find anything interesting as the errors made (by the Naïve Bayes classifier) are to mistake consistently a negative to be a positive for a few of the cases, and hence the probability for those cases is indeed between the 0 for most of the negative cases, and the 1 for most of the positive cases.
Table V lists the descriptions reported by the quality measure. The top subgroup is the same as the one found with , but subsequent subgroups differ. To illustrate the difference, Figure 2 displays two prediction matrices: one (Figure (a)a) for the top subgroup for both measures, and one (Figure (b)b) for the new description . Comparing those two subgroups, we note that as one classifier, the Naïve Bayes, is more consistent when there are fewer negative cases, relevant descriptions are being ranked higher with .
Table VI lists the subgroups found with . The top description is illustrated in Figure 3. We see that some more positive cases are included. The internal accordance of the classifiers is intact by adding those cases, and this subgroup is more interesting than the top one reported by , if taking into account also the consistent accordance. The descriptions reported by are the same as those reported by . This is not surprising, since the majority of the classifiers get all the cases correct.
The task for the Balance-scale dataset is classification, where 3 possible classes exist L for left, B for balanced, and R for right. The datasets represents a scale, where both on the left and on the right side a single weight is placed at a single spot. For both the weight and the distance from the spot to the center of the balance, integer unit values between one and five can be chosen. Hence, there are configurations. The underlying physical law states that the scale is in balance, if and only if equals .
One can intuitively understand that, assuming the classifiers do not have access to the exact mechanism, higher confusion can be found near the decision boundaries and around the balanced state. Naïve Bayes is expected to have some difficulties, as the assumption of independence among the conditional probabilities conflicts with the underlying multiplicative physical law (as just outlined). Naive Bayes resolves this problem by simply ignoring all theB cases. This is surprisingly effective, compared with the other classifiers. A view of the predictions is displayed as a parallel coordinates plot in Figure 4, where the true labels are also included (far right).
Table VII lists the subgroups found with . The top description restricts three of the four variables: a small weight on the left side, and a large weight at a small distance on the right side. This is indeed a volatile situation, where a small change in any of the remaining choices will cause the scales to tip over. Hence, it makes sense that classifiers disagree.
Top subgroups for and are listed in Tables VIII and IX, respectively. This stepwise increase in the importance of the true label can be expected to affect the ranking of the top subgroups. Indeed we see the description appearing in the fourth place in , where it is not reported by , and then it climbs to the top in . In Figure 5 we contrast the top subgroup for (a) and the top subgroup for (b).
reports descriptions for which the quality measure is 0. This value calculated from , that is 0 for the mean per row entropies, and 0 for the per classifier entropy, for example , which always results in R true label, as can be seen in Figure 6. The quality measure score for the whole dataset is . These descriptions correspond to a total agreement among the classifiers, which is not exactly what we are looking for yet is also interesting. To have descriptions for which the measure is bigger than 0, the mean per-row entropies should be higher than the mean per-classifier entropies. Similar reports are given by and by .
is not applicable here as we have more than two labels.
The task for the Titanic dataset is binary classification (life or death), based on attributes known about the passengers of the Titanic. The famous ship collected passengers from three ports. Some of the passengers traveled alone, while others traveled with family members. There were three classes of cabinets with different price levels. In general, once the ship hit the iceberg, children and women were offered a place in a lifeboat before the other passengers. Unfortunately there were not enough boats for everyone.
Table X, XI, and XII list the top subgroups found with , , and , respectively. As can be seen in Figure 7, the top description for picks on a region where the Naïve Bayes classifier predicts mostly Survived. This is also the case for the top description for , which is ranked in the second place for . By contrast, ranks other descriptions first.
Tables XIII and XIV list subgroups found when maximizing and minimizing, respectively, . The top description from maximizing is . SVM has the highest accuracy in this region by predicting all those 385 cases as Died, while 44 passengers did survive (cf. Figure 8). All attempts by the other classifiers to identify the survivors result in many false positives. The top description when minimizing , , refer to cases for which the age is unknown. The Decision Tree and the Random Forest classifiers, achieve in this region the top accuracy of , by predicting correctly 39 Survived cases out of 52. Other models also predict correctly most of the positive cases. We have maximized and found a region for which the SVM classifier achieves the highest accuracy of , and we have minimized and found a region for which those are the DT and RF classifiers that achieve the highest accuracy, this time only , which is lower than . Therefore we note that with this setting of classifiers’ predictions, is not discriminating subgroups based on individual classifiers’ accuracy but rather based on whether the collection of classifiers, if used as a voting ensemble, correctly ranks the cases. Therefore, as a voting ensemble the classifiers do a better job for than for .