A classification problem consists of predicting the value of a qualitative response variable for one or more individuals, making use of the values we know of certain variables (features) of such individuals. Those predictions are based on the knowledge obtained through a training sample of individuals whose values of the features and of the response variable are known.
Classification problems can be addressed by using machine learning techniques. Numerous classifiers have been proposed and analysed in the machine learning literature (see, for example,FernandezDelgado2014, FernandezDelgado2014). Many of them, in addition to classifying, allow the assessment of the importance that the various features have had in the classification of a particular individual. In Kononenko2010 a general assessment procedure is introduced; it is general in the sense that it is model-agnostic, i.e., valid for any classifier. The Strumbelj and Kononenko’s procedure is based on the Shapley value for cooperative games (Shapley1953, Shapley1953). The Shapley value is a rule for distributing the profits generated by a collection of cooperating agents. Such value has multiple applications in very diverse fields (see, for example, Moretti2008, Moretti2008
), including data science; an application of the Shapley value in data science can be seen inGarciaJurado2004.
In machine learning, making explanations of a model and understanding its behaviour is one of the most important tasks. Works such as Kononenko2010 provide a local interpretability of the classifiers involved, that is, explanations are made only for a specific instance. Nevertheless, in recent years, globally-interpretability machine learning models have been developed, allowing the users to understand the process for getting a particular prediction (see, for example, Wang2015, Wang2015).
In the machine learning literature there are other papers that study the influence of features on classification. Recently, several works make use of other approaches or are applied to specific classification techniques, and applications to health problems are often of great interest. Ghaddar2018
introduce a new iterative approach to address the problem of the selection of features within the classification of support vector machines and apply it to a case of medical diagnosis of tumors.Agor2019
propose a two-level programming approach to present a selection problem for classification and develop a solution based on a new genetic algorithm. They implement the framework proposed in a case study where they distinguish between good and poor quality colposcopy images.Anaraki2019
is the first work to report on using perturbation theory in feature selection. The authors have proved that perturbation theory can detect correlations between features.Benitez2019
propose a feature selection procedure based on mathematical optimisation integrated into the support vector machines classification procedure, which accommodates the costs of asymmetric misclassification.Cura2020
proposes a method that jointly optimises the classification and selection of features. The study uses multiple local search algorithms that run in parallel and share information. This collection of related works shows how the problem of the selection of features in classification is often approached by means of techniques taken from operational research, which is the discipline where cooperative game theory can also be placed.
However, probably the closest study to the subject of our research isDatta2015. In that paper, they study how influential are the various features in a classification problem when the response variable is binary and the classifier returns for each individual the value of the response variable. In this paper we consider the more general problem in which the response variable may not be binary and, in addition, the predictor returns for each individual the probability that the response variable takes each of its possible values. Our approach is significantly more generic and therefore more applicable. In any case, the methodology that is being developed here is completely different from that of Datta2015, even though it also makes use of cooperative game theory.
The purpose of this paper is also related to the Strumbelj and Kononenko’s procedure, but it is essentially different because it is not locally oriented. We do not attempt to evaluate the influence of each feature on the classification of a particular individual, but rather to evaluate the influence of each feature on the value taken by the response variable. Specifically, we introduce an influence measure that assess the influence of each feature in the classification of all individuals whose prediction of the response variable turns out to be a pre-set one. Moreover, an axiomatic characterisation of the introduced measure is provided, by using properties of efficiency and balanced contributions. This last property is based on the principle of reciprocity, as introduced by Myerson1980, which is often used in the literature on the Shapley value. Myerson’s property of balanced contributions asserts that, for any two players, the gain or loss to each player when the other “leaves” the game should be equal.
We illustrate our purpose with an example taken from the analysis of the evolution of COVID-19 patients. In the fight against a pandemic it is very important to be able to detect groups at risk in the face of various eventualities. For instance, knowing the characteristics of individuals that significantly increase their probability of decease in the event of infection can be useful for health authorities to make appropriate decisions. Therefore, once various features that can influence the mortality of a patient are selected, it would be valuable to know how to evaluate the influence of each of them on such mortality. In addition to mortality there are other eventualities that may be relevant, such as the need for hospitalisation, the need for admission to intensive care units, the need for access to certain health infrastructures, etc.
The organization of this paper is as follows. Section 2 presents the measure of influence and provides its axiomatic characterisation. In Section 3, various experiments are carried out in order to check the performance of the methodology introduced. Finally, in Section 4, based on the data from a sample of COVID-19 patients, we detect features which affect to the mortality or hospitalisation of a patient and evaluate their influence.
2 Assessing Influence in Classification
In a classification problem we have a vector of features and a qualitative response variable . denotes the set of indices of the features. Each feature takes values in a finite set and takes values in a finite set . We also have a training sample , where and are the observed values of the features and the response variable corresponding to individual . A classification problem is thus characterised by a triplet .
A classifier trained with sample is a map that assigns to every (an observation of
) a probability distribution over, i.e., with , for all , and . Each
is the estimated probability that an individual whose observed values of the features are given bybelongs to group of the response variable . Now we provide the main definition in this section.
An influence measure for is a map that assigns to every , , and a vector . The vector provides an evaluation of the influence that each feature , with , has on whether the response is worth when is worth and we only take into account the features .
Section 4 illustrates the interest of having such a measure. For instance, we use it to assess various risk features on the decease of COVID-19 patients.
We aim to introduce a sensible influence measure. Let us see some properties that are considered desirable. The first property is related to rationality. It states that, if there is only one feature, then the measure of the influence that the feature has on whether the response variable takes a certain value is the probability that the classifier we use assigns to that value. Formally, the following rationality property is established.
-Rationality. An influence measure satisfies -Rationality if, for every with , every , and , it holds that
Notice that we define the influence measure for but we allow the possibility of ignoring some features, i.e., the measure evaluates each with possibly being different from . A natural extension of the -Rationality is the following property that states that, when we ignore all features except one, then the measure of the influence that this feature has on whether the response variable takes a certain value is the expected probability that the classifier we use assigns to that value. Formally, an influence measure should satisfy the following property.
General -Rationality. An influence measure satisfies the property General -Rationality if, for every and every , and , it holds that
where denotes the subsample of formed by the observations with and , and denotes the size of the subsample . Moreover, , , and denote the restrictions of , , and to the variables of , respectively (for all ).
Again, a natural extension of the General -Rationality results when we substitute by in expression (1). This gives rise to the following efficiency property, that establishes that the total influence of a set of features is the expected prediction provided by the classifier when, for every , we fix the values of the features in according to .
-Efficiency. An influence measure satisfies -Efficiency if, for every , every , , and (), it holds that
The last property considered is a fairness property that treats all features in a balanced way. Informally, it states that given two of these features, the effect of ignoring one to the measure of the influence of the other is identical for both features.
Balanced Contributions. An influence measure satisfies Balanced Contributions if, for every , every , , (), and with ,
Now we state and prove the main result from this section. It provides a characterisation and a formal expression of an influence measure that satisfies all the properties introduced above.
There is only one influence measure for which satisfies the properties of -Efficiency and Balanced Contributions. For all , , , and , it is given by
where denotes the Shapley value, denotes the game with set of players given by
for all , and denotes the restriction of the game to the subsets of .
For each instance and each coalition , the characteristic function in
, the characteristic function indisplays the prediction provided by the classifier when the values of the features in are set to those in . In Kononenko2010, a cooperative game of difference of predictions is defined, whose characteristic function assigns to each coalition the value of the characteristic function in minus the expected prediction if no features values are set. The mathematical properties of our game and that of Strumbelj and Kononenko are essentially the same.
Proof of Theorem 2.
Existence. To show that satisfies -Efficiency, let , , and . Shapley1953 proves that the Shapley value of cooperative games satisfies an efficiency property. In our case, this property implies that
Applying this result we obtain that:
To show that satisfies Balanced Contributions, let , , , and with . Myerson1980 proves that the Shapley value of cooperative games satisfies a property of balanced contributions. In our case, this property implies that
Applying this result we obtain that:
Uniqueness. We show uniqueness by induction on the size of . Suppose that and are two influence measures satisfying -Efficiency and Balanced Contributions. If , by -Efficiency,
Assume now that for all with . Then by Balanced Contributions, for all , ,
From the property of -Efficiency of it immediately follows that the amount is
Notice that belongs to and it can be interpreted as an estimate of the probability that the corresponding response of an individual with characteristic falls in group . Also is the part that corresponds to the feature of the distribution between the features of the quantity given by the number . In this way, the evolutions of the numbers and are very illustrative of the influence of the various values of on taking the value . For example, if for and we observe that is close to and that the latter is close to , we can conclude that individuals with feature equal to have a high probability of being classified in and that this is mainly due to feature .
3 Empirical results
In this section we show the performance of the proposed influence measure (2) by means of a computational study. Two different experiments have been carried out using the software R
. The objective of such simulations is to corroborate that the results obtained by the methodology introduced in the current work are in accordance with the expected ones. The classifier used in this paper is Breiman’s random forest classifier (Breiman2001, Breiman2001), implemented in Weka111http://www.cs.waikato.ac.nz/ml/weka. and used through RWeka222https://cran.r-project.org/web/packages/RWeka/index.html.. This choice is motivated by the excellent result of the random forest type classifiers (see, for example, FernandezDelgado2014, FernandezDelgado2014). The code was run on a quad-core Intel i7-8665U CPU with 16GB of RAM.
The procedure adopted in both experiments is as follows. We start from a sample of individuals from which their attributes and response are known, . Right after, such sample is used to train a classifier previously chosen, obtaining . The purpose of this work is not to classify new instances, but to study the influence of features on classification. In accordance with Remark 4, to evaluate the influence of feature on the response taking the value , the quantities and are computed and analysed for all .
For the first experiment, a sample of instances with four binary features was generated. Such attributes take the values and with probability (hence, ). In half of the instances, the class of the response variable is determined by the value of the attribute (i.e., ); while in the remaining instances, it is feature the one which defines the class (i.e., ); note that . The following step is to select those observations whose assigned class was . Afterwards, for each attribute , and each of its possible values, we study the influence that such feature had on the response when it took such value. Since the procedure by which the class has been generated is known, it is evident that the influence of attributes and should be independent of their values. Furthermore, the value 1 for features and should have a stronger influence in the classification than the value 0. Table 3.1 and Figures 3.1, 3.2, 3.3 and 3.4 present the results obtained for this simulation, which took a runtime of minutes.
Indeed, it can be seen that for attributes and the value is closer to when , and that the former is near . However, such a difference can not be observed for features and , implying that new instances have the same probability of being classified as or whatever their values for and are.
The second experiment differs from the previous one in the procedure to assign the class to the instances. The response is now generated as a binary vector which takes the values 0 and 1 with probability 0.5, independently of the attributes. The goal of this simulation is to show that the influence of the features in the classification of the instances with response does not depend on the features’ values. Table 3.2 and Figures 3.5, 3.6, 3.7 and 3.8 present the results obtained for this simulation. The computational time was minutes.
Again, the outcomes are as expected: for each feature, there are barely differences in the values and when changes.
In view of the previous results, our methodology seems to be appropriate to study the influence that the different features’ values have on the classification of individuals. Since the experiments are satisfactory, this analytic tool is ready to be applied to problems taken from real life. Consequently, this procedure has been employed in a real dataset concerning COVID-19 patients, whose results are presented in the next section.
4 Application to the classification of COVID-19 patients
This section analyses a database of 10,454 patients from Galicia (a region in the northwest of Spain) infected with COVID-19 from March 6, 2020 to May 7, 2020. The objective is to study the influence of various patients’ characteristics in three binary response variables of special interest: the need for hospitalisation, the need for ICU admission, and the eventual decease. The emphasis is not on the predictive classification of new patients, but on the analysis of the characteristics that influenced that the patients whose complete history is known had a positive response in the binary variables indicated.
The features or attributes (categorical variables) which have been considered in this study are the following:
Age: 0 (0-49 y/o), 1 (50-64 y/o), 2 (65-79 y/o), 3 (80 y/o and over).
Sex: 0 (woman), 1 (man).
Cardiovascular Diseases: 0 (without diseases), 1 (mild diseases), 2 (severe diseases: ischemia with angina, infarction, stroke).
Respiratory Diseases: 0 (no diseases), 1 (mild diseases), 2 (severe diseases: malignancy, COPD, pneumonia).
Metabolic Diseases: 0 (no diseases), 1 (mild diseases), 2 (severe diseases: malignancy, insulin-dependent diabetes).
Urinary Diseases: 0 (none or mild diseases), 1 (severe diseases: malignancy, kidney failure).
The binary response variables considered in this application are:
Decease (exitus): 0 (no), 1 (yes).
ICU admission: 0 (no), 1 (yes).
Need for Hospitalisation: 0 (no), 1 (yes).
It is interesting to note that the methodology developed in Section 2 can be used to evaluate the influence not only of the features chosen, but also of other characteristics present in the database, on the aforementioned response variables or on others that may be considered of interest. Some of these specific attributes could be asthma, COPD, hypertension, obesity or diabetes.
Having seen the suitable results obtained in Section 3 for the experiments described, we will make use of the previous procedure to estimate the influence of the features in the classification with respect to a binary response variable (which takes values in . For instance, the interest would reside in selecting those individuals who resulted in decease (that is, ) when our purpose is to know what are the most influential attributes for the exitus. Note that to estimate the influence of feature on , we use the influence that has in the classification of the elements of the sample using an excellent classifier, since it is precisely trained with the sample . As in the previous section, we use the random forest classifier introduced by Breiman2001 and implemented in R through the RWeka library.
In order to facilitate the reading of this document, a detailed study of the influence of the feature age on the three classification problems is presented. The results obtained for the remaining attributes are left to A. Let be the set of the features. Table 4.1 shows, for each category of the feature age, the frequency of such category in the sample (in parentheses), the estimated probability that an individual with age in the considered category will decease (in the Classification row) and the evaluation of the influence that age really has on that prediction (in the Influence row). Formally, the Classification row contains the values (see Remark 4) and the Influence row contains the values (see Definition 1 and Theorem 2). Note that for each age category (that is, when ), the influence is the share-out which corresponds to the age after distributing the classification among the features. Thus, the closer the influence is to the classification on a certain value of the feature considered, the higher the relative influence of such feature on the classification. On the other hand, the higher the classification in a certain value of the considered feature, the higher the probability of a decease taking place if an individual presents that feature’s value.
|age - decease|
|0 (6)||1 (35)||2 (132)||3 (371)|
In Figure 4.1 the numbers from Table 4.1 are represented for a better visualisation of their joint evolutions. In this case, it can be observed that the influence of age in a patient’s decease grows as we move forward in the age ranges but, in any case, such influence only seems relevant for the third age bracket (from y/o onwards); the influence is negative for those individuals under years old. It is observed that the estimated probability of decease is slightly lower for age range 2 than for age group 1, contrary to what might be expected; however, taking into consideration that the influence of age on decease in these two age groups is very low, the difference in such estimated probability will be due to a greater presence of other risk features in this specific sample for individuals in range 1 than for individuals in range 2. In any case, what seems to be drawn from this study is that the clear influence in a greater probability of decease occurs in individuals over y/o and not in other age groups.
For the classification problem of ICU admission, Table 4.2 and Figure 4.2 show an evident drop in the Classification and Influence of age between the individuals of ranges 2 and 3. In fact, the estimated probability of being admited in the ICU is the lowest for patients older than 80 years old. This could be related to the fact that those individuals are the ones who have a higher probability of passing away, as we have previously studied, even before their admission in the ICU.
|age - ICU|
|0 (29)||1 (75)||2 (170)||3 (10)|
In the case of need for hospitalisation, the Influence and Classification increase as we advance in the first three age groups. However, there is barely differences between the last two age ranges, as presented in Table 4.3 and Figure 4.3. The influence of age in being hospitalised is lightly lower for those over 80, even though the estimated probability is somewhat higher for these patients.
|age - hospitalisation|
|0 (292)||1 (485)||2 (923)||3 (793)|
Results obtained for the feature age have been analysed in detail, in order to suggest guidelines of interpretation to readers. The outcomes for the rest of the attributes are presented in A, as a means to avoid overloading this document.
The authors are grateful to Ricardo Cao Abad and to the Dirección Xeral de Saúde Pública of the Xunta de Galicia in Spain. This work has been supported by the ERDF, the Government of Spain/AEI [grants MTM2017-87197-C3-1-P and MTM2017-87197-C3-3-P]; the Xunta de Galicia [Grupos de Referencia Competitiva ED431C-2016-015 and ED431C-2017/38, and Centro Singular de Investigación de Galicia ED431G/01]; and by the collaborative research project of the IMAT “Mathematical, statistical and dynamic study of the epidemic COVID-19”, subsidized by the Vice-Rector’s Office for Research and Innovation at the University of Santiago de Compostela, Spain. The research of Laura Davila-Pena has been funded by the Government of Spain [grant FPU17/02126].