Is AI Model Interpretable to Combat with COVID? An Empirical Study on Severity Prediction Task

by   Han Wu, et al.

Black-box nature hinders the deployment of many high-accuracy models in medical diagnosis. Putting one's life in the hands of models that medical researchers do not trust it's irresponsible. However, to understand the mechanism of a new virus, such as COVID-19, machine learning models may catch important symptoms that medical practitioners do not notice due to the surge of infected patients caused by a pandemic. In this work, the interpretation of machine learning models reveals a high CRP corresponds to severe infection, and severe patients usually go through a cardiac injury, which is consistent with medical knowledge. Additionally, through the interpretation of machine learning models, we find phlegm and diarrhea are two important symptoms, without which indicate a high risk of turning severe. These two symptoms are not recognized at the early stage of the outbreak, but later our findings are corroborated by autopsies of COVID-19 patients. And we find patients with a high NTproBNP have a significantly increased risk of death which does not receive much attention initially but proves true by the following-up study. Thus, we suggest interpreting machine learning models can offer help to understanding a new virus at the early stage of an outbreak.



page 1


Global and Local Interpretation of black-box Machine Learning models to determine prognostic factors from early COVID-19 data

The COVID-19 corona virus has claimed 4.1 million lives, as of July 24, ...

An early prediction of covid-19 associated hospitalization surge using deep learning approach

The global pandemic caused by COVID-19 affects our lives in all aspects....

Deep Learning Prediction of Severe Health Risks for Pediatric COVID-19 Patients with a Large Feature Set in 2021 BARDA Data Challenge

Most children infected with COVID-19 have no or mild symptoms and can re...

Diagnosis/Prognosis of COVID-19 Images: Challenges, Opportunities, and Applications

The novel Coronavirus disease, COVID-19, has rapidly and abruptly change...

Triage and diagnosis of COVID-19 from medical social media

Objective: This study aims to develop an end-to-end natural language pro...

Early Stage Diabetes Prediction via Extreme Learning Machine

Diabetes is one of the chronic diseases that has been discovered for dec...

Understanding the temporal evolution of COVID-19 research through machine learning and natural language processing

The outbreak of the novel coronavirus disease 2019 (COVID-19), caused by...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The sudden outbreak of COVID-19, a communicable disease with strong infectivity, brings an unprecedented impact worldwide. With more than 18 million confirmed cases as of mid-August, the pandemic is still accelerating globally. The disease is transmitted by inhalation or contact with infected droplets and the incubation period ranges from 2 to 14 days [1], making it more infectious and difficult to contain and mitigate.

The rapid transmission of COVID-19 causes strained medical resources in many countries. To help release the pressure of healthcare workers, different diagnostic and predictive models are developed. For instance, a deep learning model for diagnosis using chest CT Images that detects abnormalities and extract key features of the altered lung parenchyma


, and transfer learning is employed to train the model due to the inadequacy of COVID-19 datasets. However, deep neural networks are not interpretable due to their complexity which prevented many e high-performance models being fielded in healthcare. Also, there are intelligible models for prediction and readmission that use generalized additive models with pairwise interactions

[3]. It is intelligible and modular, so patterns that do not obey medical knowledge can be recognized and removed. And this method is able to scale to large datasets containing hundreds of thousands of patients and thousands of attributes while remaining intelligible and providing accuracy comparable to the best (unintelligible) machine learning methods. But this model is not suitable enough for more complex problems. To maintain both interpretability and complexity, DeepCOVIDNet is proposed for predictive surveillance that identifies the most influential features for prediction of the growth of the pandemic[4]. It is achieved through the combination of two modules, the embedding module, and the DeepFM [5] module. The embedding module takes as input various heterogeneous feature groups and outputs an equidimensional embedding corresponding to each feature group. These embeddings are then passed to the DeepFM module which computes second- and higher-order interactions between them.

Employing machine learning models achieves fast diagnosis and more reasonable distribution of medical resources according to the severity prediction of different areas. However, models with high accuracy may not provide explanations for their outputs due to the trade-off between accuracy and interpretability. More accurate models usually provide less interpretations[6]. For healthcare, being able to understand and validate its output is important for a model to be trusted to be safe and non-discriminative, and robust to adversarial attack [7]

. To be applied in healthcare, the Multi-tree XGBoost algorithm is proposed that to identify most important indicators in COVID-19 diagnosis


. This method exploits the recursive tree-based decision system of the model to achieve great interpretability, and identifies LDH, hs-CRP and lymphocytes are three important indicators for COVID-19 prognostic prediction which is consistent with our research. There is another convolutional neural network (CNN)-based model, a ResNet-50 based model, for discriminating coronavirus disease 2019 (COVID-19) from Non-COVID-19 using chest CT

[9]. Its interpretability is achieved through implementing gradient-weighted class activation mapping to produce a heat map for visually verifying where the CNN model is looking at the image, thereby, ensuring the model is performing correctly.

Besides, many model-agnostic methods have been proposed to peek into black-box models, such as Partial Dependence Plot (PDP)[10], Individual Conditional Expectation (ICE)[11], Accumulated Local Effects (ALE)[12], Permutation Feature Importance [13], Local Interpretable Model-agnostic Explanations (LIME)[14], Shapley Additive Explanation (SHAP)[15], and Anchors [16]. Most model-agnostic methods are qualitatively reasoned through illustrative figures and human experiences. In order to justify different methods, several metrics for interpretability are proposed to quantatively measure their interpretations such as faithfulness [17] and monotonicity [18]

. These interpretation methods and evaluation metrics will be introduced in detail in the next section.

In this paper, instead of targeting at a high-accuracy model, we use different methods to interpret models which are trained to predict the severity of COVID-19 using a dataset of 92 patient with 74 features. Besides checking whether or not their predictions are based on reasonable medical knowledge, we try to find clues that are neglected by medical practitioners in COVID-19 diagnosis. To quantitatively evaluate different interpretaion methods, we use faithfulness and monoticity to justify the interpretation of feature importance.

Ii Preliminary of AI Interpretability

In this section, we summarize frequently used interpretation methods (PDP, ICE, ALE and Permutation Feature Importance), and evaluation metrics (Faithfullness and Monotonicity).

Ii-a Model-Agnostic Methods

Restricting machine learning to interpretable models in healthcare is often a severe limitation. Separating explanations from the model has several flexibilities: model flexibility, explanation flexibility, and representation flexibility [19].

Model flexibility: It is often necessary to train models that are accurate for most real-world applications. However, the behavior of high-accuracy models are usually too complex for humans to fully comprehend. Thus, the trade-off between accuracy and interpretability restricts the choice of models in many applications. While model-agnostic methods seperate interpretability from the model thus frees up the model to be as flexible as necessary for the task, enabling the use of any machine learning approach.

Explanation flexibility: Interpretable models are limited to certain forms of explanation which is not flexible. For different tasks, it might be useful to have a linear formula in some cases, while a graphic with feature importances can be more favorable under other scenarios. Being model-agnostic, the same model can be explained with different types of explanations, and different degrees of interpretability for each type of explanation [19].

Representation flexibility: Many deep learning models use features that are not perceivable to human to represent input data, such as word embeddings in natural language processing (or NLP)

[20]. As a result, the explanation system is unable to use a different feature representation as the model being explained. While model-agnostic approaches can generate explanations using different features than the one used by the the original model [19], thus more flexible.

Due to these flexibilities, many model-agnostic methods are proposed to give explanations without knowing model details.

Partial Dependence Plot: Partial Dependence Plots (PDP) reveal the dependence between the target function and several target features. The partial function

is estimated by calculating averages in the training data, also known as Monte Carlo method. After setting up a grid for the features we are interested with (target features), we set all target features in our training set to be the value of grid points, then we make predictions and average them all at each grid. The drawback of PDP is that one feature produces 2D plots and two features produce 3D plots, as a result, it can be pretty hard to interpret more than two features, because it’s not easy for human to understand plots in higher dimensions.

Individual Conditional Expectation: Individual Conditional Expectation (ICE) is similar with PDP, the difference is that PDP calculate the average over the marginal distribution while ICE keeps them all, which means one line in ICE represents the predictions for one instance, so that ICE draws one line for each individual. Without averaging, ICE uncovers heterogeneous relationships, but is limited to only one target feature because two features results in overlay surfaces in the plot which cannot be identified by human eyes [21].

Accumulated Local Effects: Accumulated Local Effects (ALE) averages the changes in the predictions and accumulate them over the local grid. Its difference with PDP is that the value at each point of the ALE curve is the difference to the mean prediction, and the value is calculated in a small window rather than all of the grid, thus eliminates the effect of correlated features [21]. Calculating in a small window makes ALE more suitable in healthcare, because it’s usually irrational to assume young people having physical characteristics within the range of the elderly.

Permutation Feature Importance: The idea behind Permutation Feature Importance is intuitive. A feature is very important for the model, if there is a great increase in the model’s prediction error after permutation. A feature is less important if the prediction error remains nearly unchanged after shuffling.

Local Interpretable Model-agnostic Explanations: Local Interpretable Model-agnostic Explanations (LIME) uses interpretable models to approximate the predictions of the original black-box model in specific regions for individual predictions. LIME works for tabular data, text and images, but the explanations may not be stable enough for medical applications.

Shapley Additive Explanation:

SHapley Additive exPlanation (SHAP) borrows the idea of Shapley value from Game Theory

[22], which represents contributions of each player in a game. Calculating Shapley values is computationally expensive when there are hundreds of features, thus Lundberg, Scott M., and Su-In Lee proposed fast implementation for tree-based models to boost the calculation process [15]. SHAP has a solid theoretical foundation, but is still computationally slow for a lot of instances.

To summarize different interpretation methods, PDP, ICE and ALE only use graphs to visualize the impact of different features, while Permutation Feature Importance, LIME and SHAP provide numerical feature importance which means they quantatively rank the importance of different features.

Different methods may understand the same model differently. In healthcare applications, sometimes we can use prior medical knowledge to distinguish reasonable ones, while sometimes it can be cumbersome to sort out most influential ones out of hundreds of features. Thus it is important to find quantitative metrics to evaluate different methods without human intervention.

Ii-B Metrics for Interpretability Evaluation

Different interpretation methods try to find out most important features to provide explanations for the output. But Doshi-Velez and Kim question, “Are all models in all defined-to-be-interpretable model classes equally interpretable [6]?” And how can we measure the quality of different interpretation methods?

To figure out whether those features are correctly ranked, there are two types of indicators for assessment and comparison of explanations: qualitative and quantitative indicators [23]. Faithfulbess is a qualitative indicator, and monotonicity is a quantitative indicator.


Faithfulness incrementally remove each of the attributes deemed important by the interpretability metric, and evaluating the effect on the performance. Then it calculates the correlation between the weights (importance) of the attributes and corresponding model performance, and returns correlation between attribute importance weights and corresponding effect on classifier


Monotonicity: Monotonicity incrementally add each attribute in order of increasing importance. As each feature is added, the performance of the model should correspondingly increase, thereby resulting in monotonically increasing model performance, and it returns True of False [18].

In our experiments, we use these two metrics to evaluate different interpretation methods. But it is important to notice here that evaluating metrics is still an area under active research. Evaluation metrics may be biased, which means the way the metric is calculated can be more friendly to some methods, and gives low score on interpretation methods that produce plausible explanations. As a result, different evaluation metrics can be used as references, but not the truth.

Iii Empirical Study on COVID

In this section, we introduce our raw dataset and procedures of data preprocessing, then use it to train four different models, decision tree, random forest, gradient boosted trees and neural networks. Through interpretation of the four different models, we try to understand why different models give similar or different predictions, and whether or not their predictions are consistent with medical knowledge. Finally we focus on individual patient that our models fail to make the correct decision, and try to explain and evaluate the interpretation.

Iii-a Dataset and Perprocessing

The raw dataset comes from hospitals in China, including 92 patients contracted COVID-19.

Our Research Ethics Committee waived written informed consent for this retrospective study that evaluated de-identified data and involved no potential risk to patients. All of the data of patients have been anonymized before analysis.

The 74 features consist of Body Mass Index (BMI), Complete Blood Count (CBC), Blood Biochemical Examination, inflammatory markers, symptoms, anamneses, and etc. While some tests are not compulsory for diagnosis of COVID-19, features such as LVEF remain unfilled for those who did not take Color Doppler Ultrasound Test.

After pruning out features with less entries, and patients with incomplete records, there remains 86 records with 55 features. Among those, 77 records are used for training, cross validation, and 9 reserved for testing.

Medical Tests Features
Basic Information Sex, Age, AgeG1, Height, Weight, BMI , Temp
Cardiac Troponin cTnITimes, cTnI, cTnICKMBOrdinal1, cTnICKMBOrdinal2,
Complete Blood Count WBC1, NEU1, LYM1, N2L1, WBC2, NEU2, LYM2, N2L2
Blood Biochemical Examination AST, LDH, CK, CKMB, HiCKMB, HBDH, NTproBNP, Cr, ALB1, ALB2
Inflammatory Markers CRP1, PCT2, CRP2, PCT1
Symptoms Sympton, Fever, Cough, Phlegm, Hemoptysis, SoreThroat, Catarrh, Headache, ChestPain, Fatigue, SoreMuscle, Stomachache, Diarrhea, PoorAppetite, NauseaNVomit,
Anamneses Hypertention, Hyperlipedia, DM, Lung, CAD, Arrythmia, Cancer
TABLE I: 55 Features in the Dataset

The table above lists all the features in our dataset, and more details can be found in the Appendix. The indicator used for severity classification is Severity01 which indicates normal with 0, and severe with 1.

It is important to perform feature engineering before training and interpreting our models, as some features may not provide valuable information or provide redundant information.

First, we use some naive methods to remove unnecessary features. The two features, PCT2 and Stomachache, have the same value for all patients which means they do not provide valuable information in distinguishing normal and severe patients, thus we can remove them both. And the two features Coronary Artery Disease (CAD), Arrythmia are duplicated since patients with CAD heart disease all have an arrythmia, so we can remove one of these two features.

Second, we try to remove correlated features. The table below lists all correlated features using Pearson’s correlation coefficient. We try to understand the strong correlation between two features and remove one of them if necessary.

Feature 1 Feature 2 Correlation
cTnICKMBOrdinal1 cTnICKMBOrdinal2 0.853741
LDH HBDH 0.911419
NEU2 WBC2 0.911419
LYM2 LYM1 0.842688
NTproBNP N2L2 0.808767
BMI Weight 0.842409
NEU1 WBC1 0.90352
TABLE II: Feature Correlation

The strong correlation between cTnICKMBOrdinal1 and cTnICKMBOrdinal2 is because they are the same test among a short range of time, thus remain almost the same, so we can remove one of them, and is the same for LYM1 and LYM2. LDH and HBDH levels are significantly correlated with heart diseases, and the HBDH/LDH ratio can be calculated to differentiate between liver and heart diseases. As for the correlation between NEU1 and WBC1, NEU2 and WBC2, they are all correlated to the immune system, and in fact, most of the white blood cells that lead the immune system’s response are neutrophils. Since there is no much information about N2L2, and is correlated with NTproBNP, we simply keep NTproBNP. Finally, the correlation between BMI and weight is straight forward because Body Mass Index (BMI) is a person’s weight in kilograms divided by the square of height in meters.

Third, several statistical methods are employed to remove features with redundant information.

Statistical Methods Removed Features
Mutual Information Height, CK, HiCKMB, Cr, WBC1, Hemoptysis
Univariate Weight, AST, CKMB, PCT1, WBC2
TABLE III: Features Removed

Mutual information is calculated using (1) that determines how similar the joint distribution p(X,Y) is to the products of individual distributions p(X)p(Y). Univariate Test measures the dependence of 2 variables, and a high p value for this test means a less similar distribution among X and Y.

Finally, there are still 36 features left for training and testing.

Iii-B Training Models

Machine learning models outperform human in many different areas in terms of accuracy. Interpretable models such as decision trees are easy to understand, but not suitable for large scale applications. Complex models achieve high accuracy while giving less explanation. We used 4 different models to extract information from our dataset, Decision Trees, Random Forests, Gradient Boosted Trees, Neural Networks.

Decision Trees:

Decision Tree (DT) is a widely adopted method for both classification and regression. It’s a non-parametric supervised learning method which infers decision rules from data features. Decision trees try to find decision rules that make the best split measured by gini impurity or entropy. More importantly, the generated decision trees can be visualized, thus easy to understand and interpret

[24]. .

Random Forest: Random Forests (RF) is a kind of ensemble learning method [25] that employs bagging strategy. Multiple decision trees are trained using the same learning algorithm, and then aggregate the predictions from the individual decision trees. Random forests produces great results most of the time even without much hyper-parameter tuning, thus it has been widely accepted for its simplicity and good performance. However, it is rather difficult for human to interpret hundreds of decision trees, thus the model itself is less interpretable than a single decision tree.

Gradient Boosted Trees: Gradient Boosted Trees is another ensemble learning method that employs boosting strategy [26]. Through sequentially adding one decision tree at one time, gradient boosted trees combines results along the way. With fine-tuned parameters, gradient boosting can result in better performance than random forests. Still, it is tough for human to interpret a sequence of decision trees, and thus comsidered as black-box models.

Neural Networks: Neural Networks could be the most promising model as for achieving high accureacy, and even outperforms human in medical imaging [27]. Though the whole network is difficult to understand, deep neural networks are stacks of simple layers, thus can be partially understood through visualizing outputs of intermediate layers [28].

For healthcare, both accuracy and interpretability are required. Simple interpretable models cannot achieve satisfactory accuracy in applications of image, thus many model-agnostic methods are employed to interpret complex black-box models.

All these methods are implemented using scikit-learn [29]

, keras and python3.6.

There is no hyperparameter for decision tree, and as for random forest, we use 100 trees in the forest. And the hypterparameters for gradient boosted trees are selected according to prior experience. The structure for neural networks is listed below.

Layer (type) Output Shape Param
Dense (None, 10) 370
Dropout (None, 10) 0
Dense (None, 15) 165
Dropout (None, 15) 0
Dense (None, 5) 80
Dropout (None, 5) 0
Dense (None, 1) 6
TABLE IV: Neural Networks

After training, gradient boosted trees and neural networks acheive the highest precision on the test set, while random forest gets the worst performance. Among the 9 patients in our test set, four of them are severe, which means Decision Tree fail to find two severe patients, and Random Forest loses three of them, while Gradient Boosted Trees and Neural Networks find all of severe patients.

Classifier CV Test Set

95% confidence interval

F1 Precision Recall F1
Decision Tree 0.56 0.67 0.50 0.57 0.307
Random Forest 0.64 0.56 0.25 0.33 0.324
Gradient Boosted Trees 0.62 0.78 1.00 0.80 0.271
Neural Networks 0.53 0.78 1.00 0.80 0.271
TABLE V: Binary Classification Results

In this paper, we do not try to get the most accurate model, instead we focus on understanding different models, trying to interpret why they make the right prediction and why they err. Thus, in the next section, we’ll interpret these models and try to understand why they make different decisions.

Iii-C Interpreting Models

First, we use Permutation Feature Importance to find the most important features in different models.

Model Most Important Features
Decision Tree NTproBNP, CRP2, LYM1, ALB1, ALB2
Random Forest CRP2, cTnI, NTproBNP, ALB2
Gradient Boosted Trees CRP2, cTnITimes, Phlegm, NTproBNP, cTnI
Neural Networks NTproBNP, CRP2, LDH, Age, CRP1
TABLE VI: Feature Correlation

It can be seen from the table that both CRP2 and NTproBNP are recognized as important features by all of the four models. From the perspective of medical research, CRP refers to C-Reactive Protein, which increases when there’s inflammation or infection in ones body. C-reactive protein (CRP) levels are positively correlated with lung lesions and could reflect disease severity[30]. And NTproBNP refers to N-Terminal prohormone of Brain Natriuretic Peptide, which are released in response to changes in pressure inside the heart. Patients contracted COVID19 will have a rise in CRP due to virus infection, and patients with higher NT-proBNP (above 88.64 pg/mL) level had more risks of in-hospital death [31]. The result implies that the two important features recognized by all of the four models does not obey medical knowlege.

In order to visualize the relationship between CRP and NTproBNP inside the model, we use PDP, ICE and ALE to see how they impact on prediction.

(a) Decision Tree
(b) Random Forest
(c) Gradient Boosted Trees
(d) Neural Networks
Fig. 1: Partial Dependence Plot

In the PDPs, all of the four models indicate a higher risk of severe with the increase of NTproBNP and CRP which it’s consistent with retrospective study on COVID-19. Thus, both interpretable models and black-box models successfully catch features that are deemed important by medical researchers.

The difference is that different models have different tolerances and dependence on NTproBNP and CRP. Averagely, decision trees and gradient boosted trees give less tolerance on a high level of NTproBNP (¿2000ng/ml), and gradient boosted trres give a much higher probability of death as CRP increases. Since PDPs calculate an average of all instances, we use ICEs to identify heterogeneity.

(a) Decision Tree
(b) Random Forest
(c) Gradient Boosted Trees
(d) Neural Networks
Fig. 2: Individual Conditional Expectation

ICE reveals individual differences. Though all of the models give a prediction of higher risk of severe as NTproBNP and CRP increase, some patients have a much higher initial probabilities which indicates there are other features that have impact on overall predictions, for example, elderly people have higher NTproBNP than young people and have a higher risk of turing severe.

Sometimes it’s not appropriate to use PDP and ICE in medical applications, because both PDP and ICE sample in a grid and make predictions, while some points in the grid may never appear in real world. For example, it’s very rare for young people to have a NTproBNP as high as elderlys. ALE only samples in a small range of area which makes it more realistic.

(a) Decision Tree
(b) Random Forest
(c) Gradient Boosted Trees
(d) Neural Networks
Fig. 3: Accumulated Local Effects

In the ALEs, all of the four models give a positive contribution of beging severe as NTproBNP and CRP get higher, which coincide with medical knowledge.

Iii-C1 Explaining Misclassified Instances

Even though the four models successfully find important features in the diagnosis of COVID-19, some models fail to recognize severe patients. Both Gradient Boosted Trees and Neural Networks successfully recognized all severe patients and yield a recall of 1.00, while Decision Trees misses two of them, and Random Forest fails to recognize three of them. To find out the reason, we use LIME and SHAP to explain misclassified instances.

Take Random Forest as an example. The three severe patients it fails to predict are listed below.

No Predicted Probability of Severe
1 0.52
5 0.70
7 0.52
TABLE VII: Severe patients Random Forest fail to recognize

Among three misclassified patients, patient No. 1 and No. 7 are predicted a probability of 0.52 of being severe which is around the boundary, while for patient No. 5, the model gives a high prediction of 0.7 which is a huge mistake. Thus we focus on patient No.5 and use LIME and SHAP to understand why there is such a mistake.

Feature Value
Sex 1.00
Age 63.00
AgeG1 1.00
Temp 36.40
cTnITimes 7.00
cTnI 0.01
cTnICKMBOrdinal1 0.00
LDH 220.00
NTproBNP 433.00
LYM1 1.53
N2L1 3.13
CRP1 22.69
ALB1 39.20
CRP2 22.69
ALB2 36.50
Symptoms Hypertention
TABLE VIII: Misclassified Patient No.5

Iii-C2 Interpreting Random Forest

The table LABEL:tab:severe5 shows medical conditions for patient No.5 who is a severe patient while random forest takes him as a normal one.

(a) LIME Explanation for patient No.5 of Gradient Boosted Tree
(b) LIME Explanation for patient No.5 of Neural Networks
Fig. 5: LIME Explanation
(a) SHAP Explanation for patient No.5 of Random Forest
(b) SHAP Explanation for patient No.5 of Gradient Boosted Trees
(c) SHAP Explanation for patient No.5 of Neural Networks
Fig. 6: SHAP Explanation
Fig. 4: LIME Explanation for patient No.5 of Random Forest

From the explanation of LIME, we see that even though this patient has a high NTproBNP, CRP and LDH that reflect a severe infection and heart injury, the model takes him as normal because he has low cTnl and cTnlCKMBOrdinal, and does not have any critical symptom. Actually this explanation is reasonable to some extend, a person without any symptom cannot be a severe patient even if he’s infected. But this is only the explanation from LIME, and it’s not guaranteed to be a fully correct interpretation, so we use SHAP to make a comparison.

Features pushing the prediction to be higher (normal) are shown in red, and those pushing the prediction to be lower (severe) are in blue. From the explanation of SHAP, we see that a high CRP, LDH and NtproBNP try to push the patient to be severe which are consistent with explanations from LIME, and while having no symptoms, and a low cTnl, LYM, temperature makes the model to take him as normal.

If we think different models are different doctors, then random forest, as a doctor, makes the wrong diagnosis. The reason human doctors take him as severe is that he actually needs a respirator to survive, so there are things random forest does not notice. But why gradient boosted trees and neural networks make the correct diagnosis?

Iii-C3 Interpreting Gradient Boosted Tree

Let’s try to figure out why the doctor of gradient boosted tree make the correct decision. From the interpretation of both LIME and SHAP, we see that the main difference between gradient boosted tree and random forest is that gradient boosted tree gives a higher weight on NTproBNP and CRP, thus even though the patient has a low temperature without any critical symptoms, the overall diagnosis still remains severe.

Iii-C4 Interpreting Neural Networks

Similar with gradient boosted tree, Neural Networks takes it more important to be severe with a high NTproBNP. Besides, from the interpretation of SHAP, we notice that Neural Networks take Age as an important factor, the patient No.5 is diagnosed as severe is partly because the patient is old which is consistent with human doctors judgement.

Actually in our dataset, there is an extra feature that indicates how severe a patient is ranging from level 0 to level 3. If we calculate the average of different severity level, we notice that as people grow older, their situations are more likely to deteriorate.

As a result, taking the patient’s old age into account, neural networks make a prediction of severe due to a high NTproBNP and CRP, even though without critical symptoms.

Severity Level Average Age
0 36.83
1 47.45
2 54.31
3 69.40
TABLE IX: Misclassified Patient No.5

Iii-C5 Other Interpretations

There are other interesting features that models rely on to make predictions.

In the LIME explanation of Gradient Boosted Trees, Phlegm being False is considered an important symptom of turning severe, which means patients that do not produce phlegm is likely to turn severe. This seems to be odd at first glance, but it’s corroborated by autopsies of COVID-19 patients

[32]. Patients that don’t spit out phlegm from the throat or lungs through coughing is more likely to undergo progressive respiratory failure due to sputum bolt.

Similarly, in the LIME explanation of Neural Networks, Diarrhea being False is considered an important symptom of turning severe, which means patients that have the symptom of diarrhea is more likely to recover. Initially, no one associates diarrhea with COVID-19 which usually causes alveolar damage, but as more and more people go to hospital due to diarrhea, but diagnosed as COIVD-19, gradually medial practitioners realize that diarrhea is the first sign of illness for some COVID-19 patients.

These two findings indicate that machine learning models are capable of catching importance clues for a new disease while human doctors may neglect. Thus even though it’s inappropriate to deploy black-box models to clinical diagnosis, we may use it to unveil the mechanism behind a new virus through model interpretation.

Iii-D Evaluating Interpretation

Even though we do find some critical symptoms of COVID-19 through model interpretation, they are confirmed credible because these interpretations are corroborated by later study. If we use interpretation to understand a new virus at a very early stage of an outbreak, there will be no enough evidence to support our interpretation initially. Thus we use Monoitinicity and Faithfulness to evaluate different interpretations using IBM AIX 360 toolbox [33].

Models Faithfulness Monotonicity
Permutation -0.526 False
LIME -0.376 False
SHAP 0.729 False
TABLE X: Evaluation for Different Interpretations

Faithfulness reveals the correlation between the importance assigned by the interpretability algorithm and the effect of each of the attributes on the performance of the predictive model. Though only SHAP receives a high faithfulness, this does not mean interpretations from LIME and Permutation is not reasonable because they are supported by medical knowledge in our experiment. The reason SHAP receives the highest faithfulness is that shapley value is calculated by removing the effect of specific features which similar to how faithfullness is computed, thus SHAP is more akin to faithfulness by native.

As for monotonicity, all three interpretation methods receive a False though we do find some valuable conclusions from these interpretations. The reason is that LIME trains a local surrogate model to imitate the behavior of the original one, but does not guarantee the same behavior globally. And permutation randomly shuffles the features, thus does not guarantee the monotonicity. As for SHAP, it is calculated by removing features rather than adding features.

From these results, we can see that the scores different metrics give are heavily dependent on how similar they are calculated with the interpretation method, so these metrics can be biased. As a result, under current research, it may be more reliable to manually check the fidelity of different interpretations through comparing with prior medical knowledge.

In conclusion, the evaluation metrics is still under active research, it may offer some help in evaluating different interpretation methods, still it’s hard to find a metric that is non-biased for all different methods. And a non-biased classifier-agnostic metrics In future research.

Iv Conclusion

In this paper, we use Permutation Feature Importance to interpret four different models that predict COVID-19 severity, and use PDP, ALE, ICE to visualize the result. The interpretation reveals that all of the four models successfully find NTproBNP and CRP are two important indicators for CIVID-19, which are consistent with medical research.

Though both low-accuracy and high-accuracy models find NTproBNP and CRP as two important indicators, they err on some patients. We use LIME and SHAP to understand mistakes in different models. The reason Gradient Boosted Trees and Neural Networks acheive better performance than Random Forests and Decision Trees is that they give less tolerance on a high NTproBNP and CRP, and they take patients’ age into consideration which is another important factor in COVID-19 clinical treatment.

Besides, high-accuracy models reveal that phlegm and diarrhea are two most indicative symptoms without which the patient is likely to turn severe. And these findings are consistent with autopsies of COVID-19 patients, and recognized as important signs of illness for some COVID-19 patients.

Finally, we use monotonicity and faithfullness to evaluate different interpretation mehods, and find that faithfullness is more akin to SHAP due to the way it’s calculated. Thus the evaluating metrics is still under active research, it’s still hard to find a non-biased metric for different interpretation methods.

In conclusion, through interpreting different machine learning models trained with patients data from hospitals, we find several important indicators for COVID-19 severity diagnosis, and these findings are corroborated by autopsies and medical research. Thus it’s possible to use machine learning model intepretations to reveal mechanisms of a new virus at the early stage of an outbreak.


Feature Medical Meaning Type
Sex Man (1), Woman(0) Personal Info
Age - Personal Info
AgeG1 Personal Info
Height - Personal Info
Weight - Personal Info
BMI Body Mass Index Personal Info
Temp Temperature -
Severity01 Severe (1), Normal (0) Severity
cTnITimes When was cTnI tested -
cTnI Cardiac Troponin I -
cTnlCKMBOrdinal1 The value when hospitalized -
cTnlCKMBOrdinal2 The maximum value when hospitalized -
AST Aspartate aminotransferase Biochemical Examination
LDH Lactate Dehydrogenase Biochemical Examination
CK Creatine Kinase Biochemical Examination
CKMB The amount of an isoenzyme of creatine kinase (CK) Biochemical Examination
HBDH Alpha-Hydroxybutyrate Dehydrogenase Biochemical Examination
HiCKMB Highest CKMB Biochemical Examination
NTproBNP N-terminal Prohormone of Brain Natriuretic Peptide -
Cr Serum Creatinine Biochemical Examination
PCT1/PCT2 Procalcitonin Inflammatory Markers
WBC1/WBC2 White Blood Cell Complete Blood Count
NEU1/NEU2 Neutrophil Count Complete Blood Count
LYM1/LYM2 Lymphocyte Count Complete Blood Count
N2L1/N2L2 - Complete Blood Count
CRP1/CRP2 C-Reactive Protein Inflammatory Markers
ALB1/ALB2 Albumin Count Biochemical Examination
Sympton - Symptoms
Fever - Symptoms
Cough - Symptoms
Phlegm - Symptoms
Hemoptysis - Symptoms
SoreThroat - Symptoms
Catarrh - Symptoms
Headache - Symptoms
ChestPain - Symptoms
Fatigue - Symptoms
SoreMuscle - Symptoms
Stomachache - Symptoms
Diarrhea - Symptoms
PoorAppetie - Symptoms
NauseaNVomit - Symptoms
Hypertention - Anamneses
Hyperlipedia - Anamneses
DM Diabetic Mellitus Anamneses
Lung Lunge Disease Anamneses
CAD Coronary Heart Disease Anamneses
Arrythmia - Anamneses
Cancer - Anamneses
TABLE XI: All Features in our dataset


  • [1] T. Singhal, “A review of coronavirus disease-2019 (covid-19),” The Indian Journal of Pediatrics, pp. 1–6, 2020.
  • [2] S. Basu, S. Mitra, and N. Saha, “Deep learning for screening covid-19 using chest x-ray images,” medRxiv, 2020. [Online]. Available:
  • [3] R. Caruana, Y. Lou, J. Gehrke, P. Koch, M. Sturm, and N. Elhadad, “Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission,” in KDD ’15, 2015.
  • [4] A. Ramchandani, C. Fan, and A. Mostafavi, “Deepcovidnet: An interpretable deep learning model for predictive surveillance of covid-19 using heterogeneous features and their interactions,” 2020.
  • [5] H. Guo, R. Tang, Y. Ye, Z. Li, and X. He, “Deepfm: A factorization-machine based neural network for ctr prediction,” 2017.
  • [6] F. Doshi-Velez and B. Kim, “Towards a rigorous science of interpretable machine learning,” 2017.
  • [7] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” 2014.
  • [8] L. Yan, H.-T. Zhang, J. Goncalves, Y. Xiao, M. Wang, Y. Guo, C. Sun, X. Tang, L. Jing, M. Zhang, X. Huang, Y. Xiao, H. Cao, Y. Chen, T. Ren, F. Wang, Y. Xiao, S. Huang, X. Tan, N. Huang, B. Jiao, C. Cheng, Y. Zhang, A. Luo, L. Mombaerts, J. Jin, Z. Cao, S. Li, H. Xu, and Y. Yuan, “An interpretable mortality prediction model for covid-19 patients,” Nature Machine Intelligence, vol. 2, no. 5, pp. 283–288, May 2020. [Online]. Available:
  • [9] E. Matsuyama et al., “A deep learning interpretable model for novel coronavirus disease (covid-19) screening with chest ct images,” Journal of Biomedical Science and Engineering, vol. 13, no. 07, p. 140, 2020.
  • [10] J. H. Friedman, “Greedy function approximation: A gradient boosting machine.” Ann. Statist., vol. 29, no. 5, pp. 1189–1232, 10 2001. [Online]. Available:
  • [11] A. Goldstein, A. Kapelner, J. Bleich, and E. Pitkin, “Peeking inside the black box: Visualizing statistical learning with plots of individual conditional expectation,” 2013.
  • [12]

    D. W. Apley and J. Zhu, “Visualizing the effects of predictor variables in black box supervised learning models,” 2016.

  • [13] A. Fisher, C. Rudin, and F. Dominici, “All models are wrong, but many are useful: Learning a variable’s importance by studying an entire class of prediction models simultaneously,” 2018.
  • [14] M. T. Ribeiro, S. Singh, and C. Guestrin, “”why should i trust you?”: Explaining the predictions of any classifier,” 2016.
  • [15] S. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,” 2017.
  • [16] M. T. Ribeiro, S. Singh, and C. Guestrin, “Anchors: High-precision model-agnostic explanations,” in AAAI, 2018.
  • [17] D. Alvarez-Melis and T. S. Jaakkola, “Towards robust interpretability with self-explaining neural networks,” 2018.
  • [18] R. Luss, P.-Y. Chen, A. Dhurandhar, P. Sattigeri, Y. Zhang, K. Shanmugam, and C.-C. Tu, “Generating contrastive explanations with monotonic attribute functions,” 2019.
  • [19] M. T. Ribeiro, S. Singh, and C. Guestrin, “Model-agnostic interpretability of machine learning,” 2016.
  • [20]

    T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” 2013.

  • [21] C. Molnar, Interpretable Machine Learning, 2019,
  • [22] L. S. Shapley, “17. a value for n-person games,” Contributions to the Theory of Games (AM-28), Volume II, p. 307–318, 1953.
  • [23] D. V. Carvalho, E. M. Pereira, and J. S. Cardoso, “Machine learning interpretability: A survey on methods and metrics,” Electronics, vol. 8, no. 8, p. 832, Jul 2019. [Online]. Available:
  • [24] L. Breiman, J. Friedman, R. Olshen, and C. Stone, “Classification and regression trees. belmont, ca: Wadsworth international group.” Encyclopedia of Ecology, vol. 57, no. 1, pp. 582–588, 2015.
  • [25] L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.
  • [26] S. Schaal and C. C. Atkeson, “From isolation to cooperation: An alternative view of a system of experts,” in Advances in Neural Information Processing Systems 8.   MIT Press, 1996, pp. 605–611.
  • [27] A. Maier, C. Syben, T. Lasser, and C. Riess, “A gentle introduction to deep learning in medical image processing,” 2018.
  • [28] G. Montavon, W. Samek, and K.-R. Müller, “Methods for interpreting and understanding deep neural networks,” Digital Signal Processing, vol. 73, p. 1–15, Feb 2018. [Online]. Available:
  • [29] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
  • [30] L. Wang, “C-reactive protein levels in the early stage of covid-19,” Médecine et Maladies Infectieuses, vol. 50, no. 4, pp. 332 – 334, 2020. [Online]. Available:
  • [31] L. Gao, D. Jiang, X. Wen, X. Cheng, M. Sun, B. He, L.-n. You, P. Lei, X.-w. Tan, S. Qin, G. Cai, and D. Zhang, “Prognostic value of nt-probnp in patients with severe covid-19,” medRxiv, 2020. [Online]. Available:
  • [32] X. Z. L. Y. J. L. C. S. P. H. L. Y. C. T. J. P. J. J. FS;, “Pathological findings of covid-19 associated with acute respiratory distress syndrome.” [Online]. Available:
  • [33] V. Arya, R. K. E. Bellamy, P.-Y. Chen, A. Dhurandhar, M. Hind, S. C. Hoffman, S. Houde, Q. V. Liao, R. Luss, A. Mojsilović, S. Mourad, P. Pedemonte, R. Raghavendra, J. Richards, P. Sattigeri, K. Shanmugam, M. Singh, K. R. Varshney, D. Wei, and Y. Zhang, “One explanation does not fit all: A toolkit and taxonomy of ai explainability techniques,” Sep. 2019. [Online]. Available: