Fitting a Logistic Regression Model
We used the training dataset with 351 patients without missing any values of the three important features (i.e., LDH, lymphocyte, and hs-CRP) to fit a step-wise logistic regression model with all the second-order interaction items. We used the coefficient of determination adjusted (hastie2009elements)
, indicating the portion of variance in the dependent variable (i.e., patient outcome) explained by the independent variables, to determine which interaction items to include. We obtained a model with the largest adjustedwith all the three variables, i.e., LDH (), lymphocyte (), and hs-CRP (), and two interaction items, i.e., LDH:lymphocyte () and LDH:hs-CRP ().
Table 1 shows the fatality prediction performance in terms of area under the curve (AUC) (walter2005partial) scores of logistic regression using 100-round five-fold cross-validation. It can be seen that the logistic regression models are able to accurately predict the fatality outcomes of COVID-19 patients with three important biomarkers. As the number of the features increases, the performance also improves before the last interaction item (i.e.,lymphocyte:hs-CRP) was added. This was consistent with the results of fitting a logistic regression model found above. The model with optimal performance had two (product) interaction items, i.e., LDH:lymphocyte and LDH:hs-CRP.
An Explainable Logistic Regression Model
Based on the results in Table 1, we developed an explainable logistic regression model with two interaction items to predict the fatality rate of COVID-19 patients. The modeling of prediction using logistic regression models is transparent and the model can produce fatality probabilities between 0 and 1 rather than a binary value. Using and to indicate death and survival, respectively, we can formulate the logistic regression model in the following way (hastie2009elements):
where is a constant and , are the coefficients of (LDH), (lymphocyte), (hs-CRP), (LDH:lymphocyte), and (LDH:hs-CRP). We ran 100 rounds of five-fold cross-validation with random search to identify the optimal values of the regularization term between 0.0001 and 1000 with two types of penalty, i.e., and
. This process resulted in 500 sets of the coefficients and we used the median values as the coefficients of the final model in a vector form:. We also optimized the prediction performance of the model by adjusting the threshold of the death probability both for the external test data and multi-day ahead forecasting below. We found the optimal threshold was 0.8. In other words, when one’s fatality probability was larger than 0.8, the model predicted that a patient would die and the model had the best performance.
The model in Eq.(1) with the identified threshold was then used to predict the outcomes of 110 patients in the external test set that was not used to build the model. Although the data set was rather unbalanced (13 deaths and 97 survivals), the performance of the proposed logistic regression model was promising with 96 true negatives (1 false negative) and 12 true positives (1 false positive). The accuracy, f1-score, and AUC were , and 0.996, respectively.
Multi-Day Ahead Forecasting
Since there were multiple records of the three biomarkers for each patient, the model was used to forecast patient outcomes multi-days in advance. The samples were obtained by examining the records within each day for each patient. A total number of 909 records was obtained for all the 485 patients with 251 records for the 110 patients in the external test set. For multi-day ahead forecasting, we aimed to obtain the maximum days in advance. For example, if there are records in total for patient and the model was able to predict days ahead, the predicted outcomes and the ground truth of the following maximum consecutive records need to be the same. Here / are the days between the dates of the -th/first records and the date of the final outcome, respectively. For all the records, the model was able to predict 11.30 days (maximum = 34.91) ahead on average with a cumulative f1-score of 93.76% (see Fig.1 (a) and (b)) and a cumulative accuracy value of 93.92%. For the 251 records in the external test set, the model was able to predict 12.76 (maximum = 33.15) days ahead on average with a cumulative f1-score of 95.73% (see Fig.1 (c) and (d)) and a cumulative accuracy value of 96.47%. Thus, the proposed model can potentially give doctors the time needed to treat the patient accordingly.
Built on (yan2020interpretable), we proposed an explainable, intuitive, and yet accurate prediction model using logistic regression by incorporating two interaction items among the three most important biomarkers, including LDH, lymphocyte (%) and hs-CRP, which can be easily measured in hospitals. Our model used no extra input information, compared to the decision tree model in (yan2020interpretable). Unlike the binary decisions produced by the decision tree, the logistic regression model produced a probability of fatality for each patient, which is more consistent with human-friendly explanations of machine learning models (molnar2020interpretable). For example, one of the rules in the decision tree in (yan2020interpretable) was that IF LDH , THEN death. Such a binary prediction may be not very intuitive without telling the likelihood of death. As a value of might result in a significantly different fatality probability from a value of .
Our model, on the other hand, always gave a probability of death and also identified a threshold probability at 0.8, above which the model predicted that the patient would die. Furthermore, our model also outperformed the decision tree model in terms of average maximum days to the outcome and the cumulative f1-score and accuracy in Fig. 1. Such a model can offer the clinicians time to identify high-risk patients before they become critical. Thus, an appropriate treatment strategy for COVID-19 patients depending on their likelihood of death can be made using corresponding medical resources. This can potentially alleviate the shortages of critical medical resources in hospitals in the current situation.
However, the model was built on a relatively small sample size. More research is needed to further test and optimize the model, taking both explainability and prediction performance into account.
Materials and Methods
The data was originally from (yan2020interpretable). The model construction was based on the data of 375 (174 died) patients collected between January 10, 2020 and February 18, 2020 from Tongji Hospitals, Wuhan, China. Of them, 24 of the patients had missing data in the three biomarkers and thus were excluded from analysis. The external test data set was collected from another 110 patients (13 died) between February 19, 2020 and February 24, 2020 from the same hospital. We reported the performance of the model using metrics, including AUC (walter2005partial), micro-ave f1-score (zhou2011affect), and accuracy (zhou2020driver).
Interaction items in logistic regression can potentially improve the performance of the model to a great extent (levy2019don). Hence, we first fitted a logistic regression model using the three most important biomarkers, i.e., LDH, lymphocyte (%), and hs-CRP identified in (yan2020interpretable) and further identified two interaction items that could be useful to improve the prediction model. Then, we added one item at a time to the logistic regression model with five-fold cross-validation for 100 rounds and verified the two identified interaction items did improve its prediction performance as shown in Table 1. In order to have a model with good generalizability, we used the median values of the coefficients produced from the 500 models when producing the results in Table 1. Finally, this model was used to predict the external test set and multi-day ahead forecasting.
The proposed logistic regression model can effectively predict the outcomes of COVID-19 patients with fatality probabilities. The model is accurate, intuitive, and explainable with only three blood biomarkers and two of their interaction items as input, which can potentially help the doctors determine the best treatment route for COVID-19 patients with high risks and optimize the logistic planning in the medical systems around the world amid this COVID-19 pandemic.
We thank the authors in (yan2020interpretable) for providing their data sets.