## 1 Introduction

Coronavirus disease (COVID-19) is an infectious disease caused by the severe-acute respiratory symptom Coronavirus 2 (SARS-Cov2), and has resulted in more than 1.3 million confirmed cases and about 75k deaths worldwide as of April 7, 2020 JHU (2020). Due to its rapid spread, infection cases are increasing with high fatality rate (e.g.,

up to 5.7%). Statistical models and machine learning methods have been developing to analyze the transmission dynamics and conduct early diagnosis of COVID-19

del Rio and Malani (2020); Li et al. (2020b). For example, Wu et al.employed the susceptible-exposed-infectious-recovered model to estimate the number of the COVID-19 cases in Wuhan based on the number of exported COVID-19 cases which moved from Wuhan to other cities in China Wu et al. (2020).The increasing number of confirmed COVID-19 cases results in the lack of the clinicians and the increase of the clinicians’ workloads. Many laboratory techniques have been used to confirm the suspected COVID-19 cases by clinicians Jung et al. (2020), including real-time reverse transcription polymerase chain reaction (RT-PCR) Corman et al. (2020); Ai et al. (2020), non-PCR tests (e.g., isothermal nucleic acid amplification technology Craw and Balachandran (2012)), non-contrast chest computed tomography (CT) and radiographs Lee et al. (2020), and so on. It is well known that manually detection is time-consuming and increases the infection risk of the clinicians Kong and Agarwal (2020). Moreover, laboratory tests are usually prohibited for all suspected cases due to the limitation of the test kits Kong and Agarwal (2020); Ng et al. (2020). Also, RT-PCR has been widely used to confirm COVID-19, but easily results in low sensitivity Chaganti et al. (2020)

. As a good alternative, artificial intelligence techniques on available data from laboratory tests have been playing important roles on the confirmation and follow-up of COVID-19 cases. For example, Alom et al.employed the inception residual recurrent convolutional neural network (CNN) and transfer learning on X-ray and CT scan images to detect COVID-19 and segment the infected regions of COVID-19

Alom et al. (2020). Ozkaya et al.first applied CNN to fuse and rank deep features for the early detection of COVID-19, and then used support vector machine (SVM) to conduct binary classification using the obtained deep features

Ozkaya et al. (2020).While imaging data is playing an important role in the diagnosis of all kinds of pneumonia diseases including COVID-19 Shi et al. (2020a), CT has been applied to help monitor imaging changes and measure the disease severity Zhao et al. (2020); Chaganti et al. (2020). For example, Chaganti et al.designed an automated system to quantify the abnormal tomographic patterns appeared in COVID-19 Chaganti et al. (2020). Li et al.employed CNNs with the imaging features of radiographs and CT images for identifying COVID-19 Li et al. (2020a).

In this work, we investigate a new early diagnosis method to predict whether the mild confirmed cases (i.e., non-severe cases) of COVID-19 would develop severe symptoms in later time and estimate the time interval. However, it is challenging due to many issues, such as small infected lesions in the chest CT scan at the early stage, appearances similar to other pneumonia, the data set with high-dimensional features and small-sized samples, and imbalanced group distribution.

First, the infected lesions in the chest CT scan at the early stage are usually small and their appearances are quite similar with that of other pneumonia. Given the early stage of COVID-19 with minor imaging signs, it is difficult to predict its future progression status. Conventional severity assessment methods can easily distinguish a severe sign of the image from the mild sign, since the changes of CT data are correlated with disease severity, e.g., the lung involvement and abnormalities increase while the symptoms become severe. However, the infected volume of the non-severe COVID-19 cases is usually mild. For example, Guan et al.showed that 84.4% of non-severe patients had mild symptoms and more than 95% severe cases had severe symptoms on CT changes Guan et al. (2020). On the other hand, the clinicians have few prior knowledge about whether or when the non-severe cases convert to severe cases, so early diagnosis and conversion time prediction could reduce the clinicians’ workloads or even save patients’ levies.

Second, the collected data set usually has a small number of samples (i.e.,

small-sized samples) and high-dimensional features. Due to all kinds of reasons, such as data protection, data security, and the scenario of acute infectious diseases, a small number of subjects are available for early diagnosis of COVID-19. The limited samples are difficult to build an effective artificial intelligence model. Moreover, high-dimensional features for each imaging data are often extracted, by considering to capture the comprehensive changes of the disease. Hence, both of scenarios often result in the issue of over-fitting and the issue of curse of dimensionality

Hu et al. (2019); Zhu et al. (2014).Third, the class or group distribution of the data set is generally imbalanced. In particular, the number of severe cases is much smaller than the number of non-severe cases, e.g., 20% reported^{1}^{1}1https://www.webmd.com/lung/news/20200324/the-other-side-of-covid-19-milder-cases-recovery#1.

https://www.businessinsider.com/coronavirus-80-percent-cases-are-mild-2020-2.. Such a scenario poses a challenge for most of classification methods because they were designed under the assumption of an equal number of samples for each class Adeli et al. (2019). As a result, previous classification techniques output poor predictive performance, especially for the minority class Adeli et al. (2019); Zhu et al. (2019c).

In this work, we propose a novel joint regression and classification method to identify the severe COVID-19 cases from the non-severe cases and predict the conversion time from a non-severe case to the severe case in a unified framework. Specifically, we employ the logistic regression for the binary classification task and the linear regression for the regression task. Moreover, we employ an

-norm regularization term on both the classification coefficient matrix and the regression coefficient matrix to consider the correlation between the two tasks as well as select the useful features for disease diagnosis and conversion time prediction. We further design a novel method to learn the weights of the samples, i.e., automatically learning the weight of each sample so that the important samples have large weights and the unimportant samples have small or even zero weights. Moreover, the samples with zero weights in both the majority class and the minority class are excluded to the process of the construction of the joint classification and regression, thus the problem of imbalance classification can be solved.Different from previous literature, the contribution of our proposed method is listed as follows.

First, our method considers imbalance classification, feature selection, and sample weight in the same framework. Moreover, our method takes into account the correlation between the classification task and the regression task. In the literature, few study has focused on exploring the above issues simultaneously. For example, the studies separately conduct feature selection Zhu et al. (2014) and sample weight Hu et al. (2019); Zhu et al. (2019b) and Zhu et al.conduct joint classification and regression Zhu et al. (2014). Recently, a few studies simultaneously conduct feature selection and sample selection Adeli et al. (2019, 2018); Hu et al. (2019).

Second, in the literature, a few machine learning methods have proposed to conduct the diagnosis of COVID-19 disease. For example, Tang et al.employed random forest to detect the severe cases from the confirmed cases based on the CT scan data

Tang et al. (2020). Shi et al.conducted the same task by a two-step strategy, i.e., automatically categorizing all subjects into groups followed by random forests in each group for classification Shi et al. (2020a). However, previous literature did not take into account any of the above issues. Recently, deep learning techniques

Alom et al. (2020); Ozkaya et al. (2020); Li et al. (2020a) have been employed to conduct early diagnosis of COVID-19, lacking the interpretability. To our knowledge, this is the first study simultaneously detecting the severe cases and predicting the conversion time, which have widely applications because the severe cases could endanger patients’ lives and correctly predicting of the conversion time makes the clinicians take care the patients early or even save the patients’ lives.Severe cases | Non-severe cases | |

(86) | (322) | |

Female/male | 35/51 | 160/162 |

Age | 55.43 16.35 | 49.30 15.70 |

## 2 Materials and image preprocessing

This study investigated the chest CT images of 422 confirmed COVID-19 patients. The demographic information is summarized in Table 1. If a patient has multiple scans over time, the first scan is used. All CT images was provided by Shanghai Public Health Clinical Center and Sicuan University West China Hospital. Informed consents were waived and all private information of patients was anonymized. Moreover, the ethics of committees of these two institutes approved the protocol of this study.

All patients were confirmed by the national centers for disease control (CDC) based on the positive new Coronavirus nucleic acid antibody. Moreover, patients with large motion artifacts or pre-existing lung cancer conditions on the CT scans were excluded from this study.

### 2.1 Image acquisition parameters

All patients underwent thin-section CT scan by the scanners including SCENARIA 64 from Hitachi, Brilliance 64 from Philips, uCT 528 from United Imaging. The CT protocol is listed as follows: kV: 120, slice thickness: 1-1.5 mm, and breath hold at full inspiration. More details about both the image acquisition and the image pre-processing can be found in Shan et al. (2020); Shi et al. (2020a). Moreover, we used the mediastinal window (with window width 350 hounsfield unit (HU) and window level 40 HU) and the lung window (with window width 1200 HU and window level-600 HU) for reading analysis.

### 2.2 Image pre-processing

We utilized the disease characteristics, i.e., infection locations and spreading patterns, to extract handcrafted features of each COVID-19 chest CT image. To do this, we used the COVID-19 chest CT analysis tool developed by Shanghai United Imaging Intelligence Co. Ltd., and followed the literature Shan et al. (2020) to calculate the quantitative features.

First, the COVID-19 chest CT analysis tool designed a deep learning method named VB-net to automatically segment infected lung regions and lung fields bilaterally. The infected lung regions were mainly related to manifestations of pneumonia, such as mosaic sign, ground glass opacification (GGO), lesion-related signs, and interlobular septal thickening.

Second, after the segmentation process, the lung fields include the left lung and the right lung, 5 lung lobes, and 18 pulmonary segments, as shown in Figure 1. Specifically, the left lung included superior lobe and inferior lobe, while the right lung included superior lobe, middle lobe, and inferior lobe. Moreover, the left lung has 8 pulmonary segments and the right lung has 10 pulmonary segments. As a result, we had 26 regions of interest (ROIs) for each CT images.

Third, we partitioned each segment to five parts based on the HU ranges, i.e., , , , , and . Specifically, indicates the parts with the HU range between and -700, indicates the parts with the HU range between -700 and -500, indicates the parts with the HU range between -500 and -200, indicates the parts with the HU range between -200 and 50, and indicates the parts with the HU range between 50 and . As a result, each CT image was partitioned to 130 parts (i.e., ).

In this study, we extracted three kinds of handcrafted features from each part, i.e., density feature, volume feature, and mass feature. Specifically, we obtained the volume feature as the total volume of infected region and the density feature by calculating the averaged HU value within the infected region. We further followed Song et al. (2014) to define the mass feature to simultaneously reflect the volume and density of subsolid nodule because the mass feature has demonstrated to have potentially superior reproducibility to 3D volumetry, i.e., .

Finally, each CT image is represented by 390D handcrafted features in this study.

## 3 Method

In this paper, we denote matrices, vectors, and scalars, respectively, as boldface uppercase letters, boldface lowercase letters, and normal italic letters.
Specifically, we denote a matrix as . The *i*-th row and *j*-th column of are denoted as and , respectively. We further denote the Frobenius norm and the -norm of a matrix as and
, respectively. We also denote the transpose operator, the trace operator, and the inverse of a matrix as , , and , respectively.

### 3.1 Sparse logistic regression

In the classification problem, given the feature matrix including *n* samples () and their corresponding labels , the logistic regression is employed to distinguish the severe cases (i.e., ) from the confirmed COVID-19 cases (i.e., ).
Specifically, by denoting

as the coefficient vector, the logistic loss function is defined as

(1) |

where is a tuning parameter, and the -norm regularization term on the coefficient vector is used to control the complexity of the logistic regression. Eq. (1) conducts the classification task without taking into account the issues, such as feature selection, sample weight, and imbalance classification.

First, in real applications, clinicians have prior knowledge on the regions of the CT scan data which are possible related to the disease, but we cannot only extract the features from these regions because they may cooperate with other regions to influence the disease. As a result, we extract the features from all imaging data to obtain high-dimensional data, which captures the comprehensive changes of confirmed COVID-19 cases but increases the store and computation costs as well as easily results in the issue of curse of dimensionality Zhu et al. (2019c). To address this issue, we design machine learning models to automatically recognize the features related to the disease by taking into account the correlation of the features. Specifically, we replace the -norm on the coefficient vector (i.e., ) by the -norm on on the coefficient vector (i.e., and ), which outputs sparse elements to make the corresponding features (i.e., the rows in ) not involving the classification task, i.e.,

(2) |

where is a tuning parameter.

### 3.2 Balanced and sparse logistic regression

In binary classification, the issue of imbalance classification easily results in the classification results bias to the majority class, i.e., outputting high false negatives. In the literature, both re-sampling methods and cost-sensitive learning methods Zhu et al. (2019b) have been used for solve the issue of imbalance classification.

Recently, robust loss functions has been widely designed to reduce the influences of outliers by taking into account the sample weight in robust statistics Zhu et al. (2019c). Specifically, robust loss functions use a weight vector to automatically output small weights to the samples with large estimation errors and large weights to the samples with small estimation error. As consequence, the samples with large estimation errors are regarded as outliers and their influences are reduced. In the literature, a number of robust loss functions have been developed, including function, Cauchy function, and Geman–McClure estimator, etc. Hu et al. (2019); Zhu et al. (2019a). However, the robust loss function was not designed to explore the issue of imbalance classification.

In this paper, motivated by the self-paced learning assigning weights to the samples, we propose a new method to assign a weight to each sample as well as to solve the problem of imbalance classification. By regarding that different samples have different contributions to the construction of the classification model, our method expect to assign large weights to the important samples and small weights to the unimportant samples. Moreover, by regarding of the problem of imbalance classification, our method expect to set different numbers of zero weights to different classes so that there is a balance of the sample number between the positive class and the negative class. To do this, we employ an -norm constraint on the weight vector to have the following loss function:

(3) |

where indicates the weight set of all negative samples and indicates the weight set of all positive samples. The constraint ‘’ indicates that the number of non-zero elements in the negative class is . Specifically, after receiving the estimation value for each sample, i.e., , we first sort the estimation values of all samples in the same class with an increase order. We then keep the original weights to the weights of the negative samples with the smallest estimation values and 0 to the weights of the left negative samples. In this way, either the negative samples or the positive samples with zero weights will not involve the process of the classification model.

Our method in Eq. (3) has at least two advantages, i.e., automatically selecting important samples (i.e., reducing the influence of the outliers) to learn the classification model and adjusting the number of selected samples for each class by tuning the values of and (e.g., ) to solve the imbalance classification problem. In particular, our method in Eq. (3) employs the -norm constraint for each class to output exactly predefined non-zero elements. On the contrary, self-paced learning uses the -norm constraint for all samples or other robust loss functions Zhu et al. (2019c); Hu et al. (2019) to estimate the sample weight without guaranteeing the exact number of non-zero elements. As a result, compared to self-paced learning only considering the sample weight to reduce the influence of outliers, our method takes into account the sample weight to remove outliers not to involve the process of the model construction as well as the problem of imbalance classification.

### 3.3 Joint logistic regression and linear regression

Besides distinguishing the severe cases from the non-severe cases, predicting the time converting a non-severe case to a severe case is also important because it may be related to the patients’ lives. To do this, a naive solution is to separately conduct a classification task to diagnose the severe cases and a regression task to predict the conversion time. Obviously, the separate strategy ignores the correlation among two tasks. In this paper, by regarding the prediction of conversion time as a regression task, we define a ridge regression to linearly characterize the correlation between the feature matrix

and the vector of conversion time by(4) |

where is the coefficient vector for the regression task and is a tuning parameter.

Similar to the classification task in Eq. (3), the regression task in Eq. (4) still needs to consider the issues, such as feature selection, sample weight, and imbalance classification. Moreover, in this study, we conduct joint classification and regression (i.e., multi-task learning) by simultaneously considering a classification task and a regression task in the same framework. We expect that each task could obtain information from another task so that the model effectiveness of each of them can be improved by the shared information. Specifically, we employ the -norm regularization term with respect to both the variable and the variable to obtain the following objective function:

(5) |

where is the sample weight vector for the regression task and . and are tuning parameters. indicates that the selected features are obtained by the classification and regression model. Moreover, the selected features are their shared or common information benefiting each of them Evgeniou and Pontil (2004).

Eq. (5) needs a tuning parameter to have a magnitude or importance trade-off for two tasks. However, the process of tuning parameter is time-consuming and needs prior knowledge. In this work, we use a squared root operator on the second term of Eq. (5) to automatically obtain their weights. It is noteworthy that we keep the parameter to be tuned because it controls the sparsity of the term and the sparsity will be changed based on the data distribution Evgeniou and Pontil (2004); Zhu et al. (2017). Hence, the final objective function of our proposed joint classification and regression method is:

(6) |

To solve the optimization problem in Eq. (6), i.e., optimizing the variables and , we compute the derivatives of the square root in Eq. (6) and obtain the following formulation

(7a) | |||||

(7b) |

The values of in Eq. (7b) is automatically obtained without the tuning process and can be regarded as the weight of the tasks. Specifically, if the prediction error is small, the value of is large, i.e., the regression task is more important than the classification task. Hence, the optimization of the value of automatically balances the contributions of two tasks. As a result, the optimization of Eq. (6) is changed to optimize Eq. (7b).

### 3.4 Optimization

In this paper, we employ the alternating optimization strategy Bezdek and Hathaway (2003) to optimize the variables , , , and , in Eq. (7a). We list the pseudo of our optimization method in Algorithm 1 and report the details as follows.

(i) Update by fixing , and

After other variables are fixed, the objective function with respect to in Eq. (7a) becomes

(8) |

where is a diagonal matrix and the value of its -th diagonal element is . To solve the imbalance classification problem, we set to obtain samples for the training process. Hence, Eq. (8) becomes

(9) |

By setting and , Eq. (9) becomes

(10) |

In this paper, we employ the Newton’s method Liu and Nocedal (1989) to minimize Eq. (10) by the following update rules

(11) |

where and are defined as

(12) |

(ii) Update by fixing , and

The objective function with respect to in Eq. (7a) is

(13) |

By letting and , we have

(14) |

Eq. (14) has a closed-form solution, i.e.,

(15) |

(iii) Update and by fixing and

By denoting the estimation value of the *i*-th sample as (), we sort the values () with an increase order for each class to denote the weight set of negative samples with the smallest estimation values as (where if ) and the weight set of positive samples with the smallest estimation values as (where if ), and then we have

(16) |

where is the index of the original order before the sorting and is the index of the increase order after the sorting. indicates that the *j*-th index in the original order is matched with the [*i*]-th index in the increase order.

By denoting the estimation value of the *i*-th sample as (), we sort the values () with an increase order for each class to denote the weight set of negative samples with the smallest estimation values as (where if ) and the weight set of positive samples with the smallest estimation values as (where if ), and then we have

(17) |

### 3.5 Convergence analysis

Algorithm 1 involves five variables (i.e., , , , , and ). By denoting , , , , and , respectively, as the -th iteration results of , , , , and , we denote the objective function value of the -th iteration of Eq. (6) as .

By fixing , , , and , we employ the Newton’s method to optimize , so we have

(18) |

The optimizations of the variables (i.e., , , , and ) have closed-form solutions, so we have

(19) |

(20) |

(21) |

(22) |

According to Eq. (23), the objective function values in Eq. (6) gradually decrease with the increase of iterations until Algorithm 1 converges. Hence, the convergence proof of Algorithm 1 to optimize Eq. (6) is completed.

Methods | FS | SW | IMB | CLASS | REG |

SVM Chang and Lin (2011) | |||||

L1SVM Chang and Lin (2011) | |||||

Random forest Liaw et al. (2002) | |||||

SFS Adeli et al. (2019) | |||||

Ridge regression | |||||

L1SVR Chang and Lin (2011) | |||||

Lasso Tibshirani (1996) | |||||

Random forest Liaw et al. (2002) | |||||

MSFS Zhu et al. (2014) | |||||

Proposed |

## 4 Experiments

We experimentally evaluated our method, compared to state-of-the-art classification and regression methods, on a real COVID-19 data set with chest CT scan data, in terms of binary classification performance and regression performance.

### 4.1 Experimental setting

We selected SVM and ridge regression, respectively, as the baseline methods for the classification task and the regression task. Other comparison methods include -SVM (L1SVM) Chang and Lin (2011), random forest Liaw et al. (2002), sample-feature selection (SFS) Adeli et al. (2019), -SVR Chang and Lin (2011) (L1SVR), least absolute shrinkage and selection operator (Lasso) Tibshirani (1996), and matrix-similarity feature selection (MSFS) Zhu et al. (2014). We summarize the details of all comparison methods in Table 2. It is noteworthy that random forest can be used for feature selection, sample selection, and imbalance classification. However, in this study, we only used random forest to consider the problem of imbalance classification.

In our experiments, we repeated the 5-fold cross-validation scheme 20 times for all methods to report the average results as the final results. In the model selection, we set in Eq. (6), and fixed for solving the problem of imbalance classification for our method. We followed the literature Chang and Lin (2011); Liaw et al. (2002); Adeli et al. (2019); Tibshirani (1996); Zhu et al. (2014) to set the parameters of the comparison methods so that they outputted the best results.

The evaluation metrics include accuracy, specificity, sensitivity, and area under the ROC curve (AUC) for the classification task, as well as correlation coefficient (CC) and root mean square error (RMSE) for the regression task.

Methods | SFS Adeli et al. (2019) | Proposed w/o Regression | Proposed |
---|---|---|---|

Accuracy | 78.18 3.71 | 83.25 2.44 | 85.69 2.20 |

Sensitivity | 50.65 6.33 | 70.73 3.36 | 76.97 3.36 |

Specificity | 86.31 2.69 | 86.60 3.45 | 88.02 1.45 |

AUC | 73.88 6.66 | 81.74 3.30 | 85.91 2.27 |

### 4.2 Classification result

We report the classification performance of all methods in Figure 2. We also report the classification performance of our proposed method using single-task learning and multi-task learning in Table 3 and the Receiver Operating Characteristic (ROC) curves of all methods in Figure 2. Based on the results, we conclude our observations as follows.

First, it could be observed that the proposed method achieves the best classification performance, followed by SFS, MSFS, random forest, L1SVM, and SVM. Specifically, our proposed method improve on average by 32.80% and 11.13%, respectively, compared to the worst comparison method (i.e., SVM) and the best comparison methods (i.e., SFS), in terms of all four evaluation metrics. The reason is the facts that our method takes into account the issues in the same framework, such as feature selection removing the redundant features, sample weight reducing the influence of the outliers and solving the problem of imbalance classification to reduce the issue of high false negatives, and joint classification and regression utilizing the share information between two tasks to improve the model effectiveness of each of them.

Second, it is important to conduct feature selection for analyzing high-dimensional data. High-dimensional data easily results in the issue of curse of dimensionality. In the literature, many studies showed that the classification model on the high-dimensional data will output low performance Zhu et al. (2014); Adeli et al. (2018). Figure 2 verified the above statement. In our experiments, only SVM does not consider the issue of high-dimensional data and achieves the worst classification performance. More specifically, the best comparison method (i.e., SFS) improves by 21.09% and the worst comparison method (i.e., L1SVM) improve by 4.73%, for all evaluation metrics, compared to the baseline SVM.

Third, it is useful to use joint classification and regression framework for detecting the severe cases from mild confirmation cases. As shown in Table 3 and Figure 2, our proposed method conducting joint regression and classification achieves better classification performance, compared to the single-task based classification methods, e.g., random forest, L1SVM, SFS, and SVM. Moreover, both MSFS and our method are joint models. However, our method outperforms MSFS since our proposed method takes into account one more constraint, i.e., imbalance classification. In particular, we conducted single-task classification using Eq. (3), i.e., our proposed method without considering the regression task, Proposed w/o Regression in Table 3. As a result, Proposed w/o Regression outperforms SFS since both of them take into account three following constraints, such as feature selection, sample weight, and imbalance classification.

Methods | CC | RMSE |
---|---|---|

Ridge regression | 0.329 0.158 | 20.02 9.724 |

L1SVR Chang and Lin (2011) | 0.351 0.085 | 10.49 2.072 |

Lasso Tibshirani (1996) | 0.354 0.165 | 9.92 9.571 |

Random forest Liaw et al. (2002) | 0.406 0.188 | 13.22 6.762 |

MSFS Zhu et al. (2014) | 0.408 0.092 | 9.29 1.104 |

Proposed | 0.462 0.056 | 7.35 1.087 |

### 4.3 Regression results

We evaluated the regression performance through the prediction of conversion time from the non-severe case to the severe case. We report the results of correlation coefficients (CCs) and RMSEs of all methods in Table 4.

First, the regression performance of the methods without feature selection (e.g., ridge regression) is worse than methods with feature selection, e.g., Lasso, L1SVR, MSFS, and ours. Moreover, our method outperforms all comparison methods. For example, our method receives the best performance for correlation coefficient (e.g., 0.462) and RMSE (e.g., 7.351).

Second, similar to the results of the classification task, the results of the regression task show the advantages of the considerations, such as feature selection, sample weight, imbalance classification, and joint classification and regression. In particular, our proposed method considering all four considerations improves 0.054 and 1.940, respectively, in terms of correlation coefficient and RMSE, compared to MSFS which takes two considerations into account, such as feature selection, and joint classification and regression.

Hu ranges | left lung (6) | right lung (16) |
---|---|---|

0 | 2 | |

1 | 8 | |

2 | 5 | |

1 | 0 | |

2 | 1 |

## 5 Discussion

### 5.1 Imbalance classification

In the classification task, our method investigates the issues, i.e., feature selection, sample weight, imbalance classification, and joint classification and regression. As a result, our method outperforms all comparison methods only focusing on part of four issues. Moreover, our solution for each issue is shown reasonable and feasible. An interesting question is which issue dominates the COVID-19 analysis with chest CT scan data. There is not theoretical answer. However, based on our experimental results, the problem of imbalance classification should be the first issue to be considered due to the following reasons.

First, it is necessary to take into account the problem of imbalance classification. In our experiments, random forest outperforms L1SVM (e.g., 6.84% for all evaluation metrics) because random forest considers the problem of imbalance classification and L1SVM takes into account the issue of high-dimensional data. Moreover, the only difference between SFS and MSFS is that SFS considers the problem of imbalance classification and MSFS conducts a joint classification and regression. As a result, SFS beats MSFS a little bit, i.e., 1.26% improvement in terms of all evaluation metrics.

Second, in Figure 2, the sensitivities of the methods (e.g., SVM, L1SVM, and MSFS) are low, e.g.,

23.86%, 26.73%, and 47.45% respectively. The reason is that their classifiers could directly predict the samples of the minority class with the label of the majority class to output high accuracy,

e.g., 67.35%, 75.26%, and 86.31%, respectively. On the contrary, the methods (e.g., random forest, SFS, and our Proposed w/o Regression) consider the issue of imbalance classification to output the high sensitivities, e.g., 49.61%, 50.65%, and 70.73%, respectively.### 5.2 Top selected regions

In this paper, we did not employ deep learning methods due to the interpretability and the issue of small-sized sample. In this section, we list top selected features (i.e., the chest regions) in Table 5, which could help the clinicians to improve the efficiency and the effectiveness of the disease diagnosis. To do this, we first obtained the totally selected number for each feature across 100 experiments, i.e., repeating the 5-fold cross validation scheme 20 times, and then reported top 22 selected features (i.e., regions), each of which was selected at least 90 out of 100 times. We list our observations as follows.

First, most of selected features (i.e., 17 out of 22) are in the HU range of , corresponding to the regions of ground glass opacity which has been demonstrated related to the severity of COVID-19 Tang et al. (2020). Second, the region number in the right lung is larger than the number in the left lung, i.e., 16 vs. 6. The possible reason is that the virus might easily infect the regions in the right lung Shi et al. (2020b). Third, we extracted 3 kinds of handcrafted features, i.e., density, mass, and volume, from each part. Moreover, the mass feature is related to both the density feature and the volume feature. Based on the results, our method selected 4 and 7 density features, respectively, from the left lung and the right lung, and 2 and 6 mass features, respectively, from the left lung and the right lung. However, our method only selected 3 volume feature from the right lung. Hence, we would have the conclusion that the density feature is the most important in our experiments, followed by the mass feature and the volume feature.

### 5.3 Importance of prediction and time estimation of severe cases

To our knowledge, this study is the first work to simultaneously predict and estimate the conversion time of COVID-19 developing to severe symptoms using chest CT scan data.

First, our method obtains higher sensitivity, i.e., 76.97%, compared to Tang et al. (2020), i.e., 74.5% of sensitivity. That is, our method achieves higher accuracy for classifying the severe cases than Tang et al. (2020). The reason could contribute to that 1) our method designs a novel solution for the problem of imbalance classification, and 2) the regression information in our proposed joint model improves the classification performance.

Second, as shown in Figure 3, the correlation coefficient (i.e., 0.524) between our predictions and the corresponding ground truths for the severe cases is larger than the value (i.e., 0.462) in Table 4 which measures the correlation between our predictions and the corresponding ground truths for all subjects. Moreover, our proposed method yields the averaged conversion time (i.e., 4.59 0.223 days, which has 0.55 days different from the ground truth of the conversion time, i.e., days) from all non-serve cases to the severe stage, with the least estimation error (i.e., 6.01 1.22), compared to all comparison methods. The possible reason should be the proposed joint classification and regression model, where the classification information could improve the effectiveness of the regression task. Above advantages of our proposed method imply that our proposed method is good at predicting the conversion time from the non-severe stage to the severe stage.

Above two observations indicate that our proposed method is suitable for predicting the severe cases. In real applications, correctly classifying severe cases is more important than correctly classifying the non-severe cases because the former could reduce the clinicians ?workloads. In particular, the correct prediction of the conversion time could help the clinicians designing effective treatment plan for the potential severe cases in time or even save the patients’ lives.

### 5.4 Limitations

This study yielded an accuracy of 85.91%, which seems lower than that reported in previous severity assessment work. However, the task in this paper tried to solve is quite different, as we predict whether the patient would develop severe symptom in the later time. This would result in the problem of imbalance classification since only a small portion of patients would convert severe based on the prevalence rate. First, the problem of imbalance classification of our data set is bias, i.e., 86 severe cases vs. 322 non-severe cases. This makes difficult to construct effective classification models. Second, the difference of infected volumes between the severe cases and the non-severe cases is small, as shown in Figure 4, while the corresponding difference is distinguished in Tang et al. (2020), thus the latter can easily conduct classification. With the increase of available the data of severe cases, the accuracy of our method could be further improved. In our future work, we plan to generate new samples for the minority class to lessen the problem of imbalance classification, as well as design new deep transfer learning methods using other data sources (e.g., X-ray data) to solve the issue of small-sized sample and high-dimensional features.

This study only focused on binary classification, i.e., severe cases vs. non-severe cases. In our future work, we plan to conduct multi-class classification on four types of COVID-19 diagnosis, i.e., mild, common, severe, and critical.

## 6 Conclusion

In this paper, we proposed a new method to jointly conduct disease identification and conversion time prediction, by taking into account the issues, such as high-dimensional data, small-sized sample, outlier influence, and imbalance classification. To do this, we designed a sparsity regularization term to conduct feature selection and learn the shared information between two tasks, and proposed a new method to take into account the sample weight and the issue of imbalance classification. Finally, experimental results showed that our proposed method achieved the best performance for detecting the severe case from non-severe cases and the conversion time from the mild confirmed case to the severe case with the CT data in a real data set, compared to the comparison methods.

## References

- Logistic regression confined by cardinality-constrained sample and feature selection. IEEE transactions on pattern analysis and machine intelligence. Cited by: §1, §1, Table 2, §4.1, §4.1, Table 3.
- Semi-supervised discriminative classification robust to sample-outliers and feature-noises. IEEE transactions on pattern analysis and machine intelligence 41 (2), pp. 515–522. Cited by: §1, §4.2.
- Correlation of chest ct and rt-pcr testing in coronavirus disease 2019 (covid-19) in china: a report of 1014 cases. Radiology, pp. 200642. Cited by: §1.
- COVID_MTNet: covid-19 detection with multi-task deep learning approaches. External Links: 2004.03747 Cited by: §1, §1.
- Convergence of alternating optimization. Neural, Parallel & Scientific Computations 11 (4), pp. 351–368. Cited by: §3.4.
- Quantification of tomographic patterns associated with covid-19 from chest ct. External Links: 2004.01279 Cited by: §1, §1.
- LIBSVM: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST) 2 (3), pp. 1–27. Cited by: Table 2, §4.1, §4.1, Table 4.
- Diagnostic detection of 2019-ncov by real-time rt-pcr. World Health Organization, Jan 17. Cited by: §1.
- Isothermal nucleic acid amplification technologies for point-of-care diagnostics: a critical review. Lab on a Chip 12 (14), pp. 2469–2486. Cited by: §1.
- COVID-19—new insights on a rapidly changing epidemic. Jama. Cited by: §1.
- Regularized multi–task learning. In SIGKDD, pp. 109–117. Cited by: §3.3, §3.3.
- Clinical characteristics of coronavirus disease 2019 in china. New England Journal of Medicine. Cited by: §1.
- Robust svm with adaptive graph learning. World Wide Web, DOI: 10.1007/s11280-019-00766-x. Cited by: §1, §1, §3.2, §3.2.
- Coronavirus covid-19 global cases by the center for systems science and engineering (csse) at johns hopkins university. pp. https://gisanddata.maps.arcgis.com/apps/opsdashboard/index.html#/bda75 94740fd402994234 67b48e9ecf6. Cited by: §1.
- Real-time estimation of the risk of death from novel coronavirus (covid-19) infection: inference using exported cases. Journal of clinical medicine 9 (2), pp. 523. Cited by: §1.
- Chest imaging appearance of covid-19 infection. Radiology: Cardiothoracic Imaging 2 (1), pp. e200028. Cited by: §1.
- COVID-19 pneumonia: what has ct taught us?. The Lancet Infectious Diseases 20 (4), pp. 384–385. Cited by: §1.
- Artificial intelligence distinguishes covid-19 from community acquired pneumonia on chest ct. Radiology, pp. 200905. Cited by: §1, §1.
- Early transmission dynamics in wuhan, china, of novel coronavirus–infected pneumonia. New England Journal of Medicine. Cited by: §1.
- Classification and regression by randomforest. R news 2 (3), pp. 18–22. Cited by: Table 2, §4.1, §4.1, Table 4.
- On the limited memory bfgs method for large scale optimization. Mathematical programming 45 (1-3), pp. 503–528. Cited by: §3.4.
- Imaging profile of the covid-19 infection: radiologic findings and literature review. Radiology: Cardiothoracic Imaging 2 (1), pp. e200034. Cited by: §1.
- Coronavirus (covid-19) classification using deep features fusion and ranking technique. External Links: 2004.03698 Cited by: §1, §1.
- Lung infection quantification of covid-19 in ct images with deep learning. arXiv preprint arXiv:2003.04655. Cited by: §2.1, §2.2.
- Large-scale screening of covid-19 from community acquired pneumonia using infection size-aware classification. arXiv preprint arXiv:2003.09860. Cited by: §1, §1, §2.1.
- Radiological findings from 81 patients with covid-19 pneumonia in wuhan, china: a descriptive study. The Lancet Infectious Diseases. Cited by: §5.2.
- Volume and mass doubling times of persistent pulmonary subsolid nodules detected in patients without known malignancy. Radiology 273 (1), pp. 276–284. Cited by: §2.2.
- Severity assessment of coronavirus disease 2019 (covid-19) using quantitative features from chest ct images. arXiv preprint arXiv:2003.11988. Cited by: §1, §5.2, §5.3, §5.4.
- Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58 (1), pp. 267–288. Cited by: Table 2, §4.1, §4.1, Table 4.
- Nowcasting and forecasting the potential domestic and international spread of the 2019-ncov outbreak originating in wuhan, china: a modelling study. The Lancet 395 (10225), pp. 689–697. Cited by: §1.
- Relation between chest ct findings and clinical conditions of coronavirus disease (covid-19) pneumonia: a multicenter study. American Journal of Roentgenology, pp. 1–6. Cited by: §1.
- Spectral clustering via half-quadratic optimization. World Wide Web, DOI: 10.1007/s11280-019-00731-8.. Cited by: §3.2.
- Graph pca hashing for similarity search. IEEE Transactions on Multimedia 19 (9), pp. 2033–2044. Cited by: §3.3.
- A novel matrix-similarity based loss function for joint regression and classification in ad diagnosis. NeuroImage 100, pp. 91–105. Cited by: §1, §1, Table 2, §4.1, §4.1, §4.2, Table 4.
- Efficient utilization of missing data in cost-sensitive learning. IEEE Transactions on Knowledge and Data Engineering, pp. 10.1109/TKDE.2019.2956530. Cited by: §1, §3.2.
- Spectral rotation for deep one-step clustering. Pattern Recognition, pp. 10.1016/j.patcog.2019.107175. Cited by: §1, §3.1, §3.2, §3.2.