Towards computer-aided severity assessment: training and validation of deep neural networks for geographic extent and opacity extent scoring of chest X-rays for SARS-CoV-2 lung

05/26/2020 ∙ by Alexander Wong, et al. ∙ 7

Background: A critical step in effective care and treatment planning for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is the assessment of the severity of disease progression. Chest x-rays (CXRs) are often used to assess SARS-CoV-2 severity, with two important assessment metrics being extent of lung involvement and degree of opacity. In this proof-of-concept study, we assess the feasibility of computer-aided scoring of CXRs of SARS-CoV-2 lung disease severity using a deep learning system. Materials and Methods: Data consisted of 130 CXRs from SARS-CoV-2 positive patient cases from the Cohen study. Geographic extent and opacity extent were scored by two board-certified expert chest radiologists (with 20+ years of experience) and a 2nd-year radiology resident. The deep neural networks used in this study are based on a COVID-Net network architecture. 100 versions of the network were independently learned (50 to perform geographic extent scoring and 50 to perform opacity extent scoring) using random subsets of CXRs from the Cohen study, and evaluated the networks using stratified Monte Carlo cross-validation experiments. Findings: The deep neural networks yielded R^2 of 0.673 ± 0.004 and 0.636 ± 0.002 between predicted scores and radiologist scores for geographic extent and opacity extent, respectively, in stratified Monte Carlo cross-validation experiments. The best performing networks achieved R^2 of 0.865 and 0.746 between predicted scores and radiologist scores for geographic extent and opacity extent, respectively. Interpretation: The results are promising and suggest that the use of deep neural networks on CXRs could be an effective tool for computer-aided assessment of SARS-CoV-2 lung disease severity, although additional studies are needed before adoption for routine clinical use.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

As the COVID-19 pandemic, caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), continues around the world, radiology has seen growing importance in providing clinical insights for aiding the diagnosis, treatment, and management of the disease. Much of early literature have focused on imaging features presented in computed tomography (CT) scans of SARS-CoV-2 positive patients given its use in China during the earlier stages of the global pandemic [Zhou, Chung, Ai, Fang, Shi]; however, the low availability of CT scanners in many parts of the world due to its high costs, the high risk of SARS-CoV-2 transmission during patient transport to/from CT imaging suites, and long decontamination times between scans have limited the use of CT scans for SARS-CoV-2 diagnosis and treatment planning. A number of recent studies have illustrated the growing interest and usage of chest x-ray (CXR) imaging around the world [RSNA, Jacobi, Wong, Warren, Toussie, Huang, Guan], with some studies foreseeing a greater reliance on portable CXR [Jacobi] and the high value of portable CXR for critically ill patients [Wu]. Compared to CT scanners, CXR imaging systems are widely available around the world due to their relatively low cost, and have comparatively faster decontamination times; in addition, the existence of portable CXR units means that imaging can occur within an isolation room and, thus, greatly reduce transmission risk [Jacobi, RSNA, CSTR]. Furthermore, CXR imaging is frequently performed for patients with respiratory complaints as part of standard procedure [BSTI], and have been shown to give valuable insights on disease progression [Wong]. In the context of detecting SARS-CoV-2, CXR imaging can also be useful in situations where patients with initial negative reverse transcription–polymerase chain reaction (RT-PCR) results, the current gold standard for viral testing, revisit the emergency department with worsening symptoms [CSTR].

Several studies have investigated imaging features presented in CXR images of SARS-CoV-2 positive patients [Huang, Guan, Kong], with commonly found features being bilateral abnormalities, ground-glass opacity, and interstitial abnormalities. Leveraging the presence of these imaging features in combination with the ability to observe their progression and extent over the duration of disease onset, an important role that CXR assessment has in aiding with disease treatment and management is in determining the severity of a patient’s condition. As such, a number of recent studies have focused on severity scoring [Wong, Warren, Toussie], where the goal is to quantify SARS-CoV-2 lung disease severity. Disease severity scoring can help with determining the best course of treatment and management given a SARS-CoV-2 case (e.g., at-home quarantine, oxygen therapy, ventilation, etc.), allowing for the individualized treatment of each patient.

We hypothesise that deep learning could potentially be a valuable tool for enabling computer-aided severity scoring of SARS-CoV-2 lung severity using CXRs of SARS-CoV-2 positive patients. Using CXR training data acquired from a global pool of SARS-CoV-2 positive patients, deep neural networks can learn to identify the important imaging features within a CXR image indicative of SARS-CoV-2, and output scores for quantifying the severity of a patient’s disease progression. In this study, we assess the feasibility of computer-aided severity scoring of SARS-CoV-2 lung severity using deep learning by developing, training, and validating 100 versions of a deep neural network (50 for performing geographic extent scoring and 50 for performing opacity scoring) using stratified Monte Carlo cross-validation experiments on data consisting of 130 CXRs from positive patient cases. Two board-certified chest radiologists and a radiology resident assess the results achieved by the deep neural networks.

Materials and Methods

Data preparation and radiological scoring

The primary goal of this study is to assess the feasibility of computer-aided severity scoring of SARS-CoV-2 using deep learning. To this end, we develop and evaluate deep neural networks that can score CXRs of patients with SARS-CoV-2. We collected CXR data from the Cohen study [Cohen] pertaining to SARS-CoV-2 positive cases, and ethics approval for the Cohen study data collection was received by the University of Montreal’s Ethics Committee. In this study specifically, the 130 CXRs from the Cohen study used here represent a patient population of 85 patients between 12 and 87 years old around the world. The CXR data were acquired using a range of X-ray imaging equipment types and acquisition protocols that is representative of routine imaging practice (including supine and upright, posterioranterior and anteriorposterior).

Radiological scoring was performed by two board-certified chest radiologists (with 20+ years of experience) (A.A. and M.H.) and a 2nd-year radiology resident (B.S.) to stage SARS-CoV-2 disease severity using a score system adapted from Wong et al. [Wong]. The two assessment metrics scored in the radiological scoring are geographic extent and opacity extent. More specifically, for geographic extent, the extent of lung involvement by ground glass opacity or consolidation of each lung (with the right and left lung scored separately) is scored as: 0 = no involvement; 1 = <25%; 2 = 25-50%; 3 = 50-75%; 4 = >75% involvement. The scores are then added together, and the total geographic extent score ranges from 0 to 8 (right + left lung). For opacity extent, the degree of opacity was similarly scored for the right and left lung separately as: 0 = no opacity; 1 = ground glass opacity; 2 = consolidation; 3 = white-out. The scores are similarly added together, and the total opacity extent score ranges from 0 to 6 (right + left lung). Fleiss’ Kappa [Fleiss] for inter-rater agreement was 0.45 for opacity extent and 0.71 for geographic extent. The mean scores are then calculated across the radiologists.

After radiological scoring, all CXR data used in this study underwent data processing to facilitate the training of deep neural networks. To discourage the deep neural networks from learning irrelevant visual cues when making severity scoring predictions, the boundaries of the CXR data were cropped to remove boundary artifacts and embedded metadata outside of the patient region of interest. Furthermore, all CXR data are resized to the same data dimensions to enable training of the deep neural networks in this study. Finally, the geographic extent scores (with a dynamic range of 0 to 8) and opacity extent scores (with a dynamic range of 0 to 6) are re-mapped to a unified dynamic range from 0 to 1.

Model development

Figure 1: Flowchart of the overall architecture of the deep neural networks for predicting severity scores.

The development of the deep neural network architecture for computer-aided severity scoring is important as it dictates the sequence of mathematical operations that maps the input CXR data to the predicted severity scores (e.g., geographic extent score and opacity extent score). Specifically, the architecture of the deep neural network will affect the efficiency and effectiveness with which it is able to learn the underlying parameters and operations in this complex, hierarchical mapping. In this study, the architecture of the deep neural networks used to evaluate the feasibility of computer-aided severity scoring of SARS-CoV-2 lung disease severity is based on the COVID-Net deep neural network architecture [COVIDNet], which was found to achieve state-of-the-art performance in SARS-CoV-2 detection. The last layers of the COVID-Net architecture are replaced with a set of new layers to enable the prediction of severity scores corresponding to scores within the dynamic range of 0 to 1. These scores may be mapped back to the original dynamic ranges of geographic extent score and opacity extent score used during radiological scoring. Figure 1 presents an overview of this network architecture. The network architecture consists of projection-expansion-projection design patterns for high representational capacity while maintaining computational efficiency, selective long-range connectivity to improve learning efficiency, and high architectural diversity.

To improve the performance of the deep neural networks, a technique known as transfer learning 

[Pan] is used to initialize the deep neural network parameters in this study using the parameters from deep neural networks trained on COVIDx, a dataset introduced in the Wang study [COVIDNet] containing 13,975 CXR images across 13,870 patient cases consisting of healthy patients and patients with different forms of pneumonia (e.g., viral, bacterial, etc.). Statistical distribution details of COVIDx can be found in the Wang study [COVIDNet]. We also leveraged data augmentation [Perez]

in this study to improve the performance of the deep neural networks, which consists of synthesizing new training samples by applying randomly generated translations, rotations, horizontal flips, zooms, and intensity shifts to the CXR data in the training set to increase data diversity and allow the deep neural networks to learn improved robustness. All of the model development was conducted using Python, OpenCV, and the Keras deep learning library with a TensorFlow backend.

Cross-validation and Performance Evaluation

To evaluate the efficacy of the deep neural networks developed for computer-aided severity scoring of SARS-CoV-2 lung disease severity, stratified Monte Carlo cross-validation [Xu] was conducted. For geographic extent and opacity extent independently, 100 different deep neural networks (50 for geographic extent scoring and 50 for opacity extent scoring) were learned using 100 different random subsets of CXR data from the Cohen study (50 for geographic and 50 for opacity). Each of the 100 different deep neural networks were then tested on 100 different subsets of CXR data that was held out from the learning process. For each trial, a random subset consisting of 90% of the CXR data was used to learn a deep neural network, with the remaining 10% of the CXR data that was held out used for testing.

To quantify the performance of the deep neural networks learned in this study, we compute the coefficient of determination, R, between predicted scores outputted by the deep neural networks and scores by expert radiologists for both geographic extent and opacity extent in the test sub-set of CXR data for each trial. To present a quantitative summary for the cross-validation results, the R

was averaged over the trials for geographic extent and opacity extent independently, resulting in means and standard deviations across the cross-validation results.

Results

Table 1 summarizes the demographic variables and imaging protocol variables of the CXR data used in this study from the Cohen study. Note that the majority of the patient cases are from Europe and Asia, and reflects the earlier rise of the COVID-19 pandemic in those two continents. In addition, the majority of the cases are above the age of 50, with the mean age being 56.6, and is consistent with the greater effect of SARS-CoV-2 on the older population.

width=0.4 Age mean std <20 1 (1.2%) 20-29 2 (2.4%) 30-39 6 (7.1%) 40-49 10 (11.8%) 50-59 18 (21.2%) 60-69 15 (17.6%) 70-79 15 (17.6%) 80-89 3 (3.5%) 90+ 0 (0.0%) Unknown 15 (19.2%) Sex Male 47 (55.3%) Female 28 (32.9%) Unknown 10 (11.8%) Geographic location Asia 20 (23.5%) North America 2 (2.4%) Europe 45 (52.9%) Australia 1 (1.2%) Unknown 17 (20.0%) Imaging view PA 98 (75.4%) AP 32 (24.6%) Imaging position Supine 17 (13.1%) Upright 113 (86.9%)

Table 1: Summary of demographic variables and imaging protocol variables of CXR data used in this study. Age, sex, and geographic location statistics expressed on a patient level, while imaging view and imaging position statistics are expressed on an image level.

Studying the R between predicted scores from the deep neural networks and the radiologist scores for the 100 experiments (50 deep neural networks for geographic extent scoring and 50 deep neural networks for opacity extent scoring) led to number of observations. First, the deep neural networks yielded R of 0.673 0.004 and 0.636 0.002 for geographic extent and opacity extent, respectively, in the stratified Monte Carlo cross-validation experiments. Second, the best performing networks achieved R of 0.865 and 0.746 between predicted scores and radiologist scores for geographic extent and opacity extent, respectively. Third, the results show that the mean R between predicted scores and radiologist scores for geographic extent is higher than that for opacity extent.

Discussion

In this study, we hypothesised that computer-aided deep learning algorithms can accurately predict lung disease severity on CXRs associated with SARS-CoV-2 infection against expert chest radiologist ground truths, and the experimental results of study support this hypothesis. Results from the stratified Monte Carlo cross-validation experiments showed that the learned deep neural networks could achieve mean R between predicted scores and radiologist scores for geographic extent and opacity extent greater than 0.5 when evaluated for 100 different subsets of CXR data from the Cohen study (50 for geographic extent scoring and 50 for opacity extent scoring).

Severity scoring for SARS-CoV-2 has gained recent attention due to the rise and continued prevalence of the COVID-19 pandemic across the globe, and the need to assess the severity of a patient who is SARS-CoV-2 positive is crucial for determining the best course of action regarding treatment and care. Several severity scoring mechanisms have recently been proposed for the severity assessment of SARS-CoV-2. Wong et al. [Wong] introduced a scoring scheme for severity quantification of SARS-CoV-2 by adapting and simplifying the Radiographic Assessment of Lung Edema (RALE) score introduced by Warren et al. [Warren]. Toussie et al. [Toussie] introduced a scoring scheme where each lung was divided into three zones (for a total of six zones) and each zone was assigned a binary score based on opacity, with the final severity score being the aggregate of the scores from the different zones. Borghesi and Maroldi [Borghesi] introduced a scoring scheme where, similar to Toussie et al.

, each lung was divided into three zones, but each zone was instead assigned a score from 0 to 3 based on interstitial and alveolar infiltrates. Considering the large quantity of patients that are being screened due to the COVID-19 pandemic and the need for expert radiologists to assess the severity of each patient, the use of artificial intelligence for computer-aided severity scoring has strong potential to assist in clinical workflow efficiency given the situation.

This study has a few limitations. First, the data were obtained from various sources and could exhibit bias. Second, disease severity is based on radiologist ground truths, and functional outcomes such as measures of lung function or mortality were not available. Third, the image quality of the CXRs can vary. Note that although some CXRs have lower resolution, they are observed to be of acceptable diagnostic quality. Fourth and finally, future studies should investigate longitudinal changes in disease severity.

In conclusion, our results support the hypothesis that the use of deep neural networks on CXRs can be an effective tool for computer-aided assessment of lung disease severity, although additional studies are needed before adoption for routine clinical use. This tool may be helpful in ER and ICU settings for triaging patients into general admission or ICU, as well as determining who and when to put SARS-CoV-2 patients on mechanical ventilator and when to extubate.

Acknowledgements

We would like to thank Natural Sciences and Engineering Research Council of Canada (NSERC), the Canada Research Chairs program, DarwinAI Corp., Nvidia Corp., and Hewlett Packard Enterprise Co.

Author contributions statement

Z.Q.L, L.W., A.G.C, A.W., and T.Q.D. conceived the experiment, B.S, A.A., M.H., and T.Q.D. collected the data, B.S, A.A., and M.H. performed the radiological scoring, Z.Q.L., L.W., A.G.C., and A.W. developed the model and prepared the learning procedure, all authors analysed the results. All authors reviewed the manuscript.

Declaration of interests

L.W., Z.Q.L., A.G.C. and A.W. are affiliated with DarwinAI Corp.

References