Pristine annotations-based multi-modal trained artificial intelligence solution to triage chest X-ray for COVID-19

by   Tao Tan, et al.
General Electric

The COVID-19 pandemic continues to spread and impact the well-being of the global population. The front-line modalities including computed tomography (CT) and X-ray play an important role for triaging COVID patients. Considering the limited access of resources (both hardware and trained personnel) and decontamination considerations, CT may not be ideal for triaging suspected subjects. Artificial intelligence (AI) assisted X-ray based applications for triaging and monitoring require experienced radiologists to identify COVID patients in a timely manner and to further delineate the disease region boundary are seen as a promising solution. Our proposed solution differs from existing solutions by industry and academic communities, and demonstrates a functional AI model to triage by inferencing using a single x-ray image, while the deep-learning model is trained using both X-ray and CT data. We report on how such a multi-modal training improves the solution compared to X-ray only training. The multi-modal solution increases the AUC (area under the receiver operating characteristic curve) from 0.89 to 0.93 and also positively impacts the Dice coefficient (0.59 to 0.62) for localizing the pathology. To the best our knowledge, it is the first X-ray solution by leveraging multi-modal information for the development.


page 5

page 7

page 8


Calibrated Bagging Deep Learning for Image Semantic Segmentation: A Case Study on COVID-19 Chest X-ray Image

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) causes coro...

Deep-XFCT: Deep learning 3D-mineral liberation analysis with micro X-ray fluorescence and computed tomography

The rapid development of X-ray micro-computed tomography (micro-CT) open...

COVID-19 Classification of X-ray Images Using Deep Neural Networks

In the midst of the coronavirus disease 2019 (COVID-19) outbreak, chest ...

Exploration of Interpretability Techniques for Deep COVID-19 Classification using Chest X-ray Images

The outbreak of COVID-19 has shocked the entire world with its fairly ra...

The entire network structure of Crossmodal Transformer

Since the mapping relationship between definitized intra-interventional ...

Multi-modal volumetric concept activation to explain detection and classification of metastatic prostate cancer on PSMA-PET/CT

Explainable artificial intelligence (XAI) is increasingly used to analyz...

Using Deep Learning-based Features Extracted from CT scans to Predict Outcomes in COVID-19 Patients

The COVID-19 pandemic has had a considerable impact on day-to-day life. ...

1 Introduction

Coronavirus disease 2019 (COVID-19) is extremely contagious and has become a pandemic [al2020death, shaker2020covid]. It has spread inter-continentally in the first wave [leung2020first] and suspected to be currently entering the second wave [ali2020covid, aleta2020modeling, evenett2020preparing] in various countries, having infected more than 30 million people and caused nearly 1M deaths till September 2020 [JHU]. The mortality rate of this disease differs from country to country ranging from 2.5% to 7% compared with 1% from influenza [baud2020real, rajgor2020many, jung2020real]. Considering different age groups, elder people and patients with comorbidities are most vulnerable and more likely to progress to a life-threatening condition[liu2020clinical]. To prevent the spread of the disease, different governments have implemented strict containment measures [tian2020investigation, gatto2020spread] which have aimed to minimize transmission. Because of the strong infection rate of COVID-19, rapid and accurate diagnostic methods are urgently required to identify, isolate and treat the patients specially considering that effective vaccines are still under development.

The diagnosis of COVID-19 relies on reverse-transcriptase-polymerase chain reaction (PCR) test [lanciotti1992rapid, kim2020diagnostic, kucirka2020variation]. However it has several drawbacks. The PCR tests often require 5 to 6 hours to yield results. Sensitivity of PCR depends on the stage of the infection [kucirka2020variation] and can be as low as 71%[fang2020sensitivity]. Therefore care must be taken on interpreting PCR tests. More importantly, the cost of PCR prevents large population from being tested in the developing and highly populated economies.

From the imaging domain, chest CT may be considered as a primary tool for the current COVID-19 detection [ai2020correlation, kim2020diagnostic, udugama2020diagnosing] and the sensitivity of chest CT can be greater than that of PCR (98% vs 71%) [fang2020sensitivity], but the cost and time including for system decontamination can be prohibitive. Portable X-ray (XR) units are cost and time effective, and thus is the primary imaging modality used in diagnosis and management. Since it has limited capability to provide detailed 3D structure of anatomy or pathology of chest cavity and therefore not regarded as an optimum tool for quantitative analysis [blanchon2007baseline]. Also due to the imaging apparatus and nature of the x-ray projection, it is challenging to the radiologists to identify relevant disease regions and creates difficulty in accurate interpretation [jacobi2020portable, pereira2020covid].

To alleviate the lack of experienced radiologists and minimize human effort in managing an exponentially growing pandemic and the impending task to triage suspected COVID-19 subjects, the academic and industry communities have proposed various systems for diagnosing COVID-19 patients using X-ray imaging [castiglioni2020artificial, murphy2020covid, shi2020review, neri2020use]. The performance of some AI systems to detect COVID-19 pneumonia was comparable to radiologists to identify presence or absence of COVID-19 infection[murphy2020covid]. The common outputs of these diagnostic solutions or triage systems are classification likelihoods for COVID19 and non-COVID-19 accompanied by heatmaps localizing the suspected pathological region or areas of attention.

Although the performance of some systems approaches the level of radiologists on X-rays in terms of classification, but to our best knowledge, no studies have verified the detection and segmentation of the disease regions against human annotation on X-rays. Segmentation is critical for (a) severity assessment of the disease; and for (b) follow-up for treatment monitoring or progression of patient condition. Although human annotations can be obtained for COVID-19 regions on X-rays, the certainty of regions is weak compared to annotations derived from CT. The development in the field suffers from the lack of availability of COVID-19 X-ray images with corresponding region annotations. The contribution of our work builds on establishing a multi-modal protocol for our analysis and downstream classification of X-ray images. Our deep learning model trains using data from CT, through generation of synthetic X-ray from available CT scans and including original X-ray images as well. The inference pipleline is exclusively based on X-ray images, not requiring CT data. First we established a synthetic X-ray generation scheme to generate as many as possible realistic synthetic X-ray to significantly augment X-ray images to expand our training pool. Second, we use synthetic X-ray as a bridge to transfer the ground-truth from CT to the original X-ray geometry.

1.1 Method

Figure 1: The training scheme design

1.1.1 Solution Design Overview

The key idea is to learn the disease patterns jointly using CT and XR data but to inference the solution on X-ray images. In this study, we are not particularly focusing on network design, but rather on improving data sufficiency and enhancing ground-truth quality to improve the visibility of anomolous tissue on a lower dimensional modality. In order to leverage available unparied and paired X-ray and CT images, we have designed a pipeline as Fig.1 shows. Disease regions in X-ray and CT are annotated independently by trained staff with varying levels of experience. Using CT, we generate multiple synthetic X-ray image (SXR) per CT volume together with corresponding 2D masks of the diseased region(see 1.1.2). For patients where X-rays and CTs (we refer them as paired CT and XR data) are acquired within a small time-window (e.g. 48 hours), we automatically transfer the masks from the SXR to the corresponding XR. The transferred annotations on XR are further adjusted manually reviewed by trained staff to build our pristine mask ground-truth/annotations (see 1.1.3

). With all available XR and SXR together with corresponding disease masks, we train a deep-learning convolutional neural network for diagnosing and segmenting COVID-19 disease regions on X-ray.

1.1.2 Synthetic X-ray Generation

Original X-ray images are generated by shining an X-ray beam with initial intensity on the subject and measuring the intensity () of the beam having passed through an attenuating medium at different positions using a 2D X-ray detector array. The attenuation of X-ray beams in matter follows the BeerLambert law, stating that the decrease in the beam’s intensity is proportional to the intensity itself () and the linear attenuation coefficient value() of the material being traversed. (1).


From this we can derive the beam intensity function function over distance to be an exponential decay as the following:


In medical X-ray images the values saved to file are proportional to the expression seen in 3, which is actually a summation of the attenuation values along the beam. If the subject is made up of smaller homogeneous cells, this could be formulated as a sum.


3D CT images are constructed from many 2D X-ray images of the subject taken at different angles and positions (helical or other movement of the source-detector pair) by using a tomographic reconstruction algorithms such as iterative reconstruction or the inverse radon transform. The voxel values of the constructed CT image are the linear attenuation coefficient values specific to the used X-ray beam’s energy spectrum. These values are converted to Hounsfield unit (HU) values as seen in Eqn 4, where and are the linear attenuation coefficient for water and air respectively for the given X-ray source. Using this unit of measure transforms images taken with different energy X-ray beams to have similar intensity values for the same tissues, scaling water to 0 HU, air to -1000 HU.


An inverse process can help create X-ray images from CT images by projecting the constructed 3D volume back into a 2D plane along virtual rays originating from a virtual source. Since the X-ray image (Eqn. 3) is an integral/summation of attenuation values along the projection rays, established methods create projection images by converting the CT voxel values from HU to linear attenuation coefficient and simply summing the pixel values along the virtual rays (weighted by the traveled path-length by the ray withing each pixel). To achieve realistic pseudo X-ray we used a point source for the projection which accurately models the source of the X-ray, like in actual X-ray imaging apparatus. For the projection itself we used the ASTRA-toolbox[ASTRA] , which allowed parameterization of the beam’s angle (vertical and horizontal) as well as the distances between the virtual point source, the origin of the subject and the virtual detector (see Fig. 2). As X-ray images have the same pixel spacing in both dimensions, while CT images usually have larger pixel spacing in the z direction, we re-sampled the 3D volumes to have isotropic (0.4 mm) spacing prior to computing the projection. In clinical X-ray images, the radiation source is behind and the detector is in front of the patient - also known as posterior-anterior (PA) view the images generated appear to be flipped horizontally. To conform with this protocol, we also flipped PA projections.

It is important to point out that in case of small field-of-view CT images, where the patient doesn’t fit into the reconstruction circle in the horizontal plane, creating coronal or sagittal projection will not result in lifelike X-rays image as body parts are missing from the virtual ray’s path.

During our work we also projected the segmentation ground truth masks for CT images to 2D format. The same transformations were applied to these binary masks as to the corresponding CT volumes excluding the ones aiming to scale the voxel/pixel values. In our workflow we use both a binary format of the 2D projected masks - created by setting all non-zero values to 1 - and a depth-mask format which retains information of the depth of the original 3D mask. The latter was scaled so the values have the physical meaning of depth.

Figure 2: The illustration of synthetic X-ray and its mask generation

1.1.3 Pristine Annotation Generation

For paired XR and CT from same patients, our aim is to transfer the pixel-wise ground-truth from CT to SXR and from SXR to XR. SXR serves as a bridge between CT and XR.

For each CT volume, we generate a number of SXRs and corresponding disease masks by varying imaging parameters as mentioned in 1.1.2. The goal is to register each SXR candidate to XR so that the disease region between SXR and XR have the best alignment so that we can apply the same transformation on the disease mask from SXR and generated registered disease mask for the paired XR. As for the same CT, multiple SXRs are generated resulting different registered masks and we simply select the one with the best mutual information between registered SXR and XR. Another approach to select optimal SXR involves comparison of the lung mask region. The pair with the most overlapping mask was used to select the ideal pair.

To register SXR to XR, one challenge is that the fields of body view are different between the two and SXR is imaged usually with hands and arms up while for XR the hands are down, normal position of the body. To alleviate the problem of the during image registration, instead of using original images, we create lung ROI image by applying lung segmentation on synthetic X-ray and X-ray. Fig. 3 shows one example how we transfer the CT ground-truth from CT to synthetic X-ray and from synthetic X-ray to real X-ray. We observe that there are moderate differences between annotations from original X-ray and those annotations derived from transformed and transferred CT data implying that the visibility of COVID-19 related pneumonia on X-ray may not be ideal and most comprehensive.

Figure 3: An example of transfering synthetic X-ray mask to X-ray mask. Top: a representative synthetic X-ray generated from CT, the corresponding lung image and disease mask; Middle: paired X-ray, the corresponding lung image and direct disease annotation from X-ray; bottom: X-ray with transferred annotations from CT shown as red contour; registered lung image from synthetic X-ray and transferred disease annotations from synthetic X-ray

Although we have an approximate spatial match between SXR and XR, we cannot directly use transferred groundtruth(GT) from 3D CT onto 2D even with the best matching for training the deep learning models. The disease can rapidly change even within 48 hours. In this case the transferred groundtruth from CT can only be used as a directional guidance for a human expert to annotate on the 2D X-rays. Therefore for paired cases, we perform manual adjustment on the automated transferred ground-truth.

Figure 4: The schematic overview of our proposed classification and segmentation model

1.1.4 COVID-19 Modeling and Evaluation Strategy

For this extra-ordinary COVID-19 pandemic, AI systems have been investigated to identify anomalies in the lungs and assist in the detection, triage, quantification and stratification (e.g. mild, moderate and severe) of COVID-19 stages. To help radiologists do the triage, our deep-learning model takes frontal (anterial-posterior view or posterior-antierial view) X-ray as input and output two types of information (see Fig. 4

): location of the disease regions and classification. For location, the network generates a segmentation mask to identify disease pixels or regions on X-ray related to COVID-19 or regular pneumonia. For classification, the fully connected neural network (FCN) outputs the likelihood of COVID-19 pneumonia, other pneumonia or a negative finding for each given input AP/PA X-ray image.

Many publications have released their algorithms in open online forums or are marketing the same as additional pneumonia indicators. However, the robustness of the algorithms and their clinical value is somewhat unproven. A few studies have characterized systems for COVID-19 prediction with stand-alone performance that approaches that of human experts. However, all the existing works have either no established pixelwise ground-truth or are evaluated using pixel-wise ground-truth from purely X-ray annotations with uncertainties from annotators

Our deep learning model was trained with and without using data from CT cases. We evaluated our approach in three seperate aspects: first, AI predictions were compared with the image-level classification labels; second, the segmentation of disease regions for the COVID class was evaluated against direct X-ray pixel-wise annotations. Third, the segmentation of disease regions for the COVID-19 class was evaluated against pristine pixel-wise annotations.

1.2 Data and Groundtruth

In this study, we formed a large experiment dataset consisting of real X-ray images and synthetic X-ray images originating from CT volumes. The data was sourced from in-house/internal collections as well as publicly available data sources including Kaggle Pneumonia RSNA [KagglePneumonia], Kaggle Chest Dataset [KaggleChest], PadChest Dataset [bustos2019padchest], IEEE github dataset[IEEE-Github], NIH dataset [wang2017chestx]. Any image databases with limiting non-commercial use licenses were excluded from our train/test cohorts. Representative paired and unpaired CT and XR datasets from US, Africa, and European population were also leveraged and sourced through our data partnerships. Outcomes were derived from information aggregated from radiological and laboratory reports. A summary of the database and selected categories used for our experiments is summarized in Table 1 where the in-house test dataset is a subset of test dataset. As the trained model will be inferenced on real X-ray images, we remove synthetic X-rays for validation and testing.

Dataset X-rays (# XMA , # PMA) Synthetic X-rays (# SMA)
train COVID-19 974 (247, 77) 21487 (8322)
train pneumonia 10175(6108, 17) 11312 (5380)
train negative 17068 (NA,NA) 12542 (NA)
val COVID-19 113 (37,2) NA
val pneumonia 531 (473,8) NA
val negative 3731 (NA, NA) NA
test COVID-19 307 (68, 52) NA
test pneumonia 1006 (345,33) NA
test negative 2271 (NA,NA) NA
in-house test COVID-19 268 (68,52) NA
in-house test pneumonia 119 (45,33) NA
in-house test negative 34 (NA,NA) NA
Table 1: Dataset breakdown for our experiments

Two levels of groundtruth are associated with each image: image-level groundtruth and pixel-level (segmentation) groundtruth.

For image-level groundtruth, each image is assigned with a label of COVID-19, pneumonia or negative. All in-house X-rays and CTs and the COVID-19 images from the public data sources were confirmed by PCR tests. The labels of pneumonia and normal images from public data sources are given by radiologists.

For pixel-wise/segmentation groundtruth, we have four types of annotations. X-ray manual mask annotations (XMA) are made by annotators purely based on X-rays without any information from CTs. Synthetic mask annotations (SMA) are generated by the projection algorithms based on CT annotations for synthetic X-rays. Transferred CT mask annotations (TMA) are automatically generated by registration algorithms which transfer annotations from CT to X-ray using SXR as a bridge. Pristine mask annotations (PMA) which are adjusted annotations by annotators from TMA.

1.3 Experiment Settings

To show the benefits of multi-modal training for developing COVID-19 model, we have conducted training with 4 different training datasets summarized in Table 2.

training set Description
T1 X-ray images with XMA
T2 T1 + synthetic X-ray images with SMA
T3 T1 + X-ray images with PMA
T4 T1 + synthetic X-ray images with SMA + X-ray images with PMA
Table 2: Different training sets

1.4 Evaluation Metrics

To evaluate the performance of our inferred classification of the test subjects, we use the area under the receiver operating characteristic curve (AUC) between different combinations of positive negative classes including COVID-19 vs pneumonia, COVID-19 vs pneumonia+negative and COVID19 vs negative for different usage scenarios.

coefficient is used to evaluate the exactness of our pathology localization,

1.5 Results

1.6 Pristine annotation creation

With three different pixel-wise annotations on X-ray, we evaluated overlap between XMA, PMA and TMA.

XMA vs TMA 0.28
PMA vs TMA 0.47
XMA vs PMA 0.50
Table 3: Area overlapping between different annotations

We can observe (see Table 3) that after using TMA (transferred CT annotations), the consistency of human annotations to CT annotations is largely improved from 0.28 to 0.47 in terms of coefficient. The coefficient between XMA (X-ray manual mask annotations) and PMA (pristine mask annotations) are also moderate which means PMA is somewhere between X-ray direct annotations and CT transferred annotations. Fig.5 shows a number of examples with TMA, XMA and PMA. The X-ray annotations show large inconsistency with automated CT transferred annotations.

Figure 5: Examples with TMA as red contour, XMA as blue regions and PMA as green regions

1.7 Model Evaluation

The evaluation of the model was performed in terms of both classification and segmentation of COVID-19 disease regions. To test the effect of domain shift, we show the evaluation results on both all X-ray test dataset and in-house test dataset where COVID19 regular pneumonia and normal cases are all available.

Regarding classification, Table 5 shows AUC based on different combination for positive and negative classes. By adding synthetic X-rays, the AUC increases for COVID-19 vs pneumonia + negative increase from 0.89 to 0.93. The addition of adding pristine groundtruth does not further increase the AUC. The same increase is observed if AUC is computed using COVID-19 as positive and pneumonia as negative class.

We measure the Dice coefficients to estimate the segmentation accuracy. As PMA are obtained, we have measured Dice on all testing images, in-house test images and testing images with PMA only shown by Table

4. Adding PMA can largely improve the measures across different test settings up to 0.70. Fig.6 shows examples of AI detection and segmentation of COVID-19 regions with manual annotations as overlay as well.

Train dataset vs test Dice XR with XMA in-house XR with XMA in-house XR with PMA
T1 0.58 0.59 0.59
T2 0.57 0.56 0.58
T3 0.60 0.62 0.70
T4 0.57 0.58 0.62
Table 4: Segmentation results: dice measures of different training schemes on different datasets
Training vs test AUC
AUC C vs P+N
on XR
AUC C vs P
on XR
AUC C vs P+N
on in-house XR
AUC C vs P
on in-house XR
T1 0.98 0.98 0.89 0.87
T2 0.99 0.98 0.93 0.92
T3 0.99 0.98 0.91 0.90
T4 0.99 0.99 0.93 0.92

Table 5: Classification results: AUC measures of different training schemes on different datasets with different positive and negative compositions
Figure 6: Segmentation examples where images on the left are original X-ray images, in the middle are PMA and on the right are AI segmentations

1.8 Conclusion and Discussion

In this study, from multi-modal perspective, we have developed an artificial intelligence system which learns from high dimensional modality CT but inferences on low dimensional mono-modality X-ray for COVID-19 diagnosis and segmentation. The system classifies a given image into three categories: COVID-19, pneumonia and negative. We show that by learning from CT, the performance of the AI system is improved in terms of both classification and segmentation. Our AI system achieves a classification AUC of 0.99 and 0.93 between COVID-19 and pneumonia plus negative on full testing dataset and the subset in-house dataset, respectively. The Dice of 0.57 and 0.58 are obtained for COVID-19 disease regions on full testing dataset and the subset in-house dataset using X-ray direct annotations, respectively. The Dice is increased to 0.62 when pristine ground-truth transferred from CT is used for training and testing. Accurate classification can aid physician in triaging patients and make appropriate clinical decisions. Accurate segmentation enhanced confidence in the triage, and helps in quantitative reporting of disease for reporting and monitoring progression.

Learning from a second modality (CT) in our multi-modal approach has two main implications. One impact is to add synthetic X-ray and corresponding disease masks with different projection parameters to significantly augment training image pool to ensure data diversity. Another benefit related to additional pathology evidence from a higher sensitivity modality when we transfer the CT annotations from synthetic X-ray, and from synthetic X-ray to original X-ray, and allowing manual fine tuning and the annotations used for training. The first addition contributes mostly towards gain in the classification accuracy and the second addition contributes substantially to the gain in disease localization. The quality of the synthetic X-ray may play an important role and is worth further investigation, perhaps using generative networks to make the synthetic X-ray more realistic.

Our study shows that there exists substantial inconsistency between X-ray direct annotations and automated transferred CT annotations. When using transferred annotations as hints, the second version of X-ray annotations (pristine groundtruth) are more consistent to automated transferred CT annotations. The Dice between direct annotations and pristine groundtruth is 0.50 which might be a good reference indicating how good segmentation the AI should achieve. The pristine groundtruth is something between X-ray annotations and CT annotations. It should be noted that the automated transferred annotations can not be directly used as errors might be induced because of registration. Also COVID-19 is a fast changing disease, as we are pairing X-ray and CT acquired within a time-window of 48 hours, the disease status in CT might be quite different compared to the disease status when X-ray is taken. Therefore manual adjustment is an important consideration.

In the research and industry community, efforts have been made to apply AI into imaging-based pipeline of for the COVID-19 applications. However, many existing AI studies for segmentation and diagnosis are based on small samples and based on single data source, which may lead to the overfitting of results. To make the results clinically applicable, a large amount of data from different sources shall be collected for evaluation. Moreover, many studies only provide classification prediction without providing segmentation or heatmap which makes AI systems lack explainability. By providing the segmentation, we aim to fill this void, enhancing the promotion of AI in clinical practice. On the other hand, the imaging-based diagnosis has limitations and clinicians make the diagnosis considering clinical symptoms also. An AI system can be largely enhanced with incorporation of patient clinical parameters [mei2020artificial] such as blood oxygen level, body temperature, to further enhanced the capability to accurately diagnose pathological conditions.

WHO has recommended[who2020-chest] a few scenarios where chest-imaging can play an important role in care delivery. From the triage perspective, WHO suggests using chest imaging for the diagnostic workup of COVID-19 when PCR testing is not available (timely) or is negative while patients have relevant symptoms. In this case, the classification support from our AI can aid radiologists to identify COVID-19 patients. From monitoring or temporal perspective, for patients with suspected or confirmed COVID-19, WHO suggests using chest imaging in addition to clinical and laboratory assessment to decide on hospital admission versus home discharge, to decide on regular admission versus intensive care unit (ICU) admissions, to inform the therapeutic management. In these scenarios, accurate segmentation of disease regions is essential for the evaluations.

The COVID-19 disease continues to spread around the whole world. Medical imaging and corresponding artificial intelligence applications together with clinical indicators provides solutions for triage, risk analysis and temporal analysis. This study provides a solution from multi-modal perspective to leverage the CT information but to inference on X-ray to avoid the necessity of taking CT imaging because of limited accessibility, dose and decontamination concern. Future work focusses on leveraging paired CT information to estimate severity and other higher dimensional measures. We believe that our study introduces a new trend of combinating multi-modal training and single-modality inference. Although this study tries to leverage CT information as much as possible to aid the data-driven AI solution on X-rays, it should be noted that the extraction of relevant information is limited by the nature of X-ray imaging because of 2D projection as well as impaired visibility with the presence of the non-lung thick tissue. In addition different vendors may apply varying post-processing algorithms to suppress information on those thick tissue regions. To avoid information loss, our AI may be deployed directly on X-ray hardware with direct access to X-ray raw images for ideal translation to the clinic.