A major challenge in rolling out machine learned models to a broad user base is the variability of data encountered in the real world. Models can only be expected to work well on data of similar distribution as has been used for training, but ubiquitously, differences in image acquisition setup hinder the applicability of a once developed model in novel settings. A recent example for the negative effects of such failure to adapt between different domains has been given at the start of the COVID-19 pandemic:
As of 2nd February 2021, this disease has caused over million infections worldwide and over million deaths according to the World Health Organisation (WHO) [who_reports]. To alleviate this, rapid diagnosis of COVID-19 cases has been proven to be effective for decelerating the spread of the disease [horry2020covid]. According to [horry2020covid, chen2020sars]
, reverse transcriptase quantitative polymerase chain reaction (RT-qPCR) tests are accepted as the gold standard for the identification of positive cases. However, this type of test was not available in sufficient numbers at the beginning of the pandemic. Further, beyond being time-consuming, it relies on both human effort and expert knowledge. Thus, there arose a need for automatic diagnostic methods that can assist experts and reduce human efforts by targeting the automatic identification of COVID-19 positive cases. The literature has shown promising efforts in the automatic identification of COVID-19 cases from lung computed tomography (CT) scans using computer vision methods[mei2020nature, harmon2020nature, lin2020rnas, wang2020tmi]. Lessmann et al. addressed cross-vendor analysis (between different CT scanners such as Varian, Siemens, GE Healthcare, Philips and Canon) for 3D CT scans successfully [lessmann2021automated]. However, it is demonstrated that a considerable drop in cross-dataset performance appears for the diagnosis of 2D CT scans acquired via different devices. Thus, the previously mentioned within dataset variability has the potential to discourage the community to merge and annotate data from multiple sources. As a result, combining datasets is a challenge posed not only for COVID detection but also for other applications in diagnosis and segmentation.
In this paper, we address domain adaptation of medical image analysis methods by proposing a deep convolutional neural network (CNN) for preprocessing 2D CT scans such that it is trained to fool a classifier that discriminates between various CT datasets, thus aiming to remove the within dataset variability. We evaluate the performance of the suggested method on the exemplary use case of predicting COVID-19 positive cases, due to the global variability in respective datasets and the availability of plenty of opportunities to compare. It should be noted that, the methodology is inspired by generative adversarial learning[DBLP:conf/nips/GoodfellowPMXWOCB14, schmidhuber2020generative]. Our contribution is twofold: () we propose a novel trainable preprocessing CNN architecture with a dual training objective that is capable of equalizing the variability of different CT-scanner technologies in the image domain as a pre-processor (PrepNet); () we validate this model by showing the transferability of its diagnostic capabilities between different CT data sources based on common public benchmarks. We conduct experiments on the SARS-CoV-2 CT-scan dataset [DVN/SZDUQX_2020] and the UCSD COVID-CT dataset [zhao2020covid] as well as MosMed dataset [morozov2020mosmeddata]. Our results show that our PrepNet model improves the cross-dataset COVID-19 diagnosis performance (i.e., training on one dataset and testing on another) by percentage points (pp) through creating a unified representation of multi-dataset CT scans.
Ii Related Work
With the emergence of COVID-19, many studies and datasets have been proposed in the literature which show an increase in data diversity over time and the extent of related computer vision methods to deal with it [cohen2020covid, gunraj2021covid]. Horry et al. [horry2020covid]
utilize a transfer learning scheme to build various COVID-19 classifiers based on several off the shelf CNN models such as VGG16/19[simonyan2014very], Resnet50 [he2016deep], InceptionV3 [szegedy2016rethinking], Xception [chollet2017xception], and InceptionResnet [szegedy2017inception]. They compared the generalization capability of various images sources such as X-ray, CT and ultrasound images and developed a pre-processing scheme for X-ray images to reduce noise at non-lung areas in order to decrease the effect of quality imbalance among the employed images. A VGG19 [simonyan2014very] coupled with ultrasound images is found to yield the best validation accuracy of %, while % have been achieved using CT scans [he2020sample].
He et al. [he2020sample]
propose a sample-efficient learning concept called “Self-Trans” via synergetically combining transfer learning and contrastive self-supervised learning. They seek intrinsic visual patterns in CT scans without relying on labels created with human effort. Besides, they open-sourced their CT dataset involvingCOVID-19 positive patients and COVID-19 negatives [zhao2020covid]. They achieve an accuracy of % through unbiased feature representations together with a reduction of overfitting.
Mobiny et al. [mobiny2020radiologist] propose the DECAPS approach with following contributions: () inverted dynamic routing [sabour2017dynamic] to avoid seeking visual features from non-related regions, () training with a two-stage patch crop and drop strategy to encourage the network to focus on the useful areas, (
) employing conditional generative adversarial networks for data augmentation. Experiments result% precision and % recall along with % accuracy. They additionally report results for the conventional deep classifiers DenseNet121 [gao2017densely] and Resnet50 [he2016deep], yielding % and % accuracy, respectively. In contrast to this study, Pham [pham2020comprehensive] points out the negative impact of data augmentation in the context of CT-based COVID-19 image classification. In his study, the author fine-tunes various well-known pre-trained CNN models ranging from AlexNet [krizhevsky2012imagenet] to NasNet-Large [zoph2018learning]. Experiments conducted on the already introduced CT dataset [zhao2020covid] credit a DenseNet-201 with the best accuracy of %. However, data augmentation using random vertical/horizontal flips (p=), vertical/horizontal translation ( pixels) and scaling (%) yields a % accuracy drop on average.
Chaganti et al. [chaganti2020quantification]
suggest a deep-reinforcement-learning-based scheme focusing on seeking doubtful lung areas on CT scans to localize abnormal portions. A recent study by[gunraj2021covid], a novel architecture called “COVID-Net- CT-2” which utilizes machine-driven design exploration based on iterative constrained optimization is proposed [wong2019netscore]. The authors point out that one of the subtle problems of earlier studies is the limited number of patients and poor diversity of CT scans in terms of multi-nationality. Therefore, they introduce the two large-scale COVID-19 CT datasets called “COVIDx CT-2A” and “COVIDx CT-2B” gathered from patients from at least countries, totally comprising and images respectively. Experiments show that the architecture achieves a sensitivity of % and an accuracy of
%, which competes with radiologist-level decision making capability. The study deals with variability in the patients’ ethnicity, while CT scans generated by various vendors’ devices exhibit visual differences, artifacts, and variable intensities that are never addressed so far. Thus, independent from the reported success of some deep learning architecture, it is likely to witness a drop in prediction accuracy during inference when a test image is taken with a different device as has been used for training. Motivated by this issue, we propose to employ a pre-processing network (PrepNet) to standardize CT images with respect to the visual differences among datasets prior to training of any final diagnosis model, relying on generative architectures since they showed very promising results for similar tasks [mobiny2020radiologist]. An advantage of this approach is that the PrepNet can be combined with any downstream diagnosis model, thus leveraging future progress there without additional costs while improving cross-dataset performance.
Two research papers closely related to the goal of domain adaptation in this study are presented by Lessmann et al. addressing cross-vendor diagnosis [lessmann2021automated] and Amyar et al. using auto-encoders in multi-task learning [amyar2020multi]. Nevertheless, Lessmann et al. did not confront a considerable cross-vendor performance drop because of using a richer source of information (3D scans) as explained in [de2020improving]. Amyar et al. leveraged multi-task learning and trained an auto-encoder besides a segmentation and classification model for COVID-19 diagnosis. However, they did not aim at removing the cross-dataset variability of the scans. This study focuses on homogenizing the 2D CT scans by reducing cross-dataset information.
In this section, we give details of our PrepNet
model in terms of network architecture, core modules, and loss functions. The architecture of our proposed model is presented in Figure1. For a group of input CT scans , coming from different CT vendors’ devices, our model extracts multi-scale discriminative feature maps through an auto-encoder and reconstructs the original CT scans . The reconstructed CT scans are next fed into a dataset/technology classification branch which acts as a pseudo-label classifier and is responsible for discriminating among different CT datasets. Once this model is trained end-to-end in an adversarial way, the reconstructed CT scans are fed into a COVID-19 classifier which is trained directly on the reconstructed CT-scans. The COVID-19 classification branch is responsible for the classification of healthy vs. non-healthy patients. The complete network model with its main modules are described in more detail below.
Iii-a Model Architecture
Auto-Encoder Module: We feed a CT scan image into our auto-encoder ( and ) and obtain a reconstructed version given by . The encoder is based on the standard classification network VGG-Net [simonyan2014very], whilst the decoder is a convolutional network with the same number of layers as the encoder. We add skip-connections from to to recover the spatial information lost during the down-sampling operations.
Dataset Classifier Module: The CT dataset classifier receives the reconstructed CT scan from the auto-encoder as input and feeds it into an encoder branch that classifies the CT dataset/technology. In our experiments, relies on the VGG-Net architecture as well.
COVID-19 Classifier Module: The COVID-19 classifier is also uses several backbone architectures. Given a reconstructed CT scan , it outputs COVID vs. non-COVID predictions, i.e. .
Iii-B Loss Functions and Evaluation Metric
The complete loss function of PrepNet is based on the various terms presented in Figure 1. It comprises a reconstruction loss and two classification losses and :
Given the labeled dataset comprising the CT scans together with their binary COVID label and the CT-dataset pseudo label , the auto-encoder reconstruction loss is given by ; the COVID-19 binary classification loss is denoted ; the CT dataset pseudo label is computed by .
To measure the COVID-19 detection performance and to minimize the effect of class imbalance in datasets, we use the balanced accuracy metric (BA) [brodersen2010balanced]
where and are the number of positive and negative samples respectively and and denote the number of true positive and true negative predictions, respectively. In addition, we also use specificity, sensitivity, and area under the curve to evaluate the COVID-19 performance results.
We use three public datasets to validate our approach experimentally. The SARS-CoV-2 CT-scan dataset [DVN/SZDUQX_2020] comprises a total of CT images of real patients from the Public Hospital of the Government Employees of Sao Paulo (HSPM) and the Metropolitan Hospital of Lapa, both in Sao Paulo - Brazil ( positive/infected and healthy patients). Moreover, CT scans belong to patients who have other pulmonary diseases. The CT image annotations (positive vs. negative) have been done by three different clinicians. Note that during our visual inspection we found two erroneous images (i.e. unrelated to the problem domain) and excluded them from the dataset. In addition, we also excluded the pulmonary diseased patients.
The UCSD COVID-CT dataset [zhao2020covid] has been collected in the Tongji Hospital in Wuhan, China during the outbreak of COVID-19 between the months of January/2020 and April/2020. This dataset contains CT images from infected patients and from non-infected patients. All images have been annotated by a senior radiologist of the same hospital. As reported by [mobiny2020radiologist], heights of the images in this dataset range between and pixels with an average of pixels, whereas the widths vary between and pixels (average of pixels). For partitioning, we follow the splitting guideline provided by the authors of the dataset. Table I summarizes the train, validation and test splits for each dataset.
The MosMed dataset [morozov2020mosmeddata] was collected by the Moscow Health Care Department from different municipal hospitals in Russia between March/2020 and April/2020. The dataset contains axial CT images from patients with different levels of COVID-19 severity, ranging from mild to critical cases and also healthy patients. Some image samples of each dataset are provided in Figure 2.
|SARS-COV-2 [DVN/SZDUQX_2020]||2D CT||Various||Brazil||(%)||(%)||(%)|
|UCSD COVID-CT [zhao2020covid]||2D CT||Various||China||(%)||(%)||(%)|
|MosMed Dataset [morozov2020mosmeddata]||3D CT||Various||Russia||images for unseen test dataset|
|COVID||SARS-COV-2||UCSD COVID-CT||MosMed COVID-19|
Iv-B Implementation Details
We run all our experiments using the publicly available Pytorch 1.5.0 library and an NVIDIA VP100 GPU (
GB of VRAM). During network training, each image is first resized according to the input size of the classifiers’ backbones; we use histogram equalization as a fixed preprocessing step, then apply the mean and standard deviation of ImageNet pretrained models. We trainPrepNet using the AdamW optimizer [loshchilov2017decoupled]. We perform a
hour hyperparameter search with six parallel runs using the Bayesian search strategy with Hyperband for early-stopping on one GPUs[tuggener2019automated]. The hyperparameter search improves the chance of avoiding local minima and presenting optimal results of every configuration. The best model is selected based on the optimal validation performances. During training, we first train the auto-encoder for epochs and warm up the dataset classification branch for epochs before we start with the adversarial training. Once the adversarial training is finished, we train the COVID classification branch independently from the other two branches using the output of the auto-encoder/PrepNet.
Iv-C Experimental Results
The within- and cross-dataset performance of the proposed preprocessing schemes are presented in Table II. In order to observe possible overfitting, we report the hold out test set performance on each dataset. The cross-dataset performance is evaluated by measuring the balanced accuracy (minimizing the effect of class imbalance) of the models trained on one dataset and tested on the other. We report results using the balanced accuracy of the models trained on the SARS-COV-2 and UCSD COVID-CT datasets. Further metrics also include sensitivity (Sens), specificity (Spec) and area under the curve (AUC). In the rows, we present the datasets used during training. Furthermore, we group the results by model. The first group of results are related to the COVID classifier (VGG-19 pre-trained model), that is trained and evaluated on the original CT scans. The second group of results is related to the auto-encoder alone trained on both datasets in a self-supervised manner to minimize the reconstruction loss. The third group of results relate to full PrepNet preprocessing before training the classifiers.
|Test dataset||SARS-COV-2||UCSD COVID-CT||Within Test||Cross-Dataset||Pre-trained|
The results in Table II show that the average cross-dataset performance (over all dataset splits) of models trained on original data increases by pp after using the pure auto-encoder model, and by pp through PrepNet. However, the average test accuracy for within-dataset evaluation declines by pp and pp after applying the baseline auto-encoder or PrepNet, respectively. A discussion regarding this effect is presented in the next section.
In our experiments, we use the VGG19 [simonyan2014very] as the baseline model because it is more straight-forward to train and has shown good generalization properties on 2D medical images based on previous practical experiments111https://stanfordmlgroup.github.io/competitions/mura/. Besides that, the VGG architecture has been also successfully applied for COVID-19 identification [horry2020covid, he2020sample].
As part of our ablation study, we also evaluated how different backbones affect the COVID-19 diagnosis accuracy of PrepNet. More precisely, we replicate the experiments for each dataset (SARS-COV-2 and UCSD COVID-CT) and evaluate different CNN architectures as part of our COVID-classifier Module (See Section III-A for more information). The CNN architectures include ResNet18 [he2016deep], Inception [szegedy_inception], and EfficientNet-B0 [tan2019efficientnet]. We report results in Table III. Experimental results show that in almost all backbones, the average cross-dataset performance increases with the cost of a small decrease in the within-dataset accuracy.
|Test dataset||SARS-COV-2||UCSD COVID-CT||Within Test||Cross-Dataset||Pre-trained|
Finally, in order to evaluate the generalisation capabilities of PrepNet and our baselines, we evaluate how our trained models perform on an unseen dataset, i.e. the MosMed dataset [morozov2020mosmeddata]. The results in Table IV show the improvements of our AutoEncoder and PrepNet models in terms of BA and sensitivity, however, with a decrease in specificity and AUC when compared with the COVID-19 classifier. Despite the decrease in specificity, we argue that especially for medical diagnosis and screening, a low specificity is less harmful than a reduction in sensitivity, as false positive cases can be discarded by additional examinations. On the contrary, a higher sensitivity is important as false negatives should be low.
The baseline and proposed pre-processing approaches introduce performance drops when applied before within-dataset classification. These approaches usually reduce the test accuracies when trained and evaluated on the same dataset using the corresponding dataset splits. Therefore, we further investigate the intermediate results of the baseline auto-encoder and PrepNet on a case-by-case basis. Severe cases of generated artifacts through reconstruction via the baseline auto-encoder and the PrepNet are presented in Figure 3. We conjecture that the drop in within-dataset test performance is caused by occasional artifacts such as these. These quality drops can be clearly seen in the reconstruction loss. However, it is not straightforward to correct them. We could eventually overcome this by also investigating different data-augmentation strategies and by improving the network architecture of our auto-encoder. Additionally, we depict sample images in which the models failed to make a correct decision after auto-encoder or PrepNet (See Fig. 4). Limited amount of training data and noisy labels of public datasets are other factors contributing to low classification accuracies. One possible way to tackle this limitation is to rely on weakly supervised learning methods to improve the COVID-19 classification accuracy with the methodology summarized in [simmler2021survey].
|Dataset||pre-processed||initial reproduction||PrepNet reproduction|
V Conclusions and Future Work
In this paper, we introduced a novel approach to unify several CT scan datasets with respect to varying image datasets and acquisition circumstances such as CT scanner technology through training an adaptive pre-processing network that removes such specificities from the images themselves. Additionally, we presented initial results demonstrating the applicability of the method on three publicly available benchmark datasets. This way, it is possible to shift the focus of model training from merely optimizing hold-out test set performance on the same data distribution (which likely does not transfer to any other environment) towards cross-dataset detection accuracy. The proposed PrepNet improves the cross-dataset balanced accuracy by a margin of percentage points (SARS-CoV-2 CT-scan dataset [DVN/SZDUQX_2020]) at the expanse of a decline in the within dataset test performance of ca. pp (UCSD COVID-CT database [zhao2020covid]). These results suggest that the trainable preprocessing network erases some of the necessary information for diagnosis, due to artifacts. This information could be partially retained by propagating the gradients of the COVID-19 classifier network through the preprocessing model, and generated artifacts could be detected automatically by monitoring the reconstruction loss of the auto-encoder module. This, together with further investigations on the applicability and generality of the proposed approach to combine multiple datasets, is an intriguing theme for future research.
This research was financially supported by the ZHAW Digital Futures Fund under contracts “SDMCT—Standardized Data and Modeling for AI-based CoVID-19 Diagnosis Support on CT Scans” as well as “Synthetic data generation of CoVID-19 CT/X-rays images for enabling fast triage of healthy vs. unhealthy patients”.