Towards Efficient COVID-19 CT Annotation: A Benchmark for Lung and Infection Segmentation

04/27/2020 ∙ by Jun Ma, et al. ∙ 2

Accurate segmentation of lung and infection in COVID-19 CT scans plays an important role in the quantitative management of patients. Most of the existing studies are based on large and private annotated datasets that are impractical to obtain from a single institution, especially when radiologists are busy fighting the coronavirus disease. Furthermore, it is hard to compare current COVID-19 CT segmentation methods as they are developed on different datasets, trained in different settings, and evaluated with different metrics. In this paper, we created a COVID-19 3D CT dataset with 20 cases that contains 1800+ annotated slices and made it publicly available. To promote the development of annotation-efficient deep learning methods, we built three benchmarks for lung and infection segmentation that contain current main research interests, e.g., few-shot learning, domain generalization, and knowledge transfer. For a fair comparison among different segmentation methods, we also provide unified training, validation and testing dataset splits, and evaluation metrics and corresponding code. In addition, we provided more than 40 pre-trained baseline models for the benchmarks, which not only serve as out-of-the-box segmentation tools but also save computational time for researchers who are interested in COVID-19 lung and infection segmentation. To the best of our knowledge, this work presents the largest public annotated COVID-19 CT volume dataset, the first segmentation benchmark, and the most pre-trained models up to now. We hope these resources (<https://gitee.com/junma11/COVID-19-CT-Seg-Benchmark>) could advance the development of deep learning methods for COVID-19 CT segmentation with limited data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

COVID-19 (novel coronavirus 2019) has spread all over the world during the past few months and caused over 170,000 deaths as of 23 April 2020[who_report_94]. It is projected to be a seasonal disease after the current outbreak [neher2020potential, kissler2020projecting].

Computed tomography (CT) is playing an important role in the fight against COVID-19 [RSNA_Expert_Consensus, RSNA_Expert_Consensus2, RSNA_Expert_Consensus3]. CT is shown to be more sensitive in the early diagnosis of COIVD-19 infection compared to Reverse Transcription-Polymerase Chain Reaction (RT-PCR) tests [fang2020sensitivity]. Wang et al. trained a deep learning model on 325 COVID-19 CT images and 740 typical pneumonia images. Their model identified 46 COVID-19 cases that were previously missed by the RT-PCR test. Further, quantitative information from CT images, such as lung burden, percentage of high opacity and lung severity score, can be used to monitor disease progression and help us understand the course of COVID-19 [COVID-19-Quant, chaganti2020quantification].

Artificial Intelligence (AI) methods, especially deep learning-based methods, have been widely applied in medical image analysis to combat COVID-19 [shen-review]. For example, AI can be used for building a contactless imaging workflow to prevent transmission from patients to health care providers [wang2020precise]. In addition, most screening and segmentation algorithms for COVID-19 are developed with deep learning models[fang2020sensitivity, zheng2020deep, li2020artificial]

, and most automatic diagnosis and infection quantification systems in COVID-19 studies rely on segmentation generated by deep neural networks

[zheng2020deep, cao2020AILongAss, huang2020QuanCOVID, cell2020QuanCOVID].

Although tremendous studies show that deep learning methods have potential for providing accurate and quantitative assessment of COVID-19 infection in CT images [shen-review], the solutions mainly rely on large and private datasets. Due to the patient privacy and intellectual property issues, the datasets and solutions may not be publicly available. However, researchers may hope that the datasets, entire source code and trained models could be provided by authors [RSNAletterAIDis].

Existing studies demonstrate that the classical U-Net (or V-Net) can achieve promising segmentation performance if hundreds of well-labelled training cases are available [huang2020QuanCOVID, shan20quant]. 83.1% to 91.6% of segmentation performance in Dice score coefficient was reported in various U-Net-based approaches. Shan et al. developed a neural network based on V-Net and the bottle-neck structure to segment and quantify infection regions [shan20quant, he2016deep]. In their human-in-the-loop strategy, they achieved 85.1%, 91.0% and 91.6% dice with 36, 114, and 249 labelled CT scans, respectively. Huang et al. [huang2020QuanCOVID] employed U-Net [ronneberger20152DUNet] for lung and infection using an annotated dataset of 774 cases, and demonstrated that trained model could be used for quantifying the disease burden and monitoring disease progression or treatment response. In general, we observed a trend that training deep learning models with more annotations will decrease the time needed for contouring the infection, and increase segmentation accuracy, which is consistent with the Shan et al.’s findings [shan20quant]. However, annotations for 3D CT volume data are expensive to acquire because it not only relies on professional diagnosis knowledge of radiologists, but also takes much time and labor, especially in current situation. Thus, a critical question would be:

how can we annotate COVID-19 CT scans efficiently?

Towards this question, three basic but important problems remain unsolved:

  • There is no publicly well-labelled COVID-19 CT 3D dataset; Data collection is the first and essential step to develop deep learning methods for COVID-19 segmentation. Most of the existing studies rely on private and large dataset with hundreds of annotated CT scans.

  • There are no public benchmarks to evaluate different deep learning-based solutions and different studies use various evaluation metrics. Similarly, datasets for training, validating and testing are split diversely, which also makes it hard for readers to compare those methods. For example, although MedSeg (http://medicalsegmentation.com/covid19/) dataset with 100 annotated slices were used in [MedSeg100fan2020InfNet], [MedSeg100AttentionUNet], [MedSeg100MiniSeg], [MedSeg100MultiTask] to develop COVID-19 segmentation methods, it was split in different ways and the developed methods were evaluated by different metrics.

  • There are no publicly available baseline models trained for COVID-19. U-Net, a well-known segmentation architecture, is commonly used as a baseline network in many studies. However, due to different implementations, the reported performance variesin different studies [MedSeg100AttentionUNet, MedSeg1002020ResAtten], even though the experiments are conducted on the same dataset.

In this paper, we focus on annotation-efficient deep learning solutions and aim to alleviate the above problems by providing a well-labelled COVID-19 CT dataset and a benchmark. In particular, we first annotated left lung, right lung and infections of 20 COVID-19 3D CT scans, and then established three tasks to benchmark different deep learning strategies with limited training cases. Finally, we built baselines for each task based on U-Net.

Our tasks target on three popular research fields in medical imaging community:

  • Few-shot learning: building high performance models from very few samples. Although most existing approaches focus on natural images, it has received growing attention in medical imaging because generating labels for medical images is much more difficult [Zhao_2019_CVPR, roy2020MIAsqueeze].

  • Domain generalization: learning from known domains and applying to an unknown target domain that has different data distribution [li2019episodic]. Adversarial training is usually adopted to achieve it in recent studies. [Tsai_2018_CVPR, Lee_2018_ECCV].

  • Knowledge transfer: reusing existing annotations to boost the training/fine-tuning on a new dataset. Transferring knowledge from large public datasets is a well-established practice in medical imaging analysis [huynh2016digital, shin2016deep]. Using generative adversarial network to transfer knowledge from publicly available annotated datasets to new datasets for segmentation also receives significant attention [PKUMICCAI19CardiacVessel, miccai_simu2019cell].

Our contributions can be summarized as following:

  • 20 well-labelled COVID-19 CT volume data are publicly released to the community. To the best of our knowledge, this is the largest public COVID-19 3D CT segmentation datasets.

  • 3 benchmark tasks are set up to promote studies on annotation-efficient deep learning segmentation for COVID-19 CT scans. Specifically, we focus on few-shot learning, domain generalization, and knowledge transfer.

  • 40+ trained models are publicly available that can serve as baselines and out-of-the-box tools for COVID-19 CT lung and infection segmentation.

Ii Datasets

Annotations of COVID-19 CT scans are scarce, but several lung CT annotations with other diseases are publicly available. Thus, one of the main goals for our benchmark is to explore these existing annotations to assist COVID-19 CT segmentation. This section introduces the public datasets used in our segmentation benchmarks. Figure 1 presents some examples from each dataset.

Fig. 1: Examples of four lung CT datasets. The 1st and 2nd row denote original CT images and corresponding ground truth of lung and lesions, respectively. The 3rd row shows the 3D rendering results of ground truth. MSD Lung Tumor dataset (3rd column) does not provide lung masks.

Ii-a Existing lung CT segmentation datasets

Ii-A1 StructSeg lung organ segmentation

50 lung cancer patient CT scans are accessible, and all the cases are from one medical center. This dataset served as a segmentation challenge111MICCAI 2019 StructSeg: https://structseg2019.grand-challenge.org during MICCAI 2019. Six organs are annotated, including left lung, right lung, spinal cord, esophagus, heart, and trachea. In this paper, we only use the left lung and right lung annotations.

Ii-A2 NSCLC left and right lung segmentation

This dataset consists of left and right thoracic volume segmentations delineated on 402 CT scans from The Cancer Imaging Archive NSCLC Radiomics [NSCLC1, NSCLC2, TCIA].

Ii-B Existing lung lesion CT segmentation datasets

Ii-B1 MSD Lung tumor segmentation

This dataset is comprised of patients with non-small cell lung cancer from Stanford University (Palo Alto, CA, USA) publicly available through TCIA. The dataset served as a segmentation challenge222MICCAI 2018 MSD: http://medicaldecathlon.com/ during MICCAI 2018. The tumor is annotated by an expert thoracic radiologist, and 63 labelled CT scans are available.

Ii-B2 StructSeg Gross Target Volume segmentation of lung cancer

The same 50 lung cancer patient CT scans as the above StructSeg lung organ segmentation dataset are provided, and gross target volume of tumors are annotated in each case.

Ii-B3 NSCLC Pleural Effusion (PE) segmentation

The CT scans in this dataset are the same as those in NSCLC left and right lung segmentation dataset, while pleural effusion is delineated for 78 cases [NSCLC1, NSCLC2, TCIA].

Ii-C Our COVID-19-CT-Seg dataset

We collected 20 public COVID-19 CT scans from the Coronacases Initiative and Radiopaedia, which can be freely downloaded333https://github.com/ieee8023/covid-chestxray-dataset with CC BY-NC-SA license. The left lung, right lung, and infection were firstly delineated by junior annotators, then refined by two radiologists with 5 years experience, and finally all the annotations were verified and refined by a senior radiologist with more than 10 years experience in chest radiology. On average, it takes about minutes to delineate one CT scan with 250 slices. We have made all the annotations publicly available [COVID-19-CT-Seg] at https://zenodo.org/record/3757476 with CC BY-NC-SA license.

Iii Benchmark settings for COVID-19 CT lung and infection segmentation

As mentioned in Section 1, there is a need for innovative strategies that enable annotation-efficient methods for COVID-CT segmentation. Thus, we set up three tasks to evaluate potential annotation-efficient strategies. In particular, we focus on learning to segment left lung, right lung and infection in COVID-19 CT scans using

  • pure but limited COVID-19 CT scans;

  • existing annotated lung CT scans from other non-COVID-19 lung diseases;

  • heterogeneous datasets include both COVID-19 and non-COVID-19 CT scans.

Iii-a Task 1: Learning with limited annotations

Task 1 is designed to address the problem of few-short learning, where few annotations are available for training. This task is based on the COVID-19-CT-Seg dataset only. It contains three subtasks aiming to segment lung, infection and both of them, respectively. For each subtask, 5-fold cross validation results (based on a pre-defined dataset split file) are reported. In each fold, 4 training cases are used for training and the rest 16 cases are used for testing (Table I).

Seg. Task Training and Testing
Lung
5-fold cross validation
4 cases (20% for training)
16 cases (80% for testing)
Infection
Lung and infection
TABLE I: Experimental settings of Task 1 (Learning with limited annotations) for lung and infection segmentation in COVID-19 CT scans. All the experiments are base on the COVID-19-CT-Seg dataset.

Iii-B Task 2: Learning to segment COVID-19 CT scans from non-COVID-19 CT scans

Task 2 is designed to address the problem of domain generalization, where only out-of-domain data are available for training. Specifically, in the first subtask, the StructSeg Lung dataset and the NSCLC Lung dataset are used for training a lung segmentation model. In the second subtask, the MSD Lung Tumor, the StructSeg Gross Target and the NSCLC Pleural Effusion datasets are used. For both subtasks, 80% of the data are randomly selected for training and the rest 20% are held-out for validation. All labelled COVID-CT-Seg CT scans are kept for testing (Table II).

Seg. Task Training Validation (Unseen) Testing
Lung StructSeg Lung (40) StructSeg Lung (10)
COVID-CT-Seg
Lung (20)
NSCLC Lung (322) NSCLC Lung (80)
Infection MSD Lung Tumor (51) MSD Lung Tumor (12)
COVID-CT-Seg
Infection (20)
StructSeg Gross Target (40) StructSeg Gross Target (10)
NSCLC Pleural Effusion (62)
NSCLC Pleural Effusion (16)
TABLE II: Experimental settings of Task 2 (Learning to segment COVID-19 CT scans from non-COVID-19 CT scans) for lung and infection segmentation in COVID-19 CT scans.

Iii-C Task 3: Learning with both COVID-19 and non-COVID-19 CT scans

Task 3 is designed to address the problem of knowledge transfer with heterogeneous annotations, where out-of-domain data are used to boost the annotation-efficient training on limited in-domain data. Specifically, in both subtasks (lung segmentation and lung infection segmentation), 80% and 20% of out-of-domain data are used for training and validation, while 20% and 80% of in-domain data are used for training and testing, respectively (Table III).

Seg. Task Training Validation Testing
Lung StructSeg Lung (40)
COVID-CT-Seg
Lung (4)
StructSeg Lung (10)
COVID-CT-Seg
Lung (16)
NSCLC Lung (322) NSCLC Lung (80)
Infection MSD Lung Tumor (51)
COVID-CT-Seg
Infection (4)
MSD Lung Tumor (12)
COVID-CT-Seg
Infection (16)
StructSeg Gross Target (40) StructSeg Gross Target (10)
NSCLC Pleural Effusion (62)
NSCLC Pleural Effusion (16)
TABLE III: Experimental settings of Task 3 (Learning with both COVID-19 and non-COVID-19 CT scans) for lung and infection segmentation in COVID-19 CT scans.

Iii-D Evaluation metrics

Motivated by the evaluation methods of the well-known medical image segmentation decathlon444http://medicaldecathlon.com/files/MSD-Ranking-scheme.pdf, we also employ two complementary metrics to evaluate the segmentation performance. Dice similarity coefficient, a region-based measure, is used to evaluate the region overlap. Normalized surface Dice [nikolov2018SDice], a boundary-based measure is used to evaluate how close the segmentation and ground truth surfaces are to each other at a specified tolerance . For both two metrics, higher scores admit better segmentation performance, and means perfect segmentation. Let denote ground truth and segmentation, respectively. We formulate the definitions of the two measures as follows:

Iii-D1 Region-based measure

Iii-D2 Boundary-based measure

where denote the border region of ground truth and segmentation surface at tolerance , which are defined as and , respectively. We set tolerance as and for lung segmentation and infection segmentation, respectively. The tolerance is computed by measuring the inter-rater segmentation variation between two different radiologists, which is also in accordance with another independent study [nikolov2018SDice].

The main benefit of introducing Surface Dice is that it ignores small boundary deviations because small inter-observer errors are also unavoidable and often not clinically relevant when segmenting the objects by radiologists.

Iii-E U-Net baselines: oldies but goldies

U-Net ([ronneberger20152DUNet, ronneberger20163DUNet]) has been proposed for 5 years, and many variants have been proposed to improve it. However, recent study [nnUNet2020] demonstrates that a basic U-Net is still difficult to surpass if the corresponding pipeline is designed adequately. In particular, nnU-Net (no-new-U-Net) [nnUNet2020]

was proposed to automatically adapt preprocessing strategies and network architectures (i.e., the number of pooling, convolutional kernel size, and stride size) to a given 3D medical dataset. Without manual tuning, nnU-Net achieved better performance than most specialised deep learning pipelines in 19 public international segmentation competitions and set a new state-of-the-art in the majority of 49 tasks. The source code is publicly available at

https://github.com/MIC-DKFZ/nnUNet.

U-Net is often used as a baseline model in existing COVID-19 CT segmentation studies. However, reported results vary a lot even in the same dataset, which make it hard to compare different studies. To standardize the U-Net performance, we build our baselines on nnU-Net, which is the most powerful U-Net implementation to the best of our knowledge. To make it comparable between different tasks, we manually adjust the patch size and network architectures in task 2 and task 3 to be the same as task 1. Figure 2 shows details of the U-Net architecture.

Fig. 2: Details of the 3D U-Net architecture that is used in this work.

Iv Results and Discussion

Iv-a Results of Task 1: Learning with limited annotations

Table IV presents average DSC and NSD results of lung and infection of each subtask in Task 1. It can be found that

  • the average DSC and NSD values among different folds vary greatly. It demonstrates that reporting 5-fold cross validation results is necessary to obtain an object evaluation because 1-fold results may be biased.

  • promising results for left and right lung segmentation in COVID-19 CT scans can be achieved with as few as four training cases. Models trained for segmenting lung obtained significantly better results compared with those trained for segmenting lung and infection simultaneously.

  • There is still large room for improving infection segmentation with limited annotations. Statistical tests are needed to justify whether there is an improvement from training the model for segmenting lung and infection at the same time.

Subtask Lung Infection Lung and Infection
Left Lung Right Lung Left Lung Right Lung Infection
DSC NSD DSC NSD DSC NSD DSC NSD DSC NSD DSC NSD
Fold-0 84.9±8.2 68.7±13.3 85.2±13.0 70.6±15.8 68.1±20.5 70.9±21.3 53.8±28.4 39.1±18.3 65.5±19.4 47.4±14.3 65.4±23.9 068.2±23.2
Fold-1 80.3±14.5 61.8±15.1 83.9±9.6 68.3±9.0 71.3±20.5 71.8±23.0 40.3±18.7 27.5±12.0 60.1±11.1 41.7±9.9 64.7±21.8 60.6±25.1
Fold-2 87.1±12.1 74.3±16.0 90.3±8.2 78.5±12.0 66.2±21.7 71.7±24.2 80.3±18.8 66.8±18.8 85.2±12.4 68.6±15.1 60.7±27.6 62.5±28.9
Fold-3 88.4±7.0 75.2±8.8 89.9±6.3 78.5±8.0 68.1±23.1 70.8±27.1 79.7±13.6 65.4±14.4 84.0±9.8 67.7±13.0 62.0±27.9 65.3±28.9
Fold-4 88.3±76.0 75.8±11.0 90.2±7.0 78.3±10.2 62.7±26.9 64.9±28.2 72.4±21.1 58.6±20.8 80.9±13.4 63.4±15.9 51.4±30.2 51.9±31.0
Avg 85.8±10.5 71.2±13.8 87.9±9.3 74.8±11.9 67.3±22.3 70.0±24.4 65.4±25.6 51.6±22.9 75.3±16.8 57.9±17.5 60.8±26.3 61.6±27.5
TABLE IV: Quantitative Results of 5-fold cross validation of Task 1: Learning with limited annotations. For each fold, average DSC and NSD values are reported. The last row shows the average results of 80 ( folds testing cases per fold) testing cases.

Iv-B Results of Task 2: Learning to segment COVID-19 CT scans from non-COVID-19 CT scans

This task is quite challenging as the model does not see any cases from target domain during training. In other words, the trained models are expected to generalize to the unseen domain (COVID-19 CT). Table V

shows left lung and right lung segmentation results in terms of average and standard deviation values of DSC and NSD. It can be found that

  • 3D U-Net achieves excellent performance in terms of DSC on the validation set that is from the same source domain. Average NSD values are lower than DSC values, implying that most of the errors come from the boundary.

  • the performance on the testing set drops significantly on both subtasks. The performance of the model trained on NSCLC Lung dataset is worse than the model trained on StuctSeg lung. We suppose it is because the difference in the distribution lung appearance is smaller between StructSeg and COVID-19-CT.

Subtask Validation Set (Unseen) Testing Set
Left Lung Right Lung Left Lung Right Lung
DSC NSD DSC NSD DSC NSD DSC NSD
StructSeg Lung 96.4 1.4 74.6 9.1 97.3 0.3 74.3 7.2 92.2 19.7 82.0 15.7 95.5 7.2 84.2 11.6
NSCLC Lung 95.3 4.9 80.2 8.3 95.4 10.9 80.7 10.7 57.5 21.5 46.9 16.9 72.2 15.3 51.7 16.8
TABLE V: Quantitative results (mean standard deviation) of Lung segmentation in Task 2.

Table VI shows quantitative infection segmentation results in terms of average and standard deviation values of DSC and NSD. It can be found that

  • on the validation set, the performance of lesion segmentation is not as good as the performance of lung segmentation (Table V), which means that tumor segmentation remains a challenging problem. This observation is in line with recent results in MICCAI tumor segmentation challenge, i.e. liver tumor segmentation [bilic2019lits] and kidney tumor segmentation [heller2019kits].

  • The models almost fail to predict COVID-19 infections on testing set, which highlights that the lesion appearances differ significantly among lung cancer, pleural effusion, and COVID-19 infections in CT scans.

Subtask Validation Set (Unseen) Testing Set
DSC NSD DSC NSD
MSD Lung Tumor 67.2 27.1 77.1 31.4 25.2 27.4 26.0 28.5
StructSeg Tumor 71.3 29.6 70.3 29.5 6.0 12.7 5.5 10.7
NSCLC-PE 64.4 45.5 73.7 12.9 0.4 0.8 3.7 4.8
TABLE VI: Quantitative results (mean standard deviation) of Infection segmentation in Task2.

Iv-C Limitation

One possible limitation would be the number of cases in our dataset is relatively small. On one hand, we focus on how to learn from a few training cases(task 1) and how to efficiently use existing annotations. Thus, the number of training cases is acceptable for our benchmark tasks. On the other hand, we have 16 or 20 testing cases in each task. The number of testing cases is comparable with the setting of recent MICCAI 2020 segmentation challenges. For example, MyoPS 2020 (Myocardial pathology segmentation combining multi-sequence CMR) has 15 testing cases [MyoPS]. StructSeg (Automatic Structure Segmentation for Radiotherapy Planning Challenge 2020) has 10 testing cases [StructSeg2020] and ASOCA (Automated Segmentation Of Coronary Arteries) has 20 testing cases [ASOCA2020].

Another possible limitation is that only quantitative results of 3D U-Net are reported. We will also provide 2D U-Net baselines for all tasks in the near future.

V Conclusion

In this paper, we created a COVID-19 CT dataset, established three segmentation benchmark tasks towards annotation-efficient methods, and provided 40+ baselines models based on state-of-the-art segmentation architectures. All the related results are publicly available at https://gitee.com/junma11/COVID-19-CT-Seg-Benchmark. We hope this work can help community to accelerate designing and developing AI-based COVID-19 CT analysis tools with limited data.

Acknowledgment

This project is supported by the National Natural Science Foundation of China (No. 91630311, No. 11971229). We would like to thank all the organizers of MICCAI 2018 Medical Segmentation Decathlon, MICCAI 2019 Automatic Structure Segmentation for Radiotherapy Planning Challenge, the Coronacases Initiative and Radiopaedia for the publicly available lung CT dataset. We also thank Joseph Paul Cohen for providing convenient download link of 20 COVID-19 CT scans. We also thank all the contributor of NSCLC and COVID-19-Seg-CT dataset for providing the annotations of lung, pleural effusion and COVID-19 infection. We also thank the organizers of TMI Special Issue on Annotation-Efficient Deep Learning for Medical Imaging because we get lots of insights from the call for papers when designing these segmentation tasks. We also thank the contributors of the two great COVID-19 related resources: COVID19 imaging AI paper list555https://github.com/HzFu/COVID19_imaging_AI_paper_list and MedSeg666http://medicalsegmentation.com/covid19/ for their timely update about COVID-19 publications and datasets. Last but not least, we thank Chen Chen, Xin Yang, and Yao Zhang for their important and valuable feedback on this benchmark.

References