Interactive Segmentation for COVID-19 Infection Quantification on Longitudinal CT scans

by   Michelle Xiao-Lin Foo, et al.
Kyung Hee University

Consistent segmentation of COVID-19 patient's CT scans across multiple time points is essential to assess disease progression and response to therapy accurately. Existing automatic and interactive segmentation models for medical images only use data from a single time point (static). However, valuable segmentation information from previous time points is often not used to aid the segmentation of a patient's follow-up scans. Also, fully automatic segmentation techniques frequently produce results that would need further editing for clinical use. In this work, we propose a new single network model for interactive segmentation that fully utilizes all available past information to refine the segmentation of follow-up scans. In the first segmentation round, our model takes 3D volumes of medical images from two-time points (target and reference) as concatenated slices with the additional reference time point segmentation as a guide to segment the target scan. In subsequent segmentation refinement rounds, user feedback in the form of scribbles that correct the segmentation and the target's previous segmentation results are additionally fed into the model. This ensures that the segmentation information from previous refinement rounds is retained. Experimental results on our in-house multiclass longitudinal COVID-19 dataset show that the proposed model outperforms its static version and can assist in localizing COVID-19 infections in patient's follow-up scans.



page 1

page 3

page 4

page 5

page 7

page 8

page 9


Longitudinal Quantitative Assessment of COVID-19 Infection Progression from Chest CTs

Chest computed tomography (CT) has played an essential diagnostic role i...

Lung Infection Quantification of COVID-19 in CT Images with Deep Learning

CT imaging is crucial for diagnosis, assessment and staging COVID-19 inf...

Longitudinal Self-Supervision for COVID-19 Pathology Quantification

Quantifying COVID-19 infection over time is an important task to manage ...

Deep Sequential Segmentation of Organs in Volumetric Medical Scans

Segmentation in 3D scans is playing an increasingly important role in cu...

Efficient and Generic Interactive Segmentation Framework to Correct Mispredictions during Clinical Evaluation of Medical Images

Semantic segmentation of medical images is an essential first step in co...

Quality-Aware Memory Network for Interactive Volumetric Image Segmentation

Despite recent progress of automatic medical image segmentation techniqu...

Interactive Radiotherapy Target Delineation with 3D-Fused Context Propagation

Gross tumor volume (GTV) delineation on tomography medical imaging is cr...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Over the last few years, there has been immense progress in the medical image analysis field due to the introduction of deep learning networks. Specifically, such automatic segmentation methods are nowadays widely adopted in data annotation software.

In December 2019, the first cases of a new coronavirus disease, COVID-19, a severe acute respiratory illness, emerged in Wuhan, China [29]. This highly infectious respiratory virus rapidly spread worldwide and threw the world into a global pandemic. According to the numbers on the COVID-19 monitoring site by Johns Hopkins University111, as of August 30th, 2021, more than 216 million people have been infected and 4.498 million people have succumbed due to the complications caused by the virus.

During the initial outbreak of COVID-19, there was an urgent need for fast annotation of medical scans in order to further understand the disease. Numerous researchers utilized deep learning-based methods for this task [23, 20, 7]. Computed tomography (CT) scans provide crucial diagnostic information in the assessment and treatment of COVID-19 patients [1, 13]. However, medical reports show observations that the imaging features of COVID-19 are mixed and diverse among patients, and the change in radiological patterns during the course of the disease is inconsistent [28, 21, 15]. The subtle anatomical boundaries and variations in size, density, location, and texture of the disease pose a challenge for automatic segmentation techniques.

Human interactions coupled with deep learning models can offer a way to overcome the challenges faced by automatic segmentation models and to improve segmentation results as shown by [18, 22, 27]. However, previous works on interactive segmentation only used single time point data for segmentation. The readily available segmentation information from previous time points has not been exploited to segment a patient’s follow-up scans.

To address the limitations of automatic static segmentation models, in this paper, we propose an interactive segmentation method that segments COVID-19 infection on longitudinal CT scans. The proposed method is designed to leverage available longitudinal information and user feedback to improve segmentation quality. The main contributions of this paper can be summarized as follows:

  • We propose a new segmentation approach that utilizes information from previous time point, past segmentation refinement rounds, and user feedback. To the best of our knowledge, it is the first work to design an interactive segmentation on longitudinal CT scans. Our method is also the first interactive segmentation of COVID-19 infection using longitudinal COVID-19 CT scans.

  • Our method can be used to extend existing static models for longitudinal interactive segmentation with minimal effort.

  • We conduct an extensive study using an in-house longitudinal COVID-19 dataset to showcase the improved performance of the longitudinal interactive segmentation model over the static interactive segmentation model.

2 Related Works

2.1 COVID-19 Infection CT Segmentation

After the outbreak of COVID-19, in order to understand the disease and support the radiologists several studies were presented. For the task of learning from noisy labels to segment COVID-19 pneumonia lesions from lung CT scans, Wang et al. [23] proposed a noise-robust training method. Shan et al. [20]

presented a modified 3D convolutional neural network called VB-Net, which combines V-Net 


, a fully convolutional neural network for volumetric medical image segmentation with a bottle-neck deep residual learning framework for quantitative COVID-19 infection assessment. Fan et al. 


proposed a semi-supervised learning approach for segmenting diverse radiological patterns such as ground-glass opacity and consolidation from lung CT scans. However, due to the subtle anatomical boundaries, pleural-based location, and high variations in infection characteristics, it is still challenging to automatically identify and quantify CT image findings related to COVID-19 

[28, 21].

2.2 Interactive Segmentation

In interactive segmentation, feedback from users is used to enhance the predictions of machine learning models. This human-in-the-loop method shows potential in improving segmentation results 

[18]. However, interactive segmentation needs to be accurate and efficient in order to be helpful in a clinical setting. Xu et al. [25] introduced a deep interactive object selection method where user-provided clicks are transformed into Euclidean distance maps. However, Euclidean distance does not exploit image context information. In contrast, the geodesic distance transform by [5] further encodes spatial regularization and contrast-sensitivity information but suffers from sensitivity to noise. The deep learning-based interactive segmentation framework by Wang et al. [22] incorporated user-provided bounding boxes and scribbles (lines drawn on wrongly segmented areas). With the inclusion of this user feedback, they demonstrated an increase in segmentation performance. Zhou et al. [27] showed that with a small number of user interactions, segmentation accuracy can be substantially improved. Kitrungrotsakul et al. [12] proposed a segmentation refinement module that can be appended to automatic segmentation networks and utilized a skip connection attention module to improve important features for both segmentation and refinement tasks. Withal, the methods introduced above are designed for single time point data.

2.3 Longitudinal Image Segmentation

To effectively study how a disease progresses, consistent segmentation of the affected regions based on scans from multiple time points can further provide important information. Birenbaum et al. [4] suggest the use of multiple longitudinal networks to process longitudinal patches from different views where the model concatenates the output of the encoder to produce Multiple Sclerosis (MS) lesion segmentation. Their method shows that the inclusion of information from multiple time points is beneficial to the model. To improve segmentation of MS lesions by taking advantage of the longitudinal information, [6] proposes a longitudinal network with an early fusion of two-time points scans to encode the structural differences implicitly. But due to the minute structural differences in MS lesions across different time points, it is still a challenge for the network to achieve a high accuracy. Kim et al. [11] present a framework that leverages spatio-temporal cues between longitudinal scans to improve quantitative assessment of the progression of COVID-19 infection in chest CT scans. However, the implementations above do not utilize the available previous time scan segmentation mask to segment a patient’s follow-up scans.

3 Proposed Method

3.1 Interactive Segmentation Network

The proposed method extends the baseline longitudinal network by Denner et al. [6] to fully exploit longitudinal information and user feedback for interactive segmentation. The baseline longitudinal model is referred to as a 2.5D model as it retains the global context of the CT volume by combining the per-slice prediction (2D) from three anatomical planes (coronal, sagittal, and axial views) to produce the segmentation for one voxel. Each slice is processed by FC-DenseNet56 [10], which is a fully convolutional dense network for 2D segmentation. Previous work [8, 26, 3, 2] has shown that state-of-the-art results on various medical image segmentation problems are achievable through 2.5D approaches. This is due to the reasons that fully 3D approaches induce a high computational cost, and the patch-based 3D approaches lose the global structural information in the slice. Therefore, our baseline model is built as the 2.5D model, which uses two-time points consisting of stacked 2D slices from three anatomical views. This method can preserve global information along the two axes and local information from the third axis [26].

Let and denote the follow-up target CT volume and the reference previous volume, respectively. and are the height and width of the input, and is the number of slices in the volume. and indicate the individual slices of the volumes. In this study, we assume that the segmentation masks for the reference previous volume is available. represents the number of foreground classes, and it is set to 2 in this study since we are targeting the segmentation of two foreground classes (i.e., ground-glass opacity and consolidation). In addition, let denote the editing masks on the target segmentation during the user feedback. Note that the segmentation masks for the reference previous volume and the editing masks are concatenated to the input.

3.1.1 Training

To train the model to adapt to different input information combinations during inference, throughout the training process, the model is randomly trained with two different inputs as shown in Figure 1.

Figure 1: Training flow of the proposed interactive longitudinal segmentation model. Training alternates between two training inputs that represent different scenarios: Input 1 for initial segmentation round. Input 2 for interactive segmentation rounds. Note that and two channels in our case as . Accordingly if is not available as it it in the scenario of Input 1, is a pair of empty masks.

Input 1 represents the scenario where the user passes the data to the model at the beginning of interactive segmentation to produce the first segmentation for the target slice, whereas input 2 represents the input data in the subsequent editing rounds. In both cases, scans from the two-time points with additional data are concatenated along the channel dimension so that structural changes that are evident between them are utilized by the model to improve its segmentation performance. Empty masks are used in place of information that is not available in the first segmentation round, such as user feedback and target prediction.


denote the input tensor for the segmentation round

. The input consists of the reference previous CT slice , the segmentation masks on the reference previous CT slice , the target CT slice

, the highest class probability per pixel on the target slice from the previous segmentation round

, the predicted class on the target slice from the previous segmentation round (0 for background, 1 for ground-glass opacity, and 2 for consolidation), the editing masks as shown in Figure 1. For Input 2, the empty mask is used for . The interactive segmentation refinement network (ISR), outputs the per-slice segmentation of the target image as follows:


To produce the simulated edits for training with input 2 the model is first set to evaluation mode (without model weights updates), and an initial segmentation of the target is generated. This is then used to produce the simulated edits. The edit simulation process will be further introduced in Section 3.2.

During training, the segmentation of the slices is treated as a 2D segmentation problem. During inference with real user feedback, the prediction on slices from the three anatomical orientations are combined to produce the segmentation output for every voxel. The pseudocode of the training process of the proposed interactive longitudinal interactive segmentation model is presented in Algorithm 1.

1:for  do
2:     for  to  do
3:         Get 1st predictions for input batch
4:         Generate random number,
5:         if  then
6:              Generate simulated edits for the predictions
7:              Append simulated edits & outputs from
8:. . . . . . .. 1st prediction round to inputs
9:              Get 2nd predictions
10:              Calculate loss using 2nd predictions
11:         else
12:              Calculate loss using 1st predictions
13:         end if

         Backpropagate & update model weights

15:     end for
16:end for
Algorithm 1 Training of Proposed Interactive longitudinal Segmentation model,

3.2 Edit Simulation during Training

During training, simulated user edits for wrongly segmented regions are automatically generated. Incorrectly segmented regions are areas that are under- or over-segmented. The segmentation output from the model is compared with the ground truth to choose the slice region for simulating the user edits. Lines are automatically drawn on the selected regions as simulated feedback. As mentioned before, the edit information is concatenated to the CT scans as additional channels, one for each class, with foreground interaction having a value of 1 and background interaction -1. Because the model input is 2.5D instead of 3D, edits are simulated in the axial, coronal, and sagittal slices as opposed to only the axial slices as in [27]. Zhou et al. [27] simulate edits only for the most extensive 2D incorrectly segmented slice region. However, due to the scattered nature of the COVID-19 infections in our case, the top-5 largest wrongly segmented regions in each slice is used for edit simulation. The total number of generated edits, independent of the different classes are limited to prevent the model from overfitting and also to avoid considerable slow-down in the training process caused by the long edit simulation time needed when large numbers of incorrectly segmented regions are detected.

Figure 2: Overview of the interactive segmentation flow with the GUI. Input to the segmentation model prior to the first user interaction resembles Input 1 that is used during training. Accordingly, input of subsequent editing rounds resembles Input 2 as in Figure 1 which enables assisted segmentation refinement via the GUI.

Figure 3: Our GUI for COVID-19 lung infection interactive segmentation. The spin box in the side bar shows the class for the current brush input, here 1, i.e., the brush for ground-glass opacity (GGO).

3.3 Inference with GUI for User Feedback

Figure 2 shows the interactive segmentation refinement stage. An editing graphical user interface (GUI) as shown in Figure 3 is implemented using Qt Designer222 The GUI automatically loads the trained segmentation refinement model during start-up. Then, the user can load the CT volume to be segmented. After the initial segmentation round, the predicted segmentation will be overlaid on the scans for inspection. The user can then use the brush to edit wrongly segmented areas and run the data through the model again. This process can be repeated as many times as necessary.

For each segmentation refinement round, the user feedback on incorrectly segmented regions from current and previous rounds is summed up in each slice’s edit mask. The current round user feedback has higher priority, and so it is multiplied by two before being added to the previous user edits. The values of the mask are then clipped to . This is done so that the previous edit information is not lost. The mask is then concatenated to its corresponding slice image and fed into the segmentation refinement model.

3.4 Implementation

The baseline longitudinal segmentation model from [6] is modified and used in this study. It is an end-to-end 2.5D segmentation network based on FC-DenseNet56 [10]

and implemented in PyTorch 1.4 

[17]. Mean Squared Error (MSE) loss, Adam optimizer with AMSGrad [19] and a learning rate of 0.0001 are used for training. The inference time for processing a COVID-19 patient’s 2.5D data with a size of 3(150150150) takes 15 seconds on an NVIDIA GeForce RTX 2080 Ti with 11GB GPU.

4 Experiments

4.1 COVID-19 Segmentation Dataset and Preprocessing

An in-house clinical dataset collected from the Radiology Department of Technical University of Munich during the COVID-19 first wave (March-June 2020) is used for training and evaluation. It consists of 30 longitudinal low-dose native CT scans from patients age between 46 and 82 years old with a positive polymerase chain reaction (PCR) test for COVID-19. The time gap between the follow-up scans and the previous scans is 1710 days; the scans were taken during admission and hospitalization (3321 days, 0–71 days). The scans were performed using two different CT imaging devices (IQon Spectral CT and iCT 256, Philips Healthcare, Best, the Netherlands) with the same parameters (X-ray current 140-210 mA, voltage 120kV peak, slice thickness 0.9mm) and covered the entire lung. The data was collected with the approval of the institutional review board of TUM (ethics approval 111/20 S-KH).

An expert rater (radiologist with four years of experience) annotated the dataset at voxel-level with the ImFusion Labels software (ImFusion GmbH, Munich, Germany333ImFusion, Lung masks (lung parenchyma vs. other tissues) and pathology masks for four classes: healthy lung (HL), ground-glass opacity (GGO), consolidation (CONS), and pleural effusion (PLEFF) are generated. Due to the large variations in intensity range, size, and alignment, the raw CT volumes have to be preprocessed before they are used for training. The volumes are cropped to the lung regions using manually annotated lung masks. Intensity values outside the range (-1024, 600) are clipped, and min-max normalization is performed on the volumes before they are resized to 150150150 pixels. Slices that have a voxel-value variation smaller than 0.001% between their min and max values are considered empty and removed.

Similar to [11], we also use the deformable registration algorithm by [14], where the image is deformed through a B-Spline Transform that uses a sparse set of grid points overlaid onto the fixed domain of the image, to register the reference scan to the follow-up scan and to resolve the misalignment error between scans. Registration is performed on the lung masks to avoid registration errors that may arise due to the pathological changes in the lung parenchyma. An example of aligned CT scans from different time points are shown in Figure 4.

Figure 4: Deformable registration example of reference scan to target scan. Left: Reference scan Right: Target scan.

According to [9], GGO is the most common findings in COVID-19 patients’ CT scans, followed by CONS. Figure 5 shows examples of GGO and CONS from our dataset. For our experiments, we only segment GGO and CONS, due to the low occurrences of PLEFF in the patient cohort of the dataset.

Figure 5: Examples of GGO and CONS on axial slices of COVID-19 patients’ lung CTs. Arrows point to the infection areas. Left: GGO Right: CONS.

For the 30 patients, the reference and follow-up CT scans of each patient after registration have an average structural similarity index (SSIM) [24] of 29.71%. This shows that the scans taken at different time points broadly differ from one another perceptually. Besides that, the average change in the percentage of GGO and CONS in the patients’ lung CTs from different time points is 13.68% and 6.59%, respectively. This indicates the noticeable difference in the disease progression over time in the dataset. Table 1 shows the percentage of GGO and CONS in the lungs of the patients at each timestep.

Radiomic Average Std. Dev. Average Std. Dev. Average Std. Dev.
T-1 (n=30) T-2 (n=30) T-3 (n=5)
GGO 15.23% 14.76% 20.17% 18.07% 15.02% 11.91%
T-1 (n=30) T-2 (n=30) T-3 (n=5)
CONS 6.52% 7.07% 8.15% 11.51% 11.28% 11.07%
Table 1: Percentage of GGO and CONS in the lungs of patients. refers to the timestep of the scans. is the number of patient volumes at each timestep

The training set is made up of 16 patients (37 volumes), with training (n=12) validation (n=4) split. Our model is tested on an independent test set consisting of 14 patients (28 volumes).

4.2 Experimental Settings

4.2.1 Evaluation Metrics

The segmentation performance of the models is evaluated using the following metrics.

  • Dice Similarity Coefficient (DSC) is a statistical measure of the similarity between two segmentations.

  • Positive Predictive Value (PPV, or precision) displays the fraction of correctly segmented regions over all predicted segmentations.

  • True Positive Rate (TPR, recall or sensitivity) shows the proportion of correct segmentation outputs with respect to the ground truth.

  • Volume Difference (VD) is calculated as the absolute difference in the predicted lesion segmentation volume and ground truth lesion segmentation volume over the ground truth lesion segmentation volume.


4.3 Ablation Study

In order to study how the additional information concatenated to the inputs influences the model’s segmentation performance, an ablation study was carried out using our longitudinal COVID-19 dataset. In addition to the longitudinal baseline network [6], we also implemented a static version of the network for comparison. long_edit+ref_seg is the baseline long.+ref_seg extended for interactive segmentation. The proposed model additionally incorporates past prediction outputs of the target as additional information to guide the following segmentation, and static_edit is the version of the proposed model without reference information. Table 2 summarizes all tested models and their inputs. The reference manual segmentation in the table refers to ground truth masks of the reference images, whereas the edit masks contain the user feedback on the target segmentation.

Model input
Model name Target Reference Ref. manual Edit Target previous
image image segmentation mask segmentation
Baseline static network
Baseline long. network
Baseline long.ref_seg
Table 2: Input for different models tested in ablation study

As it can be inferred from Table 3, the baseline models performed better than the interactive segmentation models in the first segmentation round without edits and the baseline longitudinal models have higher Dice scores compared to the baseline static model. Among the longitudinal baseline models, concatenating reference segmentation to the input CTs can further improve its GGO Dice by 1.44% and CONS Dice by 0.76%. Comparing the interactive segmentation models, the longitudinal interactive segmentation models output better initial segmentation than the static model. The proposed model’s GGO Dice is 8.45% higher than the static model, whereas the long_edit+ref_seg model has an improvement of 15.7% in its CONS Dice over the static model.

Model Dice (%) PPV (%) TPR (%) VD (%)
Non interactive methods
Baseline static network 44.15 3.33 19.75 4.89 62.12 2.89 42.86 8.53 38.98 5.09 14.96 3.60 49.98 6.07 102.0 28.84
Baseline long. network 45.42 3.03 27.63 5.61 67.34 3.03 41.35 8.01 37.96 4.46 23.54 4.32 48.56 6.09 93.18 38.04
Baseline long.+ref_seg 46.86 3.12 28.39 6.39 62.61 3.86 42.12 8.53 41.63 4.16 24.36 5.61 43.31 6.53 90.36 39.43
Interactive methods
static_edit 35.97 4.20 13.99 3.51 66.82 3.01 41.20 8.45 29.03 4.99 9.28 2.27 59.12 7.20 98.11 22.21
long_edit+ref_seg 37.70 3.89 29.69 5.86 64.66 3.03 37.85 7.60 30.39 4.46 26.80 5.35 55.46 6.75 64.88 27.62
Proposed 44.42 3.61 23.94 6.12 62.65 3.49 41.89 8.86 40.38 5.42 19.62 5.92 51.84 7.40 49.71 7.94
Table 3:

Evaluation results on the test set before user interactions. Values displayed are the mean and standard errors. Bold values represent the best results for each metric. Empty masks are used in place of the edit mask and target image previous segmentation mask for static_edit, long_edit

+ref_seg and the proposed model.
Model Dice (%) PPV (%) TPR (%) VD (%)
Initial segmentation with user edits
static_edit 40.86 3.16 24.33 3.14 67.80 3.01 54.37 7.45 32.77 4.33 17.16 1.97 54.15 6.31 87.99 20.73
long_edit+ref_seg 44.34 2.34 36.71 5.57 67.03 3.30 44.38 6.84 35.72 3.39 33.91 5.05 48.76 5.41 54.67 20.78
Proposed 49.22 2.58 36.33 5.19 64.34 3.88 50.50 7.10 44.64 4.50 30.91 5.13 46.63 6.54 40.08 6.03
Output segmentation after one round of segmentation refinement by model
static_edit 53.59 2.33 55.48 4.48 71.63 4.12 71.34 4.34 44.62 2.98 47.01 4.68 40.21 3.76 33.68 5.66
long_edit+ref_seg 54.79 1.95 54.01 5.03 66.70 3.62 51.14 5.16 48.54 2.54 64.44 5.67 34.74 4.55 74.43 36.10
Proposed 59.86 2.57 58.81 4.36 62.06 4.48 58.18 5.15 61.86 3.00 62.39 4.06 36.39 11.28 34.11 12.75
Table 4: Evaluation results of the interactive methods on the test set with real user interactions. Values displayed are the mean and standard errors. Bold values represent the best results for each metric.

The evaluations results in Table 3 further revealed that the CONS TPR for the static models are considerably lower than the longitudinal models, with the non interactive methods having a difference of 8.58% between the baseline static network and baseline long. network. For the interactive methods, the static_edit model CONS TPR is 10.34% less than the longitudinal interactive model with lower TPR. These results proves that longitudinal models can improve segmentation of a more complicated class such as CONS.

4.4 Segmentation Results and Discussions

The following evaluations are carried out using the GUI to obtain real user feedback on the segmentation output.

4.4.1 Quantitative Results

The aim of the interactive segmentation model is to assist and reduce the user’s workload during the segmentation of new data. Thus, its desired function is to take in rough user interactions and improve the segmentation output. To further determine whether the segmentation refinement model serves this purpose, the initial segmentation that is manually corrected and fed into the segmentation refinement model is compared with the model refined segmentation output. The results are shown in Table 4.

As can be seen from the results in Table 4, the interactive segmentation refinement model was able to use the previous segmentation results and user feedback to improve the segmentation of the target scans. Out of the three models, the proposed model has the highest Dice scores for GGO and CONS after one round of segmentation refinement. However, by comparing the change in Dice of the edited initial segmentation by the user and refined segmentation by the model, the static interactive segmentation model showed the most considerable improvement in its segmentation output, with an increase of 12.73% in GGO Dice and 31.15% in CONS Dice, whereas the proposed model showed an increase of 10.64% in GGO Dice and 22.48% in CONS Dice.

The proposed model and staticedit model are further evaluated with another round of segmentation refinement. Figure 6 shows how the Dice of 14 patients change after each segmentation refinement rounds with real user edits. The best baseline model: baseline long.ref_seg is added for comparison.

Figure 6: Average change in Dice vs Number of Refinement Rounds for 14 test patients. Left: class GGO and Right: class CONS.

Experimental results showed that the proposed model improves the Dice by a significant amount just after two rounds of segmentation refinement with real user feedback. The total average increase of the segmentation Dice after two rounds of segmentation refinement is 20.44% for GGO and 40.33% for CONS, with an average increase of 4.99% for GGO and 5.46% for CONS between the first and second segmentation refinement rounds. As for the static_edit model, after two rounds of segmentation refinement, the GGO and CONS Dice is 9.03% and 12.98% lower than the proposed model.

From the plots in Figure 6

, it is visible that the proposed model is superior in terms of segmentation refinement performance compared to the static interactive segmentation model. The static_edit model is shown to be unable to further refine the segmentation correctly at the second refinement step; the GGO segmentation Dice improved, whereas the CONS segmentation Dice dropped. Closer inspection of the segmentation output shows that the static_edit model is more likely to wrongly classify regions, which leads to the Dice decrease. As for the proposed model, the inclusion of reference scan information improved its segmentation classification.

Since the degree of infection severity varies among the test patients, which influences the output segmentation, the amount of rough user edit strokes needed to edit a patient’s initial segmentation range from 1 to 10 per edited slice for the proposed model. In most cases, the increase in Dice is larger for CONS after refinement with the model. Qualitative results of the initial segmentation showed that CONS is frequently under segmented due to its high similarity to the background class (blood vessels and walls of airways have a very similar Hounsfield unit histogram) making it difficult for the model to segment the region correctly without user guidance.

4.4.2 Qualitative Results

Qualitative results from the proposed model are presented in this section.
Comparison of Edits
Figure 7 shows the different segmentation refinement output given different types of user edits. As can be observed in the lower segmentation output, it is sufficient to draw the outer outline of the infected regions to separate them from the background. The model is able to further segment the unedited areas, but it is often incorrectly classified. However, as shown in Figure 8, the wrong segmentation can be corrected in the following refinement round.

Figure 7: Example of how the refined segmentation differs when different types of user edits (top and lower rows) are drawn to refine the segmentation. (Red: GGO ground truth, Green: CONS ground truth, Pink: predicted GGO, Dark green: predicted CONS, Magneta: foreground edit for GGO, Neon green: foreground edit for CONS)

A possible explanation for the wrong segmentation of GGO as CONS is due to their feature similarity in the CT scans. One of the main characteristics that differentiate GGO from CONS is the location of the segmentation. CONS is often located at the bottom of the axial lung CT. But in severe cases, it can be found higher up and its borders are often GGO as shown in the second row of Figure 8.

Figure 8: Examples of two rounds of segmentation refinement, resulting in increasingly better results.

Figure 9: Example of how edits on one slice are automatically propagated to other slices. After the initial prediction, edits are drawn on one slice. Red arrows point to false edit locations. Interestingly, the model learns that false edits should not be propagated further than one slice by the model.

Robustness Testing
A robustness test is carried out to evaluate how well the 2.5D model that uses stacked slices from three anatomical views as input is able to propagate edits on one slice to another.

Figure 9 displays the limited automatic propagation of edits from one slice to other slices. The further the slices are from the edited slice, the more segmentation are wrongly classified even though the correct regions are segmented and some regions are undersegmented. However, false edits are not propagated further than one slice away by the model. This shows the model’s ability to detect edits that are incorrect by considering the features of the CT scans.

Figure 10: Example of how the target’s initial prediction is influenced by the reference image in a case where there is a misalignment between the reference and target image. The red arrows point to regions where the CT scans differ the most. The white arrows point to the interested regions.

Influence of Reference Data on Target Segmentation
In some cases where a patient’s reference scan is not correctly deformed to align with the follow-up scan, the initial segmentation of the follow-up scan for more difficult regions can be negatively affected by the reference CT. Figure 10 shows some abnormalities in the initially predicted segmentation of the target image. The infected lower part of the target CT that is harder to discern from the background class is segmented according to the reference image. As observed in the target’s initial predicted segmentation, its left outline looks similar to the reference image.

In another example, when the reference image is correctly aligned with the target image, as shown in Figure 11, the reference image can serve as a guide for the model to segment regions that are difficult for the model as previously mentioned, such as areas that look similar to the background class. In this example, the model produces a more accurate initial segmentation of the target scan.

Figure 11: Example of a case where the reference image and target image are correctly aligned. The model segmented the difficult areas better compared to Figure 10. The red arrows point to regions where the CT scans differ the most. The white arrows point to the interested regions.

4.5 Limitations and Future Work

One of the limitations that we faced is the availability of a larger longitudinal COVID-19 dataset. Also, label noise is probably present in our data, since for severe cases, even expert radiologists struggle to separate in particular CONS from the background class and PLEFF. In future work, the potential of this method on improving segmentation results of other longitudinal medical images and in different clinical contexts can be further explored. The problem of wrongly classified segmentation in multiclass segmentation can be additionally examined through using an ensemble of binary interactive segmentation models for each foreground class. Furthermore, it would be interesting to test the proposed method with other state of the art model architectures. Besides, this paper also reveals limitation of a 2.5D model in propagating edits from one slice to other slices further away, a 3D implementation of the interactive segmentation method which is not in the scope of this paper would perhaps be able to counter this issue at the cost of requiring bigger training datasets.

5 Conclusion

In this paper, an interactive segmentation method with 2.5D longitudinal network is proposed. Through concatenating the previous time-point reference segmentation mask, segmentation output of target image from past segmentation rounds and user interactions mask to target and reference images, all past information is fully utilized as input data to the model to improve segmentation results. Experiments on our in-house longitudinal COVID-19 dataset show that a large improvement in the Dice of both classes is obtained after one round of interactive segmentation refinement. Besides that, the proposed longitudinal interactive segmentation refinement model’s segmentation performance is superior compared to the static version of the interactive model. This concludes that, with the availability of longitudinal data, existing segmentation models can be easily adapted through our method and trained end-to-end for interactive segmentation refinement.


  • [1] T. Ai, Z. Yang, H. Hou, C. Zhan, C. Chen, W. Lv, Q. Tao, Z. Sun, and L. Xia (2020-08) Correlation of chest CT and RT-PCR testing for coronavirus disease 2019 (COVID-19) in china: a report of 1014 cases. Radiology 296 (2), pp. E32–E40. External Links: Document, Link Cited by: §1.
  • [2] R. Alkadi, A. El-Baz, Dr. F. Taher, and N. Werghi (2019-01) A 2.5d deep learning-based approach for prostate cancer detection on t2-weighted magnetic resonance imaging: munich, germany, september 8-14, 2018, proceedings, part iv. In Computer Vision – ECCV 2018 Workshops, pp. 734–739. Cited by: §3.1.
  • [3] S. Aslani, M. Dayan, L. Storelli, M. Filippi, V. Murino, M. A. Rocca, and D. Sona (2019-08) Multi-branch convolutional neural network for multiple sclerosis lesion segmentation. NeuroImage 196, pp. 1–15. Cited by: §3.1.
  • [4] A. Birenbaum and H. Greenspan (2017-10) Multi-view longitudinal cnn for multiple sclerosis lesion segmentation. Eng. Appl. Artif. Intell. 65 (C), pp. 111–118. External Links: ISSN 0952-1976 Cited by: §2.3.
  • [5] A. Criminisi, T. Sharp, and A. Blake (2008) GeoS: geodesic image segmentation. In Lecture Notes in Computer Science, pp. 99–112. Cited by: §2.2.
  • [6] S. Denner, A. Khakzar, M. Sajid, M. Saleh, Z. Spiclin, S. T. Kim, and N. Navab (2020) Spatio-temporal learning from longitudinal data for multiple sclerosis lesion segmentation. arXiv preprint arXiv:2004.03675. Cited by: §2.3, §3.1, §3.4, §4.3.
  • [7] D. Fan, T. Zhou, G. Ji, Y. Zhou, G. Chen, H. Fu, J. Shen, and L. Shao (2020-08) Inf-net: automatic COVID-19 lung infection segmentation from CT images. IEEE Transactions on Medical Imaging 39 (8), pp. 2626–2637. Cited by: §1, §2.1.
  • [8] A. Guha Roy, S. Conjeti, N. Navab, and C. Wachinger (2018-11) QuickNAT: a fully convolutional network for quick and accurate segmentation of neuroanatomy. NeuroImage 186, pp. . Cited by: §3.1.
  • [9] M. M. Hefeda (2020-11) CT chest findings in patients infected with COVID-19: review of literature. Egyptian Journal of Radiology and Nuclear Medicine 51 (1). Cited by: §4.1.
  • [10] S. Jégou, M. Drozdzal, D. Vazquez, A. Romero, and Y. Bengio (2017) The one hundred layers tiramisu: fully convolutional densenets for semantic segmentation. In

    IEEE Conference on Computer Vision and Pattern Recognition Workshops

    pp. 11–19. Cited by: §3.1, §3.4.
  • [11] S. T. Kim, L. Goli, M. Paschali, A. Khakzar, M. Keicher, T. Czempiel, E. Burian, R. Braren, N. Navab, and T. Wendler (2021) Longitudinal quantitative assessment of covid-19 infection progression from chest cts. International Conference on Medical Image Computing and Computer Assisted Intervention. Cited by: §2.3, §4.1.
  • [12] T. Kitrungrotsakul, Q. Chen, H. Wu, Y. Iwamoto, H. Hu, W. Zhu, C. Chen, F. Xu, Y. Zhou, L. Lin, et al. (2021) Attention-refnet: interactive attention refinement network for infected area segmentation of covid-19. IEEE Journal of Biomedical and Health Informatics. Cited by: §2.2.
  • [13] Y. Li and L. Xia (2020-06) Coronavirus disease 2019 (COVID-19): role of chest CT in diagnosis and management. American Journal of Roentgenology 214 (6), pp. 1280–1286. External Links: Document, Link Cited by: §1.
  • [14] B. Lowekamp, D. Chen, L. Ibanez, and D. Blezek (2013-12) The design of simpleitk. Frontiers in neuroinformatics 7, pp. 45. Cited by: §4.1.
  • [15] J. B. Mendel, J. T. Lee, and D. Rosman (2020) Current concepts imaging in covid-19 and the challenges for low and middle income countries. Journal of Global Radiology 6 (1), pp. 3. Cited by: §1.
  • [16] F. Milletari, N. Navab, and S. Ahmadi (2016-06) V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. arXiv e-prints, pp. arXiv:1606.04797. External Links: 1606.04797 Cited by: §2.1.
  • [17] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. External Links: Link Cited by: §3.4.
  • [18] H. Ramadan, C. Lachqar, and H. Tairi (2020-08) A survey of recent interactive image segmentation methods. Computational Visual Media 6 (4), pp. 355–384. Cited by: §1, §2.2.
  • [19] S. Reddi, S. Kale, and S. Kumar (2018) On the convergence of adam and beyond. In International Conference on Learning Representations, Cited by: §3.4.
  • [20] F. Shan, Y. Gao, J. Wang, W. Shi, N. Shi, M. Han, Z. Xue, D. Shen, and Y. Shi (2021-03) Abnormal lung quantification in chest CT images of COVID-19 patients with deep learning and its application to severity prediction. Medical Physics 48 (4), pp. 1633–1645. Cited by: §1, §2.1.
  • [21] H. Shi, X. Han, N. Jiang, Y. Cao, O. Alwalid, J. Gu, Y. Fan, and C. Zheng (2020) Radiological findings from 81 patients with covid-19 pneumonia in wuhan, china: a descriptive study. The Lancet infectious diseases 20 (4), pp. 425–434. Cited by: §1, §2.1.
  • [22] G. Wang, W. Li, M. A. Zuluaga, R. Pratt, P. A. Patel, M. Aertsen, T. Doel, A. L. David, J. Deprest, S. Ourselin, and T. Vercauteren (2018) Interactive medical image segmentation using deep learning with image-specific fine tuning. IEEE Transactions on Medical Imaging 37 (7), pp. 1562–1573. Cited by: §1, §2.2.
  • [23] G. Wang, X. Liu, C. Li, Z. Xu, J. Ruan, H. Zhu, T. Meng, K. Li, N. Huang, and S. Zhang (2020-08) A noise-robust framework for automatic segmentation of COVID-19 pneumonia lesions from CT images. IEEE Transactions on Medical Imaging 39 (8), pp. 2653–2663. Cited by: §1, §2.1.
  • [24] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13 (4), pp. 600–612. Cited by: §4.1.
  • [25] N. Xu, B. Price, S. Cohen, J. Yang, and T. Huang (2016-06) Deep interactive object selection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2.
  • [26] H. Zhang, A. M. Valcarcel, R. Bakshi, R. Chu, F. Bagnato, R. T. Shinohara, K. Hett, and I. Oguz (2019) Multiple sclerosis lesion segmentation with tiramisu and 2.5 d stacked slices. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 338–346. Cited by: §3.1.
  • [27] B. Zhou, L. Chen, and Z. Wang (2019) Interactive deep editing framework for medical image segmentation. International Conference on Medical Image Computing and Computer Assisted Intervention, pp. 329–337. Cited by: §1, §2.2, §3.2.
  • [28] S. Zhou, Y. Wang, T. Zhu, and L. Xia (2020-06) CT features of coronavirus disease 2019 (COVID-19) pneumonia in 62 patients in wuhan, china. American Journal of Roentgenology 214 (6), pp. 1287–1294. Cited by: §1, §2.1.
  • [29] S. Zhou, Y. Wang, T. Zhu, and L. Xia (2020) CT features of coronavirus disease 2019 (covid-19) pneumonia in 62 patients in wuhan, china. American Journal of Roentgenology 214 (6), pp. 1287–1294. Cited by: §1.