Log In Sign Up

ECONet: Efficient Convolutional Online Likelihood Network for Scribble-based Interactive Segmentation

Automatic segmentation of lung lesions associated with COVID-19 in CT images requires large amount of annotated volumes. Annotations mandate expert knowledge and are time-intensive to obtain through fully manual segmentation methods. Additionally, lung lesions have large inter-patient variations, with some pathologies having similar visual appearance as healthy lung tissues. This poses a challenge when applying existing semi-automatic interactive segmentation techniques for data labelling. To address these challenges, we propose an efficient convolutional neural networks (CNNs) that can be learned online while the annotator provides scribble-based interaction. To accelerate learning from only the samples labelled through user-interactions, a patch-based approach is used for training the network. Moreover, we use weighted cross-entropy loss to address the class imbalance that may result from user-interactions. During online inference, the learned network is applied to the whole input volume using a fully convolutional approach. We compare our proposed method with state-of-the-art and show that it outperforms existing methods on the task of annotating lung lesions associated with COVID-19, achieving 16 requiring 9000 lesser scribbles-based labelled voxels. Due to the online learning aspect, our approach adapts quickly to user input, resulting in high quality segmentation labels. Source code will be made available upon acceptance.


page 2

page 8


Exploiting Shared Knowledge from Non-COVID Lesions for Annotation-Efficient COVID-19 CT Lung Infection Segmentation

The novel Coronavirus disease (COVID-19) is a highly contagious virus an...

CCAT-NET: A Novel Transformer Based Semi-supervised Framework for Covid-19 Lung Lesion Segmentation

The spread of the novel coronavirus disease 2019 (COVID-19) has claimed ...

Rapid quantification of COVID-19 pneumonia burden from computed tomography with convolutional LSTM networks

Quantitative lung measures derived from computed tomography (CT) have be...

Label-Free Segmentation of COVID-19 Lesions in Lung CT

Scarcity of annotated images hampers the building of automated solution ...

CT Image Segmentation for Inflamed and Fibrotic Lungs Using a Multi-Resolution Convolutional Neural Network

The purpose of this study was to develop a fully-automated segmentation ...

Detecting when pre-trained nnU-Net models fail silently for Covid-19 lung lesion segmentation

Automatic segmentation of lung lesions in computer tomography has the po...

BREAK: Bronchi Reconstruction by gEodesic transformation And sKeleton embedding

Airway segmentation is critical for virtual bronchoscopy and computer-ai...

1 Introduction

COVID-19 causes pneumonia-like symptoms, adversely affecting respiratory systems in some patients. In their response to the disease, clinicians have used Computed Tomography (CT) imaging to assess the amount of lung damage and disease progression by localizing lung lesions [roth2021rapid, revel2021study, rubin2020role]. This has been essential in providing relevant treatment for COVID-19 patients with severe conditions and has resulted in acquisition of large number of CT volumes from COVID-19 patients [roth2021rapid, tsai2021rsna, wang2020noise, revel2021study]

. Deep learning-based automatic lung lesion segmentation methods may ease burden on clinicians, however, these methods require large amounts of manually labelled data

[wang2020noise, gonzalez2021detecting, tilborghs2020comparative, chassagnon2020ai]. Labelling CT volumes for lung lesion is a time-intensive task which requires expert knowledge, putting further strain on clinicians’ workload. In addition, future variants of novel coronaviruses may result in variations in lesion pathologies [mclaren2020bullseye]. In such cases, automatic segmentation methods that are trained on existing datasets may fail. To address this, rapid labelling of relevant data is needed to augment existing dataset with new labelled volumes.

Figure 1: Comparison of appearance of lung lesions vs non-lesions in CT volumes from COVID-19 dataset [wang2020noise]. Top row shows a selected slice, and bottom row shows the distribution of HU intensity values for each bounding box. The large overlap between background and lesion distributions indicates that the appearance of some parts of lesion may look similar to background, leading to ambiguity in likelihood models learned from appearance-based features alone.

Related work.

Due to their quick adaptability and efficiency, a number of existing online likelihood methods have been applied as semi-automatic methods for interactively segmenting objects in images [boykovjolly2001interactive, criminisi2008geos, rother2004grabcut, barinova2012online, wang2016dynamically]. One of the first approach for interactive segmentation used histogram of intensity values for generating likelihood [boykovjolly2001interactive], which was then regularized using a conditional random field formulation solved using a max-flow algorithm. Similarly, [criminisi2008geos] also used histogram-based likelihood for interactively segmenting objects using geodesic symmetric filtering for regularization. In [rother2004grabcut]

, a set of Gaussian Mixture Models (GMMs) were employed to model class-specific intensity distribution.

While the intensity-based methods provided significant advancement in terms of interactively segmenting an object, they failed to model ambiguous cases, e.g., where the object intensity is similar to that of the background. To bypass this limitation, hand-crafted features were employed to build online likelihood models in [barinova2012online, wang2016dynamically]. Barionova et al. [barinova2012online]

proposed an Online Random Forests (ORF), which was trained by re-sampling the training data through fixed class weights. Dynamically Balanced Online Random Forests (DybaORF)

[wang2016dynamically] proposed to utilize dynamically changing class weights based on distribution of class after each user-interaction. Both ORF and DybaORF used hand-crafted features, where DybaORF formed the state-of-the-art by outperforming all existing online likelihood methods.

Existing online likelihood methods either directly depend on intensity values [boykovjolly2001interactive, criminisi2008geos, rother2004grabcut] or utilize hand-crafted features [barinova2012online, wang2016dynamically] to build the corresponding likelihood models. While these methods work well for cases where appearance/features for object and background differ sufficiently, they result in failure for cases where this assumption breaks. As shown in fig:appearancefailure, the appearance of lung lesions in COVID-19 patients may have ambiguity, where the distribution of their HU intensity may appear similar to background regions.

A number of deep learning-based interactive segmentation methods exist that provide AI-assisted annotation [luo2021mideepseg, wang2018deepigeos, wang2018interactive, rajchl2016deepcut]. DeepCut [rajchl2016deepcut] uses bounding box provided by user to train CNNs for fetal brain and lung segmentation from MRI. DeepIGeoS [wang2018deepigeos] provides interactive segmentation by combing CNNs with user-provided scribbles interaction in a two staged CNN-based approach, where the first stage infer an initial segmentation and the second refines it using user-scribbles. BIFSeg [wang2018interactive] utilizes bounding box interactions with image-specific fine-tuning of CNN to segment unseen objects. MIDeepSeg [luo2021mideepseg] incorporate user-clicks with input image using exponential geodesic distance transform to propose an improved interactive CNN. Deep learning-based interactive segmentation methods consist of large networks that require offline pre-training on large labelled datasets. Additionally, due to the amount of parameters, these networks do not adapt quickly in an online setting to changes in different unseen examples. Some methods, such as BIFSeg, propose to use image-specific fine-tuning, however this has limited application in online on-the-fly learning due to their extensive computational requirements.


To address the challenge of learning a distinctive likelihood model in an online and data-light manner, we propose a method which we refer to as Efficient Convolutional Online likelihood Network (ECONet). To the best of our knowledge, ECONet is the first online likelihood method that enables joint and efficient on-the-fly learning of both features and classifier using only scribbles-based labels. The proposed model is lightweight, using only a single convolutional feature layer and three hidden fully-connected layers and can be learned online, while the user provides labels interactively, without the need for any pre-training. We propose an efficient online training technique, where only the patches extracted from scribble-labelled voxels are used. Efficient inference from ECONet is achieved through fully convolutional application of the network on whole input volume

[long2015fully]. We evaluate ECONet on the problem of labelling lung lesions in CT volumes from COVID-19 patients, with comparison against high-quality segmentation labels from expert annotators. We show that the proposed ECONet outperforms existing state-of-the-art online likelihood methods, achieving 16% higher Dice score in 3 lower online training and inference time and requiring approximately 9000 lesser interactively labelled voxels.

2 Method

2.1 Problem Formulation

Let represents an image volume that is to be labelled, where is the index of a given voxel. Given , the user provides scribble-based interaction indicating class labels for a subset of voxels of the image . Let represent the set of scribbles, where and denote the foreground (lung lesion) and background scribbles, respectively, and . For a given voxel , the provided scribble label is if and if . The scribbles in and image patches centered at each scribbles in are used for online training of a given model with parameters .

Figure 2: ECONet Online Training and Inference shows (a) patch-based online training of ECONet where a patch of size K K

K, extracted around a scribble voxel, is used. The loss function in Eq. eq:weightedCrossEntropy is used along with label from scribble to learn the model parameters. (b) Online likelihood inference using ECONet as fully convolutional network on full image volume.

2.2 Online Training and Inference using ECONet

The proposed Efficient Convolutional Online Likelihood Network (ECONet) is a lightweight fully convolutional neural network designed to be trained and applied in an online learning setting. ECONet consists of one convolution layer used for learning relevant features, which is followed by three fully-connected layers that enable learning the classifier for a given voxel. Each convolutional and fully-connected layer is followed by a batch normalization layer and ReLU activation. To train and apply ECONet in an online setting without any pre-training and using only scribbles-based labels provided by the user, we propose to use a training and inference strategy that maximizes the efficiency of both tasks. An overview of the proposed online training and inference method is shown in fig:ECONetFlowchart.

Scribbles provided by an annotator at a given stage only label a small subset of voxels within a given image volume . Based on this observation, we minimize the computational budget required to perform training passes on ECONet by extracting and learning only from patches with KK

K dimensions, each centered around a voxel with user-scribble (fig:ECONetFlowchart(a)). Once the parameters of ECONet have been learned, online inference is done by applying it to the whole input CT volume. ECONet is converted to a fully convolutional network for inference (fig:ECONetFlowchart(b)), where appropriate padding is used in the first conv3d layer and fully-connected layers are converted to 1x1x1 conv3d

[long2015fully]. This enables ECONet to efficiently infer a volume with likelihood for each voxel within image .

2.3 Scribbles-balanced Cross-Entropy Loss

User-scribbles suffer from class imbalance problem, resulting from the user-interactions being biased towards the object of interest. In addition, during the course of an interactive session, the user may focus on labelling different segments, which results in dynamically changing class imbalance in [barinova2012online, wang2016dynamically]. To address this, we utilize a scribbles-balanced cross-entropy loss, with class weights from scribbles distribution.

Given a model with parameters , the foreground likelihood from this model is defined as . Then, the scribbles-balanced cross-entropy loss is:


where and are scribble-based class weights for foreground and background, respectively, and are defined as: and .

3 Experimental Validation

We compare our proposed ECONet with existing state-of-the-art methods in online likelihood inference, which are referred to as Histogram [boykovjolly2001interactive], Gaussian Mixture Model (GMM) [rother2004grabcut] and DybaORF-Haar-Like [wang2016dynamically]. In addition, to show the effectiveness of learning features in ECONet, we define ECONet-Haar-Like that replaces the first learnable conv3d layer of ECONet with hand-crafted haar-like features [jung2013generichaar] and learns the three fully-connected layers. Both DybaORF-Haar-Like and ECONet-Haar-Like utilize GPU-based implementation of 3d haar-like features. A GPU-based implementation of GMM is used [MONAI_Consortium_MONAI_Medical_Open_2020]. DybaORF was implemented using CPU-based Random Forest implementation from [pedregosa2011scikit]. All experiments were performed on Tesla V100 GPU with 32 GB of memory.


We use the UESTC-COVID-19 dataset for experimental validation and comparison of ECONet with existing methods [wang2020noise]. This dataset contains a total of 120 CT volumes with lung lesion labels, of which 50 are by expert annotators. In order to compare robustness of our proposed ECONet against expert annotators, we use these 50 CT volumes for all our experiments. In our validation, the ground-truth labels are only used for generating simulated interactions within a synthetic scribbler and to compute comparison metrics from likelihood predictions.

Training Parameters.

Adam optimizer [kingma2014adam]

with 200 epochs and an initial learning rate of 0.01 dropped to 0.001 at 140th epoch is used for training of ECONet-based methods. Dropout probability of 0.3 is used during training for all fully-connected layers. The size of each layer in ECONet is selected through line search ablation experiments (see appendix:econet_params), which are as follows: (i) input conv3d kernel size is 7

77, (ii) number of filters in input conv3d is 128 and (iii) fully-connected layer sizes are 32162. The best performing configuration from [wang2016dynamically] are used for DybaORF, which are 50 trees with maximum tree depth of 20 and minimum samples for split equal to 6. GMM-based method uses 20 Gaussians for each GMM, whereas in the Histogram-based method 128 bins were used to build each histogram. Similar to [wang2016dynamically], likelihood from ECONet (and all comparison methods) is spatially regularized by applying GraphCut using max-flow/min-cut algorithm [boykovjolly2001interactive]. Following [luo2021mideepseg], we use and for this GraphCut-based spatial regularization.

Evaluation Metrics.

Segmentation results from each method are compared against ground truth labels from experts annotators from UESTC-COVID-19 dataset using Dice similarity (DICE) and average symmetric surface distance (ASSD) metrics [luo2021mideepseg]. In addition to DICE and ASSD, we also evaluate comparison methods on their online training and inference execution time (referred to as Time for brevity) as well as the number of voxels with scribbles needed for achieving a given DICE and ASSD score.

Method DICE (%) ASSD Time (s) Synthetic Scribbles Voxels Required
ECONet (proposed) 82.81 8.77 7.5714.65 2.031.79 26052929
ECONet-Haar-Like 71.6112.43 20.2836.24 0.590.09 37372471
DybaORF-Haar-Like [wang2016dynamically] 66.8114.92 40.8146.40 6.331.63 116997383
GMM[rother2004grabcut] 50.96 5.35 77.7638.77 0.120.06 135022209
Histogram[boykovjolly2001interactive] 49.63 0.37 82.0931.60 0.210.06 188622928
Table 1:

Quantitative comparison of different online likelihood generation methods using synthetic scribbler from sec:quant_syntheticscrib. Mean and standard deviation of DICE (%), ASSD, Time (s) and Synthetic Scribble Voxels Required is reported for all methods.

3.1 Quantitative Comparison using Synthetic Scribbler

We employ a synthetic scribbling method based on the method used for training in [wang2018deepigeos]. The proposed synthetic scribbler first compares the inferred segmentation label against the ground truth to identify each mis-segmented regions. For the first interaction, where the network is randomly initialized, ground truth is used as mis-segmented region. Let define the volume of a given under-segmented or over-segmented region, the synthetic scribbler labels voxels randomly within that region. is set to if and otherwise to based on empirical experiments. A likelihood based segmentation label is then inferred using a comparison method with these synthetic scribbles. This synthetic interaction process is repeated 10 times and the metrics corresponding to the final interaction are reported. Note that since the number of synthetically scribbled voxels directly depends on the volume of a given under/over-segmented region, therefore the amount of voxels required by each method directly relate to how well that method performs. An ideal method needs the least amount of synthetic interactions to achieve the best accuracy.

fig:syn_scrib_quant_results_analysis[] []

Figure 3: Quantitative analysis using synthetic scribbler on UESTC-COVID-19 dataset, shows (a) percentage of dataset samples that are below a given DICE score, and (b) number of synthetically scribbled voxels needed to achieve corresponding DICE score for method. Plateaus in (b) indicate that a method does not require further interactively labelled voxels to improve accuracy.

tab:syn_scrib_quant_results shows quantitative comparison of the comparison methods using the proposed synthetic scribbler. It can be observed that ECONet outperforms all existing state-of-the-art in terms of accuracy, while requiring least number of synthetically scribbled voxels. In terms of efficiency, online training and inference of the proposed ECONet takes around 2 seconds combined, which is significantly faster as compared to 6 seconds for DybaORF, however it is slower than methods that do not learn a classifier (i.e., GMM and Histogram).

To further analyze the quantitative results, we visualize the percentage of dataset samples below a given DICE score for all methods in fig:syn_scrib_quant_results_analysis(a). It can be observed that 70% of the dataset achieves above 80% DICE using ECONet. As compared to this, ECONet-Haar-Like has 50% and DybaORF-Haar-Like has 15% samples above 80% DICE. It can also be observed that both GMM and Histogram method failed in most cases achieving 50% DICE for most of the samples, which corresponds to labelling all voxels with same label.

fig:syn_scrib_quant_results_analysis(b) presents analysis of the amount of synthetic scribble voxels required to achieve a given DICE for all comparison method. It can be observed that for ECONet, an average of 1600 labelled voxels achieve DICE 80%. Similarly, ECONet-Haar-Like requires 1650 labelled voxels to achieve DICE 70%. Unlike ECONet-based methods, DybaORF-Haar-Like requires significantly greater number of labelled voxels (5000) and only achieves DICE 65%. Both GMM and Histogram fail, with additional labelled voxels having no effect on Histogram. Interestingly, for GMM increasing the number of labelled voxel adversely affects the accuracy resulting in drop in DICE. We believe this is due to the limited representation capability of GMM learning from voxel intensity alone, which is insufficient to model the additional ambiguous variations.

3.2 Qualitative Comparison using Scribbles from Non-expert Annotator

A non-expert annotator provided scribble-based interaction for labelling CT volumes from UESTC-COVID-19 dataset. The provided scribbles were used for annotation using learned likelihood methods i.e., ECONet, ECONet-Haar-Like and DybaORF-Haar-Like. fig:user_scrib_qualit_results_analysis shows the qualitative results from this experiment. As can be observed, ECONet is able to provide segmentation labels close to the ground truth, which is due to the use of learned features that enable the network to better differentiate lung lesions from the background.

Figure 4: Qualitative comparison of online likelihood inference methods using scribbles from a non-expert annotator. Segmentation labels are shown in red, while foreground and background scribbles are in green and blue colors, respectively. indicates under/over-segmented regions for each comparison method.

4 Conclusion and Future Work

We proposed Efficient Convolutional Online Likelihood Network (ECONet) for scribble-based interactive segmentation of lungs lesions in CT volumes from COVID-19 patients. The light-weight architecture of ECONet enabled online training and inference using the scribble-interactive annotations. ECONet was learned online, without the need for any pre-training, from interactive labels for a given CT volume. A method for efficient online learning of ECONet was proposed, which consisted of extracting and using only the patches with user-provided scribble labels. For inference, the network was applied to full volume using a fully convolutional approach. Experimental validation showed that the proposed ECONet significantly outperformed existing state-of-the-art for online likelihood learning on the task of labelling lung lesions in COVID-19 patients. All ECONet-based methods outperformed state-of-the-art DybaORF-Haar-Like method in terms of accuracy as well as online learning efficiency. ECONet achieved 16% higher DICE score in 3 lesser time while requiring around 9000 lesser scribble labelled voxels than DybaORF-Haar-Like.

In our future work, we envision using ECONet in our interactive segmentation pipelines, where it can assist in quick online adaption and learning based on user-scribbles and input volume data. An extension of this work may also look into extending ECONet for multi-class online likelihood learning segmentation problems.

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 101016131 (icovid project). This project has received funding from the European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement TRABIT No 765148. This work was also supported by core and project funding from the Wellcome/EPSRC [WT203148/Z/16/Z; NS/A000049/1; WT101957; NS/A000027/1]. This project utilized scribbles-based interactive segmentation tools from opensource project MONAI Label [MONAI_Consortium_MONAI_Medical_Open_2020]111


Appendix A Experiments for Searching Optimal Layer Sizes for ECONet

fig:appendix:econet_ablation[] [] []

Figure 5: Searching for the optimal layer sizes for ECONet. Shows ablation experiments with varying (a) input conv3d kernel size, (b) number of filters in input conv3d, and (c) size of the fully-connected layers against accuracy DICE (%) and Time (s) for both online inference and training using the corresponding ECONet. Optimal sizes selected using these experiments are: input conv3d kernel size 777 with 128 filters and 3216 fully-connected layers. The largest impact on DICE comes from number of filters in (b), which directly corresponds to our observation on requirement of learned features. Increasing different layer sizes in ECONet significantly increases the online training and inference time as evident in these experiments. All ablation experiments are performed on a single Tesla V100 GPU with 32GB memory.