Log In Sign Up

Contextual Two-Stage U-Nets for Robust Pulmonary Lobe Segmentation in CT Scans of COVID-19 and COPD Patients

by   Weiyi Xie, et al.

Pulmonary lobe segmentation in computed tomography scans is essential for regional assessment of pulmonary diseases. Automated segmentation is still an open problem, especially for scans with substantial abnormalities, such as in COVID-19 infection. Recent works used Convolutional Neural Networks for automatic pulmonary lobe segmentation. Convolution kernels in these networks only respond to local information within the scope of their effective receptive field, and this may be insufficient to capture all necessary contextual information. We argue that contextual information is critically important for accurate delineation of pulmonary lobes, especially when the lungs are severely affected by diseases such as COVID-19 or COPD. In this paper, we propose a contextual two-stage U-net (CTSU-Net) that leverages global context by introducing a first stage in which the receptive field encompasses the entire scan and by using a novel non-local neural network module. The proposed module computes the filter response at one position as a weighted sum of feature responses at all positions, where geometric and visual correlations between features determine weights. With a limited amount of training data available from COVID-19 subjects, we initially train and validate CTSU-Net on a cohort of 5000 subjects from the COPDGene study (4000 for training and 1000 for evaluation). Using models pretrained COPDGene, we apply transfer learning to retrain and evaluate CTSU-Net on 204 COVID-19 subjects (104 for retraining and 100 for evaluation). Experimental results show that CTSU-Net outperforms state-of-the-art baselines and performs robustly on cases with incomplete fissures and severe lung infection due to COVID-19.


page 1

page 4

page 7

page 9


Inf-Net: Automatic COVID-19 Lung Infection Segmentation from CT Scans

Coronavirus Disease 2019 (COVID-19) spread globally in early 2020, causi...

Dynamic deformable attention (DDANet) for semantic segmentation

Deep learning based medical image segmentation is an important step with...

Is the U-Net Directional-Relationship Aware?

CNNs are often assumed to be capable of using contextual information abo...

PiaNet: A pyramid input augmented convolutional neural network for GGO detection in 3D lung CT scans

This paper proposes a new convolutional neural network with multiscale p...

COVID-CT-Mask-Net: Prediction of COVID-19 from CT Scans Using Regional Features

We present COVID-CT-Mask-Net model that predicts COVID-19 from CT scans....

I Introduction

The human lungs consist of five disjoint pulmonary lobes. The right lung is composed of an upper, middle, and lower lobe, while the left lung only has an upper and a lower lobe. The lobes are separated by the pulmonary fissures, a double-fold of visceral pleura visible as a thin line on CT images. The lobes are functionally independent units because each has its own vascular and bronchial supply. As a result, the extent of disease often varies substantially across lobes, and lobe-wise assessment of pulmonary disorders is of great clinical importance. Computed Tomography (CT) is the best way to image the lungs in vivo.

In chronic obstructive pulmonary disease (COPD), one of the major causes of mortality and morbidity worldwide, measuring destruction due to emphysema at a lobar level, provides critical information for selecting target lobes for therapeutic intervention [grabenhorst2015radiologic]. In COVID-19, the pandemic disease caused by the SARS-Cov2 virus that is straining healthcare systems worldwide, the severity of the disease are summarized in a CT severity score where each lobe is scored visually by radiologists on a scale from 0 to 5. These scores are then summed to quantify lung involvement on a scale from 0 to 25. The score provides a tool to assess disease severity and progression over time and is used for clinical decision making. To automate the CT severity score, lobe segmentation in COVID-19 scans is needed. CT scans of COVID-19 patients are affected by extensive patchy ground-glass region and consolidations and may even show lobes or complete lungs filled with pleural fluid. Automated lobe segmentation is highly challenging in scans with such extensive pathological changes.

Many automatic lobe segmentation approaches focused on finding visible fissures. Assuming that once the fissures are detected, the lobe segmentation can be derived by interpolation. Both early hand-crafted fissure enhancement filters

[kubo1999extraction, kubo2000extraction, wang2006pulmonary, pu2008computational]

and more robust supervised learning methods

[van2007supervised, gerard2018fissurenet] relied heavily on local features, neglecting the global context. As a result, they often produced unsatisfactory results when pathological deformation, anatomical variation, or high noise levels are present. Moreover, because incomplete fissures are very common [raasch1982radiographic], interpolation of boundaries based on visible fissures may not suffice to find the lobe borders reliably. Therefore, contextual features were introduced [kuhnigk2005new, van2010automatic, lassen2012automatic, bragman2017pulmonary] in the form of anatomical relations between lobes and nearby airways, vessels, and the lung borders. Most recently, LobeNet [gerard2019pulmonary] proposed a fully convolutional neural network (FCN) for segmenting the lobes that uses segmentations of fissures and lung obtained from their previously proposed FCN methods. These approaches show that incorporating anatomical relations improves the robustness for the lobe segmentation. However, these approaches require prior segmentation of the lung or other structures in the lung, thus requiring a multi-stage complex process. Each stage is then optimized separately, and therefore the final solution may be sub-optimal.

Based on fully convolution neural networks (FCNs), end-to-end lobe segmentation methods have been proposed [ferreira2018end, terzopoulosautomatic]. In these works, it was assumed that the local and contextual features for segmenting pulmonary lobes could be learned directly from the CT images without explicitly injecting contextual information. However, fully convolutional neural networks (FCN) may not be sufficient to harvest the global context. First, the effective receptive field may be much smaller than the theoretical receptive field due to the nature of the convolution [luo2016understanding]. Second, input CT images are mostly cropped before feeding into the networks to reduce GPU memory requirements, thus missing crucial scan-level semantics during the training. Third, in FCN-based methods, the context-dependency is often underused, especially long-range dependencies between objects and object parts, because intrinsically, the prediction for each voxel in an FCN is made independently of voxels outside its receptive field. However, contextual dependencies such as anatomical relations are essential in lobe segmentation.

Therefore, this paper focuses on capturing the global context in two ways. First, we introduce a two-stage U-net in which the first stage encapsulates the entire scan, and both stages are trained simultaneously in an end-to-end fashion. Second, we introduce a non-local neural network module that can be plugged into the two-stage U-Net. This module computes a feature response at one location using both appearance and geometric features from all other positions at the scan-level. We call this approach a contextual two-stage U-net, or CTSU-Net, for short.

The main contributions of this paper are as follows:

  • We propose a lobe segmentation framework that uses a novel two-stage U-net architecture that can be trained end-to-end and contains a non-local neural network module for capturing the global context of a large 3D CT scan.

  • The proposed CTSU-Net is robust and produces accurate lobe segmentations even for scans with severe pathology, and can be easily extended to other medical image analysis problems.

  • CTSU-Net is fast and memory-efficient. It requires only a standard GPU with 12GB memory to train, and inference takes around 30 seconds to produce lung and lobe segmentations for a full thoracic CT scan.

I-a Related Work

Fully convolutional network (FCN) based methods have become the standard in medical image semantic segmentation [cciccek20163d, milletari2016v]. The FCN approach can be thought of as sliding a classification network around a region in an input image, and each sliding region is then processed independently. Consequently, FCN are essentially local and do not take the full global context of an image into account. Atrous spatial pyramid pooling [chen2017rethinking, chang2018pyramid] was later adopted to embed contextual information at different scales with dilated convolutions filters but suffering from the gridding artifacts [yu2017dilated]. Fully connected layers [nikolov2018deep] were used to capture the global context in the segmentation of head and neck anatomy for radiotherapy. However, in this work, cropped CT scans were used as the input because of the large memory footprints of fully-connected layers. For aggregating contextual dependencies, [poudel2016recurrent]

introduced recurrent neural networks to aggregate contextual information on the axial slices for cardiac segmentation in multi-slice MRI. A known issue with recurrent network networks is that they suffer from vanishing gradients

[pascanu2013difficulty] and therefore are hard to train. The global context could also be explicitly defined using Graph Models such as dense conditional random fields (CRF) [kamnitsas2017efficient]

. However, due to their heavy computational demands, dense CRFs are often only used as the post-processing steps and optimized separately on a heuristic basis, making it hard for this approach to scale well.

Attention is widely used for various tasks such as machine translation, visual question answering, image and video classification, and semantic segmentation in natural images. Self-attention methods [bahdanau2014neural, vaswani2017attention] calculate the context encoding at one position by a weighted summation of embedding at all positions in sentences, which is a natural way to capture sentence context. As one of the self-attention applications, a non-local neural network was proposed for semantic segmentation [wang2018non] by computing a self-attention map for each feature based on all the other features in an input CNN feature map. The attention weights were determined by predefined similarity measurements between features in a linear-projected subspace. There are many recent extensions of this non-local method in semantic segmentation.

DANet [fu2019dual] applies both spatial and channel attention to gather contextual information across channels of a feature map. Although non-local neural networks show superior performance on many benchmarks, their high computational costs limited their application to volumetric medical data. To reduce the computational intensity, the CCNet [huang2018ccnet] was proposed. This network employed a simple criss-cross trick which reduces the space and time complexity of the non-local module from to in two-dimensional images. Our approach is motivated by the success of using self-attention in the above works for semantic segmentation. Our self-attention module uses the criss-cross trick, while also introducing geometric features.

Ii Data

(a) COPD set GOLD stages
GOLD stages # subjects for training #subjects for testing
GOLD0 1709 433
GOLD1 319 80
GOLD2 734 184
GOLD3 441 110
GOLD4 226 57
Non Spirometry 30 2
Non Smoking 45 11
PRISm 496 123
Total 4000 1000
(b) COVID-19 set CO-RADS scores
CORADS #subjects for training #subjects for testing
1 24 23
2 10 9
3 20 20
4 17 16
5 24 24
6 9 8
Total 104 100
TABLE I: Characteristics of the two data sets used in this study. (a) lists the distribution of GOLD stages and other classes, see [regan2011COPDGene] in the COPD data set. (b) gives the distribution of CO-RADS scores across the training and test sets. CO-RADS score 1-6 indicates the level of suspicion for COVID-19 positive disease, ranging from very low, low, equivocal, high, very high, and confirmed PCR positive, respectively.

CT scans used in this study were obtained from two sources. We refer to the first set as the COPD set, and to the second set as the COVID-19 set.

A large set of scans from subjects with COPD, ranging from mild to very severe, was obtained from the COPDGene study [regan2011COPDGene]. This is a clinical trial with data from 21 imaging centers in the United States. In total, COPDGene enrolled 10,000 subjects. Each subject underwent both inspiration and expiration chest CT. Image reconstruction uses sub-millimeter slice thickness and in-plane resolution, with edge-enhancing and smooth algorithms. The CT protocols are detailed elsewhere [regan2011COPDGene]. Data from COPDGene is publicly available and can be retrieved after submitting an ancillary study proposal (ANC-398 was used for this work).

We randomly selected 5000 subjects and used only Phase I inspiration CT scans (one scan per subject). Subjects were randomly grouped into a training set ( = 4000) and a test set ( = 1000). Slice thickness ranged from 0.625-0.9mm and pixel spacing from 0.478-1.0mm. Scans were performed using 200mAs and a tube voltage of 120kVP.

The other data set was obtained from Radboud University Medical Center, Nijmegen, the Netherlands. On March 18, 2020, this institution implemented a low-dose non-contrast CT protocol and all patients who arrived at the hospital with suspicion of COVID-19 disease and inpatients for whom COVID-19 was considered a possibility underwent CT. To date, 518 CT scans have been made. We only included scans from subjects who did not object to the use of their scans for research purposes. All scans were anonymized and permission for research use was obtained from our review board (file number CMO 2016-3045, Project 20027). It is unclear at this stage if we will be allowed to publicly share these scans. We will file a request for this later.

We randomly selected 204 subjects and use one scan per subject by selecting the CT scan of the smallest slice thickness in a study. Scans have a sub-millimeter pixel spacing and a slice thickness of 0.5 mm. 104 of these scans were used for training and the other 100 for testing.

See Table I for the distribution of GOLD stages in the training and the test set for the COPD set and the distribution of CO-RAD scores from the COVID-19 set. The CO-RAD scores defines the level of suspicion COVID-19 and was reported by in the radiology reports. Note that PCR status was not yet available for most subjects.

From the two training data sets, we selected 100 scans as the validation set for the COPD set, and we selected 24 scans for validation from the COVID-19 set. set for retraining all the models. Train, validation and test partitioning ensures the distribution of either GOLD stages or CO-RAD scores distributed similarly across partitions.

Ii-a Reference Standard

Lobe segmentation references were obtained from Thirona, a company that specializes in chest CT analysis. First, automated segmentation of the left and right lung was generated using a commercialized software (LungQ, Thirona, Nijmegen, NL), followed by manual refinement if needed. Second, automatic algorithms [van2007supervised, van2009automatic, van2010automatic] were used to extract the lobar boundary with possible interpolation for incomplete fissures using nearby airways and vasculature information. Next, the automatically found lobar boundaries were manually corrected separately for the left and the right lung, by trained analysts with at least one year experience in annotating the pulmonary structures on CT. Analysts are medical students who received training in lung anatomy and CT imaging and were instructed to draw a complete lobar boundary if needed. Once the manual modifications are made, the lobar boundaries are updated, possibly followed by final corrections.

Iii Methods

We define the lobe segmentation problem as a voxel-wise classification problem. Given a scan , the goal is to predict the voxel label for every spatial location , where the label set representing the background, left upper, left lower, right upper, right lower, and the right middle lobe, respectively.

Fig. 1: The overview of our proposed framework with two cascaded Contextualized U-Nets. The output from the first stage is concatenated with the cropped 3D patches as the second stage input.

This paper proposes a lobe segmentation framework consisting of two cascaded CNNs and is depicted in Fig. 1. Each CNN follows the design of the 3D U-Net [cciccek20163d], but we propose several novelties in this work that enable the CNN to capture more context effectively. We refer to the architecture of these CNNs as the contextualized U-Net (CU-Net), and the details are explained later in this section.

Iii-a Cascading two CNNs

The first CU-Net reads an input scan at a down-sampled resolution to segment coarse lobes and lobe borders. The resolution of these coarse outputs is subsequently upsampled to a higher resolution. Then the high-resolution input scan and the output of the first CU-Net are concatenated and cropped into 3D patches to be fed to the second CU-Net. The purpose of this concatenation is to introduce extra context captured at the first CU-Net as the second CU-Net uses patch-based inputs at a higher resolution to segment lobes and lobe borders. The cascading of two U-Nets is trained end-to-end, allowing both local details in 3D patches and scan-level context to be learned in the same optimization process. Furthermore, we use the errors found in the predictions of the first CU-Net to sample 3D patches for training the second stage, which encourages the second CU-Net to focus on the regions where the first CU-Net fails. This technique can be seen as a form of online hard example mining [shrivastava2016training].

Iii-B Contextualized U-Net

The contextualized U-Net architecture (CU-Net) is a 3D U-Net architecture [cciccek20163d]

with less number of convolution filters and an additional non-local module. The CU-Net has three down-sampling layers in the encoding path, and each layer consists of two convolutions and a max-pooling operation. Following the down-sampling path, two more convolutions are used to double the number of convolution filters. We then place the non-local module before up-sampling. In the up-sampling path, three layers are used to reconstruct the resolution, and each contains one tri-linear interpolation, followed by two convolutions to reduce the interpolation artifacts. In the end, features are reshaped via a single 1x1x1 convolution in two parallel output branches, and each corresponds to a different learning objective; one produces 6-channel softmax probabilities for segmenting the background and the five lobes. The other provides a single channel probability map by sigmoid function for predicting the lobe border. Features from 3x3x3 convolutions are batch normalized and activated via a rectifier linear unit (ReLU). No dropout is used.

The first CU-Net uses padded convolutions, whereas the second uses valid convolutions. The details regarding CU-Net network architecture on both stages are provided in Table

II, where the names of the down-sampling layers are prefixed with ’Down’, and the name of up-sampling layers are prefixed with ’Up’. The numbers listed are based on the execution order.

Layer CU-Net I CU-Net II
Down1 3x3x3,1-16 3x3x3,16-24 3x3x3,8-24 3x3x3,24-48

2x2x2 max pool, stride 2

Down2 3x3x3,24-24 3x3x3,24-48 3x3x3,48-48 3x3x3,48-96
2x2x2 max pool, stride 2
Down3 3x3x3,48-64 3x3x3,64-128 3x3x3,96-96 3x3x3,96-192
2x2x2 max pool, stride 2
Bridge 3x3x3,128-128 3x3x3,128-256 3x3x3,192-192 3x3x3,192-384
Non-local , , , ,
Up1 3x3x3,384-128 3x3x3,128-128 3x3x3,576-192 3x3x3,192-192
trilinear interpolation x2
Up2 3x3x3,176-48 3x3x3,48-48 3x3x3,288-96 3x3x3,96-96
trilinear interpolation x2
Up3 3x3x3,72-24 3x3x3,24-24 3x3x3,144-48 3x3x3,48-48
trilinear interpolation x2
Output 1x1x1,6 1x1x1,1 1x1x1,6 1x1x1,1
MAC 5.71 G 8.79 G
#Parameter 3.85M 9.24 M
TABLE II: Architectures for the first and the second stage of the contextualized U-Nets. The convolution filters are named by the kernel sizes and number of filters as (stride 1 for all). Non-local linear embedding parameters are defined in Eqs. (3) and (5). denotes the operation performed in dual paths.

Iii-C The Non-Local module

The original non-local neural network [wang2018non] for semantic segmentation computes the feature response at a position as a weighted sum of the features at all locations in the input feature maps as


where at location is computed as a weighted sum using the feature correspondence between the feature at the location and all features indexed by in the input feature map . The feature correspondence between feature and is also called the self-attention in this context, computed by the pairwise function , which is used to weigh the feature embedding before normalizing by . For simplicity, is set to a linear projection: , and the pairwise function can be the embedded Gaussian function using linear embeddings defined as . We set the normalizing factor as . Then becomes the softmax computation along the dimension written, in matrix multiplication form, as . To make the input and output of the non-local module the same size, the is reshaped to have the same dimensions as the input by applying the linear reconstruction function . Therefore, the non-local response at location can be written as .

The feature response automatically achieves a global receptive field with respect to the input. The computed self-attention map captures the contextual dependencies, as relevant features would have higher attention responses.

However, the original non-local module disregards the spatial locations of features which may provide critical information for the lobe segmentation. Hence, we propose to compute non-local responses with a geometric term considering the relative locations between features. Here, we denote as geometric features for the position and . is extracted by computing the center of the receptive field of the feature at position with respect to the input image and normalized by the size of the input image. Noting that if the feature map is produced from a cropped input, the center of receptive field is then shifted according to the offset to the original input. The normalized geometric features are then shifted by 0.5 to have zero mean. is the pairwise function for measuring correlations. Then, the non-local response with geometric terms is defined as:


A similar reparameterization can be applied using the softmax function row-wise under linear projections to reformulate Equation 2 into matrix multiplications:



is parameterized as a dot product in a subspace projected using the linear transformation matrix

and . Similarly, and are linear transformations that are used to project the geometric features into a subspace where their correspondence is measured by the pairwise kernel function , . Such correspondence is then trimmed at 0, to restrict geometric relations within a certain threshold.

The Equation 3 however, has high computational cost because the self-attention map requires computing and on all pairs of locations. Each term has a complexity in time and space of where is the dimension of linear projected subspace and denotes the width, height, and depth of a 3D feature map. To reduce computational complexity, we adopt the criss-cross trick [huang2018ccnet], which has a time and space complexity of . In CCNet the Equation 2 is modified to:


where indicates the neighboring voxels with respect to under criss-cross connectivity. Such sparse connectivity requires having three recurrent criss-cross modules to cover all spatial locations in computation.

Given the input feature , the non-local response for a feature location at each -th recurrent criss-cross module can be written as follows:


At each recurrent step, the non-local response is used as the input feature for computing the non-local response for the next recurrent step. For the size of scans used in this work, full global context can be achieved with three recurrent steps for a 3D input feature map. Therefore, we set .

Iii-D Online Hard Example Mining

As shown in Fig. 1 using the red dashed lines, we compute the mean square errors (MSE) between the lobe-wise softmax probabilities of the first CU-Net and the lobe reference standard. We then go through all sliding window 3D patches, and find patches with the highest integral of MSE and use them to train the second CU-Net.

is set to 1.0 such that all patches are used to train at the beginning and continuously reduced until it reaches a coverage of only approximately 20% of the scan volume at the end of the training process. The proposed online hard example mining does not introduce extra forward and backward passes on the network, therefore the additional computational cost is trivial.

Iii-E Learning Objectives

There are two learning objectives for each CU-Net: lobe segmentation and lobe border segmentation. Therefore, the final loss function is a summation of four and each is the generalized Dice loss

[sudre2017generalised]. The lobe border reference is pre-computed from the lobe reference by detecting object boundaries.

Let be the segmentation reference with -th voxel values for the class label and be the predicted probabilistic map for the label over -th image voxel, then the generalized Dice loss is defined as:

with , where the in total number of voxels for the class label in the segmentation reference.

is to re-balance learning against the variance in object volumes.

Iv Experiments

As the COVID-19 pandemic emerged only recently, it was not possible to obtain a large amount of CT scans with annotations of COVID-19 patients. Therefore, we used a transfer learning approach in our experiments. For training of the models on the COVID-19 data, the models were initialized with the trained weights from our models developed on the COPD data set.

Iv-a Training details

Training, validation, and testing of each experiment were carried out on a machine with a NVidia TitanX GPU with 12 GB memory. The methods were implemented using Python 3.6, Pytorch 1.1.0 library

[paszke2017automatic]. The trainable parameters of each method were initialized using Kaiming He initialization when training from scratch [he2015delving]

and are optimized using stochastic gradient descent with a momentum of 0.9, and the initial learning rate is set to 10e-6. The initial models were trained using CT scans from the COPD data set. Therefore, the visual patterns in COVID-19 scans are not familiar to these models. For efficiently training on new visual patterns, all models were retrained using a combined loss between the generalized Dice loss (as we used to train initial models) and top-

cross-entropy loss where is set to 30% of all voxels in the input. The top- cross-entropy loss was implemented simply as the voxel-wise cross-entropy loss but selecting only voxels with largest cross entropy to back-propagate.

Iv-B Comparison with previous work

We compared our approach with two state-of-the-art baselines.

Iv-B1 3D U-Net

We implemented 3D U-Net following the original paper [cciccek20163d]. The input is a mini-batch of two 3D patches randomly cropped from the pre-processed scan (refer to IV-D). As a result of using valid convolutions, the output of this network is voxels. During test time, the softmax probabilities of all 3D patches are tiled together by sliding over the entire scan in the output without overlaps to build up a scan-level probability map.

Iv-B2 FRV-Net

We compare the proposed method with an existing end-to-end lobe segmentation method called FRV-Net [ferreira2018end] which follows the design of the V-Net [milletari2016v] and extensively uses the idea of deep supervision at almost all scales in the up-sampling pathways. Similar to our approach, FRV-Net introduced multi-tasking into the training process by using both lobe segmentation and prediction lobe borders simultaneously. We implemented the FRV-Net architecture following the paper at our best efforts. Note that this work uses specific pre-processing and post-processing steps where the input scan is resized into a fixed size of and intensities are clipped into the range HU.

Iv-C Ablation studies

To assess the contribution of the different novel components that are introduced in CTSU-Net, we performed several ablation studies. During these experiments, the models are retrained from scratch using the COPD training set and performance is measured on the COPD test set of 1000 cases. The performance of our proposed model is assessed without the geometric features in the non-local module, and without the non-local module in the two 3D U-Net CNNs, thereby reducing the architecture to a two-stage cascaded 3D U-Net.

Iv-D Pre-processing and post-processing

All training and test scans were standardized by clamping intensity values to the range before re-scaling into . Then all scans were down-sampled to have a in-plane resolution while z-spacing is adjusted to make the scan isotropic.

The input size of the second CNN for our proposed method consists of two sized 3D patches. The pre-processed scan is down-sampled by a factor of 2 as the input for the first stage. The predictions of all 3D patches at the second stage are tiled together by sliding over the entire scan to produce a scan-level probability map, which is used to generate the final prediction.

As a post-processing step, the predictions were then up-sampled by nearest neighbor interpolation to match the original resolution of the scans. All evaluations are performed by using predictions and reference segmentations at the original resolution.

Iv-E Evaluation Metrics

The Intersection over Union (IOU), also known as Jaccard index, and 95% percentile in Hausdorff distance (HD95) between predictions and segmentation references were used for quantitative evaluation of segmentation performance. The IOU between two binary masks

is defined as:

The Hausdorff distance (HD) measures the surface distance between two binary masks. Denote two surfaces as , from the masks , and coordinate indices on the surface as ,. HD is expressed as:

with being the Euclidean distance. We use the 95% percentile of the HD distance, and refer to this as HD95.

The overall performance of the method was evaluated by computing the average of the per-lobe metrics. A Wilcoxon signed-rank test was employed to assess whether the performance difference was statistically significant (p <0.01 with Bonferroni correction).

In addition, we computed the number of Multi-Adds operations (MAC) and the number of parameters to assess computational efficiency. We also provide a comparison with independent human readers for a subset of the COPD data.

Fig. 2: Box and whisker plots of IOU per-lobe for different methods on the COPD data set (top) and the COVID-19 data set (bottom).

V Results

V-a Quantitative results

(a) COPD results Method MAC #Param Metric Overall LUL LLL RUL RLL RML 3DU-Net [cciccek20163d] 10.5G 16.32M IOU HD95 0.9150.037 7.0207.361 0.9440.033 3.8016.173 0.9370.007 5.84813.87 0.9180.043 7.7839.227 0.9370.032 5.37411.73 0.8400.032 12.3413.47 FRV-Net [ferreira2018end]  7.2G 15.5M IOU HD95 0.9180.038 8.6729.239 0.9500.038 4.58612.73 0.9320.050 5.8479.048 0.9170.050 10.4420.02 0.9420.033 5.57013.65 0.8480.103 16.9021.51 CTSU-Net (ours) 14.5G 13.1M IOU HD95 0.9490.026 2.7723.342 0.9620.020 2.0164.955 0.9590.023 2.0163.347 0.9520.030 3.1026.388 0.9600.010 2.4896.729 0.9120.080 4.2345.981 (b) COVID-19 results Method MAC #Param Metric Overall LUL LLL RUL RLL RML 3DU-Net [cciccek20163d] 10.5G 16.3M IOU HD95 0.8990.060 6.3388.894 0.9340.038 4.4548.894 0.8860.104 6.0997.166 0.9020.081 7.61412.29 0.9050.091 6.07312.29 0.8680.109 7.45210.73 FRV-Net [ferreira2018end]  9.3G 15.5M IOU HD95 0.9020.053 6.5866.883 0.9360.029 4.1736.527 0.9060.073 4.7857.820 0.9080.065 9.06412.49 0.9100.089 5.87612.53 0.8490.119 9.03111.61 CTSU-Net (ours) 13.3G 14.1M IOU HD95 0.9120.044 6.4479.052 0.9350.028 8.36324.09 0.9080.054 6.63514.43 0.9190.058 6.35111.25 0.9190.067 3.7628.019 0.8770.091 7.1279.723 RUL, RML, RLL, LUL, LLL: Right upper, Right middle, Right lower, Left upper, Left lower lobes. Overall: per-lobe mean.
TABLE III: Quantitative results on the COPD and COVID-19 test sets. IOU and HD95 (in mm) metrics are given in mean standard deviation. Boldface denotes the best result for each column.

Table III reports the quantitative results on both data sets. The quantitative results show that our proposed method significantly outperformed the previously published methods in segmenting all lobes both when training from scratch on the COPD data set and retraining on the COVID-19 data set ( with Bonferroni correction). We see that using the two cascaded networks boosts the IOU performance from 0.915 to 0.940 for the COPD set and from 0.899 to 0.912 for COVID-19 data. The non-local module helps to further improve the IOU to 0.949 and substantially reduces the HD95 from 8.2 to 2.8mm on the COPD data.

Box and whisker plots are provided in Fig. 2

. These plots show that for both the COVID-19 and the COPD cases, the right middle lobe is the most difficult to segment, which is not surprising given that it can have incomplete fissures at both its boundaries with the upper and lower right lung lobes. Especially for the COPD data, CTSU-Net clearly outperforms the two baselines for the right middle lobe. For the COVID-19 cases, the differences are less pronounced, but it can be observed that there are less outliers with low IOU, indicating CTSU-Net exhibits more robust performance.

In terms of computational efficiency, the proposed method consumes even less memory than the baseline approach, with only a slight increase in the Multi-Adds operations (MAC). Hence, we conclude that the proposed method outperforms the other methods without introducing substantial computational overhead. The proposed method processes a single scan at test time in 20 seconds on average.

V-B Ablation study

Table IV shows the results of the ablation study, in which two of our novel components, the non-local module and the geometric features, are removed from CTSU-Net. These models were trained on the COPD training set and reported metrics are the performance on the COPD test set. The results demonstrate the added value of the non-local module and show that the introduction of the geometric features increases the performance over the non-local module alone. This effect is most pronounced for the 95% Hausdorff Distance metric.

Method Two-stage Non-local Geo-metric HD95 IOU only two-stage 8.12713.312 0.9400.031 w/o geometric 5.52310.72 0.9420.031 CTSU-Net 2.7723.342 0.9490.026
TABLE IV: Ablation study on the COPD data set for the non-local module (Non-local) and the geometric features (Geometric) into the two-stage cascading framework.

V-C Effect of the Non-Local module

Before Non-Local After Non-Local
Fig. 3: Effective Receptive Field (ERF) before the non-local module (left) and after (right). The green area indicates non-zero gradients (with respect to the input scan) of a feature at a location in the input scan corresponding to the red square.

To measure the effective receptive field (ERF) size before and after the non-local operation, we computed the gradients of the feature at the location in the feature map with respect to the input image . For simplicity, only the feature maps at the first CU-Net were studied. The ERF of the features at the same corresponding location before and after non-local operation are visualized in Fig. 3 for one axial slice. The figure renders non-zero gradients in green and indicates the center of the ERF with a red square. The center is a mapped coordinate from the chosen feature in the feature map to the input image via up-sampling, thus a slight shift may occur. The left image shows the ERF before the non-local operation is contained in a square due to the nature of stacked convolutions. However, the ERF after non-local on the right side shows a non-square distribution, reaching the other side of the lung. We therefore conclude that the non-local module can enlarge effective receptive field dramatically and in theory can achieve a global receptive field.

Fig. 4: Qualitative comparison of segmentation results for six representative test cases. The left three columns show COVID-19 cases, the right three columns show COPD cases. From top to bottom: input image, 3DU-Net baseline, CTSU-Net method w/o the non-local module, the proposed CTSU-Net, and the segmentation reference.   right upper,   right middle,   right lower,   left upper,   left lower lobes.

V-D Qualitative Results

Fig. 4 shows the qualitative performance of the proposed method and includes the 3D U-Net baseline, the proposed method without the non-local module, and the reference segmentations for comparison. We selected three COPD and three COVID-19 cases with various levels of pathological and anatomical variations. We observe that all methods usually do not produce oversegmentation of the lungs. By adding contextual dependencies using the non-local module with the appearance and geometric terms, we see that the proposed method generates generally smoother lobe borders and is able to infer about lobe shapes from the global context.

V-E Comparison with human readers

To evaluate human performance, we asked two independent human readers to manually segment the lobes from scratch given a segmentation of the lung. Their results are evaluated on a random set of 100 scans from the COPD test set. The human readers achieved IOU and HD95 on average, while the proposed method achieved IOU and HD95. The human readers are both significantly better than the other methods, whereas there is no significant difference between the readers ().

Vi Discussion and Conclusion

We have presented a novel method using contextualized two-stage neural networks for segmenting pulmonary lobes in CT images. The proposed method is capable of capturing global context by introducing a non-local module. This proposed non-local module takes geometric features into account when computing the self-attention. We show in our results that introducing global context improves the lobe segmentation performance significantly on the COPD and the COVID-19 data set. The Hausdorff distance metric in the ablation study shows that using geometric features is effective for generating finer object boundaries. This smoothing effect can also be observed from the qualitative results, where the lobe boundaries from the proposed method are more consistent with the reference lobe shapes. Our proposed architecture is a general framework and can be applied or extended for other medical segmentation problems. In terms of computational efficiency, our method has the same level of Multi-Adds operations (MAC). It requires even fewer trainable parameters compared to the standard 3D U-Net. Our method can be trained and tested on a consumer-level GPU with 12 GB memory, and inference time is around 30 seconds for a full resolution CT scan.

Segmentation of lobes in scans of patients with a severe pneumonia due to COVID-19 is not an easy task. In this work, we used only 104 COVID-19 CT scans for training. Thanks to pre-training on 4000 COPD scans, we still obtained good results with such a small training set and we were able to provide lobe segmentations robust to the presence of ground-glass, consolidations and crazy paving. Lobe segmentation is an important prerequisite for accurate quantification of lung damage in COVID-19 CT scans. Fig. 4 shows that the standard 3D U-Net may miss areas of consolidation (third row), but even CTSU-Net misses part of a lobe when this lobe is completely filled with pleural fluid (first column). We hypothesize that a larger training set would further improve performance on cases with gross pathological changes that are not yet well represented in the 104 scans we had available for training at the time of this research. Nevertheless, the results presented here are sufficient for further analysis and we believe that they will prove useful in automated per-lobe severity scoring. This is a topic for future research.

We freely share our segmentation algorithm with the research community on