Automatic airway segmentation from Computed Tomography using robust and efficient 3-D convolutional neural networks

03/30/2021
by   A. Garcia-Uceda, et al.
0

This paper presents a fully automatic and end-to-end optimised airway segmentation method for thoracic computed tomography, based on the U-Net architecture. We use a simple and low-memory 3D U-Net as backbone, which allows the method to process large 3D image patches, often comprising full lungs, in a single pass through the network. This makes the method simple, robust and efficient. We validated the proposed method on three datasets with very different characteristics and various airway abnormalities: i) a dataset of pediatric patients including subjects with cystic fibrosis, ii) a subset of the Danish Lung Cancer Screening Trial, including subjects with chronic obstructive pulmonary disease, and iii) the EXACT'09 public dataset. We compared our method with other state-of-the-art airway segmentation methods, including relevant learning-based methods in the literature evaluated on the EXACT'09 data. We show that our method can extract highly complete airway trees with few false positive errors, on scans from both healthy and diseased subjects, and also that the method generalizes well across different datasets. On the EXACT'09 test set, our method achieved the second highest sensitivity score among all methods that reported good specificity.

READ FULL TEXT VIEW PDF

Authors

page 1

page 3

page 8

page 9

page 10

11/04/2018

False Positive Reduction in Lung Computed Tomography Images using Convolutional Neural Networks

Recent studies have shown that lung cancer screening using annual low-do...
02/18/2019

Automatic Segmentation of Pulmonary Lobes Using a Progressive Dense V-Network

Reliable and automatic segmentation of lung lobes is important for diagn...
01/14/2020

Hippocampus Segmentation on Epilepsy and Alzheimer's Disease Studies with Multiple Convolutional Neural Networks

Hippocampus segmentation on magnetic resonance imaging (MRI) is of key i...
08/16/2020

Deep Learning Predicts Cardiovascular Disease Risks from Lung Cancer Screening Low Dose Computed Tomography

The high risk population of cardiovascular disease (CVD) is simultaneous...
09/01/2021

ImageTBAD: A 3D Computed Tomography Angiography Image Dataset for Automatic Segmentation of Type-B Aortic Dissection

Type-B Aortic Dissection (TBAD) is one of the most serious cardiovascula...
06/18/2018

Assessing robustness of radiomic features by image perturbation

Image features need to be robust against differences in positioning, acq...
04/09/2018

Abdominal Aortic Aneurysm Segmentation with a Small Number of Training Subjects

Pre-operative Abdominal Aortic Aneurysm (AAA) 3D shape is critical for c...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Segmentation of the airway tree from thoracic Computed Tomography (CT) is a useful procedure to assess pulmonary diseases characterized by structural abnormalities of the airways, such as bronchiectasis (widening of the airways) and thickening of the airway wall. To quantitatively assess these conditions on the CT scan, clinicians are interested in having individual airway measurements, including airway diameter, wall thickness and tapering [Kuo2017, Kuo2020]. The bronchial tree is a complex 3D structure, with many branches of varying sizes and different orientations. Segmenting airways by manual or semi-automatic methods is extremely tedious and time-consuming, taking more than 15 h [Kuo2017] per scan, manually; or up to 2.5 h [Tschirren2009] per scan, semi-automatically. Thus, a fully automatic airway segmentation method is important to provide an accurate, effortless and free of user-bias segmentation of the airway tree needed to quantify airway abnormalities.

In CT scans, airways appear as tubular structures with an interior of typically low intensity (the lumen) enclosed by a structure of higher intensity (the airway wall). Moreover, airways are surrounded by a background which can be of low intensity (the lung parenchyma) or higher intensity (typically the vessels). The earliest airway segmentation methods used the region growing algorithm [Mori1996, Sonka1996]

to segment the airway lumen, relying on the intensity difference between the airway lumen and wall. Region growing can accurately capture the central bronchi, but has a tendency to systematically fail at extracting the peripheral bronchi of smaller size, missing a large portion of the airway tree. Also, when segmenting the smaller airways it often results in segmentations leaking into the lung parenchyma. This is due to the reduced intensity difference and reduced signal-to-noise ratio between the airway lumen and wall, caused by partial volume effects near the smaller airways. Many airway segmentation methods build upon the region growing algorithm, using this as an initial step to segment the larger bronchi and then apply additional operators to detect smaller airways while preventing leakage 

[Fetita2004, Fabijanska2009, Graham2010, Lo2010, Lo2009]. Extensions to region growing based methods have been widely reviewed [Pu2012, Rikxoort2013]. In the airway extraction challenge EXACT’09 [LoExact2012], 15 algorithms were evaluated and compared, out of which 11 were region growing based methods. The results showed that all participating methods missed a significant amount of the smaller branches, and many methods had a large number of false positives errors.

Several airway segmentation methods use machine learning classifiers, either for voxelwise airway classification 

[Lo2010, Bian2018] or to remove false positive airway candidates from a leaky baseline segmentation [Inoue2013, Meng2016]

. These classifiers (kNN 

[Lo2010], AdaBoost [Inoue2013]

, support vector machines 

[Meng2016]

or random forest 

[Bian2018]) operate on a set of predefined image features: multiscale Gaussian derivatives [Lo2010], multiscale Hessian-based features [Inoue2013, Meng2016, Bian2018] or image texture features with local binary patterns [Bian2018]. These methods can achieve more complete airway tree predictions than previous purely intensity-based methods, with fewer false positives. However, they are highly dependent on the image features used to train the classifier, and may be time-consuming to apply due to the time needed to compute the image features [Lo2010, Meng2016, Bian2018]

. Recently, many state-of-the-art methods for medical image segmentation tasks have used deep learning 

[Litjens2017]

, and in particular convolutional neural networks (CNNs) 

[Long2015]. The main advantage of deep CNN methods over classical learning-based techniques is that the extraction of relevant image features is done automatically from data in an end-to-end optimised setting. Several CNN-based methods have been applied for airway segmentation [Charbonnier2017, Yun2019, Jin2017, Meng2017, GarciaUceda2018, Qin2019, Qin2019-2, Zhao2019, Wang2019, Qin2020, Qin2021, Zheng2020, Zhou2021, Nadeem2021, GarciaUceda2019]. Charbonnier et al. [Charbonnier2017] applied CNNs to detect and remove leakage voxels from a leaked region growing based segmentation. Yun et al. [Yun2019] applied the so-called 2.5D CNN, a pseudo 3D CNN which processes the three perpendicular 2D slices around each voxel, to perform voxelwise airway classification. The methods [Jin2017, Meng2017, GarciaUceda2018, GarciaUceda2019, Qin2019, Qin2019-2, Zhao2019, Wang2019, Qin2020, Qin2021, Zheng2020, Zhou2021, Nadeem2021] are based on the U-Net architecture [Ronneberger2015]. This model and its 3D extension [Cycek2016, Milletari2016] are highly successful for medical image segmentation. The main advantage of the U-Net over voxelwise CNN models is that it can process entire images in one forward pass through its encoding / decoding structure, resulting in a segmentation map directly. Jin et al. [Jin2017] and Meng et al. [Meng2017] applied the U-Net on local volumes of interest around airways, being guided by the centerlines from a baseline segmentation [Jin2017] or by tracking the airways extracted from the U-Net [Meng2017]. Garcia-Uceda et al. [GarciaUceda2018] applied the U-Net on large image patches extracted from the 3D scans, using various data augmentation techniques. Garcia-Uceda et al. [GarciaUceda2019] proposed a joint approach combining both 3D U-Net and graph neural networks (GNNs) [Selvan2020]. Qin et al. [Qin2019, Qin2019-2] applied the U-Net for 26-fold prediction of the 26-neighbour connectivities of airway voxels, and segments airways by grouping voxels with at least one mutually predicted connectivity with a neighbour voxel. Zhao et al. [Zhao2019]

combined both 2D and 3D U-Nets with linear programming-based tracking to link disconnected components of the segmented airway tree. Wang et al. 

[Wang2019] used the U-Net with spatial recurrent convolutional layers and a radial distance loss to better segment tubular structures. Qin et al. [Qin2020, Qin2021] extended the U-Net with feature recalibration and attention distillation modules that leverage the knowledge from intermediate feature maps of the network. Zheng et al. [Zheng2020] proposed a General Union loss to alleviate the intra-class imbalance between large and small airways. Zhou et al. [Zhou2021] extended the U-Net with a multi-scale feature aggregation module based on dilated convolutions, to include more contextual information from a larger receptive field. Nadeem et al. [Nadeem2021] used the U-Net followed by a freeze-and-grow propagation algorithm to iteratively increase the completeness of the segmented airway tree.

In this paper, we present a fully automatic and end-to-end optimised method to perform airway segmentation from chest CTs, based on the U-Net architecture. A preliminary version of this work was presented in [GarciaUceda2018]. The proposed method processes larger 3D image patches, often covering an entire lung, in a single pass through the network. We achieve this by using a simple U-Net backbone with low memory footprint, and having efficient operations that feed image data to the network. This makes our method simple, robust and efficient. We performed a thorough validation of the proposed method on three datasets with very different characteristics, from subjects including pediatric patients and adults and with diverse airway abnormalities, including the EXACT’09 public dataset [LoExact2012]. We compared the method with other state-of-the-art segmentation methods tested on the same data. On the EXACT’09 data, we compared with all the recent methods in the literature evaluated on these data, in terms of the average performance measures reported by the EXACT’09 evaluation. On the other datasets, we compared with the two methods previously evaluated on the same data [Lo2010, Lo2009]. Moreover, we evaluated the accuracy of our method with respect to the presence of lung disease and airway abnormalities in the CT scans from the three datasets.

This article is organized as follows: in Section 2 we explain the proposed methodology for airway segmentation, in Section 3 we present the data used for the experiments in this paper, in Section 4 we explain the experiments performed to validate the method, in Section 5 we present the results obtained, in Section 6 we discuss these results and explain the advantages and limitations of the proposed method, and in Section 7 we give the conclusions for this paper.

2 Methods

2.1 Network architecture

The network consists of an encoder (downsampling) path followed by a decoder (upsampling) path, at different levels of resolution. Each level in the downsampling / upsampling path is composed of two

convolutional layers, each followed by a rectified linear (ReLU) activation, and a

pooling / upsample layer, respectively. After each pooling / upsample layer, the number of feature channels is doubled / halved, respectively. There are skip connections between opposing convolutional layers prior to pooling and after upsampling, at the same resolution level. The final layer is a

convolutional layer followed by a sigmoid activation function. This network is schematically shown in Fig. 

1.

For analysis of 3D chest scans, we found that the main bottleneck is the memory footprint of the network. To alleviate this, we use non-padded convolutions in the first 3 resolution levels of the U-Net, where the outmost layers of voxels in the feature maps are progressively removed after each convolution. We still use zero-padding in the remaining levels after the third, to prevent an excessive reduction of the output size of the network. This allows to reduce the memory footprint by approximately

compared to a model with zero-padding in all layers. Moreover, we do not use dropout or batch normalization layers, as these require extra memory to store the feature maps after the operation.

(a) Training time
(b) Testing time
Figure 1: Schematics of the proposed airway segmentation method. (a) Overview at training time, showing the extraction of 3D random patches from the CT scan, to feed as input to the U-Net. (b) Overview at testing time, showing the extraction of 3D patches through sliding-window, and the generation of the airway tree segmentation from the patch-wise output of the U-Net.

2.2 Training of networks

The network is trained in an end-to-end supervised fashion, using the CT scans as input and evaluating the network output with respect to the ground truth, the reference airway segmentations, using the soft Dice loss [SoftDice2017],

(1)

where

are the predicted voxelwise airway probability maps and airway ground truth, respectively. In the loss computation, the ground truth is masked to the region of interest (ROI): the lungs, indicated by the sub-index

, so that only voxels within this region contribute to the loss. This mask removes the trachea and part of the main bronchi from the ground truth, so that the training of the network focuses on the smaller branches. The lung segmentation needed for this masking operation is easily computed by a region growing method [Lo2010].

To train the network, the Adam optimizer [AdamOptime2017] is used with a learning rate of , which was chosen as large as possible while ensuring convergence of the training and validation losses.

2.3 Implementation of networks

When building the U-Net, the degrees of freedom (or hyperparameters) that determine the capacity of the network are i) the number of resolution levels in the U-Net, ii) the number of feature maps in the first layer, iii) the input image size, and iv) the training batch size. We chose these hyperparameters to build a network that can process the largest input images possible while fitting in a graphical card GPU NVIDIA Quadro P6000 with 24 GB memory. This results in a U-Net with 5 levels of resolution, 16 feature maps in the first layer, and training batches containing only one image. The largest image size that we can fit (and that we can use with the series of non-padded convolutions in the U-Net) is

, which can typically cover a full lung. This U-Net construction is used for all the experiments undertaken in this paper. The implementation of the network is done using the Pytorch framework 

[Pytorch2019].

2.4 Generation of images during training

The 3D chest scans analysed in this paper have a size much larger than the input size of the network. Thus, we extract patches of size from the full-size scans and ground truth to train the network. Moreover, the training datasets used for the experiments in this paper are small and contain only 15-25 scans. Thus, we apply data augmentation to the extracted image patches to increase artificially the amount of training data available and improve generalization. To generate input images to the network from the chest scans and ground truth, we apply the series of operations:

  1. [itemsep=0pt]

  2. Crop the full-size scans to a bounding-box around the region of interest: the pre-segmented lung fields. This operation removes the areas outside the lungs, which are irrelevant for airway segmentation. The limits of the bounding-box are defined as 30 voxels from the outer voxel of the lung mask, in each direction. The extra 30-voxel buffer is used to prevent that boundary effects introduced by the use of zero-padding in the last layers of the network affect the prediction of peripheral airways, closer to the lung surface. Moreover, we mask the ground truth to the mask of the lung fields to remove the trachea and part of the main bronchi.

  3. Extract image patches from the input image by cropping this to random bounding-boxes of

    . We generate a total of 8 random patches per scan and per epoch during training. To perform this operation, we randomly sample the coordinates of the first corner of the bounding-box

    from a uniform distribution in the ranges

    , and , respectively, where is the size of the extracted patch, and the size of the cropped input image. Then, the random bounding-box is generated with the limits .

  4. Apply random rigid transformations to the input image patches as data augmentation. These transformations include i) random flipping in the three directions, ii) random small 3D rotations up to 10 degrees, and iii) random scaling with a scale factor in the range (0.75, 1.25). The same transformations are applied to the ground truth using nearest-neighbour interpolation.

This series of operations is schematically shown in Fig. 1. We implemented operations 2) and 3) as an efficient on-the-fly image generation pipeline that feed data to the network, to reduce any computational overhead for the batch data loading part of the training algorithm. Operation 1) is applied only once to the chest scans and ground truth and prior to the training.

2.5 Airway extraction

To segment new scans, we extract image patches of size from the full-size scans in a sliding-window fashion in the three dimensions, with an overlap of between the patches in each direction. Each image patch is processed by the trained network, producing a set of patch-wise airway probability maps. To obtain the full-size binary segmentation of the airway tree from these output patches, we apply the series of operations:

  1. [itemsep=0pt]

  2. Aggregate the output patches by the network with airway probability maps by stitching them together (i.e. reversing the sliding-window operation used to extract the input patches). To account for the overlap between patches, we divide the aggregated voxelwise airway probabilities by the number of patches containing the given voxel.

  3. Mask the full-size airway probability map to the mask of the lung fields. This is to discard the noise predictions outside the lung regions, as only these regions are included in the training of the networks by equation (1).

  4. Threshold the airway probability maps to obtain the airway binary segmentation, using a threshold of 0.5 by default. This output does not contain the trachea and part of the main bronchi that were outside the mask of the lung fields. To include this, we merge the airway segmentation with a mask for the trachea and main bronchi. This is easily computed by a region growing method [Lo2010], and then masking its output to the mask of the lung fields.

  5. Apply connected component analysis [ConnCompon1996] to the airway segmentation and retrieve the final airway tree as the largest 26-neighbour-connected component.

3 Data

To conduct the experiments, we used CT scans and reference airway segmentations from three different datasets:

  1. CF-CT: This dataset consists of 24 inspiratory low-dose chest CT scans from pediatric patients at Erasmus MC-Sophia, 12 controls and 12 diseased [Kuo2017]. The 12 controls were patients with normal lung assessment from CT reported by two different radiologists, who received CT scanning for diagnostic purposes. The 12 diseased were 11 patients with Cystic fibrosis (CF) lung disease and 1 with Common Variable Immune Deficiency (CVID), who had structural lung damage with airway abnormalities visible on the CT scan. The two groups were gender and age matched, with the age range from 6 to 17 years old in both groups and 5 females per group. Scanning was undertaken using spirometry-guidance in a Siemens SOMATOM Definition Flash scanner. Each CT scan has an in-plane voxel size in the range mm, with slice thickness in the range mm, and slice spacing from 0.3 to 1.0 mm.

  2. DLCST: This dataset consists of 32 inspiratory low-dose chest CT scans from the Danish Lung Cancer Screening Trial [DLCST2009]. Participants to the trial were subjects between 50 to 70 years old, with a history of at least 20 pack-years of smoking, and without lung cancer related symptoms. Roughly half of the CT scans are from subjects with Chronic Obstructive Pulmonary Disease (COPD), reported from Spirometry measures. Scanning was undertaken using a MDCT scanner (16 rows Philips Mx 8000, Philips Medical Systems, Eindhoven, The Netherlands). All CT scans have a voxel resolution of mm.

  3. EXACT’09: This is the multi-center, public dataset of the EXACT’09 airway extraction challenge [LoExact2012]. We used for evaluation purposes the EXACT’09 test set consisting of 20 chest CT scans. The conditions of the scanned patients vary largely, ranging from healthy volunteers to subjects with severe lung disease. The data includes both inspiratory and expiratory scans, ranging from clinical to low-dose. The CT scans were acquired with several different CT scanner brands and models, using a variety of scanning protocols and reconstruction parameters. Each CT scan has an in-plane voxel resolution in the range mm, with varying slice thickness in the range mm.

3.1 Generation of reference segmentations

Networks were trained and evaluated using airway lumen segmentations that were obtained by a combination of manual interaction and automated surface detection. For the CF-CT data, we have centerline annotations manually drawn by trained experts. For the DLCST data, we have airway extractions obtained from the union of the results of methods [Lo2010] and [Lo2009], and corrected by a trained observer. The visual assessment was done similarly to the EXACT’09 challenge [LoExact2012], where the obtained airways trees were divided in branches using a wave front propagation algorithm that detects bifurcations, and spurious branches were removed manually. "Partly wrong" branches whose segmentation covered the airway lumen but leaked outside the airway wall were also removed. To obtain the ground truth airway segmentations, we applied the optimal surface graph segmentation method [Petersen2014] on top of these initial airway references in order to i) determine accurate lumen contours ii) homogenize the ground truth for the two datasets. To evaluate the networks, we use these two ground truth segmentations as well as (via submission to the EXACT’09 challenge) the reference segmentations from the EXACT’09 test set. To build the EXACT’09 reference, the airway trees predicted by all 15 participating methods to the challenge were evaluated by independent expert observers, and the group of correct-scored branches were merged together.

4 Experiments

We performed three experiments to assess the performance of the proposed U-Net-based method in the different datasets and in comparison with previously published approaches:

  1. CF-CT: We compared the performance of our method with that of a previously published airway segmentation method on these data [PerezRovira2016]. These results were obtained by a kNN-classifier together with a vessel similarity measure [Lo2010], and refined with an optimal surface graph method [Petersen2014] to obtain an accurate lumen segmentation. We denote this method by kNN-VS. We trained and evaluated our U-Net model using the same 6-fold cross-validation setting as in [PerezRovira2016]. The training and testing data in each fold have an equibalanced number of scans from control and diseased subjects.

  2. DLCST: We compared the performance of our method with that of two previously published airway segmentation methods on these data [Lo2010] and [Lo2009]. The method [Lo2010] is a kNN-classifier together with a vessel similarity measure, while the method [Lo2009] extends the airways iteratively from an incomplete segmentation with locally optimal paths. We refined these results [Lo2010, Lo2009] with an optimal surface graph method [Petersen2014] to obtain a more accurate lumen segmentation, and to homogenize them with the training data used to train our model, for a fair comparison. We denote these methods by kNN-VS and LOP, respectively. We trained and evaluated our U-Net model using the same 2-fold cross-validation setting as in [Lo2010] and [Lo2009]. The training and testing data in each fold have a similar number of scans from healthy and diseased subjects.

  3. EXACT’09: We compared the performance of our method with that of the 15 methods participating in the challenge EXACT’09 based on the results reported in [LoExact2012], 6 post-challenge methods evaluated on the EXACT’09 data and reported in [EXACTwebsite], and 4 recent airway segmentation methods evaluated on these data [Charbonnier2017, Yun2019, Gil2019, Qin2021]. The methods in Charbonnier et al. [Charbonnier2017], Yun et al. [Yun2019] and Qin et al. [Qin2021] were previously described in the Introduction. The method in Gil et al. [Gil2019], named PICASSO, uses locally adaptive optimal thresholds learnt from a graph-encoded measure of the airway tree branching. We denote these methods by CNN-Leak2.5D CNNU-Net-FRAD and  PICASSO, respectively. The method in Zheng et al. [Zheng2020] is also evaluated on EXACT’09 data, but we did not include it in our comparison as the authors did not report the same specificity metric (false positive rate) as in the EXACT’09 evaluation. To our knowledge, these are all the recent methods in the literature that are evaluated on the EXACT’09 data. We trained our U-Net model using half of the CT scans from the CF-CT and DLCST data (28 scans in total). This training data has a similar number of scans from healthy and diseased subjects. We did not include the EXACT’09 training set in our training data as there are no reference segmentations available.

For each cross-validation fold in the experiments, we used roughly 80% to train the models, and the remaining 20% is used to monitor the validation loss. We trained the models until convergence is reached, and we retrieved the model with the overall minimum validation loss for testing. As convergence criterion, we monitored the moving average of the validation loss over 50 epochs, and training is stopped when its value i) increases by more than 5%, or ii) does not decrease more than 0.1%, over 20 epochs. Training time was approximately 1-2 days on a GPU NVIDIA Quadro P6000, while test time inference takes less than 1 minute per scan including all post-processing steps to obtain the binary airway tree.

To compute the airway predictions on the EXACT’09 data, we thresholded the airway probability maps with a value of 0.1, and retrieved the final airway tree as the largest 6-neighbour-connected component. The lower threshold was to compensate for the reduction of completeness in our results when computing a single 6-connected structure, required by the submission guidelines for the evaluation on the EXACT’09 dataset [LoExact2012]. In contrast, for the predictions on the CF-CT and DLCST data, we used the default threshold of 0.5 and 26-neighbour-connected component analysis.

4.1 Evaluation

To compare the different methods on the CF-CT and DLCST datasets, we computed i) tree length detected, ii) centerline leakage and iii) Dice coefficient, with respect to the ground truth segmentations. For the results on the EXACT’09 data, we evaluated i) tree length detected and ii) false positive rate, computed by the EXACT’09 challenge organizers [LoExact2012]. Moreover, we assessed the segmentation accuracy of our U-Net-based method with respect to the presence of lung disease in the scans from each dataset. For this, we computed i) tree length detected, ii) centerline leakage (or false positive rate on the EXACT’09 data) and iii) total tree length, and compared the measures from both healthy and diseased subjects.

The metrics are defined as:

  1. [itemsep=0pt]

  2. Tree length (TL): The number of ground truth centerline voxels that are inside the predictions, as a percentage of the ground truth centerline full length.

  3. Centerline leakage (CL): The number of predicted centerline voxels outside the ground truth, as a percentage of the ground truth centerline full length.

  4. False positive rate (FPR): The number of false positive voxels outside the ground truth, as a percentage of the total number of ground truth voxels.

  5. Dice coefficient (DSC): The voxelwise overlap between the predictions and the ground truth according to equation (2), with and the binary prediction and ground truth masks, respectively.

    (2)
  6. Total tree length

    : The sum of ground truth centerline voxels that are inside the predictions, multiplied by a factor that represents the voxel resolution (we used the geometrical mean of the voxel sizes in the three dimensions).

The trachea and main bronchi are removed in these measures from both the predictions and ground truth, similar to [LoExact2012]. The centerlines are obtained by applying skeletonization [Skeletonize1994] to the prediction and ground truth masks.

To compare the performance of the different methods, we used the paired two-sided Student’s T-test on the performance measures over the test set. To compare the accuracy of our U-Net-based method between healthy and diseased subjects, we used the unpaired two-sided Student’s T-test on the measures from the two groups. We consider that a p-value lower than 0.01 indicates that the two sets of measures compared are significantly different.

5 Results

5.1 Evaluation on CF-CT data

We show in Fig. 2 the evaluation on the CF-CT data of our U-Net-based method and the kNN-VS method [PerezRovira2016]. Comparing the U-Net with the kNN-VS results, the former shows higher TL ( compared to , ), higher DSC ( compared to , ) and a similar CL ( compared to , ). This indicates that our U-Net-based method predicts more complete airway trees than the kNN-VS method [PerezRovira2016], with more and/or longer peripheral branches, and with a similar count of false positive errors. We show in Fig. 4 visualizations of airway trees obtained by the two methods. We can see that the airways predicted by our U-Net-based method match better the reference segmentations, with more and/or longer peripheral branches detected, while the kNN-VS method misses many of these peripheral branches.

Figure 2: Boxplots showing i) tree length detected, ii) centerline leakage and iii) Dice coefficient on the CF-CT data, for the results obtained by our U-Net-based method and the kNN-VS method [PerezRovira2016]

. For each boxplot, the box shows the quartiles of the data (defined by the median, 25% and 75% percentiles), the whiskers extend to include the data within 1.5 times the interquartile range from the box limits, and the markers show all the data points.

Figure 3: Boxplots showing i) tree length detected, ii) centerline leakage and iii) Dice coefficient on the DLCST data, for the results obtained by our U-Net-based method and the kNN-VS [Lo2010] and LOP [Lo2009] methods. For each boxplot, the box shows the quartiles of the data (defined by the median, 25% and 75% percentiles), the whiskers extend to include the data within 1.5 times the interquartile range from the box limits, and the markers show all the data points.
Figure 4: Visualization of predicted airways trees on the CF-CT data, obtained by our U-Net-based method and the kNN-VS method [PerezRovira2016]. True positives are displayed in yellow, centerline false negatives in blue and centerline false positives in red. These false negatives and positives are the errors counted in the tree length detected and centerline leakage measures reported in Fig. 2, respectively. The cases displayed correspond to, from left to right, the two U-Net results with Dice score closest to the median of the test Dice measures, and the U-Net result with the lowest Dice score.

Comparing the measures from our U-Net-based method between control and subjects with CF and airway abnormalities, we found no significant differences in TL ( compared to , ) and in CL ( compared to , ), but they were significant in total tree length ( compared to , ). This indicates that our method has similar accuracy with respect to the manual annotations on scans from both control and subjects with CF, while segmenting more airway branches in the diseased cases. We show these grouped measures in Fig. 10 in the Appendix.

5.2 Evaluation on DLCST data

We show in Fig. 3 the evaluation on the DLCST data of our U-Net-based method and the kNN-VS [Lo2010] and LOP [Lo2009] methods. Comparing the U-Net with the kNN-VS results, the former shows lower CL ( compared to , ), higher DSC ( compared to , ) and a similar TL ( compared to , ). Comparing the U-Net with the LOP results, the former shows lower CL ( compared to , ), higher DSC ( compared to , ), but lower TL ( compared to , ). This indicates that our U-Net-based method predicts an airway tree as complete as the kNN-VS method [Lo2010], and less complete than the LOP method [Lo2009], but with significantly less false positive errors than these two methods. It should be noted that this comparison is biased towards the kNN-VS and LOP methods that made up the reference segmentations, which partly explains the very high completeness of the results by the LOP method. We show in Fig. 5 visualizations of airway trees obtained by the three methods. We can see that the airways predicted by the LOP method have more peripheral branches detected (with almost no false negative errors), however they have more false positive errors. The results from our U-Net-based method have less peripheral branches detected, but also have much less false positives.

Figure 5: Visualization of predicted airways trees on the DLCST data, obtained by our U-Net-based method and the kNN-VS [Lo2010] and LOP [Lo2009] methods. True positives are displayed in yellow, centerline false negatives in blue and centerline false positives in red. These false negatives and positives are the errors counted in the tree length detected and centerline leakage measures reported in Fig. 3, respectively. The cases displayed correspond to, from left to right, the two U-Net results with Dice score closest to the median of the test Dice measures, and the U-Net result with the lowest Dice score.

Through visual inspection of the results, we observed that several false positive errors correspond to real airways on the CT scan that were missing in the reference segmentations. This is because this reference was built in a conservative way from automatic airway extractions, where branches not extracted by either method [Lo2010, Lo2009] were not included, and branches evaluated as "partly wrong" were removed, as explained in Section 3.1. Thus, the reference segmentations are somewhat incomplete. For the segmentation results obtained by the three methods, we inspected any false positives that were of tubular shape and that were long enough to be clearly perceived as possible airways. We found 41 such cases of false positives that were actual airways when they were overlaid and analyzed on the CT scan. Out of these 41 misclassified branches, 20 were segmented only by our U-Net-based method, 14 were segmented by our U-Net-based method and either the kNN-VS or LOP method, and 7 were segmented by either the kNN-VS or LOP method and not the U-Net-based method. All cases segmented by our U-Net-based method were free of leakage. In contrast, all cases segmented by the kNN-VS and LOP methods had errors: they were branches longer than the real airway, or had leakage outside the airway wall. We show in Fig. 6 some examples of these misclassified false positive branches, where in the corresponding overlay of the airway mask with the CT scan we can see that they are real airways. Because of this, the CL leakage reported for the three methods in Fig. 3 are presumably overestimated, and to a larger extent for our U-Net-based method. Interestingly, most of the 34 misclassified but correctly segmented branches by our U-Net-based method were on CT scans from subjects with COPD: 26 out of 34 (76%).

Figure 6: Visualization of examples of predicted false positive branches on the DLCST data, obtained by our U-Net-based method and missed by the kNN-VS and LOP methods, that are actually real airways missing in the reference segmentations. The last earmarked branch is detected by the LOP method but with errors: with leakage outside the airway wall. In the 3D visualizations the true positives are displayed in yellow, centerline false negatives in blue and centerline false positives in red. We also show the overlay of the predicted airway centerline masks with the CT scan, showing that the earmarked branches are real airways. The predicted centerlines by our U-Net method are displayed in green, those by the LOP method in blue, and the reference segmentations in yellow.

Comparing the measures from our U-Net-based method between subjects with normal lung function and subjects with COPD, we found no significant differences in TL ( compared to , ), but they were significant in CL ( compared to , ) and in total tree length ( compared to , ). This indicates that our method has slightly more false positives errors on scans from subjects with COPD. However, this is partly explained by the higher number of misclassified but correctly segmented branches that we found in scans with COPD, for the U-Net results. We show these grouped measures in Fig. 11 in the Appendix.

5.3 Evaluation on EXACT’09 data

We show in Fig. 7 the evaluation on the EXACT’09 data of our U-Net-based method, the 15 methods from the challenge EXACT’09 [LoExact2012], the 6 post-challenge methods [EXACTwebsite] and the 4 recent methods [Charbonnier2017, Yun2019, Gil2019, Qin2021] evaluated on the EXACT’09 data. The reported results are the average TL and FPR measures obtained over the EXACT’09 test set. Our U-Net-based method shows an overall TL of and FPR of . When compared to the EXACT’09 methods and the 6 post-challenge methods [EXACTwebsite], our U-Net-based method achieves a TL much higher (at least ) than the scores from all methods except two, while the reported FPR is only slightly higher () than the average score among these methods. This indicates that our U-Net-based method predicts more complete airway trees on average, and with limited false positive errors. The two methods with higher TL, () and (), have also a much higher FPR, () and () respectively. Out of the 15 algorithms from EXACT’09, the kNN-VS method [Lo2010] (also compared to in the other two datasets) reported an overall TL of and FPR of . When compared to the CNN-Leak method [Charbonnier2017], which reported an overall TL of and FPR of , our U-Net-based method has slightly higher completeness but also slightly more false positive errors. When compared to the U-Net-FRAD method [Qin2021], which reported an overall TL of and FPR of , our U-Net-based method has slightly lower completeness but also slightly less false positive errors. Our U-Net-based method together with the CNN-Leak [Charbonnier2017] and U-Net-FRAD [Qin2021] methods seem to have the best trade-off between the TL and FPR scores among all tested methods in Fig. 7. However, the authors from [Charbonnier2017] did not follow the independent evaluation for the EXACT’09 benchmark, but they re-implemented the evaluation algorithm using the reference standard from EXACT’09 [LoExact2012]. Their evaluation on one of the 15 submissions to EXACT’09 resulted in a slightly higher tree length than the one originally reported in [LoExact2012]. Thus, the TL score reported for this method could be overestimated. When compared to the 2.5D CNN method [Yun2019], which reported an overall TL of and FPR of , and the PICASSO method [Gil2019], which reported an overall TL of and FPR of , our U-Net-based method shows better performance measures, with higher completeness and less false positive errors. Regarding computational efficiency, our U-Net-based method has an execution time during test inference of less than 1 minute per scan.

Figure 7: Average tree length detected versus average false positive rate over EXACT’09 test set, for our U-Net-based method, the 15 methods from the challenge EXACT’09 [LoExact2012], the 6 post-challenge methods [EXACTwebsite] and the 4 recent methods [Charbonnier2017, Yun2019, Gil2019, Qin2021] evaluated on the EXACT’09 data. Results are computed by the EXACT’09 challenge organizers except for [Charbonnier2017], who did the evaluation slightly differently, leading to slightly better performance, as reported in their paper.
Figure 8: Visualization of predicted airway trees on the EXACT’09 data obtained by our U-Net-based method. True positives are displayed in green and false positives in red. The cases displayed correspond to 3 out of the 5 U-Net results with the highest false positive rate reported in the EXACT’09 evaluation, where we found through visual inspection that many of the reported false positives are real airways on the CT scan, which were presumably missing in the EXACT’09 reference.

Through visual inspection of the results, we observed that for some CT scans a large number of airways predicted by our U-Net-based method that were reported as false positives in the EXACT’09 evaluation are real airways on the CT scan. We show in Fig. 8

visualizations of airway trees with true and false positives indicated according to the EXACT’09 evaluation, for which most of the reported false positive errors are real airways. This is because the reference standard from EXACT’09, constructed from all 15 participating methods to the challenge, is somewhat incomplete: it was estimated that on average approximately

more branches were visible in the scans than those that were included in the reference [LoExact2012]. Thus, the false positive rate reported by our U-Net-based method in Fig. 7 may be overestimated, and could be also for the other methods [Charbonnier2017, Yun2019, Gil2019, Qin2021, EXACTwebsite] we compared with, which used the same reference for evaluation.

We compared the measures from our U-Net-based method between a group of scans without any reported anomalies on the CT scan (cases 21, 22, 23, 28, 29, 35, and 37), and a group of scans with reported bronchiectasis (cases 33, 34, 36, and 39) [LoExact2012]. We found no significant differences in TL ( compared to , ), in FPR ( compared to , ) and in total tree length ( compared to , ). This indicates that our method has similar accuracy with respect to the reference segmentations on both scans without anomalies and scans with bronchiectasis. We show these grouped measures in Fig. 12 in the Appendix.

6 Discussion

6.1 Performance compared to other methods

The proposed U-Net-based method has shown good performance on the three different datasets, achieving highly complete airway segmentations with low false positive errors. The three datasets tested have very different characteristics: with subjects of a wide age range (from children in CF-CT data to older adults in DLCST data) and with airway abnormalities from diverse lung diseases, including CF and COPD. On the CF-CT and DLCST data, our U-Net-based method obtained more accurate airway segmentations than the other tested methods [PerezRovira2016, Lo2010, Lo2009], shown by higher completeness and Dice scores altogether, and lower false positive errors. On the EXACT’09 data, our U-Net-based method achieved performance measures similar to the best performing previous methods, with the second highest sensitivity score among all methods that reported good specificity. Especially, it is noticeable that our method achieved this good performance on the highly heterogeneous EXACT’09 data with models trained on different and more homogeneous datasets (CF-CT and DLCST). This shows the robustness and the capacity to generalize across different data of the proposed method. Furthermore, our U-Net-based method obtained similar performance measures on scans from both healthy and diseased subjects, on the CF-CT and EXACT’09 data. On the DLCST data, our method had slightly more false positives errors on scans from subjects with COPD, where airway detection can be more challenging due to emphysema. However, on scans with COPD we reported a higher number of false positives in the U-Net results that were real airways on the CT scan, which partly explains the lower specificity in the diseased cases. On the CF-CT data, the method segmented more airway branches on scans from subjects with CF, probably due to the widening of peripheral airways due to CF-bronchiectasis, which makes them easier to detect on the CT scan.

When compared to other CNN-based methods evaluated on the EXACT’09 data, our U-Net-based method shows a performance similar to that of the CNN-Leak [Charbonnier2017] and U-Net-FRAD [Qin2021] methods. These three methods seem to have the best trade-off between sensitivity and specificity among all tested methods in Fig. 7. However, the U-Net-FRAD method [Qin2021] was trained on data that included the EXACT’09 training set, using their own manually-drawn reference segmentations on these data. Since the scans from the EXACT’09 training and test sets have similar characteristics and scanning parameters [LoExact2012], this gives an advantage to the U-Net-FRAD method [Qin2021] over our U-Net-based method, which was trained on different data. On the other hand, the CNN-Leak method [Charbonnier2017] requires a coarse airway segmentation as initialization to the CNN-based leakage removal algorithm, and thus the completeness of the results is limited by that of the initial segmentation. In contrast, our U-Net-based method segments the full airway tree directly. Moreover, [Charbonnier2017] applied leakage removal to a series of 15 runs of the region growing based algorithm [Rikxoort2009], with different parameters that control the extent of leakage, upon which the results were merged. In fact, the results from applying the leakage removal only to the baseline segmentation [Rikxoort2009] had a much lower sensitivity (TL in contrast to ) and a slightly higher specificity (FPR in contrast to ). Similarly, an ensemble of U-Net results with different settings would likely have a slightly better performance, but this is more time consuming. When compared with the 2.5D CNN method [Yun2019], our U-Net-based method shows better performance with both higher sensitivity and specificity. The higher accuracy of our method can be because the 2.5D CNN method processes three perpendicular 2D slices around each voxel, while the 3D U-Net can better capture the 3D morphological information of airways.

6.2 Advantages of the proposed method

The proposed U-Net-based method processes large 3D patches extracted from the CT scan in a single pass through the network. This makes our method simple, robust, and efficient at inference time, as only a few large patches are processed to segment a full scan. In contrast, other U-Net-based methods in the literature either i) apply a small U-Net locally around detected candidate airway locations, processing many image patches per scan [Jin2017, Meng2017]; or ii) apply a model with lager memory footprint on smaller image patches in a sliding-window fashion [Qin2019, Qin2019-2, Zhao2019, Wang2019, Qin2020, Qin2021, Zheng2020, Zhou2021, Nadeem2021]. With our simple U-Net-based method and without further processing of the network output (except for computing the largest connected component) we obtained highly complete and accurate airway segmentations on the three datasets tested. On the EXACT’09 data, our method achieved a similar performance to the more complex U-Net-FRAD method [Qin2021], which used manually annotated EXACT’09 data for model training. An additional advantage of processing few, large patches by our method is that fewer edge artifacts are introduced when tiling together the predicted output patches of the network. These artifacts typically occur where the tiled predicted patches meet in the full-size output (or where the patches overlap if this occurs), because the predictions from each patch can be slightly different due to border effects when using non-valid convolutions, and can cause discontinuities in the predicted airway mask.

6.3 Limitations

A limitation of our validation of the proposed method is that we did not evaluate it on CT scans with severe airway disease: the CF-CT dataset include subjects with moderate CF disease, the DLCST dataset include subjects with moderate COPD, and the EXACT’09 test data has various airways abnormalities but only one scan with reported "extensive bronchiectasis" [LoExact2012]. Testing the method on severe cases would be important to assess its generalizability to tackle challenging airways with abnormally deformed shapes due to severe disease. Moreover, our evaluation of segmentation performance with respect to the presence of disease on the DLCST dataset may not show the whole picture. This is because the reference segmentations on these data were built in a conservative way from automatic airway extractions [Lo2010, Lo2009], and could be less complete for scans with severe emphysema.

A limitation of our U-Net-based method is that the prediction of the airway tree output by the U-Net is not guaranteed to form a connected tree structure. This could complicate the automatic extraction of airway biomarkers based on these segmentations as some methods assume a fully connected tree as input [Petersen2014-2, PerezRovira2016, Kuo2020]. The airway predictions obtained for this paper had typically some segmented peripheral airways disconnected from the main airway tree, and these were discarded when computing the largest connected component, which reduced the completeness of our airway segmentations. Alternatively, airway measurements techniques that do not rely on fully connected trees, such as [Estepar2012] could be used. Several methods have been proposed to extract more complete, connected tree structures. The voxel-connectivity U-Net formulation [Qin2019] aims to improve connectivity in the airway prediction, resulting however in a model that is significantly more complex than our U-Net used. The linear programming-based tracking module on top of U-Net [Zhao2019] attempts to link disconnected components of airways from the U-Net output. The mean-field networks and graph neural networks (GNNs) [Selvan2020] emphasize the prediction of connected tree-like structures by phrasing the tree extraction problem as graph refinement starting from an over-connected input graph. The joint U-Net-GNN method [GarciaUceda2019] attempts to integrate this tree-like modelling in the U-Net prediction. However, none of these methods can guarantee fully connected airway tree predictions.

Finally, the memory footprint of our U-Net-based method could be further reduced, which would allow us to fit even larger images to the network, and possibly the entire CT scan. It may be possible to reduce the number of feature maps, especially in the decoding path of the U-Net, without decreasing much the performance. Also, using the partially reversible U-Net formulation[Brugger2019] in our method could largely reduce its memory footprint, by lowering the number of activation maps in the network stored in memory. However, this may result in an increase of training time (the authors from [Brugger2019] reported a 50% increase for their tested 5-level U-Net model, similar to our U-Net).

7 Conclusions

In this paper, we present a fully automatic and end-to-end optimised method to segment the airways from thoracic CT, based on the U-Net architecture. In contrast to previous U-Net methods for airway segmentation, the proposed method processes large 3D image patches often covering entire lungs. This is achieved by using a simple and low-memory 3D U-Net as backbone. This makes the method robust and efficient, which is important if the method is deployed in clinical software. Our method obtained highly complete and accurate airway segmentations on three very different datasets including CT scans with various airway and lung abnormalities. On the EXACT’09 test set, our method achieved the second highest sensitivity score among all methods that reported good specificity; and it outperformed previous methods on the other datasets.

References

Acknowledgements

The research leading to these results has received support from the Innovative Medicines Initiative Joint Undertaking under grant agreement n. 115721 resources of which are composed of financial contribution from the European Union’s Seventh Framework Programme (FP7/2007-2013) and EFPIA companies’ in kind contribution.

Author contributions

A.G-U.J. did the literature research, conducted the experiments, analysed the results, and wrote the manuscript; R.S. contributed to the analysis of results and writing of the manuscript; Z.S. collected data, H.T. collected data, M.B. supervised the literature research, experiments and analysis of results, and contributed to writing of the manuscript. All authors reviewed the manuscript.

Additional information

Code availability

The source code for our implementation of the proposed method, and the trained model for the results obtained on the EXACT’09 test set, are available here: https://github.com/antonioguj/bronchinet

Competing interests

The author(s) declare no competing interests.

Appendix

Learning curves of the proposed method

We computed the learning curve for the proposed U-Net-based-method trained on both CF-CT and DLCST data together. To do this, we trained several models with different sizes of the training data. The maximum training size has half of the CT scans from the CF-CT and DLCST data (28 scans in total), which is the same data we used to train the model evaluated on the EXACT’09 data in this paper. We keep the remaining 28 scans for testing the trained models. For each training set used, the ratio between CF-CT and DLCST scans is the same as in the full dataset. We did three experiments for each training size, with randomly assigned training images (except for the largest run with 28 scans). To compute the airway predictions on the test data, we did not extract the largest connected component from the thresholded output of the U-Net, as we did for the other experiments in this paper. This is to account for the full prediction of the U-Net in assessing the method accuracy for all training sizes. To compare the results for different training sizes, we applied the paired, two-sided Student’s T-test on the average of the measures from the three experiments for a given size, and consider that a p-value lower than 0.01 indicates a significant difference. We show in Fig. 9 the computed learning curves, with the different performance metrics obtained for each run and training set size. The measures of tree length detected increase progressively with the training size. The difference between the scores with sizes of 18 and 28 images is still significant (), and adding more training images could still improve slightly the results. For the measures of centerline leakage and Dice coefficient, they are more similar between sizes of 9 and 18 images ( and , respectively) and between sizes of 18 and 28 images ( and , respectively).

Figure 9:

Learning curve for our U-Net-based-method trained on both CF-CT and DLCST data together. Boxplots showing i) tree length detected, ii) centerline leakage and iii) Dice coefficient on both CF-CT and DLCST data together, for each experiment and training size. For each boxplot, the box shows the quartiles of the data (defined by the median, 25% and 75% percentiles), the whiskers extend to include the data within 1.5 times the interquartile range from the box limits, and the markers show the data outliers.

Results of proposed method grouped by the presence of lung disease

Figure 10: Boxplots showing i) tree length detected, ii) centerline leakage and iii) total tree length on the CF-CT data, grouped by the presence of CF disease in the CT scans, for the results obtained with our U-Net-based method. For each boxplot, the box shows the quartiles of the data (defined by the median, 25% and 75% percentiles), the whiskers extend to include the data within 1.5 times the interquartile range from the box limits, and the markers show the data outliers.
Figure 11: Boxplots showing i) tree length detected, ii) centerline leakage and iii) total tree length on the DLCST data, grouped by the presence of COPD disease in the CT scans, for the results obtained with our U-Net-based method. For each boxplot, the box shows the quartiles of the data (defined by the median, 25% and 75% percentiles), the whiskers extend to include the data within 1.5 times the interquartile range from the box limits, and the markers show the data outliers.
Figure 12: Boxplots showing i) tree length detected, ii) false positive rate and iii) total tree length on the EXACT’09 test data, grouped by the presence of bronchiectasis disease in the CT scans, for the results obtained with our U-Net-based method. The healthy group contains 7 scans without any reported anomalies on the CT scan, while the diseased group contains 4 scans with reported bronchiectasis [LoExact2012]. For each boxplot, the box shows the quartiles of the data (defined by the median, 25% and 75% percentiles), the whiskers extend to include the data within 1.5 times the interquartile range from the box limits, and the markers show the data outliers.