The translation of machine-learning approaches developed on specific data sets to the variety of routine clinical data is of increasing importance. As methodology matures across different fields, means to render algorithms robust for the transition from thelab to the clinic become critical. Here, we investigate the impact of model choice and training data diversity in the context of segmentation.
The detection and accurate segmentation of organs is a crucial step for a wide range of medical image processing applications in research and computer-aided diagnosis (CAD) . Segmentation enables the quantitative description of anatomical structures with metrics such as volume, location, or statistics of intensity values, and is paramount for the detection and quantification of pathologies. For machine-learning, segmentation is highly relevant for discarding confounders outside the relevant organ. At the same time, the segmentation algorithm itself can act as a source of bias. For example, a segmentation algorithm can hamper subsequent analysis by systematically excluding dense areas in the lung field .
While lung segmentation is regarded as a mostly solved problem in the methodology community, experiments on routine imaging data demonstrate that algorithms performing well on publicly available data sets do not yield accurate and reliable segmentation in routine chest CTs of patients who suffer from severe disease. Thus, the methodological advance does not translate to applicability in the clinical routine. Here, we show that this is primarily an issue of training data diversity, and not of further methodological advances. Furthermore, annotation of training data in public data sets is often biased by specific target applications, and thus leads to evaluation results that cannot be compared.
Automated lung segmentation algorithms are typically developed and tested on limited data sets, covering a limited spectrum of visual variability by predominantly containing cases without severe pathology  or cases with a single class of disease . Such specific cohort datasets are highly relevant in their respective domain, but lead to specialized methods and machine-learning models that struggle to generalize to unseen cohorts when utilized for the task of segmentation. Due to these limitations, in practice, medical image processing studies, especially when dealing with routine data, still rely on semiautomatic segmentations or human inspection of automated organ masks [22, 32]. However, for large-scale data analysis based on thousands of cases, human inspection or any human interaction with single data items, at all, is not an option.
I-a Related Work
A diverse range of lung segmentation techniques has been proposed. They can be categorized into rule-based, atlas-based, machine-learning-based and hybrid approaches. Rule-based methods rely on thresholding, edge detection, region-growing, morphological operations, and other ”classical” image processing techniques [16, 11, 5, 24]. Atlas-based methods rely on non-rigid image registration between the unlabeled image and on- or more labeled atlases [28, 12, 18]. Machine-learning-based approaches rely on large datasets to learn active shape models [27, 2, 4], locations of landmarks 
, hand-crafted features, or Convolutional Neural Networks (CNN) for end-to-end learning. Hybrid approaches combine various techniques, such as thresholding and texture classification [17, 34, 15], landmarks and shape models , and other combinations [30, 33, 21].
The lung appears as a high contrast region in X-ray-based imaging modalities, such as CT, so that thresholding and atlas segmentation methods lead to good results in many cases [16, 11, 5]. However, disease-associated lung patterns, such as effusion, atelectasis, consolidation, fibrosis, or pneumonia, lead to dense areas in the lung field that impede such approaches. Multi-atlas registration aims to deal with these high-density abnormalities by incorporating additional atlases, shape models, and other post-processing steps . However, such highly complex pipelines are not reproducible without extensive effort, especially if the source code and the underlying set of atlases is not shared. An additional drawback is that these algorithms are usually optimized for chest CT scans, neglecting scans with a larger or smaller field of view. Furthermore, run-time does not scale well when incorporating additional atlases, as registrations tend to be computationally expensive. With respect to these drawbacks, trained machine-learning models have the advantage, in that they can be easily shared without giving access to the training data, they are fast at inference time, and scale well when additional training data are available. Harrison et al.  showed that deep-learning-based segmentation outperforms a specialized approach in cases with interstitial lung diseases  and provides trained models, inference code, and model specifications for non-commercial use upon request. However, with some exceptions, trained models for lung segmentation are rarely shared publicly, hampering advances in medical imaging research. At the same time, machine-learning methods are limited by the training data available, their number, and the quality of the ground-truth annotations.
Benchmark datasets for training and evaluation are paramount to establish comparability between different methods. However, publicly available datasets with manually annotated organs for development and testing of lung segmentation algorithms are scarce. The VISCERAL Anatomy3 dataset  provides a total of 40 segmented lungs (left and right separated) in scans that show the whole body or the whole trunk. The Lung CT Segmentation Challenge 2017 (LCTSC)  provides 36 training and 24 test scans with segmented lungs (left and right separated) from cancer patients of three different institutions. The VESsel SEgmentation in the Lung 2012 (VESSEL12)  challenge provides 20 lungs with segmentations. In VESSEL12, the left and right lung are not separated and have a single label. The Lung Tissue Research Consortium (LTRC)111https://ltrcpublic.com provides an extensive database of cases with interstitial fibrotic lung disease and chronic obstructive pulmonary disease (COPD). The dataset provides lung-masks with left-right split and the individual lung-lobes. The LObe and Lung Analysis 2011 (LOLA11) challenge222https://lola11.grand-challenge.org published 55 test cases from 55 subjects. These images include normal scans and scans with severe pathologies acquired with different scanners. The ground-truth labels (left and right lung) are known only to the challenge organizers.
In semantic segmentation of natural images, large publicly available datasets have fostered the development and sharing of high-quality segmentation algorithms for downstream applications [6, 19]. In contrast, the aforementioned publicly available datasets that are frequently used for training of lung segmentation models were not created for this purpose and are strongly biased to either inconspicuous cases (Anatomy3) or very specific diseases (LCTSC, VESSEL12, LTRC). The creation of the lung masks in these datasets involved automated algorithms with a varying and uncertain degree of human inspection and manual correction. While the Anatomy3 dataset underwent a thorough quality assessment, the organizers of the VESSEL12 challenge provide the lung masks as-is without guarantees about quality. In fact, the segmentations are merely provided as a courtesy supplement for the task of vessel segmentation. None of these datasets consistently includes high-density areas, such as tumor mass or pleural effusion, as part of the pathological lung. Within the LCTSC dataset ”tumor is excluded in most data” and ”collapsed lung may be excluded in some scans.”  At the same time, pleural fluids and other consolidations have texture and intensity properties similar to surrounding tissue, rendering their annotation quite tedious. Furthermore, anatomical sub-division is not consistent across datasets [full lung only (VESSEL12), left/right split (Anatomy3,LCTSC), lung lobes (LTRC)]. To the best of our knowledge, there is no publicly available dataset with severe pathologies, such as consolidations, effusions, air-pockets, and tumors, as part of the annotated lung field nor is there a publicly available segmentation model trained on such a dataset.
We investigated to what extent the accuracy and reliability of lung segmentation in CT scans, in the presence of severe abnormalities in the routine population, is driven by the diversity of the training data or by methodological advances. Specifically, we addressed the following questions: (1) what is the influence of training data diversity on lung segmentation performance, (2) how do inconsistencies in ground-truth annotations across data contribute to the bias in automatic segmentation or its evaluation in severely diseased cases, and (3) can a generic deep learning algorithm perform competitively with specialized approaches on a wide range of data, once diverse training data is available? To this end, we collected and annotated a diverse data set of 266 CT scans from routine data without restriction on disease or disease pattern. We trained four different segmentation models on different training data (public and routine) and evaluated their accuracy on public data-sets, and on a diverse routine data-set with more than six different disease patterns. Furthermore, we compared the performance between two publicly available (trained) lung segmentation models and our segmentation model trained on our diverse data set and submitted results to the LOLA11 challenge for independent evaluation.
Ii Materials and Methods
Both-, the training dataset and architecture, used for a machine-learning task affect the performance of a trained model. In order to study these effects in the context of lung segmentation, we trained four generic semantic segmentation models (Sec. II-A) from scratch on three different public training sets (Anatomy3, LTSC, LTRC) and one training set collected from the clinical routine (Sec. II-C). We evaluated these models on public test sets (LTSC, LTRC, and VESSEL12) and routine data, including cases showing severe pathologies. Furthermore, we performed a comparison of models trained on a diverse routine training set to two published automatic lung segmentation systems, which we did not train, but used as provided.
We refrained from developing specialized methodology, but utilized generic ”vanilla” state-of-the-art deep-learning, semantic segmentation architectures, without modifications, which we trained from scratch. We considered the following four models:
U-net: Ronneberger et al. proposed the U-net for the segmentation of anatomic structures in microscopy images . Since then, the U-net architecture has been used for a wide range of applications and various modified versions have been studied 
. We utilized the U-net as proposed by Ronneberger et al. with the only adaption being batch-normalization after each layer.
ResU-net:Residual connections have been proposed to facilitate the learning of deeper networks [31, 9] and are widely used in deep learning architectures. Here, the ResU-net model includes residual connections at every down- and up-sampling block as a second adaptation to the U-net, in addition to batch-normalization.
Dilated Residual Network (DRN): Yu et al. proposed dilated convolutions for semantic image segmentation . They showed that dilated convolutions can extend the receptive field in higher layers, thereby facilitating context aggregation. Later, they adapted deep residual networks  with dilated convolutions (DRN) to perform high-quality semantic segmentations on natural images. Here, we utilized the DRN-D-22 model, as proposed by Yu et al. .
Deeplab v3+: Chen et al. proposed a series of semantic segmentation algorithms (Deeplab v1, Deeplab v2, and Deeplab v3(+)), with a particular focus on speed. Deeplab v3 combines dilated convolutions, multi-scale image representations, and fully-connected conditional random fields as a post-processing step. Deeplab v3+ includes an additional decoder module to refine the segmentation. Here, we utilized the ’Deeplab v3 +’ model as proposed by Chen et al. .
Ii-B Implementation Details
We aimed to achieve a maximum of flexibility with respect to the field of view (from partially visible organ to whole-body) and to enable lung segmentation without prior localization of the organ. To this end, we performed segmentation on the slice level (2D). That is, for volumetric scans, each slice was processed individually. We segmented the left and right lung (individually labelled), excluded the trachea and specifically included high density anomalies such as tumor and plural effusions. During training and inference, the images were cropped to the body region using thresholding and morphological operations, and rescaled to a resolution of pixels. Prior to processing, hounsfield units were mapped to the intensity window and normalized to the 0-1 range. During training, the images were augmented by random rotation, non-linear deformation and Gaussian noise. We used stratified mini-batches of size 14 holding 7 slices showing the lung and 7 slices which don’t show the lung. For optimization, we used Stochastic Gradient Decent (SGD) with momentum .
Ii-C Routine data collection and sampling
We collected representative training and evaluation datasets from the picture archiving and communication system (PACS) of a university hospital radiology department. We included in- and outpatients who underwent a chest CT examination during a period of 2.5 years, with no restriction on age, sex, or indication. However, we applied minimal inclusion criteria with regard to imaging parameters, such as axial-slicing of the reconstruction, number of slices in a series and the series description included one of the terms lung, chest, or thorax. If multiple series of a study fulfilled these criteria, the one series with the highest number of slices was used. In total, we collected more than 5300 patients (examined during the 2.5-year period), each represented by a single CT series. From this database, we carefully selected a representative training dataset using three sampling strategies: (1) random sampling of cases (N=57); (2) sampling from image phenotypes  (N=71) (the exact methodology for phenotype identification was not in the scope of this work); and (3) manual selection of edge cases with severe pathologies, such as fibrosis (N=28), trauma (N=20), and other pathologies (N=55). In total, we collected 231 cases from routine data for training (hereafter referred to as R231). While the dataset collected from the clinical routine showed a high variability in lung appearance, cases that showed the head or the abdominal area are scarce. To mitigate this bias toward slices that showed the lung, we augmented the number of non-lung slices in R231 by including all slices which did not show the lung from the Anatomy3 dataset.
For testing, we randomly sampled 20 cases from the database that were not part of the training set (hereafter referred to as RRT) and 15 cases with specific anomalies [atelectasis (2), emphysema (2), fibrosis(4), mass (2), pneumothorax (2) and trauma (3)] for testing. Ground-truth labeling was bootstrapped by training of a lung segmentation algorithm (U-net) on the Anatomy3 dataset. The preliminary masks were iteratively corrected in an active-learning fashion. Specifically, the model for the intermediate masks was iteratively retrained after 20-30 new manual corrections were performed with the ITK-Snap software. In total, we performed ground-truth annotation on 266 chest CT scans from 266 individual patients.
Ii-D Evaluation metrics
Automatic segmentations were compared to the ground truth for all test datasets using the following evaluation metrics, as implemented by theDeepmind surface-distance python module333https://github.com/deepmind/surface-distance. While segmentation was performed on 2D slices, evaluation was performed on the 3D volumes. If not reported differently, the metrics were calculated for the right and left lung separately and then averaged.
Ii-D1 Dice coefficient (DSC)
The Dice coefficient or Dice score is a measure of overlap:
where and are two alternative labelings, such as predicted and ground-truth lung masks.
Ii-D2 Robust Hausdorff distance (HD95)
The directed Hausdorff distance is the maximum distance over all distances from points in surface to their closest point in surface . In mathematical terms, the directed robust Hausdorff distance is given as:
where denotes the percentile of the distances. Here, we used the symmetric adaptation:
Ii-D3 Mean surface distance (MSD)
The mean surface distance is the average distance of all points in surface to their closest corresponding point in surface :
Here, we used the symmetric adaptation:
Study the impact of training data variability: We determined the influence of training data variability (especially public datasets vs. routine) on the generalizability to other public test-datasets, and, specifically, to cases with a variety of pathologies. To this end, we trained the four different models on an equal number of patients (N=36) and slices (N=3393) from each training dataset [Routine Random (RR36), VISCERAL Anatomy3 (VISC36), LTRC (LTRC36), and LCTSC (LTSC36)] individually. The RR36 dataset is a subset of the 57 randomly sampled cases that we collected for training, and therefore, is also a subset of the full routine dataset R231. Table I gives an overview of the training datasets created. The number of volumes and slices were limited to match the smallest dataset (LCTSC), with 36 volumes and 3393 slices. During this experiment, we considered only slices that showed the lung (during training and testing) to prevent a bias induced by the field of view. For example, images in VISCERAL Anatomy 3 showed either the whole body or the trunk, including the abdomen, while other datasets, such as LTRC, LCTSC, or VESSEL12 contained only images limited to the chest. Also, trauma patients are scanned with a larger field of view compared to patients in whom only the lung is examined.
|R231||Routine 231 Cases||231||62224||108248|
Comparison between generic models trained on a diverse dataset and publicly available lung segmentation systems: We compared the models trained on our dataset (R231) to publicly available lung segmentation algorithms on the full volumes. As a post-processing step, we removed connected components smaller than from segmentations. Note that the reference methods performed comparable post-processing. We compared our trained models to the reference systems provided by the Chest Imaging Platform (CIP)444https://chestimagingplatform.org and the Progressive Holistically Nested Networks (P-HNN) model provided by Harrison et al. . The CIP algorithm features a threshold-based approach, while P-HNN is a CNN architecture specifically developed for lung segmentation that was trained on cases from the LTRC dataset and other cases with interstitial lung diseases. The CIP algorithm was shown to be sensitive to image noise. Thus, if the CIP algorithm failed, we pre-processed the volumes with a Gaussian filter kernel. If the algorithm still failed, the case was excluded for comparison. The trained P-HNN model does not distinguish between left and right lung. Thus, evaluation metrics were computed on the full lung for masks created by P-HNN. In addition to evaluation on publicly available datasets and methods, we performed an independent evaluation of our lung segmentation model by submitting solutions to the LOLA11 challenge for which 55 CT scans are published but ground-truth masks are available only to the challenge organizers.
Quantitative assessment of the models’ ability to cover tumor areas: Studies on lung segmentation usually use overlap- and surface-metrics to assess the automatically generated lung mask against the ground truth. However, segmentation metrics on the full lung can only marginally quantify the capability of a method to cover pathological areas in the lung as pathologies may be relatively small compared to the lung volume. Carcinomals are an example of high-density areas that are at risk of being excluded by threshold- or registration-based methods, when they are close to the lung border. We utilized the publicly available, previously published Lung1 dataset  to quantify the model’s ability to cover tumor areas within the lung. The collection contains scans of 318 non-small-cell lung cancer patients before treatment, with a manual delineation of the tumors. However, no lung masks were available. In this experiment, we evaluated the overlap proportion of tumor volume covered by the lung mask:
Table II gives an overview of the test datasets used. LTRC, LCTSC, and VESS12 are publicly available. RRT is a set of 20 randomly sampled lung CT scans from the routine database. The category ass held two cases from the Lung1 dataset where ground truth masks were created by us and the category normal held four cases with a large field of view from the Visceral Anatomy3 dataset and two cases from the routine database. In total, we collected 191 test cases not used for training.
|RRT||Routine Random Test||20||5788||7969|
|Normal**||Normal (Large field of view)||7||1180||5301|
Iii-a Models trained on routine data outperformed models trained on publicly available study data
U-net, ResU-net, and Deeplab v3+ models, when trained on routine data (RR36 models), yielded the best evaluation scores over all test cases. The largest differences in performance were observed on routine test data (RRT), with a U-net trained on Visceral data (VISC36), which yielded an average DSC of 0.84, and, when trained on RR36, 0.92. This advantage of routine data for training is also reflected in the overall results and other evaluation metrics with U-net yielding DSC, HD95, and MSD scores of when trained on RR36 and when trained on VISC36. Table III lists the evaluation results in detail. We report the averaged DSC for the individual test sets and for all test cases combined (All(L), N=191). In addition, we report all test cases combined without the LTRC and LCTSC data considered (All, N=62). The rationale behind this is, that the LTRC test dataset contains 105 volumes and dominates the averaged scores, and the LCTSC dataset contains multiple cases with tumors and effusions that are not included in the ground-truth masks. Thus, an automated segmentation that includes these areas yields a lower score, distorting and misrepresenting the averaged results. Models trained on routine data yielded the highest DSC on all subcategories apart from LCTSC and VESS12. In fact, the models trained on RR36 yielded the lowest performance in terms of DSC on the LCTSC test dataset. However, the lower scores for these models in LCTSC and VESS12 can be attributed to the lack of very-dense pathologies in the ground truth masks, as mentioned above (See qualitative results in Fig. 2).
Iii-B Various deep-learning-based semantic segmentation architectures perform comparably for lung segmentation
We determined that the influence of model architecture is marginal compared to the influence of training data. Considering the models trained on the datasets comprised of 36 scans (RR36, LTRC36, LCTSC36 and VISC36) we observed that the different architectures perform comparably, with the DSC not varying for more than 0.02 when the same combination of training and test set was applied (Table III). The same conclusion holds when models were trained on the full dataset R231 and evaluated on the full volumes with the U-net(R231), resunet(R231) and deeplab(R231) achieving the identical DSC of 0.980.03 over all test cases. The drn(R231) model yielded a slightly lower DSC over all cases of 0.970.06. Detailed results are listed in Table (IV).
Iii-C All training sets generalized well to cases without severe pathologies
Results show, that moderately sized training sets generalize well to the test cases of the same dataset but also to different datasets without severe pathologies. For example, training the U-net on 36 cases of the LTRC dataset yielded a DSC of 0.99 on the LTRC test set of 105 cases while still generalizing well to the VESS12 dataset (DSC of 0.99) and our selected routine cases with emphysema (DSC of 0.99), mass (DSC of 0.98), or pneumo-thorax (DSC of 0.98). In general, we observed that, independent of training set and architecture, cases without severe pathologies were accurately segmented by the models. Specifically, test cases in the public LTRC and VESS12 datasets received an averaged DSC of at least 0.96 (up to 0.99) depending on architecture and training data. The same conclusion can be drawn for the emphysema, mass, pneumo-thorax and normal test cases for which results vary only little (DSC0.01) for the different models. Detailed results are listed in Table III.
Iii-D Generic models trained on routine data outperformed specialized publicly available systems
Compared to P-HNN the U-net(R231) yielded an average DSC of vs. over all 191 test cases. The averaged results without the LTRC and LCTSC datasets yieled DSC, HD95, and MSD scores of vs. . Detailed results are given in Table IV. Note that the P-HNN results were calculated on the full lung compared to the left and right lung for the U-net, giving P-HNN an advantage in achieving better scores. For comparison with the CIP-algorithm, only volumes for which the algorithm did not fail were considered. The CIP-algorithm tends to fail on challenging cases with dense pathologies and on volumes with a large field of view. In total, the CIP algorithm was able to process 160 of the 191 test volumes. Both algorithms yielded comparable DSC when averaged on all test cases. Without LTRC and LCTSC considered, the algorithms yielded average DSC, HD95, and MSD scores of for the U-net(R213) compared to for CIP. Figure 1 shows qualitative results for cases from the routine test sets. In addition to publicly available datasets and our routine data, we created segmentations for the 55 cases of the LOLA11 challenge with the U-net(R231) model. While masks for evaluation are only available to the challenge organizers, prior research and earlier submissions suggest inconsistencies in creating the ground truth, especially with respect to pleural effusions . We specifically included effusions in our lung masks of the R231 training dataset. To account for this discrepancy and improve comparability we submitted two solutions— first, masks as yielded by the U-net(231), and alternatively, with subsequently removed dense areas from the lung masks. The automatic exclusion of dense areas was performed by simple thresholding of values between and morphological operations. The unaltered masks yielded an average overlap score of and with dense areas removed which was the second highest score among all competitors at the time of submission. In comparison, the first ranked method  achieved a score of and a human reference segmentation achieved 555https://lola11.grand-challenge.org/evaluation/results/.
Iii-E A generic model trained on routine data covers more tumor area than specialized publicly available systems
Table V and Figure 4 show the results for average tumor overlap on the 318 volumes of the Lung1 dataset. U-net(231) covered more tumor volume mean/median (60/69%) compared to P-HNN (50/44%) and CIP (34/13%). Figures 3a and b show qualitative results for tumor cases for U-net(R231) and P-HNN. We found, that 23 cases of the Lung1 dataset had corrupted ground-truth annotation of the tumors (Figure 3c). Figure 3d shows cases with little or no tumor overlap achieved by U-net(R231).
Batchwise inference of the U-net model requires 12.5ms per slice with a batch size of 20 ( 4.4 seconds for a volume of 350 slices) on a GeForce GTX 1080Ti GPU with Python 3.6, Pytorch 1.2 and Numpy 1.17.2 and an Intel(R) Xeon(R) W-2125 CPU. Inference on the CPU takes 0.8s per slice or 281 seconds for a volume of 350 slices. In addition, creating the body-mask and cropping the slice takes22ms per slice or 7.7s for the whole volume of 350.
Lung segmentation is a pivotal pre-processing step for many image analysis tasks, such as classification, detection and quantification of lung pathologies. Despite its fundamental importance, publicly available algorithms for pathological lung segmentation are scarce, which impedes research on lung pathologies with medical imaging. This is not only attributed to authors reluctant to make their implementations publicly available, but also to the fact that many algorithms are complex and hard to reproduce. For example, the method with the highest score in the LOLA11 challenge consists of a sophisticated pipeline with multiple processing steps, and relies on a private database of reference atlases , rendering it impossible to reproduce the results without access to the data. In contrast to complex processing pipelines, trained machine-learning models are easy to share and use. However, we showed that public datasets do not hold sufficient variability to generalize to a wide spectrum of routine pathologies. This situation is aggravated by the fact that many publicly available datasets don’t include dense pathologies such as tumor or effusions into the lung mask.
The inclusion or exclusion of lung pathologies such as effusions into lung segmentations is a matter of definition and application. While pleural effusions (and pneumothorax) are technically outside the lung, they are assessed as part of lung assessment, and have a substantial impact on lung parenchyma appearance through compression artefacts. Neglecting such abnormalities would hamper automated lung assessment, as they are closely linked to lung function. In addition, lung masks that include pleural effusions greatly alleviate the task of effusion detection and quantification, thus making it possible to remove effusions from the lung segmentation as a post-processing step.
There are a large number of segmentation methods proposed every year, often based on architectural modifications  of established models. Isensee et al. showed that such modified design concepts do not improve—and occasionally even worsen—the performance of a well-designed baseline . They achieved state-of-the-art performance on multiple, publicly available segmentation challenges relying only on U-nets. This corresponds to our finding that architectural choice had a subordinate effect on performance, and, given a diverse and large set of training data, a standard semantic segmentation architecture (U-net) can generate high quality lung segmentations.
Volume-based segmentation metrics are valid tools with which assess the segmentation quality on the organ level. However, they are a superficial means by which to assess the quality in pathological lung segmentation. Area-based metrics, such as the Dice score, but also surface-based metrics such as HD95 and MSD, tend to only marginally worsen in the presence of a missed abnormality if the abnormality is small compared to the whole organ. We showed that datasets without lung masks, but with annotated pathologies such as Lung1, can be a means with which to quantify algorithms with respect to the ability to cover abnormal lung areas. We showed that the U-net trained on routine data covered more tumor area than the reference methods.
We showed that accurate and fast lung segmentation does not require complex methodology and that a proven deep-learning-based segmentation architecture can outperform state-of-the-art methods once diverse (but not necessarily larger) training data are available. By comparing various datasets for training of the models, we illustrated the importance of diverse training data and showed that data from the clinical routine generalize well to unseen cohorts, achieving the second-highest score among all submissions to the LOLA11 challenge. Given these findings, we draw the following conclusions: (1) translating machine-learning approaches from the lab to routine data can require the collection of diverse training data rather than methodological modifications; (2) current, publicly available study datasets do not meet these diversity requirements; and (3) generic, semantic, segmentation algorithms are adequate for the task of lung segmentation. A reliable, universal tool for lung segmentation is fundamentally important to foster research on severe lung diseases and to study routine clinical datasets. Thus, the trained model and inference code are made publicly available under the GPL3 license to serve as an open science tool for research and development and as a publicly available baseline for lung segmentation under https://github.com/JoHof/lungmask.
This research was supported by the Austrian Science Fund FWF (I2714-B31), Siemens Healthineers Digital Health (https://www.siemens-healthineers.com/digital-health-solutions), and an unrestricted grant of Boehringer Ingelheim.
-  (2014) Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach. Nat Commun 5, pp. 4006. External Links: Cited by: §II-E.
-  (2017-11) Automated segmentation of lung field in HRCT images using active shape model. In IEEE Region 10 Annual International Conference, Proceedings/TENCON, Vol. 2017-Decem, pp. 2516–2520. External Links: Cited by: §I-A.
-  (2004-09) Automated lung segmentation for thoracic CT: Impact on computer-aided diagnosis. Academic Radiology 11 (9), pp. 1011–1021. External Links: Cited by: §I.
Automatic Pathological Lung Segmentation in Low-Dose CT Image Using Eigenspace Sparse Shape Composition. IEEE Transactions on Medical Imaging 38 (7), pp. 1736–1749. External Links: Cited by: §I-A.
-  (2011) Automatic Lung Segmentation in HRCT Images. In International Conference on Image and Vision Computing, pp. 293–298. Cited by: §I-A, §I-A.
The Cityscapes Dataset for Semantic Urban Scene Understanding. In , pp. 3213–3223. Cited by: §I-A.
-  (2015-05) Overview of the VISCERAL Challenge at ISBI. In Proceedings of the VISCERAL Challenge at ISBI, New York, NY. Cited by: §I-A, §I.
-  (2017-09) Progressive and multi-path holistically nested neural networks for pathological lung segmentation from CT images. In International Conference on Medical Image Computing and Computer-Assisted Intervention, Vol. 10435 LNCS, pp. 621–629. External Links: Cited by: §I-A, §I-A, §II-E.
-  (2016-12) Deep Residual Learning for Image Recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: §II-A, §II-A.
-  (2016) Unsupervised identification of clinically relevant clusters in routine imaging data. In International Conference on Medical Image Computing and Computer-Assisted Intervention, Vol. 9900 LNCS, pp. 192–200. External Links: Cited by: §II-C.
-  (2001-06) Automatic lung segmentation for accurate quantitation of volumetric X-ray CT images. IEEE Transactions on Medical Imaging 20 (6), pp. 490–498. External Links: Cited by: §I-A, §I-A.
-  (2015-08) Multi-atlas segmentation of biomedical images: A survey.. Medical image analysis 24 (1), pp. 205–19. External Links: Cited by: §I-A.
-  (2015-06) Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In International Conference on Machine Learning, pp. 448–456. External Links: Cited by: §II-A.
-  (2019-04) nnU-Net: Breaking the Spell on Successful Medical Image Segmentation. arXiv preprint arXiv:1809.10486. Cited by: §II-A, §IV.
-  (2010) Interactive lung segmentation in CT scans with severe abnormalities. In IEEE International Symposium on Biomedical Imaging: From Nano to Macro, pp. 564–567. External Links: Cited by: §I-A.
-  (2007-12) Combining 2D wavelet edge highlighting and 3D thresholding for lung segmentation in thin-slice CT. The British Journal of Radiology 80 (960), pp. 996–1004. External Links: Cited by: §I-A, §I-A.
-  (2008-11) Texture classification-based segmentation of lung affected by interstitial pneumonia in high-resolution CT. Medical Physics 35 (12), pp. 5290–5302. External Links: Cited by: §I-A.
-  (2006) Atlas-driven lung lobe segmentation in volumetric X-ray CT images. IEEE Transactions on Medical Imaging 25 (1), pp. 1–16. External Links: Cited by: §I-A.
-  (2019) Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 82–92. Cited by: §I-A.
-  (2015-07) Segmentation and Image Analysis of Abnormal Lungs at CT: Current Approaches, Challenges, and Future Trends. RadioGraphics 35 (4), pp. 1056–1076. External Links: Cited by: §I.
-  (2014-12) A Generic Approach to Pathological Lung Segmentation. IEEE Transactions on Medical Imaging 33 (12), pp. 2293. External Links: Cited by: §I-A, §III-D.
-  (2017) Precision Radiology: Predicting longevity using feature engineering and deep learning methods in a radiomics framework. Scientific Reports 7 (1), pp. 1648. External Links: Cited by: §I.
-  (1964-01) Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics 4 (5), pp. 1–17. External Links: Cited by: §II-B.
-  (2016-08) Automated Lung Segmentation from HRCT Scans with Diffuse Parenchymal Lung Diseases. Journal of Digital Imaging 29 (4), pp. 507–519. External Links: Cited by: §I-A.
-  (2015) U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, Vol. 9351, pp. 234–241. External Links: Cited by: §II-A.
-  (2014-10) Comparing algorithms for automated vessel segmentation in computed tomography scans of the lung: the VESSEL12 study. Medical Image Analysis 18 (7), pp. 1217–1232. External Links: Cited by: §I-A.
-  (2012-02) Automated 3-D Segmentation of Lungs With Lung Cancer in CT Data Using a Novel Robust Active Shape Model Approach. IEEE Transactions on Medical Imaging 31 (2), pp. 449–460. External Links: Cited by: §I-A.
-  (2005-08) Toward automated segmentation of the pathological lung in CT. IEEE Transactions on Medical Imaging 24 (8), pp. 1025–1038. External Links: Cited by: §I-A.
-  (2011) Multi-stage learning for robust lung segmentation in challenging CT volumes. In International Conference on Medical Image Computing and Computer-Assisted Intervention, Vol. 6893 LNCS, pp. 667–674. External Links: Cited by: §I-A.
-  (2017-01) Accurate Lungs Segmentation on CT Chest Images by Adaptive Appearance-Guided Shape Modeling. IEEE Transactions on Medical Imaging 36 (1), pp. 263–276. External Links: Cited by: §I-A, §I-A, §III-D, §IV.
-  (2015) Training Very Deep Networks. In Advances in neural information processing systems, pp. 2377–2385. Cited by: §II-A.
-  (2016-12) Quantitative CT characterization of pediatric lung development using routine clinical imaging. Pediatric Radiology 46 (13), pp. 1804–1812. External Links: Cited by: §I.
-  (2009-06) Automatic lung segmentation from thoracic computed tomography scans using a hybrid approach with error detection. Medical Physics 36 (7), pp. 2934–2947. External Links: Cited by: §I-A.
-  (2009-09) Automated segmentation of lungs with severe interstitial lung disease in CT. Medical Physics 36 (10), pp. 4592–4599. External Links: Cited by: §I-A.
-  (2018-10) Autosegmentation for thoracic radiation treatment planning: A grand challenge at AAPM 2017. Medical Physics 45 (10), pp. 4568–4581. External Links: Cited by: §I-A, §I-A, §I.
-  (2015-11) Multi-Scale Context Aggregation by Dilated Convolutions. Cited by: §II-A.
-  (2017) Dilated Residual Networks. Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 472–480. Cited by: §II-A, §II-A.
-  (2006-07) User-guided 3D active contour segmentation of anatomical structures: Significantly improved efficiency and reliability. NeuroImage 31 (3), pp. 1116–1128. External Links: Cited by: §II-C.