DeepStationing: Thoracic Lymph Node Station Parsing in CT Scans using Anatomical Context Encoding and Key Organ Auto-Search

by   Dazhou Guo, et al.
Zhejiang University

Lymph node station (LNS) delineation from computed tomography (CT) scans is an indispensable step in radiation oncology workflow. High inter-user variabilities across oncologists and prohibitive laboring costs motivated the automated approach. Previous works exploit anatomical priors to infer LNS based on predefined ad-hoc margins. However, without voxel-level supervision, the performance is severely limited. LNS is highly context-dependent - LNS boundaries are constrained by anatomical organs - we formulate it as a deep spatial and contextual parsing problem via encoded anatomical organs. This permits the deep network to better learn from both CT appearance and organ context. We develop a stratified referencing organ segmentation protocol that divides the organs into anchor and non-anchor categories and uses the former's predictions to guide the later segmentation. We further develop an auto-search module to identify the key organs that opt for the optimal LNS parsing performance. Extensive four-fold cross-validation experiments on a dataset of 98 esophageal cancer patients (with the most comprehensive set of 12 LNSs + 22 organs in thoracic region to date) are conducted. Our LNS parsing model produces significant performance improvements, with an average Dice score of 81.1 model and the previous representative approach, respectively.


Deep Esophageal Clinical Target Volume Delineation using Encoded 3D Spatial Context of Tumors, Lymph Nodes, and Organs At Risk

Clinical target volume (CTV) delineation from radiotherapy computed tomo...

DeepOrgan: Multi-level Deep Convolutional Networks for Automated Pancreas Segmentation

Automatic organ segmentation is an important yet challenging problem for...

Anatomical labeling of brain CT scan anomalies using multi-context nearest neighbor relation networks

This work is an endeavor to develop a deep learning methodology for auto...

Accurate Esophageal Gross Tumor Volume Segmentation in PET/CT using Two-Stream Chained 3D Deep Network Fusion

Gross tumor volume (GTV) segmentation is a critical step in esophageal c...

RAP-Net: Coarse-to-Fine Multi-Organ Segmentation with Single Random Anatomical Prior

Performing coarse-to-fine abdominal multi-organ segmentation facilitates...

Mediastinal lymph nodes segmentation using 3D convolutional neural network ensembles and anatomical priors guiding

As lung cancer evolves, the presence of enlarged and potentially maligna...

Pseudo-Label Guided Multi-Contrast Generalization for Non-Contrast Organ-Aware Segmentation

Non-contrast computed tomography (NCCT) is commonly acquired for lung ca...

1 Introduction

Cancers in thoracic region are the most common cancers worldwide [sung2021global] and significant proportions of patients are diagnosed at late stages involved with lymph node (LN) metastasis. The treatment protocol is a sophisticated combination of surgical resection and chemotherapy and/or radiotherapy [hirsch2017lung]. Assessment of involved LNs [zhu2020lymph, chao2020lymph] and accurate labeling their corresponding stations are essential for the treatment selection and planning. For example, in radiation therapy, the delineation accuracy of gross tumor volume (GTV) and clinical target volume (CTV) are the two most critical factors impacting the patient outcome. For CTV delineation, areas containing metastasis LN should be included to sufficiently cover the sub-clinical disease regions [chapet2005ct]. One strategy to outline the sub-clinical disease region is to include the LNS that containing the metastasized LN [pignon1992meta, yuan2019lymph]. Thoracic LNS is determined according to the text definitions of IASLC [rusch2009iaslc]. The delineation of LNS in the current clinical workflow is predominantly a manual process using CT images. Visual assessment and manual delineation is a challenging and time-consuming task even for experienced physicians, since converting text definitions of IASLC to precise 3D voxel-wise annotations can be error prone leading to large intra- and inter-user variability [chapet2005ct].

Figure 1: An illustration of LNS and key referencing organs. The top row illustrates the auto-searched top-6 key referencing organs; the bottom row depicts the 12 LNS.

Deep CNN have made remarkable progress in segmenting organs and tumors in medical imaging [tang2019clinically, zhang2020robust, jin2019accurate, jin2019deep, guo2020organ, jin2020deeptarget]

. Only a handful of non-deep learning studies have tackled the automated LNS segmentation 

[feuerstein2012mediastinal, matsumoto2014automatic, sarrut2014learning, liu2016mediastinal]. A LNS atlas was established using deformable registration [feuerstein2012mediastinal]. Predefined margins from manually selected organs, such as the aorta, trachea, and vessels, were applied to infer LNS [liu2016mediastinal], which was not able to accurately adapt to individual subject. Other methods [matsumoto2014automatic, sarrut2014learning] built fuzzy models to directly parse the LNS or learn the relative positions between LNS and some referencing organs. Average location errors ranging from mm to mm were reported using 22 test cases in [matsumoto2014automatic], while an average Dice score (DSC) of for LNSs in 5 patients was observed in [sarrut2014learning].

In this work, we propose the DeepStationing – an anatomical context encoded deep LNS parsing framework with key organ auto-search. We first segment a comprehensive set of 22 chest organs related to the description of LNS according to IASLC guideline. As inspired by [guo2020organ], the 22 organs are stratified into the anchor or non-anchor categories. The predictions of the former category are exploited to guide and boost the segmentation performance of the later category. Next, CT image and referencing organ predictions are combined as different input channels to the LNS parsing module. The 22 referencing organs are identified by human experts. However, relevant but different from the human process, CNN may require a particular set of referencing organs (key organs) that can opt for optimal performance. Therefore, we automatically search for the key organs by applying a channel-weighting to the input organ prediction channels based on differentiable neural search [liu2018darts]. The auto-searched final top-6 key organs, i.e., esophagus, aortic arch, ascending aorta, heart, spine and sternum (shown in Fig. 1), facilitate our DeepStationing method to achieve high LNS parsing accuracy. We adopt 3D nnU-Net [isensee2020nnu] as our segmentation and parsing backbone. Extensive 4-fold cross-validation is conducted using a dataset of CT images with LNS + Organ labels each, as the first of its kind to date. Experimental results demonstrate that deep model encoded with the spatial context of auto-searched key organs significantly improves the LNS paring performance, resulting in an average DS of , which is and higher over the pure CT-based deep model and the most recent relevant work [liu2016mediastinal] (from our re-implementations), respectively.

2 Method

Fig. 2 depicts the overview of our DeepStationing framework, consisting of two major modularized components: (1) stratified chest organ segmentation; (2) context encoded LNS parsing with key organ auto-search.

Figure 2: Overall workflow of our DeepStationing, which consists of stratified chest organ segmentation and anatomical context encoded LNS parsing with key organ auto-search.

2.1 Stratified Chest Organ Segmentation

To provide the spatial context for LNS parsing, we first segment a comprehensive set of 22 chest organs related to the description of LNS. Simultaneously segmenting a large number of organs increase optimization difficulty leading to sub-optimal performance. Motivated by  [guo2020organ], we stratify 22 chest organs into the anchor and non-anchor categories. Anchor organs have high contrast, hence, it is relatively easy and robust to segment them directly using the deep appearance features. Anchor organs are first segmented, and their results serve as ideal candidates to support the segmentation of other difficult non-anchors. We use two CNN branches to stratify the anchor and non-anchor organ segmentation. With predicted anchor organs as additional input, the non-anchor organs are segmented. Assuming data instances, we denote the training data as , where , , and denote the input CT and ground-truth masks for the anchor, non-anchor organs and LNS, respectively. Assuming there are and classes for anchor and non-anchor organs and dropping for clarity, our organ segmentation module generate the anchor and non-anchor organ predictions at every voxel location, , and every output class, :


where denotes the CNN functions and and for the output segmentation maps. Here, we combine both anchor and non-anchor organ predictions into an overall prediction map

. Predictions are vector valued 3D masks as they provide a pseudo-probability for every class.

represents the corresponding CNN parameters.

2.2 Anatomical Context Encoded LNS Parsing

Segmenting LNS by only CT appearance can be error prone, since LNS highly relies on the spatial context of adjacent anatomical structures. Emulating the clinical practice of IASLC guidelines, we incorporate the referencing organs into the training process of LNS parsing. Given classes of the LNS, as illustrated in Fig. 2, we combine the above organ predictions with CT images to create a multi-channel input: :


Thereupon, the LNS parsing module leverages both the CT appearance and the predicted anatomical structures, implicitly encoding the spatial distributions of referencing organs during training. Similar to Eq. (1), we have the LNS prediction in its vector-valued form as .

2.2.1 Key Organ Auto-search

The 22 referencing organs are previously selected according to the IASLC guideline. Nevertheless for deep learning based LNS model training, those manually selected organs might not lead to the optimal performance. Considering the potential variations in organ location and size distributions, and differences in automated organ segmentation accuracy, we hypothesize that the deep LNS parsing model would benefit from an automated reference organ selection process that are tailored to this purpose. Hence, we use the differentiable neural search [guo2020organ] to search the key organs by applying a channel-weighting strategy to input organ masks. We make the search space continuous by relaxing the selection of the referencing organs to a Softmax function over the channel weights of the one-hot organ predictions . For classes, we define a set of

learn-able logits for each channel, denoted as

. The channel weight for a referencing organ is defined as:


where denotes the set of channel weights and denotes the channel-wise multiplication between the scalar and the organ prediction . The input of LNS parsing model becomes . As the results of the key organ auto-search, we select the organs with the top- weights to be the searched

key organs. In this paper, we heuristically select the

based on the experimental results. Last, we train the LNS parsing model using the combination of original CT images and the auto-selected top- key organs’ segmentation predictions.

3 Experimental Results

Dataset. We collected contrast-enhanced venous-phase CT images of patients with esophageal cancers underwent surgery and/or radiotherapy treatments. A board-certified radiation oncologist with 15 years of experience annotated each patient with 3D masks of LNS, involved LN (if any), and referencing organs related to LNS according to IASLC guideline. The 12 annotated LN stations are: S1 (left + right), S2 (left + right), S3 (anterior + posterior), S4 (left + right), S5, S6, S7, S8. The average CT image size is voxels with an average resolution of mm. Extensive four-fold cross-validation (CV), separated at the patient level, was conducted. We report the segmentation performance using DS in percentage, HD and ASD in mm.

Organ GT
Organ Pred
+6 Searched
Organ Pred
S1 Left 78.1 6.8 84.3 4.5 82.3 4.6 85.1 4.0
S1 Right 76.8 5.0 84.3 3.4 82.2 3.4 85.0 4.1
S2 Left 66.9 11.4 75.8 9.0 73.7 8.9 76.1 8.2
S2 Right 70.7 8.5 74.8 7.6 72.8 7.6 77.5 6.4
S3 Anterior 77.4 4.9 79.8 5.6 79.7 5.6 81.5 4.9
S3 Posterior 84.6 3.1 87.9 2.8 87.8 2.9 88.6 2.7
S4 Left 74.1 8.2 77.0 8.9 76.9 8.9 77.9 9.4
S4 Right 73.8 8.9 74.9 9.3 74.9 9.4 76.7 8.3
S5 72.6 6.7 73.2 7.4 73.2 7.4 77.9 8.0
S6 72.4 5.7 74.9 4.4 74.8 4.5 75.7 4.3
S7 85.0 5.1 86.6 5.8 86.6 5.8 88.0 6.1
S8 80.9 6.1 84.0 5.9 82.0 5.9 84.3 6.3
Average 76.1 6.7 79.8 6.2 78.9 6.3 81.1 6.1
S1 Left 11.9 3.2 12.3 6.0 27.6 38.8 10.3 4.1
S1 Right 18.0 29.3 10.6 2.6 61.1 97.6 9.7 1.8
S2 Left 13.3 9.2 9.7 3.1 35.6 76.9 9.2 3.1
S2 Right 36.3 61.7 10.8 3.0 10.8 3.0 9.5 3.2
S3 Anterior 41.7 62.4 13.5 4.9 50.4 79.1 12.2 4.3
S3 Posterior 9.1 3.3 8.0 2.0 18.0 30.9 7.6 1.9
S4 Left 11.5 4.9 14.7 22.2 14.5 22.2 9.8 3.8
S4 Right 32.8 69.7 9.8 3.5 16.2 21.5 9.8 3.6
S5 36.4 56.4 20.5 35.2 38.1 60.3 10.9 4.0
S6 19.2 30.6 8.6 2.5 52.5 85.3 8.5 2.7
S7 26.3 42.6 9.6 3.7 9.6 3.7 9.5 3.5
S8 14.5 6.0 13.6 5.7 13.1 5.8 12.2 6.2
Average 22.6 31.6 11.8 7.9 28.9 43.8 9.9 3.5
S1 Left 1.6 0.8 1.3 0.6 1.4 1.0 0.9 0.5
S1 Right 1.8 0.8 1.2 0.5 1.6 1.1 0.9 0.5
S2 Left 1.4 0.8 1.0 0.6 1.3 0.8 0.8 0.6
S2 Right 1.5 0.8 1.3 0.7 1.3 0.7 1.0 0.7
S3 Anterior 1.0 0.8 0.7 0.4 0.9 0.9 0.6 0.4
S3 Posterior 0.9 0.5 0.6 0.3 0.8 1.1 0.6 0.4
S4 Left 1.0 0.6 1.4 2.7 1.2 1.6 0.8 0.6
S4 Right 1.5 1.0 1.4 1.0 1.5 1.0 1.3 1.0
S5 1.3 0.6 1.9 3.4 1.6 1.8 1.0 0.5
S6 0.8 0.4 0.7 0.3 1.0 1.1 0.6 0.3
S7 0.9 0.7 0.8 0.6 0.8 0.6 0.7 0.6
S8 1.7 1.2 1.6 1.1 1.6 1.1 1.3 1.3
Average 1.3 0.7 1.1 1.0 1.3 1.1 0.9 0.6
Table 1:

Mean DSCs, HDs, and ASDs, and their standard deviations of LNS parsing performance using: (1) only CT appearance; (2) CT

all 22 referencing organ ground-truth masks; (3) CTall 22 referencing organ predicted masks; (4) CTauto-searched 6 referencing organ predicted masks. The best performance scores are shown in bold.

Implementation details. We adopt the nnU-Net [isensee2020nnu]

with DSC+CE losses as our backbone for all experiments due to its high accuracy on many medical image segmentation tasks. The nnU-Net has been proposed to automatically adapt different preprocessing strategies (i.e., the training image patch size, resolution, and learning rate) to a given 3D medical imaging dataset. We use the default nnU-Net settings for our model training. The total training epochs is 1000. For the organ auto-search parameter

, we first fix the for epochs and alternatively update the and the network weights for another

epochs. The rest settings are the same as the default nnU-Net setup. We implemented our DeepStationing method in PyTorch, and an NVIDIA Quadro RTX 8000 was used for training. The average training/inference time is 2.5 GPU days or 3 mins.

Quantitative Results.

We first evaluate the performance of our stratified referencing organ segmentation. The average DSC, HD and ASD for anchor and nonanchor organs are , , , and , , , respectively. We also train a model by segmenting all organs using only one nnUNet. The average DSCs of the anchor, non-anchor, and all organs are , , and , which are , , and less than the stratified version, respectively. The stratified organ segmentation demonstrates high accuracy, which provides robust organ predictions for the subsequent LNS parsing model.

Figure 3: (a) Examples of LNS parsing results using different setups. For better comparison, red arrows are used to depict visual improvements. (b) The bottom charts demonstrate the performance using different numbers of searched referencing organs.

Table 1 outlines the quantitative comparisons on different deep LNS parsing setups. Columns 1 to 3 show the results using: 1) only CT images, 2) CT all ground-truth organ masks, and 3) CT all predicted organ masks. Using only CT images, LNS parsing exhibits lowest performance with an average DS of and HD of mm. E.g., distant false predictions is observed in the first image row of Fig. 3 and false-positive S3 posterior is predicted (in pink) between the S1 and S2. When adding ground-truth organ masks as spatial context, both DS and HD show remarked improvements: from to in DS and mm to mm in HD. This verifies the importance and effectiveness of referencing organs in inferring LNS boundaries. However, when predicted masks of the 22 organs are used (the real testing condition), it has a significant increase in HD from mm to mm as compared to that using ground truth organ masks. This shows the necessity to select the key organs suited for the deep parsing model. Finally, using the top-6 auto-searched referencing organs, our DeepStationing model achieves the best performance reaching 81.1 6.1% DSC, 9.9 3.5mm HD and 0.9 0.6mm ASD. Qualitative examples are shown in Fig. 3 illustrating these performance improvements.

We auto-search for the organs that are tailored to optimize the LNS parsing performance. Using an interval of 3, we train 7 additional LNS parsing models, by including the top-3 up to top-21 organs. The auto-searched ranking of the 22 organs is listed as follows: esophagus, aortic arch, ascending aorta, heart, spine, sternum, V.BCV (R+L), V.pulmonary, descending aorta, V.IJV (R+L), A.CCA (R+L), V.SVC, A.pulmonary, V.azygos, bronchus (R+L), lung (R+L), trachea, where ‘A’ and ‘V’ denote the Artery and Vein. The quantitative LNS parsing results in selecting the top-n organs are illustrated in the bottom charts of Fig. 3. With more organs included gradually, the DS first improves, then slightly drops after having more than top-6 organs. The performance later witnesses a sharp drop after including more than top-9 organs, then becoming steady when we include more than top-15 organs. This demonstrates that deep LNS paring model does not need a complete set of referencing organs to capture the LNS boundaries. We choose the top-6 as our final key organs based on experimental results. We notice that the trachea, lungs, and bronchus are surprisingly ranked in the bottom-5 of the auto-search, although previous works [lu2011automatic, liu2016mediastinal] manually selected them for the LNS parsing. The assumed reasons are that those organs are usually filled with air and have clear boundaries while LNS does not include air or air-filled organs. With the help of the other found key organs, it is relatively straightforward for the LNS parsing CNN to distinguish them and reject the false-positives located in those air-filled organs. We further include 6 ablation studies and segment LNS using: (1) randomly selected 6 organs; (2) top-6 organs with best organ segmentation accuracy; (3) anchor organs; (4) recommended 6 organs from the senior oncologists; (5) searched 6 organs predictions from less accurate non-stratified organ segmentor; (6) searched 6 organs GT. The randomly selected 6 organs are: V.BCV (L), V.pulmonary, V.IJV (R), heart, spine, trachea; The 6 organs with the best segmentation accuracy are: lungs (R+L), descending aorta, heart, trachea, spine; Oncologists recommended 6 organs are: trachea, aortic arch, spine, lungs (R+L), descending aorta; The DSCs for setups (1-6) are 77.2%, 78.2%, 78.6%, 79.0%, 80.2%, 81.7%; the HDs are 19.3mm, 11.8mm, 12.4mm, 11.0mm, 10.1mm, 8.6mm, respectively. In comparison to the LNS predictions using only CT images, the ablation studies demonstrate that using the referencing organ for LNS segmentation is the key contributor for the performance gain, and the selection and the quality of supporting organs are the main factors for the performance boost, e.g., our main results of the setups (5) and (6) show that better searched-organ delineation can help get superior LNS segmentation performance.

Comparison to previous work. We compare the DeepStationing to the previous most relevant approach [liu2016mediastinal] that exploits heuristically pre-defined spatial margins for LNS inference. The DeepStationing outperforms  [liu2016mediastinal] by in DS, mm in HD, and mm in ASD. For the ease of comparison, similar to [liu2016mediastinal], we also merge our LNS into four LN zones, i.e., supraclavicular (S1), superior (S2, S3, and S4), aortic (S5 and S6) and inferior (S7 and S8) zones, and calculate the accuracy of LN instances that are correctly located in the predicted zones. DeepStationing achieves an average accuracy of , or absolutely superior than [liu2016mediastinal] in LN instance counting accuracy. We tested additionally 2 backbone networks: 3D PHNN (3D UNet with a light-weighted decoding path) and 2D UNet. The DSCs of 3D PHNN and 2D UNet are 79.5% and 78.8%, respectively. The assumed reason for the performance drop might be the loss of the boundary precision/3D information.

4 Conclusion

In this paper, we propose DeepStationing as a novel framework that performs key organ auto-search based LNS parsing on contrasted CT images. Emulating the clinical practices, we segment the referencing organs in thoracic region and use the segmentation results to guide LNS parsing. Different from employing the key organs directly suggested by oncologists, we search for the key organs automatically as a neural architecture search problem that can opt for optimal performance. Evaluated using a most comprehensive LNS dataset, DeepStationing method outperforms previous most relevant approach by a significant quantitative margin of in DS, and is coherent to clinical explanation. This work is an important step towards reliable and automated LNS segmentation.