Organ at Risk Segmentation for Head and Neck Cancer using Stratified Learning and Neural Architecture Search

04/17/2020 ∙ by Dazhou Guo, et al. ∙ 6

OAR segmentation is a critical step in radiotherapy of head and neck (H N) cancer, where inconsistencies across radiation oncologists and prohibitive labor costs motivate automated approaches. However, leading methods using standard fully convolutional network workflows that are challenged when the number of OARs becomes large, e.g. > 40. For such scenarios, insights can be gained from the stratification approaches seen in manual clinical OAR delineation. This is the goal of our work, where we introduce stratified organ at risk segmentation (SOARS), an approach that stratifies OARs into anchor, mid-level, and small hard (S H) categories. SOARS stratifies across two dimensions. The first dimension is that distinct processing pipelines are used for each OAR category. In particular, inspired by clinical practices, anchor OARs are used to guide the mid-level and S H categories. The second dimension is that distinct network architectures are used to manage the significant contrast, size, and anatomy variations between different OARs. We use differentiable neural architecture search (NAS), allowing the network to choose among 2D, 3D or Pseudo-3D convolutions. Extensive 4-fold cross-validation on 142 H N cancer patients with 42 manually labeled OARs, the most comprehensive OAR dataset to date, demonstrates that both pipeline- and NAS-stratification significantly improves quantitative performance over the state-of-the-art (from 69.52 principled means to manage the highly complex segmentation space of OARs.



There are no comments yet.


page 1

page 3

page 6

page 7

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Illustration of 42 OARs in 3D demonstrating their various contrasts, sizes, and shapes in RTCT.

hn cancer is one of the most common cancers worldwide [jemal2011global]. High-precision radiation therapy, e.g. intensity-modulated radiotherapy, has been widely used for hn cancer treatment because of its ability for highly conformal dose delivery. In this process, the radiation dose to normal anatomical structures, i.e. oar, should be controlled to minimize post-treatment complications [harari2010emphasizing]. This requires accurate delineation of tumors and oar in RTCT images [nikolov2018deep, cardenas2018auto, jin2019accurate, jin2019deep, lin2019deep, tang2019clinically]. Clinically, oar segmentation is predominantly carried out manually by radiation oncologists. Manual delineation is not only time consuming, e.g. 2 hrs for 9 oar, but also suffers from large inter-practitioner variability [harari2010emphasizing]. Unsurprisingly, with more oar included, time requirements increase significantly, limiting the number of patients who may receive timely radiotherapy [mikeljevic2004trends]. These issues have spurred efforts toward automatic oar segmentation in hn cancer [raudaschl2017evaluation]. Despite this progress, performance gaps remain, calling for approaches better tailored to this distinct and challenging problem. This is the goal of our work.

By their nature, hn oar are 1) complex in anatomical shapes, 2) dense in spatial distributions, 3) large in size variations, and 4) low in RTCT image contrast. Currently, deep cnn are a dominant approach[tong2018hierarchical, wang2017hierarchical, ibragimov2017segmentation, tong2018fully, gao2019focusnet, nikolov2018deep, tang2019clinically, zhu2019anatomynet, agn2019modality]. However, existing methods either perform whole volume segmentation [nikolov2018deep, zhu2019anatomynet] or segmentation-by-detection [gao2019focusnet, tang2019clinically]. Yet, model optimization becomes increasingly difficult as greater numbers of oar need to be segmented. Leveraging insights from clinical practices can help ease the corresponding difficulties.

Within the clinic, radiation oncologists typically refer to easy oar when delineating harder ones, e.g. the eyes, brain stem, and mandible, to serve as anchors to segment hard oar, such as different types of soft-tissue hn glands [tamboli2011computed]. Figure 1 visually illustrates this stratification. As such, this process suggests that automated solutions could benefit from also stratifying oar, both to create anchors and to create tailor-made analysis workflows for each stratification. Indeed, Gao et al[gao2019focusnet] showed that exploiting two branches for oar segmentation boosts overall performance. However, large oar did not serve as support to small oar in that work. Moreover, the network architecture was manually crafted and fixed across oar stratifications. Yet, given their highly distinct natures, different oar likely require different network architectures for optimal performance. It is difficult to see how regular cnn can meet these needs.

Our work fills this gap by introducing soars, a novel stratified learning framework to segment oar. soars divides oar into three levels, i.e. anchor, mid-level, and sh. Emulating clinical practice, each is processed using tailored workflows. Anchor oar are high in intensity contrast and low in inter- and intra-reader variability. Thus these can be segmented first to provide informative location references to the harder categories. Mid-level oar are low in contrast, but not inordinately small. We provide anchor-level predictions as additional input for mid-level segmentation as guidance and reference-based grounding. sh oar are very poor in contrast and very small. Similar to mid-level oar, we use anchor oar to guide sh segmentation. However, we use a detection followed by segmentation strategy [gao2019focusnet], to better manage the extremely unbalanced class distributions across the entire volume. While this workflow provides specialized frameworks for each oar category, data processing could be even better tailored, as it is unlikely the same network architecture suits each stratification equally. Thus, we deploy an additional dimension of stratification, using nas to automatically search the optimal architecture for each category. Concretely, we formulate the structure learning as a differentiable nas [liu2018darts, liu2019auto, zhu2019v], allowing for an automatic selection across 2D, 3D or p3d convolutions with kernel sizes of 3 or 5 at each convolutional block.

Using four-fold cross-validation, we evaluate soars on RTCT images with annotated oar, the most comprehensive hn oar dataset to date. We demonstrate that both dimensions of our stratification, i.e. category-specific processing and nas, significantly impact performance. We achieve an average dsc and hd of and mm, respectively, which corresponds to improvements of and mm, respectively over a non-stratified baseline. Compared to the state-of-the-art, a 3D Mask R-CNN based UaNet method [tang2019clinically], we produce improvements of and mm, in dsc and hd, respectively. Validation on a public dataset (the MICCAI 2015 oar Challenge [raudaschl2017evaluation]), further confirms these compelling performance improvements. In summary, the contributions and novelty of this paper are three folds:

  • Segmenting a comprehensive set of oar is essential and critical for radiotherapy treatment planing in head and neck cancer. We work on the most clinically complete and desirable set of 42 oar as compared to previous state-of-the-art work.

  • Our main methodological contribution is the proposed whole framework on stratifying different organs into different categories of OARs which to be dealt respectively with tailored segmentors (achieved by nas). Our method is a well-calibrated framework of integrating organ stratification, multi-stage segmentation and nas in a synergy.

  • Our idea of stratifying the 42 oar into three levels comes from the combination of emulation of oncologists manual oar contouring knowledge and the oar’s size distributions. To our best knowledge, this simple yet effective organ stratification scheme has not been studied for such a complex segmentation and parsing task like ours, by previous work.

2 Related Works

oar Segmentation There is a large body of work on oar segmentation. Atlas-based approaches [han2008atlas, isambert2008evaluation, saito2016joint, schreibmann2014multiatlas, sims2009pre, voet2011does] enjoy a prominent history [raudaschl2017evaluation, kosmin2019rapid]. Their main disadvantage is a reliance on accurate and efficient image registration [zhu2019anatomynet], which is challenged by shape variations, normal tissue removal, abnormal tissue growth, and image acquisition differences [wu2019aar]. Registration often also take many minutes or even hours to complete. Another common approach is statistical shape or appearance models [cootes1995active, cootes2001active, rueckert2003automatic]. These have shown promise, but a prominent issue is that they can be limited to specific shapes described by the statistical model, which makes them less flexible when the number of oar is large [fritscher2014automatic]. Of note is Tong et al[tong2018hierarchical], who applied intensity and texture-based fuzzy models using a hierarchical stratification.

Figure 2: (a) soars stratifies oar segmentation across two dimensions: distinct processing frameworks and distinct architectures. We execute the latter using differentiable nas. (b) depicts illustrates the backbone network (phnn) with nas, which allows for an automatic selection across 2D, 3D, p3d convolutions. (c) demonstrates the nas search space setting.

Recently, deep cnn based approaches have proven capable of delivering substantially better performance. Apart from early efforts [ibragimov2017segmentation], fcn have quickly becomes the mainstream method [zhu2019anatomynet, nikolov2018deep, tong2018fully, jin2019accurate]. To address data imbalance issues when faced with sh oar, FocusNet [gao2019focusnet] and UaNet [tang2019clinically] adopt a segmentation-by-detection strategy to achieve better segmentation accuracy. However, both approaches do not stratify oar, and hence, cannot use easier oar as support to more difficult ones. Moreover, when the number of oar is large, e.g. , optimization becomes more difficult. Finally, their network architecture remains manually fixed, which is less optimized for the distinct oar categories.
Stratified Learning

Stratification is an effective strategy to decouple a complicated task into easier sub-tasks. Computer vision has a long history using this strategy. Several contextual learning models have been used to assist general object detection 

[heitz2008learning, rabinovich2007objects] within the conditional random field framework [lafferty2001conditional]. Instance learning, i.e. instance localization, segmentation and categorization, often stratifies the problem into multiple sub-tasks [dai2016instance, he2017mask, chen2019hybrid]. Within medical imaging, stratified statistical learning has also been used to recognize whether a candidate nodule connects to any other major lung anatomies [wu2010stratified]

. Yet, the use of stratified learning for semantic segmentation, particularly in the deep-learning era, is still relatively understudied in medical imaging. Within oar segmentation, Tong

et al[tong2018hierarchical] have applied a hierarchical stratification, but this used a non-deep fuzzy-connectedness model. We are the first to execute stratified learning for deep oar segmentation.
Neural Architecture Search

This is the process of automatically discovering better network architectures. Many nas methods exploit reinforcement learning 


or evolutionary algorithms 

[real2019regularized]. However, both strategies are extremely computationally demanding. Differentiable nas [liu2018darts, liu2019auto, zhu2019v] realize all candidate architectures simultaneously during optimization, limiting the allowable or feasible search spaces. Nonetheless, these approaches are highly practical means to tailor architectures. In this work, we follow the differentiable nas formulation [liu2018darts, zhu2019v] to search the architectures for each of the three oar stratifications. We explore the optimal kernel size and combination of 2D, 3D, and p3d configurations. As such, we are the first to apply nas to oar segmentation.

3 Methods

Figure 2 depicts the soars framework, which uses three processing branches to stratify anchor, mid-level and sh oar segmentation. A first stratification dimension is distinct processing frameworks. soars first segments the anchor oar. Then, with the help of predicted anchors, mid-level and sh oar are segmented. For the most difficult category of sh, we first detect center locations and then zoom-in to segment the small oar. The deeply-supervised 3D phnn [harrison2017progressive] is adopted as the backbone for all three branches, which uses deep supervision to progressively propagate lower-level features to higher-levels ones using a parameter-less pathway. We opt for this backbone due to its good reported performance in other RTCT works [jin2019accurate, jin2019deep]. A second dimension of stratification uses differentiable nas to search distinct phnn convolutional blocks for each oar category.

3.1 Processing Stratification

As mentioned, soars segments oar using three distinct frameworks, where oar are divided according to clinician recommendations (the details for our oar dataset is reported in Sec. 4.1). We denote the training data of data instances as , where , , , and denote the input RTCT and ground-truth masks for anchor, mid-level, and sh oar, respectively. Here, we drop

, when appropriate, for clarity. Throughout, we will abuse matrix/vector notation, using boldface to denote vector-valued volumes and use vector concatenation as an operation across all voxel locations.

Anchor branch: Assuming we have classes, soars first uses the anchor branch to generate oar prediction maps for every voxel location, , and every output class, :


where and

denote the cnn functions and output segmentation maps, respectively. Here, predictions are vector valued 3D masks as they provide a pseudo-probability for every class.

represents the corresponding cnn parameters.

Anchor oar have high contrast compared to surrounding tissue or are in easy-to-locate regions; hence, it is relatively easy to segment them directly and robustly based on pure appearance and context features. Consequently, they are ideal candidates to support the segmentation of other oar.

Mid-level branch: Most mid-level oar are primarily soft tissue, which have low contrast and can be easily confused with other structures with similar intensities and shapes. Direct segmentation can lead to false-positives or over/under-segmentations. This can be addressed by using processing stratification to directly incorporate anchor predictions into mid-level learning, since the anchor predictions are robust and provide highly informative location and semantically-based cues. As demonstrated in Figure 2, we combine the anchor predictions with the RTCT to create a multi-channel input: :


In this way, the mid-level branch leverages both the CT intensities as well as the anchor oar guidance, which can be particularly helpful in managing regions with otherwise similar CT appearance. Like (2), we can collect mid-level predictions into a vector-valued entity .

Small & hard branch: In this branch, we further decouple segmentation into a detection followed by segmentation process. Directly segmenting the fine boundaries of sh oar from CT is very challenging due to the poor contrast and the extremely imbalanced foreground and background distributions when considering the entire volume. In contrast, the detection of center regions of sh oar is a much easier problem, since the hn region has relatively stable anatomical spatial distribution. This means that the rough locations of sh oar can be inferred from the CT context with confidence. Once the center location is detected, a localized region can be cropped out to focus on segmenting the fine boundaries in a zoom-in fashion. This has similarities to Gao et al.’s [gao2019focusnet] approach to segment small organs.

For detecting sh oar centers, we adopt a simple yet effective heat map regression method [wei2016convolutional, xu2018less], where the heat map labels are created at each organ center using a 3D Gaussian kernel. Similar to the mid-level branch, to increase detection robustness and accuracy we also combine the anchor branch predictions with the RTCT as the detection input channels:


where denotes the predicted heat maps for every sh oar. Like the segmentation networks, we use the same phnn backbone for . Given the resulting regressed heat map, we choose the pixel location corresponding to the highest value, and crop a voi using three-times the extent of the maximum size of the oar of interest. With the voi cropped, soars can then segment fine boundaries of the sh oar. As illustrated in Figure 2, we concatenate the output from Eq. (4) together with the cropped RTCT image as the input to the sh oar segmentation network:


where here it’s understood that (5) is only operating on the cropped region.

3.2 Architectural Stratification

While stratifying oar into different processing frameworks with distinct inputs and philosophies is key to pushing performance, more can be done. Namely, considering the significant variations in oar appearance, shape, and size, it is likely that each oar type would benefit from segmentation branch architectures tailored to their needs. To do this, soars automatically searches network architectures for each branch, adding an additional dimension to the stratification. Throughout, we use phnn [harrison2017progressive] as the base backbone. The whole network structure is illustrated in Figure 2, in which the architecture is learned in a differentiable way [liu2018darts].


denote a composite function of the following consecutive operations: batch normalization, a rectified linear unit and a convolution with an

dimension kernel. If one of the dimensions of the kernel is set to , it reduces to a 2D kernel. As shown in Eq. (6), we search a set of possible architectures that include: 2D convolutions, 3D convolutions, or pseudo-3D convolution with either kernel sizes of 3 or 5:


where denotes the search space of possible architectures. For simplicity, instead of a layer-by-layer architecture search, we use only one type of convolutional kernel to build each phnn convolutional block.

Similar to [liu2018darts, zhu2019v], we make the search space continuous by relaxing the categorical choice of a particular operation to a softmax over all 6 possible operations. More formally, if we index each possibility in (6) by

, then we can define a set of 6 learnable logits for each, denoted

. A softmax can then be used to aggregate all possible architectures into one combined output, :


where we have dropped dependence on the input images for convenience. As Zhu et al. demonstrated [zhu2019v], this type of nas scheme can produce significant gains within medical image segmentation. This creates a sort of super network that comprises all possible manifestations of (6). This super network can be optimized in the same manner as standard networks. At the end of the nas, the chosen network architecture of each block, , can be determined by selecting the corresponding to the largest value. If the index to this maximum is denoted , then . If we have blocks, then based on (8), the searched network can be represented as , where denotes the searched network architecture. For consistency, we use the same strategy to search the network architecture for each branch of soars.

Anchor OARs Mid-level OARs S&H OARs
Baseline 84.02 5.98 0.82 Baseline 63.68 12.97 3.48 Baseline 60.97 4.86 0.98
CT Only 84.14 5.25 0.79 CT Only 67.31 12.03 3.97 CT Only 62.09 4.19 1.06
CT+NAS 85.73 4.77 0.77 CT+Anchor 70.73 10.34 1.67 CT+Heat map 71.75 2.93 0.52
CT+Anchor+NAS 72.55 9.05 1.31 CT+Heat map +NAS 72.57 2.94 0.49
Table 1: Quantitative results of the ablation studies of the proposed method using 1 fold of the dataset. The baseline network is a 3D phnn. For sh oar, all methods, except the baseline, segment on predicted voi. The performance is measured by dsc (unit: %), hd (unit: mm), and asd (unit: mm).
Dist (mm)
CT Only 3.252.34
CT+Anchor 2.911.74
Table 2: sh oar detection results measuring the average distance between regressed and true center points.
Figure 3: Qualitative mid-level oar segmentation using different setups. The seven columns are seven representative axial slices in the RTCT image. For better comparison, we use red arrows to indicate the improvements. The row is the RTCT image with oar delineations of a radiation oncologist. The row shows the impact of using anchor oar, which can help the segmentation of soft-tissue mid-level oar. The demonstrates the impact of nas, indicating the necessity of adapting network architectures for different oar.

4 Experiments

4.1 Datasets and Preprocessing

To evaluate performance, we collected anonymized non-contrast RTCT images in hn cancer patients, where 42 oar are delineated during the target contouring process for radiotherapy (hereafter denoted as hn 42 dataset). Extensive 4-fold cross validation, split at the patient level, was conducted on the hn 42 dataset to report results. We compare against other state-of-the-art methods including P-HNN [harrison2017progressive], UNet [cciccek20163d], and UaNet [tang2019clinically]. To evaluate the effectiveness of soars, we conducted two ablation studies using 1 fold of the dataset. Furthermore, we examined our performance using the public MICCAI 2015 head and neck auto-segmentation challenge data111 (referred hereafter as MICCAI2015). This external testing set contains 9 oar with 15 test cases. Evaluation metrics: We report the segmentation performance using dsc in percentage, hd and asd in mm. Note that we use hd metric instead of hd95 as reported in some previous works.

hn 42 oar dataset: Each CT scan is accompanied by 42 oar 3D masks annotated by an experienced oncologist. The average CT size is voxels with an average resolution of mm. The specific oar stratification is as follows. Anchor oar: brain stem, cerebellum, eye (left and right), mandible (left and right), spinal cord and temporomandibular joint (left and right). Mid-level oar: brachial plexus (left and right), basal ganglia (left and right), constrictor muscle (inferior, middle and superior), epiglottis, esophagus, hippocampus (left and right), larynx core, oralcavity, parotid (left and right), submandibular gland (left and right), temporal lobe (left and right), thyroid (left and right). sh oar: cochlea (left and right), hypothalamus, inner ear (left and right), lacrimal gland (left and right), optic nerve (left and right), optic chiasm, pineal gland, and pituitary.

MICCAI2015 dataset: This dataset has been extensively used by researchers to evaluate atlas and deep learning based hn oar segmentation. It contains 33 training cases and 15 test cases with 9 oar annotated. The 9 oar include brain stem, mandible, optic chiasm, optic nerve (left and right), parotid (left and right) and submandibular gland (left and right).

Image preprocessing: We apply a windowing of [-500, 1000] HU to every CT scan covering the intensity range of our target oar, from which we extract

voi as training samples for the anchor and mid-level branches as well as the detection module in the sh branch. The heat map labels in the detection module is a 3D Gaussian distribution with a standard deviation of 8mm. The training voi are sampled in two manners: (1) we randomly extract voi centered within each of the oar to ensure sufficient positive samples. (2) we randomly sample additional 15 voi from the whole volume to obtain sufficient negative examples. This results in on average 70 voi per CT scan. We further augment the training data by applying random scaling between

. In testing, 3D sliding windows with sub-volumes of

and strides of

voxels are used. The probability maps of sub-volumes are aggregated to obtain the whole volume prediction, taking on average s to process one input volume using a single GPU.

4.2 Implementation Details

We implemented soars in PyTorch

222, and trained it on an NVIDIA Quadro RTX 8000. The RAdam solver [liu2019variance] is used to optimize all models with a momentum of and a weight decay of . The dsc loss is used for the segmentation task training. The sh detection branch is trained using L2 loss with a learning rate.

We exploit nas to search the optimal network architecture for each branch. For the nas parameter , we first fix for epochs. Then we update and the network weights for an additional epochs. The batch size for nas training is set to . Note that we use only the validation set for updating. The ratio between the training set and the validation set is 2:1. The initial learning rate is set to for the anchor and mid-level branches, and for the sh branch.

After nas is completed, we retrain the searched network from scratch with a batch size of . The batch size is set to be . The initial learning rate is set to for the anchor and mid-level branches, and for the sh branch. The detailed training strategy is described as follows: 1) We train the anchor branch for epochs; 2) We fix the parameters of the anchor branch and concatenate its output to the original RTCT, followed by further training the mid-level and S&H branches for epochs; 3) Finally we fine-tune the whole framework in an end-to-end manner for epochs.

4.3 Processing Stratification

We first evaluate the effectiveness of the processing stratification of soars. The ablation results for segmenting the anchor, mid-level and sh oar are shown in Table 1. The baseline comparison is the 3D phnn model trained on all 42 oar together. When anchor oar are stratified to train only on themselves, there is a slight improvement as compared to the baseline model, consistent with the observation that anchor oar generally have good contrast and are easy to optimize. However, when focusing on mid-level oar, there is a marked dsc score improvement () when only training on mid-level oar instead of training on all. This demonstrates the difficulty in segmenting a large number of organs together without considering their differences. When further adding anchor oar predictions as support, both dsc scores and the ASD experience large improvements, i.e. from to in dsc and to mm in ASD. These significant error reductions indicate that anchor oar serve as effective references to better delineate the hard-to-discern boundaries of mid-level organs (most are soft-tissue). Figure 3 depicts qualitative examples of segmenting mid-level oar. As can be seen, our method achieves much better visual results.

For the sh branch, we first report the accuracy of the regressed center-point using the detection-by-segmentation network. As Table 2 demonstrates, the center points of sh oar can be detected with high robustness. Moreover, when using the anchor oar as support, the distance errors between regressed and true center points are further reduced. In our experiments, no sh oar was missed by our detection-by-segmentation strategy, demonstrating the robustness of our approach. Now focusing on the segmentation results of Table 1, by cropping the voi using the detection module, there is remarkable improvement in segmenting the sh oar, moving dsc from to , as compared against directly segmenting from the CT. This further demonstrates the value of our processing-based stratification method, which provides for optimal treatment of oar categories with different characteristics. As the examples of Figure 4 demonstrate, the benefits of processing stratification for sh oar is clearly shown in the optic chiasm, hypothalamus, and pineal gland, which are insufficiently segmented/missed when using only RTCT for prediction.

Anchor OARs Mid-level OARs S & H OARs All OARs
UNet [cciccek20163d] 82.97 8.90 1.06 63.61 11.06 1.92 59.64 6.38 1.31 66.62 9.26 1.86
phnn [harrison2017progressive] 84.26 6.12 1.18 65.19 13.15 2.97 59.42 5.23 0.82 67.62 9.39 2.23
UaNet [tang2019clinically] 84.30 8.89 1.72 69.40 11.57 2.06 61.85 5.28 1.53 70.44 9.20 1.83
SOARS 85.04 5.08 0.98 72.75 10.10 1.66 71.90 2.93 0.53 75.14 6.98 1.12
Table 3: Quantitative results of different approaches on segmenting the 42 hn oar using the 4-fold cross validation. Our proposed SOARS achieves the best performance in all metrics (indicated in bold).
Optic Nerve Parotid SMG
Brain Stem Mandible Optic Chiasm Lt Rt Lt Rt Lt Rt All OARs
Ren et al[ren2018interleaved] - - 58.017.0 72.08.0 70.09.0 - - - - -
Wang et al[wang2017hierarchical] 90.04.0 94.01.0 - - - 83.06.0 83.06.0 - - -
AnatomyNet [zhu2019anatomynet] 86.72.0 92.52.0 53.215.0 72.16.0 70.610.0 88.12.0 87.34.0 81.44.0 81.34.0 79.2
FocusNet [gao2019focusnet] 87.52.6 93.51.9 59.618.1 73.59.6 74.47.2 86.33.6 87.93.1 79.88.1 80.16.1 80.3
UaNet [tang2019clinically] 87.52.5 95.00.8 61.510.2 74.87.1 72.35.9 88.71.9 87.55.0 82.35.2 81.54.5 81.2
SOARS 87.62.8 95.11.1 64.98.8 75.37.1 74.65.2 88.23.2 88.25.2 84.27.3 83.86.9 82.4
Table 4: For MICCAI 2015 9 oar segmentation challenge, the proposed method achieves 7 (in bold) best performance and 2 (in italic font) second best performance.
Figure 4: Examples of sh oar segmentation using different setups. For visualiztion purpose, the dashed rectangles are enlarged for highlighting improvements. As indicated using the red arrows, the proposed method achieves visually better optic chiasm, hypothalamus, and pineal gland segmentation.

4.4 Architectural Stratification

Table 1 also outlines the performance improvements provided by nas. As can be seen, all three branches trained with nas consistently produce more accurate segmentation results than those trained with the baseline 3D phnn network. This validates the effectiveness of nas on complicated segmentation tasks. For the three branches, the anchor and mid-level branches have considerable performance improvement, from to and to in dsc scores respectively, while the sh branch provides a marginal improvement ( in dsc score). For segmenting the sh oar, the strong priors of detected heat maps may have already made the segmentation task much easier. Nonetheless, considering the dramatic improvements already provided by the stratified approach in Sec. 4.3, the fact that nas is able to boost performance even further attests to its benefits. Some qualitative examples demonstrating the effectiveness of nas are shown in Figure 3 and Figure 4.

The searched network architectures for the anchor branch are 2D-kernel3, 2D-kernel5, 2D-kernel3 and 3D-kernel5 for the four convolution blocks, while for the mid-level branch they are 2D-kernel3, 2.5D-kernel5, 2D-kernel3 and 2.5D-kernel5. This is an interesting result, as it indicates that 3D kernels may not always be the best choice for segmenting objects with reasonable size, as mixed 2D or p3d kernels dominate both branches. Consequently, it is possible that much computation and memory used for 3D networks could be avoided by using an appropriately designed 2D or p3d architecture. For the sh branch, the search architecture is 2D-kernel3, 3D-kernel5, 2D-kernel3 and 3D-kernel5 for the four convolution blocks. As can be seen, more 3D kernels are used, consistent with the intuition that small objects with low contrast rely more on the 3D spatial information for better segmentation.

Intuitively, it would be interesting to let the network search the oar levels. However, nas becomes computationally unaffordable since automatically stratifying anchor OARs alone is at the complexity of more expensive.

4.5 Comparison to State-of-the-art

Table 3 compares soars against 3 sota oar segmentation methods, i.e. UNet [cciccek20163d], phnn [harrison2017progressive], and UaNet [tang2019clinically], using the 4-fold cross-validation on the hn 42 OARs dataset. We also tested anatomyNet [zhu2019anatomynet], but it consistently missed very small organs, so we do not report its results. Although phnn [harrison2017progressive] achieves comparable performance on the anchor and sh oar segmentation with UaNet [tang2019clinically], it has decreased performance for mid-level oar. UaNet is a modified version of 3D Mask R-CNN [he2017mask], which conducts object segmentation within the detected boxes. Hence, it decouples the whole complicated task into detection followed by segmentation, possibly accounting for the better segmentation accuracy for the mid-level oar as compared to phnn [harrison2017progressive]. Nonetheless, despite being much simpler, phnn is still able to match or beat UaNet on the sh oar, demonstrating its effectiveness as a baseline and backbone method for soars. When considering soars, consistent improvements can be observed in all metrics as compared to all competitors, with absolute dsc increases and hd error reduction as compared to UaNet [tang2019clinically].

4.6 MICCAI2015 Challenge

We use the MICCAI2015 dataset as an external dataset to further demonstrate the generalizability of soars. Similar to other comparison methods, we trained our framework from scratch using the MICCAI2015 training set. We get an average dsc of , which has improvement as compared to [tang2019clinically], or over [gao2019focusnet]. Compared to competitor methods, we achieve 7 best performance and 2 second best performance on all 9 oar, especially the most difficult optic chiasm, where we have a improvement on dsc as compared to the best previous result achieved by UaNet [tang2019clinically]. These results on the MICCAI2015 dataset further validate the effectiveness and consistency of our method, reinforcing its value.

5 Conclusion

This work presented soars, a novel framework that stratifies hn oar segmentation into two dimensions. Inspired by clinical practices, we stratify oar into three categories of anchor, mid-level and sh, providing customized processing frameworks for each. Importantly, the mid-level and sh branches build off of the anchor branch’s more reliable predictions. Additionally, we stratify network architectures, executing an effective nas for each. We test on the most comprehensive hn dataset to date that comprises 42 different oar. Comparing to sota methods, the improvements are most significant for the mid-level and sh oar. With this, we demonstrate that our proposed soars can outperform all state-of-the-art baseline networks, including the most recent representative work UaNet [tang2019clinically], by margins as high as in dsc. Thus, our work represents an important step forward toward reliable and automated hn oar segmentation.


Supplementary Material

Performance of OAR segmentation

In Table 5, we report the category-by-category dsc of the proposed SOARS against UNet [ronneberger2015u], P-HNN [harrison2017progressive], and UaNet [tang2019clinically]. In Table 6, we report the category-by-category hd of the proposed SOARS against UNet, P-HNN, and UaNet. For both metrics, soars achieved 30 out of 42 oar best performance. soars performed slightly worse than UaNet on temporal lobe and temporomandibular joint segmentations in terms of dsc. Yet, the dsc differences are relatively small. We demonstrate some qualitative comparison results against UaNet in Figure 5, where the improvements are indicated using red arrows.

Organ UNet P-HNN UaNet SOARS
Basal Ganglia Lt 64.012.4 63.516.6 63.613.7 63.813.7
Basal Ganglia Rt 64.713.9 63.514.2 67.415.0 63.611.6
Brachial Lt 59.813.7 48.811.8 49.910.3 66.817.1
Brachial Rt 58.813.7 49.47.0 53.58.0 65.514.2
Brainstem 81.75.4 80.16.8 80.66.3 81.05.7
Cerebellum 83.22.7 88.82.8 90.12.8 90.22.3
Cochlea Lt 64.017.6 67.210.4 66.512.6 72.312.2
Cochlea Rt 64.210.0 67.210.4 68.212.6 69.512.4
Const. inf 63.417.1 61.814.9 73.610.6 65.018.3
Const. mid 64.915.4 63.114.5 66.111.3 66.915.1
Const. sup 64.010.2 64.110.0 62.311.3 67.49.2
Epiglottis 65.58.6 65.511.0 65.413.1 67.38.2
Esophagus 66.323.2 61.612.0 69.112.9 67.014.0
Eye Lt 83.47.4 86.43.4 85.77.4 86.43.3
Eye Rt 82.76.3 85.93.3 86.74.3 86.64.0
Hippocampus Lt 62.412.5 46.217.3 50.017.3 67.416.0
Hippocampus Rt 62.214.3 45.212.1 52.217.6 67.918.9
Hypothalamus 63.617.3 39.216.8 28.722.9 72.617.1
Innerear Lt 62.412.1 58.410.6 68.810.9 78.88.1
Innerear Rt 63.216.8 60.110.3 73.012.2 76.99.1
Lacrimalgland Lt 59.210.5 54.711.5 64.116.0 70.78.0
Lacrimalgland Rt 58.710.5 54.711.5 52.114.3 70.611.0
Larynx core 57.917.1 53.917.1 56.920.1 69.720.8
Mandible Lt 87.42.9 90.22.0 88.212.1 91.71.8
Mandible Rt 89.12.3 90.81.8 88.06.0 91.12.5
Optic Chiasm 49.915.4 50.913.6 60.422.1 72.99.2
Optic Nerve Lt 61.711.1 67.611.0 69.99.3 74.37.8
Optic Nerve Rt 62.012.2 67.610.2 69.911.0 72.38.7
Oralcavity 64.05.1 76.35.1 77.810.2 82.65.3
Parotid Lt 64.75.8 78.25.1 82.86.2 84.54.2
Parotid Rt 64.76.1 78.86.5 82.36.6 84.15.0
Pineal Gland 46.429.3 60.216.5 63.626.4 70.414.7
Pituitary 60.411.0 65.211.0 57.014.8 61.518.4
Spinalcord 83.56.2 83.73.6 82.77.4 84.62.4
SMG Lt 64.216.8 71.38.8 77.39.1 76.99.8
SMG Rt 63.216.8 69.511.7 75.29.4 76.19.0
Temporal Lobe Lt 66.73.6 80.93.7 82.66.4 81.05.2
Temporal Lobe Rt 65.15.1 73.617.4 82.45.7 80.54.0
Thyroid Lt 64.918.9 76.77.7 81.26.1 81.65.0
Thyroid Rt 64.417.7 77.06.0 80.510.5 82.25.1
TMjoint Lt 79.26.5 77.26.5 79.312.8 77.67.0
TMjoint Rt 76.58.8 75.29.3 77.49.6 76.27.1
Average 66.6 67.6 70.4 75.1
Table 5: Dice score comparison on the hn 42 oar dataset (unit: %): Lt is short for left and Rt is short for right. Const. is short for constrictor muscle, SMG is short for submandibular gland, and TMjoint is short for temporomandibular joint. The proposed SOARS achieved the best performance in 30 (in bold) out of 42 oar.

Performance of S&H OAR detection

In Table 7, we report the category-by-category detection accuracy of the regressed center points using the detection-by-segmentation network. Moreover, we binaries both the regressed and ground-truth heat maps by keeping the top 1000 largest intensity voxels, and report their hd. Note, as cochlea is spatially enclosed by inner-ear, we use a single heat map, i.e. ear, for both oar detection. As shown in Table 7, we achieve an average hd reduction of 13.7 mm (from 18.9 mm to 6.2 mm) as compared to the detection using only RTCT images. The hd for all oar are reduced, especially the lacrimal gland, optic chiasm, and pineal gland. These significant hd reductions indicate that the anchor oar serve as effective references to better detect the sh oar locations.

Organ UNet P-HNN UaNet SOARS
Basal Ganglia Lt 10.02.8 9.83.2 10.54.0 9.33.2
Basal Ganglia Rt 9.33.8 10.23.3 10.53.8 11.13.4
Brachial Lt 14.96.2 15.19.6 14.211.7 17.310.9
Brachial Rt 17.98.2 11.45.0 16.29.6 14.07.3
Brainstem 8.42.9 8.82.9 10.33.8 8.12.2
Cerebellum 8.93.8 9.44.7 14.19.8 7.73.1
Cochlea Lt 3.69.0 1.80.5 2.30.8 1.60.4
Cochlea Rt 2.10.8 2.01.0 2.40.9 1.90.6
Const. inf 5.72.6 8.53.9 7.54.9 5.42.4
Const. mid 7.42.8 8.73.1 14.710.1 7.43.3
Const. sup 7.43.0 8.03.6 12.78.2 7.03.6
Epiglottis 6.72.3 6.93.6 9.98.5 6.92.5
Esophagus 25.126.4 21.913.7 24.015.0 21.115.8
Eye Lt 2.80.8 3.01.8 4.05.4 3.31.1
Eye Rt 3.10.9 3.40.9 3.10.7 3.01.0
Hippocampus Lt 11.06.7 16.98.6 15.98.9 12.27.7
Hippocampus Rt 10.76.1 12.75.8 13.36.6 12.58.2
Hypothalamus 16.98.6 9.34.3 10.33.7 2.51.3
Innerear Lt 12.75.8 11.933.7 4.01.4 2.60.7
Innerear Rt 9.34.3 4.11.3 4.72.8 2.90.8
Lacrimal Gland Lt 4.31.0 4.31.3 4.61.6 2.91.1
Lacrimal Gland Rt 4.11.2 5.51.5 5.12.2 2.90.9
Larynx core 12.47.3 10.47.3 9.27.2 9.07.1
Mandible Lt 7.92.9 6.72.8 10.324.4 5.32.3
Mandible Rt 7.02.6 5.62.3 12.215.8 5.51.6
Optic Chiasm 8.03.9 8.45.3 11.47.8 5.34.2
Optic Nerve Lt 4.23.6 4.63.5 5.23.1 3.41.9
Optic Nerve Rt 4.12.3 3.91.7 4.94.2 3.31.4
Oralcavity 16.45.0 18.45.0 7.610.3 13.86.2
Parotid Lt 9.03.4 10.02.8 8.05.8 7.02.5
Parotid Rt 8.97.8 8.32.0 9.74.2 6.81.6
Pineal Gland 3.41.8 2.51.1 4.01.9 1.70.6
Pituitary 3.91.4 4.41.6 4.41.3 4.22.2
Spinalcord 34.913.9 10.218.1 17.327.2 5.72.2
SMG Lt 7.34.0 18.630.3 6.15.4 6.53.1
SMG Rt 7.34.0 11.18.3 7.04.9 6.12.3
Temporal Lobe Lt 14.321.4 16.06.8 16.56.7 14.66.9
Temporal Lobe Rt 12.83.6 38.685.2 15.05.0 13.55.9
Thyroid Lt 9.02.9 6.93.2 7.44.8 5.12.5
Thyroid Rt 8.710.4 7.93.3 7.14.0 5.52.3
TMjoint Lt 3.51.2 3.91.4 4.42.4 3.61.7
TMjoint Rt 3.61.7 4.61.1 4.32.9 3.51.3
Anchor OARs 9.3 9.4 9.2 7.0
Table 6: Average Hausdorff distance comparison on the hn 42 oar dataset (unit: mm): Lt is short for left and Rt is short for right. Const. is short for constrictor muscle, SMG is short for submandibular gland, and TMjoint is short for temporomandibular joint. The proposed SOARS achieved the best performance in 30 (in bold) out of 42 oar.
Figure 5: Qualitative illustration of the mid-level (left-hand side) and sh (right-hand side) oar segmentation using UaNet and the proposed soars. The seven columns are seven representative axial slices in the RTCT image. The column shows the oar labels from a radiation oncologist, while the and columns are the predicted segmentation results by the UaNet and the proposed soars, respectively. For better comparison, we use red arrows to indicate the improvements. For visualization purpose, the dashed rectangles are enlarged for highlighting improvements on sh oar segmentation.
Dist (mm) HD (mm)
CT Only CT+Anchor CT Only CT+Anchor
Ear Lt 3.92.5 3.92.6 6.73.3 5.72.1
Ear Rt 1.91.4 1.61.0 4.41.8 3.41.3
Hypothalamus 2.61.7 2.31.5 4.02.0 3.61.5
Lacrimal Gland Lt 5.65.7 4.63.1 28.076.8 14.720.7
Lacrimal Gland Rt 3.31.9 3.01.7 47.4112.0 4.71.4
Optic Chiasm 3.92.5 3.41.9 26.671.8 10.625.6
Optic Nerve Lt 2.51.6 2.61.5 4.61.8 4.51.2
Optic Nerve Rt 3.01.2 3.11.6 21.961.0 4.91.6
Pineal Gland 2.52.5 1.80.7 27.772.2 3.91.3
Average 3.3 2.9 18.9 6.2
Table 7: The detailed sh detection results measuring the average distances between regressed and true center points, as well as the Hausdorff distances between the binarised regressed and binarised true heat maps. Lt is short for left and Rt is short for right. The best performance is highlighted in bold.