## 1 Introduction

Esophageal cancer ranks the sixth in global cancer mortality [bray2018global]. As it is usually diagnosed at rather late stage [zhang2013epidemiology], RT is a cornerstone of treatment. Delineating the 3D CTV on a RTCT scan is a key challenge in RT planning. As Fig. 1 illustrates, the CTV should spatially encompass, with a mixture of predefined and judgment-based margins, primary tumor(s), i.e., the GTV, regional LN and sub-clinical disease regions, while simultaneously limiting radiation exposure to OAR [burnet2004defining].

Esophageal CTV delineation is uniquely challenging because tumors may potentially spread along the entire esophagus and metastasize up to the neck or down to the upper abdomen LN. Current clinical protocols rely on manual CTV delineation, which is very time and labor consuming and is subject to high inter- and intra-observer variability [louie2010inter]. This motivates automated approaches to the CTV delineation.

Deep CNN have achieved notable successes in segmenting semantic objects, such as organs and tumors, in medical imaging [Cicek2016, jin20173d, harrison2017progressive, jin2018ct, heinrich2019obelisk, jin2019gtv]. However, to the best of our knowledge, no prior work, CNN-based or not, has addressed esophageal cancer CTV segmentation. Works on CTV segmentation of other cancer types mostly operate based on the RTCT appearance alone [men2017automatic, men2018fully]. As shown in Fig. 1, CTV delineation depends on the radiation oncologist’s visual judgment of both the appearance and the spatial configuration of the GTV, LN, and OAR, suggesting that only considering the RTCT makes the problem ill-posed. Supporting this, Cardenas et al. recently showed that considering the GTV and LN binary masks together with the RTCT can boost oropharyngeal CTV delineation performance [cardenas2018auto]. However, the OAR were not considered in their work. Moreover, binary masks do not explicitly provide distances to the model. Yet CTV delineation is highly driven by distance-based margins to other anatomical structures of interest, and it is difficult to see how regular CNN could capture these precise distance relationships with binary masks alone.

Our work fills this gap by introducing a spatial-context encoded deep CTV delineation framework. Instead of expecting the CNN to learn distance-based margins from the GTV, LN, and OAR binary masks, we provide the CTV delineation network with the 3D SDT [Sethian1996Fast] of these structures. Specifically, we include the SDT of the GTV, LN, lung, heart, and spinal canal with the original RTCT volume as inputs to the network. From a clinical perspective, this allows the CNN to emulate the oncologist’s manual delineation, which uses the distances of GTV and LN vs. the OAR as a key constraint in determining CTV boundaries. To improve robustness, we randomly choose manually and automatically generated OAR SDT during training, while augmenting the GTV and LN SDT with the domain-specific jittering. We adopt a 3D PHNN [harrison2017progressive] to serve as our delineation model, which enjoys the benefits of strong abstraction capacities and multi-scale feature fusion with a light-weighted decoding path. We extensively evaluate our approach using a 3-fold cross-validated dataset of esophageal cancer patients. Since we are the first to tackle automated esophageal cancer CTV delineation, we compare against previous CTV delineation methods for other cancers [men2018fully, cardenas2018auto], using the 3D PHNN as the delineation model. When comparing against pure appearance-based [men2018fully] and binary-mask-based [cardenas2018auto] solutions, we show that our approach provides improvements of and in Dice score, respectively, with analogous improvements in HD and ASD. Moreover, we also show that PHNN is responsible for providing improvements of in Dice score and reduction in ASD over a 3D U-Net model [Cicek2016].

## 2 Methods

CTV delineation in RT planning is essentially a margin expansion process, starting from observable tumorous regions (GTV and regional LN) and extending into the neighboring regions by considering the possible tumor spread margins and distances to nearby healthy OAR. Fig. 2 depicts an overview of our method, which consists of four major modularized components: (1) segmentation of prerequisite regions; (2) SDT computation; (3) domain-specific data augmentation; and (4) a 3D PHNN to execute the CTV delineation.

### 2.1 Prerequisite Region Segmentation

To provide spatial context/distance of the anatomical structures of interest, we must first know their boundaries. We assume that manual segmentations for the esophageal GTV and regional LN are available. However, we do not make this assumption for the OAR. Indeed, missing OAR segmentations () is common in our dataset. For the OAR, we consider three major organs: the lung, heart, and spinal canal, since most esophageal CTV are closely integrated with these organs. Using the available organ labels, we trained a 2D PHNN [harrison2017progressive] to segment the OAR, considering its robust performance in pathological lung segmentation and its computational efficiency. Examples of automatic OAR segmentation are illustrated in the first row in Fig. 2 and validation Dice score for the lung, heart and spinal canal were , and , respectively, in our dataset.

### 2.2 SDT Computation

To encode the spatial context with respect to the GTV, regional LN, and OAR, we compute SDT for each. The SDT is generated from a binary image, where the value in each voxel measures the distance to the closest object boundary. Voxels inside and outside the boundary have positive and negative values, respectively. More formally, let denote a binary mask, where and let be a function that computes boundary voxels of a binary image. The SDT value at a voxel with respect to is computed as

(1) |

where is a distance measure from to . We choose to use Euclidean distance in our work and use Maurer et al.’s efficient algorithm [maurer2003linear] to compute the SDT. The bottom row in Fig. 2 depicts example SDT for the combined GTV and LN and the other 3 OAR. Note that we compute SDT separately for each of the three OAR, meaning we can capture each organ’s influence on the CTV. Providing the SDT of the GTV, LN, and OAR to the deep CNN allows it to more easily infer the distance-based margins to these anatomical structures, better emulating the oncologist’s CTV inference process.

### 2.3 Domain-Specific Data Augmentation

We adopt specialized data augmentations to increase the robustness of the training and harden our network to noise in the prerequisite segmentations. Specifically, two types of data augmentation are carried out. (1) We calculate the GTV and LN SDT from both the manual annotations and also spatially jittered versions of those annotations. We jitter each GTV and LN component by random shift within , mimicking that in practice average distance error represents the state-of-the art performance in esophageal GTV segmentation [yousefi2018esophageal, jin2019gtv]. (2) We calculate SDT of the OAR using both the manual annotations and the automatic segmentations from §2.1

. Combined, these augmentations lead to four possible combinations, which we randomly choose between during every training epoch. This increases model robustness and also allows the system to be effectively deployed in practice by using SDT of the automatically segmented OAR , helping to alleviate the labor involved.

### 2.4 CTV Delineation Network

To use 3D CNN in medical imaging, one has to strike a balance between choosing the appropriate image size covering enough context and the GPU memory. The symmetric encoder-decoder segmentation networks, e.g., 3D U-Net [Cicek2016], are computationally heavy and memory-consuming since half of its computation is consumed on the decoding path, which may not always be needed for all 3D segmentation tasks. To alleviate the computational/memory burden, we adopt a 3D version of PHNN [harrison2017progressive] as our CTV delineation network, which is able to fuse different levels of features using parameter-less deep supervision. We keep the first 4 convolutional blocks and adapt it to 3D as our network structure. As we demonstrate in the experiments, the 3D PHNN is not only able to achieve reasonable improvement over the 3D U-Net but requires 3 times less GPU memory.

## 3 Experiments and Results

To evaluate the performance of our esophageal CTV delineation framework, we collected from anonymized RTCT of esophageal cancer patients undergoing RT. Each RTCT is accompanied by a CTV mask annotated by an experienced oncologist, based on a previously segmented GTV, regional LN, and OAR. The average RTCT size is voxels with the average resolution of mm.

Training data sampling: We first resample all the CT and SDT images to a fixed resolution of mm, from which we extract training VOI patches in two manners: (1) To ensure enough VOI with positive CTV content, we randomly extract VOI centered within the CTV mask. (2) To obtain sufficient negative examples, we randomly sample VOI from the whole volume. This results in on average VOI per patient. We further augment the training data by applying random rotations of degrees in the x-y plane.

Implementation details: The Adam solver [kingma2014adam] is used to optimize all segmentation models with a momentum of and a weight decay of for epochs. We use the Dice loss for training. For testing, we use 3D sliding windows with sub-volumes of

and strides of

voxels. The probability maps of sub-volumes are aggregated to obtain the whole volume prediction taking on average

to process one input volume using a Titan-V GPU.Comparison setup and metrics: We use 3-fold cross-validation, separated at the patient level, to evaluate performance of our approach and the competitor methods. We compare against setups using only the CT appearance information [men2017automatic, men2018fully] and setups using the CT with binary GTV/LN masks [cardenas2018auto]. Finally, we also compare against setups using the CT + GTV/LN SDT, which does not consider the OAR. We compare these setups using the 3D PHNN. For the 3D U-Net [Cicek2016], we compared against the setup using the CT appearance information. We evaluate the performance using the metrics of Dice score, ASD and HD.

Models | Setups | Dice | HD (mm) | ASD (mm) |
---|---|---|---|---|

U-Net | CT | 0.7390.126 | 69.542.7 | 10.19.4 |

CT + GTV/LN/OAR SDT | 0.8290.061 | 36.923.8 | 4.63.0 | |

PHNN | CT | 0.7390.117 | 68.543.8 | 10.69.2 |

CT + GTV/LN masks | 0.8010.075 | 56.335.4 | 6.65.3 | |

CT + GTV/LN SDT | 0.8160.067 | 44.725.1 | 5.44.1 | |

CT + GTV/LN/OAR SDT | 0.8390.054 | 35.423.7 | 4.22.7 | |

CT + GTV/LN/OAR SDT* | 0.8230.059 | 43.626.4 | 5.13.3 |

Results: Table 1 outlines the quantitative comparisons of the different model setups and choices. As can be seen, methods based on pure CT appearance, seen in prior art [men2017automatic, men2018fully], exhibits the worst performance. This is because inferring distance-based margins from appearance alone is too hard of a task for CNN. Focusing on the PHNN performance, when adding the binary GTV and LN masks as contextual information [cardenas2018auto], the performance increases considerably from to in Dice score. When using the SDT encoded spatial context of GTV/LN, PHNN further improves the Dice score and ASD by and , respectively, confirming the value of using the distance information for esophageal CTV delineation. Finally, when the OAR SDT are included, i.e., our proposed framework, PHNN achieves the best performance reaching Dice score and ASD, with a reduction of in HD as compared to the next best PHNN result. Fig. 4 depicts cumulative histograms of the Dice score and ASD, visually illustrating the distribution of improvements in the CTV delineation performance. Fig. 3 shows some qualitative examples illustrating these performance improvements. Interestingly, as the last row of Table 1 shows, when using SDT computed from the automatically segmented OAR for testing, the performance compares favorably to the best configuration, and outperforms all other configurations. This indicates that our method remains robust to noise within the OAR SDT and also that our approach is not reliant on manual OAR masks for good performance, increasing its practical value.

We also compare the 3D PHNN network performance with that of 3D U-Net [Cicek2016] when using the CT appearance based setup and the proposed whole framework. As Table 1 demonstrates, when using the whole pipeline PHNN outperforms U-Net by dice score. Although PHNN has similar performance against U-Net when using only the CT appearance information, the GPU memory consumption is roughly 3 times less than that of the U-Net. These results indicate that for esophageal CTV delineation, a CNN equipped with strong encoding capacity and a light-weight decoding path can be as good as (or even superior to) a heavier network with a symmetric decoding path.

## 4 Conclusion

We introduced a spatial-context encoded deep esophageal CTV delineation framework designed to produce superior margin-based CTV boundaries. Our system encodes spatial context by computing the SDT of the GTV, LN and OAR and feeds them together with the RTCT image into a 3D deep CNN. Analogous to clinical practice, this allows the system to consider both appearance and distance-based information for delineation. Additionally, we also developed domain-specific data augmentation and adopted a 3D PHNN to further improve robustness. Using extensive three-fold cross-validation, we demonstrated that our spatial-context encoded approach can outperform state-of-the-art CTV alternatives by wide margins in Dice score, HD, and ASD. As we are the first to address automated esophageal CTV delineation, our method represents an important step forward for this important problem.

Comments

There are no comments yet.