Esophageal cancer ranks sixth in mortality amongst all cancers worldwide, accounting for in cancer deaths [bray2018global]. Because this disease is typically diagnosed at late stages, the primary treatment is a combination of chemotherapy and RT. One of the most critical tasks in RT treatment planning is delineating the GTV, which serves as the basis for further contouring the clinical target volume [jin2019ctv]. Yet, manual segmentation consumes great amounts of time and effort from oncologists and is subject to inconsistencies [tai1998variability]. Thus, there is great impetus to develop effective tools for automated GTV segmentation.
Deep CNN have made remarkable progress in the field of medical image segmentation [Cicek2016, harrison2017progressive, jin20173d, jin2018ct, RothLLHFSS18]. Yet, only a handful of studies have addressed automated esophageal GTV segmentation [Hao2017Esophagus, yousefi2018esophageal], all of which rely on only the RTCT images. The assessment of GTV by CT has been shown to be error prone, due to the low contrast between the GTV and surrounding tissues [muijs2009consequences]. Within the clinic these shortfalls are often addressed by correlating with the patient’s PET/CT scan, when available. These PET/CT are taken on an earlier occasion to help stage the cancer and decide treatment protocols. Despite misalignments between the PET/CT and RTCT, PET still provide highly useful information to help manually delineate the GTV on the RTCT, due to its high contrast highlighting of malignant regions [leong2006prospective]. As shown in Fig. 1, CT and PET can each be crucial for accurate GTV delineation, due to their complementary strengths and weaknesses. While recent work has explored co-segmentation of tumors using PET and CT [zhong2018simultaneous, zhao2018tumor], these works only consider the PET/CT image. In contrast, leveraging the diagnostic PET to help perform GTV segmentation on an RTCT image requires contending with the unavoidable misalignments between the two scans acquired at different times.
To address this gap, we propose a new approach, depicted in Fig. 2, that uses a two-stream chained pipeline to incorporate the joint RTCT and PET information for accurate esophageal GTV segmentation. First, we manage the misalignment between the RTCT and PET/CT by registering them via an anatomy-based initialization. Next, we introduce a two-stream chained pipeline that combines and merges predictions from two independent sub-networks, one only trained using the RTCT and one trained using both RTCT and registered PET images. The former exploits the anatomical contextual information in CT, while the latter takes advantage of PET’s sensitive, but sometimes spurious and overpoweringly strong contrast. The predictions of these two streams are then deeply fused together with the original RTCT to provide a final robust GTV prediction. Furthermore, we introduce a simple yet surprisingly powerful PSNN model, which incorporates the strengths of both UNet [Cicek2016] and P-HNN [harrison2017progressive] by using deep supervision to progressively propagate high-level semantic features to lower-level, but higher resolution features. Using 5-fold cross-validation, we evaluate the proposed approach on patients with RTCT and PET, which is more than two times larger than the previously largest reported dataset for esophageal GTV segmentation [yousefi2018esophageal]. Experiments demonstrate that both our two-stream chained pipeline and the PSNN each provide significant performance improvements, resulting in an average DSC of , which is higher over the previous state-of-the-art method using DenseUNet [yousefi2018esophageal].
Fig. 2 depicts an overview of our proposed two-stream chained esophageal GTV segmentation pipeline, which uses early and late 3D deep network fusions of CT and PET scans. Not shown is the registration step, which is detailed in §2.1.
2.1 PET to RTCT Registration
To generate aligned PET/RTCT pairs, we register the former to the latter. This is made possible by the diagnostic CT accompanying the PET. To do this, we apply the cubic B-spline based deformable registration algorithm in a coarse to fine multi-scale deformation process [rueckert1999nonrigid]. We choose this option due to its good capacity for shape modeling and efficiency in capturing local non-rigid motions. However, to perform well, the registration algorithm must have a robust rigid initialization to manage patient pose and respiratory differences in the two CT scans. To accomplish this, we use the lung mass centers from the two CT scans as the initial matching positions. We compute mass centers from masks produced by the P-HNN model [harrison2017progressive], which can robustly segment the lung field even in severely pathological cases. This leads to a reliable initial matching for the chest and upper abdominal regions, helping the success of the registration. The resulting deformation field is then applied to the diagnostic PET to align it to the RTCT at the planning stage. One registration example is illustrated in Fig. 3.
2.2 Two-Stream Chained Deep Fusion
As mentioned, we aim to effectively exploit the complementary information within the PET and CT imaging spaces. To do this, we design a two-stream chained 3D deep network fusion pipeline. Assuming data instances, we denote the training data as , where , , and represent the input CT, registered PET, and binary ground truth GTV segmentation images, respectively. For simplicity and consistency, the same 3D segmentation backbone network (described in Sec. 2.3) is adopted. Dropping for clarity, we first use two separate streams to generate segmentation maps using and [, ] as network input channels:
where and denote the CNN functions and output segmentation maps, respectively, represents the corresponding CNN parameters, and indicates the ground truth GTV tumor mask values. We denote Eq. (2) as EF, as the stream can be seen as an EF of CT and PET, enjoying the high spatial resolution and high tumor-intake contrast properties from the CT and PET, respectively. On the other hand, the stream in Eq. (1) provides predictions based on CT intensity alone, which can be particularly helpful in circumventing the biased influence from noisy non-malignant high uptake regions, which are not uncommon in PET.
In this way, the formulation of Eq. (3) can be seen as a LF of the aforementioned two streams of the CT and EF models. We use the DSC loss for all three sub-networks, training each in isolation.
2.3 PSNN Model
In esophageal GTV segmentation, the GTV target region often exhibits low contrast in CT, and the physician’s manual delineation relies heavily upon high-level semantic information to disambiguate boundaries. In certain respects, this aligns with the intuition behind UNet, which decodes high-level features into lower-level space. Nonetheless, the decoding path in UNet consumes a great deal of parameters, adding to its complexity. On the other hand, models like P-HNN [harrison2017progressive] use deep supervision to connect lower and higher-level features together using parameter-less pathways. However, unlike UNet, P-HNN propagates lower-level features down to high-level layers. Instead, a natural and simple means to combine the strengths of both P-HNN and UNet is to use essentially the same parameter blocks as P-HNN, but reverse the direction of the deeply-supervised pathways, to allow high-level information to propagate up to lower-level space. We denote such an approach as PSNN.
As shown in Fig. 2(b), a set of
3D convolutional layers are used to collapse the feature map after each convolutional block into a logit image,i.e., , where indexes the pixel locations. This is then combined with the previous higher level segmentation logit image to create an aggregated segmentation map, i.e., , for the feature block by element-wise summation:
where denotes the total number of predicted feature maps and . denotes an upsampling, i.e., bilinear upsampling. The PSNN model is trained using four deeply-supervised auxiliary losses at each convolutional block. As our experiments will demonstrate, PSNN can provide significant performance gains for GTV segmentation over both a densely connected version of UNet [yousefi2018esophageal] and P-HNN [harrison2017progressive].
3 Experiments and Results
We extensively evaluate our approach using a dataset of esophageal cancer patients, all diagnosed at stage II or later and undergoing RT treatments. Each patient has a diagnostic PET/CT pair and a treatment RTCT scan. To the best of our knowledge, this is the largest dataset collected for esophageal cancer GTV segmentation. All 3D GTV ground truth masks are delineated by two experienced radiation oncologists during routine clinical workflow. We first resample all imaging scans of registered PET and RTCT to a fixed resolution of mm. To generate positive training instances, we randomly sample sub-volumes centered inside the ground truth GTV mask. Negative examples are extracted by randomly sampling from the whole 3D volume. This results, on average, in training sub-volumes per patient. We further apply random rotations in the x-y plane within degrees to augment the training data.
, and their standard deviations of GTV segmentation performance using: (1) Contextual model using only CT images (CT); (2) Early fusion model (EF) using both CT and PET images; (3) The proposed two-stream chained early and late fusion model (EF+LF). 3D DenseUNet model using CT is equivalent to the previous state-of-the-art work[yousefi2018esophageal], which is shown in the first row. The best performance scores are shown in bold.
Implementation details: The Adam solver [kingma2014adam] is used to optimize all the 3D segmentation models with a momentum of and a weight decay of for epochs. For testing, we use 3D sliding windows with sub-volumes of
and strides of
voxels. The probability maps of sub-volumes are aggregated to obtain the whole volume prediction.
We employ five-fold cross-validation protocol split at the patient level. Extensive comparisons of our PSNN model versus P-HNN [harrison2017progressive] and DenseUNet [yousefi2018esophageal] methods are reported, with the latter arguably representing the current state-of-the-art GTV segmentation approach using CT. Three quantitative metrics are utilized to evaluate the GTV segmentation performance: DSC, HD in “mm”, and ASD in “mm”.
Results: Our quantitative results and comparisons are tabulated in Table 1. When all models are trained and evaluated using only RTCT, i.e., Eq. (1), our proposed PSNN evidently outperforms the previous best esophageal GTV segmentation method, i.e., DenseUNet [yousefi2018esophageal], which straightforwardly combines DenseNet [DBLP:conf/cvpr/HuangLMW17] and 3D UNet [Cicek2016]. As can be seen, PSNN consistently improves upon [yousefi2018esophageal] in all metrics: with an absolute increase of in DSC (from to ) and significantly dropping in HD metric, despite being a simpler architecture. PSNN also outperforms the 3D version of P-HNN [harrison2017progressive], which indicates that the semantically-nested high- to low-level information flow provides key performance increases.
Table 1 also outlines the performances of three deep models under different imaging configurations. Several conclusions can be drawn. First, all three networks trained using the EF of Eq. (2) consistently produce more accurate segmentation results than those trained with only RTCT, i.e., Eq. (1). This validates the effectiveness of utilizing PET to complement RTCT for GTV segmentation. Second, the full two-stream chained fusion pipeline of Eq. (3) provides further performance improvements. Importantly, the performance boosts can be observed across all three deep CNN, validating that the two-stream combination of EF and LF can universally improve upon different backbone segmentation models. Last, the best performing results are the PSNN model combined with chained EF+LF, demonstrating that each component of the system contributes to our final performance. When compared to the previous state-of-the-art work of GTV segmentation, which uses DenseUNet applied to RTCT images [yousefi2018esophageal], our best performing model exceeds in all metrics of DSC, HD, and ASD by , and remarked margins (refer to Table 1), representing tangible and significant improvements. Fig. 4 shows several qualitative examples visually underscoring the improvements that our two-stage PSNN approach provides.
This work has presented and validated a two-stream chained 3D deep network fusion pipeline to segment esophageal GTV using both RTCT and PET+RTCT imaging channels. Diagnostic PET and RTCT are first longitudinally registered using semantically-based lung-mass center initialization to achieve robustness. We next employ the PSNN model as a new 3D segmentation architecture, which uses a simple, parameter-less, and deeply-supervised CNN decoding stream. The PSNN model is then used in a cascaded EF and LF scheme to segment the GTV. Extensive tests on the largest esophageal dataset to date demonstrate that our PSNN model can outperform the state-of-the-art P-HNN and DenseUNet networks with remarked margins. Additionally, we show that our 2-stream chained fusion pipeline produces further important improvements, providing an effective means to exploit the complementary information seen within PET and CT. Thus, our work represents a step forward toward accurate and automated esophageal GTV segmentation.