## 1 Introduction

Assessing the lymph node (LN) status in oncology clinical workflows is an indispensable step for the precision cancer diagnosis and treatment planning, e.g., radiation therapy or surgical resection. The class of enlarged LN is defined by the revised RECIST guideline [schwartz2009evaluation] if its short axial axis is more than mm in computed tomography (CT). In radiotherapy treatment, both the primary tumor and all metastasis suspicious LNs must be sufficiently treated within the clinical target volume with the proper doses [jin2019deep]. We refer these LNs as lymph node gross tumor volume or GTV, which includes enlarged LNs, as well as smaller ones that are associated with a high positron emission tomography (PET) signal or any metastasis signs in CT [scatarige1983low]. Accurately identifying and delineating GTV, to be spatially included in the treatment area, is essential for a desirable cancer treatment outcome [National2020].

It is an extremely challenging and time-consuming task to identify GTV, even for experienced radiation oncologists. High-level sophisticated clinical reasoning guidelines are needed, leading to the risk of uncertainty and subjectivity with high inter-observer variabilities [goel2017clinical]. It is arguably more difficult than detecting the more general enlarged LNs. (1) Finding GTV is often performed using radiotherapy CT (RTCT) that (unlike diagnostic CT) is not contrast-enhanced. Hence the metastasis signs for identifying GTV are subtler. (2) GTV itself has poor contrast. Because of the shape and appearance ambiguity, it can be easily confused with vessels or muscles. (3) The size and shape of GTV vary considerably with large amounts of smaller ones that are harder to detect. Refer Fig. 1 (top row) for an illustration of GTV. While many previous works attempt to detect enlarged LNs using contrast-enhanced CT [barbu2011automatic, bouget2019semantic, feulner2013lymph, nogues2016automatic, roth2015improving, roth2014new, yan2018deeplesion], no work, as of yet, has studied the GTV detection in non-contrast RTCT scans. Given the evident differences between the enlarged LNs and GTV, further innovations are required for the robust GTV detection and segmentation.

Valuable insights from physicians’ clinical diagnosis and analysis process can be leveraged to tackle this problem. As one of the primary cues, human observers condition the analysis of GTV based on the LNs’ distance with respect to the corresponding primary tumor location. For LNs proximal to the tumor, physicians more readily identify them as GTV in radiotherapy treatment. However, for LNs distal to the tumor, they use more strict criteria to include if there are clear signs of metastasis, e.g., enlarged size, increased PET signals, and/or other CT based evidence [scatarige1983low]. Hence, the distance measure relative to the primary tumor plays a key role during physician’s decision making. Besides the distance, the PET modality is also of high importance. Although as a noisy imaging channel, it has shown to be helpful in increasing the GTV detection sensitivity [goel2017clinical]. As demonstrated in Fig. 1 (bottom row), PET provides critically distinct information, yet, it also exhibits false positives (FPs) and false negatives (FNs).

In this paper, we imitate the physician’s diagnosis process to tackle the problem of GTV detection and segmentation. (1) We introduce a distance-based gating strategy in a multi-task framework to divide the underlying GTV distributions into “tumor-proximal” and “tumor-distal” categories and solve them accordingly. Specifically, a multi-branch network is proposed to adopt a shared encoder and two separate decoders to detect and segment the “tumor-proximal” and “tumor-distal” GTV, respectively. A distance-based gating function is designed to generate the corresponding GTV sample weights for each branch. By applying the gating function at the outputs of decoders, each branch is specialized to learn the “tumor-proximal” or “tumor-distal” GTV features that emulates physician’s diagnosis process. (2) We leverage the early fusion (EF) of three modalities as input to our model, i.e., RTCT, PET and 3D tumor distance map (Fig. 1(bottom row)). RTCT depicts anatomical structures capturing the intensity, appearance and contextual information, while PET provides metastasis functional activities. Meanwhile, the tumor distance map further encodes the critical distance information in the network. Fusion of these three modalities together can effectively boost the GTV identification performance. (3) We evaluate on a dataset comprising voxel-wise labeled GTV instances in esophageal cancer patients, as the largest GTV dataset to date for chest and abdominal radiotherapy. Our method significantly improves the detection mean recall from to , compared with the previous state-of-the-art lesion detection method [yan2019mulan]. The highest achieved recall of is also clinically relevant and valuable. As reported in [goel2017clinical], human observers tend to have relatively low GTV sensitivities, e.g., by even very experienced radiation oncologists. This demonstrates our work’s clinical values.

## 2 Method

Fig. 2 shows the framework of our proposed multi-branch GTV detection-by-segmentation method. Similar to [zhu2020segmentation, zhu2019multi] which are designed for the pancreatic tumors, we detect GTV by segmenting them. We first compute the 3D tumor distance transformation map (Sec. 2.1), based on which any GTV is divided into the tumor-proximal or tumor-adjacent subcategory. Next, a multi-branch detection-by-segmentation network is designed where each branch focuses on one subgroup of GTV segmentation (Sec. 2.2). This is achieved by applying a binary or soft distance-gating function imposed on the penalty function at the output of the two branches (Sec. 2.3). Hence, each branch can learn specific parameters to specialize on segmenting and detecting the tumor-proximal and tumor-adjacent GTV, respectively.

### 2.1 3D Tumor Distance Transformation

To stratify GTV into tumor-proximal and tumor-distal subgroups, we first compute the 3D tumor distance transformation map, denoted as , from the primary tumor . The value at each voxel represents the shortest distance between this voxel and the mask of the primary tumor. Let be a set that includes the boundary voxels of the tumor. The distance transformation value at a voxel is computed as

(1) |

where is the Euclidean distance from to . can be efficiently computed using algorithms such as the one proposed in [maurer2003linear]. Based on , GTV can be divided into tumor-proximal and tumor-distal subgroups using either binary or soft distance-gating function as explained in detail in Sec. 2.3.

### 2.2 Multi-branch Detection-by-Segmentation via Distance Gating

GTV identification is implicitly associated with their distance distributions to the primary tumor in the diagnosis process of physicians. Hence, we divide GTV into tumor-proximal and tumor-distal subgroups and conduct detection accordingly. To do this, we design a multi-branch detection-by-segmentation network with each branch focusing on segmenting one GTV subgroup. Each branch is implemented by an independent decoder to learn and extract the subgroup specific information, while they share a single encoder to extract the common GTV image features. Assuming there are data samples, we denote a dataset as , where , , and represent the non-contrast RTCT, registered PET, tumor distance transformation map, and ground truth GTV segmentation mask, respectively. Without the loss of generality, we drop for conciseness in the rest of this paper. The total number of branches is denoted as , where in our case. A CNN segmentation model is denoted as a mapping function , where is a set of inputs, which consists of a single modality or a concatenation of multiple modalities. indicates model parameters, and

means the predicted probability volume. Given that

represents the predicted probability of a voxel being the labeled class from the th branch, the overall negative log-likelihood loss aggregated across branches can be formulated as:(2) |

where is introduced as a set of volumes containing the transformed gating weights at each voxel based on its distance to the primary tumor. At every voxel , the gating weights satisfies .

### 2.3 Distance-based Gating Module

Based on the tumor distance map , our gating functions can be designed to generate appropriate GTV sample weights for different branches so that each branch specializes on learning the subgroup specific features. In our case, we explore two options: (1) binary distance gating and (2) soft distance gating.

Binary Distance Gating (BG). Based on the tumor distance map , we divide image voxels into two groups, and , to be tumor-proximal and tumor-distal, respectively, where and . Therefore the gating transformations for two decoders are defined as and , where is an indicator function which equals one if its argument is true and zero otherwise. In this way, we divide the GTV strictly into two disjoint categories, and each branch focuses on decoding and learning from one category.

Soft Distance Gating (SG). We further explore a soft gating method that linearly changes the penalty weights of GTV samples as they are closer or further to the tumor. This can avoid a sudden change of weight values when samples are near the proximal and distal category boundaries. Recommended by our physician, we formulate a soft gating module based on as following:

(3) |

and accordingly.

## 3 Experimental Results

### 3.1 Dataset and Preprocessing

Dataset. We collected non-contrast RTCTs of esophageal cancer patients, with all undergoing radiotherapy treatments. Radiation oncologists labeled 3D segmentation masks of the primary tumor and all GTV. For each patient, we have a non-contrast RTCT and a pair of PET/CT scans. There is a total of GTV with voxel-wise annotations in the mediastinum or upper abdomen regions, as the largest annotated GTV dataset to-date. We randomly split patients into , , for training, validation and testing, respectively.

Implementation Details. In our experiments, PET scan is registered to RTCT using the similar method described in [jin2019accurate]. Then all coupling pairs of RTCT and registered PET images are resampled to have a consistent spatial resolution of mm. To generate the 3D training samples, we crop sub-volumes of from the RTCT, registered PET and the tumor distance map around each GTV as well as randomly from the background. For the distance-gating related parameters, we set cm as the binary gating threshold, and cm and cm as the soft gating thresholds, respectively, as suggested by our clinical collaborator. We further apply random rotations in the x-y plane within degrees to augment the training data.

Detection-by-segmentation models are trained on two NVIDIA Quadra RTX 6000 GPUs with a batch size of for epochs. The RAdam [liu2019variance] optimizer with a learning rate of is used with a momentum of and a weight decay of . For inference, 3D sliding windows with a sub-volume of

and a stride of

voxels are processed. For each sub-volume, predictions from two decoders are weighted and aggregated according to the gating transformation to obtain the final GTV segmentation results.Evaluation Metrics. We first describe the hit criteria, i.e., the correct detection, for our detection-by-segmentation method. For an GTV prediction, if it overlaps with any ground-truth GTV

, we treat it as a hit provided that its estimated radius is similar to the radius of the ground-truth GTV

within the range of . The performance is assessed using the mean and max recall (mRecall and Recall) at a precision range of with interval, and the mean free response operating characteristic (FROC) at FPs per patient. These operating points were chosen after confirming with our physician.Comparison Setups. Using the binary and soft distance-based gating function, our multi-branch GTV detection-by-segmentation method is denoted as multi-branch BG and multi-branch SG, respectively. We compare against the following setups: (1) a single 3D UNet [cciccek20163d] trained using RTCT alone or the early fusion (EF) of multi-modalities (denoted as single-net method); (2) Two separate UNets trained with the corresponding tumor-proximal and tumor-distal GTV samples and results spatially fused together (our preliminary work [zhu2020detecting] denoted as multi-net BG); and (3) MULAN [yan2019mulan], a state-of-the-art (SOTA) general lesion detection method on DeepLesion [yan2018deeplesion] that contains more than enlarged LNs.

### 3.2 Quantitative Results & Discussion

Our quantitative results and comparisons are given in Table. 1. Several observations can be drawn on addressing the effectiveness of our proposed methods. (1) The multi-modality input, i.e., early fusion (EF) of RTCT, PET and tumor distance map, are of great benefits for detecting the GTV. There are drastic performance improvements of absolute and in mRecall and mFROC when EF is adopted as compared to using RTCT alone. These results validate that input channels of PET functional imaging and 3D tumor distance transform map are valuable for identifying GTV. (2) The distance-based gating strategies are evidently effective as the options of multi-net BG, multi-branch BG and multi-branch SG consistently increase the performance. For example, the multi-net BG model achieves mRecall and mFROC, which is a and improvement against the best single-net model (where no distance-based stratification is used). The performance further boosts with the network models of multi-branch BG and multi-branch SG, to the highest scores of mRecall and mFROC achieved by the multi-branch SG.

Methods: | CT | EF | mRecall | Recall | mFROC | FROC@4 | FROC@6 |

single-net | ✓ | ||||||

single-net | ✓ | ||||||

multi-net BG [zhu2020detecting] | ✓ | ||||||

multi-branch BG (Ours) | ✓ | ||||||

multi-branch SG (Ours) | ✓ | ||||||

MULAN [yan2019mulan] | ✓ | ||||||

MULAN [yan2019mulan] | ✓ |

Multi-branch versus Multi-net. Using the distance-based gating strategy, our proposed multi-branch methods perform considerably better than the multi-net BG model. Even our second best model multi-branch BG, the mean and maximal recalls have been improved by (from to ) and (from to ) against the multi-net BG model. When the multi-branch framework is equipped with the soft-gating, marked improvements of absolute and in both mRecall and mFROC are observed as compared against to the multi-net BG model. This validates the effectiveness of our jointly trained multi-branch framework design, and our intuition that gradually changing GTV weights for the proximal and distal branches are more natural and effective. As we recall, the multi-net baseline directly trains two separate 3D UNets [cciccek20163d] targeted to segment each GTV subgroup. Considering the limited GTV training data (a few hundreds of patients), it can be overfitting prone from the split to even smaller patient subgroups.

Table. 1 also compares with the SOTA universal lesion detection method, i.e., MULAN [yan2019mulan] on DeepLesion [yan2018deeplesion, yan2019holistic]. We have retrained the MULAN models using both CT and EF inputs, but even the best results, i.e., using EF, have a large gap ( vs. mRecall) with our distance-gating networks, which further proves that the tumor distance transformation cue plays a key role in GTV identification.

Fig. 3 illustrates the visualization results of our method compared to other baselines. For the enlarged GTV (top row), most methods can detect it correctly. However, as the size of GTV becomes smaller and the contrast is poorer, our method can still successfully detect them while others struggled.

## 4 Conclusion

In this work, we propose an effective distance-based gating approach in a multi-task deep learning framework to segment GTV

, emulating the oncologists’ high-level diagnosis protocols. GTV is divided into two subgroups of “tumor-proximal” and “tumor-distal”, by means of binary or soft distance gating. A novel multi-branch detection-by-segmentation network is trained with each branch specializing on learning one subgroup features. We evaluate our method on a dataset of esophageal cancer patients. Our results demonstrate significant performance improvements on the mean recall from to , as compared to previous state-of-the-art work. The highest achieved GTV recall of at the precision level is clinically relevant and valuable.
Comments

There are no comments yet.