Radiation therapy (RT) is one of the most effective cancer treatments. Approximately half of all cancer patients undergo RT. Maximizing the radiation into the target tumors while minimizing the radiation in non-tumor tissues is the major step in image-based RT applications. This requires delineation of tumor regions as well as identification of organ at risk (OAR) in a precise manner. The exact delineation of all OARs is vital as it prevent from the adverse effects on healthy surrounding organs. Conventionally, expert RT planners manually define OARs from computed tomography (CT) scans, which is tedious and the quality of OARs depends on the expert skills [Gerhard2021OrganAR]. Recent advances in deep learning (DL
) have made significant strides in natural and medical image segmentation, however, OAR segmentation remains a challenging task due to heterogeneity of the organ appearance, size, shapes, low contrast of the CT scans, and scanner-related differences[ibragimov2017segmentation]. Better, more robust, and generalized segmentation algorithms are urgently needed for RT applications.
Convolutional Neural Networks (CNN)s have served as the defacto architecture for medical image segmentation for the past decade including UNet [ronneberger2015u]
which follows an encoder-decoder based architecture utilizing skip-connections to retain multi-scale features. Since then, extensive research has been done to leverage this design and propose mechanisms which can capture global context. Recent advent of vision transformers has released a wave of methodologies which has further pushed the envelope of performance on computer vision tasks[dosovitskiy2020image]. Medical image segmentation has also seen an influx of transformer based methods which leverage global context achieved by the self-attention mechanism [hatamizadeh2022swin, hatamizadeh2022unetr]. Such methods have recently achieved state-of-the-art (SOTA) performances on array of medical imaging tasks including the segmentation of multi-organs[chen2021transunet], cardiac [chen2020deep], and polyp [srivastava2021gmsrf].
Due to the varying size of the organ of interest for segmentation, multi-scale feature fusion has gained popularity [gu2022multi, srivastava2021gmsrf, 9662196]. However, translation of these models to perform 3D medical image segmentation results in a heavy computation overhead and might result in poor overall performance. To this extent, we design OARFocalFuseNet which combines local-global token interactions and multi-resolution feature fusion. We perform experiments on two publicly available datasets with diverse region-of-interest that vary substantially in size and shape. The main contributions of this work are summarized as follows:
We propose a novel DL segmentation architecture, OARFocalFuseNet, which upon performing multi-scale fusion utilizes a focal modulation scheme to aggregate multi-scale context in a particular resolution stream. The resultant diverse feature maps can be aided by global-local context and allow OARFocalFuseNet to serve as a benchmark for 3D medical image segmentation.
To ensure a fair comparison with traditional approaches, we also propose a 3D-MSF (3D-Multi Scale Fusion Network), which fuses multi-scale features in a densely connected mechanism. We use depth-wise convolutions which serve as a parametrically cheaper substitute as compared to its 2D predecessors.
Let, be the input CT scan where . Here, , , are axial, coronal, & sagittal axis. Each level/block of the encoder comprises of two successive convolutional layers of kernel size and a stride of , followed by a pooling layer which halves the dimensions of the feature maps across each axis. We use consecutive encoder blocks to generate which are sets of feature maps with distinct resolutions (see Figure 1(a)). An additional encoder block is used as the bottleneck layer to generate .
Ii-A2 Focal Fuse blocks
The multi-scale feature fusion is performed by concatenating features from each resolution scale. While propagating feature maps from higher-to-lower resolution streams, we use a single depth-wise convolution operation with a stride of 2 to downsample features, if an additional dimensional reduction is required, a pooling operation is used. Similarly, while transmitting features from lower-to-higher resolution streams, depth-wise transposed convolution operation with stride 2 and if necessary, bicubic interpolation is used. Forfocal levels, feature maps generated by the ’th multi-scale focal layer are calculated as:
Here, and represents a depth-wise convolutional layer and a standard convolutional layer, respectively. Additionally, denotes the multi-scale focal level, which is followed by a activation layer. Each layer of a particular resolution stream receives input from the previous layers of each resolution stream, which are then fused by a convolutional layer with a stride and kernel size of . The depthwise convolutional layer incrementally increases the effective receptive field by after each layer . Hereafter, a global average pooling layer is applied upon the output of the to obtain the global context. Thus, for each resolution scale
we are able to estimate the local and global context and after each multi-scale focal layer, we communicate information between all the resolution scales (see Figure1(c)). This exchange in features allows each stream to boost the diversity of feature maps by leveraging cross-scale long/short context modelling while maintaining the spatial precision of the feature maps in the high resolution scales. Additionally, we use a linear layer for constraining irrelevant features generated by each multi-scale focal layer (see equation 3).
Here, is an element-wise multiplication operator. We use an additional linear layer to exchange information amongst channels. The final focally modulated features generated by each scale are calculated as,
|Method||Mean DSC||Mean HD||Mean ASD||Brainstem||Spinal cord||Right parotid||Left parotid||Mandible|
|Framework||Mean DSC||Aorta||Gallbladder||Kidney (L)||Kidney (R)||Liver||Pancreas||Spleen||Stomach|
The first decoder block of OARFocalFuseNet initially receives features from the bottleneck layer and upscales the feature maps by a factor of . Subsequent decoder blocks upscales feature maps as,
Here, and is the output of the first decoder block. is a transposed convolutional layer with kernel size and stride . , and
denotes instance normalization, ReLU activation layer, and concatenation operation, respectively. The final segmentation head uses a convolutional layer which receives the input fromand has an output number of channels equal to the number of classes.
Ii-B 3D Multi-scale Fusion Network
The architecture of 3D-MSF consists of the same encoder described in Section II-A1. Multi-scale fusion is performed in a densely connected block [srivastava2021gmsrf], where inputs to each layer inside a dense block in a particular resolution stream acquires feature maps from all preceding layers within the same stream and the last layer from other resolution streams. For each resolution scale, multi-resolution features are fused as described in equation 7.
Here, denotes the resolution scale and denotes the layer inside the dense blocks. The upscaling/downscaling to align spatial resolution of received features from different resolution scales is done identically to the upscaling/downscaling recipe used in OARFocalFuseNet. The same decoder described in Section II-A3 is used for combining upscaled features from lower decoder layers and features propagated by the skip connections.
Iii-a Datasets and Implementation Details
We evaluate our proposed method on two datasets: OpenKBP [babier2021openkbp] and the Synapse multi-organ segmentation dataset111https://www.synapse.org/#!Synapse:syn3193805/wiki/217789. OpenKBP contains data from 340 patients who underwent treatment for head-and-neck cancer while the Synapse multi-organ segmentation dataset contains CT scans and corresponding labels for 13 different organs. The second data set was used to demonstrate that our proposed algorithm is generalizable to other segmentation tasks. We follow the training and testing protocol used in [isler2022enhancing], where 188 patients were selected. The selected patients had corresponding labels for five OARs namely: brainstem, spinal cord, right parotid, left parotid, and mandible. Experiments on Synapse multi-organ segmentation dataset followed the same training-testing splits as used in [chen2021transunet] and evaluated the results on eight classes, namely: aorta, gallbladder, spleen, left kidney, right kidney, liver, pancreas, spleen, and stomach. We set the number of output channels for the first encoder block as for both OARFocalFuseNet and 3D-MSF. Subsequent encoder blocks increased the number of output channels by a factor of . Adam optimizer was used for optimization. We used a cyclic learning rate scheduler where the maximum learning rate and minimum learning rate were set to and , respectively. We use an equally weighted combination of dice loss and cross-entropy loss for optimization.
Iv Results and Discussion
We provide comparison of our proposed method against other SOTA methodologies on Open-KBP [babier2021openkbp] dataset in Table I. We can observe that our proposed OARFocalFuseNet has achieved the highest DSC of 0.7995 which outperforms the previous SOTA method, SwinUNETR [hatamizadeh2022swin]. Additionally, we observed the lowest ASD of and highest organ-wise DSC on segmentation of right parotid, left parotid and mandible. The qualitative comparison of our proposed OARFocalFuseNet with other SOTA methods is shown in Figure 2. Table II summarizes the performance of our proposed method against other SOTA methodologies on the Synapse multi organ segmentation dataset. Apart from obtaining the highest DSC of , we achieved the highest organ-wise DSC of , , , , on segmenting aorta, left kidney, right kidney, liver, and pancreas respectively. Even though convolutional-based multi-scale fusion methods have enjoyed tremendous success in 2-D medical image segmentation, our OARFocalFuseNet outperforms 3D-MSF. This can be attributed to the focal fuse blocks which can maintain high resolution representations while exchanging global and local context across all resolution scales, thereby allowing OARFocalFuseNet to outperform 3D-MSF and other SOTA methodologies.
In this work, we propose the OARFocalFuseNet architecture for Organs-at-risk (OAR) segmentation, which utilizes a novel focal fuse block for fusing multi-scale features used for capturing long and short range context. We validated our proposed architecture, 3D-MSF and OARFocalFuseNet on two multi-class medical image segmentation datasets, where we showed that the proposed OARFocalFuseNet outperforms other SOTA methods in identifying OARs in head-and-neck and abdominal CT scans. In future work, we will examine the failure cases and further modify the architecture to address challenging scenarios in head, neck and abdominal cases.
This project is supported by the NIH funding: R01-CA246704 and R01-CA240639, and Florida Department of Health (FDOH): 20K04.