An Efficient Multi-Scale Fusion Network for 3D Organ at Risk (OAR) Segmentation

by   Abhishek Srivastava, et al.

Accurate segmentation of organs-at-risks (OARs) is a precursor for optimizing radiation therapy planning. Existing deep learning-based multi-scale fusion architectures have demonstrated a tremendous capacity for 2D medical image segmentation. The key to their success is aggregating global context and maintaining high resolution representations. However, when translated into 3D segmentation problems, existing multi-scale fusion architectures might underperform due to their heavy computation overhead and substantial data diet. To address this issue, we propose a new OAR segmentation framework, called OARFocalFuseNet, which fuses multi-scale features and employs focal modulation for capturing global-local context across multiple scales. Each resolution stream is enriched with features from different resolution scales, and multi-scale information is aggregated to model diverse contextual ranges. As a result, feature representations are further boosted. The comprehensive comparisons in our experimental setup with OAR segmentation as well as multi-organ segmentation show that our proposed OARFocalFuseNet outperforms the recent state-of-the-art methods on publicly available OpenKBP datasets and Synapse multi-organ segmentation. Both of the proposed methods (3D-MSF and OARFocalFuseNet) showed promising performance in terms of standard evaluation metrics. Our best performing method (OARFocalFuseNet) obtained a dice coefficient of 0.7995 and hausdorff distance of 5.1435 on OpenKBP datasets and dice coefficient of 0.8137 on Synapse multi-organ segmentation dataset.


page 1

page 3


MSRF-Net: A Multi-Scale Residual Fusion Network for Biomedical Image Segmentation

Methods based on convolutional neural networks have improved the perform...

ResUNet++: An Advanced Architecture for Medical Image Segmentation

Accurate computer-aided polyp detection and segmentation during colonosc...

An Optimized PatchMatch for Multi-scale and Multi-feature Label Fusion

Automatic segmentation methods are important tools for quantitative anal...

A Fully Convolutional Two-Stream Fusion Network for Interactive Image Segmentation

In this paper, we propose a novel fully convolutional two-stream fusion ...

Unsupervised Image Fusion Method based on Feature Mutual Mapping

Deep learning-based image fusion approaches have obtained wide attention...

Exploiting Multi-Scale Fusion, Spatial Attention and Patch Interaction Techniques for Text-Independent Writer Identification

Text independent writer identification is a challenging problem that dif...

Adaptive Fusion Affinity Graph with Noise-free Online Low-rank Representation for Natural Image Segmentation

Affinity graph-based segmentation methods have become a major trend in c...

I Introduction

Radiation therapy (RT) is one of the most effective cancer treatments. Approximately half of all cancer patients undergo RT. Maximizing the radiation into the target tumors while minimizing the radiation in non-tumor tissues is the major step in image-based RT applications. This requires delineation of tumor regions as well as identification of organ at risk (OAR) in a precise manner. The exact delineation of all OARs is vital as it prevent from the adverse effects on healthy surrounding organs. Conventionally, expert RT planners manually define OARs from computed tomography (CT) scans, which is tedious and the quality of OARs depends on the expert skills [Gerhard2021OrganAR]. Recent advances in deep learning (DL

) have made significant strides in natural and medical image segmentation, however, OAR segmentation remains a challenging task due to heterogeneity of the organ appearance, size, shapes, low contrast of the CT scans, and scanner-related differences 

[ibragimov2017segmentation]. Better, more robust, and generalized segmentation algorithms are urgently needed for RT applications.

Convolutional Neural Networks (CNN)s have served as the defacto architecture for medical image segmentation for the past decade including UNet [ronneberger2015u]

which follows an encoder-decoder based architecture utilizing skip-connections to retain multi-scale features. Since then, extensive research has been done to leverage this design and propose mechanisms which can capture global context. Recent advent of vision transformers has released a wave of methodologies which has further pushed the envelope of performance on computer vision tasks 

[dosovitskiy2020image]. Medical image segmentation has also seen an influx of transformer based methods which leverage global context achieved by the self-attention mechanism [hatamizadeh2022swin, hatamizadeh2022unetr]. Such methods have recently achieved state-of-the-art (SOTA) performances on array of medical imaging tasks including the segmentation of multi-organs[chen2021transunet], cardiac [chen2020deep], and polyp [srivastava2021gmsrf].

Due to the varying size of the organ of interest for segmentation, multi-scale feature fusion has gained popularity [gu2022multi, srivastava2021gmsrf, 9662196]. However, translation of these models to perform 3D medical image segmentation results in a heavy computation overhead and might result in poor overall performance. To this extent, we design OARFocalFuseNet which combines local-global token interactions and multi-resolution feature fusion. We perform experiments on two publicly available datasets with diverse region-of-interest that vary substantially in size and shape. The main contributions of this work are summarized as follows:

  1. We propose a novel DL segmentation architecture, OARFocalFuseNet, which upon performing multi-scale fusion utilizes a focal modulation scheme to aggregate multi-scale context in a particular resolution stream. The resultant diverse feature maps can be aided by global-local context and allow OARFocalFuseNet to serve as a benchmark for 3D medical image segmentation.

  2. To ensure a fair comparison with traditional approaches, we also propose a 3D-MSF (3D-Multi Scale Fusion Network), which fuses multi-scale features in a densely connected mechanism. We use depth-wise convolutions which serve as a parametrically cheaper substitute as compared to its 2D predecessors.

Fig. 1: The proposed OARFocalFuseNet architectures. a) The complete skeleton of OARFocalFuseNet, b) The Focal Fuse block which aggregates multi-scale global-local context c) The Multi-Scale Context Aggregation block which gathers multi-scale features and performs depthwise convolutions to gather features with diverse context ranges and performs spatial and channel wise gating to prune irrelevant features.

Ii Method

In this section, we describe the architecture of our two proposed baselines. Section II-A describes the architecture of OARFocalFuseNet and Section II-B briefly describes the architecture of 3D-MSF.

Ii-a OARFocalFuseNet

Ii-A1 Encoder

Let, be the input CT scan where . Here, , , are axial, coronal, & sagittal axis. Each level/block of the encoder comprises of two successive convolutional layers of kernel size and a stride of , followed by a pooling layer which halves the dimensions of the feature maps across each axis. We use consecutive encoder blocks to generate which are sets of feature maps with distinct resolutions (see Figure 1(a)). An additional encoder block is used as the bottleneck layer to generate .

Ii-A2 Focal Fuse blocks

Each of the feature maps is converted to new feature space using a linear layer(see Equation 1 and Figure 1(c)).


The multi-scale feature fusion is performed by concatenating features from each resolution scale. While propagating feature maps from higher-to-lower resolution streams, we use a single depth-wise convolution operation with a stride of 2 to downsample features, if an additional dimensional reduction is required, a pooling operation is used. Similarly, while transmitting features from lower-to-higher resolution streams, depth-wise transposed convolution operation with stride 2 and if necessary, bicubic interpolation is used. For

focal levels, feature maps generated by the ’th multi-scale focal layer are calculated as:


Here, and represents a depth-wise convolutional layer and a standard convolutional layer, respectively. Additionally, denotes the multi-scale focal level, which is followed by a activation layer. Each layer of a particular resolution stream receives input from the previous layers of each resolution stream, which are then fused by a convolutional layer with a stride and kernel size of . The depthwise convolutional layer incrementally increases the effective receptive field by after each layer . Hereafter, a global average pooling layer is applied upon the output of the to obtain the global context. Thus, for each resolution scale

we are able to estimate the local and global context and after each multi-scale focal layer, we communicate information between all the resolution scales (see Figure 

1(c)). This exchange in features allows each stream to boost the diversity of feature maps by leveraging cross-scale long/short context modelling while maintaining the spatial precision of the feature maps in the high resolution scales. Additionally, we use a linear layer for constraining irrelevant features generated by each multi-scale focal layer (see equation 3).


where . The multi-scale focal modulator is generated by adding the context information accumulated by each multi-scale focal layer (see equation 4 and Figure 1(c)).


Here, is an element-wise multiplication operator. We use an additional linear layer to exchange information amongst channels. The final focally modulated features generated by each scale are calculated as,

Method Mean DSC Mean HD Mean ASD Brainstem Spinal cord Right parotid Left parotid Mandible
UNet-3D [ronneberger2015u] 0.7781 5.0662 0.6839 0.7941 0.7444 0.7416 0.7601 0.8503
AttUNet [oktay2018attention] 0.7811 4.9981 0.6024 0.7919 0.7439 0.7394 0.7691 0.8611
DynUNet [isensee2021nnu] 0.7931 6.3316 0.6460 0.7958 0.7521 0.7696 0.7731 0.8747
UNETR [hatamizadeh2022unetr] 0.7810 9.5582 0.8527 0.7791 0.7339 0.7610 0.7692 0.8616
SwinUNETR [hatamizadeh2022swin] 0.7986 6.7520 0.6409 0.8085 0.7604 0.7706 0.7723 0.8813
3D-MSF(Ours) 0.7870 5.5505 0.6152 0.7903 0.7478 0.7498 0.7816 0.8655
OARFocalFuseNet(Ours) 0.7995 5.1435 0.5743 0.8031 0.7402 0.7725 0.7987 0.8832
TABLE I: Comparisons of the results on OpenKBP dataset. We report the mean DSC, mean HD, mean ASD, and DSC for each organ.
Framework Mean DSC Aorta Gallbladder Kidney (L) Kidney (R) Liver Pancreas Spleen Stomach
Encoder Decoder
V-Net [milletari2016v] 0.6881 0.7534 0.5187 0.7710 0.8075 0.8784 0.4005 0.8056 0.5698
DARR [fu2020domain] 0.6977 0.7474 0.5377 0.7231 0.7324 0.9408 0.5418 0.8990 0.4596
R50 U-Net [ronneberger2015u] 0.7468 0.8418 0.6284 0.7919 0.7129 0.9335 0.4823 0.8441 0.7392
R50 AttnUNet [oktay2018attention] 0.7557 0.5592 0.6391 0.7920 0.7271 0.9356 0.4937 0.8719 0.7495
ViT [dosovitskiy2020image] None 0.6150 0.4438 0.3959 0.6746 0.6294 0.8921 0.4314 0.7545 0.6978
ViT [dosovitskiy2020image] CUP 0.6786 0.7019 0.4510 0.7470 0.6740 0.9132 0.4200 0.8175 0.7044
R50-ViT [dosovitskiy2020image] CUP 0.7129 0.7373 0.5513 0.7580 0.7220 0.9151 0.4599 0.8199 0.7395
TransUNet [chen2021transunet] 0.7748 0.8723 0.6313 0.8187 0.7702 0.9408 0.5586 0.8508 0.7562
3D-MSF(Ours) 0.8084 0.8883 0.6968 0.8382 0.8204 0.9343 0.6460 0.8705 0.7723
OARFocalFuseNet(Ours) 0.8137 0.9085 0.6752 0.8424 0.8237 0.9496 0.6808 0.8698 0.7595
TABLE II: Comparisons of results on Synapse multi-organ CT dataset (average dice score and dice score for each organ).
Fig. 2: Qualitative comparison of OARFocalFuseNet with other recent benchmarking methods on OpenKBP dataset.

Ii-A3 Decoder

The first decoder block of OARFocalFuseNet initially receives features from the bottleneck layer and upscales the feature maps by a factor of . Subsequent decoder blocks upscales feature maps as,


Here, and is the output of the first decoder block. is a transposed convolutional layer with kernel size and stride . , and

denotes instance normalization, ReLU activation layer, and concatenation operation, respectively. The final segmentation head uses a convolutional layer which receives the input from

and has an output number of channels equal to the number of classes.

Ii-B 3D Multi-scale Fusion Network

The architecture of 3D-MSF consists of the same encoder described in Section II-A1. Multi-scale fusion is performed in a densely connected block [srivastava2021gmsrf], where inputs to each layer inside a dense block in a particular resolution stream acquires feature maps from all preceding layers within the same stream and the last layer from other resolution streams. For each resolution scale, multi-resolution features are fused as described in equation 7.


Here, denotes the resolution scale and denotes the layer inside the dense blocks. The upscaling/downscaling to align spatial resolution of received features from different resolution scales is done identically to the upscaling/downscaling recipe used in OARFocalFuseNet. The same decoder described in Section II-A3 is used for combining upscaled features from lower decoder layers and features propagated by the skip connections.

Iii Experiments

Iii-a Datasets and Implementation Details

We evaluate our proposed method on two datasets: OpenKBP [babier2021openkbp] and the Synapse multi-organ segmentation dataset111!Synapse:syn3193805/wiki/217789. OpenKBP contains data from 340 patients who underwent treatment for head-and-neck cancer while the Synapse multi-organ segmentation dataset contains CT scans and corresponding labels for 13 different organs. The second data set was used to demonstrate that our proposed algorithm is generalizable to other segmentation tasks. We follow the training and testing protocol used in [isler2022enhancing], where 188 patients were selected. The selected patients had corresponding labels for five OARs namely: brainstem, spinal cord, right parotid, left parotid, and mandible. Experiments on Synapse multi-organ segmentation dataset followed the same training-testing splits as used in [chen2021transunet] and evaluated the results on eight classes, namely: aorta, gallbladder, spleen, left kidney, right kidney, liver, pancreas, spleen, and stomach. We set the number of output channels for the first encoder block as for both OARFocalFuseNet and 3D-MSF. Subsequent encoder blocks increased the number of output channels by a factor of . Adam optimizer was used for optimization. We used a cyclic learning rate scheduler where the maximum learning rate and minimum learning rate were set to and , respectively. We use an equally weighted combination of dice loss and cross-entropy loss for optimization.

Iv Results and Discussion

We provide comparison of our proposed method against other SOTA methodologies on Open-KBP [babier2021openkbp] dataset in Table I. We can observe that our proposed OARFocalFuseNet has achieved the highest DSC of 0.7995 which outperforms the previous SOTA method, SwinUNETR [hatamizadeh2022swin]. Additionally, we observed the lowest ASD of and highest organ-wise DSC on segmentation of right parotid, left parotid and mandible. The qualitative comparison of our proposed OARFocalFuseNet with other SOTA methods is shown in Figure 2. Table II summarizes the performance of our proposed method against other SOTA methodologies on the Synapse multi organ segmentation dataset. Apart from obtaining the highest DSC of , we achieved the highest organ-wise DSC of , , , , on segmenting aorta, left kidney, right kidney, liver, and pancreas respectively. Even though convolutional-based multi-scale fusion methods have enjoyed tremendous success in 2-D medical image segmentation, our OARFocalFuseNet outperforms 3D-MSF. This can be attributed to the focal fuse blocks which can maintain high resolution representations while exchanging global and local context across all resolution scales, thereby allowing OARFocalFuseNet to outperform 3D-MSF and other SOTA methodologies.

V Conclusion

In this work, we propose the OARFocalFuseNet architecture for Organs-at-risk (OAR) segmentation, which utilizes a novel focal fuse block for fusing multi-scale features used for capturing long and short range context. We validated our proposed architecture, 3D-MSF and OARFocalFuseNet on two multi-class medical image segmentation datasets, where we showed that the proposed OARFocalFuseNet outperforms other SOTA methods in identifying OARs in head-and-neck and abdominal CT scans. In future work, we will examine the failure cases and further modify the architecture to address challenging scenarios in head, neck and abdominal cases.


This project is supported by the NIH funding: R01-CA246704 and R01-CA240639, and Florida Department of Health (FDOH): 20K04.