Multi-scale guided attention for medical image segmentation

06/07/2019 ∙ by Ashish Sinha, et al. ∙ 10

Even though convolutional neural networks (CNNs) are driving progress in medical image segmentation, standard models still have some drawbacks. First, the use of multi-scale approaches, i.e., encoder-decoder architectures, leads to a redundant use of information, where similar low-level features are extracted multiple times at multiple scales. Second, long-range feature dependencies are not efficiently modeled, resulting in non-optimal discriminative feature representations associated with each semantic class. In this paper we attempt to overcome these limitations with the proposed architecture, by capturing richer contextual dependencies based on the use of guided self-attention mechanisms. This approach is able to integrate local features with their corresponding global dependencies, as well as highlight interdependent channel maps in an adaptive manner. Further, the additional loss between different modules guides the attention mechanisms to remove the noise and focus on more discriminant regions of the image by emphasizing relevant feature associations. We evaluate the proposed model in the context of abdominal organ segmentation on magnetic resonance imaging (MRI). A series of ablation experiments support the importance of these attention modules in the proposed architecture. In addition, compared to other state-of-the-art segmentation networks our model yields better segmentation performance, increasing the accuracy of the predictions while reducing the standard deviation. This demonstrates the efficiency of our approach to generate precise and reliable automatic segmentations of medical images. Our code and the trained model are made publicly available at:



There are no comments yet.


page 1

page 8

page 9

Code Repositories


Code for our paper "Multi-scale Guided Attention for Medical Image Segmentation"

view repo


Medical Image Computation, Analysis, and Learning

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Semantic segmentation of medical images is a crucial step in diagnosis, treatment and follow-up of many diseases. Despite the automation of this task has been widely studied in the past, manual annotations are still typically used in clinical practice, which is a time-consuming and prone to inter and intra-observer variability process. Thus, there is a high demand on accurate and reliable automatic segmentation methods that allow to improve the work flow efficiency in clinical scenarios, alleviating the workload of radiologists and other medical experts.

Recently, convolutional neural networks (CNNs) have achieved state-of-the-art performance in a breadth of visual recognition tasks, becoming very popular due to their powerful, nonlinear feature extraction capabilities. These deep models dominate the literature in medical image segmentation

[1] and have achieved outstanding performance in a broad span of applications, including brain [2] or cardiac [3] imaging, for example, becoming the de facto solution for these problems. In this scenario, fully convolutional neural networks [4] or encoder-decoder architectures [5, 6] are typically the standard choice. These architectures are commonly composed of a contracting path, which collapses an input image into a set of high-level features, and an expanding path, where high-level features are used to reconstruct a pixel-wise segmentation mask at a single [4] or multiple upsampling steps [5, 6]. Nevertheless, despite their strong representation power, these multi-scale approaches lead to a redundant use of information flow, e.g., similar low-level features are extracted multiple times at different levels within the network. Furthermore, the discriminative power of the learned feature representations for pixel-wise recognition may be insufficient for some challenging tasks, such as medical image segmentation.

Recent works to improve the discriminative ability of feature representations include the use of multi-scale context fusion [7, 8, 9, 10]. Zhao et al. [8] proposed a pyramid network that exploited global information at different scales by aggregating feature maps generated by multiple dilated convolutional blocks. Aggregation of contextual multi-scale information can also be achieved through pooling operations [11]. Even though these strategies may help to capture objects at different scales, contextual dependencies for all image regions are homogeneous and non-adaptive, ignoring the difference between local representation and contextual dependencies for different categories. Further, these multi-context representations are manually designed, lacking flexibility to model the multi-context representations. This makes that long-range object relationships in the whole image cannot be fully leveraged in these approaches, which is of pivotal importance in many medical imaging segmentation problems.

Alternatively, attention mechanisms have been widely studied in deep CNNs for many computer vision tasks in order to efficiently integrate local and global features, including human pose estimation

[12], emotion recognition [13], text detection [14], object detection [15] and classification [16]. Unlike standard multi-scale features fusion approaches, which compress an entire image into a static representation, attention allows the network to focus on the most relevant features without additional supervision, avoiding the use of multiple similar feature maps and highlighting salient features that are useful for a given task. Semantic segmentation networks have also benefited from attention modules, which has resulted in enhanced models for pixel-wise recognition tasks [17, 18, 19, 20, 21, 22]. For example, Chen et .al [17]

proposed an attention mechanism to weight multi-scale features extracted at different scales in the context of natural scene segmentation. This method improved the segmentation performance over classical average and max-pooling techniques to merge multi-scale features predictions.

Despite the growing interest on integrating attention mechanisms in image segmentation networks for natural scenes, their adoption in medical images remains scarce [23, 24, 25, 26]

, being limited to simple attention models. Thus, in this work, we explore more complex attention mechanisms that can boost the performance of standard deep networks for the task of medical image segmentation. Specifically, we propose a multi-scale guided attention network for medical image segmentation. First, the multi-scale approach generates stacks at different resolutions containing different semantics. While lower-level stacks focus on local appearance, higher-level stacks will encode global representations. This multi-scale strategy encourages that attention maps generated at different resolutions encode different semantic information. Then, at each scale, a stack of attention modules will gradually remove noisy areas and emphasize those regions that are more relevant to the semantic descriptions of the targets. Each attention module contains two independent self-attention mechanisms, which focus on modelling position and channel feature dependencies, respectively. This duple allows to model wider and richer contextual representations and improve dependencies between channel maps, resulting in enhanced feature representations. We validate our method in the task of multi-organ segmentation on magnetic resonance imaging (MRI), employing the publicly available CHAOS dataset. Results show that the proposed architecture improves the segmentation performance by successfully modeling rich contextual dependencies over local features.

Ii Related work

Ii-a Medical image segmentation

Even though segmentation of medical images has been widely studied in the past [27, 28] it is undeniable that CNNs are driving progress in this field, leading to outstanding performances in many applications. Most available medical image segmentation architectures are inspired from the well-known fully convolutional neural network (FCN) [4] or UNet [5]. In FCN the fully connected layers of standard classification CNNS are replaced by convolutional layers to achieve dense pixel prediction at one forward step. To recover the original resolution of the input image, the prediction is upsampled in a single step. Further, to improve the prediction capabilities, skip connections are included in the network by employing the intermediate feature maps. On the other hand, UNet contains contractive and expansive paths created using the combination of convolutional layers with pooling and upsampling layers. Skip connections are used to concatenate the features from contractive and expansive path layers. Many extensions of these networks have been proposed to solve pixel-wise segmentation problems in a wide range of applications [29, 30, 31, 32, 33, 34, 35, 36].

Ii-B Deep attention

Attention mechanisms aim at emphasizing important local regions captured in local features and filtering irrelevant information transferred by global features, improving the modeling of long-range dependencies. These modules have therefore become an essential part of models that need to capture global dependencies. The integration of these attention modules has been proved very successful in many vision problems, such as image captioning [37], image question-answering [38], classification [39] or detection [40], among many others. Self-attention [41, 42, 43, 44] has recently attracted the attention of researchers, as it exhibits a good ability to model long-range dependencies while maintaining computational and statistical efficiency. In these modules, the response at each position is calculated by attending to all positions and taking their weighted average in an embedding space. For image vision problems, [18, 19, 43] integrated self-attention to model the relation of local features with their corresponding global dependencies. For instance, the point-wise spatial attention network (PSANet) proposed in [18] allows a flexible and dynamic aggregation of long-range contextual information by connecting each position in the feature map with all the others through self-adaptive attention maps.

Recent works have indicated that attention features generated in a single step may still contain noise introduced from regions that are irrelevant for a given class, leading to sub-optimal results [38, 45]. To overcome this issue, some works have investigated the use of progressive multiple attention layers in the context of visual question answering [38] or zero shot learning [45]. This strategy gradually filters undesired noise and emphasizes the regions highly relevant for the class semantic representations. To the best of our knowledge, the application of stacked attention modules remains unexplored in semantic segmentation.

Ii-C Medical image segmentation with deep attention

Even though attention mechanisms are becoming popular on many vision problems, the literature on medical image segmentation with attention remains scarce, with simple attention modules [23, 24, 25, 26]. Wang et .al [23] employed attention modules at multiple resolutions to combine local deep attention features (DAF) with global context for prostate segmentation on Ultrasound images. To model long-range dependencies local and global features were combined in a simple attention module, which contains three convolutional layers followed by a softmax function to create the attention map. A similar attention module, composed of two convolutional layers followed by a softmax, was integrated in a hierarchical aggregation framework integrated in UNet for left atrial segmentation [24]. More recently, additive attention gate modules were integrated in the skip connections of the decoding path of UNet with the goal of better model complimentary information from the encoder [25].

Iii Methods

Iii-a Overview

Target structures on medical imaging typically present intra and inter-class diversity on size, shape and texture, particularly if images are processed in 2D. Traditional CNNs for segmentation have a local receptive field, which results in the generation of local feature representations. As long-range contextual information is not properly encoded, local features representations may lead to potential differences between features corresponding to the pixels with the same label [19]. This may introduce intra-class inconsistency that can ultimately impact on the recognition performance [46]. To tackle with this problem, we investigate attention mechanisms to build associations between features. First, global context is captured by employing a multi-scale strategy. Then, learned features at multiple scales are fed into the guided attention modules, which are composed by a stack of spatial and channel self-attention modules. While the spatial and channel self-attention modules will help to adaptively integrate local features with their global dependencies, the stack of attention modules will help to gradually filter noise out emphasizing on relevant information. The overview of the proposed framework is depicted in Figure 1.

Fig. 1: Overview of the proposed multi-scale guided attention network. We resort to ResNet-101 to extract dense local features.

Iii-B Multi-scale attention maps

Multi-scale features are known to be useful in computer vision problems even before the deep learning era [47]. In the context of deep segmentation networks, the integration of multi-scale features has demonstrated astonishing performance [17, 48, 49]. Inspired by these works we make use of learned features at multiple scales, which help to encode both global and local context. Specifically we follow the multi-scale strategy recently proposed in [23], which is ilustrated in Fig. 1. In this setting, features at multiple scales are denoted as , where indicates the level in the architecture. Since features come at different resolutions for each level

, they are upsampled to a common resolution by employing a linear interpolation, leading to enlarged feature maps

. Then,

from all the scales are concatenated forming a tensor that is convolved to create a common multi-scale feature map,

. This new multi-scale feature map is combined with each of the feature maps at different scales and fed into the guided attention modules to generate the attention features :


where represents each guided attention module.

Iii-C Spatial and Channel self-attention modules

As introduced earlier, receptive fields in traditional segmentation deep models are reduced to a local vicinity. This limits the capabilities of modeling wider and richer contextual representations. On the other hand, channel maps can be considered as class-specific responses, where different semantic responses are associated with each other. Thus, another strategy to enhance the feature representation of specific semantics is to improve the dependencies between channel maps [50]. To address these limitations of standard CNNs we employ the position and channel attention modules recently proposed in [19], which are depicted in Figure 2.

Position attention module (PAM)

Let denote an input feature map to the attention module, where represent the channel, width and height dimensions, respectively. In the upper branch is passed through a convolutional block, resulting in a feature map , where is equal to 111We use the superscript to indicate that the feature map belongs to the position attention module. Similarly, we will employ the superscript for the channel attention module features.. Then, is reshaped to a feature map of shape . In the second branch, the input feature map follows the same operations and then is transposed, resulting in . Both maps are multiplied and softmax is applied on the resulted matrix to generate the spatial attention map :


where evaluates the impact of the position on the position. The input is fed into a different convolutional block in the third branch, resulting in , which has the same shape as . As in the other branches, is reshaped becoming . Then it is multiplied by a permuted version of the spatial attention map , whose output is reshaped to a . The attention feature map corresponding to the position attention module, i.e., , can be therefore formulated as follows:


As in [19], the value of is initialized to 0 and it is gradually learned to give more importance to the spatial attention map. Thus, the position attention module selectively aggregates global context to the learned features, guided by the spatial attention map.

Fig. 2: Details of the position and channel attention modules inspired by [19].
Channel attention module (CAM)

The pipeline of the channel attention module is depicted at the bottom of Figure 2. The input is reshaped in the first two branches of the CAM, and permuted in the second branch, leading to and , respectively. Then, we perform a matrix multiplication between and , and obtain the channel attention map as:


where the impact of the channel on the is given by . This is then multiplied by a transposed version of the input , i.e., , whose result is reshaped to . Similarly to the PAM, the final channel attention map is obtained as:


where controls the importance of the channel attention map over the input feature map . Similarly to , is initially set to 0 and gradually learned. This formulation aggregates weighted versions of the features of all the channels into the original features, highlighting class-dependent feature maps and increasing feature discriminability between classes.

At the end of both attention modules, the new generated features are fed into a convolutional layer before performing an element-wise sum operation to generate the position-channel attention features.

Iii-D Guiding attention

Given the feature map at the input of the guided attention module at scale –generated by concatenating and –, it generates attention features via a multi-step refinement. In the first step, is used by the position and channel attention modules to generate self-attention features. In parallel, we integrate an encoder-decoder network that compresses the input features into a compacted representation in the latent space. The objective is that the class information can be embedded in the second position-channel attention module by forcing the semantic representation of both encoder-decoders to be close, which is formulated as:


where and are the encoded representations of the first and second encoder-decoder networks, respectively, and are the attention features generated after the first dual attention module. Specifically, the feature maps reconstructed in the first encoder-decoder () are combined with the self-attention features generated by the first attention module through a matrix-multiplication operation to generate . In addition, to ensure that the reconstructed features correspond to the features at the input of the position-channel attention modules, the output of the encoders are forced to be close to their input:


where and are the reconstructed feature maps, i.e., and , of the first and second encoder-decoder networks.

Fig. 3: An illustration of the semantic guided attention module for a given scale .

As the guided attention module is applied at multiple scales, the combined guided loss for all the modules will be:


Similarly, the total reconstruction loss becomes:


where and are the reconstruction losses for the encoder-decoder architectures in the first and second block of the guided attention module.

Iii-E Deep supervision

While the attention modules do not require auxiliary objective functions, we found that the use of extra supervision at each scale [51] encouraged the intermediate feature maps to be semantically discriminative at each image scale, which is in line with similar works in the literature [17, 23, 25].


where the first term refers to the segmentation results at the raw features and the second term evaluates the segmentation result provided by the attention features. In all the cases, the multi-class cross-entropy between the network prediction and the ground truth labels is employed as segmentation loss. Taking into account all the losses, the final objective function to optimize becomes:


where , and

control the importance of each term in the main loss function.

Iv Experiments

Iv-a Experimental setting

In this section we present the common setting for all the experiments including: dataset, network architectures, training and evaluation metrics.

Iv-A1 Dataset

The abdominal MRI dataset from the Combined Healthy Abdominal Organ Segmentation (CHAOS) Challenge 222 [52, 53, 54] is employed to evaluate our method. Particularly, among the five tasks we focus on the segmentation of abdominal organs on MRI (T1-DUAL in phase). This dataset includes scans from 20 subjects for training, with their corresponding ground truth annotations, and 20 for testing without annotations. Scans were acquired by a 1.5T Philips MRI, producing 12 bit DICOM images and having a resolution of 256256 pixels per slice, and between 26 and 50 slices. Since testing labels are not provided within the dataset we employed the training dataset for our experiments. Particularly we split the dataset into subsets of 13, 2 and 5 subjects that were used for training, validation and testing. We repeated the process 3 times selecting different subjects for validation and testing and report the average results over the three folds. To increase the variability of the data, we rotate, flipped and mirrored the images randomly, but without augmenting the dataset size.

Iv-A2 Network architectures

The multi-scale strategy in the proposed network is based on the recently work in [23], which uses ResNet101 [55] as backbone architecture. Therefore, this architecture is considered as the lower baseline in our experiments. In the first part of the experiments, we perform an ablation study on the different proposed modules to evaluate the impact of each choice in the segmentation performance. The first two networks –i.e., Proposed (PAM) and Proposed (CAM)– extend the baseline by replacing the attention module by either the spatial or the channel self-attention module (Fig. 2), respectively. Then, both modules are combined simultaneously, leading to the Proposed (DualNet) model. In the next model –i.e., Proposed (MS-DualNet)– the attention features generated by the dual attention module are refined in a multi-step process, where a second dual attention module is included. Last, the proposed architecture, referred to as Proposed (MS-DualNet-Guided) extends the Proposed (MS-DualNet) model by incorporating the semantic guidance (Fig. 3). Furthermore we compared the performance of the proposed network to other state-of-the-art architectures, most of them integrating attention: UNet [5], Attention UNet [25], DualNet [19] and Pyramidal Attention Network (PAN) [20].

Iv-A3 Training and implementation details

We train all the networks using Adam optimizer with mini-batch of size 8, and with and

set to 0.9 and 0.99, respectively. While most of the networks converged during the first 250 epochs, we found that PAN

[20] and DANet [19] needed around 400 epochs to achieve the best results. The learning rate is initially set to 0.001 and multiplied by 0.5 after 50 epochs without improvement on the validation set. As a segmentation objective function, we employ the cross-entropy error at each pixel over all the categories for all the networks. Furthermore, as introduced in Section III, we use the objective function in eq. (11) in the proposed architecture, with , and set empirically to 1, 0.25 and 0.1, respectively. As input of the networks we employed 2D axial images of size 256 256. Experiments were performed in a server equipped with a Titan V. The code of our model, as well as the model trained, are made publicly available at .

Iv-A4 Evaluation

Similarity between ground truth and CNN segmentations is assessed by employing several comparison metrics. First, we resort to the widely used Dice similarity coefficient (DSC) to compare volumes based on their overlap. Given two volumes and , their DSC can be defined as:


Further, we also assess the segmentation performance based on the volume similarity, which is formulated as:

Method Liver Kidney R Kidney L Spleen Mean
Baseline (DAF [23]) 91.66 (2.99) 79.28 (18.68) 83.63 (7.56) 75.35 (20.41) 82.48 (6.06)
Proposed (PAM) 91.89 (4.29) 85.47 (7.04) 86.84 (6.53) 73.65 (22.62) 84.46 (6.68)
Proposed (CAM) 92.58 (2.65) 84.52 (9.34) 86.38 (6.27) 76.84 (20.56) 85.08 (5.62)
Proposed (DualNet) 92.60 (3.20) 85.29 (7.96) 87.74 (6.37) 76.44 (22.17) 85.52 (5.86)
Proposed (MS-Dual) 92.62 (3.08) 86.29 (5.98) 88.82 (4.84) 76.96 (19.87) 86.17 (5.78)
Proposed (MS-Dual-Guided) 92.46 (2.82) 87.96 (6.46) 88.01 (6.16) 78.61 (18.69) 86.75 (5.05)
Volume similarity (VS)
Liver Kidney R Kidney L Spleen Mean
Proposed( DAF [23]) 96.69 (3.21) 86.75 (16.41) 90.29 (8.39) 84.98 (14.42) 89.68 (4.48)
Proposed (PAM) 96.62 (4.62) 92.83 (7.43) 93.96 (6.46) 83.93 (20.54) 91.84 (4.77)
Proposed (CAM) 97.25 (2.95) 93.78 (6.04) 93.98 (5.48) 83.72 (20.97) 92.18 (5.07)
Proposed (DualNet) 97.04 (3.03) 94.50 (5.96) 93.43 (7.03) 83.30 (22.53) 92.07 (5.23)
Proposed (MS-Dual) 97.47 (3.07) 93.30 (4.11) 95.27 (4.89) 84.90 (16.86) 92.74 (4.76)
Proposed (MS-Dual-Guided) 96.44 (3.15) 96.14 (3.15) 94.95 (4.48) 87.87 (15.23) 93.85 (3.50)
Average Surface Distance (MSD)
Liver Kidney R Kidney L Spleen Mean
Baseline( DAF [23]) 0.64 (0.29) 0.97 (1.08) 0.63 (0.25) 1.45 (2.04) 0.92 (0.33)
Proposed (PAM) 0.55 (0.19) 0.56 (0.23) 0.55 (0.21) 1.54 (2.40) 0.80 (0.43)
Proposed (CAM) 0.58 (0.22) 0.57 (0.24) 0.52 (0.20) 1.29 (1.64) 0.74 (0.32)
Proposed (DualNet) 0.54 (0.19) 0.56 (0.19) 0.50 (0.18) 1.49 (2.29) 0.77 (0.41)
Proposed (MS-Dual) 0.53 (0.18) 0.51 (0.14) 0.46 (0.14) 1.19 (1.42) 0.67 (0.30)
Proposed (MS-Dual-Guided) 0.54 (0.16) 0.48 (0.18) 0.48 (0.14) 1.13 (1.24) 0.66 (0.27)
TABLE I: Ablation study on different proposed attention modules on the Chaos dataset (multi-organ segmentation on MRI task). The values show the average result of the experiments averaged over the 3 folds. Best results are represented in red bold, while blue is used to highlight the second best performance.

However, volume-based metrics generally lack sensitivity to segmentation outline, and segmentations showing a high degree of spatial overlap might present clinically-relevant differences between their contours. Thus, distance-based metrics, such as the mean surface distance (MSD), were also considered in our evaluation. The MSD between contours and is defined as follows:


where is the distance between a point on the surface A and the surface B, which is given by the minimum of the Euclidean norm:


Since inter-slice distances and x-y spacing for each individual scan are not provided, we report these results on voxels.

Iv-B Results

Iv-B1 Ablation study on the proposed attention modules

To validate the individual contribution of different components to the segmentation performance, we perform an ablation experiment under different settings. Table I reports the results of the different attention modules. Compared to the baseline, we observe that by integrating either a spatial (PAM) or an attention module (CAM) at each scale in the baseline architecture the performance improves between 2-3% in terms of overlapping and volume similarity, and between 12-18% in terms of surface distances, as average. On the other hand, having both modules in parallel –i.e., Proposed (DualNet)– brings slightly better results in terms of DSC, but achieves lower performance when employing the surface distance metric. However, despite the lower average performance on the MSD, the proposed DualNet model still achieves better results in 3 out of 4 structures compared to the channel attention model. This trend is repeated on the DSC metric, where DualNet surpasses the proposed CAM architecture in the same 3 structures: liver and both left and right kidneys. This suggests that, even though both spatial and channel attention bring an improvement on the performance, the channel attention module contributes more than the spatial attention when they are combined. If features generated by the proposed DualNet model are refined in a second step –network referred to as Proposed(MS-Dual)– the average results are further improved by nearly 0.7% and 10% in volume and distance-based metrics, respectively. Last, the introduction of the semantic-guided loss –Proposed (MS-Dual-Guided)– results in an additional boost on performance, yielding to the best values in the three metrics: 86.75% (DSC), 93.85% (VS) and 0.66 voxels (MSD). These results represent an improvement of 4.5%, 4% and 26% in DSC, VS and MSD, respectively, compared to the baseline in [23], showing the efficiency of the proposed attention network compared to individual attention components.

Method Liver Kidney R Kidney L Spleen Mean
UNet [5] 90.94 (4.01) 79.14 (15.23) 82.51 (7.48) 71.95 (21.61) 81.14 (7.88)
DANet [19] 91.69 (4.07) 83.85 (9.40) 84.49 (8.60) 75.54 (16.08) 83.89 (9.54)
PAN (ResNet34) [20] 91.99 (2.98) 81.51 (9.03) 83.62 (6.21) 73.70 (19.97) 82.70 (6.51)
PAN (ResNet101)[20] 92.13 (3.51) 85.02 (5.16) 85.36 (4.87) 74.84 (21.23) 84.34 (6.17)
DAF [23] 91.66 (2.99) 79.28 (18.68) 83.63 (7.56) 75.35 (20.41) 82.48 (6.06)
UNet Attention [25] 92.02 (1.93) 84.33 (5.91) 85.57 (4.09) 77.18 (15.95) 84.77 (5.27)
Proposed (MS-Dual-Guided) 92.46 (2.82) 87.96 (6.46) 88.01 (6.16) 78.61 (18.69) 86.75 (5.05)
Volume similarity (VS)
Liver Kidney R Kidney L Spleen Mean
UNet [5] 95.54 (4.43) 87.68 (5.77) 89.55 (4.68) 83.28 (14.78) 89.01 (4.82)
DANet [19] 96.90 (4.18) 92.88 (5.12) 91.52 (6.73) 84.37 (16.15) 91.42 (4.52)
PAN (ResNet34) [20] 96.56 (3.55) 90.89 (5.64) 91.83 (7.75) 81.98 (20.67) 90.32 (5.27)
PAN (ResNet101) [20] 96.99 (3.64) 93.77 (4.63) 92.69 (6.88) 84.24 (17.37) 91.93 (4.71)
DAF [23] 96.69 (3.21) 86.75 (16.41) 90.29 (8.39) 84.98 (14.42) 89.68 (4.48)
UNet Attention [25] 96.95 (1.89) 92.29 (6.41) 91.79 (3.53) 85.94 (11.88) 91.74 (3.91)
Proposed (MS-Dual-Guided) 96.44 (3.15) 96.14 (3.15) 94.95 (4.48) 87.87 (15.23) 93.85 (3.50)
Average Surface Distance (MSD)
Liver Kidney R Kidney L Spleen Mean
UNet [5] 0.59 (0.18) 0.69 (0.38) 0.61 (0.19) 1.76 (2.57) 0.91 (0.49)
DANet [19] 0.61 (0.27) 0.65 (0.31) 0.67 (0.30) 1.17 (0.94) 0.78 (0.23)
PAN (ResNet34)[20] 0.62 (0.25) 0.75 (0.31) 0.69 (0.21) 1.37 (1.43) 0.86 (0.29)
PAN (ResNet101) [20] 0.57 (0.22) 0.61 (0.19) 0.64 (0.15) 1.30 (1.47) 0.78 (0.31)
DAF [23] 0.64 (0.29) 0.97 (1.08) 0.63 (0.25) 1.45 (2.04) 0.92 (0.33)
UNet Attention [25] 0.57 (0.25) 0.61 (0.23) 0.56 (0.18) 1.15 (1.01) 0.72 (0.24)
Proposed (MS-Dual-Guided) 0.54 (0.16) 0.48 (0.18) 0.48 (0.14) 1.13 (1.24) 0.66 (0.27)
TABLE II: Comparison of the proposed network to other state-of-the-art architectures on the CHAOS dataset. The values show the average result of the experiments on the 3 folds.

Iv-B2 Comparison to state-of-the-art

The experimental results obtained by several state-of-the-art segmentation networks are reported in Table II. Compared to other networks that were proposed in the context of medical image segmentation –i.e., UNet [5], Attention UNet [25] and DAF [23]– our network achieves a mean improvement of 5.6%, 4.3% and 2.0% (in terms of DSC), 4.9%, 4.2% and 2.1% (on VS) and 25%, 26% and 6% (on MSD), respectively. This difference in performance could be explained by the fact that the attention modules integrated in [23] and [25] are much simpler than those proposed in our architecture. On the other hand, attention modules on general computer vision tasks have attracted more attention, resulting in more elaborated strategies which typically achieve better segmentation results. Among these architectures, the PAN model [20] with ResNet101 as backbone –the same as ours– achieved 84.34%, 91.93% and 0.78 voxels, as average, for DSC, VS and MSD, respectively, which represent the best results for segmentation networks proposed for natural scenes. Despite these competitive results, the proposed model still outperforms the PAN architecture by 2.4%, 1.9% and 12% in DSC, VS and MSD. As PAN [20] also employed a multi-scale architecture, these differences suggest that the use of dual self-attention and a guided refinement module can actually improve the modelling of contextual dependencies, resulting in an increased segmentation performance.

In addition to the values reported on Tables I and II, we also depict the distribution of DSC, VS and MSD values on the 15 subjects used for evaluation for all the models (Fig. 4

). In these plots, we can first observe the impact of the different attention modules in the segmentation performance of the proposed model. As we progressively include the proposed attention modules in the baseline network, the segmentation performance improves, which is reflected in a better distribution of segmentation accuracy values with a smaller variance. This difference on results distribution is more prominent when comparing the proposed network with other state-of-the-art networks, which are represented in bluish box plots. We can also observe that this pattern is constant across organs and metrics, suggesting that the proposed attention network achieves better and more robust segmentation results than current state-of-the-art architectures.

(a) Dice Similarity coefficient (%)
(b) Volume similarity (%)
(c) Average surface distance (voxels)
Fig. 4: These plots depict the distributions of the different evaluation metrics for the four organs segmented. Bluish colors represent the results obtained by other state-of-the-art networks, whereas the results obtained by our proposed models are displayed in with the brownish boxplots.

Iv-B3 Convergence

We have also compared the different architectures in terms of convergence, whose results are depicted in Fig. 5. Particularly, the mean DSC value over the four structures on one of the validation folds is shown for each of the networks. It can be observed that, even though most of the networks achieve results which may be considered ‘similar’ –up to some extent– the convergence behaviour is totally different. While there are three networks with similar convergence curves –i.e., UNet, DANet and DAF–, PAN needs more iterations to convergence, ultimately performing better than these networks after nearly 400 epochs. On the other hand, we found that attention UNet and the proposed network presented the fastest convergence, achieving their best results at epoch 48 and 73, respectively.

Fig. 5: Evolution of the mean validation DSC over time.

Iv-B4 Qualitative evaluation

Fig. 6: Results on several subjects on the CHAOS Challenge dataset. The proposed multi-scale guided attention network achieves qualitatively better results than other state-of-the-art networks that also integrate attention modules.

To visualize the impact of the different attention modules, Fig. 6 displays the segmentation results of the different networks on several subjects. Despite the quantitative results reported on Table II

show that there are several architectures with similar performance, the qualitative results depict interesting findings. The first thing that we can observe is that UNet, which is the only network not integrating attention, typically over segments certain organs and gets confused easily. For example, in the first and third row it fails to properly segment the liver –in green– and the spleen –in blue–, respectively, including many regions that do not belong to the target. Particularly in the third row it confused the small bowels with the spleen, while the spleen is not even present in that slice. Integrating attention can overcome the limitations shown by UNet and improve the segmentation performance by focusing the attention to relevant areas. This can be observed in the results obtained by the other networks, which, up to some extent, reduce the false positives in the prediction. Particularly, the PAN model (with ResNet101 as backbone) seems to avoid misclassification results on these ambiguous regions. Nevertheless, it produces smoother segmentations, which result in lost of fine grained details. This effect can be observed, for example, in the liver segmentation results –in green– on the first two rows. An interesting result is the segmentation shown in the last row. In this particular case, all the models except the proposed network get confused to segment the left kidney. While DANet and PAN models confuse the left kidney with the right one, DAF is not able to detect any relevant region related to the kidneys in that area. In addition, both UNet and UNet with attention models generate segmentations of the left kidney that contain three organs, i.e., left and right kidneys and spleen, which is anatomically not plausible. Unlike all these models, the proposed architecture does not get distracted by ambiguous regions and some misclassified structures are now correctly classified.

These visual results indicate that our approach can successfully recover finer segmentation details, while avoiding getting distracted in ambiguous regions. The selective integration of spatial information and among channel maps followed by a guided attention module helps to capture context information. This demonstrates that the proposed multi-scale guided attention model can efficiently encode complimentary information to accurately segment medical images.

V Conclusion

In this work, we introduced a novel attention architecture for the task of medical image segmentation. This model incorporates a multi-scale strategy to combine semantic information at different levels and self-attention modules to progressively aggregate relevant contextual features. Last, a guided refinement module filters noisy regions and help the network to focus on relevant class-specific regions in the image. To validate our approach we conducted experiments on MRI scans (T1-DUAL) from the Combined Healthy Abdominal Organ Segmentation (CHAOS) Challenge. We provided extensive experiments to evaluate the impact of the individual components of the proposed architecture. Besides, we compared our model to existing approaches that integrate attention, which have been recently proposed for natural scene [19, 20] and medical image [5, 23, 25] segmentation. Experiment results showed that the proposed model outperformed all previous approaches both quantitative and qualitatively, which may be explained by the enhanced ability to model rich contextual dependencies over local features. This demonstrates the efficiency of our approach to provide precise and reliable automatic segmentations of medical images.


We wish to thank NVIDIA for its kind donation of the Titan V GPU used in this work.


  • [1] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. van der Laak, B. Van Ginneken, and C. I. Sánchez, “A survey on deep learning in medical image analysis,” Medical image analysis, vol. 42, pp. 60–88, 2017.
  • [2] J. Dolz, K. Gopinath, J. Yuan, H. Lombaert, C. Desrosiers, and I. Ben Ayed, “HyperDense-Net: A hyper-densely connected CNN for multi-modal image segmentation,” IEEE transactions on medical imaging, 2018.
  • [3] O. Bernard, A. Lalande, C. Zotti, F. Cervenansky, X. Yang, P.-A. Heng, I. Cetin, K. Lekadir, O. Camara, M. A. G. Ballester et al., “Deep learning techniques for automatic MRI cardiac multi-structures segmentation and diagnosis: Is the problem solved?” IEEE transactions on medical imaging, vol. 37, no. 11, pp. 2514–2525, 2018.
  • [4] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , 2015, pp. 3431–3440.
  • [5] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention.   Springer, 2015, pp. 234–241.
  • [6] G. Lin, A. Milan, C. Shen, and I. Reid, “Refinenet: Multi-path refinement networks for high-resolution semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1925–1934.
  • [7] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2018.
  • [8] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2881–2890.
  • [9] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 801–818.
  • [10] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” in ICLR, 2016.
  • [11] W. Liu, A. Rabinovich, and A. C. Berg, “Parsenet: Looking wider to see better,” arXiv preprint arXiv:1506.04579, 2015.
  • [12] X. Chu, W. Yang, W. Ouyang, C. Ma, A. L. Yuille, and X. Wang, “Multi-context attention for human pose estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1831–1840.
  • [13] A. Gupta, D. Agrawal, H. Chauhan, J. Dolz, and M. Pedersoli, “An attention model for group-level emotion recognition,” in Proceedings of the 2018 on International Conference on Multimodal Interaction.   ACM, 2018, pp. 611–615.
  • [14] Z. Huang, Z. Zhong, L. Sun, and Q. Huo, “Mask R-CNN with pyramid attention network for scene text detection,” in 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).   IEEE, 2019, pp. 764–772.
  • [15] S. Chen, X. Tan, B. Wang, and X. Hu, “Reverse attention for salient object detection,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 234–250.
  • [16] K. Li, Z. Wu, K.-C. Peng, J. Ernst, and Y. Fu, “Tell me where to look: Guided attention inference network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 9215–9223.
  • [17] L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille, “Attention to scale: Scale-aware semantic image segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3640–3649.
  • [18] H. Zhao, Y. Zhang, S. Liu, J. Shi, C. Change Loy, D. Lin, and J. Jia, “PSANet: Point-wise spatial attention network for scene parsing,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 267–283.
  • [19] J. Fu, J. Liu, H. Tian, Z. Fang, and H. Lu, “Dual attention network for scene segmentation,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • [20] H. Li, P. Xiong, J. An, and L. Wang, “Pyramid attention network for semantic segmentation,” in BMVC, 2018.
  • [21] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “Bisenet: Bilateral segmentation network for real-time semantic segmentation,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 325–341.
  • [22] P. Zhang, W. Liu, H. Wang, Y. Lei, and H. Lu, “Deep gated attention networks for large-scale street-level scene segmentation,” Pattern Recognition, vol. 88, pp. 702–714, 2019.
  • [23] Y. Wang, Z. Deng, X. Hu, L. Zhu, X. Yang, X. Xu, P.-A. Heng, and D. Ni, “Deep attentional features for prostate segmentation in ultrasound,” in MICCAI, 2018.
  • [24] C. Li, Q. Tong, X. Liao, W. Si, Y. Sun, Q. Wang, and P.-A. Heng, “Attention based hierarchical aggregation network for 3D left atrial segmentation,” in International Workshop on Statistical Atlases and Computational Models of the Heart.   Springer, 2018, pp. 255–264.
  • [25] J. Schlemper, O. Oktay, M. Schaap, M. Heinrich, B. Kainz, B. Glocker, and D. Rueckert, “Attention gated networks: Learning to leverage salient regions in medical images,” Medical image analysis, vol. 53, pp. 197–207, 2019.
  • [26] D. Nie, Y. Gao, L. Wang, and D. Shen, “ASDNet: Attention based semi-supervised deep networks for medical image segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2018, pp. 370–378.
  • [27] T. Heimann and H.-P. Meinzer, “Statistical shape models for 3D medical image segmentation: a review,” Medical image analysis, vol. 13, no. 4, pp. 543–563, 2009.
  • [28] J. Dolz, L. Massoptier, and M. Vermandel, “Segmentation algorithms of subcortical brain structures on MRI for radiotherapy and radiosurgery: a survey,” IRBM, vol. 36, no. 4, pp. 200–212, 2015.
  • [29] P. F. Christ, M. E. A. Elshaer, F. Ettlinger, S. Tatavarty, M. Bickel, P. Bilic, M. Rempfler, M. Armbruster, F. Hofmann, M. D’Anastasi et al., “Automatic liver and lesion segmentation in CT using cascaded fully convolutional neural networks and 3d conditional random fields,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, 2016, pp. 415–423.
  • [30] T. Fechter, S. Adebahr, D. Baltas, I. Ben Ayed, C. Desrosiers, and J. Dolz, “Esophagus segmentation in CT via 3D fully convolutional neural network and random walk,” Medical physics, vol. 44, no. 12, pp. 6341–6352, 2017.
  • [31] X. Li, H. Chen, X. Qi, Q. Dou, C.-W. Fu, and P.-A. Heng, “H-DenseUNet: hybrid densely connected UNet for liver and tumor segmentation from CT volumes,” IEEE transactions on medical imaging, vol. 37, no. 12, pp. 2663–2674, 2018.
  • [32] Y. Man, Y. Huang, J. F. X. Li, and F. Wu, “Deep Q learning driven CT pancreas segmentation with geometry-aware U-Net,” IEEE transactions on medical imaging, 2019.
  • [33] J. Dolz, C. Desrosiers, and I. Ben Ayed, “3D fully convolutional networks for subcortical segmentation in MRI: A large-scale study,” NeuroImage, vol. 170, pp. 456–470, 2018.
  • [34] K. Kamnitsas, C. Ledig, V. F. Newcombe, J. P. Simpson, A. D. Kane, D. K. Menon, D. Rueckert, and B. Glocker, “Efficient multi-scale 3D CNN with fully connected CRF for accurate brain lesion segmentation,” Medical image analysis, vol. 36, pp. 61–78, 2017.
  • [35] A. Carass, J. L. Cuzzocreo, S. Han, C. R. Hernandez-Castillo, P. E. Rasser, M. Ganz, V. Beliveau et al., “Comparing fully automated state-of-the-art cerebellum parcellation from magnetic resonance images,” NeuroImage, 2018.
  • [36] J. Dolz, X. Xu, J. Rony, J. Yuan, Y. Liu, E. Granger, C. Desrosiers, X. Zhang, I. Ben Ayed, and H. Lu, “Multiregion segmentation of bladder cancer structures in MRI with progressive dilated convolutional networks,” Medical physics, vol. 45, no. 12, pp. 5482–5493, 2018.
  • [37] M. Pedersoli, T. Lucas, C. Schmid, and J. Verbeek, “Areas of attention for image captioning,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 1242–1250.
  • [38] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, “Stacked attention networks for image question answering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 21–29.
  • [39] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang, “Residual attention network for image classification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3156–3164.
  • [40] H. Li, Y. Liu, W. Ouyang, and X. Wang, “Zoom out-and-in network with map attention decision for region proposal and object detection,” International Journal of Computer Vision, vol. 127, no. 3, pp. 225–238, 2019.
  • [41] A. P. Parikh, O. Täckström, D. Das, and J. Uszkoreit, “A decomposable attention model for natural language inference,” in In EMNLP, 2016.
  • [42] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
  • [43] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena, “Self-attention generative adversarial networks,” arXiv preprint arXiv:1805.08318, 2018.
  • [44] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7794–7803.
  • [45] Z. Ji, Y. Fu, J. Guo, Y. Pang, Z. M. Zhang et al., “Stacked semantics-guided attention model for fine-grained zero-shot learning,” in Advances in Neural Information Processing Systems, 2018, pp. 5995–6004.
  • [46] C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun, “Large kernel matters–improve semantic segmentation by global convolutional network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4353–4361.
  • [47] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik, “Contour detection and hierarchical image segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 33, no. 5, pp. 898–916, 2010.
  • [48] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik, “Hypercolumns for object segmentation and fine-grained localization,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 447–456.
  • [49] M. Mostajabi, P. Yadollahpour, and G. Shakhnarovich, “Feedforward semantic segmentation with zoom-out features,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3376–3385.
  • [50] L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T.-S. Chua, “SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5659–5667.
  • [51] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu, “Deeply-supervised nets,” in Artificial Intelligence and Statistics, 2015, pp. 562–570.
  • [52] M. A. Selver, “Exploring brushlet based 3D textures in transfer function specification for direct volume rendering of abdominal organs,” IEEE transactions on visualization and computer graphics, vol. 21, no. 2, pp. 174–187, 2014.
  • [53] E. Selvi, M. A. Selver, A. E. Kavur, C. Guzelis, and O. Dicle, “Segmentation of abdominal organs from MR images using multi-level hierarchical classification,” Journal of the Faculty of Engineering and Architecture of Gazi University, vol. 30, no. 3, pp. 533–546, 2015.
  • [54] M. A. Selver, “Segmentation of abdominal organs from CT using a multi-level, hierarchical neural network strategy,” Computer methods and programs in biomedicine, vol. 113, no. 3, pp. 830–852, 2014.
  • [55] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in European conference on computer vision.   Springer, 2016, pp. 630–645.