Hierarchical Attention Networks for Medical Image Segmentation

11/20/2019 ∙ by Fei Ding, et al. ∙ 45

The medical image is characterized by the inter-class indistinction, high variability, and noise, where the recognition of pixels is challenging. Unlike previous self-attention based methods that capture context information from one level, we reformulate the self-attention mechanism from the view of the high-order graph and propose a novel method, namely Hierarchical Attention Network (HANet), to address the problem of medical image segmentation. Concretely, an HA module embedded in the HANet captures context information from neighbors of multiple levels, where these neighbors are extracted from the high-order graph. In the high-order graph, there will be an edge between two nodes only if the correlation between them is high enough, which naturally reduces the noisy attention information caused by the inter-class indistinction. The proposed HA module is robust to the variance of input and can be flexibly inserted into the existing convolution neural networks. We conduct experiments on three medical image segmentation tasks including optic disc/cup segmentation, blood vessel segmentation, and lung segmentation. Extensive results show our method is more effective and robust than the existing state-of-the-art methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

page 5

page 6

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Semantic segmentation plays a critical role in various computer vision tasks such as image editing, automatic driving, and medical diagnosing. It has been well handled by the powerful methods driven by Convolutional Neural Networks (CNNs). CNNs have achieved promising results in image recognition, where a key operation is the global pooling. The global pooling makes the feature representation of an image contain global context information. However, pixel features only contain local context information, which makes the recognition of pixels challenging. It is more challenging to distinguish the confusing pixels in the medical images than other kinds of images, because of the inter-class indistinction, high variability, and noise.

Figure 1: Diagrams of original self-attention and our two-levels hierarchical attention. In original self-attention (a), the center node (orange circle) is influenced by its whole neighbors, including the nodes from a different class (triangle). In our hierarchical attention (b), we search N-degree neighbors for the center node from the sparse graph (dotted lines are edges) where there will be an edge between two nodes only if the correlation between them is high enough. In the left part of our hierarchical attention (b), the center node is influenced by its immediate neighbors, and in the right part of (b), the center node is influenced by its neighbors’ neighbors, and the information of two levels will be fused to enrich the context information.

Various pioneering fully convolutional network (FCN) approaches have taken into account the contexts to improve the performance of semantic segmentation. The U-shaped networks [26, 3] achieve promising results on medical image segmentation, which enable the use of rich context information. Other methods exploit context information by the dilated convolutions [6, 7, 42, 11] or multi-scale pooling [47, 20]. The above methods have revealed that CNNs with large and multi-scale receptive fields are more possible to learn translation-invariant features. However, the fact that utilizing the same weights of feature detectors at all locations does not satisfy the requirement that different pixels need different contextual dependencies. Therefore, many self-attention based approaches [35, 31] are proposed, which focus on aggregating context information in an adaptive manner. In the self-attention, for the feature at a certain position, it is updated via aggregating features at all other positions in a weighted sum manner, where the weights are decided by the similarity between two corresponding features. However, two issues will reduce the performance when applying the self-attention to realize medical image segmentation. Firstly, the weighted summed features contain information about other categories (see Figure 1 (a)), which means that attention map contains noise, and this kind of noise will increase dramatically in the medical image with inter-class indistinction. The above view is proven in our experiments and shown in Figure 5. Secondly, CNNs are powerful for their ability to generate a hierarchical object representation [16], which is related to the human visual system. However, the existing self-attention methods only aggregate context information of one level which we term one-level attention, and they can not learn a hierarchical attention representation. Because of the importance of semantic segmentation in medical image analyzing, introducing the hierarchical attention mechanism into the medical image domain to generate more powerful pixel-wise feature representations will yield a general advantage.

Based on the above observation, we reformulate the original self-attention mechanism from the view of the high-order graph and propose a novel encoder-decoder structure for medical image segmentation, called Hierarchical Attention Network (HA-Net). Instead of one-level information mixing, HANet realizes an effective aggregation of hierarchical context information. Specifically, we first compute the initial attention map that represents the similarities between two corresponding features. Then, the initial attention map is utilized to extract a sparse graph, where there will be an edge between two nodes only if the correlation between them is high enough. Further, we search N-degree neighbors for each node from the sparse graph (see Figure 1 ). Finally, the nodes are updated via mixing information of neighbors at various distances. Our hierarchical attention mechanism embedded in the HANet focuses on optimizing the neighbor relations only with high confidence and aggregates latent information of hierarchical neighbors. It naturally reduces noisy attention information caused by the inter-class indistinction of medical images and can generate a wide class of feature representations.

The main contributions of this paper are listed as follows:

  • We reformulate the self-attention mechanism in a high-order graph manner, which aggregates hierarchical context information to enhance the discriminative ability of feature representations. To the best of our knowledge, this paper is the first to introduce the high-order graph into the self-attention mechanism.

  • We build the proposed hierarchical attention mechanism as a flexible module for neural networks, yielding powerful compact attention architectures for medical image segmentation.

  • Extensive experiments on four datasets, including REFUGE dataset111https://refuge.grand-challenge.org/, Drishti-GS1 dataset [29], DRIVE dataset [30], and LUNA dataset222https://www.kaggle.com/kmader/finding-lungs-in-ct-data/data/, demonstrate the superiority of our approach over existing state-of-the-art methods. The code will be released.

2 Related works

Semantic segmentation. Fully convolutional network (FCN) [21] based methods have made great progress in semantic segmentation, which attempt to predict pixel-level semantic labels of a given image. Recently, several model variants are proposed to enhance the context information aggregation for semantic segmentation. Atrous spatial pyramid pooling (ASPP) based methods [5, 6, 7, 42] aggregate context information via parallel dilated convolutions [28, 44, 24]. Other methods [47, 20] collect contextual information of different scales via multi-scale pooling. The above methods reveal the advantage of large and multi-scale receptive fields in semantic segmentation. Moreover, the U-shaped networks [26, 3, 46, 14, 32]

enable multi-level feature extraction and successive feature aggregation, which have achieved promising results on medical image segmentation. The U-shaped networks work with a few training samples and capture rich context information. However, these methods use the same weights of feature detectors at all locations, and cannot robustly handle the variance of medical images.

Figure 2: The overall structure of our proposed Hierarchical Attention network. An input image is passed through an encoder and a bottleneck layer to produce a feature map . Then is fed into the Hierarchical Attention module to reinforce the feature representation by our proposed Dense Similarity block, Attention Propagation block, and Information Aggregation block. Finally, the reinforced feature map is transformed by a bottleneck layer and a decoder to generate the final segmentation results. The corresponding low-level and high-level features are connected by skip connections.

Self-attention.

Self-attention has been widely used for natural language processing

[4, 31] and computer vision [35]. Self-attention based methods aggregate global context information with dynamic weights, where they represent the context information with a weighted summation of information at all positions. DANet [10] models the attention in spatial and channel dimensions respectively. CCNet [13] obtains context information via an effective crisscross attention module. EMANet [17]

reformulates the self-attention mechanism into an expectation-maximization iteration manner. Besides, Graph Convolutional Networks

[15] perform a message-passing similarly as self-attention. Curve-GCN [19] represents the object as a graph and conducts information fusion using a Graph Convolutional Network [15]. Recently proposed high-order Graph Convolution methods [2, 1, 23] also inspire us, which can learn a general class of neighborhood mixing relationships.

Hierarchical attention. Hierarchical attention has shown advantages in many tasks, such as document classification [43], response generation [40], sentiment classification [18], action recognition [41], and reading comprehension [48]. These methods are different from ours that obtains hierarchical attention via the high-order graph.

3 Hierarchical Attention Network

In the self-attention method, the updated features will contain much information about other categories, because of the effects of the inter-class indistinction, high variability, and noise in medical images. To address this issue, we explore a novel attention mechanism from the view of the high-order graph, which could reduce the noise in the attention map and adaptively aggregate global context information of multi-level.

As illustrated in Figure 2, we propose a Hierarchical Attention Network (HANet) for medical image segmentation. Constructed with an encoder-decoder architecture, HANet contains a hierarchical attention module to capture global context information over local features. First, a medical image is encoded by an encoder to form a feature map with the spatial size (). Then, after transformed by a channel reduction layer, the produced feature map is fed into the module to reinforce the feature representation by our hierarchical attention strategy. In the module, is transformed by two branches. 1) the Dense Similarity block and the Attention Propagation block transform orderly to generate initial attention map and hierarchical attention maps . 2) is transformed by a bottleneck layer to produce feature map . Further, the feature map and hierarchical attention maps are utilized to mix context information of multiple levels to produce new feature map via the Information Aggregation block. Finally, the feature map is transformed by a bottleneck layer and a decoder to generate the final segmentation results. Moreover, the low-level and high-level features are connected by skip connections [12], which has been widely proved to recover the segmentation details successfully. Therefore, coupled with the module, HANet can capture richer context information and enhance the discriminative ability of feature representations. Subsequently, we introduce the hierarchical attention module in detail.

3.1 Dense Similarity

The dense similarity block computes initial attention map , which represents the similarities between two corresponding features. As shown in Figure 3 (a), we calculate from in a dot-product manner [35, 31]. Given the feature , we feed it to two parallel convolutions to generate two new feature maps with shape . Then both of them are reshaped to , namely and . The initial attention map is calculated by matrix multiplication of and , as follows:

(1)

where is a scaling factor to counteract the effect of the numerical explosion. Following previous work [31], we set as and is the channel number of .

Computing the similarity between features in a dot-product manner is much faster and more space-efficient in practice. The initial attention map is further utilized to form hierarchical attention maps in the next sections.

Figure 3: The details of our Dense Similarity block, Attention Propagation block, and Information Aggregation block.

3.2 Attention Propagation

Our attention propagation block that produces the hierarchical attention maps from is based on some basic theories of the graph. A graph is given in the form of adjacency matrix , where is a positive integer if vertex can reach vertex after hops, otherwise is zero. And can be computed by the adjacency matrix multiplied by itself times:

(2)

If is normalized into a bool accessibility matrix, where we set zero to false and others to true, then as increases, tends to be equal to . At this point, is the transitive closure333https://en.wikipedia.org/wiki/Transitive_closure of the graph, where there is a direct edge between vertex and vertex if vertex is reachable from vertex . Moreover, if we initialize an edge between two vertices only if they have the same label, vertices with the same label will form a complete graph444https://en.wikipedia.org/wiki/Complete_graph. The transitive closure is the best case for attention mechanism, where the updated feature of node mixes the latent information of all features that have the same label.

Based on the above graph theories, we propose a novel way to realize the propagation of attention. As presented previously, the element of represents the correlation between two corresponding features. We consider as the adjacency matrix of a graph, where an edge means the degree that two nodes belong to the same category. As shown in the upper part of Figure 3 , given , we erase those low confidence edges to produce the down-sampled graph via applying a hard threshold. The operation is related to previous works [37, 45] which operate on the activation maps. We expect each node to be connected only to the nodes with the same label and to reduce the noise in the attention map. We obtain the down-sampled graph as follows:

(3)

where is a hard threshold. Then high-order graph can be obtained with Eq.2, which indicates the -degree neighbors of each node. Finally, the hierarchical attention maps are calculated as follows:

(4)

where is an integer adjacency power indicating the steps of attention propagation. Thus the attention information of different levels is decoupled into different attention maps via . The produced hierarchical attention maps are used to aggregate hierarchical context information in section 3.3.

In our attention propagation method, the high-order graph is important for building hierarchical attention maps because it can reduce noise in the initial attention map. Meanwhile, we can also realize our method without . Concretely, the hierarchical attention maps can also be obtained as follows:

(5)

where is the initial attention map. To a certain extent, the above operation is related to the high-order Graph Convolution methods [2, 1, 23]. But in this way, the case of transitive closure mentioned above is difficult to be achieved, and the obtained hierarchical attention maps will result in unsuitable features that mix a large amount of context information from other categories. Thus it is inconsistent with our goal of reducing noise in the attention map. Furthermore, the meaning of is very complex and hard to be explained in practice. Therefore, we propose the attention propagation method that produces hierarchical attention maps via the high-order graph . Our method reduces computational complexity and is clear to be interpreted.

3.3 Information Aggregation

As shown in the lower part of Figure 3 (b), we aggregate information of multiple levels to generate in a weighted sum manner, where we perform matrix multiplication between feature map and the normalized hierarchical attention map , as follows:

(6)

where , and denotes channel-wise concatenation, is the highest level of attention maps. and are convolutions. Finally, is transformed by a bottleneck layer and enriched by the decoder to generate accurate segmentation results.

Generally, our Hierarchical Attention module can be flexibly embedded in the existing fully convolutional networks, and it only mixes the context information of high correlation features thus enhances the discriminative ability of feature representations.

4 Experiments

To evaluate the proposed method, we carry out comprehensive experiments on three medical image segmentation tasks (i.e. optic disc/cup segmentation, retinal blood vessel segmentation, and lung segmentation). They correspond to three representative characteristics of the medical image (i.e. the inter-class indistinction, high variability, and noise). For optic disc/cup segmentation, we conduct experiments on the REFUGE dataset and Drishti-GS1 [29] dataset. For blood vessel segmentation and lung segmentation, we conduct experiments on the DRIVE [30] dataset and LUNA dataset respectively.

4.1 Datasets

REFUGE. The dataset is arranged for the segmentation of optic disc and cup, which consists of 400 training images and 400 validation images. The testing set is not available. The training images are captured with the Zeiss Visucam 500 fundus camera at a resolution of 21242056 pixels. The validation images are captured with the Canon CR-2 fundus camera at a resolution of 16341634 pixels. The pixel-wise disc and cup gray-scale annotations are provided.

Drishti-GS1. It contains 50 training images and 51 testing images for optic disc/cup segmentation. All images are taken centered on optic disc with a field-of-view of 30 degrees and of dimensions 28961944 pixels. The annotations are provided in the form of average boundaries.

DRIVE. The dataset is arranged for blood vessel segmentation. It includes 40 color retina images of dimensions 565584 pixels, from which 20 samples are used for training and the remaining 20 samples for testing. Manual annotations are provided by two experts, and the annotations of the first expert are used as the gold standard.

LUNA. It contains 2D CT images of dimensions 224224 pixels from the Lung Nodule Analysis (LUNA) competition which can be freely downloaded from the website2. Following the previous work [11], we use of the total 267 images for training and the rest for testing.

4.2 Implementation Details

We build our networks with PyTorch and train it on a single TITAN XP GPU. We choose the ImageNet

[27] pre-trained ResNet-101 as our encoder designed in a fully convolutional fashion [21] that replaces the convolutions within the last blocks by dilated convolutions [28, 24, 44]

. Our encoder can adopt three different output strides (i.e. 16, 8, 4) for feature extraction. The setting of dilated convolutions is the same as  

[6] when output stride is 16 or 8, and we change the stride of the first convolution of ResNet-101 from 2 to 1 when output stride is 4. In the decoder, the input is first bilinearly upsampled and then fused with the corresponding low-level features. We adopt a simple yet effective decoder module as [7] that only takes the output of the first block of ResNet-101 [12]

as low-level features. Finally, the output of the decoder is bilinearly upsampled to the same size as the input image, and we compute loss via cross-entropy loss function. After initializing two hyper-parameters of

and , our HANet is trained end-to-end. In addition, we implement DeepLabv3+ [7] and DANet [10] for better comparison.

We use Stochastic Gradient Descent with mini-batch for training. The initial learning rate is 0.01. And we use momentum of 0.9 and weight decay of 5e-4. For optic disc/cup segmentation and lung segmentation, we set training time to 100 epochs and employ a learning rate policy of Reduce LR On Plateau where the learning rate is multiplied by 0.1 if the performance on validation set has no improvement within the previous five epochs. Moreover, the input spatial resolution is 513

513 and the output stride is 16. For blood vessel segmentation, we set training time to 20 epochs and the learning rate is multiplied by 0.5 in the 10th epoch. The input spatial resolution is 224224 and the output stride is 8. We apply photometric distortion, rotation, random scale cropping, left-right flipping, and gaussian blurring during training for data augmentation.

4.3 Results on Optic Disc/Cup Segmentation

Optic Disc/Cup Segmentation is very useful in clinical practice and diagnosis of glaucoma, where glaucoma is usually characterized by the larger cup to disc ratio (CDR). However, its segmentation is challenging because of the high similarity among the cup, disc, and background. We first conduct ablation experiments with different attention module settings on the REFUGE dataset. Then, the proposed method is compared with existing state-of-the-art segmentation methods. We also evaluate the robustness of HANet for domain adaptation, where the model trained only on the REFUGE training set is tested on the Drishti-GS1 dataset. For these experiments, we first localize the disc following the existing methods [8, 46, 34] and then transmit the cropped images into our network. Because the official testing set is not available, we use 50 images of the training set to select the best model and then test the best model with the official verification set. We do not perform any post-processing and the results are represented by the dice coefficient (Dice) and mean absolute error of the cup to disc ratio (E), where Dice and Dice denote dice coefficients of optic disc and cup respectively.

Figure 4: The influence of the threshold and adjacency power , and the results are represented by mDice on the REFUGE validation set. is the highest level of attention, where .

4.3.1 Ablation Study for Attention Modules

We study the influence of the threshold and adjacency power on our hierarchical attention module. After normalizing the initial attention map between 0 and 1, we use threshold to get the down-sampled graph as Eq.3. And is used in Eq.2. As shown in Figure 4, HANet is not highly sensitive for different parameter configurations. Specifically, the line of denotes there is only one level attention, which shows a gradual increase when and a downward trend when . It means that the part of the noise in the attention map is erased by a small threshold and the important attention information is discarded by a big threshold. HANet performs better when , where there are more than one level of attention map. For the line of , HANet achieves the best performance when setting . When , the results of outperforms the results of . It reveals the attention propagation module with a larger can infer from existing attention to more positions, so HANet still performs well even if most of the attention information is discarded by a large threshold. These two parameters enable our model to learn a wider range of feature representations and to robustly adapt to many tasks.

Method mDice E Dice Dice
AIML5 0.9250 0.0376 0.9583 0.8916
BUCT 5 0.9188 0.0395 0.9518 0.8857
CUMED5 0.9185 0.0425 0.9522 0.8848
VRT5 0.9161 0.0455 0.9472 0.8849
CUHKMED5 0.9116 0.0440 0.9487 0.8745
U-Net[26] 0.8926 - 0.9308 0.8544
POSAL[34] 0.9105 0.0510 0.9460 0.8750
M-Net[8] 0.9120 0.0480 0.9540 0.8700
Ellipse[36] 0.9125 0.0470 0.9530 0.8720
Task-DS[25] 0.9129 - - -
ET-Net[46] 0.9221 - 0.9529 0.8912
DeepLabv3+[7] 0.9215 0.0403 0.9575 0.8854
DANet[10] 0.9192 0.0423 0.9572 0.8813
HANet 0.9239 0.0355 0.9544 0.8934
HANet 0.9302 0.0347 0.9599 0.9005
Table 1: Optic disc/cup segmentation results on REFUGE validation set. HANet indicates that HANet is set to and , which recovers the one-level attention. For HANet, we set and .
Figure 5: Visualization of the attention maps of pixel marked by “+” and segmentation results on REFUGE validation set. HANet indicates setting our HANet to and , which recovers the original self-attention method. For our HANet, we set , which generates two levels of attention map (i.e., and ). The second to fifth columns show the attention maps generated by different models and the sixth to ninth columns show the segmentation results. Obviously, there is much noise in the attention maps of the second column to the third column, which is consistent with our assumption that the attention maps produced by the original self-attention based method contain noise. Best viewed in color.

4.3.2 Comparing with State-of-the-art

We compare our HANet with existing methods on the REFUGE1 set. Our HANet ( and ) is compared with two baselines namely DeeplabV3+ [7] and DANet [10], where Deeplabv3+ [7] has a similar encoder-decoder structure to our HANet, and DANet [10] models the attention in spatial (one-level self-attention) and channel dimensions respectively. To compare the multi-level and one-level attention fairly, we also compare HANet with HANet ( and ), where HANet recovers the original self-attention method. HANet is also compared with methods leading the REFUGE challenge555https://refuge.grand-challenge.org/Results-ValidationSet_Online/ in conjunction with MICCAI 2018 (e.g. AIMI, BUCT).

As shown in Table 1, our HANet achieves the best performance among the competitive published benchmarks. In particular, HANet outperforms Deeplabv3+ [7] and DANet [10] by and on mDice respectively, and it also outperforms the previous state-of-the-art optic disc/cup segmentation method ET-Net [46]. Moreover, our HANet outperforms the AIML5 which achieved the first place for the optic disc and cup segmentation tasks in the REFUGE challenge. Notably, our model achieves impressive results for optic cup segmentation, which is an especially difficult task because of the high similarity between optic cup and disc.

Figure 5 shows the attention maps of “+” marked pixels. In the self-attention based methods, the feature of “+” marked pixel will be updated via the weighted summation of features in other locations, where weights are represented by the red color shown in the attention maps. Obviously, the attention maps from the second column to the third column contain noise, which may result in the weighted summed features contain much information about other categories. As shown in the fourth and fifth columns of Figure 5, our HANet that aggregates context information with hierarchical attention only focuses on the positive context information, which means our approach naturally reduces the noise in the attention map.

Method mDice E Dice Dice
M-Net[8] 0.8515 0.1660 0.9370 0.7660
Ellipse[36] 0.8520 0.1590 0.9270 0.7770
DeepLabv3+[7] 0.8643 0.1781 0.9663 0.7623
DANet[10] 0.8924 0.1131 0.9660 0.8187
HANet 0.8930 0.1346 0.9629 0.8232
HANet 0.9117 0.1091 0.9721 0.8513
Table 2: The comparison of the robustness of different models on Drishti-GS1 dataset. We use the training set of REFUGE dataset to train the model and test the trained model with the whole Drishti-GS1 dataset.

4.3.3 Robustness

We report the results of HANet on the Drishti-GS1 dataset [29] to evaluate its robustness for domain adaptation, where the model is only trained on the REFUGE training set. Domain adaptation is a great challenge for the medical image segmentation because the diversity of shooting equipment dramatically affects the quality of images. As shown in Table 2, the proposed HANet achieves the best robust performance than other state-of-the-art segmentation methods. In particular, our HANet that aggregates context information with dynamic weights outperform DeepLabv3+ [7] which aggregates context information with fixed weights by on mDice. Moreover, Our HANet that aggregates context information with hierarchical attention outperforms the HANet and DANet [10] which aggregates context information with one-level attention by and on mDice, respectively. Especially, our HANet outperforms state-of-the-art Ellipse [36] by on Dice, which illustrates that our hierarchical attention has great advantages in the recognition of confusing categories. The above results demonstrate our method is more robust to the variance of input than existing methods.

Method ACC F1 Se Sp
DeepVessel[9] 0.9523 0.7900 0.7603 -
U-net[26] 0.9554 0.8175 0.7849 0.9802
R2U-net[3] 0.9556 0.8171 0.7792 0.9813
LadderNet[49] 0.9561 0.8202 0.7856 0.9810
Multi-scale[39] 0.9567 - 0.7844 0.9819
DRIU[22] - 0.8221 0.8264 -
Vessel-Net[38] 0.9578 - 0.8038 0.9802
DUNet[14] 0.9566 0.8237 0.7963 0.9800
DEU-Net[32] 0.9567 0.8270 0.7940 0.9816
CASR[33] - 0.8353 0.8419 -
DeepLabv3+[7] 0.9679 0.8222 0.7881 0.9814
DANet[10] 0.9693 0.8205 0.7827 0.9804
HANet 0.9706 0.8251 0.8158 0.9832
HANet 0.9712 0.8300 0.8297 0.9843
Table 3: Blood vessel segmentation results on DRIVE testing set.
Figure 6: Visualization results on DRIVE testing set. The local patches is highlighted for a more detailed comparison.

4.4 Results on Retinal Blood Vessel Segmentation

Retina blood vessel is a typical object with curvilinear structure, and its segmentation is challenging because the blood vessel is too small and diverse in shape. Our method is compared with the DeepLabv3+ [7], DANet [10], and other state-of-the-art vessel segmentation methods. Following the existing methods [38, 49], we orderly extract

patches with a stride of 8 along with both horizontal and vertical directions for training, and we recompose the entire image with probability maps of partly overlapped patches in the testing phase. We do not perform any post-processing and the results are represented by the accuracy (

ACC), F1-score (F1), sensitivity (Se), and specificity (Sp).

Extensive results on the DRIVE testing set are shown in Table 3. HANet attains the highest values on ACC and Sp while the values of other two metrics are still competitive to the Context-aware Spatio-recurrent (CSAR) method [33]. CSAR [33] only handles the curvilinear structure segmentation, but our method could segment many kinds of medical images. Further, results also show that HANet outperforms Deeplabv3+ [7] and one-level attention based methods (i.e. HANet and DANet [10]). Qualitative results are shown in Figure 6, which demonstrates the better effectiveness of our HANet to detect the small and high variable object.

Method IoU ACC F1 Se
U-Net [26] 0.9130 0.9750 - 0.9380
ResU-Net[3] - 0.9849 0.9690 0.9555
RU-Net[3] - 0.9836 0.9638 0.9734
R2U-Net[3] - 0.9918 0.9823 0.9832
CE-Net[11] 0.9620 0.9900 - 0.9800
DeepLabv3+[7] 0.9674 0.9923 0.9834 0.9851
DANet[10] 0.9593 0.9903 0.9792 0.9832
HANet 0.9662 0.9920 0.9828 0.9842
HANet 0.9768 0.9945 0.9883 0.9879
Table 4: Experimental results on LUNA testing set.
Figure 7: Visualization results on LUNA testing set.

4.5 Results on Lung Segmentation

We apply HANet to segment lung structure in 2D CT images, where noise is closely related to image quality. The segmentation results are represented by the accuracy (ACC), Intersection over Union (IoU), F1-score (F1), and sensitivity (Se). As shown in Table 4, the HANet achieves state-of-the-art performance in all metrics. Further, two example segmentation results are shown in Figure 7. The above results show our hierarchical attention module substantially boosts the segmentation performance in medical images which are easily infected by noise.

5 Conclusion

In this paper, we experimentally find that the attention map contains a lot of noise in the original self-attention method, which leads to that the updated feature mixes much context information about other categories. The noise in the attention map has a serious impact on the feature representation of the medical images with inter-class indistinction. And we propose a novel Hierarchical Attention Network (HANet) for medical image segmentation, which adaptively captures multi-level global context information in a high-order graph manner. Especially, the hierarchical attention module embedded in the HANet can be flexibly inserted into existing CNNs. Extensive experiments demonstrate that our hierarchical attention module naturally reduces noisy attention information caused by inter-class indistinction, and it is more robust to the variance of input than the original self-attention based methods. Our HANet achieves outstanding performance consistently on four benchmark datasets.

References

  • [1] S. Abu-El-Haija, B. Perozzi, A. Kapoor, H. Harutyunyan, N. Alipourfard, K. Lerman, G. V. Steeg, and A. Galstyan (2019) Mixhop: higher-order graph convolution architectures via sparsified neighborhood mixing. ICML. Cited by: §2, §3.2.
  • [2] Abu-El-Haija,Sami, Perozzi,Bryan, A. Kapoor, N. Alipourfard, and H. Harutyunyan (2018) A higher-order graph convolutional layer. NeurIPS. Cited by: §2, §3.2.
  • [3] M. Z. Alom, M. Hasan, C. Yakopcic, T. M. Taha, and V. K. Asari (2018) Recurrent residual convolutional neural network based on u-net (r2u-net) for medical image segmentation. arXiv. Cited by: §1, §2, Table 3, Table 4.
  • [4] D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv. Cited by: §2.
  • [5] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §2.
  • [6] L. Chen, G. Papandreou, F. Schroff, and H. Adam (2017) Rethinking atrous convolution for semantic image segmentation. arXiv. Cited by: §1, §2, §4.2.
  • [7] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, pp. 801–818. Cited by: §1, §2, §4.2, §4.3.2, §4.3.2, §4.3.3, §4.4, §4.4, Table 1, Table 2, Table 3, Table 4.
  • [8] H. Fu, J. Cheng, Y. Xu, D. W. K. Wong, J. Liu, and X. Cao (2018) Joint optic disc and cup segmentation based on multi-label deep network and polar transformation. TMI 37 (7), pp. 1597–1605. Cited by: §4.3, Table 1, Table 2.
  • [9] H. Fu, Y. Xu, S. Lin, D. W. K. Wong, and J. Liu (2016)

    Deepvessel: retinal vessel segmentation via deep learning and conditional random field

    .
    In MICCAI, pp. 132–139. Cited by: Table 3.
  • [10] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu (2019) Dual attention network for scene segmentation. In CVPR, pp. 3146–3154. Cited by: §2, §4.2, §4.3.2, §4.3.2, §4.3.3, §4.4, §4.4, Table 1, Table 2, Table 3, Table 4.
  • [11] Z. Gu, J. Cheng, H. Fu, K. Zhou, H. Hao, Y. Zhao, T. Zhang, S. Gao, and J. Liu (2019) CE-net: context encoder network for 2d medical image segmentation. TMI. Cited by: §1, §4.1, Table 4.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §3, §4.2.
  • [13] Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu (2019) Ccnet: criss-cross attention for semantic segmentation. In ICCV, pp. 603–612. Cited by: §2.
  • [14] Q. Jin, Z. Meng, T. D. Pham, Q. Chen, L. Wei, and R. Su (2019) DUNet: a deformable network for retinal vessel segmentation. Knowledge-Based Systems 178, pp. 149–162. Cited by: §2, Table 3.
  • [15] T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In ICLR, Cited by: §2.
  • [16] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel (1989) Backpropagation applied to handwritten zip code recognition. Neural computation 1 (4), pp. 541–551. Cited by: §1.
  • [17] X. Li, Z. Zhong, J. Wu, Y. Yang, Z. Lin, and H. Liu (2019) Expectation-maximization attention networks for semantic segmentation. In ICCV, Cited by: §2.
  • [18] Z. Li, Y. Wei, Y. Zhang, and Q. Yang (2018) Hierarchical attention transfer network for cross-domain sentiment classification. In AAAI, Cited by: §2.
  • [19] H. Ling, J. Gao, A. Kar, W. Chen, and S. Fidler (2019) Fast interactive object annotation with curve-gcn. In CVPR, pp. 5257–5266. Cited by: §2.
  • [20] J. Liu, Q. Hou, M. Cheng, J. Feng, and J. Jiang (20192019) A simple pooling-based design for real-time salient object detection. In CVPR, pp. 3917–3926. Cited by: §1, §2.
  • [21] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In CVPR, pp. 3431–3440. Cited by: §2, §4.2.
  • [22] K. Maninis, J. Pont-Tuset, P. Arbeláez, and L. Van Gool (2016) Deep retinal image understanding. In MICCAI, pp. 140–148. Cited by: Table 3.
  • [23] C. Morris, M. Ritzert, M. Fey, W. L. Hamilton, J. E. Lenssen, G. Rattan, and M. Grohe (2019) Weisfeiler and leman go neural: higher-order graph neural networks. In AAAI, Vol. 33, pp. 4602–4609. Cited by: §2, §3.2.
  • [24] G. Papandreou, I. Kokkinos, and P. Savalle (2015) Modeling local and global deformations in deep learning: epitomic convolution, multiple instance learning, and sliding window detection. In CVPR, pp. 390–399. Cited by: §2, §4.2.
  • [25] X. Ren, L. Zhang, S. Ahmad, D. Nie, F. Yang, L. Xiang, Q. Wang, and D. Shen (2019) Task decomposition and synchronization for semantic biomedical image segmentation. arXiv. Cited by: Table 1.
  • [26] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In MICCAI, pp. 234–241. Cited by: §1, §2, Table 1, Table 3, Table 4.
  • [27] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §4.2.
  • [28] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun (2014) Overfeat: integrated recognition, localization and detection using convolutional networks. ICLR. Cited by: §2, §4.2.
  • [29] J. Sivaswamy, S. Krishnadas, G. D. Joshi, M. Jain, and A. U. S. Tabish (2014) Drishti-gs: retinal image dataset for optic nerve head (onh) segmentation. In ISBI, pp. 53–56. Cited by: 3rd item, §4.3.3, §4.
  • [30] J. Staal, M. D. Abràmoff, M. Niemeijer, M. A. Viergever, and B. Van Ginneken (2004) Ridge-based vessel segmentation in color images of the retina. TMI 23 (4), pp. 501–509. Cited by: 3rd item, §4.
  • [31] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPS, pp. 5998–6008. Cited by: §1, §2, §3.1.
  • [32] B. Wang, S. Qiu, and H. He (2019) Dual encoding u-net for retinal vessel segmentation. In MICCAI, pp. 84–92. Cited by: §2, Table 3.
  • [33] F. Wang, Y. Gu, W. Liu, Y. Yu, S. He, and J. Pan (2019) Context-aware spatio-recurrent curvilinear structure segmentation. In CVPR, pp. 12648–12657. Cited by: §4.4, Table 3.
  • [34] S. Wang, L. Yu, X. Yang, C. Fu, and P. Heng (2019) Patch-based output space adversarial learning for joint optic disc and cup segmentation. TMI. Cited by: §4.3, Table 1.
  • [35] X. Wang, R. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In CVPR, pp. 7794–7803. Cited by: §1, §2, §3.1.
  • [36] Z. Wang, N. Dong, S. D. Rosario, M. Xu, P. Xie, and E. P. Xing (2019) Ellipse detection of optic disc-and-cup boundary in fundus images. In ISBI, pp. 601–604. Cited by: §4.3.3, Table 1, Table 2.
  • [37] Y. Wei, J. Feng, X. Liang, M. Cheng, Y. Zhao, and S. Yan (2017) Object region mining with adversarial erasing: a simple classification to semantic segmentation approach. In CVPR, pp. 1568–1576. Cited by: §3.2.
  • [38] Y. Wu, Y. Xia, Y. Song, D. Zhang, D. Liu, C. Zhang, and W. Cai (2019) Vessel-net: retinal vessel segmentation under multi-path supervision. In MICCAI, pp. 264–272. Cited by: §4.4, Table 3.
  • [39] Y. Wu, Y. Xia, Y. Song, Y. Zhang, and W. Cai (2018) Multiscale network followed network model for retinal vessel segmentation. In MICCAI, pp. 119–126. Cited by: Table 3.
  • [40] C. Xing, Y. Wu, W. Wu, Y. Huang, and M. Zhou (2018) Hierarchical recurrent attention network for response generation. In AAAI, Cited by: §2.
  • [41] S. Yan, J. S. Smith, W. Lu, and B. Zhang (2018) Hierarchical multi-scale attention networks for action recognition. Signal Processing: Image Communication 61, pp. 73–84. Cited by: §2.
  • [42] M. Yang, K. Yu, C. Zhang, Z. Li, and K. Yang (2018) Denseaspp for semantic segmentation in street scenes. In CVPR, pp. 3684–3692. Cited by: §1, §2.
  • [43] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy (2016) Hierarchical attention networks for document classification. In NAACL - HLT, pp. 1480–1489. Cited by: §2.
  • [44] F. Yu and V. Koltun (2016) Multi-scale context aggregation by dilated convolutions. ICLR. Cited by: §2, §4.2.
  • [45] X. Zhang, Y. Wei, J. Feng, Y. Yang, and T. S. Huang (2018) Adversarial complementary learning for weakly supervised object localization. In CVPR, pp. 1325–1334. Cited by: §3.2.
  • [46] Z. Zhang, H. Fu, H. Dai, J. Shen, Y. Pang, and L. Shao (2019) ET-net: a generic edge-attention guidance network for medical image segmentation. MICCAI. Cited by: §2, §4.3.2, §4.3, Table 1.
  • [47] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In CVPR, pp. 2881–2890. Cited by: §1, §2.
  • [48] H. Zhu, F. Wei, B. Qin, and T. Liu (2018) Hierarchical attention flow for multiple-choice reading comprehension. In AAAI, Cited by: §2.
  • [49] J. Zhuang (2018) LadderNet: multi-path networks based on u-net for medical image segmentation. arXiv. Cited by: §4.4, Table 3.