HMANet: Hybrid Multiple Attention Network for Semantic Segmentation in Aerial Images

01/09/2020 ∙ by Ruigang Niu, et al. ∙ 15

Semantic segmentation in very high resolution (VHR) aerial images is one of the most challenging tasks in remote sensing image understanding. Most of the current approaches are based on deep convolutional neural networks (DCNNs) for its remarkable ability of feature representations. Specifically, attention-based methods can effectively capture long-range dependencies and further reconstruct the feature maps for better representation. However, limited by the mere perspective of spacial and channel attention and huge computation complexity of self-attention mechanism, it's unlikely to model the effective semantic interdependencies between each pixel-pair. In this work, we propose a novel attention-based framework named Hybrid Multiple Attention Network (HMANet) to adaptively capture global correlations from the perspective of space, channel and category in a more effective and efficient manner. Concretely, a class augmented attention (CAA) module embedded with a class channel attention (CCA) module can be used to compute category-based correlation and recalibrate the class-level information. Additionally, we introduce a simple yet region shuffle attention (RSA) module to reduce feature redundant and improve the efficiency of self-attention mechanism via region-wise representations. Extensive experimental results on the ISPRS Vaihingen and Potsdam benchmark demonstrate the effectiveness and efficiency of our HMANet over other state-of-the-art methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 10

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Semantic segmentation, also known as semantic labeling, is one of the fundamental and challenging tasks in remote sensing image understanding, whose goal is to assign pixel-wise semantic class label for a given image. In perticular, semantic segmentation in very high resolution (VHR) aerial images plays an increasingly significance for its widespread applications, such as road extraction, urban planning and land cover classification.

In recent years, Deep Convolutional Neural Networks (DCNNs) have demonstrated powerful capacity of feature extraction and object representations compared with traditional methods in machine learning, such as Random forests (RF), Support Vector Machine (SVM) and Conditional Random Fields (CRFs). Particularly, state-of-the-art methods based on the Fully Convolutional Network (FCN)

Long et al. (2015) have made great progress. However, due to the fixed geometry structured, they are inherently limited by local receptive fields and short-range context information, namely, this task is still very challenging.

To capture long-range dependencies, Chen et al. Chen et al. (2017b) proposed atrous spatial pyramid pooling (ASPP) with multi-scale dilation rates to aggregate contextual information. Zhao et al. Zhao et al. (2017) further introduced pyramid pooling module (PPM) to represent the feature map via multiple regions with different sizes. ScasNet Liu et al. (2018) aggregates multi-scale contexts in a self-cascade manner. Nevertheless, the context aggregation methods above are still unable to extract global information.

To generate dense and pixel-wise global correlation, Non-local Neural Networks Wang et al. (2018) utilizes a self-attention mechanism, which enable a single feature from any position to perceive features from all other positions. It can be seen as a matter of feature reconstruction, that is, the feature representation of each position is a weighted sum of all other counterparts. DANet Fu et al. (2019) introduces spatial-wise and channel-wise attention modules to enrich the feature representations. Besides, several works Huang et al. (2019b); Li et al. (2019); Huang et al. (2019a) improve the efficiency of self-attention mechanism to some extent. However, pixel-wise attention approaches still need to generate a dense attention map to measure the relationships between each pixel-pair, which has a high computation complexity and occupies a huge number of GPU memory. Recent works Chen et al. (2018b); Li et al. (2019) have shown the fact that information redundancy is not conducive to feature representations. What’s more, attention-based methods are restricted to the perspective of space and channel.

In previous works Long et al. (2015); Chen et al. (2017a, b); Zhao et al. (2017); Mi and Chen (2020)

, category-based information is only reflected in the last convolutional layer, that is, the score map representing the probability that each pixel belong to each category. The lack of class-level information in the middle stage of the network leads to poor object classification capabilities. Different from the methods above, we argue that exploiting class-level information is also vital for semantic segmentation task. Thus, we first propose a so-called category-based correlation which models class-level representation of each pixel and further calculates the relationships between categories and corresponding channels of the feature cube. As shown in

(a), category-based correlation mainly focuses on exploiting contextual information from a categorical perspective, which pays more attention to the pixels of same category during the feature reconstruction illustrated above.

On the other hand, aiming at the feature redundancy and heavy computation complexities of pixel-wise attention mechanism, we propose a simple yet efficient scheme to address the issue. For example, as for a single feature belonging to ‘car’ in (b), the pixel-wise attention method usually extracts features of all other positions, among which we actually don’t need to focus on the ‘building’ and ‘impervious surface’. But it’s hard for the network to learn such precise features due to the similarity of different categories under some circumstance (such as in the shadow or overlapping). In other words, the propagation of invalid redundant information is harmful to the feature representations, and it’s a waste of GPU memory in the meantime. Therefore, we employ a more robust region-wise attention mechanism to exploit wider range of correlations. Empirically, region-wise representation can extract long-range contextual information between pixels in a more efficient manner.

Towards the above two issues and our corresponding solutions, we propose a novel framework, named Hybrid Multiple Attention Network (HMANet). The overall structure is shown in . More specifically, the HMANet mainly consists of two parallel branches, among which the upper one is Class Augmented Attention (CAA) module embedded with Class Channel Attention (CCA) module. Given the input feature, the CAA module first calculates category-based correlation and further generates the class weighted representation via a dense class affinity map. While the CCA module is added to adaptively recalibrate the class-level information through two linear scaling transformation functions, which efficiently helps to enhance the discriminative abilities for each class with a few parameters. The lower branch of our network is called Region Shuffle Attention (RSA) module, which aims to capture region-wise global information with a shuffling operator and obtain more robust correlation between objects. Besides, compared with pixel-wise attention methods, grouped region-wise representation greatly reduce the complexity in time and space from to , where and are the number of partitions along height and width dimensions while each partition contains and

, respectively. Finally, we concatenate the output features from each branch and the local representation, then feed them into a classifier to further generate the fine segmentation map.

Our contributions can be summarized as follows:

  • We present a Class Augmented Attention (CAA) module to exploit category-based correlation between pixels and enhance the discriminant ability for each class, within which a Class Channel Attention (CCA) module is embedded to adaptively recalibrate the class-level information for better representations.

  • The Region Shuffle Attention (RSA) module is proposed to capture region-wise global information and obtain more robust relationships between objects in a more efficient and effective manner.

  • We propose a novel Hybrid Multiple Attention Network (HMANet) by taking advantage of the three attention modules above, which comprehensively captures feature correlations from the perspective of space, channel and category.

  • Extensive experiments on two challenging semantic segmentation benchmarks, including ISPRS 2D Semantic Labeling Challenging for Vaihingen and Potsdam, demonstrate the superiority of our HMANet over other state-of-the-art methods.

The reminder of this paper is arranged as follows. Related work is briefly introduced in . presents the details of our proposed method, including three attention modules, respectively. Experimental evaluations between our HMANet and the state-of-the-art methods, as well as ablation studies on Vaihingen dataset are provided in . Finally, the conclusion is outlined in

2 Related Work

Semantic Segmentation. Semantic segmentation is one of the fundamental tasks of image understanding. Fully Convolutional Networks (FCNs) based methods have made great progress in semantic segmentation by leveraging the powerful representation abilities of classification networks He et al. (2016); Huang et al. (2017) pretrained on large-scale data Russakovsky et al. (2015). Several model variants are proposed to aggregate multi-scale contextual information that is vital for object perception. Concretely, DeepLabv2 Chen et al. (2017a) and DeepLabv3 Chen et al. (2017b) employ atrous spatial pyramid pooling (ASPP) to embed contextual representation, which consists of parallel convolutions with different dilated rates. PSPNet Zhao et al. (2017) proposes a pyramid pooling module (PPM) to extract the contextual information with different scales, each of which can be considered the global representation. UNet Ronneberger et al. (2015), RefineNet Lin et al. (2017), DFN Yu et al. (2018b), SegNet Badrinarayanan et al. (2017), DeepLabv3+ Chen et al. (2018a) and SPGNet Cheng et al. (2019) adopt encoder-decoder structure to carefully recover the location information while retaining high-level semantic features. GCN Peng et al. (2017) utilizes global convolutional module and global pooling to harvest context information for global representations. In addition, BiSeNet Yu et al. (2018a) adopts efficient spatial and context path to achieve real-time semantic segmentation.

Semantic Segmentation of Aerial Imagery.

Semantic segmentation in VHR aerial images benefits a lot from deep learning methods in computer vision. For example, Mou

et al. Mou et al. (2019) propose two network units, spatial relation module and channel relation module, to learn relationships between any two positions. TreeUNet Yue et al. (2019) adopts a Tree-CNN block to transmit feature maps via concatenating connections and further fuse multi-scale representations. ScasNet Liu et al. (2018) proposes an end-to-end self-cascade network to improve the labeling coherence with sequential global-to-local contexts aggregation. SDNF Mi and Chen (2020) combines DCNNs and traditional decision forests algorithm in an end-to-end manner to achieve better classification accuracy. Marmanis et al. Marmanis et al. (2018) focus on semantically edge detection to restore high-frequency details and further obtain fine object boundaries. DSMFNet Cao et al. (2019) proposes a lightweight DSM fusion module to effectively aggregate depth information, within which Cao et al. investigate four fusion strategies corresponding to different scenarios.

Attention model. Attention is widely used for various tasks, such as machine translation Bahdanau et al. (2014); Vaswani et al. (2017)

, scene classification and semantic segmentation. Squeeze-and-Excitation Networks recalibrated the feature representations by modeling the dependencies between channels. Non-local first adopts self-attention mechanism as a submodule for computer vision tasks,

, video classification, object detection and instance segmentation. CCNet Huang et al. (2019b) harvests the contextual information of all the positions by stacking two serial criss-cross attention module. DANet Fu et al. (2019) adopts similar spatial and channel attention module to generate information from all pixels, which costs even more computation and GPU memory than the Non-local operator. -Nets Chen et al. (2018b)

and Expectation-Maximization Attention Networks

Li et al. (2019) sample sparse global descriptors to reconstruct the feature maps in an self-attention mechanism. ACFNet Zhang et al. (2019) proposes a coarse-to-fine segmentation network based on attention class feature module, which can be embedded in any base network. Huang et al. Huang et al. (2019a), Yuan et al. Yuan et al. (2019) and Zhu et al. Zhu et al. (2019) further improve the efficiency of self-attention mechanism for semantic segmentation.

Motivated by the success of the attention-based methods above, we rethink the attention mechanism from the view of different perspectives and computation cost. Different from previous works, we propose three types of attention modules to capture global correlations from the perspective of space, channel and category respectively for better feature representations. Moreover, benefiting from the multi-level attention mechanism and region-wise representations, HMANet is more efficient and effective than other attention-based methods. Comprehensive empirical results verify the superiority of our proposed method.

Figure 2: The pipeline of the proposed Hybrid Multiple Attention Network (HMANet). The key components are the two parallel branches, Class Augmented Attention (CAA) module embeded with Class Channel Attention (CCA) module and Region Shuffle Attention (RSA) module, which obtain the category-based correlation and region-wise contextual dependencies, respectively. Empirically, we concatenate the two output feature maps and the local representation to further generate the final segmentation map (Best viewed in color).

3 Approach

3.1 Overview

As shown in , the network architecture mainly consists of three attention modules, Class Augmented Attention (CAA) module, Class Channel Attention (CCA) module and Region Shuffle Attention (RSA) module, among which CAA module and CCA module are embedded together as the upper branch of the network. The proposed CAA module aims to extract class-based correlation of the feature map while CCA module improves the process of feature reconstruction via class channel weighting for better contextual representation. The lower branch of the network is RSA module, accordingly, which greatly decreases the computational consumption and memory footprint in contrast to the original no-local block in computing long-range dependencies.

Concretely, given an input image, we first feed it into a convolutional neural network (CNN) to adaptively extract feature for better representation, which is designed in a fully convolutional manner Long et al. (2015)

. We take ResNet-101 pretrained on ImageNet dataset as our backbone following the majority of the previous works

Huang et al. (2019b); Fu et al. (2019); Yuan et al. (2019); Zhang et al. (2019); Li et al. (2019). In particular, we remove the last two down-sampling operations and use dilated convolutions in stage-3 and stage-4, which is also called multi-grid strategy for the latter, thereby retaining more spatial information and enlarging the output feature map to 1/8 of the input image without adding extra parameters. Then the features

from the stage-4 of the backbone would be fed into two parallel attention branch. The upper branch is CAA module embedded with CCA module, while CAA module is designed to model the dependencies between specific category and the corresponding features after the dimension reduction and CCA module can be defined as the feature adaptive reconstruction of class channel information. It’s worth mentioning that the CCA module takes the class affinity matrix and class attention map as the input features, both of which are generated by the CAA module, then, obtains the adaptive weighted class affinity matrix. Ideally, given the input feature map

, in which , and denote the number of channels, height and width of feature map respectively, the CAA embedded with CCA module can effectively extract the class-channel correlation and adaptively aggregate long-range contextual information from a category view, eventually, outputs the same size feature map following the self-attention scheme Wang et al. (2018). The lower branch of the network, RSA module, is proposed with the intuition of decomposing the dense point-wise affinity matrix into two sparse region-based counterparts, either of which could efficiently capture the global context in a sparser way via adaptive average pooling method. With the combination of the two affinity matrices, RSA module could capture abundant spatial contextual information of the local input feature X then output feature . Finally, we concatenate the output features of the two branches and the local feature representation to obtain better feature representations, then, the fused features are fed into a classifier to generate the fine segmentation map.

3.2 Class Augmented Attention

Figure 3: The details of class augmented attention module (Best viewed in color).

Self-attention mechanism is essentially a kind of matrix multiplication operation in mathematics, in which the two dimensions are the number of channels and the product of height and width of the input feature map respectively. The standard channel affinity matrix of size can be obtained by merging the HW dimension, thereby generating channel attention map, such as channel attention module in DANet Fu et al. (2019). Intuitively, the definition of non-local operation constrains the scaling of the channel in such kind of channel attention module, that is, the query, key and value operation. Nevertheless, it leads into category information when one of the channels is replaced by the channel corresponding to the segmentation map supervised by the ground truth, retaining the query, key and value transformation functions in the meantime.

The intuition of the proposed class augmented attention is to capture long-range contextual information from the perspective of category information, that is, to explicitly model the relationships between each category in the dataset and each channel of the input feature cube. Next, we will elaborate the process to capture class contextual dependencies.

As shown in , given a local feature , output from the after stage-4 of ResNet in our implementation, the class augmented attention module first applies two convolutional layers to generate two feature maps , and , respectively, where is the reduced channel number of the local feature for less computational cost and P is the class attention map supervised by the ground-truth segmentation. For each channel in , is available to represent the confidence that pixels of all position belongs to class , where is the number of categories. represents the th element of along channel dimension. Then, we can further generate the class affinity map via aggregating all the position in spatial dimension of and

after a softmax layer. The class affinity operation is defined as follows:

(1)

where denotes the explicit class correlation between feature and , , , . Then, we apply a softmax operation along the class dimension to generate the class affinity map .

The final class augmented object representation can be formulated as below:

(2)

in which denotes the th feature plane of the output feature map . is a scalar value of after softmax layer. Here, and are both transformation functions implemented by . The original local feature is added to enhance the adaptive feature representation. The indicates that the final representation of each channel is a category-based weighted sum of all channels in class attention map, which models the category-based semantic dependencies between feature maps. That is to say, the proposed CAA module improves the perception and discriminability of class-level information in a straightforward manner.

3.3 Class Channel Attention

The high-level semantics of CNNs are empirically considered to be embedded in the channel dimension, among which each channel map of deep features can be regarded as a class-related response. Additionally, recent works

Hu et al. (2018); Wang et al. (2019) have demonstrated the effectiveness of modeling channel correlation in classification and segmentation tasks. Therefore, we propose a class channel attention (CCA) module to exploit class channel dependencies and generate a new class affinity map with rich and adaptive contextual information, which is effectively embedded in the CAA module with a few parameters.

The main structure of class channel attention module is illustrated in . Given the class attention map and class affinity map output from the CAA module above, the adaptive class channel statistical representations can be formulated as follows:

(3)

where is the channel-wise global average pooling to generate class-related statistics and is the Sigmoid activation. Let , the key adaptive feature recalibration function is defined as:

(4)

in which and

are two linear transformations,

, a dimensionality ascending layer with ratio (this parameter value will be discussed in ) to augment and squeeze the feature representations, respectively, and

denotes the ReLU function.

The final output of the CCA module is obtained by recalibrating with the weighted factor and the original class affinity map:

(5)

where is a learnable parameter initialized to

. The residual connection is added to retain the original representation, thus, it can be integrated into the standard CAA module above without breaking its initial behavior, which efficiently helps to enhance the feature adaptive recalibration of class information.

Figure 4: Diagram of class channel attention module.

3.4 Region Shuffle Attention

Attention-based neural networks, in terms of spatial point-wise correlation representations, mainly aim to capture long-range contextual dependencies through self-attention mechanism or its variants, eventually, generating a dense affinity matrix. Nevertheless, the key point of the proposed region shuffle attention is to harvest the region-wise dependencies as well as its counterparts after recombination in a sparse and efficient manner. We illustrate our approach via a simple schematic in .

Region Representations. We partition the input feature maps into regions via a permutation operation, each of which is fed into an adaptive global average pooling layer to obtain the region representations afterwards. Then, we merge the point-wise representations of the regions to generate a sparse representation of the whole input feature. Therefore, the self-attention on the original input features can be effectively replaced with the same attention towards the merged counterparts for convenience.

Shuffle Attention Representations. Despite the self-attention on merged feature can empirically capture long-range contextual information from all positions, the pixel-to-pixel connections are still ambiguous. In order to exploit more explicit contextual dependencies from a regional perspective, we apply a shuffle attention to alternately pool the corresponding sub-regions and compute its self-attention representations, respectively, achieving a complementary representation of spatial information. Further experiments show that the cascade of attention weighted representations of the two sub-regions can effectively enhance the contextual dependencies, superior to the pixel-wise non-local operator.

As illustrated in , we first divide the input feature into partitions and each partition contains positions, where each is a subset of . Then, we merge the point statistics after global average pooling to obtain the sparse representation . We apply self-attention on following the non-local operation Wang et al. (2018) as below:

(6)
(7)

where is a sparse affinity matrix based on global information and is the weighted output features. Here, and are both transformation functions implemented by while represents . is a learnable parameter initialized to .

The regional weighted representation can be obtained by region-wise multiplication of and . We apply another permutation to regroup the representations, then, the feature would be fed into the same region-wise attention block to generate the final representations .

In general, the proposed region shuffle attention module makes up for the deficiency of non-local block that it’s huge consumption of memory footprint. Additionally, it can be plugged into any existing architechtures at any stage without breaking its initial performance, and optimized in an end-to-end manner.

Figure 5: An example of region shuffle attention when the numbers of partitions and positions in each patition are both .

3.5 Hybrid Multiple Attention Network

Integration of Attention Module. In order to take full advantage of three proposed attention modules, we further aggregate the CAA module embedded with CCA module and RSA module in an cascading and parallel manner, both of which is concatenated with the local feature. Eventually, the feature after concatenation would be fed into a classifier to generate the final segmentation map.

Loss Function. Besides the conventional multi-class cross entropy loss , we use the auxiliary supervision after stage-3 to improve the performance and make it easier to optimize following PSPNet Zhao et al. (2017). The class attention loss from CAA module is also employed as an extra auxiliary supervision. Finally, we use three parameters to balance these loss as follows:

(8)

where , and are set as , and to make these loss value ranges comparable.

4 Experiments

To validate the effectiveness of our proposed method, we conduct extensive experiments on two aerial image semantic segmentation benchmarks, , ISPRS 2D Semantic Labeling Challenging for Vaihingen and Potsdam. In this section, we first introduce the datasets and implementation details, then we perform extensive ablation experiments on ISPRS Vaihingen dataset. Finally, we report our results on the two datasets.

4.1 Datatsets

Vaihingen. The Vaihingen dataset contains 33 orthorectified image tiles (TOP) mosaic with three spectral bands (red, green, near-infrared), plus a normalized digital surface model (DSM) of the same resolution. The dataset has a spatial resolution of 9 cm, with an average size of pixels, which involves five foreground object classes and one background class. We select 16 images for training and 17 to test our model following the previous works Mou et al. (2019); Maggiori et al. (2017); Volpi and Tuia (2016); Sherrah (2016); Marcos et al. (2018). Noted that we do not use DSM in our experiments.

Potsdam. The Potsdam 2D semantic labeling dataset is composed of 38 high resolution images of size pixels, with a spatial resolution of 5 cm. The dataset offers NIR-R-G-B channels together with DSM and normalized DSM. There are 24 images in training set and 16 images in test set, which have 6 foreground classes corresponding to the Vaihingen benchmark.

4.2 Evaluation Metrics

To evaluate the performance of the proposed network, we calculate the score for the foreground object classes with the following formula:

(9)

where is set as 1. Intersection over union (IoU) and overall accuracy (OA) are defined as:

(10)

in which , and are the number of true positives, false positives and false negatives, respectively. is the total number of pixels.

Notably, overall accuracy is computed for all categories including background for a comprehensive comparison with different models. Additionally, the evaluation is carried out using ground truth with eroded boundaries provided in the datasets following previous studies.

4.3 Implementation Details

We use ResNet-101 He et al. (2016) pretrained on ImageNet as our backbone and employ a poly learning rate policy where the initial learning rate is multiplied by with after each iteration following the prior works Chen et al. (2017a); Li et al. (2019); Huang et al. (2019b). The initial learning rate is set to be 0.01 for all datasets. Momentum and weight decay coefficients are set to 0.9 and 0.0005 respectively. We replace the standard BatchNorm with InPlace-ABN Rota Bulò et al. (2018)

to the mean and standard-deviation of BatchNorm across multiple GPUs. For the data augmentation, we apply random horizontal flipping, random scaling (from 0.5 to 2.0) and random crop over all the training images. The input size for all datasets is set to

. We employ NVIDIA Tesla P100 GPU for iterations and batch size is . For semantic segmentation, we choose FCN Long et al. (2015) as our baseline.

4.4 Experiments on Vaihingen Dataset

4.4.1 Ablation Study for Attention Modules

In the proposed HMANet, three attention modules are employed on the top of the dilation network to exploit global contextual representations from the perspective of category, channel and region-wise position. To further verify the performance of attention modules, we conduct extensive experiments with different settings in . Besides, we further investigate two integration patterns, that is, the parallel and cascading fashion, to adaptively accomplish information propagation.

As illustrated in , the proposed attention modules bring remarkable improvement compared with the baseline FCN. We can observe that the use of only class augmented attention module yields a result of 90.77% in overall accuracy and 82.45% in mean IoU, which brings 4.26% and 9.76% improvement in OA and mIoU, respectively. Meanwhile, employing region shuffle attention individually outperforms the baseline by 4.28% in OA and 9.8% in mIoU. Furthermore, when we employ the integration of two corresponding attention modules together, the performance of our network is further boosted up. Finally, it behaves superiorly compared to other methods when we integrate the three attention modules, which improves the segmentation performance over baseline by 4.47% in OA and 10.18% in mIoU. In summary, it can be seen that our approach brings great benefit to object segmentation via exploiting global context from different perspectives.

We further investigate the effect of different aggregation methods of HMANet. As shown in , the ResNet101 +Parallel-C-R, corresponding to the schematic diagram in , achieve the best performance, , 90.98% in overall accuracy, as well as 82.27% in mean IoU. While the two cascading integration patterns, “+Cascade-C-R” and “+Cascade-R-C” achieve 90.88% and 90.76% in overall accuracy, respectively. Results show that the region-wise attention representation is harmful to the extraction of category information only in the case of direct serial connection.

Method CAA CCA RSA OA(%) mIoU(%)
Baseline 86.51 72.69
HMANet 90.77 82.45
HMANet 90.79 82.49
HMANet 90.85 82.54
HMANet 90.87 82.61
HMANet 90.98 82.87
Table 1: Ablation study for attention modules on Vaihingen test set. CAA represents class augmented attention module, CCA represents channel class attention module, RSA represents region shuffle attention module.
Method OA(%) mIoU(%)
ResNet-101 Baseline 86.51 72.69
ResNet-101 + Cascade-C-R 90.88 82.62
ResNet-101 + Cascade-R-C 90.76 82.45
ResNet-101 + Parallel-C-R 90.98 82.87
Table 2: Comparison between different integration patterns. Cascade-C-R indicates that CAA embedded with CCA module is followed by RSA module, and vice versa. Parallel-C-R represents CAA embedded with CCA and RSA are appended on the top of the ResNet-101 in parallel.
Ratio OA(%) mIoU(%)
50 90.78 82.48
75 90.80 82.49
100 90.82 82.52
125 90.84 82.53
150 90.85 82.54
175 90.83 82.52
200 90.81 82.50
Table 3: Performance on Vaihingen test set for different ascending ratio in CCA module.
Method OA(%) mIoU(%)
Baseline - - 86.51 72.69
RSA 16 16 90.70 82.35
16 8 90.75 82.44
8 16 90.77 82.47
8 8 90.79 82.49
8 4 90.78 82.47
4 8 86.76 82.46
4 4 90.75 82.44
Table 4: Effect of partition numbers and within region shuffle attention module.
Method Imp. surf. Building Low veg. Tree Car mean OA(%) mIoU(%)
FCN Long et al. (2015) 88.67 92.83 76.32 86.67 74.21 83.74 86.51 72.69
UZ_1 Volpi and Tuia (2016) 89.20 92.50 81.60 86.90 57.30 81.50 87.30 -
RoteEqNet Marcos et al. (2018) 89.50 94.80 77.50 86.50 72.60 84.18 87.50 -
S-RA-FCN Mou et al. (2019) 91.47 94.97 80.63 88.57 87.05 88.54 89.23 79.76
DANet Fu et al. (2019) 91.63 95.02 83.25 88.87 87.16 89.19 89.85 80.53
V-FuseNet Audebert et al. (2018) 92.00 94.40 84.50 89.90 86.30 89.42 90.00 -
DLR_9 Marmanis et al. (2018) 92.40 95.20 83.90 89.90 81.20 88.52 90.30 -
TreeUNet Yue et al. (2019) 92.50 94.90 83.60 89.60 85.90 89.30 90.40 -
DeepLabV3+ Chen et al. (2017b) 92.38 95.17 84.29 89.52 86.47 89.57 90.56 81.47
PSPNet Zhao et al. (2017) 92.79 95.46 84.51 89.94 88.61 90.26 90.85 82.58
ACFNet Zhang et al. (2019) 92.93 95.27 84.46 90.05 88.64 90.27 90.90 82.68
BKHN11 92.90 96.00 84.60 89.90 88.60 90.40 91.00 -
CASIA2 Liu et al. (2018) 93.20 96.00 84.70 89.90 86.70 90.10 91.10 -
CCNet Huang et al. (2019b) 93.29 95.53 85.06 90.34 88.70 90.58 91.11 82.76
HMANet (Ours) 93.50 95.86 85.41 90.40 89.63 90.96 91.44 83.49
Table 5: Comparisons with state-of-the-arts on Vaihingen test set.

4.4.2 Ablation Study for Sub-parameters and Efficiency

Ascending ratio. The ascending ratio introduced in is a hyper-parameter which allows us to control the scale of feature transformations. As the choice of ascending ratio does not have much effect on the computational cost, we only investigate the performance between a range of different values. As shown in , we can conclude that our approach consistently outperforms the baseline under different choices of hyper-parameters, among which the choice ratio achieves slightly better results than others. All the experiments above set ResNet-101 as backbone and use the same training and testing settings.

Effect of the Partition numbers. We further investigate the effect of different partition numbers of the proposed region shuffle attention module, , and . We conduct extensive experiments with various choices of and , and present the corresponding results in . Noted that and are mutually constrained, namely, we just need to determine the values of and . We can see that the performance is robust for a range of partition numbers, among which the choice

achieve the best 90.79% in overall accuracy and 82.49% in mean IoU. Empirically, the output stride of the backbone is set to 8, that is, the height/width of the input feature is 64 pixels in our experiments, thus eclectic choice of grouping is more conducive to self-attention weighted representations of each region. In practice, using an identical partition number may not be optimal (due to the distinct roles performed by different base FCN and different training settings,

, output stride and input size), so further improvements may be achievable by tuning the partition numbers to meet the needs of the given base architecture.

Comparison with Context Aggregation Approaches. We compare the performance of several well verified context aggregation approaches, , Atrous Spatial Pyramid Pooling (ASPP) in DeepLabv3 Chen et al. (2017b), Pyramid Pooling Module (PPM) in PSPNet Zhao et al. (2017), RCCA in CCNet Huang et al. (2019b) and Self-Attention in non-local networks Wang et al. (2018). All the experiments above are conducted under the same training/testing settings for fairness. We report the related results in . We can see that our HMANet outperforms other context aggregation approaches, which demonstrates the effectiveness of capturing global contextual information from different perspectives.

Efficiency Comparison. We further compare our proposed class augmented attention module and region shuffle attention module with ASPP Chen et al. (2017a, b), PPM Zhao et al. (2017), SA Wang et al. (2018), RCCA Huang et al. (2019b), OCR Yuan et al. (2019) and ISA Huang et al. (2019a) in terms of efficiency, including parameters, GPU memory and computation cost (GFLOPs). We report the results in . Notably, we evaluate the cost of all above methods without considering the cost of backbone and include the cost of convolution for dimension reduction to ensure the fairness of the comparison. As shown in , compared with standard Self-Attention (SA) mechanism, our RSA module requires less GPU memory usage and significantly reduce FLOPs by about 77% with a few parameters, which proves the efficiency of region-wise representations in capturing long-range contextual information.

Method OA(%) mIoU(%)
Baseline 86.51 72.69
+ ASPP (Our impl.) Chen et al. (2017b) 90.51 81.39
+ PPM (Our impl.) Zhao et al. (2017) 90.82 82.52
+ Self-Attention (Our impl.) Wang et al. (2018) 90.62 82.17
+ RCCA (Our impl.) Huang et al. (2019b) 90.76 82.45
+ Ours 90.98 82.87
Table 6: Comparison with context aggregation approaches.
Method Params(M) Memory(MB) GFLOPs()
ASPP Chen et al. (2017b) 15.1 284 503
PPM Zhao et al. (2017) 22.0 792 619
SA Wang et al. (2018) 10.5 2168 619
RCCA Huang et al. (2019b) 10.6 427 804
OCR Yuan et al. (2019) 10.5 202 354
ISA Huang et al. (2019a) 11.8 252 386
CAA(Ours) 9.3 283 148
RSA(Ours) 3.8 110 144
Table 7: Efficiency comparison with ASPP, PPM, Self-Attention, RCCA, OCR and ISA when processing input feature map of size [] during inference stage.

4.4.3 Comparison with State-of-the-art

We first adopt some common strategies to improve performance following Fu et al. (2019); Yuan and Wang (2018); Li et al. (2019). (1) DA: Data augmentation with random scaling (from 0.5 to 2.0) and random left-right flipping. (2) Multi-Grid: We employ hierarchical grids of different sizes (1,2,4) within stage-4 of ResNet-101. (3) MS + Flip: We average the segmentation score maps from 5 image scales and left-right flipping counterparts during inference.

Experimental results are shown in . We successively adopt the above strategies to obtain better object representations, which achieves 0.19% ,0.11% and 0.16% improvements respectively in overall accuracy.

We further compare our method with existing methods on Vaihingen test set. Notably, most of the methods adopt the same backbone (ResNet-101) as ours. Results are shown in . It can be seen that our HMANet outperforms other context aggregation methods and attention-based methods by a large margin. Moreover, our HMANet is much more efficient in parameters, memory and GFLOPs. Especially, our score of Car is much higher than other approaches, it improves the second best CCNet by 0.93%, which demonstrates the effectiveness of capturing category-based information and global region-wise correlation.

Figure 6: Qualitative comparisons between our method and baseline on Vaihingen test set.
Method DA MG MS + Flip OA(%) mIoU(%)
HMANet 90.98 82.87
HMANet 91.17 83.11
HMANet 91.28 83.27
HMANet 91.44 83.49
Table 8: Performace comparison between data augmentation (DA), multi-grid (MG) and multi-scale with horizontal flipping (MS + Flip). We report the results on the test set of Vaihingen.
Method Imp. surf. Building Low veg. Tree Car mean OA(%) mIoU(%)
FCN Long et al. (2015) 88.61 93.29 83.29 79.83 93.02 87.61 85.59 78.34
UZ_1 Volpi and Tuia (2016) 89.30 95.40 81.80 80.50 86.50 86.70 85.80 -
S-RA-FCN Mou et al. (2019) 91.33 94.70 86.81 83.47 94.52 90.17 88.59 82.38
DANet Fu et al. (2019) 91.50 95.83 87.21 88.79 95.16 91.70 90.56 83.77
V-FuseNet Audebert et al. (2018) 92.70 96.30 87.30 88.50 95.40 92.04 90.60 -
Multi-filter CNN Sun et al. (2018) 90.94 96.98 76.32 73.37 88.55 85.23 90.65 -
TreeUNet Yue et al. (2019) 93.10 97.30 86.60 87.10 95.80 91.98 90.70 -
DeepLabV3+ Chen et al. (2017b) 92.95 95.88 87.62 88.15 96.02 92.12 90.88 84.32
CASIA3 Liu et al. (2018) 93.40 96.80 87.60 88.30 96.10 92.44 91.00 -
PSPNet Zhao et al. (2017) 93.36 96.97 87.75 88.50 95.42 92.40 91.08 84.88
BKHN3 93.30 97.20 88.00 88.50 96.00 92.60 91.10 -
AMA_1 93.40 96.80 87.70 88.80 96.00 92.54 91.20 -
CCNet Huang et al. (2019b) 93.58 96.77 86.87 88.59 96.24 92.41 91.47 85.65
HUSTW4 Sun et al. (2019) 93.60 97.60 88.50 88.80 94.60 92.62 91.60 -
SWJ_2 94.40 97.40 87.80 87.60 94.70 92.38 91.70 -
HMANet (Ours) 93.85 97.56 88.65 89.12 96.84 93.20 92.21 87.28
Table 9: Numerical comparisons with state-of-the-arts on Potsdam test set.
Figure 7: Visualization results of HMANet on Potsdam test set.

4.4.4 Visualization Results

We provide qualitative comparisons between our HMANet and baseline network in , including and patches. In particular, we leverage the red dashed box to mark those challenging regions that are easily to be misclassified. It can be seen that our method outperforms the baseline by a large margin. HMANet predicts more accurate segmentation maps, that is, it can obtain finer boundary information and maintain the object coherence, which demonstrates the effectiveness of modeling category-based correlation and region-wise representations.

4.5 Experiments on Potsdam Dataset

In this section, we carry out experiments on ISPRS Potsdam benchmark to further evaluate the effectiveness of HMANet. Empirically, we adopt the same training and testing settings on Potsdam dataset. Numerical comparisons with previous state-of-the-art methods are shown in . Remarkably, HMANet achieve 92.21% in overall accuracy and 87.28% in mean IoU. Notably, we compare the two types of available input images, , RGB and IRRG color modes. Results show that the former can obtain better segmentation maps.

In addition, qualitative results are presented in . It can be seen that HMANet produces better segmentation maps than baseline. We mark the improved regions with red dashed boxes (Best viewed in color).

5 Conclusion

In this paper, we propose a novel attention-based framework for dense prediction tasks in the field of remote sensing, namely Hybrid Multiple Attention Network (HMANet), which adaptively captures global contextual information for the perspective of space, channel and category. In particular, we introduce a class augmented attention module embedded with a class channel attention module to compute category-based correlation and further adaptively recalibrate the class-level information. Additionally, to address the feature redundancy and improve the efficiency of self-attention mechanism, a region shuffle attention module is presented to obtain robust region-wise representations. Extensive experiments on ISPRS Vaihingen and Potsdam benchmark demonstrate the effectiveness and efficiency of the proposed HMANet.

References

  • N. Audebert, B. Le Saux, and S. Lefèvre (2018) Beyond rgb: very high resolution urban remote sensing with multimodal deep networks. ISPRS Journal of Photogrammetry and Remote Sensing 140, pp. 20–32. Cited by: Table 5, Table 9.
  • V. Badrinarayanan, A. Kendall, and R. Cipolla (2017) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence 39 (12), pp. 2481–2495. Cited by: §2.
  • D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §2.
  • Z. Cao, K. Fu, X. Lu, W. Diao, H. Sun, M. Yan, H. Yu, and X. Sun (2019) End-to-end dsm fusion networks for semantic segmentation in high-resolution aerial images. IEEE Geoscience and Remote Sensing Letters. Cited by: §2.
  • L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2017a) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §1, §2, §4.3, §4.4.2.
  • L. Chen, G. Papandreou, F. Schroff, and H. Adam (2017b) Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587. Cited by: §1, §1, §2, §4.4.2, §4.4.2, Table 5, Table 6, Table 7, Table 9.
  • L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam (2018a) Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pp. 801–818. Cited by: §2.
  • Y. Chen, Y. Kalantidis, J. Li, S. Yan, and J. Feng (2018b) A^ 2-nets: double attention networks. In Advances in Neural Information Processing Systems, pp. 352–361. Cited by: §1, §2.
  • B. Cheng, L. Chen, Y. Wei, Y. Zhu, Z. Huang, J. Xiong, T. S. Huang, W. Hwu, and H. Shi (2019) SPGNet: semantic prediction guidance for scene parsing. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5218–5228. Cited by: §2.
  • J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu (2019) Dual attention network for scene segmentation. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 3146–3154. Cited by: §1, §2, §3.1, §3.2, §4.4.3, Table 5, Table 9.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §2, §4.3.
  • J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §3.3.
  • G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §2.
  • L. Huang, Y. Yuan, J. Guo, C. Zhang, X. Chen, and J. Wang (2019a) Interlaced sparse self-attention for semantic segmentation. arXiv preprint arXiv:1907.12273. Cited by: §1, §2, §4.4.2, Table 7.
  • Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu (2019b) Ccnet: criss-cross attention for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 603–612. Cited by: §1, §2, §3.1, §4.3, §4.4.2, §4.4.2, Table 5, Table 6, Table 7, Table 9.
  • X. Li, Z. Zhong, J. Wu, Y. Yang, Z. Lin, and H. Liu (2019) Expectation-maximization attention networks for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9167–9176. Cited by: §1, §2, §3.1, §4.3, §4.4.3.
  • G. Lin, A. Milan, C. Shen, and I. Reid (2017) Refinenet: multi-path refinement networks for high-resolution semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1925–1934. Cited by: §2.
  • Y. Liu, B. Fan, L. Wang, J. Bai, S. Xiang, and C. Pan (2018) Semantic labeling in very high resolution images via a self-cascaded convolutional neural network. ISPRS Journal of Photogrammetry and Remote Sensing 145, pp. 78–95. Cited by: §1, §2, Table 5, Table 9.
  • J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §1, §1, §3.1, §4.3, Table 5, Table 9.
  • E. Maggiori, Y. Tarabalka, G. Charpiat, and P. Alliez (2017) High-resolution aerial image labeling with convolutional neural networks. IEEE Transactions on Geoscience and Remote Sensing 55 (12), pp. 7092–7103. Cited by: §4.1.
  • D. Marcos, M. Volpi, B. Kellenberger, and D. Tuia (2018) Land cover mapping at very high resolution with rotation equivariant cnns: towards small yet accurate models. ISPRS journal of photogrammetry and remote sensing 145, pp. 96–107. Cited by: §4.1, Table 5.
  • D. Marmanis, K. Schindler, J. D. Wegner, S. Galliani, M. Datcu, and U. Stilla (2018) Classification with an edge: improving semantic image segmentation with boundary detection. ISPRS Journal of Photogrammetry and Remote Sensing 135, pp. 158–172. Cited by: §2, Table 5.
  • L. Mi and Z. Chen (2020) Superpixel-enhanced deep neural forest for remote sensing image semantic segmentation. ISPRS Journal of Photogrammetry and Remote Sensing 159, pp. 140–152. Cited by: §1, §2.
  • L. Mou, Y. Hua, and X. X. Zhu (2019) A relation-augmented fully convolutional network for semantic segmentation in aerial scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 12416–12425. Cited by: §2, §4.1, Table 5, Table 9.
  • C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun (2017) Large kernel matters–improve semantic segmentation by global convolutional network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4353–4361. Cited by: §2.
  • O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §2.
  • S. Rota Bulò, L. Porzi, and P. Kontschieder (2018) In-place activated batchnorm for memory-optimized training of dnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5639–5647. Cited by: §4.3.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §2.
  • J. Sherrah (2016) Fully convolutional networks for dense semantic labelling of high-resolution aerial imagery. arXiv preprint arXiv:1606.02585. Cited by: §4.1.
  • Y. Sun, Y. Tian, and Y. Xu (2019) Problems of encoder-decoder frameworks for high-resolution remote sensing image segmentation: structural stereotype and insufficient learning. Neurocomputing 330, pp. 297–304. Cited by: Table 9.
  • Y. Sun, X. Zhang, Q. Xin, and J. Huang (2018) Developing a multi-filter convolutional neural network for semantic segmentation using high-resolution aerial imagery and lidar data. ISPRS journal of photogrammetry and remote sensing 143, pp. 3–14. Cited by: Table 9.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §2.
  • M. Volpi and D. Tuia (2016) Dense semantic labeling of subdecimeter resolution images with convolutional neural networks. IEEE Transactions on Geoscience and Remote Sensing 55 (2), pp. 881–893. Cited by: §4.1, Table 5, Table 9.
  • Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu (2019) ECA-net: efficient channel attention for deep convolutional neural networks. arXiv preprint arXiv:1910.03151. Cited by: §3.3.
  • X. Wang, R. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803. Cited by: §1, §3.1, §3.4, §4.4.2, §4.4.2, Table 6, Table 7.
  • C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang (2018a) Bisenet: bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 325–341. Cited by: §2.
  • C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang (2018b) Learning a discriminative feature network for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1857–1866. Cited by: §2.
  • Y. Yuan, X. Chen, and J. Wang (2019) Object-contextual representations for semantic segmentation. arXiv preprint arXiv:1909.11065. Cited by: §2, §3.1, §4.4.2, Table 7.
  • Y. Yuan and J. Wang (2018) Ocnet: object context network for scene parsing. arXiv preprint arXiv:1809.00916. Cited by: §4.4.3.
  • K. Yue, L. Yang, R. Li, W. Hu, F. Zhang, and W. Li (2019) TreeUNet: adaptive tree convolutional neural networks for subdecimeter aerial image segmentation. ISPRS Journal of Photogrammetry and Remote Sensing 156, pp. 1–13. Cited by: §2, Table 5, Table 9.
  • F. Zhang, Y. Chen, Z. Li, Z. Hong, J. Liu, F. Ma, J. Han, and E. Ding (2019) ACFNet: attentional class feature network for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6798–6807. Cited by: §2, §3.1, Table 5.
  • H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2881–2890. Cited by: §1, §1, §2, §3.5, §4.4.2, §4.4.2, Table 5, Table 6, Table 7, Table 9.
  • Z. Zhu, M. Xu, S. Bai, T. Huang, and X. Bai (2019) Asymmetric non-local neural networks for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 593–602. Cited by: §2.