MAP-Net: Multi Attending Path Neural Network for Building Footprint Extraction from Remote Sensed Imagery

by   Qing Zhu, et al.

Accurately and efficiently extracting building footprints from a wide range of remote sensed imagery remains a challenge due to their complex structure, variety of scales and diverse appearances. Existing convolutional neural network (CNN)-based building extraction methods are complained that they cannot detect the tiny buildings because the spatial information of CNN feature maps are lost during repeated pooling operations of the CNN, and the large buildings still have inaccurate segmentation edges. Moreover, features extracted by a CNN are always partial which restricted by the size of the respective field, and large-scale buildings with low texture are always discontinuous and holey when extracted. This paper proposes a novel multi attending path neural network (MAP-Net) for accurately extracting multiscale building footprints and precise boundaries. MAP-Net learns spatial localization-preserved multiscale features through a multi-parallel path in which each stage is gradually generated to extract high-level semantic features with fixed resolution. Then, an attention module adaptively squeezes channel-wise features from each path for optimization, and a pyramid spatial pooling module captures global dependency for refining discontinuous building footprints. Experimental results show that MAP-Net outperforms state-of-the-art (SOTA) algorithms in boundary localization accuracy as well as continuity of large buildings. Specifically, our method achieved 0.68%, 1.74%, 1.46% precision, and 1.50%, 1.53%, 0.82% IoU score improvement without increasing computational complexity compared with the latest HRNetv2 on the Urban 3D, Deep Globe and WHU datasets, respectively. The TensorFlow implementation is available at



There are no comments yet.


page 1

page 2

page 6

page 8

page 9


Building Extraction at Scale using Convolutional Neural Network: Mapping of the United States

Establishing up-to-date large scale building maps is essential to unders...

DDU-Net: Dual-Decoder-U-Net for Road Extraction Using High-Resolution Remote Sensing Images

Extracting roads from high-resolution remote sensing images (HRSIs) is v...

Boundary Regularized Building Footprint Extraction From Satellite Images Using Deep Neural Network

In recent years, an ever-increasing number of remote satellites are orbi...

Sci-Net: a Scale Invariant Model for Building Detection from Aerial Images

Buildings' segmentation is a fundamental task in the field of earth obse...

Learning to Extract Building Footprints from Off-Nadir Aerial Images

Extracting building footprints from aerial images is essential for preci...

PSCC-Net: Progressive Spatio-Channel Correlation Network for Image Manipulation Detection and Localization

To defend against manipulation of image content, such as splicing, copy-...

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The rapid development of remote sensing technology has made it easier to acquire a large number of high-resolution optical remote sensing images that support the extraction of building footprints in a wide range. Immediate and accurate building footprint information is significant for illegal building monitoring, 3D building reconstruction, urban planning and disaster emergency response. Due to low inter-class variance and high intra-class variance in buildings in optical remote sensing imagery, parking lots, roads and other non-buildings are highly similar to buildings in appearance. With the variety of building materials, scales and illumination, the representation of buildings in remote sensing imagery shows significant differences. Therefore, how to accurately and efficiently extract building footprints from remote sensing imagery remains a challenge.

Over the past two decades, numerous algorithms have been proposed to extract building footprints. They can be divided into two categories: traditional image processing based and CNN-based methods.

Traditional building extraction methods utilize the characteristics of the spectrum, texture, geometry and shadow[12, 28, 15, 5, 9, 44] to design feature operators for extracting buildings from optical images. Since these features vary under different illumination conditions, sensor types and building architectures, traditional methods can only resolve specific issues on specific data. [2, 13, 16, 35] combined optical imagery with GIS [34], digital surface models (DSM) obtained from light detection and ranging (Lidar) or synthetic aperture radar interferometry [45] to distinguish non-building areas that are highly similar to buildings, which increased the robustness of building extraction, although a wide range of corresponding multi-source data acquisition is always costly.

Because buildings in remote sensing images are diverse in structure, appearance and scale, building extraction algorithms have evolved from handcrafted feature-based methods to learning feature-based methods, such as deep convolutional neural networks (DCNNs). Moreover, deep networks have been practical in designing CNN models. For instance, ResNet [19] with 152 layers introduced identity mapping in a residual block to solve the problem of gradient explosion in propagation, making it possible to design a deeper network to extract richer semantic features.

Evolving from CNN, fully convolutional network (FCN)-based [45],[21, 24, 39, 32, 11, 31, 4, 36, 1, 42] building extraction methods achieve incredible results and are often used in building semantic segmentation tasks. The encoder-decoder framework [4],[27, 23, 30, 33, 3, 25] obtains more accurate building footprints compared with FCN-based methods, particularly on the localization of the boundary since they recover spatial details through skip connections to fuse shallow high-resolution features in the decoder stage. Nevertheless, coarse features introduced by shallow layers become the main obstacle for accurate building boundaries. [23, 30, 7, 29] used a conditional random field (CRF) for post processing to refine the edge of footprints, which achieved great improvements on building boundaries.

For the problem of multiscale building extraction, [36], [26, 23] integrated hierarchical results extracted from multiple models or design-specific CNN architecture to address multiscale input for accurately extracting multiscale buildings, which improved the performance, yet obviously increased the computational complexity. [43] proposed a pyramid spatial pooling module by introducing several global pooling layers to capture multiscale features without significantly increasing the computational complexity. It is more efficient than [36] and [23] in multiscale building extraction and improves continuously since the global dependency is captured by pooling layers.

The attention mechanism [18, 14, 41, 6, 40, 8, 20] is another method for capturing global relations with long-range dependencies in spatial or channel, which effectively improves the performance of segmentation. Recently, HRNet [37, 38] proposed a high-resolution CNN to address multiscale feature extraction and achieved a new goal in semantic segmentation.

In previous studies, CNN-based building footprint extraction algorithms have mainly been encoder-decoder-based, which loses the spatial details in the encoder stage and recovers by fusing shallow feature maps during the decoder stage. However, it causes inaccurate localization on building boundaries since the coarse features introduced from shallow layers and small buildings may be unrecognized. Additionally, the extracted features are always partially restricted by the local respective field, and large-scale buildings with low texture are always discontinuous and holey when extracted.

This research proposes a MAP-Net to solve the problems described above. First, a parallel multipath is generated gradually in each stage to extract high-level semantics and preserve localization details through serial convolution blocks with fixed spatial resolution. Then, the attention mechanism-based feature enhancement module is introduced to adaptively squeeze channel-wise feature maps from each path for multiscale feature optimization. Pyramid spatial pooling operations following to extracts global semantic information for continuous building footprints. The main contributions of this study are as follows:

  1. We propose a MAP-Net for efficient, accurate and exact multiscale building footprint boundary extraction through parallel localization-preserved convolutional networks.

  2. We introduce a channel-wise attention module to adaptively squeeze multiscale features extracted from multipath. These features strengthen the building representation by optimally combining global semantic and spatial localization.

  3. We validate the effectiveness of the proposed network and feature enhancement modules through extensive ablation studies.

  4. The proposed method outperforms other state-of-the-art algorithms and achieved 1.74%, 0.68%, 1.46% precision, and 1.50%, 1.53%, 0.82% IoU score improvement compared with the latest HRNetv2 [38] on Deep Globe [10] and Urban 3D [17], WHU [22] datasets, respectively.

The rest of the paper is organized as follows. Section II introduces the detailed structure of the proposed network for building extraction. Section III describes the experiments and analyses the results. The discussions and conclusions of this paper are presented in Section IV.

Ii Methodologe

Ii-a Overview

Fig. 1: Structure of the proposed MAP-Net, which is composed of three modules: (A) Detail preserved multipath feature extraction network; (B) Attention-based features adaptive squeeze and global spatial pooling enhancement module; (C) Upsampling and building footprint extraction module. The conv block is composed of a series of residual modules to extract features and is shared with each path. A gen block generates a new parallel path to extract richer semantic features on the basis of conv block.

Repeated pooling layers or stride convolution lose spatial localization during the feature extraction procedure. Existing CNN-based building extraction methods recover spatial localization through skip connections to fuse shallow feature maps or upsample feature maps extracted from the last layer by interpolation. However, shallow feature maps contain coarse semantics, introducing noise information in building extraction. In addition, convolutions process the local neighbourhood information and cannot capture global dependency for large buildings. We propose MAP-Net for multiscale building footprint extraction with accurate boundaries and continuous entities. Fig.

1 illustrates the structure of the proposed MAP-Net. It mainly includes three components:

  1. A parallel multipath network to extract multiscale high-level semantic features while preserving spatial detail information;

  2. An attention-based multiscale features adaptive squeeze and spatial global pooling enhancement module;

  3. A building footprint extraction module.

The detail preserved feature extraction network includes three stages, and the parallel path is generated gradually in each stage to extract richer high-level semantic representations with spatial resolution of features fixed to preserve local details. Features are fused among paths at the end of each stage during the feature extraction process for multiscale feature reuse, as shown by the black two-way arrow. Then, multiscale features extracted from each path concatenate fused and squeezed features by an attention-based module for feature optimization. The spatial pooling module extracts global dependency to suppress the holes and obtain continuous building entities in the final extraction module. The numbers of channels (C) and resolution of the feature maps are marked in the figure.

The remainder of this section is arranged as follows. Section B presents a detail preserved multipath feature extraction network. Attention-based multiscale features are adaptively squeezed, and spatial global pooling enhancement is described in section C. Finally, section D describes the basic unit and training strategies involved in this study.

Ii-B Localization-Preserved Multipath Network

Fig. 2: Part of the detail preserved multipath network. There are two parallel paths with multiscale features extracted in the previous stage. A new path is generated in the next stage to extract high-level semantic representations with down-sampled resolution and doubled channels. Features in parallel paths are fused between two stages. Rescale layers represented by arrows guarantee that the features have the same dimension to be fused.

Compared with the encoder-decoder-based CNN structure, the advantage of a localization-preserved multipath feature extraction network is that it extracts multiscale features that contain rich high-level semantic representation and accurate spatial localization information rather than recover spatial details by fusing shallow feature maps during the decoding. Multiscale features extracted from different stages are fed into several parallel paths that are gradually generated to extract richer semantics and preserve spatial resolution without increasing the computational complexity of the network.

Fig. 2 illustrates part of the proposed multipath network. There are two parallel path extracted feature maps with different dimensions in the previous stage, and a new path is generated in the next stage to extract high-level semantic features with double-downsampled resolution and double channels. Feature maps extracted in each path maintain spatial resolution, which preserves as many spatial details as possible, providing security for accurate pixel localization in the building segmentation procedure. Features in each parallel path are fused at the end of every stage to make full use of multiscale representation.

To guarantee that the feature maps from different paths have the same dimension, to be fused, specific layers are designed to rescale the dimension of feature maps during the extraction process. The green arrow is composed of a max pooling layer to downsample resolutions and a 1

1 convolutional layer to increase feature channels. The orange arrow shows the sampling of the resolutions through bilinear interpolation and decreasing channels with a 11 convolutional layer. The blue arrow represents a 33 convolutional layer, and the black arrow indicates each parallel path. Thus, feature maps can be fused by pixel-wise addition or channel-wise concatenation.

In the entire process of feature extraction, spatial resolutions and channels of feature maps in each path are fixed. Features in each path are extracted by a series of convolutional blocks that suppress the coarse semantics in high-resolution feature maps compared with encoder-decoder-based CNN. Because detailed representation is preserved in higher-resolution features, smaller buildings and localization of the boundary can be extracted exactly. The effect of a multipath network that considers preserved localization and high-level semantics is explained in experiment 3.3.1.

Considering the trade-off between complexity and accuracy, the structure of MAP-Net is composed of three parallel paths for extracting multiscale features. The spatial resolutions of feature maps are 1/4, 1/8 and 1/16 of the original image, with corresponding numbers of channels of 64, 128 and 256, respectively.

Ii-C Attention-Based Feature Squeeze

Fig. 3: Feature semantic enhancement module. First, multiscale features extracted from multipath are scaled by bilinear interpolation and concatenated. Then, the channel attention enhancement module adaptively squeezes significance channel-wise features to reconstruct optimal features. Finally, a spatial pooling enhancement module is introduced to capture global dependency for continuous building footprints.

Feature maps extracted from multipaths have different dimensions. Higher-resolution features contain localization details and high-level semantic information, while lower resolution provides richer global features. The features are sampled in up to 1/4 of the original image through bilinear interpolations and fused by concatenation, as shown in Fig. 3. The function of the channel attention squeeze module adaptively measures the significance of each channel for optimizing the multiscale features. A spatial pooling enhancement module is introduced to capture global dependence for continuously extracting building entities, especially for large buildings with low texture. The details are described as follows.

In previous CNN-based methods [36, 26, 23], multiscale features were concatenated directly for final pixel-wise prediction. Nevertheless, each channel has a dissimilar influence on building extraction, and some of them may weaken the semantic representation but increase the computational complexity. In our research, multiscale features from different paths contain spatial localization and richer semantic representation. It is necessary to distinguish valuable channel-wise features for accurately and efficiently extracting buildings, while priori knowledge hardly weights the importance of each channel. The attention-based feature adaptive squeeze module inspired by [20] plays a role in learning the weight for each channel and automatically reconstructing the feature maps for optimal representation.

As illustrated in Fig. 4

, a global average pooling operation produces a vector of length 7C from the concatenated multiscale channel-wise feature, a fully connected layer with a weight parameter of 7C

7C, followed by learning a weight vector with a length of 7C corresponding to each channel. The parameters of the fully connected layer are randomly initialized and gradually learned from the features. Finally, the vector that represents the significance of each channel is normalized to a sigmoid function and multiplied by the original features for reconstructing enhanced feature maps.

Because the extracted features are always partially restricted by local receptive fields, a spatial pooling module is introduced to extract global dependence. The implementation is similar to [43] except that the global features are generated by four average pooling layers with different sizes designed in accordance with the dimensions of features and added to the original feature maps pixel-wise for global spatial enhancement. It captures global relations spatially, which cannot be extracted from the CNN for the local respective field. Hence, extracted buildings have better integrity.

Ii-D Basic Block and Training Strategy

Fig. 4: Detail of the basic blocks. (a) Downsample block. (b) Conv Block. (c) Residual block. (d) Upsample block.

To decrease the computational complexity, a downsampling block is introduced to decrease the resolution of the input before the multipath network, as shown in Fig. 4. It consists of two 3

3 convolutional layers followed by batch normalization (BN) and a rectified linear unit (ReLU) activation function and two max pooling layers to extract feature maps with 64 channels and 1/4 spatial resolution of the input image. Fig.

4 represents the conv block, which includes several residual blocks in the series. The impact of different numbers of blocks on performance are explored in experiment 3.3.2. The residual block consists of a 11 convolutional layer for reducing the dimensions of features, two 33 layers for extracting features, and a 11 convolutional layer for restoring dimensions to the input; a shortcut fuses input to output through element-wise addition, and BN and ReLU execute before the convolutional layers, as illustrated in Fig. 4. The building footprint extraction module is shown in Fig. 4, the resolution of the features is recovered through bilinear interpolation in two stages, and the convolutional layers are used to decrease the number of channels.

Our research was implemented in TensorFlow using a single 2080Ti GPU with 12 Gigabyte of memory. The am optimizer was chosen with an initial learning rate of 0.001, and beta1 and beta2 were set to default as recommended. All compared methods were trained from scratch for approximately 80 epochs until convergence and randomly rotated and flipped for data augmentation on three building datasets described in section III. Batch size was set to 4 restricted by the GPU memory size and maintained the same hyperparameters to compare the performance with different methods.

Sigmiod cross-entropy loss was selected as the loss function because of the pixel-wise binary classification involved. The computation of loss at position

was given as (1

); logits represent the predicted result and

represents the ground truth; the sigmoid function was applied to logits to ensure that , as shown in (2). The loss value is the average of at all positions for an input image.


Iii Experiment and Analysis

Iii-a Dataset

To evaluate the proposed method, we conducted a comprehensive experiment on three open datasets, including the WHU building dataset [22], the Deep Globe Building Extraction Challenge dataset [10] and the USSOCOM Urban 3D Challenge dataset [17]. The details are described as follows.

The WHU building dataset includes both aerial and satellite subsets with corresponding shapefiles and raster images. In our experiment, we selected the aerial subset, which has various appearances and scales of buildings, to evaluate the robustness of the proposed algorithm. It consists of more than 187,000 buildings, covering over a 450 area, with 30 ground resolution. Each image has three bands, corresponding to red (R), green (G) and blue (B) wavelengths, with the size of each image being 512 512 pixels. There are a total of 8,188 tiles of images, including 4,736, 2,416 and 1,036 tiles as training, test and validation datasets, respectively. We conducted our experiment as its original provided dataset partitioning.

The Deep Globe Building Extraction Challenge dataset [10] contains WorldView-3 satellite imagery captured from Vegas, Paris, Shanghai and Khartoum. In this research, the Vegas and Shanghai subsets were selected to evaluate the generalization performance of the proposed algorithm. There were approximately 243,382 buildings with 30 ground resolution, covering over 1,216 , and the size of each image was 650650 pixels. All images were randomly divided by 6:1:3 as the training set, validation set and test set.

The USSOCOM Urban 3D Challenge dataset [17] contains 208 orthorectified RGB, with corresponding DSM and digital terrain models (DTM) generated from commercial satellite imagery. It contains approximately 157,000 buildings, covering over 360 with a ground resolution of 50 , and the size of each image is 20482048 pixels. DSM and DTM indicate the elevation of buildings, which obviously improves the building extraction performance. We used only the RGB images in our experiment to evaluate the performance of the proposed method. The training, validation and test set include 104, 62, 42 tiles as the original data partition method, and we randomly clipped the images to the size 512512 of pixels for training and testing.

Iii-B Evaluation Metric

Generally, evaluation metric methodologies can be divided into two categories: pixel-level metrics and instance-level metrics. The pixel-level method counts the correctly classified and misclassified pixels pixel-wise. In the instance-level method, a building is correctly extracted only when the intersection over union between the prediction and ground truth is larger than a specific threshold. Semantic segmentation-based building footprint extraction aims to classify every pixel, whether or not it belongs to a building, for a specific input image. Therefore, we apply a pixel-level metric including precision, recall, F1-score and intersection over union (IoU) to evaluate the performance of MAP-Net and other different methods.

There are four classifying conditions: true prediction on a positive sample (TP), false prediction on a positive sample (FP), true prediction on a negative sample (TN) and false prediction on a negative sample (FN). Precision represents the percentage of TP in total positive prediction, recall indicates the percentage of TP over the total positive samples, the F1-score is the weighted average of precision and recall, which considers both FP and FN, and IoU is the average value of the intersection of the prediction and ground truth over their union of the whole image set. Equations are given as follows:


Iii-C Experimental Setup

In this section, we first analysed the significance of the proposed multipath architecture for extracting multiscale buildings with exact localization on boundaries compared with the popular encoder-decoder framework. Second, we explored the impact of different network parameters on the complexity and accuracy of MAP-Net. Third, a contrast experiment was carried out to compare the performance of MAP-Net with four state-of-the-art algorithms on building extraction. Finally, we conducted an ablation experiment to validate the significance of the proposed network and analysed the trade-off between complexity and accuracy among the compared methodologies. Details are described in the following sections.

Iii-C1 Significance of Multipath

Fig. 5: Feature maps extracted from localization-preserved multipath networks with different paths (P), stages (S) and spatial resolutions (R) referred to in Fig. 1. Column (a) represents two sample images containing multiscale buildings. Columns (b-d) are feature maps extracted from path 1 with the same spatial resolution on each stage. Columns (d-f) are the feature maps extracted from stage 3 but with decreasing spatial resolution in each path.

Feature maps extracted from the proposed localization-preserved multipath network are shown in Fig. 5. Columns (b-d) are extracted from path 1 with the same spatial resolution on each stage corresponding to the sample image in column (a). This indicates that feature maps with higher resolution extracted from deeper convolutional layers (larger S) retained richer semantics; in other words, building and background could be distinguished evidently. Columns (d-f) show the extracted feature maps from each path at stage 3 with decreasing spatial resolution. It shows that feature maps with lower-resolution blur at the edge of buildings, worse conditions, and small buildings may be lost completely, as shown in column (f), due to the exact localization loss during the feature extraction procedure.

Encoder-decoder-based networks fuse higher-resolution feature maps extracted from shallow layers, such as columns (b) or (c), to recover exact localization through skip connection at the decoder stage, which introduces noise information for the coarse semantic features. In addition, small buildings may be lost in the lowest resolution feature maps, such as column (f), which cannot be refined accurately during the decoder stage. As a result, extracted building footprints were inaccurate on the boundary, or worse, small buildings were unrecognized.

Multipath networks extract multiscale feature maps through parallel paths. The resolutions of feature maps in each path were fixed in the whole feature extraction process. Features with higher spatial resolution preserved exact localization and contained richer semantics, such as column (d), which is beneficial for extracting exact boundaries and small buildings. Additionally, the features with lower spatial resolution captured global semantic representations, which contribute to the extraction of large buildings. Multiscale features extracted from multipath were fused and enhanced to extract buildings with multiscale, which makes up for the shortcoming of the existing network.

Iii-C2 Network Parameter

The structure of the proposed network is mainly affected by the depth of the convolution network and the number of parallel paths. We designed an experiment to explore the impact of different network parameters on the performance of MAP-Net on the WHU dataset. The depth is represented by the number of residual blocks (N-blocks) in each path; empirically, these were set from 3 to 6 in our experiments. Similarly, the number of paths (N-paths) was chosen from 2 to 4 according to the resolution of the input image. The IoU metric was used to evaluate the accuracy, and the number of trainable parameters (Para.) was counted to represent the complexity of the network.

Fig. 6: Performance of MAP-Net with different network structures. The diamond, circle and rectangle represent different numbers of paths. Red and blue represent the IoU score and trainable parameters (Para.), respectively. The horizontal axis indicates the number of residual blocks (N-blocks) in a convolutional block.

The experimental result is illustrated in. Fig. 6. With the increase in N-blocks, the IoU score increased first, and then decreased after N-blocks were greater than a specific value, which may be explained by the complexity of the network growing with N-blocks increasing while weakening the generalization ability of the model. However, the Para. grow linearly with the increase in N-blocks, while increasing exponentially with the increase in N-paths since the generated path doubles the feature channels, which greatly increases the parameters during the feature extraction and enhancement stage.

Features with specific resolution were extracted from each path. The number of paths impacts the combination of multiscale semantic features fused in MAP-Net. When the N-paths equalled 3, the IoU metric was better than that of 2 or 4, as shown by the solid line marked by the red circle in Fig. 6.

Considering accuracy and complexity, the better structure of the MAP-Net composed of three parallel paths, and each convolutional block consisted of four residual blocks, which contained fewer parameters and performed better than others, as solid circles marked with red and blue.

Fig. 7: Example of results with the UNet, PSPNet, ResNet101, HRNetv2 and our proposed method on the WHU dataset. (a) Original image. (b) UNet. (c) PSPNet. (d) ResNet101. (e) HRNetv2. (f) Ours. (g) Ground truth.
Fig. 8: Example of results with the UNet, PSPNet, ResNet101, HRNetv2 and our proposed method on the Deep Globe dataset. (a) Original image. (b) UNet. (c) PSPNet. (d) ResNet101. (e) HRNetv2. (f) Ours. (g) Ground truth.
Fig. 9: Example of results with the UNet, PSPNet, ResNet101, HRNetv2 and our proposed method on the Urban 3D dataset. (a) Original image. (b) UNet. (c) PSPNet. (d) ResNet101. (e) HRNetv2. (f) Ours. (g) Ground truth.
Method IoU(%) Precision(%) Recall(%) F1-score(%)
UNet 88.75 94.85 93.25 94.04
PSPNet 88.87 94.28 93.93 94.10
ResNet101 89.18 94.47 94.09 94.28
HRNetv2 90.04 94.16 95.37 94.76
Ours 90.86 95.62 94.81 95.21
TABLE I: Comparison of The State of The Art Methods and Ours on WHU Dataset.
Method IoU(%) Precision(%) Recall(%) F1-score(%)
UNet 76.34 89.07 84.23 86.59
PSPNet 78.76 87.36 88.89 88.12
ResNet101 79.16 89.1 87.65 88.37
HRNetv2 79.13 89.55 87.17 88.35
Ours 80.63 91.29 87.35 89.28
TABLE II: Comparison of The State of The Art Methods and Ours on Deep Globe Dataset.
Method IoU(%) Precision(%) Recall(%) F1-score(%)
UNet 84.56 92.59 90.69 91.63
PSPNet 86.19 93.00 92.17 92.46
ResNet101 86.17 92.83 92.31 92.57
HRNetv2 86.15 92.74 92.38 92.56
Ours 87.68 93.42 93.45 93.44

TABLE III: Comparison of The State of The Art Methods and Ours on Urban 3D Dataset.

Iii-C3 Performance Evaluation

To evaluate the performance of the proposed network, we conducted contrast experiments to compare MAP-Net with four state-of-the-art methods, including UNet, PSPNet with ReNet50 backbone, ResNet101 and HRNetv2, on datasets [10, 17, 22]. Experimental results are shown in TABLE I, TABLE II and TABLE III. Our proposed method demonstrates a great improvement compared with other methods on three experimental datasets and obtains approximately 0.82%, 1.50%, 1.53% IoU improvement and 0.45%, 0.93%, 0.88% F1-score improvement on the WHU dataset, Deep Globe dataset and Urban 3D dataset, respectively, compared with the lasted research HRNetv2. The best records are marked with bold.

To compare different methods, some example results on each dataset are presented in Fig. 7, Fig. 8 and Fig. 9. Fig. 7 shows extracted building footprints on the WHU dataset. There are four examples, including buildings with various appearances and scales. Columns (a) and (g) represent the original image and corresponding ground truth, and columns (b-f) are extracted results from UNet, PSPNet, ResNet101, HRNetv2 and MAP-Net, respectively.

The results show that our proposed method outperforms the other four compared methods obviously, especially by more accurately recognizing small buildings and more completely extracting large buildings, which benefits from the localization-preserved multipath feature extraction network and the multiscale feature enhancement module. The boundary of buildings is more exactly based on the ground truth.

Example results on the Deep Globe dataset and Urban 3D dataset are illustrated in Fig. 8 and Fig. 9. Each column has the same meaning as presented in Fig. 7.

No Method IoU(%) Precision(%) Recall(%) F1-Score (%) Rise IoU(%)
1 HRNetv1 89.54 94.55 94.41 94.48
2 Baseline 89.78 94.09 95.15 94.62 0.24
3 Baseline+ M 90.28 95.03 94.76 94.89 0.74
4 Baseline+ MC 90.57 95.44 94.67 95.05 1.03
5 Baseline+ MS 90.62 95.21 94.95 95.08 1.08
6 MAP-Net 90.86 95.62 94.81 95.21 1.32
TABLE IV: Influence of Different Modules. (M): Multipath Feature Extraction. (C): Channel Feature Squeeze. (S): Spatial Feature Enhancement.

Iii-C4 Ablation Experiments

To explore the contributions of different modules of the MAP-Net, we conducted ablation experiments on the WHU dataset. The MAP-Net based on HRNetv1 and optimized architecture for building extraction was our baseline.

On the basis of the baseline, fusing localization-preserved multiscale features extracted from multipath represented by (M). In addition, the channel-wise feature squeezes the module based on the attention mechanism called (C) and the spatial pooling enhancement module named (S). We evaluate the performance on IoU, precision, recall and F1-score. Experimental results are recorded in Table IV. The best records are marked with bold.

Compared with the results, the attention-based feature squeeze and spatial pooling enhancement module obviously improved the performance in building extraction. Multipath localization-preserved strategy improved by 0.5% IoU score on a higher baseline.

It is worth noting that our algorithm from No. 3 to No. 6 obtained higher precision measures compared with HRNetv1 and our baseline. A probable explanation is that our methods suppressed false positive prediction, which contributed to accurate multiscale features extracted from localization-preserved multipath networks. The same conclusion can be inferred from other datasets, according to TABLE


Iii-C5 Complexity of MAP-Net

Our proposed algorithm extracted features with multiscale; specifically, some paths needed to process feature maps with large resolution to preserve exact localization in the whole network, which could lead to large numbers of parameters. To validate the trade-off between the performance and complexity of MAP-Net, we compared the trainable parameter, IoU score and model size of five related methods on the WHU dataset.

The experimental results are presented in Fig. 10. ResNet101 is the most complicated model with poor performance due to the most convolutional layers and the highest number of channels. HRNetv2 has more parameters than HRNetv1 and performs best among the compared methods, except for MAP-Net, which indicates that the multiscale feature is an important factor for building extraction. MAP-Net shows the best performance and requires the fewest parameters compared with other methods. Although it maintains a high-resolution feature map, which may lead to a large number of parameters, the number of channels remains small, allowing it to efficiently extract multiscale features.

Fig. 10: Complexity and accuracy comparison among related methods. IoU precision and the number of trainable parameters for each method are marked in the figures. The radius of the circle indicates the size of the model file.

Iv Discussion and Conclusion

To solve the problem of extracted building footprints with inaccurate boundaries and possibly lost small buildings as well as discontinuous buildings for large-scale buildings, in this research, we proposed a novel localization-preserved multipath feature extraction network with a channel and spatial enhancement module for building footprint extraction.

Multiscale features extracted from parallel multipaths that contain local details, as well as rich semantic representations allow it to accurately extract building footprints with exact edges and recognize small buildings. The enhanced module further reconstructs and optimizes features in channel and spatial aspects, which suppresses the holes and extracts continuous footprints for large buildings.

The experiments on three different building extraction datasets demonstrate that the MAP-Net outperforms other state-of-the-art algorithms with higher accuracy and lower complexity. In addition, we conducted an ablation experiment to evaluate the significance of the proposed module and proved that localization-preserved multipath network extraction of buildings with higher precision compared with previous methods.

Generally, our research provides a new approach for accurately and efficiently extracting multiscale objects that are commonly in the real world. Currently, our experiments are implemented in building extraction, and we will further study multi-class extraction tasks, such as land cover, to achieve automatic interpretation of remote sensing imagery in future work.


  • [1] R. Alshehhi, P. R. Marpu, W. L. Woon, and M. Dalla Mura (2017) Simultaneous extraction of roads and buildings in remote sensing imagery with convolutional neural networks. ISPRS Journal of Photogrammetry and Remote Sensing 130, pp. 139–149. Cited by: §I.
  • [2] M. Awrangjeb, C. Zhang, and C. S. Fraser (2013) Automatic extraction of building roofs using lidar data and multispectral imagery. ISPRS journal of photogrammetry and remote sensing 83, pp. 1–18. Cited by: §I.
  • [3] V. Badrinarayanan, A. Kendall, and R. Cipolla (2017) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence 39 (12), pp. 2481–2495. Cited by: §I.
  • [4] K. Bittner, S. Cui, and P. Reinartz (2017) BUILDING extraction from remote sensing data using fully convolutional networks.. International Archives of the Photogrammetry, Remote Sensing & Spatial Information Sciences 42. Cited by: §I.
  • [5] J. Burochin, B. Vallet, M. Brédif, C. Mallet, T. Brosset, and N. Paparoditis (2014) Detecting blind building façades from highly overlapping wide angle aerial imagery. ISPRS journal of photogrammetry and remote sensing 96, pp. 193–209. Cited by: §I.
  • [6] Y. Cao, J. Xu, S. Lin, F. Wei, and H. Hu (2019) GCNet: non-local networks meet squeeze-excitation networks and beyond. arXiv preprint arXiv:1904.11492. Cited by: §I.
  • [7] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §I.
  • [8] Y. Chen, Y. Kalantidis, J. Li, S. Yan, and J. Feng (2018) A^ 2-nets: double attention networks. In Advances in Neural Information Processing Systems, pp. 352–361. Cited by: §I.
  • [9] M. Cote and P. Saeedi (2012) Automatic rooftop extraction in nadir aerial imagery of suburban regions using corners and variational level set evolution. IEEE transactions on geoscience and remote sensing 51 (1), pp. 313–328. Cited by: §I.
  • [10] I. Demir, K. Koperski, D. Lindenbaum, G. Pang, J. Huang, S. Basu, F. Hughes, D. Tuia, and R. Raska (2018) Deepglobe 2018: a challenge to parse the earth through satellite images. In

    2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

    pp. 172–17209. Cited by: item 4, §III-A, §III-A, §III-C3.
  • [11] Z. Deng, H. Sun, S. Zhou, J. Zhao, L. Lei, and H. Zou (2018) Multi-scale object detection in remote sensing imagery with convolutional neural networks. ISPRS journal of photogrammetry and remote sensing 145, pp. 3–22. Cited by: §I.
  • [12] J. Du, D. Chen, R. Wang, J. Peethambaran, P. T. Mathiopoulos, L. Xie, and T. Yun (2019) A novel framework for 2.5-d building contouring from large-scale residential scenes. IEEE Transactions on Geoscience and Remote Sensing 57 (6), pp. 4121–4145. Cited by: §I.
  • [13] S. Du, Y. Zhang, Z. Zou, S. Xu, X. He, and S. Chen (2017) Automatic building extraction from lidar data fusion of point and grid-based features. ISPRS Journal of Photogrammetry and Remote Sensing 130, pp. 294–307. Cited by: §I.
  • [14] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu (2019) Dual attention network for scene segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3146–3154. Cited by: §I.
  • [15] N. L. Gavankar and S. K. Ghosh (2018) Automatic building footprint extraction from high-resolution satellite image using mathematical morphology. European Journal of Remote Sensing 51 (1), pp. 182–193. Cited by: §I.
  • [16] S. Gilani, M. Awrangjeb, and G. Lu (2016) An automatic building extraction and regularisation technique using lidar point cloud data and orthoimage. Remote Sensing 8 (3), pp. 258. Cited by: §I.
  • [17] H. R. Goldberg, S. Wang, G. A. Christie, and M. Z. Brown (2018) Urban 3d challenge: building footprint detection using orthorectified imagery and digital surface models from commercial satellites. In Geospatial Informatics, Motion Imagery, and Network Analytics VIII, Vol. 10645, pp. 1064503. Cited by: item 4, §III-A, §III-A, §III-C3.
  • [18] J. He, Z. Deng, L. Zhou, Y. Wang, and Y. Qiao (2019) Adaptive pyramid context network for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7519–7528. Cited by: §I.
  • [19] K. He, X. Zhang, S. Ren, and J. Sun (2016) Identity mappings in deep residual networks. In European conference on computer vision, pp. 630–645. Cited by: §I.
  • [20] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §I, §II-C.
  • [21] Z. Huang, G. Cheng, H. Wang, H. Li, L. Shi, and C. Pan (2016) Building extraction from multi-source remote sensing images via deep deconvolution neural networks. In 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), pp. 1835–1838. Cited by: §I.
  • [22] S. Ji, S. Wei, and M. Lu (2018) Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set. IEEE Transactions on Geoscience and Remote Sensing 57 (1), pp. 574–586. Cited by: item 4, §III-A, §III-C3.
  • [23] S. Ji, S. Wei, and M. Lu (2019) A scale robust convolutional neural network for automatic building extraction from aerial and satellite imagery. International Journal of Remote Sensing 40 (9), pp. 3308–3322. Cited by: §I, §I, §II-C.
  • [24] R. Kemker, C. Salvaggio, and C. Kanan (2018) Algorithms for semantic segmentation of multispectral remote sensing imagery using deep learning. ISPRS journal of photogrammetry and remote sensing 145, pp. 60–77. Cited by: §I.
  • [25] A. Khalel and M. El-Saban (2018) Automatic pixelwise object labeling for aerial imagery using stacked u-nets. arXiv preprint arXiv:1803.04953. Cited by: §I.
  • [26] L. Li, J. Liang, M. Weng, and H. Zhu (2018) A multiple-feature reuse network to extract buildings from remote sensing imagery. Remote Sensing 10 (9), pp. 1350. Cited by: §I, §II-C.
  • [27] W. Li, C. He, J. Fang, J. Zheng, H. Fu, and L. Yu (2019) Semantic segmentation-based building footprint extraction using very high-resolution satellite images and multi-source gis data. Remote Sensing 11 (4), pp. 403. Cited by: §I.
  • [28] Z. Li, W. Shi, Q. Wang, and Z. Miao (2014) Extracting man-made objects from high spatial resolution remote sensing images via fast level set evolutions. IEEE Transactions on Geoscience and Remote Sensing 53 (2), pp. 883–899. Cited by: §I.
  • [29] J. Lin, W. Jing, H. Song, and G. Chen (2019) ESFNet: efficient network for building extraction from high-resolution aerial images. IEEE Access 7, pp. 54285–54294. Cited by: §I.
  • [30] P. Liu, X. Liu, M. Liu, Q. Shi, J. Yang, X. Xu, and Y. Zhang (2019) Building footprint extraction from high-resolution images via spatial residual inception convolutional neural network. Remote Sensing 11 (7), pp. 830. Cited by: §I.
  • [31] E. Maggiori, Y. Tarabalka, G. Charpiat, and P. Alliez (2016) Convolutional neural networks for large-scale remote-sensing image classification. IEEE Transactions on Geoscience and Remote Sensing 55 (2), pp. 645–657. Cited by: §I.
  • [32] E. Maltezos, A. Doulamis, N. Doulamis, and C. Ioannidis (2018) Building extraction from lidar data applying deep convolutional neural networks. IEEE Geoscience and Remote Sensing Letters 16 (1), pp. 155–159. Cited by: §I.
  • [33] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §I.
  • [34] E. Simonetto, H. Oriot, and R. Garello (2005) Rectangular building extraction from stereoscopic airborne radar images. IEEE Transactions on Geoscience and remote Sensing 43 (10), pp. 2386–2395. Cited by: §I.
  • [35] G. Sohn and I. Dowman (2007) Data fusion of high-resolution satellite imagery and lidar data for automatic building extraction. ISPRS Journal of Photogrammetry and Remote Sensing 62 (1), pp. 43–63. Cited by: §I.
  • [36] G. Sun, H. Huang, A. Zhang, F. Li, H. Zhao, and H. Fu (2019) Fusion of multiscale convolutional neural networks for building extraction in very high-resolution images. Remote Sensing 11 (3), pp. 227. Cited by: §I, §I, §II-C.
  • [37] K. Sun, B. Xiao, D. Liu, and J. Wang (2019)

    Deep high-resolution representation learning for human pose estimation

    arXiv preprint arXiv:1902.09212. Cited by: §I.
  • [38] K. Sun, Y. Zhao, B. Jiang, T. Cheng, B. Xiao, D. Liu, Y. Mu, X. Wang, W. Liu, and J. Wang (2019) High-resolution representations for labeling pixels and regions. arXiv preprint arXiv:1904.04514. Cited by: item 4, §I.
  • [39] Y. Sun, X. Zhang, Q. Xin, and J. Huang (2018) Developing a multi-filter convolutional neural network for semantic segmentation using high-resolution aerial imagery and lidar data. ISPRS journal of photogrammetry and remote sensing 143, pp. 3–14. Cited by: §I.
  • [40] X. Wang, R. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803. Cited by: §I.
  • [41] Y. Yuan and J. Wang (2018) Ocnet: object context network for scene parsing. arXiv preprint arXiv:1809.00916. Cited by: §I.
  • [42] R. Zhang, G. Li, M. Li, and L. Wang (2018) Fusion of images and point clouds for the semantic segmentation of large-scale 3d scenes based on deep learning. ISPRS journal of photogrammetry and remote sensing 143, pp. 85–96. Cited by: §I.
  • [43] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2881–2890. Cited by: §I, §II-C.
  • [44] G. Zhou and X. Zhou (2014) Seamless fusion of lidar and aerial imagery for building extraction. IEEE Transactions on Geoscience and Remote Sensing 52 (11), pp. 7393–7407. Cited by: §I.
  • [45] X. X. Zhu, D. Tuia, L. Mou, G. S. Xia, L. Zhang, F. Xu, and F. Fraundorfer (2017) Deep learning in remote sensing: a review. Cited by: §I, §I.