Adversarial Loss for Semantic Segmentation of Aerial Imagery

by   Clint Sebastian, et al.
TU Eindhoven

Automatic building extraction from aerial imagery has several applications in urban planning, disaster management, and change detection. In recent years, several works have adopted deep convolutional neural networks (CNNs) for building extraction, since they produce rich features that are invariant against lighting conditions, shadows, etc. Although several advances have been made, building extraction from aerial imagery still presents multiple challenges. Most of the deep learning segmentation methods optimize the per-pixel loss with respect to the ground truth without knowledge of the context. This often leads to imperfect outputs that may lead to missing or unrefined regions. In this work, we propose a novel loss function combining both adversarial and cross-entropy losses that learn to understand both local and global contexts for semantic segmentation. The newly proposed loss function deployed on the DeepLab v3+ network obtains state-of-the-art results on the Massachusetts buildings dataset. The loss function improves the structure and refines the edges of buildings without requiring any of the commonly used post-processing methods, such as Conditional Random Fields. We also perform ablation studies to understand the impact of the adversarial loss. Finally, the proposed method achieves a relaxed F1 score of 95.59 buildings dataset compared to the previous best F1 of 94.88



There are no comments yet.


page 1

page 4


Contextual Pyramid Attention Network for Building Segmentation in Aerial Imagery

Building extraction from aerial images has several applications in probl...

Deep cross-domain building extraction for selective depth estimation from oblique aerial imagery

With the technological advancements of aerial imagery and accurate 3d re...

Bootstrapped CNNs for Building Segmentation on RGB-D Aerial Imagery

Detection of buildings and other objects from aerial images has various ...

RescueNet: Joint Building Segmentation and Damage Assessment from Satellite Imagery

Accurate and fine-grained information about the extent of damage to buil...

Seeing Behind Things: Extending Semantic Segmentation to Occluded Regions

Semantic segmentation and instance level segmentation made substantial p...

Superpixel-Based Building Damage Detection from Post-earthquake Very High Resolution Imagery Using Deep Neural Networks

Building damage detection after natural disasters like earthquakes is cr...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Several developments in the collection of remote sensing imagery have resulted into the availability of high-resolution aerial image datasets for exploring applications such as object detection, image retrieval, etc. Detection and recognition of objects in aerial imagery is crucial for urban planning, disaster mitigation, map making, and change detection. One of the most prominent objects that are maintained and updated are buildings. Therefore, building extraction reaps a plethora of benefits for the aforementioned applications. Because of the increasing amount of aerial imagery, automating the detection process becomes desirable. In recent years, advances of machine learning along with the development of low-cost hardware have resulted in high-performance object detection algorithms. However, building detection from remote sensing images still faces several challenges, where large variations in building appearance (varying building shapes, sizes, and colors), lighting conditions and shadows, often pose difficulties for reliable detection.

Fig. 1: Building segmentation results using our proposed method on Massachusetts building dataset. The colors white, black, blue and red indicate true positives, true negatives, false positives and false negatives, respectively.

Many of the earlier approaches relied on hand-engineered features for building extraction. They exploited the features such as the structure, color and hyper-spectral data of remote sensing images to improve performance. These feature-based methods are coupled with machine learning algorithms for detection and classification [1, 2, 3]

. However, due to the limitations of low-level features, these algorithms have low performance. In contrast to traditional methods, deep learning methods benefit from learning features by optimizing an objective function. The success of deep learning algorithms, such as Convolutional Neural Networks (CNNs), has resulted in an improved performance on various computer vision tasks. These advances in deep learning have benefited several remote sensing applications, such as aerial image object detection, image retrieval.

Building detection in remote sensing is usually posed as a segmentation problem. Recent works have obtained state-of-the-art results in semantic segmentation for remote sensing imagery. Most works have explored encoder-decoder network structures to limit the parameter increase in the bottleneck layers for semantic segmentation [4, 5, 6]. Fully Convolutional Networks (FCNs) introduced the first encoder-decoder structure for semantic segmentation. FCNs replaced the fully connected layers with a fully convolutional layer, which reduces the number of parameters of the CNN model [4]. Other work has built further on this encoder-decoder structure, while improving segmentation performance.

Apart from utilizing a sophisticated architecture, most CNN-based segmentation algorithms rely on cross-entropy or similar loss functions as the objective function. Due to the limitations imposed by cross-entropy, we explore an alternative loss function inspired by adversarial learning, that preserves structure and refines the results without the need of additional post-processing steps like Conditional Random Fields (CRFs). In particular, we explore adversarial learning in conjunction with direct per-pixel optimization utilized by cross-entropy loss [7].

Ii Related Work

Satellite imagery has been systematically captured over the last decade and a large amount of research has been conducted by both the remote sensing and the computer vision communities. Several approaches have been proposed for segmentation of buildings and other terrestrial objects from aerial imagery.

Image segmentation

: In recent years, deep learning algorithms have provided state-of-the-art results for segmentation. Earlier work that relied on deep learning used fully connected layers to produce a vector that was later reshaped to a tensor

[8, 9, 10]. However, this has been replaced by fully convolutional layers which have removed the output size restriction. Most segmentation networks use an encoder-decoder architecture such as a Fully Convolutional Network (FCN), U-Net, etc. However, the unique nature of remote sensing imagery has fuelled the design of several custom architectures which have been also proposed for aerial image segmentation [6]. Much of the work on building segmentation has been focused on improving the network architecture. The success of deep learning in computer vision has also resulted in concentrating on network architectures specifically designed for aerial image segmentation. Approaches such as attention [11], larger receptive fields [12], and other post-processing techniques are often added to existing networks to improve aerial image segmentation [12]. Besides these aspects, a few works have also considered larger context as input for the networks [13, 10]. Context provides an understanding of the object inside and provides higher quality segmentation. However, most of these works consider per-pixel loss to improve the performance, rather than capturing the properties of aerial imagery.

Adversarial learning

Adversarial learning has been primarily explored for generative models, where it is used to synthesize perceptually realistic images [7]. Adversarial learning is also used to create robust models against adversarial attacks. An adversarial learning approach utilizes a discriminator network besides the generator network, to distinguish real and fake samples. Instead of post-processing techniques such as CRFs, adversarial learning provides conditioning and structure to the segmentation outputs. Few previous publications have considered adversarial learning for semantic segmentation [14]. The general approach in the previous results deploys a pair-wise input to the discriminator, where both the generated and input images are fed to the discriminator. Finally, the direct pixel-level and adversarial losses are combined in a weighted scheme. Adding the adversarial loss fills in missing regions by learning to capture the overall structure of a building, similar to inpainting to task of from context information [15].

Fig. 2: Overview of the proposed method. Dotted lines indicate combined cross entropy and adversarial loss of the generator samples, fed back to update the segmentation network parameters. The adversarial loss of both real and generated samples are used to update the discriminator parameters (not shown in this figure).

Iii Method

Iii-a Dataset

For the experiments, we use the Massachusetts building dataset [16]. It consists of 151 high-resolution RGB aerial images of regions in Boston. Each image has a resolution of 1,500 1,500 pixels with a spatial resolution of one square meter per pixel. The regions depict primarily urban and sub-urban areas with a coverage of 340 m. The dataset is split into 137, 4 and 10 images for training, validation and testing. During training, each image is divided into 300 300 pixel patches without any overlap. Data augmentation is performed by flipping the images left-right and top-down and applying rotations of 90°, 180°and 270°.

Iii-B Adversarial learning

We combine both adversarial and cross-entropy losses to jointly optimize the generator. The final loss is defined as


where G is the generator network and D is the discriminator network. Parameter is the adversarial loss and is the cross-entropy loss. Losses and are specified by


where denotes the output from the segmentation network G(x), parameter is the label (sampled from a real distribution ) and (sampled from a real distribution ) is the input RGB image. During training, the generator weights are updated based on the combination of equally weighted adversarial and cross-entropy losses.

Iii-C Network architectures

The segmentation network acts as a generator for adversarial training. Like most adversarial training approaches, our architecture is composed of a generator and a discriminator. An overview of the method is shown in Figure 2.

Iii-C1 Generator

To test the effectiveness of adversarial learning in conjunction with cross-entropy loss, we deploy the combined loss on several existing state-of-the-art networks. We test the new loss on DeepLab v3+ [17], DenseNet [18, 5], and PSPNet [19]. DeepLab v3+ is an extension to the series of DeepLab architectures. DeepLab v3+ consists of Atrous Spatial Pyramid Pooling, combined with the low-level features from earlier layers of a pre-trained ResNet model [20]. DenseNet is comprised of dense blocks where every layer is connected to every other layer by concatenation. PSPNet utilizes a pooling module where the final output of the network is pooled at different levels and are combined via concatenation.

Iii-C2 Discriminator

In all experiments, we employ the same discriminator architecture. Unlike previous work, we do not have a symmetric discriminator as the generator. Our discriminator consists of 4 convolution layers (3

3 kernel size with 32, 64, 128, 256 filters) and 2 fully connected layers (512, 1 outputs) that classify the image as real or fake. Batch normalization 

[21] is not applied and exponential linear units are used for training the discriminator as done in [22].

Iii-D Implementation details

All the networks that are trained with cross-entropy loss using the Adam optimizer (=0.9 and

=0.99) with a batch size of 3 for 90 epochs. Both DeepLab v3+ and PSPNet networks are trained using a pretrained model with a learning rate of 10

. DenseNet is trained without pretraining with a learning rate of 10. The networks trained with adversarial and cross-entropy losses use the same settings as above for the generator (the segmentation network). The discriminator is also trained with the Adam optimizer (=0.5 and =0.9) with a learning rate of 10 for DeepLab v3+ and PSPNet, and 10 for DenseNet. The discriminator to generator training ratio is set to unity.

Iii-E Evaluation metric

The commonly used metrics for the evaluation of detection results are the precision and recall measures. Precision and recall are also known as correctness and completeness in remote sensing literature. For evaluation, we use the Accuracy, the

measure and the mean IoU (mIoU) metrics, to obtain valid comparisons with previous work. Accuracy is computed by


while the score and Precision and Recall are defined by

(y) Aerial Image
(z) DeepLab v3+
(aa) PSPNet
(ab) PSPNet + adv.
(ac) FC-DenseNet
(ad) FC-DenseNet + adv.
Fig. 3: Building segmentation results for various segmentation networks. The +adv indicates the addition of adversarial loss with cross-entropy loss. The colors white, black, blue and red pixels indicate true positives, true negatives, false positives and false negatives, respectively.

Similary, IoU is computed as


where , , , and are the true positives, true negatives, false positives, false negatives, respectively. In building and road detection, the relaxed version of the and IoU metrics are used. The relaxed version of Precision is the fraction of predicted building pixels that are within a radius of pixels of the ground-truth building pixel, whereas the relaxed Recall represents the fraction of ground-truth building pixels that are within pixels of a predicted building pixel. The value of is set to =3 in all the experiments, which is identical to previous work.

Network Accuracy Relaxed F1 Relaxed IoU
Mnih & Hinton - 92.11 -
Saito et al. - [94.88] -
ELU-FCN-CRF - 93.93 89.08
Dual Path Network - 94.23 -
DeepLab v3+ 92.13 92.65 86.31
PSPNet 90.9 89.52 81.2
PSPNet + adv. 91.02 (+0.12) 91.17 (+1.65) 83.78 (+2.58)
FC-DenseNet [93.18] 94.33 [89.27]
FC-DenseNet + adv 93.45 (+0.27) 95.59 (+1.26) 91.55 (+2.28)
TABLE I: Performance of different methods on the Massachusetts Building dataset. Best results are presented in bold, second best are between [ ] brackets. Results in ( ) parenthesis are improvements with adversarial + cross-entropy loss.

Iv Results

To compare the results against previous methods, we measure the performance across different metrics. The results are summarized in Table 1. From this table, it is evident that Fully Convolutional DenseNet (FC-DenseNet) with adversarial loss offers the best performance. We observe that with sufficient data augmentation, we are able to produce competitive results with other state-of-the-art methods. The addition of adversarial loss to the segmentation task consistently offers better performance across all the metrics. We observe this positive trend with both DenseNet and PSPNet. Note that the first 3 methods in Table 1 deploy CRFs as a post-processing step. Compared to dual-path networks, our method is significantly less expensive to train and run inference, since dual-path networks use parameter intensive AlexNet and VGGNet to learn global and local features. During inference, the discriminator network is not used and hence has the same computation cost of running a standard segmentation network. From qualitative results, it can be seen that the addition of the adversarial loss fills in missing regions more coherently than the standard cross-entropy loss. This effect is visualized in Figure 3. The proposed system partly acts as an inpainting network where a prior segmentation is generated simultaneously using the cross-entropy loss and is reconstructed by the adversarial loss.

V Conclusion

We have proposed a loss function to train CNNs for semantic segmentation of aerial imagery. The proposed loss function, which is a combination of the adversarial and cross-entropy losses, consistently improves performance without any additional cost during inference. We have concluded that the addition of the adversarial loss improves the overall structure and produces a more coherent output taking the context into consideration. Furthermore, our method has been evaluated across commonly used metrics and a comparison with state-of-the-art methods is provided. Finally, the proposed method outperforms the state-of-the-art results on the Massachusetts building dataset with a relaxed of 95.59% without any additional post-processing techniques.


  • [1] G. Forlani, C. Nardinocchi, M. Scaioni, and P. Zingaretti, “Complete classification of raw lidar data and 3d reconstruction of buildings,” Pattern Analysis and Applications, vol. 8, no. 4, pp. 357–374, 2006.
  • [2] E. Frontoni, K. Khoshelham, C. Nardinocchi, S. Nedkov, and P. Zingaretti, “Comparative analysis of automatic approaches to building detection from multi-source aerial data,” in Proceedings GEOBIA 2008-Pixels, Objects, Intelligence GEOgraphic Object Based Image Analysis for the 21st Century, Calgary, Canada, 5-8 August 2008; IAPRS, XXXVIII (4/C1), 2008.    International Society of Photogrammetry and Remote Sensing (ISPRS), 2008.
  • [3] A. Ok, “Robust detection of buildings from a single color aerial image,” Proceedings of GEOBIA 2008, p. 6, 2008.
  • [4] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , 2015, pp. 3431–3440.
  • [5] S. Jégou, M. Drozdzal, D. Vazquez, A. Romero, and Y. Bengio, “The one hundred layers tiramisu: Fully convolutional densenets for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2017, pp. 11–19.
  • [6] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention.    Springer, 2015, pp. 234–241.
  • [7] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
  • [8] V. Mnih and G. E. Hinton, “Learning to detect roads in high-resolution aerial images,” in European Conference on Computer Vision.    Springer, 2010, pp. 210–223.
  • [9] S. Saito and Y. Aoki, “Building and road detection from large aerial imagery,” in SPIE/IS&T Electronic Imaging.    International Society for Optics and Photonics, 2015, pp. 94 050K–94 050K.
  • [10] C. Sebastian, B. Boom, T. van Lankveld, E. Bondarev, and P. H. N. De With, “Bootstrapped cnns for building segmentation on rgb-d aerial imagery,” ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sciences, vol. IV-4, pp. 187–192, 2018. [Online]. Available:
  • [11] J. Huang, X. Zhang, Q. Xin, Y. Sun, and P. Zhang, “Automatic building extraction from high-resolution aerial images and lidar data using gated residual refinement network,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 151, pp. 91–105, 2019.
  • [12] Y. Liu, B. Fan, L. Wang, J. Bai, S. Xiang, and C. Pan, “Semantic labeling in very high resolution images via a self-cascaded convolutional neural network,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 145, pp. 78–95, 2018.
  • [13] A. Marcu, “A local-global approach to semantic segmentation in aerial images,” CoRR, vol. abs/1607.05620, 2016. [Online]. Available:
  • [14] P. Luc, C. Couprie, S. Chintala, and J. Verbeek, “Semantic segmentation using adversarial networks,” in NIPS Workshop on Adversarial Training, 2016.
  • [15] R. Uittenbogaard, C. Sebastian, J. Vijverberg, B. Boom, and P. H. N. de With, “Conditional Transfer with Dense Residual Attention: Synthesizing traffic signs from street-view imagery,” in 2018 24th International Conference on Pattern Recognition (ICPR), Aug 2018, pp. 553–559.
  • [16] V. Mnih, “Machine learning for aerial image labeling,” Ph.D. dissertation, University of Toronto, 2013.
  • [17] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 801–818.
  • [18] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [19] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2881–2890.
  • [20] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [21] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ser. ICML’15., 2015, pp. 448–456. [Online]. Available:
  • [22] M. Ghafoorian, C. Nugteren, N. Baka, O. Booij, and M. Hofmann, “El-gan: embedding loss driven generative adversarial networks for lane detection,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 0–0.