Encoded Hourglass Network for Semantic Segmentation of High Resolution Aerial Imagery

10/30/2018 ∙ by Panfeng Li, et al. ∙ 0

Fully Convolutional Network (FCN) has been widely used in recent work for semantic segmentation of high resolution aerial imagery. However, FCN is poor at extracting multi-scale features and exploiting contextual information. In this paper, we explore stacked encoder-decoder structure which enables repeated bottom-up, top-down inference across various scales and consolidates global and local information of the image. Moreover, we utilize the Context Encoding Module to capture the global contextual semantics of scenes and selectively emphasize or de-emphasize class-dependent featuremaps. Our approach is further enhanced by intermediate supervision on the predictions of multiple decoders and has achieved 87.01 which surpasses various baseline models.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Semantic segmentation becomes one of the most important problems in remotely sensed aerial imagery analysis. In particular, semantic segmentation can be applied to change detection, urban planning, or even automatic map mapping. Compared to the semantic segmentation on natural images, this task can be much more challenging for remotely sensed high resolution aerial imagery due to the high-spatial resolution and large volumes of pixels. Furthermore, aerial images are generally taken from the top side like drones, the viewing perspectives are different from natural images. To achieve good performance on semantic segmentation of high resolution aerial imagery, the segmentation models should have the following two characteristics:

  • [itemsep=0em,topsep=0pt,parsep=0pt,partopsep=0pt,leftmargin=*,labelindent=5pt]

  • Extraction of rich features across multi-scales to capture relatively small objects.

  • Utilization of contextual semantics which is significant for distinguishing ground objects from the top side view.

Recently, various deep neural networks structures have been utilized for semantic segmentation in aerial imagery. In

[1, 2, 3], Fully Convolutional Networks (FCN) [4] have been used as the backbone of their networks. Audebert et al. [5] further utilizes SegNet [6] for the segmentation task which is an adaption of FCN by replacing the decoder with a series of pooling and convolutional layers. To overcome the loss of information from the initial layers and combine features from different scales, skip connection have been used in [7, 8] for high resolution imagery segmentation. However, does utilization of skip layers is good enough for extracting rich multi-scale features? Recent work [9] shows that by stacking multiple encoder-decoder structures end-to-end, enabling repeated bottom-up, top-down inference across various scales, the network performance is greatly enhanced. In this paper, we adapt the Stacked Hourglass Architecture (SHG) proposed by [9] as the backbone of our network, which has the power to extract rich multi-scale features.

Despite multi-scale features, it is important to learn global contextual information for semantic segmentation of high resolution imagery. Recent work [10]

leverages the classic computer vision encoders, Bag-of-Words (BoW) 


, with deep learning, and introduce the Encoding Layer 

[12], which captures global contextual semantics of the whole image. The proposed EncNet [10] based on the Encoding Layer achieved new state-of-the-art results on multiple semantic segmentation datasets [10]. However, EncNet is based on FCN, which does not exploit the stacked encoder-decoder structure. As a result, EncNet does not reuse the attended featuremaps by encoded semantics, and lacks intermediate supervision on the predictions of multiple encoders which has been demonstrated strong performance in prior work [9, 13, 14].

In this paper, we develop a novel Encoded Hourglass Network (EnHGNet). We utilize SHG as the backbone of our network, and after the class-dependent featuremaps being emphasized or de-emphasized by contextual semantics, we reuse the featuremaps for later stacks of Hourglass Modules, which adds more feedback loops for learning contextual semantics through intermediate supervision. Our EnHGNet has the abilities to both capturing rich multi-scale features and exploiting contextual information which tackles the difficulties of semantic segmentation of high resolution aerial imagery.

2 Approach

(a) a
(b) b
Figure 1: (a) Each white box in the figure corresponds to a wide-dropout residual block. Blue circles are the intermediate predictions whereas the yellow one is the final prediction. A Loss is applied to all these predictions through the same ground truth. The region in the dashed orange box represents the encoding procedure, where red rhombus is the context layer and the pink box is the branch for SE-loss. (b) The Encoding Layer contains a codebook and smoothing factors, capturing encoded semantics. The top branch predicts scaling factors selectively highlighting class-dependent featuremaps. The down branch predicts the presence of the categories in the scene. (Notation: FC fully connected layer, channel-wise multiplication.)

In this section, we will first review two related structures, namely Stacked Hourglass Architecture [9] and Context Encoding Module [10] and then introduce our EnHGNet.

2.1 Stacked Hourglass Architecture

Stacked Hourglass Architecture (SHG) [9] is a sequence of modules each shaped like an hourglass. Each Hourglass Module firstly processes features down to a very low resolution by a set of convolutional and pooling layers, then continually bilinear upsamples and combine features until reaching the final output resolution. The Stacked Hourglass Architecture connects multiple Hourglass Modules end-to-end consecutively, which enables repeated bottom-up, top-down inference across various scales and consolidate global and local information of the whole image. As a result, the performance of the network is greatly enhanced.

In our implementation, we replace the residual block [15] used in the original architecture with wide-dropout residual block [16], which prevents overfitting during training and improves the network performance. We stack 4 Hourglass Modules in our network and the number of output features in each Hourglass Module is 128, 128, 256, and 256 at corresponding locations where the resolution drops.

2.2 Context Encoding Module

Despite pixel-level information, how to utilize contextual information is a key point for semantic labeling. Contextual relationships provide significant clues from neighborhood objects. For example, a pedestrian is likely to appear over a road and a book is likely to appear on a table. To learn the global semantic context, the Encoding Layer [12] is used to capture the context statistics [10]. The Encoding Layer includes a learnable inherent dictionary which stores the semantic context information and a set of scaling factors which attends featuremaps of different classes. The following part reviews the details of Encoding Layer.

The input of Encoding Layer is a featuremap of shape or , which corresponds to a set of C-dimensional input features , where is the total number of features and . The layer has a learnable inherent codebook containing number of codewords (visual centers) and a set of smoothing factor of the visual centers . The output of Encoding Layer is the residual encoder of shape , and , where aggregates the residuals with soft-assignment weights, namely


where the residuals are given by . The final encoded semantics is summed up over residual encoders, namely , where

denotes ReLU activation.

Two further branches are applied upon the Encoding Layer. One stacks a fully connected layer on it with a sigmoid activation function and outputs the scaling factors

, where denotes the weights of fully connected layer and

is the sigmoid function. The output of the Encoding Module is then obtained by a channel-wise multiplication between input featuremaps

and scaling factors , namely , which emphasizes or de-emphasizes class-dependent featuremaps. Another branch also stacks a fully connected layer with a sigmoid activation function on the Encoding Layer, which outputs individual predictions for the presences of object categories in an image and learns with a binary cross entropy loss, namely Semantic Encoding loss (SE-loss) [10]. Different from pixel-level softmax loss, SE-loss considers big and small objects equally due to the fact that it only cares about the existence but not the number or size. The SE-loss forces the network to understand the global semantic information and regularize the training of Context Encoding Module.

2.3 Encoded Hourglass Network (EnHGNet)

Combining the Stacked Hourglass Architecture and the Context Encoding Module, we build an Encoded Hourglass Network (EnHGNet), an overview of which is shown in Fig. 1. The Encoding Layers are placed when the feature maps reach the size of the original input, shown as the red rhombus in Fig. (a)a. This resolution retains more details without increasing the number of parameters. Compared to EncNet [10], our EnHGNet has the following differences.

1) We utilize SHG as the backbone of our network, whereas EncNet is based on FCN. Compared with FCN, SHG enables repeated bottom-up, top-down inference across various scales and consolidate global and local information of the whole image.

2) EncNet stacks the Context Encoding Module on top of convolutional layers right before the final prediction, which does not reuse the attended featuremaps by encoded semantics. Here we have two Context Encoding Modules in one Hourglass Module and every Encoding Layer shares the same codebook and smoothing factors . After the featuremaps being emphasized or de-emphasized, we will reuse the featuremaps for later stacks of Hourglass Modules, which adds more feedback loops for learning contextual semantics through intermediate supervision.

3) EncNet adds SE-loss to two stage of their base network, and here we calculate two SE-loss in each Hourglass Module and sum up SE-loss of all Hourglass Modules as the final SE-loss. This procedure also helps predict more accurate individual predictions for the presences of object categories.

Figure 2: Results on Potsdam test set. EnHGNet produces more accurate predictions. For example, in the 1st and 3rd images, EnHGNet captures the shallow red and white regions well and in the last two images, EnHGNet predicts more consistent results.

3 Experimental Results

In this section, we first briefly introduce the dataset, then describe the implementation details and show our final results.

3.1 Data

The Potsdam 2D segmentation dataset is provided by Commission II of ISPRS [17]. The dataset includes 38 high resolution aerial images, where 24 images are used for training, and the rest 14 images are for testing. Each image has the resolution of pixels and the ground sampling distance is 5 cm. The image data contains five channels, namely near-infrared (NIR), red (R), green (G), blue (B), and the normalized digital surface models (nDSMs).

We use a sliding window method to extract patches of size

from the original images without overlap, and pad

s if needed. We further split the original training images to training and validation sets under the ratio of 9:1. Finally, in the dataset, there are 12,441 images for training, 1,383 images for validation and 8,064 images for test.

3.2 Implementation Details

We implement our network based on open-source toolbox Tensorflow 

[18] and train it on 4 NVIDIA GTX 1080 Ti GPUs. Each GPU processes a batch of 4 images and the total batch size is 16. The training data are randomly shuffled and we drop the remainder of the last batch. We use the learning rate scheduling following prior work [19, 10]. The Adam optimizer [20] is used for optimization, and we set the base learning rate as , the power as . The image data in Potsdam dataset have 5 channels where pretrained weights are unavailable. We follow [10]

and use a similar way, namely first pretrain the network without Context Encoding Module for 100 epochs, then restore the pretrained weights and train our EnHGNet another 100 epochs.

For data augmentation, we randomly flip the image horizontally and vertically, then scale it between 0.5 to 2 and finally crop the image into fix size padding 0s if needed.

The ground truth for SE-loss is a binary vector of size number of categories, where each bin represents whether this category is present in the image or not. The final loss is a weighted sum of per-pixel softmax loss and SE-loss. For training EnHGNet, we follow prior work 

[10] to use the number of codewords 32 in Encoding Layers and set the weight for SE-loss as 0.2.

3.3 Results on Potsdam Dataset

Method Backbone pixAcc% mIoU%
FCN [4] VGG-16 82.75 61.71
SegNet [6] VGG-16 83.93 63.42
SHG [9] Hourglass-104 85.38 67.26
EncNet [10] ResNet-101 86.37 69.04
EnHGNet (ours) Hourglass-104 87.01 69.78
EnHGNet (ours) Hourglass-104 86.97 69.95
Table 1: Segmentation results on Potsdam test set. *: Dilated convolution [21] used in residual blocks.

We train our EnHGNet on the training set and evaluate it on the test set using two standard metrics, namely pixAcc and mIoU. The validation set is used for adjusting hyperparameters in the network.

We use FCN [4], SegNet [6], SHG [9] and EncNet [10] as our baseline models. FCN is the generally used framework for semantic segmentation, and SegNet is an adaptation of FCN by replacing the decoder with a series of pooling and convolution layers. SHG and EncNet are compared with our EnHGNet to show the effectiveness of Contextual Encoding Module and Stacked Hourglass Architecture, respectively.

EnHGNet achieves 87.01% pixAcc and 69.78% mIoU on the Potsdam test set, and EnHGNet achieves 86.97% pixAcc and 69.95% mIoU, which outperforms all baseline models. The numerical results are shown in Table 1. We also show some visual examples of size in Fig. 2.

4 Conclusion

We develop a novel Encoded Hourglass Network (EnHGNet) for semantic segmentation of high resolution aerial imagery. EnHGNet has the abilities to both extract rich multi-scale features of the image and learn the contextual semantics in scenes. This is achieved by repeated bottom-up, top-down inference across various scales and selectively highlighting the class-dependent featuremaps. Our EnHGNet also utilizes intermediate supervision to enhance the performance. The experimental results on Potsdam test set have demonstrated the superiority of our EnHGNet.


This work was funded by the Center for Space and Earth Science at Los Alamos National Laboratory.


  • [1] M. Volpi and D. Tuia,

    “Dense semantic labeling of subdecimeter resolution images with convolutional neural networks,”

    IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 2, pp. 881–893, Feb 2017.
  • [2] P. Kaiser, J. D. Wegner, A. Lucchi, M. Jaggi, T. Hofmann, and K. Schindler, “Learning aerial image segmentation from online maps,” IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 11, pp. 6054–6068, Nov 2017.
  • [3] Y. Liu, S. Piramanayagam, S. T. Monteiro, and E. Saber, “Dense semantic labeling of very-high-resolution aerial imagery and lidar with fully-convolutional neural networks and higher-order crfs,” in

    2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

    , 2017, pp. 1561–1570.
  • [4] Jonathan Long, Evan Shelhamer, and Trevor Darrell, “Fully convolutional networks for semantic segmentation,” in CVPR, 2015.
  • [5] Nicolas Audebert, Bertrand Le Saux, and Sébastien Lefèvre, “Beyond RGB: Very high resolution urban remote sensing with multimodal deep networks,” ISPRS Journal of Photogrammetry and Remote Sensing, 2017.
  • [6] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 12, pp. 2481–2495, Dec 2017.
  • [7] J. Li, W. Ding, H. Li, and C. Liu, “Semantic segmentation for high-resolution aerial imagery using multi-skip network and markov random fields,” in 2017 IEEE International Conference on Unmanned Systems (ICUS), Oct 2017, pp. 12–17.
  • [8] Lichao Mou and Xiao xiang Zhu, “Vehicle instance segmentation from aerial image and video using a multi-task learning residual fully convolutional network,” CoRR, vol. abs/1805.10485, 2018.
  • [9] Alejandro Newell, Kaiyu Yang, and Jia Deng,

    “Stacked hourglass networks for human pose estimation,”

    in ECCV, 2016.
  • [10] Hang Zhang, Kristin Dana, Jianping Shi, Zhongyue Zhang, Xiaogang Wang, Ambrish Tyagi, and Amit Agrawal, “Context encoding for semantic segmentation,” in CVPR, 2018.
  • [11] T. Joachims,

    “Text categorization with support vector machines: Learning with many relevant features,”

    in ECCV, 1998.
  • [12] Hang Zhang, Jia Xue, and Kristin Dana, “Deep ten: Texture encoding network,” in CVPR, 2017.
  • [13] Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh, “Convolutional pose machines,” in CVPR, 2016.
  • [14] João Carreira, Pulkit Agrawal, Katerina Fragkiadaki, and Jitendra Malik, “Human pose estimation with iterative error feedback,” in CVPR, 2016.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
  • [16] Sergey Zagoruyko and Nikos Komodakis, “Wide residual networks,” in BMVC, 2016.
  • [17] F. Rottensteiner, G. Sohn, J. Jung, M. Gerke, C. Baillard, S. Benitez, and U. Breitkopf, “The ISPRS benchmark on urban object classification and 3d building reconstruction,” ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, vol. 1, no. 3, pp. 293–298, 2012.
  • [18] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, and Zhifeng Chen, “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” in arXiv:1603.04467, 2016.
  • [19] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia, “Pyramid scene parsing network,” in CVPR, 2017.
  • [20] Diederik P. Kingma and Jimmy Lei Ba, “Adam: A method for stochastic optimization,” in ICLR, 2014.
  • [21] Fisher Yu and Vladlen Koltun, “Multi-scale context aggregation by dilated convolutions,” in ICLR, 2016.