Semantic segmentation becomes one of the most important problems in remotely sensed aerial imagery analysis. In particular, semantic segmentation can be applied to change detection, urban planning, or even automatic map mapping. Compared to the semantic segmentation on natural images, this task can be much more challenging for remotely sensed high resolution aerial imagery due to the high-spatial resolution and large volumes of pixels. Furthermore, aerial images are generally taken from the top side like drones, the viewing perspectives are different from natural images. To achieve good performance on semantic segmentation of high resolution aerial imagery, the segmentation models should have the following two characteristics:
Extraction of rich features across multi-scales to capture relatively small objects.
Utilization of contextual semantics which is significant for distinguishing ground objects from the top side view.
Recently, various deep neural networks structures have been utilized for semantic segmentation in aerial imagery. In[1, 2, 3], Fully Convolutional Networks (FCN)  have been used as the backbone of their networks. Audebert et al.  further utilizes SegNet  for the segmentation task which is an adaption of FCN by replacing the decoder with a series of pooling and convolutional layers. To overcome the loss of information from the initial layers and combine features from different scales, skip connection have been used in [7, 8] for high resolution imagery segmentation. However, does utilization of skip layers is good enough for extracting rich multi-scale features? Recent work  shows that by stacking multiple encoder-decoder structures end-to-end, enabling repeated bottom-up, top-down inference across various scales, the network performance is greatly enhanced. In this paper, we adapt the Stacked Hourglass Architecture (SHG) proposed by  as the backbone of our network, which has the power to extract rich multi-scale features.
Despite multi-scale features, it is important to learn global contextual information for semantic segmentation of high resolution imagery. Recent work 
leverages the classic computer vision encoders, Bag-of-Words (BoW)
, with deep learning, and introduce the Encoding Layer, which captures global contextual semantics of the whole image. The proposed EncNet  based on the Encoding Layer achieved new state-of-the-art results on multiple semantic segmentation datasets . However, EncNet is based on FCN, which does not exploit the stacked encoder-decoder structure. As a result, EncNet does not reuse the attended featuremaps by encoded semantics, and lacks intermediate supervision on the predictions of multiple encoders which has been demonstrated strong performance in prior work [9, 13, 14].
In this paper, we develop a novel Encoded Hourglass Network (EnHGNet). We utilize SHG as the backbone of our network, and after the class-dependent featuremaps being emphasized or de-emphasized by contextual semantics, we reuse the featuremaps for later stacks of Hourglass Modules, which adds more feedback loops for learning contextual semantics through intermediate supervision. Our EnHGNet has the abilities to both capturing rich multi-scale features and exploiting contextual information which tackles the difficulties of semantic segmentation of high resolution aerial imagery.
2.1 Stacked Hourglass Architecture
Stacked Hourglass Architecture (SHG)  is a sequence of modules each shaped like an hourglass. Each Hourglass Module firstly processes features down to a very low resolution by a set of convolutional and pooling layers, then continually bilinear upsamples and combine features until reaching the final output resolution. The Stacked Hourglass Architecture connects multiple Hourglass Modules end-to-end consecutively, which enables repeated bottom-up, top-down inference across various scales and consolidate global and local information of the whole image. As a result, the performance of the network is greatly enhanced.
In our implementation, we replace the residual block  used in the original architecture with wide-dropout residual block , which prevents overfitting during training and improves the network performance. We stack 4 Hourglass Modules in our network and the number of output features in each Hourglass Module is 128, 128, 256, and 256 at corresponding locations where the resolution drops.
2.2 Context Encoding Module
Despite pixel-level information, how to utilize contextual information is a key point for semantic labeling. Contextual relationships provide significant clues from neighborhood objects. For example, a pedestrian is likely to appear over a road and a book is likely to appear on a table. To learn the global semantic context, the Encoding Layer  is used to capture the context statistics . The Encoding Layer includes a learnable inherent dictionary which stores the semantic context information and a set of scaling factors which attends featuremaps of different classes. The following part reviews the details of Encoding Layer.
The input of Encoding Layer is a featuremap of shape or , which corresponds to a set of C-dimensional input features , where is the total number of features and . The layer has a learnable inherent codebook containing number of codewords (visual centers) and a set of smoothing factor of the visual centers . The output of Encoding Layer is the residual encoder of shape , and , where aggregates the residuals with soft-assignment weights, namely
where the residuals are given by . The final encoded semantics is summed up over residual encoders, namely , where
denotes ReLU activation.
Two further branches are applied upon the Encoding Layer. One stacks a fully connected layer on it with a sigmoid activation function and outputs the scaling factors, where denotes the weights of fully connected layer and
is the sigmoid function. The output of the Encoding Module is then obtained by a channel-wise multiplication between input featuremapsand scaling factors , namely , which emphasizes or de-emphasizes class-dependent featuremaps. Another branch also stacks a fully connected layer with a sigmoid activation function on the Encoding Layer, which outputs individual predictions for the presences of object categories in an image and learns with a binary cross entropy loss, namely Semantic Encoding loss (SE-loss) . Different from pixel-level softmax loss, SE-loss considers big and small objects equally due to the fact that it only cares about the existence but not the number or size. The SE-loss forces the network to understand the global semantic information and regularize the training of Context Encoding Module.
2.3 Encoded Hourglass Network (EnHGNet)
Combining the Stacked Hourglass Architecture and the Context Encoding Module, we build an Encoded Hourglass Network (EnHGNet), an overview of which is shown in Fig. 1. The Encoding Layers are placed when the feature maps reach the size of the original input, shown as the red rhombus in Fig. (a)a. This resolution retains more details without increasing the number of parameters. Compared to EncNet , our EnHGNet has the following differences.
1) We utilize SHG as the backbone of our network, whereas EncNet is based on FCN. Compared with FCN, SHG enables repeated bottom-up, top-down inference across various scales and consolidate global and local information of the whole image.
2) EncNet stacks the Context Encoding Module on top of convolutional layers right before the final prediction, which does not reuse the attended featuremaps by encoded semantics. Here we have two Context Encoding Modules in one Hourglass Module and every Encoding Layer shares the same codebook and smoothing factors . After the featuremaps being emphasized or de-emphasized, we will reuse the featuremaps for later stacks of Hourglass Modules, which adds more feedback loops for learning contextual semantics through intermediate supervision.
3) EncNet adds SE-loss to two stage of their base network, and here we calculate two SE-loss in each Hourglass Module and sum up SE-loss of all Hourglass Modules as the final SE-loss. This procedure also helps predict more accurate individual predictions for the presences of object categories.
3 Experimental Results
In this section, we first briefly introduce the dataset, then describe the implementation details and show our final results.
The Potsdam 2D segmentation dataset is provided by Commission II of ISPRS . The dataset includes 38 high resolution aerial images, where 24 images are used for training, and the rest 14 images are for testing. Each image has the resolution of pixels and the ground sampling distance is 5 cm. The image data contains five channels, namely near-infrared (NIR), red (R), green (G), blue (B), and the normalized digital surface models (nDSMs).
We use a sliding window method to extract patches of size
from the original images without overlap, and pads if needed. We further split the original training images to training and validation sets under the ratio of 9:1. Finally, in the dataset, there are 12,441 images for training, 1,383 images for validation and 8,064 images for test.
3.2 Implementation Details
We implement our network based on open-source toolbox Tensorflow and train it on 4 NVIDIA GTX 1080 Ti GPUs. Each GPU processes a batch of 4 images and the total batch size is 16. The training data are randomly shuffled and we drop the remainder of the last batch. We use the learning rate scheduling following prior work [19, 10]. The Adam optimizer  is used for optimization, and we set the base learning rate as , the power as . The image data in Potsdam dataset have 5 channels where pretrained weights are unavailable. We follow 
and use a similar way, namely first pretrain the network without Context Encoding Module for 100 epochs, then restore the pretrained weights and train our EnHGNet another 100 epochs.
For data augmentation, we randomly flip the image horizontally and vertically, then scale it between 0.5 to 2 and finally crop the image into fix size padding 0s if needed.
The ground truth for SE-loss is a binary vector of size number of categories, where each bin represents whether this category is present in the image or not. The final loss is a weighted sum of per-pixel softmax loss and SE-loss. For training EnHGNet, we follow prior work to use the number of codewords 32 in Encoding Layers and set the weight for SE-loss as 0.2.
3.3 Results on Potsdam Dataset
We train our EnHGNet on the training set and evaluate it on the test set using two standard metrics, namely pixAcc and mIoU. The validation set is used for adjusting hyperparameters in the network.
We use FCN , SegNet , SHG  and EncNet  as our baseline models. FCN is the generally used framework for semantic segmentation, and SegNet is an adaptation of FCN by replacing the decoder with a series of pooling and convolution layers. SHG and EncNet are compared with our EnHGNet to show the effectiveness of Contextual Encoding Module and Stacked Hourglass Architecture, respectively.
We develop a novel Encoded Hourglass Network (EnHGNet) for semantic segmentation of high resolution aerial imagery. EnHGNet has the abilities to both extract rich multi-scale features of the image and learn the contextual semantics in scenes. This is achieved by repeated bottom-up, top-down inference across various scales and selectively highlighting the class-dependent featuremaps. Our EnHGNet also utilizes intermediate supervision to enhance the performance. The experimental results on Potsdam test set have demonstrated the superiority of our EnHGNet.
This work was funded by the Center for Space and Earth Science at Los Alamos National Laboratory.
M. Volpi and D. Tuia,
“Dense semantic labeling of subdecimeter resolution images with convolutional neural networks,”IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 2, pp. 881–893, Feb 2017.
-  P. Kaiser, J. D. Wegner, A. Lucchi, M. Jaggi, T. Hofmann, and K. Schindler, “Learning aerial image segmentation from online maps,” IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 11, pp. 6054–6068, Nov 2017.
Y. Liu, S. Piramanayagam, S. T. Monteiro, and E. Saber,
“Dense semantic labeling of very-high-resolution aerial imagery and
lidar with fully-convolutional neural networks and higher-order crfs,”
2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2017, pp. 1561–1570.
-  Jonathan Long, Evan Shelhamer, and Trevor Darrell, “Fully convolutional networks for semantic segmentation,” in CVPR, 2015.
-  Nicolas Audebert, Bertrand Le Saux, and Sébastien Lefèvre, “Beyond RGB: Very high resolution urban remote sensing with multimodal deep networks,” ISPRS Journal of Photogrammetry and Remote Sensing, 2017.
-  V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 12, pp. 2481–2495, Dec 2017.
-  J. Li, W. Ding, H. Li, and C. Liu, “Semantic segmentation for high-resolution aerial imagery using multi-skip network and markov random fields,” in 2017 IEEE International Conference on Unmanned Systems (ICUS), Oct 2017, pp. 12–17.
-  Lichao Mou and Xiao xiang Zhu, “Vehicle instance segmentation from aerial image and video using a multi-task learning residual fully convolutional network,” CoRR, vol. abs/1805.10485, 2018.
Alejandro Newell, Kaiyu Yang, and Jia Deng,
“Stacked hourglass networks for human pose estimation,”in ECCV, 2016.
-  Hang Zhang, Kristin Dana, Jianping Shi, Zhongyue Zhang, Xiaogang Wang, Ambrish Tyagi, and Amit Agrawal, “Context encoding for semantic segmentation,” in CVPR, 2018.
“Text categorization with support vector machines: Learning with many relevant features,”in ECCV, 1998.
-  Hang Zhang, Jia Xue, and Kristin Dana, “Deep ten: Texture encoding network,” in CVPR, 2017.
-  Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh, “Convolutional pose machines,” in CVPR, 2016.
-  João Carreira, Pulkit Agrawal, Katerina Fragkiadaki, and Jitendra Malik, “Human pose estimation with iterative error feedback,” in CVPR, 2016.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
-  Sergey Zagoruyko and Nikos Komodakis, “Wide residual networks,” in BMVC, 2016.
-  F. Rottensteiner, G. Sohn, J. Jung, M. Gerke, C. Baillard, S. Benitez, and U. Breitkopf, “The ISPRS benchmark on urban object classification and 3d building reconstruction,” ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, vol. 1, no. 3, pp. 293–298, 2012.
-  Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, and Zhifeng Chen, “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” in arXiv:1603.04467, 2016.
-  Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia, “Pyramid scene parsing network,” in CVPR, 2017.
-  Diederik P. Kingma and Jimmy Lei Ba, “Adam: A method for stochastic optimization,” in ICLR, 2014.
-  Fisher Yu and Vladlen Koltun, “Multi-scale context aggregation by dilated convolutions,” in ICLR, 2016.