Building segmentation from remote sensing data is an important task in the remote sensing community, which benefits for a wide range of applications, such as land use management, urban planning, and monitoring. However, the variations of buildings in terms of color, shape, material, and background bring challenges to this task. Early efforts have been made to seek out effective handcrafted visual features, for example, 
proposes a Morphological Building Index (MBI) to model a relation between building characteristics (e.g., brightness, size, and contrast) and morphological operators. However, these methods have poor generalization abilities. Recently, Convolutional Neural Networks (CNNs) have been widely used for building segmentation and shown promising results also in large-scale tasks (see Fig. 1 (a)). However, when we zoom in these segmentation results, we can clearly see that such results are not perfect, e.g., (Fig. 1 (b) and Fig. 1 (c)), where the boundary of the individual building is blurred. This is caused by the pooling layers in many existing methods which directly learn semantic masks. The presence of pooling layers have resulted in information loss, which further reduces the chance of preserving the sharp boundary.
We have observed that buildings, which are man-made objects, usually have distinct corner points. The corner points can effectively depict the shape and structure of buildings. Therefore, in this paper, we proposed a bottom-up instance segmentation method that firstly detects keypoints of a building, and then reconstructs semantic masks with these keypoints. By doing so, more fine-grained boundaries of buildings could be preserved.
We notice a contemporary work, PolyMapper 
, that also uses keypoints for building segmentation. PolyMapper predicts keypoints and groups them with a CNN-Recurrent Neural Network (RNN) structure. Our approach differs in two key aspects: keypoint detection and grouping. In PolyMapper, a heatmap mask of building boundaries are generated firstly, and then a mask of candidate keypoints is obtained from the additional convolutional layer. Our proposed method avoids this intermediate learning and directly detect keypoints from the input. Another difference to PolyMapper is the grouping. Our grouping approach is fully geometric-based grouping without any deep feature learning.
In our approach, a building is considered as a set of keypoints. Fig. 2 provides an overview of the proposed approach, which consists of a CNN, a Region Proposal Network (RPN), and a Fully Convolutional Network (FCN). Firstly, the CNN is utilized to extract feature maps, and then followed by the RPN, which slides over feature maps in order to generates “proposals” (candidate bounding boxes) where buildings may exist. For each proposal, local features are acquired by RoIAlign . Then FCN is applied to the features, and predicts the heatmap of the keypoints. Once the keypoints are extracted from the heatmap, they are grouped into boundaries in a purely geometric way. Finally, the buildings of interest could be delineated with these boundaries as a polygon map.
2.2 Keypoints Detection and Grouping
Our approach is a two-stage procedure. In the first stage, class and box offsets of proposals are predicted in parallel. Then the second stage outputs a heatmap of keypoints for each object. For each input patch, a corresponding heatmap is predicted from the proposed network, where and are the height and width of the input patch, respectively. This heatmap indicates locations of the keypoints. The training is guided by a regression of a Gaussian heatmap
, where each keypoint denotes the mean of a Gaussian kernel. In this regard, the penalty is reduced to negative locations within a radius of the positive location instead of equal penalization during training. This is due to the fact that some false keypoint detections can still generate a bounding-box which is sufficiently overlapped with the ground reference annotation. For keypoint estimation on each object, a modified focal loss is utilized for the training, which can maintain a balance between the positive and negative locations:
, where is the number of objects in a patch, and and are the hyper-parameters and fixed to and during training. The total loss of our network is a multi-task loss . is a cross entropy loss for bounding-box classification and is a bounding-box loss for bounding-box regression, which are defined in .
The corresponding keypoints are extracted from the predicted heatmap by detecting all peaks. This procedure is called ExtractPeak , where the pixel locations with a value greater than a threshold are firstly selected, and then peaks are the locally maximum in a window with size surrounding these selected pixels. Here, we set .
Finally, we adopt a simple geometric method to approximate the segmentation mask by creating a polygon where edges are sequentially connected with keypoints. More specifically, the extreme keypoint (the most left or right or bottom or top one) is firstly selected as the start point, and then the first edge would be generated by establishing a connection between it and its nearest neighbour. Then the latter is considered as start point for the next round. Edges are extended until the end keypoint meets the initial keypoint. Finally, a polygon is formed by utilizing all these generated edges.
The Aerial Imagery for Roof Segmentation (AIRS) dataset  is a publicly available dataset, which aims at developing methods of building segmentation from very-high-resolution aerial imagery (0.075 meter). Note that our goal in this work is to accurately segment individual buildings. Therefore, 1680 patches each containing an individual building in the center, are extracted from AIRS dataset to validate our method. The training/validation/test split is as follows: 1400 patches for training, 140 patches for validation, and 140 images for testing, where each patch is with the size of
. In this research, the proposed approach is implemented within a keras framework on an NVIDIA Tesla P100 with 16 GB of memory. The training strategy follows
, which uses SGD optimizer with learning momentum as 0.9 for 40 epochs training.
In order to evaluate the performance of our proposed algorithm, four metrics are selected in this research. The mask accuracy is evaluated by F1-Score and Intersection Over Union (IoU), while Structural Similarity Index (SSIM) and F-Measure are utilized as accuracy measures for boundary.
Fig 3 shows the intermediate results obtained from our approach, which are overlaid on the input aerial imagery. One is the point map, which is acquired from the predicted heatmap of keypoints using the deep network. The other is the boundary map formed by these detected keypoints.
Table 1 and Fig 4 compare our method to FCN-8s and Mask R-CNN, which are the state-of-the-art semantic segmentation and instance segmentation methods, respectively. The proposed approach outperforms FCN-8s and Mask R-CNN in terms of both mask and boundary accuracy. Notable, sharper boundaries and geometric details are preserved in our network, thus, the shape and structure of the buildings are well depicted. This shows the advantage of detecting keypoints over directly learning semantic masks for the task of building segmentation.
|FCN-8s||89.01 %||83.15 %||92.58 %||8.29 %|
|Mask R-CNN||94.73 %||90.22 %||96.82 %||9.63 %|
|Proposed method||95.08 %||90.81 %||96.93 %||11.29 %|
In this paper, we have proposed a new instance segmentation approach that obtains semantic mask of buildings based on keypoint detection. Our approach firstly detects keypoints of a building and then polygonize them to generate a segmentation mask with fine semantic boundaries. We evaluate our method on a selected AIRS dataset, and experimental results demonstrate that the proposed network is capable of providing competitive results compared to the state-of-the-art semantic segmentation and instance segmentation methods. It should be noted that the generated building boundaries by our method are fine grained, and shapes of buildings are well preserved in segmenation masks. This would be beneficial to further steps such as vectorization, which rely much on accurate geometric details .
-  (2019) Aerial imagery for roof segmentation: a large-scale dataset towards automatic mapping of buildings. ISPRS journal of photogrammetry and remote sensing 147, pp. 42–55. Cited by: §3.
Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: §2.2.
-  (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §2.1.
-  (2011) Morphological building/shadow index for building extraction from high-resolution imagery over urban areas. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 5 (1), pp. 161–172. Cited by: §1.
-  (2018) Cornernet: detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 734–750. Cited by: §2.2.
-  (2019) Topological map extraction from overhead images. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1715–1724. Cited by: §1.
-  (2018) Building segmentation on satellite images. Web: https://project. inria. fr/aerialimagelabeling/files/2018/01/fp_ohleyer_c ompressed. pdf. Cited by: §3.
-  (2018) Building footprint generation using improved generative adversarial networks. IEEE Geoscience and Remote Sensing Letters 16 (4), pp. 603–607. Cited by: §1.
Bottom-up object detection by grouping extreme and center points.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 850–859. Cited by: §2.2.