Log In Sign Up

Accurate Polygonal Mapping of Buildings in Satellite Imagery

by   Bowen Xu, et al.

This paper studies the problem of polygonal mapping of buildings by tackling the issue of mask reversibility that leads to a notable performance gap between the predicted masks and polygons from the learning-based methods. We addressed such an issue by exploiting the hierarchical supervision (of bottom-level vertices, mid-level line segments and the high-level regional masks) and proposed a novel interaction mechanism of feature embedding sourced from different levels of supervision signals to obtain reversible building masks for polygonal mapping of buildings. As a result, we show that the learned reversible building masks take all the merits of the advances of deep convolutional neural networks for high-performing polygonal mapping of buildings. In the experiments, we evaluated our method on the two public benchmarks of AICrowd and Inria. On the AICrowd dataset, our proposed method obtains unanimous improvements on the metrics of AP, APboundary and PoLiS. For the Inria dataset, our proposed method also obtains very competitive results on the metrics of IoU and Accuracy. The models and source code are available at


page 1

page 4

page 7

page 10

page 11


Decoupling Magnitude and Phase Estimation with Deep ResUNet for Music Source Separation

Deep neural network based methods have been successfully applied to musi...

Adversarial Shape Learning for Building Extraction in VHR Remote Sensing Images

Building extraction in VHR RSIs remains to be a challenging task due to ...

AdaptIS: Adaptive Instance Selection Network

We present Adaptive Instance Selection network architecture for class-ag...

Buildings Classification using Very High Resolution Satellite Imagery

Buildings classification using satellite images is becoming more importa...

Correcting rural building annotations in OpenStreetMap using convolutional neural networks

Rural building mapping is paramount to support demographic studies and p...

Learning to Extract Building Footprints from Off-Nadir Aerial Images

Extracting building footprints from aerial images is essential for preci...

Boosting Mapping Functionality of Neural Networks via Latent Feature Generation based on Reversible Learning

This paper addresses a boosting method for mapping functionality of neur...

Code Repositories

1 Introduction

Polygonal mapping of buildings, aiming at precisely extracting the footprint of buildings from high-resolution satellite imagery in the form of polygons, is a core and dynamic problem in photogrammetric computer vision and remote sensing 

(sohn2001extraction; zorzi2021polyworld), and plays important roles in the Geographic Information Systems (GIS) for making up the basic feature classes (MAYER1999automatic).

As in orthorectified satellite imagery, the footprints and roof outlines of buildings coincide under most circumstances, the mapping of buildings could be roughly equivalent to extracting building polygons (wang2022learning). Thus, the problem of polygonal mapping of buildings in satellite imagery has usually been formulated in a multi-step paradigm (Girard2020polygonal; Li2021joint)

, by (1) computing the rasterized binary masks of buildings from the input images, (2) vectorizing the binary masks into polygons with certain heuristic post-processing schemes, and (3) optionally simplifying the initial polygons to reduce the redundancy of vertices. In this paradigm, building segmentation, as a key sub-problem, has recently been significantly advanced by deep neural networks (DNNs) 

(mnih2013phd; Yuan2018learning). Thereafter, the quality of polygonal mapping of buildings can be improved (Wei2020toward) by leveraging the off-the-shelf ad-hoc algorithms such as Marching Squares (MarchingCubes) for polygon generation and the Douglas-Peucker (douglas1973al) for the shape simplification.

(a) Frame-Field Learning (Girard2020polygonal)
(b) HiSup Learning (Ours)
Figure 1: An illustrative comparison between the prior art, Frame-Field Learning (Girard2020polygonal) and our proposed HiSup Learning for polygonal mapping of buildings. In contrast to the Frame-Field that is learned as a complementary to handle the uncertainty of the predicted building masks, our proposed method learns the shape-aware feature embedding by end-to-end learning with Hierarchical Supervisions, thus minimized the gap between the predictions of masks (in the middle column) and polygons (in the right column).

Although the performance of polygonal mapping of buildings has been significantly improved by achieving higher quality segmentation masks, it is worth noticing that the task of polygonal mapping of buildings largely depends on the quality of learned masks. Furthermore, when we get a closer look at the polygonal annotation of buildings, we will find that the learned masks remained an issue of reversibility to the polygonal representation, saying mask reversibility. More precisely, with a polygonal annotation of a building, the mask representation is accessible by drawing pixels into an image domain and can be reversed to the polygons as long as the resolution of the image domain is in a reasonable range. However, when we use the mask representation of the polygonal annotations as the supervision signals for the optimization of a convolutional neural network, it would be extremely challenging to obtain such reversible masks.

To handle this mask-reversibility challenge, the most recent studies either exploited the shape attributes of buildings by Frame-Field (Girard2020polygonal) or improved the polygonal mapping of buildings schemes (Li2020appro) with the off-the-shelf mask predictions, pushed the accuracy of predictions to a higher stage. Despite that, it is also worth noticing that there is a notably large performance gap on the AICrowd benchmark (Mohanty2020deep) between the polygon and mask predictions. The winning solution (Jaku2018winner) achieves the mask average precision of 79.1%, whereas the best-performing model of Frame-Field (Girard2020polygonal) who was applied to the masks achieves the average precision of 67.0%. As the mask predicted in Fig. 1(a) is ambiguous in the shape boundaries, the finally predicted polygons of Frame-Field (Girard2020polygonal) is still far away from that of human annotations. One possible reason leading to such a large performance gap is that the used supervision signals pay more attention to the high-level regional information rather than the detailed geometric shape of buildings.

In this paper, we make efforts to close the gap between the polygonal representations and the mask ones of buildings with deep convolutional neural networks, inspired by an observation:

The shape composition of polygonal buildings is made up of points (or vertices) at the bottom, uses line segments to associate the points as the middle-level information, and finally forms the regional instances of buildings at high-level semantics.

Therefore, we are motivated to build the fulllevel supervision to train the neural networks for the sake of shape correctness of the predicted masks. As long as the masks predicted by convolutional neural networks attain the best information about the polygonal shapes, only a simple post-processing scheme is required to achieve the goal of accurate polygonal mapping of buildings.

Towards this goal, we are inspired by the dual representation of line segments, named regional attraction fields (xue2019learning; xue2021pami) at the middle-level description as the bridge to get the connection between the unary information of points and complex geometry of regional shapes. Different from the representation of edge pixels, the dual regional representation of line segments demonstrates advantages in two aspects:

  • It intrinsically connects to the vertices of polygons as they can be viewed as the endpoint of line segments.

  • The non-local support region of each line segment is corresponding to the region-level information of a part of the building mask.

Although the properties of regional attraction fields (xue2019learning; xue2021pami) are appealing and promising for polygonal mapping of buildings, several challenging open problems remain in computing the higher-ordered geometric shapes from the mid-level line segments (or its dual representation). One may come up with solutions by directly leveraging the detected line segments (or wireframes) as the basic computing clues. However, this will lead to an NP-Hard problem of finding all the closed loops from an undirected graph. Meanwhile, the detection accuracy of line segments and wireframes will also affect the final results. In the most recent, PolyWorld (zorzi2021polyworld) attempted to solve the problem of polygonal mapping of buildings along this line by using Sinkhorn (marco2013sink) to approximate the step of finding circulars in a graph. However, it still remained the issue of flexibility when handling the complicated buildings that have holes.

In the study of this paper, we present a hierarchical supervision (HiSup) learning scheme to address the mask reversibility issue and show that the geometric information of polygons should be used inductively in the embedding space of the multi-head predictions by end-to-end learning. Specifically, we design a novel information aggregation mechanism to enhance the feature embedding for both the bottom-level vertex prediction and the high-level mask prediction. By explicitly encoding and aggregating the boundary information into the high-level mask head, we successfully reduced the gap between the mask predictions and the expected polygonal ones. During the inference, we design a simple greedy scheme to polygonize and simplify the masks by using the predicted vertices. Our proposed method, i.e., HiSup, achieves the state-of-the-art of polygonal mapping of buildings.

In the experiments, we evaluate our method on the challenging benchmarks of AICrowd (Mohanty2020deep) and Inria (maggi2017can). Compared to the current state-of-the-art approaches Frame-Field (Girard2020polygonal) and the method of (Li2021joint), the building polygons computed by our HiSup obtain a significant improvement of AP (Lin2014microsoft) by 12.4 points and 5.6 points, respectively. In terms of the more challenging metric,  (Cheng2021boundary), we push the state-of-the-art performance of 50.0% obtained by PolyWorld (zorzi2021polyworld) to . On the Inria dataset (maggi2017can), our proposed HiSup also obtains a considerable improvement over the Frame-Field (Girard2020polygonal). A systematic ablation study is performed to further justify the proposed method.

The main contributions of this paper are summarized as:

  1. We address the mask reversibility issue of polygonal mapping of buildings by proposing the HiSup, which takes the hierarchical supervision signals of vertices, boundaries, and masks to guide the convolutional neural network to learn in a shape-aware fashion.

  2. We present a key component of aggregating the embedding of attraction fields into the learning of high-level masks and the bottom-level vertices.

  3. We set several new state-of-the-art performances on the challenging AICrowd benchmark in terms of AP and AP by learning semantically-accurate and geometrically-precise masks with our HiSup.

The rest of the paper is organized as follows. Section 2 reviews related works. Section 3 describes the hierarchical representation. Section 4 introduces our polygonal mapping of buildings method. Section 5 reports experimental results and some discussions. Finally, conclusions are drawn in section 6.

2 Related work

The problem of (polygonal) mapping of buildings is a longstanding problem, which was extensively studied by the means of using hand-crafted features such as textures, geometry, spectrum, and shadows with a vast body of literature (pesaresi2011improved; Xia2019geosay; ngo2016shape; senaras2013fusion; mannok2015orient

, to cite a few). These kinds of methods could get good results on certain buildings that satisfy the predefined rules, but they suffer from the degraded performance when buildings have inconsistent appearances in complex environments. Therefore, the research focusing on polygonal mapping of buildings has been gradually moved to the deep learning solutions to handle the more challenging real-world data. In this section, we mainly review the recent advancements for learning-based polygonal mapping of buildings.

2.1 Learning Building Masks with ConvNets

The mask representation is widely used for building extraction in two mainstream pipelines: the building map segmentation (yang2018mapping) and the building instance segmentation (chen2019darnet). The former one formulates the problem of building extraction as a pixel-wised binary classification task and the latter one leverages the advances of detected bounding boxes of buildings to directly extract building mask for each instance. Those two pipelines correspond to the semantic segmentation (long2015fully) and instance segmentation (He2017maskrcnn) respectively in natural images, whereas, they have the same goal for the task of building extraction in satellite images. As the orthorectified satellite imagery eliminates the occlusions between building instances, it is straightforward to delineate the foreground pixels from the instance-agnostic segmentation map into multiple instances by the connected components labeling schemes (vladimir2018ternaus; Jaku2018winner).

In general, learning a ConvNet is promising to obtain high-quality building masks from satellite images (saito2016multiple; alshehhi2017simultaneous; maggi2017conv; Yuan2018learning). However, focusing mainly on the high-level semantics makes these approaches perform inaccurate in terms of the building shape correctness. There are some works that focus on alleviating the boundary uncertainties of building maps including multi-scale feature fusion (maggi2017high; ji2019fully), and geometrically-aware learning (li2022building) with the assistance of attraction fields (xue2019learning; xue2021pami) or edge constraint (liu2021multiscale). However, their results are not satisfactory for the accurate polygonal mapping of buildings.

In our method, we concentrate on the geometric correctness of building masks learned by ConvNets. Compared to those methods, our method learns a unified embedding space by exploiting the different level of building representations, which maintains the accuracy of the segmentation masks in both aspects of high-level semantics and the geometric correctness.

2.2 Polygonal Mapping of Buildings

Polygonal mapping of buildings refers to getting vectorized building instances, which is the most compact and meaningful representation format beyond others. Only ordered building vertices are kept to describe individual building instances. Traditionally, building vectorization tends to be a separate process after getting the raster map of building instance segmentation. One of the basic algorithms is the DouglasPeucker simplification serial techniques (douglas1973al). Those simplification algorithms are usually rough and have difficulty in identifying right building vertices. Extra refine strategies based on the empirical building shape are proposed in (Wei2020toward)

. Lately, based on both the image and its semantic probability map, a polygonal partition approach shows promising results for building vectorization 

(Li2020appro). However, these multistaged methods mostly demand separate optimization processes. An overall optimized result is hard to get.

Currently, automatic polygonal mapping of buildings from remote sensing imagery has received considerable attention. Some methods tend to combine the vectorization process with segmentation through the multitask learning. By learning building segmentation map aligned with proposed frame field vectors, Frame-Field (Girard2020polygonal) gets building polygons through corneraware contour simplification. The invented frame field map is a 4D vector calculated according to the direction of building edges. In comparison with the frame field (Girard2020polygonal), the attraction fields (xue2019learning; xue2021pami) adopted in our method have more contextual and structural information. Li2021joint generate building masks, corner locations, and edge orientations simultaneously. Then the corner location map and edge orientation map are combined together to help with identifying vertices from pixels belonging to building boundaries. Beyond that, an additional refinement process is adopted to further adjust building vertices on a finer scale. The refinement of building vertices can also be embedded to the whole model and be trained jointly (Chen2020polygoncnn).

The above methods all get building polygons through simplifying the predicted building contours. Instead, we use detected vertices to rebuild the whole building polygons.

There are other methods that extract building polygons on the basis of object detection formula. Buildings are first marked with bounding boxes. Then, for every building, corners are predicted one by one through recurrent neural networks (RNNs) with convolutional long-short term memory (ConvLSTM) 

(Li2019topo). This framework is further improved by (zhao2021building)

with modified modules including global context block (GCB), boundary refinement block (BRB) and gated recurrent units (GRU). Following the same detection formula,

liu2021afmrnn proposed AFM-RNN which introduced the attraction fields (xue2019learning) as embedded features to enhance the building corner detection. In our method, we take advantage of the attraction fields as the bridge to connect both the building masks and vertices implicitly.

Lately, zorzi2021polyworld came up with a new proposal termed PolyWorld by considering all building polygons within one image as a whole undirected graph. They predict building polygons by detecting vertices and solving their adjacent connections. The obtained building polygons are very neat in visual. Failed cases appear where buildings have complex structures like inner yards.

3 A Hierarchical Representation of Buildings

(a) Building Polygons
(b) Segmentation Map
(c) Attraction Field Map
(d) Convex/Concave Vertices
Figure 2: An illustration for the hierarchical representation of vectorized building maps. For the expected building polygons in LABEL:, we learn the high-level segmentation masks LABEL:, the mid-level attraction fields of line segments LABEL: and the bottom-level vertices LABEL: with a single convolutional neural network, which guides the neural network learns a unified embedding for accurate polygonal mapping of buildings.

Given a satellite image defined on the image lattice with the resolution of , the building polygons presented in are the regions , where is a continuation of the set in the 2D real space. As the buildings are man-made objects, the boundary can be described by a polygon that consists of an array of vertices, denoted by , where . The order of the vertices in indicates the adjacent relationship among the vertices.

3.1 High-Level Regional Masks of Buildings

The mask representation is extensively used for the extraction of buildings as a high-level semantic segmentation task. In this sense, the regions of building polygons are represented by a segmentation map defined on the image lattice by


Benefiting from the bird-eye view observation of satellites, the regions can be observed without occlusion between instances, therefore a high-performing convolutional neural network for image semantic segmentation can obtain the instance-level segmentation results in most cases. However, due to the concentration of high-level semantics, the predicted segmentation maps usually pay more attention to the interior regions of rather than their boundaries . As a result, it is very challenging to compute accurate polygons from the predicted masks.

3.2 Bottom-Level and Mid-Level Geometries

As one main goal of building segmentation is for the vectorized representation, the polygonal representation about the boundary information is of great importance for the end task. Here, we briefly discuss both the bottom-level vertices and the mid-level line segments about the boundary information.

When we focus on the polygonal boundary of the buildings, the polygon for the -th building mask contains two kinds of information: the bottom-level vertices and their adjacent relationship represented by mid-level line segments . With the bottom-level and mid-level geometries of buildings, there would be another challenging problem of grouping primitives into instances. It was recently approached under the view of graph computation with Sinkhorn approximation (marco2013sink) to avoid the NP-Hard problem of finding circulars in an undirected graph (zorzi2021polyworld).

In this paper, we believe that although the mask learning is a high-level recognition task, as the polygonal buildings can be lossily converted into mask representation, there should be a unified embedding space to learn and decode consistent multi-level outputs. To approach this, we use the bottom-level vertices and mid-level line segments as the complementary information with the learning of masks, which guides the convolutional neural networks to focus on the bottom-level and mid-level information for the shape correctness of the masks.

Convex and Concave Building Vertices.

Inspired by Li2021joint, we divide the building vertices ( i.e., junctions) according to their convexity. In detail, for the vertices in , vertices belonging to the minimal convex hull of are regarded as the convex vertices, denoted in . For the vertices that belong to the set , they are denoted by . By stacking all the vertices across instances, we have two sets and as the bottom-level representation of buildings.

Regional Encoding of Line Segments.

As discussed before that the line segments are actually the bridge between the bottom-level vertices and the high-level masks, we are interested in the representation of line segments that are well-suited for polygonal mapping of buildings. The most natural choice of the line segments might be the edge map (or line heatmap). However, we found that the learning of edge maps can only improve the mask reversibility marginally. Therefore, we are motivated to use the dual representation, the Attraction Field Representation Map (AFM) (xue2019learning; xue2021pami) for our endtask. Denoted by ( and are the two endpoints), the line segments of the polygonal instances partition the image lattice into regions by,


With the region partition , the attraction field map between any pixel to its close-in line segment is computed by


where the point is the projection of on . As the attraction field map encodes the line segments according to , the high-level masks are implicitly encoded.

4 Learning Reversible Building Masks

In this section, we present a novel approach that learns building masks in a unified embedding space under the supervision of masks, line segments and the vertices. The unified embedding space takes the merits of different-level of information for the buildings, thereafter enables the convolutional neural networks to learn both the instance-level information and the geometric details with the best reversibility for polygonal mapping of buildings. Then, we present a mask attraction scheme that polygonizes the reversible masks into the expected polygons by attracting the sparse vertices to the boundary of masks.

Figure 3: The overall training process. The “Conv” module denotes three continuous sequences including

convolution, batch normalization and ReLU layers. The “Head” module denotes two

and convolution layers connected with an activation layer. denotes element-wise product and denotes element-wise addiction. The “Enhance” module denotes the concatenated operations mentioned in Geometric-Aware Mask Learning.

4.1 Network Architecture Overview

As illustrated in Fig. 3, given an image , we use a backbone network (e.g., HRNets (Wang2021hrnet) or UResNets (Unet (ronneberger2015u) with ResNet (he2016deep) encoder)) to extract the shared feature map , where and are the size of with the down-sampling factor . Because our method is learned from the bottom-level to high-level information of buildings, three convolution layers (including the BatchNorm (ioffe2015batch) and ReLU (nair2010rectified)) with 256 output channels are used to transform the backbone feature into the embedding features of vertices, of the attraction field maps, and of segmentation maps. The different embedding features are then transformed by the cross-level interaction module (introduced in Sec. 4.2) to predict (1) the heatmap and the short-range offset field for the vertices, (2) the attraction field map of the line segments, and (3) the segmentation mask for the segmentation map of the building polygons.

4.2 Learning from Cross-Level Interactions

As the attraction field representation (xue2019learning; xue2021pami) has strong correlations with the vertices and segmentation masks, we are going to regularize the feature maps and to emphasize their shape correctness for the mask predictions and the semantic consistency in the learning of vertices.

Efficient Channel Attention with AFM Embedding.

Inspired by the success of ECA-Net (wang2020eca), we extend the efficient channel attention mechanism to enhance the feature representation of and by


where , ,

are the 1-D convolution, global average pooling and Sigmoid function.

Geometric-Aware Mask Learning.

We lift the learned attraction field maps by a linear layer from the predicted attraction field

to a 128-channel tensor

and concatenate it to the backbone feature . The concatenated feature map is subsequently used to predict the building masks , which guides the backbone feature

to learn the shape correctness from the target of attraction field maps via backpropagation. As the backbone feature map is regularized by the predicted attraction field

, we are able to use the enhanced feature map for the final mask prediction, denoted by . During training, the predicted intermediate mask and the final mask are supervised by the binary cross entropy (BCE) loss with the groundtruth masks by


while the attraction field is learned from the loss function by


where is the ground truth of attraction fields generated by the edge of polygons.

Semantic-Consistent Vertices Prediction.

Instead of using the salient edge pixels (from predicted masks) as building vertices, we design a vertex branch to learn the heatmap and short-range offsets of vertices. Taking the enhanced feature , a convolution layer is used to learn the heatmap and the offset map by the loss functions of cross entropy and loss by


where is the ground truth of vertices with the values for background pixels and for the vertices targets, is the short-range offsets that are in the range of for the target pixels.

Total Loss.

To train the neural network, we use the predefined loss factors to balance the different types of predictions in the total loss function by


where , , and are the loss factors of the predicted masks, attraction field maps, the heatmaps and offsets of the vertices. In our experiment, we empirically set , , and according to their loss magnitudes.

Figure 4: The inference process of our method. Building polygons are generated through matching masks with their nearest junctions. The abundant junctions that do not belong to building vertices are removed afterwards.

4.3 Extracting Polygons by Mask Attraction

Benefiting from our design of crosslevel interaction module, the learned masks have a desired property of shape correctness in the boundary geometry. Therefore, the polygonization of the building masks would be straightforward by tracing the boundary pixels from the predicted masks. For the sake of computing the most simplified polygons without redundant vertices, we found that the learned vertices from and cooperate very well by attracting the traced boundary pixels of masks to the instance-agnostic vertices. We term our scheme of polygonization as Mask-and-Vertices Attraction, short in MaV-Attr. The whole pipeline of our polygon extraction is shown in Fig. 4.

Polygon Initialization.

Given a predicted mask , the corresponding heatmap and offset of vertices, we extract the building instances by a given threshold of to get the binary mask prediction , which are subsequently processed into instances by computing the connected components, denoted by . The redundant polygonal representation of is computed by tracing all the boundary pixels, denoted in , where is the number of boundary pixels of . With the initial redundant polygons , we only need to focus on its simplification by sparse vertices that are computed by leveraging the local non-maximum suppression (NMS) in neighborhoods. The local NMS is implemented by a MaxPooling layer with the kernel size . After the local NMS, the top- ( in our experiments) vertices that have a higher classification score than are retained and refined by the offset vectors for sub-pixel accuracy, denoted by .

Polygon Simplification.

For the -th initial polygon , we match each vertex to the closest vertex via computing the Euclidean distance by


where is an index mapping from the -th boundary pixel of the -th initial polygon to the closest vertex in . Then, we go through all the boundary pixels and remove the non-minimal pixels that have the same indices. That is to say, for any boundary pixel and , if their indices satisfying and , the pixel is removed from the initial polygon . In addition to this, for the pixels that are far away from any predicted vertex (under a given threshold ) are also removed. As we already know the topological relationship between any boundary pixels in , a linear scanning is required to efficiently remove the redundant boundary pixels. In the final step, we check the adjacent edges of polygons and merge them if they are parallel up to an angle tolerance of .

The simplified polygon of is denoted by . For any vertex , it must be in the set . The number of vertices is far less than .

4.4 Implementation details

In our implementation, we mainly use the HRNetV2-W48 (Wang2021hrnet) as the backbone for our method. Meanwhile, we also use the smaller backbones, HRNetV2-W18, HRNetV2-W32 (Wang2021hrnet), and the UResNet101 (ronneberger2015u; he2016deep) to show that the possible usage of our method in more practical configurations. The input images are resized to during both training and testing. As all the used backbone networks will yield the down-sampled feature maps with the down-sampling factor of , the output maps of our method are with the resolution of . In the inference phase, we will resize the polygonal outputs into original image size (e.g., in AICrowd dataset (Mohanty2020deep)) for evaluation.

During training, we used the data augmentation strategies (as used in (Girard2020polygonal)) including random flip, random rotation, and color jittering. The ADAM optimizer (diederik2015adam) is used to train our network on 4 Nvidia RTX 3090 GPUs with the initial learning rate of and the weight decay

on PyTorch 1.8. The number of training epochs is set to

and the learning rate is decayed by after 25 epochs of training. For the ablation study, we reduce the training epochs from to due to the limited computation resources and time.

5 Experiment

5.1 Datasets and evaluation metrics

We compared our method to the stateoftheart methods on two publicly available datasets.

5.1.1 AICrowd dataset

AICrowd Mapping Challenge dataset (Mohanty2020deep) (AICrowd dataset) contains

pixels RGB images and corresponding annotations in MS-COCO 

(Lin2014microsoft) format. The training set includes 280741 tiles. For this challenge is already closed, we use the validation subset as testing, whose amount of tiles is 60317. AICrowd dataset also provided users with a small version that contains smaller amount of samples. We use the small version of AICrowd dataset in the ablation studies.

For evaluating the performance of building instance segmentation, we follow the standard MSCOCO (Lin2014microsoft)evaluation metrics. They are also the official metrics for AICrowd dataset. Particularly, we report the average precision (AP) and average recall (AR) metrics under a range of intersection over union (IoU) thresholds from 0.50 to 0.95 with a step of 0.05. The precision and recall scores under IoU thresholds of 50 and 75 are also listed in Table 1 denoted as , , and . Moreover, to put extra emphasis on the quality of building contours, we introduce the metrics of Boundary IoU (Cheng2021boundary). Compared with MSCOCO metrics, the Boundary IoU metrics are more reasonable to reveal the accuracy of the detected instance boundaries. For two masks and , the Boundary IoU only computes the IoU between sets of pixels that are within distance from two mask contours. Denoted the contours of masks and as and , the Boundary IoU is defined as


Here the equals 0.02 which is consistent with the setting of COCO instance segmentation. The average boundary precision denoted by under a range of IoU to object boundary is reported in Table 2.

All the above evaluation metrics are based on results at the segmentation level. To directly compare the quality of predicted building polygons, we report the evaluation of PoLiS metric (Avbelj2015metric). For two given polygons A and B, the PoLiS is defined as the average distances between each vertex of A and its closest point within the boundary and vice versa. Suppose the polygon B has vertices , the PoLiS metric can be expressed as


where and are the normalization factors. The IoU threshold for filtering predicted building polygons is set to 0.5 as referred to zhao2021building.

The IoU and complexity aware IoU (CIoU) (zorzi2021polyworld) are also calculated for evaluation. For the two polygons and , the CIoU metric is computed as:


where the first term IoU indicates the normal IoU between two compared polygon masks and . The second term is the relative difference between the total number of vertices from polygon and the number of vertices from polygon . The CIoU metric considers both segmentation accuracy and polygonization complexity.

5.1.2 Inria dataset

Inria Aerial Image Labeling dataset (maggi2017can) (Inria dataset) contains aerial orthorectified color imagery of pixels with a spatial resolution of 0.3 m. Unlike AICrowd dataset, the training and testing images of Inria dataset come from different cities where the building distribution varies. The original Inria dataset provides annotations of pixel-wise semantic segmentation, which is not suitable for getting polygonized results by our method. During the training process, we use the traditional method to convert the raster ground truth labels to vector format. As suggested in (maggi2017can), the first five images of each location from the Inria training set are used for validation.

As the Inria dataset (maggi2017can)

did not public the testing samples, we follow their official evaluation protocol by submitting the polygonized results in the format of segmentation masks to their evaluation server. The recommended metrics of IoU and accuracy (Acc) are adopted for comparison. The Acc for Inria dataset is defined as the percentage of correctly classified pixels, while the IoU is computed only for pixels belonging to buildings.

5.2 Results and Analysis

5.2.1 Results on AICrowd dataset

Table 1 and Table 2 demonstrate the quantitative evaluation results on the AICrowd dataset. We compare our method with other competing approaches including Mask RCNN (He2017maskrcnn) based on the implementation of (Mohanty2020deep), PolyMapper (Li2019topo), ASIP (Li2020appro), FrameField (Girard2020polygonal), PolyWorld (zorzi2021polyworld) and the recent work proposed by (Li2021joint). The Mask RCNN implemented by Mohanty2020deep serves as the baseline for AICrowd dataset, which can only obtain building masks instead of polygons. To be able to compare with others, a postprocessing with DouglasPeucker (DP) simplification (douglas1973al) is adopted to generate vectorized results from the pixel-wise segmentation. The rest of methods all have outputs of polygonal building instances.

It can be seen in Table 1 that our method gets either the highest or comparable scores under the MSCOCO metrics. The average precision termed AP reaches 79.4% by our method. Only the recall score under IoU equals 0.5 is slightly lower than the ASIP (Li2020appro) method. It is probably due to the omission of some tiny building pieces that are near the edge of images.

Method Backbone AP AR
Mask R-CNN (He2017maskrcnn; crowdAI2018baseline) ResNet101 41.9 67.5 48.8 47.6 70.8 55.5
PolyMapper111The implementation code from the author’s github repository ( is used for training by ourselves. And we perform the testing on the defined testing dataset. (Li2019topo) VGG16 50.8 81.7 58.6 58.5 85.4 67.0
ASIP (Li2020appro) - 65.8 87.6 73.4 78.7 94.3 86.1
Frame-Field222Here the results are reported by applying the polygonization method mentioned in Frame-Field (Girard2020polygonal) to the the segmentation probability maps from the UNet-Variant (Jaku2018winner). (Girard2020polygonal) UResNet101 67.0 92.1 75.6 73.2 93.5 81.1
Li2021joint UResNet101 73.8 92.0 81.9 72.6 90.5 80.7
PolyWorld (zorzi2021polyworld) R2U-Net 63.3 88.6 70.5 75.4 93.5 83.1
HiSup (ours) HRNetV2-W48 79.4 92.7 85.3 81.5 93.1 86.7
Table 1: Comparison of evaluation results of MSCOCO metrics on the AICrowd dataset. The highest scores are in bold.

Table 2 reports the additional evaluation results of methods provided with code. The Boundary IoU is a more restricted measurement of the building boundary area. The performance of our method ranks first place on the average boundary precision. It indicates that the building polygons obtained by our method have more precise shapes. Apart from the evaluation of instance segmentation, our method also gets 0.726 on the PoLiS metric, which outperforms the other three vectorization methods. It demonstrates that our method has less overall dissimilarity between vertices of predicted building polygons and the ground truth. For the CIoU metric, our method surpasses all the other listed methods, which means that the buildings extracted by our method are represented in both concise and accurate polygons. Meanwhile, the learned reversible building masks give our method more flexibility for representing buildings in subtle structure details.

Fig. 5 shows the visual comparison of qualitative results. Each building instance is represented by a different colored polygon. Compared with the mask-related approach Frame-Field (Girard2020polygonal), our method gets building polygons with fewer but more precise vertices. The PolyMapper (Li2019topo) and PolyWorld (zorzi2021polyworld) approaches are based on points inference. They both tend to predict relatively regular simple polygons. However, they are not good at dealing with buildings of complex structures. For instance, they failed in extracting the building marked in green hollow polygon in the image from the last column. While our method could both handle buildings with holes and get compact polygon representation as well.

Method Backbone PoLiS C-IoU IoU
Mask R-CNN (crowdAI2018baseline) ResNet101 15.4 3.454 50.1 61.3
PolyMapper11footnotemark: 1 (Li2019topo) VGG16 22.6 2.215 67.5 77.6
Frame-Field (ASM)333ASM refers to the Active Skeleton Model polygonization algorithm with the marching squares as initialization in Frame-Field (Girard2020polygonal). Here the parameter of simplification tolerance equals 1. The whole pretrained UResNet101 model is downloaded from (Girard2020polygonal) UResNet101 34.4 1.945 73.8 84.3
PolyWorld (zorzi2021polyworld) R2U-Net 50.0 0.962 88.3 91.2
HiSup (ours) HRNetV2-W48 66.5 0.726 89.6 94.3
Table 2: Comparison of additional results on the AICrowd dataset. The best scores are in bold.
Frame-Field (ASM)
Ground truth
Figure 5: Example of qualitative polygonal mapping of buildings results of AICrowd dataset. Building instances are marked with colors. From the top row to the bottom row: PolyMapper (Li2019topo), Frame-Field (Girard2020polygonal) with ASM as polygonization and UNetResNet101 as backbone, PolyWorld (zorzi2021polyworld), our results and ground truth.

5.2.2 Results on Inria dataset

For the Inria dataset, our method is compared with the other five methods. The top two methods (“Advanced Institute” and “Eugene Khvedchenya”) in Table 3 take the first and second place on the public leaderboard of Inria dataset website. The ICT-Net (chatterjee2019on) that combines improved UNet with Dense blocks (huang2017densely) and SE blocks (hu2020sqeeze) also got IoU score higher than 80. These three mentioned methods are trained directly on semantic segmentation annotations from the original Inria dataset. They all get segmentation results only and need extra postprocessing for vectorization. In addition, we list results of FrameField (Girard2020polygonal) and zorzi2020machine, who get vectorized results as ours.

Compared with AICrowd dataset, Inria dataset contains lots of buildings with more complex shapes and structures. And the patterns of building distribution in different cities vary greatly. Table 3 shows the quantitative results of IoU and Acc on the testing images. Compared to the state-of-the-art polygonal mapping of buildings approaches (Girard2020polygonal; zorzi2020machine), our proposed method achieves better performance on both the metrics of IoU and Accuracy.

Fig. 6 shows the qualitative results conducted on an image from the test set of Inria dataset. We produce vectorized polygons that fit the buildings with various shapes.

Figure 6: Crop of polygonal mapping of buildings results by our method on a test image from the Inria dataset. Building instances are marked in different colored polygons. Four zoomed in views of polygonal mapping of buildings are provided as typical.
Method IoU Acc.
Advanced Institute444Reported on 81.91 97.41
Eugene Khvedchenya444Reported on 81.06 97.25
ICT-Net (chatterjee2019on) 80.32 97.14
zorzi2020machine 74.40 96.10
Frame-field (Girard2020polygonal) 74.80 95.96444Reported on
HiSup (ours) 75.53 96.27
Table 3: Comparison of evaluation results on the Inria dataset.

5.3 Discussion

We perform additional studies to further validate the effect of details of our method.

5.3.1 Effectiveness of mask and vertex attraction

On the basis of the same segmentation masks, we compare our polygon generation algorithm with other methods to validate its effectiveness. We generate the set of segmentation masks by applying our method to images of the AICrowd small validation dataset. The traditional DP simplification (douglas1973al) algorithm, ASIP (Li2020appro), and Frame-Field (Girard2020polygonal) polygonization are chosen as comparison. The tolerance parameter of DP simplification (douglas1973al) is set to 1. For the ASIP (Li2020appro) polygonization, we kept the parameter settings the same as in their released code. For the Frame-Field (Girard2020polygonal) polygonization, we adopt the Active skeleton Model of tolerance 1 with UResNet101 as backbone and marching square as initialization.

As demonstrated in Table 4, in terms of the segmentation evaluation metrics, the segmentation mask obtained by our method gets a fairly high score of 79.3% for the AP. The DP simplification (douglas1973al) algorithm gets the AP of 69.5% which is comparable to most existing methods. Based on the accurate segmentation masks, the ASIP (Li2020appro) and Frame-Field (Girard2020polygonal) polygonization methods get AP of 70.5% and 73.7% respectively. The scores are both higher than the originally reported results as shown in Table 1. Such differences validate our view that the reversibility of building masks is determinant for the final polygonization performance.

Despite the comparable performance, the polygonization processes of DP and ASIP come with a great loss compared to the original masks. The segmentation AP of the generated polygons drops by almost nine points compared with the masks. The also drops more than 10 points. The introducing of frame field vectors by Frame-Field learning method (Girard2020polygonal) narrows the gap between masks and polygons a little. In contrast, our method has the minimum loss when vectorizing the original masks. The gap of mask AP score is less than one point. By matching with vertices, the polygons obtained by our method have more accurate shapes, which leads to an even higher score than masks. Besides, our generated polygons get the minimum PoLiS score.

Method AP PoLiS
Segmentation mask 79.3 65.8 -
DP poly (douglas1973al) 69.5 51.3 1.178
ASIP poly (Li2020appro) 70.5 54.5 1.160
Frame-field poly (Girard2020polygonal) 73.7 57.9 0.984
Our poly 78.8 66.3 0.737
Table 4: Comparison of our polygon generation with other approaches on the AICrowd small dataset. “Segmentation mask” refers to the predicted mask results by our proposed network model. “DP poly” refers to the traditional DouglasPeucker simplification algorithm. “ASIP poly” refers to the ASIP polygonization method. “Frame-field poly” refers to the ASM polygonization method from Frame Field with UResNet101. “Our poly” refers to our polygon generation algorithm described.

5.3.2 Effectiveness of hierarchical supervision

We validate the effect of hierarchical supervision by replacing the AFM with edge and frame-field vectors while the remained architectures of the network model stay the same. We also conduct experiments to evaluate the designed Cross-Level Interactions by just predicting AFM as a separate output to the backbone without the proposed interactions of AFM embedding. The baseline model is set as predicting only masks and junctions from backbone features. All mentioned models are trained and tested on the small version of AICrowd dataset.

The resulting masks and polygons are evaluated by the AP from MS-COCO metrics. The generated junctions are measured by F1-score with the controlled distance threshold set to 5 pixels.

As shown in Table 5, simply predicting AFM representation with a new head helps to improve all the mask, junction and polygon results compared to baseline. When introducing the proposed Cross-Level Interactions with AFM embedding, the performance has been significantly improved. The result also shows that the AFM supervision improves more on mask localization performance and junction detection performance compared to edge supervision and frame-field vectors supervision.

Baseline Edge Frame-field AFM (head only) AFM Mask (AP) Polygon (AP) Junction (F)
- - - - 57.1 55.9 59.3
- - - 57.7 56.4 60.9
- - - 57.7 56.4 61.6
- - - 57.6 56.5 61.7
- - - 58.1 56.7 62.6
Table 5: Ablation study of the proposed hierarchical supervision on the AICrowd small dataset. The comparisons are made by replacing the supervision of AFM with edge and frame-field map respectively. The “AFM (head only)” stands for the simplified version of our model that omits the proposed cross-level interactions.
Backbone AP AR PoLiS C-IoU
UResNet101 75.8 90.8 77.7 91.7 61.6 0.925 88.2
HRNetV2-W18 71.4 89.6 73.7 90.4 52.6 1.116 85.2
HRNetV2-W32 76.2 91.7 78.5 92.1 61.2 0.872 88.1
Table 6: Evaluation of our method with different backbones on the AICrowd dataset.

5.3.3 Different backbones

We test our method with different backbone network architectures. The classical UResNet101 model is chosen as a comparison for utilized by most of the previous methods. For the HRNet (Wang2021hrnet) serials, we adopt the lightweight models of HRNetV2-W18 and the HRNetV2-W32. Table 6 shows the quantitative evaluation results of different backbone settings on the AICrowd dataset. It can be seen that our method with UResNet101 gets 75.8% of the AP, which outperforms currently available methods. The larger model with the HRNetV2-W32 gets better performance than the HRNetV2-W18 and UResNet101.

6 Conclusion

This paper studies the problem of polygonal mapping of buildings in a novel viewpoint of mask reversibility. By taking the hierarchical-level supervision signals from the bottom-level vertices to the high-level regional masks, we present a novel method, HiSup, to use the mid-level attraction fields of line segments as the most important linkage. The key component of Cross-Level Feature Interaction in our proposed HiSup method learns a unified feature embedding in both aspects of high-level semantics and the shape correctness of buildings, thereafter closes the gap between the segmentation masks and polygons of buildings in satellite imagery. In the experiments, we show that the proposed HiSup outperforms existing polygonal mapping of buildings methods on several challenging metrics on two public benchmarks of AICrowd (Mohanty2020deep) and Inria (maggi2017can). The systematic ablation studies further justified our design choices in our HiSup.