HiSup
None
view repo
This paper studies the problem of polygonal mapping of buildings by tackling the issue of mask reversibility that leads to a notable performance gap between the predicted masks and polygons from the learningbased methods. We addressed such an issue by exploiting the hierarchical supervision (of bottomlevel vertices, midlevel line segments and the highlevel regional masks) and proposed a novel interaction mechanism of feature embedding sourced from different levels of supervision signals to obtain reversible building masks for polygonal mapping of buildings. As a result, we show that the learned reversible building masks take all the merits of the advances of deep convolutional neural networks for highperforming polygonal mapping of buildings. In the experiments, we evaluated our method on the two public benchmarks of AICrowd and Inria. On the AICrowd dataset, our proposed method obtains unanimous improvements on the metrics of AP, APboundary and PoLiS. For the Inria dataset, our proposed method also obtains very competitive results on the metrics of IoU and Accuracy. The models and source code are available at https://github.com/SarahwXU.
READ FULL TEXT VIEW PDFNone
Polygonal mapping of buildings, aiming at precisely extracting the footprint of buildings from highresolution satellite imagery in the form of polygons, is a core and dynamic problem in photogrammetric computer vision and remote sensing
(sohn2001extraction; zorzi2021polyworld), and plays important roles in the Geographic Information Systems (GIS) for making up the basic feature classes (MAYER1999automatic).As in orthorectified satellite imagery, the footprints and roof outlines of buildings coincide under most circumstances, the mapping of buildings could be roughly equivalent to extracting building polygons (wang2022learning). Thus, the problem of polygonal mapping of buildings in satellite imagery has usually been formulated in a multistep paradigm (Girard2020polygonal; Li2021joint)
, by (1) computing the rasterized binary masks of buildings from the input images, (2) vectorizing the binary masks into polygons with certain heuristic postprocessing schemes, and (3) optionally simplifying the initial polygons to reduce the redundancy of vertices. In this paradigm, building segmentation, as a key subproblem, has recently been significantly advanced by deep neural networks (DNNs)
(mnih2013phd; Yuan2018learning). Thereafter, the quality of polygonal mapping of buildings can be improved (Wei2020toward) by leveraging the offtheshelf adhoc algorithms such as Marching Squares (MarchingCubes) for polygon generation and the DouglasPeucker (douglas1973al) for the shape simplification.


Although the performance of polygonal mapping of buildings has been significantly improved by achieving higher quality segmentation masks, it is worth noticing that the task of polygonal mapping of buildings largely depends on the quality of learned masks. Furthermore, when we get a closer look at the polygonal annotation of buildings, we will find that the learned masks remained an issue of reversibility to the polygonal representation, saying mask reversibility. More precisely, with a polygonal annotation of a building, the mask representation is accessible by drawing pixels into an image domain and can be reversed to the polygons as long as the resolution of the image domain is in a reasonable range. However, when we use the mask representation of the polygonal annotations as the supervision signals for the optimization of a convolutional neural network, it would be extremely challenging to obtain such reversible masks.
To handle this maskreversibility challenge, the most recent studies either exploited the shape attributes of buildings by FrameField (Girard2020polygonal) or improved the polygonal mapping of buildings schemes (Li2020appro) with the offtheshelf mask predictions, pushed the accuracy of predictions to a higher stage. Despite that, it is also worth noticing that there is a notably large performance gap on the AICrowd benchmark (Mohanty2020deep) between the polygon and mask predictions. The winning solution (Jaku2018winner) achieves the mask average precision of 79.1%, whereas the bestperforming model of FrameField (Girard2020polygonal) who was applied to the masks achieves the average precision of 67.0%. As the mask predicted in Fig. 1(a) is ambiguous in the shape boundaries, the finally predicted polygons of FrameField (Girard2020polygonal) is still far away from that of human annotations. One possible reason leading to such a large performance gap is that the used supervision signals pay more attention to the highlevel regional information rather than the detailed geometric shape of buildings.
In this paper, we make efforts to close the gap between the polygonal representations and the mask ones of buildings with deep convolutional neural networks, inspired by an observation:
The shape composition of polygonal buildings is made up of points (or vertices) at the bottom, uses line segments to associate the points as the middlelevel information, and finally forms the regional instances of buildings at highlevel semantics.
Therefore, we are motivated to build the fulllevel supervision to train the neural networks for the sake of shape correctness of the predicted masks. As long as the masks predicted by convolutional neural networks attain the best information about the polygonal shapes, only a simple postprocessing scheme is required to achieve the goal of accurate polygonal mapping of buildings.
Towards this goal, we are inspired by the dual representation of line segments, named regional attraction fields (xue2019learning; xue2021pami) at the middlelevel description as the bridge to get the connection between the unary information of points and complex geometry of regional shapes. Different from the representation of edge pixels, the dual regional representation of line segments demonstrates advantages in two aspects:
It intrinsically connects to the vertices of polygons as they can be viewed as the endpoint of line segments.
The nonlocal support region of each line segment is corresponding to the regionlevel information of a part of the building mask.
Although the properties of regional attraction fields (xue2019learning; xue2021pami) are appealing and promising for polygonal mapping of buildings, several challenging open problems remain in computing the higherordered geometric shapes from the midlevel line segments (or its dual representation). One may come up with solutions by directly leveraging the detected line segments (or wireframes) as the basic computing clues. However, this will lead to an NPHard problem of finding all the closed loops from an undirected graph. Meanwhile, the detection accuracy of line segments and wireframes will also affect the final results. In the most recent, PolyWorld (zorzi2021polyworld) attempted to solve the problem of polygonal mapping of buildings along this line by using Sinkhorn (marco2013sink) to approximate the step of finding circulars in a graph. However, it still remained the issue of flexibility when handling the complicated buildings that have holes.
In the study of this paper, we present a hierarchical supervision (HiSup) learning scheme to address the mask reversibility issue and show that the geometric information of polygons should be used inductively in the embedding space of the multihead predictions by endtoend learning. Specifically, we design a novel information aggregation mechanism to enhance the feature embedding for both the bottomlevel vertex prediction and the highlevel mask prediction. By explicitly encoding and aggregating the boundary information into the highlevel mask head, we successfully reduced the gap between the mask predictions and the expected polygonal ones. During the inference, we design a simple greedy scheme to polygonize and simplify the masks by using the predicted vertices. Our proposed method, i.e., HiSup, achieves the stateoftheart of polygonal mapping of buildings.
In the experiments, we evaluate our method on the challenging benchmarks of AICrowd (Mohanty2020deep) and Inria (maggi2017can). Compared to the current stateoftheart approaches FrameField (Girard2020polygonal) and the method of (Li2021joint), the building polygons computed by our HiSup obtain a significant improvement of AP (Lin2014microsoft) by 12.4 points and 5.6 points, respectively. In terms of the more challenging metric, (Cheng2021boundary), we push the stateoftheart performance of 50.0% obtained by PolyWorld (zorzi2021polyworld) to . On the Inria dataset (maggi2017can), our proposed HiSup also obtains a considerable improvement over the FrameField (Girard2020polygonal). A systematic ablation study is performed to further justify the proposed method.
The main contributions of this paper are summarized as:
We address the mask reversibility issue of polygonal mapping of buildings by proposing the HiSup, which takes the hierarchical supervision signals of vertices, boundaries, and masks to guide the convolutional neural network to learn in a shapeaware fashion.
We present a key component of aggregating the embedding of attraction fields into the learning of highlevel masks and the bottomlevel vertices.
We set several new stateoftheart performances on the challenging AICrowd benchmark in terms of AP and AP by learning semanticallyaccurate and geometricallyprecise masks with our HiSup.
The rest of the paper is organized as follows. Section 2 reviews related works. Section 3 describes the hierarchical representation. Section 4 introduces our polygonal mapping of buildings method. Section 5 reports experimental results and some discussions. Finally, conclusions are drawn in section 6.
The problem of (polygonal) mapping of buildings is a longstanding problem, which was extensively studied by the means of using handcrafted features such as textures, geometry, spectrum, and shadows with a vast body of literature (pesaresi2011improved; Xia2019geosay; ngo2016shape; senaras2013fusion; mannok2015orient
, to cite a few). These kinds of methods could get good results on certain buildings that satisfy the predefined rules, but they suffer from the degraded performance when buildings have inconsistent appearances in complex environments. Therefore, the research focusing on polygonal mapping of buildings has been gradually moved to the deep learning solutions to handle the more challenging realworld data. In this section, we mainly review the recent advancements for learningbased polygonal mapping of buildings.
The mask representation is widely used for building extraction in two mainstream pipelines: the building map segmentation (yang2018mapping) and the building instance segmentation (chen2019darnet). The former one formulates the problem of building extraction as a pixelwised binary classification task and the latter one leverages the advances of detected bounding boxes of buildings to directly extract building mask for each instance. Those two pipelines correspond to the semantic segmentation (long2015fully) and instance segmentation (He2017maskrcnn) respectively in natural images, whereas, they have the same goal for the task of building extraction in satellite images. As the orthorectified satellite imagery eliminates the occlusions between building instances, it is straightforward to delineate the foreground pixels from the instanceagnostic segmentation map into multiple instances by the connected components labeling schemes (vladimir2018ternaus; Jaku2018winner).
In general, learning a ConvNet is promising to obtain highquality building masks from satellite images (saito2016multiple; alshehhi2017simultaneous; maggi2017conv; Yuan2018learning). However, focusing mainly on the highlevel semantics makes these approaches perform inaccurate in terms of the building shape correctness. There are some works that focus on alleviating the boundary uncertainties of building maps including multiscale feature fusion (maggi2017high; ji2019fully), and geometricallyaware learning (li2022building) with the assistance of attraction fields (xue2019learning; xue2021pami) or edge constraint (liu2021multiscale). However, their results are not satisfactory for the accurate polygonal mapping of buildings.
In our method, we concentrate on the geometric correctness of building masks learned by ConvNets. Compared to those methods, our method learns a unified embedding space by exploiting the different level of building representations, which maintains the accuracy of the segmentation masks in both aspects of highlevel semantics and the geometric correctness.
Polygonal mapping of buildings refers to getting vectorized building instances, which is the most compact and meaningful representation format beyond others. Only ordered building vertices are kept to describe individual building instances. Traditionally, building vectorization tends to be a separate process after getting the raster map of building instance segmentation. One of the basic algorithms is the DouglasPeucker simplification serial techniques (douglas1973al). Those simplification algorithms are usually rough and have difficulty in identifying right building vertices. Extra refine strategies based on the empirical building shape are proposed in (Wei2020toward)
. Lately, based on both the image and its semantic probability map, a polygonal partition approach shows promising results for building vectorization
(Li2020appro). However, these multistaged methods mostly demand separate optimization processes. An overall optimized result is hard to get.Currently, automatic polygonal mapping of buildings from remote sensing imagery has received considerable attention. Some methods tend to combine the vectorization process with segmentation through the multitask learning. By learning building segmentation map aligned with proposed frame field vectors, FrameField (Girard2020polygonal) gets building polygons through corneraware contour simplification. The invented frame field map is a 4D vector calculated according to the direction of building edges. In comparison with the frame field (Girard2020polygonal), the attraction fields (xue2019learning; xue2021pami) adopted in our method have more contextual and structural information. Li2021joint generate building masks, corner locations, and edge orientations simultaneously. Then the corner location map and edge orientation map are combined together to help with identifying vertices from pixels belonging to building boundaries. Beyond that, an additional refinement process is adopted to further adjust building vertices on a finer scale. The refinement of building vertices can also be embedded to the whole model and be trained jointly (Chen2020polygoncnn).
The above methods all get building polygons through simplifying the predicted building contours. Instead, we use detected vertices to rebuild the whole building polygons.
There are other methods that extract building polygons on the basis of object detection formula. Buildings are first marked with bounding boxes. Then, for every building, corners are predicted one by one through recurrent neural networks (RNNs) with convolutional longshort term memory (ConvLSTM)
(Li2019topo). This framework is further improved by (zhao2021building)with modified modules including global context block (GCB), boundary refinement block (BRB) and gated recurrent units (GRU). Following the same detection formula,
liu2021afmrnn proposed AFMRNN which introduced the attraction fields (xue2019learning) as embedded features to enhance the building corner detection. In our method, we take advantage of the attraction fields as the bridge to connect both the building masks and vertices implicitly.Lately, zorzi2021polyworld came up with a new proposal termed PolyWorld by considering all building polygons within one image as a whole undirected graph. They predict building polygons by detecting vertices and solving their adjacent connections. The obtained building polygons are very neat in visual. Failed cases appear where buildings have complex structures like inner yards.
Given a satellite image defined on the image lattice with the resolution of , the building polygons presented in are the regions , where is a continuation of the set in the 2D real space. As the buildings are manmade objects, the boundary can be described by a polygon that consists of an array of vertices, denoted by , where . The order of the vertices in indicates the adjacent relationship among the vertices.
The mask representation is extensively used for the extraction of buildings as a highlevel semantic segmentation task. In this sense, the regions of building polygons are represented by a segmentation map defined on the image lattice by
(1) 
Benefiting from the birdeye view observation of satellites, the regions can be observed without occlusion between instances, therefore a highperforming convolutional neural network for image semantic segmentation can obtain the instancelevel segmentation results in most cases. However, due to the concentration of highlevel semantics, the predicted segmentation maps usually pay more attention to the interior regions of rather than their boundaries . As a result, it is very challenging to compute accurate polygons from the predicted masks.
As one main goal of building segmentation is for the vectorized representation, the polygonal representation about the boundary information is of great importance for the end task. Here, we briefly discuss both the bottomlevel vertices and the midlevel line segments about the boundary information.
When we focus on the polygonal boundary of the buildings, the polygon for the th building mask contains two kinds of information: the bottomlevel vertices and their adjacent relationship represented by midlevel line segments . With the bottomlevel and midlevel geometries of buildings, there would be another challenging problem of grouping primitives into instances. It was recently approached under the view of graph computation with Sinkhorn approximation (marco2013sink) to avoid the NPHard problem of finding circulars in an undirected graph (zorzi2021polyworld).
In this paper, we believe that although the mask learning is a highlevel recognition task, as the polygonal buildings can be lossily converted into mask representation, there should be a unified embedding space to learn and decode consistent multilevel outputs. To approach this, we use the bottomlevel vertices and midlevel line segments as the complementary information with the learning of masks, which guides the convolutional neural networks to focus on the bottomlevel and midlevel information for the shape correctness of the masks.
Inspired by Li2021joint, we divide the building vertices ( i.e., junctions) according to their convexity. In detail, for the vertices in , vertices belonging to the minimal convex hull of are regarded as the convex vertices, denoted in . For the vertices that belong to the set , they are denoted by . By stacking all the vertices across instances, we have two sets and as the bottomlevel representation of buildings.
As discussed before that the line segments are actually the bridge between the bottomlevel vertices and the highlevel masks, we are interested in the representation of line segments that are wellsuited for polygonal mapping of buildings. The most natural choice of the line segments might be the edge map (or line heatmap). However, we found that the learning of edge maps can only improve the mask reversibility marginally. Therefore, we are motivated to use the dual representation, the Attraction Field Representation Map (AFM) (xue2019learning; xue2021pami) for our endtask. Denoted by ( and are the two endpoints), the line segments of the polygonal instances partition the image lattice into regions by,
(2) 
(3) 
With the region partition , the attraction field map between any pixel to its closein line segment is computed by
(4) 
where the point is the projection of on . As the attraction field map encodes the line segments according to , the highlevel masks are implicitly encoded.
In this section, we present a novel approach that learns building masks in a unified embedding space under the supervision of masks, line segments and the vertices. The unified embedding space takes the merits of differentlevel of information for the buildings, thereafter enables the convolutional neural networks to learn both the instancelevel information and the geometric details with the best reversibility for polygonal mapping of buildings. Then, we present a mask attraction scheme that polygonizes the reversible masks into the expected polygons by attracting the sparse vertices to the boundary of masks.
As illustrated in Fig. 3, given an image , we use a backbone network (e.g., HRNets (Wang2021hrnet) or UResNets (Unet (ronneberger2015u) with ResNet (he2016deep) encoder)) to extract the shared feature map , where and are the size of with the downsampling factor . Because our method is learned from the bottomlevel to highlevel information of buildings, three convolution layers (including the BatchNorm (ioffe2015batch) and ReLU (nair2010rectified)) with 256 output channels are used to transform the backbone feature into the embedding features of vertices, of the attraction field maps, and of segmentation maps. The different embedding features are then transformed by the crosslevel interaction module (introduced in Sec. 4.2) to predict (1) the heatmap and the shortrange offset field for the vertices, (2) the attraction field map of the line segments, and (3) the segmentation mask for the segmentation map of the building polygons.
As the attraction field representation (xue2019learning; xue2021pami) has strong correlations with the vertices and segmentation masks, we are going to regularize the feature maps and to emphasize their shape correctness for the mask predictions and the semantic consistency in the learning of vertices.
Inspired by the success of ECANet (wang2020eca), we extend the efficient channel attention mechanism to enhance the feature representation of and by
(5) 
where , ,
are the 1D convolution, global average pooling and Sigmoid function.
We lift the learned attraction field maps by a linear layer from the predicted attraction field
to a 128channel tensor
and concatenate it to the backbone feature . The concatenated feature map is subsequently used to predict the building masks , which guides the backbone featureto learn the shape correctness from the target of attraction field maps via backpropagation. As the backbone feature map is regularized by the predicted attraction field
, we are able to use the enhanced feature map for the final mask prediction, denoted by . During training, the predicted intermediate mask and the final mask are supervised by the binary cross entropy (BCE) loss with the groundtruth masks by(6) 
while the attraction field is learned from the loss function by
(7) 
where is the ground truth of attraction fields generated by the edge of polygons.
Instead of using the salient edge pixels (from predicted masks) as building vertices, we design a vertex branch to learn the heatmap and shortrange offsets of vertices. Taking the enhanced feature , a convolution layer is used to learn the heatmap and the offset map by the loss functions of cross entropy and loss by
(8) 
where is the ground truth of vertices with the values for background pixels and for the vertices targets, is the shortrange offsets that are in the range of for the target pixels.
To train the neural network, we use the predefined loss factors to balance the different types of predictions in the total loss function by
(9) 
where , , and are the loss factors of the predicted masks, attraction field maps, the heatmaps and offsets of the vertices. In our experiment, we empirically set , , and according to their loss magnitudes.
Benefiting from our design of crosslevel interaction module, the learned masks have a desired property of shape correctness in the boundary geometry. Therefore, the polygonization of the building masks would be straightforward by tracing the boundary pixels from the predicted masks. For the sake of computing the most simplified polygons without redundant vertices, we found that the learned vertices from and cooperate very well by attracting the traced boundary pixels of masks to the instanceagnostic vertices. We term our scheme of polygonization as MaskandVertices Attraction, short in MaVAttr. The whole pipeline of our polygon extraction is shown in Fig. 4.
Given a predicted mask , the corresponding heatmap and offset of vertices, we extract the building instances by a given threshold of to get the binary mask prediction , which are subsequently processed into instances by computing the connected components, denoted by . The redundant polygonal representation of is computed by tracing all the boundary pixels, denoted in , where is the number of boundary pixels of . With the initial redundant polygons , we only need to focus on its simplification by sparse vertices that are computed by leveraging the local nonmaximum suppression (NMS) in neighborhoods. The local NMS is implemented by a MaxPooling layer with the kernel size . After the local NMS, the top ( in our experiments) vertices that have a higher classification score than are retained and refined by the offset vectors for subpixel accuracy, denoted by .
For the th initial polygon , we match each vertex to the closest vertex via computing the Euclidean distance by
(10) 
where is an index mapping from the th boundary pixel of the th initial polygon to the closest vertex in . Then, we go through all the boundary pixels and remove the nonminimal pixels that have the same indices. That is to say, for any boundary pixel and , if their indices satisfying and , the pixel is removed from the initial polygon . In addition to this, for the pixels that are far away from any predicted vertex (under a given threshold ) are also removed. As we already know the topological relationship between any boundary pixels in , a linear scanning is required to efficiently remove the redundant boundary pixels. In the final step, we check the adjacent edges of polygons and merge them if they are parallel up to an angle tolerance of .
The simplified polygon of is denoted by . For any vertex , it must be in the set . The number of vertices is far less than .
In our implementation, we mainly use the HRNetV2W48 (Wang2021hrnet) as the backbone for our method. Meanwhile, we also use the smaller backbones, HRNetV2W18, HRNetV2W32 (Wang2021hrnet), and the UResNet101 (ronneberger2015u; he2016deep) to show that the possible usage of our method in more practical configurations. The input images are resized to during both training and testing. As all the used backbone networks will yield the downsampled feature maps with the downsampling factor of , the output maps of our method are with the resolution of . In the inference phase, we will resize the polygonal outputs into original image size (e.g., in AICrowd dataset (Mohanty2020deep)) for evaluation.
During training, we used the data augmentation strategies (as used in (Girard2020polygonal)) including random flip, random rotation, and color jittering. The ADAM optimizer (diederik2015adam) is used to train our network on 4 Nvidia RTX 3090 GPUs with the initial learning rate of and the weight decay
on PyTorch 1.8. The number of training epochs is set to
and the learning rate is decayed by after 25 epochs of training. For the ablation study, we reduce the training epochs from to due to the limited computation resources and time.We compared our method to the stateoftheart methods on two publicly available datasets.
AICrowd Mapping Challenge dataset (Mohanty2020deep) (AICrowd dataset) contains
pixels RGB images and corresponding annotations in MSCOCO
(Lin2014microsoft) format. The training set includes 280741 tiles. For this challenge is already closed, we use the validation subset as testing, whose amount of tiles is 60317. AICrowd dataset also provided users with a small version that contains smaller amount of samples. We use the small version of AICrowd dataset in the ablation studies.For evaluating the performance of building instance segmentation, we follow the standard MSCOCO (Lin2014microsoft)evaluation metrics. They are also the official metrics for AICrowd dataset. Particularly, we report the average precision (AP) and average recall (AR) metrics under a range of intersection over union (IoU) thresholds from 0.50 to 0.95 with a step of 0.05. The precision and recall scores under IoU thresholds of 50 and 75 are also listed in Table 1 denoted as , , and . Moreover, to put extra emphasis on the quality of building contours, we introduce the metrics of Boundary IoU (Cheng2021boundary). Compared with MSCOCO metrics, the Boundary IoU metrics are more reasonable to reveal the accuracy of the detected instance boundaries. For two masks and , the Boundary IoU only computes the IoU between sets of pixels that are within distance from two mask contours. Denoted the contours of masks and as and , the Boundary IoU is defined as
(11) 
Here the equals 0.02 which is consistent with the setting of COCO instance segmentation. The average boundary precision denoted by under a range of IoU to object boundary is reported in Table 2.
All the above evaluation metrics are based on results at the segmentation level. To directly compare the quality of predicted building polygons, we report the evaluation of PoLiS metric (Avbelj2015metric). For two given polygons A and B, the PoLiS is defined as the average distances between each vertex of A and its closest point within the boundary and vice versa. Suppose the polygon B has vertices , the PoLiS metric can be expressed as
(12)  
where and are the normalization factors. The IoU threshold for filtering predicted building polygons is set to 0.5 as referred to zhao2021building.
The IoU and complexity aware IoU (CIoU) (zorzi2021polyworld) are also calculated for evaluation. For the two polygons and , the CIoU metric is computed as:
(13) 
where the first term IoU indicates the normal IoU between two compared polygon masks and . The second term is the relative difference between the total number of vertices from polygon and the number of vertices from polygon . The CIoU metric considers both segmentation accuracy and polygonization complexity.
Inria Aerial Image Labeling dataset (maggi2017can) (Inria dataset) contains aerial orthorectified color imagery of pixels with a spatial resolution of 0.3 m. Unlike AICrowd dataset, the training and testing images of Inria dataset come from different cities where the building distribution varies. The original Inria dataset provides annotations of pixelwise semantic segmentation, which is not suitable for getting polygonized results by our method. During the training process, we use the traditional method to convert the raster ground truth labels to vector format. As suggested in (maggi2017can), the first five images of each location from the Inria training set are used for validation.
As the Inria dataset (maggi2017can)
did not public the testing samples, we follow their official evaluation protocol by submitting the polygonized results in the format of segmentation masks to their evaluation server. The recommended metrics of IoU and accuracy (Acc) are adopted for comparison. The Acc for Inria dataset is defined as the percentage of correctly classified pixels, while the IoU is computed only for pixels belonging to buildings.
Table 1 and Table 2 demonstrate the quantitative evaluation results on the AICrowd dataset. We compare our method with other competing approaches including Mask RCNN (He2017maskrcnn) based on the implementation of (Mohanty2020deep), PolyMapper (Li2019topo), ASIP (Li2020appro), FrameField (Girard2020polygonal), PolyWorld (zorzi2021polyworld) and the recent work proposed by (Li2021joint). The Mask RCNN implemented by Mohanty2020deep serves as the baseline for AICrowd dataset, which can only obtain building masks instead of polygons. To be able to compare with others, a postprocessing with DouglasPeucker (DP) simplification (douglas1973al) is adopted to generate vectorized results from the pixelwise segmentation. The rest of methods all have outputs of polygonal building instances.
It can be seen in Table 1 that our method gets either the highest or comparable scores under the MSCOCO metrics. The average precision termed AP reaches 79.4% by our method. Only the recall score under IoU equals 0.5 is slightly lower than the ASIP (Li2020appro) method. It is probably due to the omission of some tiny building pieces that are near the edge of images.
Method  Backbone  AP  AR  
Mask RCNN (He2017maskrcnn; crowdAI2018baseline)  ResNet101  41.9  67.5  48.8  47.6  70.8  55.5 
PolyMapper^{1}^{1}1The implementation code from the author’s github repository (https://github.com/lizuoyue/ETHThesis/tree/master/building) is used for training by ourselves. And we perform the testing on the defined testing dataset. (Li2019topo)  VGG16  50.8  81.7  58.6  58.5  85.4  67.0 
ASIP (Li2020appro)    65.8  87.6  73.4  78.7  94.3  86.1 
FrameField^{2}^{2}2Here the results are reported by applying the polygonization method mentioned in FrameField (Girard2020polygonal) to the the segmentation probability maps from the UNetVariant (Jaku2018winner). (Girard2020polygonal)  UResNet101  67.0  92.1  75.6  73.2  93.5  81.1 
Li2021joint  UResNet101  73.8  92.0  81.9  72.6  90.5  80.7 
PolyWorld (zorzi2021polyworld)  R2UNet  63.3  88.6  70.5  75.4  93.5  83.1 
HiSup (ours)  HRNetV2W48  79.4  92.7  85.3  81.5  93.1  86.7 
Table 2 reports the additional evaluation results of methods provided with code. The Boundary IoU is a more restricted measurement of the building boundary area. The performance of our method ranks first place on the average boundary precision. It indicates that the building polygons obtained by our method have more precise shapes. Apart from the evaluation of instance segmentation, our method also gets 0.726 on the PoLiS metric, which outperforms the other three vectorization methods. It demonstrates that our method has less overall dissimilarity between vertices of predicted building polygons and the ground truth. For the CIoU metric, our method surpasses all the other listed methods, which means that the buildings extracted by our method are represented in both concise and accurate polygons. Meanwhile, the learned reversible building masks give our method more flexibility for representing buildings in subtle structure details.
Fig. 5 shows the visual comparison of qualitative results. Each building instance is represented by a different colored polygon. Compared with the maskrelated approach FrameField (Girard2020polygonal), our method gets building polygons with fewer but more precise vertices. The PolyMapper (Li2019topo) and PolyWorld (zorzi2021polyworld) approaches are based on points inference. They both tend to predict relatively regular simple polygons. However, they are not good at dealing with buildings of complex structures. For instance, they failed in extracting the building marked in green hollow polygon in the image from the last column. While our method could both handle buildings with holes and get compact polygon representation as well.
Method  Backbone  PoLiS  CIoU  IoU  
Mask RCNN (crowdAI2018baseline)  ResNet101  15.4  3.454  50.1  61.3 
PolyMapper^{1}^{1}footnotemark: 1 (Li2019topo)  VGG16  22.6  2.215  67.5  77.6 
FrameField (ASM)^{3}^{3}3ASM refers to the Active Skeleton Model polygonization algorithm with the marching squares as initialization in FrameField (Girard2020polygonal). Here the parameter of simplification tolerance equals 1. The whole pretrained UResNet101 model is downloaded from https://github.com/Lydorn/PolygonizationbyFrameFieldLearning. (Girard2020polygonal)  UResNet101  34.4  1.945  73.8  84.3 
PolyWorld (zorzi2021polyworld)  R2UNet  50.0  0.962  88.3  91.2 
HiSup (ours)  HRNetV2W48  66.5  0.726  89.6  94.3 
PolyMapper 
FrameField (ASM) 
PolyWorld 
Ours 
Ground truth 
For the Inria dataset, our method is compared with the other five methods. The top two methods (“Advanced Institute” and “Eugene Khvedchenya”) in Table 3 take the first and second place on the public leaderboard of Inria dataset website. The ICTNet (chatterjee2019on) that combines improved UNet with Dense blocks (huang2017densely) and SE blocks (hu2020sqeeze) also got IoU score higher than 80. These three mentioned methods are trained directly on semantic segmentation annotations from the original Inria dataset. They all get segmentation results only and need extra postprocessing for vectorization. In addition, we list results of FrameField (Girard2020polygonal) and zorzi2020machine, who get vectorized results as ours.
Compared with AICrowd dataset, Inria dataset contains lots of buildings with more complex shapes and structures. And the patterns of building distribution in different cities vary greatly. Table 3 shows the quantitative results of IoU and Acc on the testing images. Compared to the stateoftheart polygonal mapping of buildings approaches (Girard2020polygonal; zorzi2020machine), our proposed method achieves better performance on both the metrics of IoU and Accuracy.
Fig. 6 shows the qualitative results conducted on an image from the test set of Inria dataset. We produce vectorized polygons that fit the buildings with various shapes.
Method  IoU  Acc. 
Advanced Institute^{4}^{4}4Reported on https://project.inria.fr/aerialimagelabeling/leaderboard/.  81.91  97.41 
Eugene Khvedchenya^{4}^{4}4Reported on https://project.inria.fr/aerialimagelabeling/leaderboard/.  81.06  97.25 
ICTNet (chatterjee2019on)  80.32  97.14 
zorzi2020machine  74.40  96.10 
Framefield (Girard2020polygonal)  74.80  95.96^{4}^{4}4Reported on https://project.inria.fr/aerialimagelabeling/leaderboard/. 
HiSup (ours)  75.53  96.27 
We perform additional studies to further validate the effect of details of our method.
On the basis of the same segmentation masks, we compare our polygon generation algorithm with other methods to validate its effectiveness. We generate the set of segmentation masks by applying our method to images of the AICrowd small validation dataset. The traditional DP simplification (douglas1973al) algorithm, ASIP (Li2020appro), and FrameField (Girard2020polygonal) polygonization are chosen as comparison. The tolerance parameter of DP simplification (douglas1973al) is set to 1. For the ASIP (Li2020appro) polygonization, we kept the parameter settings the same as in their released code. For the FrameField (Girard2020polygonal) polygonization, we adopt the Active skeleton Model of tolerance 1 with UResNet101 as backbone and marching square as initialization.
As demonstrated in Table 4, in terms of the segmentation evaluation metrics, the segmentation mask obtained by our method gets a fairly high score of 79.3% for the AP. The DP simplification (douglas1973al) algorithm gets the AP of 69.5% which is comparable to most existing methods. Based on the accurate segmentation masks, the ASIP (Li2020appro) and FrameField (Girard2020polygonal) polygonization methods get AP of 70.5% and 73.7% respectively. The scores are both higher than the originally reported results as shown in Table 1. Such differences validate our view that the reversibility of building masks is determinant for the final polygonization performance.
Despite the comparable performance, the polygonization processes of DP and ASIP come with a great loss compared to the original masks. The segmentation AP of the generated polygons drops by almost nine points compared with the masks. The also drops more than 10 points. The introducing of frame field vectors by FrameField learning method (Girard2020polygonal) narrows the gap between masks and polygons a little. In contrast, our method has the minimum loss when vectorizing the original masks. The gap of mask AP score is less than one point. By matching with vertices, the polygons obtained by our method have more accurate shapes, which leads to an even higher score than masks. Besides, our generated polygons get the minimum PoLiS score.
Method  AP  PoLiS  
Segmentation mask  79.3  65.8   
DP poly (douglas1973al)  69.5  51.3  1.178 
ASIP poly (Li2020appro)  70.5  54.5  1.160 
Framefield poly (Girard2020polygonal)  73.7  57.9  0.984 
Our poly  78.8  66.3  0.737 
We validate the effect of hierarchical supervision by replacing the AFM with edge and framefield vectors while the remained architectures of the network model stay the same. We also conduct experiments to evaluate the designed CrossLevel Interactions by just predicting AFM as a separate output to the backbone without the proposed interactions of AFM embedding. The baseline model is set as predicting only masks and junctions from backbone features. All mentioned models are trained and tested on the small version of AICrowd dataset.
The resulting masks and polygons are evaluated by the AP from MSCOCO metrics. The generated junctions are measured by F1score with the controlled distance threshold set to 5 pixels.
As shown in Table 5, simply predicting AFM representation with a new head helps to improve all the mask, junction and polygon results compared to baseline. When introducing the proposed CrossLevel Interactions with AFM embedding, the performance has been significantly improved. The result also shows that the AFM supervision improves more on mask localization performance and junction detection performance compared to edge supervision and framefield vectors supervision.
Baseline  Edge  Framefield  AFM (head only)  AFM  Mask (AP)  Polygon (AP)  Junction (F) 
        57.1  55.9  59.3  
      57.7  56.4  60.9  
      57.7  56.4  61.6  
      57.6  56.5  61.7  
      58.1  56.7  62.6 
Backbone  AP  AR  PoLiS  CIoU  
UResNet101  75.8  90.8  77.7  91.7  61.6  0.925  88.2 
HRNetV2W18  71.4  89.6  73.7  90.4  52.6  1.116  85.2 
HRNetV2W32  76.2  91.7  78.5  92.1  61.2  0.872  88.1 
We test our method with different backbone network architectures. The classical UResNet101 model is chosen as a comparison for utilized by most of the previous methods. For the HRNet (Wang2021hrnet) serials, we adopt the lightweight models of HRNetV2W18 and the HRNetV2W32. Table 6 shows the quantitative evaluation results of different backbone settings on the AICrowd dataset. It can be seen that our method with UResNet101 gets 75.8% of the AP, which outperforms currently available methods. The larger model with the HRNetV2W32 gets better performance than the HRNetV2W18 and UResNet101.
This paper studies the problem of polygonal mapping of buildings in a novel viewpoint of mask reversibility. By taking the hierarchicallevel supervision signals from the bottomlevel vertices to the highlevel regional masks, we present a novel method, HiSup, to use the midlevel attraction fields of line segments as the most important linkage. The key component of CrossLevel Feature Interaction in our proposed HiSup method learns a unified feature embedding in both aspects of highlevel semantics and the shape correctness of buildings, thereafter closes the gap between the segmentation masks and polygons of buildings in satellite imagery. In the experiments, we show that the proposed HiSup outperforms existing polygonal mapping of buildings methods on several challenging metrics on two public benchmarks of AICrowd (Mohanty2020deep) and Inria (maggi2017can). The systematic ablation studies further justified our design choices in our HiSup.