Building footprint extraction from remote sensing data has many important applications, such as urban planning, tax estimation, population estimation, and energy demand estimation. Manual labeling building footprints can be labor-intensive and time-consuming. Therefore, automated building extraction methods are crucial for making the above applications cost-effective. Recently, open-sourced satellite building image datasets and rapid development of Convolutional Neural Networks (CNNs) have pushed the performance boundary further for automated building extractions. Among the datasets, high resolution satellite image datasets provide astonishing building details and precise building footprints, including Spacenet , Inria Building dataset , and Deep Globe . In our study, we focus on Inria Building dataset for its relatively accurate labels.
a) An example building. b) Segmentation output of a CNN network. Contour of the output is marked in red. c) Input of our module, the building probability mapproduced by a CNN segmentation network. d) Polygon extracted by our method.
Convolutional Neural Networks (CNNs) such as Deeplab, UNet, and Fully Convolutional Networks have delivered promising results for semantic segmentation [5, 22, 18]. Based on the above semantic segmentation networks, numerous methods have been developed for building extraction in specific [31, 2, 13, 29, 12, 16, 10, 3]. Typical building labels for CNN training include more content pixels than boundary pixels. This imbalance often causes CNN to produce inaccurate building edges. Some studies utilize distance transform as additional information to enhance building boundaries. However, CNN approaches focus on textures derived from convolutional filters, and do not consider the spatial continuity and smoothness on object’s boundaries. As a result, building edges may not be straight even with the enhancement from distance transform. To address these issues, several studies developed end-to-end networks for building polygon estimation [20, 6, 32, 21]. There are two approaches, active contour approach and edge assemble approach. Active contour 
approach considers both accuracy and smoothness of edges in its loss function. Yet, its smoothness term discourages sharp corners which most building has. The edge assemble approach first use CNN to detect building edges and corners, then assembles them into polygons. This approach builds on the premise that all building edges are detected. Detecting building edges can be difficult when there’s an occlusion. A missing edge can result in a collapse of the entire polygon. In summary, existing methods do not perform well in producing smooth edges, sharp corners, and handling occlusions at the same time.
In this paper, we leverage the performance of CNNs, and propose a module that uses prior knowledge of building corners to create angular and concise building polygons from CNN segmentation outputs. We describe a new transform, Relative Gradient Angle Transform (RGA Transform) that converts object contours from time vs. space to time vs. angle. We propose a new shape descriptor, Boundary Orientation Relation Set (BORS), to describe angle relationship between edges in RGA domain, such as orthogonality and parallelism. Finally, we develop an energy minimization framework that makes use of the angle relationship in BORS to straighten edges and reconstruct sharp corners, and the resulting corners create a polygon. Experimental results demonstrate that our method refines CNN output from a rounded approximation to a more clear-cut angular shape of the building footprint. Figure 2 shows the block diagram of our method.
2 Related Work
Semantic Segmentation Using CNNs:
Pixel-level semantic segmentation is a key task and an active topic in computer vision. The emergence of CNNs has enabled great advancement in this field. Fully Convolutional Networks (FCNs) introduces deconvolution as upsampling operations to replace fully connected layers in classification models. U-Net  uses multiple upsampling layers and skip connections to improve upon FCNs. SegNet  adopts encoder-decoder architecture to produce dense feature maps. In order to integrate information from various spatial scales, PSPNet  features a pyramid pooling module to distinguish patterns with different scales. FPN  employs lateral connection merging feature maps from the bottom-up pathway and the top-down pathway. DeepLab  makes use of both dilated convolutions  and fully connected Conditional Random Fields (CRFs)  as post-processing step to incorporate both local and global information.
Building Footprint Extraction: The automation of building footprint extraction is an important problem in remote sensing. Early methods focus on utilizing height or 3D geometry information. Studies in [28, 33] propose approaches to extract building footprints from LIDAR data. In , a preliminary building footprint is estimated by the shortest path 
, and the footprint is refined by maximizing a posterior probability. In, the contour of the building is refined by smaller operations including split, intersect, merge, and remove. These operations involve a lot of thresholds which make the method non-robust. In [4, 26], Bredif et al.use height information in digital surface models (DSMs) and digital elevation models (DEMs) to tackle the problem with energy functions and assumption of rectangular building shape.
Recent CNNs show promising improvement in semantic segmentation performance which also allows great progress been made to high-resolution aerial and satellite imagery analysis. U-Net is a popular baseline. In [13, 12], Ji et al.and Iglovikov et al.develop modifications to U-Net targeting building extraction. In , Li et al.apply threshold and post-processing to U-Net predictions. Other works rely on additional input information for further improvements. In , Bittner et al.fuse RGB, panchromatic, and normalized DSM data as CNN inputs to produce better building segmentation. In , Yuan et al.introduce the signed distance function of building boundaries to improve CNN output representation. In [29, 2], Yang et al.and Bischke et al.make use of additional distance-transformed building labels as input to CNNs exploring better preservation of boundaries. In , a composite loss function and weighted building labels are used for the same purpose.
Polygon-based Building Boundary Delineation: Building boundary is considered the most important feature of a building footprint because it defines the shape and location of the building. While pixel-based CNN approaches produce results with high recall, they are not good at building boundary delineation. Accurate predictions with sharp corners and straight building edges are hard to achieve. To solve this challenge, Marcos et al. propose deep structured active contours (DSAC) framework which combines CNNs and active contour model (ACM)  to produce a polygon-based output model that is trainable end-to-end. Cheng et al. improve it by representing contour points in polar coordinates as active rays.
Although the approaches above improve mask coverage compared to pure CNN-based segmentation, blob-like contours that do not closely assemble building boundaries still exist. In , Zhang et al.employ Graph Neural Networks to reconstruct building planar graph from high-resolution satellite imagery. In , Nauata et al.detect corners, edges and regions using CNN, and assemble them using integer programming. Those methods generate polygonal building output, but the usability of them is limited by the requirement of extra building corner and edge annotations for planar graph reconstruction learning.
3 Assumptions and Preprocessing
In this paper, we assume all shapes are closed. To simplify the notation, we use a circular condition for index calculations. For example, given a shape with length and an index , index will be used for any index calculation. In addition, all angles are described in degree.
Let to be the probability map of network output. Figure (c)c shows an example probability map for a building. A building pixel in has a value closer to 1, and non-building pixels have a value closer to 0. Its corresponding building segmentation mask is obtained by thresholding with probability 0.5:
where is a coordinate in the mask.
We extract individual buildings in using Connected Components Analysis . Let to be the segmentation mask of a connected component.
4 Relative Gradient Angle Transform
Gradient angle is an important feature along an object’s boundary. Figure (a)a shows an example of a gradient angle. We want to represent an object’s contour by time vs. angle instead of typical time vs. space. Angle ranges involved in trigonometry computations are limited to instead of . When converting contour points from position to gradient angle, the bounded angle range creates two problems. First, similar angles may have large numerical differences. For example, 0 and 359 have a numerical difference of 359 degrees, but the actual difference between the two is 1 degree. This makes the gradient angle signal spiky and difficult to analyze. The second problem is the following. Ideally, gradient angle goes from 0 to 360 along an object’s contour in one cycle. However, when computing gradient angles using function, every angle gets mapped to range. To determine whether we need to add 360 degrees (produces the same angle with different numerical value) or 180 degrees (produces opposite angle with the same value) to each angle is difficult. Therefore, direct conversion is problematic for contours from time vs. space domain to time vs. angle domain.
Instead of computing gradient angles for individual contour points, we find the relative difference between adjacent gradient angles is more representative for describing the shape of a contour. If we define the signed gradient angle to be positive pointing inwards of the object’s mask, we observe that positive angle difference indicates convexity and negative angle difference indicates concavity of object’s shape at the angle difference location. Therefore, we propose Relative Gradient Angle (RGA) Transform to convert contour signals into gradient angle signals along an object’s contour.
We apply contour extraction on the mask using the method described in . We obtain an initial contour signal , where is the number of points and is the th contour point with image coordinate . Computing tangent angles directly on the initial contour set can produce noisy results due to spatial aliasing. Therefore, we smooth contour using moving average filter with a window size of . We obtain a smoothed contour . Figure (b)b shows the smoothed contour of a building. We define the middle point of each adjacent contour point pair to be , where and . The tangent angle at the middle point is
Now, we compute gradient angles at object’s contour. We define the gradient angle to be the orthogonal angle pointing inward of the object’s mask at :
Let the contour signal in RGA domain to be . We call each contour point in RGA domain a contour angle, and the contour signal in RGA domain a contour angle signal. We define the th contour angle as the following
where is a function that subtracts angle from and returns the difference in range. The set always starts with and ends with , as can be equal to either 0 or .
5 Boundary Orientation Relation Set
For each type of object, we assume there is an angle structure in the contour angle signal. In other words, angles with a fixed relation co-occur in the contour angle signal. For example, in building applications, relationships between angles can be orthogonal or parallel. We name the set of angle relationships a Boundary Orientation Relation Set (BORS). We define a BOR set as , where is the number of relations. Let to be a bank of relation sets. Each BOR set represents a type of object’s shape. For example, in this paper, we predefine the BOR set for buildings as .
To find the best BOR set describing a shape, we first use median filter  with a window size of to filter the noise in the contour angle signal . Unlike other filters, median filter preserves angle values that are in the contour angle signal. We obtain the processed contour angle signal . We then compute the contour angle distribution denote as with variable degrees. Let to be an initial angle. Then the contour angles associated with the BOR set is . We call a structure angle set, and each angle in is a structure angle.
To find the best relation set, we first define the probability of the contour angle signal containing a relation described by the BOR set with initial angle to be
Figure (a)a shows the histogram of a contour angle signal.
The initial angle for BOR set can be estimated using Maximum Likelihood Estimation (MLE):
The optimal BOR set can be estimated using MLE as well:
Figure (b)b shows the contour points associated with the optimal BOR set.
6 Shape Refinement and Reconstruction by Quantization Based Energy Minimization
Ideally, all angles in a contour angle signal is a structure angle. So, , meaning that the contour angle signal can be quantized to the structure angle set under the relation set without any loss. It also indicates the transitions between structure angles are abrupt. In reality, network outputs round corners where the transitions between structure angles are more gradual. These transitions contribute to the non-structure angle probabilities in the histogram. We replace round corners with sharp corners by quantizing contour angles to their nearest structure angle. However, quantizing noise to its nearest angle can amplify the uncertainty of noise and create an oscillation effect in the contour angle signal, which results in a "step" effect in the contour signal. Therefore, in our energy minimization framework, we identify and remove noises before applying quantization to contour signals. Figure 5 shows the block diagram for the energy minimization framework.
Let to be the estimated BOR set, and let to be the structure angle set. We adjust the contour angle signal to maximize the probability of the relation set :
Equivalently, we minimize the transition between structure angles, and propose the energy function below:
We adopt the divide and conquer strategy to minimize the energy function . Let a contour angle signal between two structure angles be a transition angle signal , and let its corresponding contour signal be a transition signal . We minimize over contour angle signal by minimizing the same energy function over each transition angle signal . Figure (a)a shows a transition signal and the blue colored line in Figure (d)d shows its corresponding transition angle signal.
Let a set of consecutive identical contour angles and their contour points represent a candidate edge. We want to identify if a candidate edge is noise. Candidate edges with a structure angle are not noise, and we call them structure edges. For every structure edge, we fit a line using Least Square Estimation (LSE) , and we call this line an edge line. These lines will be used to estimate polygon vertices later.
Given a Boundary Orientation Relation set with structure angle set , and a transition signal with its contour angle signal . We minimize the objective function by the following three steps.
First, we divide the transition angle signal into two disjoint sets, a candidate angle set , and an edge transition angle set . Their corresponding contour sets are and . Details are described in Section 6.1.
Second, for each angle in candidate angle set , we quantize it to the nearest structure angle in . We obtain a new contour angle signal . As a result, the probability of the candidate angle set containing the relation set becomes 1:
Details are described in Section 6.2.
Finally, for contour points in the edge transition contour set , we replace them with intersections between edges. As a result, gradual angle transitions are replaced with dramatic angle change. Moreover, adding intersections between candidate edges does not change the relative gradient angle of each edge. We obtain a new edge transition angle set , and all angles in are structure angle. Now, the probability of edge transition angle set containing the relation set becomes 1:
Details are described in Section 6.3.
Since , and are sets, and ,
6.1 Noise Removal
Candidate edges in transition angle signals do not have a structure angle, because the contour of a network’s output may not contain building corners or part of building edges due to noise, smooth filtering, and occlusions. In these cases, a contour shape may not have enough information to determine whether a candidate edge is a building edge or noise. Thus, additional information such as texture and edge features of the object needs to be used.
In this paper, we use network’s probability map as additional information because CNN considers both texture and edge features. Ideally, an object’s contour aligns with its edge in the probability map. Let to be the first order derivative of the probability map obtained by applying Sobel filters  and normalized to range . Figure (c)c shows the first order derivative map of a building corner. We obtain an edge map from :
where is a threshold. Based on the assumption, any contour point in a contour signal should also be an edge point in , so . We divide the contour signal into subsets. Each subset contains a consecutive contour signal that describes a candidate edge. Let be the contour set for the candidate edge . Due to occlusion and noise, part of may not align with edge responses in . The edge response set of the candidate edge may contain multiple 0s. We assume candidate edge can be trusted only if at least one point in is an edge point. If edge does not include any edge points, we move its contour signal from candidate contour set to edge transition contour set. We obtain the new candidate contour set and edge transition contour set :
6.2 Relative Gradient Angle Quantization
Let be the contour angle signal of a candidate edge . From the candidate edge definition, we have all angles in equal to some angle . Let to be ’s nearest structure angle in . We quantize all angles in to . With the new angle, we fit a line for the candidate edge using LSE under the constrain that the line slope is . Figure (b)b shows an example of estimated line after angle adjustment.
6.3 Edge Transition Analysis
At this point, all the structure edges and trusted candidate edges contain lines describing object’s edge. We aggregate all structure edges and trusted candidate edges into an edge set, and solve the transition between consecutive edges in the set.
If the transition between two edges is non-parallel, we compute the intersection of the two estimated edge lines. Figure (a)a shows an example of solving non-parallel transition.
If the transition between two edges is parallel, we create a new edge between the two edges to help the transition. Let and be two consecutive edges that are parallel to each other. Let and be the closest pair of points. We project and onto the line of and obtain and . The horizontal difference is and vertical difference is . The middle point is . If there are at least two samples within the middle 50% vertical range , we then use these samples to represent the new edge . Otherwise, we use the middle points to represent the new edge . The angle of is then quantized and the line of is estimated according to Section 6.2. Intersections for and , and and are then computed. Figure (b)b shows an example of solving parallel transition.
After all the steps, we minimize the objective function to 0, and the set of all intersections between edges represents object’s polygon. Figure (a)a shows a transition signal. Figure (e)e shows the histogram before and after minimizing the objective function. Figure (f)f shows the updated transition angle signal after minimizing the objective function. Figure 1 shows the overall result for the example.
7 Experimental Results
In this experiment, we evaluate our method on the Inria Building Dataset . Particularly, we test our method on buildings with orthogonal corners to demonstrate the feasibility of our method on a set of predefined building corner angles. Buildings with other types of corners will be tested in future works. The Inria dataset contains high-resolution orthorectified aerial images with label masks for buildings. We use image tiles from Austin region for network training and method testing because there are more residential buildings with orthogonal corners in the region. There are 36 image tiles, and each tile has a size of pixels, covering an area of 1500 1500 at 30 cm spatial resolution. We use 8 image tiles for network training, and 28 image tiles for testing. Our method is evaluated on the buildings that do not touch image border, because image border creates non-orthogonal corners. The total number of buildings in the test set is 11362.
For CNN, we use PSPNet  for demonstration. Our method works with any segmentation CNN. The output probability maps from a PSPNet are used as the inputs of our method. Because we evaluate our method on buildings with orthogonal corners, we use one Boundary Orientation Relation Set , and the BORS bank is . The moving average window size is set to 11 empirically for removing noise and spatial aliasing. The median filter window size is set to 11 empirically for removing noise in contour angle signals. Edge threshold is set to 0.1. We experimented from 0 to 0.95 with a step size of 0.05, and 0.1 provides the best result. Higher value represents less detected edge and more shape reconstruction.
From the experiment, PSPNet achieves a 80% Intersection Over Union (IOU) . Based on the PSPNet outputs, our module converts them into polygons. We generate building masks from each polygon, and the masks achieve 78% IOU. Although the proposed approach achieves lower IOU than the PSPNet contours, we visually find that our method demonstrates strong capability of reconstructing building corners especially for orthogonal corners, and straightening wavy edges (as shown in Figure 8), which is essential in most building detection applications where realistic reconstructed building shapes with realistic corners and edges are more important than the mask accuracy that is reflected by IOU. We attribute the lower IOU of our method to the following facts. (i) The orthogonal corner assumption in our method is too strong to account for non-orthogonal corners in some buildings. (ii) Our method is not robust enough to the network’s false positive detection that leads to amplification of the network’s mistake. (iii) Ground truth edges sometimes are not accurate as demonstrated in Figure 8 IV. In our future research, we will improve our method by (1) using more realistic corner assumptions in addition to the orthogonal corner assumption, (2) preventing over-straightening edges that cause missing building details (as shown in Figure 8 V), and (3) enhancing the robustness of our method to the network’s false positive detection to prevent amplification of the network’s mistake (as shown in Figure 8 VI).
This paper presents a method to extract building polygons from CNN segmentation outputs. A new transform, Relative Gradient Angle Transform, is described for converting object contour signals into time vs. angle domain. A new shape descriptor, Boundary Relative Orientation Set, is proposed to represent angle relationship for object’s contour. An energy minimization framework that makes use of the angle relationship in BORS is proposed to straighten edges and reconstruct sharp corners, and the resulting corners create a polygon. Experimental results demonstrate our method refines CNN output from a rounded approximation to a more clear-cut angular shape of the building footprint. In the future, we will learn BORS from building labels to handle more building types, such as buildings with parallelogram shapes. We will investigate adaptive window sizes for smoothing filters to improve our method on detail handling. We will also investigate incorporating CNN features into our method to address corner reconstructions on false positives.
-  (2017-12) SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (12), pp. 2481–2495. External Links: Cited by: §2.
-  (2019-09) Multi-task learning for segmentation of building footprints with deep neural networks. Proceedings of the IEEE International Conference on Image Processing, pp. 1480–1484. External Links: Cited by: §1, §2.
-  (2018-08) Building footprint extraction from vhr remote sensing images combined with normalized dsms using fused fully convolutional networks. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 11 (8), pp. 2615–2629. External Links: Cited by: §1, §2.
-  (2013-03) Extracting polygonal building footprints from digital surface models: a fully-automatic global optimization framework. ISPRS Journal of Photogrammetry and Remote Sensing 77, pp. 57–65. External Links: Cited by: §2.
-  (2018-04) DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (4). External Links: Cited by: §1, §2.
DARNet: deep active ray network for building segmentation.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7431–7439. Note: Long Beach, CA External Links: Cited by: §1, §2.
-  (2009) Introduction to algorithms, 3rd edition. The MIT Press, Cambridge, Massachusetts. Cited by: §2.
-  (2018-06) DeepGlobe 2018: a challenge to parse the earth through satellite images. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 172–179. Note: Salt Lake City, UT External Links: Cited by: §1.
-  (2018-07) SpaceNet: a remote sensing dataset and challenge series. arXiv preprint arXiv:1807.01232. External Links: Cited by: §1.
-  (2018-06) Building detection from satellite imagery using a composite loss function. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 229–232. Note: Salt Lake City, UT External Links: Cited by: §1, §2.
-  (1973-06) Algorithm 447: efficient algorithms for graph manipulation. Communications of the ACM, pp. 372–378. External Links: Cited by: §3.
-  (2018-06) TernausNetV2: fully convolutional network for instance segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 233–237. Note: Salt Lake City, UT External Links: Cited by: §1, §2.
-  (2018-01) Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set. IEEE Transactions on Geoscience and Remote Sensing 57 (1), pp. 574–586. External Links: Cited by: §1, §2.
-  (1988) Snakes: active contour models. International Journal of Computer Vision, pp. 321–331. External Links: Cited by: §1, §2.
-  (2011) Efficient inference in fully connected crfs with gaussian edge potentials. Advances in Neural Information Processing Systems, pp. 109–117. External Links: Cited by: §2.
-  (2018-06) Semantic segmentation based building extraction method using multi-source gis map datasets and satellite imagery. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 238–241. Note: Salt Lake City, UT External Links: Cited by: §1, §2.
-  (2017-07) Feature pyramid networks for object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Note: Honolulu, HI External Links: Cited by: §2.
-  (2015-06) Fully convolutional networks for semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440. Note: Boston, MA Cited by: §1, §2, §7.
-  (2017-07) Can semantic labeling methods generalize to any city? the inria aerial image labeling benchmark. Proceedings of the IEEE International Geoscience and Remote Sensing Symposium, pp. 3226–3229. Note: Fort Worth, TX External Links: Cited by: §1, §7.
-  (2018-06) Learning deep structured active contours end-to-end. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8877–8885. Note: Salt Lake City, UT External Links: Cited by: §1, §2.
-  (2019-12) Vectorizing world buildings: planar graph reconstruction by primitive detection and relationship classification. arXiv preprint arXiv:1912.05135. External Links: Cited by: §1, §2.
-  (2015-05) U-net: convolutional networks for biomedical image segmentation. arXiv preprint arXiv:1505.04597. External Links: Cited by: §1, §2.
-  (1973-01) A 3X3 isotropic gradient operator for image processing. Pattern Classification and Scene Analysis, pp. 271–272. Cited by: §6.1.
-  (1981) Gauss and the invention of least squares. The Annals of Statistics, pp. 465–474. External Links: Cited by: §6.
-  (1985) Topological structural analysis of digitized binary images by border following. Computer Vision, Graphics, and Image Processing 30 (1), pp. 32–46. External Links: Cited by: §4.
-  (2010-07) An efficient stochastic approach for building footprint extraction from digital elevation models. ISPRS Journal of Photogrammetry and Remote Sensing 65 (4), pp. 317–327. External Links: Cited by: §2.
-  (1977) Exploratory data analysis. Reading, MA. Cited by: §5.
-  (2006-06) A bayesian approach to building footprint extraction from aerial lidar data. Third International Symposium on 3D Data Processing, Visualization, and Transmission, pp. 192–199. Cited by: §2.
-  (2018-08) Building extraction at scale using convolutional neural network: mapping of the united states. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 11 (8), pp. 2600–2614. External Links: Cited by: §1, §2.
-  (2015-11) Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122. External Links: Cited by: §2.
-  (2017-11) Learning building extraction in aerial scenes with convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (11), pp. 2793–2798. External Links: Cited by: §1, §2.
-  (2019-12) Conv-mpn: convolutional message passing neural network for structured outdoor architecture reconstruction. arXiv preprint arXiv:1912.01756. External Links: Cited by: §1, §2.
-  (2006-09) Automatic construction of building footprints from airborne lidar data. IEEE Transactions on Geoscience and Remote Sensing 44 (9), pp. 2523–2533. Cited by: §2.
-  (2017-07) Pyramid scene parsing network. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2881–2890. Note: Honolulu, HI External Links: Cited by: §2, §7.