GeoSay: A Geometric Saliency for Extracting Buildings in Remote Sensing Images

11/07/2018 ∙ by Gui-Song Xia, et al. ∙ 4

Automatic extraction of buildings in remote sensing images is an important but challenging task and finds many applications in different fields such as urban planning, navigation and so on. This paper addresses the problem of buildings extraction in very high-spatial-resolution (VHSR) remote sensing (RS) images, whose spatial resolution is often up to half meters and provides rich information about buildings. Based on the observation that buildings in VHSR-RS images are always more distinguishable in geometry than in texture or spectral domain, this paper proposes a geometric building index (GBI) for accurate building extraction, by computing the geometric saliency from VHSR-RS images. More precisely, given an image, the geometric saliency is derived from a mid-level geometric representations based on meaningful junctions that can locally describe geometrical structures of images. The resulting GBI is finally measured by integrating the derived geometric saliency of buildings. Experiments on three public and commonly used datasets demonstrate that the proposed GBI achieves the state-of-the-art performance and shows impressive generalization capability. Additionally, GBI preserves both the exact position and accurate shape of single buildings compared to existing methods.



There are no comments yet.


page 2

page 9

page 15

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Outline of the proposed method. Given an input image, we firstly detect the junctions by using ASJ detector. For each single junction, its reliability (NFA) and angle constraint from statistics contribute to its own building index. The neighboring junctions are found by K Nearest Neighbor algorithm with a distance constraint. The weight will be calculated based on the distance computed before. After combing the information of single junction and its neighbor, geometric building index will be computed for whole image. GBI will be blurred by a Gaussian kernel and shadow information will be added by applying black top-hat transform.

Obtaining accurate locations and footprint shapes of buildings is an important task in remote sensing applications, and the generated building maps can be used in many fields, like urban mapping and planning, autonomous driving and so on, as presented by [Ghanea et al., 2016, Jinghui et al., 2004]. In real applications, such perfect building maps are often achieved by manual administrations, which is laborious and expensive. As a result, the speed of updating building maps can not keep up with the pace of urbanization, especially in the cities that develop rapidly, e.g. most of the cities in China. Nowadays, very high-spatial resolution (VHR) remote sensing (RS) images with spatial resolutions up to half meters, either from aerial or satellite platforms, can provide rich details of buildings and becomes popular data source for building mapping. Therefore it is highly demanded to develop some automatic methods for accurately extracting the locations and footprint shapes of buildings from VHR-RS images, e.g. see [Sirmacek and Unsalan, 2011, Pesaresi et al., 2008, Xu et al., 2015].

In the past decades, many research have been dedicated to extract built-up areas and buildings from remote sensing images, e.g. see [Martinez-Fonte et al., 2005, Pesaresi et al., 2008, Sirmacek and Unsalan, 2009, Huang and Zhang, 2011, Liu et al., 2013, Shao et al., 2014]. Among them, it is popular to develop some algorithms to compute building indexes in RS images. For instance, [Zha et al., 2003] proposed the normalized difference built-up index (NDBI) to extract buildings in Landsat-TM images, by making use of the spectral features of buildings in the -th and -th bands of the multi-spectral images. Beside spectral features, texture features have also been widely used for buildings extraction. Based on the observation that pixels around buildings often have high local contrast because of shadows casting, [Pesaresi et al., 2008] proposed the texture-derived built-up presence index (Pantex), which utilized the texture information computed by gray-level co-occurrence matrix (GLCM) to extract built-in areas from satellite images. As an extension, [Shao et al., 2014] developed the built-up areas saliency index (BASI) by relying on multi-scale and multi-direction texture features measured with non-sampled Contourlet transform. It demonstrated better results in built-in areas extraction than Pantex by [Pesaresi et al., 2008]. In contrast with texture features, geometrical and morphological profiles provide another aspect for extracting buildings in VHR-RS images. [Huang and Zhang, 2011] used multi-scale and multi-directional morphological operators to compute features of buildings in RS images and developed the so-called morphological building index (MBI). Actually, MBI has integrated several morphological characteristics of buildings such as brightness, shape, and size.

However, it is worth noticing that the geometric structures of buildings turn to be more and more important when the spatial resolution of RS images increase. Especially for VHR-RS images, it is often the prominent features of buildings. For instance, [Martinez-Fonte et al., 2005] have showed that the density of corners, e.g. Harris corners developed by [Harris and Stephens, 1988] or SUSAN corners proposed by [Smith and Brady, 1997]), are efficient to distinguish man-made structures from natural objects. [Sirmacek and Unsalan, 2009] have combined the SIFT key-points by [Lowe, 2004] with graph theory, and explored the relationships between local geometrical features. In contrast with other methods, it is more theoretical but with high computational complexity and time consuming. Alternatively, [Sirmacek and Unsalan, 2010] later introduced another technique using Gabor feature points and spatial voting, which reported comparable results on same datasets but with much less time for building extraction. Along this line, [Kovács and Szirányi, 2013] developed a method by replacing Gabor filters with a new point feature detector, so-called Modified Harris for Edges and Corners (MHEC), and proposing an orientation-sensitive voting matrix. More recently, [Liu et al., 2013] demonstrated that more clear geometrical profiles, such as precise corners and junctions detected by [Xia et al., 2014], was more efficient than local key=points for buildings extraction in VHR-RS images, and proposed the perceptual building index (PBI). Observe that PBI is robust to changes of image contrast and resolution variations due to the robustness of junctions detected.

In recent years, unlike above-mentioned methods that rely on hand-crafted features of buildings, approaches based on deep learning 

[Hu et al., 2015, Zhu et al., 2017] have been proposed to train end-to-end building detection models from a set of annotated images and turn to be one of the most popular directions. [Saito et al., 2016]

employed convolutional neural networks (CNNs) as feature extractor to extract both buildings and roads simultaneously from aerial images. In this methods, a five-layer multi-channel predicted CNN was designed, which took

image patches as input. Meanwhile, special cost function and model averaging operations were used comparing with those introduced by [Mnih, 2013]. More recently, [Zuo et al., 2016]

further improved the building extraction accuracy by developing a hierarchically fused fully convolutional network (HF-FCN). It took the original image pixels as input and output the probability map of building category.

[Shrestha and Vanneschi, 2018]

combined fully convolutional network and conditional random fields for building extraction, which reduced the noise (falsely classified buildings) and sharpened the boundaries of single buildings. As we shall see in the experiments of our paper, those deep learning-based methods can achieved satisfied results on their training and testing dataset, but they often show limited generalization capability on images from different data sets and sensors. In addition, collecting well-annotated training data is usually difficult and costs a lot of money and labors in real task.

This paper presents a new method for accurately detecting buildings in VHR-RS images, by computing the geometric saliency of buildings. Our work is inspired by the observation that, in VHR-RS images, buildings are always more distinguishable in geometries (both local and global) than other features. Instead of propagating probabilities through spatial voting like PBI, which will result in many redundant false pixels, we propose a geometric reasoning processing to extract the accurate position and shape of single buildings based on robust mid-level geometric representation. More precisely, as illustrated in Fig. 1, we first propose to represent VHR-RS images with a mid-level geometrical representation, by exploiting junctions that can locally depict anisotropic geometrical structures of images. We then derive the saliency of geometric structures on buildings, by considering the probability of each junction that measures its saliency to its surroundings and the relationship of junctions. This process can encode both local and semi-global geometric saliency of buildings in images. Finally, the geometric building index (GBI) of whole image is measured via integrating the computed geometric saliency (GeoSay). A preliminary version of this work is presented by [Huang et al., 2018].

In contrast with existing building indexes, our method results in less redundant non-building areas and can provide accurate location and geometric shape (contours) of buildings. As we shall see in Section 5, our method achieves the state-of-the-art performance222All results in this paper are available at on both three public datasets. Meanwhile GBI generates reasonable good results independently of satellites, scene categories or image contrast, and it shows promising generalization power to different datasets, especially in comparison with learning-based approaches.

The rest of this paper is organized as follows. Sec. 2 introduces the junctions and mid-level geometric structural representation of images. Sec. 3 analyzes the relationship between buildings and junctions. Based on the relationship, Sec. 4 explains the details to compute geometrical building index from junctions. In Sec. 5, we will compare the proposed method with several state-of-the-art methods with three public and commonly used datasets. Finally, we draw some conclusion remarks in Sec. 6.

2 Preliminary: a junction-based representation of images

As mentioned before, local geometric features, such as Harris corners by [Harris and Stephens, 1988], have been employed to detect buildings in VHR-RS images for long years. However, the inference of the exact position and shape of a building directly by local geometric features is still difficult. Most of recent investigations can only find built-in areas rather than single buildings based on local geometric features. In order to inference single buildings, we need to figure out whether the local structures like points and lines are located around buildings or not and then integrate related neighboring structures to a whole building. While one of the main difficulties on this task is how to find the relationship between buildings and local geometric structures.

In this paper, we propose to explore the relationships between buildings and a specific local geometric structures, i.e. junction. As given by [Xia et al., 2014], a junction is defined as a local geometrical structure where several edges intersect together. Thus, a junction is composed of a central point (corner) and several branches (edges). In contrast with corners, such as Harris, junction is a mid-level geometric structure. Moreover, T-junctions and L-junctions are often distinguishable geometric features for man-made objects, e.g. buildings in remote sensing images.

The detection of junctions has been studied for years but a detail review of them is out of the scope in this paper. Here, we briefly describe the ASJ junction detector by [Xue et al., 2017], that will be used in our work.

Consider a discrete panchromatic remote sensing image as a function , where is an image grid. A junction, illustrated by the right of Fig.2, can be defined as,


where is junction’s position in the image , denotes the number of branches, and are the scale and orientation of the -th branch respectively. is the significance of junction, and the smaller is better.

Detecting junctions in an image is to find all the local structures , modeled by the template illustrated on the right of Fig.2

, and estimate their parameters.

[Xia et al., 2014] proposed the a-contrario junction detector (ACJ) with the help of the a-contrario methodology [Desolneux et al., 2000], where they assumed that the scale of branches are identical, ,i.e. . With ACJ detector, from the intersected point of a junction, we need to define a measurement to judge whether there exists junction or not and find those branches. Each branch of a junction corresponds to an edge, thus the gradient inside branch’s neighbors should be consistent with the direction of the branch. Given a scale , the neighbors of a branch with direction are defined as pixels inside a small sector along with radius .


where is a predefined parameter related to ,

is the angle of vector

in [0, 2π] and is the distance along the unit circle, defined as .

When a branch corresponds to an edge, then most of the neighboring pixels should have similar direction of gradient with the direction of this branch. Thus, the strength of a branch is measured inside its neighboring sector based on this idea. For a given sector , its strength is measured by


where is the pairwise strength between and ,


where is the normalized gradient of image in position , and is the direction of the gradient.

Figure 2: Template of isotropic-scale junction (left) defined in ACJ and anisotropic-scale junction (ASJ) (right), taken from [Xue et al., 2017]. Junctions detected by ASJ have its own scale for each branch.

With the strength of all branches, the strength of junction is defined as the minimal strength of its branches. But here we still need to set a parameter to find meaningful junction with enough larger strength, which means that . [Xia et al., 2014] has shown that one can compute a value called number of false alarm () based on the a contrario theory to measure strength of a junction without parameter. Meaningful junction should have ranges from [0, 1], and the smaller the better. In the definition of ACJ, corresponds to the value of .

Although ACJ is developed to detect junctions from nature images, there is no problem to deal with VHR remote sensing images. However, the junctions detected by ACJ have only one scale, which means that the lengths of all branches are the same. While buildings are often rectangular objects with unequal length of edges. To deal with this problem, [Xue et al., 2017] introduced an improved version of ACJ, called anisotropic-scale junction (ASJ) detector. Based on the junctions detected by ACJ, ASJ can get anisotropic scales in various directions of junction’s branches (see Fig.2, taken from [Xue et al., 2017]). More details on ACJ and ASJ detectors can be found in the work of [Xia et al., 2014] and [Xue et al., 2017].

In this paper, we employ the state-of-the-art ASJ detector to extract junctions to build a mid-level geometric profiles of buildings. Thus, given a panchromatic remote sensing image , we can represent it by a set of detected junctions

3 Statistics of junctions in VHR RS images

Figure 3: Left: The ratio of different types of junctions among building and background areas. Obviously, L-junction is the main type in both two areas. Right: The structure of -junction. A -junction contains a corner point , two branches () with their endpoints () and the included angle of the two branches. Meanwhile, each junction could form a parallelogram region () and the center of a junction is defined as the center of its parallelogram.

Junctions can be divided to different types based on the number of branches. Such as -junction has two branches (), /-junction and -junction have three and four branches respectively. It is clear that -junctions often correspond to object corners, while -junctions imply occlusions between objects in images. -junctions and arrow-junctions usually correspond to corners of 3D objects. Junctions with order higher than are less discriminative. In order to verify this assumption, we counted the ratio of junctions with different types detected from the building areas and background areas in the Spacenet65 dataset (details of datasets shall be described in Section 5). To judge whether a junction is located along buildings or not, we validate the overlapping ratio between the areas of buildings and the area of parallelogram region spanned by the junction. If the area of overlapping region between and ground truth is larger than , the junction is thought to be located in the building area. We computed the distribution of junction types with junctions collected from Spacenet65 dataset. The result is showed in Fig.3. Observe that, in both building area and background area, the distribution of junction type are similar: L-junctions are the dominant while junctions with more than branches are rare.

It is also worth noticing that -junction is a fundamental element and all junctions with any type could be represented by several L-junctions. Considering the computation complexity of using different junction types and this observation, we finally decide to only use -junction. To fully utilize all junctions, junctions with more than 3 branches were separated into several -junctions. For convenience, we rewrite the definition of -junction as below


where are the two branches of -junctions and , with for . is the center of the -junction . is the included angle of junction’s two branches. The significance inherits from its original junctions. The details of a -junction is showed in Fig.3.

Another observation is that, in RS images, the statistics of junctions should different on buildings and background. As junctions are detected along areas with high gradients, they are likely to be found around corners of buildings. Buildings are typical man-made objects and their shape are usually very regular or more precisely, rectangular. Thus, the included angles of L-junctions will also have special distribution when they are located along buildings. To verify this supposition, we calculated the distributions of L-junctions’ included angles among different regions from the Spacenet65 dataset, as illustrated in Fig.4. One can find that junctions’ included angles are really close to in the building area and have a large difference towards junctions among background area. In building area, angles of junctions are highly concentrated in interval . While in the background area, distribution around

has less contrast to other intervals. Such distributions can help us to distinguish junctions around buildings from other objects, and can be used as a prior in the detection of buildings. In order to parameterize these distributions, we fit the two distributions by Gaussian Mixture Models (GMM,

[Mclachlan and Peel, 2000]). The distribution of angles and the fitted parametric probability curve are showed in Fig.4.

In fact, we also counted other properties of junctions among different areas, like scale and position. But the results show that the distributions of those attributes have little effect to distinguish buildings and backgrounds compared with angles.

4 GeoSay: from junctions to building index

In this section, we will mainly explain the definition of geometric saliency (GeoSay) based on junctions and the details of proposed geometric building index (GBI).

4.1 Geometric saliency

Buildings in VHR images, are often geometrically composed of several parallelograms (sometimes even rectangles). A regular building may have several corners where L-junctions will be detected. -junction is likely to be detected at the corner of a building, and the junction’s two branches will coincide with the two edges. Thus the parallelogram spanned by the junction’s branches will have a lot of overlapping parts with the buildings. In such case, the spanned parallelogram could precisely represent a part of buildings and preserve the geometric shape.

Based on ASJ detector, we could have a mid-level geometric representation of the whole image. And our task is to find salient junctions that are located at buildings’ corners. Once we find such junctions, buildings could be detected based on the relationship between junctions’ parallelograms and buildings. Here we defined first-order and pairwise geometric saliency to find salient junctions, based on the properties of single junctions and the relationship between neighboring junctions.

4.1.1 First-order geometric saliency

Figure 4: The histogram in blue represents the distribution of L-junctions’ included angles (x axis) among building area(left) and background area(right). The result showed that the angles around building are concentrated on . The fitted curve is showed in red and we can see that the distribution is well simulated by it.

There are two important characteristics when junctions are located along buildings. First, for an image , the significance of each junction indicates the reliability of detection. The smaller the is, the more reliable the junction will be. Secondly, as we have shown in Section 3, the distributions of included angles are statistically dsicriminative between buildings and backgrounds.

Given an image, all detected junctions can be divided into two subsets, i.e., , inside buildings and outside buildings. For a junction with its parametric description

, the posterior probability

, measuring the possibility of the event that a junction is inside buildings, can be derived by


where the prior probabilities

and the likelihoods , can be estimated from a given dataset of buildings, e.g. the Spacenet65 dataset, based on the fitted GMM model.

By combining the significance parameter and included angle of a single junction, the first-order geometric saliency of a junction can be computed as


which indicates the degree of a single junction locating along buildings.

4.1.2 Pairwise geometric saliency

First-order geometric saliency encodes the properties of single junctions, and pairwise geometric saliency is defined for utilizing the relationship between neighboring junctions.

When there are many junctions whose centers are very close to each other in a region, the probability of existing a building will be higher. Thus, pair-wise relationships of junctions are useful cues to derive geometric saliency. In contrast with first-order saliency, pair-wise ones can encode more globally geometric information in images. Here, we use nearest neighbors to compute pair-wise saliency. For a junction , its -nearest neighbors (-NN), denoted by , is defined as a set of junctions satisfying


where represents the maximal length of branch of the junction . An example of junctions and neighboring junctions is displayed in Fig.5, where green points are the centers of junctions inside the -NN of the junction with location center in red. Based on neighboring junctions, the pair-wise geometric saliency of a junction is defined as,

Figure 5: Left: junctions detected by ASJ detector. Right: neighbors of the selected red junction. Its neighbors are those green junctions whose centers satisfied the Eq.8.

Furthermore, neighboring junctions should have similar scales as they are located in the same buildings. Besides the distance constraint, there will also be a scale constraint. If the ratio of scales between junctions are too large, the neighboring junction will be discarded from the neighbor list.

4.2 Geometric building index

Note that, given an -junction , the two branches uniquely span a parallelogram , as shown in Fig. 3. Our geometric building index (GBI) attempts to associate each pixel with a saliency measuring the possibility of the pixel belonging to buildings, which is the summation of saliency inside parallelogram of all junctions. Thus, for a pixel in , we calculate its building index by:


where is the list of junctions detected by the ASJ detector in image , and is an indicator function, which equals if the pixel is inside the parallelogram of junction and equals to otherwise.

Furthermore, inside VHR-RS images, there are many shadows. The shape of building’s shadow is often regular and junctions will be detected there. [Huang and Zhang, 2012] have applied the black top-hat transform to extract shadow. We were inspired by this idea and apply black top-hat transform in the brightness channel of the original image. The transformed image is calculated by applying morphological closing operation to the original image and using the result to subtract original image. We use one subtract this transformed image as a new multiplicative suppression term to the computation of GBI.

4.3 Numerical implementation of geometric building index

The code of ASJ detector was provided by the author in github333The code of ASJ could be found in, written in C++. The algorithm of calculating geometric building index was implemented using Matlab and the pseudocode is showed in Algorithm 1.

  Input: Image ;
  Output: matrix containing geometric building index;
  //** Step 1: junction detection **//
   Apply ASJ detector on ;
  for all  do
      the neighbors satisfied the constraint in Eq.8;
  end for
  for all  do
     //**Step 2: first-order geometric saliency **//
     //** Step 3: pair-wise geometric saliency **//
     for all  do
     end for
     //** Step 4: geometric building index **//
      parallelogram bounded by ;
     for all  do
     end for
  end for
  // **Step 5: plus shadow information **//
   Apply Black Top-hat transform in
Algorithm 1 Computation of geometric building index

The whole algorithm could be divided in to five steps. Step 1 detects junctions from the input, VHR-RS images, and computes the parameters of junctions ahead. Step 2 computes first-order geometric saliency of every junction based on its significance and information of angles. By using neighboring junctions, Step 3 generates pair-wise geometric saliency. Then Step 4 generates building index for each pixel based on the relationship between junction’s parallelograms and buildings. Step 5 integrates shadow information into building index by using black top-hat transform.

5 Experiments and analysis

In this section, we evaluate the proposed method, i.e. GBI, and compare it with the state-of-the-art on three public and commonly used datasets. The involved building extraction algorithms includes BASI ([Shao et al., 2014]), MBI ([Huang and Zhang, 2011]), PBI ([Liu et al., 2013]), which do not need any training data. We also compared it with a deep learning-based method, i.e. HF-FCN ([Zuo et al., 2016]), for which we used the model provided by the authors in github444The model of HF-FCN is provide by the authors and can be found at We also make some ablation studies on the GBI.

5.1 Experimental setups

5.1.1 Datasets

We used three public datasets as follows.

  • Spacenet65 dataset by [SpaceNet, ]:. SpaceNet is a corpus of commercial satellite imagery and part labeled data, available for academic usage. Our experimental data comes from the Area of Interest 1 (AOI 1) at Rio de Janeiro. This dataset collected imagery from DigitalGlobe WorldView-2 satellite with spatial resolution m. We used the RGB imageries with labeled building footprints. The original size of satellite images are pixels. Considering the feasibility of testing algorithms, we cropped the original data to a bunch of smaller images with fixed size as 20002000 pixels. Note that not all cropped images contain buildings and also the given building footprints are not completed. Therefore we only pick cropped images with buildings and whole building footprints marked. After that, we manually correct every building footprint for chosen images. The final Spacenet65 dataset includes 65 images of 20002000 pixels and their ground truth. This dataset covers mostly urban and rural areas and buildings of different appearances.

  • Massachusetts dataset is a dataset designed for training neural network for building detections, proposed by [Mnih, 2013]. It contains three subsets, 131 images for training, 4 images for validation and 10 images for test. The images are 15001500 pixels with resolution of 1m. Due to the disordered gradients of images (difficult to extract local structures), we smoothed all images with a small 3x3 Gaussian kernel while applying ASJ detector.

  • Potsdam dataset: It is published by [ISPRS, ] and contains 38 patches (of the same size), each consisting of a true orthophoto (TOP) extracted from a larger TOP mosaic. Here, we only used the ortho corrected images. They are all very large images with size 60006000 and we also cropped them like before to 20002000. After deleting images without buildings, we finally get 214 images with their ground truth. The resolution of ground truth is 0.05m. Due to the high resolution, buildings inside this dataset are a little bigger than others.

5.1.2 Settings of parameters

The distributions of junctions’ included angles among different areas are fitted from Spacenet-65 dataset by applying Gaussian Mixture Model (GMM) in the collected over 200k junctions. Based on the individual shape of the two distributions among building and background areas, they are fitted respectively by 3 and 4 Gaussian models.

In general, a rectangular building could be fully represented by 4 L-junctions. Thus, here we set in the processing of finding neighbors. The ratio of scales between neighboring junctions is set to 3. It means that when the scale of a neighbor junction is three times larger or smaller than the junction , will be discarded.

While blurring, the size of the Gaussian kernel is set to 5-by-5, while is set to 0.5. The kernel of black top-hat transform is selected as a square.

5.1.3 Evaluation metric

To evaluate the accuracy of detection, we employed two commonly used metrics: mean Average Precision(mAP) [Buckley and Voorhees, 2000] and F-score [Powers, 2011]. As directly assessing generated index map is difficult, thus we used thresholds ranges from [0,1] with step to segment the index map. For each binary segmented result, pixels could be divided into true positive (TP), false positive (FP), false negative (FN) and true negative (TN). Precision of building detection is then the proportion of correctly detected building pixels in all detected building pixels. Recall is the the proportion of correctly detected building pixels in all building pixels. They could be represented as below,


Let recall be the x axis and precision be the y axis, we could plot a precision-recall curve (Fig.6), where precision p(r) is a function of recall . The average precision (AP) is the area between the curve and the x axis. Mean average precision is the mean value of AP scores of images among one dataset.



is defined as the harmonic mean of precision and recall. For an image, each segmented result corresponds to a

F-score and we choose the maximal F-score as the final result. F-score of a dataset is the average scores of all of the images in the dataset.

Methods Spacenet65 Massachusetts Potsdam
mAP F-score mAP F-score mAP F-score
BASI 0.34 0.44 0.32 0.40 0.34 0.44
MBI 0.28 0.35 0.28 0.38 0.17 0.35
PBI 0.27 0.37 0.25 0.36 0.41 0.50
HF-FCN 0.04 0.12 0.76 0.74 0.03 0.10
GBI(ours) 0.46 0.52 0.37 0.44 0.46 0.59
Table 1: F-score and mAP of several methods on three datasets. The best result is shown in bold.

5.2 Results and analysis

The F-scores and mAP of all compared methods on three datasets are showed in the Table 1. It shows that our method achieved the best performance on both Spacenet-65 and Potsdam datasets. On both two datasets, the mAP score of GBI is higher than the second best method about nearly and its F-measure is higher than others’ about . While on Massachusetts dataset, HF-FCN got the best performance, which is mainly due to the fact that the HF-FCN model is directly trained on this dataset. However, HF-FCN showed very poor generalization capability when applied it to the other two datasets. BASI, a texture-based method, also provided good performance on the three datasets although it was not the best one. It generated many abundant false areas around buildings. Like MBI, PBI also has such problem. Its score is very high in Potsdam dataset but we could see from the segmented results (Fig.9) that it only detected the built-in areas. Those detected areas by BASI, MBI, and PBI are often correct but irregular, so they may be suited to act as a pre-processing step to find built-in areas before accurate building detection.

Fig.6 illustrates the comparisons of precision-recall (PR) curves on the Spacenet65, Massachusetts and Potsdam datasets respectively. From the results on Spacenet dataset, one can find that HF-FCN (the purple curve) has high precision but very low recall rate. The curve of MBI drops very fast when the recall rate increases. The detected buildings will be often incomplete and the geometric shape of buildings are broken. Although BASI does not have such high precision, its precision only decrease a little when recall rate increases. Compared with other methods, GBI shows a good performance. The curve of GBI (the red one) is always higher than others and achieves the highest F-score. But we could also find that the recall rate of GBI is difficult to reach 1 because not every building could be detected by junctions and thus the geometric saliency is limited in those areas. Besides GBI, other methods like HF-FCN and MBI also have such problems.

Figure 6: PR curves of different methods on Spacenet65, Massachusetts and Potsdam datasets respectively. There are five methods, and ours is showed in red. F indicates the F-score.
Figure 7: The PR curves of the proposed algorithms with different settings on Massachusetts dataset.
Figure 8: Building indexes generated from different methods. The color represent the value of building index, increase from blue to red. Results of BASI and PBI are distributed mainly around building areas and it is difficult to distinguish roads with single buildings. MBI misses some part of buildings and HF-FCN shows poor generality to different datasets. While building indexes generated by our proposed GBI are mainly distributed on single buildings and it is easy to find them out.
Figure 9: Segmented results of different building detection methods. Green areas are those correctly detected buildings and red areas are false detections. Blue areas represent the missed detected buildings. Compared with other methods, the result of GBI is very clean and much more similar to the ground truth. Furthermore, it preserves the geometric shape of single buildings.

Fig. 8 and Fig. 9 display the corresponding building indexes and segmented results of 5 representative images in the three datasets. Obviously, building indexes generated by GBI concentrate more on buildings. Thus the segmented result is very clean (with less false detections) and the precision is much higher than other methods. While building indexes calculated by texture-based BASI are very mussy. Besides the buildings, they also have large value on other non-building objects, like the long roads and textured forest. The indexes given by MBI also suffer from such problems. PBI only detects the building areas and it can not provide the accurate shape of buildings. For HF-FCN, it has a high precision but the recall is very low. In contrast, the results given by GBI preserve better the shape of single buildings. For instance, from the segmented results, one can find that the detected buildings have a clear boundary.

Similar results can be found on the Potsdam dataset and Massachusetts dataset. It is worth noticing that HF-FCN shows almost perfect performance on Massachusetts dataset but extremely poor performance on the other two datasets. This is mainly due to the fact that the HF-FCN model was trained on Massachusetts dataset and there is no prior knowledge learned by the model from other two datasets. Thus, in the case that there is little prior knowledge about the data and no available annotated training samples, GBI can be the best choice for building extraction.

5.3 Ablation studies

Methods Spacenet65 Massachusetts Potsdam
mAP F-score mAP F-score mAP F-score
raw saliency 0.41 0.48 0.31 0.41 0.45 0.58
+ neighbor 0.42 0.49 0.32 0.42 0.45 0.59
+ angle 0.45 0.52 0.35 0.44 0.46 0.59
+ shadow (final GBI) 0.46 0.52 0.37 0.44 0.46 0.59
Table 2: Effect of taking different part of GBI into consideration. The best result is shown in bold.

In the computations of GBI, there are several parts, including the junction significance (raw saliency of single junction), neighboring information (pairwise geometric saliency), the distribution of junction’s angle and the shadow information. It is of great interest to study the effect of these different parts to the final accuracy of building detection.

In this experiment, we make some ablation studies by changing the way to generate geometric saliency. First, we call it the raw saliency when using only the significance of junctions and there is no pairwise term, i.e. the first-order saliency in Eq.7 is changed to . We then add the pairwise term, the prior distribution of junction’s angle, and the shadow information to the raw saliency one after another. Observe that the forth one with shadow information is the final GBI used in the previous experiments. The result of this experiment is shown in Table.2 and the precision-recall curves evaluated on Massachusetts dataset is displayed in Fig.7.

One can find that “new-introduced” information have positive effect to the accuracy on Massachusetts and Spacenet dataset. More precisely, taking the distribution of junction’s angle into consideration actually increases the performance largely. Massachusetts dataset contains mainly urban areas and thus buildings are more likely to have regular shapes. The introduction of angle’s distribution will contribute to more salient junctions inside building areas. For the computation of GBI, we use the parallelograms of junctions. When there are neighboring junctions located in a same building, their parallelograms will have overlapping areas. In such cases, the first-order geometric saliency indeed implicitly contains the neighboring information and the pairwise term will not help too much. In addition, shadow information also helps to improve the performances as Massachusetts dataset has many shadows inside it. However, the improvements on Potsdam dataset were not remarkable. In this dataset, buildings have very big size and there are only several buildings in an image due to the very high resolution. Meanwhile, shadows are also very little because of low lightness. In such cases, new information are hard to help improve accuracy. But the geometric structure is very salient in such high resolution, junctions are generally located around buildings and thus generates reasonable results.

6 Conclusion

In this paper, we proposed a new method to calculate the building index to extract building in VHR-RS images, which is based on the defined geometric saliency of mid-level geometrical structures, i.e. junctions. Our method achieves the state-of-the-art performances in contrast with existing methods which use the building index for extracting buildings from remote sensing images. Furthermore, as the resolution and details of images increase, the performance of GBI grows obviously. The building areas detected by GBI have a clearer boundary and less redundant cluttered areas than other methods and are much convenient to be applied in real application.

For further studies, in our current work, we only consider the geometric cues in panchromatic remote sensing images. The combination of other cues like textures and spectral could also help. For example, the NDVI could deal with the farmlands. In the future, fusing different type of features will be a perspective way to improve the accuracy of building detection.


  • [Buckley and Voorhees, 2000] Buckley, C. and Voorhees, E. M. (2000). Evaluating evaluation measure stability. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR’00, pages 33–40.
  • [Desolneux et al., 2000] Desolneux, A., Moisan, L., and Morel, J.-M. (2000). Meaningful alignments.

    International Journal of Computer Vision

    , 40(1):7–23.
  • [Ghanea et al., 2016] Ghanea, M., Moallem, P., and Momeni, M. (2016). Building extraction from high-resolution satellite images in urban areas: recent methods and strategies against significant challenges. International Journal of Remote Sensing, 37(21):5234–5248.
  • [Harris and Stephens, 1988] Harris, C. and Stephens, M. (1988). A combined corner and edge detector. In Alvey Vision Conference, volume 15, pages 10–5244. Manchester, UK.
  • [Hu et al., 2015] Hu, F., Xia, G., Hu, J., and Zhang, L. (2015).

    Transferring deep convolutional neural networks for the scene classification of high-resolution remote sensing imagery.

    Remote Sensing, 7(11):14680–14707.
  • [Huang et al., 2018] Huang, J., Xia, G., Hu, F., and Zhang, L. (2018). Accurate building detection in VHR remote sensing images using geometric saliency. CoRR, abs/1806.00908.
  • [Huang and Zhang, 2011] Huang, X. and Zhang, L. (2011). A multidirectional and multiscale morphological index for automatic building extraction from multispectral geoeye-1 imagery. Photogrammetric Engineering & Remote Sensing, 77(7):721–732.
  • [Huang and Zhang, 2012] Huang, X. and Zhang, L. (2012). Morphological building/shadow index for building extraction from high-resolution imagery over urban areas. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 5(1):161–172.
  • [ISPRS, ] ISPRS. Potsdam:
  • [Jinghui et al., 2004] Jinghui, D., Veronique, P., and Hanqing, L. (2004). Building extraction in urban areas from satellite images using gis data as prior information. In IGARSS 2004, volume 7, pages 4762–4764 vol.7.
  • [Kovács and Szirányi, 2013] Kovács, A. and Szirányi, T. (2013). Improved harris feature point set for orientation-sensitive urban-area detection in aerial images. Geoscience and Remote Sensing Letters, 10(4):796–800.
  • [Liu et al., 2013] Liu, G., Xia, G.-S., Huang, X., Yang, W., and Zhang, L. (2013). A perception-inspired building index for automatic built-up area detection in high-resolution satellite images. In Geoscience and Remote Sensing Symposium (IGARSS), 2013 IEEE International, pages 3132–3135. IEEE.
  • [Lowe, 2004] Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91–110.
  • [Martinez-Fonte et al., 2005] Martinez-Fonte, L., Gautama, S., Philips, W., and Goeman, W. (2005). Evaluating corner detectors for the extraction of man-made structures in urban areas. In Geoscience and Remote Sensing Symposium, 2005. IGARSS’05. Proceedings. 2005 IEEE International, volume 1, pages 4–pp. IEEE.
  • [Mclachlan and Peel, 2000] Mclachlan, G. J. and Peel, D. (2000). Finite mixture models. Partha Deb, 39(4):521–541.
  • [Mnih, 2013] Mnih, V. (2013). Machine Learning for Aerial Image Labeling. PhD thesis, University of Toronto.
  • [Pesaresi et al., 2008] Pesaresi, M., Gerhardinger, A., and Kayitakire, F. (2008). A robust built-up area presence index by anisotropic rotation-invariant textural measure. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 1(3):180–192.
  • [Powers, 2011] Powers, D. M. (2011). Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation.
  • [Saito et al., 2016] Saito, S., Yamashita, T., and Aoki, Y. (2016). Multiple object extraction from aerial imagery with convolutional neural networks. Electronic Imaging, 2016(10):1–9.
  • [Shao et al., 2014] Shao, Z., Tian, Y., and Shen, X. (2014).

    BASI: A new index to extract built-up areas from high-resolution remote sensing images by visual attention model.

    Remote Sensing Letters, 5(4):305–314.
  • [Shrestha and Vanneschi, 2018] Shrestha, S. and Vanneschi, L. (2018). Improved fully convolutional network with conditional random fields for building extraction. Remote Sensing, 10(7).
  • [Sirmacek and Unsalan, 2009] Sirmacek, B. and Unsalan, C. (2009). Urban area detection using gabor features and spatial voting. In 2009 IEEE 17th Signal Processing and Communications Applications Conference, pages 812–815.
  • [Sirmacek and Unsalan, 2010] Sirmacek, B. and Unsalan, C. (2010). Urban area detection using local feature points and spatial voting. IEEE Geoscience and Remote Sensing Letters, 7(1):146–150.
  • [Sirmacek and Unsalan, 2011] Sirmacek, B. and Unsalan, C. (2011). A probabilistic framework to detect buildings in aerial and satellite images. IEEE Transactions on Geoscience and Remote Sensing, 49(1):211–221.
  • [Smith and Brady, 1997] Smith, S. M. and Brady, J. M. (1997). SUSAN - A new approach to low level image processing. International journal of computer vision, 23(1):45–78.
  • [SpaceNet, ] SpaceNet. Spacenet on amazon web services (aws).
  • [Xia et al., 2014] Xia, G., Delon, J., and Gousseau, Y. (2014). Accurate junction detection and characterization in natural images. International Journal of Computer Vision, 106(1):31–56.
  • [Xu et al., 2015] Xu, B., Xue, N., Xia, G., and Zhang, L. (2015). Finding edges of buildings via a junction process in high-resolution remotely sensed images. In 2015 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), pages 477–480.
  • [Xue et al., 2017] Xue, N., Xia, G. S., Bai, X., Zhang, L., and Shen, W. (2017). Anisotropic-scale junction detection and matching for indoor images. IEEE Transactions on Image Processing, PP(99):1–1.
  • [Zha et al., 2003] Zha, Y., Gao, J., and Ni, S. (2003). Use of normalized difference built-up index in automatically mapping urban areas from TM imagery. International Journal of Remote Sensin, 24(3):583–594.
  • [Zhu et al., 2017] Zhu, X. X., Tuia, D., Mou, L., Xia, G., Zhang, L., Xu, F., and Fraundorfer, F. (2017). Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine, 5(4):8–36.
  • [Zuo et al., 2016] Zuo, T., Feng, J., and Chen, X. (2016). HF-FCN: Hierarchically fused fully convolutional network for robust building extraction. In Asian Conference on Computer Vision, pages 291–302. Springer.