Deep Built-Structure Counting in Satellite Imagery Using Attention Based Re-Weighting

by   Anza Shakeel, et al.

In this paper, we attempt to address the challenging problem of counting built-structures in the satellite imagery. Building density is a more accurate estimate of the population density, urban area expansion and its impact on the environment, than the built-up area segmentation. However, building shape variances, overlapping boundaries, and variant densities make this a complex task. To tackle this difficult problem, we propose a deep learning based regression technique for counting built-structures in satellite imagery. Our proposed framework intelligently combines features from different regions of satellite image using attention based re-weighting techniques. Multiple parallel convolutional networks are designed to capture information at different granulates. These features are combined into the FusionNet which is trained to weigh features from different granularity differently, allowing us to predict a precise building count. To train and evaluate the proposed method, we put forward a new large-scale and challenging built-structure-count dataset. Our dataset is constructed by collecting satellite imagery from diverse geographical areas (planes, urban centers, deserts, etc.,) across the globe (Asia, Europe, North America, and Africa) and captures the wide density of built structures. Detailed experimental results and analysis validate the proposed technique. FusionNet has Mean Absolute Error of 3.65 and R-squared measure of 88 ? 103 m2 of the unseen region, with the error of 19 buildings off the 656 buildings in that area.



There are no comments yet.


page 2

page 3

page 4

page 5

page 6

page 7

page 8


Cross-Region Building Counting in Satellite Imagery using Counting Consistency

Estimating the number of buildings in any geographical region is a vital...

An Attention-Based System for Damage Assessment Using Satellite Imagery

When disaster strikes, accurate situational information and a fast, effe...

Intercensal updating using structure-preserving methods and satellite imagery

Censuses are fundamental building blocks of most modern-day societies, y...

Deep Learning Guided Building Reconstruction from Satellite Imagery-derived Point Clouds

3D urban reconstruction of buildings from remotely sensed imagery has dr...

Satellite Images and Deep Learning to Identify Discrepancy in Mailing Addresses with Applications to Census 2020 in Houston

The accuracy and completeness of population estimation would significant...

Using Deep Learning and Satellite Imagery to Quantify the Impact of the Built Environment on Neighborhood Crime Rates

The built environment has been postulated to have an impact on neighborh...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Accurate, detailed and up-to-date analysis of the urban and non-urban areas play a vital role in building an economic and social understanding of the region, helping in policy making and designing interventions. This analysis is dependent upon reliable and up-to-date surveys, which are lacking in the economically challenged areas of the world (Jean et al., 2016)

. One of the important, but laborious to gather, statistics are population densities and built-up area, especially in either densely constructed areas or scarcely populated areas. An accurate and up-to-date mapping of the built-up areas is necessary for the effective disaster (e.g, flood or earthquake) relief, urban food security, and estimation of effects of the urbanization on the farmlands, forest volume, and population. Recently, where there has been a surge in using machine learning and satellite imagery to discover the economic and social pattern such as poverty

(Jean et al., 2016), slavery (Boyd et al., 2018), population spread, and large-scale urban patterns (Albert et al., 2017), there have been some successes in built-up-area estimation and building detection (Zhang et al., 2017). Unlike before mentioned works that rely on the collective features of the image to regress on the value, building detection requires detailed visual analysis, more accurately labeled data and respectable-resolution imagery. Over the years accurate results for the prediction of land-use and land-cover maps, such as (Zhou et al., 2018; Längkvist et al., 2015; Albert et al., 2017) have been presented. However, either these are image classification based approaches or techniques that are restricted to just segmenting out the areas without coming up with a realistic count of structures. Furthermore, these approaches are not able to capture changes inside the urban regions.

Counting allows fine-grain urban population analysis and detailed view of change occurring within the urban and rural centers, without explicitly tackling the complex task of the individual building segmentation. It is a better surrogate for the population analysis, more helpful in disaster management (damage and destruction estimation), urban food security analysis, and allow complex economic analysis of different parts of the city (indirectly allowing us to understand how much land is being used by each building).

Several datasets have been introduced by different researches for various remote sensing applications. Available satellite imagery datasets include (Yang and Newsam, 2013), (Zou et al., 2015), (Xia et al., 2017), (Rottensteiner et al., 2012), (Lam et al., 2018) and (Zhou et al., 2018) comprising of land-use type class labels, bounding-boxes for overhead object localization and segmentation masks for categories like road, vegetation and buildings. Note that these land-use and land-cover datasets cover regions where buildings are separated from each other, or use hyper or multispectral imagery. Although these datasets are challenging and useful, none of them addresses the important problem of building counting in satellite imagery, especially in congested regions.

Figure 1: FusionNet: Built-structure count results. (a) area of Egypt is processed and the results are represented in the form of a heat map. The image is divided into 27 cells each of size . The predicted count is assigned a range of count. (b) Ground truth count in blue and predicted count in red for each cell is plotted. The precise building counts results demonstrate the generalization and robustness of our proposed deep model (Best viewed in color).

We venture into estimating the density of the buildings in the visible spectrum satellite imagery and present our counting results on the diverse set of images taken from sparsely to densely populated areas across the globe. We propose a deep regression based network and two new attention based re-weighting techniques to achieve building counts. To do a thorough evaluation of our proposed approaches, we have collected a new large dataset of satellite imagery capturing built-structures of different densities (low, medium, and high) as well as including scenes without any built structure. Furthermore, we have provided detailed annotations for building counts for each satellite image. To our knowledge, it is the first time challenging task of building counting has been handled at this scale. Our work exploits recent developments in the Deep Learning (LeCun et al., 2015)

and propose the Convolutional Neural Network

(Krizhevsky et al., 2012a) based solution for estimating the number of buildings in the region. In summary, our work has the following technical contributions.

  1. We propose three new convolutional neural network based approaches for building counting. Firstly, we propose deep regression based counting. Secondly, we propose to employ attention network and introduce two new attention based re-weighting techniques to count the number of buildings.

  2. We propose large, diverse satellite imagery-based dataset with the hand-counted number of buildings.

  3. Extensive experiments to evaluate different approaches are performed. Experimental results demonstrate that our approach achieves state-of-art results as compared to competitive baselines.

  4. Since we automatically estimate the built-regions through attention networks, We require only the image level count information, unlike previous methods, (Li et al., 2016; Sam et al., 2017; Liu et al., 2018b), which require sub-image level information .

Figure 2: Dataset samples, (a) shows the geographic locations of the area of interests from where satellite imagery is taken. (b) presents the examples of urban, desert and hilly regions from our dataset.

In what follows, we first provide a detailed review of existing techniques, details of our dataset collection is shared in Sec. 3 and in Sec. 4 we present our proposed methodologies along with detailed implementation details.Sec. 5 consists of results and their analysis. Finally Sec. 6 concludes the paper.

2 Related Work

Identifying urban markers in satellite imagery has been explored extensively. Most of these work differ in the quality of imagery, sensor, resolution of the imagery and the granularity at which results are reported. The authors Head et al. (2017) used high-resolution satellite imagery and night time satellite imagery as input to train CNN based deep neural network for predicting the economical markers representing human development. Gueguen and Hamid (2015) proposed trees-of-shapes features to perform the damage detection using both the pre and post event satellite imagery. LaLonde et al. (2018) performed object detection in the wide area motion imagery. Cheng et al. (2016, 2019) proposed the rotation invariant CNN for object detection in VHR remote sensing images. Note that since we automatically estimate the density built-regions through attention networks, we require only the image level count information instead of expensive object level annotations. Similarly, there is extensive literature on building detection mainly relying on multi-spectral imaging. Built-up area detection and building detection systems vary on the basis of the information they use (visible spectrum, multi-spectral imaging, DEM (Digital Elevation Model), LiDAR), features they extract (lines, corners, texture, etc,), machine learning applied on these features, the resolution of the input and output and the final objective they achieve. Global Human Settlement Layer (Pesaresi et al., 2016) has been constructed using the Landsat imagery of multiple years, giving the percentage of built-up coverage in each pixel (38.2 m spatial resolution). Deep Learning based Semantic Segmentation (Long et al., 2015; Badrinarayanan et al., 2017) have been applied to the satellite imagery for land-use and land-cover analysis (Audebert et al., 2016). Also, Yang et al. (2018) modified (Badrinarayanan et al., 2017) to design a two-stage CNN that first segments land-cover type and then the segmented land-cover polygons are further processed for land-use classification. A boundary detector based semantic segmentation model is trained by Marmanis et al. (2018) and in this model, Digital Elevation Model (DEM) is used with the input image to train the pipeline. There is, however as per our knowledge, no previous work on counting the buildings from satellite imagery, let alone using the RGB spectrum.

Initial works relied on detecting the local features, such as edge, lines, corners (Huertas and Nevatia, 1988; Sirmacek and Unsalan, 2009), or photo-metric properties (Müller and Zaum, 2005; Ghaffarian and Ghaffarian, 2014) and then intelligently combining these information together (Izadi and Saeedi, 2012; Krishnamachari and Chellappa, 1996). Izadi and Saeedi (2012) looked for intersection of the lines and shadow cues to define the buildings. Müller and Zaum (2005) used the low-level features to capture the geometric (roundness, size), photometric and structural properties (shadow, and presence of other houses that is neighborhood). They assumed that roof-tops are more visible in the red channel, constraining themselves to one particular type of roofs. This is not true in general, as seen in fig. (2). Ghaffarian and Ghaffarian (2014)

proposed the variation of the FastICA algorithm and k-means to detect buildings in monocular high-resolution Google Earth Images using the LUV space. Although their source of imagery is similar to us, they are relying on low-level and hand-crafted features and their objective is to achieve only built-up area segmentation and not the estimate of the number of buildings.

Another direction is to group the low-level features intelligently for detecting buildings. Krishnamachari and Chellappa (1996) used Markov Random Fields to detect buildings by combining the straight line segments detected from the edge map of an aerial image. In designing their objective function, they used the insight that only line-segments near each other needs to be combined and such combination should encourage rectangular shapes. Ok (2013) performed multiple graph-cuts using the shadow-cues to detect buildings in Very High-Resolution multi-spectral images. Most of the previous works in building detection relied on the shadow detection (Ok, 2013; Müller and Zaum, 2005; Huertas and Nevatia, 1988; Ngo et al., 2017; Irvin and McKeown, 1989) or shadow cues (Chen et al., 2014).

However, shadows depend upon the position of illumination source at the timing of the image capture. Several research works use more than just one optical Sensor (Ngo et al., 2017; Ok, 2013) and rely on multi-spectral andor the high-resolution satellite imagery. Such systems mostly end up in segmenting the built-up area and fail in the cases where buildings are connected very close to each other.

Closest to our work is by Xia and Wang (2018), where authors segment out building instances, but they are using high-resolution imagery and difficult to get mobile LiDAR dataset. We solve the problem of counting the number of buildings, from satellite or aerial imagery in RGB space. This is a difficult problem, especially for densely populated areas in general (Zhang et al., 2017), or more specifically where the buildings are connected. The architectural and cultural designs impact how buildings appear from above, making it difficult to separately identify the boundaries of each building.

Counting objects from the images or videos (Idrees et al., 2013) is an interesting and important problem. However, most of the recent works have been targeted towards the crowd counting, perhaps, because the dataset preparation for such is easier or the problem could be relegated to the counting of heads. Whereas the building does not have any such distinctive sub-part like a head. Counting objects could be a complex or easy problem depending upon the sample. If objects are separable, a simple method is to detect objects and count them, for instance, see (Hsieh et al., 2017; Liu et al., 2018b; Sam et al., 2017). Furthermore, the recent success of deep learning based object detectors (Ren et al., 2017; Redmon et al., 2016; Hu and Ramanan, 2017), allows objects detection based counting methods to be more accurate. Moreover, many of these work exploits the structure of the object, for example, in crowd counting (Idrees et al., 2013). Hu and Ramanan (2017), and Attia and Dayan (2018) uses the head shapes which are consistent across humans to detect heads and use that for the counting. They also use perspective information i.e., the density of humans per pixel in patches far away from the camera will be more than the density of humans in patches near the camera. Many research works use the fact that the humans will be standing up and will be straight (in a way aligned to the axis). Data-set collected by the Marsde et al. (2018) also has the different densities of perspective properties. The same is true for the car counting problem. In sharp contrast, perspective information is not useful for satellite image building counting. Especially in the case of irregular construction, where houses of all sizes are build next to each other.

DecideNet (Liu et al., 2018a) comes closest to our work, in terms of trying to find a middle path between the count by detection and regression. However, that’s where the similarity ends. Their algorithm relies on the object detection pipeline based on detecting the heads. Furthermore, their regression pipeline requires that the dots be placed on the heads of the humans. Both of these conditions are not applicable to our problem. There is no such visible ”head” in terms of buildings and our pipeline does not require individual dots be given for each house. Our method relies on exploiting built-area segmentation (not to be confused with instance level segmentation) for the attention. It only requires the total count concerning the image and not the sub-image count information as in previous counting methods.

Datasets play a vital role in the research, development, and evaluation of new technologies. Lu et al. (2018) proposed remote sensing captioning dataset, where each image is accompanied with five sentences. Xia et al. (2017)

proposed a new large-scale scene classification dataset which includes 30 scene classes such as beach airport, desert, and farmlands. Other scenes classification datasets include UC-Merced

(Yang and Newsam, 2010), NWPU-RESISC45 (Cheng et al., 2017), WHU-RS (Sheng et al., 2012) and RSSCN7 (Zou et al., 2015) datasets. Zhou et al. (2018)

presented a 38 classes dataset for remote sensing image retrieval applications. Finally,

Tian et al. (2017) put forward a new dataset for cross-view image geo-localization. In contrast, we collect a new challenging dataset that captures built areas of various densities from the satellite view.

Figure 3: Distribution of data set according to the area covered by detected regions. (Left) A pie chart shows how the data is divided into count windows. Majority of the images in the data set corresponds to count window of 1 to 20. (Right) The built-up region is plotted against the number of structures. The covered area is divided into bins and number of images in each bin as per its count is computed hence represented by a heat map shown at right.

3 Data Collection

The proliferation of deep learning libraries has enabled many to train for classification and regression tasks on the basis of the hyper or multi-spectral images without explicitly hand-designing, weighing, and fusion functions of different channels. However, utilization of the deep-learning based methods in remote sensing has been challenged by the absence of the large-scale datasets (Demir et al., 2018). To the best of our knowledge, there is no publicly available dataset for counting buildings using satellite imagery, covering different geographical areas and a variety of built-up densities. Therefore, we have collected a new geographically diverse dataset by extracting, sorting and marking the satellite images. Although we mainly focused on counting the number of buildings, our dataset can be used for other remote sensing applications as well, such as more accurate surrogate for the population density estimation or neighborhood type estimation. With the development of the latest high-quality hyper-spectral optical sensors, good quality high-resolution satellite images are publicly available for several developed countries. However, still, the majority of publicly available satellite images are of low-quality as shown in Fig. 2 in comparison to the ground image datasets available today. In this paper, we focus on RGB-images of a resolution (m per pixel) which might be considered as VHR with respect to the satellite imagery, however, is of low quality when we consider sharpness of edges or noise, especially for the task on hand that is counting the number of buildings.

Note that the same building looks quite different depending on the position of the satellite and the time of the day the image was taken. The height of the building, shadows, degree of separation and types of boundaries between the buildings makes these images challenging. To make our dataset realistic, we have collected satellite images at different times and from geographically different locations depicting built-areas of different diversity. Below, we provide details about our dataset collection and annotation process.

3.1 Sorting and collecting data

The satellite images are collected from regions including geographical and architectural differences that cover natural, urban and desert landscapes, Fig.  2. Collecting images from different locations induces scene-type variability and makes dataset challenging to evaluate. We selected different regions from these geographically different locations. From those regions, we have downloaded highly dense, moderately-dense and low-dense areas using Google Earth API. All these images are captured at zoom level 19 that covers 0.3m per pixel. A building in a densely populated urban residential area covers approximately - , while in hilly regions and other rural areas, the range of area covered decreases to - .The tile size of is selected on Google maps to capture all types of small, medium and large built-structures. The downloaded image size is . Table 1 shows the details of the number of images downloaded from different landscapes with their areas in kilometer square.

Landscape Number of Images Area Covered
Urban Areas 2211 22.55
Hilly Areas 251 2.56
Desert 220 2.24
Total 2682 27.35
Table 1: Details of collected dataset.

Manually downloading geo-located images from Google maps is a daunting task, therefore a Matlab® based tool is designed that calls the Google Earth API to automatically download geo-located images. Specifically, at given size and scale, the image array and its corresponding latitude and longitude vectors are saved. Note that this pixel-level geo-location is very useful for visualizations and post-processing purposes.

Table in Fig. 3 shows how challenging the collected dataset is. The built-up region is computed using the satellite segmentation network explained in later sections. The percentage of area covered by the detected built-structures is compared with the labeled count of buildings. The comparison between both is made on the varying size of structures. As the satellite images are collected from various locations, the dataset covers a variety of different architectural designs of buildings with varying sizes that are difficult to learn from. As shown in the table in Fig. 3, there exist images containing few buildings but cover nearly % of the area.

Figure 4: (a) Network architectures, Top:

pixel-wise classifier (SS-Net) to detect built-structures,

Bottom: Counting by regression model based on DenseNet features. Pooled features are fed into a three-layer neural network. (b) The qualitative results of SS-Net are presented on two test sets. Left:

Building Detection probabilities overlaid on the region of the city of Lahore (Pakistan). The shades of blue represent the probable existence of built-structures, dark blue means high probability and light blue means low probability. Selected cells from the tile of Lahore, numbered from 1 to 3, are zoomed to show the accuracy of our algorithm.

Right: Results on four of the samples from the village finders test set.

3.2 Data tagging

Thorough annotation of the collected dataset was performed. Specifically, we designed a Matlab based GUI to tag ground truth building count. To ensure good quality annotation, each image was annotated by at least two annotators. In Fig. 3, we provide the detailed statistics of our dataset. The Pie chart on left shows the percentage of data that belong to a specific count window. As it can be seen that our dataset contains images with varieties of house count; from no built structure to a large count of built-structures. The Table on the right of Fig. 3 shows the number of images in our data with a specific percentage of area covered by structures relative to the number of buildings in them. Note that we obtain built-up ratio using our Satellite Segmentation Net (Sec. 4.2.1).

Figure 5: The network architectures for the two proposed methods of counting by attention are shown. (a) The backbone (light-blue) is shared by both streams green and yellow. Channel-wise multiplication is performed between the probability map of SS-Net and the feature volume of DenseNet-121. This weighted feature map is input to both streams which are trained separately. The green block represents CCPP and the yellow one represents the GWAP. (b) Network architecture for FusionNet. Information captured by each stream is fused together to perform regression. The input of top stream (DRC) is the feature volume from DenseNet, while the CCPP and GWAP stream takes the weighted feature volume as input. FC layers from all three streams are concatenated together to form a fused FC and thus the error is back-propagated collectively.

4 Methodology

The primary goal of our paper is to achieve a precise count of the number of buildings in each satellite image, which is a challenging problem, as usually, satellite imagery is of low-resolution and quality as compared to the generally available ground imagery. Most importantly, there is no visible space between the neighboring buildings making it difficult to delineate accurately each building. Therefore, we propose to map deep visual features to real numbers representing the count of built-structures in the image. As an input to our regression model, initially, we took deep features from DenseNet

(Huang et al., 2017a) and map them to house counting problem through a fully connected neural network. Although, we achieve decent counting results (see table 3), however, the DenseNet gives equal importance to all of the image regions, therefore, resulting in the loss of accuracy. Our key intuition is that for the house counting purpose, features belonging to built-up regions are more important than the features originating from the rest of the image regions such as form fields, streets etc. Therefore, we propose two deep regression approaches using attention based re-weighting (ABW), where we decrease the influence of deep features from non-built areas such as fields or streets regions; thus enabling the algorithm to predict count with more accuracy. Our experimental results validate our intuitions. Below, we provide details for each of our proposed approach.

4.1 Deep Regression Counting (DRC)

We pose built-structure counting as a deep regression problem, that is, training deep learning based models with the regression as an output layer. Transfer learning

(Bengio, 2012) is performed by extracting the deep features from global average pooling layer of DenseNet (Huang et al., 2017a)

, pre-trained on ImageNet. Note that ImageNet

(Krizhevsky et al., 2012a) is a large dataset with 1K class labels. Many recent works like (Huang et al., 2017b) indicate that features learned by CNN models, e.g. VGG (Simonyan and Zisserman, 2014) and AlexNet (Krizhevsky et al., 2012b), trained on such large datasets can be used to perform transfer learning for tasks with limited training data. DenseNet was used because of its reported high accuracy and computational efficiency.

Features extracted are fed into the fully-connected (FC) neural network. We used a three-layered network having 512, 32, 1 units respectively (Fig. 4

). We used 60% Dropout layer between FCs. Relu is used as an activation layer after the first and second fully connected layer. No activation function is applied at the output layer. We have not used the ImageNet mean values to normalize our remote sensing data. Though initializing weights with such datasets is helpful instead of random values but the mean of ImageNet is a pure representation of day to day ground images. In our experiments, normalizing satellite imagery using these values disrupts the input and this affects the accuracy of the model.

4.2 Deep Regression Counting by Attention

Deep Regression counting suffers from the problem of giving equal weight to all the features whether they belong to a built-up region or not. Attention-based architectures help neural network concentrate on the task at hand and not impacted by the noise. To exploit local information for precise building count, we propose to use built-up region segmentation probabilities as the attention.

4.2.1 Satellite Segmentation Net (SS-Net):

We train compact VGG-based (Simonyan and Zisserman, 2014) fully convolutional neural network (Long et al., 2015) to perform pixel-wise built-up region classification. We call this network SS-Net. The output convolutional layer of this network predicts if a 6464 input patch belongs to a built-structure or not. The SS-Net is trained on low-resolution Village-Finders dataset (Murtaza et al., 2009). The original size of the images in data is . We randomly crop the patches of size and from the image and use segmentation mask associated with them to generate labels. The weights of the network are initialized with pre-trained VGG network (trained on ImageNet) weights and the data is normalized by computing its mean values instead of the ImageNet ones. During training, each patch in the training set is augmented four times by flipping, inverting and rotating the patch 45 degrees. Inspired by (Dupret and Koda, 2001) and (Harwood et al., 2017)

, to cater to the problem of unbalanced data, a bootstrap technique to do hard negative mining method is applied. Specifically, after every 15 epochs, new samples were evaluated and all with a false positive response were added as negative examples of the training set. Fig.

4 shows the network architecture of the SS-Net.

Evaluating Metric SS-Net results
Pixel-wise accuracy 0.947
F1 score 0.8
Table 2: Segmentation results of SS-Net on Village Finders test set. The results demonstrate high accuracy of propose technique.

During inference, when we present patch to the SS-Net, it returns the probability of this patch containing building(s) or part of the building. Since SS-Net is fully convolutional, it is capable of processing images of any size greater than pixels. The following equation gives the output size of the feature map at any layer where is output size of feature map, is size of input feature map,

represents padding,

shows filter size and

represents stride. In our experiments, we used


for max pooling layers and

, for convolution layers, and the value of depends on convolution layers. For instance, for an input image of , after 3 max pooling layers of stride 2 and kernel size 2, and following convolution layer of filter size and , we obtain probability maps of , where 2 represents number of channels. A probability map,

, representing the input image is generated for each image and bi-linear interpolation is performed to re-size the map to that of input image size. Qualitative results in Fig.

4 demonstrate that our SS-Net can segment the built and non-built areas with very high accuracy. Table 2 demonstrates the accuracy and F1-score of SS-Net on village finders test set.

The building probability calculated on each pixel is used to improve the regression algorithm for counting buildings. In sections below, we discuss in detail our two proposed approaches that use output probability maps of SS-Net for improved building counts.

Total Absolute Error (Low-Count : 0 to 30)
1158 1121 1136 1001
Total Absolute Error (Medium-Count:31 to 60)
814 820 796 743
Total Absolute Error (High-Count: 60)
229 180 161 176
Total Absolute Error (TAE: Total) 2201 2121 2093 1920
Mean Absolute Error (TAE(Total Number of Images))
4.14 3.99 3.94 3.61
(coefficient of determination)
0.86 0.872 0.875 0.88
Table 3: Total Absolute Error of structures in the set of Low-Count, set of Medium-Count and set of High-Count ranges, numbers in the bracket represent the building count in that satellite image patch. Where set Low-Count contains 3880, Medium-Count contains 3937 and High-Count contains 1128 structures in the test set. Mean Absolute Error (MAE) and score of each model is also listed.

4.2.2 Global Weighted Average Pooling (GWAP):

Similar to Sec. 4.1 the pre-trained DenseNet is used to extract the features. However, in this algorithm attention map generated by the SS-Net is used to perform Attention Based Global Weighted Average Pooling (GWAP) over the features. To achieve GWAP, we first multiply each feature map, extracted from the DenseNet, with probability map generated by SS-Net and then compute the average of each channel independently. This results in dimensionality reduction while gathering of the spatial information. Each value in the pooled vector corresponds to the density of constructed regions in a satellite image. These activation maps and SS-Net output probability maps for a typical image are shown in Fig. 5 under the ’Counting by Attention’ pipeline. Similar to GAP (Lin et al., 2013), GWAP also directly corresponds to the features learned. However, in GWAP features from different locations of the image are given different weights. As shown in Fig. 5, DenseNet produces features maps (output of last convolutional layer) which are agnostic to the built structure while SS-Net provides high probability score on the built area. Combining these two maps filters out the activation values of DenseNet from non-built areas. This meaningful representation is then fed into the 3-layer fully connected neural network, with 512, 32 and 1 units referred as regression pipeline. The yellow block along with blue block (Fig. 5) displays the network architecture of GWAP.

Figure 6: (a) Mean absolute error of all categories. (b) Ground truth and predicted count of the testing samples inferred using FusionNet.

4.2.3 Cross Channel Parametric Pooling (CCPP):

GWAP allows the algorithm to consider only the built area, however, it suffers from a lack of accuracy. One reason is the effect of averaging i.e. features representing buildings at different locations are summed up. However, recognizing them separately is required for accurate building count, especially for the densely built-up areas. Instead of predicting one single value for the whole image, if one can predict the count at different locations of the image, then the final number should be a summation of these counts; thus reducing the effect of averaging. However, we only have one count per image and not the count at each location. To counter this shortcoming, we design a network that can take care of across the channels correlation and spatial layout of the feature map. Specifically, we employ convolution of kernel size , let’s call it that outputs a single activation map. This activation map is presented to the fully connected regression pipeline, predicting the final count. Note that the architecture of regression pipeline is same in all methods, however, minor changes regarding the activation function and optimizer were experimented and are discussed in the implementation details.

The output of the layer is a single channel, visualized in the green block (Fig. 5). This convolutional layer performs learnable interactions within the weighted feature volume at every location. This layer learns the combination and comparison of all sizes of built-structures that are captured by the weighted feature map. Its response is different at different parts of the images, corresponding to the density of the buildings at that location.

4.2.4 Counting by Attention with FusionNet:

All the models discussed above suffer from one or other shortcoming. GWAP is unable to give credence to the local information. CCPP, where handles the local information, is challenged when the images with low count are presented, much due to the lack of the larger perspective (Table 3). The attention based pipelines (Sec. 4.2.2 & 4.2.3) do better than generic deep regression pipeline in our case, by detecting the areas with buildings. However, as shown in Fig. 4, our building segmentation system takes away the other useful information too, such as the location of streets or roads, or other markers highlighting the natural boundaries of the buildings. FusionNet has been designed to counter the shortcomings and enhance the benefits, by fusing the features extracted from each method. These fused features are processed by the fully connected regression network, outputting the final count. After concatenating the output of FC layers, the number of units in the fused layer is 1536. Finally, the fused layer is fed into the 3-layer fully connected neural network, with 512, 32 and 1 units referred before as regression pipeline. The network architecture of FusionNet is displayed in Fig. 5.

All above approaches when fused together complement each other hence improving the learning of regression pipeline. During training, the penalty is back-propagated collectively where all or any of the streams results in the prediction of an erroneous count.

4.3 Implementations details

All regression-based models are trained on image size in pixels which correspond to area covered on the ground with a resolution of 0.3 m per pixel. While using DenseNet features, we did not use any normalization technique as per our experiments, normalizing remote sensing data with ImageNet mean-values disrupts the images. To prevent the model from over-fitting, Dropout layers (Srivastava et al., 2014) are applied with a ratio of on fully connected layers. Apart from FusionNet, all models have the same regression pipeline comprising of three FCS of 512, 32 and 1 units. In FusionNet, last fully connected layer of all three blocks (DRC, CCCP, GWAP) contains 512 units. Concatenating them creates an input layer of 1536 dimension, which is input to fully connected layer of 512 units, followed by 32 units and 1 unit fully connected layers. While training the Deep Regression Counting, we use ReLu as an activation function. However, for all attention based models, leaky-ReLu with a ratio of is used. This counters the high activation values resulting from the DenseNet features and its product with the probability maps. For training the built-up area segmentation network, we normalize by subtracting the mean of the whole train set from each image. SS-Net is trained with a batch size of 16 on patches of size 64, so only the first three blocks of the VGG-16 are used. The learning rate of

is used with optimizer SGD to train the SS-Net. The training data, for counting, was augmented by flipping, inserting images, and rotating them at angles 90 degrees and 270 degrees, increasing its size five times. All the experiments were performed using Keras with tensorflow as a backend. Chanel-wise cross-entropy loss function is used for training SS-Net. For training DRC, mean squared is used. Furthermore, all attention based training networks (GWAP, CCPP, and FusionNet) are trained on root mean squared error.

Figure 7: For each method two sample results are shown, one where our prediction is accurate and one where it’s not (Sec. 5.1).

5 Results and Analysis

A thorough comparative analysis of proposed approaches is performed by evaluating their results on the test set of 531 satellite images, extracted from our collected dataset. In Table 3, we provide a quantitative comparison of all four proposed approaches, by calculating the mean absolute error (MAE), total absolute error (TAE) and R-squared measure. The MAE decreases and R-squared values improve, as we move from deep regression counting to FusionNet.

In order to perform in-depth analysis of our results, the test set is divided into three ranges on the basis of ground truth count of the buildings; (a) Low-Count (0 to 30): less built, (b) Medium-Count (31 to 60) : reasonably populated and (c) High-Count (greater than 60) : densely built. Out of 531 images 416, 100 and 15 are in the Low-Count, Medium-Count and High-count range, respectively. TAE for all four approaches on each set is computed separately. The 531 testing images cover a total of 8945 structures. MAE is computed by dividing the total absolute error with the total number of images. As compared to deep regression counting, both attention re-weighted counting has better results. Finally, the fusion of all three approaches further decreases the mean absolute error (see Table 3). The proposed approach is quite efficient; DRC, GWP and FusionNet took 0.07 (0.02), 0.8 (0.026), and 0.9 (0.029) sec/image (sec/Km) respectively.

5.1 Comparison and Analysis of results

As indicated by the MAE results, Table 3 and Fig. 6, introducing attention mechanism considerably decreases the MAE (3.6%) and increases the R-squared value. Fig. 7 shows the images corresponding to minimum and maximum MAE, for all of our proposed models. On fine-grain analysis, it is observed that the GWAP network is accurate for the Low-Count images whereas the CCPP network is predicting with lesser TAE in the Medium-Count and High-Count images (Table 3). For the low-density images, where both CCPP and GWAP are much better than DRC, GWAP’s TAE is much less than that of CCPP. With the involvement of attention the MAE between the ground truth and predicted count decreases generally but the CCPP seems to be distracted while over counting the structures in some of the images. For example, in Fig. 7

, vehicles parked on the road are misleading the model. However, for both medium and high-density images, the number of TAE of CCPP is much less than the GWAP, indicating that much more detailed local information is needed for counting where the density of buildings is more. To capitalize on the complementary nature of CCPP and GWAP, FusionNet is trained which combines the deep regression counting with both attention models. Fig.

6 shows the comparison in the MAE of these models. We retrained DRC on the mean (of ImageNet) subtracted data. MAE of this mean-subtracted DRC rose from 4.14 to 16.98. Deep regression was performed on the features extracted from SS-Net. This resulted in an increase in MAE to 5.5 since these features do-not capture inner-structure in the segment

5.2 Counting in large neighbourhood

In order to show generalization capacity and effectiveness of our model, we test our approach on a portion of Cairo’s densely populated region. The satellite image is of the size pixels, covering area. The region covered in this testing tile is diverse, containing both small and large structures in densely and moderately populated areas. Ground truth is created by manually counting the buildings, and came up to be 656. Our approach, FusionNet predicted 675 buildings which are quite close to the ground-truth. In order to perform detailed analysis, we divide the image into 27 cells, where each cell is of size pixels. FusionNet’s prediction for each of the cell is compared with the ground-truth count of the buildings in that cell, the ground-truth is achieved by hand-marking each cell in the image. Predicted count is overlayed on the map for visualization, Fig. 1, by assigning different colors according to the different predicted counts in each window. For a quantitative comparison, graph of predicted count and ground truth count is show in Fig. 1, indicating that predicted values closely follow the ground-truth. We argue that cell 18 is intensely populated and contains irregular construction which makes it difficult even for the human annotators to count. High accuracy on a large image outside of training and test set, demonstrates the generalization capacity and robustness of our proposed approach.

6 Conclusion

In this paper, we have attempted to solve a difficult problem of counting buildings from satellite imagery. The diversity in the shape of the urban structures, variations in city planning and sensor response, makes the problem challenging. We have introduced a new challenging benchmark dataset capturing different geographical regions and areas with different building counts at various build densities (dataset will be made publicly available). Instead of using deep learning as a black box, we have presented an attention based mechanism, based on insights of how Deep Convolutional Neural Networks work so that our model can capture variations in the urban-structures. Our final solution, FusionNet, combines the information captured by different pipelines at different granularity, making it robust to the densely built buildings, as well as to sparsely built areas, from large structures (covering a large area) as well as to small structures. FusionNet is able to handle a variety of the roof types including a difficult case of flat roofs especially when the buildings are interconnected. Future directions include improving the image quality through super-resolution before feature computations and investigation of other pooling techniques to improve building counting.

Acknowledgment: We greatly appreciate discussion and useful comments provided by Hamza Rawal, Maria Zubair, Komal Khan and Umar Saif.


  • Albert et al. (2017) Albert, A., Kaur, J. and Gonzalez, M. C., 2017. Using convolutional networks and satellite imagery to identify patterns in urban environments at a large scale. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp. 1357–1366.
  • Attia and Dayan (2018) Attia, A. and Dayan, S., 2018. Detecting and counting tiny faces. arXiv preprint arXiv:1707.08952.
  • Audebert et al. (2016) Audebert, N., Le Saux, B. and Lefèvre, S., 2016. Semantic segmentation of earth observation data using multimodal and multi-scale deep networks.

    In: Asian Conference on Computer Vision, Springer, pp. 180–196.

  • Badrinarayanan et al. (2017) Badrinarayanan, V., Kendall, A. and Cipolla, R., 2017. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence pp. 2481–2495.
  • Bengio (2012) Bengio, Y., 2012. Deep learning of representations for unsupervised and transfer learning. In: Proceedings of ICML Workshop on Unsupervised and Transfer Learning, pp. 17–36.
  • Boyd et al. (2018) Boyd, D. S., Jackson, B., Wardlaw, J., Foody, G. M., Marsh, S. and Bales, K., 2018. Slavery from space: Demonstrating the role for satellite remote sensing to inform evidence-based action related to un sdg number 8. ISPRS Journal of Photogrammetry and Remote Sensing 142, pp. 380 – 388.
  • Chen et al. (2014) Chen, D., Shang, S. and Wu, C., 2014. Shadow-based building detection and segmentation in high-resolution remote sensing image. Journal of Multimedia 9, pp. 181–188.
  • Cheng et al. (2017) Cheng, G., Han, J. and Lu, X., 2017. Remote sensing image scene classification: benchmark and state of the art. Proceedings of the IEEE 105(10), pp. 1865–1883.
  • Cheng et al. (2019) Cheng, G., Han, J., Zhou, P. and Xu, D., 2019. Learning rotation-invariant and fisher discriminative convolutional neural networks for object detection. IEEE Transactions on Image Processing 28(1), pp. 265–278.
  • Cheng et al. (2016) Cheng, G., Zhou, P. and Han, J., 2016. Learning rotation-invariant convolutional neural networks for object detection in vhr optical remote sensing images. IEEE Transactions on Geoscience and Remote Sensing 54(12), pp. 7405–7415.
  • Demir et al. (2018) Demir, I., Koperski, K., Lindenbaum, D., Pang, G., Huang, J., Basu, S., Hughes, F., Tuia, D. and Raskar, R., 2018. Deepglobe 2018: A challenge to parse the earth through satellite images. ArXiv e-prints pp. 172–17209.
  • Dupret and Koda (2001) Dupret, G. and Koda, M., 2001.

    Bootstrap re-sampling for unbalanced data in supervised learning.

    European Journal of Operational Research pp. 141–156.
  • Ghaffarian and Ghaffarian (2014) Ghaffarian, S. and Ghaffarian, S., 2014. Automatic building detection based on Purposive FastICA (PFICA) algorithm using monocular high resolution Google Earth images. ISPRS Journal of Photogrammetry and Remote Sensing 97, pp. 152–159.
  • Gueguen and Hamid (2015) Gueguen, L. and Hamid, R., 2015. Large-scale damage detection using satellite imagery.

    In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1321–1328.

  • Harwood et al. (2017) Harwood, B., Kumar, B., Carneiro, G., Reid, I. and Drummond, T., 2017. Smart mining for deep metric learning. Proceedings of the IEEE International Conference on Computer Vision pp. 2821–2829.
  • Head et al. (2017) Head, A., Manguin, M., Tran, N. and Blumenstock, J., 2017. Can human development be measured with satellite imagery? In: Proceedings of the Ninth International Conference on Information and Communication Technologies and Development, p. 8.
  • Hsieh et al. (2017) Hsieh, M.-R., Lin, Y.-L. and Hsu, W. H., 2017. Drone-based object counting by spatially regularized regional proposal network. In: The IEEE International Conference on Computer Vision, Vol. 1, pp. 4165–4173.
  • Hu and Ramanan (2017) Hu, P. and Ramanan, D., 2017. Finding tiny faces. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 951–959.
  • Huang et al. (2017a) Huang, G., Liu, Z., Van Der Maaten, L. and Weinberger, K. Q., 2017a. Densely connected convolutional networks. In: IEEE conference on computer vision and pattern recognition, pp. 4700–4708.
  • Huang et al. (2017b) Huang, Z., Pan, Z. and Lei, B., 2017b. Transfer learning with deep convolutional neural network for sar target classification with limited labeled data. Remote Sensing 9, pp. 907.
  • Huertas and Nevatia (1988) Huertas, A. and Nevatia, R., 1988. Detecting buildings in aerial images. Computer Vision, Graphics, and Image Processing 41(2), pp. 131–152.
  • Idrees et al. (2013) Idrees, H., Saleemi, I., Seibert, C. and Shah, M., 2013. Multi-source multi-scale counting in extremely dense crowd images. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2547–2554.
  • Irvin and McKeown (1989) Irvin, R. B. and McKeown, D. M., 1989. Methods for exploiting the relationship between buildings and their shadows in aerial imagery. IEEE Transactions on Systems, Man, and Cybernetics 19(6), pp. 1564–1575.
  • Izadi and Saeedi (2012) Izadi, M. and Saeedi, P., 2012. Three-dimensional polygonal building model estimation from single satellite images. IEEE Transactions on Geoscience and Remote Sensing 50(6), pp. 2254–2272.
  • Jean et al. (2016) Jean, N., Burke, M., Xie, M., Davis, W. M., Lobell, D. B. and Ermon, S., 2016. Combining satellite imagery and machine learning to predict poverty. Science 353(6301), pp. 790–794.
  • Krishnamachari and Chellappa (1996) Krishnamachari, S. and Chellappa, R., 1996. Delineating buildings by grouping lines with mrfs. IEEE Transactions on Image Processing 5(1), pp. 164–168.
  • Krizhevsky et al. (2012a) Krizhevsky, A., Sutskever, I. and Hinton, G. E., 2012a. Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105.
  • Krizhevsky et al. (2012b) Krizhevsky, A., Sutskever, I. and Hinton, G. E., 2012b. Imagenet classification with deep convolutional neural networks. In: International Conference on Neural Information Processing Systems, pp. 1097–1105.
  • LaLonde et al. (2018) LaLonde, R., Zhang, D. and Shah, M., 2018. Clusternet: Detecting small objects in large scenes by exploiting spatio-temporal information. In: IEEE Computer Vision and Pattern Recognition, pp. 4003–4012.
  • Lam et al. (2018) Lam, D., Kuzma, R., McGee, K., Dooley, S., Laielli, M., Klaric, M., Bulatov, Y. and McCord, B., 2018. xView: Objects in context in overhead imagery. arXiv:1802.07856.
  • Längkvist et al. (2015) Längkvist, M., Kiselev, A., Alirezaie, M. and Loutfi, A., 2015. Classification and segmentation of satellite orthoimagery using convolutional neural networks. Remote Sensing p. 329.
  • LeCun et al. (2015) LeCun, Y., Bengio, Y. and Hinton, G., 2015. Deep learning. Nature 521(7553), pp. 436.
  • Li et al. (2016) Li, W., Fu, H., Yu, L. and Cracknell, A., 2016. Deep learning based oil palm tree detection and counting for high-resolution remote sensing images. Remote Sensing 9, pp. 22.
  • Lin et al. (2013) Lin, M., Chen, Q. and Yan, S., 2013. Network in network. arXiv preprint arXiv:1312.4400.
  • Liu et al. (2018a) Liu, J., Gao, C., Meng, D. and Hauptmann, A. G., 2018a. Decidenet: Counting varying density crowds through attention guided detection and density estimation. In: The IEEE Conference on Computer Vision and Pattern Recognition, pp. 5197–5206.
  • Liu et al. (2018b) Liu, X., Weijer, J. and D. Bagdanov, A., 2018b. Leveraging unlabeled data for crowd counting by learning to rank. IEEE Conference on Computer Vision and Pattern Recognition.
  • Long et al. (2015) Long, J., Shelhamer, E. and Darrell, T., 2015. Fully convolutional networks for semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440.
  • Lu et al. (2018) Lu, X., Wang, B., Zheng, X. and Li, X., 2018. Exploring models and data for remote sensing image caption generation. IEEE Transactions on Geoscience and Remote Sensing 56(4), pp. 2183–2195.
  • Marmanis et al. (2018) Marmanis, D., Schindler, K., Wegner, J. D., Galliani, S., Datcu, M. and Stilla, U., 2018. Classification with an edge: Improving semantic image segmentation with boundary detection. ISPRS Journal of Photogrammetry and Remote Sensing 135, pp. 158–172.
  • Marsde et al. (2018) Marsde, M., McGuinness, K., Little, S., Keogh, C. E. and O’Connor, N. E., 2018. People, penguins and petri dishes: adapting object counting models to new visual domains and object types without forgetting. IEEE Conference on Computer Vision and Pattern Recognition pp. 8070–8079.
  • Müller and Zaum (2005) Müller, S. and Zaum, D. W., 2005. Robust building detection in aerial images. International Archives of Photogrammetry and Remote Sensing 36(B2/W24), pp. 143–148.
  • Murtaza et al. (2009) Murtaza, K., Khan, S. and Rajpoot, N. M., 2009. Villagefinder: Segmentation of nucleated villages in satellite imagery. BMVC pp. 1–11.
  • Ngo et al. (2017) Ngo, T.-T., Mazet, V., Collet, C. and De Fraipont, P., 2017. Shape-based building detection in visible band images using shadow information. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 10(3), pp. 920–932.
  • Ok (2013) Ok, A. O., 2013. Automated detection of buildings from single vhr multispectral images using shadow information and graph cuts. ISPRS Journal of Photogrammetry and Remote Sensing 86, pp. 21–40.
  • Pesaresi et al. (2016) Pesaresi, M., Ehrlich, D., Ferri, S., Florczyk, A., Freire, S., Halkia, M., Julea, A., Kemper, T., Soille, P. and Syrris, V., 2016. Operating procedure for the production of the global human settlement layer from landsat data of the epochs 1975, 1990, 2000, and 2014. Publications Office of the European Union.
  • Redmon et al. (2016) Redmon, J., Divvala, S., Girshick, R. and Farhadi, A., 2016. You only look once: Unified, real-time object detection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788.
  • Ren et al. (2017) Ren, S., He, K., Girshick, R. and Sun, J., 2017. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Analysis and Machine Intelligence.
  • Rottensteiner et al. (2012) Rottensteiner, F., Sohn, G., Jung, J., Gerke, M., Bailard, C., Benitez, S. and Breitkopf, U., 2012. The isprs benchmark on urban object classification and 3d building reconstruction. In: ISPRS 2012 Proceedings of the XXII ISPRS Congress : Imaging a Sustainable Future, 25 August - 01 September 2012, Melbourne, Australia. Peer reviewed Annals, Volume I-7, 2012, International Society for Photogrammetry and Remote Sensing (ISPRS), pp. 293–298.
  • Sam et al. (2017) Sam, D. B., Surya, S. and Babu, R. V., 2017. Switching convolutional neural network for crowd counting. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4031–4039.
  • Sheng et al. (2012) Sheng, G., Yang, W., Xu, T. and Sun, H., 2012. High-resolution satellite scene classification using a sparse coding based multiple feature combination. International Journal of Remote Sensing 33, pp. 2395–2412.
  • Simonyan and Zisserman (2014) Simonyan, K. and Zisserman, A., 2014. Very deep convolutional networks for large-scale image recognition. arXiv 1409.1556.
  • Sirmacek and Unsalan (2009) Sirmacek, B. and Unsalan, C., 2009. Urban-area and building detection using sift keypoints and graph theory. IEEE Transactions on Geoscience and Remote Sensing 47(4), pp. 1156–1167.
  • Srivastava et al. (2014) Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. and Salakhutdinov, R., 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, pp. 1929–1958.
  • Tian et al. (2017) Tian, Y., Chen, C. and Shah, M., 2017. Cross-view image matching for geo-localization in urban environments. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1998–2006.
  • Xia et al. (2017) Xia, G., Hu, J., Hu, F., Shi, B., Bai, X., Zhong, Y., Zhang, L. and Lu, X., 2017. Aid: A benchmark data set for performance evaluation of aerial scene classification. IEEE Transactions on Geoscience and Remote Sensing 55(7), pp. 3965–3981.
  • Xia and Wang (2018) Xia, S. and Wang, R., 2018. Extraction of residential building instances in suburban areas from mobile lidar data. ISPRS Journal of Photogrammetry and Remote Sensing 144, pp. 453–468.
  • Yang et al. (2018) Yang, C., Rottensteiner, F. and Heipke, C., 2018. Classification of land cover and land use based on convolutional neural networks. ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sciences IV-3, pp. 251–258.
  • Yang and Newsam (2010) Yang, Y. and Newsam, S., 2010. Bag-of-visual-words and spatial extensions for land-use classification. In: SIGSPATIAL International Conference on Advances in Geographic Information Systems, pp. 270–279.
  • Yang and Newsam (2013) Yang, Y. and Newsam, S., 2013. Geographic image retrieval using local invariant features. IEEE Transactions on Geoscience and Remote Sensing 51(2), pp. 818–832.
  • Zhang et al. (2017) Zhang, A., Liu, X., Gros, A. and Tiecke, T., 2017. Building detection from satellite images on a global scale. arXiv preprint arXiv:1707.08952.
  • Zhou et al. (2018) Zhou, W., Newsam, S., Li, C. and Shao, Z., 2018. Patternnet: A benchmark dataset for performance evaluation of remote sensing image retrieval. ISPRS Journal of Photogrammetry and Remote Sensing.
  • Zou et al. (2015) Zou, Q., Ni, L., Zhang, T. and Wang, Q., 2015.

    Deep learning based feature selection for remote sensing scene classification.

    IEEE Geoscience and Remote Sensing Letters 12(11), pp. 2321–2325.