Log In Sign Up

Building Footprint Generation by IntegratingConvolution Neural Network with Feature PairwiseConditional Random Field (FPCRF)

Building footprint maps are vital to many remote sensing applications, such as 3D building modeling, urban planning, and disaster management. Due to the complexity of buildings, the accurate and reliable generation of the building footprint from remote sensing imagery is still a challenging task. In this work, an end-to-end building footprint generation approach that integrates convolution neural network (CNN) and graph model is proposed. CNN serves as the feature extractor, while the graph model can take spatial correlation into consideration. Moreover, we propose to implement the feature pairwise conditional random field (FPCRF) as a graph model to preserve sharp boundaries and fine-grained segmentation. Experiments are conducted on four different datasets: (1) Planetscope satellite imagery of the cities of Munich, Paris, Rome, and Zurich; (2) ISPRS benchmark data from the city of Potsdam, (3) Dstl Kaggle dataset; and (4) Inria Aerial Image Labeling data of Austin, Chicago, Kitsap County, Western Tyrol, and Vienna. It is found that the proposed end-to-end building footprint generation framework with the FPCRF as the graph model can further improve the accuracy of building footprint generation by using only CNN, which is the current state-of-the-art.


page 5

page 8

page 10

page 11

page 13

page 14

page 15

page 16


Semi-Supervised Building Footprint Generation with Feature and Output Consistency Training

Accurate and reliable building footprint maps are vital to urban plannin...

Mapping Vulnerable Populations with AI

Humanitarian actions require accurate information to efficiently delegat...

A Multi-Task Deep Learning Framework for Building Footprint Segmentation

The task of building footprint segmentation has been well-studied in the...

Quantization in Relative Gradient Angle Domain For Building Polygon Estimation

Building footprint extraction in remote sensing data benefits many impor...

MTBF-33: A multi-temporal building footprint dataset for 33 counties in the United States (1900-2015)

Despite abundant data on the spatial distribution of contemporary human ...

Automatic Building Extraction in Aerial Scenes Using Convolutional Networks

Automatic building extraction from aerial and satellite imagery is highl...

A Novel Adaptive Deep Network for Building Footprint Segmentation

Building footprint segmentations for high resolution images are increasi...

I Introduction

Building footprint generation is an active field of research with the domain of remote sensing (RS). The established building footprint maps are useful to understand urban dynamics in many important applications, and also facilitate the assessment of the extent of damages after natural disasters such as earthquakes. OpenStreetMap (OSM) can provide manually annotated building footprint information for some urban areas, however, it is not always available in many parts of the world. Therefore, high-resolution RS imagery, which covers global areas and contains huge potential for meaningful ground information extraction, is a reliable source of data for building footprint generation. However, automatic building footprint generation from high-resolution RS imagery is still difficult because of variations in the appearance of buildings, complicated background interference, shooting angle, shadows, and illumination conditions. Moreover, buildings and the other impervious objects in urban areas have similar spectral and spatial characteristics.

Early studies of automatic building footprint generation from high resolution RS imagery rely on regular shape and line segments of buildings to recognize buildings. Line segments of the building are first detected and extracted by edge drawing lines (EDLines) [2], and then hierarchically grouped into candidate rectangular buildings by a graph search-based perceptual grouping approach in [52]. Some studies also propose some building indices to identify the presence of a building. The morphological building index (MBI) [19], which takes the characteristics of buildings into consideration by integrating multiscale and multidirectional morphological operators, can be implemented to extract buildings automatically. The most widely used approaches are classification-based approaches, which make use of spectral information, structural information, and context information. The pixel shape index (PSI) [58]

, a shape feature measuring the gray similarity distance in each direction, is integrated with spectral features to extract buildings by using a support vector machine. However, the main problem with these algorithms is that multiple features need to be engineered for the proper classifier, which may consume too much computational resources and thus preclude large scale applications.

Based on learning data representations, deep learning is the state-of-the-art method for many big data analysis applications

[61] [10] [29] [26]

. Deep learning architectures such as convolutional neural networks (CNN), which is an artificial neural network based on multiple processing layers, have been extensively employed in many computer vision tasks

[17] [16]. A major advantage of CNN is its independence from prior knowledge and hand-crafted features, which has supported its more powerful generalization capability. CNN is superior to other approaches with respect to accuracy and efficiency. In particular, many CNN models have been proposed and applied in semantic segmentation with quite promising results, such as the fully convolutional network (FCN) [30], U-Net [43] SegNet [4], ResNet [15], ENet [40], DenseNet [18], PSPNet [59], and DeepLabv3+ [9]. Recently, the generative adversarial network (GAN)[14] has shown the potential in solving such problems.

In fact, the task of building footprint generation belongs to the branch of semantic segmentation in computer vision. In the RS community, recent research has also made an effort to improve building footprint generation through the application of the aforementioned CNN models. In order to perform building segmentation, a multi-constraint fully convolution network (MC–FCN) model is proposed in [54], which consists of a FCN architecture and multi-constraints. In [55], a modified and extended architecture of both ResNet and U-Net, named Res-U-Net, is proposed to improve the accuracy of building segmentation results from RS imagery. A comparatively simple and memory-efficient model, SegNet, is used for a multi-task (a shared representation for boundary and segmentation prediction) learning for building footprint generation in [5]. A conditional GAN called cwGAN-gp [44]

, whose loss function is derived from the Wasserstein distance and an added gradient penalty term, is proposed to improve the building footprint generation results.

However, there are usually non-sharp boundaries and visually degraded results in CNN-based semantic segmentation tasks, which results from the inherent invariant to spatial transformations of CNN architectures. In this case, the common approach to improving the accuracy of pixel-level segmentation is to adopt a graph model such as conditional random field (CRF) as a post-processing step. Fully connected CRF [24]

is applied to accurately localize segment boundaries and assign the most probable label to each pixel after the training based on FCN in

[6]. In this case, the CRF inference is used as a post-processing step, which is not integrated with the training of the CNN. In this research, we propose an accurate and reliable building footprint generation framework, which makes three contributions:

(1) Since each existing CNN model also has its own limitations, achieving more accurate segmentation results is still critical for automatic building footprint generation. The use of a graph model enables the combination of low-level image information such as the interactions between pixels, which is especially important for capturing fine local details. Therefore, in order to achieve more accurate segmentation results, we propose to combine CNN and a graph model in an end-to-end framework for building footprint generation, which has not been adequately addressed in the current literature.

(2) In addition, it should be noted that, in this research, we propose a graph model called feature pairwise conditional random field (FPCRF) to be exploited in the building footprint generation framework. Specifically, we design a pairwise potential term with localized constraints in CRF. This term combines feature kernels extracted from CNN, which allows more complete feature learning than other traditional graph models. Moreover, the localized processing facilitates the efficient message passing operation.

(3) Recently, there has been some development of deep learning methods in the computer vision community that seek to enhance the results of semantic segmentation; this development offers the RS community an opportunity to investigate the application of building footprint generation using deep learning methods. However, there is still a lack of a comprehensive investigation into the state-of-the-art CNN models in the tasks of automated building footprint generation from remote sensing imagery. With the aim of better understanding the usability and generalization ability of the state-of-the-art approaches, we compare and analyze the performances and characteristics of different CNN models for building footprint generation.

This research is organized as follows. In Section II, a brief review of related works is presented. Then, the proposed framework is introduced in Section III, followed by experiments in Section IV and results in Section V. Next, a discussion is provided in Section VI, leading to conclusions in Section VII.

Ii Related Work

Ii-a Semantic Segmentation

Deep learning methods have been commonly used in the field of computer vision, from coarse to fine inference. Classification is the coarse inference, which makes a prediction for a whole input. Semantic segmentation is the fine-grained inference, which assigns a label to each pixel. CNN can learn an enhanced feature representation end-to-end for solving the semantic segmentation problems. FCN or encoder-decoder based architectures have been successfully implemented to produce spatially explicit label maps efficiently.

FCN is a forerunner of semantic segmentation, which transforms popular classification models to fully convolutional ones, and replaces the fully-connected layers with transposed convolutions to solve pixel labelling problems. Apart from the FCN architecture, the performance of other variants such as encoder-decoder based architectures is also remarkable. The spatial dimension has been gradually reduced with pooling layers in the encoder, while the local detail and spatial dimension are recovered in the decoder. Moreover, there are skip connections from encoder to decoder in U-Net, which makes the compensation from low-level details to high-level semantic information. In SegNet, the max-pooling indices are reused in the decoding process, which results in a substantial reduction of the number of parameters. ResNet-DUC

[53] is similar to U-Net, but uses a ResNet block instead of a normal block. In the ResNet block, the layers are reformulated as learning residual functions of the input layer, which is easier to optimize [15]

. ENet consists of a large encoder and a small decoder, where the large encoder can be operated on smaller resolution data and contributes to efficient information processing. The potential of GAN is also investigated in the semantic segmentation domain. GAN comprises of two networks: a discriminator and a generator. The discriminator learns the boundary between classes, while the generator learns the distribution of individual classes. The two networks play a two-player min-max game to optimize both of their objective functions. The PSPNet is a typical example of the multi-scale processing network, which first generates a feature map from a feature extraction network (ResNet, DenseNet, etc.), and then utilizes a pyramid pooling module to combine multi-scale feature maps. DeepLab

[8] is a state-of-the-art semantic segmentation model, which now already have 4 versions with different improvements over time: DeepLab V1, DeepLab V2, DeepLab V3 and DeepLab V3+. Both DeepLab V1 and DeepLab V2 use CRF as a postprocessing step, where the prediction could be refined both qualitatively and quantitatively. DeepLabv3 improve over previous DeepLab versions without CRF post-processing. This is due to the fact that a better way is designed to encode multi-scale context in its network architectures. DeepLabv3 is a network that does multi-scale processing, and by using altrous convolution it can achieve satisfactory results without increasing the number of parameters. The DeepLabv3+ model is a quick extension of DeepLabv3 that proposes to add an intermediate decoder module to the DeepLabv3, could recover object boundaries better. Currently, FC-DenseNet has shown superior results on terrestrial scene interpretation tasks. FC-DenseNet extends the DenseNet architecture to fully convolutional networks in pixel-level labeling tasks. In the DenseNet block, all preceding features are taken as input, and then its output features are transferred to all subsequent layers [18]. Through this feature reuse, the potential of the network can be utilized to improve the ease of training and parameter efficiency.

The development of CNN has rapidly improved the performance of semantic segmentation algorithms, which has elicited an increasing interest in the RS domain. Many research works have transferred these common CNN models and adapted them for RS imagery, which has already achieved good performance [42]. An efficient multi-scale approach is implemented for CNN in [3], leveraging both a large spatial context and high resolution data to allow better semantic segmentation results. In [51], a multi-task learning method for semantic segmentation is proposed that learns the semantic class likelihoods and semantic boundaries across classes by CNN simultaneously. The spatial relation and channel relation modules are combined with CNN in [37], which has achieved competitive semantic segmentation results.

Ii-B CNN for Building Footprint Generation

In RS domain, semantic segmentation is often referred to in numerous applications, such as change detection [36], land-cover classification [34], road extraction [25], and building footprint generation [50] and etc. Since the building is an important object among various terrestrial targets in RS imagery, the task of building footprint generation has been heavily studied in the RS community.

One of the CNN models commonly used for building footprint generation is FCN, which has showed superiority in accuracy as well as computational time. When applied with RS data, FCN is usually adapted. In [33]

, a multiscale neuron module is designed in FCN, which is able to provide fine-grained building footprint maps. A multilayer perceptron (MLP) network is derived on top of the base FCN in

[31], which extracts intermediate features from the base FCN to provide finer results. In [7], three parallel FCNs are first implemented to combine different data sources, and then merged at a late stage to automatically generate a more accurate building footprint map. A variant of FCN, which introduces an additional higher resolution skip connection, is adopted in [24] in order to preserve consistently improved results. The proposed method in [35] employs a similar strategy by adding skip connections, which can minimize information loss from downsampling.

Apart from FCN, other encoder-decoder based architectures such as SegNet is also preferred in building footprint generation, because its memory requirements are significantly lower then FCN’s. In this regard, larger scale problems can be solved in parallel more efficiently at the inference stage. In [56], the building footprints across the entire continental United States are generated by SegNet with better fulfillment of the quality and computational time requirements. However, SegNet has a low edge accuracy, since it only uses a part of the layers to generate predicted output. Another encoder-decoder based architecture, U-Net, which combines both the low and high layers, is widely exploited to generate building footprint maps with their edges preserved. A Siamese U-Net [22], where original images and their down-sampled counterparts are taken into the network separately, is proposed to improve the final results, especially for large buildings. Currently, some newly proposed networks, such as FC-DenseNet and GAN, have also demonstrated promising performances in building footprint generation. In [27], a generator using FC-DenseNet and an adversarial discriminator are jointly trained for the building footprint generation from RS imagery.

Ii-C Graph Model

Exploiting CNN for semantic segmentation tasks is still a significant challenge. The convolutional layer of CNN is a weights sharing architecture. Hence, shift invariant and spatial invariant characteristics limit spatial accuracy for segmentation tasks [13]. The convolution filters with large receptive fields and max-pooling layers in CNN also lead to coarse segmentation output, such as a non-sharp boundary and blob-like shapes [60]. Moreover, CNN fails to refine local details without taking the interactions between pixels into consideration. Graph models enable modeling of interactions between pixels, which can integrate more elaborate terms to preserve the sharp boundary. Therefore, graph models can be utilized to enhance the semantic segmentation results from CNN, which has the ability to capture fine-grained details.

A graph model is a probabilistic model that encodes a distribution based on a graph-based representation. In a graph model, conditional dependencies are expressed between random variables. There are two categories of graphical representations of distributions, Bayesian networks and Markov random field (MRF), which are distinguished by their encoded set of independence and induced factorization of the distribution. In the Bayesian networks, the network structure of the model is based on a directed acyclic graph, where the joint distribution is represented as a product of conditional distributions. MRF is an undirected graph, which is described by random variables with a Markov property. In the Markov property, only the present state contributes to the conditional probability distribution of future states of the process. CRF is a notable variant of MRF, in which each random variable is conditioned upon some global observations. FullCRF


is a notable example of CRF, which is regarded as a recurrent neural network (RNN) that forms a part of a deep network for end-to-end training. However, FullCRF is based on a complex data structure and does not allow efficient GPU computation. Recently, there are some researches focused on the improvement of CRF. The work in

[21] proposes to use bilateral convolution layers (BCL) built inside CNN architectures for efficient CRF inference, where the receptive field of filters could change. ConvCRF [48] is a recently proposed CRF algorithm that adds a conditional independence assumption to supplement FullCRF, and such an adjustment reduces the complexity of the pairwise potential. A recent example is Pixel-adaptive convolution (PAC)-CRF [46], propose a pixel-adaptive convolution (PAC) for efficient inference of CRF to alleviate the computation, whose filter weights depends on a spatially varying kernel utilizing local pixel features.

Some researchers have tried to implement both CNN models and graph models for building footprint generation. The results have shown that combining graph models and CNN models can lead to better results, especially along the boundaries of buildings [45]. In [49], MRF is integrated as a post-processing stage after the training of CNN, which has ameliorated the final building footprint generation map. The CRF is exploited in [38] and [39] to smooth the final pixel labeling results from CNN, which can respect the edges present in the imagery. However, the graph models are exploited only as post-processing steps in these studies. In [62], the FullCRF is plugged in at the end of the FCN for end-to-end training, which has preserved sharp boundaries, but requires longer training time and greater efforts to find optimal parameters.

Iii Methodology

In this section, the proposed building footprint generation framework is first described. Then, we introduce the proposed FPCRF, which has a designed pairwise potential term for complete feature learning and efficient computation. The experiment design for detailed investigation of FPCRF parameters is provided in Section IV.C.

Iii-a The Proposed Building Footprint Generation Framework

The building footprint generation in our research is actually a semantic segmentation task in the computer vision field. Recently, CNN has achieved great success in semantic segmentation tasks, as it is able to learn a strong feature representation instead of hand-crafted features. However, there are also some problems with CNN models, such as limited spatial accuracy, non-sharp boundaries, and so on. Parallel with CNN models, graph models, which enable interactions between pixels to be modeled, have also been shown to be effective methods to improve semantic segmentation results. For example, sharp boundaries and fine-grained details can be preserved by graph models. In order to harness the strengths of both models, we propose to integrate CNN and a graph model in the framework of building footprint generation. However, it should be noted that although the results could be improved by simply including graph models after learning from CNN, an end-to-end training scheme that fully integrates the graph models with CNN is preferred in our research. The end-to-end approach can provide more replicable and stable building footprint maps, especially for large scale applications. In this regard, we propose to utilize FPCRF as the graph model in the end-to-end framework, as it is superior to other graph models in terms of computation efficiency and completeness in feature learning.

In our proposed approach, CNN and FPCRF are integrated in an end-to-end framework, where the gradients are propagated through the entire pipeline. In this case, CNN and FPCRF can co-adapt and therefore produce the optimal output. Fig. 1

shows the overall architecture of the proposed approach. It has two major components: CNN and FPCRF. The output of the CNN consists of two parts. One output is the segmentation probability obtained from the last softmax layer of CNN, which predicts labels for pixels. This segmentation probability obtained from CNN is utilized as the unary potential

[60]. The other output is extracted features from CNN, which encodes each pixel as a fixed-length vector representation (i.e. embedding). This feature embedding is used for pairwise potential calculation, which encourages assigning similar labels to pixels with similar properties. The FPCRF component is utilized as the graph model to complement the results obtained from CNN. FPCRF takes the patch of feature embedding and unary potential as input and models their spatial correlations. The final output from FPCRF is the marginal distribution of each pixel, which represents the different class label when the patch embedding is given.

Fig. 1: Flowchart of the proposed approach

Iii-B Data Preprocessing

Since the ground truth of the building footprint is generated using OSM with different data sources from satellite images, the inconsistencies between datasets need to be resolved by the preprocessing steps, including coregistration and truncated signed distance labels.

(1) Coregistration: One inconsistency is the misalignment between OSM building footprints and satellite imagery, which is caused by different projections and accuracy levels from data sources. This misalignment leads to inaccurate training samples, which need to be corrected. In this regard, we make an assumption that after translation the building footprint from OSM will be aligned with satellite imagery content within a local neighborhood [57]

. Between the building footprint and gradient magnitude of satellite imagery, a cross correlation is calculated, where the maximum of the cross-correlation corresponds to the estimated alignment location. In this regard, the offsets in both row and column direction can be derived, which are corresponding to the translation coefficients. An example of satellite imagery overlaid with the OSM building footprint is presented in Fig.

2 (a). There are noticeable misalignments between the building footprint and the satellite imagery. The local neighborhood size is selected as 7. Fig. 2 (b) illustrates the coregistration result.

Fig. 2: (a) Before coregistration, (b) After coregistration.

(2) Truncated signed distance label: In order to incorporate both semantic information and geometric properties of the buildings during training [5], the distances from pixels to the boundaries of buildings are extracted as output representations. In our experiment, the signed distance from a pixel to its closest point on the boundary is calculated with positive values indicating building interior and negative indicating building exterior. Then we truncate the distance at a given threshold to only incorporate the pixels closest to the border [5]. Finally, the distance values are categorized into a number of class labels [5]. The advantages of this truncated signed distance mask is that the location of the boundary and implicit geometric properties of each pixel can be captured. In addition, different buildings can be distinguished based on their between-distance and labels.

Given that is the set of pixels on the object boundary and is the set of pixels with class label , the truncated distance for every pixel is calculated as


, with being the Euclidean distance between pixels and and is the truncated threshold. The sign function is used to weight the pixel distances to represent whether the pixels are inside or outside the building masks. To facilitate training, the continuous distance values are then uniformly quantized.

Fig. 3: (a) Binary label, (b) Truncated signed distance label, (c) Colorbar for the class label.

In this research, we use 11 classes with the labels . Class 5 represents the building boundary and when the class label is greater than 5, this pixel belongs to the building. Similarly, the non-building pixel has a class label smaller than 5. Fig. 3 illustrates the binary label and truncated signed-distance label of a building footprint, which are used in the network training. Based on the raw output (multiclass) from a trained network, we simply select a threshold to classify the class labels as a final binary building footprint result: a pixel is considered as building if ; otherwise it is considered as non-building when .

Iii-C Fpcrf

An image can be regarded as a graph, where every pixel is a vertex, and there are edges between each pair of pixels. FPCRF provides a probabilistic model for an image that is both local and modular.

In FPCRF, the joint probability for the random variables is implied as functions over cliques,


, where is a field defined over a set of variables with being the number of pixels, where the domain of each variable is a set of labels with being the number of classes. The expression denotes a graph where . The term is a global observation (image). The term is a potential induced by the clique (each two vertices are linked) in the graph . The function is a partition function. The energy of a labeling is . Gibbs distribution is a probability distribution that measures a system with a certain state as a function of that state’s energy. Conditional random filed (CRF) explicitly gives a representation of the conditional independence between nodes of a graph. CRF and Gibbs distribution are proved to be equivalent with regard to the same graph from the Hammersley-Clifford theorem [47], which indicates that when the Gibbs distribution is given, the conditional independence specified by the corresponding CRF will be satisfied by all of the Gibbs joint probability distributions. Therefore, the Gibbs distribution characterized by FPCRF can thus be expressed as


In order to take (1) the interactions between pixels, and (2) the approximation inference into consideration during learning, the Gibbs energy is expressed as


, and and range from 1 to . The term is the unary potential, which is independent for each pixel. Unary potential is a distribution over the label assignment from the classifier. The term is a pairwise potential function that is determined based on the compatibility among pairs of pixels. This pairwise potential term can overcome the drawbacks of the noisy and inconsistent labeling from the unary potential alone.

In FPCRF, the pairwise potential is defined by the expression below,


, where are learnable parameters, and is the number of kernels, which is determined by the selected kernels. The terms and are feature vectors for pixels and and may depend on the input image . The function is the compatibility transformation and captures the compatibility between labels and .

However, FullCRF and ConvCRF only use shallow features — the color and position of the pixel for kernels in pairwise potential term, which have not fully harnessed the complete features extracted from CNN. In this regard, we propose FPCRF as a graph model to be exploited in the building footprint generation framework.

Inspired by the fact that ConvCRF is based on localized processing, we design a pairwise potential term with localized constraints in FPCRF that allows complete feature learning. The kernel utilized for pairwise potential in FPCRF is a Gaussian kernel, which is defined by the feature vectors , … , , where is the number of feature vector types. The kernel is defined as:


, where is a learnable parameter.

The labeling of the random field is derived by the maximum a posteriori (MAP) method,


The most probable label can be yielded by the minimization of the Gibbs energy in FPCRF. However, the exact minimization is intractable. In this regard, the mean field inference is utilized for the approximation of FPCRF distribution. A distribution that tries to minimize the KL-divergence from exact distribution is computed by the mean field approximation,


, where the approximated distribution can be represented as a product of independent marginal distributions,


The combined message passing result of all kernels is expressed as:


The steps of the mean field algorithms are presented in Table I.

Mean field approximation in FPCRF
1. Initialize Q
2. while not converged
3. for all m Message passing from all to all
4. Weighting filtering outputs
5. Compatibility transformation
6. Adding unary potentials
7. Normalization
8. end while
TABLE I: The steps of the mean field algorithms in FPCRF

The steps of the mean field inference algorithm of FPCRF are reformulated as a network layer, where the error differentials in each layer with respect to its inputs are sent to previous layers by back propagation during training [60]. FPCRF exploits a filter to assign the different penalties for all different pairs of labels.

To implement the efficient computation of the convolution, the input is firstly tiled into the specific shapes, which are related to the filter size . An efficient message passing operation in FPCRF can be implemented analogously to 2D-convolution [48]. Then, the message passing step is reformulated to be a convolution with a truncated Gaussian kernel.

Iv Experiments

Iv-a Study Area and Dataset

In this research, the study sites cover four cities (see Fig. 4): (1) Munich, Germany; (2) Rome, Italy; (3) Paris, France; (4) Zurich, Switzerland. We use Planetscope satellite imagery [41] with three bands — Red, Green, Blue (RGB) — and 3 m spatial resolution to validate our proposed method. The imagery is processed using a sliding window. The corresponding building footprint (stored as polygon shape files) is downloaded from OSM, where the detailed building footprints around these four cities are publicly released. Some patches are mismatched, which result from the time difference between OSM building footprints and satellite imagery. For example, a building might appear in the OSM building footprint, while it is missing in the corresponding satellite imagery, or vice versa. To limit such patches, we have manually selected 3000 pairs of proper patches. The selected pairs are then separated into two parts, where 80% of the sample patches are used for training the network and 20% are used for model validation.

Fig. 4: True color Planetscope satellite images and building footprint of Munich, Rome, Paris, and Zurich

Iv-B Experiment Setup

In this research, all networks were investigated within a Pytorch framework on an NVIDIA Titan X GPU with 12 GB of memory


. For all networks, a stochastic gradient descent (SGD) optimizer with a learning rate of 0.0001 was utilized and negative log likelihood loss (NLLLoss) was taken as loss function. The batch size of all networks was 4.

In our proposed end-to-end approach, CNN and FPCRF are two vital parts in the framework, where the CNN component acts as a feature extractor, and the FPCRF models their pixel correlations by using pairwise potential. Hence, we first investigate which CNN model has better feature extraction capability. Then, the feature kernels that are taken in pairwise potential calculation of FPCRF are also carefully studied to find the optimal feature embedding. Moreover, the sensitivity of the filter size

, being the only hyperparameter of FPCRF, is analyzed. Additionally, to prove the superiority of our proposed framework, we train the following networks for comparison:

1) FCN-8s is based on VGG16 as the encoder and an up-sampling layer and convolutional layer as the decoder.

2) ResNet-DUC, which has [3, 4, 6, 3, 3, 6, 4, 3] convolutional layers in each ResNet block.

3) SegNet, which attaches a reversed VGG16 as a decoder to the encoder.

4) U-Net, which has a depth of five with a feature channel in each depth [64, 128, 256, 512, 1024].

5) ENet, which consists of five stages, where the first three stages act as the encoder, while the last two stages belong to the decoder.

6) cwGAN-gp which also has five depth U-Net in the generator.

7) FC-DenseNet, with each dense block having [5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] convolutional layers.

8) PSPNet starts off with a standard feature extraction network (ResNet101).

9) DeepLabv3+ utilizes the Xception model [11] as the feature extractor.

V Results

The three metrics in the following experiments selected to evaluate the results are overall accuracy, F1 score, and intersection over union (IoU), which are used widely to evaluate building footprint generation results.


, where is the number of building pixels correctly detected, and denotes the missed building pixels. and

are the numbers of non-building pixels in the ground reference, but detected as buildings and non-buildings in the result, respectively. The F1 score indicates a balance between precision and recall.

V-a Feature Extractor Combined with FPCRF

Fig. 5 and Table II list the results of the different CNN models combined with FPCRF. The results of FC-DenseNet combined with FPCRF are more accurate than other two CNN models combined with FPCRF. This is due to the superiority of FC-DenseNet, which extends the DenseNet architecture to FCN for semantic segmentation. In the DenseNet block, through feature reuse, there are shorter connections within the layers close to the input or output, which strengthen the learning of the discriminated features. Moreover, features are combined by iterative concatenation, which contributes to the improved flow of information. In addition, a standard skip connection between the encoder and decoder is used to pass higher resolution information, which can help the encoder recover spatially detailed information from the decoder.

Methods Overall accuracy F1 score IOU
FC-DenseNet + FPCRF 0.9297 0.6698 0.5046
FCN-8s + FPCRF 0.9248 0.6340 0.4642
U-Net+ FPCRF 0.8927 0.6278 0.4575
TABLE II: Accuracy of different feature extractors combined with FPCRF
Fig. 5: The predicted results (in red) obtained from different feature extractors (a) FC-DenseNet, (b) FCN-8s, (c) U-Net combined with FC-DenseNet, and (d) ground truth.

V-B Kernel Selection in FPCRF

FullCRF and ConvCRF only utilize the pairwise potentials from shallow features, which include only appearance and smooth Gaussian kernels. In the implementation of ConvCRF, the unary potential is obtained from CNN, and only the smooth kernel and appearance kernel are utilized for the calculation of the pairwise potential term. FPCRF is able to reduce the complexity of the pairwise potential greatly, which makes the exact message passing and complete feature learning possible. In this regard, we can use the features extracted from CNN models to calculate pairwise potentials, which may facilitate training. The results for the FC-DenseNet combined with FPCRF from the different kernels are presented in Table III and Fig. 6. The appearance kernel (a) and the smooth kernel (s) are the same as FullCRF and ConvCRF. The feature difference kernel (fd) represents the CNN extracted feature difference calculated with a Gaussian function, and the feature spatial kernel (fs) is the feature difference combined with position difference calculated with a Gaussian function. In the feature cosine kernel (fc), the cosine distance between feature vectors is implemented as pairwise potential [28]. The detailed formulas of the different kernels are listed in the below:

(1) appearance kernel (a):


, where is the feature of position, is the feature of color, and are learnable parameters.

(2) smooth kernel (s):


, where is a learnable parameter.

(3) feature difference kernel (fd):


, where is the feature extracted from CNN and is a learnable parameter.

(4) feature spatial kernel (fs):


, where and are learnable parameters.

(5) feature cosine kernel (fc):


“FC-DenseNet + FPCRF (a+s)” is corresponding to the “ConvCRF ”, which means that unary potential is the segmentation probability obtained from FC-DenseNet, but for the calculation of the pairwise potential term only the smooth kernel and appearance kernel are utilized. It should be noted that in our proposed method “FC-DenseNet + FPCRF (fd)”, FC-DenseNet not only provide the segmentation probability as unary potential, but also extracts features for the calculation of the pairwise potential term. FC-DenseNet combined with FPCRF using the feature difference kernel (fd) outperforms other kernels in terms of their high F1 score and IoU. There are several reasons for this. The smooth kernel (s), which removes small isolated regions, is not useful in our case. Since the spatial resolution of satellite imagery is coarse, we can preserve isolated small buildings by removing smooth kernel. The feature spatial kernel (fs) controls the degree of nearness that neighboring pixels having similar features may belong to the same class. However, since we have already used filter size to add a locality by filter size, we want the pixels within the filter to contribute equally to the centered pixel. In addition, the appearance kernel (a) has not shown any improvements to the results. This may result from the fact that the RGB information in the appearance kernel (a) is not sufficient to distinguish the buildings from other non-building areas (sometimes roads and buildings have similar RGB information). The feature cosine kernel (fc) shows very low accuracy, which can be explained by the fact that a Gaussian function in feature difference (fd) can remove the noise, but cosine distance can be largely affected by the noise. In this case, when the cosine distance between feature vectors is implemented as a pairwise potential, the final results will suffer from great instability.

Methods Overall accuracy F1 score IOU
FC-DenseNet + FPCRF (a+s) 0.9075 0.6653 0.4986
FC-DenseNet + FPCRF (a+s+fd) 0.9166 0.6682 0.5018
FC-DenseNet + FPCRF (s+fd) 0.9013 0.6660 0.4991
FC-DenseNet + FPCRF (a+fd) 0.9212 0.6685 0.5013
FC-DenseNet + FPCRF (fs) 0.9275 0.6673 0.5006
FC-DenseNet + FPCRF (fd) 0.9297 0.6698 0.5046
FC-DenseNet + FPCRF (fc) 0.7888 0.4521 0.2921
TABLE III: Accuracy of FC-DenseNet combined with FPCRF from different kernels. (a: appearance kernel, s: smooth kernel, fd: feature difference kernel, fs: feature spatial kernel, fc: feature cosine kernel)
Fig. 6: The predicted results (in red) obtained from FC-DenseNet combined with FPCRF from different kernels: (a) a+s, (b) a+s+fd, (c) s+fd, (d) a+fd, (e) fs, (f) fd, (g) fc, (h) ground truth. (a: appearance kernel, s: smooth kernel, fd: feature difference kernel, fs: feature spatial kernel, fc: feature cosine kernel)

V-C Hyperparameter Analysis in FPCRF

The hyperparameter filter size in FPCRF implies that the pairwise potential is zero when the Manhattan distance between the pairs of pixels exceeds . In order to better understand the influence of the various filter sizes for building footprint generation, the visual results of FC-DenseNet combined with FPCRF within different filter size r, as well as their accuracy indexes, are shown and compared in Fig. 7 and Table IV. From the visual results, we can observe that when the filter size is not optimal, there are more non-building areas wrongly detected as building areas, and some small buildings are not detected. This can be explained by the fact that filter size is related to the quantity of the most useful neighboring pixels, which contributes to the improvement of the segmentation results.

Methods Overall accuracy F1 score IOU
FC-DenseNet + FPCRF (r=5) 0.9121 0.6665 0.4985
FC-DenseNet + FPCRF (r=7) 0.9297 0.6698 0.5046
FC-DenseNet + FPCRF (r=9) 0.9142 0.6670 0.4993
TABLE IV: Accuracy of FC-DenseNet combined with FPCRF within different filter size. is filter size.
Fig. 7: The predicted results (in red) obtained from FC-DenseNet combined with FPCRF within different filter size: (a) FC-DenseNet+FPCRF (=5), (b) FC-DenseNet+FPCRF (=7), (c) FC-DenseNet+FPCRF (=9), and (d) ground truth.

Vi Discussion

Vi-a Additional Datasets

Another three datasets, ISPRS benchmark data, Dstl Kaggle dataset, and Inria Aerial Image Labeling data are used to test the performance and characteristics of the different networks for building footprint generation.

The first dataset is ISPRS benchmark data [20], shown in Fig. 8. The dataset covers the city of Potsdam, which contains 38 aerial images with pixel size and four channels: Red, Green, Blue (RGB) and Near-infrared bands with 5 cm spatial resolution. The corresponding ground truth is also available from the ISPRS benchmark data, which includes six categories. In this research, we take the building class as building and other five classes as non-building; traditional natural color aerial imagery are utilized. The images 7-07, 7-08, 7-09, 7-10, 7-11, 7-12, and 7-13 are used as the validation set, and the remaining images are exploited for training.

Dstl Kaggle dataset [12] is the second dataset, which provides 57 satellite images with a region of in both 3-band RGB and 16-band multi-spectral formats. Here, we use 3-band images with the spatial resolution 1.24 m. In this dataset, 10 different classes have been labeled within some images. In this research, the pixels of building are from building class, and those of non-building are the remaining pixels. Ten satellite images with pixel size , which has corresponding building class in the ground truth, are exploited for this experiment, which includes eight images with ID (6100-2-3, 6100-1-2, 6100-3-1, 6110-4-0, 6120-2-0, 6120-2-2, 6140-1-2, 6140-3-1) for training, and two images with ID (6100-1-3, 6100-2-2) for validation. Fig. 9 illustrates one satellite imagery sample.

The third dataset is Inria Aerial Image Labeling data [32]. This dataset contains 360 aerial images of size (at a 30cm spatial resolution), which have three bands: Red, Green, and Blue. In this research, 36 tiles of aerial imagery and their corresponding ground truth (building and non-building) are selected for each of the following five regions: Austin, Chicago, Kitsap County, Western Tyrol and Vienna, where dissimilar urban settlements are covered. The sample data are showed in Fig. 10. To split the training set and test set, we used the first eight images of every city for validation.

In order to get more training data, satellite imagery and their corresponding ground truth from Dstl Kaggle dataset are cut into small patches of size pixels with overlap of 64. However, since the numbers of samples from ISPRS benchmark data and Inria Aerial Image Labeling data are enough for network training, aerial imagery and their corresponding ground truth from both datatsets are cut into non-overlapping patches with size pixels. The numbers of training and validation patches for the additional three datasets are listed in Table V.

Fig. 8: The aerial imagery in ISPRS dataset (spatial resolution: 5cm).
Fig. 9: The WorldView 3 imagery in Dstl dataset (spatial resolution: 1.24m).
Fig. 10: The aerial imagery in Inria dataset (spatial resolution: 30cm).
Dataset Number of training patches Number of validation patches
ISPRS benchmark data (spatial resolution: 5cm) 16000 3573
Dstl Kaggle dataset (spatial resolution: 1.24m) 2312 578
Inria Aerial Image Labeling data (spatial resolution: 30cm) 50540 14440
TABLE V: The numbers of training and validation patches of three additional dataset

Vi-B Comparison with Other Models

In this research, several popular semantic segmentation neural networks from four different datasets were also investigated for comparisons of the proposed method. Their performance in building footprint generation, such as accuracy indexes are presented in Tables VI, VII, VIII, and IX. Moreover, the visual results of different networks are also illustrated in Figs 11, 12, 13, and 14. The training and inference time costs of the different methods from Planetscope datatset are listed in the Fig. 15

, where the training time measures the whole training patches for 100 epochs, and inference time refers to the time cost for each patch.

DeepLabv3+ and PSPNet, which are the state-of-art networks for semantic segmentation tasks in computer vision, achieved satisfactory accuracy. These two networks are multiscale processing techniques, which not only allow the refinement of details, but also retain high-level semantic information. They can also take global structure into consideration when making local predictions. ENet is highly superior with respect to both training time and inference time, due to its specific architectures. First, the decoder uses max-pooling indices to produce sparse upsampled maps, which can reduce training time requirements. The input size can also be reduced heavily by the first two blocks, which adopt only a small number of features. Moreover, in the first stage, a max-pooling operation is performed in parallel with a strided convolution, and the resulting feature maps are concatenated, which speeds up inference process of the initial block. Compared to other CNN models, cwGAN-gp, which is a newly proposed network, also shows promising results for building footprint generation. The generator of cwGAN-gp exploits skip connection, which is helpful for retaining the boundary of the buildings. Moreover, the generator and discriminator of the GAN are both improved by the min-max game. However, the difficulty of training of GAN also leads to the longest training time among all the CNN models. Among all CNN models, FC-DenseNet is a superior network with respect to the numerical accuracy and visual results. On one hand, feature maps produced from different layers are concatenated in the DenseNet block, which can improve variation in the input of subsequent layers. On the other hand, high frequency information can be transferred by a standard skip connection between the encoder and the decoder, which contributes to the recovery of spatial details.

The architectures of the network, such as the feature extractor, decoder, and skip connection, have different significance when applied with satellite imagery of diverse spatial resolution. On one hand, for the higher spatial resolution satellite imagery (ISPRS dataset), the feature extractor is rather important. For instance, the accuracy indexes of PSPNet are much higher than those of DeepLabv3+, which means that the ResNet101 in PSPNet has a better feature extraction capability than the Xception in DeepLabv3+. On the other hand, the decoder plays an important role in other datasets, including lower spatial resolution satellite imagery. DeepLabv3+ achieves much better results than PSPNet when applied in lower spatial resolution satellite imagery (Planetscope dataset, Dstl dataset, and Inria dataset), This is owing to the decoder module on top of the encoder output in DeepLabv3+, which contributes to sharper segmentation results. The skip connection in the networks (e.g., U-Net) is also vital to lower spatial resolution satellite imagery, as it is able to concatenate feature maps from both low-level and high-level layers. Hence, it can create a more efficient path for information propagation. However, it consumes more training and inference time, due to the fact that the feature maps from the encoder are transferred and concatenated to the decoder.

However, there are still some problems with CNN-based results such as weak boundaries and coarse pixel-level prediction. Therefore, graph models can be implemented to overcome the drawbacks of exploiting CNN for building footprint generation. CRF is a popular graph model with widespread success in solving semantic segmentation problems. The CRF inference can be used as a post-processing step, which is not integrated with the training of the CNN. However, in this case, the strength of CRF can not be fully harnessed. Therefore, we adopt an end-to-end deep learning network to produce sharp boundaries and fine-grained segmentation. FullCRF and FPCRF are combined with CNN models in one unified framework. When connected with CRF-based graph models, the results can be improved as wrongly detected non-building pixels are removed. FC-DenseNet combined with FPCRF has achieved higher IoU and F1 scores than that combined with FullCRF, and can also better preserve the details and sharper boundaries. Moreover, FPCRF can substantially reduce the time needed for the training and inference stages. This superiority can be attributed to two reasons. First, FPCRF uses exact message passing, which avoids the approximation errors resulted from the permutohedral lattice approximation [1] in FullCRF. Second, localized processing in FPCRF can implement the feature learning more efficiently.

Models Overall accuracy F1 score IoU
ResNet-DUC 0.7976 0.4593 0.2981
SegNet 0.8263 0.5597 0.3886
ENet 0.8379 0.5831 0.4115
U-Net 0.8435 0.6054 0.4341
FCN-8s 0.8505 0.6292 0.4590
cwGAN-gp 0.8453 0.6339 0.4641
PSPNet 0.8395 0.5948 0.4233
DeepLabv3+ 0.8742 0.6592 0.4901
FC-DenseNet 0.8718 0.6556 0.4877
FC-DenseNet+FullCRF 0.8913 0.6580 0.4903
FC-DenseNet+FPCRF 0.9297 0.6698 0.5046
TABLE VI: Comparison of accuracy indexes among different models of Planetscope dataset (spatial resolution: 3m)
Fig. 11: The predicted results (in red) obtained from (a) ResNet-Duc, (b) SegNet, (c) ENet, (d) U-Net, (e) FCN-8s, (f) cwGAN-gp, (g) PSPNet, (h) DeepLabv3+, (i) FC-DenseNet, (j) FC-DenseNet+FullCRF, (k) FC-DenseNet+FPCRF, and (l) ground truth from Planetscope dataset (spatial resolution: 3m).
Models Overall accuracy F1 score IoU
ResNet-DUC 0.7475 0.6766 0.5051
SegNet 0.8948 0.8511 0.7408
ENet 0.7711 0.7764 0.6110
U-Net 0.8892 0.8392 0.7229
FCN-8s 0.8617 0.7986 0.6647
cwGAN-gp 0.8926 0.8504 0.7397
PSPNet 0.9141 0.9144 0.8682
DeepLabv3+ 0.8995 0.9086 0.8325
FC-DenseNet 0.9186 0.9182 0.8789
FC-DenseNet+FullCRF 0.9298 0.9232 0.8826
FC-DenseNet+FPCRF 0.9315 0.9358 0.8974
TABLE VII: Comparison of accuracy indexes among different models of ISPRS dataset (spatial resolution: 5cm)
Fig. 12: The predicted results (in red) obtained from (a) ResNet-Duc, (b) SegNet, (c) ENet, (d) U-Net, (e) FCN-8s, (f) cwGAN-gp, (g) PSPNet, (h) DeepLabv3+, (i) FC-DenseNet, (j) FC-DenseNet+FullCRF, (k) FC-DenseNet+FPCRF, and (l) ground truth from ISPRS dataset (spatial resolution: 5cm).
Models Overall accuracy F1 score IoU
ResNet-DUC 0.8923 0.5184 0.3499
SegNet 0.9240 0.6050 0.4337
ENet 0.9127 0.6890 0.5189
U-Net 0.9485 0.7576 0.5887
FCN-8s 0.9447 0.7467 0.5779
cwGAN-gp 0.9412 0.7291 0.5732
PSPNet 0.9379 0.6926 0.5297
DeepLabv3+ 0.9602 0.7578 0.6100
FC-DenseNet 0.9507 0.7602 0.5928
FC-DenseNet+FullCRF 0.9598 0.7697 0.6034
FC-DenseNet+FPCRF 0.9604 0.7821 0.6176
TABLE VIII: Comparison of accuracy indexes among different models of Dstl dataset (spatial resolution: 1.24m)
Fig. 13: The predicted results (in red) obtained from (a) ResNet-Duc, (b) SegNet, (c) ENet, (d) U-Net, (e) FCN-8s, (f) cwGAN-gp, (g) PSPNet, (h) DeepLabv3+, (i) FC-DenseNet, (j) FC-DenseNet+FullCRF, (k) FC-DenseNet+FPCRF, and (l) ground truth from Dstl dataset (spatial resolution: 1.24m).
Models Overall accuracy F1 score IoU
ResNet-DUC 0.8704 0.7395 0.6097
SegNet 0.8826 0.7845 0.6455
ENet 0.8972 0.8001 0.6669
U-Net 0.9018 0.8027 0.6704
FCN-8s 0.9169 0.8192 0.6837
cwGAN-gp 0.9387 0.8371 0.7198
PSPNet 0.8960 0.7951 0.6599
DeepLabv3+ 0.9498 0.8551 0.7299
FC-DenseNet 0.9426 0.8536 0.7258
FC-DenseNet+FullCRF 0.9485 0.8605 0.7312
FC-DenseNet+FPCRF 0.9581 0.8765 0.7479
TABLE IX: Comparison of accuracy indexes among different models of INRIA dataset (spatial resolution: 30cm)
Fig. 14: The predicted results (in red) obtained from (a) ResNet-Duc, (b) SegNet, (c) ENet, (d) U-Net, (e) FCN-8s, (f) cwGAN-gp, (g) PSPNet, (h) DeepLabv3+, (i) FC-DenseNet, (j) FC-DenseNet+FullCRF, (k) FC-DenseNet+FPCRF, and (l) ground truth from Inria dataset (spatial resolution: 30cm).
Fig. 15: Comparison of training time and inference time among different models from Planetscope dataset (spatial resolution: 3m)

Vii Conclusion

Considering that there are weak boundary and coarse pixel-level label predictions in CNN-based results, we have proposed an end-to-end building footprint generation framework integrating CNN and a graph model in this research. Moreover, a number of the state-of-the-art CNN models for semantic segmentation are selected to generate building footprints from high resolution RS images for comparison. The effectiveness of CNN models and the proposed end-to-end CNN-graph model building footprint generation approach are validated on four different datasets, (1) Planetscope satellite imagery of the cities of Munich, Paris, Rome, and Zurich; (2) aerial imagery of the City of Potsdam (North Germany) from ISPRS benchmark data; (3) WorldView3 satellite imagery from Dstl Kaggle dataset; (4) aerial imagery of the city of Austin, Chicago, Kitsap County, Western Tyrol, and Vienna from Inria Aerial Image Labeling data. The experimental results show that building footprint generation based on CNN-graph model-based methods can obtain more accurate results than CNN-based methods alone. Furthermore, FPCRF as the graph model in our proposed framework is effective in producing sharp boundaries and fine-grained segmentation results. On one hand, the completeness of the buildings can be preserved. On the other hand, some non-buildings, which are wrongly detected as buildings by CNN models, can be removed by graph models. Thus, we believe the proposed CNN-graph model method will be of practical value for the monitoring of fast-growing urban areas. In the future, we plan to extend our work to instance segmentation. More types of graph models will also be investigated.


  • [1] A. Adams, J. Baek, and M. A. Davis (2010) Fast high-dimensional filtering using the permutohedral lattice. In Computer Graphics Forum, Vol. 29, pp. 753–762. Cited by: §VI-B.
  • [2] C. Akinlar and C. Topal (2011) EDLines: a real-time line segment detector with a false detection control. Pattern Recognition Letters 32 (13), pp. 1633–1642. Cited by: §I.
  • [3] N. Audebert, B. Le Saux, and S. Lefèvre (2018) Beyond rgb: very high resolution urban remote sensing with multimodal deep networks. ISPRS Journal of Photogrammetry and Remote Sensing 140, pp. 20–32. Cited by: §II-A.
  • [4] V. Badrinarayanan, A. Kendall, and R. Cipolla (2017) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence 39 (12), pp. 2481–2495. Cited by: §I.
  • [5] B. Bischke, P. Helber, J. Folz, D. Borth, and A. Dengel (2017) Multi-task learning for segmentation of building footprints with deep neural networks. arXiv preprint arXiv:1709.05932. Cited by: §I, §III-B.
  • [6] K. Bittner, S. Cui, and P. Reinartz (2017) BUILDING extraction from remote sensing data using fully convolutional networks.. International Archives of the Photogrammetry, Remote Sensing & Spatial Information Sciences 42. Cited by: §I.
  • [7] K. Bittner, F. Adam, S. Cui, M. Körner, and P. Reinartz (2018) Building footprint extraction from vhr remote sensing images combined with normalized dsms using fused fully convolutional networks. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 11 (8), pp. 2615–2629. Cited by: §II-B.
  • [8] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §II-A.
  • [9] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 801–818. Cited by: §I.
  • [10] Y. Chen, H. Jiang, C. Li, X. Jia, and P. Ghamisi (2016) Deep feature extraction and classification of hyperspectral images based on convolutional neural networks. IEEE Transactions on Geoscience and Remote Sensing 54 (10), pp. 6232–6251. Cited by: §I.
  • [11] F. Chollet (2017) Xception: deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1251–1258. Cited by: §IV-B.
  • [12] (Website) External Links: Link Cited by: §VI-A.
  • [13] A. Garcia-Garcia, S. Orts-Escolano, S. Oprea, V. Villena-Martinez, and J. Garcia-Rodriguez (2017) A review on deep learning techniques applied to semantic segmentation. arXiv preprint arXiv:1704.06857. Cited by: §II-C.
  • [14] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §I.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §I, §II-A.
  • [16] Y. Hua, L. Mou, and X. X. Zhu (2019) Recurrently exploring class-wise attention in a hybrid convolutional and bidirectional LSTM network for multi-label aerial image classification. ISPRS Journal of Photogrammetry and Remote Sensing 149, pp. 188–199. Cited by: §I.
  • [17] Y. Hua, L. Mou, and X. X. Zhu (2019) Relation network for multi-label aerial image classification. arXiv:1907.07274. Cited by: §I.
  • [18] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §I, §II-A.
  • [19] X. Huang and L. Zhang (2011) A multidirectional and multiscale morphological index for automatic building extraction from multispectral geoeye-1 imagery. Photogrammetric Engineering & Remote Sensing 77 (7), pp. 721–732. Cited by: §I.
  • [20] (Website) External Links: Link Cited by: §VI-A.
  • [21] V. Jampani, M. Kiefel, and P. V. Gehler (2016) Learning sparse high dimensional filters: image filtering, dense crfs and bilateral neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4452–4461. Cited by: §II-C.
  • [22] S. Ji, S. Wei, and M. Lu (2018) Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set. IEEE Transactions on Geoscience and Remote Sensing (99), pp. 1–13. Cited by: §II-B.
  • [23] Jülich Supercomputing Centre (2019) JUWELS: Modular Tier-0/1 Supercomputer at the Jülich Supercomputing Centre. Journal of large-scale research facilities 5 (A135). External Links: Document, Link Cited by: §IV-B.
  • [24] P. Kaiser, J. D. Wegner, A. Lucchi, M. Jaggi, T. Hofmann, and K. Schindler (2017) Learning aerial image segmentation from online maps. IEEE Transactions on Geoscience and Remote Sensing 55 (11), pp. 6054–6068. Cited by: §I, §II-B.
  • [25] B. Le Saux, A. Beaupère, A. Boulch, J. Brossard, A. Manier, and G. Villemin (2018) Railway detection: from filtering to segmentation networks. In IGARSS 2018-2018 IEEE International Geoscience and Remote Sensing Symposium, pp. 4819–4822. Cited by: §II-B.
  • [26] J. Li, X. Huang, and J. Gong (2019) Deep neural network for remote sensing image interpretation: status and perspectives. National Science Review. Cited by: §I.
  • [27] X. Li, X. Yao, and Y. Fang (2018) Building-a-nets: robust building extraction from high-resolution remote sensing images with adversarial networks. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (99), pp. 1–8. Cited by: §II-B.
  • [28] Y. Li and W. Ping (2018) Cancer metastasis detection with neural conditional random field. arXiv preprint arXiv:1806.07064. Cited by: §V-B.
  • [29] W. Liao, F. Van Coillie, L. Gao, L. Li, B. Zhang, and J. Chanussot (2018) Deep learning for fusion of apex hyperspectral and full-waveform lidar remote sensing data for tree species mapping. IEEE Access 6, pp. 68716–68729. Cited by: §I.
  • [30] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §I.
  • [31] E. Maggiori, Y. Tarabalka, G. Charpiat, and P. Alliez (2017) Can semantic labeling methods generalize to any city? the inria aerial image labeling benchmark. In 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), pp. 3226–3229. Cited by: §II-B.
  • [32] E. Maggiori, Y. Tarabalka, G. Charpiat, and P. Alliez (2017) Can semantic labeling methods generalize to any city? the inria aerial image labeling benchmark. In IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Cited by: §VI-A.
  • [33] E. Maggiori, Y. Tarabalka, G. Charpiat, and P. Alliez (2017) Convolutional neural networks for large-scale remote-sensing image classification. IEEE Transactions on Geoscience and Remote Sensing 55 (2), pp. 645–657. Cited by: §II-B.
  • [34] D. Marcos, M. Volpi, B. Kellenberger, and D. Tuia (2018) Land cover mapping at very high resolution with rotation equivariant cnns: towards small yet accurate models. ISPRS Journal of Photogrammetry and Remote Sensing 145, pp. 96–107. Cited by: §II-B.
  • [35] D. Marmanis, K. Schindler, J. D. Wegner, S. Galliani, M. Datcu, and U. Stilla (2018) Classification with an edge: improving semantic image segmentation with boundary detection. ISPRS Journal of Photogrammetry and Remote Sensing 135, pp. 158–172. Cited by: §II-B.
  • [36] L. Mou, L. Bruzzone, and X. X. Zhu (2019) Learning spectral-spatial-temporal features via a recurrent convolutional neural network for change detection in multispectral imagery. IEEE Transactions on Geoscience and Remote Sensing 57 (2), pp. 924–935. Cited by: §II-B.
  • [37] L. Mou, Y. Hua, and X. X. Zhu (2019) A relation-augmented fully convolutional network for semantic segmentationin aerial scenes. arXiv preprint arXiv:1904.05730. Cited by: §II-A.
  • [38] S. Paisitkriangkrai, J. Sherrah, P. Janney, V. Hengel, et al. (2015) Effective semantic pixel labelling with convolutional networks and conditional random fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 36–43. Cited by: §II-C.
  • [39] S. Paisitkriangkrai, J. Sherrah, P. Janney, and A. Van Den Hengel (2016) Semantic labeling of aerial and satellite imagery. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 9 (7), pp. 2868–2881. Cited by: §II-C.
  • [40] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello (2016) Enet: a deep neural network architecture for real-time semantic segmentation. arXiv preprint arXiv:1606.02147. Cited by: §I.
  • [41] (Website) External Links: Link Cited by: §IV-A.
  • [42] C. Qiu, M. Schmitt, C. Geiss, T. K. Chen, and X. X. Zhu (2020) A framework for large-scale mapping of human settlement extent from sentinel-2 images via fully convolutional neural networks. arXiv preprint arXiv:2001.11935. Cited by: §II-A.
  • [43] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §I.
  • [44] Y. Shi, Q. Li, and X. X. Zhu (2019) Building footprint generation using improved generative adversarial networks. IEEE Geoscience and Remote Sensing Letters 16 (4), pp. 603–607. Cited by: §I.
  • [45] Y. Shi, Q. Li, and X. X. Zhu (2020) Building segmentation through a gated graph convolutional neural network with deep structured feature embedding. ISPRS Journal of Photogrammetry and Remote Sensing 159, pp. 184–197. Cited by: §II-C.
  • [46] H. Su, V. Jampani, D. Sun, O. Gallo, E. Learned-Miller, and J. Kautz (2019) Pixel-adaptive convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11166–11175. Cited by: §II-C.
  • [47] C. Sutton, A. McCallum, et al. (2012) An introduction to conditional random fields.

    Foundations and Trends® in Machine Learning

    4 (4), pp. 267–373.
    Cited by: §III-C.
  • [48] M. T. Teichmann and R. Cipolla (2018) Convolutional crfs for semantic segmentation. arXiv preprint arXiv:1805.04777. Cited by: §II-C, §III-C.
  • [49] M. Vakalopoulou, K. Karantzalos, N. Komodakis, and N. Paragios (2015) Building detection in very high resolution multispectral data with deep learning features. In 2015 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), pp. 1873–1876. Cited by: §II-C.
  • [50] J. E. Vargas-Muñoz, S. Lobry, A. X. Falcão, and D. Tuia (2019) Correcting rural building annotations in openstreetmap using convolutional neural networks. ISPRS Journal of Photogrammetry and Remote Sensing 147, pp. 283–293. Cited by: §II-B.
  • [51] M. Volpi and D. Tuia (2018) Deep multi-task learning for a geographically-regularized semantic segmentation of aerial images. ISPRS journal of photogrammetry and remote sensing 144, pp. 48–60. Cited by: §II-A.
  • [52] J. Wang, X. Yang, X. Qin, X. Ye, and Q. Qin (2015) An efficient approach for automatic rectangular building extraction from very high resolution optical satellite imagery. IEEE Geoscience and Remote Sensing Letters 12 (3), pp. 487–491. Cited by: §I.
  • [53] P. Wang, P. Chen, Y. Yuan, D. Liu, Z. Huang, X. Hou, and G. Cottrell (2018) Understanding convolution for semantic segmentation. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1451–1460. Cited by: §II-A.
  • [54] G. Wu, X. Shao, Z. Guo, Q. Chen, W. Yuan, X. Shi, Y. Xu, and R. Shibasaki (2018) Automatic building segmentation of aerial imagery using multi-constraint fully convolutional networks. Remote Sensing 10 (3), pp. 407. Cited by: §I.
  • [55] Y. Xu, L. Wu, Z. Xie, and Z. Chen (2018) Building extraction in very high resolution remote sensing imagery using deep learning and guided filters. Remote Sensing 10 (1), pp. 144. Cited by: §I.
  • [56] H. L. Yang, J. Yuan, D. Lunga, M. Laverdiere, A. Rose, and B. Bhaduri (2018) Building extraction at scale using convolutional neural network: mapping of the united states. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 11 (8), pp. 2600–2614. Cited by: §II-B.
  • [57] J. Yuan and A. M. Cheriyadat (2014) Learning to count buildings in diverse aerial scenes. In Proceedings of the 22nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pp. 271–280. Cited by: §III-B.
  • [58] L. Zhang, X. Huang, B. Huang, and P. Li (2006) A pixel shape index coupled with spectral information for classification of high spatial resolution remotely sensed imagery. IEEE Transactions on Geoscience and Remote Sensing 44 (10), pp. 2950–2961. Cited by: §I.
  • [59] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2881–2890. Cited by: §I.
  • [60] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. Torr (2015) Conditional random fields as recurrent neural networks. In Proceedings of the IEEE international conference on computer vision, pp. 1529–1537. Cited by: §II-C, §II-C, §III-A, §III-C.
  • [61] X. X. Zhu, D. Tuia, L. Mou, G. Xia, L. Zhang, F. Xu, and F. Fraundorfer (2017) Deep learning in remote sensing: a comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5 (4), pp. 8–36. Cited by: §I.
  • [62] X. Zhuo, F. Fraundorfer, F. Kurz, and P. Reinartz (2018) Optimization of openstreetmap building footprints based on semantic information of oblique uav images. Remote Sensing 10 (4), pp. 624. Cited by: §II-C.