1 Introduction
Building footprint generation is an active topic in remote sensing field. Recently, it has received considerable attention due to its huge potential in autonomous driving, virtual reality, urban planning, environmental, and demographic applications. Manual extraction of buildings from optical images is time consuming and difficult in largescale practice. In contrast, semantic segmentation is a comparatively inexpensive and timesaving technique for extracting building footprints. It aims to classify each pixel with a corresponding class. Various semiautomatic and automatic methods
bib:Ok2013Automated bib:Xu2018Building bib:Bittner2018Building bib:Chen2019Aerialhave been developed to improve segmentation accuracy within this method; traditionally, feature extraction and classification are its two main steps. The extraction of such handcrafted features usually require a strong domainspecific knowledge.
In recent years, the use of deep learning has garnered great success in semantic segmentation. In particular, deep convolutional neural networks (DCNNs) have shown promising results, due to their high capacity for data learning. DCNNs
bib:zhu2017deep have instigated compelling advancement over traditional semantic segmentation methods.However, exploiting DCNN for semantic segmentation tasks still raises significant challenges. The convolution layer of a DCNN is a weights sharing architecture, and it has both shift invariant and spatial invariant characteristics. While the invariance is clearly desirable for highlevel vision tasks, it may hamper lowlevel tasks such as pose estimation and semantic segmentation, where precise localization is required rather than abstraction of spatial details. For instance, the coarse segmentation output such as nonsharp boundaries and bloblike shapes is caused by convolution filters with large receptive fields and pooling layers in DCNN. Moreover, DCNN fails to fine local details without the consideration of the interactions between pixels.
To overcome these issues, the probabilistic graph models, such as the conditional random field (CRF) bib:chen2017 and Markov random field (MRF) bib:liu2015 , have been introduced to connect with DCNNs at the final layer. To use CRF for semantic segmentation, the main concept is to transform the problem of pixelwise classification into a problem of probabilistic inference, which assumes similar pixels should have the same labels. This substantially improves the predictions of the pixelwise labels to generate precise borders and exhaustive segmentation. In bib:chen2017 , instead of using CRF as the postprocessing step, the authors propose an endtoend architecture that combines the FCN with a fully connected CRF. However, these frameworks have not sufficiently extracted the features from the images. Different level features have different properties for semantic segmentation. Since lowlevel characteristics are rich with spatial details but lack semantic information and highlevel characteristics are conversely, they are naturally complementary. Another issue with CRF is that information propagation is not sufficient.
In this work, we propose a generic framework for semantic segmentation in this work, which integrates deep structured feature embedding and the graph convolutional network. In order to extract more comprehensive and representative features, we exploit deep structured feature embedding techniques to enhance the feature fusion by incorporating multilevel characteristics. Furthermore, we propose a new graph convolutional network, the gated graph convolutional network (GCN). GCN can aggregate the information from neighbor nodes (short range), which allows the model to learn about local structures. A recurrent neural network (RNN) with gated recurrent units (GRUs) has proven successful to model the longterm dependencies in sequential data. Hence, we adopt RNN with GRUs for longrange information propagation. The proposed network integrates the two architectures together, thus taking into account both local and global contextual dependencies. It is useful for semantic segmentation tasks. As a consequence, DSFEGGCN is a trainable endtoend framework. We show that joint learning of deep structured feature embedding and GGCN parameters results in considerable performance gains.
Contributions
The contributions of this work are summarized as follows:

A generic framework for semantic segmentation is proposed, which integrates the deep structured feature embedding and a graph convolutional neural network into an endtoend workflow.

We propose a novel network architecture, called a “gated graph convolutional neural network,” which combines the RNN with GRUs for long distance information propagation and the GCN for short distance information propagation.

An effective fourstep preprocessing approach is proposed for data augmentation, especially for mediumresolution satellite imagery.

The performance of different DCNNs and the proposed framework is analyzed through a systematic investigation. Our framework with GGCN surpasses the stateoftheart approaches to building footprint extraction.
2 Related Work
2.1 Semantic segmentation with DCNNs
The fully convolutional network (FCN) was first proposed in bib:long2015 for the task of semantic segmentation, in which convolutional layers take the place of fully connected layers. FCN makes the training more efficient and the input size of inference arbitrary. A more memoryefficient approach that used an alternative decoder variant, SegNet, was proposed in bib:badrinarayanan2015
. The stored indices of the maxpooling step in the downsampling path is used by the decoder for the operation of upsampling. Another variant of the encoderdecoder architecture is UNet
bib:ronneberger2015 . The long skip connections in the network enables the recovery of the downsampleinduced information lost in the encoder.One key issue for fully convolutional neural networks is that the spatial resolution is significantly downsampled, which is caused by the operations, such as strided convolutional layers or pooling layers. In order to overcome the poor localization property,
bib:zheng2015 proposed another approach to improve the spatial resolution, using a probabilistic graph model CRF to achieve finegrained boundaries. Instead of using CRF as a postprocessing step, DeepLabCRF bib:chen2017 introduces a fully connected CRF layer, which leads to an endtoend trainable network.2.2 Graph model
A graph model is a probabilistic model that encodes a distribution based on a graphbased representation. The Markov random field (MRF) is one classic graph model, which uses an undirected graph to describe the joint probability distribution of random variables. It has been applied to many tasks of image processing, including image coregistration, image segmentation, and image superresolution. MRF takes into account the relationships of the neighbours to infer the maximal possibility of the pixel’s label. The conditional random field (CRF) is an extension of MRF, which models the conditional probability distribution instead of the joint probability distribution. CRF as a discriminative model shows a better performance when the samples are limited. The combination of DCNNs and the graph model CRF
bib:zheng2015 ; bib:chen2017 can produce highresolution prediction for better segmentation.Recent work bib:bruna2013 has extended DCNNs to topologies that differ from the lowdimensional grid structure. Due to significant computational drawbacks, it is impractical for realworld use. Henaff et al.bib:henaff2015 and Defferrard et al.bib:defferrard2016 further improve GCN to successfully overcome this issue. The gridlike data can be interpreted as a special type of graph data, where the node is on the grid and the number of neighbours is fixed. In this work, we propose a gated graph convolutional network, which is a trainable inference systems based on GCN and RNN with GRUs.
2.3 Building footprint extraction
Building footprint generation is currently exciting a great deal of interest, and is an active field of research in the fields of remote sensing, photogrammetry and computer vision. The established building footprint maps are used in many important applications to analyze the process of urbanization, such as urban growth and sustainable urban development.
In bib:yuan2018
, the authors propose a multistage ConvNet with an upsampling operation of bilinear interpolation. The trained model achieves a superior performance on veryhighresolution aerial imagery. Recently, an endtoend trainable active contour model (ACM) was developed for building instance extraction
bib:marcos2018 , which learns ACM parameterizations using a DCNN. In bib:huang2019 , a residual refinement network was proposed to extract the building footprint using aerial images and LiDAR point clouds. In bib:shi2018 , the authors exploit the improved conditional Wasserstein generative adversarial network to generate the building footprint automatically. Recent work bib:wang2017 has shown that most of the tasks, such as building segmentation, building height estimation, and building contour extraction, are still difficult for modern convolutional networks. In this work, we show a significant performance improvement in building footprint extraction by using our proposed novel framework.3 Methodology
The details of the DSFEGGCN framework are introduced in this section. The workflow of the proposed method is shown in Fig. 1
. An image can be generalized as a graph, whose nodes are on the twodimensional grid. Each pixel represents a node. The embedding vectors can be computed initially from node inputs, e.g., node type embeddings, and then propagated on the graph to aggregate information from the local neighborhood.
3.1 Deep structured feature embedding
Deep embedding methods typically map images into an embedding space, where their distances preserve the relative similarity. In general, the representations of the data can be learned by graph embedding techniques bib:yan2007 , which take into account the relationships of the data. In addition, data from different sources, such as images, point clouds, and social media data, can be transformed into feature space, which can be further used for segmentation or other tasks. In this study, the data source is only imagery. Hence, we exploit a more efficient approach for feature embedding that uses DCNNs as feature extractor.
However, the resolution of the later layers in the neural network is extremely downsampled, a phenomenon that is caused by strided convolution, maxpolling, or other operations. Several methods have been introduced to decipher precise information from the downsampled feature maps. One common approach is to utilize interpolation techniques bib:badrinarayanan2015 , which is both computationally cheap and memorysaving. An alternative is deconvolution, in which recorded indices of the polling operation are used to retrieve information from the feature maps bib:noh2015 . Recently, long skip connections between the contracting and expanding paths were introduced to retrieve detailed spatial information from the highlevel feature layers bib:ronneberger2015 . In combination with DenseNet block bib:huang2017 , FCDenseNets was proposed in bib:jegou2017 , where the upsampling path was composed of deconvolution, unpooling, and long skip connections. Consequently, all the feature maps from deconvolution, unpooling, or skipconnections are exploited for the computation in the upsampling path of the dense blocks. Moreover, recent work shows that multiple DCNN features extracted from different networks can be complementary, and could be fused to improve segmentation accuracy. However, the method for fusing multiple features is still an open problem that needs systematic investigation.
As mentioned above, lowlevel features yield better representation of localization and highlevel features can give more comprehensive semantics. Therefore, in this work, we concatenate different level features progressively in order to propagate information about localization, semantics, and other properties through graph convolutional neural networks.
3.2 Gated graph convolutional neural network
An undirected and connected graphs consists of a set of nodes and edges . The unnormalized graph Laplacian matrix is defined as:
(1) 
where is the adjacency matrix representing the topology of , and is the degree matrix, which is calculated by . The properties of the graph Laplacian
are symmetric, positive, and semidefined; therefore the eigenvalue decomposition can be expressed as:
(2) 
where
are the orthonormal eigenvectors, known as the graph Fourier modes, and
are the eigenvalues of , which is a nonnegative diagonal matrix. Assuming a signal on the graph nodes, its graph Fourier transform can be formulated as
. If is a filter, the convolution of and is written as(3) 
where is the spectral representation of the filter. Rather than computing the Fourier transform , the filter coefficients can be parameterized as , ask shown in bib:henaff2015 . With the polynomial parametrization of the filter, the spectral filter is exactly localized in space. Moreover, the learning complexity is , the filter support size, and the same complexity as classical DCNNs.
In order to avoid explicit multiplication in the spectral domain, alternatively, the spectral representation of the filter can be approximated by a Chebyshev polynomial expansion , which is formulated as:
(4) 
where is the Chebyshev polynomials. The graph convolution can be defined as:
(5) 
where , and is the maximal eigenvector. In bib:kipf2016 , the authors further simplify the Chebyshev framework, setting and assuming , allowing them to redefine a single convolutional layer as simply:
(6) 
where is the hidden layer. By taking into account the selfconnections, the original adjacency matrix of the graph is transformed to , where
is the identity matrix.
is the trainable weight matrix and the new degree matrix can be calculated by . The functiondenotes a nonlinear activation function. This simplified form improves computational performance on larger graphs and predictive performance on small training sets.
Propagation model
The propagation process can be formulated as:
(7)  
(8) 
where is the message layer at time step , which represents the messages propagated from its neighbours to the node . The message layer at time step serves as input to update the hidden layer with function . Our proposed method is to use GCN as the message function, which makes it easy for the propagation model to learn to propagate the node embeddings for node to all nodes reachable from . We adopt gating techniques to surpass GCN performance, because its own memory can be maintained and the valuable information from neighbours can be gathered with its aid.
The unrolled propagation model at timestep can be written as:
(9)  
(10)  
(11)  
(12)  
(13) 
where and are the reset and update gates, and are learnable weights for different gates. The function
is the ReLU function,
is the logistic sigmoid function, and
is interior product. The initial hidden representation of the corresponding node is taken from the feature vectors of the DSFE step. For a certain time step , the messages from the neighbourhoods of the node are aggregated by using GCN. After that, the hidden state of next time step is updated by gated recurrent units, which use the hidden state and the message at time step as input. With the help of the reset gate and the update gate in GRU bib:cho2014 , the node can maintain its own memory and extract useful information from incoming messages. Along with the increase of the time step, it is capable of capturing the long range dependencies, which has been difficult to model in vanilla GCN.Prediction model
The node classification is defined as:
(14) 
Since we have transferred the binary semantic segmentation problem to the multilabel pixel labeling task, a softmax with negative loglikelihood loss function is used to predict the probability of each node.
4 Experiments
4.1 Datasets
In this work, we use Planetscope satellite imagery bib:planet with three channels (R, G, B) at a 3 m spatial resolution. The imagery is acquired by Doves, which can provide complete coverage of Earth once a day. The study sites cover four cities: (1) Munich, Germany; (2) Rome, Italy; (3) Paris, France; and (4) Zurich, Switzerland. The corresponding building footprint layer is downloaded from OpenStreetMap (OSM) bib:osm . The images are cropped with a patch size of . The overlap of each patch has 19 pixels in one direction. At the end, 48,000 sample patches are generated. The training data has 80% of the patches and the testing data has 20% of the patches. The training and testing data is spatially separated.
4.2 Preprocessing
The datasets utilized in this work consist of Planetscope satellite imagery and OSM building footprints as ground truth. However, since data sources for OSM are different from satellite imagery, there are likely inconsistencies between OSM building footprints and satellite imagery. Therefore, we need to carry out preprocessing steps to limit the inconsistencies before the experiments, which include band normalization, coregistration, refinement, and a truncated signed distance map (TSDM) (see Fig. 2).
In the next section, we will mainly focus on the coregistration and TSDM steps.
4.2.1 Coregistration
One inconsistency is misalignments between OSM building footprints and satellite imagery, which is caused by different projections and accuracy levels from data sources. Fig. 3 (a) shows an example of and OSM building footprint overlaid with the corresponding satellite imagery. There are noticeable misalignments between the building footprint and the satellite imagery. These misalignments lead to inaccurate training samples, which need to be corrected.
The coregistration process includes several steps: (1) The satellite imagery is transformed from RGB to gray scale; (2) The Gaussian gradient of grayscale imagery is calculated; (3) The cross correlation between the gradient magnitude of the grayscale image and building footprints is computed; (4) The pixel with the maximum cross correlation is found and the offset in both row and column direction can be derived. Fig. 3 (b) shows the result after coregistration.
4.2.2 Truncated signed distance map
In order to incorporate both semantic information about class labels and geometric properties in the training of the network, the distances of pixels to boundaries of buildings are extracted as output representations. In our experiment, the value of the signeddistance function (SDF) is determined by the distance between the pixel and its nearest point on the boundary. Positive values imply that the pixels are within the buildings and negative values indicate the outside of buildings.
Then we truncate the distance at a given threshold to incorporate only the pixels closest to the border. In this case, the problem in our research is a multilabel segmentation task, which enhances the result of prediction by the detailed signed distance map. The truncated signed distance function can be expressed as:
(15) 
where denotes the euclidean distance between the pixel and its nearest point on the boundary of the building. The term is a sign function with the implication of inside or outside of objects; is the truncated threshold.
4.3 Experimental setup
We use 11 classes for the truncated signed distance map, which is in
and the truncated threshold is set to 5. For all networks, a stochastic gradient descent (SGD) is used and the learning rate is set to
. The negative log likelihood loss (NLLLoss) is adopted as the loss function. The proposed framework is implemented using Pytorch. Experiments are run on a NVIDIA Tesla P100 16 GB GPU. Several semantic segmentation methods, which include FCN32s, SegNet, FCN16s, UNet, FCN8s, ResNetDUC, CWGANGP, FCDenseNet, GCN, GraphSAGE, and GGNN, are chosen as the algorithms of comparison.
4.4 Numerical results
The three metrics in the following experiments selected to evaluate the results are: overall accuracy (OA), F1 scores, and the Intersection over Union (IoU) scores. The experiments are carried out in following way. First, as a baseline, we assess the capability of different deep convolutional neural networks for building footprint extraction. Then, we choose different DCNNs for deep structured feature embedding and combine it with GCN bib:kipf2016 to decide which DCNN is the best feature extractor for our proposed framework. At the end, we use the best feature extractor for DSFE and compare the proposed framework to different graph models.
4.4.1 Baseline with different DCNNs
In this section, the performance of the stateoftheart DCNNs for building footprint generation are firstly investigated, which indicates the capability of each DCNN for feature extraction and precise localization.
Methods  OA  F1  IoU 

FCN32s  0.7318  0.2697  0.1559 
FCN16s  0.7698  0.3993  0.2494 
ResNetDUC  0.7945  0.4542  0.2930 
ENet  0.8243  0.5427  0.3724 
SegNet  0.8261  0.5558  0.3848 
UNet  0.8412  0.6043  0.4329 
FCN8s  0.8472  0.6222  0.4513 
CWGANGP  0.8483  0.6268  0.4562 
FCDenseNet  0.8551  0.6328  0.4628 
FCN32s and FCN16s exhibit poor performance, since the feature map of later layers have only highlevel semantics with poor localization. ResNetDUC can achieve better result than the previous two because of hybrid dilated convolution and dense upsampling convolution. However, it is limited due to the lack of skip connections. Maxpooling indices are reused in SegNet during the decoding process, which can reduce the parameter number of network leading to efficient training. However, as it only use indices of maxpooling to decoder, some local details cannot be recovered, e.g., small buildings will be neglected. FCN8s and UNet outperform previous networks due to the concatenation of lowlevel features. Compared to the other CNN models, cwGANgp shows promising results for building footprint generation. The enhancement of performance is motivated by the minmax competition between the discriminator and the generator of the GAN.
FCDenseNet outperforms all other semantic segmentation neural networks in numerical accuracy and visual results. On one hand, DenseNet block concatenates different features learned by convolution layers, which can boost the input diversity of subsequent layers and promote better efficiency of the training. On the other hand, the detailed spatial information can be propagated by shortcut connections between the convolution and deconvolution paths, which enhances the recovery of finegrained segmentation from the deconvolution path.
4.4.2 Proposed framework with different DSFE
In order to choose the best feature extractor for our task, three representative DCNNs have been adopted in the proposed framework with the graph convolutional network. The statistical result is shown in Table 2.
Methods  OA  F1  IoU 

DSFE(UNet)GCN  0.8396  0.6258  0.4544 
DSFE(FCN8s)GCN  0.8594  0.6320  0.4611 
DSFE(FCDenseNet)GCN  0.8640  0.6677  0.5012 
From Table 2 we can see that different DCNNs exhibit different capabilities for feature embedding. It is clear that FCDenseNet, as a feature extractor in DSFE with GCN, produces the best result. This is due to the superiority of FCDenseNet, which extends the DenseNet architecture to a UNetlike network for semantic segmentation. In the DenseNet block, through feature reuse, there are shorter connections between layers close to the input and those close to the output, which force the intermediate layers to learn discriminative features. Moreover, DenseNet combines features by iteratively concatenating them, which contributes to improved information and gradient propagation in the networks.
As can be seen in Fig. 5, DSFE (FCDenseNet)GCN gives the best result, which implies that FCDenseNet is a powerful tool for extracting different levels of features.
4.4.3 Proposed framework with different graph models
In this section, we choose FCDenseNet as the feature extractor in DSFE with different graph models. The results are summarized in Table 3.
Methods  OA  F1  IoU 

FCDenseNet bib:jegou2017  0.8551  0.6328  0.4628 
DSFECRF bib:chen2017  0.8592  0.6415  0.4757 
DSFEGCN bib:kipf2016  0.8640  0.6677  0.5012 
DSFEGraphSAGE bib:hamilton2017  0.8719  0.6726  0.5067 
DSFEGGNN bib:li2016  0.8787  0.6778  0.5123 
DSFEGGCN  0.8881  0.6899  0.5251 
The results show that DSFEGGCN has the best performance for our task. The IoU increases 6.2% compared to the best result of DCNN. Fig. 6 shows a visual comparison of all the networks used in section 4. We marked the key region with a yellow bounding box. The closeup figures for the key regions are shown in Fig. 7.
5 Discussion
5.1 Additional dataset
We validate our proposed method with experiments on the ISPRS 2D Semantic Labeling Contest dataset, which covers the city of Potsdam and comprises 38 tiles of aerial imagery bib:isprs . In order to maintain the consistency, images with 3 spectral bands (red, green, blue) are used in this experiment without a digital surface model (DSM). Each aerial image is depicted with pixels at a spatial resolution of 5 cm. The corresponding ground truth is also provided for results evaluation, which includes six classes: Impervious surfaces, Building, Low vegetation, Trees, Cars, and Clutter/background. For our detailed experiments, we split those 38 tiles into a training subset (tile numbers 210 to 615) and a test subset (tile numbers 707 to 713). The building class is regarded as a building and other five classes are considered nonbuildings. We cut 16,000 patches of pixels from the training subset and 3573 patches from the test subset. As mentioned in the previous section, the data augmentation step TSDM is used for the mediumresolution images and the ground truth is well coregistrated with the optical image. Therefore, there is no data preprocessing step for the ISPRS dataset. The optical image is fed directly into the networks.
5.2 Experimental setup
The SGD optimizer is adopted and the initial learning rate is set to be 10e4, which is reduced by a factor of ten when the validation loss is saturated. Once the learning rate is reduced below 10e8, the training stops. The number of epochs is in the range (120, 160) for all the networks. The size of the training batch is 4.
5.3 Experimental results
The metrics OA, F1 scores, and IoU scores are used to evaluate the results. Fig. 8 shows the visualized comparison of the predicted results the ISPRS Potsdam dataset, using different networks.
FCN8s provides a significantly higher percentage of buildings detected compared to FCN16s and FCN32s, by combining predictions from not only the final layer but also coarse layers, allowing more information to be preserved. The boundaries of buildings detected from UNet are sharper than for SegNet or ENet. However, unlike in the mediumresolution case, the completeness of the result obtained by SegNet or ENet is better than for UNet, which indicates that the spatial information propagation is more effectively undertaken by recording the pooling indices than by concatenating the lowlevel features when the resolution is high enough, i.e., when comprehensive spatial information exists. The finer details are captured by the proposed framework with different graph models such as CRFasRNN, GCN, and GGCN rather than CNNonly methods, which confirms the effectiveness of the graph model in modelling the interaction among pixels and spatial information propagation. Compared to CRFasRNN and GCN, the proposed GGCN method gives a better result. A closeup view of the key region is shown in Fig. 9. It can be seen that the DSFEGGCN shows a better result with respect to both completeness and sharpness for building extraction compared to other methods.
Methods  OA  F1  IoU 

FCN32s  0.7371  0.6186  0.4478 
FCN16s  0.8247  0.7429  0.5910 
ResNetDUC  0.7475  0.6766  0.5051 
ENet  0.7711  0.7764  0.6110 
SegNet  0.8948  0.8511  0.7408 
UNet  0.8892  0.8392  0.7229 
FCN8s  0.8617  0.7986  0.6647 
CWGANGP  0.8926  0.8504  0.7397 
FCDenseNet  0.9186  0.9182  0.8789 
DSFEGCN 
0.9221  0.9375  0.9097 
DSFEGGCN 
0.9271  0.9422  0.9196 
Table 4 summarizes the results of using different deep convolutional neural networks and the proposed framework on the ISPRS dataset. As can be seen the proposed DSFEGGCN/DSFEGCN framework contributes a significant improvement over the DCNNs. Moreover, compared to DSFEGCN, DSFEGGCN can effectively propagate the information in the short and longrange, which leads to better results.
6 Conclusion
In this work, we develop a novel framework for semantic segmentation that combines the deep structured feature embedding and a graph convolutional network. Specifically, we propose using a gated graph convolutional network to improve the information propagation by using RNN with GCN. Our proposed framework outperforms the stateoftheart methods for building footprint extraction. Although we have used building footprint extraction as the practical application, the proposed method can be generally applied to other binary or multilabel segmentation tasks, such as road extraction, settlement layer extraction, or semantic segmentation of very high resolution data in general. In addition, the proposed GCN network can work directly with unstructured data, such as point clouds and social media text messages.
7 Acknowledgments
This work is supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement no. ERC2016StG714087, acronym: So2Sat, www.so2sat.eu), the Helmholtz Association under the framework of the Young Investigators Group “SiPEO” (VHNG1018, www.sipeo.bgu.tum.de), Munich Aerospace e.V. Fakultät für Luft und Raumfahrt, and the Bavaria California Technology Center (Project: LargeScale Problems in Earth Observation). The authors thank the Gauss Centre for Supercomputing (GCS) e.V. for funding this project by providing computing time on the GCS Supercomputer SuperMUC at the Leibniz Supercomputing Centre (LRZ) and on the supercomputer JURECA at Forschungszentrum Jülich. The authors thank Planet for providing the datasets.
8 References
References
 (1) A. O. Ok, “Automated detection of buildings from single VHR multispectral images using shadow information and graph cuts,” ISPRS J. Photogramm. Remote Sens., vol. 86, pp. 2140, 2013.
 (2) Y. Xu, L. Wu, Z. Xie, and Z. Chen, “Building Extraction in Very High Resolution Remote Sensing Imagery Using Deep Learning and Guided Filters,” Remote Sensing, vol. 10, pp. 144, Jan. 2018.
 (3) K. Bittner, F. Adam, S. Cui, M. Körner and P. Reinartz, “Building Footprint Extraction From VHR Remote Sensing Images Combined With Normalized DSMs Using Fused Fully Convolutional Networks,” IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., vol. 11, pp. 26152629, 2018.
 (4) Q. Chen, L. Wang, Y. Wu, G. Wu, Z. Guo, and S. Waslander, “Aerial imagery for roof segmentation: A largescale dataset towards automatic mapping of buildings,”, ISPRS J. Photogramm. Remote Sens., vol. 147, pp. 4255, 2019.
 (5) X. X. Zhu, D. Tuia, L. Mou, G. Xia, L. Zhang, F. Xu, and F. Fraundorfer, “Deep learning in remote sensing: a comprehensive review and list of resources,” IEEE Geosci. Remote Mag., vol. 5, no. 4, pp. 836, 2017.

(6)
Z. Liu, X. Li, P. Luo, C. C. Loy, X. Tang, “Semantic Image Segmentation via Deep Parsing Network,” in
Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
, 2015, pp. 13771385.  (7) L. Chen, G. Papandreou, I. Kokkinos, K. Murphy and A. L. Yuille, “DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 4, pp. 834848, 2017.
 (8) J. Long, E. Shelhamer, T. Darrell, “Fully Convolutional Networks for Semantic Segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 34313440.
 (9) V. Badrinarayanan, A. Handa, R. Cipolla, “Segnet: A deep convolutional encoderdecoder architecture for robust semantic pixelwise labelling,” arXiv preprint arXiv:1505.07293, 2015.
 (10) O. Ronneberger, P. Fischer, T. Brox, “Unet: Convolutional networks for biomedical image segmentation,” in Proc. Int. Conf. Med. Image Comput. Comput.Assisted Intervention, 2015, pp. 234241.
 (11) S. Zheng, S. Jayasumana, B. RomeraParedes, V. Vineet, Z. Su, D. Du, C. Huang, P. Torr, “Conditional random fields as recurrent neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 15291537.
 (12) S. Jégou, M. Drozdzal, D. Vázquez, A. Romero, Y. Bengio, “The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation,” arXiv preprint arXiv:1611.09326, 2017.
 (13) G. Huang, Z. Liu, K. Q. Weinberger, L. van der Maaten, “Densely connected convolutional networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 47004708.
 (14) T. Akilan, Q. M. Wu, W. Jiang, “A feature embedding strategy for highlevel CNN representations from multiple convnets,” in Proc. IEEE Conf. Signal and Information Processing, 2017, pp. 11951199.
 (15) H. Noh, S. Hong, B. Han, “Learning Deconvolution Network for Semantic Segmentation,” in Proc. Int. Conf. Comput. Vision, 2015, pp. 15201528.

(16)
O. A. Penatti, K. Nogueira, J. A. dos Santos, “Do Deep Features Generalize from Everyday Objects to Remote Sensing and Aerial Scenes Domains?” in
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, 2015, pp. 4451.  (17) J. Yuan, “ Learning Building Extraction in Aerial Scenes with Convolutional Networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 11, pp. 27932798, Nov. 2018.
 (18) D. Marcos, D. Tuia, B. Kellenberger, L. Zhang, M. Bai, R. Liao, and R. Urtasun, “Learning deep structured active contours endtoend,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018.
 (19) J. Huang, X. Zhang, Q. Xin, Y. Sun, and P. Zhang, “Automatic building extraction from highresolution aerial images and LiDAR data using gated residual refinement network,”, ISPRS J. Photogramm. Remote Sens., vol. 151, pp. 91105, 2019.
 (20) Y. Shi, Q. Li, X. X. Zhu, “Building Footprint Generation Using Improved Generative Adversarial Networks,” IEEE Geosci. Remote Lett., 2018, doi:10.1109/LGRS.2018.2878486.
 (21) S. Wang, M. Bai, G. Mattyus, H. Chu, W. Luo, B. Yang, J. Liang, J. Cheverie, S. Fidler, and R. Urtasun. “TorontoCity: Seeing the world with a million eyes,” in Proc. Int. Conf. Comput. Vision, 2017.
 (22) S. Yan, D. Xu, B. Zhang, H. Zhang, Q. Yang, S. Lin, “Graph embedding and extensions: A general framework for dimensionality reduction,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 1, pp. 4051, Jan. 2007.
 (23) J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral networks and locally connected networks on graphs,” arXiv preprint arXiv:1312.6203, 2013.
 (24) M. Henaff, J. Bruna, and Y. LeCun, “Deep convolutional networks on graphstructured data,” arXiv preprint arXiv:1506.05163, 2015.
 (25) M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” in Proc. Int. Conf. Neural Information Processing Systems (NIPS), 2016, pp. 38443852.
 (26) T. N. Kipf and M. Welling, “Semisupervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
 (27) W.L. Hamilton, R. Ying, and J. Leskovec, “Inductive Representation Learning on Large Graphs,” arXiv preprint arXiv:1706.02216, 2017.
 (28) Y. Li, D. Tarlow, M. Brockschmidt, R. Zemel, “Gated Graph Sequence Neural Networks,” in Proc. Int. Conf. Learning Representations, 2016.
 (29) K. Cho, B. Merrie, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk and Y. Bengio, “Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation,” arXiv preprint arXiv:1406.1078v3, 2014.
 (30) PlanetScope. https://www.planet.com/
 (31) Openstreetmap. http://www.openstreetmap.org
 (32) ISPRS 2D Semantic Labeling Dataset  Potsdam. http://www2.isprs.org/commissions/comm3/wg4/2dsemlabelpotsdam.html
Comments
There are no comments yet.