DSM Building Shape Refinement from Combined Remote Sensing Images based on Wnet-cGANs

03/08/2019 ∙ by Ksenia Bittner, et al. ∙ Technische Universität München 0

We describe the workflow of a digital surface models (DSMs) refinement algorithm using a hybrid conditional generative adversarial network (cGAN) where the generative part consists of two parallel networks merged at the last stage forming a WNet architecture. The inputs to the so-called WNet-cGAN are stereo DSMs and panchromatic (PAN) half-meter resolution satellite images. Fusing these helps to propagate fine detailed information from a spectral image and complete the missing 3D knowledge from a stereo DSM about building shapes. Besides, it refines the building outlines and edges making them more rectangular and sharp.



There are no comments yet.


page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A gl:DSM is an important and valuable data source for many remote sensing applications, like building detection and reconstruction, cartographic analysis, urban planning, environmental investigations and disaster assessment tasks. The use of gl:DSM for those remote sensing applications is motivated by the fact that it already provides geometric descriptions about the topographic surface. With recent advances in sensor technologies, it became possible to generated gl:DSM with a gl:GSD smaller than

not only from land surveying, aerial images, laser ranging data, or gl:InSAR, but also using satellite stereo images. The main advantages of satellite photogrammetric gl:DSM are the large land coverage and possibility to access remote areas. However, gl:DSM generated with the image-based matching approaches miss objects like steep walls in urban areas or feature some unwanted outliers and noise due to temporal changes, matching errors or occlusions. To overcome these problems, algorithms from computer vision have been analyzed and adapted to satellite imagery. For example, the filtering techniques such as geostatistical filter integrated with a hierarchical surface fitting technique, a threshold slope-based filter, or a Gaussian noise removal filter are the ones commonly used for gl:DSM quality improvements. Moreover, some methodologies propose to fuse gl:DSM obtained from different data sources to compensate the limitations and gaps which each of them has individually


With recent developments devoted to deep learning, it became possible to achieve top scores on many tasks including image processing. As a result, several works have already investigated their applicability for remote sensing applications, like landscape classification, building and road extraction, or traffic monitoring. Recently, a class of neural networks called gl:GAN was applied on three-dimensional remote sensing data and proved to be suitable. Mainly, the generation of large-scale 3D surface models with refined building shape to the gl:LoD 2 from stereo satellite gl:DSM was studied using gl:cGAN 

[bittner2018automatic, bittner2018dsm]. In this paper, we follow those ideas and propose a hybrid gl:cGAN architecture which couples half-meter resolution satellite gl:PAN images and gl:DSM to produce 3D surface models not only with refined 3D building shapes, but also with their completed structures, more accurate outlines, and sharper edges.

2 Methodology

The birth of gl:GAN-based domain adaptation neural networks introduced by goodfellow2014generative yielded great achievements in generating realistic images. The idea behind the adversarial manner of learning is to train a pair of networks in a competing way: a generator that tries to fool the discriminator to make the source domain look like the target domain as much as possible, and a discriminator

that tries to differentiate between the target domain and the transformed source domain. Taking the source distribution as input instead of a uniform distribution and using this external information to restrict both the generator in its output and the discriminator in its expected input leads to the conditional type of gl:GAN. The objective function for gl:cGAN can be expressed through a two-player minimax game

stereo gl:DSM ()PAN ()()GT

Figure 1: Schematic overview of the proposed architecture for the building shape refinement in the 3D surface model by WNet-gl:cGAN using depth and spectral information.

between the generator and the discriminator, where intents to minimize the objective function against the that aims to maximize it. Moreover, it should be mentioned that in the first term of Eq. 1 we use an objective function with least squares instead of the common negative log likelihood. The second term in Eq. 1 regularizes the generator and produces the output near the ground truth in a sense.

In our previous work, we already adapted the architecture proposed by isola2016image to obtain refined 3D surface models from the noisy and inaccurate stereo gl:DSM. Now, we propose a new gl:cGAN architecture that integrates depth information from stereo gl:DSM together with spectral information from gl:PAN images, as the latter provides a sharper information about building silhouettes, which allows not only a better reconstruction of building outlines but also their missing construction parts. Since intensity and depth information have different physical meanings, we propose a hybrid network where two separate UNet [ronneberger2015u] type of networks with the same architecture are used: we feed one part with the gl:PAN image and the second part with the stereo gl:DSM generating a so-called WNet architecture. Before the last upsampling layer, which leads to the final output size, we concatenate the intermediate features from both streams. Moreover, we increase the network with an additional convolutional layer of size , which plays the role of information fusion from different modalities. As investigated earlier, this fusion can correct small failures in the predictions by automatically learning which stream of the network provides the best prediction result [bittner2018building]. Finally, the activation function is applied on the top layer of the network. is represented by a binary classification network with a sigmoid activation function

to the top layer to output the probability that the input image belongs either to class 1 (

“real”) or class 0 (“generated”

). It has five convolutional layers which are followed by a leaky gl:ReLU activation function

with a negative slope of 0.2. The input to is a concatenation of a stereo gl:DSM with either a WNet-generated 3D surface model or a ground-truth gl:DSM. A simplified representation of the proposed network architecture is demonstrated in Fig. 1.

3 Study Area and Experiments

Experiments have been performed on WorldView-1 data showing the city of Berlin, Germany, within a total area of . As input data, we used a stereo gl:DSM and one of six very high-resolution gl:PAN images, both with a resolution of . The gl:PAN image is orthorectified. As ground truth, the gl:LoD2-gl:DSM, generated with a resolution of from a gl:CityGML data model, was used for learning the mapping function between the noisy DSM and the gl:LoD2-gl:DSM with better building shape quality. The detailed methodology on gl:LoD2-gl:DSM creation is given in our previous work. A gl:CityGML data model is freely available on the download portal Berlin 3D (http://www.businesslocationcenter.de/downloadportal).

(a) PAN
(b) Stereo gl:DSM
(c) gl:cGAN gl:LoD2-gl:DSM


(d) WNet-gl:cGAN gl:LoD2-gl:DSM
(e) GT: gl:LoD2-gl:DSM
(f) PAN
(g) Stereo gl:DSM
(h) gl:cGAN gl:LoD2-gl:DSM


(i) WNet-gl:cGAN gl:LoD2-gl:DSM
(j) GT: gl:LoD2-gl:DSM
Figure 2: Visual analysis of gl:DSM, generated by stereo gl:cGAN and WNet-gl:cGAN architectures, over selected urban areas. The gl:DSM images are color-shaded for better visualization.

The implementation of the proposed WNet-gl:cGAN is done with the PyTorch python package. For the training process, the satellite images were tiled into patches of size to fit into the single NVIDIA TITAN X (Pascal) GPU with

of memory. The total number of epochs was set to 200 with a batch size of 5. We trained the gl:DSM-to-gl:LoD2 WNet-gl:cGAN network with minibatch gl:SGD using the ADAM optimizer. An initial learning rate was set to

and the momentum parameters to and .

4 Results and Discussion

Two selected areas of the resulting gl:LoD2-like gl:DSM generated from combined spectral and depth information together with the gl:LoD2-like gl:DSM from a single image are illustrated in Fig. 2. From Fig. 1(b) and Fig. 1(g) we can see that the refinement of building shapes only from stereo gl:DSM is a very challenging task, due to several reasons. First of all, the presence of vegetation can influence the reconstruction as some parts of buildings are covered by trees. Besides, the stereo gl:DSM is very noisy itself, due to failures in the generation algorithms. It means that in most cases the types of roofs and, as a result, their shapes are indistinguishable. On the other hand, looking at Fig. 1(a) and Fig. 1(f) we can see that the edges and outlines can be seen very well in the gl:PAN image. Refinement of 3D buildings only from gl:PAN image though would be very difficult as it does not contain 3D information, which is very important. Therefore, the combination of these two types of information is a good compromise which leads to advantages.

It can be clearly seen that the hybrid WNet-gl:cGAN architecture is able to reconstruct more complete building structures than the gl:cGAN from a single data source (see the highlighted buildings in Fig. 1(d)). Even complicated forms of buildings are also preserved in the reconstructed 3D surface model. The obvious example is a zigzag-shaped building at the upper-left part in Fig. 1(d). This information could be only obtained from the gl:PAN image (see Fig. 1(a)).

The second example depicts a smaller but scaled area for better visual investigation. Here, a central building is complete and more details are distinguishable. Besides, the ridge lines of the roofs are also much better visible. One can even guess to which type of roof parts of building belong to: gable or hip roofs. A clear contribution of the spectral information to the building shape refinement task can be seen at the upper-right building in Fig. 1(i). We can notice that this building structure is more complete. The outlines of all buildings are clearer rectilinear and the building shapes become more symmetrical. To look more detailed into the 3D information, we illustrate some building profiles. We can see that the roof forms like gable and hip are clearly improved. The ridge lines tend to be sharp peaks. With the profile in Fig. 2(d) we again highlight the ability of the proposed architecture to reconstruct even complicated buildings, which is difficult to reconstruct using a single stereo gl:DSM information.

(a) Profile 1
(b) Profile 3
(c) Profile 4
(d) Profile 5
Figure 3: Visual analysis of selected building profiles in generated gl:DSM.
Stereo gl:DSM 3.00 5.97 1.48 0.90
gl:cGAN 2.01 4.78 0.86 0.92
Fused-gl:cGAN 1.79 4.36 0.67 0.94
Table 1: Prediction accuracies of gl:cGAN and Fused-gl:cGAN models on investigated metrics over the Berlin area.

To quantify the quality of the generated gl:DSM, we evaluated the metrics gl:MAE, gl:RMSE, gl:NMAD and gl:NCC, commonly used for 3D surface model accuracy investigation, on the gl:cGAN and WNet-gl:cGAN setups and report their performance in Table 1. As we are interested in quantifying the improvements only of the building shapes on gl:DSM the above mentioned metrics were measured only within the area where buildings are present plus a three-pixel buffer around each of them. This was achieved by employing the binary building mask and dilation procedure on the footprint boundaries. From the obtained results we can see that gl:DSM from WNet-gl:cGAN is better than the original stereo gl:DSM and the gl:DSM generated by the gl:cGAN model on all proposed metrics. This is reasonable, as the spectral information provides additional information, helpful to reconstruct building structure more accurately and detailed, which is not possible using only stereo DSM. This feature especially influences the corners, outlines, and ridge lines. As the gl:NCC metric indicates how the form of the object resembles the ground truth object, the gaining of 4 % in comparison to the stereo gl:DSM and 2 % improvement on the gl:DSM generated by gl:cGAN model over the whole test area, which includes thousands of buildings, demonstrate the advantage of using complementary information for such complicated tasks. The high values of gl:RMSE (order of ) is due to data acquisition time difference between the available DSM generated from stereo satellite images and the given ground truth data. As a result, several buildings are not presented or newly constructed in the more recent data set.

5 Conclusion

Refinement and filtering techniques from the literature for gl:DSM quality improvement are adequate for either small-scale gl:DSM or gl:DSM with no discontinuities. As a result, there is a need to develop a refinement procedure that can handle discontinuities, mainly building forms in urban regions, in high-resolution large-scale gl:DSM. A common strategy in remote sensing for refinement procedures is the use of all available information from different data sources. Their combination helps to compensate the mistakes and gaps in each independent data source.

We present a method for automatic large-scale gl:DSM generation with refined building shapes to the gl:LoD 2 from multiple spaceborne remote sensing data on the basis of gl:cGAN. The designed end-to-end WNet-gl:cGAN integrates the contextual information from height and spectral images to produce good-quality 3D surface models. The obtained results show the potential of the proposed methodology to generate more completed building structures in gl:DSM. The network is able to learn how to complement the strong and weak sides of gl:PAN image and stereo gl:DSM, as, for instance, the stereo gl:DSM provide elevation information of the objects, but gl:PAN images provide texture information and, as a result, more accurate building boundaries and silhouettes.