Towards Monocular Digital Elevation Model (DEM) Estimation by Convolutional Neural Networks - Application on Synthetic Aperture Radar Images

by   Gabriele Costante, et al.
Università Perugia

Synthetic aperture radar (SAR) interferometry (InSAR) is performed using repeat-pass geometry. InSAR technique is used to estimate the topographic reconstruction of the earth surface. The main problem of the range-Doppler focusing technique is the nature of the two-dimensional SAR result, affected by the layover indetermination. In order to resolve this problem, a minimum of two sensor acquisitions, separated by a baseline and extended in the cross-slant-range, are needed. However, given its multi-temporal nature, these techniques are vulnerable to atmosphere and Earth environment parameters variation in addition to physical platform instabilities. Furthermore, either two radars are needed or an interferometric cycle is required (that spans from days to weeks), which makes real time DEM estimation impossible. In this work, the authors propose a novel experimental alternative to the InSAR method that uses single-pass acquisitions, using a data driven approach implemented by Deep Neural Networks. We propose a fully Convolutional Neural Network (CNN) Encoder-Decoder architecture, training it on radar images in order to estimate DEMs from single pass image acquisitions. Our results on a set of Sentinel images show that this method is able to learn to some extent the statistical properties of the DEM. The results of this exploratory analysis are encouraging and open the way to the solution of single-pass DEM estimation problem with data driven approaches.


page 1

page 4

page 5


Automatic Target Recognition of Aircraft using Inverse Synthetic Aperture Radar

Along with the improvement of radar technologies, Automatic Target Recog...

SAR image despeckling through convolutional neural networks

In this paper we investigate the use of discriminative model learning th...

Application of the Modified Fractal Signature Method for Terrain Classification from Synthetic Aperture Radar Images

In this paper the Modified Fractal Signature method is applied to real S...

SAR Image Despeckling Using a Convolutional

Synthetic Aperture Radar (SAR) images are often contaminated by a multip...

A Novel Monocular Disparity Estimation Network with Domain Transformation and Ambiguity Learning

Convolutional neural networks (CNN) have shown state-of-the-art results ...

Fast strategies for multi-temporal speckle reduction of Sentinel-1 GRD images

Reducing speckle and limiting the variations of the physical parameters ...

Regression Networks For Calculating Englacial Layer Thickness

Ice thickness estimation is an important aspect of ice sheet studies. In...

1 Introduction

Interferometric Synthetic Aperture Radar (InSAR) allows topographic reconstruction of a physical environment. The technique is performed designing a spatial single-baseline SAR geometry [12], [3],[22]

where the result is a digital elevation model (DEM). However, to solve the phase indetermination with a good altitude accuracy, a minimum of two pass are needed, and this usually implies that we need to wait days, or months between the first and the second pass. We propose a method for estimating the topographic reconstruction with machine learning, implemented by Convolutional Neural Networks (CNNs) in order to estimate a DEM using only one single-look-complex (SLC) SAR image. Before getting inside the description of the novel signal processing technique, a brief analysis of the InSAR history is given. It is necessary to go back in time, until 1980, where Walker

et al. in [21] admits the feasibility of fine Doppler frequency resolution existing for the range-Doppler SAR image. In this context a high energy scattering point target may move through several range-Doppler resolution cells, producing a smeared trace.

SAR data are represented with a three-dimensional Fourier transform of the object reflectivity density. A full three-dimensional environment reconstruction is processable by an inverse Fourier transform. Munson

et al. [17] show that spotlight SAR, interpreted as a tomographic reconstruction problem, synthesizes high resolution terrain maps observed along multiple observation angles.

Figure 1: Overview of the proposed DEM estimation approach. The proposed approach does not need multiple radar acquisitions. Instead, it processes a single look complex radar image to estimate the associated elevation model. In order to achieve this, we exploit the convolutional neural network paradigm. In particular, the encoder section extracts highly informative local structures (i.e., features) from the input radar image. Afterwards, the decoder section decodes the features and predicts the DEM image.

Jakowats et al. [10] extend the work of Munson et al. to a new three-dimensional formulation, making the simplifying assumption that the SAR range-Doppler image is two-dimensional. Unfortunately, this assumption implies the generation of the layover effect and, in order to explore target detection in the cross-slant range, multiple observations have to be performed. In [2] the author gave a theoretical explanation of the frequency diversity in SAR-Tomography. A very good introduction to InSAR is given in [15]. The work gives detailed information for combining complex SAR images recorded by antennas positioned at different locations. Recent years saw a refinement of the InSAR technique, trying to remove the need of using multiple satellite passes. InSAR can also be applied using two sensors mounted on the same platform. This configuration is called single-pass interferometry [15]. However, to obtain a digital elevation model with useful accuracy a minimum baseline is needed. Application of InSAR from spaceborne radar prospective is also given in [16]. In Colasanti et al. [4] authors performed a precious study regarding ERS-ENVISAT interferometry despite of their carrier frequency having a shift of 31MHz. In [7]

the authors gave demonstration in estimating absolute height using a single staring spotlight SAR image using the information of different azimuth defocusing levels generated by scatterers positioned at different heights. The problem of this technique seems being excessively anchored to the nature of the staring-spotlight acquisition which gives a reduced range-azimuth swath of observation and precious absolute height estimation is possible only for few azimuth intra-chromatic high coherency scatterers. However, all the aforementioned methods require complex models and computations to take into account all the atmosphere, sensor and environment conditions. Up to the authors knowledge, the possibility of computing DEM estimates with a standard SAR sensor and with a single-pass acquisition has not be tackled before. In this work, we propose the use of a different paradigm to solve this problem. Since a lot of SAR images has been collected in the past, we adopt a data driven approach. The work has been inspired by recent work on Monocular Depth Estimation performed in the Robotics and Computer Vision communities

[13], [14], [19], [20], [9], [8], [18], [11]. Usually, in the Robotics context, depth estimation from standard camera sensors is done by triangulation of information collected through stereo-rigs, or using multiple passes of the same sensor. Recently, Convolutional Neural Networks (CNNs) models have been proposed to perform a reconstruction of a depth map from a single image acquisition. The problem of learning depth from image appearance has similarities with the task of learning DEMs from radar images. In this work, we propose to use the same reasoning, learning the conditional distribution of digital elevation maps from single-pass interferometric imaging. We show that the proposed model is able to learn to some extent the spatial relationships from the input data, even with a moderate amount of data. This preliminary study already shows promising results for future developments.

2 Methodology

In order to perform DEM estimation from single-pass SAR acquisition, we need to infer the structure of the observed Earth portion by only using a single radar image. We achieve this by devising a deep neural network architecture that learns to predict the DEM by extracting structures and high-level information from the input radar image. The key intuition behind this strategy lies in the exploitation of local image structures to infer the DEM value at a certain location (i.e., image pixel). By using multiple stages of convolutional filters, we are able to extract high-level structures (i.e., features) at different scales. These features are then used by the model to resolve ambiguities and estimate the DEM.

In the remainder of this section, we firstly describe more formally the principles behind our approach. Afterwards, we provide details about the proposed convolutional neural network architecture.

2.1 Estimation Problem Formulation

We want to model a function that, given a single radar image represented in the complex range-azimuth domain, is able to estimate the relative DEM, filtering out radar noise and resolving the layover indetermination. The output of the model is the DEM image , where each entry contain the elevation value at that location. In order to evaluate the contribution of the complex components of the radar image, we give the model as input the absolute value and the phase of the complex image . Thus, our function is defined as .

Figure 2: Overview of the proposed Convolutional Neural Network DEM estimator. The encoder section processes the input radar image (i.e., the absolute and the phase images) and extracts features at different scales to detect informative local structures. The decoder sections decodes the features to estimate the associated DEM image.
Layer name Kernel size Stride Padding output size activation
Input - - - -
Encoder Section Conv1 same ReLU
Conv2 same ReLU
MaxPool - -
Conv3 valid ReLU
Conv4 valid ReLU
Conv5 valid ReLU
Conv6 valid Linear
Decoder Section T-Conv1 valid PReLU
T-Conv2 valid PReLU
T-Conv3 valid PReLU
T-Conv3 valid PReLU
T-Conv4 valid PReLU
ConvOutput same Linear
Table 1:

Details of the network architecture. The network is composed by two sections, namely the encoder and the decoder section. The encoder section has six convolutional filters and a max pooling layer to extract features at different scale levels. The decoder section decodes the features by using transposed convolutions to estimate the DEM image. Padding

same is used when we need to preserve the dimensions of a layer input. Conversely, padding valid indicates that the convolution operation processes only valid patch of the input (i.e., the output dimension is slightly smaller due to border effects).

For the network structure we exploit the encoder-decoder paradigm, similar to [1, 6, 13, 14]. This kind of architecture is composed by two main blocks, each one composed by a number of convolutional layers, as shown in Figure 1. The encoder part computes the spatial features and at the same time reduces the image representation size layer after layer, in order to find an encoded representation of the image; the decoder part takes this encoded representation ad decompresses it, with upsamplings and convolutions, to finally reconstruct the original image. The loss that is minimized is the DEM reconstruction error, that is propagated through the decoder and encoder layers. In this way, the network is able to learn a lower dimensional representation (an embedding) of the input radar images, removing noise and increasing the generalization properties for further processing. We propose to use a variation of Encoder-Decoder architecture where the input and output are not the same. In our case the inputs are radar images and the outputs are the DEM reprojected in the radar coordinates (slant-range versus azimuth).

The architecture we propose is a fully convolutional deep network, that is able to handle generic inputs. Furthermore, fully convolutional architectures preserve the spatial information both in the encoder and the decoder sections, which is crucial to fully exploit local structure information.

The encoder section is composed by a series of convolutional layers, which sequentially apply learned filters on their input to compute the features.

To extract higher level features, the input is downsampled multiple times. To scale inputs, we use max pooling.

The decoder section is composed by a stack of transposed convolutional layers that learn to reconstruct the pixel-wise predictions of the DEM image from the features computed in the encoder section. Differently from the encoder section, instead of using unpooling layers to reverse pooling operations, we take advantage from the transposed convolutional layers to learn an effective upsampling strategy.

The network is detailed in Table 1 and shown in Figure 2

. All the convolutional layers in the encoder section have rectified-linear activation functions (ReLU), except for the last one (Conv6) that has a linear activation function.

The decoder section has five

Transposed Convolutional (T-Conv) layers to decode the feature extracted by the first section of the network. The last T-Conv layer performs an upsampling by striding the convolutional operations by a factor of 4. All the T-Conv have probabilistic rectified-linear unit (PReLU) to allow for negative activations during the decoding phase. Finally, a single channel

convolutional layer with a linear activation outputs the predicted DEM image.

All the convolutional filters are regularized with L2 penalty to prevent overfitting.

The objective that is minimized during the learning phase is the pixel-wise linear root mean squared error (RMSE) between the estimated and the GT-DEM images:


where is the number of pixel of the DEM image and .

3 Experiments

In this section, we describe the experiments we run to validate our proposed CNN-based DEM estimation approach. In the following, we first describe the experimental setup, providing details about datasets used, preprocessing procedures and details about CNN training. Afterwards, we discuss the results and draw conclusions.

3.1 Datasets

We test out approach in three different datasets, namely the Alps, the California and the Tucson datasets. The SLC image and the associated GT DEM are depicted in Figure 9-9, 9-9 and 9-9, respectively.

[2] [] []

Figure 3: SLC - Alps
Figure 4: GT DEM - Alps

[2] [] [] [2] [] []

Figure 5: SLC - California
Figure 6: GT DEM - California
Figure 7: SLC - Tucson
Figure 8: GT DEM - Tucson
Figure 9: The datasets used for validating the proposed approach. The first row refers to the Alps dataset, the second to the California dataset and the third to the Tucson dataset. The first column depicts the SLC images, while the second one shows the associated GT DEM.

These datasets are taken from the Sentinel European Space Agency satellite mission. In particular, we use three different acquisitions observing the Alps (Italy), California (USA) and the city of Tucson (USA). The datasets are Single Look Complex (SLC) and vertical-vertical (VV) polarized. Each acquisition is composed by an SLC image with the associated DEM (GT-DEM) computed with standard InSAR techniques. The SLC images are provided as a big complex matrix (typically 12000x20000 entries), while the GT-DEM is a real valued matrix with the same size of its corresponding SLC image.

To learn the CNN model, we generate the training and test samples by sliding a 4000x4000 window on the SLC/GT-DEM pair. The window has a step of 100 pixels with respect to both row and column directions. The size of the window is chosen so that each sample contains enough local structure information to allow the CNN to properly estimate the DEM image. Each sample is downsampled to 140x140 pixels to make the learning task tractable. Depending on the size of the input matrices, we generate up to 22000 samples for each dataset (the exact sample number is discussed in the following sections). The train-test split is generated by randomly selecting the 65% of samples for training and the 35% for testing.

3.2 Training details

The CNN network is trained by using the Adam Optimizer, setting the learning rate

, the exponential decay rates for the moment estimates

and , and . All the L2 regularizer values of the convolutional layer are set to

. The batch size is set to 128 for all the experiments and the training set is randomly shuffled at the end of each epoch. Each model is trained for 500 epochs, which takes approximately four hours with a desktop workstation equipped with a Titan Xp GPU. Once the model is learnt, the predictions run very fast at test time: the computation of the DEM image associated to a 4000x4000 SLC subwindow takes

ms, i.e. it runs at approximately Hz.

3.3 Discussion

Figure 22 shows examples of the real DEMs and the estimated ones for each datasets. In addition, the elevation profiles are plotted for two sample range (in pixel), in order to better show the estimation properties of the network. Alps dataset is composed by train images and test images, and the average RMSE on all test images is m. California is composed by train images and test images. The average RMSE in this case is m. Finally, the Tucson dataset has train and test images. In this test, the average RMSE error is m.

[4] [] [] [] []

Figure 10: GT DEM - Alps test
Figure 11: Est - DEM - Alps test
Figure 12: Est. vs GT altitudes
Range - 30 Alps test
Figure 13: Est. vs GT altitudes
Range 120 - Alps test

[4] [] [] [] [] [4] [] [] [] []

Figure 14: GT DEM - California test
Figure 15: Est - DEM - California test
Figure 16: Est. vs GT altitudes
Range 30 - California test
Figure 17: Est. vs GT altitudes
Range 120 - California test
Figure 18: GT DEM - Tucson test
Figure 19: Est - DEM - Tucson test
Figure 20: Est. vs GT altitudes
Range 30 - Tucson test
Figure 21: Est. vs GT altitudes
Range 120 - Tucson test
Figure 22: Comparison between the estimated and the ground truth DEM images. The first row refers to the experiment on the Alps dataset, while the second and the third show the results for the California and the Tucson tests, respectively. The first column depicts the GT DEM of a sample image from the three datasets, while the second column shows the relative estimated DEM. The third and the fourth columns compare the estimated altitude profiles with the ground truth ones at fixed range values.

It is possible to see qualitatively that the Network learned the altitude statistics, giving a result that closely resembles the ground truth. The main difference is a smoothing effect that the network estimate has in comparison with the original. This is more evident if we compare the GT and predicted profiles for fixed range values (in pixel with respect to the image coordinates) of 30 and 120 pixels, respectively. This is shown in Figures 22-22, 22-22 and 22-22 for the Alps, the California and the Tucson dataset, respectively. From the profiles it is even more apparent that the network is able to output a digital elevation model for the input images that closely resembles the original. The general trends of the GT DEM are closely followed and the main differences between the GT and prediction are due to the smoothing effect on crest ripples, since that for the network these ripples in the ground truth are like a high frequency signal (noise) superimposed to the general elevation model. To better quantify the performances of the network, we quantized the range of elevations in the datasets and computed the average error for each bin, in order to analyse the error distribution given the GT elevation. The resulting plot is shown in Figures 26, 26 and 26 for the Alps, the California and the Tucson datasets, respectively. The three plots, together with the ones in Figure 22 and in consideration of the average RMSE on the test sets show that the estimation network performances degrades when the terrain is mountainous, while are close to the real DEM for slow varying terrain features. This is expected, since the altitude information is not really included in a single pass radar image, so the Network has to extract it from context level information. We hypothesize that, increasing the amount of data given to the network, is possible to further reduce the errors on the high frequency ripples. Furthermore, devising more complex architectures should also help to better model the variabilities of high crests.

[3] [] []

Figure 23: Avg Estimation error on the ALPS data
Figure 24: Avg Estimation error on the California data


Figure 25: Avg Estimation error on the Tucson data
Figure 26: Average estimation error computed on quantized elevations (100 bins) on the ALPS 26, the California 26 and the Tucson 26 data.

4 Conclusions

In this paper, we have proposed an novel method able to estimate DEMs using single SAR images instead interferometric couples. The proposed method uses a data driven approach, implemented through an Encoder-Decoder CNNs architecture, and is able to potentially solve the layover indetermination present on the single SLC SAR image using image context information. Our results show that this method is promising, and able to learn useful DEM estimate even with moderate training time and data. For training the CNN a set of Sentinel data has been used.


  • [1] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
  • [2] Filippo Biondi. MCA-SAR-Tomography. In EUSAR 2016: 11th European Conference on Synthetic Aperture Radar, Proceedings of, pages 1–4. VDE, 2016.
  • [3] Fabio Bovenga, Dominique Derauw, Fabio Michele Rana, Christian Barbier, Alberto Refice, Nicola Veneziani, and Raffaele Vitulli. Multi-chromatic analysis of sar images for coherent target detection. Remote Sensing, 6(9):8822–8843, 2014.
  • [4] C Colesanti, F De Zan, A Ferretti, C Prati, and F Rocca. Generation of dem with sub-metric vertical accuracy from 30’ers-envisat pairs. In Proc. FRINGE 2003 Workshop, Frascati, Italy, pages 1–5, 2003.
  • [5] Mauro Coltelli. Generation of digital elevation models by using sir-c/x-sar multifrequency two-pass. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 34(5), 1995.
  • [6] A. Dosovitskiy, P. Fischer, E. Ilg, P. Häusser, C. Hazırbaş, V. Golkov, P. v.d. Smagt, D. Cremers, and T. Brox. Flownet: Learning optical flow with convolutional networks. In IEEE International Conference on Computer Vision (ICCV), 2015.
  • [7] Sergi Duque, Helko Breit, Ulrich Balss, and Alessandro Parizzi. Absolute height estimation using a single terrasar-x staring spotlight acquisition. IEEE Geoscience and Remote Sensing Letters, 12(8):1735–1739, 2015.
  • [8] David Eigen and Rob Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE International Conference on Computer Vision, pages 2650–2658, 2015.
  • [9] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. In Advances in neural information processing systems, pages 2366–2374, 2014.
  • [10] Charles V Jakowatz and P Thompson. A new look at spotlight mode synthetic aperture radar as tomography: imaging 3-d targets. IEEE transactions on image processing, 4(5):699–703, 1995.
  • [11] Fayao Liu, Chunhua Shen, Guosheng Lin, and Ian Reid. Learning depth from single monocular images using deep convolutional neural fields. 2015.
  • [12] Soren N Madsen and Howard A Zebker. Automated absolute phase retrieval in across-track interferometry. 1992.
  • [13] M. Mancini, G. Costante, P. Valigi, and T. A. Ciarfuglia. Fast robust monocular depth estimation for obstacle detection with fully convolutional networks. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4296–4303, Oct 2016.
  • [14] M. Mancini, G. Costante, P. Valigi, T. A. Ciarfuglia, J. Delmerico, and D. Scaramuzza. Toward domain independence for learning-based monocular depth estimation. IEEE Robotics and Automation Letters, 2(3):1778–1785, July 2017.
  • [15] Gerard Margarit, Jordi J Mallorqui, and Xavier Fabregas. Single-pass polarimetric sar interferometry for vessel classification. IEEE transactions on geoscience and remote sensing, 45(11):3494–3502, 2007.
  • [16] Joao Moreira, Marcus Schwabisch, Gianfranco Fornaro, Riccardo Lanari, Richard Bamler, Dieter Just, Ulrich Steinbrecher, Helko Breit, Michael Eineder, Giorgio Franceschetti, et al. X-sar interferometry: First results. IEEE transactions on Geoscience and Remote Sensing, 33(4):950–956, 1995.
  • [17] David C Munson, James D O’Brien, and W Kenneth Jenkins. A tomographic formulation of spotlight-mode synthetic aperture radar. Proceedings of the IEEE, 71(8):917–925, 1983.
  • [18] Anirban Roy and Sinisa Todorovic. Monocular depth estimation using neural regression forest. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , 2016.
  • [19] Ashutosh Saxena, Sung H Chung, and Andrew Y Ng. Learning depth from single monocular images. In Advances in neural information processing systems, pages 1161–1168, 2006.
  • [20] Ashutosh Saxena, Min Sun, and Andrew Y Ng. Make3d: Learning 3d scene structure from a single still image. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(5):824–840, 2009.
  • [21] Jack L Walker. Range-doppler imaging of rotating objects. IEEE Transactions on Aerospace and Electronic systems, (1):23–52, 1980.
  • [22] Howard A Zebker and Richard M Goldstein. Topographic mapping from interferometric synthetic aperture radar observations. Journal of Geophysical Research: Solid Earth, 91(B5):4993–4999, 1986.