Interferometric Synthetic Aperture Radar (InSAR) allows topographic reconstruction of a physical environment. The technique is performed designing a spatial single-baseline SAR geometry , ,
where the result is a digital elevation model (DEM). However, to solve the phase indetermination with a good altitude accuracy, a minimum of two pass are needed, and this usually implies that we need to wait days, or months between the first and the second pass. We propose a method for estimating the topographic reconstruction with machine learning, implemented by Convolutional Neural Networks (CNNs) in order to estimate a DEM using only one single-look-complex (SLC) SAR image. Before getting inside the description of the novel signal processing technique, a brief analysis of the InSAR history is given. It is necessary to go back in time, until 1980, where Walkeret al. in  admits the feasibility of fine Doppler frequency resolution existing for the range-Doppler SAR image. In this context a high energy scattering point target may move through several range-Doppler resolution cells, producing a smeared trace.
SAR data are represented with a three-dimensional Fourier transform of the object reflectivity density. A full three-dimensional environment reconstruction is processable by an inverse Fourier transform. Munsonet al.  show that spotlight SAR, interpreted as a tomographic reconstruction problem, synthesizes high resolution terrain maps observed along multiple observation angles.
Jakowats et al.  extend the work of Munson et al. to a new three-dimensional formulation, making the simplifying assumption that the SAR range-Doppler image is two-dimensional. Unfortunately, this assumption implies the generation of the layover effect and, in order to explore target detection in the cross-slant range, multiple observations have to be performed. In  the author gave a theoretical explanation of the frequency diversity in SAR-Tomography. A very good introduction to InSAR is given in . The work gives detailed information for combining complex SAR images recorded by antennas positioned at different locations. Recent years saw a refinement of the InSAR technique, trying to remove the need of using multiple satellite passes. InSAR can also be applied using two sensors mounted on the same platform. This configuration is called single-pass interferometry . However, to obtain a digital elevation model with useful accuracy a minimum baseline is needed. Application of InSAR from spaceborne radar prospective is also given in . In Colasanti et al.  authors performed a precious study regarding ERS-ENVISAT interferometry despite of their carrier frequency having a shift of 31MHz. In 
the authors gave demonstration in estimating absolute height using a single staring spotlight SAR image using the information of different azimuth defocusing levels generated by scatterers positioned at different heights. The problem of this technique seems being excessively anchored to the nature of the staring-spotlight acquisition which gives a reduced range-azimuth swath of observation and precious absolute height estimation is possible only for few azimuth intra-chromatic high coherency scatterers. However, all the aforementioned methods require complex models and computations to take into account all the atmosphere, sensor and environment conditions. Up to the authors knowledge, the possibility of computing DEM estimates with a standard SAR sensor and with a single-pass acquisition has not be tackled before. In this work, we propose the use of a different paradigm to solve this problem. Since a lot of SAR images has been collected in the past, we adopt a data driven approach. The work has been inspired by recent work on Monocular Depth Estimation performed in the Robotics and Computer Vision communities, , , , , , , . Usually, in the Robotics context, depth estimation from standard camera sensors is done by triangulation of information collected through stereo-rigs, or using multiple passes of the same sensor. Recently, Convolutional Neural Networks (CNNs) models have been proposed to perform a reconstruction of a depth map from a single image acquisition. The problem of learning depth from image appearance has similarities with the task of learning DEMs from radar images. In this work, we propose to use the same reasoning, learning the conditional distribution of digital elevation maps from single-pass interferometric imaging. We show that the proposed model is able to learn to some extent the spatial relationships from the input data, even with a moderate amount of data. This preliminary study already shows promising results for future developments.
In order to perform DEM estimation from single-pass SAR acquisition, we need to infer the structure of the observed Earth portion by only using a single radar image. We achieve this by devising a deep neural network architecture that learns to predict the DEM by extracting structures and high-level information from the input radar image. The key intuition behind this strategy lies in the exploitation of local image structures to infer the DEM value at a certain location (i.e., image pixel). By using multiple stages of convolutional filters, we are able to extract high-level structures (i.e., features) at different scales. These features are then used by the model to resolve ambiguities and estimate the DEM.
In the remainder of this section, we firstly describe more formally the principles behind our approach. Afterwards, we provide details about the proposed convolutional neural network architecture.
2.1 Estimation Problem Formulation
We want to model a function that, given a single radar image represented in the complex range-azimuth domain, is able to estimate the relative DEM, filtering out radar noise and resolving the layover indetermination. The output of the model is the DEM image , where each entry contain the elevation value at that location. In order to evaluate the contribution of the complex components of the radar image, we give the model as input the absolute value and the phase of the complex image . Thus, our function is defined as .
|Layer name||Kernel size||Stride||Padding||output size||activation|
Details of the network architecture. The network is composed by two sections, namely the encoder and the decoder section. The encoder section has six convolutional filters and a max pooling layer to extract features at different scale levels. The decoder section decodes the features by using transposed convolutions to estimate the DEM image. Paddingsame is used when we need to preserve the dimensions of a layer input. Conversely, padding valid indicates that the convolution operation processes only valid patch of the input (i.e., the output dimension is slightly smaller due to border effects).
For the network structure we exploit the encoder-decoder paradigm, similar to [1, 6, 13, 14]. This kind of architecture is composed by two main blocks, each one composed by a number of convolutional layers, as shown in Figure 1. The encoder part computes the spatial features and at the same time reduces the image representation size layer after layer, in order to find an encoded representation of the image; the decoder part takes this encoded representation ad decompresses it, with upsamplings and convolutions, to finally reconstruct the original image. The loss that is minimized is the DEM reconstruction error, that is propagated through the decoder and encoder layers. In this way, the network is able to learn a lower dimensional representation (an embedding) of the input radar images, removing noise and increasing the generalization properties for further processing. We propose to use a variation of Encoder-Decoder architecture where the input and output are not the same. In our case the inputs are radar images and the outputs are the DEM reprojected in the radar coordinates (slant-range versus azimuth).
The architecture we propose is a fully convolutional deep network, that is able to handle generic inputs. Furthermore, fully convolutional architectures preserve the spatial information both in the encoder and the decoder sections, which is crucial to fully exploit local structure information.
The encoder section is composed by a series of convolutional layers, which sequentially apply learned filters on their input to compute the features.
To extract higher level features, the input is downsampled multiple times. To scale inputs, we use max pooling.
The decoder section is composed by a stack of transposed convolutional layers that learn to reconstruct the pixel-wise predictions of the DEM image from the features computed in the encoder section. Differently from the encoder section, instead of using unpooling layers to reverse pooling operations, we take advantage from the transposed convolutional layers to learn an effective upsampling strategy.
. All the convolutional layers in the encoder section have rectified-linear activation functions (ReLU), except for the last one (Conv6) that has a linear activation function.
The decoder section has five
Transposed Convolutional (T-Conv) layers to decode the feature extracted by the first section of the network. The last T-Conv layer performs an upsampling by striding the convolutional operations by a factor of 4. All the T-Conv have probabilistic rectified-linear unit (PReLU) to allow for negative activations during the decoding phase. Finally, a single channelconvolutional layer with a linear activation outputs the predicted DEM image.
All the convolutional filters are regularized with L2 penalty to prevent overfitting.
The objective that is minimized during the learning phase is the pixel-wise linear root mean squared error (RMSE) between the estimated and the GT-DEM images:
where is the number of pixel of the DEM image and .
In this section, we describe the experiments we run to validate our proposed CNN-based DEM estimation approach. In the following, we first describe the experimental setup, providing details about datasets used, preprocessing procedures and details about CNN training. Afterwards, we discuss the results and draw conclusions.
We test out approach in three different datasets, namely the Alps, the California and the Tucson datasets. The SLC image and the associated GT DEM are depicted in Figure 9-9, 9-9 and 9-9, respectively.
These datasets are taken from the Sentinel European Space Agency satellite mission. In particular, we use three different acquisitions observing the Alps (Italy), California (USA) and the city of Tucson (USA). The datasets are Single Look Complex (SLC) and vertical-vertical (VV) polarized. Each acquisition is composed by an SLC image with the associated DEM (GT-DEM) computed with standard InSAR techniques. The SLC images are provided as a big complex matrix (typically 12000x20000 entries), while the GT-DEM is a real valued matrix with the same size of its corresponding SLC image.
To learn the CNN model, we generate the training and test samples by sliding a 4000x4000 window on the SLC/GT-DEM pair. The window has a step of 100 pixels with respect to both row and column directions. The size of the window is chosen so that each sample contains enough local structure information to allow the CNN to properly estimate the DEM image. Each sample is downsampled to 140x140 pixels to make the learning task tractable. Depending on the size of the input matrices, we generate up to 22000 samples for each dataset (the exact sample number is discussed in the following sections). The train-test split is generated by randomly selecting the 65% of samples for training and the 35% for testing.
3.2 Training details
The CNN network is trained by using the Adam Optimizer, setting the learning rate
, the exponential decay rates for the moment estimatesand , and . All the L2 regularizer values of the convolutional layer are set to
. The batch size is set to 128 for all the experiments and the training set is randomly shuffled at the end of each epoch. Each model is trained for 500 epochs, which takes approximately four hours with a desktop workstation equipped with a Titan Xp GPU. Once the model is learnt, the predictions run very fast at test time: the computation of the DEM image associated to a 4000x4000 SLC subwindow takesms, i.e. it runs at approximately Hz.
Figure 22 shows examples of the real DEMs and the estimated ones for each datasets. In addition, the elevation profiles are plotted for two sample range (in pixel), in order to better show the estimation properties of the network. Alps dataset is composed by train images and test images, and the average RMSE on all test images is m. California is composed by train images and test images. The average RMSE in this case is m. Finally, the Tucson dataset has train and test images. In this test, the average RMSE error is m.
It is possible to see qualitatively that the Network learned the altitude statistics, giving a result that closely resembles the ground truth. The main difference is a smoothing effect that the network estimate has in comparison with the original. This is more evident if we compare the GT and predicted profiles for fixed range values (in pixel with respect to the image coordinates) of 30 and 120 pixels, respectively. This is shown in Figures 22-22, 22-22 and 22-22 for the Alps, the California and the Tucson dataset, respectively. From the profiles it is even more apparent that the network is able to output a digital elevation model for the input images that closely resembles the original. The general trends of the GT DEM are closely followed and the main differences between the GT and prediction are due to the smoothing effect on crest ripples, since that for the network these ripples in the ground truth are like a high frequency signal (noise) superimposed to the general elevation model. To better quantify the performances of the network, we quantized the range of elevations in the datasets and computed the average error for each bin, in order to analyse the error distribution given the GT elevation. The resulting plot is shown in Figures 26, 26 and 26 for the Alps, the California and the Tucson datasets, respectively. The three plots, together with the ones in Figure 22 and in consideration of the average RMSE on the test sets show that the estimation network performances degrades when the terrain is mountainous, while are close to the real DEM for slow varying terrain features. This is expected, since the altitude information is not really included in a single pass radar image, so the Network has to extract it from context level information. We hypothesize that, increasing the amount of data given to the network, is possible to further reduce the errors on the high frequency ripples. Furthermore, devising more complex architectures should also help to better model the variabilities of high crests.
In this paper, we have proposed an novel method able to estimate DEMs using single SAR images instead interferometric couples. The proposed method uses a data driven approach, implemented through an Encoder-Decoder CNNs architecture, and is able to potentially solve the layover indetermination present on the single SLC SAR image using image context information. Our results show that this method is promising, and able to learn useful DEM estimate even with moderate training time and data. For training the CNN a set of Sentinel data has been used.
-  Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
-  Filippo Biondi. MCA-SAR-Tomography. In EUSAR 2016: 11th European Conference on Synthetic Aperture Radar, Proceedings of, pages 1–4. VDE, 2016.
-  Fabio Bovenga, Dominique Derauw, Fabio Michele Rana, Christian Barbier, Alberto Refice, Nicola Veneziani, and Raffaele Vitulli. Multi-chromatic analysis of sar images for coherent target detection. Remote Sensing, 6(9):8822–8843, 2014.
-  C Colesanti, F De Zan, A Ferretti, C Prati, and F Rocca. Generation of dem with sub-metric vertical accuracy from 30’ers-envisat pairs. In Proc. FRINGE 2003 Workshop, Frascati, Italy, pages 1–5, 2003.
-  Mauro Coltelli. Generation of digital elevation models by using sir-c/x-sar multifrequency two-pass. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 34(5), 1995.
-  A. Dosovitskiy, P. Fischer, E. Ilg, P. Häusser, C. Hazırbaş, V. Golkov, P. v.d. Smagt, D. Cremers, and T. Brox. Flownet: Learning optical flow with convolutional networks. In IEEE International Conference on Computer Vision (ICCV), 2015.
-  Sergi Duque, Helko Breit, Ulrich Balss, and Alessandro Parizzi. Absolute height estimation using a single terrasar-x staring spotlight acquisition. IEEE Geoscience and Remote Sensing Letters, 12(8):1735–1739, 2015.
-  David Eigen and Rob Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE International Conference on Computer Vision, pages 2650–2658, 2015.
-  David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. In Advances in neural information processing systems, pages 2366–2374, 2014.
-  Charles V Jakowatz and P Thompson. A new look at spotlight mode synthetic aperture radar as tomography: imaging 3-d targets. IEEE transactions on image processing, 4(5):699–703, 1995.
-  Fayao Liu, Chunhua Shen, Guosheng Lin, and Ian Reid. Learning depth from single monocular images using deep convolutional neural fields. 2015.
-  Soren N Madsen and Howard A Zebker. Automated absolute phase retrieval in across-track interferometry. 1992.
-  M. Mancini, G. Costante, P. Valigi, and T. A. Ciarfuglia. Fast robust monocular depth estimation for obstacle detection with fully convolutional networks. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4296–4303, Oct 2016.
-  M. Mancini, G. Costante, P. Valigi, T. A. Ciarfuglia, J. Delmerico, and D. Scaramuzza. Toward domain independence for learning-based monocular depth estimation. IEEE Robotics and Automation Letters, 2(3):1778–1785, July 2017.
-  Gerard Margarit, Jordi J Mallorqui, and Xavier Fabregas. Single-pass polarimetric sar interferometry for vessel classification. IEEE transactions on geoscience and remote sensing, 45(11):3494–3502, 2007.
-  Joao Moreira, Marcus Schwabisch, Gianfranco Fornaro, Riccardo Lanari, Richard Bamler, Dieter Just, Ulrich Steinbrecher, Helko Breit, Michael Eineder, Giorgio Franceschetti, et al. X-sar interferometry: First results. IEEE transactions on Geoscience and Remote Sensing, 33(4):950–956, 1995.
-  David C Munson, James D O’Brien, and W Kenneth Jenkins. A tomographic formulation of spotlight-mode synthetic aperture radar. Proceedings of the IEEE, 71(8):917–925, 1983.
Anirban Roy and Sinisa Todorovic.
Monocular depth estimation using neural regression forest.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
-  Ashutosh Saxena, Sung H Chung, and Andrew Y Ng. Learning depth from single monocular images. In Advances in neural information processing systems, pages 1161–1168, 2006.
-  Ashutosh Saxena, Min Sun, and Andrew Y Ng. Make3d: Learning 3d scene structure from a single still image. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(5):824–840, 2009.
-  Jack L Walker. Range-doppler imaging of rotating objects. IEEE Transactions on Aerospace and Electronic systems, (1):23–52, 1980.
-  Howard A Zebker and Richard M Goldstein. Topographic mapping from interferometric synthetic aperture radar observations. Journal of Geophysical Research: Solid Earth, 91(B5):4993–4999, 1986.