Over the last few years, we have experienced a rising interest in applications of remote sensing—geospatial monitoring from space and airborne platforms. Today’s sensors have seen a rapid and large improvement in spatial and spectral resolution, which has expanded the capabilities from observation of geological, atmospheric, and vegetation phenomena to applications such as the extraction of high-resolution elevation models and maritime vessel tracking [9, 16, 4, 27, 3].
One specific remote sensing method, synthetic aperture radar (SAR) imaging, uses a moving, side-looking radar which transmits electromagnetic pulses and sequentially receives the back-scattered signal . This time series of amplitude and phase values contains information on the relative location, surface geometry/roughness, and permittivity of the reflecting objects. The frequency of the transmitted electromagnetic signal defines the penetration depth into the soil, and the bandwidth tunes the geometric resolution. SAR is particularly interesting for a wide range of applications as it is—as opposed to visual or multispectral imaging methods—resilient to weather conditions and cloud coverage, and it is independent of the lighting conditions (night, dusk/dawn) 
. By a mere overflight of such a sensor either aboard a satellite, airplane, or drone, data can be collected to monitor crop growth, ocean wave height, floating icebergs, biomass estimation, snow monitoring, and maritime vessel tracking.
This wide range of applications and the high reliability have lead to a rapid increase in SAR imaging satellites orbiting the Earth. It has fueled the rise of newly-founded data providers such as ICEYE and Capella Space with the intention of providing such imaging data for commercial purposes at resolutions of up to 0.5 m/px and with a schedule to reach hourly re-visitation rates of any point globally within the next few years.
Given the growing amount of SAR data, it is critical to develop automated analysis methods, possibly in real-time pipelines. Deep neural networks (DNNs) play a key role in this automation effort. They have become the method of choice for many computer vision applications over the last few years, pushing the accuracy far beyond previous methods and exceeding human accuracy on tasks such as image recognition. DNNs have also shown state-of-the-art performance in image segmentation in application scenarios from road segmentation for autonomous driving to tumor segmentation in medical image analysis and road network extraction from visual imagery.
The main contributions of this work are three-fold: 1) presenting a dataset with high-resolution (0.15 m/px) SAR imaging data and ground truth annotations for urban land-use segmentation, 2) proposing and evaluating deep neural network topologies for automatic segmentation of such data, and 3) providing a detailed analysis of the segmentation results and which input data/features are most beneficial for the quality of results, reducing the error rate from 16.0% to 4.8% on relative to related work for a similar task and reaching a mean IoU of 74.67%.
Ii Related Work
In this section, we first introduce existing SAR datasets and then provide an overview of current analysis methods.
Ii-a SAR Datasets
Probably the best known SAR dataset for classification and detection purposes is the MSTAR dataset of the U.S. air force from 1995/1996 for target recognition based on 30 cm/px X-band images with HH polarization. It contains images with to px of 15 types of armored vehicles with a few hundred images each and a few large scenes with which more complex scenes can be composed for detection tasks. Nevertheless, such detection tasks are not very hard with clearly distinct objects in an open field and often simplified classes (e.g., personnel carrier/tank/rocket launcher/air defense unit). State-of-the-art methods achieve a 100% detection and recognition rate on such generated detection problems [5, 29].
Another large SAR dataset is OpenSARShip 2.0 . It uses 41 images of the ESA’s Sentinel-1 mission and includes labels of ships of various types, of which 8470 are cargo ships, and 1740 are tankers. It uses C-band SAR to generate 10 m/px resolution images.
Several recent SAR datasets focus on providing matching pairs of SAR and visual images: the TUM/DLR SEN1-2 dataset from 2018  provides 10 m/px data from Sentinel-1 and 2, and the SARptical dataset (also 2018)  provides TerraSAR-X data at a 1 m/px resolution including extracted 3D point clouds and matching visual images.
Other datasets such as  or JAXA’s ALOS PALSAR forest/non-forest  (18 m/px) typically come with a much lower resolution (>10 m/px) and often for applications in large-scale land cover analysis with classes such as water/forest/urban/agriculture/bareland. Most SAR datasets, particularly more recent ones with improved resolution, do not come with any ground truth annotations. Such datasets include the SLC datasets of ICEYE with 1 m/px resolution X-band SAR images with a VV polarization, publicly available images of the DLR/Airbus TerraSAR-X satellite, etc. Such datasets must first be combined and registered to, or annotated with ground truth labels.
Ii-B SAR Data Analysis Methods
Traditional signal processing methods still dominate the field of SAR data analysis. While in some cases, properties specific for SAR data and corresponding particular sensor configurations are used, e.g., for change detection [14, 13]. Also, tasks focusing on semantic segmentation are widely addressed using statistical models and engineered features [2, 6, 21].
DNNs have achieved excellent results segmenting visual data, including satellite images, for geostatistical uses as well as disaster-relief (e.g., road passability estimation) [16, 4, 27, 3]. To leverage these results, several researchers have created datasets and proposed methods for SAR-to-visual translation and SAR data generation from visual images using generative adversarial networks (GANs) or using DNNs on visual data to generate labels for training DNNs on SAR data [7, 22, 11]. These works provide helpful methods to rapidly obtain labels for SAR datasets or enlarge such datasets with generated samples at a reduced cost. However, the resulting data is inherently less accurate than systematically collected ground truth labels.
Several efforts have also been undertaken towards urban land-use segmentation. In , they compare the building segmentation accuracy of an atrous ResNet-50 DNN using OSM ground truth and training the same network on TerraSAR-X data with a 2.9 m/px resolution and visual data extracted from Google Earth, achieving a pixel-wise accuracy of 74% and 82.9%, respectively. In , the authors propose an FCN-style  network for building segmentation and have achieved a pixel accuracy of 78%–92%. In , they segment roads using the FCN-8s and DeepLabv3+ networks on 1.25 m/px TerraSAR-X images, achieving pixel accuracies of 43%–46%. They have determined OSM annotations being too inaccurate and manually labeled all the roads based on human analysis of the SAR data. Further, they introduced spatial tolerance for the road classification, not considering a stripe of 1–8 around the boundary between road and non-road segments.
Iii Dataset Creation
Iii-a SAR Data
The data was recorded in Autumn 2017 by armasuisse and the University of Zurich’s SARLab during two flights over Thun, Switzerland, with an off-nadir angle of 55° and with a 35 GHz carrier (Ka-band) using the Fraunhofer/IGI Miranda-35 sensor. From the recorded data, a rasterized “image” with a m/ps resolution was generated, including phase information. The system contains 4 channels, each with data recorded from a separate receiving antenna. We do not apply any preprocessing methods to compensate foreshortening or layover, or to reduce the speckle noise . The resulting images each cover an area of 2.2 with 97.17 Mpx and have been recorded once from the left and once from the right, with most of the recorded regions-of-interest overlapping (91.8 Mpx). The latter reduces the areas not visible to the radar due to occlusion (cf. Fig. (b)b and Fig. (c)c).
High-quality ground truth data is crucial to train effective DNNs: it not only severely affects the quality of the results, but also has a substantial impact on training time and generalization of the trained model . For this work, we consider three classes—building, road, other—and allow for some pixel to be annotated as unlabeled.
We have used OSM data on buildings and roads to generate the ground truth segmentation. In order to further improve the quality, we fused this information with the building shape annotations of the government-created swisstopo swissTLM3D map. Fig. (e)e
visualizes the annotations of both data sources on a tile of the dataset. The annotations mostly overlap, but significant differences exist with sometimes one of the sources missing a building entirely. Visual inspection showed that none of these sources reliably captures all buildings. We have thus merged the annotations, to only classify a pixel asbuilding or not-a-building where the sources agree, and left the pixels with unclear annotations as unlabeled (cf. Fig. (e)e).
The roads captured in OSM, as well as the swisstopo
data, are provided as line segments with several rank annotation (for OSM: motorway, path, pedestrian, platform, primary, residential, secondary, service, steps, tertiary, track, bus stop, cycleway, footway, living street, unclassified). While this information aids in estimating the width of the road required to create the segmentation ground truth, a very wide variance remains within each category. We thus assign two widths to each type of road: an expected maximum width and a minimum width. The latter is used to annotate pixels asroad, whereas the former is used to mark the surrounding pixels as unlabeled, similar to .
With roads of this many different ranks and sizes, some of them are of minor relevance and hard to recognize due to the small size. We thus create three different annotations, which we then evaluate experimentally in Sec. V:
Annotating all the roads from OSM including the minor ones, and
Combining the annotations from OSM and swissTLM3D using the same rule for fusing the data as for the buildings: labeling roads only where both sources agree (cf. Fig. (g)g and Fig. (h)h). We can see that some roads, especially the minor ones, have a significant offset between the two map sources, leaving some of them entirely unlabeled.
Iv-a DNN Topology
We show the proposed DNN topology in Fig. 2. After a first 2d convolution from one or more input feature maps (depending on the selected features) to 16, the feature maps are passed through one of our basic building blocks: 3 parallel convolutions with 16 output feature maps and a dilation factor 1, 2 and 4, respectively, of which the results are concatenated, followed by a convolution to reduce the number of feature maps back to 16. This allows capturing information from various context sizes in each such block. This is followed by the common U-Net structure introduced in 
: the feature maps are fed through several convolution layers, some of which are strided, followed by several transposed convolution layers to recover the resolution, after each of which the last of the previous sets of feature maps is concatenated to recover the high-resolution information in order to allow fine-grained segmentation.
The presented network is light-weight in terms of parameter count and computation effort compared to other segmentation networks. Due to the low number of feature maps throughout the network, the small filter sizes, and the use of dilated convolutions and striding to cover the required field of view, it has merely 63k parameters. It requires 13k multiply-accumulate (MAC) operations per pixel for inference111Note that the stride of the first convolution layer can be brought forward to the dilated convolution layer with identical results., which means for a relatively large input of px it requires 13.6G MAC operations. For comparison, recent GPUs such as Nvidia’s GTX 2080 Ti can perform 6.7T single-precision MAC operations per second and can thus in a very rough approximation process 515 Mpx/s.
Iv-B Feature Extraction
Our overall SAR data is comprised of two recordings (once recorded from the left, once from the right), where each has multiple channels (receive antennas), as well as magnitude and phase information. We extract the following features:
Magnitude information: The main information, and a visually human-interpretable image, is obtained by 1) compressing the magnitude values of any of the SAR images of all the recordings and channels to log-scale, 2) clamping the maximum value at the 99 percentile, and 3) limiting the range to 25 dB. An image after such preprocessing is shown in Fig. (b)b.
Phase information: Each pixel also records the phase of the reflected signal. Such data can be used to obtain further information on the relative distance. In order to convert it to a real-valued feature map, we either used the real and imaginary part of the length-normalized complex vector of each pixel (named cos and sin phase hereafter), or the real and imaginary parts of the dB-scaled magnitude rotated by the phase of the original pixel value (named re/im hereafter).
Phase difference information: As opposed to the other features, which can be applied to each recording and channel individually, we here take the phase difference of each pixel between two channels (channels 1 and 4, unless specified otherwise). Visually, such a feature map is indicative of radar shadows and structure height. We convert this data into two real-valued feature maps identically to the phase information. The cosine/real component of the phase difference is shown in Fig. (d)d.
By combining the magnitude feature map (FM) with the 4 phase FMs for each channel and both recordings, plus the two phase difference FMs for each recording, we obtain a total of 44 FMs. However, these features are redundant, require a sizable amount of memory, and might only lead to overfitting. We thus perform feature selection based on the experimental results in Sec.V.
V Results & Discussion
|pixel accuracy [%]||89.26||85.97||83.80||85.78||89.37||85.22||90.50||86.85||90.03||84.89||86.64||91.89||91.78||90.81||90.11||95.19||93.49|
|mean accuracy [%]||76.31||66.52||63.76||66.14||72.89||65.00||80.78||68.22||71.43||65.45||67.65||75.12||79.00||72.21||72.29||90.30||86.12|
|mean IoU [%]||56.92||59.02||55.33||57.03||64.15||55.48||68.43||59.34||65.79||58.09||60.21||70.18||69.69||66.35||66.05||74.67||69.90|
Including swisstopo annotations for roads.
Without class balancing.
V-a Experimental Setup
For our experiments, we have split the SAR images into spatial tiles of
. The resulting 86 tiles are split into a training set of 64 tiles (74%) and a test set of 22 tiles (26%). The networks were trained with PyTorch using the Adam optimizer. Starting from an initial learning rate of , we used a reduce-on-plateau learning rate schedule (factor: 0.1, patience: 10, relative threshold:
). The batch size was 8 and split across 4 GTX 2080 Ti GPUs. We use the cross-entropy loss function for all experiments. Unless otherwise noted, we have weighted the classes in the loss function to compensate for the inherent class imbalance. The experiments have, in general, fully converged after a maximum of 80 epochs, corresponding to an average training time of merely 8 minutes. Following related work, we use the metricspixel accuracy (PA), mean accuracy (MA), and mean intersection-over-union (mIoU),
where is the number of pixels with target class and predicted class , and is the number of classes.
V-B Feature Selection
We have trained and evaluated the network for many combinations of features. An overview of the results is shown in Tbl. I. For the phase information features, we can observe that in each of the experiment groups 1–4, 5–8, 9–11, and 12–13, providing phase difference information or phase information in the cos/sin representation consistently resulted in worse results than using only the magnitude features. Using only the re/im features without the magnitude feature has shown similar performance to providing the magnitude feature map, although generally slightly worse with the exception of the exp. group 5–8. For the remaining analyses, we thus use the magnitude features solely.
As for using multiple channels, we see a clear accuracy gain comparing experiments 1 to 5 and 9 to 12 of 1.73% and 1.86%, respectively. Similarly, for using data from both flights, we observe improvements by 2.39% and 2.52% from experiments 1 to 9 and 5 to 12.
V-C Ground Truth Selection
In Sec. III-B, we have left the decision on which annotations to use for the roads to experimental evaluation. Exp. 14 and 15 include the swisstopo road annotations in addition to the OSM data used for the other experiments. However, they do not further improve the accuracy but instead, reduce it. Each additional road comes with several pixels marked as unlabeled in its surroundings, and roads are only labeled as such if the ground truths agree, This might leave the impression of simplifying the classification task. However, requiring the agreement of both annotations sources also implies removing some valuable labeled road pixels from the training data. We thus attribute this small accuracy drop to the latter effect outweighing the slight simplification of the task and proceed using only OSM annotations to label the roads.
|resolution||2.9 m/px||2.9 m/px||0.5 m/px||1.25 m/px||0.15 m/px|
|labels||OSM||OSM||human||human||OSM & swisstopo|
|DNN type||Atrous-ResNet50||Atrous-ResNet50||FCN||FCN||mod. U-Net|
|qualitative acc.||bad||bad||ok||very good||very good|
V-D Overall Accuracy and Class Balancing
Following our insights on which features to use from Sec. V-B, we use data from all channels and both flights but no phase information, to achieve a pixel accuracy of 91.89% and a mIoU of 70.18%. If we do not compensate for the class imbalance, the pixel accuracy and the mIoU further increase to 95.19% and 74.67%. For a more qualitative analysis, we provide some example results from the best predictor with and without class balancing alongside the ground truth information in Fig. 3. Particularly for the network trained with class balancing, we can see an outstanding segmentation quality. The buildings are segmented very well, and most misclassified pixels are observed at the segments classified as road leaking into the driveways or marking very small clusters of pixels in the backyards as road. For the class-balanced network, the road class is assigned much less frequently, as a misclassification as other is much more penalized, and the pixels labeled as road are much fewer—particularly on smaller roads. As the class road is not as strictly defined as building—private roads, forecourts, and driveways are labeled as other although they are not distinguishable from their public pendants to a human annotator either—the decision of the resulting classifier can be expected and observed to be less confident as well.
V-E Comparison to Related Work
We compare our results to related work in Tbl. II. We see a vast improvement in accuracy, reducing the error rate from 16% to 4.8% on a similar task. Akin, we observe gains in mean accuracy and mIoU. A qualitative comparison of the segmentation outputs shows vast improvements (cf. Fig. 3 and [28, 8, 25]). The main differences to the other methods is the fusion of two annotation sources, the vastly increased resolution, the two 4-channel recordings from opposite directions, as well as the optimized DNN.
We have proposed and evaluated a DNN to automatically and reliably perform urban scene segmentation from high-resolution SAR data, achieving a pixel accuracy of 95.2% and a mean IoU of 74.7% with data collected over a region of merely 2.2 . The presented DNN is not only effective, but is very small with only 63k parameters and computationally simple enough to achieve a throughput of around 500 Mpx/s using a single GPU. We have further identified that additional SAR receive antennas and data from multiple flights massively improve the segmentation accuracy while phase information showed no positive effect. The procedure described for generating a high-quality segmentation ground truth from multiple inaccurate building and road annotations has shown to be crucial to achieve good segmentation results.
We would like to thank armasuisse Science and Technology for funding this research and for providing—jointly with the University of Zurich’s SARLab—the SAR data used throughout this work.
-  (2016) Land Cover Mapping Using Sentinel-1 SAR Data. ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences XLI-B7 (July), pp. 757–761. External Links: Cited by: §II-A.
-  (2017) SAR Image Segmentation with GMMs. In Proc. IET Radar, pp. 3–6. External Links: Cited by: §II-B.
Fully Convolutional Neural Network with Augmented Atrous Spatial Pyramid Pool and Fully Connected Fusion Path for High Resolution Remote Sensing Image Segmentation. Applied Sciences 9 (9), pp. 1816. External Links: Cited by: §I, §II-B.
-  (2018) Soft computing approaches for image segmentation: a survey. Multimedia Tools and Applications 77 (21), pp. 28483–28537. External Links: Cited by: §I, §II-B.
-  (2019) D-ATR for SAR Images Based on Deep Neural Networks. Remote Sensing 11 (8). External Links: Cited by: §II-A.
-  (2018) Adaptive Hierarchical Multinomial Latent Model with Hybrid Kernel Function for SAR Image Semantic Segmentation. IEEE Trans. on Geoscience and Remote Sensing 56 (10), pp. 5997–6015. External Links: Cited by: §II-B.
-  (2019) SAR-to-Optical Image Translation Based on Conditional Generative Adversarial Networks—Optimization, Opportunities and Limits. Remote Sensing 11 (17). External Links: Cited by: §II-B.
-  (2018) Road Segmentation in SAR Satellite Images With Deep Fully Convolutional Neural Networks. IEEE Geoscience and Remote Sensing Letters 15 (12), pp. 1867–1871. External Links: Cited by: §II-B, §III-B, §V-E, TABLE II.
-  (2018) OpenSARShip: A Dataset Dedicated to Sentinel-1 Ship Interpretation. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 11 (1), pp. 195–208. External Links: Cited by: §I, §II-A.
-  (2015) Adam: A Method for Stochastic Optimization. In Proc. ICLR, Cited by: §V-A.
-  (2018) Generating simulated SAR images using Generative Adversarial Network. In Proc. SPIE Applications of Digital Image Processing XLI, External Links: Cited by: §II-B.
-  (2015) Fully Convolutional Networks for Semantic Segmentation. In Proc. IEEE CVPR, Cited by: §II-B, §II-B.
-  (2019) A Back-Projection Tomographic Framework for VHR SAR Image Change Detection. IEEE Transactions on Geoscience and Remote Sensing 57 (7), pp. 4470–4484. External Links: Cited by: §II-B.
-  (2018) A Multisquint Framework for Change Detection in High-Resolution Multitemporal SAR Images. IEEE Transactions on Geoscience and Remote Sensing 56 (6), pp. 3611–3623. External Links: Cited by: §II-B.
-  (2013) A tutorial on synthetic aperture radar. IEEE Geoscience and Remote Sensing Magazine 1 (1), pp. 6–43. External Links: Cited by: §III-A.
-  (2019) Road Passability Estimation Using Deep Neural Networks and Satellite Image Patches. In Proc. BiDS, Cited by: §I, §II-B.
-  (2015) U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proc. MICCAI, Vol. 9351, pp. 234–241. External Links: Cited by: §II-B, §IV-A.
The SEN1-2 Dataset for Deep Learning in SAR-Optical Data Fusion. ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences 4 (1), pp. 141–146. External Links: Cited by: §II-A.
-  (2019) Buildings detection in VHR SAR images using fully convolution neural networks. IEEE Transactions on Geoscience and Remote Sensing 57 (2), pp. 1100–1116. External Links: Cited by: §II-B.
-  (2014) New global forest/non-forest maps from ALOS PALSAR data (2007–2010). Remote Sensing of Environment 155, pp. 13–31. External Links: Cited by: §II-A.
-  (2017) Deep Structured Features for Semantic Segmentation. In Proc. IEEE EUSIPCO, External Links: Cited by: §II-B.
-  (2018) Synthetic Aperture Radar Image Generation With Deep Generative Models. IEEE Geoscience and Remote Sensing Letters 16 (6), pp. 912–916. External Links: Cited by: §II-B.
-  (2018) Embedded Classification of Local Field Potentials Recorded from Rat Barrel Cortex with Implanted Multi-Electrode Array. In Proc. IEEE BIOCAS, Cited by: §I.
-  (2018) The SARptical Dataset for Joint Analysis of SAR and Optical Image in Dense Urban Area. In Proc. IEEE IGARSS, pp. 6840–6843. External Links: Cited by: §II-A.
PolSAR Image Semantic Segmentation Based on Deep Transfer Learning—Realizing Smooth Classification With Small Training Sets. IEEE Geoscience and Remote Sensing Letters 16 (6), pp. 977–981. External Links: Cited by: §II-B, §V-E, TABLE II.
High-Resolution PolSAR Scene Classification with Pretrained Deep Convnets and Manifold Polarimetric Parameters. IEEE Transactions on Geoscience and Remote Sensing 56 (10), pp. 6159–6168. External Links: Cited by: §II-B.
-  (2017) W-Net: A Deep Model for Fully Unsupervised Image Segmentation. arXiv:1711.08506. Cited by: §I, §II-B.
-  (2017) Semantic Segmentation using Deep Neural Networks for SAR and Optical Image Pairs. In Proc. Big data from space, pp. 2–5. Cited by: §II-B, §V-E, TABLE II.
-  (2018) Deep learning model-based algorithm for SAR ATR. In Proc. SPIE Algorithms for Synthetic Aperture Radar Imagery XXV, External Links: Cited by: §I, §II-A.
-  (2018) On the Importance of Label Quality for Semantic Segmentation. In Proc IEEE/CVF CVPR, pp. 1479–1487. External Links: Cited by: §III-B.