Satellite data and aerial imaging is becoming more and more accessible in the recent years and changing the understanding of our planet. In order to make use of the valuable information that is contained in the sensory data and raw images captured by satellites, it is necessary to impose a layer that is able to reflect abstract geospatial features and real objects. With the success of deep learning, convolutional neural networks (CNNs) have shown to be a powerful approach to extract such higher level features and outperform many traditional computer vision methods. Applications of CNN’s in this context are manifold and have been used to solve problems ranging from regression and classification  over to object detection to semantic & instance segmentation .
One of the key challenges in remote sensing is the processing of multi-modal data sources and their fusion. Data fusion aims to integrate the information acquired with different spatial resolution, spectral bands and imaging modes from sensors mounted on satellites, aircraft and ground platforms to generated a combined representation that contains more detailed information than each of the individual sources . Fusing complementary information across modalities can lead to substantial synergy effects and improve the overall accuracy [5, 6]. Dedicated datasets and challenges such as the IEEE GRSS Data Fusion Contest  have been released recently in order to push research forward in areas of data fusion and multi-modal segmentation.
However, most prior work that focuses on multi-modal data fusion assumes that all modalities are available during inference and test time. This assumption can greatly limit applications of multi-modal analysis because in practice data collection process may likely generate data with missing modalities. Within the domain of remote sensing, this is the rule rather than the exception due to effects such as (1) blocking of spectral responses for optical sensors in the presence of clouds, (2) adverse constellations of non geostationary satellites at particular points in time and (3) corrupted/incomplete sensory data because unexpected exposures such as space debris and solar flares. In all of these scenarios, models that have been trained on multi-modal information can not be used anymore and a fall back strategy to a modal that only uses the available modality is often necessary.
Within the scope of this work, we consider the problem of missing modalities on the exemplary use case of building footprints segmentation in a multi-modal setup. Our segmentation network, depicted in (2) in Fig 1, is trained on two modalities relying on optical (RGB) and depth information. Our goal is to apply this model even in the absence of one modality by using a second CNN that is able to create a synthetic representation for the missing modality. Please note, that our approach is not restricted to high resolution data or to the absence of depth modality only. Our approach can generalize to other modalities as well, with the underlying assumption that the two modalities share a common feature space that allows the translation from one modality to another one.
The contributions of this work are as follows:
We show that Generative Adversarial Networks (GANs) can be effectively used overcome the problems that arise when modalities are missing, incomplete or corrupted during inference time. Focusing on semantic segmentation of building footprints, our approach achieves an improvement of about 2% on the Intersection over Union (IoU) of the building class against the same network that relies only on the available modality.
To the best of the authors knowledge, this is the first work in literature which uses GANs not only for data augmentation during training but specifically solve the problem of incomplete and missing modalities during inference with satellite image data.
Ii Related Work
Our work is closely related to the following research areas:
Ii-a Multi-modal Segmentation
Multi-modal data fusion for semantic segmentation is an active research field and different strategies ranging from early fusion to in-fusion and late fusion have been proposed. Audebert et. al  explore different fusion strategies within the context of multi-modal segmentation in remote sensing. In their experiments, fusion with the FuseNet architecture  has proven to be slightly better compared to a late fusion approach. A similar idea has been explored previously by Huang. et. al  in which a deconvolutional network with two parallel streams on RGB and NRG band combinations is used to fused the predictions of the two streams. Since we achieved with early fusion of channel stacking similar results compared to such a fusion approach, but with less convolutional filters and less computational load, we are fusing the modalities in this work via channel stacking.
Ii-B Overcoming Missing Modalities
While there has been a lot of research in the field of reconstructing incomplete and sparse data via low rank recovery  or factorization , there is not much work dealing with the complete absence of one modality during the inference. One of the first attempts of overcome this problem are Hallucination Networks proposed by Hoffman et al.. In this approach, CNNs are used to extract features for a particular modality and the predictions are obtained by a late fusion of all modalities. An additional network, the hallucination network, is then trained via regression in a way that features of the hallucination network extract similar features than the network which operates on the real modality. In case of a missing modality, the network trained on the real modality can be replaced by the corresponding hallucination network. Kampffmeyer et. al  successfully applied this approach on satellite data to overcome missing depth information on land use and land cover segmentation.
Ii-C Generative Adversarial Networks
The framework with GANs  has achieved impressive results on image generation in the recent years. In the GAN setup, a discriminator and a generator play a zero sum game. Thereby, the generator tries to fool the discriminator, which on the other hand tries to distinguish between real and generated data samples. Due to the instability during training various improvements with respect to new training objectives  and combination with other models  have been proposed. In this work, we use GANs to capture the data distribution of modalities, such that in case of a missing modality we can generate the corresponding synthetic data and use it for inference.
Ii-D Image-to-Image Translation
Image-to-image translation is the problem of translating one given representation of a scene from a source domain into the representation of another target domain. Please note, that source and target domain do not need to be necessarily different. One of the first unified framework which is also used in our work, was proposed by Isola et. al. . Their approach uses GANs in a conditional setting. Recent studies extend this work and learn the image translations in an unpaired manner without supervision. Since this represents an ill-posed problem, additional constraints are required to preserve certain properties in the image representation of the source domain. Various approaches have introduced additional constraints ranging from pixel values  to cycle consistent training objectives . Liu et. al.  proposed the UNIT framework, in which the image representations of two different domains are mapped and reconstructed through a shared latent space. Our work has close connections to such a multi-domain image translation, however we do not explicitly enforce additional constraints such as a shared latent space. Closest to our work are approaches in [21, 22] which learn directly the mapping from RGB to synthetic Depth.
In this section, we describe our two step approach to overcome the effects of missing modalities with Generative Adversarial Networks. We first train a segmentation network on multi-modal input (RGB and Depth) for building footprint segmentation as depicted in Fig 1 (2). We then use a conditional GAN to capture the data distribution of the depth channel with respect to the RGB image. More formally, the conditional GAN is trained to learn the mapping from the RGB image
and noise vectorto a synthetic depth image with . The generator G receives the noise vector and RGB image as input and generates a synthetic depth image that looks indistinguishable from the real depth image. The discriminator D is trained adversarially to discriminate between the two pairs: (1) real pairs using RGB & real depth and (2) on synthetic pairs using RGB & synthetic depth. The adversarial training setup of the discriminator is depicted in Fig. 2. After this training we only keep the generator G and use it to produce a synthetic depth image from RGB as input for segmentation network.
Iii-a Segmentation Network Architecture.
, consists of 13 convolutional layers of 3x3 convolutions and five layers of 2x2 max pooling. The decoder is a mirrored version of the encoder which uses the pooling indices of the encoder to upsample the feature maps. SegNet offers a good trade-off between memory consumption and classification accuracy compared to other approaches such as U-Net and has been proven in the past to achieve good results on segmentation of building footprints. Please note, that we decided to use SegNet because of the good trade-off between memory and accuracy but it could be replaced in practice by any other segmentation network.
Iii-B Generative Adversarial Training
The objective of our approach is formalized as follows:
where represents the loss of the conditional GAN:
Motivated by previous work, we add a further regularization term to the overall objective weighted by a factor .
This regularizer forces the generator to not only fool the discriminator, but also create synthetic depth images that look similar to the real depth maps in the L1-space. We use the L1 norm since it encourages less blurring compared to L2.
Iii-C Network Architectures
We use similar architectures for the generator and discriminator network as presented in the work of Isola et. al . In the following, we discuss key points and differences.
Iii-C1 Generator Network
The generator network in this work builds upon a CNN that uses an encoder decoder architecture. This architecture has been shown in the past to be successful to address pixel-based computer vision tasks ranging from semantic and instance-segmentation to image-to-image translation. Our encoder consists of 8 convolutions of size 4x4 with a stride of 2 followed by BatchNorm and ReLU layers. Each layer of the encoder network with the triple Convolution, Batchnorm and ReLU reduces consequently the spatial dimension of the input image. This yields to model that maps features from a high dimensional input space into a low dimensional latent space. The decoder has of the opposite structure than the encoder and learns the reverse mapping from latent space to a high dimensional output space. Since down-sampling of a high resolution input results in a loss of high frequency information, which we want to keep in the unknown output modality we add skip connections to each layer of the decoder. This follows the architecture of U-Net where feature maps of the encoder are fused with the feature maps of the decoder using concatenation and an additional convolution. With these skip connections, high frequency information contained in the feature map of the encoder can be transferred to the up-sampled feature maps of the decoder which lack with detailed information to such an extend.
Iii-C2 Discriminator Network
Our discriminator network is modeled as a CNN that tries to learn a binary classification between real versus synthetic pairs with respect to the two modalities of RGB and depth. Before we explain details of the network architecture, note that there is a regularization term L1 to the objective function in Eq. 1
. The presence of this term forces the generator to capture low frequency information. However, generative models that only rely on L1 or L2 distances such as Variational Autoencoders have shown to fail to model high frequency information leading to generation of blurry images. This motivates to restrict the discriminator network to only capture the high frequency information. Isola et. al  showed that is sufficient to restrict the attention of the discriminator to the structure in local image patches in order to model high frequency details. We use their proposed network architecture PatchGAN that penalizes structure at a scale of patches. This discriminator is applied convolutationally across the image on smaller patches and the results are aggregated by averaging. Compared to discriminators that operate on the full image, this can be advantageous since with a smaller network fewer parameters are used, a faster runtime and processing of large images can be achieved. Because of the close connection of PatchGAN to markov random fields , the network can be seen as a form of style loss.
Iv Experimental Results
Iv-a Urban Mapper 3D Dataset
We evaluate our approach on the Urban Mapper 3D Dataset. This dataset is comprised of 236 orthorectified color images, which have a size of 2048 x 2048 pixels. The satellite scenes cover the two cities Jacksonville and Tampa in Florida (USA) with a ground sample distance of 0.5 meters. In addition to optical RGB images, the dataset includes a digital surface model (DSM) and a digital terrain model (DTM) in the same spatial resolution for each tile. The surface and terrain models are represented as a single band of 32-bit signed floating point values that represent height (in meters) referenced to the WGS84 ellipsoid. Ground-truth data is only provided for the public training set in form of instance and class labels for each building footprint. Since we do not have access to this private test set, we randomly split the training set with a ratio of 80/20 (for both locations) into new training and test sets. We subtract the DTM band from the DSM band in order to get the normalized height of each pixel. This representation in robust against topographical elevations such as hills and represents height information with respect to the surface as origin. We use in the following experiments RGB images, the normalized depth and building class labels as ground truth.
Iv-B Evaluation Metrics
Following the evaluation protocol defined in 
for the segmentation of building footprints, we report our results on the same two metrics for all of the following experiments. The first one is the Intersection over Union (IoU) for the positive building class. This is the number of pixels labeled as building in the prediction and the ground truth, divided by the number of pixels labeled as pixel in the prediction or the ground truth. As second metric, we report accuracy, the percentage of correctly classified pixels.
Iv-C Experimental Setup - Baselines & Network Training
In order to evaluate the performance of our approach, we first train different segmentation networks as a proxy to define our lower and upper bounds. We use RGB as known modality that is available during training and testing, whereas the depth modality is only available during the training phase.
|RGB only (Lower Bound)||62.09||92.91||77.50|
|RGB & Partial Depth (Baseline)||62.34||92.91||77.63|
|RGB & Synthetic Depth||63.96||93.86||78.91|
|RGB & Depth (Upper Bound)||65.58||94.05||79,82|
Iv-C1 Lower Bound - Training with RGB only
We define the lower bound for this experiment by using a SegNet trained on RGB information only. Even though depth information is available during training, it is discarded. Any network trained on multi-modal information or missing modalities should be better than this network.
Iv-C2 Upper Bound - Training with RGB and Depth
Our upper bound is determined by a SegNet which is trained and tested on all modalities. We use an early fusion approach with channel stacking to combine the three RGB channels and depth channel. It is worth mentioning, that there have been different approaches proposed in the past to fuse depth and optical RGB information. A very commonly used approach on satellite data is the FuseNet architecture  in which two separate encoders fuse feature maps at each resolution via the add operation. Since such an approach requires more parameters and we did not observe a significant improvement compared to an early fusion approach, we decided against it. Please note, that we want to show that the complementation of RGB with the depth modality using early fusion leads to better segmentation results compared to RGB only. Since RGB and depth information are available during both training and testing, this approach constitutes our upper bound.
Iv-C3 Baseline - Training with Incomplete Depth
We use the same network architecture as above with an early fusion approach for our baseline. However, during test time only RGB information is available. In order to force the network to cope with the missing modality, we train the network with the following training procedure. We randomly (p=0.5) sample RGB and corresponding depth patches from two distributions: (1) one distribution with RGB and depth patches and (2) one distribution with RGB and no depth. The training setup ensures that if depth is missing, the network should still produce a meaningful output. Since there is on the other hand depth information available during training, this modality can be leveraged in order to extract features that produce better segmentation results than relying on RGB alone.
Iv-C4 Network Training
We initialize the weights of the encoder where applicable, with the weights of a VGG16 
model pre-trained on ImageNet. In case of more than three input channels in the first convolution, we initialize further channels with the average over the RGB channels. In the following, all networks are trained with SGD using a learning rate of 0.01, weight decay of 0.0005 and momentum of 0.9. We use the negative log likelihood loss and reduce the learning rate according to the poly policy (power=0.9
). Training is stopped after 30 epochs. We extract 16 patches of size 512 x 512 pixels from each satellite image and use a batch size of 4 for the training. We apply randomly flipping in vertical and horizontal directions as data augmentation.
Iv-D Synthetic Image Generation - Quantitative Results
We train the GAN defined in section 2 for 200 epochs with a batch-size of 4 using the Adam optimizer. We start with a learning rate of 0.0002 and reduce it linearly to 0 starting from epoch 100. After this adversarial training the generator learned the mapping from RGB input to synthetic depth output.
The qualitative results when feeding the synthetic depth together with the RGB into the network trained on RGB and real depth can be seen in Table I (row 3). Our approach can not achieve the results as in the best case when all modalities are available during testing. However it is still significantly better compared to models that are relying on RGB only. The IoU of the positive building class is improved by about 2% against the network that was trained on RGB only and 1.5% better compared to our baseline that was explicitly trained on RGB and incomplete depth modality. Please note, that this improvement of 2% is significant because of the large number of pixels that have to be classified. Similar results are achieved when considering the pixel accuracy metric.
Iv-E Synthetic Image Generation - Qualitative Results
We visualized in Fig. 3 for four different patches the RGB image (a), real depth (b) and synthetically generated depth image (c) as well as the segmentation results of the models using these different input modalities (d-g). When comparing the images of synthetic versus real depth (b & c), a high similarity between the two representations can be observed. This includes sharp edges on building outlines, on shadows of buildings and accounts even for more complex building structures as shown in the first two rows. When comparing the predictions based on RGB only against RGB with synthetic depth (d & e), an improvement in the segmentation mask can be observed when using synthetic depth. This accounts for large buildings as seen in the upper right part of row 1 and for small buildings shown in the lower left part of row 2.
In the third row of Fig. 3 an interesting case of incomplete or corrupted data can be observed. The real depth image shows two vertical rows of trees (in orange-yellowish color), but in the optical image only one row of trees is present. It could be that case that the images of the two modalities were taken at two different timestamps. When this noisy multi-modal information is passed to the segmentation network, the parking space in the optical image is misleadingly classified as building because of the similar texture in the optical image and height cues in the depth image. The generator on the other hand did not predict any height cues at this position and consequently the parking space is correctly classified as ground.
The opposite effect can be observed in the fourth row. Here, information is lost near the image border in the upper right corner of the real depth image whereas the synthetic image contains a stronger signal of height. Consequently the upper right building is detected when relying on synthetic data but not when using the noisy but real modalities.
In this paper we addressed the problem of missing modalities during inference time in a setup in which models trained on multi-modal information. We showed that GANs provide a powerful approach to overcome this problems and better segmentation accuracies of about 2% can be obtained compared to relying on the single, available modality only. It can be seen in Fig. 3
that our model learned to estimate height not only for the building class which this work was focusing on primarily, but also other land cover classes such as trees. It is interesting to extend our segmentation approach also to additional land cover classes. We are currently extending this work with other training objectives and generative models such as Structure to Signal Autoencoders.
-  Adrien Lagrange, Bertrand Le Saux, Anne Beaupere, Alexandre Boulch, Adrien Chan-Hon-Tong, Stéphane Herbin, Hicham Randrianarivo, and Marin Ferecatu, “Benchmarking Classification of Earth-Observation Data: From Learning explicit Features to Convolutional Networks,” in Geoscience and Remote Sensing Symposium (IGARSS), 2015 IEEE International. IEEE, 2015, pp. 4173–4176.
-  Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth, “Introducing EuroSAT: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classification,” in Geoscience and Remote Sensing Symposium (IGARSS), 2018 IEEE International. IEEE, 2018.
-  Benjamin Bischke, Patrick Helber, Joachim Folz, Damian Borth, and Andreas Dengel, “Multi-Task Learning for Segmentation of Building Footprints with Deep Neural Networks,” arXiv preprint arXiv:1709.05932, 2017.
-  Michael Kampffmeyer, Robert Jenssen, et al., “Urban Land Cover Classification with Missing Data using Deep Convolutional Neural Networks,” in Geoscience and Remote Sensing Symposium (IGARSS), 2017 IEEE International. IEEE, 2017, pp. 5161–5164.
-  Naoto Yokoya, Pedram Ghamisi, Junshi Xia, Sergey Sukhanov, Roel Heremans, Ivan Tankoyeu, Benjamin Bechtel, Bertrand Le Saux, Gabriele Moser, and Devis Tuia, “Open Data for Global Multimodal Land Use Classification: Outcome of the 2017 IEEE GRSS Data Fusion Contest,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 11, no. 5, pp. 1363–1377, 2018.
-  Jixian Zhang, “Multi-Source Remote Sensing Data Fusion: Status and Trends,” International Journal of Image and Data Fusion, vol. 1, no. 1, pp. 5–24, 2010.
-  IEEE GRSS, “2018 IEEE GRSS Data Fusion Contest: Advanced Multi-Sensor Optical Remote Sensing for Urban Land Use and Land Cover Classification,” Available: http://www.grss-ieee.org/community/technical-committees/data-fusion/ data-fusion-contest/, 2018.
-  Nicolas Audebert, Bertrand Le Saux, and Sebastien Lefvre, “Beyond RGB Very High Resolution Urban Remote Sensing with Multimodal Deep Networks,” ISPRS Journal of Photogrammetry and Remote Sensing, 2017.
-  Caner Hazirbas, Lingni Ma, Csaba Domokos, and Daniel Cremers, “Fusenet: Incorporating Depth into Semantic Segmentation via Fusion-based CNN Architecture,” in Asian Conference on Computer Vision. Springer, 2016, pp. 213–228.
-  Zuming Huang, Guangliang Cheng, Hongzhen Wang, Haichang Li, Limin Shi, and Chunhong Pan, “Building Extraction from Multi-Source Remote Sensing Images via Deep Deconvolution Neural Networks,” in Geoscience and Remote Sensing Symposium (IGARSS), 2016 IEEE International. IEEE, 2016, pp. 1835–1838.
“Incremental Singular Value Decomposition of Uncertain Data with Missing Values,”in European Conference on Computer Vision. Springer, 2002, pp. 707–720.
Evrim Acar, Daniel M Dunlavy, Tamara G Kolda, and Morten Mørup,
“Scalable Tensor Factorizations for Incomplete Data,”Chemometrics and Intelligent Laboratory Systems, vol. 106, no. 1, pp. 41–56, 2011.
Judy Hoffman, Saurabh Gupta, and Trevor Darrell,
“Learning with Side Information Through Modality Hallucination,”
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 826–834.
-  Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Generative Adversarial Nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
-  Martin Arjovsky, Soumith Chintala, and Léon Bottou, “Wasserstein Gan,” arXiv preprint arXiv:1701.07875, 2017.
-  Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey, “Adversarial Autoencoders,” arXiv preprint arXiv:1511.05644, 2015.
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros,
“Image-to-Image Translation with Conditional Adversarial Networks,”arXiv preprint, 2017.
-  Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Josh Susskind, Wenda Wang, and Russ Webb, “Learning from Simulated and Unsupervised Images through Adversarial Training,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, vol. 3, p. 6.
-  Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros, “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks,” arXiv preprint arXiv:1703.10593, 2017.
-  Ming-Yu Liu, Thomas Breuel, and Jan Kautz, “Unsupervised Image-to-Image Translation Networks,” in Advances in Neural Information Processing Systems, 2017, pp. 700–708.
-  Lichao Mou and Xiao Xiang Zhu, “M2HEIGHT: Height Estimation from Single Monocular Imagery via Fully Residual Convolutional-Deconvolutional Network,” arXiv preprint arXiv:1802.10249, 2018.
-  Shivangi Srivastava, Michele Volpi, and Devis Tuia, “Joint Height Estimation and Semantic Labeling of Monocular Aerial Images with CNNs,” in Geoscience and Remote Sensing Symposium (IGARSS), 2017 IEEE International. IEEE, 2017, pp. 5173–5176.
-  Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla, “SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 12, pp. 2481–2495, 2017.
-  Karen Simonyan and Andrew Zisserman, “Very Deep Convolutional Networks for large-scale Image Recognition,” arXiv preprint arXiv:1409.1556, 2014.
-  Olaf Ronneberger, Philipp Fischer, and Thomas Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241.
-  Diederik P Kingma and Max Welling, “Auto-Encoding Variational Bayes,” arXiv preprint arXiv:1312.6114, 2013.
-  Chuan Li and Michael Wand, “Precomputed real-time Texture Synthesis with Markovian Generative Adversarial Networks,” in European Conference on Computer Vision. Springer, 2016, pp. 702–716.
-  Emmanuel Maggiori, Yuliya Tarabalka, Guillaume Charpiat, and Pierre Alliez, “Can Semantic Labeling Methods Generalize to Any City? The Inria Aerial Image Labeling Benchmark,” arXiv preprint arXiv:1409.1556, 2017.
-  Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia, “Pyramid Scene Parsing Network,” arXiv preprint arXiv:1612.01105, 2016.
-  Joachim Folz, Sebastian Palacio, Joern Hees, Damian Borth, and Andreas Dengel, “Adversarial Defense based on Structure-to-Signal Autoencoders,” arXiv preprint arXiv:1803.07994, 2018.