Urban Land Cover Classification with Missing Data Using Deep Convolutional Neural Networks

by   Michael Kampffmeyer, et al.

Automatic urban land cover classification is a classical problem in remote sensing and good urban land cover maps build the foundation for many tasks, such as e.g. environmental monitoring. It is a particularly challenging problem, as classes generally have high inter-class and low intra-class variance. A common technique to improve urban land cover classification performance in remote sensing is the fusing of data from different sensors with different data modalities. However, all modalities are rarely available for all test data, and this missing data problem poses severe challenges for multi-modal learning. Inspired by recent successes in deep learning, we propose as a remedy a convolutional neural network (CNN) architecture for urban remote sensing image segmentation trained on data modalities which are not all available at test time. We train our architecture with a cost function particularly suited for imbalanced classes, as this is a frequent problem in remote sensing, especially in urban areas. We demonstrate the method using two benchmark datasets, both consisting of optical and digital surface model (DSM) images. We simulate missing data, by assuming that the DSM images are missing during testing and show that our method outperforms both CNNs trained on optical images as well as an ensemble of two CNNs trained only on optical images. We further evaluate the potential of our method to handle situations where only some DSM images are missing during training and show that we can clearly exploit training time information of the missing modality during testing.



There are no comments yet.


page 1

page 3

page 6

page 9


Large-scale Land Cover Classification in GaoFen-2 Satellite Imagery

Many significant applications need land cover information of remote sens...

The global information for land cover classification by dual-branch deep learning

Land cover classification has played an important role in remote sensing...

Extracting urban impervious surface from GF-1 imagery using one-class classifiers

Impervious surface area is a direct consequence of the urbanization, whi...

Change Detection between Multimodal Remote Sensing Data Using Siamese CNN

Detecting topographic changes in the urban environment has always been a...

Urban-Rural Environmental Gradient in a Developing City: Testing ENVI GIS Functionality

The research performs urban ecosystem analysis supported by ENVI GIS by ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

More than half of the world population now lives in cities, and 2.5 billion more people are expected to move into cities by 2050 [1]. Although constituting only a small percentage of global land cover, urban areas significantly alter climate, biogeochemistry, and hydrology at local, regional, and global scales. Thus, in order to support sustainable urban development, accurate information on the urban land cover is needed.

Fig. 1: A brief illustration of the issue addressed in this work. We propose a method to effectively produce urban land cover classification when some data modalities are missing partially or completely during the test phase. For instance, the top part of the figure illustrates a scenario, where data modality 2 is completely missing during testing and data modality 3 is missing for some of the test images. Our method leverages all available training modalities (top left part of the figure) to increase overall performance when performing inference (bottom part of figure).

Due to the successes of deep learning architectures, convolutional neural networks (CNNs) have found increased use in the field of remote sensing [2, 3, 4, 5, 6], outperforming more traditional approaches [7]. The strength of CNNs is their ability to learn features that exploit the spatial context and thereby provide land cover maps with high accuracy. Most modern approaches to urban land cover classification rely on deep convolutional neural networks with multiple modalities [2, 3, 4, 5, 6], effectively fusing information contained in the individual data modalities. Here we use the term data modalities to refer to data obtained by different imaging sensors, potentially operating in different frequency ranges and/or using a different imaging principle.

Data fusion is a frequently studied topic in land cover classification since it often leads to improved accuracy. It aims to integrate the information acquired potentially with different spatial resolution, spectral bands and imaging modes from sensors mounted on satellites, aircraft and ground platforms to produce fused data. This leads to fused data, which contains more detailed information than each of the individual sources [8, 9]. Several approaches to fusing data modalities exist for CNNs, however, these approaches generally break down during the testing stage, when not all modalities are available. A concrete example, where all training data modalities are likely to not be present during testing is disaster monitoring and assessment systems. These systems are often time critical, especially in highly populated areas, and require analysis immediately after disasters happen. Due to this requirement, it is likely that not all data modalities have been gathered.

In this work, we propose an approach to remedy this problem. The key contribution is an approach to urban land cover mapping that exploits the relationships between the various modalities in the training dataset to improve performance when certain data modalities are missing during the test-phase. The approach further provides the flexibility to use all data modalities during testing in scenarios where they are available. We illustrate the problem that we consider in this work in Figure 1. Concretely we focus on how to exploit all available training modalities in the following three problem scenarios:

Problem Scenario 1: One modality is completely missing during testing, however, since the modality is available during training, we would like to exploit the available training data.

Problem Scenario 2: One modality is partly missing during testing. Given a set of images, a given data modality is missing only for a subset of images. We would like to model the missing data modality for these images, but still want to make use of the data modality for the test images where it is available.

Problem Scenario 3: Multiple modalities are either completely or partly missing during testing. This extends problem scenario one and two to multiple missing data modalities.

Keeping to the example of disaster monitoring and more specifically flood monitoring, these three problem scenarios can, for instance, occur in the following situations. One common approach to mapping and detection of floods is the use of Synthetic Aperture Radar (SAR) [10], but very high-resolution optical data from airborne sensors can be incorporated to improve performance [10]. However, this data is not always available during the inference phase (Problem Scenario 1). Problem Scenario 2 allows us to use a single model to perform inference on these images, even if only parts of the area is covered by the high-resolution optical images, for instance, due to the high cost of acquiring it. Finally, when extending the task of flood mapping to flood emergency management with tasks like damage analysis [11, 12]

and estimation of elements-at-risk and vulnerability 

[10], it can be beneficial to map buildings and infrastructure with the help of additional data modalities, such as LIDAR (Light Detection And Ranging) based DSMs. However, adding the LIDAR-based DSMs will lead to potentially multiple missing modalities (Problem Scenario 3).

The proposed system will build upon the hallucination network strategy proposed by Hoffman et al. [13] for training CNNs for object detection from data modalities when one of the modalities is completely missing during testing. We extend their work to land-cover classification (Problem Scenario 1) and extend the underlying idea of hallucination networks to handle situations where multiple modalities are missing (Problem Scenario 3). We further analyze its potential to handle both situations where data modalities are completely missing and situations where data modalities are only missing for some of the test images (Problem Scenario 2). Section IV describes the approach taken in detail.

Note, that the aforementioned problems are inherently different from the more traditional missing data scenarios that have been investigated previously in a remote sensing context and to the author’s knowledge, this is the first time that these problems have been addressed in remote sensing.

Another challenge that is often encountered when designing classifiers for land cover mapping is class imbalance. Land covers within the area of interest are often highly imbalanced, where some land cover types are frequent, whereas others are rare. Moreover, many objects of interest in remote sensing are small compared to the overall image. A solution based on optimizing the overall classification accuracy is often not satisfactory since ’small’ classes will often be suppressed [14]. In a recent paper [2], the influence of imbalanced classes was reduced by introducing weights to the cost function. Classes with few samples were given high weights, whereas classes with many samples were given small weights. Due to the inherent difficulties in urban land cover classification caused by small objects and imbalanced classes, we specially tailor the network by incorporating a class-wise median frequency balancing approach into the cost function.

A preliminary version of this paper appeared in [15]. Here, we extend our work by (i) providing a more thorough literature background discussion, (ii) extending and evaluating the methodology for scenarios where multiple data modalities are not present during the test phase (Problem Scenario 3), (iii) expand the experiment section, (iv) evaluate the effect of medium frequency balancing on the small (imbalanced) classes, and (v) evaluate the potential of the trained model to also handle situations where both modalities are available during testing (Problem Scenario 2).

This paper is structured as follows. Section II provides an overview of the background literature. Section III introduces the datasets used in our work and the evaluation procedure. In Section IV we present the methodology and training procedure. Experimental evaluation of the method is performed in Section V and finally in Section VI we draw conclusions and point to future directions.

Fig. 2: A small example patch from the Vaihingen validation dataset. From left to right and top to bottom: RG+I image, DSM, normalized DSM, and ground truth image. It illustrates the difference in size between classes, such as the car class (yellow) and the building class (blue).
Fig. 3: A small example patch from the Potsdam validation dataset. From left to right and top to bottom: RGB image, IR, normalized DSM, and ground truth image. The DSM modality has been omitted in this figure, as we only use the normalized DSM in our experiments.

Ii Background

Our approach to segmentation builds on the recent successes that deep learning techniques have achieved for image segmentation using CNNs. Instead of requiring elaborate feature design, CNNs are able to learn relevant features from data. Even though segmentation can be viewed as a pixel-wise classification problem, currently, most state-of-the-art CNN models for image segmentation are inspired by the idea of fully convolutional pixel-to-pixel end-to-end learnable architectures [16]

. These architectures generally consist of an encoder-decoder architecture, where the encoder creates a lower resolution representation of the image and the decoder produces the pixel-wise prediction from this representation. Upsampling in the decoder can be performed in several ways, with one common approach being the fractional-strided convolution (also referred to as Deconvolution) 


. Alternative approaches avoid the explicit learning of an upsampling by using the max-pooling locations from the downsampling stage to place activations, thereby achieving a sparse upsampling map, which is then converted into a dense representation with the help of convolutional filters 

[17]. All layers in these networks are based on convolutions and do not make use of, as in previous approaches, fully connected layers, which allows their application on test images of varying size.

Lately, deep learning and CNNs have also increasingly been applied to remote sensing tasks such as for instance domain adaptation [18], but especially for object detection and image segmentation in remote sensing. For instance, Paisitkriangkrai et al. [19]

proposed a scheme for high-resolution land cover classification using a combination of a patch-based CNN and a random forest classifier that is trained on hand-crafted features. To increase the classification accuracy further, a conditional random field (CRF) was used to smooth the final pixel labeling results. Further, Krylov et al. 

[4] propose the use of a patch-based neural network to perform land cover classification in SPOT-5 imagery. More recent works, to a large extent, make use of fully convolutional architectures [2, 3, 5], achieving better overall accuracy and being computationally more efficient [2, 5]. Building on the good performance of fully convolutional architectures for land cover classifications, two-stage systems to object detection have been proposed, where the segmentation is followed by a classification step [20] and novel approaches to fusing heterogeneous data modalities in a land cover classification context have been proposed [6].

Missing data is a common problem in real-world data that can occur due to for instance sensors being corrupt. In the traditional sense, the term missing data has been used to refer to partly missing data in a given data modality. In remote sensing missing data can also occur due to the inherent properties of certain remote sensing systems. The occurrence of cloud cover in optical images and the acquisition of images during the night using optical sensors are two examples of missing data that can occur due to the inherent nature of the sensor. Common approaches to handling missing data for pattern recognition can be grouped into four main groups 


. Incomplete data can either be deleted, leading to the removal of potentially large amounts of training data, or missing values can be imputed 


. Alternatively, model-based approaches can be used, such as e.g. the expectation-maximization algorithm by modeling the data distribution, or approaches can be designed that incorporate missing data into the machine learning approach, such as in the case of e.g. decision trees 


Traditionally, most works on handling missing data in remote sensing are based on partly missing data, situations where only parts of a modality are available. One common approach of many algorithms is to simply discard these incomplete data samples [23]. However, more advanced methods have been proposed, such as for instance, by Latif and Mercier [24]

, who leverage Self-Organizing Maps to estimate and impute missing values. Further, various data imputation techniques have been used to fill the missing parts in the data 

[25]. However, imputation may often produce suboptimal results, since the assumption of missingness at random [22] may not hold when missingness appears due to the properties of the sensors. Other works that utilize the missingness directly as part of the machine learning approach include Salberg and Jenssen [26], who propose to train SVMs for all types of missingness patterns for landcover classification, and Aksoy et al. [23], who present a decision tree based model to data fusion when some modalities are partly missing.

However, to the author’s knowledge, none of these approaches are able to exploit the knowledge provided by all data sources to improve the accuracy when one or more data sources are completely missing. This aspect, concerning missing modalities, is the focus of this work.

Iii Benchmark datasets for urban land cover classification

To perform urban land cover classification we require large-scale datasets. In this work, we focus on two commonly used benchmark datasets that each consist of more than one data modality. The datasets are the ISPRS Vaihingen and the ISPRS Potsdam 2D semantic labeling benchmark datasets [27]. The Vaihingen dataset consists of high-resolution true ortho photo images with the three bands corresponding to near infrared, red and green bands (RG+I). The images are acquired with a ground sampling distance of cm and of varying size (ranging from approximately 3 million to 10 million pixels). Ground truth is available for images. The Potsdam dataset consists of four-channel true ortho photo images with the channels corresponding to red, green, blue and infrared (RGB+I). The ground sampling distance is cm with each image containing 36 million pixels and with ground truth available for images. Additionally, both the digital surface model (DSM) and the normalized DSM (produced by Gerke [28]) is available for all images in both datasets. Figures 3 and 3 show two example patches, one from the Vaihingen dataset and one from the Potsdam dataset. The datasets contain six classes. Impervious surface (white in ground truth), buildings (blue), low vegetation (cyan), trees (green), cars (yellow) and background/clutter (red). To evaluate the approach and its use for missing data modalities for urban land cover classification, the normalized DSM was included only during the training phase.

In this paper, we will follow the guidelines and the evaluation metrics that have been specified by the ISPRS 

[27] to evaluate performance on these datasets. The evaluation metrics are the F1-score and accuracy. The F1-score is computed per class and overall as


The accuracy is the percentage of correctly labeled pixels measured as , where denotes the number of pixels of class that are classified as class and the total number of pixels in class . We report the overall class-independent and also the mean class accuracy.

Following the ISPRS specifications, the class boundaries are eroded with a disk of radius 3 and ignored in the evaluation to reduce boundary effects. For evaluation, the labeled part of the Vaihingen dataset is divided into a training set, a validation set and a test set containing , and images, respectively. The Potsdam dataset is similarly split into training, validation and test set (, and

images). The models are trained on the training dataset, hyperparameters are tuned according to the validation dataset and the final accuracy is reported on the test dataset.

Iv Approach

In this section, we describe the implementation and the training details of the proposed architecture. We start by explaining the main idea of hallucination networks in IV-A since this framework is at the core of the proposed approach. The concept of median frequency balancing for handling small classes is explained in IV-B. In IV-C we provide details on the overall implementation and describe how we combine median frequency balancing and the idea of hallucination network to perform urban land cover classification in scenarios, where a data modality is missing during the inference phase. Finally, we propose in IV-D an extension of the hallucination network methodology from dealing solely with one missing modality to handling the case of multiple missing modalities.

Our proposed method builds on a common approach to data fusion using convolutional neural networks, where features are extracted independently from each data modality using separate dedicated CNNs and the final scores are combined. Such approaches have previously shown promising results for combining multiple data modalities, such as for example RGB and Depth information [29, 6].

Iv-a Hallucination Networks

Hallucination Networks [13] are recent attempts to use data modalities that are available solely during the training phase in test time to improve object detection performance. They are inspired by recent works on transferring information between neural networks through network distillation, which initially was proposed as a way to compress the knowledge of large models into a smaller one by training small neural networks on the softmax output of the larger network [30].

Utilizing all available training data modalities in the case of missing data modalities during the inference phase is done by adding an additional network, the hallucination network, in addition to networks for the available modalities. This network takes a data modality as input that we assume to be available during both training and testing and attempts to learn a mapping function from this modality to the modality that is missing during testing. This is done by adding a loss for the mid-level features of the hallucination network to mimic the mid-level features of the data modality that is missing during inference. In the case that the modality is missing during testing, inference is performed by not only passing the available modality to its dedicated network but in addition also passing it to the hallucination network. The raw scores of the two networks are then averaged and the land-cover classification map is produced through a final softmax layer.

Fig. 4: The network architecture that addresses Problem Scenario 1 of the proposed method. Note that we choose the layer where the hallucination loss is applied () to be the third pooling layer based on the fact that Hoffman et al. [13] reported the best accuracy when hallucination is applied on the mid/late-level features. A more thorough analysis of this choice is left for future work.

Iv-B Medium frequency balancing

Medium frequency balancing is a weighted cross-entropy loss function that has been shown to yield good performance for imbalanced classes 

[31, 17, 2]. Instead of optimizing the commonly used cross-entropy loss, where small classes contribute less to the overall loss and are therefore often neglected compared to the larger classes, each class in the loss function is weighted by the ratio of the median class frequency and the class frequency (computed over the training dataset). The updated cost formulation is




denotes the weight for class , is the frequency of pixels in class , and

is the softmax probability of sample

being in class . corresponds to the label of sample for class

when the label is given in one-hot encoding,

is the set of all classes and is the number of samples in the mini-batch.

Iv-C Object segmentation for imbalanced classes with one missing modality

Our architecture for the individual networks follows the architecture of Kampffmeyer et al. [2] and is based on the fully convolutional network architecture [16]. It consists of four sets of two convolutional layers, each set followed by a

max-pooling layer and each individual convolution layer followed by a ReLU nonlinearity and batch normalization layer 

[32]. The first convolutional layer has stride due to memory restrictions during the test-phase when considering the large images, whereas all other convolutions are of stride . Three of these networks are trained jointly. The first one for the RG+I or RGB+I images, which will be referred to as RGB network for simplicity, the second one for the depth image and the third one as the hallucination network. Figure 4 illustrates the complete Architecture that is used to address Problem Scenario 1. Training is performed on image patches of size , which are extracted from the original images with overlap and are flipped and rotated at 90-degree intervals as part of the data augmentation step. For the Potsdam dataset, we only rotate the images due to the fact that much more data is available, making additional data augmentation unnecessary.

The total loss consists of six individual losses and is


The loss is optimized using backpropagation.

, , and are the hallucination loss, the loss of the RGB network, the loss of the depth network and the loss of the hallucination network, respectively, and and are the joint losses. The parameter is the weight parameter for the hallucination loss. The individual losses for the three networks, , and , ensure that the features learned by the three individual networks will be useful for the land cover classification task also independently from each other. The joint losses, and , instead ensure that the network performs well on all the combination of modalities that can be observed during testing. Here we assume that the data for the RGB network is always available, but that the depth data modality might be missing during testing. In the case that the depth data is missing during testing, the combination of the RGB and the hallucination network can be used for inference. On the other hand, in the case where the depth data modality is available during testing, the combination of the RGB and the depth network can be used. The potential of using the same architecture for both scenarios (Problem Scenario 2) is analyzed in Section V.

Fig. 5: Network architecture of the proposed method when handling multiple missing modalities (Problem Scenario 3).

Following Hoffman et al. [13], the hallucination loss is



is the sigmoid function and

refers to the activation of the networks at a certain network depth . To avoid that the depth features are adapting during the end-to-end training procedure, the learning rate for the layers before network depth are set to zero for the depth network. The hallucination loss in our experiments is based on the activations after the third pooling layer () and medium frequency balancing is used for all terms in the loss function except the hallucination loss.

Training details

Training is performed by first training the RGB and depth network separately and then fine-tuning the whole architecture end-to-end. The depth network was used to initialize the hallucination network and the weight of the hallucination network loss. was set such that the hallucination loss is roughly times the loss of the largest loss of the remaining terms in Eq. 4, following the example of Hoffman et al. [13]. A more thorough analysis of how to optimally set the parameter is left for future work.

To avoid large variations in the magnitude of the gradients, gradient clipping 


is performed to clip outlier gradients to an acceptable range, which is determined by monitoring the gradients during training. Training is performed using Adam 


Iv-D Multiple missing modalities

In this section, we propose an extension of the hallucination network to handle more than one missing data modality (Problem scenario 3). We focus especially on the Potsdam dataset as it comes with unique modalities, RGB, infrared and depth. Figure 5 illustrates the updated architecture, where an additional hallucination network is introduced to handle a missing infrared data modality. The updated cost function consists of terms, which are optimized jointly following the same procedure as in the one-missing modality case, and is


In addition to the two hallucination losses, and , which are used to transfer information from the training phase about the relationship between RGB and Depth images to the test stage, we have losses, one for each network. As in the single missing modality scenario, these losses ensure that the individual models learn good individual feature representation that can then be used for the final prediction task. In addition, losses are added for the various combinations of available and missing modalities, namely, , , and , where the subscript indicates the network losses that are combined. These losses ensure, that the network is able to utilize the availability of data modalities during testing for the cases where modalities are only sometimes missing. For instance, given that both infrared and depth information is missing, testing is performed by using the RGB and the two hallucination networks, which means that the loss should be low. On the other hand, if for example the depth information is missing, but infrared is available, we would in the test phase use the RGB network, the depth hallucination network and the infrared network. In this case the loss should be low.

Imp Surf Building Low veg Tree Car Overall
Method MFB F1 Acc F1 Acc F1 Acc F1 Acc F1 Acc Avg F1 Avg Acc Acc
RG+I 84.95 87.26 89.24 86.43 75.47 72.27 82.90 86.95 75.37 83.82 81.59 83.35 83.50
RG+I-ensemble 85.30 87.83 89.32 86.14 76.12 73.52 83.28 87.01 78.15 85.19 82.44 83.94 83.86
Hallucination 87.15 89.58 91.13 88.30 77.05 74.33 83.72 87.47 80.92 84.30 83.99 84.80 85.20
RG+I&Depth 88.89 89.33 93.29 91.42 77.29 75.68 83.88 87.38 81.82 83.42 85.03 85.45 86.33
RG+I 86.83 89.13 91.49 89.81 75.30 75.18 85.67 86.36 59.00 44.38 79.66 76.97 84.96
RG+I-ensemble 87.14 89.47 91.65 89.79 75.71 75.74 85.76 86.39 60.01 45.09 80.05 77.30 85.17
Hallucination 88.19 91.14 92.49 90.58 76.54 74.31 86.49 88.34 74.75 62.90 83.69 81.45 86.22
RG+I&Depth 88.17 91.35 93.06 91.36 76.92 75.69 86.69 87.54 65.31 50.01 82.03 79.19 86.39
TABLE I: Performance of the different models for the Vaihingen dataset. The F1 scores and accuracies are shown as percentages. Bold numbers indicate the best accuracy among the first three models. The final model RG+I&Depth is used as a reference to illustrate the overall accuracy that could be achieved by a model if all data modalities are available and no hallucination network is employed.
Imp Surf Building Low veg Tree Car Overall
Method MFB F1 Acc F1 Acc F1 Acc F1 Acc F1 Acc Avg F1 Avg Acc Acc
RGB+I 86.36 90.57 92.74 91.84 81.69 82.73 83.94 84.78 88.51 97.61 86.65 89.51 85.74
RGB+I-ensemble 87.03 91.28 93.64 92.99 82.19 82.98 84.49 85.08 89.15 97.86 87.30 90.04 86.43
Hallucination 87.26 93.01 93.84 91.41 82.08 83.72 84.81 84.25 88.17 97.88 87.23 90.05 86.53
RGB+I&Depth 88.76 91.00 96.26 97.79 81.38 79.36 83.48 87.22 84.98 98.02 86.97 90.68 87.58
RGB+I 86.65 92.82 93.49 93.07 81.41 83.20 81.93 78.34 88.11 81.78 86.32 85.84 85.79
RGB+I-ensemble 87.06 91.98 93.98 95.14 82.22 84.62 82.71 77.96 89.61 84.58 87.11 86.86 86.39
Hallucination 87.47 94.51 94.62 93.47 82.76 83.13 83.86 80.42 91.04 90.62 87.95 88.43 86.96
RGB+I&Depth 88.75 94.50 96.74 97.26 82.10 83.90 83.14 79.85 82.49 71.37 86.65 85.38 87.76
TABLE II: Performance of the different models for the Potsdam dataset. The F1 scores and accuracies are shown as percentages. Bold numbers indicate the best accuracy among the first three models. The final model, RG+I&Depth, is used as a reference to illustrate the overall accuracy that could be achieved by a model if all data modalities are available and no hallucination network is employed.
Imp Surf Building Low veg Tree Car Overall
Method MFB F1 Acc F1 Acc F1 Acc F1 Acc F1 Acc Avg F1 Avg Acc Acc
Hallucination 87.26 93.01 93.84 91.41 82.08 83.72 84.81 84.25 88.17 97.88 87.23 90.05 86.53
RGB+I&Depth (Hal) 87.90 93.00 95.20 94.14 82.01 82.78 84.52 83.91 88.76 97.64 87.68 90.29 87.20
Hallucination 87.47 94.51 94.62 93.47 82.76 83.13 83.86 80.42 91.04 90.62 87.95 88.43 86.96
RGB+I&Depth (Hal) 87.78 94.92 95.59 95.06 82.65 83.86 82.78 77.45 89.13 85.01 87.59 87.26 87.20
TABLE III: Performance of the hallucination Potsdam model. The F1 scores and accuracies are shown as percentages. Bold numbers indicate the best accuracy among the two models. The final model, RG+I&Depth (Hal), corresponds to the trained hallucination model. However, in this case we do not use the hallucination network during testing, but instead assume that the depth information is available such that we can use the trained RG+I and Depth networks.

V Experiments and results

In this section, we perform experiments on the Vaihingen (Section V-A) and Potsdam (Section V-B) datasets and report both qualitative and quantitative results to address Problem Scenario 1. We further analyze for the trained Potsdam hallucination network its ability to perform inference in situations where suddenly both modalities are available. This is crucial to address Problem Scenario 2, the problem of partially missing data modalities, as the single trained model should not only be able to perform well in scenarios where data modalities are missing. Instead, it should still be able to exploit the information of a given data modality when it is present during testing to improve performance. Section V-C addresses Problem Scenario 3 and focuses on the case of multiple missing modalities for the Potsdam scenario, where both infrared and depth measurements are missing. For all datasets, we compare our proposed approach to a CNN trained only on the RGB+I (or RG+I) image. To ensure fairness with respect to the number of network parameters, we also compare the approach to an ensemble of two CNNs trained on the RGB+I (or RG+I) images. For the ensemble, the softmax output of the two CNNs was averaged during the test-phase.

V-a Vaihingen

Initially, we focus on Problem Scenario 1, the case of one missing data modality. Table I shows the results for the models when trained on the Vaihingen dataset with and without medium frequency balancing. It illustrates that the hallucination network outperforms both the single RG+I model as well as the ensemble when considering overall accuracy. Increases in accuracy can be observed in most classes with large increases being observed for the impervious surface class and the building class. Given that these two classes appear similar in many of the RGB images, it is not surprising that additional depth information would benefit these classes considerably. This indicates that some of the additional information contained in the available training depth data is benefiting the test-phase of our model even though the depth data is missing during testing.

(a) RG+I image (b) Ground truth (c) Segmentation Ensemble (d) Segmentation Hallucination
Fig. 6: Segmentation results for an image in the test dataset.
(a) RG+I image (b) Ground truth (c) Ensemble (d) Hallucination
Fig. 7: Closeup of the segmentation results for the bottom left corner of the image.
(a) RG+I image (b) Ground truth (c) Hallucination without MFB (d) Hallucination with MFB
Fig. 8: Illustration of the effect that MFB has on the small car class. To highlight the car class, the color of the car class has been changed to red.

To illustrate some of the differences between the results achieved by the ensemble and the hallucination method, Figure 6 shows one of the RG+I images from the test set, the ground truth, and the achieved segmentation using the proposed method with the median frequency balancing cost function as well as the RG+I ensemble. It can be seen that both models perform well. However, when comparing the results of the two models it can be observed that the ensemble assigns more impervious surface pixels to the building class. This is as expected, as the color and shape of some roof areas can be considered similar to impervious surfaces. However, including the additional depth information, as done in our proposed method, allows a better separation between these classes as the normalized depth measurements indicate differences between buildings and impervious surfaces. Looking at a close-up of the bottom right corner of the image in Figure 6, Figure 7 illustrates this difference clearly, as large parts of the building gets classified as impervious surface by the RG+I ensemble. By making use of the available depth information during the training phase, the proposed model is able to capture the building class more accurately.

For comparison, we also investigate the performance of a model when the RG+I and depth data are available both during training and testing and we provide the results in Table I. The overall accuracy for this model is, as expected, higher as more information is available for the network. However, we are able to observe the largest increases in the building and the low vegetation class. This corresponds to our intuition that the depth data is most useful for classes that can clearly be distinguished from similar looking classes with help of a height model. It also illustrates that the hallucination model, with regards to overall accuracy, is able to capture a significant part of the information contained in the depth data.

Table I also illustrates the effect of using medium frequency balancing, as it can be observed that the smaller car class in our experiments generally performs much worse when not using medium frequency balancing. Overall it can be seen that the overall accuracy drops slightly when using the balanced cost function, however, due to the large improvements in the smaller classes, this appears to be a negligible difference. Figure 8 shows a qualitative example of the effect of medium frequency balancing. It can be seen that the hallucination network trained without balancing misses some of the cars or only classifies a few of the center pixels in each car. This coincides with our preconception that small classes will overall be neglected in an effort to optimize overall accuracy.

Imp Surf Building Low veg Tree Car Overall
Method MFB F1 Acc F1 Acc F1 Acc F1 Acc F1 Acc Avg F1 Avg Acc Acc
RGB 85.70 90.01 92.20 91.41 79.45 81.14 82.41 81.57 87.55 97.81 85.46 88.39 84.77
RGB-ensemble 86.04 89.64 92.98 92.80 81.00 83.71 83.45 82.16 89.09 98.03 86.51 89.27 85.55
Hallucination 86.77 92.16 94.03 93.10 82.39 83.03 84.72 83.41 90.98 97.95 87.78 89.93 86.56
RGB&I&Depth 88.48 92.81 96.48 96.32 80.66 82.91 83.47 81.15 88.82 97.46 87.58 90.13 87.49
TABLE IV: Performance of the different models for the Potsdam dataset when we assume multiple missing modalities, namely infrared and depth. The F1 scores and accuracies are shown as percentages.

In addition to the F1-measure and the overall accuracy, we also investigated the use of the Intersection over Union (IoU) as an additional evaluation metric. However, as the overall trend is similar to the other metrics, we omitted this evaluation metric in our experimental results. Results for this metric for the experiments reported in this section can be found in the Appendix.

V-B Potsdam

Table II illustrates the results when performing the same set of experiments on the Potsdam dataset. Overall we can observe similar results, where the overall accuracy for the hallucination network outperforms both the RGB+I and the ensemble method. However, the overall increase in accuracy is lower for the Potsdam dataset, which the authors hypothesize is due to the higher resolution, of the images. Due to the higher resolution, there are more factors in the RGB+I bands, which make it possible to distinguish between for example buildings and impervious surfaces. Table II also shows that the medium frequency balancing has an effect for the Potsdam dataset. However, also due to the higher resolution, resulting in cars being larger in the small training image patches, the effect is lower compared to the Vaihingen dataset.

Further, we explore the effect of using the trained hallucination model to evaluate images during the test phase where both RGB+I and Depth images are available for the Potsdam dataset (Problem Scenario 2). To illustrate this, we take the trained hallucination model and evaluate its performance when both image modalities are available during testing. The RGB image is fed only into the RGB part of the hallucination network and the depth image is fed into the hallucination networks depth network. The final classification is again produced by applying a softmax to the averaged raw scores of the two networks. Note, no additional training is performed and the hallucination model is applied as is. These results are presented in Table III, which also, to allow the reader to perform an easier comparison, includes a duplicate of the hallucination networks performance when the depth data modality is missing (from Table II). It can be seen that the hallucination model, when presented with the depth modality also during testing, performs better than when the modality is missing. This means that the trained hallucination model can use the depth modality when it is not missing to achieve better performance, and can, therefore, be used both in situations where both modalities are available, as well as in situations where the depth information is missing (Problem Scenario 2). This flexibility, however, comes at a price, as the overall accuracy for the model when both modalities are available is slightly lower than RGB+I&Depth in Table II, where the network is only optimized for situations where both modalities are available.

V-C Multiple missing modalities

In this experiment, we evaluate the proposed method on the multiple missing modality scenario, Problem Scenario 3. We assume that the RGB data modality of the Potsdam dataset is available, while the infrared and the depth modalities are missing. Table IV shows the results for the different methods. It can, like in the case of a single missing modality, be seen that the Hallucination network outperforms both the RGB network and the RGB-ensemble network while performing worse than the network trained on all available modalities. This illustrates the power of the proposed model to not only handle the case of a single missing modality during inference, but also the case of multiple missing data modalities.

V-D Experimental setup

All experiments in this paper were performed using the deep learning framework Caffe 

[35] on a single Titan X. Modifications were made to the framework to support the median frequency balancing. Inference on a image patch takes 0.08 seconds for the hallucination model illustrated in Figure 4. The model was trained for approximately hours for the Vaihingen dataset and hours for the Potsdam dataset.

Vi Conclusions and Future Work

In this paper, we have proposed a method for image segmentation in urban remote sensing that makes use of data modalities that are only available during the training phase. Our experiments show that the method performs better than both a single model using only the available data as well as an ensemble of two models. Additionally, by making use of the medium frequency balancing cost function, we achieve good performances in small classes. We, therefore, consider it an attractive choice for handling missing data modalities in urban remote sensing. Note that the proposed methodology is applicable not only to urban land cover classification but can be generalized to alternative land cover classification scenarios.

The drawback of the proposed approach for multiple missing data modalities is the fact that it does not scale well with the number of (potentially) missing modalities, as the number of losses grows exponentially. This is in many cases not problematic, as it is common to only consider a limited amount of data modalities. However, in future work we intend to extend this work to a larger amount of missing modalities, trying to devise end-to-end learnable networks that in a scalable way can ’hallucinate’ several modalities and that can provide a single model for all combinations of modalities. Also, currently we assume that one of the data modalities is always available, for example in the Vaihingen example we assume that the true ortho photo is available during training and testing. However, in future work, we would like to explore ways of avoiding this restriction. Further, we will investigate transfer learning methodology and the usage of pre-trained models for the sub-networks where pre-trained models are available, such as for example in the case where we have optical image data. Another potential research direction is to utilize larger networks that achieve state-of-the-art performances on segmentation tasks, however, as the focus of this work was to investigate the potential for handling missing data modalities and not achieving overall state-of-the-art performance, this is left for future work.


We gratefully acknowledge the support of NVIDIA Corporation with the donation of the GPU used for this research and the German Society for Photogrammetry, Remote Sensing and Geoinformation (DGPF) for providing the Vaihingen and the Potsdam dataset [36]. This work was partially funded by the Norwegian Research Council FRIPRO grant no. 239844 on developing the Next Generation Learning Machines.


  • [1] D. Un, “World urbanization prospects: The 2014 revision,” 2015.
  • [2] M. Kampffmeyer, A.-B. Salberg, and R. Jenssen, “Semantic segmentation of small objects and modeling of uncertainty in urban remote sensing images using deep convolutional neural networks,” in

    Proceedings IEEE Conferance Computer Vision Pattern Recognition Workshops

    , 2016, pp. 1–9.
  • [3] E. Maggiori, Y. Tarabalka, G. Charpiat, and P. Alliez, “Fully convolutional neural networks for remote sensing image classification,” in 2016 International Geoscience and Remote Sensing Symposium, 2016, pp. 5071–5074.
  • [4] V. A. Krylov, M. De Martino, G. Moser, and S. B. Serpico, “Large urban zone classification on spot-5 imagery with convolutional neural networks,” in 2016 International Geoscience and Remote Sensing Symposium.   IEEE, 2016, pp. 1796–1799.
  • [5] M. Volpi and D. Tuia, “Dense semantic labeling of subdecimeter resolution images with convolutional neural networks,” IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 2, pp. 881–893, 2017.
  • [6] N. Audebert, B. Le Saux, and S. Lefèvre, “Fusion of heterogeneous data in convolutional networks for urban semantic labeling,” in Urban Remote Sensing Event (JURSE), 2017 Joint.   IEEE, 2017, pp. 1–4.
  • [7] A. Lagrange, B. Le Saux, A. Beaupère, A. Boulch, A. Chan-Hon-Tong, S. Herbin, H. Randrianarivo, and M. Ferecatu, “Benchmarking classification of earth-observation data: from learning explicit features to convolutional networks,” in 2015 International Geoscience and Remote Sensing Symposium, 2015, pp. 4173–4176.
  • [8] J. Zhang, “Multi-source remote sensing data fusion: status and trends,” International Journal of Image and Data Fusion, vol. 1, no. 1, pp. 5–24, 2010.
  • [9] F. Bovolo and L. Bruzzone, “The time variable in data fusion: a change detection perspective,” IEEE Geoscience and Remote Sensing Magazine, vol. 3, no. 3, pp. 8–26, 2015.
  • [10] S. B. Serpico, S. Dellepiane, G. Boni, G. Moser, E. Angiati, and R. Rudari, “Information extraction from remote sensing images for flood monitoring and damage evaluation,” Proceedings of the IEEE, vol. 100, no. 10, pp. 2946–2970, 2012.
  • [11] D. Brunner, G. Lemoine, and L. Bruzzone, “Earthquake damage assessment of buildings using vhr optical and sar imagery,” IEEE Transactions on Geoscience and Remote Sensing, vol. 48, no. 5, pp. 2403–2420, 2010.
  • [12] L. Gueguen and R. Hamid, “Large-scale damage detection using satellite imagery,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1321–1328.
  • [13] J. Hoffman, S. Gupta, and T. Darrell, “Learning with side information through modality hallucination,” in Proceedings IEEE Conferance Computer Vision Pattern Recognition (CVPR), June 2016.
  • [14] A. Estabrooks, T. Jo, and N. Japkowicz, “A multiple resampling method for learning from imbalanced data sets,” Computational Intelligence, vol. 20, no. 1, pp. 18–36, 2004.
  • [15] M. Kampffmeyer, A.-B. Salberg, and R. Jenssen, “Urban land cover classification with missing data using deep convolutional neural networks,” in 2017 International Geoscience and Remote Sensing Symposium, 2017.
  • [16] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings IEEE Conferance Computer Vision Pattern Recognition, 2015, pp. 3431–3440.
  • [17] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” arXiv preprint arXiv:1511.00561, 2015.
  • [18] E. Riz, B. Demir, and L. Bruzzone, “Domain adaptation based on deep denoising auto-encoders for classification of remote sensing images,” in SPIE Remote Sensing.   International Society for Optics and Photonics, 2016, pp. 100 040K–100 040K.
  • [19] S. Paisitkriangkrai, J. Sherrah, P. Janney, V.-D. Hengel et al., “Effective semantic pixel labelling with convolutional networks and conditional random fields,” in Proceedings IEEE Conferance Computer Vision Pattern Recognition Workshops, 2015, pp. 36–43.
  • [20] N. Audebert, B. L. Saux, and S. Lefèvre, “Segment-before-detect: Vehicle detection and classification through semantic segmentation of aerial images,” Remote Sensing, vol. 9, no. 4, p. 368, 2017.
  • [21] P. J. García-Laencina, J.-L. Sancho-Gómez, and A. R. Figueiras-Vidal, “Pattern classification with missing data: a review,” Neural Computing and Applications, vol. 19, no. 2, pp. 263–282, 2010.
  • [22] J. L. Schafer, Analysis of incomplete multivariate data.   CRC press, 1997.
  • [23] S. Aksoy, K. Koperski, C. Tusk, and G. Marchisio, “Land cover classification with multi-sensor fusion of partly missing data,” Photogrammetric Engineering & Remote Sensing, vol. 75, no. 5, pp. 577–593, 2009.
  • [24] B. A. Latif and G. Mercier, Self-Organizing Maps for processing of data with missing values and outliers: application to remote sensing images.   INTECH, 2010.
  • [25] H. Shen, X. Li, Q. Cheng, C. Zeng, G. Yang, H. Li, and L. Zhang, “Missing information reconstruction of remote sensing data: A technical review,” IEEE Geoscience and Remote Sensing Magazine, vol. 3, no. 3, pp. 61–85, 2015.
  • [26]

    A.-B. Salberg and R. Jenssen, “Land-cover classification of partly missing data using support vector machines,”

    International Journal of Remote Sensing, vol. 33, no. 14, pp. 4471–4481, 2012.
  • [27] “ISPRS 2d semantic labeling contest,” http://www2.isprs.org/commissions/comm3/wg4/semantic-labeling.html.
  • [28] M. Gerke, “Use of the stair vision library within the isprs 2d semantic labeling benchmark (vaihingen),” Technical report, University of Twente, 2015., Tech. Rep., 2015.
  • [29] S. Gupta, R. Girshick, P. Arbeláez, and J. Malik, “Learning rich features from rgb-d images for object detection and segmentation,” in European Conference on Computer Vision.   Springer, 2014, pp. 345–360.
  • [30] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
  • [31] D. Eigen and R. Fergus, “Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2650–2658.
  • [32] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
  • [33] R. Pascanu, T. Mikolov, and Y. Bengio, “Understanding the exploding gradient problem,” Computing Research Repository (CoRR) abs/1211.5063, 2012.
  • [34] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [35] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” arXiv preprint arXiv:1408.5093, 2014.
  • [36] M. Cramer, “The DGPF-test on digital airborne camera evaluation–overview and test design,” Photogrammetrie-Fernerkundung-Geoinformation, no. 2, pp. 73–82, 2010.