Appearance Based Deep Domain Adaptation for the Classification of Aerial Images

by   Dennis Wittich, et al.
uni hannover

This paper addresses domain adaptation for the pixel-wise classification of remotely sensed data using deep neural networks (DNN) as a strategy to reduce the requirements of DNN with respect to the availability of training data. We focus on the setting in which labelled data are only available in a source domain DS, but not in a target domain DT. Our method is based on adversarial training of an appearance adaptation network (AAN) that transforms images from DS such that they look like images from DT. Together with the original label maps from DS, the transformed images are used to adapt a DNN to DT. We propose a joint training strategy of the AAN and the classifier, which constrains the AAN to transform the images such that they are correctly classified. In this way, objects of a certain class are changed such that they resemble objects of the same class in DT. To further improve the adaptation performance, we propose a new regularization loss for the discriminator network used in domain adversarial training. We also address the problem of finding the optimal values of the trained network parameters, proposing an unsupervised entropy based parameter selection criterion which compensates for the fact that there is no validation set in DT that could be monitored. As a minor contribution, we present a new weighting strategy for the cross-entropy loss, addressing the problem of imbalanced class distributions. Our method is evaluated in 42 adaptation scenarios using datasets from 7 cities, all consisting of high-resolution digital orthophotos and height data. It achieves a positive transfer in all cases, and on average it improves the performance in the target domain by 4.3 from the ISPRS semantic labelling benchmark our method outperforms those from recent publications by 10-20



There are no comments yet.


page 7

page 17

page 18

page 20

page 21

page 24


Domain Adaptation for Neural Networks by Parameter Augmentation

We propose a simple domain adaptation method for neural networks in a su...

Adversarial Domain Adaptation Being Aware of Class Relationships

Adversarial training is a useful approach to promote the learning of tra...

A Generalized Neyman-Pearson Criterion for Optimal Domain Adaptation

In the problem domain adaptation for binary classification, the learner ...

L-Vector: Neural Label Embedding for Domain Adaptation

We propose a novel neural label embedding (NLE) scheme for the domain ad...

Domain adaptation techniques for improved cross-domain study of galaxy mergers

In astronomy, neural networks are often trained on simulated data with t...

On Regularization Parameter Estimation under Covariate Shift

This paper identifies a problem with the usual procedure for L2-regulari...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The task of pixel-wise image classification is to assign a class label to each pixel in the image according to a pre-defined class structure. In remote sensing (RS) applications, the data usually consist of an orthorectified multi-spectral image (MSI), and, possibly, height information, e.g. obtained by 3D-reconstruction from overlapping images. For several years, research on this topic has been dominated by supervised methods based on deep learning, in particular on Fully Convolutional Neural Networks (FCN)

(Long et al., 2015), e.g. (Marmanis et al., 2016; Zhang et al., 2019). One of the main problems related to deep learning in RS is the requirement of FCN for the availability of a large amount of training samples, the generation of which involves a labour-intensive interactive labelling process (Zhu et al., 2017b). One strategy to solve this problem is domain adaptation (DA) (Tuia et al., 2016)

, a special setting of transfer learning (TL)

(Pan and Yang, 2009). This paper presents a new method for DA dedicated to the pixel-wise classification of RS data based on deep learning.

In DA, data are assumed to be available in different domains. The domains share the same feature space, but the data might follow different distributions. In both domains, a learning task is to be solved, characterized by the same class structure. One distinguishes a source domain , in which an abundance of training samples with known class labels is available, and a target domain . In semi-supervised DA, the scenario we are interested in, no labelled samples are available in (Tuia et al., 2016). The main goal of DA is to use the information available in to find a better solution for the task in , which requires the domains to be related (Pan and Yang, 2009).

In RS applications, the domains can be associated with imagery from different geographical regions or acquired at different epochs, but having the same input channels. The source domain corresponds to images for which pixel-level class annotations are known, e.g. from earlier projects, whereas the target domain corresponds to a new set of images to be classified according to the same class structure. In semi-supervised DA (which we will simply refer to as

DA in the remainder of this paper) this is to be achieved without having to generate (new) training labels in , even though there is a domain gap, i.e. even though the distributions of the data in the two domains are different (Tuia et al., 2016). Due to this domain gap, a classifier trained using the labelled data from will achieve a lower performance in compared to a classifier trained using labelled data from . The goal of DA is to reduce this performance gap, i.e. to achieve a positive transfer, while at the same time avoiding a negative transfer, i.e. a reduction in the classifier’s performance after DA.

There are different strategies for DA. Methods based on instance transfer start with training a classifier using data from . This classifier is applied to data from to predict semi-labels, and the classifier is re-trained using target domain data with semi-labels in an iterative procedure. We believe that this strategy is strongly limited by the initial performance of the (source) classifier in . The second strategy for DA, used frequently in the context of deep learning (Wang and Deng, 2018), is based on representation transfer. Such methods map images from both domains to a shared and domain-invariant representation space in which a classifier trained using samples from can also be used to classify data from , e.g. (Tzeng et al., 2017a; Liu et al., 2020). However, we found such an approach to be rather difficult to train and to be susceptible to result in a negative transfer (Wittich and Rottensteiner, 2019).

Consequently, the approach presented in this paper follows another strategy, which we refer to as appearance adaptation and which is based on methods for style transfer (Zhu et al., 2017a; Liu et al., 2017). It can be seen as a special case of representation transfer in which the common representation is the space of the original radiometric features of . A common approach is to use unlabelled samples from both domains to train an appearance adaptation network that transforms images from so that they look like images from . As the label information is not changed in this process, the transformed source images with known labels are used to train a classifier for . This strategy was originally applied to street scene segmentation (Hoffman et al., 2018; Zhang et al., 2018), examples for its application in RS are (Benjdira et al., 2019; Tasar et al., 2020a, b; Li et al., 2020; Soto et al., 2020; Gritzner and Ostermann, 2020).

A major problem of appearance adaptation is to achieve what we call semantic consistency: it is not sufficient for the transformed images to give an overall impression similar to the images from , but the appearance of objects has to be adapted so that after the transformation the objects of any given class look similar to the objects of the same class in (e.g., pixels corresponding to buildings should look like buildings of after the adaptation). This is particularly difficult if the label distributions in and are very different (Soto et al., 2020; Gritzner and Ostermann, 2020). Soto et al. (2020) tried to solve this problem by applying a cycle consistency constraint (Zhu et al., 2017a), in which images transformed from to and back again using a second adaptation network are required to have the same grey values as the original ones. This approach resulted in artifacts in the adapted images, a problem called hallucination of features (Cohen et al., 2018). Gritzner and Ostermann (2020) tried to align the label distributions based on label maps predicted in , but this did not lead to significant improvements in DA. Approaches trying to constrain the label distribution in to be similar to the one in , e.g. (Zhang et al., 2017), may even be detrimental in RS, where real differences in the label distributions are available and have been shown to be problematic for DA (Wittich and Rottensteiner, 2019). We conclude that achieving semantic consistency remains an unsolved problem in appearance-based DA, especially in presence of large differences in the label distributions in and .

Another unsolved problem in DA is parameter selection, i.e. the selection of the values of the network parameters to be used for classification, which can also be seen as the problem of selecting a stopping criterion for training. Gritzner and Ostermann (2020) report that the classification error in shows large variations in the DA process and may even increase again after reaching a minimum, a behaviour we also observed in (Wittich and Rottensteiner, 2019; Wittich, 2020)

. In classical machine learning, a validation dataset is used to select a proper epoch to stop the training process. However, in our DA scenario, there are no labelled samples in

and, thus, there is no validation set. In such a context, parameter selection is usually solved by making the number of training epochs a hyper-parameter to be tuned and using the parameters values determined in the final iteration. However, if one DA scenario with labels in is used to tune this hyper-parameter, there is no guarantee that the optimal number of training epochs in DA can be transferred to other pairs of source and target domains. It would be desirable to be able to select the appropriate epoch for stopping the training process based on some other criterion.

In this work we address these unsolved problems by proposing a new method for DA based on appearance adaptation. It applies adversarial training (Goodfellow et al., 2014), in which a discriminator network learns to distinguish adapted images from real target images, while the adaptation network learns to fool the discriminator by transforming the images such that they look like images from (Zhu et al., 2017a). However, unlike most existing approaches we do not apply cycle-consistency to constrain the adaptation, but try to achieve semantic consistency by joint training of the appearance adaptation and classification networks. Thus, the appearance adaptation network does not only learn to transform images from so that they look like images from , but the transformed images still need to be classified correctly, which we believe to be a suitable measure to avoid that transformed regions belonging to a certain class will look like areas corresponding to another class in . The contributions of this paper can be summarized as follows:

  1. We present a novel approach for DA based on semantically consistent appearance adaptation that relies on adversarial training of an appearance adaptation network jointly with the classification network. The approach requires only a single adaptation network from to , which makes it less memory consumptive and easier to tune than existing methods.

  2. To further improve semantic consistency we introduce a new regularization term to adversarial training to mitigate problems due to large differences in the class distributions in and . It prevents the discriminator from learning trivial solutions for differentiating samples from different domains.

  3. We propose a new criterion for selecting the optimal parameter values in DA that does not require any labelled validation data in . It relies on an entropy-based confidence measure for the predictions in .

  4. As a minor contribution, we address the problem of a poor classification performance for classes that are under-represented in the training set, presenting a new adaptive cross-entropy loss function which uses class-wise performance metrics to tune the weights of samples in supervised training.

2 Related Work

2.1 Semi-supervised deep domain adaptation

This review discusses methods for adversarial representation transfer, appearance adaptation, and hybrid methods combining different strategies. A more generic overview on deep DA can be found in (Wang and Deng, 2018).

2.1.1 Adversarial methods for representation transfer

The main idea of representation transfer is to map images from both domains to a shared representation space in which a single classifier is applied. Many approaches for representation transfer rely on adversarial training. Originally developed for methods designed to predict a single class label for every input image (Ganin et al., 2016; Tzeng et al., 2017b), this approach has also been transferred to pixel-wise classification of street scenes (Huang et al., 2018). In this application, an important motivation for DA is the desire to use synthetic images to generate training samples with pixel-wise label annotations for training (source domain ) and adapt the resultant classifier to real (target) images. However, we think that the success of DA in this scenario is strongly related to the fact that typical street scenes are rather similar to each other with respect to the class distributions in the two domains. In some approaches, this is used to constrain the DA, e.g. to deliver a label distribution in similar to the one in , e.g. (Zhang et al., 2017; Vu et al., 2019), or to consider prior information about the typical location of objects in a street scene (e.g., sky is expected to occur in the upper part of an image) (Zou et al., 2018). However, for reasons already discussed, such assumptions are not justified in RS applications.

An example for DA based on representation transfer in RS is (Riz et al., 2016). A domain-independent feature representation from images of two geographical areas is obtained by training a stacked auto-encoder using images from both domains that learns to reconstruct the input image via a lower-dimensional feature space. This seems to work well for domains that are rather similar, but it remains unclear whether it would still be sufficient in the presence of larger domain differences. Gritzner and Ostermann (2020) perform representation transfer based on a domain distance for the pixel-wise classification of aerial images. Their results show that the adaptation performance strongly decreases if the class distributions are very different in the two domains. To improve the adaptation performance they align the representations using target images found to be semantically similar according to label maps predicted by the classifier trained on source domain data, but this only leads to an improvement in a half of the presented experiments. Liu et al. (2020) aim at representation transfer by matching so-called feature curves from both domains using adversarial training. However, the domain gap that could be bridged by this method was limited. This indicates that adversarial representation transfer is difficult if the domain gap is large, e.g. when adapting between two different cities in which the objects have a different appearance or in which the class distributions are dissimilar, both of which is the case for the public benchmark dataset used in (Liu et al., 2020). We use this approach as a baseline for comparison in our experiments. In (Wittich and Rottensteiner, 2019) we also used representation transfer based on adversarial training for DA. We could achieve a small but stable improvement of the classification results due to DA if an early network layer was chosen for transfer. However, the results strongly depended on the hyper-parameters used in training, which makes this approach difficult to tune.

From the overview presented in this section we conclude that approaches based on representation transfer are facing problems in the presence of a large domain gap, in particular in case of large differences in the class distributions of the two domains. Approaches for stabilizing adversarial representation transfer for street scene classification frequently rely on assumptions which are not generally justified in RS applications, e.g. on class distributions to be similar.

2.1.2 DA based on appearance adaptation

This group of methods can be seen as a special case of representation transfer in which the shared representation space is the original feature space of the images in one domain. Relying on concepts for style transfer, they take an image from and adapt it so that it looks as if it were a sample from . As the training labels from are not affected by this transformation, a classifier can be trained in a supervised way based on the transformed images, e.g. (Yang et al., 2020b; Yang and Soatto, 2020; Chang et al., 2019; Musto and Zinelli, 2020; Chen et al., 2019; Hoffman et al., 2018). Maintaining semantic consistency as defined in section 1

is crucial for the success of this strategy. Some authors try to achieve this goal in the frequency domain, where the adaptation is applied to the amplitude, either based on learning

(Yang et al., 2020b) or by swapping the low frequency coefficients between the source and target images (Yang and Soatto, 2020). We believe that in RS it may sometimes also be required to consider modifications of higher frequencies, e.g. when transferring between domains corresponding to images acquired at different seasons, in which deciduous trees look completely different.

In the field of RS, Benjdira et al. (2019) used CycleGAN (Zhu et al., 2017a) to transform images from two different cities which were treated as different domains. Their main goal is to learn semantically consistent transformations between both domains by incorporating a cycle-consistency loss. The method, which we use as another baseline to compare our method to, results in quite large improvements in the classification performance for two out of six classes due to DA, but the results for the other classes could barely be improved.

Other papers on DA for RS applications present modifications of CycleGAN designed to improve the DA performance. Soto et al. (2020) address deforestation detection based on satellite imagery, defining the domains to correspond to images of different geographical regions. They show that cycle consistency is not sufficient to preserve the semantic structure in the adaptation process, apparently because the differences in the class distributions of the domains lead to hallucinated structures in the transformed images. They introduce an additional identity mapping loss which reduces the amount of hallucinated structures, but also leads to a decreased performance of the classifier after DA. Gritzner and Ostermann (2020) address the pixel-wise classification of urban scenes. Observing plain CycleGAN to lead to a negative transfer in of their experiments, the authors tried to improve the adaptation by training on semantically paired images (cf. section 2.1.1), but this did not result in a significant improvement.

There are also strategies to achieve semantic consistency that do not require cycle-consistency. Tasar et al. (2020a) learn a colour mapping to perform the image adaptation from the source to the target domain. However, as this approach cannot adapt the texture of objects, we think it is too limited to work in more complex DA scenarios, e.g. requiring transfer between images from different seasons. Tasar et al. (2020b) use a bi-directional image-to-image transformation based on an alternative to cycle-consistency called cross-cycle-consistency (Lee et al., 2018) and an alignment of the image gradients between the images before and after the transformation to achieve semantic consistency. In our opinion, this may be too strong a regularization when trying to apply DA to imagery from different seasons, because the gradient maps may change a lot in vegetated areas.

The method proposed in this paper is also based on appearance adaptation. However, compared to the papers cited in this section we propose a different strategy to achieve semantic consistency. Instead of relying on cycle-consistency or cross-cycle-consistency, we only train a single adaptation network that transforms images from the source to the target domain. Semantic consistency is achieved by not only enforcing the transformed source images to look like those from after adaptation, but also by requiring them to be classified correctly after the transformation. We also address possible problems caused by different class distributions in the two domains by applying a new regularization term to the output of the discriminator network. To the best of our knowledge, ours is the first approach for appearance adaptation based on a joint training the adaptation and the classification networks in the context of RS.

2.1.3 DA based on hybrid approaches

Quite a few hybrid approaches combine representation matching and appearance adaptation. For instance, Zhang et al. (2018) use gradient-based style transfer (Gatys et al., 2016) to reduce the visual difference between the two domains before feeding the images to the classification network, where representation transfer is applied. However, this approach leads to high computation times and requires a considerable amount of hyper-parameter tuning to achieve good results. One of the first hybrid approaches combining CycleGAN for appearance adaptation and adversarial representation transfer was CyCADA (Hoffman et al., 2018), applied to street scene classification. It is based on a rather complex network which, according to the authors, cannot be trained end-to-end on a consumer GPU due to very high memory requirements. Musto and Zinelli (2020) extend CyCADA by feeding the predicted label map to the appearance adaptation network and enforcing consistency between the label maps predicted for both, the original and the transformed source images. Chen et al. (2019) additionally enforce the predictions of original target images and adapted target images to be consistent, whereas Chang et al. (2019) replace cycle-consistency with cross-cycle-consistency. All of these methods are very complex with respect to the number of modules, parameters and training procedures. This may be the reason why none of them directly uses the classification loss for the transformed source images jointly with the losses related to the adaptation network to enforce semantic consistency. In contrast, our method does not combine appearance adaptation with representation transfer, but only learns the appearance adaptation and the classifier, which reduces the network complexity.

An architecture very similar to CyCADA was trained end-to-end in (Murez et al., 2018). The authors learn two encoders which embed images from both domains in a shared feature space by adversarial training. Simultaneously, two decoders are trained, which recover images based on the embeddings. To enforce semantic consistency the authors use an identity loss that enforces recovered representations to look like the original inputs. Further, they learn image-to-image transformations by decoding representations from each domain to the corresponding other one. The second aspect is again achieved via adversarial training, requiring two additional discriminators. The actual classifier is optimized to correctly classify embeddings for the source domain and embeddings for images transformed from to . The approach achieves good results for the pixel-wise classification of street scenes. However, it is unclear if it is transferable to DA in RS, where different class distributions pose additional challenges (Wittich and Rottensteiner, 2019). We consider (Murez et al., 2018) to be the work closest to our approach, because it is also based on the joint training of the appearance adaptation and the classification networks. However, we do not map the images to some domain-independent intermediate representation, but only adapt images from to . Thus, we only need one domain discriminator and one appearance adaptation network, which largely reduces the overall memory footprint of the architecture. Secondly, we propose an additional regularization of the discriminator, addressing the problems due to large differences in the class distributions.

Whereas there are many hybrid methods for street scene classification, we found only one such approach addressing the pixel-wise classification of RS images. Ji et al. (2020) combine appearance adaptation and representation transfer, using adversarial training in both cases. For the appearance adaptation they also rely on cycle-consistency. Representation transfer is applied in the last layer of the network. As already discussed in section 2.1.2, we think that cycle consistency may not be sufficient if the domains are very different. However, because the authors report quite good results on publicly available datasets, we use this approach as another baseline for the evaluation of our method.

2.2 Stopping and parameter selection criteria

A problem that is barely addressed in research on DA is the stopping criterion for the adaptation. In supervised training, it is common practice to monitor the performance of the classifier on a validation set and to select the parameter values resulting in the best validation performance as the final result of the training process (Prechelt, 1998). However, this approach cannot be used for semi-supervised DA for lack of labelled samples that could be used for validation in the target domain. Unfortunately, in DA it is particularly important for how long the adaptation process is continued. For instance, Gritzner and Ostermann (2020) show that after increasing for some time, the performance on a test set from the target domain decreases if the adaptation is carried out for too long. Benaim et al. (2018) discuss this problem for unsupervised appearance adaptation. However, they do not propose a solution, but derive a bound to predict the success of such methods. The common strategy in DA is to fix the number of epochs for the adaptation and the very last parameter set is used for inference (Tasar et al., 2020b, a; Benjdira et al., 2019; Musto and Zinelli, 2020), which means that this hyper-parameter has to be tuned with care. Some publication do not even tell for how many epochs they train their model (Murez et al., 2018; Liu et al., 2020; Hoffman et al., 2018; Chen et al., 2019). To the best of our knowledge, this paper proposes the first method to solve the problem of unsupervised parameter selection in DA.

2.3 Training with imbalanced data

In RS applications, the class distribution of the training samples is often imbalanced. In such a case, the cross-entropy loss, which is commonly used for training FCNs, is dominated by frequent classes, so that after training, the prediction quality of under-represented classes may be not satisfactory. One way to compensate this imbalance is to take measures which lead to well-defined clusters in the feature space. This can be achieved by considering similarity measures like the Euclidean distance (Hadsell et al., 2006; Schroff et al., 2015)

or the cosine similarity

(Yang et al., 2020a) of latent representations of samples belonging to the same class as constraints in the loss function. Hadsell et al. (2006) and Schroff et al. (2015) have shown that a clustering approach based on the Euclidean distance of representations can improve the results in tasks related the prediction of a single label per image, but it remains unclear if this also applies for pixel-wise classification. Yang et al. (2020a)

could improve the mean F1-score in a scenario addressing the pixel-wise classification of aerial images by enforcing the cosine similarity of representations belonging to the same class to be close to the respective centroid. However, we think that the cosine similarity does not necessarily lead to compact clusters, as it mainly affects the directions of vectors in feature space.

Another way of compensating class imbalance is to use a weighted cross-entropy loss in which pixels which correspond to an under-represented class are considered with a higher weight than pixels of the more frequent classes. However, this approach may be problematic in a DA scenario, because the class distribution in the target domain is unknown. Alternative loss functions such as the focal loss for binary classification (Lin et al., 2017) or its variant for the multi-class case (Yang et al., 2019) define the weights according to the predicted score for the reference class of each sample. In this way, the training process should focus on pixels that were predicted with high uncertainty, which was shown to increase the prediction quality of under-represented classes to some extent. However, pixels with a low confidence frequently correspond to pixels at object boundaries, where the label information is uncertain due to geometrical inaccuracies. In this work, we propose a different approach which incorporates class-wise performance metrics, but does not do this for every sample individually. The basic idea is that classes which are predicted with a lower quality should have a higher impact on the overall loss. An alternative would be to use the dice loss (Sorensen, 1948; Ren et al., 2020), but this would cause the classifier to focus too much on object borders. We think that this can be suboptimal for the reasons given in the context of the focal loss.

3 Methodology

We start with a formalization of the task following (Tuia et al., 2016). Let be an input image of size from the input feature space . and denote the height and width of the image, respectively, and is the number of channels. If height information is available, the first channels correspond to the orthorectified multispectral image and the last channel contains the metric height information for each pixel, e.g. in the form of a normalized digital surface model (nDSM) which contains the height above terrain for every pixel. Each input image corresponds to a label map that encodes the semantic label of every pixel. Thus, comes from the categorical label space where and is the number of classes in the pre-defined class structure. The learning task is to find a parameter set of a classifier such that it predicts the correct label map for any input image . In this work, we consider two domains, a source domain , where a training set of labelled images is assumed to be available, and a different but related target domain , where only the set of unlabelled images is available. We address the semi-supervised setting of domain adaptation, trying to use and jointly to solve the learning task in such that the resulting model achieves a better performance than a classifier trained only on . This corresponds to a common scenario in RS, were a dataset labelled in the past is used for training a classifier which is to be applied to new data without requiring new training labels.

3.1 Overview

An overview of the method including the main loss terms used in training is shown in figure 1. As in all DA methods based on appearance adaptation, the core idea is to substitute missing label information in by labelled images from that were transformed such that they have an appearance similar to images from . This is achieved by an appearance adaptation network A. Denoting the transformed version of a source image by , these transformed images are used jointly with the corresponding label maps from the source domain to train a classifier in a supervised way. By passing the transformed images through the network and optimizing its parameters such that the predictions are correct, the classification network should be adapted to the target domain. For this approach to be successful, the transformed images should look like images from the target domain, but the transformation also has to be semantically consistent in the way defined in section 1. To that end, we propose a joint training of A and C, using the supervised loss of the transformed images as a guidance for A to achieve semantic consistency. To make the transformed images look like images from the target domain, we rely on adversarial training of A and a domain discriminator D. The idea is that A learns to transform input images from to such that they look like coming from (adversarial training), yet still being correctly classified by C (supervised guidance). The latter aspect is the main key to achieve a good performance in the target domain, but simultaneously, due to another supervised loss , C also learns to classify images from the source domain, which is required to achieve semantic consistency (cf. 3.3.2). In addition, we introduce a regularization loss which should prevent D from learning to differentiate the domains only based on the occurrence of simple features and, thus, should prevent A from hallucinating structures. In the subsequent sections the network architecture and the training strategy are described in detail.

Figure 1: Method overview. The classification network C predicts maps of class scores and for source images () and transformed images (), respectively, the latter being produced by the appearance adaptation network A. In both cases, C is trained to predict label maps that match the reference from by minimizing the supervised loss terms and . Simultaneously, A is trained in an adversarial way using the discriminator D. To enforce semantic consistency, A is also trained to minimize . D

delivers probability maps

for the corresponding inputs to belong to . They are considered in the adversarial loss terms and and in the proposed regularization term .

3.2 Network architecture

The network architecture consists of the three modules: the classification network C, the appearance adaptation network A and the domain discriminator D with corresponding sets , and of trainable parameters.

3.2.1 Classification network

Layer Layer type H / W Depth


1 Input layer 256

Conv(3) stride 2, BN, ReLU

128 32
3 Conv(3), BN, ReLU 128 64 *
4 Xception block 64 128 *
5 Xception block 32 256 *
6-15 Xception block 16 728 *
16 Xception block 8 1024 *
17 Conv(3), BN, ReLU 8 1536 *
18 Conv(3), BN 8 2048 *


16 Upsample, Concat(15) 16 2776
17, 18 Conv(3), ReLU 16 256
19 Upsample, Concat(5) 32 512
20, 21 Conv(3), ReLU 32 128
22 Upsample, Concat(4) 64 256
23, 24 Conv(3), ReLU 64 64
25 Upsample, Concat(3) 128 128
26, 27 Conv(3), ReLU 128 32
28 Upsample 256 32
29, 30 Conv(3), ReLU 256 16
31 Conv(1), Softmax 256
Table 1: Classification network . Conv(): convolution with kernel size

; BN: Batch-Normalization; ReLU: rectified linear unit. Layers marked with an asterisk are pre-trained on the ImageNet dataset

(Deng et al., 2009). Concat(X): depth-wise concatenation of the output of the layer X and the current layer. H / W / Depth: output dimensions.

The classification is performed by a FCN which takes an image with channels as input and predicts pixel-wise class scores. These scores are normalized by applying the softmax function, which results in probabilistic class scores for each pixel, arranged in maps for images from and for transformed images . This corresponds to the red and blue paths in figure 1, respectively. The actual class predictions are obtained by selecting the class with the highest probability for each pixel. In table 1 the full architecture of the classification network is given. We use a UNet-like architecture (Ronneberger et al., 2015) with an Xception backbone (Chollet, 2017)

pre-trained on ImageNet

(Deng et al., 2009). The Xception backbone corresponds to the encoder of our classification network, including all layers before the one with the lowest spatial resolution (layer 18 in table 1

). In the decoder of the network, nearest neighbour interpolation is used for upsampling. In preliminary experiments we compared this architecture in combination with a pre-trained initialization to a residual network with completely random initialization, similarly to

(Wittich, 2020). We observed a comparable performance but a noticeable reduction of the training time. Because the backbone is pre-trained on three-channel RGB-images, the first layer of the network cannot be used if the number of input channels is different from three. In that case, the first layer is replaced by a convolution layer with input-channels which is initialized randomly. The parameters of the decoder are also initialized randomly; all random initializations are based on (He et al., 2015). Overall, this network has about 28.8 M parameters. We choose this rather large network because it has to learn to classify images from and from (cf. section 3.3), which we assume to be a more complex task than classifying only images from one domain due to the higher variability of the data. By using a larger network, we reduce the risk that the learning capacity of C becomes a limiting factor of the method.

3.2.2 Appearance adaptation network

The appearance adaptation network takes an image from the source domain as input and delivers the transformed version . This corresponds to the dotted green line in figure 1. For this task we use a residual FCN with about 5 M parameters that is a simplified version of the one used in (Wittich, 2020). Table 2 lists all layers of the network. An initial convolution with a stride of four pixel downsamples the signal by a factor of 4. It is followed by 15 residual blocks at the reduced scale. Each of them consists of two subsequent

convolutions with 64 and 256 filters, respectively, zero padding and ReLU activation. The result of each residual block is added to its input. At the end, two transposed convolutions are used to enable predictions at the original spatial resolution of the input. Commonly, non-linearities with a bounded range of values such as the hyperbolic tangent function are used to produce the output of networks for image generation (regression of colour values)

(Goodfellow et al., 2014)

. However, because we also want to be able to predict transformed height information we do not restrict the values to a fixed range. Consequently, we do not apply any activation function to the output of the last convolutional layer.

We chose a residual network for the appearance adaptation because we expect the optimal solution for this task not to deviate too much from an identity mapping. For example, objects which have the same appearance in both domains should not be changed at all in the adaptation. Following (He et al., 2016), we assume that residual networks are well suited to learn such a solution, which we confirmed in preliminary experiments.

Layer Layer type H / W Depth
1 Input layer 256
2 Conv(6) stride 4, ReLU, BN 64 256
3-18 Residual block 64 256
19 T-Conv(4) stride 2, BN, ReLU 128 128
20 T-Conv(4) stride 2 256
Table 2: Residual network for appearance adaptation. T-Conv(): transposed convolution with kernel size . For other abbreviations, cf. table 1.

3.2.3 Domain discriminator

The discriminator network is required for the training of the appearance adaptation network in an adversarial way (cf. section 3.3.2). Its task is to predict whether an image is from the target domain or whether it is a transformed source image. However, instead of predicting one class score per image (Goodfellow et al., 2014), we predict a map of probabilistic class scores. This has been shown to achieve better results for adapting the appearance of images, for instance by Isola et al. (2017), whose discriminator architecture we adapt for our purposes. Our network, which has about 2.8 M parameters, is described in table 3. It takes either a transformed image or a target domain image as input and predicts maps of probabilistic class scores and , respectively. This corresponds to the solid green and yellow paths in figure 1. Given the receptive field of the network, each value in the probability maps denotes the probability for the corresponding support window in the input image to come from the target domain. Deviating from (Isola et al., 2017), we replace the batch-normalization layers by a spectral normalization of the weights, as proposed in (Miyato et al., 2018). In preliminary experiments we found this to lead to a more realistic appearance of the transformed images.

Layer Layer type H / W Depth
1 Input layer: MSI + nDSM 254
2 Conv(4) stride 2, LReLU 126 64
3 SN-Conv(4) stride 2, LReLU 62 128
4 SN-Conv(4) stride 2, LReLU 30 256
5 SN-Conv(4) stride 1, LReLU 27 512
6 SN-Conv(4) stride 1, Sigmoid 24 1
Table 3: Discriminator network. SN-Conv(): convolution with kernel size and spectral normalization of the weights. LReLU: leaky ReLU with slope 0.1 (as in (Wittich, 2020)). For other abbreviations, cf. table 1.

3.3 Training

To determine the parameters of all networks, we use a two-stage training strategy which consists of the source training and the subsequent DA. In source training, only the parameters of the classification network C are determined by conventional supervised training using the labelled source domain dataset (cf. section 3.3.1), resulting in the parameter set . In the second stage, described in section 3.3.2, the actual DA is carried out using and . In this process, the parameter set is used as initialization for , and the parameters of the other networks are randomly initialized according to (He et al., 2015). The second stage could also be carried out starting from a random initialization. However, in preliminary experiments we observed random initialization to increase the time required for training by a factor of 2.

3.3.1 Supervised source training

During source training, the parameters for the classification network C

are iteratively updated by stochastic gradient descent, minimizing a supervised loss

that is based on the discrepancy between the predicted labels and the reference. Commonly, the cross-entropy loss is used for a multi-class classification problem. However, this loss can be suboptimal if the class-distribution of the training dataset is imbalanced, as it is often the case in RS applications. In such a case, the cross-entropy loss is dominated by the frequent classes, which is the reason why the prediction quality of under-represented classes may be not satisfactory after training. For the reasons discussed in section 2.3, we think that existing methods to mitigate this problem are not optimal and, thus, we propose a new approach. Similarly to the focal loss (Lin et al., 2017; Yang et al., 2019), it also adapts the loss function by a weight that depends on the quality of the prediction. However, in contrast to the core idea of the focal loss, the training procedure should not focus on individual pixels whose class labels are difficult predict, but it should focus on which are predicted with low quality. Thus, we determine one weight per class (rather than per pixel as in (Lin et al., 2017)) which depends on the current prediction quality of that class.

After initialization, training starts with one epoch based on the standard softmax-cross-entropy loss. In all subsequent training epochs, the training images of a minibatch are classified using the current state of the classifier and the results are compared to the reference to determine the intersection over union as a class-specific performance indicator for all l classes :


In equation 1, , and are the numbers of true positives, false positives and false negatives, respectively, of all samples assigned to class . These IoU scores are used to determine the class-wise weights :


In equation 2, is the difference between the class wise score and the mean of all classes, and the hyper-parameter scales the influence of classes with a lower . In the next epoch the classifier is trained using a weighted cross-entropy loss in which the loss of each pixel is weighted by according to its reference label. For a batch of training images with height H and width W, this loss becomes


where is the index of an image in the batch, are the indices of a pixel in an image and is the class index. is a short-hand for the number of pixels in a batch. The symbol indicates whether pixel in the label map belongs to class () or not (), whereas denotes the softmax output for class at pixel . After each new epoch the weights are re-computed according to equations 1 and 2 and used for the subsequent epoch. To prevent the classifier from overfitting, we use L2-regularization of the parameters and apply data augmentation in the way described in section 4.2.1.

3.3.2 Embedded appearance adaptation

In the adaptation phase, the parameters of all networks are determined using variants of stochastic gradient descent. To simplify the notation we define a combined set consisting of the parameters of the adaptation network A and the classification network C. As we want to determine these parameters by joint training of the two networks, according to the principles of adversarial training, in each iteration the gradient of a joint loss function with respect to is computed first. This corresponds to the paths visualized by the red, the blue and the two green arrows in figure 1, based on a batch of B images from and the corresponding labels. The resulting gradient is used to update the parameter set . After that, within the same training iteration, the gradient of a discriminator loss with respect to the parameters of the discriminator is computed and used to update . This requires an additional set of B unlabelled images from , which are processed jointly with the B transformed images from that contributed to the update of , and it corresponds to the paths visualized by the yellow and green arrows in figure 1. We follow the common practice to alternate between updating and (Goodfellow et al., 2014). In the following, the two steps are described in detail.

Joint update of and

The joint loss used to update A and C consists of three components:


where and are weighting factors to control the relative influence of the corresponding loss terms.

The first term is related to the main goal of the DA, namely to achieve a good classification performance on images from the target domain. It is formulated as a supervised classification loss for transformed images similarly to equation 3:


where denotes the predicted probabilistic class score for the pixel in the transformed input image and the remaining symbols are those already defined in the context of equation 3.

The second term in equation 4, is the supervised loss for images from the source domain and is computed according to equation 3. This is important to achieve semantic consistency. If C were solely trained using the transformed images, it would be possible for the appearance adaptation network to produce semantically inconsistent results, e.g. images in which transformed trees look like target domain buildings and vice versa. In this case, could still learn to predict the label maps correctly, but it would no longer perform well for real target images. Considering for source images will constrain the target classifier so that does not deviate too much from the classifier of the source domain, so that this loss acts as a regularization term.

The last term in equation 5 is the adversarial loss for and realizes the component of adversarial training that influences . The appearance adaptation module A should learn to transform images from such that they look like images from by maximising the probabilities predicted by D for the transformed images to be from , which results in the following loss:


where corresponds to the prediction of D at position (), i.e. the predicted probability of the corresponding support window in the transformed input image presented to D to be an image from the target domain, which should be large such that can learn to fool the discriminator (cf. section 3.2.3). and are the height and width of the discriminator output, respectively, and is the number of discriminator pixels in the batch with images.

As stated above, each training iteration starts with a step aiming at minimizing the joint loss with respect to and . While depends on via the adversarial term , these parameters are not updated in this context. Minimizing will adapt the adaptation network such that fools the discriminator , i.e. such that its output cannot be discriminated from a real target domain image, and at the same time it will update the classifier such that it performs well for target domain images. The supervised loss acts as a regularizer for .

In one training iteration, a source domain image will contribute to the loss twice, namely via (equation 5) and (equation 3). This has to be considered in the batch normalization layers of . We only use the transformed images to compute the running averages of these layers, because we assume them to be more closely related to the statistics of than the values for the source domain images if the appearance adaptation is successful.

Update of

The second step of adversarial training is related to the update of the discriminator network , which is based on a loss function consisting of two terms, the second one being weighted by :


The first term is typical for adversarial training and is supposed to train the discriminator network D to differentiate real images from and transformed source images :


where is the probability for the image patch corresponding to the discriminator pixel () for the image in a batch of images from to be obtained from a target domain image and the remaining symbols are identical to those defined in the context of equation 6.

The second term in equation 7 corresponds to the proposed new regularization for D that is motivated by the following line of thought, supported by observations in preliminary experiments. Whenever the feature distributions between and are very different, the discriminator network can easily distinguish the domains by simply focussing on frequent features. For instance, if has a higher frequency of pixels corresponding to vegetation than , the discriminator will quickly learn to predict the probability for an image to be drawn from based on the number of pixels that are representative for vegetation. In consequence, D will predict such regions to come from with a high probability, and this will have a higher impact on the overall decision for an image than other areas which are more difficult to differentiate and where the predicted probabilities might be close to

. We note that such a situation results in a high variance in the predicted maps

. As A tries to fool the discriminator, it will learn to mimic such features, e.g. to predict vegetation areas in the example just mentioned. However, as the respective areas may correspond to other classes, this would results in hallucinated structures, which means that semantic consistency will not be achieved. To prevent the discriminator from learning such solutions, we propose to constrain the variance of the output of the discriminator. In this way, D

also has to learn to also differentiate the more difficult areas and, thus, to learn non-trivial differences between the domains. To do so, the batch-wise standard deviation of the values of the discriminator output

for the target images and the transformed source images of a batch is penalized, which results in the regularization loss


where denotes the average value in for the image set in the batch. The remaining symbols are as described in the context of equation 8.

In order to update the parameters of , the gradients of with respect to are used, which will train such that it can discriminate well between real target images and transformed source images, thus making the task of the adaptation network more difficult. Note that in this process, the transformed images are not directly presented to the discriminator, but instead a small horizontal and vertical shift as well as a radiometric transformation as described in section 4.2.1

are applied. The shift is drawn from a uniform distribution with the lower limit of zero and the upper limit of 4 pixels, which corresponds to the filter size of the first convolution in

D. In preliminary experiments, we found this to reduce artefacts in the form of high-frequency repeating patterns in the transformed images.

3.4 Entropy based parameter selection

Many DA methods rely on iterative processes that are repeated for a fixed number of epochs, using the parameter set after the very last iteration for inference (Tasar et al., 2020b, a; Benjdira et al., 2019). For the reasons given in section 2.2 we propose another strategy, namely to select a parameter set according to an optimality criterion derived from the data. For lack of labelled data in , this criterion cannot be based on the validation error in that domain. Instead, we use the average entropy of the predicted class scores for images from as an approximate measure for the validation performance. The entropy of the class scores can be interpreted as a measure of uncertainty of the predictions (Wittich, 2020). Consequently, we expect good parameter values to lead to a low uncertainty of the predictions and, thus, to a low entropy. After each epoch in the adaptation, the classification network C is used to predict the probabilistic class scores for every pixel of all images from . For such a pixel, the entropy is computed according to


and the average entropy is determined from all the pixel-wise values. The parameter set having the lowest value of is selected after running the adaptation for a fixed number of epochs. To save computation time, the mean entropy is not computed for the first couple of epochs, because A and D are not expected to deliver meaningful results in the beginning of the adaptation phase.

3.5 Resolution adaptation

Methods for appearance adaptation have problems if the ground sampling distance (GSD) of the source domain () is different from the one of the target domain () (Liu et al., 2020; Benjdira et al., 2019). To overcome this problem we pre-scale the images of one domain using the information about the GSD, which is usually available in RS applications. If the data from are downsampled to the resolution of . This includes downsampling of both, the image data (using bilinear interpolation) and the reference (using nearest neighbour interpolation). Then, source training and DA are performed using the resampled data from and the original data from . On the other hand, if , we downsample the image data from to and perform source-training and domain adaptation in the resolution of . In order to predict label maps for at the original resolution, the predicted probabilities are upsampled to using bilinear interpolation. The pixel-wise class predictions are then obtained by selecting the class with the highest probability for each pixel in the upsampled probability maps.

4 Experiments

4.1 Test Datasets

For the evaluation of the proposed method seven datasets were used, each showing a different German city and each being treated as a single domain. The first group of datasets consists of MSI, height data and label maps for the cities of Schleswig (S), Hameln (Hm), Buxtehude (B), Hannover (H) and Nienburg (N) with a GSD of 20 cm (Wittich, 2020). All of these datasets include images (red, green, blue, near infrared) and a normalized digital surface models (nDSM) which contains the height above ground for each pixel. The blue channel was not used in the experiments, because it was not available for all datasets. The reference was generated by manual labelling according to the class structure shown in table 4. In (Wittich, 2020), each dataset was split into subsets for training and testing, respectively. We denote the training subset of a city by and the testing subset as .

City S Hm B H N
Size in M pixel 85.5 34 26 37 100 100 100
Capturing season A S S A S S S

Class distr. [%]

Sealed Ground (SG) 29.6 27.8 14.1 18.8 22.1 33.6 22.8
Building (BU) 25.7 26.0 14.7 19.1 19.7 36.7 18.4
Low Vegetation (LV) 22.6 21.3 38.9 36.2 36.9 7.5 40.3
High Vegetation (HV) 15.5 22.9 31.5 24.5 20.3 20.6 17.8
Vehicle (VH) 1.8 1.2 0.8 1.3 1.0 1.6 0.7
Clutter (CL) 4.8 0.8 - - - - -
Table 4: Dataset overview. Capturing season is either autumn (A) or summer (S). Class distr.: percentage of pixels assigned to the corresponding class in every city.

We also used the Potsdam (P) and Vaihingen (V) datasets of the ISPRS labelling benchmark (Wegner et al., 2017). They consist of ortophotos, nDSMs and label maps with 6 classes as shown in table 4. P consists of images captured with a GSD of 5 cm, whereas the imagery from V has a GSD of 9 cm and does not include a blue channel. Both datasets were split into training and testing areas by the benchmark organizers; we denote the training areas by and and the test areas by and , respectively. In the experiments conducted to compare our method to (Liu et al., 2020), we further use a subset which includes areas from both, and ; cf. (Liu et al., 2020) for details. In some experiments, we used resampled versions of the datasets and , which is indicated by a subscript showing the GSD, e.g. refers to data from resampled to a GSD of 20 cm. We use bilinear interpolation for resampling the image and height data and nearest neighbour interpolation for the label maps. In order to align the class structure of and to the one of the other cities, we follow the approach of Liu et al. (2020) and ignore the class Clutter. The corresponding regions in the reference do not contribute to the training loss, and at test time they do not contribute to the evaluation procedure. Whenever Clutter is ignored, we denote the datasets by and , respectively.

Table 4 shows the class distributions as well as the overall size of each dataset, along with the capturing season as a possible reason for a domain gap. In all datasets the class Vehicle is strongly under-represented. Also, the class distributions are very different between the datasets. For example, has a larger amount of High Vegetation () than (), and has a much smaller amount of Low Vegetation () than (). The Jensen-Shannon-Divergence () (Lin, 1991) indicates that the class distributions of and are the least similar ones ((), while the distributions of and are most similar (). Each dataset was pre-processed so that each channel has a mean of zero and a standard-deviation of one. In case of the nDSMs, the heights were divided by a constant value of 30 m (instead of the standard deviation) to preserve the relative metric height information.

4.2 Experimental settings, evaluation protocol and test setup

4.2.1 Experimental settings

In all experiments, the supervised source training of the classification network is conducted using stochastic minibatch gradient descent with a learning rate of and momentum of . We use a L2-regularisation of , implemented as weight decay with a weight of . These values were adopted from (Chollet, 2017), the only difference being we do not decrease the learning rate over time but instead start with a lower learning rate, which is a common strategy when starting from a pre-trained network. Further hyper-parameters were tuned empirically using the domain by training multiple networks with different sets of hyper-parameters and selecting the parameter set achieving the highest mean F1-score on . This domain was chosen because it was also used for tuning in (Wittich, 2020), to which we compare our new method in one of the experiments. As a result, the parameter of the adaptive loss (equation 2) is set to and the number of epochs of source training to 50, where each epoch consists of 2,500 iterations. The batch-size was set to .

The hyper-parameters for DA were tuned by performing DA from S to Hm with different parameter values and choosing the ones achieving the highest mean F1-score on Hm after the adaptation. This pair of domains was chosen because it is very challenging, S having been captured in summer and Hm in autumn. The resulting weights and in equation 4 were set to and the weight of the regularization term in equation 7 was set to . The batch-size was again set to . Following (Isola et al., 2017; Soto et al., 2020), the appearance adaptation network A and the discriminator D are both optimized using the ADAM optimizer (Kingma and Ba, 2014) using and and a learning rate of . The adaptation is run for 50 epochs. After epoch 25 we start with evaluating the entropy based selection criterion (cf. section 3.4), so that the parameter set achieving the lowest average entropy of the class scores after epoch 25 defines the parameter set to be used for classification.

We apply data augmentation to images from both domains in the source training and DA procedures, following the strategy presented in (Wittich, 2020)

. We crop randomly rotated patches out of all available training images to build the training batches online. While the rotated patches are generated using bilinear interpolation, the corresponding label maps are generated using nearest-neighbour interpolation. Each channel of the cropped and rotated patches is multiplied by a random value drawn from a normal distribution

, and a random value drawn from is added. We used in both cases; this value was determined in the parameter tuning process.

To classify an image that is larger than the input size of C, we use an inference protocol that aims at increasing the quality of the predictions (Wittich, 2020). The image is processed by C in a sliding window fashion with an horizontal and vertical overlap of . To further increase the redundancy, each window is additionally flipped in both, horizontal and vertical directions, and it is also rotated by 180; these flipped and rotated versions of the window are classified, too. In the end, all class scores that correspond to the same pixel are averaged and the class achieving the highest score is selected as the classification result.

4.2.2 Evaluation protocol

Based on the predicted label maps, a confusion matrix is determined by comparing the predictions to the reference label maps. From the confusion matrix, the number of true positive (

), false positive () and false negative () predictions are derived for each class . For each class, we compute the intersection over union (cf. equation 1) and the F1-score :


As global metrics, the overall accuracy , i.e. the percentage of correct class assignments, the mean intersection over union and the mean F1-score are reported, where the means are taken over all classes. In addition, we consider the positive transfer rate , where can be any of the above performance metrics. We present this metric as where is the number of different adaptation scenarios in an experiment and the number of positive transfers w.r.t. the performance metric .

4.2.3 Test setup

Using the parameter settings and training methods described in section 4.2.1, different experiments were conducted that are presented in the subsequent sections. In all these experiments, a domain corresponds to the data of one of the cities described in section 4.1. The first set of experiments solely evaluates source training and mainly provides a baseline for the other experiments. On the one hand, evaluating the classification accuracy that can be achieved when the training and test data belong to the same domain is an indication for what could be expected in the optimal case; on the other hand, the results achieved when applying a classifier trained on source domain data to the target domain without DA corresponds to the worst-case scenario and indicates the domain gap for each combination of cities serving as source and target domains, respectively. These experiments are reported in section 4.3. Section 4.4

is dedicated to the evaluation of domain adaptation. As seven domains are available, there are 42 possible pairs of source and target domains, and for every combination we report the evaluation metrics achieved in the target domain after the adaptation. After that, several ablation studies are reported in section 

4.5, in which we want to assess the influence of some components of our method on the classification results. Finally, in section 4.6, our method is compared to existing DA approaches from the field of RS. Details on the experimental protocols that deviate from those described in section 4.2.1 and the selection of the datasets for specific experiments are discussed in the respective sections.

4.3 Evaluation of source training

In order to evaluate the source training, we used all datasets described in section 4.1 at a GSD of . In case of and , the results are based on the datasets and , respectively, i.e. we excluded the class Clutter.

4.3.1 Training and testing on the same domain

In each of the experiments reported in this section, we trained a classifier using training data from one city and applied it to test data from the same city, using the definition of the (non-overlapping) training and test sets described in (Wittich, 2020). DA was not applied. Thus, table 5 presents the performance of classifiers trained on the training subset and evaluated on the testing subset for each city .

Table 5: scores and overall accuracies obtained on the test sets for the domains after training on the training subsets of the same domain.

As training and test data are taken from the same domain, we expect the results in table 5 to represent the optimum that can be achieved by our classification network . They are in line with the current state of the art in pixel-wise classification. Whereas the highest OA achieved for and according to the score board of the ISPRS benchmark (Wegner et al., 2017) is about 91%, the benchmark protocol excludes pixels near object boundaries from the evaluation, where wrong classification results are more likely to occur than in other areas.

4.3.2 Training and testing on different domains

For the experiments described in this section, the source classifier, trained using the entire dataset from one city (), was applied to all data from another city (). This was performed for all possible combinations of source and target domains. For all of these 42 DA scenarios, the results show how a classifier performs if it is applied to another domain, and the effectiveness of DA can be assessed by comparing the performance in after DA to the one reported here. Note that we did not use the split into training and test samples that was used in section 4.3.1 because we wanted the evaluation of source training and of DA in the next section to be based on a dataset that was as large as possible. Table 6 presents the values for this cross-domain evaluation. The last column and the last row show the average scores for every source and target domain, respectively. In the bottom-right corner of the table the average score over all scenarios is presented. The result for the scenario was excluded in the computation of all average scores because this setting was used for tuning the DA method. We do not report other qualtity metrics to save space; they behave very similarly to .

The values in table 6 vary quite strongly between the scenarios. The average performance is worst when applying the models to , which is probably due to the fact that the centre of Hannover is much more densely populated than the other cities. A second possible reason is the bad quality of the height information in , where the nDSM has a patchy appearance and height changes are badly aligned with the objects; cf. example for in figure 1. Nevertheless, the model trained on performs rather good in the other domains; only the models trained on and perform better on average. Again this could be due to the bad quality of the height information which makes the classifier focus on the radiometric information. A strong influence of seasonal differences between and cannot be observed. For example, the results for the setting are rather bad although the data for both domains were captured in autumn. Contrarily, the results for the opposite scenario are comparatively good, which also indicates that the domain gap is not symmetrical.

S Hm B H N Avg.
S - ()
Hm -
B -
H -
N -

Table 6: Mean F1 scores obtained on after source training on (before DA). Avg.: average scores per row / column, respectively, excluding the scores for the scenario with as source and as target domains.

Representing results before DA, the metrics in table 6 correspond to the worst-case scenario that can occur when a classifier is transferred to another domain without adaptation. A direct comparison of these results to the best-case scenario described in section 4.3.2 and especially in table 5 is not possible because the numbers are based on a different test set (test set of a domain in table 5, all data from domain in table 6). Nevertheless, we assume the results in table 5 to be a good approximation of the missing values on the main diagonal in table 6. Comparing the average values of the intra-domain settings in table 5 () to the ones of the cross-domain settings in table 6 (), we conclude that there is a considerable performance drop of on average, which we attribute to the domain gap between the datasets.

4.4 Evaluation of DA

Using the parameters after source training (section 4.3.2) as initialization for , the proposed DA method is used to adapt each model to all domains but the source domain. The adapted models are then evaluated in . Table 7 shows the values after DA. Analogously to table 6, the last column and the last row show the average metrics by source and target domain, respectively, and the value in the bottom-right corner shows the average score over all scenarios. Again, the result for SHm was excluded in the computation of the averages. For an easier comparison, the improvements of the averaged metrics after DA over the scores after source-training in table 6 are listed in parenthesis.

S Hm B H N Avg.
S - ()
Hm -
B -
H -
N -
Table 7: obtained on after adapting from . The values in parenthesis show the relative improvements compared to source training in table 6.

In all scenarios, DA results in a positive transfer, i.e. the mean F1-score is higher after DA than before DA. In all cases, the OA and the mean IoU are improved, too, and behave equivalently in a qualitative sense, but as in section 4.3, we only report to save space. The improvement of due to DA ranges from a minimum of for the setting H to a maximum of for H. In general, the improvement is higher when the initial performance before DA was lower. This seems understandable because improving better models is more difficult. When comparing the average improvements by domains it can be seen that the adaptation from and to resulted in the largest average improvement. The probable reason for this is a relatively large visual difference of this domain from all he others due to the different density of urban development as well as the special appearance of the height information in (cf. section 4.3.2). Nevertheless, even after adaptation the average performance on H is still considerably worse with respect to compared to the other domains. The models that were originally trained on Hm have on average the best performance before and after adapting them to the other domains, but DA achieves the smallest improvement. We conclude that our approach results in a stable improvement and that it is higher if a model has a poorer performance in before DA.

In the scenarios and , involving the domains with the most dissimilar class distributions with (cf. section 4.1), an improvement of about could be achieved by DA, which is about twice the average improvement over all scenarios. However, in both cases the resulting performance after DA is lower than the one achieved in the same target domains when adapting from almost all other domains (the exception being , which has an even lower value than ). and are domains with most similar class distributions. Of all scenarios involving as target domain, the adaptation based on as source domain performs best. Vice versa, using as source domain results in the best performance of all scenarios involving as target domain. We take these results as an indication that a large difference in the class distributions does indeed lead to a worse performance in the target domain after DA. However, as DA improves the result in the respective scenarios considerably, we think that this is mainly due to a poor initial performance of the classifiers in the target domains before DA. We also believe that different class distributions are likely to go along with a different appearance of objects. For example, buildings in densely developed urban areas look different from and cover a larger area than those in suburban areas, which can lead to a bad initial performance in the target domain.

To further summarize the DA performance, the average performance metrics for the cross-domain scenarios before and after DA are presented in table 8, again not considering the results for . Again we can compare the average values achieved after DA in table 6, which is , to the one achieved in the intra-domain setting from table 5, which is . DA could reduce the average performance gap from to . Although this is still a rather large difference, DA could nevertheless compensate about one third of the original performance gap.

Metric Before DA After DA Improvement
OA 77.7

Table 8: Average global performance metrics before and after DA.

Figures 2 and 3 show some examples of appearance adaptation and the results of the classifier in . Note that the shown reference labels in were used neither for training nor for DA. Having a look at the transformed images and nDSMs as well as exemplary samples from , it would seem that the style was adapted quite well. The examples 1) and 2) in figure 2, involving the domain H, should be highlighted. The nDSM is of relatively low quality in that domain, but nevertheless the figure indicates that in both cases, in which serves as target and source domain, respectively, style transfer works quite well for the height data. A nice adaptation of the spectral channels can be seen in example 5) in figure 3. Whereas the image from is very bright and shows strong contrast, its transformed version is darker and has a lower contrast, corresponding to the appearance of the data in . Some trees look rather greyish, which matches the appearance of the exemplary image from .

1) HmH
2) HHm
3) N
Figure 2: Qualitative examples of DA for three scenarios. Columns from left to right: Predicted (upper) and reference (lower) label map for , image / nDSM from , transformed image / nDSM , example image / nDSM , Predicted- (upper) and reference (lower) label map for . Colour-codes: SG (grey), BU (red), HV (green), LV (yellow), VH (blue), CL (magenta). For the abbreviations of classes, cf. table 4.
4) SB
5) B
6) N

Figure 3: More qualitative examples of the adaptation. Columns and rows as in figure 2. Green arrows in 5) indicate an example for a problematic region.

The predicted source domain labels shown in figures 2 and 3 are nearly error-free, which is to be expected because these images were used directly for training. There are more errors in the predictions for the images from , where no reference labels were available for training and DA. The majority of these errors is related to a confusion of high and low vegetation. These classes are actually hard to separate and the evaluation is highly influenced by the labelling policy, which is the reason why these problems of DA are not completely unexpected. In example 3) (figure 2) the hedge was not detected as High Vegetation in , probably because no similar structure exists in . In examples of 4) and 5) (figure 3), the large buildings were predicted rather well, but the small Building instances are problematic. While in 4) their outlines are very imprecise, in 5) the small building in the centre of the patch was not predicted at all. In most examples the prediction of narrow paths belonging to Impervious Surface is also problematic in . Such a case is highlighted by the green arrows in example 5) (figure 3). However, such structures are often difficult to differentiate even for a human observer, particularly if the path is in a shaded area or if it is partially covered with dirt.

4.5 Ablation studies

4.5.1 Source training: adaptive loss

In the first ablation study, we evaluate if the adaptive cross-entropy loss (ACE) proposed in section 3.3.1 can help to improve the prediction quality of under-represented classes based on the original Vaihingen dataset with the original class-structure. We compare the ACE loss to the regular cross-entropy loss (CE) and the multi-class focal loss (FCL). The models are trained according to the protocol described in section 4.2 on the training area . Each experiment is repeated three times, each time starting from a different random initialization of the layers that are not pre-trained. Table 9 shows the means and standard-deviations of the achieved metrics using the provided reference label maps of . For comparison to the leader-board of the ISPRS labelling benchmark, the version based on the ACE-loss is also evaluated on the eroded reference where the pixels close to the object boundaries are not considered (table 9, last row).

class-wise F1-scores
Full reference
Eroded reference
Table 9: Results on the Vaihingen dataset from the ISPRS semantic labelling benchmark. Scores show mean and standard-deviation over three runs. Best results when comparing to the full reference label maps are printed in bold font. Last row: metrics achieved when using the eroded reference for evaluation.

The table indicates that the best performance is shown by the model based on the ACE loss. It does not only achieve the best OA and , but also the highest class-wise F1-scores for five out of six classes; for the sixth class the difference to the best model is below 0.3%. In general, the variation in OA is very low: the difference between the best model and one achieving the worst OA, the model trained from scratch using the focal loss, is smaller than 0.5%. The largest impact of the loss can be observed in the F1-score of the under-represented classes Vehicle and Clutter. The proposed ACE-loss achieves the highest metrics for these classes; in case of Clutter, the improvement is about 5%. Compared to the regular CE loss, the focal loss resulted in a reduced performance for Vehicle and had no impact on the F1-score of Clutter. The metrics achieved on the eroded reference are comparable to the best-performing results in the leader-board of the benchmark. We want to point out that according to the there is only one listed approach that achieves higher values for the F1-score of the class Vehicle, unfortunately without a publication. The variant using the proposed ACE loss leads to a value of . This is comparable to the best result in (Yang et al., 2020a) on this benchmark (), where cosine similarity and the focal loss were used to address class imbalance.

In order to analyse the performance during training and the convergence behaviour, the class-wise F1-scores of two classes on the test set were tracked during training (cf. figure 4). The scores are based on the non-eroded reference, and no horizontal and vertical flipping was used during inference, which is the reason why the results from the last epoch are slightly worse than those reported in table 9. As far as the under-represented class Vehicle is concerned, figure 4 shows that the ACE loss outperforms the other losses and the focal loss performs worst by a relatively large margin. For the frequent class Sealed Ground there is no clear difference between using CE and ACE, but both also outperform the focal loss. To summarize, the proposed ACE loss mitigates the problems of imbalanced class distributions to some degree compared to the other loss functions.

Figure 4: Mean of class-wise F1-scores for the frequent class Sealed Ground (SG) and the under-represented class Vehicle

(VH), averaged over three runs. The shaded areas correspond to the 95% confidence intervals.

4.5.2 Influence of the regularization term for the discriminator

In this section we analyse the influence of the hyper-parameter that controls the influence of the regularization loss of the discriminator (cf. equation 7) on the performance of DA. To that end, the adaptation from S to Hm, which was the scenario used to tune the hyper-parameters, is performed several times using different values for in the range between , which implies the regularization loss is not considered, to . The experiment is repeated three times for each value. In table 10 the means and the standard deviations for several quality metrics after adaptation are given for each tested value of .

Table 10: Influence of the hyper-parameter on the adaptation scenario . Best results are printed in bold font. corresponds to training without the regularization loss. ST: source training without DA. , : number of positive transfers with respect to and OA, respectively.

Among all tested values for , DA performs best for , where a positive transfer with respect to and is observed in all three test runs. The latter statement also applies for , but in this case, the resulting scores are worse. Increasing beyond 4 as well as decreasing it results in a worse DA performance. Disabling the auxiliary loss by setting resulted in a positive transfer only once with respect to all of the listed global metrics. Whereas the improvement in OA and due to is very small (1.2% and 1.7%, respectively), we conclude that this regularization loss has a stabilizing influence on DA and helps to avoid negative transfer if the hyper-parameter is tuned properly.

Figure 5 shows a visual example of the appearance adaptation from to with and without the regularization loss. This is a difficult scenario because trees occur more frequently in and they also have a different appearance due to seasonal effects. The model which was trained using delivers reasonable appearance adaptation results, whereas the model which was trained without the proposed regularization leads to artifacts in the transformed image and nDSM; some of them are highlighted by green arrows in figure 5. Without regularization the model hallucinates structures with a large reflectance in the infrared band, most likely due to the higher occurrence of such regions in the target domain.

Figure 5: Visual comparison for the appearance adaptation from to when using (centre left) and (centre right). Green arrows: hallucinated structures in the transformed data. Right column: example from .

4.5.3 Evaluation of the parameter selection criterion

The entropy-based criterion for parameter selection proposed in section 3.4 is based on the assumption that parameter sets which achieve a lower mean entropy in also have a better performance in that domain. To validate this assumption, the values were tracked for epochs 26-50 in the DA experiments described in section 4.4, so that can be analysed as a function of the mean entropy in . The results are shown in figure 6 for DA scenarios involving four domains (two autumn domains - , - and two summer domains - , ). Every point corresponds to one epoch in DA training and, thus, to a set of model parameters values; parameter sets which resulted in a positive transfer are shown in blue, those that yielded a negative transfer in red. Furthermore, three parameter sets are highlighted: the set after the last epoch (red cross), the one resulting in the highest mean F1-score (star), and the set achieving the minimum entropy, i.e. the set selected by our method (large green dot).

Figure 6: Visualisation of the mean F1 scores achieved in as a function of the mean entropy in the DA epochs 26-50 for several DA scenarios. Each point corresponds to one DA epoch. The heading of each graph encodes the scenario ( to ). Other DA scenarios resulted in similar graphs, but are omitted to save space.

Figure 6 shows that in many scenarios the performance is higher for parameter sets that also achieved a low entropy in . Whereas the parameters chosen according to our selection criterion, i.e. those with the lowest entropy (green dots), correspond to a positive transfer in all 42 DA scenarios, the parameters after the last epoch lead to a negative transfer in three cases. In 26 out of the 42 scenarios the parameter sets with minimum entropy resulted in a better performance than those obtained after the last training epoch, in 11 the results were slightly worse and in 5 cases they were almost identical. On average, using the model from the last training epoch results in a value of , while using the proposed criterion results in . This is rather close to the average value of the best models (). Particularly in difficult scenarios such as or , where many parameter sets resulted in a negative transfer, the set selected according to our entropy-based criterion achieved a positive transfer. Furthermore, the improvement was quite large in a few cases. For example, in and the improvement is around w.r.t . The largest difference can be observed in the scenario where the last model () has a much lower mean F1-score than the model which is selected according to the proposed criterion (). In the worst-case scenario the last model is only better than the selected one. We take this as an indication that our selection criterion is better than picking the last model. It must be noted that the overall maximum number of 50 epochs was chosen empirically, and there might be another fixed number of epochs that would achieve better results, but this would require a rather difficult tuning procedure based on multiple scenarios. We thus conclude that, whereas our selection criterion might not result in the best choice of parameters in all scenarios, it achieves a consistently good choice. Although on average, the improvement in is relatively small (), the selection criterion definitively helps to avoid negative transfer, achieving a positive transfer in all investigated scenarios, and in some cases can lead to a significant improvement.

4.6 Comparison to the state of the art

4.6.1 Comparison to instance transfer

In this section we compare the proposed method to the variant of instance transfer described in (Wittich, 2020). For this comparison we use the same inference protocol and the same datasets with the same split into training and test sets that were also used in that publication. In the initial cross-domain evaluation before DA, models trained using the training data from are evaluated on the test sets of all other domains. All models are then adapted to the other domains, using the labelled training samples from and all unlabelled samples from , and the results are evaluated using the test samples in . The results of the classifier trained on the training sets and applied to the test sets of the same domains