Domain-Adversarial Training of Self-Attention Based Networks for Land Cover Classification using Multi-temporal Sentinel-2 Satellite Imagery

04/01/2021 ∙ by Martini Mauro, et al. ∙ Politecnico di Torino 0

The increasing availability of large-scale remote sensing labeled data has prompted researchers to develop increasingly precise and accurate data-driven models for land cover and crop classification (LC CC). Moreover, with the introduction of self-attention and introspection mechanisms, deep learning approaches have shown promising results in processing long temporal sequences in the multi-spectral domain with a contained computational request. Nevertheless, most practical applications cannot rely on labeled data, and in the field, surveys are a time consuming solution that poses strict limitations to the number of collected samples. Moreover, atmospheric conditions and specific geographical region characteristics constitute a relevant domain gap that does not allow direct applicability of a trained model on the available dataset to the area of interest. In this paper, we investigate adversarial training of deep neural networks to bridge the domain discrepancy between distinct geographical zones. In particular, we perform a thorough analysis of domain adaptation applied to challenging multi-spectral, multi-temporal data, accurately highlighting the advantages of adapting state-of-the-art self-attention based models for LC CC to different target zones where labeled data are not available. Extensive experimentation demonstrated significant performance and generalization gain in applying domain-adversarial training to source and target regions with marked dissimilarities between the distribution of extracted features.



There are no comments yet.


page 4

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the past few decades, the launch of many satellite missions with short revisit time and comparatively high-resolution sensors has offered an extensive repository of remote sensing images. Availability of the open-source data by many earth observation satellites has made remote sensing very easy and obtainable

[rudd2017application]. Open-source data sets are available free of cost from several satellite missions such as the sentinel-2 and Landsat [novelli2016performance]. These satellites are equipped with multi-spectral sensors having short revisit time, and good spatial and spectral resolution, allowing researchers to test modern image analysis techniques to extract more detailed information of the target object. It is quite possible to monitor the dynamic processes on earth [de2011analysis, pacifici2014importance]

. Also, it has become easier to estimate and classify biophysical parameters using several data sources

[amoros2013multitemporal, li2015airborne, rembold2013using]. Overall, the new scenario has led to the opportunity to perform a detailed analysis of the land cover monitoring, change detection, image mosaicking, and large-scale processing using multi-temporal and multi-source images [rudd2017application, khaliq2019refining, gomez2016optical, khaliq2018analyzing].

The most essential and critical remote sensing application is land cover and crops classification (LC&CC). It facilitates labeling the cover such as forest, ocean, and agricultural land. Moreover, mapping can also be done manually using satellite images, but the process is quite tedious, costly, and time taking. Finally, an exquisite global cover map is not available as yet, but there is a land cover map with the name Corine Land Cover (CLC) [buttner2004corine] which provides land cover information with 100m per pixel resolution. However, the problem with this map is that it only covers the European area and is updated once in six years. There are several ways to perform land cover classification automatically. In general, the classification involves the creation of a training dataset that is extracted from annotated samples of the corresponding class labels, training a model using the training dataset, and evaluating the resulting predictions. The number and quality of training samples play a pivotal role in defining the performance of the trained model. From a remote sensing prospective, training sample collection requires a ground survey or visual photo-interpretation by an expert [tuia2016domain]. Ground surveying involves GIS expert knowledge, human resource, that is not typically economical, while visual-interpretation is not appropriate to be used for some applications, such as finding chlorophyll concentration [verrelst2012gaussian] and classification of tree species [ballanti2016tree].

In LULC, pixel segmentation is instead a challenge as compared to classification. Recently, it has come out that conventional machine learning tools do not perform well in semantic segmentation, including LULC. However, it was shown

[penatti2015deep, mazzia2020improvement]

that Convolutional Neural Networks (CNN) are better than traditional land cover classification procedures. In the land segmentation section of the deep globe challenge

[demir2018deepglobe], the Deep Neural networks (DNN) completely dominate the leaderboards. The best examples of land cover classification using Deep Neural Networks are ResNet and DenseNet [tian2018dense, kuo2018deep].

Since there is a difference in the land covers of different locations, the model trained in one area cannot be deployed for the other areas. Also, the satellite imagery of different satellites is not the same. That phenomenon is due to the difference in their resolution, capture time, and other radiometric parameters. Due to these multiple changing variables, the dataset taken from a satellite covering one region and another satellite dataset covering the same or other regions leads to a domain shift between the datasets. One way to achieve a reliable outcome is possibly to train a model with a huge amount of training samples in order to generalize its behavior for all classes of all the regions. However, that needs an enormous labeled dataset that is time and labor extensive.

Another method to deal with the shift between the datasets is termed Domain Adaptation (DA). The distribution shift between the target and source dataset is mainly due to temporal differences in the acquisition, differences in the acquisition sensors, and geographical differences. The domain shift affects the performance of a model trained on a source dataset and applied on the target dataset. Domain adaptation methods often rely on learning domain invariant models that keep comparable performances on the two datasets. Existing domain adaptation techniques may be classified as supervised, unsupervised, and semi-supervised. In supervised DA methods, it is presumed that labeled data is available on all the elements of both domains. In a semi-supervised domain, the labeled data for the target domain is assumed to be small while an unsupervised method contains labeled data for the source domain only. [tuia2016domain]

Divides the domain adaptation methodologies into four different categories: domain invariant features selection, adapting data distribution, adapting classifiers, and adaptive classifiers using active learning methods.

This work investigates adversarial training of deep neural networks to bridge the domain discrepancy between distinct geographical zone. In particular, we perform a thorough analysis of domain adaptation applied to challenging multi-spectral, multi-temporal data, highlighting the advantages of adapting state-of-the-art self-attention based models for LCCC to different target zones where labeled data are not available.

This article is organized as follows. Section 2 covers the related work on DANN and its developments in techniques for LCCC. Section 3 describes the dataset. A detailed description of the proposed method is presented in Section 4. The experimental setup, the results and related discussion are reported in Section 5. Finally, Section 6 draws some conclusions and future directions.

2 Related Work

2.1 Land Cover and Crop Classification

LC&CC has been the subject of many studies in the past. A widely used classification method makes use of time series of vegetation indices (derived from remotely sensed imagery) to extract temporal features and phenological metrics. There are also some thresholds and simple statistical techniques that help calculate the time of peak VI, Maximum VI, and other vegetation related metrics [walker2014dryland, arvor2011classification]. [moranduzzo2013automatic, hao2015feature]

Illustrate the older image classification methods by using handcrafted features for image representation and training classifiers like support vector machine and random forest. ML methods self-learn how to extract the features from the data with massive datasets available and improved computing devices. Random Forest (RF) based classifiers is another common approach for remote sensing applications

[hao2015feature], though it should be noted that multiple features need to be derived and fed to the RF classifier for more effective output.

One of the newest and most powerful concepts integrated into mapping is a branch of machine learning known as Deep Learning (DL). DL can be used to solve a wide range of problems like signal processing, computer vision, image processing, and natural language processing


. DL has shown significant contribution in remote sensing image classification due to its ability to represent features and its competence of mechanization for end-to-end learning. DL methods use autoencoders, and due to these, features are automatically extracted without utilizing any feature extracting algorithms

[hubel1962receptive, szeliski2010computer]. In the remote sensing field, object detection and image segmentation have been performed extensively using two-dimensional CNNs [krizhevsky2012imagenet, zeiler2014visualizing] to perform spatial feature extraction from high-resolution images. 2D CNN proved better than 1D CNN in crop classification [kussul2017deep].

2.2 Domain Adaptation

The method of domain adaptation aims to reduce the domain shift between source and target datasets. Domain adaptation has three possible approaches according to [wang2018deep, ma2019super, bengana2020improving]. The primary approach consists of reducing the difference in the feature space amongst the target and source data. For this purpose, maximum mean discrepancy (MMD) is often used as a cost function to minimize the distance or in order to check a consistent feature extraction in both source and target domains [bengana2020improving]

. The second approach utilizes Generative Adversarial Networks (GANs)

[goodfellow2014generative] for an adversarial domain adaptation. The purpose of the GANs is to make both the source and target datasets spectral characteristics similar. [tzeng2017adversarial] Shows an example where the target dataset is translated to the source dataset using GANs. The translation contains a discriminator that recognized the two datasets. The last approach of domain adaptation creates a shared representation of both domains. In this method, one domain can be translated to another, and both domains can be translated into a common space. The method also provides a transfer function that facilitates the translation of one domain to another and translating back to the original state. CycleGAN provides the third approach and involves two discriminators that are used to translate one domain to another and converse [zhu2017unpaired].

DA method is used excellently for semantic segmentation due to the lack of pixel annotation for images. The general methods of domain adaptations are not well interpreted for semantic segmentation [zhang2017curriculum]. Thus, adversarial and reconstruction procedures are chosen. Adversarial and constraint-based adaptations are performed at pixel level using architectures that exploit adversarial domain adaptation using GANs to generate source-like images [hoffman2016fcns]. Then, the images are segmented using a network that has been trained on the source dataset. In [chang2019all], Domain Invariant Structure Extraction (DISE) structure was adopted to transform images into the domain-invariant structure and domain-specific texture representations. The bidirectional method provides prevention for the translational model to reach a point where the discriminator fails to identify the image from the same distribution setup and fails to align correctly [li2019bidirectional].

3 Study Area and Data

Figure 1: Magnified view of the four NUTS-3 regions of Brittany, located in the northwest of France and covering 27,200 km². The strict division of the supervised BreizhCrops dataset in the four regions allows to perform a formal and controlled analysis on domain adaptation for LC&CC with multi-spectral and multi-temporal data.
Figure 2: Class frequencies divided in the four NUTS-3 regions of Britanny. The respective number of parcels highlights the strong class imbalance, reflecting the substantial imbalance in real-world crop-type-mapping datasets
Barley Wheat Rapeseed Corn Sunflower Orchards Nuts
Zone 1 13051 30380 5596 44003 1 937 10 32641 52013
Zone 2 10736 15026 2349 36620 6 348 18 36536 39143
Zone 3 7154 27202 3557 42011 10 1217 10 32524 52682
Zone 4 5981 17009 3244 31361 2 552 11 26134 38141
Table 1: Summary of the number of samples per class divide in the four NUTS-3 regions of Brittany. Instances are derived by L2A bottom-of-atmosphere parcels in order to disentangle our analysis with variation of the atmospheric conditions.

To promote reproducibility of our experimentation, we rely on BreizhCrops, a large-scale time series benchmark dataset introduced in 2020 by Rußwurm et al., [russwurm2019breizhcrops], for supervised classification of field crops from satellite data. The dataset comprises multivariate time series examples in the Region of Brittany, France, of the season 2017, from January 1 to December 31. In particular, the authors of the dataset exploited all available Sentinel 2 images from Google Earth Engine, [gorelick2017google], and farmer surveys collected by France National Institue of Forest and Geography Information (IGN) to collect more than 600k samples divided into 9 classes with 45 temporal steps and 13 spectral bands. Most importantly, as shown in Fig. 1, acquired data is split into distinct regional areas. Indeed, as regulated by the Nomenclature des unites territoriales statistiques (NUTS), the overall dataset is divided into the four NUTS-3 regions Côes-d’Armor, Finistère, Ille-et-Vilaine, and Morbihan. That, in conjunction with the challenging nature of the dataset, makes BreizhCrops an ideal benchmark to test domain adaptation for multi-spectral and multi-temporal data for LC&CC.

As summarized in Fig. 2, even if the authors of the dataset avoided broad categories, such as diverse or fodder crops, due to the nature of agricultural production, which focuses on a few dominant crop types, a class imbalance can be observed in the collected parcels. That constitutes a challenge for every classifier type, but it reflects the strong imbalance in real-world crop-type-mapping datasets. Finally, to disentangle the performed domain adaptation analysis from the influence of the random variation of the atmospheric conditions, we exclusively make use of L2A bottom-of-atmosphere parcels where data acquired over time and space share the same reflectance scale. Adjacent and slope effects are corrected by MAJA processing chain, [hagolle2015multi], that employs 60-meter spectral bands to apply atmospheric rectification and detect clouds. So, only ten spectral features are available for each parcel. Tab. 1 is presented as a summary of the number of samples collected for the domain adaptation experimentation divided into classes and regions.

4 Methodology

In this work, unsupervised domain adaptation is considered to tackle knowledge transfer for land cover classification from satellite images. In particular, the proposed methodology is intended to investigate the application of representation learning (RL) techniques for domain adaption when dealing with multi-temporal data. For this purpose, a Transformer Encoder-based classifier is adapted to a Domain-Adversarial Neural Networks (DANN) architecture and trained accordingly.

In this section, a thorough description of the methodology is provided. First, we frame domain adaptation with the DANN method. Then, we briefly explain the Transformer Encoder structure with self-attention adopted for the multi-temporal crops classification. Finally, we describe the resulting architecture of the attention-based DANN, which is used to train a classifier with improved domain generalization.

4.1 Domain-Adversarial Neural Networks

Classifiers obtained with Deep Neural Networks often suffer from a lack of generalization related to possible variations in the appearance of the same objects. This problem is usually identified as a domain gap. In the land cover classification task, this situation is very recurrent and can be associated with the spectral shift affecting the data collected in different regions at different times. The shift is often related to photogrammetric distortion or visual differences in the appearance of lands. Furthermore, when dealing with satellite images, a dataset usually needs to be created by labeling images for a specific region to train a classification model. Despite this time-expensive procedure, standard training does not guarantee satisfying performance on images of different regions.

Domain-Adversarial Neural Networks (DANN) is a representation learning technique that allows a classifier to generalize better from a source domain to a target domain. This specific domain adaptation method consists of adding a branch to the original feed-forward architecture of the classifier and carry out an adversarial training. From a generic perspective, it is possible to identify three main components of the DANN: a feature extractor with parameters , a label predictor with parameters , and a domain classifier with parameters . The feature extractor is the first block of the DANN model. It is responsible for learning the function , which maps the input samples

to a d-dimensional vector containing the extracted features. The label predictor function,

, compute the label associated with the predicted class of the sample. The domain discriminator function distinguishes between source and target domains given the extracted features. The combination of feature extractor and label predictor gives us the complete classifier model. The domain classifier is composed of a secondary branch, similar to the label predictor, which receives the extracted feature vector by the first block of the network.

Given these three main elements, the expression of the total loss used to train DANN is obtained by the following expression, according to the authors [ganin2016domain]:


The first term is the label predictor loss, while the second one involves the domain discriminator loss. The hyper-parameter can be tuned to weigh the contribution of the two learning terms. A more detailed analysis of the choice of is proposed in the experiments section. and

are respectively the numbers of samples from the source and e target domains. The expression of the total loss function also describes the principal goals of DANN: first of all, we want to obtain a label predictor with low classification risk. Second, we are adding a regularization term for the domain adaptation. To this extent, we aim to find a set of parameters of the feature extractor

that can map a generic input sample from either source or target domain to a new latent space of features, where the domain gap is reduced. On the other hand, the classification performance has not to be affected. For this reason, the extracted features should be discriminative as well as domain invariant. According to this goal, the optimal choice of parameters and is represented by the one which minimizes the total loss function, keeping unchanged. Differently, the domain discriminator parameters are updated to maximize the loss while not changing the other ones.


In the original paper of DANN, the parameters of each piece of the neural network model are updated with an SGD optimizer. Here instead we use Adam [kingma2014adam], and parameters , and are updated according to its rules.


The first and second momentum of Adam and are computed according to the gradient of the specific DANN element. For example, the feature extractor gradients and are used to compute and . Differently gradients obtained from label predictor and domain discriminator are only used to update their respective momentum and .

The feature extractor and the domain discriminator play adversarial roles during the training process. A satisfying feature extractor can fool the domain discriminator by forwarding a vector of domain invariant features. The role of the domain discriminator is to improve and evaluate this ability. A key intuition in the DANN method is to carry-out the adversarial training with a standard back-propagation of the gradients, thanks to a custom Gradient Reversal Layer between the feature extractor and the domain discriminator. This particular layer does not add other parameters to the model but changes the sign of the upstream gradients. The GRL operation can be formulated with in the following mathematical expressions for the forward and back-propagation step:



is the identity matrix. Hence, by performing optimization steps on the resulting DANN architecture, we can update parameters to reach saddle points of the total loss function reported in Equation


4.2 Classification of Multi-Spectral Time Series Data with Self-Attention

Self-Attention, popularized by the Transformer model in 2017, [vaswani2017attention], has provided a considerable boost in machine translation performance while being more parallelizable and requiring significantly less time to train. Nevertheless, the introspection capability behind the success of Transformers is not limited only to natural language processing, but can be adapted to any time series analysis to filter data and focus on more relevant repressions aspects.

A sample -th of multi-spectral, multi-temporal acquisition can be represented as a matrix where is the temporal dimension and is given by the number of spectral bands. So, it is a 1D sequence of tokens, , with , that can be easily linearly projected to feed a standard Transformer encoder. The encoder can map a temporal input sequence in a continuous representation , where is the output layer of the Transformer model and

is the constant latent dimension of the projection space. Self-attention, through local multi-head dot-product self-attention blocks, can easily manipulate the temporal sequence finding correlations between different time-steps and completely avoiding the use of recurrent layers. Subsequently, the output representation can be exploited to perform a classification of the input sequence. Indeed, that can be achieved by further processing the output encoder matrix and feeding a classification head trained to map the hidden representation to one of the


Several approaches have been proposed in literature to obtain this result; in [devlin2018bert] and [dosovitskiy2020image] they pre-append to the input sequence a learnable embedding, whose state at the output of the Transformer encoder serves as a hidden representation of the membership class. Indeed, only that output token is fed to the classification head to obtain the final prediction. On the other hand, the output sequence can be averaged or processed with a max operation on the temporal dimension [russwurm2020self]. Nevertheless, despite the type of processing applied to , the encoder will adapt to elaborate the sequence properly and embed the needed information for the classification task. In conclusion, a Transformer encoder can be repurposed to process a multi-spectral input sequence and find valuable correlations between the different time-steps to perform LC&CC with a high level of accuracy.

4.3 DANN for Land Cover and Crops Classification

Figure 3: Overview of the overall framework to train a Transformer encoder with domain-adversarial training. The multi-spectral temporal sequence is firstly linearly projected and fused with a position encoding. Subsequently, the self-attention based model manipulates the input series and, through a max operation applied to the last layer of the encoder, is possible to extract a token from the output sequence. Finally, gradients derived by LC&CC and Domain classifiers train the network while keeping close the distribution of source and target domains.

We employ DANN in conjunction with self-attention based models to bridge the domain gap between different geographical regions. The overall architecture of the adopted methodology is shown in Fig. 3. Firstly, an input sequence is linearly projected to the constant latent dimension of the Transformer model . Besides, a Transformer encoder does not contain recurrence or convolution to make use of the order of the sequence. Therefore, some positional encoding is injected about the relative or absolute position of the tokens in the sequence. The positional encodings have dimension as the projected sequence, so that the two can be summed. Guided by experimentation, as in [dosovitskiy2020image], we adopt a learnable positional encoding instead of the sine and cosine functions with different frequencies of [vaswani2017attention]. The resulting pre-processed input sequence feeds the Transformer encoder, parameterized by , that provides as output a continuous representation . Subsequently, we make use of the max function, over the temporal axis, to extract a token, , from the output sequence.

The extracted representation constitutes the input for either the LC&CC and domain multi-layer perceptron classifiers. The first network, provides a probability distribution over the

different classes,

. On the other hand, the domain classifier outputs the probability,

, that the extracted representation belongs to the target or source domain. Using the cross-entropy loss function for both classifiers, it is possible to compute the respective gradients and update the weights, of the feature extractor. Indeed, inverting the sign of the gradients, , derived from the domain classifier, and multiplying them for a scale factor , we can increasingly reduce the distance between the latent space of the two domains while training the encoder on the classification task. Overall, the proposed training framework provides an effective solution to transfer the acquired knowledge of a model to a diverse region, exploiting only the original nature of the data.

5 Experiments and Discussion

We experiment with the proposed methodology on the four regions of the multi-temporal satellite BreizhCrops dataset presented in Section 3

. Firstly, the main objective of the conducted experimentation is to investigate how the classification performance of a state-of-the-art model for LC&CC model is affected by a lack of generalization towards different geographical regions. Then, we clearly highlight how adversarial training can mitigate the domain gap and significantly boost performances for source and target regions with marked distribution distance. It is important to remark that the method relies on the availability of samples of both source and target domains, whereas only source labels are required, not allowing direct applicability of transfer learning techniques. Finally, in the last part of the section, obtained results are discussed and inspected through dimensionality reduction techniques, validating the proposed method for practical use.

Figure 4: scheduling: the value of the domain adaptation parameter

is changed during training according to a growing trend. This allows the feature extractor to learn basic features during the initial epochs. Different final

values are tested to study the right level of adaptation required in the different cases: . The parameter influences the slope of the curve and it is kept constant to 10 in order to let reach the desired value in a suitable number of epochs.

5.1 Experimental settings

We carry out a complete set of experiments to compare the Transformer encoder classifier performance with and without DANN. In the final architecture, the classifier model comprises a transformer encoder feature extractor and a final classification stage.

In all experimentation, the transformer encoder receives as input a batch of 256 tensors with

temporal steps and spectral bands in the image samples. Moreover, to linearly project the temporal sequence to the constant latent dimension of the encoder, the input is first passed to a dense layer with 64 units. So, is equal to 128. On the other hand, the multi-head attention Transformer encoder is defined with a number of layers and attention heads equal to and . Finally, the dimension of internal fully-connected layers

. Rectified linear units is the non-linear activation function used for all neurons of the encoder.

The LC&CC classification stage is a simple multi-layer perceptron head composed of a normalization layer, a fully-connected layer with 128 units, ReLU as activation function, and a final layer with

neurons. On the other hand, for the DANN experimentation, the domain predictor is identical to the multi-layer perceptron head of the LC&CC classifier, with 128 units and a ReLU activation. However, the number of neurons in the final layer is set to , since we always perform a single target domain adaptation.

A cross-entropy loss function is chosen to train both the classifiers. The parameters of both models are updated using Adam optimizer with , and . A fixed number of epochs is always set to 250. The learning rate value is changed during training according to an exponential decay policy from a starting value of 0.001, with a decay scheduled for each epoch equal to . A key point in the experimental settings is related to the domain adaptation parameter . It acts as a regularization parameter, since it regulates the impact of the domain discriminator gradients on the feature extractor during training. Therefore, it can be considered as the principal hyper-parameter to tune when using DANN. We always use a scheduling policy for , as suggested in the original publication of DANN:


Where is the plateau value reached. This is the actual value of used for the second half of the training, which affects the final performance of the model in terms of generalization. The parameter defines the slope of the curve and it is fixed to such value to let be reached in a suitable number of epochs. A scheduled value of allows the feature extractor to learn the basic features for the classification during the first epochs. It then adjust the mapping function in order to let the source and target domain feature distributions to overlap at the end of the training process. As shown in Fig. 4, different values of are tested to study the response of the model.

Zone Transformer Encoder DANN
F1-Accuracy K-score MMD
F1-Accuracy K-score MMD
1 2 0,8577 0,7877 0,5675 0,7229 0,1109 0,7628 0,5540 0,6950 0,0077
1 3 0,8577 0,7436 0,5266 0,6606 0,1620 0,7449 0,5080 0,6714 0,0183
1 4 0,8577 0,7941 0,5675 0,7294 0,0516 0,7960 0,5734 0,7343 0,0086
2 1 0,8951 0,7433 0,5309 0,6773 0,1577 0,7403 0,5161 0,6687 0,0208
2 3 0,8951 0,4967 0,3592 0,3642 0,6700 0,6505 0,4544 0,5483 0,0104
2 4 0,8951 0,6006 0,4395 0,4912 0,2536 0,7482 0,4832 0,6735 0,0416
3 1 0,8750 0,7767 0,5339 0,7122 0,1819 0,8045 0,5778 0,7488 0,0121
3 2 0,8750 0,6638 0,4594 0,5615 0,6254 0,7589 0,5334 0,6865 0,0277
3 4 0,8750 0,7348 0,5074 0,6504 0,1184 0,7968 0,5778 0,7338 0,0115
4 1 0,8870 0,7927 0,5551 0,7354 0,0339 0,8233 0,5822 0,7753 0,0039
4 2 0,8870 0,7600 0,5443 0,6870 0,0953 0,8003 0,5788 0,7399 0,0084
4 3 0,8870 0,7111 0,4961 0,6230 0,0960 0,7673 0,5443 0,6965 0,0062
Table 2: Results of crops classification for the Transformer Encoder classifier trained with and without DANN using . The two models are trained and tested on all the possible combinations of source/target domains available in BreizhCrops dataset. Accuracy, F1-Accuracy and K-score are the metrics used to compare the classification quality. Training accuracy is also reported for the Transformer encoder classifier. Maximum Mean Discrepancy computed on a subset of extracted features of source and target domain shows the successful reduction of features distance obtained with DANN.
Zone Improvement [%]
F1-Accuracy K-score
1 2 -3,1576 -2,3859 -3,8508
1 3 0,1762 -3,5378 1,6395
1 4 0,2296 1,0467 0,6773
2 1 -0,3996 -2,7935 -1,2698
2 3 30,9721 26,4916 50,5414
2 4 24,5690 9,9474 37,1046
3 1 3,5803 8,2152 5,1446
3 2 14,3204 16,1075 22,2539
3 4 8,4475 13,8791 12,8283
4 1 3,8705 4,8817 5,4228
4 2 5,3053 6,3384 7,6922
4 3 7,9018 9,7154 11,8067
Table 3: Comparison between Transformer Encoder Classifier with and without DANN, in terms of classification metrics reported in Tab. 2. This run of experiments is conducted with a scheduling of the adaptation parameter , with .

The Transformer encoder classifier is trained and tested on all the possible combinations of regions to quantify the existing domain gap. The classification performance is evaluated by using three different classification metrics, which are chosen among the ones proposed in the BreizhCrops dataset benchmarks: Accuracy, F1-score and K-score. This last metric is the Cohen’s kappa [cohen1960coefficient], computed according where, and are the empirical and expected probability of agreement on a label. In addition, we make use of Maximum Mean Discrepancy (MMD) metric, to quantitatively evaluate the distance between source and target distributions.

Classification Improvement [%]
Classification Improvement [%]
F1-Accuracy K-score MMD
F1-Accuracy K-score
1 2 0,0081 -5,9370 -7,4556 -7,5298 0,0139 -3,0445 -3,4958 -3,8474
1 3 0,0175 -0,8177 -5,7995 -0,0167 0,0047 -1,2319 -8,0403 -0,6464
1 4 0,0083 -2,0227 -2,4705 -2,2594 0,0059 -1,6857 0,8998 -1,9084
2 1 0,0214 5,0107 0,1788 6,2900 0,0075 -3,6125 -6,4026 -5,3582
2 3 0,0074 27,9781 24,1537 46,4754 0,0039 35,8191 62,0768 29,7026
2 4 0,0420 25,7184 20,1661 38,8051 0,0035 20,7318 11,9886 30,9672
3 1 0,0164 3,1832 5,3662 4,3668 0,0227 3,9681 7,9060 5,7821
3 2 0,0057 11,1249 14,1963 17,6895 0,0036 12,2495 13,3626 19,3933
3 4 0,0123 8,3884 14,1846 12,8221 0,0248 7,2283 11,9942 11,1784
4 1 0,0037 3,1445 3,4514 4,3273 0,0042 -0,0469 0,9347 -0,1183
4 2 0,0112 3,3195 1,5598 4,8393 0,0093 3,9222 1,5396 5,6998
4 3 0,0105 8,4599 10,9973 12,6350 0,0104 8,7947 11,3400 13,0299
Table 4: Results of the comparison between Transformer Encoder Classifier with and without DANN, in terms of classification metrics. The table resume the results obtained from two runs of experiments, conducted with different scheduling of the adaptation parameter, in particular with and . For each couple of source-target zones, the MMD obtained with DANN and the percentage improvements on classification metrics are shown.

5.2 Maximum Mean Discrepancy

MMD is a statistical test originally proposed in [gretton2012kernel] to determine a measure of the distance between two distributions. MMD is largely used in domain adaptation since it perfectly fits the need to understand whether the source and the target domain extracted features overlap. MMD can be directly exploited as a loss function for adversarial training of generative models or for domain adaptation purposes, as shown in [dziugaite2015training], [long2017deep]. However, in this work we limit its usage to show the results of the Transformer Encoder DANN in terms of reduction of features distances.
Formally, MMD is a kernel-based difference between feature means. Given a set of samples with a probability measure , the feature mean can be expressed as:


where is the feature map that maps to a new feature space . If it satisfies the necessary theoretical conditions, a kernel based approach can be used to compute the inner product of two distributions of samples and :


At this point the MMD can be defined as the distance between the feature means of and :


which can be expressed more in detail by using Equation (11):


However, an empirical estimate of MMD needs to be computed since in a real case only samples are available instead of the explicit formulation of the distributions. It is possible to obtain the MMD expression by considering the empirical estimates of the feature means based on their samples:


Where and in this case are the image samples from source and target domains, is the number of samples of the considered subsets. Finally, we specifically use a gaussian kernel with the following expression:

Figure 5: 2D feature visualization obtained with PCA, extracted with the transformer encoder trained on the source domain. In (a) and (b) we have features extracted from zone 1 (source) and 2 (target): a low MMD distance indicates no need of domain adaptation. On the contrary (c) and (d) show that features from zones 2 (source) and 3 (target) are mapped poorly in the target domain, with a consequent low accuracy in classification. The case reported in (e) and (f) shows zone 4 (source) to 3 (target), where regardless of an initial low MMD, the classifier accuracy can still be improved reducing the domain gap.
Figure 6: 2D feature visualization obtained with PCA, extracted with Transformer DANN models trained on the specific source-target domains. In (a) and (b) we have features extracted from zone 2 (source) and 3 (target): here the positive effect of DANN in terms of features overlapping is very clear compared to what shown in Fig. 5 - (c-d). On the contrary (c) and (d) show that features from zones 1 (source) and 2 (target), which have a low MMD, are less consistent. The case reported in (e) and (f) shows zone 4 (source) to 3 (target), where regardless of an initial low MMD, the classifier accuracy can still be improved reducing the distance between extracted features.

5.3 Results interpretation and applicability study

In this section, we present the comparison results between the Transformer classifier with and without DANN, clearly highlighting the scenarios that present a definite advantage in applying adversarial training for training a classifier for LC&CC. From results in Tab. 2, 3 and 4 it is possible to notice that DANN adversarial training allows the classifier to improve knowledge transferability to other domains for the majority of the cases. Nonetheless, we investigate a potential criterion to decide if the transfer of learning from source to target can be effectively improved by DANN. More in detail, since DANN aims to overlap feature distributions, we look at the extracted features from a subset of 10000 samples of each zone dataset. Thanks to the balanced distribution of samples for each class in the datasets, a random sampling operation can be performed, and the selected subsets result to be perfectly representative of the total dataset.

We use the set of extracted features to compute a numerical evaluation of the distance decrease, and to give a graphical visualization of the effect of DANN. From a quantitative perspective, we propose Maximum Mean Discrepancy as the feature distance metrics to detect suitable conditions where DANN is an appropriate methodology. To compute MMD without considering the clustering of classes, we only need unlabeled image samples. We use PCA algorithm to compute the principal components of the extracted features and we exploit them to provide 2D and 3D visualization of relevant cases.

Figure 7: 3D feature visualization and comparison. (a) and (b) show the features extracted from zone 1 (source) and 2 (target). They are respectively obtained with transformer encoder and DANN. It is clear that the transformer encoder alone can correctly map features on both domains. Differently, the improvement provided by DANN model is very evident in figures (c) and (d), representing the features extracted from zone 2 (source) and 3 (target), where the transformer encoder alone present both high values of MMD and low classification accuracy on target domain.

Firstly, we can look at the MMD values obtained from both the Transformer encoder and DANN in Tab. 2. It is clear that DANN is always able to reduce the distance between feature distributions. However, this is not always associated with an increase in classification performance. We realize that key information is contained in the MMD value obtained from source and target features, extracted by the standard classifier. This simple test is crucial and can also be done without labels. The best improvement with DANN is reached considering zone 2 as the source domain and selecting zone 3 as the target domain. The percentage improvement shown in Tab. 3, with an increase of more than of accuracy, correlates with an initial MMD value for this specific case is equal to 0.6700, reduced by DANN to 0.0104. What can be deduced by this observation is that high values of the MMD indicate a lack of generalization of the classifier and a domain gap. It is also to consider that the geographical zones of interest are close to each other. Hence, it can be reasonable to find small domain gaps. A clear example is the case of zone 1, when chosen as source domain. This factor can be considered an additional difficulty of the study case. Therefore it is possible that the same methodology applied to other regions on the planet, sharing the same categories of crops, can probably show greater results. Another peculiar case to be considered is: zones 4 (source) and 3 (target). The MMD value is low from the initial analysis of the case, without the intervention of DANN. However, a classification boost is always achieved.

We report a visual representation of the extracted features to add meaning to the previous considerations. In particular, Fig. 5 and 6 show the 2D principal components obtained from the peculiar cases defined below:

  • case 1: zone 2 (source), zone 3 (target). In this case DANN shows the greatest improvements with an initial high value of MMD. Features are visually reported in (c),(d) of Fig. 5 when extracted by standard Transformer encoder, in (a),(b) of Fig. 6 when extracted by DANN. The difference is visually clear.

  • case 2: zone 1 (source), zone 2 (target). In this case DANN shows the worst improvements with an initial low value of MMD. Features are visually reported in (a),(b) of Fig. 5 when extracted by standard Transformer encoder, in (b),(c) of Fig. 6 when extracted by DANN. They appear very similar also without DANN.

  • case 3: zone 4 (source), zone 3 (target). In this case DANN shows good improvements with an initial low value of MMD. Features are visually reported in (e),(f) of Fig. 5 when extracted by standard Transformer encoder, in (e),(f) of Fig. 6 when extracted by DANN. The difference is visually clear.

Finally, case 1 and case 2 defined above are also considered for a 3D representation. Fig. 7 shows the obtained results. For each subplot in the figure, both source and target domain features are scattered. Thanks to this visual perspective, the effect of the DANN method is highlighted, considering both the worst and the best application scenario. In case 1, the difference between source and target features is shallow also without DANN, as shown in (a). Differently, the situation from (c) to (d) is changed thanks to the adversarial training significantly.
The proposed discussion underlines some interesting insights on the correlation between reducing the domain gap and improving a classifier performance. The isolated cases considered provide a good reference example to decide if it is a reasonable and convenient choice to adopt the proposed DANN methodology for multi-spectral temporal sequences for Land Cover classification.

6 Conclusions

In this paper, we investigated adversarial training for domain adaptation with state-of-the-art self-attention based models for LC&CC. Indeed, domain gaps between distinct geographical regions prevent the direct repurpose of the trained model on diverse areas of the training domain, and the practical difficulty of acquiring labeled data prevents the direct application of transfer learning techniques. Our extensive experimentation clearly highlights the advantages of applying the proposed methodology to transformer models trained on multi-spectral, multi-temporal data and the considerable gain in performance with considerable distribution distance between target and source regions. Future work may investigate the advantages and disadvantages of different domain adaptation techniques applied to LC&CC.