A Bayesian-inspired, deep learning, semi-supervised domain adaptation technique for land cover mapping

05/25/2020 ∙ by Benjamin Lucas, et al. ∙ Monash University 38

Land cover maps are a vital input variable to many types of environmental research and management. While they can be produced automatically by machine learning techniques, these techniques require substantial training data to achieve high levels of accuracy, which are not always available. One technique researchers use when labelled training data are scarce is domain adaptation (DA) – where data from an alternate region, known as the source domain, are used to train a classifier and this model is adapted to map the study region, or target domain. The scenario we address in this paper is known as semi-supervised DA, where some labelled samples are available in the target domain. In this paper we present Sourcerer, a Bayesian-inspired, deep learning-based, semi-supervised DA technique for producing land cover maps from SITS data. The technique takes a convolutional neural network trained on a source domain and then trains further on the available target domain with a novel regularizer applied to the model weights. The regularizer adjusts the degree to which the model is modified to fit the target data, limiting the degree of change when the target data are few in number and increasing it as target data quantity increases. Our experiments on Sentinel-2 time series images compare Sourcerer with two state-of-the-art semi-supervised domain adaptation techniques and four baseline models. We show that on two different source-target domain pairings Sourcerer outperforms all other methods for any quantity of labelled target data available. In fact, the results on the more difficult target domain show that the starting accuracy of Sourcerer (when no labelled target data are available), 74.2 state-of-the-art method trained on 20,000 labelled target instances.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 14

page 23

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Land cover maps enable us to observe and understand the evolution of the Earth over many spatial and temporal scales (Turner et al., 2007), and as such, they are considered a vital component of all types of environmental research and management (Bojinski et al., 2014; Loveland et al., 2000; Lavorel et al., 2007; Asner et al., 2005; Armsworth et al., 2006).

Land cover maps can be automatically produced by applying supervised machine learning models to images acquired by satellites. Traditionally, models were learnt from single images; however in recent times the use of temporally ordered sequences of images— known as satellite image time series (SITS)—has become the new standard (Inglada et al., 2017). Figure 1 depicts the production of time series from a pixel of Earth imaged by satellite.

Maps produced using SITS have been found to be significantly more accurate, as these data enable classification of some land cover types that single images do not (Defourny et al., 2019; Vuolo et al., 2018); for instance, soy and corn are both winter crops and will appear similar in a single image. In contrast, their different growth rates will be clearly evident using SITS data.

Figure 1: The production of time series data from satellite images (Tan et al., 2017).

The current state-of-the-art methods for producing land cover maps from SITS are deep learning and random forests 

(Wulder et al., 2018; Azzari and Lobell, 2017). However, the accuracy of both of these is highly dependent upon the availability of a large quantity of labelled data. The need for large quantities of labelled training data presents a major problem in land cover mapping for three reasons:

  1. Labelled data are both expensive and time-consuming to acquire at the resolution of the latest Earth observation satellites (10 meters in the case of Sentinel-2 as used in this paper).

  2. The data are often specific to their location. For example, Figure 2 shows the Normalized Difference Vegetation Index of pea crops growing in 3 different regions of France. It is clear that even within one country, the same crop can take on three distinctly different profiles, meaning we cannot simply borrow data from a nearby region when training a model.

  3. Land cover changes over time, and thus we cannot reliably use old labelled data to train a new model as it may no longer be accurate.

Consequently, labelled data which is both recent and sourced from the study area are at best scarce, and frequently non-existent, making utilizing the state of the art to create land cover maps extremely challenging.

Figure 2: Mean NDVI of pea crops in 2016 for three Sentinel-2 satellite tiles located in France (see Figure 4 for the exact locations of the tiles).

Researchers have proposed three main approaches to tackle this problem: (1) using out-of-date reference data (Tardy et al., 2017)

; (2) active learning strategies

(Persello and Bruzzone, 2012; Matasci et al., 2012); and (3) domain adaptation (DA).

The first approach best suits the scenario in which accurate historical data exist (which is often not the case for the same reasons outlined above) and state-of-the-art algorithms can be used to identify which of the historical data are now outdated (Pelletier et al., 2017; Frenay and Verleysen, 2014; Bailly et al., 2018; Damodaran et al., 2020). The second approach best suits the scenario where sufficient computational resources are available and where further data collection is feasible (i.e., timely, affordable) (Tuia et al., 2011), and therefore lends itself favorably to smaller scale applications. The third approach, DA, is best suited to a situation where ample labelled data from a different location are readily available to the practitioner and can be used to train a classifier. Some DA approaches are specifically aimed at the scenario where some additional labelled data are available from the study area, which is the one we present in this paper.

In DA, data from a source domain, an area where labels are available, is utilized for the purpose of classifying data from a target domain, where labels are unavailable, or scarce. Generally speaking, this can occur in two ways: (a) by adapting the source domain data to appear more statistically similar to the target domain; or (b) learning a classifier on the source domain data and then adapting it to classify the target domain.

In this paper, we address the particular scenario of semi-supervised DA, in which a relatively small amount of labelled data is available from the target domain. Our method, Sourcerer, is a deep learning-based method for semi-supervised DA based on a Bayesian-inspired, novel regularizer for the weights of a convolutional neural network (CNN). We demonstrate Sourcerer on Sentinel-2 image time series data and show that on a 30-class land cover classification problem it outperforms the current state-of-the-art methods in semi-supervised DA, regardless of the the given quantity of labelled target data available. In particular, our contributions can be summarized as follows:

  1. Proposing Sourcerer: a novel method for semi-supervised DA on SITS data;

  2. Achieving state-of-the-art performance on two separate source-target pairings on Sentinel-2 data;

  3. Providing a semi-supervised DA method emphasizing a user-friendly implementation, as it can be applied to a pre-trained deep learning model and does not require the user to possess the source domain training data;

  4. Providing an open source implementation of Sourcerer for reproducibility and wider implementation.

The remainder of the paper is organized as follows: Section 2 discusses domain adaptation, the current state of the art, and presents the existing work using DA for remote sensing; Section 3 presents Sourcerer: our Bayesian-inspired, deep learning-based method for semi-supervised DA; Section 4 details the data used in the experiments presented in Section 5; finally, we draw conclusions and suggest future directions in Section 6.

2 Domain Adaptation

DA belongs to a family of machine learning techniques that deal with data distributions that are not stationary over time or space (Tuia et al., 2016). It utilizes labelled data from a source domain, in which labels are widely available, for the purpose of classifying the area of interest in which labels are scarce (or unavailable), the target domain

. Implicit in this are the assumptions that the source joint distribution

is sufficiently different to the target joint distribution for it to be sub-optimal to use a model trained on , but nonetheless sufficiently similar to be useful for the learning task, where is the input (observations) and the output (land cover labels).

When using SITS for land cover mapping, DA can be applied in two ways: temporally or spatially. The first situation primarily arises when a map is in need of updating but reference data from the present time is unavailable. In this case, a map of the study area from a previous year (or years) can be used as the source domain and adapted to map the present day land cover (the target domain) (Tardy et al., 2017; Demir et al., 2013; Tardy et al., 2019).

The second setting, and the one we will explore in this paper, occurs when data from one geographical region is used as the source domain and domain adaptation is used to map a different geographical region (the target domain).

There are two general scenarios that are presented in DA research—unsupervised DA and semi-supervised DA (Kouw and Loog, 2019)—which differ in whether labelled target data is available. In unsupervised DA, no labelled data are available in the target domain and the methods acquire information only from the structure of the unlabelled data. In semi-supervised DA, some labelled samples are available. However, there are usually insufficient samples to train an accurate classifier, so the labelled target data works to complement the source data in training a classifier.111We note that this differs from the definition given in the most cited survey of DA in remote sensing (Tuia et al., 2016)

but is consistent with the definition used in the overwhelming majority of DA research, particularly in the field of computer vision 

(Patel et al., 2015). In accordance with the definition of DA, it assumed that sufficient labelled source data is available in both scenarios.

While the vast majority of DA research focuses on unsupervised methods, we have chosen to present a semi-supervised DA method as we believe that this is a more practical scenario in remote sensing/land cover mapping—where funding is available to obtain some labelled data. In this case, a state-of-the-art semi-supervised DA technique would help practitioners produce high-accuracy land cover maps without having to perform additional large scale data collection.

In the following section we provide a brief overview of the state of the art in both unsupervised and semi-supervised DA.

2.1 Unsupervised Domain Adaptation

Due to the large quantity of research in unsupervised DA, we emphasize the current state of the art; for a more comprehensive review of the field we direct the reader to Kouw and Loog (2019). Unsupervised DA occurs when labelled data is available in the source domain, while the target domain has only unlabelled samples available. Early techniques addressing this problem attempt to align the source and target data spaces, or projections thereof, to one another (Huang et al., 2007; Kouw et al., 2016)

. These methods often also include use of dimension reduction techniques, such as principal component analysis or transfer component analysis, based on the assumption that the reduced spaces will be more similar to one another

(Pan et al., 2011; Fernando et al., 2013; Gong et al., 2012) . These ideas have been further extended in Long et al. (2015) by the addition of deep learning and the maximum mean discrepancy criteria to find features that are transferable between domains.

More recently, DA research has had a marked shift towards deep learning methods. The primary difference is that traditionally, the adaptation method and the classifier used to be orthogonal to one another , deep learning-based methods perform the adaptation and the training of the classifier in one step (often simultaneously). Deep learning methods have been applied in various ways, including: sharing model weights (Sun and Saenko, 2016)

; adversarial loss functions 

(Ganin et al., 2016); generative adversarial networks (Tzeng et al., 2017); and iteratively learning the target-domain decision boundary (Shu et al., 2018).

Another major area of recent research in unsupervised DA is optimal transport (OT), which seeks to find the minimum optimal transformation between the source and target distributions by attributing a cost to the transformation of each instance in the dataset (Courty et al., 2016; Damodaran et al., 2018). OT has been used in land cover mapping to produce maps with no present day reference data, where maps from previous years are used as the source domain and the present day land cover used as the target.  Tardy et al. (2019) found that a 17-class problem was too difficult for most variants of OT, with the best producing a map with only 70 percent accuracy. It has also been shown that OT can be used in a multimodal context for land cover mapping—where data from one device acts as the source domain and another the target (Courty et al., 2016).

2.2 Semi-supervised Domain Adaptation

Semi-supervised DA occurs when labelled data is available in both the source and target domains, but the quantity available in the target domain is insufficient to train an accurate model.

This scenario has great applicability to land cover mapping as resources are rarely available for a large scale data collection campaign, and therefore a successful semi-supervised DA method will allow for the production of large-scale maps at a small fraction of the cost.

For example, Inglada et al. (2017) required approximately 35 million training instances to create a land cover map of France, a quantity that is unfeasible to obtain in many nations, particularly those that are resource-poor.

While less studied, semi-supervised DA research has followed a similar trajectory to unsupervised DA research over the last decade. In fact, a number of unsupervised methods have also been applied to the semi-supervised setting with slight modifications to utilize the labelled target data.

Most early methods worked by mapping the source and target domain data to a new feature space, ensuring that instances from the same class map to a similar area of the space (regardless their originating domain) (Gong et al., 2012; Wang and Mahadevan, 2011)

. In general, these methods do not handle non-linear deformations or high-dimensional data problems particularly well and are therefore of less relevance to the field of remote sensing 

(Tuia and Camps-Valls, 2016).

Kernel manifold alignment (KEMA) (Tuia and Camps-Valls, 2016) was developed to combat the issue of dealing with high-dimensional data, by creating the data transform based on only a few labelled samples from each domain. However when KEMA was used in a land cover mapping problem by Bailly et al. (2017) the results were unsatisfactory, yielding only 70 percent accuracy on a 7-class classification problem.

Recently, deep learning has resulted in marked advances in semi-supervised DA. Domain-adversarial neural networks (DANN) (Ganin et al., 2016) can be used as either a semi-supervised or unsupervised method, as required. This method aims to learn class labels that are domain-independent. To achieve this, a CNN is trained with a loss function comprised of two components—a class-specific component and a domain-specific component. This approach seeks to simultaneously minimize the loss of predicting class labels while maximizing the loss of predicting whether the instance came from the source or target domain. Consequently, the model learns to accurately classify classes while having increasing difficulty distinguishing between domains.

The other method representing the current state of the art in semi-supervised DA is minimax entropy (MME) (Saito et al., 2019). This method learns a prototype (a representative datapoint) for each class in the labelled data and then minimizes the distance between these prototypes and the unlabelled data, thus learning discriminating features. As the labelled data is dominated by instances from the source domain, the method uses a novel adversarial method to shift the class prototypes towards the target domain data.

It is important to note that both DANN and MME perform semi-supervised by first pooling the labelled source and target data. This is a major point of difference with the approach we propose in this paper.

3 Sourcerer

The method we present in this paper, Sourcerer, is a novel, Bayesian-inspired, deep learning-based method for semi-supervised DA for SITS. It uses a CNN model trained on the source domain as a starting point and improves upon it by further training it on the available labelled target data. A critical and distinguishing feature of our approach is a novel regularizer (SourceRegLoss) that is used while training on the target data. This tunes the amount of trust placed on the updates. That is, as the quantity of labelled target data increases, the model places gradually more trust on what is learnt from this data (and consequently, relies less upon the weights learnt from the source data).

Sourcerer not only delivers excellent performance, but is also widely applicable as it does not require access to the source data, but instead works on a model previously trained on that data. In our experiments, we demonstrate the flexibility of our approach by training a model on a source domain once, and then utilizing this pre-trained model and Sourcerer to classify two different target domains.

3.1 TempCNN

The CNN model utilized by Sourcerer is TempCNN (Pelletier et al., 2019)

, which has been shown to be a highly accurate model for pixel-based analysis of SITS data. It has been demonstrated to significantly outperform other types of deep learning models, including recurrent neural networks and ResNet variations, at large geographical scale.

The model comprises 3 convolutional blocks, followed by 1 fully-connected block and a softmax layer (see Figure 

3

). The convolutional block consists of 64 convolutional filters of length 5, followed by a batch normalization layer, a dropout layer with a rate of 0.5, ending with a ReLU activation function. The convolutions are 1-dimensional and are performed along the temporal axis only. The fully-connected layer has 256 neurons, followed by the same batch normalization layer, dropout and ReLU function. The final layer in our case is a softmax with 30 units representing the 30 land cover classes of our classification problem (see Section 

4.2).

Figure 3: The TempCNN model architecture, as presented in Pelletier et al. (2019).

3.2 Source-regularized Loss Function

To utilize Sourcerer, one must first either train a model using the labelled source data with a standard loss function (for example, categorical cross-entropy loss for a classification problem), or obtain a pre-trained model. Let

denote the estimates of the parameters of the source model. Then, using these estimates as a reference point, the new target model is trained on the labelled target data using the following source-regularized loss function:

(1)

where:
      is the average loss (calculated per sample);
      is the current model with parameters ;
      are the target data and labels, respectively;
      are the estimated parameters of a model trained on the source data; and
     

is the regularization hyperparameter.

During training, the proposed regularizer acts to shrink the values of the estimated parameters towards those that were learned on the source data. This is done by adding the squared difference between the parameters of the target model, , and the estimated parameters of the source model,

, to the loss function, penalizing parameter estimates that deviate substantially from the source model. This approach is motivated by the more general ideas of Bayesian inference. In Bayesian inference one formally specifies a prior guess at the likely population values of a model through the mechanism of a prior distribution. The resulting posterior distribution combines the information contained in the sample with the information in the prior. In our case, the use of the source model parameter estimates

as a reference point mimics the use of a prior distribution. The correspondence is even closer than this, due to the relationship between squared-penalties and normal distributions (see Section

3.4

for further discussion). A similar approach has been previously used in transfer learning 

(Dalessandro et al., 2014), but to the best of our knowledge this is the first time it has been adapted for time series classification, the field of Earth observation and to CNNs in general.

The hyperparameter controls the degree to which deviations of the target model from the source model are penalized. In standard Bayesian inference, the information contained in the prior distribution is outweighed by the information contained in the sample as the sample size grows, so that the effects of regularization are greater for small amounts of target data and correspondingly reduced as the amount of target data increases. We discuss in Section 3.3 a simple technique for choosing that mimics this behavior. The regularization is applied to all learnable parameters of the model; for the TempCNN this includes the weights and biases of the convolutional layers, the fully-connected layers, and the batch normalization layers.

We note that we found optimum results by freezing the running mean and running variance parameters of the batch normalization layer after training on the source data. This is because the available target data have low variability as they are from a limited number of polygons and therefore the batch mean and batch variance of these data are not representative of the data as a whole. That is, the batch means are skewed towards the classes present and the batch variance will be lower given the limited number of classes present in the target data.

It is also important to emphasize that our proposed loss function adds no additional computational cost to the training of the target model.

3.3 Determining the Regularization Hyperparameter

The amount of regularization applied by Sourcerer is determined entirely by the choice of

. In this section we propose a heuristic choice that automatically balances the amount of regularization against the amount of available labelled target data. When the quantity of labelled target data is small, we would like the procedure to use a large value for

, making the value of the weights tend toward the parameters of the source model. To see that such a schedule is sound, we note that as the amount of target data, , grows the average loss is of order , i.e., it does not grow in magnitude with . To ensure that for large amounts of target data the regularization has little effect we require that , i.e., tends to zero as . This is a necessary condition for our learning procedure to be statistically consistent.

To achieve this desired behavior, we propose a simple heuristic schedule for . We fix the value of at the two extreme points: (i) when we have a minimum quantity of target data (, ) and (ii) when we have some large amount of target data (, ). We then fit a concave-up power curve between these points. The usual form of a power curve is:

(2)

where:
      is the regularization hyperparameter;
      is the quantity of labelled target data available; and,
      , are constants.

Using the properties of a power curve, this formula can also be represented as a linear equation on a log-log scale. Therefore, to find the schedule for we find the line that passes through the log transform of our two points: (log(), log()) and (log(), log()), respectively. The slope of the resulting line is:

(3)

Using this slope and the point (, ), we can define the equation of the line as:

and solve for , yielding

(4)

We now describe some simple and reasonable heuristic choices for some of the free variables. In the (unlikely) case in which only one labelled target instance is available, a very large value for will ensure that the model uses the source parameters. By similar reasoning, when a significant amount of labelled target data is available, a suitably small value of will allow the model to learn from the target data and largely ignore the source model. Following this argument, we set , , and and Equation 4 reduces to:

(5)

where is now

(6)

This leaves as the only free, user-specified hyperparameter of the procedure. We note that as long as (a reasonable choice), then and the schedule (5) satisfies the condition , as prescribed above. We have performed a sensitivity analysis of this parameter in Section 5.2.4

3.4 Connection to Bayesian Inference

We now examine the close connection between Sourcerer and Bayesian inference. This has been previously noted, but we now make the connection more explicit. First we briefly review Bayesian statistics. In the Bayesian approach we have a probabilistic model of data,

, with unknown parameters

that we would like to fit to some observed dataset. We further must propose a probability distribution

that describes our belief about which values of are likely to be the (unknown) population value, before seeing the data (i.e., a priori). This is called a prior distribution. Bayesian inference proceeds by forming a posterior distribution using Bayes’ rule:

where denotes the marginal distribution of the data. The posterior distribution describes the likelihood of certain values of being the true (unknown) population value of , after observing data , and is used as a basis for statistical inference. In practice, computing the normalizing term is usually infeasible, particularly for complex models such as neural networks, and instead of using the complete posterior it is common practice to estimate by maximizing the unnormalized posterior

A particular strength of the Bayesian framework is that it allows us to formally encode our prior beliefs, or previous information, into the learning process.

We can connect Sourcerer, and the source-regularized loss (eq. 1) on which it is based, to Bayesian inference by noting several equivalencies. First, we note that maximizing the posterior is equivalent to minimizing the negative logarithm of the posterior. The choice of cross-entropy loss for categorical regression is equivalent to choosing our data model

to be an appropriate neural network with a multinomial logistic regression output layer, and our choice of

regularization is equivalent to assuming a normal prior distribution for the parameters of the form

that is, assuming that each of the model parameters is a priori normally distributed with a mean equal to the estimated value of corresponding parameter in the source model, and a variance inversely proportional to . In this way we can interpret as setting our “best guess” for the value of our parameter, and as determining how much weight we place on our prior beliefs. Large values of lead to small prior variance, and a concentration of probability around our prior guess , and small values spread probability more diffusely, placing less importance on our prior guess.

We note that this idea of using a prior guess and regularizing a loss for estimation of (high dimensional) parameter vectors is itself certainly not new. In fact, the concept dates back as early as the seminal work of

James and Stein (1961), a ground-breaking piece of work in which the authors propose the first formal shrinkage estimator. The James-Stein procedure was designed to estimate the mean of a multivariate normal, and was shown to uniformly improve on regular least-squares (i.e., equivalent in our setting to using the target data only) by shrinking the estimates towards a reference point (equivalent to the existence of a source model). This is essentially the same idea that underlies our proposal.

4 Data

All experiments were performed using the Satellite Image Time Series (SITS) data acquired by the Sentinel-2A satellite, starting on 1 January 2016 and running through to 26 December 2016 (its twin satellite Sentinel-2B was launched in March 2017). Table 1 shows the dates of the images for each satellite tile used in our experiments (tiles discussed further in Section 4.3).

T31TEL T31TDJ T32ULU Interpolated Dates
12-MAR 12-JAN 26-JAN 01-JAN
22-MAR 12-MAR 05-FEB 11-JAN
08-APR 22-MAR 09-MAR 21-JAN
28-APR 29-MAR 26-MAR 31-JAN
08-MAY 08-APR 29-MAR 10-FEB
18-MAY 08-APR 08-APR 20-FEB
21-MAY 11-APR 28-APR 01-MAR
28-MAY 18-APR 05-MAY 11-MAR
07-JUN 28-APR 08-MAY 21-MAR
20-JUN 01-MAY 25-MAY 31-MAR
27-JUN 18-MAY 28-MAY 10-APR
30-JUN 21-MAY 07-JUN 20-APR
07-JUL 28-MAY 24-JUN 30-APR
10-JUL 07-JUN 24-JUN 10-MAY
17-JUL 10-JUN 07-JUL 20-MAY
20-JUL 20-JUN 17-JUL 30-MAY
30-JUL 27-JUN 27-JUL 09-JUN
06-AUG 07-JUL 13-AUG 19-JUN
16-AUG 10-JUL 16-AUG 29-JUN
19-AUG 17-JUL 23-AUG 09-JUL
26-AUG 20-JUL 26-AUG 19-JUL
29-AUG 27-JUL 02-SEP 29-JUL
05-SEP 30-JUL 12-SEP 08-AUG
08-SEP 06-AUG 22-SEP 18-AUG
25-SEP 16-AUG 25-SEP 28-AUG
28-SEP 19-AUG 02-OCT 07-SEP
05-OCT 26-AUG 05-OCT 17-SEP
15-OCT 29-AUG 12-OCT 27-SEP
18-OCT 05-SEP 22-OCT 07-OCT
18-OCT 15-SEP 22-OCT 17-OCT
18-OCT 28-SEP 01-NOV 27-OCT
18-OCT 08-OCT 01-DEC 06-NOV
07-NOV 15-OCT 04-DEC 16-NOV
17-NOV 18-OCT 11-DEC 26-NOV
27-NOV 18-OCT 14-DEC 06-DEC
04-DEC 04-NOV 21-DEC 16-DEC
07-DEC 14-NOV 31-DEC 26-DEC
14-DEC 17-NOV
17-DEC 27-NOV
24-DEC 07-DEC
27-DEC 14-DEC
17-DEC
27-DEC
Table 1: Original image dates for each tile used in the experiments and the interpolated dates after pre-processing (all from 2016)

4.1 Preprocessing

All Sentinel-2A data have been collected and prepared by our colleagues from the CESBIO lab using iota2 software (Inglada et al., 2016). The key steps in this process are outlined below:

  • Atmospheric, adjacency and slope effects are corrected for using the MAJA processing chain (Hagolle et al., 2015). The output of this are top-of-canopy images with associated clouds masks. We note that only the images with a cloud-cover of less than 80 % are processed by MAJA.

  • The images are gapfilled using a linear temporal interpolation with a time gap of 10 days, resulting in 37 dates for each pixel (Inglada et al., 2017). Ten days is a natural choice for the time gap as it represents the revisit frequency of one Sentinel 2 satellite. However, the orbit of the satellite results in some overlapping between areas and therefore some pixels are imaged more frequently than others. Thus, gapfilling is a vital processing step to ensure that each pixel has the same number of timestamps. It also allows for the correction of images that are compromised by cloud-cover. Table 1 shows the dates of the original images for each satellite tile and the interpolated dates.

  • Each image is comprised of 10 spectral bands—four of which are recorded at a spatial resolution of 10 metres and six that are recorded at a resolution of 20 metres, which are then reinterpolated at 10 metres.

After this process the resulting instances (pixels) are each represented by a multivariate time series with 10 variables (one for each spectral band) of length 37. The data has been normalized per spectral band using values from the source domain data. Following Pelletier et al. (2019), a variation on min-max normalization has been used, replacing the absolute minimum and maximum values with the 2nd and 98th percentile values, respectively. The percentiles used are estimated using all of the values of the series at each individual timestep. This normalization differs from the usual method for time series classification (Bagnall et al., 2017) but deliberately avoids two potential pitfalls in using standard methods. First, it retains the relative scale of the spectral bands as this is important to SITS data (for instance, in the calculation of normalized difference vegetation index). Second, if the data were normalized per image, the ability to track changes over time would be lost. The normalization method used preserves both the capacity to combine band values and to track changes through time.

4.2 Reference Data

The reference data are the same as those used previously to produce a land cover map of France in Inglada et al. (2017). The reference data originate from four sources:

  1. The Agricultural Land Parcel Information System (2016) (Registre Parcellaire Graphique): a compilation of data gathered from farmers’ declarations of agricultural land (Cantelaube and Carles, 2015).

  2. Urban Atlas (2012): a land cover dataset gathered by the European Environment Agency (EEA) detailing the land cover of cities in continental Europe at a very high resolution (2.5 metres) using 27 urban classes (Lavalle et al., 2002).

  3. The CORINE Land Cover Inventory (CLC 2012): an inventory of land cover information gathered by the EEA using 44 land cover classes at a spatial resolution of 24-56 metres (Bossard et al., 2000).

  4. French National Geographic Institute ‘BD-Topo’: a national topographical map of produced by the government of France (Maugeais et al., 2011).

Information from these sources has been amalgamated to create a dataset using a nomenclature of 30 land cover classes:

  • Five urban classes: High-density Urban, Low-Density Urban, Industrial, Parking, Roads;

  • Fourteen vegetation classes: Rapeseed, Winter Wheat and Barley, Spring Barley, Pea, Soy, Sunflower, Corn, Corn silage, Rice, Beetroot, Potatoes, Grassland, Orchards, Vineyards;

  • Seven natural and semi-natural classes: Deciduous Forest, Coniferous Forest, Lawn, Woodlands, Surface Minerals, Beaches and Dunes, Glaciers; and,

  • Four other classes: Peat, Marshland, Inter-tidal Land, Water.

4.3 Source and Target Tiles

The experiments were conducted on three study areas: we used one as a source domain and two as target domains, with each area representing a Sentinel-2 tile (110km 110km). The experiment regions are all located in France (see map in Figure 4) as there is full reference data available as outlined in Section 4.2. The source domain (tile T31TEL) was chosen at random amongst the available Sentinel-2 tiles and is located within a highland region known as Massif Central (45.1°N, 2.6°E). The two target tiles were chosen specifically to observe the variation in results between a target region with a similar climatic profile (T32ULU) and a target region with a very different climatic profile (T31TDJ). Target domain T31TDJ is located near the city of Toulouse in south-west France (43.6°N, 1.4°E), and T32ULU is located in the north-eastern region of France called Grand Est, which includes the city of Strasbourg (48.4°N, 7.5°E).

Figure 4: Climate map of France from (Joly et al., 2010) with our three study regions identified

Our colleagues at CESBIO who provided us with the preprocessed data also provided us with predefined train and test sets per tile. We have chosen not to modify this split as it has been performed such that instances that belong to the same polygon are in the same set—ensuring independence between training and testing sets (as per Roberts et al. (2017)). In our data, a polygon represents a contiguous area of one land cover class (a corn crop, a river, an industrial estate, etc.) and consequently instances from the same polygon have near-identical profiles. For example, Figure 5 depicts three of the spectral bands from three different pixels of sunflower from within the same polygon. The similarity between these instances demonstrates that if the data were split at random, rather than blocked by polygon, and these instances were distributed to both the train and test sets, the problem of classifying them would be trivial.

Figure 5: The green, red and near infrared reflectance time series for 3 different sunflower pixels located in the same polygon.

Table 2 displays the total number of instances per domain and per set. We note that while this shows all of the target training data, we conduct our experiments under the condition that only a predetermined quantity is available (per experiment), and we study the evolution of test accuracy for increasing quantities of target training data.

Source Target 1 Target 2
T31TEL T31TDJ T32ULU
Train 12,647,452 8,758,196 15,122,125
Test - 3,371,843 5,599,461
Table 2: Total number of train and test instances (pixels) available for each domain.

A comparison of the land cover classes of the training data for the regions is displayed in Table 3.

Label Description
Source
T31TEL
Target 1
T31TDJ
Target 2
T32ULU
1 Urban (high density) 16709 (0.13%) 18242 (0.14%) 9871 (0.08%)
2 Urban (low density) 740326 (5.85) 307343 (2.43) 942652 (7.45)
3 Industrial 502479 (3.97) 188150 (1.49) 649285 (5.13)
4 Parking 9198 (0.07) 2779 (0.02) 20469 (0.16)
5 Road 57634 (0.46) 8898 (0.07) 74980 (0.59)
6 Rapeseed 79401 (0.63) 247425 (1.96) 291462 (2.3)
7 Wheat & Barley (winter) 509236 (4.03) 773027 (6.11) 444315 (3.51)
8 Barley (spring) 22135 (0.18) 45397 (0.36) 81282 (0.64)
9 Pea 15298 (0.12) 103890 (0.82) 49802 (0.39)
10 Soy 6296 (0.05) 262310 (2.07) 177084 (1.4)
11 Sunflower 298067 (2.36) 1823222 (14.42) 106120 (0.84)
12 Corn 609941 (4.82) 305467 (2.42) 2204111 (17.43)
13 Corn silage 827715 (6.54) 339535 (2.68) 644489 (5.1)
15 Beetroot 207575 (1.64) 9636 (0.08) 302543 (2.39)
16 Potatoes 26617 (0.21) 3465 (0.03) 40855 (0.32)
17 Grassland 2277897 (18.01) 533604 (4.22) 1119211 (8.85)
18 Orchards 527 (<0.01) 16434 (0.13) 7337 (0.06)
19 Vineyards 2578 (0.02) 357489 (2.83) 19145 (0.15)
20 Deciduous forest 1088129 (8.6) 926583 (7.33) 1972989 (15.6)
21 Coniferous forest 4732777 (37.42) 1091930 (8.63) 5373252 (42.48)
22 Lawn 128711 (1.02) 682709 (5.4) 148231 (1.17)
23 Woodlands 347466 (2.75) 368759 (2.92) 52902 (0.42)
24 Minerals 381 (<0.01) 8483 (0.07) 2072 (0.02)
27 Peat 0 (0) 0 (0) 5324 (0.04)
28 Marshland 0 (0) 71985 (0.57) 7130 (0.06)
30 Water 140359 (1.11) 261436 (2.07) 375212 (2.97)
TOTALS 12,647,452 (100) 8,758,196 (100) 15,122,125 (100)
Table 3: Distribution of land cover classes across each domain.

5 Experiments

In the following experiments, a given run was performed by training a model using all of the available source training data and a fixed quantity of target training data. The target data represents a fixed number of polygons, rather than a fixed amount of data. A polygon represents a contiguous area with the same land cover—eg. a farm, forest or residential area—meaning that the number of instances in a polygon can be as few as 7 to well over 1,000. On average, tile T31TDJ has 336 instances (i.e. time series) per polygon and tile T32ULU has 279 instances per polygon.

Treating the target data in this manner makes the problem more realistic, but also more difficult. More realistic because in practice reference data is collected per site (polygon), and not per satellite pixel. More difficult as rather than having an increasing random sample, the data are not distributed across the whole domain and do not represent the accurate class distribution of the area. For example, if an experiment is performed with 10 polygons of target data this will equate to approximately 3000 training instances, but it will represent at most 10 classes from the target domain and all the instances within one of these classes will be quite similar. A sample of this nature is more difficult to learn from than a sample of the same quantity that is randomly selected across the whole target domain. The number of polygons was increased according to the following schedule:

  • no. of polygons:

Each experiment was repeated five times, and to enable comparison between runs, a linear interpolation of the test accuracies was applied to give the test accuracy for specific quantities of training data (number of pixels). To enable comparison between methods, the interpolated results from each of the five runs were averaged. All experiments were performed using an implementation in PyTorch 1.3.1 

(Paszke et al., 2019). Our code and the results of the experiments are available at: https://github.com/benjaminmlucas/sourcerer.

5.1 Experimental Settings

The following section will begin by describing each of the following seven experimental settings that were compared in our experiments:

  • Sourcerer;

  • 4 baseline configurations: Source Only, Target Only, Naive TempCNN, and Finetuned TempCNN;

  • 2 state-of-the-art semi-supervised DA methods: MiniMax Entropy (MME), and Domain-adversarial Neural Networks (DANN).

These configurations are detailed below.

5.1.1 Sourcerer

Sourcerer starts with a TempCNN model trained on the source domain data. The weights of this model are used as the initial values for training on the labelled target data with the amount the model is allowed to vary from these values (), based on the quantity of labelled target data. We note that this highlights a significant benefit of Sourcerer—that its prerequisites for use are only to have available a pre-trained model and the labelled target data. That is, the practitioner does not have to be in possession of the source data to apply our method, as opposed to MME and DANN (presented in Section 5.1.3), where all labelled instances (source and target) are pooled and the model is trained on this pooled data, and therefore these methods require all of the labelled source data to be available. This can be of significant practical benefit as labelled training data from a Sentinel-2 tile is of the order of hundreds of gigabytes and using Sourcerer means that this only has to be stored and used for training on one occasion.

The value of is a function of a single hyperparameter: (see Section 3.3), which we have set to for all of our experiments. We believe this to be a reasonable choice as a training set of instances provides enough variation to train an accurate model and thus, it is unlikely that restricting the model’s learning by regularizing towards the source parameters will be beneficial to the overall accuracy (we note that a sensitivity analysis is provided in section 5.2.4). Substituting this value into equation 6, gives and the final schedule for our values as:

(7)

Once is calculated the model is trained with SourceRegLoss (Equation 1

) using the Adam optimizer. We vary the number of epochs used for training with the quantity of training data such that

gradient updates have been performed or 1 epoch is completed (see Equation 8), as due to the large quantity of training data, little learning occurs beyond this point (and this was shown specifically for TempCNN in (Pelletier et al., 2019)).

(8)

5.1.2 Baseline Configurations

The following four settings correspond to methods that can map the target domain without using any DA methods, and in doing so, represent various lower bounds for our method to compare to (also see Lucas et al. (2019) for a discussion of the performance of baseline CNN configurations).

Source Only

This is the baseline configuration in which a model is trained on labelled source data only. This is the simplest setting as it is independent of the amount of labelled target data available, and hence returns only a single value for test accuracy. This configuration sets the lower bound we would expect for test accuracy when no target data is available and no DA method is applied. For comparative purposes, we have used the TempCNN model, the same categorical cross-entropy loss function, and the same number of training epochs as used for Sourcerer.

Target Only

This is a baseline configuration in which the only labelled data used for training the model is from the target domain. This configuration also acts as a lower bound on the test accuracy for when DA is no longer required, that is, enough target data is available to train an accurate classifier. As per the Source only configuration we have used the TempCNN model, categorical cross-entropy loss function and number of training epochs as we used for Sourcerer.

Naive TempCNN

This is a baseline configuration in which a TempCNN model is first trained on the labelled source data and then trained on the labelled target data, without applying a particular DA method.

Finetuned TempCNN

This is a baseline configuration where the TempCNN is first trained on the labelled source data, at which point the weights of the convolutional layers are frozen, and only the fully-connected layer(s) and the softmax are finetuned by training on the labelled target data.

This technique is common in transfer learning for computer vision problems (Yosinski et al., 2014) as the convolutional layers of a CNN learn general features of data while the fully-connected layers learn specifics. This configuration allows for comparison as to whether the general features of SITS data can first be learnt from a (/any) source domain and then finetuned using the available labelled target data.

5.1.3 State-of-the-art Methods

The two methods presented here represent the current state of the art in semi-supervised DA. Like Sourcerer, they are both deep learning-based methods, however unlike our method, each of these require the labelled source data to be available to train the model for the target domain. These methods concatenate the labelled source and labelled target data and train on this pooled dataset. While we note that using the labelled data in this manner creates a different problem (an easier one), these methods have been included to illustrate that our method is competitive in accuracy (or indeed outperforming) with the current state of the art, with the additional benefit of only requiring a model pre-trained on the source domain (not all of the source data).

Mme

This state-of-the-art method (Saito et al., 2019) is based on training a CNN model using 2 loss functions. It learns a prototype of each class from the labelled data and then minimizes the distance between these prototypes and the unlabelled data, in the process learning discriminating features of the data. It can be implemented on any CNN model, so for comparison purposes we have implemented it using the TempCNN architecture used in our model, so as to control for model choice. Training occurs in two steps: first, a batch of labelled data (pooled from both source and target domains) is passed forward through the model, a standard

loss function is calculated and the weights are updated via backpropagation; then, a batch of unlabelled data is passed forward through the model and an entropy loss function is calculated using the output of the convolutional layers of the model. Once trained, unlabelled target data is tested as per a standard CNN, via a forward pass of the model.

Dann

This state-of-the-art method (Ganin et al., 2016) is based on maximizing accuracy in predicting the class label of an instance while not being able to tell whether it was from the source or target domain. Learning in this manner discourages the model from learning features that are specific to a domain. To achieve this it uses a CNN model with two output layers—one for the class and one for the domain. In our case, each instance has a class label (land cover: 0-29) and a domain label (binary: source/target) and the loss function is the addition of the loss calculated using the class labels and the inverse of the loss calculated using the domain labels. For unlabelled instances, there will be only the domain label available. We have used a TempCNN model with two outputs following the convolutional layers, each with a fully connected layer and softmax. Once trained, unlabelled target data is tested via a forward pass of the model and the class labels are recorded (the domain labels are ignored).

5.2 Results

The following section presents the results of semi-supervised DA experiments performed on two source-target domain pairs:

  1. T31TEL (source)–T31TDJ (target); and

  2. T31TEL (source)–T32ULU (target).

The results are presented for our method, Sourcerer, the two state-of-the-art methods, as well as the four baseline configurations. In each instance, the overall accuracy is plotted against an increasing amount of labelled target training data. Then in Section 5.2.3 we present a visual analysis of land cover maps produced in various experiments.

5.2.1 Sourcerer Versus the State of the Art

We start with the most challenging comparison: against MME and DANN. Figure 6 shows the average overall accuracy for Sourcerer against the state-of-the-art methods–DANN and MME–for each target tile. It is evident from these plots that Sourcerer, produces a higher test accuracy than either DANN or MME, for any given quantity of labelled target data. In fact, when considering tile T31TDJ the best possible accuracy achieved by DANN–76.8% (when training on 1M labelled instances) is achieved by Sourcerer when training on only 25,000 instances. On tile T32ULU, DANN achieves 86.5% accuracy using 1M labelled target instances, which is below the initial test accuracy of Sourcerer (87.5%), that is, without having done any adaptation.

For each experiment, MME starts with the lowest overall accuracy but increases noticeably as more target data become available. When around 1M target instances are available, it produces a test accuracy within 0.5% of Sourcerer, for each target tile. The improvement indicates that the MME method is learning the difference between the domains, however it is not learning quick enough for our application purposes, where greater than 1M instances are unlikely to be available.

We reiterate that not only is Sourcerer outperforming each of these methods, but it is doing so in a more convenient manner. Each of MME and DANN use the labelled source data in the training process, whereas one a model is trained on the source data, Sourcerer can use this pre-trained model and the target data to map any target region.

(a) T31TDJ: x-axis in log scale
(b) T32ULU: x-axis in log scale
(c) T31TDJ: x-axis in linear scale
(d) T32ULU: x-axis in linear scale
Figure 6: Average overall accuracy for Sourcerer against the state-of-the-art methods, DANN and MME, and 2 baseline configurations. Models were trained on the source domain (T31TEL) and increasing quantities of labelled target data. Results are show for target domains T31TDJ in (a)&(c) and T32ULU in (b)&(d).

5.2.2 Sourcerer Versus the Baseline Configurations

We now turn to the comparison of Sourcerer with baseline configurations. Figure 7 shows the average overall accuracy for Sourcerer against the baseline configurations–Naive TempCNN, Finetuned TempCNN, Target Only and Source Only–for each target tile. It shows that for each quantity of target data available, Sourcerer is either equal to or exceeds the performance of all baseline configurations. This aligns with the intuition of how Sourcerer is designed—for small quantities of target data, the model parameters will be heavily regularized towards those learned on the source data, and hence returns the same accuracy as Source Only; while for large quantities (where DA is not necessary), the model is allowed to learn from the available data and hence returns the same accuracy as Target Only. The model gradually increases in accuracy between these two extreme situations.

Comparing the performance of Sourcerer on the two tiles, it is evident that the magnitude of its benefit is dependent on the similarity of the source and target domains. For example, when 25,000 labelled target instances are available Sourcerer outperforms Source Only by 2.5% on target tile T31TDJ (74.2% to 76.8%) where the domains are less similar climatically, compared to 0.3% (87.5% to 87.8%) on tile T32ULU, where the domains are more similar.

An interesting result is also present in the Naive TempCNN and Finetuned TempCNN experiments. In these setting, it was found that the model initially decreases in accuracy when trained only with labelled target data (see Figure 7). On target tile T31TDJ, the Source Only achieves test accuracy of 0.744, while the Naive TempCNN dips to as low as 0.646 when 1,000 labelled target instances are available (approximately 3-4 polygons), before increasing again and growing to be more accurate when moderate-to-large quantities of data are available. Similarly, tile T32ULU begins at 0.875 test accuracy and drops to 0.812 before increasing again. A similar pattern is observed in the Finetuned TempCNN on each target tile.

This dip occurs for two reasons: (1) The available target training data originates from few polygons, and consequently the model overfits the classes present in the target data; and (2) There are some classes present in the target domain that were absent from the source, which significantly shifts the weights of the TempCNN model when they are presented for the first time (Lucas et al., 2019). These results demonstrate that the convolutions of a TempCNN cannot overcome the domain shift alone and that a semi-supervised DA method like Sourcerer is necessary for optimal accuracy.

When considering the Target Only configuration, it takes well over 1M training instances to reach the performance of Sourcerer  for each target tile, thus re-emphasizing the case for an accurate semi-supervised DA method. On target tile T31TDJ, Target Only learns from 100,000 labelled instances before achieving 75% test accuracy, whereas Sourcerer uses only 1,000. On tile T32ULU, Target Only requires 500,000 training instances to achieve the starting accuracy of Sourcerer (87.5%).

Figure 7: Average overall accuracy for Sourcerer against the Baseline configurations, trained on the source domain (T31TEL) and increasing quantities of labelled target data for domains T31TDJ (a) and T32ULU (b).

5.2.3 Visual Analysis of Results

Figure 8: A false-color Sentinel-2 image, land cover maps produced using DANN and Sourcerer  and the ground truth land cover classes. Maps were created with 64 polygons of labelled target data available (approximately 12,000 instances). Legend provided in Table 4.

In this section, we will illustrate what the differences in overall accuracy mean for the resulting land cover maps. Figure 8 shows two land cover maps produced by using DANN and Sourcerer and trained on 64 labelled target polygons (approximately 12,000 instances); in comparison with the ground truth polygons from the test data. When comparing the maps of the two methods, there is disagreement between large areas of agricultural land with the DANN-based model classifying large amounts of corn where Sourcerer classified soy. As soy and corn are both winter crops, their spectral profiles appear similar and an accurate classifier is required to separate them correctly. In this case, we can see from the test data that the correct land cover for these polygons are soy as predicted by Sourcerer.

When more data are available the differences between maps produced are more subtle. Figure 9 shows land cover maps produced by training on 512 labelled target polygons (approximately 99,000 instances) using MME and Sourcerer  as well as a Sentinel-2 false color image and the ground truth. If we compare the results of each method of classifying the rapeseed crop (crop A in the ground truth subfigure), it can be seen that MME correctly classifies few pixels of this crop while in comparison, Sourcerer accurately classifies almost the whole crop. When we consider the corn crop (B) located just below the image’s center, the MME-based model classifies approximately half of this crop as corn silage, while Sourcerer classifies almost the complete polygon correctly.

Color Class Color Class Color Class
Urban (high density) Soy Deciduous forest
Urban (low density) Sunflower Coniferous forest
Industrial Corn Lawn
Parking Corn silage Woodlands
Road Beetroot Minerals
Rapeseed Potatoes Peat
Wheat & Barley Grassland Marshland
Barley (spring) Orchards Water
Peas Vineyards
Table 4: Legend of land cover classes for Figures 8 and 9
Figure 9: A false-color Sentinel-2 image, land cover maps produced using MME and Sourcerer  and the ground truth land cover classes. Maps were created with 512 polygons of labelled target data available (approximately 99,000 instances). Legend provided in Table 4.

5.2.4 Sensitivity Analysis of

Sourcerer has only one user-defined hyperparameter, , which represents the quantity of labelled target data at which the regularization applied to the model approaches zero (as discussed in Section 3.3)— it represents the quantity of target data at which we would no longer require source data and DA to learn an accurate model. We have performed experiments on each target tile with three different values of , (the default value), and . Figure 10 shows the average overall accuracy for the three models for each target tile.

Figure 10: Average overall accuracy for Sourcerer with different values for the hyperparameter , trained on the source domain (T31TEL) and increasing quantities of labelled target data for domains T31TDJ (a) and T32ULU (b).

On tile T31TDJ, each of the three models of Sourcerer outperform the state of the art for all quantities of labelled target data. This is also the case for tile T32ULU for models with of and . The model with of does dip below the performance of MME when around 8,000 target instances are used. This dip in performance is the same as that displayed by the Naive TempCNN in Section 5.2.1, and indicates that a value of for the does not regularize the model sufficiently.

The results for the other two models show that the choice of being either or will produce similar performance, and thus any attempt to optimize this value further is not likely to be necessary.

6 Conclusion and Future Work

In this paper we presented Sourcerer, a Bayesian-inspired, deep learning-based, semi-supervised DA technique for producing land cover maps from SITS data. The technique takes a CNN trained on a source domain and treats this as a prior distribution for the weights of the model, with the degree to which the model is modified to fit the target domain limited by the quantity of labelled target data available.

Our experiments using Sentinel-2 time series images showed that Sourcerer outperforms all other methods for any quantity of labelled target data available on two different source-target domain pairings. On the more difficult target domain, the starting accuracy (when no labelled target data are available) of Sourcerer is 74.2%, and this is greater than the next-best state-of-the-art method when trained on 20,000 labelled target instances.

Sourcerer’s high accuracy is also complemented by its straight-forward manner of application as it only requires a model pre-trained of the source domain, rather than all of the source data. This offers great promise to efficiently map resource-poor areas as the practitioner only has to download a model, and not millions of instances of source domain data.

The Bayesian connection of our method offers the possibility for further improvements to Sourcerer. James and Stein (1961) show that the optimal choice of

is inversely proportional to an unbiased estimate of the Kullback–Leibler divergence between the target only model and the reference (source) model; that is, the more the target only model differs from the reference model, the less weight should be placed on the reference. Though accurate estimation of Kullback–Leibler divergences between neural networks is difficult, a similar idea could potentially be adapted for use in Sourcerer to refine the selection of

.

Another way of choosing , would be in a more formal, and data-driven manner, with a prior distribution placed on , and it integrated directly into the posterior distribution. In this manner an appropriate value for could be estimated directly from the target data by a straightforward integration into a posterior sampling scheme or a variational Bayes approach, both of which are gaining popularity in the neural network community.

Supplementary material

To aid replication, the code for our method and the raw results of all experiments is available at https://github.com/benjaminmlucas/sourcerer.

Acknowledgements

The authors would like to thank our colleagues from the CESBIO laboratory (in particular Jordi Inglada and Olivier Hagolle) for providing us with the corrected Sentinel-2 data and associated labels.
This research was supported by the Australian Research Council under grant DE170100037.

References

  • Armsworth et al. (2006) Armsworth PR, Daily GC, Kareiva P, Sanchirico JN (2006) Land market feedbacks can undermine biodiversity conservation. Proceedings of the National Academy of Sciences 103(14):5403–5408
  • Asner et al. (2005) Asner GP, Knapp DE, Broadbent EN, Oliveira PJC, et al. (2005) Selective logging in the Brazilian Amazon. Science 310(5747):480–2
  • Azzari and Lobell (2017) Azzari G, Lobell D (2017) Landsat-based classification in the cloud: An opportunity for a paradigm shift in land cover monitoring. Remote Sensing of Environment 202:64–74
  • Bagnall et al. (2017) Bagnall A, Lines J, Bostrom A, Large J, Keogh E (2017) The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances. Data Mining and Knowledge Discovery 31(3):606–660
  • Bailly et al. (2017) Bailly A, Chapel L, Tavenard R, Camps-Valls G (2017) Nonlinear time-series adaptation for land cover classification. IEEE Geoscience and Remote Sensing Letters 14(6):896–900, DOI 10.1109/LGRS.2017.2686639
  • Bailly et al. (2018) Bailly S, Giordano S, Landrieu L, Chehata N (2018) Crop-rotation structured classification using multi-source sentinel images and lpis for crop type mapping. In: IGARSS 2018 - 2018 IEEE International Geoscience and Remote Sensing Symposium, pp 1950–1953
  • Bojinski et al. (2014) Bojinski S, Verstraete M, Peterson TC, Richter C, Simmons, Zemp M (2014) The concept of essential climate variables in support of climate research, applications, and policy. Bulletin of the American Meteorological Society 95(9):1431–1443
  • Bossard et al. (2000) Bossard M, Feranec J, Oťaheľ J (2000) Corine land cover technical guide. Tech. rep., European Environment Agency, Copenhagen, Denmark
  • Cantelaube and Carles (2015) Cantelaube P, Carles M (2015) Le registre parcellaire graphique : des données géographiques pour décrire la couverture du sol agricole. Cahier des Techniques de l’INRA pp 58–64
  • Courty et al. (2016) Courty N, Flamary R, Tuia D, Corpetti T (2016) Optimal transport for data fusion in remote sensing. In: 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), pp 3571–3574, DOI 10.1109/IGARSS.2016.7729925
  • Courty et al. (2016) Courty N, Flamary R, Tuia D, Rakotomamonjy A (2016) Optimal transport for domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence 39(9):1853–1865
  • Dalessandro et al. (2014) Dalessandro B, Chen D, Raeder T, Perlich C, Han Williams M, Provost F (2014) Scalable hands-free transfer learning for online advertising. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 1573–1582
  • Damodaran et al. (2018) Damodaran BB, Kellenberger B, Flamary R, Tuia D, Courty N (2018) Deepjdot: Deep joint distribution optimal transport for unsupervised domain adaptation. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds) Computer Vision – ECCV 2018, Springer International Publishing, Cham, pp 467–483
  • Damodaran et al. (2020) Damodaran BB, Flamary R, Seguy V, Courty N (2020) An entropic optimal transport loss for learning deep neural networks under label noise in remote sensing images. Computer Vision and Image Understanding 191:102863, DOI https://doi.org/10.1016/j.cviu.2019.102863, URL http://www.sciencedirect.com/science/article/pii/S1077314219301559
  • Defourny et al. (2019) Defourny P, Bontemps S, Bellemans N, Cara C, Dedieu G, Guzzonato E, Hagolle O, Inglada J, Nicola L, Rabaute T, et al. (2019) Near real-time agriculture monitoring at national scale at parcel resolution: Performance assessment of the Sen2-Agri automated system in various cropping systems around the world. Remote Sensing of Environment 221:551–568
  • Demir et al. (2013) Demir B, Bovolo F, Bruzzone L (2013) Updating land-cover maps by classification of image time series: A novel change-detection-driven transfer learning approach. IEEE Transactions on Geoscience and Remote Sensing 51(1):300–312
  • Fernando et al. (2013) Fernando B, Habrard A, Sebban M, Tuytelaars T (2013) Unsupervised visual domain adaptation using subspace alignment. In: 2013 IEEE International Conference on Computer Vision, pp 2960–2967, DOI 10.1109/ICCV.2013.368
  • Frenay and Verleysen (2014) Frenay B, Verleysen M (2014) Classification in the presence of label noise: A survey. IEEE Transactions on Neural Networks and Learning Systems 25(5):845–869, DOI 10.1109/TNNLS.2013.2292894
  • Ganin et al. (2016) Ganin Y, Ustinova E, Ajakan H, Germain P, Larochelle H, Laviolette F, Marchand M, Lempitsky V (2016) Domain-adversarial training of neural networks. The Journal of Machine Learning Research 17(1):2096–2030
  • Gong et al. (2012)

    Gong B, Shi Y, Sha F, Grauman K (2012) Geodesic flow kernel for unsupervised domain adaptation. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, pp 2066–2073

  • Hagolle et al. (2015) Hagolle O, Huc M, Villa Pascual D, Dedieu G (2015) A multi-temporal and multi-spectral method to estimate aerosol optical thickness over land, for the atmospheric correction of FormoSat-2, LandSat, VENS and Sentinel-2 images. Remote Sensing 7(3):2668–2691
  • Huang et al. (2007) Huang J, Gretton A, Borgwardt K, Schölkopf B, Smola AJ (2007) Correcting sample selection bias by unlabeled data. In: Schölkopf B, Platt JC, Hoffman T (eds) Advances in Neural Information Processing Systems 19, MIT Press, pp 601–608, URL http://papers.nips.cc/paper/3075-correcting-sample-selection-bias-by-unlabeled-data.pdf
  • Inglada et al. (2016) Inglada J, Vincent A, Arias M, Tardy B (2016) iota2: a land cover map production system. DOI 10.5281/zenodo.58150, URL https://doi.org/10.5281/zenodo.58150
  • Inglada et al. (2017) Inglada J, Vincent A, Arias M, Tardy B, Morin D, Rodes I (2017) Operational high resolution land cover map production at the country scale using satellite image time series. Remote Sensing 9(1):95
  • James and Stein (1961) James W, Stein C (1961) Estimation with quadratic loss. In: Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, University of California Press, Berkeley, Calif., pp 361–379, URL https://projecteuclid.org/euclid.bsmsp/1200512173
  • Joly et al. (2010) Joly D, Brossard T, Cardot H, Cavailhes J, Hilal M, Wavresky P (2010) Les types de climats en france, une construction spatiale. Cybergeo: European Journal of Geography
  • Kouw and Loog (2019) Kouw WM, Loog M (2019) A review of domain adaptation without target labels. IEEE Transactions on Pattern Analysis and Machine Intelligence pp 1–1, DOI 10.1109/TPAMI.2019.2945942
  • Kouw and Loog (2019) Kouw WM, Loog M (2019) A review of domain adaptation without target labels. IEEE transactions on pattern analysis and machine intelligence
  • Kouw et al. (2016) Kouw WM, van der Maaten LJP, Krijthe JH, Loog M (2016) Feature-level domain adaptation. Journal of Machine Learning Research 17(171):1–32, URL http://jmlr.org/papers/v17/15-206.html
  • Lavalle et al. (2002) Lavalle C, Demicheli L, Kasanko M, et al. (2002) Towards an urban atlas. assessment of spatial data on 25 european cities and urban areas. environmental issue report. European Environment Agency, Copenhagen
  • Lavorel et al. (2007) Lavorel S, Flannigan MD, Lambin EF, Scholes MC (2007) Vulnerability of land systems to fire: Interactions among humans, climate, the atmosphere, and ecosystems. Mitigation and Adaptation Strategies for Global Change 12(1):33–53
  • Long et al. (2015) Long M, Cao Y, Wang J, Jordan M (2015) Learning transferable features with deep adaptation networks. In: International Conference on Machine Learning, pp 97–105
  • Loveland et al. (2000) Loveland T, Reed B, Brown J, Ohlen D, Zhu Z, Yang L, Merchant J (2000) Development of a global land cover characteristics database and IGBP DISCover from 1 km AVHRR data. International Journal of Remote Sensing 21(6-7):1303–1330
  • Lucas et al. (2019) Lucas B, Pelletier C, Inglada J, Schmidt D, Webb GI, Petitjean F (2019) Exploring data quantity requirements for Domain Adaptation in the classification of satellite image time series. In: IEEE 10th International Workshop on the Analysis of Multitemporal Remote Sensing Images (MultiTemp), pp 1–4
  • Matasci et al. (2012) Matasci G, Tuia D, Kanevski M (2012) Svm-based boosting of active learning strategies for efficient domain adaptation. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 5(5):1335–1343, DOI 10.1109/JSTARS.2012.2202881
  • Maugeais et al. (2011) Maugeais E, Lecordix F, Halbecq X, Braun A (2011) Dérivation cartographique multi échelles de la bdtopo de l’ign france: mise en œuvre du processus de production de la nouvelle carte de base. In: Proceedings of the 25th international cartographic conference, Paris, pp 3–8
  • Pan et al. (2011) Pan SJ, Tsang IW, Kwok JT, Yang Q (2011) Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks 22(2):199–210, DOI 10.1109/TNN.2010.2091281
  • Paszke et al. (2019) Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, Desmaison A, Kopf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang L, Bai J, Chintala S (2019) Pytorch: An imperative style, high-performance deep learning library. In: Wallach H, Larochelle H, Beygelzimer A, d'Alché-Buc F, Fox E, Garnett R (eds) Advances in Neural Information Processing Systems 32, Curran Associates, Inc., pp 8024–8035, URL http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
  • Patel et al. (2015) Patel VM, Gopalan R, Li R, Chellappa R (2015) Visual domain adaptation: A survey of recent advances. IEEE Signal Processing Magazine 32(3):53–69
  • Pelletier et al. (2017) Pelletier C, Valero S, Inglada J, Champion N, Marais Sicre C, Dedieu G (2017) Effect of training class label noise on classification performances for land cover mapping with satellite image time series. Remote Sensing 9(2):173
  • Pelletier et al. (2019) Pelletier C, Webb GI, Petitjean F (2019) Temporal convolutional neural network for the classification of satellite image time series. Remote Sensing 11(5):523
  • Persello and Bruzzone (2012) Persello C, Bruzzone L (2012) Active learning for domain adaptation in the supervised classification of remote sensing images. IEEE Transactions on Geoscience and Remote Sensing 50(11):4468–4483, DOI 10.1109/TGRS.2012.2192740
  • Roberts et al. (2017) Roberts DR, Bahn V, Ciuti S, Boyce MS, Elith J, Guillera-Arroita G, Hauenstein S, Lahoz-Monfort JJ, Schröder B, Thuiller W, et al. (2017) Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography 40(8):913–929
  • Saito et al. (2019) Saito K, Kim D, Sclaroff S, Darrell T, Saenko K (2019) Semi-supervised domain adaptation via minimax entropy. In: Proceedings of the IEEE International Conference on Computer Vision, pp 8050–8058
  • Shu et al. (2018) Shu R, Bui HH, Narui H, Ermon S (2018) A dirt-t approach to unsupervised domain adaptation. arXiv preprint arXiv:180208735
  • Sun and Saenko (2016) Sun B, Saenko K (2016) Deep coral: Correlation alignment for deep domain adaptation. In: Hua G, Jégou H (eds) Computer Vision – ECCV 2016 Workshops, Springer International Publishing, pp 443–450
  • Tan et al. (2017) Tan CW, Webb GI, Petitjean F (2017) Indexing and classifying gigabytes of time series under time warping. In: Proceedings of the 2017 SIAM International Conference on Data Mining, pp 282–290, DOI 10.1137/1.9781611974973.32, URL https://epubs.siam.org/doi/abs/10.1137/1.9781611974973.32, https://epubs.siam.org/doi/pdf/10.1137/1.9781611974973.32
  • Tardy et al. (2017) Tardy B, Inglada J, Michel J (2017) Fusion approaches for land cover map production using high resolution image time series without reference data of the corresponding period. Remote Sensing 9(11)
  • Tardy et al. (2019) Tardy B, Inglada J, Michel J (2019) Assessment of optimal transport for operational land-cover mapping using high-resolution satellite images time series without reference data of the mapping period. Remote Sensing 11(9):1047
  • Tuia and Camps-Valls (2016) Tuia D, Camps-Valls G (2016) Kernel manifold alignment for domain adaptation. PloS one 11(2)
  • Tuia et al. (2011) Tuia D, Pasolli E, Emery W (2011) Using active learning to adapt remote sensing image classifiers. Remote Sensing of Environment 115(9):2232 – 2242
  • Tuia et al. (2016) Tuia D, Persello C, Bruzzone L (2016) Domain adaptation for the classification of remote sensing data: An overview of recent advances. IEEE Geoscience And Remote Sensing Magazine 4(2):41–57
  • Turner et al. (2007) Turner BL, Lambin EF, Reenberg A (2007) The emergence of land change science for global environmental change and sustainability. Proceedings of the National Academy of Sciences of the United States of America 104(52):20666–20671
  • Tzeng et al. (2017) Tzeng E, Hoffman J, Saenko K, Darrell T (2017) Adversarial discriminative domain adaptation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7167–7176
  • Vuolo et al. (2018) Vuolo F, Neuwirth M, Immitzer M, Atzberger C, Ng WT (2018) How much does multi-temporal Sentinel-2 data improve crop type classification? International Journal of Applied Earth Observation and Geoinformation 72:122–130, DOI 10.1016/j.jag.2018.06.007
  • Wang and Mahadevan (2011)

    Wang C, Mahadevan S (2011) Heterogeneous domain adaptation using manifold alignment. In: Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence, AAAI Press, IJCAI’11, p 1541–1546

  • Wulder et al. (2018) Wulder MA, Coops NC, Roy DP, White JC, Hermosilla T (2018) Land cover 2.0. International Journal of Remote Sensing 39(12):4254–4284
  • Yosinski et al. (2014) Yosinski J, Clune J, Bengio Y, Lipson H (2014) How transferable are features in deep neural networks? In: Advances in neural information processing systems, pp 3320–3328