1 Introduction
Deep learning has been applied with tremendous success on a variety of tasks in remote sensing image analysis. For instance, achievement of stateoftheart performance in scene classification (Cheng et al., 2018; Anwer et al., 2018), pixelwise labeling of both multispectral (Huang et al., 2018; Audebert et al., 2018; Maggiori et al., 2017) and hyperspectral datasets (Zhong et al., 2018; Wang et al., 2017), object detection (Kellenberger et al., 2018)
and image retrieval
(Zhou et al., 2018; Ye et al., 2017; Li et al., 2018), highlights the recent success of deep learning models in remote sensing. But these phenomenal performances is highly dependant on the availability of large collection of datasets with accurate annotations (labels). If either the size of the dataset or the accuracy of the labels is not sufficient (i.e, small scale datasets or inaccurate labels), the performance of the deep learning methods could suffer drastically. The former one can be addressed to some degree by data augmentation strategies, however solving the later case of inaccurate labeling is more difficult.To address the large scale data requirements of deep learning methods, new datasets have been proposed recently in the remote sensing community (Zhou et al., 2018; Huang et al., 2018; Cheng et al., 2017; Kemker et al., 2017; Wang et al., 2016; Xia et al., 2017). This trend will grow continuously in the coming years, due for instance to the large constellation of the Earth observation satellites. One of the major challenge in collecting this new large scale data is accurate labeling of the samples. Manual expert labeling of such large collection of samples is often not feasible and not costeffective. Thus, labeling is usually performed by nonexperts through crowd sourcing (Snow et al., 2008; Haklay, 2010), keyword query through search engine in the case of images, open street maps, and outdated classification maps (Kaiser et al., 2017). These cheap surrogate procedures allows scaling the size of labeled datasets, but at the cost of introducing label noise (i.e. inaccurately labeled samples). Even when manual experts are involved in labeling the data samples, they must be provided with sufficient information; otherwise inaccurate labeling may still occur (for instance, during the field survey) (Hickey, 1996). Note that in the some applications, labeling is a subjective task (Smyth et al., 1995)
that can again introduce label noise. Furthermore, the label noise could occur due to the misregistration of satellite images. Hence in general, large scale datasets might mostly contain inaccurately labeled samples or affected by label noise. In this case, when deep learning methods are employed with conventional loss functions (for instance, categorical cross entropy, mean square error), they will not be robust to label noise, and as a result the classification accuracy decreases significantly
(Zhang et al., 2017). This calls for robust approaches to mitigate the impact of label noise on the deep learning methods.Recently, it was shown that while training deeper neural networks, models tend to memorize the training data, and this phenomena is more severe when the dataset is affected by the label noise (Zhang et al., 2017)
. The impact of the label noise in the deep learning models can be partly circumvented by regularization techniques such as drop out layers, and weight regularization. These standard procedures make neural networks robust to some extend, but they are still prone to memorize noisy labels for mediumtolarge noise levels. The problem of learning with noisy labels has been long studied in machine learning
(Frenay and Verleysen, 2014; Brooks, ; Zhu and Wu, 2004; Sáez et al., 2014; Hickey, 1996; Smyth et al., 1995; Natarajan et al., 2013), but still only few works have focused on neural networks. Recently, new approaches have been proposed in the computer vision and machine learning fields to tackle the label noise by cleaning the noisy labels or designing robust loss functions within the deep learning framework (Jiang et al., 2018; Vahdat, 2017; Patrini et al., 2017).To mitigate the impact of label noise, one category of method relies on estimating the noise transition probability that describes the probability of
class label being mislabeled to the class label, and use it to be robust to label noise (Vahdat, 2017; Natarajan et al., 2013; Patrini et al., 2017). Among those, some of them require a small set of clean labels to estimate the noise transition probability (Vahdat, 2017). The other category of methods proposes to use loss functions which are inherently tolerant to the label noise (Natarajan et al., 2013; van Rooyen et al., 2015; MasnadiShirazi, Hamed and Vasconcelos, 2008; Ghosh et al., 2015; Aritra et al., 2017). Though these methods provided satisfactory results, none of them consider the implicit local geometric structure of the underlying data.The primary objective of this paper is to develop a robust approach to tackle the label noise for remote sensing image analysis. The sensitiveness of deep neural networks to label noise has not been well studied in remote sensing image analysis so far as per our knowledge. Hence the first contribution of this article lies in studying the robustness of deep neural networks to label noise, and also to analyse the efficiency of existing robust loss functions for remote sensing classification tasks. The second contribution of this paper is to propose a novel robust solution to tackle the label noise based on optimal transportation theory (Villani, 2009)
. Indeed we propose to learn a deep learning model which is robust to label noise by fitting the model to the labelfeatures joint distribution of the dataset with respect to the entropyregularized optimal transport distance. We coin this method as CLEOT for Classification Loss with Entropic Optimal Transport. One major advantage of our approach compared to existing methods is that our method inherently exploits the geometric structure of the underlying data. A stochastic approximation schemes is proposed to solve the learning problem, and allows the use of our approach within deep learning frameworks. Experiments are conducted on several remote sensing aerial and hyperspectral benchmark datasets, and the results demonstrate that our approach is more robust (tolerant) to high level label noise than current stateoftheart methods.
The remaining of the paper is organized as follows. Section 2 discusses related works, section 3 defines the label noise and describes the problem formulation, and section 4 introduces optimal transport. The proposed method is then presented in section 4.2 while experimental datasets and results are explained in section 5. We finally draw some conclusions in section 6.
2 Related works
2.1 Learning with noisy labels
Label noise, and attribute (feature) noise are two types of noise commonly found in machine learning datasets. The label noise is considered as more harmful and difficult to tackle compared to the attribute noise, and can decrease the classification performance significantly (Zhu and Wu, 2004). Learning with noisy labels with shallow learning methods have been widely investigated in the literature (Frenay and Verleysen, 2014; Brooks, ; Zhu and Wu, 2004; Sáez et al., 2014; Hickey, 1996; Smyth et al., 1995; Natarajan et al., 2013), but studies in the context of deep learning still remain scarse (but growing recently) (Reed et al., 2015; Vahdat, 2017; Hendrycks et al., 2018; Patrini et al., 2017). Among the several methods which have been proposed to robustly train deep neural networks on the datasets with noisy labels, one set of methods approaches the problem from the perspective of cleaning the noisy labels, and use the clean estimated labels for training deep neural networks, or they smoothly reduce the impact of noisy labels by putting smaller weights on noisy label samples, either through directed graphical models (Tong Xiao et al., 2015), conditional random field (Vahdat, 2017)
, knowledge graph distillation
(Li et al., 2017), metalearning (Ren et al., 2018) or noisetransition matrix estimation (Hendrycks et al., 2018). But those methods require an additional small subset of data with clean labels, or require ground truth of preidentified noisy labels in order to model the noise in the dataset. A second kind of methods tries to detect clean instances out of the noisy instances, and use them to update the parameters of the trained neural network (Jiang et al., 2018; Ding et al., 2018). In this category, two deep networks or two stage framework are employed to remove noisy label instances. The last kind of methods design a robust loss function and loss correction approach. The robust loss functions unhinged (van Rooyen et al., 2015), savage (MasnadiShirazi, Hamed and Vasconcelos, 2008), sigmoid and ramp (Ghosh et al., 2015) are inherently robust to the label noise with associated theoretical bounds. Most of these method rely on an assumption of symmetric loss function. The loss correction approaches employ the correctness procedure to adjust the loss function to eliminate the influence of the noisy labels by forward and backward correction approach (Patrini et al., 2017)using the estimated noise transition model from the noisy labeled data, adding linear layer on top of a softmax layer
(Sukhbaatar et al., 2014; Jacob and Ehud, 2017), using bootstrap approach (Reed et al., 2015) that replaces the noisy labels with a soft or hard combination of noisy labels and their predicted labels.In remote sensing image analysis, the adverse effect of the label noise is not much studied in literature. The impact of label noise has been recently studied in (Frank et al., 2017; Pelletier et al., 2017a)
with shallow classifiers. The feasibility of using online open street map (outdated or mislabeled ground truth) to obtain classification map with deep neural network was studied in
(Kaiser et al., 2017), however they didn’t consider directly addressing label noise as a specificity of the problem. Some other studies tackle the label noise in the context of shallow classifier (random forest, logistic regression) by selecting clean labeled instances via outlier detection
(Pelletier et al., 2017b), or by using existing noise tolerant logistic regression method (Maas et al., 2016, 2017).2.2 Optimal transport
Optimal transport theory provide the Wasserstein distance, that measures the discrepancy between probability distribution in a geometrically sound manner. More recently, optimal transport has found applications in domain adaptation
(Courty et al., 2017b, a; Damodaran et al., 2018), generative models (Seguy et al., 2018; Genevay et al., 2017; Arjovsky et al., 2017), data mining (Courty et al., 2018) and image processing (Solomon et al., 2015; Papadakis, 2015).Among those applications, domain adaptation is the one that is the most related to the problem of noisy labels. It indeed aim at adapting a classifier to better predict on new data whose distribution is different from the training data. In the case of label noise in the training dataset, we want to adapt the classifier to perform well on data using a different noisy dataset for training. One recent approach coined JDOT for joint distribution optimal transport (Courty et al., 2017a) propose to estimate a classifier that minimize the Wasserstein distance between the joint feature/labels distribution and a predicted (with the model) joint distribution on the new data. The approach has been recently extended to the deep learning framework in (Damodaran et al., 2018) and will be described more in detail in section 4.2.
3 Problem formulation and noise model
3.1 Traditional supervised learning
Let be the training features/images and
be their associated onehot encoded class labels (
, is the number of classes) sampled from the joint distribution . Let be a neural network model with model parameters , which maps the input features into class conditional probabilities . The loss function measures the discrepancy (error) between the true label and the predicted label distributionby the neural network. In the standard supervised learning stting, one estimates the parameters
of by minimizing the empirical risk on the training set(1) 
In this paper we use the crossentropy defined as : , thus eq. (1) can be reexpressed as
(2) 
The neural network model is estimated by minimizing the objective above with respect to its parameters through stochastic optimization procedures. However, minimizing the loss function eq. (2) in certain scenarios can lead to overfitting. When the dataset is affected by label noise, minimizing the empirical risk can degrade the performance of the neural network. Hence suitable modification of the loss function is necessary to learn a robust neural network model, which is the direction of our proposed method.
In the following subsection, we describe the label noise in the datasets, and how to artificially simulate this noise in two different settings.
3.2 Label noise
Large scale datasets are commonly subjected to label noise (mislabeled samples), especially when using one of the surrogate labeling strategy discussed in the introduction. The occurrence of the label noise in the dataset can be of two types: asymmetric and symmetric label noise.
In the asymmetric label noise, each label in the training set is flipped to with probability , defining the noise transition matrix, , indicating the probability of class label being flipped to class label. Thus, the training samples are observed from the joint distribution
(3) 
This noise model is realistic and can occur in real world scenario, where nonexpert finds difficult to distinguish between the similar fine grained classes. However, the matrix is generally unknown in realworld scenarios.
In the symmetric label noise, the label is flipped uniformly across all the classes with probability , irrespective of similarity between the classes. In this case, matrix has the entries in the diagonal, and in the offdiagonal elements. This noise model is much simpler and has a unique parameter.
For both the noise types, learning the classifier with the loss function mentioned in eq. (1) is not robust and can lead to overfitting to the noisy training labels.
4 Classification Loss with Entropic Optimal Transport (CLEOT)
In this section we first provide an introduction to optimal transport (OT) by discussing unregularized and regularized OT. Next we introduce the joint distribution OT which is starting point of our method. Then we formulate our approach and discuss the numerical resolution of the proposed learning problem.
4.1 Introduction to optimal transport
Optimal transport (see for instance the two monographs by Villani (Villani, 2003, 2009)) is a theory that allows to compare probability distributions in a geometrically sound manner even when their respective supports do not overlap. OT is hence wellsuited to work on empirical distributions and allows to take into account the geometry of the data set in its embedding space. Formally, OT searches for a probabilistic coupling between two distributions and which yields a minimal total displacement cost wrt. a given cost function measuring the dissimilarity between samples and on the support of each distribution and respectively. Here, describes the space of joint probability distributions with marginals and . In a discrete setting (both distributions are empirical) the OT problem becomes:
(4) 
where is the Frobenius dot product, is a ground cost matrix representing the pairwise costs , is a matrix of size with prescribed marginals, and , the sizes of the supports of the distributions and respectively. The minimum of this optimization problem can be used as a measure of discrepency between distributions, and, whenever the cost is a metric, OT is also a metric and is called the Wasserstein distance.
OT solvers have a supercubic complexity in the size of the support of the input distributions , which makes OT approaches untractable when dealing with medium to largescale datasets. In order to speed up OT computation, Cuturi (2013)
proposed instead of the above linear program to solve a regularized version of OT. Regularization is achieved by adding the negative entropy regularization term to the coupling
. Thus, the socalled entropyregularized Wasserstein distance can be defined as eq. (4) is becomes(5) 
with
(6) 
where is the negative entropy of , and is the tradeoff between the two terms. When eq. 5 recovers the original optimal transport problem from eq. 4, and when the resulting divergence has strong links with maximum mean discrepancy as discussed in (Genevay et al., 2017). Efficient computational schemes were proposed with entropic regularization (Cuturi, 2013) as well as stochastic versions using the dual formulation of the problem (Genevay et al., 2016; Arjovsky et al., 2017; Seguy et al., 2018), allowing to tackle middle to large sized problems.
Note that the regularized Wassersein distance is defined in eq. 5 only with the linear term whereas the OT matrix is optimized with an additional regularization term in eq. 6. This allows for a better approximation of the Wasserstein distance as discussed in Luise et al. (2018), but comes with a slightly more complex problem to minimize when used as objective value as discussed in the following.
4.2 Joint distribution optimal transport
In the context of unsupervised domain adaptation, Courty et al. (2017a) proposed the joint distribution optimal transport (JDOT) method. The idea is to consider the optimal transport problem between distributions on the algebraic product space of features and labels spaces, instead of only considering the feature space distributions.
In this setting, the source measure and the target measures are measures on the product space , and we note , the samples of and respectively. The generalized ground cost associated to this space can be naturally expressed as a weighted combination of costs in the input and label spaces, reading
(7) 
for the th element of the support of and th element of the support of . is chosen as a distance and is a classification loss (e.g. hinge or crossentropy). Parameters and are two scalar values weighting the relative contributions of features and label discrepancies. In the unsupervised domain adaptation setting, the labels are unknown and we seek to learn a classifier to estimate the label of each target sample. Hence, with the samples from the target distribution, we define the ground loss,
(8) 
Accounting for the classification loss, JDOT leads to the following minimization problem:
(9) 
where depends on and gathers all the pairwise costs . As a byproduct of this optimization problem, samples that share a similar representation and a common label (through classification) are matched, yielding better discrimination. JDOT has been recently extended to deep learning strategies (Damodaran et al., 2018) by computing the optimal transport w.r.t. deep embeddings of the data rather than the original feature space, and also by proposing a largescale variant of the regularized OT optimization problem.
4.3 Learning with noisy labels using entropyregularized OT
The main idea of our proposed method is to learn a neural network model efficiently in the presence of noisy labels. Let be the samples and their associated noisy onehot labels observed from . We note the discrete distribution corresponding to these samples. Our proposal is to learn that yields a discrete distribution which minimizes the following problem:
(10) 
which can be reformulated to the following bilevel optimization problem:
(11)  
s.t.  (12) 
with
(13) 
As we can see from the objective function eq. (11), will be learned such that each sample classification needs to be close to every labels for which is nonzero. This highlights the role of the optimal coupling in helping learn a classifier which is smoother thanks to this averaging process since the OT regularization will promote a spread of mass in . Here the geometry of the dataset is taken into account through the ground metric on the joint featurelabel space. This averaging process is even more clear when the classification loss is linear, which is the case for the crossentropy loss. Indeed, we have in that case
(14) 
where is as an average of the labels with weights in , hence a denoised estimate for . Our approach corresponds to learning from labels that have been smoothed by substituting the each noisy label by a weighted combination of labels where the weights are provided by the optimal couplings estimated w.r.t. the ground cost matrix . We name this approach CLEOT for Classification Loss with Entropic Optimal Transport. This approach is notably motivated by the denoising capacity of the entropy regularized optimal problem, as explored in (Rigollet and Weed, 2018), and where the denoising is conducted directly on the joint distributions.
In order to better interpret our approach, we can look at the limit cases. When and the label loss disappears in the OT metric and the OT matrix is the solution between
and itself. In this case the solution is obviously the identity matrix and the optimization problem wrt
boils down to the classical empirical risk minimization of eq. (1) without crossterms. When and , the OT is performed in the joint distribution sense and will include label information through cost (13), but the solution will be very sparse (a permutation) at the risk of overfitting the sample assignment in . However, when the entropy regularization is included (), the probability in mass is spreadout, as a result the optimal coupling () will share mass between the samples which have similar features and label representations and perform label smoothing.This averaging is of particular interest when the labels are corrupted by the noise since we always suppose that the good labels are wining locally in average (or else nothing can be learned anyways). Thus, learning the neural network model () with CLEOT (eq. (11)) naturally mitigate the impact of the label noise, and obtains the robust classifier.
Finally we discuss how to solve the optimization problem (10). The authors of JDOT originally proposed to perform alternative optimization on and . This approach works for unregularized OT and converges to a stationary point. However this does not hold true for regularized OT. When using regularized OT as proposed here, the problem is a bilevel optimization problem (Colson et al., 2007). Bilevel optimization problem that are notoriously difficult to solve. Since the inner problem is a regularized OT problem that is strongly convex, one could solve the problem by using the implicit function theorem as discussed in Luise et al. (2018) on a different application. However, solving the full coupling is computationally infeasible both in terms of time and memory, because is a dense matrix and scales quadratically in size to the number of samples. Even if modern solvers have been proposed for regularized OT in the dual (Seguy et al., 2018; Arjovsky et al., 2017; Genevay et al., 2016) or primal (Genevay et al., 2017), they are still computationally intensive and cannot be used properly with alternate optimization. This problem is even aggravated by the necessity to solve the OT problem at each iteration. In order to circumvent these problems, we use a the stochastic optimization scheme by solving the problem on minibatches, enabling to learn complex deep neural networks on large datasets.
4.4 Stochastic approximation of proposed method
We propose to approximate the objective function eq. (11) of our proposed method by sampling minibatches of size , and minimizing optimization problem:
(15)  
s.t.  (16) 
where the expectation is taken over the randomly sampled minibatches and (16) is solved only on the minibatches. As the size increases, the optimization problem will converge to eq. (11). Still as discussed in Genevay et al. (2017), the expected value over the minibatches if OT is not equivalent to the full OT and may lead to a different minimum. In practice it has the effect of densifying the equivalent full OT matrix and adding an additional regularization.
In order to optimize the problem above on minibatches we use the sinkhornautodiff introduced in (Genevay et al., 2017) that relies on automatic differentiation of the Sinkhorn algorithm that quickly estimates the solution of entropic regularized OT and its gradients (see the psuedo code in algorithm 1). Note that we could have used the approach of Luise et al. (2018) for computing the gradients instead of sinkhornautodiff but their approach rely on the implicit function theorem which supposes that the inner problem is solved exactly. Since it is difficult to ensure exact convergence of the Sinkhorn, we prefer to perform autodiff on the algorithm with a finite number of iterations which will provide a reasonable gradients even when Sinkhorn has not converged. This stochastic approach has two major advantages: it scales to large datasets, and can be easily integrated into the modern deep learning framework in an endtoend fashion.
4.5 Illustration on a toy example
For the sake of clarity, we propose an illustration (Figure 1) of the behavior of the method on a simple toy example. It consists in the classical two moons problem, which is a binary classification problem. From a clean version of the dataset (Figure 1.a), containing 400 data samples, labels are randomly flipped with a probability (Figure 1
.b). The classifier is a fully connected neural networks that consists in two hidden layers of size 256, with Relu activations. The model is adjusted along 500 epochs. A graphical representation of the decision boundary is given in (Figure
1.c), where one can clearly see that the model is not capable of separating properly the two classes, resulting in a complex boundary that encloses mislabeled samples. Then, three iterations of the proposed approach are represented (one per line). Column (d) shows the coupling as a graph, i.e. links between samples corresponds to entries of that are above a given threshold (as is dense). The width of the connection is proportional to the magnitude of the entry. As expected, most of the connections highlight a geometrical and class label proximity. Labels are then propagated (Column (e)) following eq. 14. The classifier is fine tuned over this new set of fuzzy labels. Column (f) shows the new decision boundary, as well as the corresponding accuracy score (in red). As performances increase, it is worth noting the relative lower complexity of the classifier, that almost correctly classifies clean samples (0.99 of accuracy) after three iterations of CLEOT.5 Experiments and Results
We evaluate our proposed CLEOT method and stateoftheart (SoA) methods on two remote sensing tasks: Remote sensing (aerial image) scene classification, and pixelwise labeling of hyperspectral image. The effectiveness of our proposed method is compared with several SoAs which modifies the loss function similar to ours. The considered SoA methods are Backward and Forward loss correction (Patrini et al., 2017), Unhinged (van Rooyen et al., 2015), Sigmoid and Ramp (Ghosh et al., 2015), Savage (MasnadiShirazi, Hamed and Vasconcelos, 2008), and Bootstrap soft (Reed et al., 2015). The Unhinged, Sigmoid, Ramp, Savage
loss correction methods did not perform well in their original form. Preliminary experiments show that these methods either do not converge or converge to poor solutions, and sometimes result in premature saturation. In order to make these methods comparable, we stacked the batchnormalization and softmax pooling right before the loss function. This procedure increased the performance of the stateoftheart methods, compared to the implementation mentioned in
(Patrini et al., 2017) and in their respective articles. Thus the performance of the SoA methods can be considered as the strong baseline for our proposed method. A similar procedure is also used for the rest of the methods (including ours) to have uniformity. The source code of our proposed method and SoA methods will be published here once upon acceptance.In the next subsections, for each dataset, we first present the data, then detail the label noise simulations and implementation details, and finally present and discuss the results.
5.1 Aerial Image Labeling
We have considered four diverse publicly available remote sensing aerial scene classification datasets: NWPURESIS45 (Cheng et al., 2017), NWPU19 (Cheng et al., 2017), PatternNet (Zhou et al., 2018), AID (Xia et al., 2017). The description of each dataset is provided below followed by a description of the label noise applied to them.
5.1.1 Datasets
NwpuResis45
This dataset consists of 31’500 remote sensing images covering 45 scene classes. Each class contains 700 images with a size of 256 256 in the red green blue (RGB) color space. The spatial resolution of this dataset varies from about 30 m to 0.2 m per pixel. This dataset was extracted by the experts in the field of remote sensing image interpretation, from Google Earth (Google Inc.). Additional details of this dataset can be found in (Cheng et al., 2017).
Nwpu19
This dataset is a subset of NWPURESIS45 dateset, which consists of 13’300 remote sensing images divided into 19 scene classes. The number of samples per class, and its size and spatial configuration are similar to NWPURESIS45.
PatternNet
This is a largescale high resolution remote sensing dataset collected for remote sensing image retrieval. Here, we have used it for classification task. It contains 38 classes, and each class consists of 800 images of size 256 256 pixels, totals to 30’400 image scenes. The images in PatternNet are collected from Google Earth imagery or via the Google Map API for US cities. The images are of higher spatial resolution than the NWPURESIS45 dataset, the highest spatial resolution is around 0.062 m and lowest is around 4.69 m. For further information, please see (Zhou et al., 2018).
Aid
This dataset is made up of 10’000 images covering 30 scene classes. Unlike the above datasets, the number of images in this dataset varies a lot with different aerial scene types, from 220 to 420 sample images. The spatial resolution is varied from 0.5 m to 8 m, and the size of each aerial image is fixed to 600 600 pixels. Similar to above datasets, this dataset is also collected from Google Earth at different time and seasons, over different countries and regions around the world. For more details, please see (Xia et al., 2017).
5.1.2 Label noise simulation
In order to evaluate our proposed method, we artificially simulate the (asymmetric) label noise in the above datasets, to meet the requirements of real world scenarios. We carefully inspected the samples, and flipped labels according to the noise probability to the visually similar classes. The class permutations that were selected are reported in Table 1.
Dataset  Label noise 

NWPURESIS45 14/45 classes impacted  baseball diamond medium residential, beach river, dense residential medium residential, intersection freeway, mobile home park dense residential, overpass intersection, tennis court medium residential, runway freeway, thermal power station cloud, wetland lake, rectangular farm land meadow, church palace, commercial area dense residential 
NWPU19 7/19 classes impacted  baseball diamond medium residential, beach river, dense residential medium residential, intersection freeway, mobile home park dense residential, overpass intersection, tennis court medium residential. 
PatternNet 11/38 classes impacted  cemetery christmas tree farm, harbor ferry terminal, dense residential coastal home, overpass intersection, parking space parking lot, runway mark parking space, coastal home sparse residential, swimming pool coastal home 
AID 12/30 classes impacted  bareland desert, centre storage tank, church centre, storage tank; dense residential medium residential, desert bareland, industrial medium residential, meadow farm land, medium residential dense residential, play ground meadow, school; resort medium residential, school medium residential, play ground; stadium play ground 
5.1.3 Model
We employed pretrained VGG16 architecture, replacing the last layer with two MLPs that map to 512 hidden neurons before predicting the classes with
regularization, respectively. The dropout layer with is inserted before the last MLP and the batch normalization is applied before the softmax operator. To have uniformity, all the methods follow similar architecture design. During the training, the network is finetuned by freezing the weights of the VGG16 layers. We optimized the SoA methods for epochs using the SGD optimizer () with momentum () using the minibatch size of 128. The proposed CLEOT method is also optimized as above, but with different learning rate () and minibatch size (50 samples per class). The hyperparameters of CLEOT method are set as , , and experimentally for all the datasets. Additionally, we have used early stopping criterion to terminate the training process, if the validation loss did not decrease for epochs. This allows to prevent overfitting to the noisy labels for all the methods. Furthermore, we retained the model weights with best validation loss.For all the datasets, from the available number of samples we partitioned 80% of samples for training, 10% samples for validation and the remaining 10% samples for evaluating the performance. All the methods are trained with the noisy labeled training and validation samples, and evaluated with the clean testing label instances.
5.1.4 Results
Method  NWPU45  NWPU19  

0.0  0.2  0.4  0.6  0.8  0.0  0.2  0.4  0.6  0.8  
Cross entropy  82.930.09  80.530.30  75.631.04  67.800.40  62.050.04  89.750.30  84.950.49  78.192.08  68.450.95  62.250.17 
Unhinged  82.810.21  82.130.14  78.380.57  63.070.26  61.010.01  90.070.20  87.360.46  79.220.77  66.650.85  61.220.41 
Sigmoid  71.740.40  68.080.18  65.760.50  57.100.06  56.610.31  89.690.06  88.730.19  84.370.09  66.190.29  59.620.11 
Ramp  82.990.10  82.260.20  78.810.26  62.970.16  60.910.32  90.770.28  86.690.18  78.620.77  67.250.60  60.700.44 
Savage  76.850.15  75.130.11  69.960.14  59.560.03  58.080.07  90.200.18  89.012.80  81.130.41  67.210.20  60.620.12 
Bootstrap soft  82.980.17  80.650.47  75.820.88  67.390.86  62.220.21  89.740.18  85.190.56  79.640.95  69.201.50  62.030.05 
Backward  82.790.14  80.650.51  75.960.72  68.670.75  62.450.52  89.730.43  85.200.37  78.171.01  68.721.60  62.060.08 
Forward  83.060.11  80.870.53  74.971.02  68.121.16  62.560.16  89.970.32  85.371.04  78.891.28  69.070.95  62.310.24 
CLEOT  82.410.27  81.540.18  80.840.45  76.070.35  70.140.33  89.980.38  89.170.30  86.260.63  78.040.42  68.08 0.53 
PatternNet  AID  
0.0  0.2  0.4  0.6  0.8  0.0  0.2  0.4  0.6  0.8  
Cross entropy  97.680.15  94.820.30  89.110.55  79.761.36  73.680.34  86.940.51  82.920.33  73.801.03  65.561.18  57.800.84 
Unhinged  97.760.16  97.550.05  95.310.19  73.620.59  71.230.14  87.640.19  86.330.19  78.670.29  65.930.52  57.200.16 
Sigmoid  96.630.03  96.310.31  94.630.24  73.170.39  70.580.04  85.410.26  84.710.25  82.05 0.17  60.960.44  56.180.08 
Ramp  97.730.07  97.560.02  95.440.16  72.940.47  71.370.16  87.740.22  86.240.23  78.370.56  66.040.59  57.210.27 
Savage  96.820.05  96.410.03  93.940.11  73.160.13  70.720.01  83.650.10  85.730.21  82.280.27  62.880.50  56.550.48 
Bootstrap soft  97.620.13  94.450.39  88.880.79  79.130.73  73.390.48  87.030.40  82.540.78  73.750.82  65.241.09  58.000.45 
Backward  97.600.10  94.760.31  89.070.70  79.890.36  73.470.32  86.870.52  82.630.59  74.030.56  65.711.16  57.900.22 
Forward  97.670.06  94.430.78  89.161.01  79.440.51  73.210.62  86.910.41  82.301.08  73.590.76  64.910.74  58.430.59 
CLEOT  97.290.04  96.770.09  94.510.15  83.750.20  79.84 0.22  87.020.63  85.391.12  79.190.94  71.760.66  63.230.42 
The average classification accuracies and standard deviation of SoA methods and proposed
CLEOT method on remote sensing aerial scene classification datasets. The accuracy measures are averaged over 5 runs. and the best accuracies are reported in bold.Table LABEL:tab:aerial presents the classification performance of our proposed and SoA methods on the four aerial scene classification datasets with different noise levels. We have also included Cross entropy loss function, which is the baseline for all the approaches. The noise level indicates that the methods are trained with the clean labeled training and validation samples, and it can be considered as the gold standard. The impact of label noise is varied and analyzed in the range of . The amount of actual noise depends on the number of the classes affected by the label noise in the dataset.
When the conventional Cross entropy loss function is considered, the classification accuracy drops to few percentage of points (35%) initially and decreases drastically (above 15%) as the magnitude of label noise increases with all the four datasets. This shows that regularization techniques such as weight regualizers, dropout and early stopping criteria can circumvents label noise only up to some degree and is inefficient in high level label noise. This emphasizes need for the inclusion of robust loss functions while training the deep neural networks in remote sensing image analysis. Next, when the performance of existing SoA methods are considered, they showed robust performance and did not degrade the performance compared to the clean training set under the noise level but decreases about 4% on the mid noise level, however they outperformed the conventional Cross entropy loss function. Further, it is noted that under the high level label noise, the SoA methods are similar or less than the Cross entropy loss function. Thus, SoA methods are still limited to tackle the complex noise scenarios.
Lastly, when the performance of the proposed CLEOT method is analyzed, one could see that CLEOT achieves better or similar performance to the SoA methods in the lowlevel noise, and achieves impressive performance on the higher noise levels. For instance, on average our method decreases only 2.6%, 10.6% compared to clean training set with 0.4, and 0.6 noise level, where as the best SoA decreases about 4%, and 22.5% respectively. Further, Forward and Backward methods are inferior to the the robust loss functions (Unhinged, Sigmoid, etc), which is contrary w.r.t to the observation in Patrini et al. (2017). This reveals that methods that perform well on machine learning datasets not necessarily achieve better performance in remote sensing datasets, thus new methods has to be designed specific to remote sensing datasets. Lastly, it is noted from table LABEL:tab:aerial that, among the SoA methods there is no single best method which consistently performs better across the datasets, and noise levels. Thus, there exists dilemma in choice of method among the existing SoA for the underlying real world task. On contrary, our method consistently outperforms across different datasets, and noise levels.
5.2 Hyperspectral image classification
Next, we evaluate our proposed method on the pixelwise labeling task of hyperspectral datasets. For this, we have chosen three hyperspectral datasets from three different type of sensors covering agricultural and urban cover settings.
5.2.1 Datasets
Pavia University
The first hyperspectral data considered here was collected over the University of Pavia, Italy by the ROSIS airborne hyperspectral sensor in the framework of the HySens project managed by DLR (German national aerospace agency). The ROSIS sensor collects images in 115 spectral bands in the spectral range from 0.43 to 0.86 µm with a spatial resolution of 1.3 m/pixel. After the removal of noisy bands, 103 bands were selected for experiments. This data contains 610340 pixels with nine classes of interest, which covers the urban materials. The total number of available labeled ground truth samples is 42’776.
Chikusei
The airborne hyperspectral dataset was taken by Headwall HyperspecVNIRC imaging sensor over agricultural and urban areas in Chikusei, Ibaraki, Japan. The hyperspectral dataset has 128 bands in the spectral range from 363 nm to 1018 nm. The scene consists of 25172335 pixels and the ground sampling distance was 2.5 m. Ground truth of 19 classes was collected via a field survey and visual inspection using highresolution color images obtained by Canon EOS 5D Mark II together with the hyperspectral data. The number of labeled reference samples is 77’592. For additional details, please refer (Yokoya and Iwasaki, 2016).
Grss_dfc_2018
The last hyperspectral dataset used were acquired over the University of Houston campus and its neighborhood on February 2017 by an ITRES CASI 1500 imaging sensor. This dataset contains 48 spectral bands covering the spectral range of 380 nm to 1050 nm with 1 m ground sampling distance. The scene consists of 6012384 pixels representing 20 urban land use/cover classes, and contains 50’4856 labeled reference samples. The details of this dataset can be found in ^{1}^{1}1http://www.grssieee.org/community/technicalcommittees/datafusion.
Dataset  Label noise 

Pavia University7/9 classes impacted  meadows trees, gravel self building blocks, bare soil meadows, bitumen asphalt. Out of 9 classes, 7 classes are impacted by the label noise. 
Chikusei 10/19 classes impacted  baresoil (park) baresoil (farm); baresoil (farm) baresoil (park), rowcrops; weeds grass, rowcrops, forest; forest rice(grown), weeds; grass weeds, rowcrops; rice(grown) forest, weeds; rowcrops baresoil(farm), weeds, grass; plastic home asphalt, manmade(dark); manmade(dark) plastic house;paved ground baresoil(farm) 
GRSS_DFC_2018 10/20 classes impacted  healthy grass stressed grass; stressed grass bare earth; evergreen trees deciduous trees; deciduous trees residential buildings; residential buildings roads, sidewalks; nonresidential buildings sidewalks; roads major thoroughfares, sidewalks; sidewalks major thoroughfares, crosswalks; crosswalks major thoroughfares; major thoroughfares highways 
5.2.2 Label noise simulation
For the hyperspectral datasets, it is difficult to find similar classes with visual inspection due to the high dimensionality of the data. So we measure class similarity using the JeffriesMatusita distance and Transformed divergence measure (Richards et al., 1999) and visual interpret the spectral signatures of some training samples for the most similar classes. According, the class labels are flipped as defined in Table 3.
5.2.3 Model
We used a recent stateoftheart hyperspectral image classification framework named spectralspatial residual residual network (SSRN) (Zhong et al., 2018)
. It consecutively extracts spectral and spatial features for pixel wise classification of hyperspectral image. The spectral feature learning consists of two 3D convolutional layer, and two residual blocks. Following the spectral features, spatial features are extracted using 3D convolutional layer and two spatial residual blocks. Average pooling layer is added on top of the spectralspatial feature volume, and followed by the fully connected layer with softmax activation function. Dropout layer (p=0.5) is added after the average pooling layer and batchnormalization layer is stacked before the softmax activation. Please refer to
(Zhong et al., 2018) for additional details of the SSRN architecture.We trained the SSRN architecture using SGD momentum optimizer with lr = 0.01 and m = 0.9 for 600 epochs using the batch size of 128 for the SoA methods, and 256 for the CLEOT. As with aerial scene classification we also employed the eary stopping criterion to avoid overfitting, and terminate the training process, if the validation loss did not decrease for epochs. All the methods follow similar training procedure. The hyperparameters of our proposed method CLEOT are set to , experimentally for all the datasets, and the entropic regularizer is set to for the PaviaU and Chikusei datasets, and for the remaining dataset.
While partitioning the ground truth reference samples into training, validation and testing subsets, we followed the conventional protocol in the hyperspectral remote sensing community to train classifier with small number of training samples. Accordingly, we used 20% of samples for training, 10% samples for validation and remaining 70% samples for testing purpose for the Pavia University, and Chikusei datasets. Where as for the GRSS_DFC_2018 dataset, we used 10% samples for training, 10% samples for validation and remaining 80% samples for evaluation. The training and validation samples are impacted by the label noise, and clean testing labeled samples is used for evaluation.
5.2.4 Results
Table 4 presents the classification performance of SoA methods and CLEOT for the three hyperspectral datasets. The experiments are conducted with different noise levels (see Table 4 for noise levels) to effectively access the robustness of SoA and proposed methods. The noise level indicates the clean labeled training and validation samples, which is an upper bound for all the methods.
Method  PaviaU  GRSS_DFC_2018  

0.0  0.1  0.2  0.3  0.4  0.0  0.1  0.2  0.3  0.4  0.5  
(0.0)  (0.19)  (0.37)  (0.57)  
Cross entropy  99.930.02  95.620.38  85.310.60  78.011.78  65.131.49  84.932.27  68.6616.13  76.716.37  79.5416.57  71.5414.54  44.3311.52 
Unhinged  99.970.02  98.580.01  94.910.01  93.020.02  78.550.01  87.071.48  94.231.03  85.243.85  92.760.65  75.902.21  45.114.45 
Sigmoid  99.980.00  99.680.05  97.560.01  88.690.01  69.310.02  90.336.72  84.877.01  90.184.09  86.233.79  85.185.31  54.046.11 
Ramp  99.590.49  98.580.17  96.040.65  90.702.02  75.754.65  92.323.11  85.918.29  77.503.32  85.703.69  82.018.13  40.198.92 
Savage  99.970.01  95.910.86  86.070.38  76.060.08  64.841.33  83.0910.68  92.203.27  88.424.15  91.691.43  83.963.50  48.2315.16 
Bootstrap soft  99.890.01  91.443.28  85.962.34  75.271.18  66.072.25  85.5713.96  87.555.59  87.935.96  80.073.81  75.9212.42  28.377.38 
Backward  99.940.02  90.913.52  84.080.64  76.263.39  71.656.74  85.517.97  69.6725.91  84.445.94  86.084.64  78.4011.54  71.1312.40 
Forward  96.654.65  95.740.26  87.641.05  84.053.24  65.465.09  79.269.11  88.500.50  88.843.04  87.356.44  85.445.28  83.304.46 
CLEOT  99.800.12  99.280.24  98.420.21  96.910.50  91.310.23  96.010.59  95.500.56  94.810.50  92.181.51  91.080.15  62.981.65 
Method  chikusei  
0.0  0.1  0.2  0.3  0.4  0.5  0.6  
Cross entropy  99.990.01  99.270.04  98.980.02  95.920.17  92.400.42  86.370.31  61.310.26  
Unhinged  99.870.18  99.740.16  99.450.09  99.190.01  97.570.22  91.720.37  67.297.45  
Sigmoid  99.440.01  99.510.01  99.420.01  99.230.02  99.060.02  94.350.06  64.020.03  
Ramp  99.880.15  99.160.64  99.500.01  99.170.13  97.690.75  91.810.96  64.797.15  
Savage  99.990.02  99.980.00  98.670.01  90.781.45  83.393.81  66.872.46  52.400.05  
Bootstrap soft  99.950.04  99.590.16  98.730.31  95.791.44  92.350.48  81.397.73  63.484.44  
Backward  99.920.10  97.741.43  97.341.24  95.810.90  89.685.60  79.264.32  65.319.60  
Forward  99.990.01  99.170.01  98.980.04  96.840.05  95.580.06  83.890.06  64.150.04  
CLEOT  99.880.01  99.410.13  99.590.12  99.240.01  99.100.02  96.641.20  84.501.35 
Figure 2 shows the relative difference (in %) of methods with respect to the baseline (Cross entropy) method. The proposed CLEOT consistently outperformed the existing SoA methods with large margin (about 1520%) in the higher level label noise (except in GRSS_DFC_2018). However our method still has the large performance margin with other methods. On the lower level label noise, fig. 2 reveals there is no significant difference between the best SoA and CLEOT method. However our method has several distinct advantages over the SoA, CLEOT (i) converges faster than the best SoA method, which is beneficial for very large scale remote sensing datasets, (ii) consistently performs better irrespective of noise level, whereas the best SoA varies with respect to the noise level, for instance with Pavia University, Sigmoid outperforms in mid level noise, but in higher level noise Unhinged outperformed the Sigmoid loss function, (iii) monotonically degrades the classification accuracy as the noise level increases with complex dataset, whereas SoAs do not follow this trend, thus existing methods are not as reliable. This also reveals that the best SoAs might be more sensitive to the neural network initialization under label noise. Thus, our proposed method can be considered as a alternative candidate to train robust deep neural networks for remote sensing image analysis.
As observed with aerial scene classification, the the classification accuracy of loss Cross entropy decreases as the noise level increases. The magnitude of decrease in accuracy is dependent on the amount classes affected by the label noise, and also the nature of the datasets. It is noted that on the Chikusei dataset, Cross entropy is very robust compared to other datasets, this might be due to the large patches of homogeneous landscapes in the dataset as well as the appearance of label noise at pixel level. In future work, we will consider more complicated noise model for the hyperspectral datasets, where the label noise could appears as spatially correlated clusters of pixels.
6 Conclusion
In this paper, we proposed the CLEOT method to learn robust deep neural networks under label noise in remote sensing. The proposed method leverages on the geometric structure of underlying data, and uses optimal transport with entropic regularization to regularize the classification model. We evaluated the robustness of CLEOT on two very different applications, one focusing on image scene classification, the second one on pixelwise classification of hyperspectral images with different deep learning architectures. Our proposed approach performed better than competing state of the art approaches and has shown strong robustness in the presence of significant amount of label noise. Future works will consider other regularization schemes of the optimal transport problem, and use an embedding metric in the definition of the cost matrix instead of relying to the distance in the input space.
Acknowledgments
This work benefited from the support of Region Bretagne grant and OATMIL ANR17CE230012 project of the French National Research Agency (ANR), and from CNRS PEPS 3IA DESTOPT. The authors would like to thank Prof. Paolo Gamba for providing Pavia University dataset, and the National Center for Airborne Laser Mapping and the Hyperspectral Image Analysis Laboratory at the University of Houston for acquiring and providing the GRSS_DFC_2018 data, and the IEEE GRSS Image Analysis and Data Fusion Technical Committee. We also gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.
References

Anwer et al. (2018)
Anwer, R.M., Khan, F.S.,
van de Weijer, J., Molinier, M.,
Laaksonen, J., 2018.
Binary patterns encoded convolutional neural networks for texture recognition and remote sensing scene classification.
ISPRS Journal of Photogrammetry and Remote Sensing 138, 74–85. 
Aritra et al. (2017)
Aritra, G., Himanshu, K.,
PS, S., 2017.
Robust Loss Functions under Label Noise for Deep Neural Networks, in: AAAI Conference on Artificial Intelligence.
 Arjovsky et al. (2017) Arjovsky, M., Chintala, S., Bottou, L., 2017. Wasserstein generative adversarial networks, in: ICML, pp. 214–223.
 Audebert et al. (2018) Audebert, N., Le Saux, B., Lefèvre, S., 2018. Beyond RGB: Very high resolution urban remote sensing with multimodal deep networks. ISPRS Journal of Photogrammetry and Remote Sensing 140, 20–32.
 (5) Brooks, J.P., . Support Vector Machines with the Ramp Loss and the Hard Margin Loss. Operations Research 59, 467–479.
 Cheng et al. (2017) Cheng, G., Han, J., Lu, X., 2017. Remote Sensing Image Scene Classification: Benchmark and State of the Art. Proceedings of the IEEE 105, 1865–1883.
 Cheng et al. (2018) Cheng, G., Yang, C., Yao, X., Guo, L., Han, J., 2018. When Deep Learning Meets Metric Learning: Remote Sensing Image Scene Classification via Learning Discriminative CNNs. IEEE Transactions on Geoscience and Remote Sensing 56, 2811–2821.
 Colson et al. (2007) Colson, B., Marcotte, P., Savard, G., 2007. An overview of bilevel optimization. Annals of operations research 153, 235–256.
 Courty et al. (2018) Courty, N., Flamary, R., Ducoffe, M., 2018. Learning wasserstein embeddings. ICLR, arXiv:1710.07457 .
 Courty et al. (2017a) Courty, N., Flamary, R., Habrard, A., Rakotomamonjy, A., 2017a. Joint distribution optimal transportation for domain adaptation, in: NIPS.
 Courty et al. (2017b) Courty, N., Flamary, R., Tuia, D., Rakotomamonjy, A., 2017b. Optimal transport for domain adaptation. IEEE TPAMI 39, 1853–1865.
 Cuturi (2013) Cuturi, M., 2013. Sinkhorn distances: Lightspeed computation of optimal transportation, in: NIPS, pp. 2292–2300.
 Damodaran et al. (2018) Damodaran, B.B., Kellenberger, B., Flamary, R., Tuia, D., Courty, N., 2018. DeepJDOT: Deep Joint Distribution Optimal Transport for Unsupervised Domain Adaptation, in: The European Conference on Computer Vision (ECCV), pp. 447–463.
 Ding et al. (2018) Ding, Y., Wang, L., Fan, D., Gong, B., 2018. A SemiSupervised TwoStage Approach to Learning from Noisy Labels. arXiv preprint arXiv:1802.02679 .
 Frank et al. (2017) Frank, J., Rebbapragada, U., Bialas, J., Oommen, T., Havens, T., Frank, J., Rebbapragada, U., Bialas, J., Oommen, T., Havens, T.C., 2017. Effect of Label Noise on the MachineLearned Classification of Earthquake Damage. Remote Sensing 9, 803.
 Frenay and Verleysen (2014) Frenay, B., Verleysen, M., 2014. Classification in the Presence of Label Noise: A Survey. IEEE Transactions on Neural Networks and Learning Systems 25, 845–869.
 Genevay et al. (2016) Genevay, A., Cuturi, M., Peyré, G., Bach, F., 2016. Stochastic optimization for largescale optimal transport, in: NIPS, pp. 3432–3440.
 Genevay et al. (2017) Genevay, A., Peyré, G., Cuturi, M., 2017. Sinkhornautodiff: Tractable wasserstein learning of generative models. arXiv preprint arXiv:1706.00292 .
 Ghosh et al. (2015) Ghosh, A., Manwani, N., Sastry, P., 2015. Making risk minimization tolerant to label noise. Neurocomputing 160, 93–107.
 Haklay (2010) Haklay, M., 2010. How Good is Volunteered Geographical Information? A Comparative Study of OpenStreetMap and Ordnance Survey Datasets. Environment and Planning B: Planning and Design 37, 682–703.
 Hendrycks et al. (2018) Hendrycks, D., Mazeika, M., Wilson, D., Gimpel, K., 2018. Using Trusted Data to Train Deep Networks on Labels Corrupted by Severe Noise, in: Advances in Neural Information Processing Systems.
 Hickey (1996) Hickey, R.J., 1996. Noise modelling and evaluating learning from examples. Artificial Intelligence 82, 157–179.
 Huang et al. (2018) Huang, B., Lu, K., Audebert, N., Khalel, A., Tarabalka, Y., Malof, J., Boulch, A., Saux, B.L., Collins, L., Bradbury, K., Lefèvre, S., ElSaban, M., 2018. Largescale semantic classification: outcome of the first year of Inria aerial image labeling benchmark, in: IEEE International Geoscience and Remote Sensing Symposium – IGARSS 2018.
 Jacob and Ehud (2017) Jacob, G., Ehud, B.R., 2017. Training deep neuralnetworks using a noise adaptation layer, in: In ICLR.
 Jiang et al. (2018) Jiang, L., Zhou, Z., Leung, T., Li, L.J., FeiFei, L., 2018. MentorNet: Learning DataDriven Curriculum for Very Deep Neural Networks on Corrupted Labels, in: Proceedings of the 35 th International Conference on Machine Learning.
 Kaiser et al. (2017) Kaiser, P., Wegner, J.D., Lucchi, A., Jaggi, M., Hofmann, T., Schindler, K., 2017. Learning Aerial Image Segmentation From Online Maps. IEEE Transactions on Geoscience and Remote Sensing 55, 6054–6068.
 Kellenberger et al. (2018) Kellenberger, B., Marcos, D., Tuia, D., 2018. Detecting mammals in UAV images: Best practices to address a substantially imbalanced dataset with deep learning. Remote Sensing of Environment 216, 139–153.
 Kemker et al. (2017) Kemker, R., Salvaggio, C., Kanan, C., 2017. HighResolution Multispectral Dataset for Semantic Segmentation. arXiv preprint arXiv:1703.01918 .
 Li et al. (2017) Li, Y., Yang, J., Song, Y., Research, Y., Cao, L., Luo, J., Li, L.J., 2017. Learning from Noisy Labels with Distillation, in: IEEE International Conference on Computer Vision.
 Li et al. (2018) Li, Y., Zhang, Y., Huang, X., Zhu, H., Ma, J., 2018. LargeScale Remote Sensing Image Retrieval by Deep Hashing Neural Networks. IEEE Transactions on Geoscience and Remote Sensing 56, 950–965.
 Luise et al. (2018) Luise, G., Rudi, A., Pontil, M., Ciliberto, C., 2018. Differential properties of sinkhorn approximation for learning with wasserstein distance. arXiv preprint arXiv:1805.11897 .
 Maas et al. (2016) Maas, A., Rottensteiner, F., Heipke, C., 2016. Using label noise robust logistic regression for automated updating of topographic geospatial databases, in: ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume III7.
 Maas et al. (2017) Maas, A., Rottensteiner, F., Heipke, C., 2017. Classification under label noise based on outdated maps, in: ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume IV1/W1.
 Maggiori et al. (2017) Maggiori, E., Tarabalka, Y., Charpiat, G., Alliez, P., 2017. HighResolution Aerial Image Labeling With Convolutional Neural Networks. IEEE Transactions on Geoscience and Remote Sensing 55, 7092–7103.
 MasnadiShirazi, Hamed and Vasconcelos (2008) MasnadiShirazi, Hamed and Vasconcelos, N., 2008. On the Design of Loss Functions for Classification: Theory, Robustness to Outliers, and SavageBoost, in: Proceedings of the 21st International Conference on Neural Information Processing Systems, pp. 1049–1056.
 Natarajan et al. (2013) Natarajan, N., Dhillon, I.S., Ravikumar, P., Tewari, A., 2013. Learning with Noisy Labels, in: 26th International Conference on Neural Information Processing Systems, Nevada. pp. 1196–1204.
 Papadakis (2015) Papadakis, N., 2015. Optimal Transport for Image Processing. Ph.D. thesis. URL: https://hal.archivesouvertes.fr/tel01246096.

Patrini et al. (2017)
Patrini, G., Rozza, A.,
Menon, A., Nock, R., Qu,
L., 2017.
Making Deep Neural Networks Robust to Label Noise: a Loss Correction Approach, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI.
 Pelletier et al. (2017a) Pelletier, C., Valero, S., Inglada, J., Champion, N., Marais Sicre, C., Dedieu, G., Pelletier, C., Valero, S., Inglada, J., Champion, N., Marais Sicre, C., Dedieu, G., 2017a. Effect of Training Class Label Noise on Classification Performances for Land Cover Mapping with Satellite Image Time Series. Remote Sensing 9, 173.
 Pelletier et al. (2017b) Pelletier, C., Valero, S., Inglada, J., Dedieu, G., Champion, N., 2017b. Filtering mislabeled data for improving time series classification, in: 2017 9th International Workshop on the Analysis of Multitemporal Remote Sensing Images (MultiTemp), IEEE. pp. 1–4.
 Reed et al. (2015) Reed, S., Lee, H., Anguelov, D., Szegedy, C., Erhan, D., Rabinovich, A., 2015. Training Deep Neural Networks on Noisy Labels with Bootstrapping, in: Workshop at International conference on learning representations.
 Ren et al. (2018) Ren, M., Zeng, W., Yang, B., Urtasun, R., 2018. Learning to Reweight Examples for Robust Deep Learning, in: Proceedings of the 35 th International Conference on Machine Learning.
 Richards et al. (1999) Richards, J.A.J.A., Jia, X., Gessner, W., 1999. Remote sensing digital image analysis : an introduction. SpringerVerlag.
 Rigollet and Weed (2018) Rigollet, P., Weed, J., 2018. Entropic optimal transport is maximumlikelihood deconvolution. arXiv preprint arXiv:1809.05572 .
 van Rooyen et al. (2015) van Rooyen, B., Menon, A.K., Williamson, R.C., 2015. Learning with symmetric label noise: the importance of being unhinged, in: Proceedings of the 28th International Conference on Neural Information Processing Systems  Volume 1, MIT Press. pp. 10–18.
 Sáez et al. (2014) Sáez, J.A., Galar, M., Luengo, J., Herrera, F., 2014. Analyzing the presence of noise in multiclass problems: alleviating its influence with the OnevsOne decomposition. Knowledge and Information Systems 38, 179–206.
 Seguy et al. (2018) Seguy, V., Damodaran, B, B., Flamary, R., Courty, N., Rolet, A., Blondel, M., 2018. Largescale optimal transport and mapping estimation. ICLR, arXiv preprint arXiv:1711.02283 .
 Smyth et al. (1995) Smyth, P., Fayyad, U.M., Burl, M.C., Perona, P., Baldi, P., 1995. Inferring Ground Truth from Subjective Labelling of Venus Images, in: Advances in Neural Information Processing sustems, pp. 1085–1092.

Snow et al. (2008)
Snow, R., O’Connor, B.,
Jurafsky, D., Ng, A.Y.,
2008.
Cheap and fast—but is it good?: evaluating nonexpert annotations for natural language tasks, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics. pp. 254–263.
 Solomon et al. (2015) Solomon, J., de Goes, F., Peyré, G., Cuturi, M., Butscher, A., Nguyen, A., Du, T., Guibas, L., 2015. Convolutional wasserstein distances: efficient optimal transportation on geometric domains. ACM Transactions on Graphics 34, 66:1–66.
 Sukhbaatar et al. (2014) Sukhbaatar, S., Bruna, J., Paluri, M., Bourdev, L., Fergus, R., 2014. Training Convolutional Networks with Noisy Labels. arXiv preprint arXiv:1406.2080 .
 Tong Xiao et al. (2015) Tong Xiao, Tian Xia, Yi Yang, Chang Huang, Xiaogang Wang, 2015. Learning from massive noisy labeled data for image classification, in: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE. pp. 2691–2699.
 Vahdat (2017) Vahdat, A., 2017. Toward Robustness against Label Noise in Training Deep Discriminative Neural Networks, in: 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA.
 Villani (2003) Villani, C., 2003. Topics in optimal transportation. 58, American Mathematical Soc.
 Villani (2009) Villani, C., 2009. Optimal transport: old and new. Grundlehren der mathematischen Wissenschaften, Springer.
 Wang et al. (2017) Wang, L., Zhang, J., Liu, P., Choo, K.K.R., Huang, F., 2017. Spectral–spatial multifeaturebased deep learning for hyperspectral remote sensing image classification. Soft Computing 21, 213–221.
 Wang et al. (2016) Wang, S., Bai, M., Mattyus, G., Chu, H., Luo, W., Yang, B., Liang, J., Cheverie, J., Fidler, S., Urtasun, R., 2016. TorontoCity: Seeing the World with a Million Eyes. arXiv preprint arXiv:1612.00423 .
 Xia et al. (2017) Xia, G.S., Hu, J., Hu, F., Shi, B., Bai, X., Zhong, Y., Zhang, L., Lu, X., 2017. AID: A Benchmark Data Set for Performance Evaluation of Aerial Scene Classification. IEEE Transactions on Geoscience and Remote Sensing 55, 3965–3981.
 Ye et al. (2017) Ye, D., Li, Y., Tao, C., Xie, X., Wang, X., 2017. Multiple Feature Hashing Learning for LargeScale Remote Sensing Image Retrieval. ISPRS International Journal of GeoInformation 6, 364.
 Yokoya and Iwasaki (2016) Yokoya, N., Iwasaki, A., 2016. Airborne hyperspectral data over Chikusei. Technical Report. Space Application Laboratory, University of Tokyo. Japan.
 Zhang et al. (2017) Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O., 2017. Understanding deep learning requires rethinking generalization, in: International Conference on Learning Representations (ICLR).
 Zhong et al. (2018) Zhong, Z., Li, J., Luo, Z., Chapman, M., 2018. Spectral–Spatial Residual Network for Hyperspectral Image Classification: A 3D Deep Learning Framework. IEEE Transactions on Geoscience and Remote Sensing 56, 847–858.
 Zhou et al. (2018) Zhou, W., Newsam, S., Li, C., Shao, Z., 2018. PatternNet: A benchmark dataset for performance evaluation of remote sensing image retrieval. ISPRS Journal of Photogrammetry and Remote Sensing .
 Zhu and Wu (2004) Zhu, X., Wu, X., 2004. Class Noise vs. Attribute Noise: A Quantitative Study. Artificial Intelligence Review 22, 177–210.