## 1 Introduction

In many real-world applications, it is desirable to use models trained on largely annotated data sets (or *source domains*) to label a newly collected, and therefore unlabeled, data set (or *target domain*). However, differences in the probability distributions between them hinder the success of the direct application of learned models to the latter. To overcome this problem, recent machine learning research has devised a family of techniques, called domain adaptation (DA), that deals with situations where test and training samples (also called source and target samples) follow different probability distributions Qui09 ; Pat15

. The inequality between joint distributions can be characterized in a variety of ways depending on the assumptions made about the conditional and marginal distributions. Among those, covariate shift (or sample selection bias) considers the situation where the inequality between probability density functions (pdfs) is due to the change in the marginal distributions

DBLP:conf/icdm/ZadroznyLA03 ; Bickel:2007:DLD:1273496.1273507 ; nips07:Huang ; DBLP:conf/nips/LiuZ14 ; DBLP:conf/icml/WenYG14 ; Fernando:2013:UVD:2586117.2587168 ; DBLP:conf/nips/CourtyFHR17 .Despite numerous methods proposed in the literature to solve the DA problem under covariate shift assumption, very few considered the situation where the change in joint distribution is caused by a shift in the distribution of the outputs, a setting usually called target shift. In practice this means that a change in the class proportions across domains is at the base of the shift: such a situation is known as choice-based or endogenous stratified sampling in econometrics econometrica

or prior probability shift

Storkey09whentraining . In the classification context, target shift was first introduced in Japkowicz:2002:CIP:1293951.1293954 and referred to as the class imbalance problem. In order to solve it, several approaches were proposed. In llw-svmcns-02 , authors assumed that the shift in the target distribution was known a priori, while in DBLP:conf/pakdd/YuZ08 , partial knowledge of the target shift was supposed to be available. In both cases, assuming prior knowledge about class proportions in the target domain seems quite restrictive in practice. More recent methods that avoid making this kind of assumption are Chan:2005:WSD and conf/icml/ZhangSMW13. In the former, authors used a variation of the EM approach to deal with target shift, which relies on a computationally expensive estimation of the conditional distribution. In the latter, authors estimate the proportions in both domains directly from observations. Their approach, however, also relies on computationally expensive optimization over kernel embeddings of probability functions and thus suffers from the same weaknesses as that of

Chan:2005:WSD. Despite a rather small amount of works that can be found in the literature on the subject, target shift often occurs in practice, especially in applications related to anomaly/novelty detection

Blanchard:2010:SND:1756006.1953028 ; pmlr-v30-Scott13 ; pmlr-v33-sanderson14, or in tasks where spatially located training sets are used to classify large areas, as in remote sensing image classification

tuia15 .In this paper, we propose a new algorithm for correcting the target shift based on optimal transport (OT). The recent appearance of efficient formulations of OT cuturi:2013 has allowed its application in DA, as OT allows to learn explicitly the transformation of a given source pdf into the pdf of the target sample. In this work, we build upon a recent work on DA DBLP:journals/pami/CourtyFTR17 , where authors successfully casted the DA problem as an OT problem, and extend it to deal with the target shift setting. Our motivation to propose new specific algorithms for target shift stems from the fact that many popular DA algorithms designed to tackle covariate shift cannot handle the target shift equally well. This is illustrated in Figure 1, where we show that the DA method based on OT mentioned above fails to restrict the transportation of mass across instances of different classes when the class proportions of source and target domains differ. However, as we show in the following sections, our model, called *Joint Class Proportion and Optimal Transport (JCPOT)*, manages to do it correctly. Furthermore and contrary to the original contribution, we consider a much more challenging case of multi-source domain adaptation, where more than one source domains with changing distributions of outputs are used for learning. To the best of our knowledge, this is the first multi-source algorithm that efficiently leverages the target shift and shows an increasing performance with the increasing number of source domains considered.

The rest of the paper is organized as follows. In Section 2, we briefly present regularized OT and its application to DA. Section 3 details the target shift problem and provides a generalization bound for this learning scenario. In Section 4, we present the proposed JCPOT method for unsupervised DA when no labels are used for adaptation. In Section 5, we provide comparisons to state-of-art methods for synthetic data in the single and multi-source adaptation cases and we report results for a real life case study performed in remote sensing pixel classification.

## 2 Optimal Transport

#### Basics and notation

OT can be seen as the search for a plan that moves (transports) a probability measure onto another measure with a minimum cost given by some function . In our case, we use the squared Euclidean distance , but other domain-specific measures, more suited to the problem at hand, could be used instead. In the relaxed formulation of Kantorovitch Kantorovich42 , OT seeks for an optimal coupling that can be seen as a joint probability distribution between and . In other words, if we define as the space of probability distributions over with marginals and , the optimal transport is the coupling , which minimizes the following quantity:

where is the cost of moving to (issued from distributions and , respectively). In the discrete versions of the problem, i.e. when and

are defined as empirical measures based on vectors in

, the previous equation reads:(1) |

where is the Frobenius dot product, is a cost matrix , representing the pairwise costs of transporting bin to bin , and is a joint distribution given by a matrix of size , with marginals defined as and . Solving equation (1

) is a simple linear programming problem with equality constraints, but its dimensions scale quadratically with the size of the sample. Alternatively, one can consider a regularized version of the problem, which has the extra benefit of being faster to solve.

#### Entropic regularization

cuturi:2013 added a regularization term to that controls the smoothness of the coupling through the entropy of . The entropy regularized version of the discrete OT reads:

(2) |

where is the entropy of

. Similarly, denoting the Kullback-Leibler divergence (

) as , one can establish the following link between OT and Bregman projections.###### Proposition 1.

(benamou:2015, , Eq. (6,7)). For , the minimizer of (2) is the solution of the following Bregman projection

For an undefined , is solely constrained by the marginal and is the solution of the following closed-form projection:

(3) |

where the division has to be understood component-wise.

As it follows from this proposition, the entropy regularized version of OT can be solved with a simple algorithm based on successive projections over the two marginal constraints and admits a closed form solution. We refer the reader to benamou:2015 for more details on this subject.

#### Application to domain adaptation

A solution to the two domains adaptation problem based on OT has been proposed in courty14 . It consists in estimating a transformation of the source domain sample that minimizes their average displacement w.r.t. target sample, *i.e.* an optimal transport solution between the discrete distributions of the two domains.
The success of the proposed algorithm is due to an important advantage offered by OT metric over other distances used in DA (e.g. MMD): it preserves the topology of the data and admits a rather efficient estimation. The authors further added a regularization term used to encourage instances from the target sample to be transported to instances of the source sample of the same class, therefore promoting group sparsity in thanks to the norm with and courty14 or and DBLP:journals/pami/CourtyFTR17 .

## 3 Domain adaptation under the target shift

In this section, we formalize the target shift problem and provide a generalization bound that shows the key factors that have an impact when learning under it.

To this end, let us consider a binary classification problem with source domains, each being represented by a sample of size , drawn from a probability distribution . Here and are marginal distributions of the source data given the class labels and , respectively with . We also possess a target sample of size drawn from a probability distribution such that . This last condition is a characterization of target shift used in previous theoretical works on the subject pmlr-v30-Scott13 .

Following bendavid10 , we define a domain as a pair consisting of a distribution on some space of inputs and a labeling function . A hypothesis class is a set of functions so that .

Given a convex loss-function

, the probability according to the distribution that a hypothesis disagrees with a labeling function (which can also be a hypothesis) is defined as(4) |

In the multi-source case, when the source and target error functions are defined w.r.t. and or , we use the shorthand and . The ultimate goal of multi-source DA then is to learn a hypothesis on source domains that has the best possible performance in the target one.

To this end, we define the combined error of source domains as a weighted sum of source domains error functions:

We further denote by the labeling function associated to the distribution mixture . In multi-source scenario, the combined error is minimized in order to produce a hypothesis that is used on the target domain. Here different weights can be seen as measures reflecting the proximity of the corresponding source domain distribution to the target one.

For the target shift setup introduced above, we can prove the following proposition.

###### Proposition 2.

Let denote the hypothesis space of predictors and be a convex loss function. Let be the discrepancy distance DBLP:conf/colt/MansourMR09 between two probability distributions and . Then, for any fixed the following holds for any :

where
represents the joint error between the combined source error and the target one^{1}^{1}1Proof of all theoretical results of this paper can be found in the Supplementary material..

The second term in the bound can be minimized for any when . This can be achieved by using a proper reweighting of the class distributions in the source domains, but requires to have access to the target proportion which is assumed to be unknown. In the next section, we propose to estimate optimal proportions by minimizing the sum of the Wasserstein distances between all reweighted sources and the target distribution. We show in the Supplementary material that under some mild assumptions, the weights that minimize the Wasserstein distance between the weighted source and target distribution is exactly the target proportion in the asymptotic case, i.e.when using the true class distributions.

## 4 Joint Class Proportion and Optimal Transport (JCPOT)

In this section, we introduce the proposed JCPOT method, that aims at finding the optimal transportation plan and estimating class proportions jointly. The main underlying idea behind JCPOT is that we reweigh instances in the source domains in order to compensate for the discrepancy between the source and target domains class proportions.

### 4.1 Data and class-based weighting

We assume to have access to several data sets corresponding to different domains , . These domains are formed by instances with each instance being associated with one of the classes of interest. In the following, we use the superscript when referring to quantities in one of the source domains (e.g ) and the equivalent without superscript when referring to the same quantity in the (single) target domain (e.g. ). Let be the corresponding class, i.e. . We are also given a target domain , populated by instances defined in . The goal of unsupervised multi-source adaptation is to recover the classes of the target domain samples, which are all unknown.

JCPOT works under the target shift assumption presented in Section 3. For every source domain, we assume that its data points follow a probability distribution function or probability measure (). In real-world situations, is only accessible through the instances that we can use to define a distribution , where are Dirac measures located at , and is an associated probability mass. By denoting the corresponding vector of mass as , i.e. , and the corresponding vector of Dirac measures, one can write . Note that when the data set is a collection of independent data points, the weights of all instances in the sample are usually set to be equal. In this work, however, we use different weights for each class of the source domain so that we can adapt the proportions of classes w.r.t. the target domain. To this end, we note that the measures can be decomposed among the classes as . We denote by the proportion of class in . By construction, we have .

Since we chose to have equal weights in the classes, we define two linear operators and that allow to express the transformation from the vector of mass to the class proportions and back:

allows to retrieve the class proportions with and returns weights for all instances for a given vector of class proportions with where the masses are distributed equiproportionnally among all the data points associated to one class.

### 4.2 Multi-source domain adaptation with JCPOT

As illustrated in Section 1, having matching proportions between the source and the target domains helps in finding better couplings, and, as shown in Section 3 it also enhances the adaptation results.

To this end, we propose to estimate the class proportions in the target domain by solving a constrained Wasserstein barycenter problem benamou:2015 for which we use the operators defined above to match the proportions to the uniformly weighted target distribution. The corresponding optimization problem can be written as follows:

(5) |

where the regularized Wasserstein distances are defined as

provided that . are convex coefficients accounting for the relative importance of each domain. Here, we define the set as the set of couplings between each source and the target domains. This problems leads to marginal constraints w.r.t. the uniform target distribution, and marginal constraints related to the unknown proportions .

Optimizing for the first marginal constraints can be done independently for each by solving the problem expressed in Equation 3. On the contrary, the remaining constraints require to solve the proposed optimization problem for and , simultaneously. To do so, we formulate the problem as a Bregman projection with prescribed row sum (), i.e.,

(6) |

This problem admits a closed form solution that we establish in the following result.

###### Proposition 3.

The solution of the projection defined in Equation 6 is given by:

The initial problem can now be solved through an Iterative Bregman projections scheme where the matrix updates for coupling matrix can be computed in parallel for each domain.

### 4.3 Classification in the target domain

When both the class proportions and the corresponding coupling matrices are obtained, we need to perform adapt source and target samples and classify unlabeled target instances. Below, we provide two possible ways that can be used to perform these tasks.

#### Barycentric mapping

In DBLP:journals/pami/CourtyFTR17 , the authors proposed to use the OT matrices to estimate the position of each source instance as the barycenter of the target instances, weighted by the mass from the source sample. This approach extends to multi-source setting and naturally provides a target-aligned position for each point from each source domain. These adapted source samples can then be used to learn a classifier and apply it directly on the target sample. In the sequel, we denote the variations of JCPOT that use the barycenter mapping as JCPOT-PT. For this approach, DBLP:journals/pami/CourtyFTR17 noted that too much regularization has a shrinkage effect on the new positions, since the mass spreads to all target points in this configuration. Also, it requires the estimation of a target classifier, trained on the transported source samples, to provide predictions for the target sample. In the next paragraph we propose a novel approach to estimate target labels that relies only on the OT matrices.

#### Label propagation

We propose alternatively to use the OT matrices to perform label propagation onto the target sample. Since we have access to the labels in the source domains and since the OT matrices provide the transportation of mass, we can measure for each target instance the proportion of mass coming from every class. Therefore, we propose to estimate the label proportions for the target sample with where the component in L contains the probability estimate of target sample to belong to class . Note that this label propagation technique can be seen as boosting, since the expression of corresponds to a linear combination of weak classifiers from each source domain. To the best of our knowledge, this is the first time such type of approach is proposed in DA. In the following, we denote it by JCPOT-LP where LP stands for label propagation.

## 5 Experimental results

In this section, we present the results of our algorithm for both synthetic and real-world data from the task of remote sensing classification.

#### Baseline and state-of-the-art methods

We compare the proposed JCPOT algorithm to two other methods designed to tackle target shift, namely betaEM, the variation of the EM algorithm proposed in Chan:2005:WSD and betaKMM, an algorithm based on the kernel embeddings proposed in conf/icml/ZhangSMW13 ^{2}^{2}2We use their public implementation available at http://people.tuebingen.mpg.de/kzhang/Code-TarS.zip.. As explained in 4.3, our algorithms can obtain the target labels in two different ways, either based on label propagation (JCPOT-LP) or based on transporting points and applying a standard classification algorithm after transformation (JCPOT-PT). Furthermore, we also consider two additional DA algorithms that use OT courty14 : OTDA-LP and OTDA-PT that align the domains based on OT but without considering the discrepancies in class proportions.

### 5.1 Synthetic data

#### Data generation

In the multi-source setup, we sample 20 source domains, each consisting of 500 instances and a target domain with 400 instances. We vary the source domains’ class proportions randomly while keeping the target ones equal to . For more details on the generative process and some additional empirical results regarding the accuracy of proportion estimation, the sensitivity of JCPOT to hyper-parameters tuning and the running times comparison, we refer the interested reader to the Supplementary material.

#### Results

Table 1 illustrates average performances over five runs for each domain adaptation task when the number of source domains varies from to . As betaEM, betaKMM and OTDA are not designed to work in the multi-source scenario, we fusion the data from all source domains and use it as a single source domain. From the results, we can see that the algorithm with label propagation (JCPOT-LP) provides the best results and outperforms other state-of-the-art DA methods. On the other hand, all methods addressing specifically the target shift problem perform better that the OTDA method designed for covariate shift. This result justifies our claim about the necessity of specially designed algorithms that take into account the shifting class proportions in DA.

### 5.2 Real-world data from remote sensing application

#### Data set

We consider the task of classifying superpixels from satellite images at very high resolution into a set of land cover/land use classes tuia15 .
We use the ‘Zurich Summer’ data set^{3}^{3}3https://sites.google.com/site/michelevolpiresearch/data/zurich-dataset, composed of 20 images issued from a large image acquired by the QuickBird satellite over the city of Zurich, Switzerland in August 2002 where the features are extracted as described in (tuia_zurich, , Section 3.B)

. For this data set, we consider a multi-class classification task corresponding to the classes Roads, Buildings, Trees and Grass shared by all images. The number of superpixels per class is imbalanced and varies across images: thus it represents a real target shift problem. We consider 18 out of the 20 images, since two images exhibit a very scarce ground truth, making a reliable estimation of the true classes proportions difficult. We use each image as the target domain (average class proportions with standard deviation are

) while considering remaining 17 images as source domains. Figure 2 presents both the original and the ground truths of several images from the considered data set. One can observe that classes of all three images have very unequal proportions compared to each other.#### Results

The results over 5 trials obtained on this data set are reported in the lower part of Table 1. The proposed JCPOT method based on label propagation significantly improves the classification accuracy over the other baselines. The results show an important improvement over the “No adaptation” case, with an increase reaching 10% for JCPOT-LP. We also note that the results obtained by JCPOT-LP outperform the “Target only" baseline. This shows the benefit brought by multiple source domains as once properly adapted, they represent a much larger annotated sample that the target domain sample alone. This claim is also confirmed by an increasing performance of our approach with the increasing number of source domains. Overall, the obtained results show that the proposed method handles the adaptation problem quite well and thus allows to avoid manual labeling in real-world applications.

## 6 Conclusions

In this paper we proposed JCPOT, a novel method dealing with target shift: a particular and largely understudied DA scenario occurring when the difference in source and target distributions is induced by differences in their class proportions. To justify the necessity of accounting for target shift explicitly, we presented a theoretical result showing that unmatched proportions between source and target domains lead to inefficient adaptation. Our proposed method addresses the target shift problem by tackling the estimation of class proportions and the alignment of domain distributions jointly in optimal transportation framework. We used the idea of Wasserstein barycenters to extend our model to the multi-source case in the unsupervised DA scenario. In our experiments considering both synthetic and real-world data, the proposed JCPOT method outperforms current state-of-the-art methods and provides a reliable estimation of proportions in the unlabeled target sample.

In the future, we plan to investigate the application of our algorithm to anomaly and novelty detection in order to deal with cases of highly imbalanced proportions. This would allow to apply our method in many important tasks related to health-care, fraud detection or wildlife animals detection: areas were this problem is largely presented. Finally, we also plan to adapt our strategy to other domain adaptation settings, where the shift happens in the conditional distributions, or where only subsets of classes (possibly disjoint) are present in every domains.

## Supplementary material

## 1 Proof of Proposition 2

###### Proposition 2.

Let denote the hypothesis space of predictors and be a convex loss function. Let be the discrepancy distance DBLP:conf/colt/MansourMR09 between two probability distributions and . Then, for any fixed the following holds for any :

where represents the joint error between the combined source error and the target one.

###### Proof.

(1) | ||||

(2) | ||||

(3) |

Here lines (1) and (2) are obtained due to the validity of the triangle inequality for the classification error function DBLP:journals/jmlr/CrammerKW08 . Regarding the discrepancy term, we obtain:

The final result is obtained by combining the last expression with (3) from the proof. ∎

We further note that this result can be made data dependent for predefined families of loss functions such as 0-1 loss and loss often used in classification and regression, respectively. To this end, one may apply (DBLP:conf/colt/MansourMR09, , Corollary 6 and 7) in order to replace the true distributions and by their empirical counterparts.

### Proof of optimal weight as minimum of Wasserstein distance

In this section, we prove that the minimization of the Wasserstein distance between a weighted source distribution and a target distribution yields the optimal proportion estimation. To proceed, let us consider the multi-class problem with classes, where the target distribution is defined as

with being a distribution of class . Similar to the previous section, the source distribution with weighted classes can be then defined as

where are coefficients lying in the probability simplex that reweight the corresponding classes.

As the proportions of classes in the target distribution are unknown, our goal is to reweigh source classes distributions by solving the following optimization problem:

(4) |

We can now state the following proposition.

###### Proposition.

Assume that . Then, for any distribution , the unique solution minimizing 4 is given by .

###### Proof.

We first note that for any two probability distributions and , and if and only if as Wasserstein distance is a valid metric on the space of probability measures. In this case, when , and thus is a feasible solution of the optimization problem given in 4. On the other hand, for a given solution such that , we have due to the assumption made in the statement of the proposition that and and thus . This last condition roughly means that none of the class distributions for classes can be expressed as a weighted sum involving class distribution . Hence, is the unique solution of the optimization problem 4. ∎

Note that this result extends straightforwardly to the multi-source case where the optimal solution of minimizing the sum of the Wasserstein distance for all source distributions is the target domain proportions. As real distributions are accessible only through available finite samples, in practice, we propose to minimize the Wasserstein distance between the empirical target distribution and the empirical source distributions . The convergence of the exact solution of this problem with empirical measures can be characterized using the concentration inequalities established for Wasserstein distance in bobkov ; fournier:hal-00915365 where the rate of convergence is inversely proportional to the number of available instances in source domains and consequently, to the number of source domains.

## 2 Proof of Proposition 3

For the sake of completeness, we recall that the considered optimization problem has the following form:

(5) |

where the regularized Wasserstein distances can be expressed as

provided that . are convex coefficients () accounting for the relative importance of each domain.

In order to solve it for constraints related to the unknown proportions , we formulate the problem as a Bregman projection with prescribed row sum (), i.e.,

(6) |

This problem admits a closed form solution that we establish in the following result.

###### Proposition 3.

The solution of the projection defined in Equation 6 is given by:

###### Proof.

We follow a similar line of reasoning as (benamou:2015, , Proposition (2)). We write the following optimization problem with a collection of Lagrange multipliers denoted as in vector form.

(7) |

We now compute the derivative w.r.t. , and :

(8) | |||||

(9) | |||||

(10) |

Setting the first equation to zero leads to

(11) | |||

(12) | |||

(13) |

with the Hadamard product. Finally, by multiplying the two terms by , we get:

(14) |

Using the optimality condition of equation two of the previous system, we know that or and subsequently

(15) |

Plugging this expression into the first equation, we obtain:

(16) |

that is the first element of the solution of the projection. We sum over the first optimality equation, and we get:

(17) |

Setting the third question to zero leads to . Because of the specific structure of , we also have . Therefore, we obtain:

(18) | |||||

(19) |

From Equation 16 we get

(20) | |||||

(21) |

and finally, since , we get

(22) |

which concludes the proof. ∎

## 3 Experimental results

In this section, we provide the details on the generative process of the synthetic data used in the main paper and present results of several other experiments that we could not include into the main paper due to lack of space.

### 3.1 Data set generation

In the main paper, we considered the multi-source scenario for which we generated a binary classification problem with the instances of each class were drawn from the Gaussian distributions

and , respectively. The pseudo-code of the proposed algorithm is given in Algorithm 1.### 3.2 Proportion estimation

Here we study the deviation of the estimated proportions from their true value in terms of the distance. The results of this study are presented in Table 2. From this table, we can see that our method provides a reliable estimate of the target class proportions for all cases considered.

Number of source domains | |||||||
---|---|---|---|---|---|---|---|

2 | 5 | 8 | 11 | 14 | 17 | 20 | |

Simulated data | 0.039 | 0.045 | 0.027 | 0.029 | 0.035 | 0.033 | 0.034 |

Zurich data set | 0.036 | 0.039 | 0.032 | 0.035 | 0.03 | 0.026 | - |

### 3.3 Sensitivity to hyper-parameters

Figures 3 and 3 illustrate the classification results obtained by JCPOT when varying the regularization parameter and the overall size of source samples in source domains, respectively in a setting with 4 source and 1 target domains. In the latter scenario, we vary the sample size by increasing it by 500 for the source domains (125 instances per domain) and by 200 for the target domain. From these figures, we observe that higher values of can lead to a decrease in the performance of our algorithm, while the source domains’ sample size does not appear to have a high influence on the results.

### 3.4 Running time comparison

In Table 3, we give the running times of all the algorithms considered in the empirical evaluation of the main paper for the simulated data. From the results, we can see that betaEM is the less computationally demanding method closely followed by the proposed JCPOT method and OTDA. We note also that betaKMM is the most computationally heavy method.

Number of source domains | |||||||
---|---|---|---|---|---|---|---|

2 | 5 | 8 | 11 | 14 | 17 | 20 | |

betaEM | 0.179 | 0.174 | 0.241 | 0.314 | 0.394 | 0.458 | 0.524 |

betaKMM | 16.057 | 193.331 | 119.859 | 117.982 | 190.623 | 172.903 | 209.53 |

OTDA | 0.515 | 1.04 | 1.622 | 2.276 | 2.978 | 3.824 | 4.488 |

JCPOT | 0.31 | 1.079 | 1.766 | 2.285 | 3.296 | 4.38 | 4.722 |

## References

- (1) J. Quiñonero-Candela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence. Dataset Shift in Machine Learning. MIT Press, 2009.
- (2) V. M. Patel, R. Gopalan, R. Li, and R. Chellappa. Visual domain adaptation: a survey of recent advances. IEEE Signal Processing Magazine, 32(3):53–69, 2015.
- (3) Bianca Zadrozny, John Langford, and Naoki Abe. Cost-sensitive learning by cost-proportionate example weighting. In ICDM, page 435, 2003.
- (4) Steffen Bickel, Michael Brückner, and Tobias Scheffer. Discriminative learning for differing training and test distributions. In ICML, pages 81–88, 2007.
- (5) J. Huang, A.J. Smola, A. Gretton, K. Borgwardt, and B. Schölkopf. Correcting sample selection bias by unlabeled data. In NIPS, volume 19, 2007.
- (6) Anqi Liu and Brian D. Ziebart. Robust classification under sample selection bias. In NIPS, pages 37–45, 2014.
- (7) Junfeng Wen, Chun-Nam Yu, and Russell Greiner. Robust learning under uncertain test distributions: Relating covariate shift to model misspecification. In ICML, pages 631–639, 2014.
- (8) Basura Fernando, Amaury Habrard, Marc Sebban, and Tinne Tuytelaars. Unsupervised visual domain adaptation using subspace alignment. In ICCV, pages 2960–2967, 2013.
- (9) Nicolas Courty, Rémi Flamary, Amaury Habrard, and Alain Rakotomamonjy. Joint distribution optimal transportation for domain adaptation. In NIPS, pages 3733–3742, 2017.
- (10) C. Manski and S. Lerman. The estimation of choice probabilities from choice-based samples. Econometrica, 45:1977–1988, 1977.
- (11) Amos J Storkey. When training and test sets are different: characterising learning transfer. In In Dataset Shift in Machine Learning, pages 3–28. MIT Press, 2009.
- (12) Nathalie Japkowicz and Shaju Stephen. The class imbalance problem: A systematic study. In IDA, volume 6, pages 429–449, 2002.
- (13) Yi Lin, Yoonkyung Lee, and Grace Wahba. Support vector machines for classification in nonstandard situations. Machine Learning, 46(1-3):191–202, 2002.
- (14) Yang Yu and Zhi-Hua Zhou. A framework for modeling positive class expansion with single snapshot. In PAKDD, pages 429–440, 2008.
- (15) Yee Seng Chan and Hwee Tou Ng. Word sense disambiguation with distribution estimation. In IJCAI, pages 1010–1015, 2005.
- (16) Kun Zhang, Bernhard Schölkopf, Krikamol Muandet, and Zhikun Wang. Domain adaptation under target and conditional shift. In ICML, volume 28, pages 819–827, 2013.
- (17) Gilles Blanchard, Gyemin Lee, and Clayton Scott. Semi-supervised novelty detection. Journal of Machine Learning Research, 11:2973–3009, 2010.
- (18) Clayton Scott, Gilles Blanchard, and Gregory Handy. Classification with asymmetric label noise: Consistency and maximal denoising. In COLT, volume 30, pages 489–511, 2013.
- (19) Tyler Sanderson and Clayton Scott. Class Proportion Estimation with Application to Multiclass Anomaly Rejection. In AISTATS, volume 33, pages 850–858, 2014.
- (20) D. Tuia, R. Flamary, A. Rakotomamonjy, and N. Courty. Multitemporal classification without new labels: a solution with optimal transport. In 8th International Workshop on the Analysis of Multitemporal Remote Sensing Images, 2015.
- (21) N. Courty, R. Flamary, and D. Tuia. Domain adaptation with regularized optimal transport. In ECML/PKDD, pages 1–16, 2014.
- (22) Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In NIPS, pages 2292–2300, 2013.
- (23) Nicolas Courty, Rémi Flamary, Devis Tuia, and Alain Rakotomamonjy. Optimal transport for domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(9):1853–1865, 2017.
- (24) L. Kantorovich. On the translocation of masses. Doklady of the Academy of Sciences of the USSR, 37:199–201, 1942.
- (25) Jean-David Benamou, Guillaume Carlier, Marco Cuturi, Luca Nenna, and Gabriel Peyré. Iterative bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing, 37(2):A1111–A1138, 2015.
- (26) Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Vaughan. A theory of learning from different domains. Machine Learning, 79:151–175, 2010.
- (27) Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms. In COLT, 2009.
- (28) Devis Tuia, Michele Volpi, and Gabriele Moser. Decision fusion with multiple spatial supports by conditional random fields. IEEE Transactions on Geoscience and Remote Sensing, pages 1–13, 2018.
- (29) Koby Crammer, Michael J. Kearns, and Jennifer Wortman. Learning from multiple sources. Journal of Machine Learning Research, 9:1757–1774, 2008.
- (30) S. Bobkov and M. Ledoux. One-dimensional empirical measures, order statistics and Kantorovich transport distances. To appear in: Memoirs of the AMS, 2016.
- (31) Nicolas Fournier and Arnaud Guillin. On the rate of convergence in Wasserstein distance of the empirical measure. Probability Theory and Related Fields, 162(3-4):707, 2015.