Hidden Covariate Shift: A Minimal Assumption For Domain Adaptation

07/29/2019 ∙ by Victor Bouvier, et al. ∙ 0

Unsupervised Domain Adaptation aims to learn a model on a source domain with labeled data in order to perform well on unlabeled data of a target domain. Current approaches focus on learning Domain Invariant Representations. It relies on the assumption that such representations are well-suited for learning the supervised task in the target domain. We rather believe that a better and minimal assumption for performing Domain Adaptation is the Hidden Covariate Shift hypothesis. Such approach consists in learning a representation of the data such that the label distribution conditioned on this representation is domain invariant. From the Hidden Covariate Shift assumption, we derive an optimization procedure which learns to match an estimated joint distribution on the target domain and a re-weighted joint distribution on the source domain. The re-weighting is done in the representation space and is learned during the optimization procedure. We show on synthetic data and real world data that our approach deals with both Target Shift and Concept Drift. We report state-of-the-art performances on Amazon Reviews dataset blitzer2007biographies demonstrating the viability of this approach.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Supervised Learning consists in learning a model on a sample of data drawn from an unknown distribution. It assumes that the data distribution is conserved at test time. This assumption may not be hold for a large wide of industrial applications (Candela et al., 2009; Kull & Flach, 2014) since several factors may alter the data generative process: different preprocessing, sample rejection, conditions of data collection… Domain Adaptation (DA) (Patel et al., 2015; Pan et al., 2010) is a typical case of such an issue. By transferring knowledge learned on source domain where labels are available, Domain Adaptation aims to learn a better model on the target domain. Prior work on Domain Adaptation mainly differs on the assumption made on the change in data distribution between the source and the target domains.

A first family of approaches, known as Importance Sampling (Candela et al., 2009), consists in re-weighting the importance of each sample in order to incorporate during learning the fact that the joint distribution may change across domains. The optimal re-weighting is obtained computing the ratio between target and source joint distributions. This requires the availability of labels on the target domain, which is often unfeasible. To overcome this issue, two exclusive assumptions are commonly made depending on a prior knowledge of the distributional shift nature. The first, called Covariate Shift or Sample Selection Bias (Sugiyama et al., 2008; Wen et al., 2014; Zadrozny et al., 2003; Bickel et al., 2007; Huang et al., 2007) assumes changes in while the conditional distribution is conserved across domains. Under this assumption the re-weighting depends only on . This results in a well-posed problem since samples of are available in both source and target domains. The second, called Target Shift (Storkey, 2009) or Endogeneous Stratified Sampling (Manski & Lerman, 1977) assumes changes in while the conditional distribution is conserved across domains. Under this assumption, the re-weighting depends only on . The major drawback of this assumption is the lack of labeled data in the target domain: the re-weighting is performed using estimated labels in the target domain. Challenging cases for Importance Sampling methods are the Conditional Shift (when changes) and Concept Drift (when changes).

For addressing such challenging cases of distributional shift, Invariant Representation methods aim to learn a representation that makes source and target data indistinguishable (Baktashmotlagh et al., 2013). Such a representation is essentially learned by matching the distribution between the source and target domains using adversarial learning (Ganin & Lempitsky, 2014; Ganin et al., 2016; Tzeng et al., 2014). One of the major drawbacks of such approaches is that they do not naturally handle the case of Target Shift. To address this issue, (Manders et al., 2018; Yan et al., 2017) generalizes the approach of (Ganin & Lempitsky, 2014) by matching of and estimating label distributional shift during learning.

Figure 1: This figure illustrates an instance of domain adaptation to transfer knowledge learned from to that suffers of distributional shift. Blue color refers to source domain and red color to target domain. Importance Sampling methods learn to re-weight importance of each sample with a factor in the source domain in order to match the distribution in the target domain (). Invariant Representation methods aim to learn a representation of the data which makes domain indistinguishable ().

As suggested in (Courty et al., 2017), there is no clear reason why learning a distribution such that is conserved across domains will help for solving the task using the intermediate state . Likewise, methods based on class ratio estimation between source and target domains implicitly make the assumption that it exists a non-linear mapping such that is conserved across domains. Furthermore, those methods make a second assumption that the conservation of across domains will help to solve the task using the intermediate state . This assumption reveals the underlying causal scenario (Gong et al., 2016) and there is no clear reason why this assumption necessarily holds for real world data.


We believe the hypothesis of Hidden Covariate Shift (Kull & Flach, 2014) is a rather minimal assumption for performing Domain Adaptation in challenging distributional shift settings. Hidden Covariate Shift is a particular case of Concept Drift (where may change across domains (Gama et al., 2014)) where there exists a non linear function such that is conserved across domains. We believe that learning such a function is the underlying objective of any Domain Adaptation method. Using Maximum Mean Discrepancy (), we theoretically show that this assumption can be expressed as a null maximum mean discrepancy involving the joint distribution and the density ratio of between target and source domains. We must however estimate the joint distribution on the target domain. Our formulation naturally handles challenging distributional shift cases (Concept Drift and Target Shift). We show on several experiments with both synthetic and real world data that our formulation is robust to these cases.

2 Proposed Approach

2.1 Introduction

2.1.1 Notations

We denote a random variable of features

(a realization ) and a random variable labels (a realization ) which respectively take their values in a feature space and a label space

. On such spaces, we introduce two probability distributions:

(source distribution) and (target distribution) on . We introduce the source domain and the target domain . The expectation over the source domain is noted and the target domain . We call a Hidden Covariate Representation, a function such that . For the purpose of notation, we identify with as hidden covariate representation. For a given set , the set denotes the set of measurable and bounded functions from to .

2.1.2 Distribution matching with Maximum Mean Discrepancy

Invariant Representation methods for learning cross-domain representations rely on the quantification on how data distributions from the source and the target domains differ. Formally, for two given distributions and , such methods introduce a proxy which quantifies how close and are. In the present work, we suggest to use the Maximum Mean Discrepancy measure (Gretton et al., 2007, 2012) denoted . Such measure is based on the following property:


where is the set of measurable and bounded functions of . We can derive a proxy of this property called Maximum Mean Discrepancy:


2.1.3 Main contributions

Figure 2: This figure illustrates how our approach works by exhibiting the three learning steps. consists in labeling samples in the target domain with a Covariate Shift adaptation in the representation space. consists in evaluating density ratio of representations drawn in the target domain with respect to representations drawn in the source domain. consists in learning such that is invariant assuming labeled samples are available in the target domain and density ratio in is exact.

In this section, we derive an optimization procedure from the Hidden Covariate Shift assumption. More specifically, we show that a given representation is a Hidden Covariate Representation if in the target domain and a normalized joint distribution in the source domain are equal. The normalization consists in a re-weighting by a factor . This implies the introduction of three different losses which we detail further:

  1. For a given representation , we need to estimate the density ratio (step in Figure 2). The loss proxy for learning for a is denoted and called Hidden Weight Loss.

  2. Since we address the context of Unsupervised Domain Adaptation, labeled samples in the target domain are not available. We estimate them through a target labeler (step in Figure 2). We show that learning is equivalent to a Covariate Shift Adaptation in the representation space. The loss proxy for learning for given and is noted called Hidden Covariate Loss.

  3. Assuming and labeled samples are available in target domain, it is possible to learn such that is an hidden covariate representation by distribution matching (step in Figure 2). The loss proxy for learning for given and is noted called Reweighted Distribution Matching Loss.

2.2 Formulation

2.2.1 From hidden covariate shift assumption to a distribution matching problem

Let denote an Hidden Covariate Representation. Therefore, by definition, for all ,


The fact that equation 3 holds for all is equivalent to:


which holds for all . Since equation 4 holds for all , we can add a dependency with in 111The choice of a relevant may differ with a given value of . Considering the case where it exists two such that and . The function which exhibits may be different of the function which exhibits . Thus, equation 4 is equivalent to:


where is the set of functions such that is measurable bounded and is measurable. Then, it is equivalent to:


which consists in taking the expectation over the variable for . Noting that where , we obtain:


The application of the transfer theorem with leads to:


where the equality holds for all . To summarize, the Hidden Covariate Shift assumption is equivalent to the equality between two families of distributions and . We have shown this equality between two distribution families is equivalent to a null discrepancy between the join distribution in the target domain and a re-weighted version in the source domain. To compute this discrepancy, we must evaluate the density ratio . Such a task can be challenging in high dimension and we may want to keep the representation dimension of reasonably low. Additionally, we need to estimate the label in the target domain.

2.2.2 Losses

Hidden Covariate Loss

For a given Hidden Covariate Representation , and verify the covariate shift assumption introduced in (Sugiyama et al., 2007)

. Authors have shown that Domain Adaptation problem can be solved by instance re-weighting in the loss function of label estimation. In our context, the re-weighting is done in the representation space

. Thus, the label estimation in the target domain is obtained by minimizing the following loss:


where and .

Hidden Weight Loss

For a given Hidden Covariate Representation , we have shown in 8 that Domain Adaptation in relies on the estimation of . (Gretton et al., 2007) show that such weights verifies thus an estimation is obtained by minimizing the following loss:

Reweighted Distribution Matching Loss

Assuming that an estimation of both and are available, from 2.2.1 an estimation is obtained by minimizing the following loss:


where .

2.3 Learning

2.3.1 Unbiased estimation of the losses with RKHS

Since the supremum on is highly intractable, the work of (Gretton et al., 2007) suggests to use Reproducing Kernel Hilbert Space (RKHS). For a given unit ball of RKHS associated with a kernel , the supremum has a close form:


Furthermore, authors have derived an unbiased empirical estimate of given samples drawn from and samples drawn from :


In our specific context, for a given and and noting that () and , the Reweighted Distribution Matching Loss and the Hidden Weight Loss can be expressed as follows:


Following the unbiased estimation suggested in

(Gretton et al., 2007), considering and associated with estimated labels , we derive the loss estimators with :


Details on kernel estimation are given in Appendix B

2.3.2 Optimization procedure

Using the previous results, the learning of a Hidden Covariate Representation consists in solving the following optimization problem :

such that (19)

Since the model can collapse to a state where and are independent in the source and in the target domains (which ensures that ) we add the loss of the supervised task in target domain as a regularization term:


We suggest a simple optimization procedure detailed in 1. We use the notation to emphasize that is parametrized by a set of parameters . Details about the procedure are given in Appendix A.

2.3.3 Model selection with hidden reversed validation

Most of Domain Adaptation methods depend on a large set of hyper-parameters. A major difficulty of Domain Adaptation is the lack of labeled data in the target domain. (Zhong et al., 2010) suggest to use the Reverse Validation to select the best hyper-parameters. This method consists in training the domain adaptation procedure in order to infer labels on the target domain. Then, the same Domain Adaptation method is used in the reversed source/target situation using estimated labels in the target domain. The best model is finally chosen by comparing these new labels with the ground-truth. Assuming and is a Covariate Shift situation, Domain Adaptation only needs to learn instance weight . It is straightforward to show that the second adaptation only consists in learning , by minimizing where are estimated labels in the target domain and the density ratio learned during the first adaptation. In our context, we aim to learn a function such that and is a situation of Covariate Shift. We suggest to apply Hidden Reverse Validation which consists in learning with our proposed approach and validate the model with reverse validation in by only learning a new by minimizing . Hidden Reverse Validation has the major advantage to reduce dramatically computation time since the second adaptation is only an instance re-weighted supervised problem.

3 Experiments

3.1 Baselines

The literature on Domain Adaptation is large and vastly different methods can be employed (learning invariant features with or without label conditioning). Moreover, several discrepancy measures may be used: divergence measure (e.g. KL-divergence, Mutual Information), integral divergence measure () or Optimal Transport based measure (Wasserstein distance222which is also an integral divergence measure). For the sake of simplicity and to ensure a fair comparison with methods of the literature, we focus on differences between invariance assumptions. Moreover, the same discrepancy measure is used (i.e. with the same kernels). Furthermore, we detail how the discrepancy loss of distribution matching is modified for each baseline:

Based on invariance

It learns to match and , thus it assumes ). The distribution matching loss is . This invariance covers (Ganin & Lempitsky, 2014; Ganin et al., 2016; Shen et al., 2018; Baktashmotlagh et al., 2013)

Based on invariance

It learns to match . Thus, depends only of the labels . The distribution matching loss is . This invariance covers namely (Yan et al., 2017; Manders et al., 2018; Chen et al., 2018; Zhang et al., 2013).

We will flag our method as based on invariance.

3.2 A toy experiment on synthetic data

We generate synthetic two dimensional data of a binary classification problem which verifies a challenging case of distributional shift (equations 22, 23, 24 bellow and Figure 3 (Top)). In that context, we observe the situation of Target Shift since . Besides, it suffers from Concept Drift with an inverted dependency with between source and target domains. Furthermore with but the distribution of is normal in the source domain and uniform in the target domain thus the data verifies the assumption of Hidden Covariate Shift. The model we considered is a linear model where features which lie in two dimensional space are projected to which lie in a one dimensional space. Then a sigmoid layer is applied on to determine the label. The estimation of

is done with a two layers neural network with

non-linearity and 10 dimensions of hidden states.

We report in Figure 3 the distribution of and the estimation of for four snapshots equally separated during learning. We observe that after the pretraining step (top left figure), representations from the target domain are not well separated and seem to have a gaussian component. The main reason is the model uses to infer . During learning (from top left, to top right, to bottom left and to bottom right), we observe that the distribution of

from the target domain switches from a gaussian distribution to a uniform distribution (which is the distribution of

): the model has learned to forget the feature to infer on the target domain. Furthermore, we can observe the estimation of which integrates both label ratio (to the left of the linear separation, density ratio is much higher than to the right) and the fact that representation from the target domain has a higher density close to the separator than representation from the source domain (since representation in the target domain are harder to separate).

Figure 3: Top: Sample of synthetic data of the toy experiment. The black vertical line is the optimal linear separation. Bottom: four snapshot of learning with our suggested approach reporting (Top) distribution and estimation (Bottom). The black vertical line is the learned separation in the hidden space.

3.3 Amazon Reviews dataset

Task No DA
ED 72.1 76.1 74.8 75.4 / 76.1
EB 71.8 74.4 71.8 / 73.3 73.2 / 74.2
EK 83.8 86.1 / 87.3 85.8 / 86.8 85.4
DE 73.6 82.2 79.8 / 81.0 80.2 / 81.0
DK 77.3 83.0 83.0 82.2 / 83.4
DB 78.6 78.8 / 79.4 77.0 77.9 / 78.6
BD 78.7 80.3 79.7 / 80.0 78.4 / 78.6
BE 72.9 80.4 / 80.6 79.2 / 79.6 78.5 / 79.0
BK 76.1 83.7 81.8 82.8
KE 83.2 84.4 82.0 / 83.1 83.6
KD 73.9 80.4 / 80.6 79.2 / 79.6 78.5 / 79.0
KB 72.4 80.5 / 82.7 79.8 / 81.0 80.2 / 81.0
AVG 76.2 80.9 / 81.2 79.5 / 80.1 79.7 / 80.2
Table 1: Accuracy () on Amazon Reviews dataset (standard benchmark). Baseline based on invariance outperforms significantly approaches based and . This is due to the prior knowledge on the target distribution .

In our experiment on real-world data, we used the Amazon Reviews dataset (Blitzer et al., 2007). This dataset contains reviews collected on the Amazon website about products (Book (B), DVD (D), Kitchen (K), Electronics (E)) such that reviews with a rating higher than 4 are considered as positive and considered as negative otherwise. Authors introduced a preprocessed version where relevant unigrams and bigrams are selected. It allows to evaluate the model on 12 tasks of Domain Adaptation ().

In order to compare with literature preprocessing choices, for each task, we selected the 5000 most frequent features encoded as a bag of words (Ganin & Lempitsky, 2014). The model we used during experiments is a one layer neural networks with 50 dimensions and a non linearity for learning

, a two hidden layer neural network with 10 dimensions for hidden representation for learning

, and

are sigmoid layers. We used a slow optimizer RMSProp with a learning rate of

for learning and for learning and , a large batch of 128 samples and and and are updated 5 times at each update of

. Training is stopped after 30 epochs. Those choices are motivated by numerical stability observed during experiments. The hyper-parameter

(equation 21) was selected by Hidden Reversed Validation among repeating the experiment for 3 different but fixed seeds for replicability. Thus, performance on each task is selected among 9 runs. For transparency, we report two results separated with ’’: left is performance of the model selected by Hidden Reverse Validation and right is best performance observed among the 9 runs. This allows to observe the variability of methods. We bold best model comparing methods based on versus since methods based on invariance has a clear advantage since the hypothesis is verified for the standard benchmark and the Concept Drift experiment. We report in Table 1 the accuracy of our approach compared to baselines on the standard Amazon Reviews (standard benchmark).

3.3.1 Concept Drift

We filtered original datasets for obtaining a Concept Drift situation. For a couple of datasets , we build the source domain by over-representing positive reviews from and negative reviews from . The target domain is balanced: it has the same number of negative and positive reviews from each domain. More formally, we sample initial datasets such that for the source domain , while conserving and . For the target domain, we took a non-overlapping sample with the source dataset such that and . The Concept Drift situation occurs since determining whether the sample is drawn from or helps to infer in the source domain but do not help in the target domain. We report results in Table 2 as the accuracy of the conventional test set of .

No DA Task
(E, D) 71.1 78.1 76.4 75.8
(E, B) 70.3 75.3 / 77.3 74.0 / 76.1 73.0 / 75.7
(E, K) 83.2 86.6 84.9 / 85.4 85.1 / 85.2
(D, E) 76.7 83.0 80.3 / 82.7 82.1
(D, K) 75.5 85.2 83.9 / 85.4 82.7 / 83.1
(D, B) 75.4 78.0 75.2 / 76.4 76.3 / 78.0
(B, D) 75.0 79.3 76.5 / 77.9 78.0 / 78.1
(B, E) 76.7 82.3 81.4 / 81.8 81.0
(B, K) 77.7 85.3 82.5 / 85.2 83.4 / 84.7
(K, E) 81.6 82.1 / 83.1 80.3 / 82.1 82.4 / 83.2
(K, D) 73.7 76.0 / 76.6 74.2 / 76.7 75.6 / 76.3
(K, B) 71.7 76.9 74.6 / 76.4 74.0 / 75.7
AVG 75.7 80.7 / 80.9 78.7 / 80.2 79.1 / 79.9
Table 2: Accuracy on Amazon Review dataset in the Concept Drift situation. Notation (E, B) stands E and B . invariance provides significant improvements. This is due to the prior knowledge of label distribution in the target domain.

3.3.2 Target shift

Task No DA
ED 71.2 71.6 / 73.0 72.6 73.8
EB 70.9 72.2 70.6 / 71.7 72.3
EK 85.1 81.1 / 81.5 83.7 / 85.1 84.9 / 85.9
DE 74.2 79.0 / 79.5 83.2 80.8
DK 76.4 79.0 / 79.4 81.3 78.1 / 81.6
DB 78.7 75.9 / 76.9 77.6 77.6
BD 78.2 78.1 78.3 / 78.6 79.1
BE 71.1 76.9 / 77.3 78.8 / 80.4 77.2
BK 76.9 81.0 82.6 82.6
KE 82.2 81.6 83.4 80.4
KD 73.1 72.9 / 74.4 71.8 / 72.0 76.1
KB 73.2 71.7 / 73.1 70.2 / 75.1 73.9
AVG 76.0 76.8 / 77.3 77.8 / 78.6 78.1 / 78.4
69.4 71.3 72.8
69.5 66.7 / 67.8 69.0 / 69.9
74.8 / 75.0 82.9 84.0
71.1/74.8 77.1 / 79.1 79.3
73.4 80.4 78.5 / 78.6
73.1 / 73.3 75.9 75.5
72.8 76.8 / 77.7 77.3 / 78.2
69.8 / 72.5 73.0 / 77.4 75.6 / 76.4
75.3 79.9 / 80.5 75.4 / 79.2
74.3 81.4 / 82.3 83.2
70.9 72.7 / 73.4 73.8 / 74.1
72.1 70.9 / 71.5 73.3

72.2 / 72.8
75.8 / 76.7 76.5 / 77.0
Table 3: Performances on ACL dataset with Smooth Target Shift () and Hard Target Shift (). Method based on invariance fails to learn in that context while our approach is marginally better than the baseline based on invariance.

We filtered original dataset for obtaining a Target Shift situation. It consists in rejecting randomly samples in order to obtain a target set with a desired amount of Target Shift. We investigate two cases of target shift: Soft Target Shift with and Hard Target Shift with . We report in Table 3 results of the experiment.

3.3.3 Analysis

We note our approach learns since it improves the model without Domain Adaptation of for the standard benchmark (see Table 1). In the situation of Concept Drift, our approach still improves the model without Domain Adaptation (). In these both cases, our approach remains significantly under the baseline based on invariance (respectively and ). This drop of performance is explained by the fact that invariance baseline has a major advantage since it has a prior knowledge on the target distribution by knowing . Nevertheless, in the context of Target Shift, baseline based on invariance fails to learn in such situation: the model degrades badly compared to the model without Domain Adaptation in the context of Hard Target Shift (). Our approach still allows to learn in such situation by improving performances with respect to the model without Domain Adaptation (smooth: , hard: ). For the standard benchmark, the Concept Drift and Target Shift situation, our approach is (globally) marginally better than the baseline based on invariance (from for the standard benchmark to for Hard Target Shift). This may be explained by the fact our approach can adapt itself to by learning to set making it a more generic formulation.

4 Related work

Addressing Target Shift is known as a difficult problem. (Manders et al., 2018) and (Yan et al., 2017) has shown it is possible to adapt respectively the adversarial framework of (Ganin & Lempitsky, 2014) and MMD based methods (Gretton et al., 2007) for learning invariant representation by re-weighting at each training step the domain discrepancy measure with an estimated class distribution in the target domain. In the context of Optimal Transport based discrepancy measure (Arjovsky et al., 2017; Shen et al., 2018), (Chen et al., 2018) suggests to learn adversarially the class ratio incorporating it into the supremum over the dual critic function of the Wasserstein measure. Conditional Shift (where may change while keeping constant) is also a current assumption for extending invariant representation methods in challenging context of distributional shift. They traditionally use a local-scale transformation for learning such (Zhang et al., 2013). (Gong et al., 2016) suggests a component-wise version of local-scale transformations for avoiding noisy features which can not be well matched. Those methods are naturally extended in their original work to the context of both target shift and conditional shift using class ratio estimation.

Previous methods are essentially based on marginals or representations conditioned to label matching. However, Joint Domain Adaptation methods have recently received a lot of attention and seem well-suited for tackling challenging distributional shift cases. (Courty et al., 2017; Damodaran et al., 2018) have shown it is possible to learn a non linear function minimizing an Optimal Transport cost between the joint distribution of on the source and the estimated joint distribution . Furthermore, they successfully adapted it to the context of target shift (Redko et al., 2018). (Long et al., 2016, 2018)

is the first, to our knowledge, to perform MMD on joint distributions. They suggest to join internal states of a neural network and to perform MMD in the tensor product of reproducing kernel Hilbert spaces associated to each layer. They claim their formulation handles harder distributional shift cases than traditional invariant representation methods since it weights each layer of the network with respect to others in the kernel mean embedding space. Our work mainly differs from the work

(Long et al., 2016, 2018) by proposing a constrained re-weighting structure in the kernel mean embedding space. This structural constraint is derived from the hypothesis of Hidden Covariate Shift assumption.

5 Discussion and Future Work

In the present work, we have explored a method for performing Unsupervised Domain Adaptation based on the Hidden Covariate Shift assumption. To our knowledge, proposed approach is novel and differentiates to current approaches by incorporating during learning the implicit assumption of Domain Adaptation: learning such that is conserved. We adapted the reverse validation method for model selection in our specific case suggesting Hidden Reverse Validation. Furthermore, our approach has the interest to take best of two traditional approaches, namely Covariate Shift and Domain Invariant Representation. We have shown the viability of our formulation in context of Target Shift or Concept Drift on both synthetic and real-world data. We reported performances comparable to state-of-the-art approaches on Amazon Review dataset. Besides, we have observed that Hidden Reverse Validation may not always reflect the performance of selected model at test time on the target set. This may result to the fact we need to estimate a density ratio in high dimension which may be highly noisy.

In a future work, we want to exhibit situation where our formulation has a clear advantage with respect to state-of-the-art approaches. We believe that invariance is a weaker constraint that invariance. One the one hand, this allows to learn representations with domain specific information although such information is not used during inference. This can be a desiderata in a context of Multi-Task Learning or Transfer Learning. On the other hand, this weaker constraint leads to introduce an over-parametrized formulation. Although it is not theoretically justified to look for invariance for learning with the intermediate state

, this has the advantage to regularize the model with a receivable heuristic. This intuition was confirmed during experiment when some learnings collapse to learn exactly the ratio of labels distribution (i.e.

). Therefore, future work will focus on introducing regularization compatible with the hypothesis of Hidden Covariate Shift. Finally, we want to extend our formulation to deep neural network where the network learns a sequence of representation such that at each layer is conserved across domains.


This work was funded by Sidetrade and ANRT (France).


  • Arjovsky et al. (2017) Arjovsky, Martin, Chintala, Soumith, and Bottou, Léon. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
  • Baktashmotlagh et al. (2013) Baktashmotlagh, Mahsa, Harandi, Mehrtash T, Lovell, Brian C, and Salzmann, Mathieu. Unsupervised domain adaptation by domain invariant projection. In

    Proceedings of the IEEE International Conference on Computer Vision

    , pp. 769–776, 2013.
  • Bickel et al. (2007) Bickel, Steffen, Brückner, Michael, and Scheffer, Tobias. Discriminative learning for differing training and test distributions. In

    Proceedings of the 24th international conference on Machine learning

    , pp. 81–88. ACM, 2007.
  • Blitzer et al. (2007) Blitzer, John, Dredze, Mark, and Pereira, Fernando. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Proceedings of the 45th annual meeting of the association of computational linguistics, pp. 440–447, 2007.
  • Candela et al. (2009) Candela, J Quiñonero, Sugiyama, Masashi, Schwaighofer, Anton, and Lawrence, Neil D. Dataset shift in machine learning, 2009.
  • Chen et al. (2018) Chen, Qingchao, Liu, Yang, Wang, Zhaowen, Wassell, Ian, and Chetty, Kevin. Re-weighted adversarial adaptation network for unsupervised domain adaptation. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pp. 7976–7985, 2018.
  • Courty et al. (2017) Courty, Nicolas, Flamary, Rémi, Habrard, Amaury, and Rakotomamonjy, Alain. Joint distribution optimal transportation for domain adaptation. In Advances in Neural Information Processing Systems, pp. 3730–3739, 2017.
  • Damodaran et al. (2018) Damodaran, Bharath Bhushan, Kellenberger, Benjamin, Flamary, Rémi, Tuia, Devis, and Courty, Nicolas. Deepjdot: Deep joint distribution optimal transport for unsupervised domain adaptation. arXiv preprint arXiv:1803.10081, 2018.
  • Gama et al. (2014) Gama, João, Žliobaitė, Indrė, Bifet, Albert, Pechenizkiy, Mykola, and Bouchachia, Abdelhamid. A survey on concept drift adaptation. ACM computing surveys (CSUR), 46(4):44, 2014.
  • Ganin & Lempitsky (2014) Ganin, Yaroslav and Lempitsky, Victor. Unsupervised domain adaptation by backpropagation. arXiv preprint arXiv:1409.7495, 2014.
  • Ganin et al. (2016) Ganin, Yaroslav, Ustinova, Evgeniya, Ajakan, Hana, Germain, Pascal, Larochelle, Hugo, Laviolette, François, Marchand, Mario, and Lempitsky, Victor. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030, 2016.
  • Gong et al. (2016) Gong, Mingming, Zhang, Kun, Liu, Tongliang, Tao, Dacheng, Glymour, Clark, and Schölkopf, Bernhard. Domain adaptation with conditional transferable components. In International conference on machine learning, pp. 2839–2848, 2016.
  • Gretton et al. (2007) Gretton, Arthur, Borgwardt, Karsten M, Rasch, Malte, Schölkopf, Bernhard, and Smola, Alex J. A kernel method for the two-sample-problem. In Advances in neural information processing systems, pp. 513–520, 2007.
  • Gretton et al. (2012) Gretton, Arthur, Borgwardt, Karsten M, Rasch, Malte J, Schölkopf, Bernhard, and Smola, Alexander. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723–773, 2012.
  • Huang et al. (2007) Huang, Jiayuan, Gretton, Arthur, Borgwardt, Karsten M, Schölkopf, Bernhard, and Smola, Alex J. Correcting sample selection bias by unlabeled data. In Advances in neural information processing systems, pp. 601–608, 2007.
  • Kull & Flach (2014) Kull, Meelis and Flach, Peter. Patterns of dataset shift. In First International Workshop on Learning over Multiple Contexts (LMCE) at ECML-PKDD, 2014.
  • Long et al. (2016) Long, Mingsheng, Zhu, Han, Wang, Jianmin, and Jordan, Michael I. Deep transfer learning with joint adaptation networks. arXiv preprint arXiv:1605.06636, 2016.
  • Long et al. (2018) Long, Mingsheng, Cao, Zhangjie, Wang, Jianmin, and Jordan, Michael I. Conditional adversarial domain adaptation. In Advances in Neural Information Processing Systems, pp. 1647–1657, 2018.
  • Manders et al. (2018) Manders, Jeroen, Marchiori, Elena, and van Laarhoven, Twan. Simple domain adaptation with class prediction uncertainty alignment. arXiv preprint arXiv:1804.04448, 2018.
  • Manski & Lerman (1977) Manski, Charles F and Lerman, Steven R. The estimation of choice probabilities from choice based samples. Econometrica: Journal of the Econometric Society, pp. 1977–1988, 1977.
  • Pan et al. (2010) Pan, Sinno Jialin, Yang, Qiang, et al. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2010.
  • Patel et al. (2015) Patel, Vishal M, Gopalan, Raghuraman, Li, Ruonan, and Chellappa, Rama. Visual domain adaptation: A survey of recent advances. IEEE signal processing magazine, 32(3):53–69, 2015.
  • Redko et al. (2018) Redko, Ievgen, Courty, Nicolas, Flamary, Rémi, and Tuia, Devis. Optimal transport for multi-source domain adaptation under target shift. arXiv preprint arXiv:1803.04899, 2018.
  • Shen et al. (2018) Shen, Jian, Qu, Yanru, Zhang, Weinan, and Yu, Yong. Wasserstein distance guided representation learning for domain adaptation. In AAAI, 2018.
  • Storkey (2009) Storkey, Amos. When training and test sets are different: characterizing learning transfer. Dataset shift in machine learning, pp. 3–28, 2009.
  • Sugiyama et al. (2007) Sugiyama, Masashi, Krauledat, Matthias, and MÞller, Klaus-Robert. Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research, 8(May):985–1005, 2007.
  • Sugiyama et al. (2008) Sugiyama, Masashi, Nakajima, Shinichi, Kashima, Hisashi, Buenau, Paul V, and Kawanabe, Motoaki. Direct importance estimation with model selection and its application to covariate shift adaptation. In Advances in neural information processing systems, pp. 1433–1440, 2008.
  • Tzeng et al. (2014) Tzeng, Eric, Hoffman, Judy, Zhang, Ning, Saenko, Kate, and Darrell, Trevor. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474, 2014.
  • Wen et al. (2014) Wen, Junfeng, Yu, Chun-Nam, and Greiner, Russell. Robust learning under uncertain test distributions: Relating covariate shift to model misspecification. In ICML, pp. 631–639, 2014.
  • Yan et al. (2017) Yan, Hongliang, Ding, Yukang, Li, Peihua, Wang, Qilong, Xu, Yong, and Zuo, Wangmeng. Mind the class weight bias: Weighted maximum mean discrepancy for unsupervised domain adaptation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 3, 2017.
  • Zadrozny et al. (2003) Zadrozny, Bianca, Langford, John, and Abe, Naoki. Cost-sensitive learning by cost-proportionate example weighting. In Data Mining, 2003. ICDM 2003. Third IEEE International Conference on, pp. 435–442. IEEE, 2003.
  • Zhang et al. (2013) Zhang, Kun, Schölkopf, Bernhard, Muandet, Krikamol, and Wang, Zhikun. Domain adaptation under target and conditional shift. In International Conference on Machine Learning, pp. 819–827, 2013.
  • Zhong et al. (2010) Zhong, Erheng, Fan, Wei, Yang, Qiang, Verscheure, Olivier, and Ren, Jiangtao. Cross validation framework to choose amongst models and datasets for transfer learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 547–562. Springer, 2010.

Appendix A Optimization procedure details

  Input: source data sampled from , target data sampled from , batch size , weight iterations, iterations.
  initialize with supervision on (representation pre-training)
  initialize minimizing (weight pre-training)
     Sample batch of data of size from the source / target empirical distributions and
     for  to  do
     end for
     for  to  do
     end for
  until convergence
Algorithm 1 Learning hidden covariate representation

Appendix B Training details: Kernels

We used the following kernels for computing Maximum Mean Discrepancy:

  • Liner kernels,

  • Gaussian kernels:


  • Quadratic kernels:


During learning, we need to compute kernel in joint space . Since is discrete, we suggest to cast the variable into a continuous variable .