Domain Generalization via Conditional Invariant Representation

07/23/2018
by   Ya Li, et al.
The University of Sydney
USTC
0

Domain generalization aims to apply knowledge gained from multiple labeled source domains to unseen target domains. The main difficulty comes from the dataset bias: training data and test data have different distributions, and the training set contains heterogeneous samples from different distributions. Let X denote the features, and Y be the class labels. Existing domain generalization methods address the dataset bias problem by learning a domain-invariant representation h(X) that has the same marginal distribution P(h(X)) across multiple source domains. The functional relationship encoded in P(Y|X) is usually assumed to be stable across domains such that P(Y|h(X)) is also invariant. However, it is unclear whether this assumption holds in practical problems. In this paper, we consider the general situation where both P(X) and P(Y|X) can change across all domains. We propose to learn a feature representation which has domain-invariant class conditional distributions P(h(X)|Y). With the conditional invariant representation, the invariance of the joint distribution P(h(X),Y) can be guaranteed if the class prior P(Y) does not change across training and test domains. Extensive experiments on both synthetic and real data demonstrate the effectiveness of the proposed method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

08/20/2021

Discriminative Domain-Invariant Adversarial Network for Deep Domain Generalization

Domain generalization approaches aim to learn a domain invariant predict...
10/04/2021

Learning Domain-Invariant Relationship with Instrumental Variable for Domain Generalization

Domain generalization (DG) aims to learn from multiple source domains a ...
07/25/2019

Domain Generalization via Multidomain Discriminant Analysis

Domain generalization (DG) aims to incorporate knowledge from multiple s...
01/10/2013

Domain Generalization via Invariant Feature Representation

This paper investigates domain generalization: How to take knowledge acq...
03/08/2021

Size-Invariant Graph Representations for Graph Classification Extrapolations

In general, graph representation learning methods assume that the test a...
10/12/2021

Domain Generalization via Domain-based Covariance Minimization

Researchers have been facing a difficult problem that data generation me...
06/11/2021

Invariant Information Bottleneck for Domain Generalization

The main challenge for domain generalization (DG) is to overcome the pot...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Recent years have witnessed a great success of supervised learning in various pattern recognition problems, such as image classification, object detection, and speech recognition. Standard supervised learning relies heavily on the

i.i.d.

data assumption; however, dataset-bias is unavoidable in many situations due to selection bias or mechanism changes. For example, this problem has been well recognized in the computer vision community

[Torralba and Efros2011, Khosla et al.2012]

: the widely adopted vision datasets have their special properties and are not representative of the visual world. In medical diagnosis, the distribution of cell types varies from patient to patient, and we need to train a classifier on the data collected from previous patients that generalizes well to unseen patients

[Blanchard, Lee, and Scott2011, Muandet, Balduzzi, and Schölkopf2013]. These problems are known as domain generalization, in which the training set consists of data from heterogeneous source domains, say patients, and the test data distribution is different from that of the training data.

To handle the distribution changes, many existing domain generalization methods aim to learn domain-invariant representations that have stable distributions across all source domains [Muandet, Balduzzi, and Schölkopf2013, Erfani et al.2016, Ghifary et al.2017]. The learned invariant representations are expected to generalize well to any unseen test set under the assumption that the changes of distribution across source and test domains are caused by some common factors whose effects are removed in the invariant representations. In computer vision, such factors could be illumination, camera viewpoints, and backgrounds. These methods have achieved good performance in computer vision [Ghifary et al.2015, Ghifary et al.2017] and medical diagnosis [Muandet, Balduzzi, and Schölkopf2013].

However, existing methods that learn domain-invariant representations assume that only changes across domains while the conditional distribution is rather stable. Thus, the conditional distribution is also invariant, and the learning problem reduces to ensuring that the marginal distribution is invariant across domains. This assumption greatly simplifies the problem, but it is unclear whether this assumption holds in practical situations. According to some recent results in causal learning [Schölkopf et al.2012, Janzing and Scholkopf2010], can be stable when changes in the situation where is the cause for , i.e., the causal structure is . This is because the mechanism that generates the cause, i.e., , is not coupled with the mechanism that generates the effect from the cause, i.e., , and not vice versa. That is to say, if is the cause and is the effect, often changes together with . In this situation, if changes, it is very likely that also changes across domains, which violates the stability of assumption. In practice, we have plenty of problems where the causal structure is

. For example, , in face recognition, Y is person id, X is the feature, and

is the viewpoint. Let us consider each viewpoint as a domain, then in each domain we have conditional distribution

. According to Bayes theorem,

thus changes across domains. This conflicts with previous assumptions that keeps unchanged. There are also other examples, e.g. speaker recognition and person re-identification [Yang et al.2017].

In this paper, we assume both and change across domains. We aim to find a feature transformation that has invariant class-conditional distribution

. To achieve so, we propose to minimize two regularization terms that enforce distribution invariance across source domains. The first term measures the variance of each class-conditional distribution across all source domains and then sums up the variances for all classes. The second term is the variance of class prior-normalized marginal distribution

, which measures the global distribution discrepancy. The normalization of class priors is introduced to remove the effects brought by possible changes in across source domains. If the prior distribution does not change across source domains, the second term reduces to the common technique used in existing domain-invariant representation learning methods [Muandet, Balduzzi, and Schölkopf2013, Ghifary et al.2017]. To preserve the discriminative power of the learned representation, we also incorporate the intra-class and inter-class distances used in kernel Fisher discriminant analysis (FDA)[Mika et al.1999].

Compared to existing domain-invariant representation learning methods, our method does not require the assumption of stable by exploiting the labels on the source domains which were overlooked in the previous methods. Especially, if the prior distribution on the test sets is the same as that on the training set containing all source domains, our method is able to learn representations that have invariant joint distribution across all domains. We conduct a series of experiments on both synthetic and real data, and the results demonstrate the effectiveness of our method.

Related Work

Domain generalization has been widely applied in classification tasks [Xu et al.2014, Duan et al.2009, Muandet, Balduzzi, and Schölkopf2013, Ghifary et al.2017, Ghifary et al.2015, Erfani et al.2016]. Compared with standard supervised learning, domain generalization methods aim to reduce data bias across different domains and improve the generalization of the learned model to unseen but related domains. For example, [Xu et al.2014]assumed that positive samples within the same shared latent domain should have similar likelihood and proposed to exploit the low-rank structure from latent domains for domain generalization. [Muandet, Balduzzi, and Schölkopf2013] proposed domain-invariant component analysis (DICA) through learning an invariant feature representation , in which the difference between marginal distributions is minimized. [Ghifary et al.2017] proposed a unified framework called scatter component analysis for domain adaptation and domain generalization. The scatter component analysis combines domain scatter [Muandet, Balduzzi, and Schölkopf2013], kernel PCA [Schölkopf, Smola, and Müller1998], and kernel FDA [Mika et al.1999] in a single objective function. However, all these methods assume that the distribution between domains differs only in the marginal distribution while the conditional distribution keeps stable or unchanged across domains. This assumption can simplify the problem of domain generalization, but it is easily violated in real-world applications.

Domain adaptation is a related problem which has been extensively studied in the literature [Baktashmotlagh et al.2013, Huang et al.2007, Pan et al.2011, Long et al.2017, Shao, Kit, and Fu2014, Shao et al.2016, Luo et al.2017, Liu, Yang, and Tao2017]. Assuming that only changes, the distribution changes can be corrected by importance reweighting [Huang et al.2007] or domain-invariant feature learning [Pan et al.2011, Baktashmotlagh et al.2013], using unlabeled data from source and target domains. Recently, several works attempted to work in the situation where both and change across domains [Zhang et al.2013, Gong et al.2016, Long et al.2017]. [Zhang et al.2013] and [Gong et al.2016] proposed to consider the domain adaptation problem in the generalized target shift (GeTarS) scenario where the causal direction is . In this scenario, both the change of distribution and conditional distribution are considered to reduce the data bias across domains. [Zhang et al.2013] made an assumption that features from source domains can be transferred to the target domain by a location-scale transformation, which is restricted in real-world applications because of the presence of noises in features. [Gong et al.2016] proposed to learn components whose conditional distribution

is invariant across domains and estimate the target label distribution

through labeled source domain data and unlabeled target domain data. Since there are no labels in the target domain to match class-conditionals, the invariance of is achieved by minimizing the discrepancy of the marginal distribution under some untestable assumptions. [Long et al.2017] proposed an iterative way to match the conditionals by using the predicted labels from previous iterations as pseudo labels. Different from the domain adaptation methods, domain generalization does not require unlabeled data from the target domains.

Conditional Invariant Domain Generalization

In this section, we first establish the basic notations of domains and formally introduce the definition of domain generalization. Then we give a detailed description of the proposed conditional invariant domain generalization (CIDG) method.

Problem Definition

Denote and as the input feature and label spaces, respectively. A domain defined on

can be represented by a joint probability distribution

. For simplicity, we denote the joint probability distribution of the -th source domain as . The domain is associated with a sample , where and denotes the sample size of the domain . Then we can define domain generalization as follows.

Definition 1 (Domain Generalization).

Given multiple related source domains and each domain is associated with a sample , where . The goal of domain generalization is to learn a classification function from source domain datasets and apply it to an unseen but related target domain .

Kernel Mean Embedding

Before introducing the proposed method, we briefly review the kernel mean embedding of distributions, which is an important mathematical tool to represent and compare distributions [Song, Fukumizu, and Gretton2013, Sriperumbudur et al.2010]. Let denote a characteristic reproducing kernel Hilbert space (RKHS) on associated with a kernel , and be an associated mapping such that . Suppose we have two observations and from domain , then we have . The kernel embedding of a distribution can be formulated as the following:

(1)

where denotes for simplicity. If a kernel is characteristic, then the mean embedding is injective. All the information about the distribution can be preserved [Sriperumbudur et al.2010]. The kernel embedding cannot be computed directly and is usually estimated from observations. Given a sample , where is the sample size of the domain, and the kernel embedding can be empirically estimated as the following:

(2)

Proposed Approach

The proposed conditional invariant domain generalization (CIDG) method aims to find a conditional invariant representation

(a linear transformation of the original features) to reduce the variance of the conditional distribution

across source domains. Suppose we can learn a perfect conditional invariant representation , which satisfies , and denotes the target domain. We can gather all the source domains to construct a new single domain with a joint distribution . Therefore, under the condition , the learned has the invariant joint distribution across training and test domains. Contrarily, the previous method can only guarantee that is invariant, and whether is invariant remains unknown. If is different from , our method cannot guarantee the invariance of the joint distribution either. Nevertheless, our method can at least guarantee invariant class-conditional distributions, which is still better than previous methods. This is because is usually not very sensitive to the changes in the prior if is highly correlated with .

The learning of conditional invariant representations is achieved mainly through two regularization terms: total scatter of class-conditional distributions and scatter of class prior-normalized marginal distributions. The first term measures the variance of locally, while the second term measures the variance of globally. In addition to these two terms, we also incorporate several terms that measure the discriminative power of the representation as done in the previous works. By minimizing the distribution variance across domains and maximizing the discriminative power in one objective function, we can obtain the conditional invariant representation which is predictable for the labels on unseen target domains.

Total scatter of class-conditional distributions

Suppose we have related domains on . The marginal distribution on of the -th domain is denoted as . Suppose the class labels of each domain vary from to . For simplicity, the -th class conditional distribution of the -th domain is denoted as . The total scatter of class-conditional distributions across domains can be formulated as:

(3)

where and is called the domain scatter [Ghifary et al.2017] or distributional variance [Muandet, Balduzzi, and Schölkopf2013]. Instead of measuring the domain scatter w.r.t. the marginal distributions as done in previous works like [Ghifary et al.2017], we measure the domain scatter w.r.t. each class-conditional distribution and then sum them together.

Before introducing the computation of the above scatter, we first give the formulation of the learned feature transformation. Denote the feature matrix as the data matrix of samples from source domains, where is the dimension of the feature space and . Define a set of functions related to the feature map . We aim to find a linear feature transformation transforming into a finite subspace , that is

. According to the kernel principal component analysis (KPCA)

[Schölkopf, Smola, and Müller1998], the linear transformation can be formulated as the linear combination of , where is the coefficient matrix. By using this representation, we can avoid explicitly computing the feature map and use the kernel trick instead.

For simplicity, denote as ,

(4)

where is trace operator. To measure the distribution scatter of the distributions of , we apply the linear feature transformation to the above scatter and obtain

(5)

where is:

(6)

in which and can be computed according to the empirical estimation shown in equation (2). Denote as the -th sample belonging to the -th class in the -th domain, where and . Let denote the sample size of the -th class from the -th domain, we have:

(7)

where denotes the indicies of examples in the -th class.

Scatter of class prior-normalized marginal distributions

The scatter of each class-conditional distribution is estimated locally using the samples from that class. When the number of examples in each class is small, optimizing (Total scatter of class-conditional distributions) can easily overfit the data. To further improve the estimation accuracy, we propose another regularization term which measures the scatter of class-prior normalized marginal distributions. The new regularization term is able to measure the global distance between all class-conditionals. In the -th domain, the marginal distribution is defined as

(8)

If the class prior distribution does not change across domains, and we can also find a feature representation that has an invariant class-conditional across source domains, we can say that is also domain-invariant, but not vice versa. Nevertheless, searching for a representation that reduces the discrepancy between the marginal distributions can to some extent reduce the discrepancy of class conditional distributions, though the original purpose was to match marginal distributions only [Muandet, Balduzzi, and Schölkopf2013, Ghifary et al.2017]. However, if the class prior changes across source domains, the above statements are no longer true. That is to say, even if the class conditionals are domain-invariant, the marginal distribution are not invariant because of the changes in . To mitigate this issue, we propose to match the class-prior normalized marginal distribution, which is defined as follows:

(9)

It can be seen that the class-prior normalized marginal distribution enforces the same prior probability for each class. Therefore, the changes in the prior distribution across source domains are adjusted, which guarantees that the prior-normalized marginal distribution is domain-invariant when the class conditionals are invariant. By embedding the class prior-normalzied marginal distribution into a Hilbert space, the scatter of the normalized marginal distribution across domains can be formulated as:

(10)

where , and is the prior-normalized marginal distribution of the -th domain. is the kernel mean of the class prior-normalized marginal distribution of all domains. To learn the domain-invariant representation, we apply the linear feature transformation to the above scatter, resulting in:

(11)

where can be formulated as follows:

(12)

in (12) can be empirically estimated from the observations as:

(13)

Note that if are identical for all , that is the classes are balanced, the class prior-normalized marginal distribution reduces to the empirical estimate of the original marginal distribution adopted in [Muandet, Balduzzi, and Schölkopf2013, Ghifary et al.2017].

Preserving Discriminative Power

In addition to the above proposed two domain-invariance regularization terms, we also consider extra terms to preserve the discriminativeness of the learned representation. There have been plenty of works in supervised dimension reduction in the case, and kernel Fisher discriminant analysis [Mika et al.1999] is a representative method which has been used in domain generalization [Ghifary et al.2017]. Becasue the focus of our method is to better learn the domain-invariant representations, we incorporate kernel Fisher discriminant analysis for fair comparison to existing methods. Specifically, the examples with the same label should be similar and the examples with different labels should be well separated. These two constraints can be formulated as two regularization terms: within-class scatter and between-class scatter, which are briefly described as follows.

Between-class scatter:

(14)

where matrix can be computed as:

(15)

and denotes the number of examples in the -th class from all domains. Note that and can be empirically estimated as and .

Within-class scatter:

(16)

where the matrix can be computed as:

(17)

Objective Function and Optimization

In this subsection, we first formulate our objective function with the above regularization terms and then find the solutions by maximizing the objective function.

The proposed CIDG aims to learn an invariant feature transformation by solving the following optimization problem:

(18)

The numerator enforces the distance between features in different classes to be large. The denominator aims to learn a conditional invariant feature representation and reduce the distance between features in the same class simultaneously.

Replace the scatters with equation, (Total scatter of class-conditional distributions), (Scatter of class prior-normalized marginal distributions), (14), (16) and introduce several trade-off parameters , the above objective function can be reformulated as follows:

(19)

where , are trade-off parameters, which need to be selected according to the validation set.

Note that the above objective function is invariant when rescaling , where is a constant. Consequently, (19) can be reformulated as the following constrained optimization problem:

(20)

which yields Lagrangian:

(21)

where

is an identity matrix of dimension

and is a diagonal matrix with the Lagrange multipliers aligned in the diagonal. Solving (21) by setting the derivative w.r.t.

to be zero, we arrive at a standard eigenvalue decomposition problem:

(22)

In practice, the term is added by a small constant to get a more stable solution, becoming . We summarize the algorithm of our CIDG in Algorithm 1.

0:   source domains with datasets , trade-off parameters .
0:  Invariant feature transformation and corresponding eigenvalues
1:  Construct kernel matrix from data samples of all domains, , , and construct matrices from equations (14), (Total scatter of class-conditional distributions), (Scatter of class prior-normalized marginal distributions), (16).
2:  Centering the kernel matrix , where and denotes a matrix with all entries equal to .
3:  Solve the equation (22) to get the optimal feature transformation matrix and the corresponding eigenvalues with the first leading eigenvalues.
4:  When given a target domain with a set of data , construct a kernel matrix with samples from source domains and samples from the target domain, . Then we apply the centering operation to , where denotes a matrix with all entries equal to .
5:  The learned feature matrix of the target domain can be computed as .
Algorithm 1 Conditional invariant domain generalization
domain index domain 1 domain 2 domain 3
class index
x (1,0.3) (2, 0.3) (3, 0.3) (3.5, 0.3) (4.5, 0.3) (5.5, 0.3) (8, 0.3) (9.5, 0.3) (10, 0.3)
y (2,0.3) (1, 0.3) (2, 0.3) (2.5, 0.3) (1.5, 0.3) (2.5, 0.3) (2.5, 0.3) (1.5, 0.3) (2.5, 0.3)
# samples 30 20 30 20 60 40 40 40 40
Table 1: Details of the generated distributions of three domains.
Figure 1: Performance comparison between different methods. The figures in the first row visualize the samples according to three different domains (yellow, magenta, cyan). The figures in the second row visualize the samples of three classes (green, red, blue) in different domains (star, circle, cross). Note that the left two domains (yellow, magenta) are source domains and the right one (cyan) is target domain.

Experiments

In this section, we conduct experiments on one synthetic data and two real-world image classification datasets to demonstrate the effectiveness of our conditional invariant domain generalization (CIDG) method. The synthetic data are two dimensional, which facilitate the comparison of the performance of different methods through the visualization of the data distribution. The two real-world image classification datasets are the VLCS and Office+Caltech datasets, which are widely used datasets to evaluate the performance of domain generalization and domain adaptation [Ghifary et al.2017, Gong et al.2016, Khosla et al.2012]. We compare our CIDG with several state-of-the-art domain generalization methods, which are summarized below.

  • K-nearest neighbors (KNN) using the original features, which servers as the baseline method.

  • Kernel principal component analysis (KPCA) [Schölkopf, Smola, and Müller1998] which finds the dominant components of the original features. KNN is applied for classification on the KPCA features.

  • Undo-Bias [Khosla et al.2012], which is a multi-task learning method aims to reduce the data bias. Because undo-bias is a binary classification algorithm, we use the one-vs-rest strategy for multi-class classification.

  • Domain invariant component analysis (DICA) [Muandet, Balduzzi, and Schölkopf2013], which is a domain generalization method learns an domain-invariant feature representation in terms of marginal distributions. We use KNN to do classification on the learned feature representation.

  • Scatter component analysis (SCA) [Ghifary et al.2017], which is a another method that learns domain-invariant features in terms of marginal distributions. The method incorporates discriminative terms and domain scatter terms into a unified framework.

Note that we have also conducted experiments using kernel finsher discriminant analysis (FDA), however, it performs worse than KPCA. Consequently, we do not report the results of KLDA in this paper.

Source Target 1NN KPCA DICA Undo-bias SCA CIDG
L,C,S V
V,C,S L
V,L,S C
V,C,L S
C,S V,L
C,L V,S
C,V L,S
L,S V,C
L,V S,C
V,S L,C
Table 2: Performance comparison between different methods with respect to accuracy () on VLCS dataset.
Source Target 1NN KPCA DICA Undo-bias SCA CIDG
W,D,C A
A,W,D C
A,W,C D
A,C,D W
A,C D,W
D,W A,C
A,W C,D
A,D C,W
C,W A,D
C,D A,W
Table 3: Performance comparison between different methods with respect to accuracy () on office+caltech dataset.

Synthetic Dataset

In this section, we randomly generate two dimensional examples for source domains and target domain from different Gaussian distributions

, where is the mean and

is the standard deviation. The values of mean

and standard deviation pairs of different classes in three domains are shown in Table 1. We consider the first two domains as source domains and the third one as a target domain. The first row of Figure 1 visualizes the samples from three different domains corresponding to three different colors (yellow, magenta, cyan), and the domains are domain , domain and domain from left to right. The second row of Figure 1 shows that each domain has three clusters (green, red, blue) corresponding to three different classes and the domains are represented by different shapes (star, circle, cross). The first column illustrates the raw feature distributions.

We compare our CIDG with KNN, KPCA, DICA, and SCA to evaluate the distributions of the learned feature representation across domains. Since Undo-Bias is a SVM-based method that does not need to explicitly learn a feature representation, we do not compare the results with Undo-Bias on synthetic data. We use the RBF kernel for all the methods involving computation of kernel matrices. In all experiments, domain 1 and domain 2 are used as source domains and domain 3 is used as the unseen target domain. From the results in Figure 1, we can see that the proposed CIDG achieves the best accuracy of . KPCA almost has no improvement over the baseline KNN method on the synthetic dataset. DICA can cluster one class (blue) well but performs badly for the other two classes. SCA can learn better feature distribution but the blue class and the green class are mixed in the learned representation. Additionally, the samples in the same class lie in a line rather than reside in a clear cluster. Our CIDG can learn more robust feature representations and the learned features in the same class are distributed in a well-shaped cluster.

VLCS Dataset

VLCS is an image classification dataset widely used for evaluating the performance of domain generalization. This dataset contains images from four different sub-datasets corresponding to four domains: PASCAL VOC2007 (V) [Everingham et al.2010], LabelMe (L) [Russell et al.2008], Caltech-101 (C) [Griffin, Holub, and Perona2007], and SUN09 (S) [Choi et al.2010]. Five shared classes (bird, car, chair, dog and person) are selected from these four datasets. The images are preprocessed by subtracting the mean values and cropped on the central region out of the resized images. Then the preprocessed images are fed into the DeCAF network and extracted the 4096 dimensional DeCAF6 features [Donahue et al.2014]. We randomly select of the data as training set from each domain and repeat the random selection five times. The mean classification accuracy and standard deviation of the five random selection are given for each method. All parameters are selected through validation, in which of the training data is selected as validation set. All kernel methods use a RBF kernel and the learned features are classified using KNN except for Undo-Bias. The results are shown in Table 2.

From the results in Table 2, we can see that our conditional invariant domain generalization (CIDG) performs the best on 9 of the 10 domain generalization tasks. KPCA performs the best when L,S are source domains and V,C are target domains. Note that almost all the domain generalization methods outperform the 1NN on raw features. However, some methods on several domain tasks perform even worse than 1NN on raw features. This is mainly because that features of real world images are complicated and noisy. The learned features are not discriminative when generalized to target domains.

Office+Caltech Dataset

The Office+Caltech image dataset consists of ten overlapping categories between the Office dataset and the Caltech-256 dataset (C). Because the Office dataset contains three sub-datasets: AMAZON (A), DSLR (D), and WEBCAM (W), we have four different domains in total. Similarly, We randomly select

of the data as training set from each domain and repeat the random selection five times. The mean classification accuracy and standard deviation of the five random selection are reported for each method. The feature extraction is the same as that used for the VLCS dataset except we use the CAFFE network

[Jia et al.2014] instead of the DeCAF network. The other settings are the same as those in experiments on the VLCS dataset.

From the results in Table 3, we can find that the proposed CIDG achieves the best performance on 9 of the 10 domain generalization tasks. This further validates that enforcing conditional invariance is more reasonable than enforcing only marginal invariance. Note that Undo-bias is a SVM-based method. It is possibly the main reason why it outperforms CIDG when using D,W as source domains and A,C as target domains.

Conclusion

In this paper, we have proposed a conditional invariant domain generalization approach considering the situation that both and change across domains. Different from previous works which assume that only changes, our proposed method can learn representations that have invariant joint distribution across domains if the prior distribution does not change between the source domains and the target domains. Two regularization terms that enforce class-conditional distribution invariance across domains are proposed and validated on both synthetic and real datasets.

Acknowledgments

This work was supported by National Key Research and Development Program of China 2017YFB1002203, NSFC No.61572451, No.61390514, and No. 61632019, Youth Innovation Promotion Association CAS CX2100060016, Fok Ying Tung Education Foundation WF2100060004, and Australian Research Council Projects FL-170100117, DP-180103424, DP-140102164, LP-150100671.

References

  • [Baktashmotlagh et al.2013] Baktashmotlagh, M.; Harandi, M.; Lovell, B.; and Salzmann, M. 2013. Unsupervised domain adaptation by domain invariant projection. In Computer Vision (ICCV), 2013 IEEE International Conference on, 769–776.
  • [Blanchard, Lee, and Scott2011] Blanchard, G.; Lee, G.; and Scott, C. 2011. Generalizing from several related classification tasks to a new unlabeled sample. In Advances in neural information processing systems, 2178–2186.
  • [Choi et al.2010] Choi, M. J.; Lim, J. J.; Torralba, A.; and Willsky, A. S. 2010. Exploiting hierarchical context on a large database of object categories. In Computer vision and pattern recognition (CVPR), 2010 IEEE conference on, 129–136. IEEE.
  • [Donahue et al.2014] Donahue, J.; Jia, Y.; Vinyals, O.; Hoffman, J.; Zhang, N.; Tzeng, E.; and Darrell, T. 2014. Decaf: A deep convolutional activation feature for generic visual recognition. In

    International conference on machine learning

    , 647–655.
  • [Duan et al.2009] Duan, L.; Tsang, I. W.; Xu, D.; and Chua, T.-S. 2009. Domain adaptation from multiple sources via auxiliary classifiers. In Proceedings of the 26th Annual International Conference on Machine Learning, 289–296. ACM.
  • [Erfani et al.2016] Erfani, S. M.; Baktashmotlagh, M.; Moshtaghi, M.; Nguyen, V.; Leckie, C.; Bailey, J.; and Ramamohanarao, K. 2016. Robust domain generalisation by enforcing distribution invariance. In IJCAI, 1455–1461.
  • [Everingham et al.2010] Everingham, M.; Van Gool, L.; Williams, C. K.; Winn, J.; and Zisserman, A. 2010. The pascal visual object classes (voc) challenge. International journal of computer vision 88(2):303–338.
  • [Ghifary et al.2015] Ghifary, M.; Bastiaan Kleijn, W.; Zhang, M.; and Balduzzi, D. 2015.

    Domain generalization for object recognition with multi-task autoencoders.

    In Proceedings of the IEEE international conference on computer vision, 2551–2559.
  • [Ghifary et al.2017] Ghifary, M.; Balduzzi, D.; Kleijn, W. B.; and Zhang, M. 2017. Scatter component analysis: A unified framework for domain adaptation and domain generalization. IEEE transactions on pattern analysis and machine intelligence 39(7):1414–1430.
  • [Gong et al.2016] Gong, M.; Zhang, K.; Liu, T.; Tao, D.; Glymour, C.; and Schölkopf, B. 2016. Domain adaptation with conditional transferable components. In International Conference on Machine Learning, 2839–2848.
  • [Griffin, Holub, and Perona2007] Griffin, G.; Holub, A.; and Perona, P. 2007. Caltech-256 object category dataset.
  • [Huang et al.2007] Huang, J.; Smola, A.; Gretton, A.; Borgwardt, K.; and Schölkopf, B. 2007. Correcting sample selection bias by unlabeled data. In NIPS 19, 601–608.
  • [Janzing and Scholkopf2010] Janzing, D., and Scholkopf, B. 2010. Causal inference using the algorithmic markov condition. IEEE Transactions on Information Theory 56(10):5168–5194.
  • [Jia et al.2014] Jia, Y.; Shelhamer, E.; Donahue, J.; Karayev, S.; Long, J.; Girshick, R.; Guadarrama, S.; and Darrell, T. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, 675–678. ACM.
  • [Khosla et al.2012] Khosla, A.; Zhou, T.; Malisiewicz, T.; Efros, A. A.; and Torralba, A. 2012. Undoing the damage of dataset bias. In European Conference on Computer Vision, 158–171. Springer.
  • [Liu, Yang, and Tao2017] Liu, T.; Yang, Q.; and Tao, D. 2017.

    Understanding how feature structure transfers in transfer learning.

    In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, 2365–2371.
  • [Long et al.2017] Long, M.; Zhu, H.; Wang, J.; and Jordan, M. I. 2017. Deep transfer learning with joint adaptation networks. In Precup, D., and Teh, Y. W., eds., Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, 2208–2217. International Convention Centre, Sydney, Australia: PMLR.
  • [Luo et al.2017] Luo, Y.; Wen, Y.; Liu, T.; and Tao, D. 2017. General heterogeneous transfer distance metric learning via knowledge fragments transfer.
  • [Mika et al.1999] Mika, S.; Ratsch, G.; Weston, J.; Scholkopf, B.; and Mullers, K.-R. 1999. Fisher discriminant analysis with kernels. In Neural Networks for Signal Processing IX, 1999. Proceedings of the 1999 IEEE Signal Processing Society Workshop., 41–48. IEEE.
  • [Muandet, Balduzzi, and Schölkopf2013] Muandet, K.; Balduzzi, D.; and Schölkopf, B. 2013. Domain generalization via invariant feature representation. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), 10–18.
  • [Pan et al.2011] Pan, S. J.; Tsang, I. W.; Kwok, J. T.; and Yang, Q. 2011. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks 22:199–120.
  • [Russell et al.2008] Russell, B. C.; Torralba, A.; Murphy, K. P.; and Freeman, W. T. 2008. Labelme: a database and web-based tool for image annotation. International journal of computer vision 77(1):157–173.
  • [Schölkopf et al.2012] Schölkopf, B.; Janzing, D.; Peters, J.; Sgouritsa, E.; Zhang, K.; and Mooij, J. 2012. On causal and anticausal learning. arXiv preprint arXiv:1206.6471.
  • [Schölkopf, Smola, and Müller1998] Schölkopf, B.; Smola, A.; and Müller, K.-R. 1998. Nonlinear component analysis as a kernel eigenvalue problem. Neural computation 10(5):1299–1319.
  • [Shao et al.2016] Shao, M.; Ding, Z.; Zhao, H.; and Fu, Y. 2016. Spectral bisection tree guided deep adaptive exemplar autoencoder for unsupervised domain adaptation. In AAAI, 2023–2029.
  • [Shao, Kit, and Fu2014] Shao, M.; Kit, D.; and Fu, Y. 2014. Generalized transfer subspace learning through low-rank constraint. International Journal of Computer Vision 109(1-2):74–93.
  • [Song, Fukumizu, and Gretton2013] Song, L.; Fukumizu, K.; and Gretton, A. 2013. Kernel embeddings of conditional distributions: A unified kernel framework for nonparametric inference in graphical models. IEEE Signal Processing Magazine 30(4):98–111.
  • [Sriperumbudur et al.2010] Sriperumbudur, B. K.; Gretton, A.; Fukumizu, K.; Schölkopf, B.; and Lanckriet, G. R. 2010. Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research 11(Apr):1517–1561.
  • [Torralba and Efros2011] Torralba, A., and Efros, A. A. 2011. Unbiased look at dataset bias. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, 1521–1528. IEEE.
  • [Xu et al.2014] Xu, Z.; Li, W.; Niu, L.; and Xu, D. 2014. Exploiting low-rank structure from latent domains for domain generalization. In European Conference on Computer Vision, 628–643. Springer.
  • [Yang et al.2017] Yang, X.; Wang, M.; Hong, R.; Tian, Q.; and Rui, Y. 2017. Enhancing person re-identification in a self-trained subspace. arXiv preprint arXiv:1704.06020.
  • [Zhang et al.2013] Zhang, K.; Schölkopf, B.; Muandet, K.; and Wang, Z. 2013. Domain adaptation under target and conditional shift. In International Conference on Machine Learning, 819–827.