Domain Generalization via Invariant Feature Representation

01/10/2013 ∙ by Krikamol Muandet, et al. ∙ Max Planck Society ETH Zurich 0

This paper investigates domain generalization: How to take knowledge acquired from an arbitrary number of related domains and apply it to previously unseen domains? We propose Domain-Invariant Component Analysis (DICA), a kernel-based optimization algorithm that learns an invariant transformation by minimizing the dissimilarity across domains, whilst preserving the functional relationship between input and output variables. A learning-theoretic analysis shows that reducing dissimilarity improves the expected generalization ability of classifiers on new domains, motivating the proposed algorithm. Experimental results on synthetic and real-world datasets demonstrate that DICA successfully learns invariant features and improves classifier performance in practice.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Domain generalization considers how to take knowledge acquired from an arbitrary number of related domains, and apply it to previously unseen domains. To illustrate the problem, consider an example taken from Blanchard et al. (2011) which studied automatic gating of flow cytometry data. For each of patients, a set of cells are obtained from peripheral blood samples using a flow cytometer. The cells are then labeled by an expert into different subpopulations, e.g., as a lymphocyte or not. Correctly identifying cell subpopulations is vital for diagnosing the health of patients. However, manual gating is very time consuming. To automate gating, we need to construct a classifier that generalizes well to previously unseen patients, where the distribution of cell types may differ dramatically from the training data.

Unfortunately, we cannot apply standard machine learning techniques directly because the data violates the basic assumption that training data and test data come from the same distribution. Moreover, the training set consists of heterogeneous samples from several distributions, i.e., gated cells from several patients. In this case, the data exhibits covariate (or dataset) shift

(Widmer and Kurat 1996, Quionero-Candela et al. 2009, Bickel et al. 2009b): although the marginal distributions on cell attributes vary due to biological or technical variations, the functional relationship across different domains is largely stable (cell type is a stable function of a cell’s chemical attributes).

A considerable effort has been made in domain adaptation and transfer learning to remedy this problem, see Pan and Yang (2010a), Ben-David et al. (2010) and references therein. Given a test domain, e.g., a cell population from a new patient, the idea of domain adaptation is to adapt a classifier trained on the training domain, e.g., a cell population from another patient, such that the generalization error on the test domain is minimized. The main drawback of this approach is that one has to repeat this process for every new patient, which can be time-consuming – especially in medical diagnosis where time is a valuable asset. In this work, across-domain information, which may be more informative than the domain-specific information, is extracted from the training data and used to generalize the classifier to new patients without retraining.

Figure 1. A simplified schematic diagram of the domain generalization framework. A major difference between our framework and most previous work in domain adaptation is that we do not observe the test domains during training time. See text for detailed description on how the data are generated.

1.1. Overview.

The goal of (supervised) domain generalization is to estimate a functional relationship that handles changes in the marginal

or conditional well, see Figure 1

. We assume that the conditional probability

is stable or varies smoothly with the marginal . Even if the conditional is stable, learning algorithms may still suffer from model misspecification due to variation in the marginal . That is, if the learning algorithm cannot find a solution that perfectly captures the functional relationship between and then its approximate solution will be sensitive to changes in .

In this paper, we introduce Domain Invariant Component Analysis (DICA), a kernel-based algorithm that finds a transformation of the data that (i) minimizes the difference between marginal distributions of domains as much as possible while (ii) preserving the functional relationship .

The novelty of this work is twofold. First, DICA extracts invariants: features that transfer across domains. It not only minimizes the divergence between marginal distributions , but also preserves the functional relationship encoded in the posterior . The resulting learning algorithm is very simple. Second, while prior work in domain adaptation focused on using data from many different domains to specifically improve the performance on the target task, which is observed during the training time (the classifier is adapted to the specific target task), we assume access to abundant training data and are interested in the generalization ability of the invariant subspace to previously unseen domains (the classifier generalizes to new domains without retraining).

Moreover, we show that DICA generalizes or is closely related to many well-known dimension reduction algorithms including kernel principal component analysis (KPCA)

(Schölkopf et al. 1998, Fukumizu et al. 2004a), transfer component analysis (TCA) (Pan et al. 2011), and covariance operator inverse regression (COIR) (Kim and Pavlovic 2011), see §2.5. The performance of DICA is analyzed theoretically §2.6 and demonstrated empirically §3.

1.2. Related work.

Domain generalization is a form of transfer learning, which applies expertise acquired in source domains to improve learning of target domains (cf. Pan and Yang (2010a) and references therein). Most previous work assumes the availability of the target domain to which the knowledge will be transferred. In contrast, domain generalization focuses on the generalization ability on previously unseen domains. That is, the test data comes from domains that are not available during training.

Recently, Blanchard et al. (2011) proposed an augmented SVM that incorporates empirical marginal distributions into the kernel. A detailed error analysis showed universal consistency of the approach. We apply methods from Blanchard et al. (2011) to derive theoretical guarantees on the finite sample performance of DICA.

Learning a shared subspace is a common approach in settings where there is distribution mismatch. For example, a typical approach in multitask learning is to uncover a joint (latent) feature/subspace that benefits tasks individually (Argyriou et al. 2007, Gu and Zhou 2009, Passos et al. 2012). A similar idea has been adopted in domain adaptation, where the learned subspace reduces mismatch between source and target domains (Gretton et al. 2009, Pan et al. 2011). Although these approaches have proven successful in various applications, no previous work has fully investigated the generalization ability of a subspace to unseen domains.

2. Domain-Invariant Component Analysis

Let denote a nonempty input space and an arbitrary output space. We define a domain

to be a joint distribution

on , and let denote the set of all domains. Let and

denote the set of probability distributions

on and on given respectively.

We assume domains are sampled from probability distribution on

which has a bounded second moment, i.e., the variance is well-defined. Domains are not observed directly. Instead, we observe

samples , where is sampled from and each is sampled from . Since in general , the samples in are not i.i.d. Let denote empirical distribution associated with each sample . For brevity, we use and interchangeably to denote the marginal distribution.

Let and denote reproducing kernel Hilbert spaces (RKHSes) on and with kernels and , respectively. Associated with and are mappings and induced by the kernels and . Without loss of generality, we assume the feature maps of and have zero means, i.e., . Let , , , and be the covariance operators in and between the RKHSes of and .

2.1. Objective.

Using the samples , our goal is to produce an estimate that generalizes well to test samples drawn according to some unknown distribution (Blanchard et al. 2011). Since the performance of depends in part on how dissimilar the test distribution is from those in the training samples, we propose to preprocess the data to actively reduce the dissimilarity between domains. Intuitively, we want to find transformation in that (i) minimizes the distance between empirical distributions of the transformed samples and (ii) preserves the functional relationship between and , i.e., . We formulate an optimization problem capturing these constraints below.

2.2. Distributional Variance

First, we define the distributional variance, which measures the dissimilarity across domains. It is convenient to represent distributions as elements in an RKHS (Berlinet and Agnan 2004, Smola et al. 2007, Sriperumbudur et al. 2010) using the mean map

(1)

We assume that is bounded for any such that . If is characteristic then (1) is injective, i.e., all the information about the distribution is preserved (Sriperumbudur et al. 2010). It also holds that for all and any .

We decompose into , which generates the marginal distribution , and , which generates posteriors . The data generating process begins by generating the marginal according to . Conditioned on , it then generate conditional according to . The data point is generated according to and , respectively. Given set of distributions drawn according to , define Gram matrix with entries

(2)

for . Note that is the inner product between kernel mean embeddings of and in . Based on (2), we define the distributional variance, which estimates the variance of the distribution :

Definition 1.

Introduce probability distribution on with and center to obtain the covariance operator of , denoted as . The distributional variance is

(3)

The following theorem shows that the distributional variance is suitable as a measure of divergence between domains.

Theorem 1.

Let . If is a characteristic kernel, then if and only if .

To estimate from sample sets drawn from , we define block kernel and coefficient matrices

where and is the Gram matrix evaluated between the sample and . Following (3), elements of the coefficient matrix equal if , and otherwise. Hence, the empirical distributional variance is

(4)
Theorem 2.

The empirical estimator obtained from Gram matrix

is a consistent estimator of .

2.3. Formulation of DICA

DICA finds an orthogonal transform onto a low-dimensional subspace () that minimizes the distributional variance between samples from , i.e. the dissimilarity across domains. Simultaneously, we require that preserves the functional relationship between and , i.e. .

2.3.1. Minimizing distributional variance.

In order to simplify notation, we “flatten” to where . Let be the basis function of where and are -dimensional coefficient vectors. Let and denote the projection of onto , i.e., . The kernel on the -projection of is

(5)

After applying transformation , the empirical distributional variance between sample distributions is

(6)

2.3.2. Preserving the functional relationship.

The central subspace is the minimal subspace that captures the functional relationship between and , i.e.

. Note that in this work we generalize a linear transformation

to nonlinear one . To find the central subspace we use the inverse regression framework, (Li 1991):

Theorem 3.

If there exists a central subspace satisfying , and for any , is linear in , then .

It follows that the bases of the central subspace coincide with the

largest eigenvectors of

premultiplied by . Thus, the basis

is the solution to the eigenvalue problem

. Alternatively, for each one may solve

under the condition that is chosen to not be in the span of the previously chosen . In our case, is mapped to induced by the kernel and has nonlinear basis functions . This nonlinear extension implies that lies on a function space spanned by

, which coincide with the eigenfunctions of the operator

(Wu 2008, Kim and Pavlovic 2011). Since we always work in , we drop from the notation below.

To avoid slicing the output space explicitly (Li 1991, Wu 2008), we exploit its kernel structure when estimating the covariance of the inverse regressor. The following result from Kim and Pavlovic (2011) states that, under a mild assumption, can be expressed in terms of covariance operators:

Theorem 4.

If for all , there exists such that for almost every , then

(7)

Let and . The covariance of inverse regressor (7) is estimated from the samples as where and . Assuming inverses and exist, a straightforward computation (see Supplementary) shows

(8)

where smoothes the affinity structure of the output space , thus acting as a kernel regularizer. Since we are interested in the projection of onto the basis functions , we formulate the optimization in terms of . For a new test sample , the projection onto basis function is , where .

2.3.3. The optimization problem.

Combining (6) and (8), DICA finds that solves

(9)

The numerator requires that aligns with the bases of the central subspace. The denominator forces both dissimilarity across domains and the complexity of to be small, thereby tightening generalization bounds, see §2.6. Rewriting (9) as a constrained optimization (see Supplementary) yields Lagrangian

(10)

where is a diagonal matrix containing the Lagrange multipliers. Setting the derivative of (10) w.r.t. to zero yields the generalized eigenvalue problem:

(11)

Transformation corresponds to the leading eigenvectors of the generalized eigenvalue problem (11)111In practice, it is more numerically stable to solve the generalized eigenvalue problem , where is a small constant..

The inverse regression framework based on covariance operators has two benefits. First, it avoids explicitly slicing the output space, which makes it suitable for high-dimensional output. Second, it allows for structured outputs on which explicit slicing may be impossible, e.g., trees and sequences. Since our framework is based entirely on kernels, it is applicable to any type of input and output variables, as long as the corresponding kernels can be defined.

2.4. Unsupervised DICA

In some application domains, such as image denoising, information about the target may not be available. We therefore derive an unsupervised version of DICA. Instead of preserving the central subspace, unsupervised DICA (UDICA) maximizes the variance of in the feature space, which is estimated as . Thus, UDICA solves

(12)

Similar to DICA, the solution of (12) is obtained by solving the generalized eigenvalue problem

(13)

UDICA is a special case of DICA where and . Algorithm 1 summarizes supervised and unsupervised domain-invariant component analysis.

Input: Parameters , , and . Sample .
Output: Projection and kernel .
1:  Calculate gram matrix and .
2:  Supervised: .
3:  Unsupervised: .
4:  Solve for .
5:  Output and .
6:  The test kernel where is the joint kernel between test and training data.
Algorithm 1 Domain-Invariant Component Analysis

2.5. Relations to Other Methods

The DICA and UDICA algorithms generalize many well-known dimension reduction techniques. In the supervised setting, if dataset contains samples drawn from a single distribution then we have . Substituting gives the eigenvalue problem , which corresponds to covariance operator inverse regression (COIR) (Kim and Pavlovic 2011).

If there is only a single distribution then unsupervised DICA reduces to KPCA since and finding requires solving the eigensystem which recovers KPCA (Schölkopf et al. 1998). If there are two domains, source and target , then UDICA is closely related – though not identical to – Transfer Component Analysis (Pan et al. 2011). This follows from the observation that , see proof of Theorem 1.

2.6. A Learning-Theoretic Bound

We bound the generalization error of a classifier trained after DICA-preprocessing. The main complication is that samples are not identically distributed. We adapt an approach to this problem developed in Blanchard et al. (2011) to prove a generalization bound that applies after transforming the empirical sample using . Recall that .

Define kernel on as . Here, is the kernel on and the kernel on distributions is where is a positive definite kernel (Christmann and Steinwart 2010, Muandet et al. 2012). Let denote the corresponding feature map.

Theorem 5.

Under reasonable technical assumptions, see Supplementary, it holds with probability at least that,

The LHS is the difference between the training error and expected error (with respect to the distribution on domains ) after applying .

The first term in the bound, involving , quantifies the distributional variance after applying the transform: the higher the distributional variance, the worse the guarantee, tying in with analogous results in Ben-David et al. (2007; 2010). The second term in the bound depends on the size of the distortion introduced by : the more complicated the transform, the worse the guarantee.

The bound reveals a tradeoff between reducing the distributional variance and the complexity or size of the transform used to do so. The denominator of (9) is a sum of these terms, so that DICA tightens the bound in Theorem 5.

Preserving the functional relationship (i.e. central subspace) by maximizing the numerator in (9) should reduce the empirical risk . However, a rigorous demonstration has yet to be found.

3. Experiments

We illustrate the difference between the proposed algorithms and their single-domain counterparts using a synthetic dataset. Furthermore, we evaluate DICA in two tasks: a classification task on flow cytometry data and a regression task for Parkinson’s telemonitoring.

3.1. Toy Experiments

We generate 10 collections of

data points. The data in each collection is generated according to a five-dimensional zero-mean Gaussian distribution. For each collection, the covariance of the distribution is generated from Wishart distribution

. This step is to simulate different marginal distributions. The output value is , where are the weight vectors, is a constant, and . Note that and form a low-dimensional subspace that captures the functional relationship between and . We then apply the KPCA, UDICA, COIR, and DICA algorithms on the dataset with Gaussian RBF kernels for both and with bandwidth parameters , , and .

Fig. 2 shows projections of the training and three previously unseen test datasets onto the first two eigenvectors. The subspaces obtained from UDICA and DICA are more stable than for KPCA and COIR. In particular, COIR shows a substantial difference between training and test data, suggesting overfitting.

KPCA
UDICA
COIR
DICA
Figure 2. Projections of a synthetic dataset onto the first two eigenvectors obtained from the KPCA, UDICA, COIR, and DICA. The colors of data points corresponds to the output values. The shaded boxes depict the projection of training data, whereas the unshaded boxes show projections of unseen test datasets. The feature representations learnt by UDICA and DICA are more stable across test domains than those learnt by KPCA and COIR.

3.2. Gating of Flow Cytometry Data

Graft-versus-host disease (GvHD) occurs in allogeneic hematopoietic stem cell transplant recipients when donor-immune cells in the graft recognize the recipient as “foreign” and initiate an attack on the skin, gut, liver, and other tissues. It is a significant clinical problem in the field of allogeneic blood and marrow transplantation. The GvHD dataset (Brinkman et al. 2007) consists of weekly peripheral blood samples obtained from 31 patients following allogenic blood and marrow transplant. The goal of gating is to identify cells, which were found to have a high correlation with the development of GvHD (Brinkman et al. 2007). We expect to find a subspace of cells that is consistent to the biological variation between patients, and is indicative of the GvHD development. For each patient, we select a dataset that contains sufficient numbers of the target cell populations. As a result, we omit one patient due to insufficient data. The corresponding flow cytometry datasets from 30 patients have sample sizes ranging from 1,000 to 10,000, and the proportion of the cells in each dataset ranges from 10% to 30%, depending on the development of the GvHD.

Methods Pooling SVM Distributional SVM
Input 91.68.91 92.111.14 93.57.77 91.53.76 92.81.93 92.41.98
KPCA 91.65.93 92.061.15 93.59.77 91.83.60 90.861.98 92.611.12
COIR 91.71.88 92.001.05 92.57.97 91.42.95 91.541.14 92.61.89
UDICA 91.20.81 92.21.19 93.02.77 91.51.79 91.741.08 93.02.77
DICA 91.37.91 92.71.82 94.16.73 91.51.89 93.42.73 93.33.86
Table 1. Average accuracies over 30 random subsamples of GvHD datasets. Pooling SVM applies standard kernel function on the pooled data from multiple domains, whereas distributional SVM also considers similarity between domains using kernel (14). With sufficiently many samples, DICA outperforms other methods in both pooling and distributional settings. The performance of pooling SVM and distributional SVM are comparable in this case.

To evaluate the performance of the proposed algorithms, we took data from patients for training, and the remaining 20 patients for testing. We subsample the training sets and test sets to have 100, 500, and 1,000 data points (cells) each. We compare the SVM classifiers under two settings, namely, a pooling SVM and a distributional SVM. The pooling SVM disregards the inter-patient variation by combining all datasets from different patients, whereas the distributional SVM also takes the inter-patient variation into account via the kernel function (Blanchard et al. 2011)

(14)

where and is the kernel on distributions. We use and , where is computed using . For pooling SVM, the kernel is constant for any and . Moreover, we use the output kernel where is 1 if , and 0 otherwise. We compare the performance of the SVMs trained on the preprocessed datasets using the KPCA, COIR, UDICA, and DICA algorithms. It is important to note that we are not defining another kernel on top of the preprocessed data. That is, the kernel for KPCA, COIR, UDICA, and DICA is exactly (5). We perform 10-fold cross validation on the parameter grids to optimize for accuracy.

Methods Pooling SVM Distributional SVM
Input 92.038.21 93.197.20
KPCA 91.999.02 93.116.83
COIR 92.408.63 92.928.20
UDICA 92.515.09 92.745.01
DICA 92.726.41 94.803.81
Table 2. The average leave-one-out accuracies over 30 subjects on GvHD data. The distributional SVM outperforms the pooling SVM. DICA improves classifier accuracy.

Table 1

reports average accuracies and their standard deviation over 30 repetitions of the experiments. For sufficiently large number of samples, DICA outperforms other approaches. The pooling SVM and distributional SVM achieve comparable accuracies. The average leave-one-out accuracies over 30 subjects are reported in Table

2 (see supplementary for more detail).

3.3. Parkinson’s Telemonitoring

To evaluate DICA in a regression setting, we apply it to a Parkinson’s telemonitoring dataset222http://archive.ics.uci.edu/ml/datasets/Parkinson’s+Telemonitoring. The dataset consists of biomedical voice measurements from 42 people with early-stage Parkinson’s disease recruited for a six-month trial of a telemonitoring device for remote symptom progression monitoring. The aim is to predict the clinician’s motor and total UPDRS scoring of Parkinson’s disease symptoms from 16 voice measures. There are around 200 recordings per patient.

Methods Pooling GP Regression Distributional GP Regression
motor score total score motor score total score
LLS 8.82 0.77 11.80 1.54 8.82 0.77 11.80 1.54
Input 9.58 1.06 12.67 1.40 8.57 0.77 11.50 1.56
KPCA 8.54 0.89 11.20 1.47 8.50 0.87 11.22 1.49
UDICA 8.67 0.83 11.36 1.43 8.75 0.97 11.55 1.52
COIR 9.25 0.75 12.41 1.63 9.23 0.90 11.97 2.09
DICA 8.40 0.76 11.05 1.50 8.35 0.82 10.02 1.01
Table 3. Root mean square error (RMSE) of the independent Gaussian Process regression (GPR) applied to the Parkinson’s telemonitoring dataset. DICA outperforms other approaches in both settings; and the distributional SVM outperforms the pooling SVM.

We adopt the same experimental settings as in §3.2, except that we employ two independent Gaussian Process (GP) regression to predict motor and total UPDRS scores. For COIR and DICA, we consider the output kernel to fully account for the affinity structure of the output variable. We set to be the median of motor and total UPDRS scores. The voice measurements from 30 patients are used for training and the rest for testing.

Figure 3. The root mean square error (RMSE) of motor and total UPDRS scores predicted by GP regression after different preprocessing methods on Parkinson’s telemonitoring dataset. The top and middle rows depicts the pooling and distributional settings; the bottom row compares the two settings. Results of linear least square (LLS) are given as a baseline.

Fig. 3 depicts the results. DICA consistently, though not statistically significantly, outperforms other approaches, see Table 3. Inter-patient (i.e. across domain) variation worsens prediction accuracy on new patients. Reducing this variation with DICA improves the accuracy on new patients. Moreover, incorporating the inter-subject variation via distributional GP regression further improves the generalization ability, see Fig. 3.

4. Conclusion and Discussion

To conclude, we proposed a simple algorithm called Domain-Invariant Component Analysis (DICA) for learning an invariant transformation of the data which has proven significant for domain generalization both theoretically and empirically. Theorem 5 shows the generalization error on previously unseen domains grows with the distributional variance. We also showed that DICA generalizes KPCA and COIR, and is closely related to TCA. Finally, experimental results on both synthetic and real-world datasets show DICA performs well in practice. Interestingly, the results also suggest that the distributional SVM, which takes into account inter-domain variation, outperforms the pooling SVM which ignores it.

The motivating assumption in this work is that the functional relationship is stable or varies smoothly across domains. This is a reasonable assumption for automatic gating of flow cytometry data because the inter-subject variation of cell population makes it impossible for domain expert to apply the same gating on all subjects, and similarly makes sense for Parkinson’s telemonitoring data. Nevertheless, the assumption does not hold in many applications where the conditional distributions are substantially different. It remains unclear how to develop techniques that generalize to previously unseen domains in these scenarios.

DICA can be adapted to novel applications by equipping the optimization problem with appropriate constraints. For example, one can formulate a semi-supervised extension of DICA by forcing the invariant basis functions to lie on a manifold or preserve a neighborhood structure. Moreover, by incorporating the distributional variance as a regularizer in the objective function, the invariant features and classifier can be optimized simultaneously.

Acknowledgments

We thank Samory Kpotufe and Kun Zhang for fruitful discussions and the three anonymous reviewers for insightful comments and suggestions that significantly improved the paper.

References

  • Altun and Smola [2006] Y. Altun and A. Smola. Unifying divergence minimization and statistical inference via convex duality. In Proc. of Conf. on Learning Theory (COLT), 2006.
  • Argyriou et al. [2007] A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. In Advances in Neural Information Processing Systems 19, pages 41–48. MIT Press, 2007.
  • Ben-David et al. [2007] S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain adaptation. In Advances in Neural Information Processing Systems 19, pages 137–144. MIT Press, 2007.
  • Ben-David et al. [2010] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of learning from different domains. Machine Learning, 79:151–175, 2010.
  • Berlinet and Agnan [2004] A. Berlinet and T. C. Agnan. Reproducing kernel Hilbert spaces in probability and statistics. Kluwer Academic Publishers, 2004.
  • Bickel et al. [2009a] S. Bickel, M. Brückner, and T. Scheffer. Discriminative learning under covariate shift. Journal of Machine Learning Research, 10:2137–2155, Dec. 2009a. ISSN 1532-4435.
  • Bickel et al. [2009b] S. Bickel, M. Brückner, and T. Scheffer. Discriminative learning under covariate shift. Journal of Machine Learning Research, pages 2137–2155, 2009b.
  • Blanchard et al. [2011] G. Blanchard, G. Lee, and C. Scott. Generalizing from several related classification tasks to a new unlabeled sample. In Advances in Neural Information Processing Systems 24, pages 2178–2186, 2011.
  • Brinkman et al. [2007] R. R. Brinkman, M. Gasparetto, S.-J. J. Lee, A. J. Ribickas, J. Perkins, W. Janssen, R. Smiley, and C. Smith. High-content flow cytometry and temporal data analysis for defining a cellular signature of graft-versus-host disease. Biol Blood Marrow Transplant, 13(6):691–700, 2007. ISSN 1083-8791.
  • Caruana [1997] R. Caruana. Multitask learning. Machine Learning, 28:41–75, 1997.
  • Christmann and Steinwart [2010] A. Christmann and I. Steinwart. Universal kernels on Non-Standard input spaces. In Advances in Neural Information Processing Systems 23, pages 406–414. MIT Press, 2010.
  • Fukumizu et al. [2004a] K. Fukumizu, F. R. Bach, and M. I. Jordan.

    Kernel Dimensionality Reduction for Supervised Learning.

    In Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004a.
  • Fukumizu et al. [2004b] K. Fukumizu, F. R. Bach, and M. I. Jordan. Dimensionality reduction for supervised learning with reproducing kernel hilbert spaces. Journal of Machine Learning Research, 5:73–99, 2004b.
  • Gretton et al. [2009] A. Gretton, A. Smola, J. Huang, M. Schmittfull, K. Borgwardt, and B. Schölkopf. Dataset Shift in Machine Learning, chapter Covariate Shift by Kernel Mean Matching, pages 131–160. MIT Press, 2009.
  • Gu and Zhou [2009] Q. Gu and J. Zhou. Learning the shared subspace for multi-task clustering and transductive transfer classification. In Proceedings of the 9th IEEE International Conference on Data Mining, pages 159–168. IEEE Computer Society, 2009.
  • Kim and Pavlovic [2011] M. Kim and V. Pavlovic. Central subspace dimensionality reduction using covariance operators. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(4):657–670, 2011.
  • Li [1991] K.-C. Li. Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86(414):316–327, 1991.
  • Muandet et al. [2012] K. Muandet, K. Fukumizu, F. Dinuzzo, and B. Schölkopf. Learning from distributions via support measure machines. In Advances in Neural Information Processing Systems 25, pages 10–18. MIT Press, 2012.
  • Pan and Yang [2010a] S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345–1359, October 2010a.
  • Pan and Yang [2010b] S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22:1345–1359, 2010b.
  • Pan et al. [2011] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang. Domain adaptation via transfer component analysis.

    IEEE Transactions on Neural Networks

    , 22(2):199–210, 2011.
  • Passos et al. [2012] A. Passos, P. Rai, J. Wainer, and H. D. III. Flexible modeling of latent task structures in multitask learning. In Proceedings of the 29th international conference on Machine learning, Edinburgh, UK, 2012.
  • Quionero-Candela et al. [2009] J. Quionero-Candela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence. Dataset Shift in Machine Learning. MIT Press, 2009.
  • Schölkopf et al. [1998] B. Schölkopf, A. Smola, and K.-R. Müller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5):1299–1319, July 1998.
  • Smola et al. [2007] A. Smola, A. Gretton, L. Song, and B. Schölkopf. A Hilbert space embedding for distributions. In Proceedings of the 18th International Conference In Algorithmic Learning Theory, pages 13–31. Springer-Verlag, 2007.
  • Sriperumbudur et al. [2010] B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Schölkopf, and G. R. G. Lanckriet. Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research, 99:1517–1561, 2010.
  • Widmer and Kurat [1996] G. Widmer and M. Kurat. Learning in the Presence of Concept Drift and Hidden Contexts. Machine Learning, 23:69–101, 1996.
  • Wu [2008] H.-M. Wu. Kernel sliced inverse regression with applications to classification. Journal of Computational and Graphical Statistics, 17(3):590–610, 2008.

Appendix A Domain Generalization and Related Frameworks

The most fundamental assumption in machine learning is that the observations are independent and identically distributed (i.i.d.). That is, each observation comes from the same probability distribution as the others and all are mutually independent. However, this assumption is often violated in practice, in which case the standard machine learning algorithms do not perform well. In the past decades, many techniques have been proposed to tackle scenarios where there is a mismatch between training and test distributions. These include domain adaptation [Bickel et al., 2009a], multitask learning [Caruana, 1997], transfer learning [Pan and Yang, 2010b], covariate/dataset shift [Quionero-Candela et al., 2009] and concept drift [Widmer and Kurat, 1996]. To better understand domain generalization, we briefly discuss how it relates to some of these approaches.

a.1. Transfer learning (see e.g., Pan and Yang [2010b] and references therein).

Transfer learning aims at transferring knowledge from some previous tasks to a target task when the latter has limited training data. That is, although there may be few labeled examples, “knowledge” obtained in related tasks may be available. Transfer learning focuses on improving the learning of the target predictive function using the knowledge in the source task. Although not identical, domain generalization can be viewed as a transfer learning when knowledge of the target task is unavailable during training.

a.2. Multitask learning (see e.g., Caruana [1997] and references therein).

The goal of multitask learning is to learn multiple tasks simultaneously – especially when training examples in each task are scarce. By learning all tasks simultaneously, one expects to improve generalization on individual tasks. An important assumption is therefore that all the tasks are related. Multitask learning differs from domain generalization because learning the new task often requires retraining.

a.3. Domain adaptation (see e.g., Bickel et al. [2009a] and references therein).

Domain adaptation, also known as covariate shift, deals primarily with a mismatch between training and test distributions. Domain generalization deals with a broader setting where training instances may have been collected from multiple source domains. A second difference is that in domain adaptation one observes the target domain during the training time whereas in domain generalization one does not.

Table 4 summarizes the main differences between the various frameworks.

Framework Distribution Mismatch Multiple Sources Target Domain
Standard Setup
Transfer Learning
Multi-task Learning
Domain Adaptation
Domain Generalization
Table 4. Comparison of domain generalization with other well-known frameworks. Note that the domain generalization is closely related to multi-task learning and domain adaptation. The difference of domain generalization is that one does not observe the target domains in which a classifier will be applied without retraining the classifier.

Appendix B Proof of Theorem 1

Lemma 6.

Given a set of distributions , the distributional variance of is where and .

Proof.

Let be the probability distribution defined as , i.e., . It follows from the linearity of the expectation that . For brevity, we will denote by . Then, expanding (3) gives

which completes the proof. ∎

  • For a characteristic kernel , if and only if .

Proof.

Since is characteristic, is a metric and is zero iff for any distributions and [Sriperumbudur et al., 2010]. By Lemma 6, Thus, iff . Consequently, if is zero, this implies that for all , meaning that . Conversely, if , then is zero for all and thereby is zero. ∎

Appendix C Proof of Theorem 2

  • The empirical estimator obtained from Gram matrix

    is a consistent estimator of .

Proof.

Recall that

where

By Theorem 15 in Altun and Smola [2006], we have a fast convergence of to . Consequently, we have , which implies that . Hence, is a consistent estimator of . ∎

Appendix D Derivation of Eq. (8)

DICA employs the covariance of inverse regressor , which can be written in terms of covariance operators. Let and be the RKHSes of and endowed with reproducing kernels and , respectively. Let , , , and be the covariance operators in and between the corresponding RKHSes of and . We define the conditional covariance operator of given , denoted by , as

(15)

The following theorem from Fukumizu et al. [2004b] states that, under mild conditions, equals the expected conditional variance of given .

Theorem 7.

For any , if there exists such that for almost every , then .

Using the identity333 for any ., the covariance can be expressed in terms of the conditional covariance operators as follow:

(16)