Supervised learning is perhaps the most popular task in machine learning and has recently achieved dramatic successes in many applications such as object recognition [1, 2], object detection , speech recognition , and machine translation . These successes derive in large part from the availability of massive labeled datasets such as PASCAL VOC2007 
and ImageNet. Unfortunately, obtaining labels is often a time-consuming and costly process that requires human experts. Furthermore, the process of collecting samples is prone to dataset bias [8, 9]
, i.e., a learning algorithm trained on a particular dataset generalizes poorly across datasets. In object recognition, for example, training images may be collected under specific conditions involving camera viewpoints, backgrounds, lighting conditions, and object transformations. In such situations, the classifiers obtained with learning algorithms operating on samples from one dataset cannot be directly applied to other related datasets. Developing learning algorithms that are robust to label scarcity and dataset bias is therefore an important and compelling problem.
represents a probability distribution from which the samples are drawn and is often equated with a dataset. The domain is usually divided into two different types: thesource domain and the target domain, to distinguish between a domain with labeled samples and a domain without labeled samples. These two domains are related but different, which limits the applicability of standard supervised learning models on the target domain. In particular, the basic assumption in standard supervised learning that training and test data come from the same distribution is violated.
The goal of domain adaptation is to produce good models on a target domain, by training on labels from the source domain(s) and leveraging unlabeled samples from the target domain as supplementary information during training. Domain adaptation has demonstrated significant successes in various applications, such as sentiment classification [12, 13], visual object recognition [14, 15, 16, 17], and WiFi localization .
Finally, the problem of domain generalization arises in situations where unlabeled target samples are not available, but samples from multiple source domains can be accessed. Examples of domain generalization applications are automatic gating of flow cytometry [11, 19] and visual object recognition [20, 21, 22].
The main practical issue is that several state-of-the-art domain adaptation and domain generalization algorithms for object recognition result in optimization problems that are inefficient to solve [23, 15, 17, 22]. Therefore, they may not be suitable in situations that require a real-time learning stage. Furthermore, although domain adaptation and domain generalization are closely related problems, domain adaptation algorithms cannot in general be applied directly to domain generalization, since they rely on the availability of (unlabeled) samples from the target domain. It is highly desirable to develop algorithms that can be computed more efficiently, are compatible with both domain adaptation and domain generalization, and provides state-of-the-art performance.
1.1 Goals and Objectives
To address the fore-mentioned issues, we propose a fast unified algorithm for reducing dataset bias that can be used for both domain adaptation and domain generalization. The basic idea of our algorithm is to learn representations as inputs to a classifier that are invariant to dataset bias. Intuitively, the learnt representations should incorporate four requirements: (i) separate points with different labels and (ii) separate the data as a whole (high variance), whilst (iii) not separating points sharing a label and (iv) reducing mismatch between the two or more domains. The main contributions of this paper are as follows:
The first contribution is scatter, a simple geometric function that quantifies the mean squared distance of a distribution from its centroid. We show that the above four requirements can be encoded through scatter and establish the relationship with Linear Discriminant Analysis 
, Principal Component Analysis, Maximum Mean Discrepancy and Distributional Variance .
The second contribution is a fast scatter-based feature learning algorithm that can be applied to both domain adaptation and domain generalization problems, Scatter Component Analysis (SCA), see Algorithm 1. To the best of our knowledge, SCA is the first multi-purpose algorithm applicable across a range of domain adaptation and generalization tasks. The SCA optimization reduces to a generalized eigenproblem that admits a fast and exact solution on par with Kernel PCA  in terms of time complexity.
The third contribution is the derivation of a theoretical bound for SCA in the case of domain adaptation. Our theoretical analysis shows that domain scatter controls the generalization performance of SCA. We demonstrate that domain scatter controls the discrepancy distance under certain conditions. The discrepancy distance has previously been shown to control the generalization performance of domain adaptation algorithms .
We performed extensive experiments to evaluate the performance of SCA against a large suite of alternatives in both domain adaptation and domain generalization settings. We found that SCA performs considerably faster than the prior state-of-the-art across a range of visual object cross-domain recognition, with competitive or better performance in terms of accuracy.
1.2 Organization of the Paper
This paper is organized as follows. Section 2 describes the problem definitions and reviews existing work on domain adaptation and domain generalization. Sections 3 and 4 describes our proposed tool and also the corresponding feature learning algorithm, Scatter Component Analysis (SCA). The theoretical domain adaptation bound for SCA is then presented in Section 5. Comprehensive evaluation results and analyses are provided in Sections 6 and 7. Finally, Section 8 concludes the paper.
2 Background and Literature Review
This section establishes the basic definitions of domains, domain adaptation, and domain generalization. It then reviews existing work in domain adaptation and domain generalization, particularly in the area of computer vision and object recognition.
A domain is a probability distribution on , where and are the input and label spaces respectively. For the sake of simplicity, we equate with . The terms domain and distribution are used interchangeably throughout the paper. Let be an i.i.d. sample from a domain. It is convenient to use the notation for the corresponding empirical distribution , where is the Dirac delta. We define domain adaptation and domain generalization as follows.
Definition 1 (Domain Adaptation).
Let and be a source and target domain respectively, where . Denote by and samples drawn from both domains. The task of domain adaptation is to learn a good labeling function given and as the training examples.
Definition 2 (Domain Generalization).
Let be a set of source domains and be a target domain. Denote by samples drawn from source domains. The task of domain generalization is to learn a labeling function given as the training examples.
It is instructive to compare these two related definitions. The main difference between domain adaptation and domain generalization is on the availability of the unlabeled target samples. Both have the same goal: learning a labeling function that performs well on the target domain. In practice, domain generalization requires to work well although might not violate Definition 2. Note that domain generalization can be exactly reduced to domain adaptation if and .
Domain adaptation and domain generalization have recently attracted great interest in machine learning. We present a review of recent literature that is organized into two parts: i) domain adaptation and ii) domain generalization.
2.1 Domain Adaptation
Earlier studies on domain adaptation focused on natural language processing, see, e.g., and references therein. Domain adaptation has gained increasing attention in computer vision for solving dataset bias in object recognition [16, 29, 30, 31, 32, 15] and object detection . The reader is encouraged to consult the recent survey in visual domain adaptation  for a more comprehensive review. We classify domain adaptation algorithms into three categories: i) the classifier adaptation approach, ii) the selection/reweighting approach, and iii) the feature transformation-based approach.
. Adaptive Support Vector Machines (A-SVMs) utilize auxiliary classifiers to adapt a primary classifier that performs well on a target domain, where the optimization criterion is similar to standard SVMs. The Domain Adaptation Machine (DAM)  employs both a domain-dependent regularizer based on a smoothness assumption and a sparsity regularizer in Least-Squares SVMs . Recently, a multi-instance learning based classifier for action and event recognition, trained on weakly labeled web data, was proposed .
The reweighting/selection approach reduces sample bias by reweighting or selecting source instances that are ‘close’ to target instances – selection can be considered as the ‘hard’ version of reweighting. The basic idea has been studied under the name of covariate shift . Gong et al. applied a convex optimization strategy to select some source images that are maximally similar to the target images according to Maximum Mean Discrepancy  – referred to as landmarks. The landmarks are then used to construct multiple auxiliary tasks as a basis for composing domain-invariant features. Transfer Joint Matching (TJM)  uses a reweighting strategy as a regularizer based on -norm structured sparsity on the source subspace bases.
The feature transformation-based approach is perhaps the most popular approach in domain adaptation. Daume III  proposed a simple feature augmentation method by replicating the source and target data, both are in
, as well as zero-padding such that the resulting features are in. Li et al.  extended the method for the case of heterogeneous features, i.e., source and target features have different dimensionality, by introducing a common subspace learnt via the standard SVM formulation. A subspace learning-based algorithm, Transfer Component Analysis (TCA) and its semi-supervised version SSTCA , utilizes the Maximum Mean Discrepancy (MMD)  to minimize dataset bias in WiFi localization and text classification applications. Metric learning-based domain adaptation approaches have been proposed [16, 47], which were early studies in object recognition on the Office dataset. The idea of extracting ‘intermediate features’ to minimize dataset bias by projecting data onto multiple intermediate subspaces was also considered. Sampling Geodesic Flow (SGF)  and Geodesic Flow Kernel (GFK) 
generate multiple subspaces via an interpolation between the source and the target subspace on a Grassmann manifold – a point on the manifold is a subspace. Subspace Alignment (SA) transforms a source PCA subspace into a new subspace that is well-aligned to a target PCA subspace without requiring intermediate subspaces. A recent method called Correlation Alignment (CORAL) facilitates adaptive features by aligning the source and target covariance matrices . Other subspace learning-based methods such as Transfer Sparse Coding (TSC)  and Domain Invariant Projection (DIP)  make use of MMD, following TCA, to match the source and target distributions in the feature space. One of the methods proposed in 
follows a similar intuition by using Hellinger distance as an alternative to MMD. Algorithms based on hierarchical non-linear features or deep learning are also capable of producing powerful domain adaptive features[52, 53, 54, 55, 14, 56].
Several works have addressed Probably Approximately Correct theoretical bounds for domain adaptation. Ben-David et al.  presented the first theoretical analysis of domain adaptation, that is, an adaptation bound in classification tasks based on the -distance . Mansour et al.  extended this work in several ways built on Rademacher complexity  and the discrepancy distance, as an alternative to -distance. In this paper, we provide a domain adaptation bound for our new algorithm based on the latter analysis.
2.2 Domain Generalization
Domain generalization is a newer line of research than domain adaptation. Blanchard et al.  first studied this issue and proposed an augmented SVM that encodes empirical marginal distributions into the kernel for solving automatic gating of flow cytometry. A feature projection-based algorithm, Domain-Invariant Component Analysis (DICA) , was then introduced to solve the same problem. DICA extends Kernel PCA  by incorporating the distributional variance to reduce the dissimilarity across domains and the central subspace  to capture the functional relationship between the features and their corresponding labels.
Domain generalization algorithms also have been used in object recognition. Khosla et al.  proposed a multi-task max-margin classifier, which we refer to as Undo-Bias, that explicitly encodes dataset-specific biases in feature space. These biases are used to push the dataset-specific weights to be similar to the global weights. Fang et al. developed Unbiased Metric Learning (UML) based on a learning-to-rank framework. Validated on weakly-labeled web images, UML produces a less biased distance metric that provides good object recognition performance. Xu et al. extended an exemplar-SVM 
to domain generalization by adding a nuclear norm-based regularizer that captures the likelihoods of all positive samples. The proposed model is referred to as LRE-SVM that provides the state-of-the-art performance. More recently, an autoencoder based algorithm to extract domain-invariant features via multi-task learning has been proposed.
Although both domain adaptation and domain generalization have the same goal (reducing dataset bias), the approaches are generally not compatible to each other – domain adaptation methods cannot be directly applied to domain generalization or vice versa. To our best knowledge, only LRE-SVM can be applied to both domain adaptation and domain generalization. The domain generalization algorithm formulation as in DICA, Undo-Bias, or UML typically does not allow to take into account unlabeled data from the target domain. Furthermore, several state-of-the-art domain adaptation and domain generalization algorithms such as TJM and LRE-SVM, require the solution of a computationally complex optimization that induces high complexity in time. In this work, we establish a fast algorithm that overcomes the above issues.
We work in a feature space that is a reproducing kernel Hilbert space (RKHS) . The main motivation is to transform original inputs onto , which is high or possibly infinite dimensional space, with the hope that the new features are linearly separable. The most important property of RKHS is perhaps to allow a computationally feasible transformation onto by virtue of the kernel trick.
Definition 3 (Reproducing Kernel Hilbert Space).
Let be an arbitrary set and a Hilbert space of functions on . Define the evaluation functional by . Then is a reproducing kernel Hilbert space (RKHS) if the functional is always bounded: i.e. for all there exists an such that
It follows that there is a function (referred to as the canonical feature map) satisfying:
Consequently, for each , one can write
The function is referred to as the reproducing kernel.
Expression (1) is the weakest condition that ensures the existence of an inner product and also the ability to evaluate each function in at every point in the domain, while (2) provides more useful notion in practice.
Before introducing scatter, it is convenient to first represent domains as points in RKHS using the mean map :
Definition 4 (Mean map).
Suppose that is equipped with a kernel, and that is the corresponding RKHS with feature map . Let denote the set of probability distributions on . The mean map takes distributions on to points in :
Geometrically, the mean map is the centroid of the image of the distribution under . We define scatter as the variance of points in the image around its centroid:
Definition 5 (Scatter).
The scatter of distribution on relative to is
where is the norm on .
The scatter of a domain cannot be computed directly; instead it is estimated from observations. The scatter of a finite set of observationsis computed with respect to the empirical distribution
We provide a theorem that shows how the difference between the true scatter and a finite sample estimate decreases with the sample size.
Theorem 1 (Scatter Bound).
Suppose is a true distribution over all samples of size and is its empirical distribution. Further suppose that for all . Then, with probability ,
See Supplementary material. ∎
Note that the right hand site of the bound is smaller for lower values of and higher values of . Furthermore, if is in the form of Gaussian kernel, the bound only depends on since .
We provide an example for later use. If the input space is a vector space and is the identity then it follows immediately that
Lemma 2 (Total variance as scatter).
The scatter of the set of -dimensional points (in a matrix form) relative to the identity map , is the total variance:
where denotes the trace operation and with .
We utilize scatter to formulate a feature learning algorithm referred to as Scatter Component Analysis (SCA). Specifically, scatter quantifies requirements needed in SCA to develop an effective solution for both domain adaptation and generalization, which will be described in the next section.
4 Scatter Component Analysis (SCA)
SCA aims to efficiently learn a representation that improves both domain adaptation and domain generalization. The strategy is to convert the observations into a configuration of points in feature space such that the domain mismatch is reduced. SCA then finds a representation of the problem (that is, a linear transformation of feature space) for which (i) the source and target domains are similar and (ii) elements with the same label are similar; whereas (iii) elements with different labels are well separated and (iv) the variance of the whole data is maximized. Each requirement can be quantified throughscatter that leads to four consequences: (i) domain scatter, (ii) between-class scatter, (iii) within-class scatter, and (iv) total scatter.
The remainder of the subsection defines the above four scatter quantities in more detail (along the way relating the terms to principal component analysis, the maximum mean discrepancy, and Fisher’s linear discriminant) and describes the SCA’s learning algorithm. We will also see that SCA can be easily switched to either domain adaptation or domain generalization by modifying the configuration of the input domains.
4.1 Total Scatter
Given domains on , we define the total domain as the mean . The total scatter is then defined by
It is worth emphasizing that this definition is general in the sense that it covers both domain adaptation ( and one of them is the target domain) and domain generalization ().
Total scatter is estimated from data as follows. Let be the matrix of unlabeled samples from all domains (, where is the number of examples in the -th domain). Given a feature map corresponding to kernel , define a set of functions arranged in a column vector . After centering by subtracting the mean, the covariance matrix is . By Lemma 2,
We are interested in the total scatter after applying a linear transform to a finite relevant subspace . To avoid the direct computation of , which could be expensive or undoable, we use the kernel trick. Let be the transformed feature vectors and . After fixing such that , the total transformed scatter is
We remark that, in our notation, Kernel Principal Component Analysis (KPCA)  corresponds to the optimization problem
4.2 Domain Scatter
Suppose we are given domains on . We can think of the set as a sample from some latent distribution on domains. Equipping the sample with the empirical distribution and computing scatter relative to the identity map on yields domain scatter:
where . Note that domain scatter coincides with the distributional variance introduced in . Domain scatter is also closely related to the Maximum Mean Discrepancy (MMD), used in some domain adaptation algorithms [64, 45, 15].
Let be a set of functions . The maximum mean discrepancy between domains and is
The MMD measures the extent to which two domains resemble one another from the perspective of function class . The following theorem relates domain scatter to MMD given two domains, where the case of interest is bounded linear functions on the feature space:
Lemma 3 (Scatter recovers MMD).
The scatter of domains and on is their (squared) maximum mean discrepancy:
In particular, if is induced by a characteristic kernel on then if and only if .
Lemma 3 also tells that the domain scatter is a valid metric if the kernel on is characteristic . The most important example of a characteristic kernel is the Gaussian RBF kernel, which is the kernel used in the theoretical results and experiments below. We also remark that MMD can be estimated from observed data with bound provided in , which is analogous to Theorem 1.
Domain scatter in a transformed feature space in is estimated as follows. Suppose we have samples . Recall that , where contains projected samples from all domains: and
is the corresponding kernel matrix, where . By some algebra, the domain scatter is
where is a coefficient matrix with if , and otherwise.
4.3 Class Scatter
For each class , let denote the conditional distribution on induced by the total labeled domain when (the number of labeled domains does not necessarily equal to the number of source domains ). We define the within-class scatter and between-class scatter as
The class scatters are estimated as follows. Let denote the -tuple of source samples in class . The centroid of is . Furthermore, let denote the -tuple of all class centroids where centroid appears times in . The centroid of is then the centroid of the source domain: . It follows that the within-class scatter is
and the between-class scatter is
The right-hand sides of the above equations are the classical definitions of within- and between- class scatter . The classical linear discriminant is thus a ratio of scatters
Maximizing Fisher’s linear discriminant increases the separation of the data points with respect to the class clusters.
4.4 The Algorithm
Here we formulate the SCA’s learning algorithm by incorporating the above four quantities. The objective of SCA is to seek a representation by solving an optimization problem in the form of the following expression
Maximizing the numerator encourages SCA to preserve the total variability of the data and the separability of classes. Minimizing the denominator encourages SCA to find a representation for which the source and target domains are similar, and source samples sharing a label are similar.
Objective function. We reformulate (16) in three ways. First, we express it in terms of linear algebra. Second, we insert hyper-parameters that control the trade-off between scatters as one scatter quantity could be more important than others in a particular case. Third, we impose the constraint that is small to control the scale of the solution.
Explicitly, SCA finds a projection matrix that solves the constrained optimization
and are the trade-off parameters controlling the total and between-class scatter, and domain scatter respectively.
Observe that the above optimization is invariant to rescaling . Therefore, optimization (17) can be rewritten as
which results in Lagrangian
To solve (17), set the first derivative , inducing the generalized eigenproblem
where are the leading eigenvalues and
contains the corresponding eigenvectors.111In the implementation, a numerically more stable variant is obtained by using (20) using , where is a fixed small constant. Algorithm 1 provides a complete summary of SCA.
4.5 Relation to other methods
SCA is closely related to a number of feature learning and domain adaptation methods. To see this, let us observe Lagrangian (4.4). Setting the hyper-parameters and in (4.4) recovers KPCA. Setting and recovers the Kernel Fisher Discriminant (KFD) method . KFD with linear kernel is equivalent to Fisher’s linear discriminant, which is the basis of a domain adaptation method for object detection proposed in .
Setting and (that is, ignoring class separation) yields a new algorithm: unsupervised Scatter Component Analysis (uSCA), which is closely related to TCA. The difference between the two algorithms is that TCA constrains the total variance and regularizes the transform, whereas uSCA trades-off the total variance and constrains the transform (recall that should be small) motivated by Theorem 1. It turns out that uSCA consistently outperforms TCA in the case of domain adaptation, see Section 6.
Eliminating the term from the denominator in (17) from uSCA yields TCA . The semi-supervised extension SSTCA of TCA differs markedly from SCA. Instead of incorporating within- and between- class scatter into the objective function, SSTCA incorporates a term derived from the Hilbert-Schmidt Independence Criterion that maximizes the dependence of the embedding on labels.
uSCA is essentially equivalent to unsupervised Domain Invariant Component Analysis (uDICA) in the case of two domains . However, as for SSTCA, supervised DICA incorporates label-information differently from SCA – via the notion of a central subspace. In particular, supervised DICA requires that all data points are labeled, and so it cannot be applied in our experiments.
4.6 Computational Complexity
Here we analyze the computation complexity of the SCA algorithm. Suppose that we have domains with are the number of samples for each domain ( covers the domain generalization case). Denote the total number of samples by and the number of leading eigenvectors by . Computing the matrices , , , and takes (Line 1 at Algorithm 1). Hence, the total complexity of SCA after solving the eigendecomposition problem (Line 2) takes , or quadratic in . This complexity is similar to that of KPCA and Transfer Component Analysis .
In comparison to Transfer Joint Matching (TJM) , the prior state-of-the-art domain adaptation algorithm for object recognition, TJM uses an alternating eigendecomposition procedure in which iterations are needed. Using our notation, the complexity of TJM is , i.e., TJM is times slower than SCA.
4.7 Hyper-parameter Settings
Before reporting the detailed evaluation results, it is important to explain how SCA hyper-parameters were tuned. The formulation of SCA described in Section 4 has four hyper-parameters: 1) the choice of the kernel, 2) the number of subspace bases , 3) the between-class and total scatters trade-off , and 4) the domain scatter ,. Tuning all those hyper-parameters using a standard strategy, e.g., a grid-search, might be impractical due to two reasons. The first is of the computational complexity. The second, which is crucial, is that cross-validating a large number of hyper-parameters may worsen the generalization on the target domain, since labeled samples from the target domain are not available.
Our strategy to deal with the issue is to reduce the number of tunable hyper-parameters. For the kernel selection, we chose the RBF kernel , where the kernel bandwidth was set analytically to the median distance between samples in the aggregate domain following ,
For domain adaptation, was fixed at . Thus, only two hyper-parameters remain tunable: and . For domain generalization, was set at 1, i.e., the total scatter was eliminated, and was allowed to be tuned – the number of tunable hyper-parameters remains unchanged. The configuration is based on an empirical observation that setting is no better (if not worse) than in terms of both the cross-validation and test performance for domain generalization cases. In all evaluations, we used 5-fold cross validation using source labeled data to find the optimal and . We found that this strategy is sufficient to produce good SCA models for both domain adaptation and generalization cases.
5 Analysis of Adaptation Performance
We derive a bound for domain adapation that shows how the MMD controls generalization performance in the case of the squared loss . Despite the widespread use of the MMD for domain adaptation [68, 36, 23, 15, 45], to the best of our knowledge, this is the first generalization bound. The main idea is to incorporate the MMD (that is, domain scatter) into the adaptation bound proven for the discrepancy distance . A generalization bound for domain generalization in terms of domain scatter is given in , see remark 1.
Let denote a hypothesis class of functions from to where
is a compact set. Given a loss function defined over pairs of labelsand a distribution over , let denote the expected loss for any two hypotheses . We consider the case where the hypothesis set is a subset of an RKHS .
We first introduce discrepancy distance, , which measures the difference between two distributions and .
Definition 7 (Discrepancy Distance ).
Let be a set of functions mapping from to . The discrepancy distance between two distributions and over is defined by
The discrepancy is symmetric and satisfies the triangle inequality, but it does not define a distance in general: such that .
If we assume that we have a universal kernel [69, 70], i.e. as topological spaces, and the loss is the squared loss  then the discrepancy is a metric. The most important example of a universal kernel is the Gaussian RBF kernel, which is the kernel used in the experiments below.
The main step of the proof is to find a relationship between domain scatter and the discrepancy distance. We are able to do so in the special case where the kernel is universal and the loss is the mean-square error. The main technical challenge is that the discrepancy distance is quadratic in the hypotheses (involving terms of the form and ) whereas the MMD is linear. We therefore need to bound the effects of the multiplication operator:
Definition 8 (Multiplication Operator).
Let be the space of continuous functions on the compact set equipped with the supremum norm . Given , define the multiplication operator as the bounded linear operator given by
Note that a general RKHS is not closed under the multiplication operator . However, since the kernel is universal, it follows that is closed under multiplication since the space of continuous functions is closed under multiplication. Moreover, we can define the sup-norm on using its identification with .
The following Lemma upper bounds the norm of multiplication operator, which will be useful to prove our main theorem.
Given , , where is equipped with a universal kernel, it holds that .
Straightforward calculation. The Lemma requires a universal kernel since is only defined if . ∎
We now show that the domain scatter of two distributions upper-bounds the discrepancy distance.
Lemma 5 (Domain scatter bounds discrepancy).
Let be an RKHS with a universal kernel. Suppose that is the square loss, and consider the hypothesis set
where is a constant Let and be two domains over . Then the following inequality holds:
See Supplementary material. ∎
Lemma 5 allows us to relate domain scatter to generalization bounds for domain adaptation proven in . Before stating the bounds, we introduce Rademacher complexity , which measures the degree to which a class of functions can fit random noise. This measure is the basis of bounding the empirical loss and expected loss.
Definition 9 (Rademacher Complexity).
Let be a family of functions mapping from to and be a fixed sample of size . The empirical Rademacher complexity of with respect to the sample is
where are Rademacher variables, with s independent uniform random variables taking values in
s independent uniform random variables taking values in. The Rademacher complexity over all samples of size is
The supplementary material discusses how to associate a family of functions to a loss function, and provides a useful Rademacher bound. We now have all the ingredients to derive domain adaptation bounds in terms of domain scatter.
Let and be the true labeling functions on domain and respectively, and and be the minimizers. For a successful domain adaptation, we shall assume that is small. The following theorem provides a domain adaptation bound in terms of scatter (recall that the MMD is a special case of scatter by Lemma 3).
Theorem 6 (Adaptation bounds with domain scatter).
Let be a family of functions mapping from to , and be a source and target sample respectively. Let the rest of the assumptions be as in Lemma 5 and Theorem 8 in the supplementary material. For any hypothesis , with probability at least , the following adaptation bound holds:
It is instructive to compare Theorem 6 above with Theorem 9 in , which is the analog if we expand in (5) with its empirical measure. It is also straightforward to rewrite the bound in term of the empirical scatter by applying Theorem 1.
The significance of Theorem 6 is twofold. First, it highlights that the scatter controls the generalization performance in domain adaptation. Second, the bound shows a direct connection between scatter (also MMD) and the domain adaptation theory proposed in . Note that the bound might not be useful for practical purposes, since it is loose and pessimistic as they hold for all hypotheses and all possible data distributions.
Remark 1 (The role of scatter in domain generalization).
Theorem 5 of  shows that the domain scatter (or, alternatively, the distributional variance) is one of the key terms arising in a generalization bound in the setting of domain generalization.
6 Experiment I : Domain Adaptation
The first set of experiments evaluated the domain adaptation performance of SCA on synthetic data and real-world object recognition tasks. The synthetic data was designed to understand the behavior of the learned features compared to other algorithms, whereas the real-world images were utilized to verify the performance of SCA.
The experiments are divided into two parts. Section 6.1 visualizes performance on synthetic data. Section 6.2 evaluates performance on a range of cross-domain object recognition tasks with a standard yet realistic hyper-parameter tuning. Some additional results with a tuning protocol established in the literature are also reported in the supplementary material for completeness.
6.1 Synthetic data
Figure 1 depicts synthetic data that consists of two dimensional data points under three classes with six clusters. The data points in each cluster were generated from a Gaussian distribution, where and
is the mean and standard deviation of the-th cluster. The RBF kernel was used for all algorithms. All tunable hyper-parameters were selected according to 1-nearest neighbor’s test accuracy. We compare features extracted from Kernel Principal Component Analysis (KPCA), Semi-Supervised Transfer Component Analysis (SSTCA) , Transfer Joint Matching (TJM) , and SCA.
The top row of Figure 1 illustrates how the features extracted from the MMD-based algorithms (SSTCA, TJM, and SCA) reduce the domain mismatch. Red and blue colors indicate the source and target domains, respectively. Good features for domain adaptation should have a configuration of which the red and blue colors are mixed. This effect can be seen in features extracted from SSTCA, TJM, and SCA, which indicates that the domain mismatch is successfully reduced in the feature space. In classification, domain adaptive features should also have a certain level of class separability. The bottom row highlights a major difference between SCA and the other algorithms in terms of the class separability: the SCA features are more clustered with respect to the classes, with more prominent gaps among clusters. This suggests that it would be easier for a simple function to correctly classify SCA features.
6.2 Real world object recognition
We summarize the complete domain adaptation results over a range of cross-domain object recognition tasks. Several real-world image datasets were utilized such as handwritten digits (MNIST  and USPS ) and general objects (MSRC , VOC2007 , Caltech-256 , Office ). Three cross-domain pairs were constructed from these datasets: USPS+MNIST, MSRC+VOC2007, and Office+Caltech.
6.2.1 Data Setup
The USPS+MNIST pair consists of raw images subsampled from datasets of handwritten digits. MNIST contains 60,000 training images and 10,000 test images of size . USPS has 7,291 training images and 2,007 test images of size  . The pair was constructed by randomly sampling 1,800 images from USPS and 2,000 images from MNIST. Images were uniformly rescaled to size and encoded into feature vectors representing the gray-scale pixel values. Two source target classification tasks were constructed: usps mnist and mnist usps.
The MSRC+VOC2007 pair consist of 240-dimensional images that share 6 object categories: “aeroplane”, “bicycle”,“bird”, “car”, “cow”, and “sheep” taken from the MSRC and VOC2007  datasets. The pair was constructed by selecting all 1,269 images in MSRC and 1,530 images in VOC2007. As in , features were extracted from the raw pixels as follows. First, images were uniformly rescaled to be 256 pixels in length. Second, 128-dimensional dense SIFT (DSIFT) features were extracted using the VLFeat open source package 
. Finally, a 240-dimensional codebook was created using K-means clustering to obtain the codewords.
The Office+Caltech consists of 2,533 images of ten categories (8 to 151 images per category per domain), that forms four domains: () amazon, () dslr, () webcam, and () caltech. amazon images were acquired in a controlled environment with studio lighting. dslr consists of high resolution images captured by a digital SLR camera in a home environment under natural lighting. webcam images were acquired in a similar environment to dslr, but with a low-resolution webcam. Finally, caltech images were collected from Google Images . Taking all possible source-target combinations yields 12 cross-domain datasets denoted by . We used two types of extracted features from these datasets that are publicly available: SURF-BoW222http://www-scf.usc.edu/~boqinggo/da.html  and 333http://vc.sce.ntu.edu.sg/transfer_learning_domain_adaptation/domain_adaptation_home.html . SURF-BoW features were extracted using SURF  and quantized into 800-bin histograms with codebooks computed by K-means on a subset of amazon images. The final histograms were standardized to have zero mean and unit standard deviation in each dimension. Deep Convolutional Activation Features (DeCAF) were constructed by 
using the deep convolutional neural network architecture in. The model inputs are the mean-centered raw RGB pixel values that are forward propagated through 5 convolutional layers and 3 fully-connected layers. We used the outputs from the 6th layer as the features, leading to dimensional features.