A basic assumption in statistical machine learning is that the training and the test data come from the same distribution. However, this assumption does not hold in many real-world applications. For example, in image recognition, the distributions in training and testing can be different due to varying scene, lighting, view angle, image resolution, etc. Annotating data for a new domain is often expensive and time-consuming, thus there are application scenarios that we have plenty of data, but none or a very small amount of them are labeled. Transfer learning (TL) has shown promising performance in handling such a challenge, by transferring knowledge from a labeled source domain to a new (unlabeled) target domain [25, 24]. In the last decade, it has been widely used in image recognition [10, 19, 6], emotion recognition , brain-computer interfaces [14, 29], and so on [26, 16, 15].
Typical TL approaches can be categorized into parameter-based transfers , instance-based transfers, and feature transformation based transfers. Parameter-based transfers need some labeled data, whereas this paper focuses on unsupervised domain adaptation, in which the target domain does not have any labeled data at all. Instance-based transfers assume that the source and the target domains share the same conditional distribution [25, 30], which usually does not hold in practice. Feature transformation based transfers relax this assumption, and only assume that there exists a common subspace, in which the source and the target domains have similar distributions. This paper considers feature transformation based TL.
According to Pan and Yang 
, TL can be applied when the source and the target domains have different feature spaces, label spaces, marginal probability distributions, and/or conditional probability distributions. Existing feature transformation based TL approaches mainly focus on minimizing the distribution divergence between the source and the target domains by a distribution metric. Frequently used such metrics include maximum mean discrepancy (MMD)27], Wasserstein distance , etc. MMD on marginal and/or conditional distribution is probably the most popular metric in TL. Existing MMD based distribution adaptation approaches consider either the marginal distribution only , or both the marginal and the conditional distributions with equal weight [19, 20, 5] or different weights 
, even in deep learning[8, 18] and adversarial learning [7, 21].
Among them, joint distribution adaptation (JDA) is the most widely used baseline in TL, whose idea is to measure the distribution shift between two domains using the marginal and the conditional MMD. Some works extended JDA by adding a regularization term , structural consistency , source domain discriminability , etc. For JDA based approaches, the marginal and conditional distributions are often treated equally, which may not be optimal; so, balanced distribution adaptation (BDA)  was proposed to give them different weights. However, both JDA and BDA consider the marginal and conditional distributions separately, ignoring the intrinsic dependency between them. The performance may be improved if this dependency can be taken into consideration.
Two measures need to be considered during feature transformation to facilitate domain adaptation . One is transferability, which measures how capable the feature representation can minimize the cross-domain discrepancies. The other is discriminability
, which measures how easily different classes can be distinguished by a supervised classifier. Traditional distribution adaptation approaches usually seek to achieve high transferability[19, 3, 5], so that the knowledge learned from the source domain can be effectively transferred to the target domain; however, the feature discriminability has been largely ignored.
This paper considers the scenario that the source and the target domains share the same feature and label spaces, which is the most common assumption in TL. Different from joint MMD based approaches, we do not use (weighted) sum of the marginal and conditional MMDs to estimate the distribution discrepancy; instead, we use the joint probability distribution directly, which in theory can better leverage the relationship between different distributions. To consider both transferability and discriminability simultaneously, we propose joint probability MMD for distribution adaptation, which minimizes the distribution discrepancy of the same class between different domains and maximizes the distribution discrepancy between different classes. In addition, our approach can be easily kernelized to consider nonlinear shifts between the domains. The proposed approach has been verified through extensive experiments on six real-world datasets: object recognition (Office, Caltech-256, COIL20), face recognition (PIE), and digit recognition (USPS, MNIST).
2 Related Works
Our proposed JPDA is related to MMD based TL approaches, e.g., JDA and BDA, which are briefly introduced in this section.
2.1 Joint Distribution Adaptation (JDA)
Long et al.  proposed JDA to measure the distribution discrepancy between the source and the target domains in a reproducing kernel Hilbert space using both the marginal and the conditional MMD:
where and denote the source and the target domain distribution, respectively, and is the MMD metric. JDA ignores the relationship between different conditional distributions, and also the relationship between the marginal and the conditional distributions.
2.2 Balanced Distribution Adaptation (BDA)
Wang et al.  proposed BDA to match both the marginal and the conditional distributions between the source and the target domains, by introducing a trade-off parameter :
For -class classification, BDA uses the -distance  to estimate the marginal and the conditional shift weights:
where () is calculated by , in which means the error of training a linear classier discriminating the two domains and (Class- in and ). BDA needs to train classifiers to learn , which is computationally expensive for big data.
3 Joint Probability Distribution Adaptation (Jpda)
This section introduces our proposed JPDA.
3.1 Problem Definition
Assume the source domain has labeled samples , and the target domain has unlabeled samples , where
is the feature vector, andis its label, with for -class classification.
Assume also the feature spaces and label spaces of the two domains are the same, i.e., and . TL seeks to learn a hypothesis that maps the source domain to of the target domain. Different from previous TL approaches, we do not assume or ; instead, we assume , and define the following general objective function for classification:
where defines a metric to measure the distribution shift between the source and the target domains, is the empirical loss over the source domain’s samples, controls the model complexity of , and is a regularization parameter. For linear functions, has the form of an inner product .
3.2 Rethink the MMD Metric
In traditional domain adaptation, MMD is frequently used to reduce the marginal and the conditional distribution discrepancies between the source and the target domains, i.e.,
which is a two-step approximation of the joint probability distribution .
In the first step, it uses to estimate . In unsupervised domain adaptation, and can be estimated directly from data . For , we can train a base classifier on the labeled source data and apply it to the unlabeled target data to get the pseudo labels
, then the posterior probabilities can be estimated.
In the second step, it uses to estimate . For and , it is difficult to directly minimize the posterior probability distributions of the two domains. So, the traditional MMD uses class-conditional distributions and instead [19, 20]. Then, the class conditional distributions and are computed with the true source labels and pseudo target labels, respectively.
Consider a linear hypothesis . Define a transformation matrix for the source and the target domains. The traditional MMD can be formulated as
where () is the th (th) feature vector in the -th class of the source (target) domain, and and are the number of samples in the -th class of the source domain and the target domain, respectively.
To quantify the distribution shift between the source and the target domains, we consider the joint probability directly.
(The Joint Probability MMD) Let and denote the label categories of the source and the target domain, respectively. Let the class-conditional probability and class prior probability be
denote the label categories of the source and the target domain, respectively. Let the class-conditional probability and class prior probability beand , respectively. Then, according to the Bayesian law, the joint probability MMD is
The joint MMD is based on the product of the marginal distribution and the posterior probability, whereas the joint probability MMD is based on the product of the class-conditional probability and the class prior probability. The latter can be computed directly from the data.
Directly optimizing (7) may improve the transferability between the source and the target domains, but it does not consider the discriminability between different classes at all. So, we define the joint probability MMD as a weighted sum of two terms. The first minimizes the joint probability MMD on the same class in the two domains for better transferability, and the second maximizes the joint probability MMD on different classes for better discriminability.
Transferability: Minimize the joint probability MMD of the same class in the two domains
Since we want to transfer the knowledge in the source domain to the target domain, we minimize the joint probability MMD of the same class in the two domains according to
The conditional MMD of is:
Consider when , and when . We can define each , then (9) can be represented as
Let the source domain one-hot label matrix be , and the predicted target domain one-hot label matrix be , where . Then, the joint probability MMD of the same class between the source and the target domains is:
where and are defined as
Note that , whose every column vector is the mean feature for a certain class. The joint probability MMD of the same class in (11) is different from the conditional MMD in (10), since and are used in (6), whereas and are used in (11). Additionally, in (6) is estimated, whereas in (11) is known in advance.
Discriminability: Maximize the joint probability MMD between different classes
Similar to linear discriminative analysis , we want to maximize the discrepancies between different classes to increase their discriminability. We maximize the MMD of different classes according to
Similarly, using the one-hot label matrices and , the joint probability MMD of different classes is:
in which and are paired class indices of the source and the target domains’ labels.
Let be the -th column vector of . All column vectors form a set . Similarly, let denote the set with all the column vectors of . Then, and can be defined by
where is , and denotes a matrix with columns, whose each column is .
The new joint probability MMD
By adding a trade-off parameter between the transferability and the discriminability, the new joint probability MMD is then defined as
By doing this, we can improve the transferability between different domains and the discriminability between different classes simultaneously during distribution adaptation. The joint probability MMD is based on the joint probability, which can handle more probability distribution shifts, and better represent the relationship between different classes.
3.4 Overall Objective Function
We can embed the joint probability MMD metric into an unsupervised distribution adaptation framework. For simplicity and to verify the superiority of joint probability MMD over the traditional MMD, we only integrate it with a regularization term and a principal component preservation constraint, similar to TCA, JDA and BDA:
where is the centering matrix, in which and is a matrix with all elements being .
and can be directly obtained from the target pseudo label iterations. By setting the derivative , it becomes a generalized Eigen-decomposition problem:
The vectors are corresponding to the
smallest eigenvalues. Then, we obtain the updated subspace projection.
The pseudocode of JPDA for classification is shown in Algorithm 1.
To consider nonlinear distribution adaptation, kernel function in a Reproducing Kernel Hilbert Space can be adopted. Define , , and , where and is the number of samples. Then, the objective function becomes
It can be optimized in a similar way to (19).
In this section, we evaluate the performance of JPDA through extensive experiments on image classification. The code is available at https://github.com/chamwen/JPDA.
4.1 Data Preparation
Office, Caltech, COIL20, PIE, MNIST and USPS are six benchmark datasets widely used to evaluate visual domain adaptation algorithms. Some examples from these datasets are shown in Figure 1.
Office-Caltech [9, 12] is an popular benchmark for visual domain adaptation. The Office database contains three real-world object domains: Amazon, Webcam, and DSLR. It has 4,652 images and 31 categories. Caltech-256 is a standard database for object recognition. The database has 30,607 images and 256 categories. Our experiments used the public Ofice+Caltech dataset released by Gong et al. . SURF features were extracted and quantized into an 800-bin histogram with code-books computed from -means clustering on a subset of images from Amazon. Then, the histograms were -normalized. We had four domains: C (Caltech-256), A (Amazon), W (Webcam), and D (DSLR). By randomly selecting one domain as the source domain and a different domain as the target domain, we had different cross-domain transfers, e.g., CA, CW, CD, …, and DW.
COIL20 contains 20 objects with 1,440 images. The images of each object were taken 5 degrees apart as the object was rotated on a turntable, and each object has 72 images. Each image is pixels with 256 gray levels per pixel. Our experiments used the public COIL20 dataset released by Long et al. . The dataset was partitioned into two subsets COIL1 and COIL2, where COIL1 contained all images taken in the directions of , and COIL2 contained all images taken in the directions of . TL configuration COIL1COIL2 was obtained by selecting all 720 images in COIL1 to form the source domain, and all 720 images in COIL2 to form the target domain. We switched the two domains to obtain another TL configuration COIL2COIL1.
PIE, which stands for “Pose, Illumination, Expression”, is a benchmark for face recognition. The database has 68 individuals with 41,368 face images of size . The face images were captured by 13 synchronized cameras (different poses) and 21 lashes (different illuminations and/or expressions). Our experiments used the public PIE dataset released by Long et al. . It has five subsets: PIE1 (C05, left pose), PIE2 (C07, upward pose), PIE3 (C09, downward pose), PIE4 (C27, frontal pose), and PIE5 (C29, right pose). In each subset (pose), all face images were taken under different lighting, illumination, and expression conditions. By randomly selecting one subset (pose) as the source domain and a different one as the target domain, we had different cross-domain TL configurations, e.g., PIE1PIE2, PIE1PIE3, PIE1PIE4, PIE1PIE5, …, and PIE4PIE5.
USPS dataset consists of 7,291 training images and 2,007 test images of size . MNIST dataset has a training set of 60,000 examples and a test set of 10,000 examples of size . Our experiments used the public USPS and MNIST datasets released by Long et al. . USPS and MNIST share 10 classes of digits, but they have different distributions. To speed up the experiments, a TL configuration USPSMNIST was constructed by randomly sampling 1,800 images in USPS to form the source domain, and randomly sampling 2,000 images in MNIST to form the target domain. Another TL configuration MNISTUSPS was obtained by switching the above source and target domains. The size of all images was uniformly rescaled to , and gray-scale pixel values were used as feature vectors, so that the source and target data shared the same feature space.
4.2 Baseline Methods
4.3 Experimental Setup
A 1-nearest neighbor classifier was applied after TCA, JDA, BDA and JPDA. The parameter settings in  were used for TCA, JDA and BDA. We set the subspace dimensionality , the regularization parameter with linear kernel for Office+Caltech dataset, and with primal kernel for all other datasets, and the iteration number for TCA, JDA and BDA.
For JPDA, the balance parameter was set to , the regularization parameter for Office+Caltech dataset, and for all other datasets. The target domain classification accuracy was used as the performance measure.
4.4 Experimental Results
The classification accuracies of JPDA and the three baselines in the 14 cross-domain object recognition tasks and two cross-domain digit recognition tasks are given in Table 1, and in the 20 cross-domain face recognition tasks are given in Table 2. JPDA outperformed the three baselines in most tasks. The average classification accuracy of JPDA on the 20 PIE datasets was 64.62%, representing a 4.38% performance improvement over JDA. These results verified that JPDA is effective and robust in cross-domain visual recognition.
JPDA also outperformed BDA, an improvement to joint MMD with a balance factor to the marginal MMD and the conditional MMD. These results demonstrated the limitation of BCA, i.e., the -distance based approach cannot guarantee improved performance.
We also verified whether JPDA can increase both the transferability and the discriminability. We used -SNE  to reduce the dimensionality of the feature to two, and visualize the data distributions. Figure 2 shows the results of the first three classes’ data distributions when transferring Caltech (source) to Amazon (target), before and after different distribution adaptation approaches, where Raw denotes the raw data distribution. For the raw distribution, the samples from Class 1 and Class 3 (also some from Class 2) from the source and the target domains are mixed together. After distribution adaptation, JPDA brings data distributions of the source and the target domains together, and also keeps samples from different classes discriminative. JDA and BDA do not have such good discriminability, especially for samples from Class 2 and Class 3.
4.5 Convergence and Time Complexity
We also empirically checked the convergence of different TL approaches. Figure 3 shows the average objective values and classification accuracies in the 20 transfer tasks on PIE, as the number of iterations increased from 1 to 20. JPDA converged very quickly and achieved high classification accuracy.
The computational costs of JPDA and the three baselines are shown in Table 3. JPDA was always faster than JDA and BDA. Especially, when the dataset is large (PIE), JPDA can save over 50% computing time. TCA is the most efficient since it does not need to be optimized iteratively.
4.6 Parameter Sensitivity
We also analyzed the parameter sensitivity of JPDA on different datasets to validate that a wide range of parameter values can be used to obtain satisfactory performance. Results on different types of datasets had shown that the number of iterations and the dimensionality of the subspace are good choices, so we only studied the sensitivity of JPDA to the two adjustable parameters, the balance parameter and the regularization parameter . The results are shown in Figure 4. JPDA is robust to the balance parameter , and to the regularization parameter .
TL makes use of data or knowledge in one task to help solve a different, yet related, task. This paper focuses on distribution adaptation with joint probability MMD, and proposed a novel JPDA approach. JPDA can improve the transferability between different domains and the discriminability between different classes simultaneously, by minimizing the joint probability MMD of the same class in the source and target domains, and maximizing the joint probability MMD of different classes. Compared with traditional MMD approaches, JPDA has a simpler form, and is more effective in measuring the discrepancy between different domains. Experiments on six image classification datasets verified the effectiveness of JPDA. Our future research will consider more effective TL metric for deep learning and adversarial learning.
-  Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira, ‘Analysis of representations for domain adaptation’, in Proc. Advances in Neural Information Processing Systems, pp. 137–144, Vancouver, B.C., Canada, (December 2007).
-  Christopher M Bishop, Pattern Recognition and Machine Learning, Springer, New York, NY, 2006.
-  Yue Cao, Mingsheng Long, and Jianmin Wang, ‘Unsupervised domain adaptation with distribution matching machines’, 2795–2802, (February 2018).
-  Xinyang Chen, Sinan Wang, Mingsheng Long, and Jianmin Wang, ‘Transferability vs. discriminability: Batch spectral penalization for adversarial domain adaptation’, in Proc. 36th Int’l Conf. on Machine Learning, pp. 1081–1090, Long Beach, CA, (June 2019).
Zhengming Ding, Sheng Li, Ming Shao, and Yun Fu, ‘Graph adaptive knowledge
transfer for unsupervised domain adaptation’, in
Proc. 15th European Conf. on Computer Vision, pp. 37–52, Munich, Germany, (September 2018).
Zhengming Ding, Ming Shao, and Yun Fu, ‘Transfer learning for image
classification with incomplete multiple sources’, in
Proc. Int’l Joint Conference on Neural Networks, pp. 2188–2195, Vancouver, B.C., Canada, (July 2016).
-  Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky, ‘Domain-adversarial training of neural networks’, Journal of Machine Learning Research, 17(1), 2096–2030, (May 2016).
-  Muhammad Ghifary, W Bastiaan Kleijn, and Mengjie Zhang, ‘Domain adaptive neural networks for object recognition’, in Proc. Pacific Rim Int’l Conf. on Artificial Intelligence, pp. 898–904, Queensland, Australia, (December 2014).
-  Boqing Gong, Yuan Shi, Fei Sha, and Kristen Grauman, ‘Geodesic flow kernel for unsupervised domain adaptation’, in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pp. 2066–2073, Providence, Rhode Island, (June 2012).
-  Raghuraman Gopalan, Ruonan Li, and Rama Chellappa, ‘Domain adaptation for object recognition: An unsupervised approach’, in Proc. Int’l Conf. on Computer Vision, pp. 999–1006, Barcelona, Spain, (November 2011).
-  Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola, ‘A kernel two-sample test’, Journal of Machine Learning Research, 13(3), 723–773, (March 2012).
-  Gregory Griffin, Alex Holub, and Pietro Perona, Caltech-256 object category dataset, Technical report, Caltech, 2007.
-  Cheng-An Hou, Yao-Hung Hubert Tsai, Yi-Ren Yeh, and Yu-Chiang Frank Wang, ‘Unsupervised domain adaptation with label and structural consistency’, IEEE Trans. on Image Processing, 25(12), 5552–5562, (September 2016).
-  Vinay Jayaram, Morteza Alamgir, Yasemin Altun, Bernhard Scholkopf, and Moritz Grosse-Wentrup, ‘Transfer learning in brain-computer interfaces’, IEEE Computational Intelligence Magazine, 11(1), 20–31, (January 2016).
Michael Kampffmeyer, Yinbo Chen, Xiaodan Liang, Hao Wang, Yujia Zhang, and Eric P Xing, ‘Rethinking knowledge graph propagation for zero-shot learning’, inProc. IEEE Conf. on Computer Vision and Pattern Recognition, pp. 11487–11496, Long Beach, CA, (June 2019).
-  Tuan Lai, Trung Bui, Nedim Lipka, and Sheng Li, ‘Supervised transfer learning for product information question answering’, in Proc. 17th IEEE Int’l Conf. on Machine Learning and Applications, pp. 1109–1114, Orlando, FL, (December 2018).
-  Chen-Yu Lee, Tanmay Batra, Mohammad Haris Baig, and Daniel Ulbricht, ‘Sliced wasserstein discrepancy for unsupervised domain adaptation’, in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pp. 10285–10295, Long Beach, CA, (June 2019).
-  Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jordan, ‘Learning transferable features with deep adaptation networks’, in Proc. 32nd Int’l Conf. on Machine Learning, volume 37, pp. 97–105, Lille, France, (July 2015).
-  Mingsheng Long, Jianmin Wang, Guiguang Ding, Jiaguang Sun, and Philip S Yu, ‘Transfer feature learning with joint distribution adaptation’, in Proc. IEEE Int’l Conf. on Computer Vision, pp. 2200–2207, Sydney, Australia, (December 2013).
-  Mingsheng Long, Jianmin Wang, Guiguang Ding, Jiaguang Sun, and Philip S Yu, ‘Transfer joint matching for unsupervised domain adaptation’, in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pp. 1410–1417, Columbus, Ohio, (June 2014).
-  Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan, ‘Deep transfer learning with joint adaptation networks’, in Proc. 34th Int’l Conf. on Machine Learning, pp. 2208–2217, Sydney, NSW, Australia, (August 2017).
-  Laurens van der Maaten and Geoffrey Hinton, ‘Visualizing data using t-SNE’, Journal of Machine Learning Research, 9, 2579–2605, (November 2008).
-  Hong Wei Ng, Dung Nguyen, Vassilios Vonikakis, and Stefan Winkler, ‘Deep learning for emotion recognition on small datasets using transfer learning’, in Proc. ACM Int’l Conf. on Multimodal Interaction, pp. 443–449, Seattle, Washington, (November 2015).
-  Sinno Jialin Pan, Ivor W Tsang, James T Kwok, and Qiang Yang, ‘Domain adaptation via transfer component analysis’, IEEE Trans. on Neural Networks, 22(2), 199–210, (February 2011).
-  Sinno Jialin Pan and Qiang Yang, ‘A survey on transfer learning’, IEEE Trans. on Knowledge and Data Engineering, 22(10), 1345–1359, (October 2009).
-  Dong Wang and Thomas Fang Zheng, ‘Transfer learning for speech and language processing’, in Proc. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, pp. 1225–1237, Hong Kong, China, (December 2015).
-  Hao Wang, Wei Wang, Chen Zhang, and Fanjiang Xu, ‘Cross-domain metric learning based on information theory’, in Proc. 28th AAAI Conf. on Artificial Intelligence, pp. 2099–2105, Québec, Canada, (July 2014).
-  Jindong Wang, Wenjie Feng, Yiqiang Chen, Han Yu, Meiyu Huang, and Philip S Yu, ‘Visual domain adaptation with manifold embedded distribution alignment’, in Proc. 2018 ACM Multimedia Conf. on Multimedia Conf., pp. 402–410, Seoul, Republic of Korea, (October 2018).
-  Dongrui Wu, ‘Online and offline domain adaptation for reducing BCI calibration effort’, IEEE Trans. on Human-Machine Systems, 47(4), 550–563, (September 2017).
-  Jing Zhang, Wanqing Li, and Philip Ogunbona, ‘Joint geometrical and statistical alignment for visual domain adaptation’, in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pp. 1859–1867, Hawaii, (July 2017).