The success of deep visual learning largely relies on the abundance of data annotated with ground-truth labels where the main assumption is that the training and test data follow from the same underlying distribution. However, in real-world problems this presumption rarely holds due to a number of artifacts, such as the different types of noise or sensors, changes in object view or context, resulting in degradation of performance during inference on test data. One way to address this problem would be to collect labeled data in the test domain and learn a test-specific classifier while possibly leveraging the model estimated from the training data. Nevertheless, this would typically be a highly costly effort.
Domain adaptation, a formalism to circumvent the aforementioned problem, is the task of adapting a model trained in one domain, called the source, to another target domain, where the source domain data is typically fully labeled but we only have access to images from the target domain with no (or very few) labels. Although there are several slightly different setups for the problem, in this paper we focus on the unsupervised domain adaptation (UDA) with classification of instances as the ultimate objective. That is, given the fully labeled data from the source domain and unlabeled data from the target, the goal is to learn a classifier that performs well on the target domain itself.
One mainstream direction to tackle UDA is the shared space embedding process. The idea is to find a latent space shared by both domains such that the classifier learned on it using the fully labeled data from the source will also perform well on the target domain. This is accomplished, and supported in theory , by enforcing a requirement that the distributions of latent points in the two domains be indistinguishable from each other. A large family of UDA approaches including [19, 17, 1, 13, 34, 26, 16, 37, 15] leverage this idea (see Sec. 4 for more details). However, their performance remains unsatisfactory, in part because the methods inherently rely on matching of marginal, class-free, distributions while using the underlying assumption that the shift in the two distributions, termed covariate shift , can be reduced without using the target domain labels.
To address this issue, an effective solution was proposed in , which aims to take into account the class-specific decision boundary. Its motivation follows the theorem in  relating the target domain error to the maximal disagreement between any two classifiers, tighter than the former bound in . It implies that a provably small target error is achievable by minimizing the maximum classifier discrepancy (MCD). The approach in , the MCD Algorithm (MCDA for short), attempted to minimize MCD directly using adversarial learning similar to GAN training , i.e., through solving a minimax problem that finds the pair of most discrepant classifiers and reduces their disagreement.
In this paper we further extend the MCD principle by proposing a more systematic and effective way to achieve consistency in the hypothesis space of classifiers through Gaussian process (GP) 
endowed priors, with deep neural networks (DNNs) used to induce their mean and covariance functions. The crux of our approach is to regard the classifiers as random functions and use their posterior distribution conditioned on the source samples, as the prior on. The key consequence and advantages of this Bayesian treatment are: (1) One can effectively minimize the inconsistency in over the target domain by regularizing the source-induced prior using a max-margin learning principle , a significantly easier-to-solve task than the minimax optimization of  which may suffer from the difficulty of attaining an equilibrium point coupled with the need for proper initialization. (2) We can quantify the measure of prediction uncertainty and use it to credibly gauge the quality of prediction at test time.
to turn the non-parametric Bayesian inference into a more tractable parametric one, leading to a learning algorithm computationally as scalable and efficient as conventional (non-Bayesian) deep models. Our extensive experimental results on several standard benchmarks demonstrate that the proposed approach achieves state-of-the-art prediction performance, outpacing recentUDA methods including MCDA .
2 Problem Setup and Preliminaries
We begin with the formal description of the UDA task for a multi-class classification problem.
Unsupervised domain adaptation: Consider the joint space of inputs and class labels, where for (-way) classification. Suppose we have two domains on this joint space, source (S) and target (T), defined by unknown distributions and , respectively. We are given source-domain training examples with labels and target data with no labels. We assume the shared set of class labels between the two domains. The goal is to assign the correct class labels to target data points .
To tackle the problem in the shared latent space framework, we seek to learn the embedding function and a classifier in the shared latent space . The embedding function and the classifier are shared across both domains and will be applied to classify samples in the target domain using the composition .
Our goal is to find the pair resulting in the lowest generalization error on the target domain,
with the indicator function. Optimizing directly is typically infeasible. Instead, one can exploit the upper bounds proposed in  and , which we restate, without loss of generality, for the case of fixed .
Here is the error rate of on the source domain, , and denotes the discrepancy between two classifiers and on the source domain , and similarly for . We use to denote the distribution of in the latent space induced by and .
Looser bound. With the uncontrollable quantity, due to the lack of labels for in the training data, the optimal can be sought through minimization of the source error and the worst-case discrepancy terms. In the looser bound (3), the supremum term is, up to a constant, equivalent to , the maximal accuracy of a domain discriminator (labeling as and as ). Hence, to reduce the upper bound one needs to choose the embedding where the source and the target inputs are indistinguishable from each other in . This input density matching was exploited in many previous approaches [57, 14, 7, 56], and typically accomplished through adversarial learning  or the maximum mean discrepancy .
Tighter bound. Recently,  exploited the tighter bound (2) under the assumption that is restricted to classifiers with small errors on . Consequently, becomes negligible as any two agree on the source domain. The supremum in (2), interpreted as the Maximum Classifier Discrepancy (MCD), reduces to:
Named MCDA,  aims to minimize (4) directly via adversarial-cooperative learning of two deep classifier networks and . For the source domain data, these two classifiers and aim to minimize the classification errors cooperatively. An adversarial game is played in the target domain: and aim to be maximally discrepant, whereas seeks to minimize the discrepancy222See the Supplementary Material for further technical details..
3 Our Approach
Overview. We adopt the MCD principle, but propose a more systematic and effective way to achieve hypothesis consistency, instead of the difficult minimax optimization. Our idea is to adopt a Bayesian framework to induce the hypothesis space. Specifically, we build a Gaussian process classifier model  on top of the share space. The GP posterior inferred from the source domain data naturally defines our hypothesis space . We then optimize the embedding and the kernel of the GP so that the posterior hypothesis distribution leads to consistent (least discrepant) class predictions most of the time, resulting in reduction of (4). The details are described in the below.
3.1 GP-endowed Maximum Separation Model
We consider a multi-class Gaussian process classifier defined on : there are underlying latent functions , a priori independently GP distributed, namely
where each is a covariance function of , defined on . For an input point , we regard as the model’s confidence toward class , leading to the class prediction rule:
We use the softmax likelihood model,
Source-driven Prior. The labeled source data, , induces a posterior distribution on the latent functions ,
where . The key idea is to use (8) to define our hypothesis space
. The posterior places most of its probability mass on thosethat attain high likelihood scores on while being smooth due to the GP prior. It should be noted that we used the term prior of the hypothesis space that is induced from the posterior of the latent functions . We use the prior and the posterior of interchangeably.
Note that due to the non-linear/non-Gaussian likelihood (7), exact posterior inference is intractable, and one has to resort to approximate inference. We will discuss an approach for efficient variational approximate inference in Sec. 3.2. For the exposition here, let us assume that the posterior distribution is accessible.
Target-driven Maximally Consistent Posterior. While serves to induce the prior of , will be used to reshape this prior. According to MCD, we want this hypothesis space to be shaped in the following way: for each target domain point , , the latent function values sampled from the posterior (8) should lead to the class prediction (made by (6)) that is as consistent as possible across the samples.
This is illustrated in Fig. 1. Consider two different priors and at a point , and , where for brevity we drop the conditioning on in notation. The class cardinality is . For simplicity, we assume that the latent functions ’s are independent from each other. Fig. 1 shows that the distributions of ’s are well-separated from each other in , yet overlap significantly in . Hence, there is a strong chance for the class predictions to be inconsistent in (identical ordering of colored samples below figure), but consistent in . This means that the hypothesis space induced from contains highly discrepant classifiers, whereas most classifiers in the hypothesis space of agree with each other (least discrepant). In other words, the maximum discrepancy principle translates into the maximum posterior separation in our Bayesian GP framework.
We describe how this goal can be properly formulated. First we consider the posterior of to be approximated as an independent Gaussian333This choice conforms to the variational density family we choose in Sec. 3.2.. For any target domain point and each
let the mean and the variance of theprior in (8) be:
The maximum-a-posterior (MAP) class prediction by the model is denoted by . As we seek to avoid fluctuations in class prediction across samples, we consider the worst scenario where even an unlikely (e.g., at chance level) sample from , other than , cannot overtake . That is, we seek
where is the normal cutting point for the least chance (e.g., if one-side is considered).
While this should hold for most samples, it will not hold for all. We therefore introduce an additional slack to relax the desideratum. Furthermore, for ease of optimization444We used the topk() function in PyTorch to compute the largest and the second largest elements. The function allows automatic gradients., we impose slightly stricter constraint than (11), leading to the final constraint:
A constant, here, was added to normalize the scale of ’s.
Our objective now is to find such embedding , GP kernel parameters , and minimal slacks , to impose (12). Equivalently, we pose it as the following optimization problem, for each :
Note that (12) and (13) are reminiscent of the large-margin classifier learning in traditional supervised learning . In contrast, we replace the ground-truth labels with the the most confidently predicted labels by our model since the target domain is unlabeled. This aims to place class boundaries in low-density regions, in line with entropy minimization or max-margin confident prediction principle of classical semi-supervised learning [20, 67].
In what follows, we describe an approximate, scalable GP posterior inference, where we combine the variational inference optimization with the aforementioned posterior maximum separation criterion (13).
3.2 Variational Inference with Deep Kernels
We describe our scalable variational inference approach to approximate the posterior (8). Although there are scalable GP approximation schemes based on the random feature expansion  and the pseudo/induced inputs [43, 51, 55, 12], here we adopt the deep kernel trick [25, 60] to exploit the deeply structured features. The main idea is to model an explicit finite-dimensional feature space mapping to define a covariance function. Specifically, we consider a nonlinear feature mapping such that the covariance function is defined as an inner product in a feature space, namely , where we model as a deep neural network. A critical advantage of explicit feature representation is that we turn the non-parametric GP into a parametric Bayesian model. As a consequence, all inference operations in the non-parametric GP reduce to computationally more efficient parametric ones, avoiding the need to store the Gram matrix of the entire training data set, as well as its inversion.
Formally, we consider latent functions modeled as with independently for . We let . Note that the feature function is shared across classes to reduce the number of parameters and avoid overfitting. The parameters of the deep model that represents serve as GP kernel parameters, since . Consequently, the source-driven prior (8) becomes
Since computing (14) is intractable, we introduce a variational density to approximate it. We assume a fully factorized Gaussian,
where and constitute the variational parameters. We further let ’s be diagonal matrices. To have , we use the following fact that the marginal log-likelihood can be lower bounded:
where the evidence lower-bound (ELBO) is defined as:
with the likelihood stemming from (7). As the gap in (16) is the KL divergence between and the true posterior , increasing the ELBO wrt the variational parameters brings closer to the true posterior. Raising the ELBO wrt the GP kernel parameters (i.e., the parameters of ) and the embedding555Note that the inputs also depend on . can potentially improve the marginal likelihood (i.e., the left hand side in (16)).
In optimizing the ELBO (17), the KL term (denoted by KL) can be analytically derived as
However, there are two key challenges: the log-likelihood expectation over does not admit a closed form, and one has to deal with large . To address the former, we adopt Monte-Carlo estimation using iid samples from , where the samples are expressed in terms of the variational parameters (i.e., the reparametrization trick ) to facilitate optimization. That is, for each and ,
For the latter issue, we use stochastic optimization with a random mini-batch . That is, we optimize the sample estimate of the log-likelihood defined as:
3.3 Optimization Strategy
Now we combine the maximum posterior separation criterion in (13) with the variational inference discussed in the previous section to arrive at the comprehensive optimization task.
With fixed, we rewrite our posterior maximum separation loss in (13) as follows. We consider stochastic optimization with a random mini-batch sampled from the target domain data.
Combining all objectives thus far, our algorithm666 In the algorithmic point of view, our algorithm can be viewed as a max-margin Gaussian process classifier on the original input space without explicitly considering a shared space . For further details about this connection, the reader is encouraged to refer to the Supplementary Material. can be summarized as the following two optimizations alternating with each other:
where is the impact of the maximum separation loss (e.g., ).
4 Related Work
There has been extensive prior work on domain adaptation . Recent approaches have focused on transferring deep neural network representations from a labeled source dataset to an unlabeled target domain by matching the distributions of features between different domains, aiming to extract domain-invariant features [46, 4, 8, 39, 48, 63, 61, 6, 36, 47]. To this end, it is critical to first define a measure of distance (divergence) between source and target distributions. One popular measure is the non-parametric Maximum Mean Discrepancy (MMD) (adopted by [6, 62, 34]), which measures the distance between the sample means of the two domains in the reproducing Kernel Hilbert Space (RKHS) induced by a pre-specified kernel. The deep Correlation Alignment (CORAL) method  attempted to match the sample mean and covariance of the source/target distributions, while it was further generalized to potentially infinite-dimensional feature spaces in  to effectively align the RKHS covariance matrices (descriptors) across domains.
The Deep Adaptation Network (DAN)  applied MMD to layers embedded in a RKHS
to match higher order moments of the two distributions more effectively. The Deep Transfer Network (DTN)  achieved alignment of source and target distributions using two types of network layers based on the MMD
distance: the shared feature extraction layer that can learn a subspace that matches the marginal distributions of the source and the target samples, and the discrimination layer that can match the conditional distributions by classifier transduction.
Many recent UDA approaches leverage deep neural networks with the adversarial training strategy [46, 4, 8, 39, 48, 63], which allows the learning of feature representations to be simultaneously discriminative for the labeled source domain data and indistinguishable between source and target domains. For instance,  proposed a technique called the Domain-Adversarial Training of Neural Networks (DANN), which allows the network to learn domain invariant representations in an adversarial fashion by adding an auxiliary domain classifier and back-propagating inverse gradients. The Adversarial Discriminative Domain Adaptation (ADDA)  first learns a discriminative feature subspace using the labeled source samples. Then, it encodes the target data to this subspace using an asymmetric transformation learned through a domain-adversarial loss. The DupGAN  proposed a GAN-like model  with duplex discriminators to restrict the latent representation to be domain invariant while its category information being preserved.
In parallel, within the shared-latent space framework, 
proposed an unsupervised image-to-image translation (UNIT) framework based on the Coupled GANs . Another interesting idea is the pixel-level domain adaptation method (PixelDA)  where they imposed alignment of distributions not in the feature space but directly in the raw pixel space via the adversarial approaches. The intention is to adapt the source samples as if they were drawn from the target domain, while maintaining the original content. Similarly,  utilized the CycleGAN  to constrain the features extracted by the encoder network to reconstruct the images in both domains. In , they proposed a joint adversarial discriminative approach that can transfer the information of the target distribution to the learned embedding using a generator-discriminator pair.
5 Experimental Results
We compare the proposed method with state-of-the-art on standard benchmark datasets. Digit classification task consists of three datasets, containing ten digit classes: MNIST , SVHN , USPS . We also evaluated our method on the traffic sign datasets, Synthetic Traffic Signs (SYN SIGNS)  and the German Traffic Signs Recognition Benchmark  (GTSRB), which contain 43 types of signs. Finally, we report performance on VisDA object classification dataset  with more than 280K images across twelve categories ( the details of the datasets are available in the Supplementary Material). Fig. 2 illustrates image samples from different datasets and domains.
We evaluate the performance of all methods with the classification accuracy score. We used ADAM  for training; the learning rate was set to and momentum to and . We used batches of size from each domain, and the input images were mean-centered. The hyper-parameters are empirically set as
. The sensitivity w.r.t. hyperparametersand will be discussed in Sec. 5.3. We also used the same network structure as . Specifically, we employed the CNN architecture used in  and  for digit and traffic sign datasets and used ResNet101 
model pre-trained on Imagenet
. We added batch normalization to each layer in these models. Quantitative evaluation involves a comparison of the performance of our model to previous works and to “Source Only” that do not use any domain adaptation. For ”Source Only” baseline, we train models on the unaltered source training data and evaluate on the target test data. The training details for comparing methods are available in our Supplementary material due to the space limit.
5.1 Results on Digit and Traffic Signs datasets
We show the accuracy of different methods in Tab. 1
. It can be seen the proposed method outperformed competitors in all settings confirming consistently better generalization of our model over target data. This is partially due to combining DNNs and GPs/Bayesian approach. GPs exploit local generalization by locally interpolating between neighbors
, adjusting the target functions rapidly in the presence of training data. DNNs have good generalization capability for unseen input configurations by learning multiple levels of distributed representations. The results demonstrateGPDA can improve generalization performance by adopting both of these advantages.
5.2 Results on VisDA dataset
Results for this experiment are summarized in Tab. 2. We observe that our GPDA achieved, on average, the best performance compared to other competing methods. Due to vastly varying difficulty of classifying different categories of objects, in addition to reporting the average classification accuracy we also report the average rank of each method over all objects (the lower rank, the better). The higher performance of GPDA
compared to other methods is mainly attributed to modeling the classifier as a random function and consequently incorporating the classifier uncertainty (variance of the prediction) into the proposed loss function,Eq. 28. The image structure for this dataset is more complex than that of digits, yet our method exhibits very strong performance even under such challenging conditions.
Another key observation is that some of the competing methods (e.g., MMD, DANN) perform worse than the source-only model in classes such as car and plant, while GPDA and MCDA performed better across all classes, which clearly demonstrates the effectiveness of the MCD principle.
indicates the method used a few labeled target samples as validation, different from our GPDA setting. We repeated each experiment five times and report the average and the standard deviation of the accuracy. The accuracy for MCDA was obtained from classifier. is MCDA’s hyper-parameter, which denotes the number of times the feature generator is updated to mimic classifiers. MNIST and USPS denote all the training samples were used to train the models.
5.3 Ablation Studies
Two complementary studies are conducted to investigate the impact of two hyper-parameters and , controlling the trade off of the variance of the classifier’s posterior distribution and the MCD loss term, respectively. To this end, we conducted additional experiments for the digit datasets to analyze the parameter sensitivity of GPDA w.r.t. and , with results depicted in (a) and 2(b), respectively. Sensitivity analysis is performed by varying one parameter at the time over a given range, while for the other parameters we set them to their final values . From (b), we see that when (no MCD regularization term), the performance drops considerably. As increases from to , the performance also increases demonstrating the benefit of hypothesis consistency (MS term) over the target samples. Indeed, using the proposed learning scheme, we find a representation space in which we embed the knowledge from the target domain into the learned classifier.
Similarly, from (a), we see that when (no prediction uncertainty) the classification accuracy is lower than the case where we utilize the prediction uncertainty, . The key observation is that it is more beneficial to make use of the information from the full posterior distribution of the classifier during the learning process in contrast to when the classifier is considered as a deterministic function.
5.4 Prediction Uncertainty vs. Prediction Quality
Another advantage of our GPDA model, inherited from Bayesian modeling, is that it provides a quantified measure of prediction uncertainty. In the multi-class setup considered here, this uncertainty amounts to the degree of overlap between two largest mean posteriors, and , where and are the indices of the largest and the second largest among the posterior means , respectively (c.f., (9)). Intuitively, if the two are overlapped significantly, our model’s decision is less certain, meaning that we anticipate the class prediction may not be trustworthy. On the other hand, if the two are well separated, we expect high prediction quality.
To verify this hypothesis more rigorously, we evaluate the distances between two posteriors (i.e., measure of certainty in prediction) for two different cohorts: correctly classified test target samples by our model and incorrectly predicted ones. More specifically, for the SVHN to MNIST adaptation task, we evaluate the Bhattacharyya distances  for the samples in the two cohorts. In our variational Gaussian approximation (21), the Bhattacharyya distance can be computed in a closed form (See Appendix in supplementary for details).
The histograms of the distances are depicted in Fig. 4 where we contrast the two models, one at an early stage of training and the other after convergence. Our final model in Fig. 4(a) exhibits large distances for most of the samples in the correctly predicted cohort (green), implying well separated posteriors or high certainty. For the incorrectly predicted samples (red), the distances are small implying significant overlap between the two posteriors, i.e., high uncertainty. On the other hand, for the model prior to convergence, Fig. 4(b), the two posteriors overlap strongly (small distances along horizontal axis) for most samples regardless of the correctness of prediction. This confirms that our algorithm enforces posterior separation by large margin during the training process.
This analysis also suggests that the measure of prediction uncertainty provided by our GPDA model, can be used as an indicator of prediction quality, namely whether the prediction made by our model is trustworthy or not. To verify this, we depict some sample test images in Fig. 5. We differentiate samples according to their Bhattacharyya distances. When the prediction is uncertain (left panel), we see that the images are indeed difficult examples even for human. An interesting case is when the prediction certainty is high but incorrectly classified (lower right panel), where the images look peculiar in the sense that humans are also prone to misclassify those with considerably high certainty.
5.5 Analysis of Shared Space Embedding
We use t-SNE  on VisDA dataset to visualize the feature representations from different classes. Fig. 6 depicts the embedding of the learned features , and the original features . Colors indicate source (red) and target (blue) domains. Notice that GPDA significantly reduces the domain mismatch, resulting in the expected tight clustering. This is partially due to the use of the proposed probabilistic MCD approach, which shrinks the classifier hypothesis class to contain only consistent classifiers on target samples while exploiting the uncertainty in the prediction.
We proposed a novel probabilistic approach for UDA that learns an efficient domain-adaptive classifier with strong generalization to target domains. The key to the proposed approach is to model the classifier’s hypothesis space in Bayesian fashion and impose consistency over the target samples in their space by constraining the classifier’s posterior distribution. To tackle the intractability of computing the exact posteriors, we combined the variational Bayesian method with a deep kernel technique to efficiently approximate the classifier’s posterior distribution. We showed on three challenging benchmark datasets for image classification that the proposed method outperforms current state-of-the-art in unsupervised domain adaptation of visual categories.
Appendix A Overview
In this Supplement, we present additional analyses highlighting the ability of our model, GPDA, to leverage its inherent measure of uncertainty to both produce increasingly accurate predictions as well as provide a measure of its own trustworthiness. These new results are summarized in Appendix B. Appendix C provides further analysis showing the key connection between GPDA and the max-margin Gaussian Process classification in the original space , surpassing the explicit need for a shared space of traditional domain adaptation approaches. We then present specific details of all datasets used in our experiments as well as the particulars of relevant experimental setups in Appendix D. Finally, we provide a brief overview of Gaussian Process models in Appendix E and another related state-of-the-art domain adaptation approach, the MCDA, in Appendix F.
Appendix B Additional Analyses: Prediction Uncertainty vs. Prediction Quality
A key benefit of our GPDA algorithm, inherited from Bayesian modeling, is that it provides a quantified measure of prediction uncertainty. In the multi-class setup, for an input we measure the uncertainty as the degree of overlap between the two largest mean posteriors, and , where , and are the indices of the largest and the second largest among the posterior means , respectively,
If the two overlap significantly, our model’s decision is less certain, signifying that we anticipate the class prediction not to be trustworthy. On the other hand, if the two are well separated, we expect high prediction quality.
Bhattacharyya distance. In the main paper (Sec. 5.4 and Fig. 4), we have verified this hypothesis by evaluating the Bhattacharyya distances (BD) between two posteriors (i.e., a measure of certainty in prediction) for two different cohorts: correctly classified test target samples by our model and incorrectly predicted ones, for the SVHN to MNIST adaptation task. Since we use variational Gaussian approximation of the posteriors , where and are determined by Eq. (22) in the main paper, the Bhattacharyya distance can be computed in closed form:
Bayes Optimal Error Rate. An alternative metric to measure the prediction uncertainty, perhaps more principled in the Bayesian sense, is the Bayes optimal error rate between the two largest mean posteriors, which can be computed as
where is the CDF of and is the Bayes optimal decision threshold, .
The interpretation is: the smaller the Bayes error rate, the more certain our prediction is, and vice versa. We depict the histograms of the Bayes error rates for two cohorts in Fig. 7.
As shown, the conclusion is very similar to our earlier analysis based on Bhattacharyya distances: Our final model in Fig. 7(a) exhibits low error rates for most of the samples in the correctly predicted cohort (green), implying well separated posteriors or high certainty of prediction. For the incorrectly predicted samples (red), the error rates are mostly high implying significant overlap between the two posteriors, i.e., high uncertainty of prediction.
Uncertainty in GPDA vs MCDA. Lastly, to demonstrate that it is the unique property of our GPDA model that the uncertainty measure can be used to credibly gauge the quality of prediction at test time, we contrast our model with other non-Bayesian approaches. Specifically, we consider MCDA, as the second-best competing method. The MCDA is a non-Bayesian method that yields point estimate class prediction, namely . By point estimate, we mean that the MCDA prediction places all its probability mass on a single (softmax) probability (score) value for each class , unable to provide a degree of uncertainty in its prediction (e.g., in our GPDA model).
However, one can define a heuristic notion of uncertainty for the MCDA by measuring how distant the two largest score predictions are from each other. More specifically, we compute the following quantity, dubbed Bhattacharyya pseudo distance (BPD), as a measure of uncertainty in the MCDA:
where and are the indices of the largest and the second largest among the scores , respectively. Note that (25) is the log-ratio between the largest two class prediction scores. We name it the pseudo distance as it reduces to the Bhattacharyya distance if we form Gaussians with the mean equal to and the same variances for both and .
We depict the histograms of the pseudo distances for MCDA’s two cohorts in Fig. 8(b), where the Bhattacharyya histograms for our GPDA are also shown in Fig. 8(a) for comparison. Unlike the more clear separation attained in our GPDA model, the MCDA exhibits two issues: i) For the correctly predicted samples (green), a considerable number of points have large overlap between and (i.e., low BPDs).
ii) For the incorrectly predicted samples (red), the number of cases where the two largest scores are relatively well separated777E.g., those with , namely, certain predictions. exceeds that of our GPDA model, suggesting higher prediction uncertainty.
This signifies the unique benefit of our Bayesian domain adaptation approach, that is, the capability to utilize the prediction uncertainty as a gauge of prediction quality.
(GPDA vs. MCDA) Histograms of Bhattacharyya distances between two largest mean posteriors (prediction certainty) for (a) GPDA and (b) MCDA. The X-axis is the Bhattacharyya distance, an indication ofprediction certainty; the higher the distance, the more certain the prediction is. For the non-Bayesian point-estimate-based MCDA, we compute the Bhattacharyya pseudo distance instead, as described in the text. Qualitatively, our GPDA model exhibits stronger correlation (histograms less overlapped) between prediction uncertainty (horizontal axis) and prediction correctness (color).
GPDA vs. MCDA – Hard vs. Easy Instances. As a counterpart to Fig. 5 in the main paper, we also depict in Fig. 9(b) some sample target test images that are correctly/incorrectly predicted by the MCDA with low/high certainty according to the BPD. For ease of comparison, we also show the samples for our GPDA from the main paper, Fig. 5, in Fig. 9(a). Unlike the GPDA, the uncertainty prediction made by the MCDA shows less agreement with the human assessment. Images whose BPDs are low (i.e., uncertain prediction judged by the MCDA shown in the left panel of Fig. 9(b)), appear to be visually easy to classify by a human, with no ambiguity, with a possible exception in few cases: e.g., the last example in the correct/low quadrant that may look like ”five”, while the first example in the incorrect/low quadrant may be interpreted as ”four”. Furthermore, the sample images in the incorrect/high quadrant of Fig. 9(b), i.e., those predicted by the MCDA with high certainty but misclassified, are relatively easy-to-classify examples for a human, other than the second example that may be confused as ”one”.
This empirical analysis verifies that the measure of prediction uncertainty provided by our GPDA model can be used as a more accurate indicator of prediction quality than that implied by the MCDA, our top competitor. That is, our model’s quantitative uncertainty measure can determine, with high precision, whether the prediction made by the model is trustworthy or not.
Appendix C A Remark on Proposed GPDA Algorithm
In this section we discuss the strong connection between the GPDA algorithm and the max-margin confident prediction (or the entropy minimization) framework in classical semi-supervised learning [20, 67]. More specifically, we show that our GPDA algorithm, in the algorithmic point of view, can be viewed as a max-margin Gaussian process classifier on the original input space without explicitly considering a shared space .
Recall that the GPDA algorithm can be summarized as the following two alternating optimizations:
where the key terms in these objectives are defined as follows:
Note that . Although we have built a GP classification model on top of the shared space , leading to the algorithm above, in our learning objective terms (26–28), the deep kernel feature mapping and the embedding function always appear together in the composite form .
Thus, our approach is functionally equivalent to building a GP classification model on top of the original space, where the explicit feature mapping is . More formally, our classifier can be written as , a function of . Consequently, our approach can be viewed as a max-margin Gaussian process classifier, without explicitly considering the shared space, where we push the posterior inferred from the source domain data to meet the large margin criterion on the (unlabeled) target domain data. This is clearly in line with entropy minimization or max-margin confident prediction principles in classical semi-supervised learning [20, 67].
Appendix D Details of Datasets and Experimental Setups
We now present additional details of experiments on the three datasets used in the main paper. For all experiments, we set , the number of posterior samples from the variational density (Sec. 3.2 in the main paper for more details).
d.1 Digit and Traffic Signs Datasets
We followed the experimental setup used in  in the following three adaptation scenarios. For this experiment, we compare our GPDA model with various state-of-the-art unsupervised domain adaptation approaces, namely: MMD , DANN , DSN , ADDA ,
and, MCDA .
SVHNMNIST. In this adaptation scenario, we used the standard training set as our training samples, and the standard testing set as our testing samples both for source and target samples.
SYN SIGNSGTSRB. Following MCDA , we randomly selected 31367 samples for the target training set and evaluated the accuracy on the remaining samples.
MNISTUSPS. For this experiment, we followed the protocols used in ADDA  and PixelDA . ADDA provides the setting where a part of training samples are utilized during training. 2,000 training samples are picked up for MNIST and 1,800 samples are used for USPS. PixelDA allows one to utilize all of the standard training samples during learning.
d.2 VisDA Dataset
We used VisDA dataset  to evaluate adaptation from synthetic to real-object images. The dataset is an instance of cross-domain object classification, with over 280K images across 12 categories in the combined training, validation, and testing domains. The source images, 152,397 synthetic images, were generated by rendering 3D models of the same object categories as in the real data from different angles and under different lighting conditions. The validation set of 55,388 images was collected from MSCOCO . In our experiment, we considered the images of validation splits as the target domain and trained models in the unsupervised domain adaptation settings. We evaluate the performance of ResNet101  model pre-trained on Imagenet . For this experiment, we compare our model with MMD , DANN , and MCDA .
Appendix E Background – Gaussian Process
A Gaussian Process (GP) is an infinite collection of random variables
, such that any finite number of samples have a joint Gaussian distribution. A GP is fully specified by the mean functionand the covariance function , typically user-defined. GPs can also be interpreted as a distribution over functions such that any finite collection of function values have a joint Gaussian distribution:
where is the vector and is the covariance matrix with .
A training dataset consists of pairs of data , where are noisy observations of some latent function with Gaussian noise , . The likelihood of the data and the prior give the joint probability model , where denotes the noisy targets and denotes the vector of underlying latent function values. The predictive distribution at a set of test points is given in closed form using the properties of conditional Gaussians,
where denotes the covariance matrix evaluated among the test inputs and denotes the covariance matrix evaluated between the test points and the training set . If there are test points, the covariance matrix is of size and is of size .
e.1 Gaussian Process Classification
In Gaussian Process Classification (GPC), the target values are discrete class labels, hence it is not appropriate to model them via a multivariate Gaussian density. Instead, we use the Gaussian process as a latent function whose sign determines the class label for binary classification; for multi-class classification one can use multiple GPs or a multivariate GP.
The key difference between the GP regression and GPC is how the output data, , are connected to the underlying function values, . Precisely, they are no longer connected via a simple noise process as in the previous section, instead now discrete: for example, for binary classification framework, say for one class and for the other. In this case, one could try fitting a GP that produces an output of for some values of and for others, simulating the discrete nature of the problem. Then, the classification of a new data point involves two steps:
Evaluate a ‘latent function’ which models qualitatively how the likelihood of one class versus the other changes over the axis. This is the usual GP.
Squeeze the output of this latent function onto using logistic function, .
Writing these two steps schematically,
Appendix F Background – MCDA 
For multi-class classification, the MCDA adopts classifier networks that output class prediction probabilities, . The discrepancy between and is defined as the expected normalized difference, that is, . The learning algorithm is a coordinate descent optimization alternating among three steps:
(Fix ) , where
Here, stands for the cross entropy (or log) loss, i.e., . All the expectations are approximately estimated on a mini-batch. Optionally, Step-3 can be repeated times (on the same mini-batch) to boost the convergence of the embedding network .
M. Baktashmotlagh, M. T. Harandi, B. C. Lovell, and M. Salzmann.
Unsupervised domain adaptation by domain invariant projection.
IEEE International Conference on Computer Vision (ICCV), pages 769–776. IEEE, 2013.
-  S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of learning from different domains. Machine Learning, 79(1–2):151–175, 2010.
-  S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain adaptation, 2007. In Advances in Neural Information Processing Systems.
-  S. Benaim and L. Wolf. One-sided unsupervised domain mapping. In Advances in Neural Information Processing Systems (NIPS), pages 752–762, 2017.
-  Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan.
Unsupervised pixel-level domain adaptation with generative
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, page 7, 2017.
-  K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan. Domain separation networks. In Advances in Neural Information Processing Systems (NIPS), pages 343–351, 2016.
-  N. Courty, R. Flamary, A. Habrard, and A. Rakotomamonjy. Joint distribution optimal transportation for domain adaptation. In Advances in Neural Information Processing Systems (NIPS), pages 3733–3742, 2017.
-  G. Csurka. A comprehensive survey on domain adaptation for visual applications. pages 1–35. Springer, 2017.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. Ieee, 2009.
-  K. G. Derpanis. The bhattacharyya measure. Mendeley Computer, 1(4):1990–1992, 2008.
-  A. Dezfouli and E. V. Bonilla. Scalable inference for Gaussian process models with black-box likelihoods, 2015. In Advances in Neural Information Processing Systems.
Y. Ganin and V. Lempitsky.
Unsupervised domain adaptation by backpropagation.International Conference on Machine Learning (ICML), 2015.
-  Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky. Domain adversarial training of neural networks. Journal of Machine Learning Research, 17(59):1–35, 2016.
-  M. Ghifary, W. B. Kleijn, M. Zhang, D. Balduzzi, and W. Li. Deep reconstruction-classification networks for unsupervised domain adaptation. In Euroupean Conference on Computer Vision (ECCV), pages 597–613, 2016.
-  B. Gholami, V. Pavlovic, et al. Punda: Probabilistic unsupervised domain adaptation for knowledge transfer across visual categories. In Proceedings of the IEEE International Conference on Computer Vision, pages 3581–3590, 2017.
-  B. Gong, K. Grauman, and F. Sha. Connecting the dots with landmarks: Discriminatively learning domain-invariant features for unsupervised domain adaptation. In International Conference on Machine Learning (ICML), pages 222–230, 2013.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets, 2014. In Advances in Neural Information Processing Systems.
-  R. Gopalan, R. Li, and R. Chellappa. Domain adaptation for object recognition: An unsupervised approach. In IEEE International Conference on Computer Vision (ICCV), pages 999–1006, 2011.
-  Y. Grandvalet and Y. Bengio. Semi-supervised learning by entropy minimization, 2004. In Proc. of Advances in Neural Information Processing Systems.
-  A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(1):723–773, 2012.
-  P. Haeusser, T. Frerix, A. Mordvintsev, and D. Cremers. Associative domain adaptation. In International Conference on Computer Vision (ICCV), volume 2, page 6, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  L. Hu, M. Kan, S. Shan, and X. Chen. Duplex generative adversarial network for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1498–1507, 2018.
W. Huang, D. Zhao, F. Sun, H. Liu, and E. Chang.
Scalable Gaussian process regression using deep neural networks,
Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI).
-  M. Kan, S. Shan, and X. Chen. Bi-shifting auto-encoder for unsupervised domain adaptation. In IEEE International Conference on Computer Vision (ICCV), pages 3846–3854, 2015.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. International Conference on Learning Representation (ICLR), 2015.
-  D. P. Kingma and M. Welling. Auto-encoding variational Bayes, 2014. In Proceedings of the Second International Conference on Learning Representations, ICLR.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
-  M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image translation networks. In Advances in Neural Information Processing Systems (NIPS), pages 700–708, 2017.
-  M.-Y. Liu and O. Tuzel. Coupled generative adversarial networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems (NIPS), pages 469–477. Curran Associates, Inc., 2016.
-  M. Long, Y. Cao, J. Wang, and M. I. Jordan. Learning transferable features with deep adaptation networks. International Conference on Machine Learning (ICML), 2015.
-  M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu. Transfer joint matching for unsupervised domain adaptation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1410–1417, 2014.
-  L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9(Nov):2579–2605, 2008.
-  M. Mancini, L. Porzi, S. R. Bulò, B. Caputo, and E. Ricci. Boosting domain adaptation by discovering latent domains. arXiv preprint arXiv:1805.01386, 2018.
-  T. Ming Harry Hsu, W. Yu Chen, C.-A. Hou, Y.-H. Hubert Tsai, Y.-R. Yeh, and Y.-C. Frank Wang. Unsupervised domain adaptation with imbalanced cross-domain data. In IEEE International Conference on Computer Vision (ICCV), pages 4121–4129, 2015.
-  B. Moiseev, A. Konev, A. Chigorin, and A. Konushin. Evaluation of traffic sign recognition methods trained on synthetically generated data. In International Conference on Advanced Concepts for Intelligent Vision Systems, pages 576–583. Springer, 2013.
-  S. Motiian, Q. Jones, S. Iranmanesh, and G. Doretto. Few-shot adversarial domain adaptation. In Advances in Neural Information Processing Systems (NIPS), pages 6673–6683, 2017.
-  Z. Murez, S. Kolouri, D. Kriegman, R. Ramamoorthi, and K. Kim. Image to image translation for domain adaptation. arXiv preprint arXiv:1712.00479, 13, 2017.
Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng.
Reading digits in natural images with unsupervised feature learning.
NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 5, 2011.
-  X. Peng, B. Usman, N. Kaushik, J. Hoffman, D. Wang, and K. Saenko. Visda: The visual domain adaptation challenge. arXiv preprint arXiv:1710.06924, 2017.
-  J. Quiñonero-Candela and C. E. Rasmussen. A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research, 6:1939–1959, 2005.
-  A. Rahimi and B. Recht. Random features for large-scale kernel machines, 2008. In Platt, J. C., Koller, D., Singer, Y., and Roweis, S. T. (eds.), Advances in Neural Information Processing Systems 20.
-  C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2006.
-  S.-A. Rebuffi, H. Bilen, and A. Vedaldi. Learning multiple visual domains with residual adapters. In Advances in Neural Information Processing Systems (NIPS), pages 506–516, 2017.
-  A. Rozantsev, M. Salzmann, and P. Fua. Residual parameter transfer for deep domain adaptation. In Conference on Computer Vision and Pattern Recognition, 2018.
-  K. Saito, Y. Ushiku, and T. Harada. Asymmetric tri-training for unsupervised domain adaptation. International Conference on Machine Learning (ICML), 2017.
-  K. Saito, K. Watanabe, Y. Ushiku, and T. Harada. Maximum classifier discrepancy for unsupervised domain adaptation. Computer Vision and Pattern Recognition, 2018.
-  S. Sankaranarayanan, Y. Balaji, C. D. Castillo, and R. Chellappa. Generate to adapt: Aligning domains using generative adversarial networks. ArXiv e-prints, abs/1704.01705, 2017.
-  E. Snelson and Z. Ghahramani. Sparse Gaussian processes using pseudo-inputs, 2006. In Advances in Neural Information Processing Systems.
-  J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel. The german traffic sign recognition benchmark: a multi-class classification competition. In Neural Networks (IJCNN), The 2011 International Joint Conference on, pages 1453–1460. IEEE, 2011.
-  M. Sugiyama, S. Nakajima, H. Kashima, P. V. Buenau, and M. Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. In Advances in Neural Information Processing Systems (NIPS), pages 1433–1440, 2008.
-  B. Sun and K. Saenko. Deep coral: Correlation alignment for deep domain adaptation. In European Conference on Computer Vision (ECCV), pages 443–450. Springer, 2016.
-  M. K. Titsias. Variational learning of inducing variables in sparse Gaussian processes, 2009. In Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics.
-  E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, page 4, 2017.
-  E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell. Deep domain confusion: Maximizing for domain invariance, 2014. arXiv:1412.3474.
-  V. Vapnik. Statistical Learning Theory. Wiley-Interscience, 1998.
-  X. Wang, B. Wang, X. Bai, W. Liu, and Z. Tu. Max-margin multiple-instance dictionary learning. In International Conference on Machine Learning, pages 846–854, 2013.
-  A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing. Deep kernel learning, 2016. AI and Statistics (AISTATS).
-  H. Yan, Y. Ding, P. Li, Q. Wang, Y. Xu, and W. Zuo. Mind the class weight bias: Weighted maximum mean discrepancy for unsupervised domain adaptation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
-  W. Zellinger, T. Grubinger, E. Lughofer, T. Natschläger, and S. Saminger-Platz. Central moment discrepancy (cmd) for domain-invariant representation learning. International Conference on Learning Representation (ICLR), 2017.
-  J. Zhang, W. Li, and P. Ogunbona. Joint geometrical and statistical alignment for visual domain adaptation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
-  X. Zhang, F. X. Yu, S.-F. Chang, and S. Wang. Deep transfer network: Unsupervised domain adaptation. arXiv preprint arXiv:1503.00591, 2015.
-  Z. Zhang, M. Wang, Y. Huang, and A. Nehorai. Aligning infinite-dimensional covariance matrices in reproducing kernel hilbert spaces for domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3437–3445, 2018.
-  J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint, 2017.
-  X. Zhu and A. B. Goldberg. Introduction to semi-supervised learning. Morgan & Claypool, 2009.