Co-regularized Alignment for Unsupervised Domain Adaptation

11/13/2018 ∙ by Abhishek Kumar, et al. ∙ MIT ibm 0

Deep neural networks, trained with large amount of labeled data, can fail to generalize well when tested with examples from a target domain whose distribution differs from the training data distribution, referred as the source domain. It can be expensive or even infeasible to obtain required amount of labeled data in all possible domains. Unsupervised domain adaptation sets out to address this problem, aiming to learn a good predictive model for the target domain using labeled examples from the source domain but only unlabeled examples from the target domain. Domain alignment approaches this problem by matching the source and target feature distributions, and has been used as a key component in many state-of-the-art domain adaptation methods. However, matching the marginal feature distributions does not guarantee that the corresponding class conditional distributions will be aligned across the two domains. We propose co-regularized domain alignment for unsupervised domain adaptation, which constructs multiple diverse feature spaces and aligns source and target distributions in each of them individually, while encouraging that alignments agree with each other with regard to the class predictions on the unlabeled target examples. The proposed method is generic and can be used to improve any domain adaptation method which uses domain alignment. We instantiate it in the context of a recent state-of-the-art method and observe that it provides significant performance improvements on several domain adaptation benchmarks.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning has shown impressive performance improvements on a wide variety of tasks. These remarkable gains often rely on the access to large amount of labeled examples for the concepts of interest (). However, a predictive model trained on certain distribution of data (, referred as the source domain) can fail to generalize when faced with observations pertaining to same concepts but from a different distribution (, referred as the target domain). This problem of mismatch in training and test data distributions is commonly referred as domain or covariate shift Shimodaira (2000). The goal in domain adaptation is to address this mismatch and obtain a model that generalizes well on the target domain with limited or no labeled examples from the target domain. Domain adaptation finds applications in many practical scenarios, including the special case when source domain consists of simulated or synthetic data (for which labels are readily available from the simulator) and target domain consists of real world observations Vazquez et al. (2014); Sun and Saenko (2014); Bousmalis et al. (2017a).

We consider the problem of unsupervised domain adaptation where the learner has access to only unlabeled examples from the target domain. The goal is to learn a good predictive model for the target domain using labeled source examples and unlabeled target examples. Domain alignment Fernando et al. (2013); Ganin and Lempitsky (2015) approaches this problem by extracting features that are invariant to the domain but preserve the discriminative information required for prediction. Domain alignment has been used as a crucial ingredient in numerous existing domain adaptation methods Ghifary et al. (2014); Long et al. (2015); Tzeng et al. (2015); Sun and Saenko (2016); Bousmalis et al. (2016); Ganin et al. (2016); Yan et al. (2017); Tzeng et al. (2017); Shu et al. (2018). The core idea is to align distributions of points (in the feature space) belonging to same concept class across the two domains (i.e., aligning and where is a measurable feature generator mapping and denotes the push-forward of a distribution

), and the prediction performance in target domain directly depends on the correctness of this alignment. However, the right alignment of class conditional distributions can be challenging to achieve without access to any labels in the target domain. Indeed, there is still significant gap between the performance of unsupervised domain adapted classifiers with existing methods and fully-supervised target classifier, especially when the discrepancy between the source and target domains is high

111Heavily-tuned manual data augmentation can be used to bring the two domains closer in the observed space French et al. (2018) but it requires the augmentation to be tuned individually for every domain pair to be successful..

In this work, we propose an approach to improve the alignment of class conditional feature distributions of source and target domains for unsupervised domain adaptation. Our approach works by constructing two (or possibly more) diverse feature embeddings for the source domain examples and aligning the target domain feature distribution to each of them individually. We co-regularize the multiple alignments by making them agree with each other with regard to the class prediction, which helps in reducing the search space of possible alignments while still keeping the correct set of alignments under consideration. The proposed method is generic and can be used to improve any domain adaptation method that uses domain alignment as an ingredient. We evaluate our approach on commonly used benchmark domain adaptation tasks such as digit recognition (MNIST, MNIST-M, SVHN, Synthetic Digits) and object recognition (CIFAR-10, STL), and observe significant improvement over state-of-the-art performance on these.

2 Formulation

We first provide a brief background on domain alignment while highlighting the challenges involved while using it for unsupervised domain adaptation.

2.1 Domain Alignment

Figure 1: Example scenarios for domain alignment between source (green) and target (blue). Continuous boundary denotes the class and the dashed boundary denotes the class. (a) and are not aligned but is zero for (a hypothesis class of linear separators) given by the shaded orange region, (b) Marginal distributions and are aligned reasonably well but expected error is high, (c) Marginal distributions and are aligned reasonably well and expected error is small.

The idea of aligning source and target distributions for domain adaptation can be motivated from the following result by Ben-David et al. (2010):

Theorem 1 (Ben-David et al. (2010))

Let be the common hypothesis class for source and target. The expected error for the target domain is upper bounded as


where ,
, and is the expected error of on the source domain.

Let and be the feature generators for source and target examples, respectively. We assume for simplicity but the following discussion also holds for different and . Let be the push-forward distribution of source distribution induced by (similarly for ). Let be a class of hypotheses defined over the feature space It should be noted that alignment of distributions and is not a necessary condition for to vanish and there may exist sets of , and for which is zero without and being well aligned (Fig. 0(a)). However, for unaligned and , it is difficult to choose the appropriate hypothesis class with small and small without access to labeled target data.

On the other hand, if source feature distribution and target feature distribution are aligned well, it is easy to see that the -distance will vanish for any space of sufficiently smooth hypotheses. A small -distance alone does not guarantee small expected error on the target domain (Fig. 0(b)): it is also required to have source and target feature distributions such that there exists a hypothesis with low expected error on both source and target domains. For well aligned marginal feature distributions, having a low requires that the corresponding class conditional distributions and should be aligned for all (Fig. 0(c)). However, directly pursuing the alignment of the class conditional distributions is not possible as we do not have access to target labels in unsupervised domain adaptation. Hence most unsupervised domain adaptation methods optimize for alignment of marginal distributions and , hoping that the corresponding class conditional distributions will get aligned as a result.

There is a large body of work on distribution alignment which becomes readily applicable here. The goal is to find a feature generator (or a pair of feature generators and ) such that and are close. Methods based on minimizing various distances between the two distributions (e.g., maximum mean discrepancy Ghifary et al. (2014); Yan et al. (2017), suitable divergences and their approximations Ganin and Lempitsky (2015); Bousmalis et al. (2016); Shu et al. (2018)

) or matching the moments of the two distributions

Tzeng et al. (2015); Sun and Saenko (2016) have been proposed for unsupervised domain adaptation.

2.2 Co-regularized Domain Alignment

The idea of co-regularization has been successfully used in semi-supervised learning

Sindhwani et al. (2005); Sridharan and Kakade (2008); Rosenberg and Bartlett (2007); Sindhwani and Rosenberg (2008) for reducing the size of the hypothesis class. It works by learning two predictors in two hypothesis classes and respectively, while penalizing the disagreement between their predictions on the unlabeled examples. This intuitively results in shrinking the search space by ruling out predictors from that don’t have an agreeing predictor in (and vice versa) Sindhwani and Rosenberg (2008). When and are reproducing kernel Hilbert spaces, the co-regularized hypothesis class has been formally shown to have a reduced Rademacher complexity, by an amount that depends on a certain data dependent distance between the two views Rosenberg and Bartlett (2007)

. This results in improved generalization bounds comparing with the best predictor in the co-regularized class (reduces the variance)

222Sridharan and Kakade (2008) show that the bias introduced by co-regularization is small when each view carries sufficient information about on its own (i.e., mutual information and are small), and the generalization bounds comparing with the Bayes optimal predictor are also tight..

Suppose the true labeling functions for source and target domains are given by and , respectively. Let and be the sets which are assigned label in source and target domains, respectively. As discussed in the earlier section, the hope is that alignment of marginal distributions and will result in aligning the corresponding class conditionals and but it is not guaranteed. There might be sets and , for , such that their images under (i.e., and ) get aligned in the feature space, which is difficult to detect or correct in the absence of target labels.

We propose to use the idea of co-regularization to trim the space of possible alignments without ruling out the desirable alignments of class conditional distributions from the space. Let , be the two hypothesis spaces for the feature generators, and , be the hypothesis classes of predictors defined on the output of the feature generators from and , respectively. We want to learn a and a such that minimizes the prediction error on the source domain, while aligning the source and target feature distributions by minimizing a suitable distance (for ). To measure the disagreement between the alignments of feature distributions in the two feature spaces ( and , for ), we look at the distance between the predictions and on unlabeled target examples . If the predictions agree, it can be seen as an indicator that the alignment of source and target feature distributions is similar across the two feature spaces induced by and (with respect to the classifier boundaries). Coming back to the example of erroneous alignment given in the previous paragraph, if there is a which aligns and but does not have any agreeing with respect to the classifier predictions, it will be ruled out of consideration. Hence, ideally we would like to construct and such that they induce complementary erroneous alignments of source and target distributions while each of them still contains the set of desirable feature generators that produce the right alignments.

The proposed co-regularized domain alignment (referred as Co-DA) can be summarized by the following objective function (denoting for ):


where, is the usual cross-entropy loss for the source examples (assuming

outputs the probabilities of classes and

is the label vector),

is the loss term measuring the distance between the two distributions, where measures the disagreement between the two predictions for a target sample, and quantifies the diversity of and . In the following, we instantiate Co-DA algorithmically, getting to a concrete objective that can be optimized.

2.2.1 Algorithmic Instantiation

We make our approach of co-regularized domain alignment more concrete by making the following algorithmic choices:

Domain alignment.  Following much of the earlier work, we minimize the variational form of the Jensen-Shannon (JS) divergence Nguyen et al. (2010); Goodfellow et al. (2014) between source and target feature distributions Ganin and Lempitsky (2015); Bousmalis et al. (2016); Shu et al. (2018):


where is the domain discriminator, taken to be a two layer neural network that outputs the probability of the input sample belonging to the source domain.

Target prediction agreement.  We use distance between the predicted class probabilities (twice the total variation distance) as the measure of disagreement (although other measures such as JS-divergence are also possible):


Diverse and It is desirable to have and such that errors in the distribution alignments are different from each other and target prediction agreement can play its role. To this end, we encourage source feature distributions induced by and to be different from each other. There can be multiple ways to approach this; here we adopt a simpler option of pushing the minibatch means (with batch size ) far apart:


The hyperparameter

is a positive real controlling the maximum disparity between and . This is needed for stability of feature maps and during training: we empirically observed that having as infinity results in their continued divergence from each other, harming the alignment of source and target distributions in both and . Note that we only encourage the source feature distributions and to be different, hoping that aligning the corresponding target distributions and to them will produce different alignments.

Cluster assumption.  The large amount of target unlabeled data can be used to bias the classifier boundaries to pass through the regions containing low density of data points. This is referred as the cluster assumption Chapelle and Zien (2005) which has been used for semi-supervised learning Grandvalet and Bengio (2005); Miyato et al. (2017) and was also recently used for unsupervised domain adaptation Shu et al. (2018). Minimization of the conditional entropy of can be used to push the predictor boundaries away from the high density regions Grandvalet and Bengio (2005); Miyato et al. (2017); Shu et al. (2018). However, this alone may result in overfitting to the unlabeled examples if the classifier has high capacity. To avoid this, virtual adversarial training (VAT) Miyato et al. (2017) has been successfully used in conjunction with conditional entropy minimization to smooth the classifier surface around the unlabeled points Miyato et al. (2017); Shu et al. (2018). We follow this line of work and add the following additional loss terms for conditional entropy minimization and VAT to the objective in (2):


We also use VAT loss on the source domain examples following Shu et al. (2018). Our final objective is given as:


(1) The proposed co-regularized domain alignment (Co-DA) can be used to improve any domain adaptation method that has a domain alignment component in it. We instantiate it in the context of a recently proposed method VADA Shu et al. (2018), which has the same objective as in Eq. (7) and has shown state-of-the-art results on several datasets. Indeed, we observe that co-regularized domain alignment significantly improves upon these results.
(2) The proposed method can be naturally extended to more than two hypotheses, however we limit ourselves to two hypothesis classes in the empirical evaluations.

3 Related Work

Domain Adaptation.  Due to the significance of domain adaptation in reducing the need for labeled data, there has been extensive activity on it during past several years. Domain alignment has almost become a representative approach for domain adaptation, acting as a crucial component in many recently proposed methods Ghifary et al. (2014); Long et al. (2015); Tzeng et al. (2015); Sun and Saenko (2016); Bousmalis et al. (2016); Ganin et al. (2016); Yan et al. (2017); Tzeng et al. (2017); Shu et al. (2018). The proposed co-regularized domain alignment framework is applicable in all such methods that utilize domain alignment as an ingredient. Perhaps most related to our proposed method is a recent work by Saito et al. (2018), who proposed directly optimizing a proxy for -distance Ben-David et al. (2010) in the context of deep neural networks. Their model consists of a single feature generator that feeds to two different multi-layer NN classifiers and . Their approach alternates between two steps: (i) For a fixed , finding and such that the discrepancy or disagreement between the predictions and is maximized for , (ii) For fixed and , find which minimizes the discrepancy between the predictions and for . Our approach also has a discrepancy minimization term over the predictions for target samples but the core idea in our approach is fundamentally different where we want to have diverse feature generators and that induce different alignments for source and target populations, and which can correct each other’s errors by minimizing disagreement between them as measured by target predictions. Further, unlike Saito et al. (2018) where the discrepancy is maximized at the final predictions and (Step (i)), we maximize diversity at the output of feature generators and . Apart from the aforementioned approaches, methods based on image translations across domains have also been proposed for unsupervised domain adaptation Liu et al. (2017); Murez et al. (2017); Bousmalis et al. (2017b).

Co-regularization and Co-training.  The related ideas of co-training Blum and Mitchell (1998) and co-regularization Sindhwani et al. (2005); Sindhwani and Rosenberg (2008)

have been successfully used for semi-supervised learning as well as unsupervised learning

Kumar et al. (2011); Kumar and Daumé (2011). Chen et al. (2011) used the idea of co-training for semi-supervised domain adaptation (assuming a few target labeled examples are available) by finding a suitable split of the features into two sets based on the notion of -expandibility Balcan et al. (2005). A related work Daume III et al. (2010) used the idea of co-regularization for semi-supervised domain adaptation but their approach is quite different from our method where they learn different classifiers for source and target, making their predictions agree on the unlabeled target samples. Tri-training Zhou and Li (2005) can be regarded as an extension of co-training Blum and Mitchell (1998) and uses the output of three different classifiers to assign pseudo-labels to unlabeled examples. Saito et al. (2017) proposed Asymmetric tri-training for unsupervised domain adaptation where one of the three models is learned only on pseudo-labeled target examples. Asymmetric tri-training, similar to Saito et al. (2018), works with a single feature generator which feeds to three different classifiers , and .

Ensemble learning.  There is an extensive line of work on ensemble methods for neural nets which combine predictions from multiple models Drucker et al. (1994); Dietterich (2000); Rosen (1996); Liu and Yao (1999); Lee et al. (2016). Several ensemble methods also encourage diversity among the classifiers in the ensemble Liu and Yao (1999); Lee et al. (2016). However, ensemble methods have a different motivation from co-regularization/co-training: in the latter, diversity and agreement go hand in hand, working together towards reducing the size of the hypothesis space and the two classifiers converge to a similar performance after the completion of training due to the agreement objective. Indeed, we observe this in our experiments as well and either of the two classifiers can be used for test time predictions. On the other hand, ensemble methods need to combine predictions from all member models to get desired accuracy which can be both memory and computation intensive.

4 Experiments

We evaluate the proposed Co-regularized Domain Alignment (Co-DA) by instantiating it in the context of a recently proposed method VADA Shu et al. (2018) which has shown state-of-the-art results on several benchmarks, and observe that Co-DA yields further significant improvement over it, establishing new state-of-the-art in several cases. For a fair comparison, we evaluate on the same datasets as used in Shu et al. (2018) (i.e., MNIST, SVHN, MNIST-M, Synthetic Digits, CIFAR-10 and STL), and base our implementation on the code released by the authors333 to rule out incidental differences due to implementation specific details.

Network architecture.  VADA Shu et al. (2018) has three components in the model architecture: a feature generator , a feature classifier that takes output of as input, and a domain discriminator for domain alignment (Eq. 3). Their data classifier consists of nine conv layers followed by a global pool and fc

, with some additional dropout, max-pool and Gaussian noise layers in

. The last few layers of this network (the last three conv layers, global pool and fc layer) are taken as the feature classifier and the remaining earlier layers are taken as the feature generator . Each conv and fc layer in and is followed by batch-norm. The objective of VADA for learning a data classifier is given in Eq. (7) as . We experiment with the following two architectural versions for creating the hypotheses and in our method: (i) We use two VADA models as our two hypotheses, with each of these following the same architecture as used in Shu et al. (2018) (for all three components and ) but initialized with different random seeds. This version is referred as Co-DA in the result tables. (ii) We use a single (shared) set of parameters for conv and fc layers in and

but use conditional batch-normalization

Dumoulin et al. (2017) to create two different sets of batch-norm layers for the two hypotheses. However we still have two different discriminators (unshared parameters) performing domain alignment for features induced by and . This version is referred as Co-DA in the result tables. Additionally, we also experiment with fully shared networks parameters without conditional batch-normalization (i.e., shared batchnorm layers): in this case, and differ only due to random sampling in each forward pass through the model (by virtue of the dropout and Gaussian noise layers in the feature generator). We refer this variant as Co-DA (for shared parameters). The diversity term (Eq. (5)) becomes inapplicable in this case. This also has resemblance to -model Laine and Aila (2016) and fraternal dropout Zolna et al. (2018), which were recently proposed in the context of (semi-)supervised learning.

Other details and hyperparameters.  For domain alignment, which involves solving a saddle point problem (, as defined in Eq. 3), Shu et al. (2018) replace gradient reversal Ganin and Lempitsky (2015) with alternating minimization (, ) as used by Goodfellow et al. (2014) in the context of GAN training. This is claimed to alleviate the problem of saturating gradients, and we also use this approach following Shu et al. (2018). We also use instance normalization following Shu et al. (2018) which helps in making the classifier invariant to channel-wide shifts and scaling of the input pixel intensities. We do not use any sort of data augmentation in any of our experiments. For VADA hyperparameters and (Eq. 7), we fix their values to what were reported by Shu et al. (2018) for all the datasets (obtained after a hyperparameter search in Shu et al. (2018)). For the domain alignment hyperparameter , we do our own search over the grid (the grid for was taken to be in Shu et al. (2018)). The hyperparameter for target prediction agreement, , was obtained by a search over . For hyperparameters in the diversity term, we fix and do a grid search for (Eq. 5) over . The hyperparameters are tuned by randomly selecting target labeled examples from the training set and using that for validation, following Shu et al. (2018); Saito et al. (2017). We completely follow Shu et al. (2018) for training our model, using Adam Optimizer (, , ) with Polyak averaging (i.e., an exponential moving average with momentum on the parameter trajectory), and train the models in all experiments for iterations as in Shu et al. (2018).

DANN Ganin and Lempitsky (2015) 35.7 71.1 81.5 90.3 - -
DSN Bousmalis et al. (2016) - 82.7 83.2 91.2 - -
ATT Saito et al. (2017) 52.8 86.2 94.2 92.9 - -
MCD Saito et al. (2018) - 96.2 - - -
Without instance-normalized input
VADA Shu et al. (2018) 47.5 97.9 97.7 94.8 80.0 73.5
Co-DA () 50.7/50.1 97.4/97.2 98.9/99.0 94.9/94.6 81.3/80 76.1/75.5
Co-DA () 46.0/45.9 98.4/98.3 99.0/99.0 94.9/94.8 80.4/80.3 76.3/76.6
Co-DA 52.8 98.6 98.9 96.1 78.9 76.1
Co-DA 52.0/49.7 98.3/98.2 99.0/98.9 96.1/96.0 81.1/80.4 76.4/75.7
Co-DA 55.3/55.2 98.8/98.7 98.6/98.7 95.4/95.3 81.4/81.2 76.3/76.2
VADA+DIRT-T Shu et al. (2018) 54.5 99.4 98.9 96.1 - 75.3
Co-DA+DIRT-T 59.8/60.8 99.4/99.4 99.1/99.0 96.4/96.5 - 76.3/76.6
Co-DA+DIRT-T 62.4/63.0 99.3/99.2 98.9/99.0 96.1/96.1 - 77.6/77.5
With instance-normalized input
VADA Shu et al. (2018) 73.3 94.5 95.7 94.9 78.3 71.4
Co-DA () 78.5/78.2 97.6/97.5 97.1/96.4 95.1/94.9 80.1/79.2 74.5/73.9
Co-DA () 74.5/74.3 98.4/98.4 96.7/96.6 95.3/95.2 78.9/79.0 74.2/74.4
Co-DA 79.9 98.7 96.9 96.0 78.4 74.7
Co-DA 81.7/80.9 98.6/98.5 97.5/97.0 96.0/95.9 80.6/79.9 74.7/74.2
Co-DA 81.4/81.3 98.5/98.5 98.0/97.9 95.3/95.3 80.6/80.4 74.7/74.6
VADA+DIRT-T Shu et al. (2018) 76.5 99.4 98.7 96.2 - 73.3
Co-DA+DIRT-T 88.0/87.3 99.3/99.4 98.7/98.6 96.4/96.5 - 74.8/74.2
Co-DA+DIRT-T 86.5/86.7 99.4/99.3 98.8/98.8 96.4/96.5 - 75.9/75.6
Table 1: Test accuracy on the Target domain: Co-DA is the proposed method for the two classifiers with shared parameters but with different batch-norm layers and different domain discriminators. Co-DA is another variant where the only sources of difference between the two classifiers are the stochastic layers (dropout and Gaussian noise). The stochastic layers collapse to their expectations and we effectively have a single classifier during test phase. For Co-DA, the two numbers A/B are the accuracies for the two classifiers (at iterations). Numbers in bold denote the best accuracy among the comparable methods and those in italics denote the close runner-up, if any. VADA and DIRT-T results are taken from Shu et al. (2018).

Baselines.  We primarily compare with VADA Shu et al. (2018) to show that co-regularized domain alignment can provide further improvements over state-of-the-art results. We also show results for Co-DA without the diversity loss term (i.e., ) to tease apart the effect of explicitly encouraging diversity through Eq. 5 (note that some diversity can arise even with , due to different random seeds, and Gaussian noise / dropout layers present in ). Shu et al. (2018) also propose to incrementally refine the learned VADA model by shifting the classifier boundaries to pass through low density regions of target domain (referred as the DIRT-T phase) while keeping it from moving too far away. If is the classifier at iteration ( being the solution of VADA), the new classifier is obtained as . We also perform DIRT-T refinement individually on each of the two trained hypotheses obtained with Co-DA (i.e., ) to see how it compares with DIRT-T refinement on the VADA model Shu et al. (2018). Note that DIRT-T refinement phase is carried out individually for and and there is no co-regularization term connecting the two in DIRT-T phase. Again, following the evaluation protocol in Shu et al. (2018), we train DIRT-T for iterations, with number of iterations taken as a hyperparameter. We do not perform any hyperparameter search for and the values for are fixed to what were reported in Shu et al. (2018) for all datasets. Apart from VADA, we also show comparisons with other recently proposed unsupervised domain adaptation methods for completeness.

Figure 2: Test accuracy as the training iterations proceed for MNISTSVHN with instance-normalization: there is high disagreement between the two classifiers during the earlier iterations for Co-DA, which vanishes eventually at convergence. VADA Shu et al. (2018) gets to a much higher accuracy early on during training but eventually falls short of Co-DA performance.
Figure 3:

Test accuracy of a kNN classifier on target domain for VADA and Co-DA: source domain features (output of

, followed by PCA which reduces dimensionality to ) are used as training data for the classifier.

4.1 Domain adaptation results

We evaluate Co-DA on the following domain adaptation benchmarks. The results are shown in Table 1. The two numbers A/B in the table for the proposed methods are the individual test accuracies for both classifiers which are quite close to each other at convergence.

MNISTSVHN.  Both MNIST and SVHN are digits datasets but differ greatly in style : MNIST consists of gray-scale handwritten digits whereas SVHN consists of house numbers from street view images. This is the most challenging domain adaptation setting in our experiments (many earlier domain adaptation methods have omitted it from the experiments due to the difficulty of adaptation). VADA Shu et al. (2018) showed good performance (73.3%) on this challenging setting using instance normalization but without using any data augmentation. The proposed Co-DA improves it substantially , even surpassing the performance of VADA+DIRT-T () Shu et al. (2018). Figure 2 shows the test accuracy as training proceeds. For the case of no instance-normalization as well, we see a substantial improvement over VADA from to using Co-DA and using Co-DA. Applying iterative refinement with DIRT-T Shu et al. (2018) further improves the accuracy to with instance norm and without instance norm. This sets new state-of-the-art for MNISTSVHN domain adaptation without using any data augmentation. To directly measure the improvement in source and target feature distribution alignment, we also do the following experiment: (i) We take the feature embeddings for the source training examples, reduce the dimensionality to using PCA, and use these as training set for a k-nearest neighbor (kNN) classifier. (ii) We then compute the accuracy of this kNN classifier on target test sets (again with PCA on the output of feature generator).We also do steps (i) and (ii) for VADA, and repeat for multiple values of ’k’. Fig. 3 compares the target test accuracy scores for VADA and Co-DA.

SVHNMNIST.  This adaptation direction is easier as MNIST as the test domain is easy to classify and performance of existing methods is already quite high ( with VADA). Co-DA is still able to yield reasonable improvement over VADA, of about for no instance-normalization, and with instance-normalization. The application of DIRT-T after Co-DA does not give significant boost over VADA+DIRT-T as the performance is already saturated with Co-DA (close to ).

MNISTMNIST-M.  Images in MNIST-M are created by blending MNIST digits with random color patches from the BSDS500 dataset. Co-DA provides similar improvements over VADA as the earlier setting of SVHNMNIST, of about for no instance-normalization, and with instance-normalization.

Syn-DIGITSSVHN.  Syn-DIGITS data consists of about synthetics digits images of varying positioning, orientation, background, stroke color and amount of blur. We again observe reasonable improvement of with Co-DA over VADA, getting close to the accuracy of a fully supervised target model for SVHN (without data augmentation).

CIFARSTL.  CIFAR has more labeled examples than STL hence CIFARSTL is easier adaptation problem than STLCIFAR. We observe more significant gains on the harder problem of STLCIFAR, with Co-DA improving over VADA by for both with- and without instance-normalization.

5 Conclusion

We proposed co-regularization based domain alignment for unsupervised domain adaptation. We instantiated it in the context of a state-of-the-art domain adaptation method and observe that it provides improved performance on some commonly used domain adaptation benchmarks, with substantial gains in the more challenging tasks, setting new state-of-the-art in these cases. Further investigation is needed into more effective diversity losses (Eq. (5)). A theoretical understanding of co-regularization for domain adaptation in the context of deep neural networks, particularly characterizing its effect on the alignment of source and target feature distributions, is also an interesting direction for future work.


  • Balcan et al. (2005) Maria-Florina Balcan, Avrim Blum, and Ke Yang. Co-training and expansion: Towards bridging theory and practice. In Advances in neural information processing systems, pages 89–96, 2005.
  • Ben-David et al. (2010) Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79(1-2):151–175, 2010.
  • Blum and Mitchell (1998) A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Conference on Learning Theory, 1998.
  • Bousmalis et al. (2016) Konstantinos Bousmalis, George Trigeorgis, Nathan Silberman, Dilip Krishnan, and Dumitru Erhan. Domain separation networks. In Advances in Neural Information Processing Systems, pages 343–351, 2016.
  • Bousmalis et al. (2017a) Konstantinos Bousmalis, Alex Irpan, Paul Wohlhart, Yunfei Bai, Matthew Kelcey, Mrinal Kalakrishnan, Laura Downs, Julian Ibarz, Peter Pastor, Kurt Konolige, et al. Using simulation and domain adaptation to improve efficiency of deep robotic grasping. arXiv preprint arXiv:1709.07857, 2017a.
  • Bousmalis et al. (2017b) Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru Erhan, and Dilip Krishnan. Unsupervised pixel-level domain adaptation with generative adversarial networks. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , volume 1, page 7, 2017b.
  • Chapelle and Zien (2005) Olivier Chapelle and A. Zien. Semi-Supervised Classification by Low Density Separation. In AISTATS, pages 57–64, 2005.
  • Chen et al. (2011) Minmin Chen, Kilian Q Weinberger, and John Blitzer. Co-training for domain adaptation. In Advances in neural information processing systems, pages 2456–2464, 2011.
  • Daume III et al. (2010) Hal Daume III, Abhishek Kumar, and Avishek Saha. Co-regularization Based Semi-supervised Domain Adaptation. In Advances in Neural Information Processing Systems, 2010.
  • Dietterich (2000) Thomas G Dietterich. Ensemble methods in machine learning. In International workshop on multiple classifier systems, pages 1–15. Springer, 2000.
  • Drucker et al. (1994) Harris Drucker, Corinna Cortes, Lawrence D Jackel, Yann LeCun, and Vladimir Vapnik. Boosting and other ensemble methods. Neural Computation, 6(6):1289–1301, 1994.
  • Dumoulin et al. (2017) Vincent Dumoulin, Jonathon Shlens, and Manjunath Kudlur. A learned representation for artistic style. In Proceedings of the International Conference on Learning Representations, Toulon, France, April 2017.
  • Fernando et al. (2013) Basura Fernando, Amaury Habrard, Marc Sebban, and Tinne Tuytelaars. Unsupervised visual domain adaptation using subspace alignment. In Computer Vision (ICCV), 2013 IEEE International Conference on, pages 2960–2967. IEEE, 2013.
  • French et al. (2018) Geoffrey French, Michal Mackiewicz, and Mark Fisher. Self-ensembling for domain adaptation. In International Conference on Learning Representations, 2018.
  • Ganin and Lempitsky (2015) Yaroslav Ganin and Victor Lempitsky.

    Unsupervised domain adaptation by backpropagation.

    In ICML, 2015.
  • Ganin et al. (2016) Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030, 2016.
  • Ghifary et al. (2014) Muhammad Ghifary, W Bastiaan Kleijn, and Mengjie Zhang. Domain adaptive neural networks for object recognition. In

    Pacific Rim International Conference on Artificial Intelligence

    , pages 898–904. Springer, 2014.
  • Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • Grandvalet and Bengio (2005) Yves Grandvalet and Yoshua Bengio. Semi-supervised learning by entropy minimization. In Advances in neural information processing systems, pages 529–536, 2005.
  • Kumar and Daumé (2011) Abhishek Kumar and Hal Daumé.

    A co-training approach for multi-view spectral clustering.

    In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 393–400, 2011.
  • Kumar et al. (2011) Abhishek Kumar, Piyush Rai, and Hal Daume. Co-regularized multi-view spectral clustering. In Advances in neural information processing systems, pages 1413–1421, 2011.
  • Laine and Aila (2016) Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242, 2016.
  • Lee et al. (2016) Stefan Lee, Senthil Purushwalkam Shiva Prakash, Michael Cogswell, Viresh Ranjan, David Crandall, and Dhruv Batra. Stochastic multiple choice learning for training diverse deep ensembles. In Advances in Neural Information Processing Systems, pages 2119–2127, 2016.
  • Liu et al. (2017) Ming-Yu Liu, Thomas Breuel, and Jan Kautz.

    Unsupervised image-to-image translation networks.

    In Advances in Neural Information Processing Systems, pages 700–708, 2017.
  • Liu and Yao (1999) Yong Liu and Xin Yao. Ensemble learning via negative correlation. Neural networks, 12(10):1399–1404, 1999.
  • Long et al. (2015) Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I Jordan. Learning transferable features with deep adaptation networks. arXiv preprint arXiv:1502.02791, 2015.
  • Miyato et al. (2017) Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. arXiv preprint arXiv:1704.03976, 2017.
  • Murez et al. (2017) Zak Murez, Soheil Kolouri, David Kriegman, Ravi Ramamoorthi, and Kyungnam Kim. Image to image translation for domain adaptation. arXiv preprint arXiv:1712.00479, 2017.
  • Nguyen et al. (2010) XuanLong Nguyen, Martin J Wainwright, and Michael I Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847–5861, 2010.
  • Rosen (1996) Bruce E Rosen. Ensemble learning using decorrelated neural networks. Connection science, 8(3-4):373–384, 1996.
  • Rosenberg and Bartlett (2007) David S Rosenberg and Peter L Bartlett. The rademacher complexity of co-regularized kernel classes. In Artificial Intelligence and Statistics, pages 396–403, 2007.
  • Saito et al. (2017) Kuniaki Saito, Yoshitaka Ushiku, and Tatsuya Harada. Asymmetric tri-training for unsupervised domain adaptation. arXiv preprint arXiv:1702.08400, 2017.
  • Saito et al. (2018) Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku, and Tatsuya Harada. Maximum classifier discrepancy for unsupervised domain adaptation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • Shimodaira (2000) Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference, 90(2):227–244, 2000.
  • Shu et al. (2018) Rui Shu, Hung H Bui, Hirokazu Narui, and Stefano Ermon. A dirt-t approach to unsupervised domain adaptation. In International Conference on Learning Representations, 2018.
  • Sindhwani and Rosenberg (2008) Vikas Sindhwani and David S Rosenberg. An rkhs for multi-view learning and manifold co-regularization. In Proceedings of the 25th international conference on Machine learning, 2008.
  • Sindhwani et al. (2005) Vikas Sindhwani, Partha Niyogi, and Mikhail Belkin. A Co-regularization approach to semi-supervised learning with multiple views. In Proceedings of the Workshop on Learning with Multiple Views, International Conference on Machine Learning, 2005.
  • Sridharan and Kakade (2008) Karthik Sridharan and Sham M Kakade. An information theoretic framework for multi-view learning. In COLT, 2008.
  • Sun and Saenko (2014) Baochen Sun and Kate Saenko. From virtual to reality: Fast adaptation of virtual object detectors to real domains. In BMVC, volume 1, page 3, 2014.
  • Sun and Saenko (2016) Baochen Sun and Kate Saenko. Deep coral: Correlation alignment for deep domain adaptation. In European Conference on Computer Vision, pages 443–450. Springer, 2016.
  • Tzeng et al. (2015) Eric Tzeng, Judy Hoffman, Trevor Darrell, and Kate Saenko. Simultaneous deep transfer across domains and tasks. In Computer Vision (ICCV), 2015 IEEE International Conference on, pages 4068–4076. IEEE, 2015.
  • Tzeng et al. (2017) Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. In Computer Vision and Pattern Recognition (CVPR), volume 1, page 4, 2017.
  • Vazquez et al. (2014) David Vazquez, Antonio M Lopez, Javier Marin, Daniel Ponsa, and David Geronimo. Virtual and real world adaptation for pedestrian detection. IEEE transactions on pattern analysis and machine intelligence, 36(4):797–809, 2014.
  • Yan et al. (2017) Hongliang Yan, Yukang Ding, Peihua Li, Qilong Wang, Yong Xu, and Wangmeng Zuo. Mind the class weight bias: Weighted maximum mean discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2272–2281, 2017.
  • Zhou and Li (2005) Zhi-Hua Zhou and Ming Li. Tri-training: Exploiting unlabeled data using three classifiers. IEEE Transactions on knowledge and Data Engineering, 17(11):1529–1541, 2005.
  • Zolna et al. (2018) Konrad Zolna, Devansh Arpit, Dendi Suhubdy, and Yoshua Bengio. Fraternal dropout. In International Conference on Learning Representations, 2018.