Virtual Mixup Training for Unsupervised Domain Adaptation

05/10/2019 ∙ by Xudong Mao, et al. ∙ City University of Hong Kong 0

We study the problem of unsupervised domain adaptation which aims to adapt models trained on a labeled source domain to a completely unlabeled target domain. Domain adversarial training is a promising approach and has been a basis for many state-of-the-art approaches in unsupervised domain adaptation. The idea of domain adversarial training is to align the feature space between the source domain and target domain by adversarially training a domain classifier and a feature encoder. Recently, cluster assumption has been applied to unsupervised domain adaptation and achieved strong performance. In this paper, we propose a new regularization method called Virtual Mixup Training (VMT), which is able to further constrain the hypothesis of cluster assumption. The idea of VMT is to impose a locally-Lipschitz constraint on the model by smoothing the output distribution along the lines between pairs of training samples. Unlike the traditional mixup model, our method constructs the combination samples without label information, allowing it to be applicable to unsupervised domain adaptation. The proposed method is generic and can be combined with existing methods using domain adversarial training. We combine VMT with a recent state-of-the-art model called VADA, and extensive experiments demonstrate that VMT significantly improves the performance of VADA on several domain adaptation benchmark datasets. For the challenging task of adapting MNIST to SVHN, when not using instance normalization, VMT improves the accuracy of VADA by over 30 accuracy of 96.4 train-on-target model. Code will be made publicly available.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks have launched a profound reformation in a wide variety of fields such as image classification

Krizhevsky et al. (2012), detection Girshick et al. (2014), and segmentation Long et al. (2015a). However, the performance of deep neural networks is often based on large amounts of labeled training data. In real-world tasks, generating labeled training data can be very expensive and may not always be feasible. One approach to this problem is to learn on a related labeled source data and generalize to the unlabeled target data, which is known as domain adaptation. We in this work consider the problem of unsupervised domain adaptation where the training samples in the target domain are completely unlabeled.

For unsupervised domain adaptation, Ganin et al. (2016) proposed the domain adversarial training to learn domain-invariant features between the source and target domains, which has been a basis for numerous domain adaptation methods Tzeng et al. (2017); Kumar et al. (2018); Shu et al. (2018); Saito et al. (2018); Xie et al. (2018). Most of the follow-up studies focus on how to learn better-aligned domain-invariant features, including the approaches of adversarial discriminative adaptation Tzeng et al. (2017), maximizing classifier discrepancy Saito et al. (2018), and class conditional alignment Xie et al. (2018); Kumar et al. (2018).

[width=0.95]framework.pdf

Figure 1: The framework of VMT. is a classifier, and denotes the KL-divergence.

Recently, Shu et al. (2018) have successfully combined cluster assumption Grandvalet and Bengio (2005) with domain adversarial training. They also pointed out that the locally-Lipschitz constraint is critical to the performance of cluster assumption. Without the locally-Lipschitz constraint, the classifier may abruptly change its predictions in the vicinity of the training samples due to the high-capacity of the classifier. To this end, they adopted virtual adversarial training Miyato et al. (2018) to constrain the local Lipschitzness of the classifier. In this paper, we follow this line and propose a new method to constrain the local Lipschitzness.

Inspired by the virtual labels used in literature Miyato et al. (2018), we propose the Virtual Mixup Training (VMT), which extends mixup Zhang et al. (2018)

to use the virtual labels, thereby allowing it to be applicable to unsupervised domain adaptation. Here virtual labels mean that these labels are obtained by the current estimate of the classifier. Specifically, as shown in Figure

1, we first construct convex combinations, denoted as , of pairs of training samples and their virtual labels, and then define a penalty term that punishes the difference between the combined sample’s prediction and the combined virtual label . This penalty term produces a linear change of the output distribution in-between training samples, imposing the locally-Lipschitz constraint to the classifier. Note that VMT can be applied to both the target and source domains. For the source domain, we also replace the real labels with the virtual labels, without using the label information of the source domain.

In the experiments, we combine VMT with a recent state-of-the-art model called VADA Shu et al. (2018), and evaluate on several commonly used benchmark datasets. The experimental results show that VMT is able to improve the performance of VADA in all tasks. For the most challenging task, MNIST SVHN without instance normalization, our model improves VADA’s accuracy from 54.5% to 86.4%. When using instance normalization, our model achieves an accuracy of 96.4%, which is very close to the accuracy (96.5%) of the train-on-target model.

2 Related Work

Domain adaptation. Domain adaptation has gained extensive attention in recent years due to its advantage of utilizing unlabeled data. A theoretical analysis of domain adaptation is presented in Ben-David et al. (2010). Early works Shimodaira (2000); Mansour et al. (2009) tried to minimize the discrepancy distance between the source and target feature distributions. Long et al. (2015b), Sun & Saenko Sun and Saenko (2016), and Das & Lee Das and Lee (2018) extended this method by matching higher order statistics of the two distributions. Huang et al. (2007), Tzeng et al. (2015), and Ganin et al. (2016) proposed to project the source and target feature distributions into some common space and match the learned features as close as possible. Specifically, Ganin et al. (2016) proposed the domain adversarial training to learn domain-invariant features, which has been a basis of numerous domain adaptation methods Tzeng et al. (2017); Saito et al. (2018); Xie et al. (2018); Shu et al. (2018); Kumar et al. (2018). Tzeng et al. (2017) generalized a framework based on domain adversarial training and proposed to combine the discriminative model and GAN loss Goodfellow et al. (2014). Saito et al. (2018) proposed to utilize two different classifiers to learn not only domain-invariant but also class-specific features. Xie et al. (2018) also proposed to learn class-specific features by assigning virtual labels to the target samples and aligning the class centroids between the source and target domains. Shu et al. (2018) proposed to combine cluster assumption Grandvalet and Bengio (2005) with domain adversarial training. They also adopted virtual adversarial training Miyato et al. (2018) to constrain the local Lipschitzness of the classifier, as they found that the locally-Lipschitz constraint is critical to the performance of cluster assumption. Kumar et al. (2018) extended Shu et al. (2018) by using co-regularization Sindhwani et al. (2005) to align class-specific features. We also follow the line of Shu et al. (2018) and propose a new method to constrain the local Lipschitzness.

There are also many other promising models including domain separation networks Bousmalis et al. (2016b), reconstruction-classification networks Ghifary et al. (2016), tri-training Saito et al. (2017), and self-ensembling French et al. (2018)

. Another effective direction to domain adaptation is through the image-to-image translation

Taigman et al. (2017); Bousmalis et al. (2017); Liu et al. (2017); Mao and Li (2018); Murez et al. (2018); Hoffman et al. (2018), where the source samples are first translated into the target domain within the same class, and then the translated target samples can be used to train the classifier.

Local Lipschitzness. Grandvalet and Bengio (2005) pointed out that the local Lipschitzness is critical to the performance of cluster assumption. Ben-David and Urner (2014)

also showed in theory that Lipschitzness can be viewed as a way of formalizing cluster assumption. Constraining local Lipschitzness has been proven the effectiveness in semi-supervised learning

Bachman et al. (2014); Sajjadi et al. (2016); Tarvainen and Valpola (2017); Laine and Aila (2017); Miyato et al. (2018) and domain adaptation French et al. (2018); Shu et al. (2018). Generally, these methods smooth the output distribution of the model by constructing surrounding points of the original points and enforcing consistent predictions between the surrounding and original points. Specifically, Bachman et al. (2014), Sajjadi et al. (2016), and Laine & Aila Laine and Aila (2017) utilized the randomness of neural networks to construct the surrounding points. Tarvainen & Valpola Tarvainen and Valpola (2017) and French et al. (2018) proposed to construct two different networks and enforce the two networks to output consistent predictions for the same input. Miyato et al. (2018) utilized the adversarial examples Goodfellow et al. (2015) to regularize the model from the direction most violating the local Lipschitzness.

Mixup. Zhang et al. (2018) proposed a regularization method called mixup to improve the generalization of neural networks. Mixup generates convex combinations of pairs of training examples and their labels, favoring the smoothness of the output distribution. A similar idea is presented in Tokozume et al. (2018) for image classification. Verma et al. (2018) extended mixup by mixing on the output of a random hidden layer. Guo et al. (2019) proposed to learn the mixing policy by an additional network instead of the random policy. A similar idea to ours is described in Verma et al. (2019) for semi-supervised learning. They also used mixup to provide consistent predictions between unlabeled training samples. Berthelot et al. (2019) extended this method by mixing between the labeled and unlabeled samples.

Virtual labels. Virtual (or pseudo) labels have been widely used in semi-supervised learning Blum and Mitchell (1998); Miyato et al. (2018) and domain adaptation Chen et al. (2011); Saito et al. (2017); Xie et al. (2018). Chen et al. (2011) and Saito et al. (2017) proposed to first use multiple classifiers to assign virtual labels to the target samples, and then train the classifier using the target samples with virtual labels. Xie et al. (2018) proposed to calculate the class centroids of the virtual labels to reduce the bias caused by the false virtual labels. However, these methods heavily rely on the accuracy of the virtual labels, as these methods utilize the class types of the virtual labels. The most related method to ours is virtual adversarial training Miyato et al. (2018). Virtual adversarial training enforces the virtual labels of the original sample and its adversarial example to be similar, and thus does not care about the class types of the virtual labels.

3 Method

3.1 Background

3.1.1 Domain Adversarial Training

We first describe domain adversarial training Ganin et al. (2016) which is a basis of our model. Let and be the distributions of the input sample and label from the source domain, and let be the input distribution of the target domain. Suppose a classifier can be decomposed into a a feature encoder and an embedding classifier . The input is first mapped through the feature encoder , and then through the embedding classifier . On the other hand, a domain discriminator

maps the feature vector to the domain label

. The domain discriminator and feature encoder are trained adversarially: tries to distinguish whether the input sample is from the source or target domain, while aims to generate indistinguishable feature vectors of samples from the source and target domains. The objective of domain adversarial training can be formalized as follows:

(1)

and is used to adjust the weight of .

3.1.2 Cluster Assumption

Cluster assumption states that the input data contains clusters, and if samples are in the same cluster, they come from the same class Grandvalet and Bengio (2005). It has been widely used in semi-supervised learning Grandvalet and Bengio (2005); Sajjadi et al. (2016); Miyato et al. (2018), and recently has been applied to unsupervised domain adaptation Shu et al. (2018). The conditional entropy minimization is usually adopted to enforce the behavior of cluster assumption Grandvalet and Bengio (2005); Sajjadi et al. (2016); Miyato et al. (2018); Shu et al. (2018):

(2)

In practice, another critical component is the local Lipschitzness of the classifier. Without the locally-Lipschitz constraint, the classifier may abruptly change its predictions in the vicinity of the training samples. To this end, Shu et al. (2018) adopted virtual adversarial training Miyato et al. (2018) to impose the locally-Lipschitz constraint:

(3)

3.2 Virtual Mixup Training

Following the line of forcing cluster assumption, we propose the Virtual Mixup Training (VMT), a novel approach to enforce the local Lipschitzness. Mixup Zhang et al. (2018) has shown the effectiveness in smoothing the output distribution of neural networks for many supervised problems. The idea of mixup is to encourage the classifier to behave linearly in-between training samples by applying the following convex combinations of labeled samples:

(4)

However, for unsupervised domain adaptation, we have no direct information about and of the target domain. Inspired by Miyato et al. (2018), we replace and with the approximations, and , which are the current predictions by the classifier . Literally, we call and virtual labels, and formalize our proposed VMT as follows:

(5)

where , for . Our target is to make the classifier behave linearly along the lines between and . Therefore we enforce the combined sample’s prediction and the combined label to be consistent. Based on this, we arrive at the objective of VMT given by:

(6)

VMT can be understood as to smooth the output distribution of the classifier, imposing the locally-Lipschitz constraint to the classifier. On the other hand, locally-Lipschitz constraint has been proven the effectiveness in favoring cluster assumption Grandvalet and Bengio (2005); Shu et al. (2018). We also empirically show in Section 4.3 that VMT is orthogonal to another locally-Lipschitz-constraint technique, virtual adversarial training (Eq. 3). Combining VMT with Eq. 123, we get the following objective:

(7)

and are used to adjust the weights of the penalty terms on the source and target domains. In Eq. 7, we apply VMT to both the source and target domains, and we also replace and with the virtual labels for the source domain, without using the label information. Note that except for in Eq. 5, which is fixed as

in our experiments, we do not introduce additional hyperparameters, compared with VADA, and the hyperparameters

are easy to choose empirically.

Like mixup Zhang et al. (2018), the implementation of VMT is also simple and straightforward. One important advantage of VMT is the low computational cost, and we show in Section 4.3 that VMT has a much lower computational cost than virtual adversarial training. Despite its simplicity, VMT achieves a new state-of-the-art performance on several benchmark datasets. Especially for the challenging task of adapting MNIST to SVHN without instance normalization, VMT is able to improve the accuracy of VADA by over 30%.

4 Experiments

For the evaluation, we focus on the visual domain adaptation and evaluate our model on several benchmark datasets including MNIST, MNIST-M, Synthetic Digits (SYN), Street View House Numbers (SVHN), CIFAR-10, and STL-10.

4.1 Implementation Detail

Architecture. We use identical network architectures as the ones in VADAShu et al. (2018) for a fair comparison. In particular, a small CNN is used for the tasks of digits, and a larger CNN is used for the tasks between CIFAR-10 and STL-10.

Iterative refinement training. In literature Shu et al. (2018), an iterative refinement training technique called DIRT-T is proposed for further optimizing the cluster assumption on the target domain. We find this strategy is also very effective for our model. Specifically, we first initialize with a trained VMT model using Eq. 7, and then iteratively minimize the following objective on the target domain:

(8)

where . We report the results of using or without using DIRT-T in the following experiments.

Hyperparameters. We fix in Eq. 5 as for all experiments. For and , we follow Shu et al. (2018) to restrict the hyperparameter search to and . For and , we restrict the hyperparameter search to and . A complete list of the hyperparameters is presented in Appendix A.

Baselines. We primarily compare our model with two baselines: VADA Shu et al. (2018) and Co-DA Kumar et al. (2018). Co-DA is also based on VADA, which used a co-regularization method to make a better domain alignment. We also show the results of several other recently proposed unsupervised domain adaptation models for comparison.

Other detail. Following Shu et al. (2018), we replace gradient reversal Ganin et al. (2016) with the strategy Goodfellow et al. (2014) of alternating updates between the domain discriminator and feature encoder. We also follow Shu et al. (2018) to apply the instance normalization to the input images and report the performances of using or without using the instance normalization. We use Adam Optimizer (learning rate , , ) with an exponential moving average (momentum ) to the parameter trajectory. The implementation of our model is based on the official implementation 111https://github.com/RuiShu/dirt-t of VADA Shu et al. (2018), and the code will be made publicly available.

Source MNIST SVHN MNIST SYN CIFAR STL
Target SVHN MNIST MNIST-M SVHN STL CIFAR
MMD Long et al. (2015b) - - -
DANN Ganin et al. (2016) - -
DRCN Ghifary et al. (2016) - -
DSN Bousmalis et al. (2016a) - - -
kNN-Ad Sener et al. (2016) - - -
PixelDA Bousmalis et al. (2017) - - - - -
ATT Saito et al. (2017) - -
-model (aug) French et al. (2018) -
Without Instance-Normalized Input:
Source-Only
VADA Shu et al. (2018)
Co-DA Kumar et al. (2018)
VMT (ours)
VADA+DIRT-T Shu et al. (2018) -
Co-DA+DIRT-T Kumar et al. (2018) -
VMT+DIRT-T (ours) -
With Instance-Normalized Input:
Source-Only
VADA Shu et al. (2018)
Co-DA Kumar et al. (2018)
VMT (ours)
VADA+DIRT-T Shu et al. (2018) -
Co-DA+DIRT-T Kumar et al. (2018) -
VMT+DIRT-T (ours) -
Table 1: Test set accuracy on the visual domain adaptation benchmark datasets. For all tasks, VMT improves the accuracy of VADA and achieves state-of-the-art performance.

4.2 Model Evaluation

We evaluate VMT on the following unsupervised domain adaptation tasks, and the results are shown in Table 1. Our proposed VMT achieves state-of-the-art performance for all the tasks.

MNIST SVHN. We first evaluate VMT on the adaptation task from MNIST to SVHN. Adapting from MNIST to SVHN is usually treated as a challenging task Ganin et al. (2016); Shu et al. (2018) since the intrinsic dimensionality of MNIST is significantly lower than SVHN. It is especially difficult when the input is not instance-normalized, as shown in Table 1. For MNIST SVHN without instance normalization, VADA removes the conditional entropy minimization (Eq. 2), as it behaves unstable and finds a degenerate solution quickly Shu et al. (2018). We find that this problem no longer exists in our model, and thus we remain the conditional entropy minimization during training. For MNIST SVHN, we observe significant improvements over the baselines. Especially for the setting of without instance normalization, VMT+DIRT-T outperforms VADA+DIRT-T by 31.9% and outperforms Co-DA+DIRT-T by 23.4%, and VMT outperforms VADA and Co-DA by 20.4% and 12.6%, respectively. For the setting of with instance normalization, VMT+DIRT-T achieves an accuracy of 96.4%. Moreover, we train a classifier on the target domain (i.e., SVHN) with labels revealed using the same network architecture and same settings, and it is treated as an upper bound for domain adaptation methods. This train-on-target model achieves an accuracy of 96.5%. The accuracy of VMT+DIRT-T (96.4%) is very close to the upper bound (96.5%).

SVHN MNIST. For this task, it is much easier than MNIST SVHN. VADA already achieves a high accuracy () for this task. VMT still improves the accuracy of VADA by 4.6% and 1% for with and without instance normalization, respectively. VMT+DIRT-T has a similar performance as VADA+DIRT-T. Compared with Co-DA, VMT performs similarly for this task.

MNIST MNIST-M. We then evaluate on the adaptation task from MNIST to MNIST-M, where the images in MNIST-M are constructed by blending the MNIST digits with randomly cropped color patches from the BSDS500 dataset. For this task, VMT improves the accuracy of VADA by 2.5% and 1.3% for with and without instance normalization respectively, and has a similar performance as Co-DA.

SYN DIGITS SVHN. We also evaluate on the adaptation task from Synthetic Digits (SYN) to SVHN. The SYN dataset is constructed by rendering digit images using standard fonts and varying the position, orientation, background, stroke color, and amount of blur. Similar to the task of MNIST MNIST-M, we observe a reasonable improvement of VMT over VADA and a similar performance between VMT and Co-DA.

CIFAR-10 STL-10. For CIFAR-10 and STL-10, there are nine overlapping classes between the two datasets. Following French et al. (2018); Shu et al. (2018); Kumar et al. (2018), we remove the non-overlapping classes and remain the nine overlapping classes. For this task, VMT improves the accuracy of VADA by 2.6% and 1.6% for with and without instance normalization respectively, and performs similarly to Co-DA. Note that DIRT-T has no effect on this task, because STL-10 contains a very small training set, making it difficult to estimate the conditional entropy.

STL-10 CIFAR-10. We finally evaluate on the adaptation task from STL-10 to CIFAR-10. For this task, VMT outperforms VADA by about 5% and outperforms Co-DA by about 2% for both with and without instance normalization. When using DIRT-T, VMT+DIRT-T outperforms VADA+DIRT-T by 4.7% and 3.9% and outperforms Co-DA+DIRT-T by 2.1% and 1.6% for with and without instance normalization, respectively.

Source MNIST SVHN MNIST SYN CIFAR STL
Target SVHN MNIST MNIST-M SVHN STL CIFAR
With Instance-Normalized Input:
Table 2: Test set accuracy in comparison experiments between VAT and VMT. denotes the conditional entropy loss, denotes the VAT loss, and denotes the VMT loss. means that we only use , , and in Eq. 7, setting the weights of the other losses to 0. The results of and are duplicated from Shu et al. (2018).
Method
Accuracy
Table 3: Test set accuracy on the adaptation task of MNIST SVHN with instance-normalized input. denotes the conditional entropy loss, and and denote the VMT loss on the source and target domains respectively. For example, means that we only use , , and in Eq. 7, setting the weights of the other losses to 0. The accuracy of is different to Shu et al. (2018) as we set to for MNIST SVHN with instance-normalized input.

4.3 Comparing with Virtual Adversarial Training

As stated in Section 3.1.2, virtual adversarial training (VAT) Miyato et al. (2018) is another approach to impose the locally-Lipschitz constraint, as used in literature Shu et al. (2018). We conduct comparison experiments between VAT (Eq. 3) and our proposed VMT (Eq. 6), and the results are shown in Table 2. VMT achieves higher accuracy than VAT for all the tasks, which demonstrates that VMT surpasses VAT in favoring cluster assumption. Furthermore, combining VMT and VAT is able to further improve the performance. This shows that VMT is orthogonal to VAT, and they can be used together to constrain the local Lipschitzness. Compared with VAT, another advantage of VMT is the low computational cost. For the task of MNIST SVHN with instance normalization, VMT costs about 100 seconds for 1000 iterations in our GPU server, while VAT needs about 140 seconds. A more detailed comparison of the accuracy change over time is presented in Appendix B.

4.4 Analysis of VMT on the Source Domain

To analyze the role of VMT on the source domain, i.e., in Eq. 7, we present an ablation study in Table 3. We conduct this ablation study through the adaptation task of MNIST SVHN with instance normalization. From table 3, we have the following three observations. First, VMT on the target domain plays a critical role, as achieves much higher accuracy than . This is reasonable because our final target is to classify the samples from the target domain. Second, applying VMT on both the source and target domains is able to further improve the performance by 3.5%. This may be because applying VMT on both domains can generate similar feature vectors for the source and target domains, thus further improving the performance on the target domain. Third, applying VMT on the source domain alone has a negative impact on the performance, because shows a lower accuracy than , reducing the accuracy from 70.2% to 68.4%.

[width=0.3]source.png [width=0.3]vat.png [width=0.3]vat_dirt.png
(a) Source only (b) VADA (c) VADA+DIRT-T
[width=0.3]mix.png [width=0.3]mix_dirt.png
(d) VMT (e) VMT+DIRT-T
Figure 2: T-SNE visualization of the last hidden layer for MNIST (red) to SVHN (blue) without instance normalization. Compared with VADA, VMT generates closer features vectors for the source and target domains, and shows stronger clustering performance for the target domain. VMT+DIRT-T makes the source and target features closest.

4.5 Visualization of Representation

We further present the T-SNE visualization results in Figure 2. We use the most challenging task (i.e., MNIST SVHN without instance normalization) to highlight the differences. As shown in Figure 2, source-only training shows a discriminative clustering result for the source domain but generates only one cluster for the target domain. We can observe that VMT makes the features from the source and target domains much closer than VADA, and shows stronger clustering performance of the target samples than VADA. VMT+DIRT-T can further get closer feature vectors of the source and target domains.

5 Conclusion

In this paper, we proposed a novel method, called virtual mixup training, for unsupervised domain adaptation. VMT is designed to constrain the local Lipschitzness, which further improves the performance of cluster assumption Grandvalet and Bengio (2005); Shu et al. (2018)

. The idea of VMT is to make linearly-change predictions along the lines between pairs of training samples. In particular, we first construct convex combinations of training samples and their virtual labels, and then add a penalty term that punishes the difference between the prediction of the combined sample and the combined virtual label. We empirically show that VMT significantly improves the performance of a recent state-of-the-art model called VADA which is based on cluster assumption. For a challenging adaptation task from MNIST to SVHN, our model achieves an accuracy of 96.4%, which is very close to the accuracy (96.5%) of the train-on-target model. Given the strong performance of VMT, we would like to explore the use of VMT in other computer vision tasks, which we leave as future work.

References

  • Bachman et al. [2014] Philip Bachman, Ouais Alsharif, and Doina Precup. Learning with pseudo-ensembles. In Advances in Neural Information Processing Systems, pages 3365–3373, 2014.
  • Ben-David and Urner [2014] Shai Ben-David and Ruth Urner. Domain adaptation—can quantity compensate for quality?

    Annals of Mathematics and Artificial Intelligence

    , 70(3):185–202, 2014.
  • Ben-David et al. [2010] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine Learning, 2010.
  • Berthelot et al. [2019] David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin Raffel. Mixmatch: A holistic approach to semi-supervised learning. arXiv:1905.02249, 2019.
  • Blum and Mitchell [1998] Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training. In

    Proceedings of the Eleventh Annual Conference on Computational Learning Theory

    , pages 92–100, 1998.
  • Bousmalis et al. [2016a] Konstantinos Bousmalis, George Trigeorgis, Nathan Silberman, Dilip Krishnan, and Dumitru Erhan. Domain separation networks. In Advances in Neural Information Processing Systems, pages 343–351, 2016a.
  • Bousmalis et al. [2016b] Konstantinos Bousmalis, George Trigeorgis, Nathan Silberman, Dilip Krishnan, and Dumitru Erhan. Domain separation networks. In Advances in Neural Information Processing Systems 29, pages 343–351, 2016b.
  • Bousmalis et al. [2017] Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru Erhan, and Dilip Krishnan. Unsupervised pixel-level domain adaptation with generative adversarial networks. In

    The IEEE Conference on Computer Vision and Pattern Recognition

    , pages 3722–3731, 2017.
  • Chen et al. [2011] Minmin Chen, Kilian Q. Weinberger, and John C. Blitzer. Co-training for domain adaptation. In Advances in Neural Information Processing Systems, pages 2456–2464, 2011.
  • Das and Lee [2018] Debasmit Das and C.S. George Lee. Sample-to-sample correspondence for unsupervised domain adaptation. arXiv:1805.00355, 2018.
  • French et al. [2018] Geoff French, Michal Mackiewicz, and Mark Fisher. Self-ensembling for visual domain adaptation. In International Conference on Learning Representations, 2018.
  • Ganin et al. [2016] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030, January 2016.
  • Ghifary et al. [2016] Muhammad Ghifary, W. Bastiaan Kleijn, Mengjie Zhang, David Balduzzi, and Wen Li. Deep reconstruction-classification networks for unsupervised domain adaptation. In European Conference on Computer Vision, pages 597–613, 2016.
  • Girshick et al. [2014] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition, 2014.
  • Goodfellow et al. [2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
  • Goodfellow et al. [2015] Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015.
  • Grandvalet and Bengio [2005] Yves Grandvalet and Yoshua Bengio. Semi-supervised learning by entropy minimization. In Advances in Neural Information Processing Systems 17, pages 529–536, 2005.
  • Guo et al. [2019] Hongyu Guo, Yongyi Mao, and Richong Zhang. Mixup as locally linear out-of-manifold regularization. In AAAI Conference on Artificial Intelligence, 2019.
  • Hoffman et al. [2018] Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei Efros, and Trevor Darrell. Cycada: Cycle-consistent adversarial domain adaptation. In International Conference on International Conference on Machine Learning, pages 1989–1998, 2018.
  • Huang et al. [2007] Jiayuan Huang, Arthur Gretton, Karsten Borgwardt, Bernhard Schölkopf, and Alex J. Smola. Correcting sample selection bias by unlabeled data. In Advances in Neural Information Processing Systems 19, pages 601–608, 2007.
  • Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105, 2012.
  • Kumar et al. [2018] Abhishek Kumar, Prasanna Sattigeri, Kahini Wadhawan, Leonid Karlinsky, Rogerio Feris, Bill Freeman, and Gregory Wornell. Co-regularized alignment for unsupervised domain adaptation. In Advances in Neural Information Processing Systems, pages 9345–9356, 2018.
  • Laine and Aila [2017] Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. In International Conference on Learning Representations, 2017.
  • Liu et al. [2017] Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised image-to-image translation networks. In Advances in Neural Information Processing Systems, 2017.
  • Long et al. [2015a] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition, 2015a.
  • Long et al. [2015b] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I. Jordan. Learning transferable features with deep adaptation networks. In International Conference on International Conference on Machine Learning, pages 97–105, 2015b.
  • Mansour et al. [2009] Yishay Mansour, MehryarMohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms. arXiv:0902.3430, 2009.
  • Mao and Li [2018] Xudong Mao and Qing Li. Unpaired multi-domain image generation via regularized conditional gans. arXiv:1805.02456, 2018.
  • Miyato et al. [2018] Takeru Miyato, Shin ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: A regularization method for supervised and semi-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
  • Murez et al. [2018] Zak Murez, Soheil Kolouri, David Kriegman, Ravi Ramamoorthi, and Kyungnam Kim. Image to image translation for domain adaptation. In The IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • Saito et al. [2017] Kuniaki Saito, Yoshitaka Ushiku, and Tatsuya Harada. Asymmetric tri-training for unsupervised domain adaptation. In International Conference on International Conference on Machine Learning, pages 2988–2997, 2017.
  • Saito et al. [2018] Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku, and Tatsuya Harada. Maximum classifier discrepancy for unsupervised domain adaptation. In The IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • Sajjadi et al. [2016] Mehdi Sajjadi, Mehran Javanmardi, and Tolga Tasdizen. Regularization with stochastic transformations and perturbations for deep semi-supervised learning. In Advances in Neural Information Processing Systems, pages 1163–1171, 2016.
  • Sener et al. [2016] Ozan Sener, Hyun Oh Song, Ashutosh Saxena, and Silvio Savarese. Learning transferrable representations for unsupervised domain adaptation. In Advances in Neural Information Processing Systems, pages 2110–2118, 2016.
  • Shimodaira [2000] Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90(2):227 – 244, 2000.
  • Shu et al. [2018] Rui Shu, Hung Bui, Hirokazu Narui, and Stefano Ermon. A DIRT-t approach to unsupervised domain adaptation. In International Conference on Learning Representations, 2018.
  • Sindhwani et al. [2005] Vikas Sindhwani, Partha Niyogi, and Mikhail Belkin.

    Unsupervised domain adaptation by backpropagation.

    In Workshop on Learning with Multiple Views, International Conference on International Conference on Machine Learning, 2005.
  • Sun and Saenko [2016] Baochen Sun and Kate Saenko. Deep coral: Correlation alignment for deep domain adaptation. In European Conference on Computer Vision Workshops, pages 443–450, 2016.
  • Taigman et al. [2017] Yaniv Taigman, Adam Polyak, and Lior Wolf. Unsupervised cross-domain image generation. In International Conference on Learning Representations, 2017.
  • Tarvainen and Valpola [2017] Antti Tarvainen and Harri Valpola.

    Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results.

    In Advances in Neural Information Processing Systems, pages 1195–1204, 2017.
  • Tokozume et al. [2018] Yuji Tokozume, Yoshitaka Ushiku, and Tatsuya Harada. Between-class learning for image classification. In The IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • Tzeng et al. [2015] Eric Tzeng, Judy Hoffman, Trevor Darrell, and Kate Saenko. Simultaneous deep transfer across domains and tasks. In IEEE International Conference on Computer Vision, 2015.
  • Tzeng et al. [2017] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. In The IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • Verma et al. [2018] Vikas Verma, Alex Lamb, Christopher Beckham, Amir Najafi, Ioannis Mitliagkas, Aaron Courville, David Lopez-Paz, and Yoshua Bengio. Manifold mixup: Better representations by interpolating hidden states. arXiv:1806.05236, 2018.
  • Verma et al. [2019] Vikas Verma, Alex Lamb, Juho Kannala, Yoshua Bengio, and David Lopez-Paz. Interpolation consistency training for semi-supervised learning. In International Joint Conferences on Artificial Intelligence, 2019.
  • Xie et al. [2018] Shaoan Xie, Zibin Zheng, Liang Chen, and Chuan Chen. Learning semantic representations for unsupervised domain adaptation. In International Conference on International Conference on Machine Learning, pages 5423–5432, 2018.
  • Zhang et al. [2018] Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018.

Appendix A Hyperparameters

Task Instance Normalization
MNIST SVHN Yes
MNIST SVHN No
SVHN MNIST Yes, No
MNIST MNIST-M Yes, No
SYN SVHN Yes, No
CIFAR STL Yes, No
STL CIFAR Yes, No
Table 4: Detail of hyperparameters. We set the refinement interval Shu et al. [2018] of DIRT-T to 5000 iterations. The only exception is MNIST MNIST-M. For this special case, we set the refinement interval to 500, and set the weight of to .

Appendix B Dynamic Accuracy Results of VAT and VMT

[width=0.5]time_cmp.pdf

Figure 3: Dynamic test set accuracy on the adaptation task of MNIST SVHN with instance normalization. The blue line is for and the red line is for in Table 2.