Learning Smooth Representation for Unsupervised Domain Adaptation

05/26/2019 ∙ by Guanyu Cai, et al. ∙ 0

In unsupervised domain adaptation, existing methods utilizing the boundary decision have achieved remarkable performance, but they lack analysis of the relationship between decision boundary and features. In our work, we propose a new principle that adaptive classifiers and transferable features can be obtained in the target domain by learning smooth representations. We analyze the relationship between decision boundary and ambiguous target features in terms of smoothness. Thereafter, local smooth discrepancy is defined to measure the smoothness of a sample and detect sensitive samples which are easily misclassified. To strengthen the smoothness, sensitive samples are corrected in feature space by optimizing local smooth discrepancy. Moreover, the generalization error upper bound is derived theoretically. Finally, We evaluate our method in several standard benchmark datasets. Empirical evidence shows that the proposed method is comparable or superior to the state-of-the-art methods and local smooth discrepancy is a valid metric to evaluate the performance of a domain adaptation method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The performance of various computer vision problems has been significantly improved with the development of deep convolutional neural networks (CNN) 

krizhevsky2012imagenet . However, a precondition of this improvement is that numerous labeled samples are needed and test samples are drawn from the same distribution with training ones. Once there exists a dataset shift between the training and test samples, the performance of a CNN model decreases dramatically donahue2014decaf ; ben2010theory . In order to tackle this problem, unsupervised domain adaptation (UDA) transfers knowledge from a labeled source domain to an unlabeled target domain.

A bunch of classical UDA methods is to match moments of features in the source and target domains. They regard moments of a distribution as the main characteristics 

zellinger2017central ; sun2016deep ; pmlr-v37-long15 . By matching moments, they hope to match distributions of different domains. Other kinds of remarkable UDA methods are based on adversarial training strategy. ganin2016domain first introduces a domain classifier to make distributions of distinct domains matching. liu2016coupled ; bousmalis2017unsupervised also propose methods which are effective based on such a principle. However, these methods ignore the effect of a decision boundary. Alignment of distributions does not ensure that target samples are well-classified. The relationship between the decision boundary and target samples is also important.

saito2018maximum ; long2018conditional ; long2017deep regard the output of softmax function as complementary information to train a UDA model. These methods still focus on matching distributions of different domains, but the influence of decision boundary is considered. Moreover, MCD saito2018maximum explains that misclassification in the target domain is caused by ambiguous target features near the decision boundary and proposes a siamese-like network to overcome this phenomenon. These methods achieve excellent performance and their philosophies are illustrated clearly. However, the reason why a decision boundary robust in the source domain becomes invalid in the target domain is not well explained.

In this work, we give a new point of view to explain why a decision boundary is ill in the target domain. In our opinion, the smoothness of target samples is the key to model performance. Therefore, we define a non-negative function named local smooth discrepancy to measure the smoothness of a sample. Finally, a optimization schedule based on local smooth discrepancy is proposed to tackle the UDA problem.

In particular, we observe the phenomenon that performance drop in the target domain is similar to the overfitting problem. It is common for deep learning models because of extreme nonlinearity. In UDA problem, a dataset shift between different domains intensifies the overfitting. A popular belief based on widely observed facts is that the law of designing a model with good generalization ability is described by smoothness 

miyato2018virtual . Based on this principle, we explain how a dataset shift and smoothness affect the performance in the target domain. Further, we define a formula named local smooth discrepancy to measure the smoothness of a target sample. Two specific computing method are introduced. Utilizing local smooth discrepancy, we propose a detailed training strategy to tackle the UDA problem. The model contains a feature generator and classifier. The classifier tries to classify source samples correctly and detect sensitive target samples which are easily misclassified. The principle used to detect sensitive samples is that they are not smooth with respect to the classifier. The feature generator is trained to strengthen the smoothness of these sensitive samples. In other words, we seek samples that are not smooth and modify them to be smooth. Note that we add no auxiliary network which is common in MCD saito2018maximum and adversarial training methods ganin2016domain ; long2018conditional .

2 Realted Work

A theoretical work proposed in ben2010theory confirms that a discrepancy between the source and target distributions causes a model invalid in the target domain. Because the distribution of a domain is difficult to illustrate, intuitive thought is to match moments of distributions instead. pmlr-v37-long15 matches expectations of features. sun2016deep takes covariance of features into consideration. Moreover, zellinger2017central utilizes high-order moments to align distributions of different domains. These methods perform well in numerous settings.

As DANN ganin2016domain proposed, methods based on adversarial training become popular gradually. These methods introduce a domain classifier to predict which domain a sample is drawn from. At the same time, a feature generator is trained to fool the domain classifier so that features from different domains are matched. tzeng2017adversarial follows this philosophy and implements adversarial training in feature space. Several methods implement adversarial training in pixel level murez2018image ; bousmalis2017unsupervised ; liu2016coupled . These methods try to generate target images from labeled source images. In this way, a classification model is able to be trained with labeled target images. Specifically, PixelDA bousmalis2017unsupervised follows the training strategy of generative adversarial networks (GANs) and obtains excellent performance on digits datasets. murez2018image ; liu2016coupled introduce training strategy similar to cycle GAN to improve their performance.

However, an essential factor, the decision boundary, is ignored in these methods. They only take the marginal distributions into consideration while neglect the information contained in prediction. In order to solve this problem, several methods utilize category-discriminative information 8451152 ; 10.1007/978-3-030-01424-7_34 ; DAS201880 . JAN long2017deep modifies DAN pmlr-v37-long15

to match joint distributions. CADA 

long2018conditional subtly changes the training strategy of DANN ganin2016domain and achieves remarkable results. ATDA pmlr-v70-saito17a adds two auxiliary classifiers to assist in generating valid pseudo labels for target samples. It constructs decision boundaries for target domain in two different views. Moreover, MCD saito2018maximum gives us a concise explanation of why matching marginal distributions still causes misclassification. It also proposes a siamese-like network and adversarial training strategy to solve the UDA problem. MCD saito2018maximum is comparable or superior to the existing methods on several benchmark domain adaptation datasets.

3 Preliminary

In this section, we give a brief description of UDA and define several notions. In the context of vanilla UDA, source images and corresponding labels are drawn from the underlying joint source distribution . Meanwhile, there is a set of unlabeled target images drawn from a marginal target distribution . The goal of a UDA task is to train a model with and to predict target labels of . In practical settings, we only access to finite samples where a source domain and target domain consists of and samples.

In our setting, the trained model is composed of a feature generator and classifier . takes inputs or and the generated features or are feeded into to classify them into different classes. Because of a dataset shift between the source and target domain, which achieves a low source risk can easily result in an unsatisfied target risk . Thus, the goal of UDA methods is narrow down to reduce .

4 Learning Smooth Representation

4.1 Main Idea

Figure 1: A visual illustration of how the proposed method achieves adaptation. (Upper) The left block explains why dataset shift between different domains causes misclassification. The other two blocks depict the process of forming a robust margin between target samples and the decision boundary. At the same time, distributions of the target and source domains are matched. (Lower) The optimization schedule is able to project sensitive samples to new locations away from the decision boundary in three steps. Dashed lines indicate fixed network parameters.

Previous UDA methods, such as moments alignment ones sun2016deep ; pmlr-v37-long15 or adversarial training ones ganin2016domain ; tzeng2017adversarial , aim to completely match the feature distributions between different domains. Although they all achieved excellent performance on various tasks, the decision boundary which implies the determinacy of predictions is ignored. MCD saito2018maximum has taken it into consideration and discussed it in their specific network architecture which is similar to a siamese network. In this work, we consider a more general setting that a classification network is only composed of a feature generator and classifier . As shown in Figure 1, a

well-trained in the source domain is able to classify source samples correctly. For a robust model, there is a large margin between the decision boundary and samples. However, a dataset shift across different domains results in a probability that a set of target samples would cross the boundary decided by source domain. These samples are misclassified. Therefore, we regard samples near the decision boundary as sensitive ones which easily lead errors to

. It is similar to the overfitting problem which is common in deep learning models. The extreme nonlinearity of deep neural networks leads to a phenomenon that performance drop happens when facing out-of-distribution samples. In UDA problem, the dataset shift further exacerbates this situation. Numerous target samples are regarded as out-of-distribution samples for . In order to obtain a work well in the target domain, needs to project into feature space away from the decision boundary.

An intuitive solution to form a large margin between target samples and the source decision boundary is to detect samples close to the boundary and force project them far away from the boundary. As shown in Figure 1, we could observe a phenomenon that samples near the decision boundary are less smooth than those far away from the boundary. In detail, the smoothness in our setting means that a sample and the samples around it should belong to the same category. Therefore, if a sample is in close proximity to the decision boundary, samples in the neighborhood are inevitable to cross the boundary so that the smoothness is broken down. On the contrary, samples away from the boundary are robust to keep the smoothness. This observation indeed inspires us to design a specific algorithm for UDA problem. We tend to define a concrete formula to measure the smoothness of a sample. Sensitive samples are able to be detected with it, and then is trained to project them far away from the boundary. In other words, they become smooth. We give a visual illustration of this procedure in Figure 1. As you can see, not only the dataset shift between source and target domain is reduced, but also source and target samples achieve a large margin with the decision boundary. The ideal smooth representations of both domains are learned in our method. An interesting point is that promoting smoothness of a model is a typical principle to alleviate the overfitting problem. However, a regularization term is the common way for overfitting problem while we design a novel optimization schedule to tackle the UDA problem.

4.2 Local Smooth Discrepancy

According to the definition of smoothness in our setting, we define a concrete formula named local smooth discrepancy (LSD) to measure it:

(1)

where denotes all the parameters of a model. denotes a noise added to the output of , and denotes the maximum norm of . is a discrepancy function that measures the divergence between two outputs of . In

, a feature vector

adds to detect another feature vector in ’s neighborhood. controls the range of sampling in ’s neighborhood. As for the choice of

, we employ cross-entropy loss function in all experiments.

Although LSD is well-defined, there is still an essential point we should pay more attention. Specifically, is limited only by its norm in (1) and its direction is ignored. In fact, the goal of adding includes detecting sensitive samples. If a is close to the decision boundary and all belong to the same category with , it means that the direction of could not detect sensitive samples. In this condition, sensitive samples are not modified to be away from the boundary and a robust margin between target samples and the decision boundary is unable to be formed. In order to solve this problem, we propose two plans to produce , an isotropic one and an anisotropic one.

Isotropic Plan In the isotropic plan, we draw

from a Gaussian distribution and normalize it to satisfy

. Because is isotropic, there must be a considerable number of are closer to the decision boundary than the corresponding . Thus, for the nearby the boundary, several broken down the smoothness are detected. The formula of LSD is modified into:

(2)

Anisotropic Plan Anisotropic plan only looks for which drag closer to the decision boundary and ignores in other directions. To reach this goal, we take the insight from adversarial attack goodfellow2014explaining . Adversarial attack applies a certain hardly perceptible perturbation, which is found by maximizing the model’s prediction error, to an image to cause the model misclassify goodfellow2014explaining . This philosophy fits well with our goal which tries to seek some noise to make the consequential feature vectors belong to different classes. However, these perturbations are found in image level and true labels are needed. In UDA problem, true labels for the target domain is unreachable. Moreover, in our setting, perturbations are added in feature space. Therefore, we make several modifications in traditional adversarial attack methods. In detail, traditional adversarial attack methods approximate adversarial perturbation by:

(3)

Instead, we approximate it by:

(4)

where denotes transforming the softmax output of to a one-hot vector. (4) computes gradients of in the feature space instead of image space and replaces with . These modifications result in a new LSD for anisotropic noise:

(5)

4.3 Optimization for Local Smooth Discrepancy

After defining the local smooth discrepancy, we design a optimization schedule to follow our motivation. We seek sensitive samples that violate the smoothness rule, and train to project them into smooth locations.

First, we train and in the source domain to obtain the decision boundary which is crucial to the following steps.

(6)

where is an indicator function, and denotes that there are

classes in a task. Then, we produce sensitive samples based on the decision boundary. Note that supervised learning in (

6) is able to form robust margin between source samples and the decision boundary so that we focus on target samples in this step. In our work, sensitive samples are generated in the feature space of :

(7)

where , and is a general notation for the adding noise. In practice, we set for an isotropic plan or for an anisotropic plan. Finally, we train to minimize local smooth discrepancy for target samples. Only parameters of are updated in this step. is trained to project to the same category with :

(8)

where denotes parameters of , and denotes cross-entropy loss function. (6) (7) and (8) are repeated in the optimization schedule as shown in Figure 1.

4.4 Generalization Error Bound

In this section, we analyze theoretically the reason why the proposed method works well in the target domain. Ben-David proposed the theory that bounds the expected error in the target domain  ben2010theory . There are three terms compose , expected error in the source domain , distance that measures the discrepancy between source and target domains, and a combined error of the ideal joint hypothesis .

Theorem 1

ben2010theory Let be the hypothesis class. Suppose that is symmetric(i.e., ). Given two domains and , for any , we have:

(9)

where and are the errors of hypothesis in the corresponding domain. denotes a indicator function.

According to ben2010theory , is extremely small with supervised learning and is also negligible if a UDA problem is solvable. Thus, we concentrate on minimizing . Regarding , because of reachable true labels of the source domain, and are able to classify source samples correctly and is assumed to be low. Thus, to approximate , we need to approximate .

Actually, MCD saito2018maximum train two different classifier to be inconsistent in the target domain to reach the supremum of . However, in our method, we try to approximate it without an auxiliary classifier . Assume that there exists an ideal so that is extremely low. If there is a set of sensitive samples which is also drawn from and can easily lead wrong results from , is well-approximated by . According to miyato2018virtual , both the proposed isotropic and anisotropic plans are able to produce that follow and leads inconsistent results from . Therefore, we regard generated sensitive samples as in our method. In practice, since is unavailable, we replace it with . Note that is generated from corresponding and their labels are the same, so this replacement is reasonable. Finally, this term is approximated by:

(10)

In fact, (10) is a discrete way to measure the smoothness of target samples. In order to optimize a model with SGD, we propose local smooth discrepancy to replace as the objective function. In addition, although the manner that we use pseudo labels of to replace unreachable is unlikely to guarantee the real to be obtained, the iterative optimization and increasing accuracy of during training ensure our method works well.

5 Experiments

In order to verify the effectiveness of the proposed method (SRDA), we conduct several classification experiments on standard benchmark datasets. First, we test SRDA on several digits classification datasets which are the most common datasets for UDA. Then, we test it on a more complex and massive dataset, VisDA peng2017visda

, to show the advanced performance of SRDA. Finally, we analyze the effectiveness of LSD in two settings. In all experiments, we implement models with Pytorch, and employ the optimization schedule we propose.

5.1 Digits Classification

We evaluate four types of adaptation scenarios by utilizing the digits datasets, MNIST lecun1998gradient , USPS hull1994database , Synthetic Traffic Signs (SYNSIG) moiseev2013evaluation , Street View House Numbers (SVHN) netzer2011reading and German Traffic Signs Recognition Benchmark (GTSRB) stallkamp2011german . Specifically, MNIST, USPS and SVHN consist of 10 classes, whereas SYNSIG and GTSRB are traffic sign datasets which consist of 43 classes. In this experiment, we set four transfer tasks: SVHNMNIST, SYNSIGGTSRB, MNISTUSPS and USPSMNIST. In detail, the dataset shift in SVHNMNIST is caused by a different property of an image that SVHN contains RGB images while MNIST contains grayscale images. The shift between USPS and MNIST is relatively small because both of them are handwritten digit datasets and contain grayscale images. Images in SYNSIG and GTSRB have distinct properties because those in SYNSIG are synthesized and the rest is collected from the real world. For a fair comparison, we follow the protocols provided in MCD saito2018maximum and ADDA tzeng2017adversarial .

Method SVHN SYNSIG MNIST USPS
MNIST GTSRB USPS MNIST
Source Only 67.1 85.1 76.7 63.4
DAN 71.1 91.1 - -
DANN 71.1 88.7 77.1 73.0
DSN 82.7 93.1 91.3 -
ADDA 76.0 - 89.4 90.1
CoGAN - - 91.2 89.1
ATDA 86.2 96.2 - -
ASSC 95.7 82.8 - -
DRCN 82.0 - 91.8 73.7
MCD 96.2 94.4 94.2 94.1
SRDA(FGSM) 93.73 90.46 80.29 95.25
SRDA(VAT) 95.10 89.45 76.85 95.49
SRDA(RAN) 98.81 89.28 93.42 94.46
Table 1: Classification accuracy percentage of digits classification experiment among all four tasks. The first row corresponds to the performance if no adaption is implemented. We evaluate three SRDA models with different plans for adding noise. The results are cited from each study.

In this experiment, in order to verify the robustness of SRDA, we implement both isotropic and anisotropic plans. For the isotropic plan, we sample noise from a standard Gaussian distribution. In all four tasks, hyper-parameter is set to 0.5 and the learning rate is . For the anisotropic plan, sensitive samples are generated in two different ways. We choose two classical adversarial attack algorithm, namely FGSM goodfellow2014explaining and VAT miyato2018virtual , to produce noise adding to a feature vector. Note that FGSM goodfellow2014explaining

needs true labels to execute a backpropagation to compute gradients, so that we use pseudo labels to replace them. Except for MNIST

USPS, is set to 0.5 in the other three tasks. In MNISTUSPS,

is set to 0.3. We set batch size to 128 in all tasks for both plans and all models are trained for 150 epochs. Learning rate is set to

for FGSM plan while for VAT plan.

Results of the digits classification experiment are shown in Table 1. We compare our three SRDA models, namely SRDA (FGSM), SRDA (VAT) and SRDA (RAN), with other state-of-the-art UDA algorithms such as DAN pmlr-v37-long15 , DANN ganin2016domain , DSN NIPS2016_6254 , ADDA tzeng2017adversarial , CoGAN liu2016coupled , ATDA pmlr-v70-saito17a , ASSC Haeusser_2017_ICCV , DRCN DBLP:conf/eccv/GhifaryKZBL16 and MCD saito2018maximum . Among all four tasks, SRDA ranks first in two of them. Especially in USPSMNIST which is the most difficult task, our three models are the top three. Only MCD is comparable to them and other methods are inferior to ours with a large margin. In SVHNMNIST, SRDA (RAN) ranks first and other two models are the top five. In the other two tasks, our models do not obtain the best results. We conclude that it is caused by the relative satisfying results when no adaptation is implemented. Once a model without adaptation allocates target samples away from the decision boundary, SRDA is hard to detect enough sensitive samples to optimize . Thus, SRDA is unable to improve the baseline a lot in this setting. The fact that improvement in SYNSIGGTSRB which is the easiest setting is the smallest verifies our conclusion.

5.2 VisDA Classification

Method Pl Bc Bs Ca Hr Kf Mc Ps Pt Sk Tr Tk Avg
No UDA 55.1 53.3 61.9 59.1 80.6 17.9 79.7 31.2 81.0 26.5 73.5 8.5 52.4
DAN 87.1 63.0 76.5 42.0 90.3 42.9 85.9 53.1 49.7 36.3 85.8 20.7 61.1
DANN 81.9 77.7 82.8 44.3 81.2 29.5 65.1 28.6 51.9 54.6 82.8 7.8 57.4
MCD() 81.1 55.3 83.6 65.7 87.6 72.7 83.1 73.9 85.3 47.7 73.2 27.1 69.7
MCD() 90.3 49.3 82.1 62.9 91.8 69.4 83.8 72.8 79.8 53.3 81.5 29.7 70.6
MCD() 87.0 60.9 83.7 64.0 88.9 79.6 84.7 76.9 88.6 40.3 83.0 25.8 71.9
SRDA(F) 90.1 67.0 82.3 56.0 84.8 88.2 90.3 77.0 82.5 26.8 85.0 16.2 71.1
SRDA(V) 89.4 43.5 81.2 60.2 81.1 57.6 93.7 76.6 81.8 41.3 79.6 22.0 69.5
SRDA(R) 90.9 74.8 81.9 59.1 87.5 77.3 89.9 79.4 85.3 40.6 85.1 21.6 73.3
Table 2: Classification accuracy percentage of VisDA classification experiment. The first row corresponds to the performance if no adaption is implemented. Columns in the middle correspond to different categories and the column on the right represents average accuracy. We evaluate three SRDA models with different plans for adding noise. The number behind MCD denotes different hyper-parameters. The results are cited from each study.

We further assess SRDA on a more complex object classification dataset. VisDA peng2017visda in this experiment constructs an adaptation from synthetic-object to real-object images. It contains more than 280K images belonging to 12 categories. These images are divided into training, validation and test sets. There are 152,397 training images synthesized by rendering 3D models with different angles and lighting conditions. The validation images are collected from MSCOCO lin2014microsoft and amount to 55,388 in total. In this experiment, we regard the training set as a source domain and the validation set as a target domain. Similarly, in order to ensure fairness, we utilize the same backbone network, ResNet101 he2016deep , with MCD saito2018maximum . The setting of generator and classifier networks is also the same. In addition, we also employ the entropy minimization trick which is used in MCD saito2018maximum .

In this experiment, we also implement both isotropic and anisotropic plans. For the anisotropic plan, FGSM goodfellow2014explaining and VAT miyato2018virtual algorithms are implemented. All models are trained for 15 epochs and batch size is set to 32. Learning rate is set to . Similarly, we compare SRDA (FGSM), SRDA (VAT) and SRDA (RAN) with several typical methods, such as DAN pmlr-v37-long15 , DANN ganin2016domain , and MCD saito2018maximum which is the state-of-the-art method.

Results of the VisDA classification experiment are shown in Table 2. SRDA and MCD saito2018maximum achieve much better accuracy than other methods. Moreover, SRDA (RAN) ranks first among all the models and SRDA (FGSM) obtains comparable accuracy with MCD saito2018maximum . In detail, SRDA (RAN) achieves the best results in class plane and person, SRDA (VAT) achieves the best result in class motor cycle and SRDA (FGSM) gets the best result in class knife. An interesting phenomenon is that three models of SRDA perform diversely among these categories. For example, in class knife, SRDA (FGSM) performs much better than the others and SRDA (VAT) ranks first in class motor cycle. Overall, SRDA (RAN) performs best. This reflects the importance of detecting sensitive samples. A well-defined plan which could seek more sensitive samples and a metric which could illustrate the smoothness of samples precisely are hopeful to further promote the proposed method.

5.3 Discuss of Local Smooth Discrepancy

In order to verify that LSD we defined indeed reflects the smoothness of a sample and the performance of a model, we show the relationship between LSD and accuracy in Figure 2

. Three models, SRDA (RAN), SRDA (FGSM) and SRDA (VAT), are assessed on VisDA. We follow the settings in VisDA classification experiment. Note that because we get the accuracy every epoch and LSD is recorded every step, we show the accuracy after a quadratic interpolation.

(a) SRDA (RAN)
(b) SRDA (FGSM)
(c) SRDA (VAT)
(d) MCD
Figure 2: Three figures on the left display relationship between LSD (red line) and accuracy (blue line) during the training period. Three SRDA models are evaluated on VisDA. As discrepancy decreases, the accuracy increases. The figure on the right display relationship between LSD and accuracy in MCD. The model with higher accuracy gets a lower LSD.

As shown in Figure 2(a)2(b), and 2(c), the accuracy of all three models gradually increases as LSD decreases. This indicates that the proposed LSD is a reasonable metric to evaluate the performance of a UDA model. As is shown in Table 2, SRDA (RAN) performs best on VisDA, SRDA (FGSM) ranks second and SRDA (VAT) is the worst model among them. In fact, the order of performance on VisDA also corresponds to LSD. SRDA (VAT) shows the highest loss, SRDA (RAN) obtains the lowest loss and SRDA (FGSM) ranks in the middle. The result verifies that smoothness is a key factor that affects the performance of a UDA model. Our explanation of performance drop in the target domain based on smoothness is also confirmed.

Moreover, to prove that LSD is a general metric to assess the performance of a UDA model, we further test it on MCD with different accuracy. In this experiment, we follow the settings described in MCD saito2018maximum and train models on VisDA. Overall, we train 12 MCD models with different accuracy by tuning hyper-parameters. As LSD is not the objective function of MCD, we introduce the original FGSM algorithm to generate adversarial samples on image level. Traditional white adversarial attack algorithms generate samples with their own networks. This paradigm introduces a new variable that adversarial samples are not the same for different MCD models. Thus, we generate adversarial samples with SRDA (RAN) in this experiment to ensure fairness for each MCD model. is also set to 0.5. With these adversarial images, LSD is calculated with their corresponding images in the target domain. To ensure the validity of the results, both classifiers in MCD are tested.

As is shown in Figure 2(d), there is an obvious relationship between LSD and accuracy that a low LSD corresponds to high accuracy. We train 12 MCD models with accuracy belongs to . LSD of both classifiers gradually decreases from 0.6 to 0.3 roughly. This means that LSD is a reasonable general metric to evaluate the performance of a UDA model. However, there are also several MCD model with high accuracy showing relative high LSD. We argue that the randomness of the gradient descent algorithm causes this fluctuation.

6 Conclusion

In this paper, we propose a novel method for UDA problem. We introduce the definition of smoothness in UDA problem and explain the reason for performance drop in the target domain in terms of smoothness. According to the definition, we propose a formula named local smooth discrepancy to measure the smoothness of samples which can both detect sensitive samples and be an optimized objective function. Based on these, we give a concise training strategy to train a UDA model and a theoretical analysis is given. We extensively evaluate our method on several benchmark datasets. In almost all experiments, our method is comparable or superior to the state-of-the-art methods.

References

  • (1) Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79(1-2):151–175, 2010.
  • (2) Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru Erhan, and Dilip Krishnan. Unsupervised pixel-level domain adaptation with generative adversarial networks. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 3722–3731, 2017.
  • (3) Konstantinos Bousmalis, George Trigeorgis, Nathan Silberman, Dilip Krishnan, and Dumitru Erhan. Domain separation networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 343–351. Curran Associates, Inc., 2016.
  • (4) D. Das and C. S. George Lee. Unsupervised domain adaptation using regularized hyper-graph matching. In 2018 25th IEEE International Conference on Image Processing (ICIP), pages 3758–3762, Oct 2018.
  • (5) Debasmit Das and C. S. George Lee. Graph matching and pseudo-label guided deep unsupervised domain adaptation. In Věra Kůrková, Yannis Manolopoulos, Barbara Hammer, Lazaros Iliadis, and Ilias Maglogiannis, editors, Artificial Neural Networks and Machine Learning – ICANN 2018, pages 342–352, Cham, 2018. Springer International Publishing.
  • (6) Debasmit Das and C.S. George Lee. Sample-to-sample correspondence for unsupervised domain adaptation.

    Engineering Applications of Artificial Intelligence

    , 73:80 – 91, 2018.
  • (7) Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In International Conference on Machine Learning, pages 647–655, 2014.
  • (8) Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030, 2016.
  • (9) Muhammad Ghifary, W. Bastiaan Kleijn, Mengjie Zhang, David Balduzzi, and Wen Li. Deep reconstruction-classification networks for unsupervised domain adaptation. In ECCV (4), volume 9908 of Lecture Notes in Computer Science, pages 597–613. Springer, 2016.
  • (10) Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
  • (11) Philip Haeusser, Thomas Frerix, Alexander Mordvintsev, and Daniel Cremers. Associative domain adaptation. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • (12) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
  • (13) Jonathan J. Hull. A database for handwritten text recognition research. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(5):550–554, 1994.
  • (14) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105, 2012.
  • (15) Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • (16) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, pages 740–755. Springer, 2014.
  • (17) Ming-Yu Liu and Oncel Tuzel. Coupled generative adversarial networks. In Advances in Neural Information Processing Systems, pages 469–477, 2016.
  • (18) Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jordan. Learning transferable features with deep adaptation networks. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 97–105, Lille, France, 07–09 Jul 2015. PMLR.
  • (19) Mingsheng Long, Zhangjie Cao, Jianmin Wang, and Michael I Jordan. Conditional adversarial domain adaptation. In Advances in Neural Information Processing Systems, pages 1640–1650, 2018.
  • (20) Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan.

    Deep transfer learning with joint adaptation networks.

    In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2208–2217. JMLR. org, 2017.
  • (21) Takeru Miyato, Shin-ichi Maeda, Shin Ishii, and Masanori Koyama.

    Virtual adversarial training: a regularization method for supervised and semi-supervised learning.

    IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
  • (22) Boris Moiseev, Artem Konev, Alexander Chigorin, and Anton Konushin. Evaluation of traffic sign recognition methods trained on synthetically generated data. In International Conference on Advanced Concepts for Intelligent Vision Systems, pages 576–583. Springer, 2013.
  • (23) Zak Murez, Soheil Kolouri, David Kriegman, Ravi Ramamoorthi, and Kyungnam Kim. Image to image translation for domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4500–4509, 2018.
  • (24) Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. 2011.
  • (25) Xingchao Peng, Ben Usman, Neela Kaushik, Judy Hoffman, Dequan Wang, and Kate Saenko. Visda: The visual domain adaptation challenge. arXiv preprint arXiv:1710.06924, 2017.
  • (26) Kuniaki Saito, Yoshitaka Ushiku, and Tatsuya Harada. Asymmetric tri-training for unsupervised domain adaptation. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 2988–2997, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
  • (27) Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku, and Tatsuya Harada. Maximum classifier discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3723–3732, 2018.
  • (28) Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. The german traffic sign recognition benchmark: A multi-class classification competition. In IJCNN, volume 6, page 7, 2011.
  • (29) Baochen Sun and Kate Saenko. Deep coral: Correlation alignment for deep domain adaptation. In European Conference on Computer Vision, pages 443–450. Springer, 2016.
  • (30) Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7167–7176, 2017.
  • (31) Werner Zellinger, Thomas Grubinger, Edwin Lughofer, Thomas Natschläger, and Susanne Saminger-Platz. Central moment discrepancy (cmd) for domain-invariant representation learning. International Conference on Learning Representations, 2017.