Factorized Adversarial Networks for Unsupervised Domain Adaptation

06/04/2018 ∙ by Jian Ren, et al. ∙ 0

In this paper, we propose Factorized Adversarial Networks (FAN) to solve unsupervised domain adaptation problems for image classification tasks. Our networks map the data distribution into a latent feature space, which is factorized into a domain-specific subspace that contains domain-specific characteristics and a task-specific subspace that retains category information, for both source and target domains, respectively. Unsupervised domain adaptation is achieved by adversarial training to minimize the discrepancy between the distributions of two task-specific subspaces from source and target domains. We demonstrate that the proposed approach outperforms state-of-the-art methods on multiple benchmark datasets used in the literature for unsupervised domain adaptation. Furthermore, we collect two real-world tagging datasets that are much larger than existing benchmark datasets, and get significant improvement upon baselines, proving the practical value of our approach.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 9

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Rapid development of deep convolutional neural networks (CNN) has led to promising performance on various computer vision tasks 

[1][2][3]

, especially with the help of large-scale annotated datasets, such as ImageNet 

[4]. However, when a model learned from a large dataset in one domain (source domain) is applied to another domain (target domain) with some different characteristics, it is not guaranteed to generalize well. In order to mitigate the influence caused by domain shift [5], two major approaches are widely employed. One popular approach is to fine-tune the model learned from source domain using annotated data from target distribution [6]. However, this requires data annotation in target domain, which is costly and labor intensive. The other approach is to generate synthetic data that is analogous to the distribution of target domain [7][8]. Although this approach could provide unlimited synthetic training data, the model trained may not perform well as compared to real data with much more complicated distributions.

Figure 1: The proposed unsupervised domain adaptation approach factorizes source and target latent feature space into two subspaces using two different networks. The domain-specific subspace stores domain-specific information, while the task-specific subspace stores the category information. We use adversarial training to minimize the discrepancy between the two task-specific subspaces.

In this work, we focus on the image classification task and aim to solve the unsupervised domain adaptation problem. In our problem setting, the source domain contains a large amount of annotated data, but there is no annotation available for the images in the target domain. The two domains share the same high level categories although they are drawn from different distributions.

We propose Factorized Adversarial Networks (FAN) to address this unsupervised domain adaptation problem. FAN encodes input data from both domains to a latent embedding space which is factorized into two complementary subspaces, a domain-specific subspace (DSS) and a task-specific subspace (TSS), as illustrated in Figure 1

. In an image recognition scenario, the task-specific subspace should ideally only contain image category related information, while the domain-specific subspace contains domain characteristics that are irrelevant to classification, e.g., different backgrounds should not impact digit recognition. We use a mutual information loss to enforce the orthogonality constraint between the two subspaces. The motivation of this factorization is to allow us to adapt only the task-specific subspace of the target domain to that of the source domain. In order to do the adaptation, we apply an adversarial network to minimize the distribution discrepancy between the two task-specific subspaces, with loss function adopted from the Generative Adversarial Network (GAN) 

[9].

A two-stage training process is used to train our FAN. In the first stage, we train a convolutional network in source domain to predict the image labels as well as reconstruct the input images. The features in task-specific subspace are used to predict the image labels, while the domain-specific subspace features, concatenated with the image classification logits, are used to reconstruct the input images. In the second stage, we train the network in target domain using the adversarial loss and reconstruction loss to generate a task-specific subspace that is indistinguishable from the one generated in source domain. A discriminator network is used to judge from which domain the task-specific features are generated. The network at target domain and the discriminator network are updated by the gradients in an adversarial way so that the task-specific subspace of target domain is adapted to that subspace of source domain.

We apply our proposed method to visual domain adaptation using the benchmark digits datasets, including MNIST [10], USPS [11] and SVHN [12], and achieve superior results compared to the state-of-the-art approaches. We also apply the method to two real-world tagging datasets that we collected, one from crawling images using search engines such as Google and Flickr, and the other from photos shot by mobile phones. The two datasets share the same 100 classes with each dataset containing more than 115,000 images and we achieve significant improvement on the classification task compared with the state-of-the-arts.

In summary, our contributions are three-fold:

  • A novel Factorized Adversarial Networks to tackle the unsupervised domain adaptation in an effective way.

  • Detailed analysis on the design of the network architecture along with visualization of the factorized subspaces.

  • New state-of-the-art domain adaptation results on digits benchmark datasets as well as newly collected larger-scale real-world tagging datasets.

2 Related Work

Unsupervised domain adaptation Extensive studies on unsupervised domain adaptation have been conducted in recent years in order to effectively transfer the representative features learned in source domain to target domain. In this section, we focus on research utilizing deep neural networks as they have a better generalization ability even for the complex distributions [1][13][14].

One category of unsupervised domain adaptation applies the Maximum Mean Discrepancy (MMD) [15] loss as a metric to learn the domain invariant features. The MMD loss computes the distance between the embedding spaces of two domains using kernel tricks. Deep domain confusion (DDC) [16] minimizes both classification loss and MMD loss in one layer. Deep adaptation network proposed in [17] places MMD loss at multiple task-specific layers that have been embedded in a reproducing kernel Hilbert space, while other layers are shared between source and target domains. Similarly, the domain separation network (DSN) [18] maintains a shared embedding between two domains as well as the individual domain representations. Deep Reconstruction-Classification Network (DRCN) [19] shares the encoding for both source and target domains. On the contrary, the work in [20]

demonstrates that it is effective to relate the weights in the form of linear transformations instead of sharing. Unlike the above discussed approaches, the authors in 

[21] proposed deep correlation alignment (CORAL) algorithm to match the covariance of the source features and the target features to learn a transformation from the source domain to the target domain.

Based on the idea of adversarial training [9]

, several studies propose using a domain classifier built on top of the networks to distinguish the represented features from the two distributions. Features extracted from the two domains are utilized to train the domain classifier, along with the classification loss for the source domain 

[22]. The gradient reversal algorithm (RevGrad) algorithm [23] trains the domain classifier by reversing its gradients. The authors of [24] propose an adversarial discriminative domain adaptation (ADDA) model in which weights are not shared between the source and target domains, and the network in target domain is trained to fool the domain classifier so that it cannot predict the two domains reliably.

Generative adversarial networks GAN [9]

related approaches are also used to synthesize images and perform unsupervised domain adaptation in the joint distribution space. A generator is trained to model the image distribution and generate the synthetic images while a discriminator is trained to differentiate the synthesized distribution and the real distribution. Coupled GAN (CoGAN) 

[25] uses two GANs on source and target domain to generate images from the two distributions. The two GANs have the same noise as input and domain adaptation is implemented by training a classifier on the input of the discriminator. The work in [7] uses images from source domain as a condition for the generator. Both the generated images and the source images are applied to train the classifier. The authors of [26] propose a learning strategy to generate cross domain images and train a task-specific classifier with the generated images and the source distributions.

Hidden factors discovery There has been some research work on discovering the higher-order factors of variation from the latent space on the image classification and generation tasks [27][28][29][30]. For example, the work at [27]

utilizes the autoencoder to disentangle the various transformations from input distributions. The network is jointly trained to reconstruct input images as well as estimate the image category. On the contrary, InfoGAN 

[28]

is proposed to learn disentangled representations from images in an unsupervised fashion by decomposing the latent code from input noise vector. In this study, we propose learning the task-specific feature in an effective way instead of learning interpretable hidden factors, and we find that factorizing the domain representations helps to adapt the knowledge between two domains.

Comparison with similar studies The motivation of our proposed FAN is to find a subspace where unsupervised domain adaptation for classification is most appropriate. It shares similarities with previous studies, especially DSN [18] and the ADDA [24]. While domain separation [27][31][32] and adversarial training [7][22] have been extensively explored in many tasks in existing liturature, we unify the two appoaches in one novel framework for unsupervised domain adaptation and demonstrate its clear advantage over DSN [18] and ADDA [24] in experiments.

Figure 2: The architecture of FAN. The encoders from two domains map input images into two feature spaces. Both feature spaces are factorized into two subspaces, the domain-specific subspace (DSS) and the task-specific subspace (TSS). The adaptation is accomplished by jointly training the discriminator and target network using both the GAN loss and reconstruction loss to find the domain invariant feature in TSS.

3 Our Approach

In this section, we present our Factorized Adversarial Networks (FAN) for unsupervised domain adaptation. The architecture of FAN is illustrated in Figure 2, where we have two encoder-decoder structured neural networks, one for source domain and one for target domain, that mirror each other except for the training losses, as well as a discriminator network. We aim to find a domain invariant feature space that retains the classification information through adversarial training. To achieve this, we explicitly factor the latent feature space into a task-specific subspace and a complementary domain-specific subspace, where the task-specific subspace aims to minimize the classification loss across domains while the domain-specific subspace combined with classification logits targets at reconstructing the input samples. The task-specific subspace, if indistinguishable by the adversarial discriminator which domain it comes from, should retain the classification information invariant to domain shifts; the domain-specific subspace, on the other hand, should capture the domain-specific but classification-irrelevant information for reconstruction. The proposed explicit feature space factorization helps to remove some domain-specific information and relieve the burden of adversarial training for more effective domain adaptation.

More formally, in our unsupervised domain adaptation, we have a source distribution that includes labeled images where is a one-hot vector encoding the image class label, and a target distribution contains unlabeled images . Our goal is to first find a mapping that maps the source task-specific subspace to the source logit space with labeled training data, and then find a mapping function for the target domain that maps the target task-specific subspace into the target logit space that is indistinguishable from the source logit space. The target mapping function thus retains the discriminative information needed for target domain, and therefore, inference in target domain could be easily done with and softmax. Our learning procedure consists of two steps: we first train a source domain network that factors the latent feature space, and we then update the target domain network by adapting the target domain task-specific subspace to its source domain counterpart with the help of adversarial training. We discuss these two steps in the following sections.

3.1 Feature Space Factorization

Our networks contain two convolutional encoder-decoder networks and the latent feature space generated by the encoders is factorized into complementary task-specific subspace and domain-specific subspace. In the first step of our approach, we train our factorization network in source domain as shown in Figure 2. To avoid cluttered notations, we drop domain indicator superscripts in the following when there is no confusion. Let = denote the encoder function that encodes the input sample into a latent feature with parameter in source domain. We split the latent feature into two parts and , where represents the feature in the domain-specific subspace and represents the feature in the task-specific subspace. The mapping maps the task-specific subspace into a logit space with parameters . We then concatenate and and feed them into a decoder to reconstruct the input sample , where includes the necessary attributes for reconstruction. Ideally, should contain discriminant information that is invariant to different domains while retains information that is specific to the domain, less relevant to classification but necessary for reconstruction. We optimize the following objective function in order to obtain the two desired subspaces in source domain:

(1)

where , are hyper parameters that control the trade-off among loss terms.

is the cross-entropy loss to train the source network for classification with the parameters using source domain labeled training data.

(2)

where is the softmax output of the classification branch, = softmax.

We add a mutual information loss term to encourage orthogonality between the domain-specific subspace and task-specific subspace:

(3)

where and denote the domain-specific feature and task-specific feature for the -th sample, respectively.

We use the reconstruction loss to minimizes the squared error between the input sample and the reconstructed one:

(4)

where denote the logit vector for the -th sample.

The three loss terms play together in the optimization of Eqn. 1. The classification loss encourages the learned feature to retain discriminative information as much as possible, the reconstruction loss relies on domain-specific information from with the logit input for reconstruction, and the mutual loss encourages the separation of the two subspaces. Thus we can obtain a task-specific space that is discriminative with much less domain-specific information, and hence more invariant to domain shifts.

Without duplicate elaboration, the target domain network holds the same architecture as the source domain network. In the second step of our approach, we fix the learned source domain factorization network and train the target factorization network with adversarial adaptation, as discussed in following section.

3.2 Adversarial Domain Adaptation

Our factorization network is designed to capture discriminant information in the task-specific subspace while dropping domain-specific information as much as possible. We leverage adversarial training to minimize the discrepancy between the task-specific subspace of the target domain and that of the source domain so that we can easily transfer the knowledge learned from source domain to target domain. Specifically, we learn our target domain neural network by optimizing the following objective function:

(5)

where and are the hyper parameters that balance the contributions of adversarial training loss.

The reconstruction loss in target domain is similarly defined as Eqn. 4 over target domain network parameters. The adversarial training losses are defined similarly to the GAN loss [9]. Instead of using the task-specific subspace directly, we use the logit space obtained from the source domain to guide the learning in the target domain, which works better in practice. The discriminator maps the input logit space into a binary label, where “true” denotes the source domain and “false” denotes the target domain. The target domain network is learned in an adversarial way to fool the discriminator so that the discrepancy between the two logit spaces is minimized. Specifically, the adversarial losses for optimizing the discriminator and for optimizing the target domain encoder are defined as

(6)
(7)

where denote the network parameters for the target domain encoder and logit mapping. As the task-specific subspace at target domain aims to learn a similar distribution as the one from source domain, the mutual information loss is not necessary for the target domain. In the experiments, we did try using Eqn. 3 at target domain , but did not observe further improvement.

Unlike the symmetric structure of our network as demonstrated in Figure 2, we perform asymmetric adaptation during optimization where the target domain network is fine-tuned from source domain network instead of weight sharing for the two networks. Previous efforts explored using shared weights between source and target networks to reduce model parameters [33][22], or leave the target network completely untied [20][24]. We found that it is not necessary to share the weights for shallow networks such as LeNet [10], but imperative to partially share some early network layers for deeper neural networks, such as ResNet [13], which is the standard practice to train the deep nets. By jointly optimizing the adversarial loss and reconstruction loss, we force the target domain task-specific subspace to match the distribution of the source domain task-specific subspace, which is discriminative for the classification task, while leaving the less relevant target domain-specific representations for the domain-specific subspace to capture. Together, the two terms encourage the network to learn more discriminative and domain invariant feature representations for the task.

4 Experiments

We evaluate the proposed FAN on the tasks of unsupervised domain adaptation using benchmark datasets including MNIST [10], USPS [11] and SVHN [12], as well as much larger real-world tagging datasets we collected that contain more than 100,000 images, respectively. We demonstrate that our approach is significantly improved compared to previous state-of-the-art methods.

4.1 Digits Datasets

We use three digits datasets, MNIST [10], USPS [11] and SVHN [12], as the benchmark and follow the previous studies [19][22][23][24][25] to perform three unsupervised adaptation settings including MNIST USPS, USPS MNIST and SVHN MNIST. The benchmark datasets contain images of 10 digits ranging from 0 to 9. Some sample images from the three datasets are shown in Figure (a)a. To run experiments in an unsupervised manner, the labels of the target domain training images are withheld.

Network architecture The network we use in the experiments contains an encoder and a decoder and has the same structure under the three experiment settings. Following the recent work [24] for fair comparison, we adopt a similarly modified LeNet [10]

as the encoder that differs only in utilizing batch normalization (BN). We also applied BN for 

[24]

but observed no improvement. Specifically, the encoder consists of two convolutional layers with kernel size 5 and the number of filters 20 and 50, respectively. Each convolutional layer is followed by rectified linear units (ReLU), BN, and max pooling layers. After that we have two fully connected (FC) layers with 500 and 100 hidden units respectively. The activations from the last FC layer is split into two parts, one for domain-specific subspace and the other for task-specific subspace. The task-specific feature is connected to an FC layer to get the classification logits for prediction, while the domain-specific feature is concatenated with the classification logits as input for decoding phase. The decoder employs a deconvolution architecture 

[34] including one FC layer with 300 hidden units, two 5516 convolutional layers, one upsampling layer to 2828, and two 33 convolutional layers with 16 and 1 filters, respectively. The FC layers and convolutional layers are followed by ReLU and BN, except for the last convolutional layer that gives the reconstruction output. The logit activations from the two domains are sent to the discriminator network which contains three FC layers. The first two FC layers have 500 hidden units followed by ReLU and BN. The last FC layer provides the domain label estimation for the input samples.

(a) Example images from the three digits datasets. Left three columns: MNIST; middle three columns: USPS; right three columns: SVHN.
(b) Example images from the Crawling dataset (left two columns) and the Mobile dataset (right two columns). Top row: forest; bottom row: steering wheel.
Figure 3: Visualization of example images from the five datasets used in the study.

Implementation details Since images in different datasets varies in size, we resize the images in USPS and SVHN datasets to 2828 in order to match the input image size in MNIST. In addition, we convert the RGB images from SVHN to gray scale images. All the pixel values are normalized to a range of 0 to 1. For the unsupervised adaptation between MNIST and USPS, two training paradigms are implemented. The first one follows the training strategy introduced in [35], which sampled 2,000 training images from MNIST and 1,800 training images from USPS. For the second training protocol, we consider utilizing all the training data from the two domains and denote it as MNISTUSPS (full) and USPSMNIST (full). For both training protocols, the testing set remains the same. For adaptation from SVHN to MNIST, we use all the training images from the two datasets. The training process contains two steps. The first step is to train a model in the source domain using Eqn. 1 with as 2 and as 1. In the second step, we fix the trained model in source domain and train the recognition model in the target domain using the Eqn. 5, where is 2 and is 1. We initialize the target domain network using the weights of the model trained in source domain. No data augmentation setting is utilized in the experiments.

Method MNISTUSPS USPSMNIST SVHNMNIST Largest Improvement Baseline 0.752 0.016 0.571 0.017 0.601 0.011 0.339 DSN[18][36] 0.913 - 0.827 0.098 RevGrad[23] 0.771 0.018 0.730 0.020 0.739 0.186 DDC[22] 0.791 0.005 0.665 0.033 0.681 0.003 0.245 CoGAN[25] 0.912 0.008 0.891 0.008 - 0.019 DRCN[19] 0.918 0.0009 0.737 0.0004 0.820 0.0016 0.173 ADDA[24] 0.894 0.002 0.901 0.008 0.760 0.018 0.165 Ours 0.921 0.014 0.910 0.011 0.925 0.011 - Ours (full) 0.963 0.002 0.971 0.008 - -

Table 1: Experimental results on unsupervised domain adaptation for the digits datasets including MNIST, USPS, and SVHN. Full denotes using the entire training set for the domain adaptation between MNIST and USPS. The last column shows the largest improvement over each method out of the three experiments.

Comparison results Table 1 shows our results as compared with recent methods. Our approach clearly achieves the best overall performance on all three domain adaptation experiments under the same settings. Compared with previous methods, our method significantly outperforms each of them at least on one of the three experiments, with a gap of over 10% in many cases, as shown in the last column in Table 1. For the adaptation between MNIST and USPS, we also show results using the full set of training data from both domains and observe that it significantly improves the accuracy, implying that our adaptation network can better minimize the distribution shift with more training data.

Ablation analysis of our network design We conduct ablation study on the design of our factorization architecture. The structure for four network settings are shown in Figure 4 with the following details.

  • Joint feature: As shown in Figure 4a, we learn a joint feature space for both image reconstruction and classification, and use reconstruction losses in both domains along with the classification loss in source domain to train the network.

  • Feature separation: As shown in Figure 4b, in this setting, we separate the latent features into two parts. One part is used for reconstruction and the other part is used for classification.

  • Feature concatenation: As shown in Figure 4c, the previous reconstruction features are concatenated with the classification logits as new reconstruction features.

  • Full factorization: As shown in Figure 4d, we add mutual information loss in this setting to explicitly enforce the orthogonality between the two separated features, thus factorizing the latent feature space into a domain-specific subspace and a task-specific subspace.

For all four settings, we conduct the same two-stage training process and apply the adversarial learning at the second stage. The results shown in Table 2 indicate that we could obtain stronger results by better separating the features, and our factorization method yields the best results.

Figure 4: Four network architectures for the study of feature factorization.

Method Joint feature Feature separation Feature concatenation Full factorization MNIST USPS (full) 0.955 0.004 0.958 0.002 0.961 0.002 0.963 0.002 USPS MNIST (full) 0.933 0.017 0.936 0.014 0.958 0.009 0.971 0.008 SVHN MNIST 0.829 0.019 0.858 0.024 0.905 0.006 0.925 0.011

Table 2: Analysis of the effects of feature factorization under different network structures.

Analysis of the embedding spaces Besides the quantitative results, we visualize the high-dimensional features of the factorized subspaces in the 2D plane for adaptation from SVHN to MNIST using the t-SNE [37]. We randomly select 1,000 images from the two testing sets and show visualization results in Figure 5. We set perplexity to 35 for all four visualization results. The embedding of the logits space before and after adaptation for the two domains are shown in Figure (a)a and Figure (b)b, respectively. As expected, after adaptation, the samples from the target domain are clustered into more obvious groups and match better with the clusters in the source domain.

The visualization of the domain-specific subspaces before and after adaptation are shown in Figure (c)c and Figure (d)d, respectively. After adaptation, we simultaneously learn a good task-specific subspace on the target domain and a good domain-specific subspace. The domain-specific subspace should capture information specific to the domain, and therefore, the two domain-specific subspaces are further divided after adaptation, which proves our learning algorithm is effective.

(a) Embedding of the logit space before the adaptation.
(b) Embedding of the logit space after the adaptation.
(c) Embedding of the domain-specific subspace before the adaptation.
(d) Embedding of the domain-specific subspace after the adaptation.
Figure 5: Visualization of the domain adaptation from SVHN (source domain, red color) to MNIST (target domain, blue color). We show the visualization of t-SNE embedding for the logits space before adaptation (a) and after adaptation (b), and the domain-specific subspace before adaptation (c) and after adaptation (d).
(a)
(b)
Figure 6: Reconstruction results using the target domain reconstruction network for domain adaptation from SVHN to MNIST. (a) Reconstruction results using the testing samples from target domain. (b) Reconstruction results using the concatenation of domain specific features from target domain and classification logits from source domain.

Furthermore, we analyze the embedding subspaces of the target domain by showing two reconstruction results. Figure (a)a demonstrates the reconstruction results using features extracted from target domain testing samples. We also concatenate the domain-specific features of the target domain samples with the logits activations of randomly selected testing images from source domain of the same class, and show the reconstruction results in Figure (b)b. Although reconstruction quality of Figure (b)b is not as good as that of Figure (a)a, the images are still very similar, which proves that task-specific subspaces for the two domains indeed share similar distributions and that the target domain-specific subspace stores the domain characteristics for reconstruction.

4.2 Real-world tagging Datasets

While many studies in the literature tackle unsupervised domain adaptation, they mostly evaluate their algorithms on small and simple datasets such as the digits dataset [10][11][12], and the office dataset [38]. The capacity for domain adaptation algorithms to work for large-scale real-world complex applications remains unclear. Previous work [18] points out some problems with evaluation on office dataset [38][39], where pretrained models from ImageNet have to be used [40]. So instead of working on the toy office dataset, we collected two real-world tagging datasets to benchmark unsupervised domain adaptation algorithms, where we have sufficient images to train deep networks from scratch.

The first dataset is collected from the search engines and named Crawling dataset, while the second dataset is collected from the photos shot by mobile phones and titled Mobile dataset. The two datasets contain the same 100 classes. Some example images from the two datasets are shown in Figure (b)b. There are two major differences between the two datasets: 1) the images in Crawling dataset usually have good quality and clear background while the images in the Mobile dataset suffer from several defects such as image blur and out of focus, as well as noisy background and various image filters and stickers; 2) the Mobile dataset contains mostly vertical images while the images from the Crawling dataset have various image ratios. We use the Crawling data as the source domain and the Mobile data as the target domain because we can easily collect crawling data with labels by keyword searching. The Crawling dataset includes 150,000 training images. The Mobile dataset contains 115,000 images out of which we randomly select 100,000 images as the training set, 10,000 images as the testing set, and others as the validation set. Compared with the digits datasets, the real-world tagging datasets not only have a larger scale but also are more suitable for the study on real-world scenarios.

Network architecture The encoder part of our network uses the ResNet-50 [13] architecture. The activations from the last average pooling layer are factorized equally into two parts. The task-specific subspace features are followed by a FC layer to estimate the classification logits, and the domain-specific features are concatenated with the classification logits to serve as the input for the decoder. The decoder network uses architecture from DCGAN [41]

. It contains 5 fractionally-strided convolutions layers with 256, 256, 128, 64 and 3 filters respectively. Each layer is followed by ReLU and BN, except for the last layer. The discriminator network contains three FC layers. The first two FC layers have 1024 and 2048 hidden units respectively, followed by ReLU and BN. The last FC layer output is used for label domain classification.

Implementation Detail All images are resized to 256256 and randomly cropped to 224224 during the training process. is set to 5 and is set to 1 for Equation 1. And we set as 2 and as 1 for Eqn. 5.

In order to measure whether more unlabeled training images in target domain would contribute to the generalizability of the target model, we perform three sets of experiments in addition to that without adaptation. In the first two sets, we randomly select 10% and 50% images from each class of the target training set, while in the third set, we use the full target training set. The Top-1 and Top-5 accuracy for the testing set of the target domain are shown in Table 3. Compared with the model without adaptation, using 10% training images from each class could improve the the Top-1 and Top-5 accuracy as 3.75% and 3.69% respectively. Using the full training set improves the Top-1 accuracy by more than 10% and Top-5 accuracy more than 12%. We also compared our results with ADDA [24] using the full target training set, as shown in Table 3. Our approach outperforms ADDA on both Top-1 and Top-5 accuracy as 2.46% and 3.05% respectively. These results demonstrate that our method can significantly improve the performance over baselines in real-world applications. In addition, we show that more unlabeled training data from the target domain helps the unsupervised adaptation.

Method Top-1 Top-5 No adaptation 0.3571 0.6607 ADDA[24] (full set of target training) 0.4386 0.7533 Ours (10% of target training) 0.3946 0.6976 Ours (50% of target training) 0.4041 0.7018 Ours (full set of target training) 0.4632 0.7838

Table 3: Top-1 and Top-5 accuracies on the testing set of the Mobile dataset.

5 Conclusion

In this paper, we introduce FAN for unsupervised domain adaptation. We factorize the latent feature space into task-specific subspace and domain-specific subspace for both source and target domains and consider the domain adaptation only on task-specific subspace. The network in source domain is jointly trained with image classification and reconstruction under the factorization architecture to learn the discriminative task-specific subspace while pushing away domain-specific information as much as possible. The network in target domain is learned under the same factorization structure with GAN loss to adapt the target domain task-specific subspace to the source domain task-specific subspace. We evaluate our proposed framework on four domain adaptation tasks, all achieving state-of-the-art results. For future work, we would like to extend our algorithm to other vision tasks beyond image classification.

References