Domain Adaptation by Maximizing Population Correlation with Neural Architecture Search

by   Zhixiong Yue, et al.

In Domain Adaptation (DA), where the feature distributions of the source and target domains are different, various distance-based methods have been proposed to minimize the discrepancy between the source and target domains to handle the domain shift. In this paper, we propose a new similarity function, which is called Population Correlation (PC), to measure the domain discrepancy for DA. Base on the PC function, we propose a new method called Domain Adaptation by Maximizing Population Correlation (DAMPC) to learn a domain-invariant feature representation for DA. Moreover, most existing DA methods use hand-crafted bottleneck networks, which may limit the capacity and flexibility of the corresponding model. Therefore, we further propose a method called DAMPC with Neural Architecture Search (DAMPC-NAS) to search the optimal network architecture for DAMPC. Experiments on several benchmark datasets, including Office-31, Office-Home, and VisDA-2017, show that the proposed DAMPC-NAS method achieves better results than state-of-the-art DA methods.



There are no comments yet.


page 7


Network Architecture Search for Domain Adaptation

Deep networks have been used to learn transferable representations for d...

A Compact DNN: Approaching GoogLeNet-Level Accuracy of Classification and Domain Adaptation

Recently, DNN model compression based on network architecture design, e....

Correlation-aware Adversarial Domain Adaptation and Generalization

Domain adaptation (DA) and domain generalization (DG) have emerged as a ...

Domain adaptation under structural causal models

Domain adaptation (DA) arises as an important problem in statistical mac...

Adversarial Branch Architecture Search for Unsupervised Domain Adaptation

Unsupervised Domain Adaptation (UDA) is a key field in visual recognitio...

DA-HGT: Domain Adaptive Heterogeneous Graph Transformer

Domain adaptation using graph networks is to learn label-discriminative ...

Boosting Domain Adaptation by Discovering Latent Domains

Current Domain Adaptation (DA) methods based on deep architectures assum...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With access to large-scale labeled data, deep neural networks have achieved state-of-the-art performance among a variety of machine learning problems and applications

krizhevsky2012imagenet; oquab2014learning; donahue2014decaf; yosinski2014transferable; ren2015faster; he2016deep; he2017mask. However, with intolerably time-consuming and labor-expensive costs, it is hard for a target domain of interest to collect enough labeled data for model training. One solution is to transfer a deep neural network trained on a data-sufficient source domain to the target domain where only unlabeled data is available. However, this learning paradigm suffers from the shift in data distributions across different domains, which brings a major obstacle in adapting predictive models for the target task.

Domain Adaptation (DA) pan2009survey; yang2020transfer

aims to learn a high-performance learner on a target domain via utilizing the knowledge transferred from a source domain, which has a different but related data distribution to the target domain. A number of DA methods aim to bridge the gap between source and target domains so that the classifier learned in the source domain can be applied to the target domain. To achieve this goal, recent DA works can be grouped into two main categories:

distance-based methods ben2007analysis; ben2010theory; zhuang2015supervised; tzeng2014deep; long2015learning; courty2016optimal; sun2016return; sun2016deep; zellinger2017central; chen2019joint and adversarial DA methods ganin2016domain; long2017conditional; pei2018multi; tzeng2017adversarial; saito2018maximum. Both categories aim to learn the domain-invariant feature representations. In this paper, we mainly focus on distance-based DA methods.

For distance functions adopted by DA, the first attempt is the Proxy -distance ben2010theory, which aims to minimize the generalization error by discriminating between source and target samples. Maximum Mean Discrepancy (MMD) gretton2006kernel is a popular distance measures between two domains and it has been used in Deep Domain Confusion (DDC) tzeng2014deep and Deep Adaptation Network (DAN) long2015learning. Although numerous distance-based DA methods have been proposed, learning the domain-invariant feature representation is still challenging since distances in a high-dimensional space may be difficult to truly reflect the domain discrepancy. Moreover, all of these methods are developed by using hand-crafted network architectures. Since the difficulty levels of different DA tasks are not the same, accomplishing complex tasks may require a more sophisticated network architecture than easy tasks, hence, using the same hand-crafted network architecture may limit the capacity and versatility of DA methods.

To alleviate these limitations, in this paper, we propose a new similarity function, which is called Population Correlation (PC), to measure the similarity between the source and target domains. Based on the PC function, we propose a novel domain adaptation method called Domain Adaptation by Maximizing Population Correlation (DAMPC). DAMPC aims to maximize the PC between the source and target domains so that a learning model can learn a domain-invariant feature representation. Specifically, With the PC defined as the maximum of pairwise correlations between source and target samples, the proposed DAMPC method maximize it to force the two domains to have similar distributions as well as minimizing the classification loss on the labeled source samples. Built on the DAMPC method, we design a reinforcement-based Neural Architecture Search (NAS) method called DAMPC-NAS to search an optimal network architecture for DAMPC. In this way, DAMPC-NAS can learn suitable network architectures for different DA tasks. To the best of our knowledge, the proposed DAMPC-NAS method is the first NAS framework designed for similarity-based DA methods. DAMPC-NAS is also one of few works that integrate NAS methods into deep DA methods. Our contributions are summarized as follows.

  • We propose a new similarity measure, i.e., PC, to measure the domain similarity. Based on the PC, we propose a DAMPC method for DA.

  • We design the DAMPC-NAS framework to search optimal network architectures for the proposed DAMPC method.

  • Experimental results on three benchmark datasets demonstrate the effectiveness of the proposed methods.

2 Related Work

Neural Architecture Search

NAS aims to design the architecture of a neural network in an automated way. Comparing with manually designed architectures of neural networks, NAS has demonstrated the capability to find architectures with state-of-the-art performance in various tasks pham2018efficient; lsy19; ghiasi2019fpn. For example, the NAS-FPN method ghiasi2019fpn leverages NAS to learn an effective architecture of the feature pyramid network for object detection.

Although NAS can achieve satisfactory performance, the high computational cost of the searching procedure makes NAS less attractive. To accelerate the search procedure, one-shot NAS leverages a supergraph, which contains all the candidate architectures in the search space. In the supergraph, weights of operations on edges are shared across different candidate architectures. ENAS pham2018efficient employs a reinforcement-based method to train a controller that samples architectures from a supergraph with a weight sharing mechanism. DARTS lsy19 search architectures with a differentiable objective function based on a supergraph that uses the softmax function to contain all candidate operations on each edge. The final architecture is determined based on the weights corresponding to the candidate operations on each edge.

Domain Adaptation

DA aims to transfer the knowledge learned from a source domain with labeled data to a target domain without labeled data, where there is a domain shift between domains. As discussed in the introduction, recent works in DA can be mainly grouped into two categories: distance-based methods and adversarial DA methods. In this paper, we mainly focus on distance-based methods, which minimize the discrepancy between the source and target domains via some measures, including the MMD used in DDC tzeng2014deep, DAN long2015learning, Weighted Domain Adaptation Network (WDAN) yan2017mind, Joint Adaptation Networks (JAN) long2017deep, and Deep Subdomain Adaptation Network (DSAN) zhu2020deep, the Kullback-Leibler divergence

adopted in Transfer Learning with Deep Autoconders (TLDA)

zhuang2015supervised, the second-order statistics utilized in CORrelation ALignment (CORAL) sun2016return; sun2016deep

, and the Central Moment Discrepancy (CMD)


Figure 1: Overview of the DAMPC-NAS framework. Source and target data first go through the feature extractor to extract hidden features. The controller samples cell choices for each cell and connections between the cells from search space to generate the architecture of the sampled network. Source and target data with the extracted feature representation then go through the sampled network. Finally, the cross-entropy loss is minimized and the PC is maximized. The controller’s policy is updated by the reward of the negative overall loss.
Neural Architecture Search for Domain Adaptation

There are few works on NAS for DA. To improve the generalization ability of neural networks for DA, li2020adapting analyze the generalization bound of neural architectures and propose the AdaptNAS method to adapt neural architectures between domains. li2020network propose a DARTS-like method for DA, which combines DARTS and DA into one framework. robbiano2021adversarial aim to learn a auxiliary branch network from data for an adversarial DA method. In this paper, different from those works, we aim to leverage NAS to search optimal neural architectures for the proposed DAMPC method.

3 Methodology

In this section, we introduce the proposed PC similarity and the DAMPC method as well as the DAMPC-NAS method.

3.1 Population Correlation

We first present the definition of PC. Here we study DA under the unsupervised setting. That is, the target domain has unlabeled data only. In DA, the source domain has labeled samples and the target domain has unlabeled samples. To adapt the classifier trained on the source domain to the target domain, one solution is to minimize the domain discrepancy or equivalently maximize the domain similarity. To achieve this, we propose the PC to measure the similarity between the source and target domains. Specifically, suppose

is the feature extraction network. Then the PC between the source and target domains can be computed based on each pair of source and target samples as


where denotes the

norm of a vector,

denotes the correlation between two vectors, and denotes a set of integers for an integer

. Here we use the cosine similarity to calculate the correlation between two vectors, thus the larger the PC value is, the more similar the two domains are.

3.2 Dampc

Built on the PC introduced in the previous section, in this section, we present the proposed DAMPC method which aims to learn a domain-invariant feature representation. For DA tasks, the hidden feature representations learned by the feature extraction network should be not only discriminative to train a strong classifier but also domain-invariant to both the source and target domains. Only maximizing the PC can help learn a domain-invariant feature representation and only minimizing the classification loss is to learn a discriminative feature representation. Therefore, we combine the classification loss and the PC to obtain the final objective function, which is formulated as


where is a trade-off parameter, denotes the classification layer, and denotes the classification loss such as the cross-entropy loss.

By minimizing Eq. (2), the final learned feature representations are not only discriminative for classification but also domain-invariant for the adaptation.

3.3 Dampc-Nas

In this section, we introduce the proposed DAMPC-NAS framework that finds an optimal architecture for the DAMPC method introduced in the previous section. An overview of the DAMPC-NAS framework is shown in Figure 1.

Cell-based Search Space

We design the search space on the top of the Resnet-50 backbone, whose architecture is kept fixed, and hence we only search the architecture after the backbone. The search space of the DAMPC-NAS method consists of two parts: within cells and between cells. We design the cell as the composition of the fully connected layer, batch-norm layer, and dropout layer as well as the associated activation functions. Within the cell, we search for the size of the fully connected layer and the location of the skip connection. Specifically, the search choice of the fully connected layer in a cell can be ‘the same as input size’ or ‘the half of input size’. The starting location of the skip connection can be chosen from the cell input, the fully connected layer, and the batch-norm layer. Between the cells, we search for input and output connections of the

cells. For example, if there are three cells in the search space, i.e., , the input of “Cell 1” can be chosen from the outputs of “Backbone” and “Cell 0”, and the input of “Cell 2” can be chosen from the outputs of “Cell 0” and “Cell 1”, hence the input of a cell can be chosen from the outputs of the previous two cells. The calculation of PC can be choose from one of outputs of all cells. Moreover, One of the outputs from the cells, i.e., “Cell 0”, “Cell 1” and “Cell 2”, can connect to the classifier trained on source domain data. Hence, the total search space has configurations. An illustration of the search space in the DAMPC-NAS method is shown in Figure 2. In experiments, for efficiency, we use the search space with cells for all experiments.

Figure 2: The search space of the DAMPC-NAS method. Dashed lines represent possible search choices and numbered grey circles indicate the order of choices generated from the controller.

Searching optimal architecture

The searching algorithm for the DAMPC-NAS method is described in Algorithm 1. DAMPC-NAS is a reinforcement-based NAS framework which leverages a controller network to sample architectures from the search space. The controller network is a LSTM that samples search choice via a softmax classifier. We denote by the learnable parameters of the controller. The policy of the controller is denoted by .

In each epoch, the training procedure of DAMPC-NAS consists of two phases. In the first phase, we fix parameters of the controller

and train the shared weights in the search space . Specifically, the controller samples an architecture from the search space with policy . For each mini-batch from and , is computed according to Eq. (2) and the shared weights of the sampled architecture are updated by minimizing . In the second phase, we fix all the shared weights in the search space and update the parameter of the controller. Specifically, after one epoch of training, is used as the reward to update the policy in the controller. The gradient is computed via the REINFORCE algorithm williams1992simple with a moving average baseline.

Input : source data , target data , the number of training epochs
Output : The searched architecture with learned weights
1 initialize controller;
2 for  to  do
3       sample from with policy ;
       // fix controller policy and train in
4       for mini-batch in and  do
5             compute in Eq. (2) with ;
6             update in  with ;
8       end for
      // fix in and update in policy
9       calculate reward of as ;
10       update in with reward ;
12 end for
Return :  with trained weights
Algorithm 1 Overview of DAMPC-NAS

In summary, the DAMPC-NAS method is a one-shot style NAS method. That is, the DAMPC-NAS method trains a supernet that contains all shared parameters in the search space during the searching process. The DAMPC-NAS method samples a child network in each epoch to calculate the loss function defined in Eq. (

2) and updates its shared parameters in the search space. Parameters in the controller are updated by the reward, which is the negative loss of the sampled child network. After searching, all weights of the final architecture are retained for testing. Different from two-stage one-shot NAS methods, there is no need for the DAMPC-NAS method to retrain the final architecture from scratch for testing since DAMPC-NAS can directly optimize the objective in Eq. (2), which is just the negative reward for the controller, in an end-to-end manner. In this way, the architecture is optimized alongside child networks’ parameters. Therefore, the final architecture derived from the DAMPC-NAS method can be deployed directly without parameter retraining, which improves the efficiency.

4 Experiments

In this section, we empirically evaluate the proposed method.

Type Method AD AW DA DW WA WD Avg
ResNet-50 he2016deep 68.9 68.4 62.5 96.7 60.7 99.3 76.1
Dist Based JDA long2013transfer 80.7 73.6 64.7 96.5 63.1 98.6 79.5
DDC tzeng2014deep 76.5 75.6 62.2 96.0 61.5 98.2 78.3
DAN long2015learning 78.6 80.5 63.6 97.1 62.8 99.6 80.4
D-CORAL sun2016deep 81.5 77.0 65.9 97.1 64.3 99.6 80.9
JAN long2017deep 84.7 85.4 68.6 97.4 70.0 99.8 84.3
MDDA wang2020transfer 86.3 86.0 72.1 97.1 73.2 99.2 85.7
Adv Based DANN ganin2015unsupervised 79.7 82.0 68.2 96.9 67.4 99.1 82.2
ADDA tzeng2017adversarial 77.8 86.2 69.5 96.2 68.9 98.4 82.9
CAN zhang2018collaborative 85.5 81.5 65.9 98.2 63.4 99.7 82.4
DDAN wang2020transfer 84.9 88.8 65.3 96.7 65.0 100.0 83.5
DAMPC-NAS (Ours) 89.16 93.08 70.36 98.74 69.05 100.0 86.69
Table 1: Accuracy (%) on the Office-31 dataset with ResNet-50 as the backbone.
Type Method ArCl ArPr ArRw ClAr ClPr ClRw PrAr PrCl PrRw RwAr RwCl RwPr Avg
ResNet-50 he2016deep 34.9 50.0 58.0 37.4 41.9 46.2 38.5 31.2 60.4 53.9 41.2 59.9 46.1
Dist Based JDA long2013transfer 38.9 54.8 58.2 36.2 53.1 50.2 42.1 38.2 63.1 50.2 44.0 68.2 49.8
DAN long2015learning 43.6 57.0 67.9 45.8 56.5 60.4 44.0 43.6 67.7 63.1 51.5 74.3 56.3
D-CORAL sun2016deep 42.2 59.1 64.9 46.4 56.3 58.3 45.4 41.2 68.5 60.1 48.2 73.1 55.3
JAN long2017deep 45.9 61.2 68.9 50.4 59.7 61.0 45.8 43.4 70.3 63.9 52.4 76.8 58.3
Adv Based DANN ganin2015unsupervised 45.6 59.3 70.1 47.0 58.5 60.9 46.1 43.7 68.5 63.2 51.8 76.8 57.6
CDAN long2017conditional 46.6 65.9 73.4 55.7 62.7 64.2 51.8 49.1 74.5 68.2 56.9 80.7 62.8
DDAN wang2020transfer 51.0 66.0 73.9 57.0 63.1 65.1 52.0 48.4 72.7 65.1 56.6 78.9 62.5
DAMPC-NAS (Ours) 46.53 68.42 75.24 58.3 66.3 67.48 56.94 44.77 75.33 69.26 51.94 80.33 63.4
Table 2: Accuracy (%) on the Office-Home dataset with ResNet-50 as the backbone.

4.1 Setup

We conduct experiments on three benchmark datasets, including Office-31 saenko2010adapting, Office-Home venkateswara2017deep, and VisDA-2017 peng2017visda. The Office-31 dataset has 4,652 images in 31 categories collected from three distinct domains: Amazon (A), Webcam (W) and DSLR (D). We can construct six transfer tasks: A W, D W, W D, A D, D A, and W A. The Office-Home dataset consists of 15,500 images in 65 object classes under the office and home settings, forming four extremely dissimilar domains: Artistic (Ar), Clip Art (Cl), Product (Pr), and Real-World (Rw) and 12 transfer tasks. The VisDA-2017 dataset has over 280K images across 12 classes. It contains two very distinct domains: Synthetic, which contains renderings of 3D models from different angles and with different lightning conditions, and Real that are natural images. On this dataset, we study a transfer task: Synthetic Real.

We compare the proposed DAMPC-NAS

method with state-of-the-art DA methods, including Joint Distribution Adaptation (

JDA) long2013transfer, Deep Domain Confusion (DDC) tzeng2014deep, Deep Adaptation Network (DAN) long2015learning, Domain Adversarial Neural Network (DANN) ganin2015unsupervised, Correlation Alignment for Deep Domain Adaptation (D-CORAL) sun2016deep, Residual Transfer Networks (RTN) long2016unsupervised, Joint Adaptation Networks (JAN) long2017deep, Adversarial Discriminative Domain Adaptation (ADDA) tzeng2017adversarial, Conditional Domain Adversarial Networks (CDAN) long2017conditional, Collaborative and Adversarial Network (CAN) zhang2018collaborative, Manifold Dynamic Distribution Adaptation (MDDA) wang2020transfer, and Dynamic Distribution Adaptation Network (DDAN) wang2020transfer. The results of baseline methods are directly reported from DDAN wang2020transfer and CDAN long2017conditional.

We use the PyTorch package

paszke2017automatic to implement all the models and leverage the ResNet-50 network he2016deep

pretrained on the ImageNet dataset


as the backbone for the feature extraction. For optimization, we use the mini-batch SGD with the Nesterov momentum 0.9. The learning rate is adjusted by

, where is the index of training steps, = 0.1, = 0.001, and = 0.75. The batch size is set to 128 for all the datasets.

4.2 Results

Figure 3: Searched architecture for transfer task DW of the Office-31 dataset. Left: architectures within the three cells. Right: connections between the three cells, PC and classifier.

The classification results on the Office-31 dataset are shown in Table 1. As illustrated in Table 1, the proposed DAMPC-NAS method achieves the best average accuracy. In four out of six transfer tasks, DAMPC-NAS performs the best, especially on transfer tasks AD and AW, which is transferring from a large source domain to a small target domain and in the other two tasks, the DAMPC-NAS method performs slightly worse than the best baseline method, which implies that the proposed DAMPC-NAS model works well when the source data is sufficient and it is able to learn transferable feature representations for effective domain adaptation.

Figure 3 shows the architecture found by DAMPC-NAS for the transfer task DW constructed on the Office-31 dataset. The left part of Figure 3 shows the search choice within the three cells found by the DAMPC-NAS method and the right part of Figure 3 shows the connections among the three cells, PC and classifier. In Cell 0, the DAMPC-NAS method chooses the FC layer with the same size as the input and the skip connection is connected to the batch-norm layer. In Cell 1, the choice of FC is the same as Cell 0 but the skip connection is starting from the cell input. In Cell 2, the skip connection is the same as Cell 2 but the FC layer is of half size of the input. For connections between cells, the DAMPC-NAS method chooses to use the output of Cell 0 to calculate the PC and the output of Cell 1 to calculate the classification loss. For a simple transfer task D

W, the searched architecture only has two cells, which indicates that the DAMPC-NAS method can adaptively learn an architecture depending on the the complexity of the DA task. Moreover, the location of the skip connection moves forward in Cell 1 and Cell 2 when compared with Cell 0, which is to help reduce the network depth and alleviate the vanishing gradient problem.

Table 2 shows the classification results on the Office-Home dataset. According to the results, we can see that DAMPC-NAS achieves the best average accuracy and performs the best in eight out of twelve transfer tasks. while transferring from a large source domain to a small target domain (i.e., ClAr, PrAr, and RwAr), DAMPC-NAS achieves the best performance and this phenomenon is similar to the Office-31 dataset, which again demonstrate that the proposed DAMPC-NAS model works well when the source data is sufficient.

According to experimental results on the most challenging VisDA-2017 dataset as shown in Table 3, the proposed DAMPC-NAS method outperforms all the baseline methods by improving by over state-of-the-art baseline methods (i.e., CDAN) on this dataset, which again demonstrates the effectiveness of the proposed method.

Type Method SyntheticReal
ResNet-50 he2016deep 45.6
Dist Based DAN long2015learning 53.0
RTN long2016unsupervised 53.6
JAN long2017deep 61.6
Adv Based DANN ganin2015unsupervised 55.0
CDAN long2017conditional 66.8
DAMPC-NAS (Ours) 68.75
Table 3: Accuracy (%) on the VisDA-2017 dataset with ResNet-50 as the backbone.

4.3 Ablation Study

Firstly, we conduct an ablation study on the Office-31, Office-Home, and VisDA-2017 datasets to demonstrate the effectiveness of the proposed PC. We compare PC with widely used distance functions, including Proxy -distance, Kullback-Leibler divergence (KL-divergence), Maximum Mean Discrepancies (MMD), CORrelation ALignmen (CORAL), and Central Moment Discrepancy (CMD). For fair comparison, we only replace the minus of the PC with these distance functions in Eq. (2

). Specifically, we adopt the ResNet-50 as the backbone, following with the bottleneck layer (consisting of a fully connected layer, a batch normalization layer, a ReLU activation function, and a dropout function) used for generating hidden features and a fully connected layer used for prediction. According to experimental results shown in Tables

4, 5 and 6, we can see that none of the distance functions can obtain performance improvement compared with no distance function used (i.e., ResNet-50). One possible reason is that the normalization layer used in the bottleneck layer has improved the performance of the ResNet-50 and adapting these distance functions can not improve the performance further. However, the proposed PC can still obtain performance improvement over ResNet-50, which indicates the effectiveness of the proposed PC.

Measurement AD AW DA DW WA WD Avg
None 83.53 80.50 64.61 98.49 62.69 100.0 81.64
Proxy -distance 82.73 81.01 64.04 98.11 61.77 100.0 81.28
KL-divergence 83.94 79.75 63.90 97.86 63.51 99.80 81.46
MMD 83.13 79.25 64.11 98.74 63.12 100.0 81.39
CORAL 84.34 80.25 64.61 98.24 62.80 99.80 81.67
CMD 82.93 79.50 64.29 98.62 63.10 100.0 81.41
PC (Ours) 88.35 91.32 70.36 98.49 69.05 100.0 86.26
Table 4: Ablation Study on the Office-31 dataset with ResNet-50 as the backbone.
Measurement SyntheticReal
None 57.68
Proxy -distance 56.36
KL-divergence 56.27
MMD 58.76
CORAL 56.66
CMD 56.65
PC (Ours) 65.25
Table 5: Ablation Study on the VisDA-2017 dataset with ResNet-50 as the backbone.
Measurement ArCl ArPr ArRw ClAr ClPr ClRw PrAr PrCl PrRw RwAr RwCl RwPr Avg
None 43.41 66.55 74.64 56.61 63.98 65.32 53.36 39.36 72.64 64.73 46.30 76.55 60.29
Proxy -distance 43.21 65.44 74.85 55.09 62.51 65.37 52.33 38.63 72.83 64.57 46.23 76.66 59.81
KL-divergence 44.01 66.75 74.50 55.75 63.42 66.51 52.74 38.14 73.43 65.84 44.79 77.13 60.25
MMD 43.78 66.28 74.48 55.62 64.07 66.19 53.40 38.30 73.15 64.89 45.52 77.43 60.26
CORAL 44.15 65.85 74.16 55.42 63.01 66.83 52.95 39.38 72.53 65.14 45.96 77.07 60.20
CMD 44.40 65.92 74.50 54.68 63.37 67.07 52.78 38.88 72.94 65.64 45.29 77.36 60.24
PC (Ours) 46.19 66.03 73.7 57.89 63.48 65.80 56.94 44.19 75.58 69.02 51.11 78.89 62.24
Table 6: Ablation Study on the Office-Home dataset with ResNet-50 as the backbone.
(a) ResNet-50
(b) DAN
(c) DAMPC-NAS (Ours)
Figure 4: t-SNE visualization of different methods for the transfer task AD in the Office-31 dataset.

Then we conduct another ablation study on the Office-31 dataset to demonstrate the effectiveness of the architecture searching process in the DAMPC-NAS method. Specifically, we modify Algorithm 1 to search an optimal architecture for the DAN by replacing the minus of the PC with MMD in . According to experimental results shown in Figure 5, DAN-NAS performs comparable to and even slightly better than DAN on the six transfer tasks in the Office-31 dataset, which demonstrates the usefulness of the search process in the DAMPC method.

Figure 5: DAMPC-NAS with DAN on the Office-31 dataset.

4.4 Visualization

We visualize in Figure 4 the hidden feature representations of the transfer task AD constructed on the Office-31 dataset learned by ResNet-50 which is trained on source samples only, DAN, and DAMPC-NAS, respectively. According to Figure 4, we can see that samples with the representations learned by ResNet-50 and DAN are not distinguishable, but those by DAMPC-NAS are more separable, which implies that the proposed DAMPC-NAS method can learn discriminative and transferable feature representations for DA.

5 Conclusion

In this paper, we propose a new DA method called DAMPC based on the proposed PC function that can measure the domain similarity. We further design the DAMPC-NAS framework that searches optimal network architectures for DA tasks. Experiments results on the Office-31, Office-Home, and VisDA-2017 datasets demonstrate the effectiveness of the proposed method. Moreover, the proposed DAMPC-NAS framework has shown its potential to search optimal architectures for other DA methods. In our future studies, we will apply the proposed the DAMPC-NAS framework to search architectures for other DA methods and other DA settings.