1 Introduction
With access to largescale labeled data, deep neural networks have achieved stateoftheart performance among a variety of machine learning problems and applications
krizhevsky2012imagenet; oquab2014learning; donahue2014decaf; yosinski2014transferable; ren2015faster; he2016deep; he2017mask. However, with intolerably timeconsuming and laborexpensive costs, it is hard for a target domain of interest to collect enough labeled data for model training. One solution is to transfer a deep neural network trained on a datasufficient source domain to the target domain where only unlabeled data is available. However, this learning paradigm suffers from the shift in data distributions across different domains, which brings a major obstacle in adapting predictive models for the target task.Domain Adaptation (DA) pan2009survey; yang2020transfer
aims to learn a highperformance learner on a target domain via utilizing the knowledge transferred from a source domain, which has a different but related data distribution to the target domain. A number of DA methods aim to bridge the gap between source and target domains so that the classifier learned in the source domain can be applied to the target domain. To achieve this goal, recent DA works can be grouped into two main categories:
distancebased methods ben2007analysis; ben2010theory; zhuang2015supervised; tzeng2014deep; long2015learning; courty2016optimal; sun2016return; sun2016deep; zellinger2017central; chen2019joint and adversarial DA methods ganin2016domain; long2017conditional; pei2018multi; tzeng2017adversarial; saito2018maximum. Both categories aim to learn the domaininvariant feature representations. In this paper, we mainly focus on distancebased DA methods.For distance functions adopted by DA, the first attempt is the Proxy distance ben2010theory, which aims to minimize the generalization error by discriminating between source and target samples. Maximum Mean Discrepancy (MMD) gretton2006kernel is a popular distance measures between two domains and it has been used in Deep Domain Confusion (DDC) tzeng2014deep and Deep Adaptation Network (DAN) long2015learning. Although numerous distancebased DA methods have been proposed, learning the domaininvariant feature representation is still challenging since distances in a highdimensional space may be difficult to truly reflect the domain discrepancy. Moreover, all of these methods are developed by using handcrafted network architectures. Since the difficulty levels of different DA tasks are not the same, accomplishing complex tasks may require a more sophisticated network architecture than easy tasks, hence, using the same handcrafted network architecture may limit the capacity and versatility of DA methods.
To alleviate these limitations, in this paper, we propose a new similarity function, which is called Population Correlation (PC), to measure the similarity between the source and target domains. Based on the PC function, we propose a novel domain adaptation method called Domain Adaptation by Maximizing Population Correlation (DAMPC). DAMPC aims to maximize the PC between the source and target domains so that a learning model can learn a domaininvariant feature representation. Specifically, With the PC defined as the maximum of pairwise correlations between source and target samples, the proposed DAMPC method maximize it to force the two domains to have similar distributions as well as minimizing the classification loss on the labeled source samples. Built on the DAMPC method, we design a reinforcementbased Neural Architecture Search (NAS) method called DAMPCNAS to search an optimal network architecture for DAMPC. In this way, DAMPCNAS can learn suitable network architectures for different DA tasks. To the best of our knowledge, the proposed DAMPCNAS method is the first NAS framework designed for similaritybased DA methods. DAMPCNAS is also one of few works that integrate NAS methods into deep DA methods. Our contributions are summarized as follows.

We propose a new similarity measure, i.e., PC, to measure the domain similarity. Based on the PC, we propose a DAMPC method for DA.

We design the DAMPCNAS framework to search optimal network architectures for the proposed DAMPC method.

Experimental results on three benchmark datasets demonstrate the effectiveness of the proposed methods.
2 Related Work
Neural Architecture Search
NAS aims to design the architecture of a neural network in an automated way. Comparing with manually designed architectures of neural networks, NAS has demonstrated the capability to find architectures with stateoftheart performance in various tasks pham2018efficient; lsy19; ghiasi2019fpn. For example, the NASFPN method ghiasi2019fpn leverages NAS to learn an effective architecture of the feature pyramid network for object detection.
Although NAS can achieve satisfactory performance, the high computational cost of the searching procedure makes NAS less attractive. To accelerate the search procedure, oneshot NAS leverages a supergraph, which contains all the candidate architectures in the search space. In the supergraph, weights of operations on edges are shared across different candidate architectures. ENAS pham2018efficient employs a reinforcementbased method to train a controller that samples architectures from a supergraph with a weight sharing mechanism. DARTS lsy19 search architectures with a differentiable objective function based on a supergraph that uses the softmax function to contain all candidate operations on each edge. The final architecture is determined based on the weights corresponding to the candidate operations on each edge.
Domain Adaptation
DA aims to transfer the knowledge learned from a source domain with labeled data to a target domain without labeled data, where there is a domain shift between domains. As discussed in the introduction, recent works in DA can be mainly grouped into two categories: distancebased methods and adversarial DA methods. In this paper, we mainly focus on distancebased methods, which minimize the discrepancy between the source and target domains via some measures, including the MMD used in DDC tzeng2014deep, DAN long2015learning, Weighted Domain Adaptation Network (WDAN) yan2017mind, Joint Adaptation Networks (JAN) long2017deep, and Deep Subdomain Adaptation Network (DSAN) zhu2020deep, the KullbackLeibler divergence
adopted in Transfer Learning with Deep Autoconders (TLDA)
zhuang2015supervised, the secondorder statistics utilized in CORrelation ALignment (CORAL) sun2016return; sun2016deep, and the Central Moment Discrepancy (CMD)
zellinger2017central.Neural Architecture Search for Domain Adaptation
There are few works on NAS for DA. To improve the generalization ability of neural networks for DA, li2020adapting analyze the generalization bound of neural architectures and propose the AdaptNAS method to adapt neural architectures between domains. li2020network propose a DARTSlike method for DA, which combines DARTS and DA into one framework. robbiano2021adversarial aim to learn a auxiliary branch network from data for an adversarial DA method. In this paper, different from those works, we aim to leverage NAS to search optimal neural architectures for the proposed DAMPC method.
3 Methodology
In this section, we introduce the proposed PC similarity and the DAMPC method as well as the DAMPCNAS method.
3.1 Population Correlation
We first present the definition of PC. Here we study DA under the unsupervised setting. That is, the target domain has unlabeled data only. In DA, the source domain has labeled samples and the target domain has unlabeled samples. To adapt the classifier trained on the source domain to the target domain, one solution is to minimize the domain discrepancy or equivalently maximize the domain similarity. To achieve this, we propose the PC to measure the similarity between the source and target domains. Specifically, suppose
is the feature extraction network. Then the PC between the source and target domains can be computed based on each pair of source and target samples as
(1)  
where denotes the
norm of a vector,
denotes the correlation between two vectors, and denotes a set of integers for an integer. Here we use the cosine similarity to calculate the correlation between two vectors, thus the larger the PC value is, the more similar the two domains are.
3.2 Dampc
Built on the PC introduced in the previous section, in this section, we present the proposed DAMPC method which aims to learn a domaininvariant feature representation. For DA tasks, the hidden feature representations learned by the feature extraction network should be not only discriminative to train a strong classifier but also domaininvariant to both the source and target domains. Only maximizing the PC can help learn a domaininvariant feature representation and only minimizing the classification loss is to learn a discriminative feature representation. Therefore, we combine the classification loss and the PC to obtain the final objective function, which is formulated as
(2) 
where is a tradeoff parameter, denotes the classification layer, and denotes the classification loss such as the crossentropy loss.
By minimizing Eq. (2), the final learned feature representations are not only discriminative for classification but also domaininvariant for the adaptation.
3.3 DampcNas
In this section, we introduce the proposed DAMPCNAS framework that finds an optimal architecture for the DAMPC method introduced in the previous section. An overview of the DAMPCNAS framework is shown in Figure 1.
Cellbased Search Space
We design the search space on the top of the Resnet50 backbone, whose architecture is kept fixed, and hence we only search the architecture after the backbone. The search space of the DAMPCNAS method consists of two parts: within cells and between cells. We design the cell as the composition of the fully connected layer, batchnorm layer, and dropout layer as well as the associated activation functions. Within the cell, we search for the size of the fully connected layer and the location of the skip connection. Specifically, the search choice of the fully connected layer in a cell can be ‘the same as input size’ or ‘the half of input size’. The starting location of the skip connection can be chosen from the cell input, the fully connected layer, and the batchnorm layer. Between the cells, we search for input and output connections of the
cells. For example, if there are three cells in the search space, i.e., , the input of “Cell 1” can be chosen from the outputs of “Backbone” and “Cell 0”, and the input of “Cell 2” can be chosen from the outputs of “Cell 0” and “Cell 1”, hence the input of a cell can be chosen from the outputs of the previous two cells. The calculation of PC can be choose from one of outputs of all cells. Moreover, One of the outputs from the cells, i.e., “Cell 0”, “Cell 1” and “Cell 2”, can connect to the classifier trained on source domain data. Hence, the total search space has configurations. An illustration of the search space in the DAMPCNAS method is shown in Figure 2. In experiments, for efficiency, we use the search space with cells for all experiments.Searching optimal architecture
The searching algorithm for the DAMPCNAS method is described in Algorithm 1. DAMPCNAS is a reinforcementbased NAS framework which leverages a controller network to sample architectures from the search space. The controller network is a LSTM that samples search choice via a softmax classifier. We denote by the learnable parameters of the controller. The policy of the controller is denoted by .
In each epoch, the training procedure of DAMPCNAS consists of two phases. In the first phase, we fix parameters of the controller
and train the shared weights in the search space . Specifically, the controller samples an architecture from the search space with policy . For each minibatch from and , is computed according to Eq. (2) and the shared weights of the sampled architecture are updated by minimizing . In the second phase, we fix all the shared weights in the search space and update the parameter of the controller. Specifically, after one epoch of training, is used as the reward to update the policy in the controller. The gradient is computed via the REINFORCE algorithm williams1992simple with a moving average baseline.In summary, the DAMPCNAS method is a oneshot style NAS method. That is, the DAMPCNAS method trains a supernet that contains all shared parameters in the search space during the searching process. The DAMPCNAS method samples a child network in each epoch to calculate the loss function defined in Eq. (
2) and updates its shared parameters in the search space. Parameters in the controller are updated by the reward, which is the negative loss of the sampled child network. After searching, all weights of the final architecture are retained for testing. Different from twostage oneshot NAS methods, there is no need for the DAMPCNAS method to retrain the final architecture from scratch for testing since DAMPCNAS can directly optimize the objective in Eq. (2), which is just the negative reward for the controller, in an endtoend manner. In this way, the architecture is optimized alongside child networks’ parameters. Therefore, the final architecture derived from the DAMPCNAS method can be deployed directly without parameter retraining, which improves the efficiency.4 Experiments
In this section, we empirically evaluate the proposed method.
Type  Method  AD  AW  DA  DW  WA  WD  Avg 

ResNet50 he2016deep  68.9  68.4  62.5  96.7  60.7  99.3  76.1  
Dist Based  JDA long2013transfer  80.7  73.6  64.7  96.5  63.1  98.6  79.5 
DDC tzeng2014deep  76.5  75.6  62.2  96.0  61.5  98.2  78.3  
DAN long2015learning  78.6  80.5  63.6  97.1  62.8  99.6  80.4  
DCORAL sun2016deep  81.5  77.0  65.9  97.1  64.3  99.6  80.9  
JAN long2017deep  84.7  85.4  68.6  97.4  70.0  99.8  84.3  
MDDA wang2020transfer  86.3  86.0  72.1  97.1  73.2  99.2  85.7  
Adv Based  DANN ganin2015unsupervised  79.7  82.0  68.2  96.9  67.4  99.1  82.2 
ADDA tzeng2017adversarial  77.8  86.2  69.5  96.2  68.9  98.4  82.9  
CAN zhang2018collaborative  85.5  81.5  65.9  98.2  63.4  99.7  82.4  
DDAN wang2020transfer  84.9  88.8  65.3  96.7  65.0  100.0  83.5  
DAMPCNAS (Ours)  89.16  93.08  70.36  98.74  69.05  100.0  86.69 
Type  Method  ArCl  ArPr  ArRw  ClAr  ClPr  ClRw  PrAr  PrCl  PrRw  RwAr  RwCl  RwPr  Avg 

ResNet50 he2016deep  34.9  50.0  58.0  37.4  41.9  46.2  38.5  31.2  60.4  53.9  41.2  59.9  46.1  
Dist Based  JDA long2013transfer  38.9  54.8  58.2  36.2  53.1  50.2  42.1  38.2  63.1  50.2  44.0  68.2  49.8 
DAN long2015learning  43.6  57.0  67.9  45.8  56.5  60.4  44.0  43.6  67.7  63.1  51.5  74.3  56.3  
DCORAL sun2016deep  42.2  59.1  64.9  46.4  56.3  58.3  45.4  41.2  68.5  60.1  48.2  73.1  55.3  
JAN long2017deep  45.9  61.2  68.9  50.4  59.7  61.0  45.8  43.4  70.3  63.9  52.4  76.8  58.3  
Adv Based  DANN ganin2015unsupervised  45.6  59.3  70.1  47.0  58.5  60.9  46.1  43.7  68.5  63.2  51.8  76.8  57.6 
CDAN long2017conditional  46.6  65.9  73.4  55.7  62.7  64.2  51.8  49.1  74.5  68.2  56.9  80.7  62.8  
DDAN wang2020transfer  51.0  66.0  73.9  57.0  63.1  65.1  52.0  48.4  72.7  65.1  56.6  78.9  62.5  
DAMPCNAS (Ours)  46.53  68.42  75.24  58.3  66.3  67.48  56.94  44.77  75.33  69.26  51.94  80.33  63.4 
4.1 Setup
We conduct experiments on three benchmark datasets, including Office31 saenko2010adapting, OfficeHome venkateswara2017deep, and VisDA2017 peng2017visda. The Office31 dataset has 4,652 images in 31 categories collected from three distinct domains: Amazon (A), Webcam (W) and DSLR (D). We can construct six transfer tasks: A W, D W, W D, A D, D A, and W A. The OfficeHome dataset consists of 15,500 images in 65 object classes under the office and home settings, forming four extremely dissimilar domains: Artistic (Ar), Clip Art (Cl), Product (Pr), and RealWorld (Rw) and 12 transfer tasks. The VisDA2017 dataset has over 280K images across 12 classes. It contains two very distinct domains: Synthetic, which contains renderings of 3D models from different angles and with different lightning conditions, and Real that are natural images. On this dataset, we study a transfer task: Synthetic Real.
We compare the proposed DAMPCNAS
method with stateoftheart DA methods, including Joint Distribution Adaptation (
JDA) long2013transfer, Deep Domain Confusion (DDC) tzeng2014deep, Deep Adaptation Network (DAN) long2015learning, Domain Adversarial Neural Network (DANN) ganin2015unsupervised, Correlation Alignment for Deep Domain Adaptation (DCORAL) sun2016deep, Residual Transfer Networks (RTN) long2016unsupervised, Joint Adaptation Networks (JAN) long2017deep, Adversarial Discriminative Domain Adaptation (ADDA) tzeng2017adversarial, Conditional Domain Adversarial Networks (CDAN) long2017conditional, Collaborative and Adversarial Network (CAN) zhang2018collaborative, Manifold Dynamic Distribution Adaptation (MDDA) wang2020transfer, and Dynamic Distribution Adaptation Network (DDAN) wang2020transfer. The results of baseline methods are directly reported from DDAN wang2020transfer and CDAN long2017conditional.We use the PyTorch package
paszke2017automatic to implement all the models and leverage the ResNet50 network he2016deeppretrained on the ImageNet dataset
russakovsky2015imagenetas the backbone for the feature extraction. For optimization, we use the minibatch SGD with the Nesterov momentum 0.9. The learning rate is adjusted by
, where is the index of training steps, = 0.1, = 0.001, and = 0.75. The batch size is set to 128 for all the datasets.4.2 Results
The classification results on the Office31 dataset are shown in Table 1. As illustrated in Table 1, the proposed DAMPCNAS method achieves the best average accuracy. In four out of six transfer tasks, DAMPCNAS performs the best, especially on transfer tasks AD and AW, which is transferring from a large source domain to a small target domain and in the other two tasks, the DAMPCNAS method performs slightly worse than the best baseline method, which implies that the proposed DAMPCNAS model works well when the source data is sufficient and it is able to learn transferable feature representations for effective domain adaptation.
Figure 3 shows the architecture found by DAMPCNAS for the transfer task DW constructed on the Office31 dataset. The left part of Figure 3 shows the search choice within the three cells found by the DAMPCNAS method and the right part of Figure 3 shows the connections among the three cells, PC and classifier. In Cell 0, the DAMPCNAS method chooses the FC layer with the same size as the input and the skip connection is connected to the batchnorm layer. In Cell 1, the choice of FC is the same as Cell 0 but the skip connection is starting from the cell input. In Cell 2, the skip connection is the same as Cell 2 but the FC layer is of half size of the input. For connections between cells, the DAMPCNAS method chooses to use the output of Cell 0 to calculate the PC and the output of Cell 1 to calculate the classification loss. For a simple transfer task D
W, the searched architecture only has two cells, which indicates that the DAMPCNAS method can adaptively learn an architecture depending on the the complexity of the DA task. Moreover, the location of the skip connection moves forward in Cell 1 and Cell 2 when compared with Cell 0, which is to help reduce the network depth and alleviate the vanishing gradient problem.
Table 2 shows the classification results on the OfficeHome dataset. According to the results, we can see that DAMPCNAS achieves the best average accuracy and performs the best in eight out of twelve transfer tasks. while transferring from a large source domain to a small target domain (i.e., ClAr, PrAr, and RwAr), DAMPCNAS achieves the best performance and this phenomenon is similar to the Office31 dataset, which again demonstrate that the proposed DAMPCNAS model works well when the source data is sufficient.
According to experimental results on the most challenging VisDA2017 dataset as shown in Table 3, the proposed DAMPCNAS method outperforms all the baseline methods by improving by over stateoftheart baseline methods (i.e., CDAN) on this dataset, which again demonstrates the effectiveness of the proposed method.
Type  Method  SyntheticReal 
ResNet50 he2016deep  45.6  
Dist Based  DAN long2015learning  53.0 
RTN long2016unsupervised  53.6  
JAN long2017deep  61.6  
Adv Based  DANN ganin2015unsupervised  55.0 
CDAN long2017conditional  66.8  
DAMPCNAS (Ours)  68.75 
4.3 Ablation Study
Firstly, we conduct an ablation study on the Office31, OfficeHome, and VisDA2017 datasets to demonstrate the effectiveness of the proposed PC. We compare PC with widely used distance functions, including Proxy distance, KullbackLeibler divergence (KLdivergence), Maximum Mean Discrepancies (MMD), CORrelation ALignmen (CORAL), and Central Moment Discrepancy (CMD). For fair comparison, we only replace the minus of the PC with these distance functions in Eq. (2
). Specifically, we adopt the ResNet50 as the backbone, following with the bottleneck layer (consisting of a fully connected layer, a batch normalization layer, a ReLU activation function, and a dropout function) used for generating hidden features and a fully connected layer used for prediction. According to experimental results shown in Tables
4, 5 and 6, we can see that none of the distance functions can obtain performance improvement compared with no distance function used (i.e., ResNet50). One possible reason is that the normalization layer used in the bottleneck layer has improved the performance of the ResNet50 and adapting these distance functions can not improve the performance further. However, the proposed PC can still obtain performance improvement over ResNet50, which indicates the effectiveness of the proposed PC.Measurement  AD  AW  DA  DW  WA  WD  Avg 

None  83.53  80.50  64.61  98.49  62.69  100.0  81.64 
Proxy distance  82.73  81.01  64.04  98.11  61.77  100.0  81.28 
KLdivergence  83.94  79.75  63.90  97.86  63.51  99.80  81.46 
MMD  83.13  79.25  64.11  98.74  63.12  100.0  81.39 
CORAL  84.34  80.25  64.61  98.24  62.80  99.80  81.67 
CMD  82.93  79.50  64.29  98.62  63.10  100.0  81.41 
PC (Ours)  88.35  91.32  70.36  98.49  69.05  100.0  86.26 
Measurement  SyntheticReal 

None  57.68 
Proxy distance  56.36 
KLdivergence  56.27 
MMD  58.76 
CORAL  56.66 
CMD  56.65 
PC (Ours)  65.25 
Measurement  ArCl  ArPr  ArRw  ClAr  ClPr  ClRw  PrAr  PrCl  PrRw  RwAr  RwCl  RwPr  Avg 

None  43.41  66.55  74.64  56.61  63.98  65.32  53.36  39.36  72.64  64.73  46.30  76.55  60.29 
Proxy distance  43.21  65.44  74.85  55.09  62.51  65.37  52.33  38.63  72.83  64.57  46.23  76.66  59.81 
KLdivergence  44.01  66.75  74.50  55.75  63.42  66.51  52.74  38.14  73.43  65.84  44.79  77.13  60.25 
MMD  43.78  66.28  74.48  55.62  64.07  66.19  53.40  38.30  73.15  64.89  45.52  77.43  60.26 
CORAL  44.15  65.85  74.16  55.42  63.01  66.83  52.95  39.38  72.53  65.14  45.96  77.07  60.20 
CMD  44.40  65.92  74.50  54.68  63.37  67.07  52.78  38.88  72.94  65.64  45.29  77.36  60.24 
PC (Ours)  46.19  66.03  73.7  57.89  63.48  65.80  56.94  44.19  75.58  69.02  51.11  78.89  62.24 
Then we conduct another ablation study on the Office31 dataset to demonstrate the effectiveness of the architecture searching process in the DAMPCNAS method. Specifically, we modify Algorithm 1 to search an optimal architecture for the DAN by replacing the minus of the PC with MMD in . According to experimental results shown in Figure 5, DANNAS performs comparable to and even slightly better than DAN on the six transfer tasks in the Office31 dataset, which demonstrates the usefulness of the search process in the DAMPC method.
4.4 Visualization
We visualize in Figure 4 the hidden feature representations of the transfer task AD constructed on the Office31 dataset learned by ResNet50 which is trained on source samples only, DAN, and DAMPCNAS, respectively. According to Figure 4, we can see that samples with the representations learned by ResNet50 and DAN are not distinguishable, but those by DAMPCNAS are more separable, which implies that the proposed DAMPCNAS method can learn discriminative and transferable feature representations for DA.
5 Conclusion
In this paper, we propose a new DA method called DAMPC based on the proposed PC function that can measure the domain similarity. We further design the DAMPCNAS framework that searches optimal network architectures for DA tasks. Experiments results on the Office31, OfficeHome, and VisDA2017 datasets demonstrate the effectiveness of the proposed method. Moreover, the proposed DAMPCNAS framework has shown its potential to search optimal architectures for other DA methods. In our future studies, we will apply the proposed the DAMPCNAS framework to search architectures for other DA methods and other DA settings.
Comments
There are no comments yet.