Domain Adaptation (DA) pan2009survey; yang2020transfer
aims to learn a high-performance learner on a target domain via utilizing the knowledge transferred from a source domain, which has a different but related data distribution to the target domain. A number of DA methods aim to bridge the gap between source and target domains so that the classifier learned in the source domain can be applied to the target domain. To achieve this goal, recent DA works can be grouped into two main categories:distance-based methods ben2007analysis; ben2010theory; zhuang2015supervised; tzeng2014deep; long2015learning; courty2016optimal; sun2016return; sun2016deep; zellinger2017central; chen2019joint and adversarial DA methods ganin2016domain; long2017conditional; pei2018multi; tzeng2017adversarial; saito2018maximum. Both categories aim to learn the domain-invariant feature representations. In this paper, we mainly focus on distance-based DA methods.
For distance functions adopted by DA, the first attempt is the Proxy -distance ben2010theory, which aims to minimize the generalization error by discriminating between source and target samples. Maximum Mean Discrepancy (MMD) gretton2006kernel is a popular distance measures between two domains and it has been used in Deep Domain Confusion (DDC) tzeng2014deep and Deep Adaptation Network (DAN) long2015learning. Although numerous distance-based DA methods have been proposed, learning the domain-invariant feature representation is still challenging since distances in a high-dimensional space may be difficult to truly reflect the domain discrepancy. Moreover, all of these methods are developed by using hand-crafted network architectures. Since the difficulty levels of different DA tasks are not the same, accomplishing complex tasks may require a more sophisticated network architecture than easy tasks, hence, using the same hand-crafted network architecture may limit the capacity and versatility of DA methods.
To alleviate these limitations, in this paper, we propose a new similarity function, which is called Population Correlation (PC), to measure the similarity between the source and target domains. Based on the PC function, we propose a novel domain adaptation method called Domain Adaptation by Maximizing Population Correlation (DAMPC). DAMPC aims to maximize the PC between the source and target domains so that a learning model can learn a domain-invariant feature representation. Specifically, With the PC defined as the maximum of pairwise correlations between source and target samples, the proposed DAMPC method maximize it to force the two domains to have similar distributions as well as minimizing the classification loss on the labeled source samples. Built on the DAMPC method, we design a reinforcement-based Neural Architecture Search (NAS) method called DAMPC-NAS to search an optimal network architecture for DAMPC. In this way, DAMPC-NAS can learn suitable network architectures for different DA tasks. To the best of our knowledge, the proposed DAMPC-NAS method is the first NAS framework designed for similarity-based DA methods. DAMPC-NAS is also one of few works that integrate NAS methods into deep DA methods. Our contributions are summarized as follows.
We propose a new similarity measure, i.e., PC, to measure the domain similarity. Based on the PC, we propose a DAMPC method for DA.
We design the DAMPC-NAS framework to search optimal network architectures for the proposed DAMPC method.
Experimental results on three benchmark datasets demonstrate the effectiveness of the proposed methods.
2 Related Work
Neural Architecture Search
NAS aims to design the architecture of a neural network in an automated way. Comparing with manually designed architectures of neural networks, NAS has demonstrated the capability to find architectures with state-of-the-art performance in various tasks pham2018efficient; lsy19; ghiasi2019fpn. For example, the NAS-FPN method ghiasi2019fpn leverages NAS to learn an effective architecture of the feature pyramid network for object detection.
Although NAS can achieve satisfactory performance, the high computational cost of the searching procedure makes NAS less attractive. To accelerate the search procedure, one-shot NAS leverages a supergraph, which contains all the candidate architectures in the search space. In the supergraph, weights of operations on edges are shared across different candidate architectures. ENAS pham2018efficient employs a reinforcement-based method to train a controller that samples architectures from a supergraph with a weight sharing mechanism. DARTS lsy19 search architectures with a differentiable objective function based on a supergraph that uses the softmax function to contain all candidate operations on each edge. The final architecture is determined based on the weights corresponding to the candidate operations on each edge.
DA aims to transfer the knowledge learned from a source domain with labeled data to a target domain without labeled data, where there is a domain shift between domains. As discussed in the introduction, recent works in DA can be mainly grouped into two categories: distance-based methods and adversarial DA methods. In this paper, we mainly focus on distance-based methods, which minimize the discrepancy between the source and target domains via some measures, including the MMD used in DDC tzeng2014deep, DAN long2015learning, Weighted Domain Adaptation Network (WDAN) yan2017mind, Joint Adaptation Networks (JAN) long2017deep, and Deep Subdomain Adaptation Network (DSAN) zhu2020deep, the Kullback-Leibler divergence
adopted in Transfer Learning with Deep Autoconders (TLDA)zhuang2015supervised, the second-order statistics utilized in CORrelation ALignment (CORAL) sun2016return; sun2016deep
, and the Central Moment Discrepancy (CMD)zellinger2017central.
Neural Architecture Search for Domain Adaptation
There are few works on NAS for DA. To improve the generalization ability of neural networks for DA, li2020adapting analyze the generalization bound of neural architectures and propose the AdaptNAS method to adapt neural architectures between domains. li2020network propose a DARTS-like method for DA, which combines DARTS and DA into one framework. robbiano2021adversarial aim to learn a auxiliary branch network from data for an adversarial DA method. In this paper, different from those works, we aim to leverage NAS to search optimal neural architectures for the proposed DAMPC method.
In this section, we introduce the proposed PC similarity and the DAMPC method as well as the DAMPC-NAS method.
3.1 Population Correlation
We first present the definition of PC. Here we study DA under the unsupervised setting. That is, the target domain has unlabeled data only. In DA, the source domain has labeled samples and the target domain has unlabeled samples. To adapt the classifier trained on the source domain to the target domain, one solution is to minimize the domain discrepancy or equivalently maximize the domain similarity. To achieve this, we propose the PC to measure the similarity between the source and target domains. Specifically, suppose
is the feature extraction network. Then the PC between the source and target domains can be computed based on each pair of source and target samples as
where denotes the
norm of a vector,denotes the correlation between two vectors, and denotes a set of integers for an integer
. Here we use the cosine similarity to calculate the correlation between two vectors, thus the larger the PC value is, the more similar the two domains are.
Built on the PC introduced in the previous section, in this section, we present the proposed DAMPC method which aims to learn a domain-invariant feature representation. For DA tasks, the hidden feature representations learned by the feature extraction network should be not only discriminative to train a strong classifier but also domain-invariant to both the source and target domains. Only maximizing the PC can help learn a domain-invariant feature representation and only minimizing the classification loss is to learn a discriminative feature representation. Therefore, we combine the classification loss and the PC to obtain the final objective function, which is formulated as
where is a trade-off parameter, denotes the classification layer, and denotes the classification loss such as the cross-entropy loss.
By minimizing Eq. (2), the final learned feature representations are not only discriminative for classification but also domain-invariant for the adaptation.
In this section, we introduce the proposed DAMPC-NAS framework that finds an optimal architecture for the DAMPC method introduced in the previous section. An overview of the DAMPC-NAS framework is shown in Figure 1.
Cell-based Search Space
We design the search space on the top of the Resnet-50 backbone, whose architecture is kept fixed, and hence we only search the architecture after the backbone. The search space of the DAMPC-NAS method consists of two parts: within cells and between cells. We design the cell as the composition of the fully connected layer, batch-norm layer, and dropout layer as well as the associated activation functions. Within the cell, we search for the size of the fully connected layer and the location of the skip connection. Specifically, the search choice of the fully connected layer in a cell can be ‘the same as input size’ or ‘the half of input size’. The starting location of the skip connection can be chosen from the cell input, the fully connected layer, and the batch-norm layer. Between the cells, we search for input and output connections of thecells. For example, if there are three cells in the search space, i.e., , the input of “Cell 1” can be chosen from the outputs of “Backbone” and “Cell 0”, and the input of “Cell 2” can be chosen from the outputs of “Cell 0” and “Cell 1”, hence the input of a cell can be chosen from the outputs of the previous two cells. The calculation of PC can be choose from one of outputs of all cells. Moreover, One of the outputs from the cells, i.e., “Cell 0”, “Cell 1” and “Cell 2”, can connect to the classifier trained on source domain data. Hence, the total search space has configurations. An illustration of the search space in the DAMPC-NAS method is shown in Figure 2. In experiments, for efficiency, we use the search space with cells for all experiments.
Searching optimal architecture
The searching algorithm for the DAMPC-NAS method is described in Algorithm 1. DAMPC-NAS is a reinforcement-based NAS framework which leverages a controller network to sample architectures from the search space. The controller network is a LSTM that samples search choice via a softmax classifier. We denote by the learnable parameters of the controller. The policy of the controller is denoted by .
In each epoch, the training procedure of DAMPC-NAS consists of two phases. In the first phase, we fix parameters of the controllerand train the shared weights in the search space . Specifically, the controller samples an architecture from the search space with policy . For each mini-batch from and , is computed according to Eq. (2) and the shared weights of the sampled architecture are updated by minimizing . In the second phase, we fix all the shared weights in the search space and update the parameter of the controller. Specifically, after one epoch of training, is used as the reward to update the policy in the controller. The gradient is computed via the REINFORCE algorithm williams1992simple with a moving average baseline.
In summary, the DAMPC-NAS method is a one-shot style NAS method. That is, the DAMPC-NAS method trains a supernet that contains all shared parameters in the search space during the searching process. The DAMPC-NAS method samples a child network in each epoch to calculate the loss function defined in Eq. (2) and updates its shared parameters in the search space. Parameters in the controller are updated by the reward, which is the negative loss of the sampled child network. After searching, all weights of the final architecture are retained for testing. Different from two-stage one-shot NAS methods, there is no need for the DAMPC-NAS method to retrain the final architecture from scratch for testing since DAMPC-NAS can directly optimize the objective in Eq. (2), which is just the negative reward for the controller, in an end-to-end manner. In this way, the architecture is optimized alongside child networks’ parameters. Therefore, the final architecture derived from the DAMPC-NAS method can be deployed directly without parameter retraining, which improves the efficiency.
In this section, we empirically evaluate the proposed method.
|Dist Based||JDA long2013transfer||80.7||73.6||64.7||96.5||63.1||98.6||79.5|
|Adv Based||DANN ganin2015unsupervised||79.7||82.0||68.2||96.9||67.4||99.1||82.2|
|Dist Based||JDA long2013transfer||38.9||54.8||58.2||36.2||53.1||50.2||42.1||38.2||63.1||50.2||44.0||68.2||49.8|
|Adv Based||DANN ganin2015unsupervised||45.6||59.3||70.1||47.0||58.5||60.9||46.1||43.7||68.5||63.2||51.8||76.8||57.6|
We conduct experiments on three benchmark datasets, including Office-31 saenko2010adapting, Office-Home venkateswara2017deep, and VisDA-2017 peng2017visda. The Office-31 dataset has 4,652 images in 31 categories collected from three distinct domains: Amazon (A), Webcam (W) and DSLR (D). We can construct six transfer tasks: A W, D W, W D, A D, D A, and W A. The Office-Home dataset consists of 15,500 images in 65 object classes under the office and home settings, forming four extremely dissimilar domains: Artistic (Ar), Clip Art (Cl), Product (Pr), and Real-World (Rw) and 12 transfer tasks. The VisDA-2017 dataset has over 280K images across 12 classes. It contains two very distinct domains: Synthetic, which contains renderings of 3D models from different angles and with different lightning conditions, and Real that are natural images. On this dataset, we study a transfer task: Synthetic Real.
We compare the proposed DAMPC-NAS
method with state-of-the-art DA methods, including Joint Distribution Adaptation (JDA) long2013transfer, Deep Domain Confusion (DDC) tzeng2014deep, Deep Adaptation Network (DAN) long2015learning, Domain Adversarial Neural Network (DANN) ganin2015unsupervised, Correlation Alignment for Deep Domain Adaptation (D-CORAL) sun2016deep, Residual Transfer Networks (RTN) long2016unsupervised, Joint Adaptation Networks (JAN) long2017deep, Adversarial Discriminative Domain Adaptation (ADDA) tzeng2017adversarial, Conditional Domain Adversarial Networks (CDAN) long2017conditional, Collaborative and Adversarial Network (CAN) zhang2018collaborative, Manifold Dynamic Distribution Adaptation (MDDA) wang2020transfer, and Dynamic Distribution Adaptation Network (DDAN) wang2020transfer. The results of baseline methods are directly reported from DDAN wang2020transfer and CDAN long2017conditional.
We use the PyTorch packagepaszke2017automatic to implement all the models and leverage the ResNet-50 network he2016deep
pretrained on the ImageNet datasetrussakovsky2015imagenet
as the backbone for the feature extraction. For optimization, we use the mini-batch SGD with the Nesterov momentum 0.9. The learning rate is adjusted by, where is the index of training steps, = 0.1, = 0.001, and = 0.75. The batch size is set to 128 for all the datasets.
The classification results on the Office-31 dataset are shown in Table 1. As illustrated in Table 1, the proposed DAMPC-NAS method achieves the best average accuracy. In four out of six transfer tasks, DAMPC-NAS performs the best, especially on transfer tasks AD and AW, which is transferring from a large source domain to a small target domain and in the other two tasks, the DAMPC-NAS method performs slightly worse than the best baseline method, which implies that the proposed DAMPC-NAS model works well when the source data is sufficient and it is able to learn transferable feature representations for effective domain adaptation.
Figure 3 shows the architecture found by DAMPC-NAS for the transfer task DW constructed on the Office-31 dataset. The left part of Figure 3 shows the search choice within the three cells found by the DAMPC-NAS method and the right part of Figure 3 shows the connections among the three cells, PC and classifier. In Cell 0, the DAMPC-NAS method chooses the FC layer with the same size as the input and the skip connection is connected to the batch-norm layer. In Cell 1, the choice of FC is the same as Cell 0 but the skip connection is starting from the cell input. In Cell 2, the skip connection is the same as Cell 2 but the FC layer is of half size of the input. For connections between cells, the DAMPC-NAS method chooses to use the output of Cell 0 to calculate the PC and the output of Cell 1 to calculate the classification loss. For a simple transfer task D
W, the searched architecture only has two cells, which indicates that the DAMPC-NAS method can adaptively learn an architecture depending on the the complexity of the DA task. Moreover, the location of the skip connection moves forward in Cell 1 and Cell 2 when compared with Cell 0, which is to help reduce the network depth and alleviate the vanishing gradient problem.
Table 2 shows the classification results on the Office-Home dataset. According to the results, we can see that DAMPC-NAS achieves the best average accuracy and performs the best in eight out of twelve transfer tasks. while transferring from a large source domain to a small target domain (i.e., ClAr, PrAr, and RwAr), DAMPC-NAS achieves the best performance and this phenomenon is similar to the Office-31 dataset, which again demonstrate that the proposed DAMPC-NAS model works well when the source data is sufficient.
According to experimental results on the most challenging VisDA-2017 dataset as shown in Table 3, the proposed DAMPC-NAS method outperforms all the baseline methods by improving by over state-of-the-art baseline methods (i.e., CDAN) on this dataset, which again demonstrates the effectiveness of the proposed method.
|Dist Based||DAN long2015learning||53.0|
|Adv Based||DANN ganin2015unsupervised||55.0|
4.3 Ablation Study
Firstly, we conduct an ablation study on the Office-31, Office-Home, and VisDA-2017 datasets to demonstrate the effectiveness of the proposed PC. We compare PC with widely used distance functions, including Proxy -distance, Kullback-Leibler divergence (KL-divergence), Maximum Mean Discrepancies (MMD), CORrelation ALignmen (CORAL), and Central Moment Discrepancy (CMD). For fair comparison, we only replace the minus of the PC with these distance functions in Eq. (2
). Specifically, we adopt the ResNet-50 as the backbone, following with the bottleneck layer (consisting of a fully connected layer, a batch normalization layer, a ReLU activation function, and a dropout function) used for generating hidden features and a fully connected layer used for prediction. According to experimental results shown in Tables4, 5 and 6, we can see that none of the distance functions can obtain performance improvement compared with no distance function used (i.e., ResNet-50). One possible reason is that the normalization layer used in the bottleneck layer has improved the performance of the ResNet-50 and adapting these distance functions can not improve the performance further. However, the proposed PC can still obtain performance improvement over ResNet-50, which indicates the effectiveness of the proposed PC.
Then we conduct another ablation study on the Office-31 dataset to demonstrate the effectiveness of the architecture searching process in the DAMPC-NAS method. Specifically, we modify Algorithm 1 to search an optimal architecture for the DAN by replacing the minus of the PC with MMD in . According to experimental results shown in Figure 5, DAN-NAS performs comparable to and even slightly better than DAN on the six transfer tasks in the Office-31 dataset, which demonstrates the usefulness of the search process in the DAMPC method.
We visualize in Figure 4 the hidden feature representations of the transfer task AD constructed on the Office-31 dataset learned by ResNet-50 which is trained on source samples only, DAN, and DAMPC-NAS, respectively. According to Figure 4, we can see that samples with the representations learned by ResNet-50 and DAN are not distinguishable, but those by DAMPC-NAS are more separable, which implies that the proposed DAMPC-NAS method can learn discriminative and transferable feature representations for DA.
In this paper, we propose a new DA method called DAMPC based on the proposed PC function that can measure the domain similarity. We further design the DAMPC-NAS framework that searches optimal network architectures for DA tasks. Experiments results on the Office-31, Office-Home, and VisDA-2017 datasets demonstrate the effectiveness of the proposed method. Moreover, the proposed DAMPC-NAS framework has shown its potential to search optimal architectures for other DA methods. In our future studies, we will apply the proposed the DAMPC-NAS framework to search architectures for other DA methods and other DA settings.