One key element of the significant success of deep learning algorithms is the availability of large-scale labeled data[he2016deep]
. However, in many practical applications, only limited or even no training data is provided. On the one hand, it is prohibitively labor-intensive and expensive to obtain abundant labeled data. On the other hand, visual data possess variance in nature, which fundamentally limits the scalability and applicability of supervised learning models for handling new scenarios with few labeled examples[ni2019dual]. In such cases, conventional deep learning approaches suffer from performance decay. Directly transferring the learned models trained on labeled source domains to unlabeled target domains may result in unsatisfying performance, because of the presence of domain shift [torralba2011unbiased], which calls for domain adaptation (DA) methods [bousmalis2016domain, zhao2018emotiongan, hoffman2018cycada]. Unsupervised DA (UDA) addresses such problems by establishing knowledge transfer from a labeled source domain to an unlabeled target domain, and exploring domain-invariant structures and representations to bridge the domain gap [netzer2011reading]. Both theoretical results [ben2010theory, gopalan2014unsupervised, tzeng2017adversarial] and algorithms for domain adaptation [pan2010survey, long2015learning, hoffman2018cycada, zhao2019cycleemotiongan] have been proposed recently.
Though these methods make progress on DA, most of them focus on the single-source setting [sun2011two, ganin2016domain] and fail to consider a more practical scenario in which there are multiple labeled source domains with different distributions. Naive application of the single-source DA algorithms may lead to suboptimal solution [shen2017wasserstein], which calls for effective multi-source domain adaptation (MDA) techniques. Recently, some deep MDA approaches have been proposed [zhao2018adversarial, xu2018deep, li2018extracting, peng2019moment, zhao2019multi], but most of them suffer from the following limitations. (1) They sacrifice the discriminative property of the extracted features for the desired task learner in order to learn domain invariant features. (2) They treat the multiple sources equally and fail to consider the different discrepancy among sources and target, as illustrated in Figure 1. Such treatment may lead to suboptimal performance when some sources are very different from the target [zhao2018adversarial]
. (3) They treat different samples from each source equally, without distilling the source data based on the fact that different samples from the same source domain may have different similarities from the target. (4) The adversarial learning based methods suffer from vanishing gradient problem when the domain classifier network can perfectly distinguish target representations from the source ones.
In this paper, we propose a novel multi-source distilling domain adaptation (MDDA) network to address the above challenges by thoroughly exploring the relationships among different sources and the target. As shown in Figure 2, MDDA can be divided into four stages. (1) We first pre-train the source classifiers separately using the training data from each source. (2) We fix the feature extractor of each source and adversarially map the target into the feature space of each source respectively by minimizing the empirical Wasserstein distance between the source and target [arjovsky2017wasserstein], which provides more stable gradients even when the target and source distributions are non-overlap. (3) We select the source training samples that are closer to the target to fine-tune the source classifiers. (4) We build the target predictor by aggregating the source predictions based on the source domain weights, which corresponds to the discrepancy between each source and target. We propose a mechanism to automatically choose a weighting strategy over source domains to emphasize more relevant sources and suppress the irrelevant ones, and aggregate multiple source classifiers based on these weights. With the above four stages, the proposed MDDA can extract features that are both discriminative for the learning task and indiscriminate with respect to the shift among the multiple source and target domains.
The main contributions of this paper are summarized as follows:
We propose MDDA to explore the relationships among different sources and target, and achieve more accurate inference on the target by finetuning and aggregating the source classifiers based on these relationships.
Compared to [xu2018deep], which symmetrically maps the multiple sources and target into the same space, MDDA learns more discriminative target representations and avoids the oscillation from the simultaneous changing of the multi-source and target distributions by using separate feature extractors that asymmetrically map the target to the feature space of the source in an adversarial manner. Wasserstein distance is used in the adversarial training to achieve more stable gradients even when the target and source distributions are non-overlap.
We propose the source distilling mechanism to select the source training samples that are closer to the target and fine-tune the source classifiers with these samples.
We propose a novel mechanism to automatically choose a weighting strategy over source domains to emphasize more relevant sources and suppress the irrelevant ones, and aggregate the multiple source classifiers based on these weights to build more accurate target predictor.
We extensively evaluate MDDA on the public benchmarks, achieving the state-of-the-art performance and verifying the efficacy of MDDA.
Single-source UDA The emphasis of recent single-source UDA (SUDA) methods has shifted to deep learning architectures in an end-to-end fashion. Most deep SUDA methods employ a conjoined architecture with two streams to respectively represent the models for the source domain and the target domain [zhuo2017deep]. Generally, these methods are trained jointly with a traditional task loss based on the labeled source data and another loss to tackle the domain shift problem, such as discrepancy loss, adversarial loss, reconstruction loss, etc. Discrepancy-based methods explicitly measure the discrepancy between the source and target domains of the two network streams, such as the multiple kernel variant of maximum mean discrepancies [long2015learning], correlation alignment (CORAL) [sun2016return, sun2017correlation, zhuo2017deep], and contrastive domain discrepancy [kang2019contrastive]. Adversarial generative models combine the domain discriminative model with a generative component to generate fake source or target data generally based on GAN [goodfellow2014generative] and its variants, such as CoGAN [liu2016coupled], SimGAN [shrivastava2017learning], CycleGAN [zhu2017unpaired, zhao2019cycleemotiongan], and CyCADA [hoffman2018cycada]. Adversarial discriminative models usually employ an adversarial objective with respect to a domain discriminator to encourage domain confusion [ganin2016domain, tzeng2017adversarial, chen2017no, shen2017wasserstein, tsai2018learning, huang2018domain]. Most of these methods suffer from low accuracy when directly applied to the MDA problem.
Multi-source DA MDA assumes training data are collected from multiple sources [sun2015survey, zhao2019multi]. There are some theoretical analysis [ben2010theory, hoffman2018algorithms] to support existing MDA algorithms. The early MDA methods mainly focus on shallow models, including two categories [sun2015survey]: feature representation approaches [sun2011two, duan2012exploiting, chattopadhyay2012multisource, duan2012domain] and combination of pre-learned classifiers [xu2012multi, sun2013bayesian]. Some novel shallow MDA methods aim to deal with special cases, such as incomplete MDA [ding2018incomplete] and target shift [redko2019optimal].
Recently, some representative deep learning based MDA methods are proposed, such as multisource domain adversarial network (MDAN) [zhao2018adversarial], deep cocktail network (DCTN) [xu2018deep]
, and moment matching network (MMN)[peng2019moment]. All these MDA methods employ a shared feature extractor network to symmetrically map the multiple sources and target into the same space. For each source-target pair in MDAN and DCTN, a discriminator is trained to distinguish the source and target features. MDAN concatenates all extracted source features and labels into one domain to train a single task classifier, while DCTN trains a classifier for each source domain and combines the predictions of different classifiers for a target image using perplexity scores as weights. MMN transfers the learned knowledge from multiple sources to the target by dynamically aligning moments of their feature distributions. The final prediction of a target image is averaged uniformly based on the classifiers from different source domains. Different from these works, we employ an unshared feature extractor to obtain the feature representation for each source, match the target feature to each source feature space asymmetrically, distill the pre-trained classifiers with selected representative samples, and combine the predictions of different classifiers using a novel weighting strategy.
Suppose we have source domains and one target domain . In unsupervised domain adaptation (UDA) scenario, are labeled and is fully unlabled. For the th source domain , the observed images and corresponding labels drawn from the source distribution are and , where is the number of source images. The target images drawn from the target distribution are without label observation, where is the number of target images. Unless otherwise specified, we assume (1) homogeneity, i.e. , which indicates that the data from different domains are observed in the same feature space but exhibit different distributions; (2) closed set, i.e. , where is the class label space, indicating that all the domains share their categories. Our goal is to learn an adaptation model that can correctly predict a sample from the target domain based on and . Please note that our method can be easily extended to tackle heterogeneous DA [li2014learning, hubert2016learning] by changing the network structure of the target feature extractor, open set DA [panareda2017open] by adding an “unknown” class, or category shift DA [xu2018deep] by reweighing the predictions of only those domains that contain the specified category. We will investigate such study in our future work.
Multi-source Distilling Domain Adaptation
In this section, we introduce the proposed multi-source distilling domain adaptation (MDDA) network. MDDA is a novel approach to overcome the limitations of existing methods for multiple source domain adaptation by thoroughly exploring the relationships among different sources and the target. It achieves more accurate inference on the target by finetuning and aggregating the source classifiers based these relationships. As shown in Figure 2
, MDDA can be divided into four stages. We first pre-train the source classifiers separately with the training data from each source. Then, we fix the feature extractor of each source and map the target into the feature space of each source adversarially by minimizing the estimated Wasserstein distance between the source and target. MDDA learns more discriminative target representations and avoids the oscillation from the simultaneous changing of the multi-source and target distributions by using separate feature extractors that asymmetrically map the target to the feature space of the source in an adversarial manner. In the third stage, the source samples closer to the target are selected to fine-tune the source classifiers. Finally, we build the target predictor by aggregating the source predictions based on the discrepancy between each source and target. We propose a novel mechanism to automatically choose a weighting strategy over source domains to emphasize more relevant sources and suppress the irrelevant ones. With the above four stages, MDDA extracts features that are both discriminative for the learning task and indiscriminate with respect to the shift among the multiple source and target domains. We will explain each stage in the following subsections.
Source Classifier Pre-training
To extract more task discriminative features and learn accurate classifiers, we pre-train a feature extractor and classifier for each labeled source domain with unshared weights between different domains. Take the -class classification task as an example, and are optimized by minimizing the following cross-entropy loss:
where is the softmax function, and is an indicator function. Comparing with a shared feature extractor network to extract domain-invariant features among different source domains [zhao2018adversarial, xu2018deep], the unshared feature extractor network can obtain the discriminative feature representations and accurate classifiers for each source domain. When aggregating the multiple predictions based on the source classifier and matched target features in the later stage, the final target prediction would be better boosted.
Adversarial Discriminative Adaptation
After the pre-training stage, we learn separate target encoder to map the target feature into the same space of source . A discriminator is trained adversarially to maximize the Wasserstein distance of correctly classifying the encoded target features from and the encoded source feature from pre-trained , while
tries to maximize the probability ofmaking a mistake, i.e. minimizing the Wasserstein distance. Similar to GAN [goodfellow2014generative], we model this as a two-player minimax game. Following [arjovsky2017wasserstein], we suppose the discriminators are all 1-Lipschitz and then we can optimize by maximizing the Wasserstein distance
while is obtained by minimizing
In this way, the target encoder tries to confuse the discriminator by minimizing the Wasserstein distance between the encoded target features as the source ones.
To enforce the Lipschitz constraint [goodfellow2014generative], we add a gradient penalty for the parameters of each discriminator as in [gulrajani2017improved]
where is a feature set that contains not only the source and target features but also the random points along the straight line between source and target feature pairs [gulrajani2017improved]. can then be optimized by
where is a balancing coefficient, the value of which can be empirically set.
We further dig into each source domain to select the source training samples that are closer to the target based on the estimated Wasserstein distance to fine-tune the source classifiers. Such source distilling mechanism utilizes more relevant training data and further improves the target performance on the aggregated source classifiers. We select the source samples based on the estimated Wasserstein distance, since it can represent the divergence between source data and target data. For each source sample in the th source domain, we calculate the Wasserstein distance between each source sample and target domain:
For each source sample , reflects the its distance to the target domain. The smaller the value is, the closer it is to the target domain. Therefore, in each source domain , we select of the source data = whose is larger than the left ones. With these selected source data, we finetune by minimizing the following objective:
Aggregated Target Prediction
In the testing stage, the goal is to accurately classify a given target image . Corresponding to each source domain, we extract the features of the target image based on the learned target encoder from stage 2, and obtain source-specific prediction using the distilled source classifier. Next, we combine the different predictions from each source classifier to obtain the final prediction:
The key problem here is how to select the weights for the predictions from different source classifiers. We design a novel weighting strategy based on the discrepancy between each source and target to emphasize more relevant sources and suppress the irrelevant ones. We assume after training in stage 2, the estimated Wasserstein distance between each source and target
subordinates to a standard Gaussian Distribution. Therefore, the weight of each domain can be computed by the following equation
|Single-best DA||DAN long2015learning||63.8||96.3||94.2||62.5||85.4||80.4|
|Source-combined DA||DAN long2015learning||67.9||97.5||93.5||67.8||86.9||82.7|
|Multi-source DA||DCTN xu2018deep||70.5||96.2||92.8||77.6||86.8||84.8|
We evaluate the proposed MDDA model on multi-source domain adaptation task in visual classification applications, including digit recognition and object classification.
Digits-five includes 5 digit image datasets sampled from different domains, including handwritten mt (MNIST) [lecun1998gradient], combined mm (MNIST-M) [ganin2015unsupervised], street image sv (SVHN) [netzer2011reading], synthetic sy (Synthetic Digits) [ganin2015unsupervised], and handwritten up (USPS) [hull1994database]. Following [xu2018deep, peng2019moment], we sample 25,000 images for training and 9,000 for testing in mt, mm, sv, sy, and select the entire 9,298 images in up as a domain.
Office-31 [saenko2010adapting] contains 4,110 images within 31 categories, which are collected from office environment in 3 image domains: A (Amazon) downloaded from amazon.com, W (Webcam) and D (DSLR) taken by web camera and digital SLR camera, respectively.
To compare MDDA with the state-of-the-art approaches for MDA, we select the following methods as baselines. (1) Source-only, i.e. train on the source domains and test on the target domain directly. We can view this as a lower bound of DA. (2) Single-source DA, perform multi-source DA via single-source DA, including conventional models, i.e. TCA [pan2011domain] and GFK [gong2012geodesic], and deep methods, i.e. DDC [tzeng2015simultaneous], DRCN [ghifary2016deep], RevGrad [ganin2015unsupervised], DAN [long2015learning], RTN [long2016unsupervised], CORAL [sun2016return], DANN [ganin2016domain], and ADDA [tzeng2017adversarial]. (3) Multi-source DA, extend some single-source DA method to multi-source settings, including DCTN [xu2018deep] and MDAN [zhao2018adversarial].
|Single-best DA||TCA pan2011domain||95.2||93.2||51.6||80.0|
|Source-combined DA||RevGrad ganin2015unsupervised||98.8||96.2||54.6||83.2|
|Multi-source DA||DCTN xu2018deep||99.6||96.9||54.9||83.8|
For the source-only and single-source DA standards, we employ two strategies: (1) source-combined, i.e. all source domains are combined into a traditional single source; (2) single-best, i.e. performing adaptation on each single source and selecting the best adaptation result in the target test set.
In Digits-five experiments, we use three convlutional layers and two fully connected layers as encoder and one fully connected layer as classifier. In Office-31 experiments, we use Alexnet as our backbone. The last layer is used as classifier and the other layers are used as encoder. Following [gulrajani2017improved], we set in Eq. (5) to 10.
Comparison with the State-of-the-art
The performance comparisons between MDDA and the state-of-the-art approaches as measured by classification accuracy are shown in Table 1 and Table 2 on Digits-five and Office-31 datsets, respectively. From the results, we have the following observations.
(1) The source-only method i.e.
directly transferring the models trained on the source domains to the target domain performs the worst in most adaptation settings. Due to the presence of domain shift, the joint probability distributions of observed images and class labels greatly differ in the source and target domains. This results in the model’s low transferability from the source domains to the target domain. Further, even with more training samples, the Combined setting does not guarantee to perform better than the Single-best one. This is because domain shift also exists across different source domains, which may confuse the classifier. For example, if one source domain and the target is very similar, such as sv and sy, and the other source domains are quite different, simple combination would enlarge the domain shift between the Single-best and the target. This observation demonstrates the necessity of designing DA algorithms to address the domain shift problem.
(2) Almost all adaptation methods outperform the source-only methods, demonstrating the effectiveness of DA in image classification. Comparing the Single-best DA and Source-combined DA, it is clear that on average the Source-combined DA performs better, which is different from the source-only scenario. This is because after adaptation, domain-invariant representations are learned for the samples of different domains. Therefore, the Source-combined DA works better with the help of more training data.
(3) Generally, multi-source DA performs better than other adaptation standards. This is more clear when comparing the methods that employ similar adaptation architectures, such as our MDDA vs. ADDA [tzeng2017adversarial] and MDAN [zhao2018adversarial] vs. DANN [ganin2016domain]. Not only the domain shift between the sources and the target, but also the shift across the different source domains is bridged in multi-source DA, which boosts the adaptation by exploring the complementarity of different sources.
(4) The proposed MDDA model performs better than state-of-the-art multi-source methods in most cases. On one hand, the performance improvements of MDDA over the best Source-combined method are 3.1% and 0.5% on Digits-five and Office-31 datsets, respectively. On the other hand, the proposed MDDA method achieves 3.3%, 4.8% and 0.4%, 0.9% performance improvements as compared to DCTN [xu2018deep] and MDAN [zhao2018adversarial] on Digits-five and Office-Home datsets, respectively. These results demonstrate that the proposed MDDA model can achieve superior performance relative to state-of-the-art approaches. The performance improvements benefit from the advantages of MDDA. First, the unshared weights enable to learn the best feature extractor and classifiers for each source domain, which would boost the performance when aggregation. Second, a novel weighting strategy based on the Wasserstein distance can better emphasize the domains that are more closer to the target. Finally, for each source domain, selective samples are distilled to fine-tune the source classifier, which also adapt better to the target features.
Interpretability and Ablation Study
To show the adaptation ability of the proposed MDDA model, we visualize the features before and after adversarial adaptation with t-SNE embedding [maaten2008visualizing] in tasks: mtup and mm sy. As illustrated in Figure 3, we have two observations: (1) target features become more dense while using adversarial adaptation; (2) target domain fits source domain more tightly after the adversarial adaptation, which demonstrates that MDDA can align the distributions between the source and target domains.
The proposed MDDA model contains two major components: source distilling for fine-tuning the source classifiers and a novel weighting strategy for aggregating target prediction. We conduct ablation study to further verify their effectiveness by changing one component while fixing the other.
We compare the proposed weighting strategy with one straightforward baseline: uniform weight. The results on Digits-five and Office31 datasets are shown in Table 3 and Table 4, respectively. From the results, we can observe that the proposed weighting strategy outperforms the uniform weight. This is reasonable because the uniform weight does not reveal the importance of different sources, which might have different similarities to the target. By considering the relative similarity of different sources to the target based on the Wasserstein distance, the proposed MDDA achieves 6.6% and 1.1% improvements on Digits-five and Office31 datasets, respectively. These observations demonstrate the effectiveness of the proposed weighting strategy.
Table 5 and Table 6 show the comparison between with and without fine-tuning the source classifiers by the distilled source samples on Digits-five and Office31 datasets, respectively. It is clear that without distilling, the adaptation performance drops in most cases. For example, we can achieve 0.3% and 0.5% average accuracy improvements by source distilling on Digits-five and Office31 datasets. This confirms the validity of distilling the sources, since the selected source samples are more similar to the target ones and the fine-tuned classifier can enhance the transferability.
To better demonstrate the effectiveness of source distilling, we give an example of Wasserstein Distance based ADDA method before and after distilling on the Digits-five dataset when sy is set as the target domain and the others as source domains. As shown in Table 7, we find that the performance gains of source distilling vary across different sources. For the sources with larger domain discrepancies to the target, e.g. mt to sy and up to sy, source distilling may yield higher improvement (2.5% and 2.1%, respectively), while the improvement is not that obvious for the sources with smaller discrepancy to the target, e.g. sv to sy (0.1%), mm to sy (0.2%). This is reasonable because when one source domain is far away from the target, the distilled samples can lead the classifier closer to target domain. If the source is already very similar to the target, the influence of distilled samples will be not that obvious.
In order to show the interpretability of our model, we use the heat map generated by the Grad-Cam algorithm [gradcam2017iccv] to visualize the attention before and after our proposed domain adaptation method. As illustrated in Figure 4, we observe that after the domain adaptation: the attentions generated by our model can better focus on the more “discriminative” regions, which indicates that our model can pay more attention to the discriminative regions of the objects for classification even though the background or view point are changed. Such observation verifies that our model learns the features that are more invariant to different domains, while they are discriminative for the desired learning task (i.e. image classification).
For example, the ring binder in the first row shows that before adaptation, the model focuses on a region in the background, instead of the central target object. However, after our domain adaptation, the model can correctly focus on the ring binder and thus is more discriminative for the classification. Similar observations can be found in the second and third rows. In the last row, we find that attention is enhanced on the discriminative regions of the object (the laptop) after our domain adaptation.
In this paper, we have proposed an effective multi-source domain adaptation approach MDDA. The separately pre-trained feature extractor and classifier for each source domain can sufficiently explore the discriminability of labeled source data. The adversarial discriminative-adaptation and source distilling aim to match the target feature distribution to the source ones and to fine-tune the pre-trained classifiers. A novel weighting strategy is designed to jointly combine the predictions from different source classifiers. The extensive experiments conducted on Digits-five and Office-31 benchmarks demonstrate that MDDA achieves 3.3% and 0.4% performance improvements as compared to the state-of-the-art multi-source domain adaptation approaches (i.e. DCTN) for digit and object classification. In future studies, we plan to extend the MDDA model to more challenging vision tasks, such as scene segmentation. We also aim to investigate methods that can combine generative and discriminative pipelines for multi-source domain adaptation.
This work is supported by Berkeley DeepDrive and the National Natural Science Foundation of China (No. 61701273).