1 Background and Motivation
The availability of large-scale labeled training data, such as ImageNet, has enabled deep neural networks (DNNs) to achieve remarkable success in many learning tasks, ranging from computer vision to natural language processing. For example, the classification error of the “Classification + localization with provided training data” task in the Large Scale Visual Recognition Challenge has reduced from 0.28 in 2010 to 0.0225 in 2017111http://image-net.org/challenges/LSVRC/2017, outperforming even human classification. However, in many practical applications, obtaining labeled training data is often expensive, time-consuming, or even impossible. For example, in fine-grained recognition, only the experts can provide reliable labels [gebru2017fine]; in semantic segmentation, it takes about 90 minutes to label each Cityscapes image [cordts2016cityscapes]; in autonomous driving, it is difficult to label point-wise 3D LiDAR point clouds [wu2019squeezesegv2].
One potential solution is to transfer a model trained on a separate, labeled source domain to the desired unlabeled or sparsely labeled target domain. But as Figure 1 demonstrates, the direct transfer of models across domains leads to poor performance. Figure 1(a) shows that even for the simple task of digit recognition, training on the MNIST source domain [lecun1998gradient] for digit classification in the MNIST-M target domain [ganin2015unsupervised] leads to a digit classification accuracy decrease from 96.0% to 52.3% when training a LeNet-5 model [lecun1998gradient]. Figure 1(b) shows a more realistic example of training a semantic segmentation model on a synthetic source dataset GTA [richter2016playing] and conducting pixel-wise segmentation on a real target dataset Cityscapes [cordts2016cityscapes] using the FCN model [long2015fully]. If we train on the real data, we obtain a mean intersection-over-union (mIoU) of 62.6%; but if we train on synthetic data, the mIoU drops significantly to 21.7%.
The poor performance from directly transferring models across domains stems from a phenomenon known as domain shift [torralba2011unbiased, zhao2018emotiongan]
: whereby the joint probability distributions of observed data and labels are different in the two domains. Domain shift exists in many forms, such as from dataset to dataset, from simulation to real-world, from RGB images to depth, and from CAD models to real images.
The phenomenon of domain shift motivates the research on domain adaptation (DA), which aims to learn a model from a labeled source domain that can generalize well to a different, but related, target domain. Existing DA methods mainly focus on the single-source scenario. In the deep learning era, recent single-source DA (SDA) methods usually employ a conjoined architecture with two approaches to respectively represent the models for the source and target domains. One approach aims to learn a task model based on the labeled source data using corresponding task losses, such as cross-entropy loss for classification. The other approach aims to deal with the domain shift by aligning the target and source domains. Based on the alignment strategies, deep SDA methods can be classified into four categories:
Discrepancy-based methods try to align the features by explicitly measuring the discrepancy on corresponding activation layers, such as maximum mean discrepancy (MMD) [long2015learning], correlation alignment [sun2017correlation], and contrastive domain discrepancy [kang2019contrastive].
Adversarial generative methods generate fake data to align the source and target domains at pixel-level based on Generative Adversarial Network (GAN) [goodfellow2014generative] and its variants, such as CycleGAN [zhu2017unpaired, zhao2019cycleemotiongan].
Adversarial discriminative methods employ an adversarial objective with a domain discriminator to align the features [tzeng2017adversarial, tsai2018learning].
Reconstruction based methods aim to reconstruct the target input from the extracted features using the source task model [ghifary2016deep].
In practice, the labeled data may be collected from multiple sources with different distributions [sun2015survey, bhatt2016multi]. In such cases, the aforementioned SDA methods could be trivially applied by combining the sources into a single source: an approach we refer to as source-combined DA. However, source-combined DA oftentimes results in a poorer performance than simply using one of the sources and discarding the others. As illustrated in Figure 2, the accuracy on the best single source digit recognition adaptation using DANN [ganin2016domain] is 71.3%, while the source-combined accuracy drops to 70.8%. For segmentation adaptation using CyCADA [hoffman2018cycada], the mIoU of source-combined DA (37.3%) is also lower than that of SDA from GTA (38.7%). Because the domain shift not only exists between each source and target, but also exists among different sources, the source-combined data from different sources may interfere with each other during the learning process [riemer2019learning]. Therefore, multi-source domain adaptation (MDA) is needed in order to leverage all of the available data.
The early MDA methods mainly focus on shallow models [sun2015survey], either learning a latent feature space for different domains [sun2011two, duan2012exploiting] or combining pre-learned source classifiers [schweikert2009empirical]. Recently, the emphasis on MDA has shifted to deep learning architectures. In this paper, we systematically survey recent progress on deep learning based MDA, summarize and compare similarities and differences in the approaches, and discuss potential future research directions.
2 Problem Definition
In the typical MDA setting, there are multiple source domains ( is the number of sources) and one target domain . Suppose the observed data and corresponding labels222The label could be any type, such as object classes, bounding boxes, semantic segmentation, etc. in the source are drawn from distribution are and , respectively, where is the number of source samples. Let and denote the target data and corresponding labels drawn from the target distribution , where is the number of target samples.
Suppose the number of labeled target samples is , the MDA problem can be classified into different categories:
unsupervised MDA, when ;
fully supervised MDA, when ;
semi-supervised MDA, otherwise.
|CV||digit recognition||Digits-five (D)||lecun1998gradient,netzer2011reading||5||145,298||10 classes||handwritten, synthetic, and street-image digits|
|object classification||Office-31 (O)||saenko2010adapting||3||4,110||31 classes||images from amazon and taken by different cameras|
|Office-Caltech (OC)||gong2013connecting||4||2,533||10 classes||overlapping categories from Office-31 and C|
|Office-Home (OH)||venkateswara2017deep||4||15,500||65 classes||artistic, clipart, product, and real objects|
|ImageCLEF (IC)||Challenge3||3||1,800||12 classes||shared categories from 3 datasets|
|PACS (P)||li2017deeper||4||9,991||7 classes||photographic, artistic, cartoon, and sketchy objects|
|DomainNet (DN)||peng2019moment||6||600,000||345 classes||clipart, infographic, artistic, quickdrawn, real, and sketchy objects|
|sentiment classification||SentiImage (SI)||machajdik2010affective||4||25,986||2 classes||artistic and social images on visual sentiment|
|vehicle counting||WebCamT (W)||zhang2017understanding||8||16,000||vehicle counts||each camera used as one domain|
|semantic segmentation||Sim2RealSeg (S2R)||cordts2016cityscapes,yu2018bdd100k||4||49,366||16 classes||simulation-to-real adaptation|
|richter2016playing,ros2016synthia||for pixel-wise predictions|
|NLP||sentiment classification||AmazonReviews (AR)||chen2012marginalized||4||12,000||2 classes||reviews on four kinds of products|
|MediaReviews (MR)||liu2017adversarial||5||6897||2 classes||reviews on products and movies|
|part-of-speech tagging||SANCL (S)||petrov2012overview||5||5250||tags||part-of-speech tagging in 5 web domains|
Suppose and are an observation in source and target , we can classify MDA into:
homogeneous MDA, when ;
heterogeneous MDA, otherwise.
Suppose and are the label set for source and target , we can define different MDA strategies:
closed set MDA, when ;
open set MDA, for at least one , ;
partial MDA, for at least one , ;
universal MDA, when no prior knowledge of the label sets is available;
where and indicate the intersection set and proper subset between two sets.
Suppose the number of labeled source samples is for source , the MDA problem can be classified into:
strongly supervised MDA, when for ;
weakly supervised MDA, otherwise.
When adapting to multiple target domains simultaneously, the task becomes multi-target MDA. When the target data is unavailable during training [yue2019domain], the task is often called multi-source domain generalization or zero-shot MDA.
The datasets for evaluating MDA models usually contain multiple domains with different styles, such as synthetic vs. real, artistic vs. sketchy, which impose large domain shift among different domains. Here we summarize the commonly employed datasets in both computer vision (CV) and natural language processing (NLP) areas, as shown in Table 1.
Digit recognition. Digits-five includes 5 digit image datasets sampled from different domains, including handwritten MNIST (mt) [lecun1998gradient], combined MNIST-M (mm) [ganin2015unsupervised] from MNIST and randomly extracted color patches, street image SVHN (sv) [netzer2011reading], Synthetic Digits (sy) [ganin2015unsupervised] generated from Windows fonts by various conditions, and handwritten USPS (up) [hull1994database]. Usually, 25,000 images are sampled for training and 9,000 for testing in mt, mm, sv, and sy. The entire 9,298 images in up are selected.
Object classification. Office-31 [saenko2010adapting] contains 4,110 images in 31 categories collected from office environments in 3 domains: Amazon (A) with 2,817 images downloaded from amazon.com, Webcam (W) and DSLR (D) with 795 and 498 images taken by web camera and digital SLR camera with different photographical settings.
Office-Caltech [gong2013connecting] consists of the 10 overlapping categories shared by Office-31 [saenko2010adapting] and Caltech-256 (C) [griffin2007caltech]. Totally there are 2,533 images.
Office-Home [venkateswara2017deep] consists of about 15,500 images from 65 categories of everyday objects in office and home settings. There are 4 different domains: Artistic images (Ar), Clip Art (Cl), Product images (Pr) and Real-World images (Rw).
ImageCLEF, originated from ImageCLEF 2014 domain adaptation challenge333http://imageclef.org/2014/adaptation, consists of 12 object categories shared by ImageNet ILSVRC 2012 (I), Pascal VOC 2012 (P), and C. Totally there are 600 images for each domain with 50 for each category.
PACS [li2017deeper] contains 9,991 images of 7 object categories extracted from 4 different domains: Photo (P), Art paintings (A), Cartoon (C) and Sketch (S).
DomainNet [peng2019moment], the largest DA dataset to date for object classification, contains about 600K images from 6 domains: Clipart, Infograph, Painting, Quickdraw, Real, and Sketch. There are totally 345 object categories.
Sentiment classification of images. SentiImage [lin2020multi] is a DA dataset with 4 domains for binary sentiment classification of images: social Flickr and Instagram (FI) [you2016building], artistic ArtPhoto (AP) [machajdik2010affective], social Twitter I (TI) [you2015robust], and social Twitter II (TII) [borth2013large]. There are 23,308, 806, 1,269, and 603 images in these 4 domains, respectively.
Vehicle counting. WebCamT [zhang2017understanding] is a vehicle counting dataset from large-scale city camera videos with low resolution, low frame rate, and high occlusion. Totally there are 60,000 frames with vehicle bounding box and count annotations. For MDA, 8 cameras located in different intersections are selected, each with more than 2,000 labeled images. We can view each camera as a domain.
Scene segmentation. Sim2RealSeg contains 2 synthetic datasets (GTA, SYNTHIA) and 2 real datasets (Cityscapes, BDDS) for segmentation. Cityscapes (CS) [cordts2016cityscapes] contains vehicle-centric urban street images collected from a moving vehicle in 50 cities from Germany and neighboring countries. There are 5,000 images with pixel-wise annotations into 19 classes. BDDS [yu2018bdd100k] contains 10,000 real-world dash cam video frames with a compatible label space with Cityscapes. GTA [richter2016playing] is a vehicle-egocentric image dataset collected in the high-fidelity rendered computer game GTA-V. It contains 24,966 images (video frames) with 19 classes as Cityscapes. SYNTHIA [ros2016synthia] is a large synthetic dataset. To pair with Cityscapes, a subset, named SYNTHIA-RANDCITYSCAPES, is designed with 9,400 images which are automatically annotated with 16 compatible Cityscapes classes, one void class, and some unnamed classes. The common 16 classes are used for MDA.
Sentiment classification of natural languages. Amazon Reviews [chen2012marginalized] is a dataset of reviews on four kinds of products: Books (B), DVDs (D), Electronics (E), and Kitchen appliances (K
). Reviews are encoded as 5,000 dimensional feature vectors of unigrams and bigrams and are labeled with binary sentiment. Each source has 2,000 labeled examples, and the target test set has 3,000 to 6,000 examples.
Media Reviews [liu2017adversarial] contains 16 domains of product reviews and movie reviews for binary sentiment classification. 5 domains with 6,897 labeled samples are usually employed for MDA, including Apparel, Baby, Books, Camera taken from Amazon and MR from Rotten Tomato.
Part-of-speech tagging. The SANCL dataset [petrov2012overview] contains part-of-speech tagging annotations in 5 web domains: Emails (E), Weblogs (W), Answers (A), Newsgroups (N), and Reviews (R). 750 sentences from each source are used for training.
Unless otherwise specified, each domain is selected as the target and the rest domains are considered as the sources. For WebCamT, 2 domains are randomly selected as the target. For Sim2RealSeg, MDA is often performed using the simulation-to-real setting [zhao2019multi], i.e. from synthetic GTA, SYNTHIA to real Cityscapes, BDDS. For SANCL, N, R, and A are used as target domains, while E and W are used as target domains [guo2018multi].
|extractor||alignment method||alignment loss||alignment domains||alignment||weight||backbone|
|[mancini2018boosting]||shared||—||—||—||CT loss||1||—||AlexNet||O, OC, P||83.6, 91.8, 85.3|
|[guo2018multi]||shared||discrepancy||MMD||target and each source||—||PoS metric||AlexNet||AR, S||84.8, 90.1|
|[hoffman2018algorithms]||shared||discrepancy||Rényi-divergence||target and each source||CT loss||1||—||AlexNet||O||87.6|
|[zhu2019aligning]||shared||discrepancy||MMD||target and each source||loss||uniform||ResNet-50||O, OH, IC||90.2, 89.4, 74.1|
|[rakshit2019unsupervised]||unshared||discrepancy||distance||pairwise all domains||CT loss||1||—||ResNet-50||O, OC, IC||88.3, 97.5, 91.2|
|[peng2019moment]||shared||discrepancy||moment distance||pairwise all domains||loss||relative error||LeNet-5||D||87.7|
|[guo2020multi]||shared||discrepancy||mixture distance||target and each source||CT loss||1||—||BiLSTM||MR||79.3|
|[xu2018deep]||shared||discriminator||GAN loss||target and each source||—||perplexity score||AlexNet||D, O, IC||74.2, 83.8, 80.8|
|[li2018extracting]||shared||discriminator||Wasserstein||pairwise all domains||CT loss||1||—||AlexNet||D||79.9|
|[zhao2018adversarial]||shared||discriminator||-divergence||target and each source||CT loss||1||—||BiLSTM||AR||82.7|
|[wang2019tmda]||shared||discriminator||Wasserstein||pairwise all domains||CT loss||1||—||BiLSTM||AR||84.5|
|[zhao2020multi]||unshared||discriminator||Wasserstein||target and each source||—||Wasserstein||LeNet-5||D||88.1|
4 Deep Multi-source Domain Adaptation
Existing methods on deep MDA primarily focus on the unsupervised, homogeneous, closed set, strongly supervised, one target, and target data available settings. That is, there is one target domain, the target data is unlabeled but available during the training process, the source data is fully labeled, the source and target data are observed in the same data space, and the label sets of all sources and the target are the same. In this paper, we focus on MDA methods under these settings.
There are some theoretical analysis to support existing MDA algorithms. Most theories are based on the seminal theoretical model [blitzer2008learning, ben2010theory]. mansour2009mda mansour2009mda assumed that the target distribution can be approximated by a mixture of the source distributions. Therefore, weighted combination of source classifiers has been widely employed for MDA. Moreover, tighter cross domain generalization bound and more accurate measurements on domain discrepancy can provide intuitions to derive effective MDA algorithms. hoffman2018algorithms hoffman2018algorithms derived a novel bound using DC-programming and calculated more accurate combination weights. zhao2018adversarial zhao2018adversarial extended the generalization bound of seminal theoretical model to multiple sources under both classification and regression settings. Besides the domain discrepancy between the target and each source [hoffman2018algorithms, zhao2018adversarial], li2018extracting li2018extracting also considered the relationship between pairwise sources and derived a tighter bound on weighted multi-source discrepancy. Based on this bound, more relevant source domains can be picked out.
Typically, some task models (e.g. classifiers) are learned based on the labeled source data with corresponding task loss, such as cross-entropy loss for classification. Meanwhile, specific alignments among the source and target domains are conducted to bridge the domain shift so that the learned task models can be better transferred to the target domain. Based on the different alignment strategies, we can classify MDA into different categories. Latent space transformation tries to align the latent space (e.g. features) of different domains based on optimizing the discrepancy loss or adversarial loss. Intermediate domain generation explicitly generates an intermediate adapted domain for each source that is indistinguishable from the target domain. The task models are then trained on the adapted domain. Figure 3 summarizes the common overall framework of existing MDA methods.
4.1 Latent Space Transformation
The two common methods for aligning the latent spaces of different domains are discrepancy-based methods and adversarial methods. We discuss these two methods below, and Table 2 summarizes key examples of each method.
Discrepancy-based methods explicitly measure the discrepancy of the latent spaces (typically features) from different domains by optimizing some specific discrepancy losses, such as maximum mean discrepancy (MMD) [guo2018multi, zhu2019aligning], Rényi-divergence [hoffman2018algorithms], distance [rakshit2019unsupervised], and moment distance [peng2019moment]
. guo2020multi guo2020multi claimed that different discrepancies or distances can only provide specific estimates of domain similarities and that each distance has its pathological cases. Therefore, they consider the mixture of several distances[guo2020multi], including distance, Cosine distance, MMD, Fisher linear discriminant, and Correlation alignment. Minimizing the discrepancy to align the features among the source and target domains does not introduce any new parameters that must be learned.
Adversarial methods try to align the features by making them indistinguishable to a discriminator. Some representative optimized objectives include GAN loss [xu2018deep], -divergence [zhao2018adversarial], Wasserstein distance [li2018extracting, wang2019tmda, zhao2020multi]. These methods aim to confuse the discriminator’s ability to distinguishing whether the features from multiple sources were drawn from the same distribution. Compared with GAN loss and -divergence, Wasserstein distance can provide more stable gradients even when the target and source distributions do not overlap [zhao2020multi]. The discriminator is often implemented as a network, which leads to new parameters that must be learned.
There are many modular implementation details for both types of methods, such as how to align the target and multiple sources, whether the feature extractors are shared, how to select the more relevant sources, and how to combine the multiple predictions from different classifiers.
Alignment domains. There are different ways to align the target and multiple sources. The most common method is to pairwise align the target with each source [xu2018deep, guo2018multi, zhao2018adversarial, hoffman2018algorithms, zhu2019aligning, zhao2020multi, guo2020multi]. Since domain shift also exists among different sources, several methods enforce pairwise alignment between every domain in both the source and target domains [li2018extracting, rakshit2019unsupervised, peng2019moment, wang2019tmda].
|generator||alignment domains||alignment loss||alignment domains||weight||backbone|
|[russo2019towards]||CoGAN||target and each source||GAN loss||target and each source||uniform||DeepLabV2||S2R-CS||seg||42.8|
|[zhao2019multi]||CycleGAN||target and aggregated source||GAN loss||target and each source||1||—||FCN8s||S2R-CS||seg||41.4|
|[lin2020multi]||VAE+CycleGAN||target and combined source||—||—||1||—||ResNet-18||SI||cls||68.1|
Weight sharing of feature extractor. Most methods employ shared feature extractors to learn domain-invariant features. However, domain invariance may be detrimental to discriminative power. On the contrary, rakshit2019unsupervised rakshit2019unsupervised adopted one feature extractor for each source and target pair with unshared weights, while zhao2020multi zhao2020multi first pre-trained one feature extractor for each source and then mapped the target into the feature space of each source. Correspondingly, there are and feature extractors. Although unshared feature extractors can better align the target and sources in the latent space, this substantially increases the number of parameters in the model.
Classifier alignment. Intuitively, the classifiers trained on different sources may result in misaligned predictions for the target samples that are close to the domain boundary. By minimizing specific classifier discrepancy, such as 1 loss [zhu2019aligning, peng2019moment], the classifiers are better aligned, which can learn a generalized classification boundary for target samples mentioned above. Instead of explicitly training one classifier for each source, many methods focus on training a compound classifier based on specific combined task loss, such as normalized activations [mancini2018boosting] and bandit controller [guo2020multi].
Target prediction. After aligning the features of target and source domains in the latent space, the classifiers trained based on the labeled source samples can be used to predict the labels of a target sample. Since there are multiple sources, it is possible that they will yield different target predictions. One way to reconcile these different predictions is to uniformly average the predictions from different source classifiers [zhu2019aligning]. However, different sources may have different relationships with the target, e.g. one source might better align with the target, so a non-uniform, weighted averaging of the predictions leads to better results. Weighting strategies, known as a source selection process, include uniform weight [zhu2019aligning], perplexity score based on adversarial loss [xu2018deep], point-to-set (PoS) metric using Mahalanobis distance [guo2018multi], relative error based on source-only accuracy [peng2019moment], and Wasserstein distance based weights [zhao2020multi].
Besides the source importance, zhao2020multi zhao2020multi also considered the sample importance, i.e. different samples from the same source may still have different similarities from the target samples. The source samples that are closer to the target are distilled (based on a manually selected Wasserstein distance threshold) to fine-tune the source classifiers. Automatically and adaptively selecting the most relevant training samples for each source remains an open research problem.
4.2 Intermediate Domain Generation
Feature-level alignment only aligns high-level information, which is insufficient for fine-grained predictions, such as pixel-wise semantic segmentation [zhao2019multi]. Generating an intermediate adapted domain with pixel-level alignment, typically via GANs [goodfellow2014generative], can help address this problem.
Domain generator. Since the original GAN is highly under-constrained, some improved versions are employed, such as Coupled GAN (CoGAN) in [russo2019towards] and CycleGAN in MADAN [zhao2019multi]. Instead of directly taking the original source data as input to the generator [russo2019towards, zhao2019multi]
, lin2020multi lin2020multi used a variational autoencoder to map all source and target domains to a latent space and then generated an adapted domain from the latent space. russo2019towards russo2019towards then tried to align the target and each adapted domain, while lin2020multi lin2020multi aligned the target and combined adapted domain from the latent space. zhao2019multi zhao2019multi proposed to aggregate different adapted domains using a sub-domain aggregation discriminator and cross-domain cycle discriminator, where the pixel-level alignment is then conducted between the aggregated and target domains. zhao2019multi zhao2019multi and lin2020multi lin2020multi showed that the semantics might change in the intermediate representation, and that enforcing a semantic consistency before and after generation can help preserve the labels.
Feature alignment and target prediction. Feature-level alignment is often jointly considered with pixel-level alignment. Both alignments are usually achieved by minimizing the GAN loss with a discriminator. One classifier is trained on each adapted domain [russo2019towards] and the multiple predictions for a given target sample are averaged. Only one classifier is trained on the aggregated domain [zhao2020multi] or on the combined adapted domain [lin2020multi] which is obtained by a unique generator from the latent space for all source domains. The comparison of these methods are summarized in Table 3.
5 Conclusion and Future Directions
In this paper, we provided a survey of recent MDA developments in the deep learning era. We motivated MDA, defined different MDA strategies, and summarized the datasets that are commonly used for performing MDA evaluation. Our survey focused on a typical MDA setting, i.e. unsupervised, homogeneous, closed set, and one target MDA. We classified these methods into different categories, and compared the representative ones technically and experimentally. We conclude with several open research directions:
Specific MDA strategy implementation. As introduced in Section 2, there are many types of MDA strategies, and implementing an MDA strategy according to the specific problem requirement would likely yield better results than a one-size-fits-all MDA approach. Further investigation is needed to determine which MDA strategies work the best for which types of problems. Also, real-world applications may have a small amount of labeled target data; determining how to include this data and what fraction of this data is needed for a certain performance remains an open question.
Multi-modal MDA. The labeled source data may be of different modalities, such as LiDAR, radar, and image. Further research is needed to find techniques for fusing different data modalities in MDA. A further extension of this idea is to have varied modalities in defferent sources as well as partially labeled, multi-modal sources.
Incremental and online MDA. Designing incremental and online MDA algorithms remains largely unexplored and may provide great benefit for real-world scenarios, such as updating deployed MDA models when new source or target data becomes available.