Multi-source Domain Adaptation in the Deep Learning Era: A Systematic Survey

02/26/2020 ∙ by Sicheng Zhao, et al. ∙ berkeley college Beijing Didi Infinity Technology and Development Co., Ltd. 38

In many practical applications, it is often difficult and expensive to obtain enough large-scale labeled data to train deep neural networks to their full capability. Therefore, transferring the learned knowledge from a separate, labeled source domain to an unlabeled or sparsely labeled target domain becomes an appealing alternative. However, direct transfer often results in significant performance decay due to domain shift. Domain adaptation (DA) addresses this problem by minimizing the impact of domain shift between the source and target domains. Multi-source domain adaptation (MDA) is a powerful extension in which the labeled data may be collected from multiple sources with different distributions. Due to the success of DA methods and the prevalence of multi-source data, MDA has attracted increasing attention in both academia and industry. In this survey, we define various MDA strategies and summarize available datasets for evaluation. We also compare modern MDA methods in the deep learning era, including latent space transformation and intermediate domain generation. Finally, we discuss future research directions for MDA.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Background and Motivation

The availability of large-scale labeled training data, such as ImageNet, has enabled deep neural networks (DNNs) to achieve remarkable success in many learning tasks, ranging from computer vision to natural language processing. For example, the classification error of the “Classification + localization with provided training data” task in the Large Scale Visual Recognition Challenge has reduced from 0.28 in 2010 to 0.0225 in 2017

111http://image-net.org/challenges/LSVRC/2017, outperforming even human classification. However, in many practical applications, obtaining labeled training data is often expensive, time-consuming, or even impossible. For example, in fine-grained recognition, only the experts can provide reliable labels [gebru2017fine]; in semantic segmentation, it takes about 90 minutes to label each Cityscapes image [cordts2016cityscapes]; in autonomous driving, it is difficult to label point-wise 3D LiDAR point clouds [wu2019squeezesegv2].

Figure 1: An example of domain shift in the single-source scenario. The models trained on the labeled source domain do not perform well when directly transferring to the target domain.

One potential solution is to transfer a model trained on a separate, labeled source domain to the desired unlabeled or sparsely labeled target domain. But as Figure 1 demonstrates, the direct transfer of models across domains leads to poor performance. Figure 1(a) shows that even for the simple task of digit recognition, training on the MNIST source domain [lecun1998gradient] for digit classification in the MNIST-M target domain  [ganin2015unsupervised] leads to a digit classification accuracy decrease from 96.0% to 52.3% when training a LeNet-5 model [lecun1998gradient]. Figure 1(b) shows a more realistic example of training a semantic segmentation model on a synthetic source dataset GTA [richter2016playing] and conducting pixel-wise segmentation on a real target dataset Cityscapes [cordts2016cityscapes] using the FCN model [long2015fully]. If we train on the real data, we obtain a mean intersection-over-union (mIoU) of 62.6%; but if we train on synthetic data, the mIoU drops significantly to 21.7%.

Figure 2: An example of domain shift in the multi-source scenario. Combining multiple sources into one source and directly performing single-source domain adaptation on the entire dataset does not guarantee better performance compared to just using the best individual source domain.

The poor performance from directly transferring models across domains stems from a phenomenon known as domain shift [torralba2011unbiased, zhao2018emotiongan]

: whereby the joint probability distributions of observed data and labels are different in the two domains. Domain shift exists in many forms, such as from dataset to dataset, from simulation to real-world, from RGB images to depth, and from CAD models to real images.

The phenomenon of domain shift motivates the research on domain adaptation (DA), which aims to learn a model from a labeled source domain that can generalize well to a different, but related, target domain. Existing DA methods mainly focus on the single-source scenario. In the deep learning era, recent single-source DA (SDA) methods usually employ a conjoined architecture with two approaches to respectively represent the models for the source and target domains. One approach aims to learn a task model based on the labeled source data using corresponding task losses, such as cross-entropy loss for classification. The other approach aims to deal with the domain shift by aligning the target and source domains. Based on the alignment strategies, deep SDA methods can be classified into four categories:

  1. Discrepancy-based methods try to align the features by explicitly measuring the discrepancy on corresponding activation layers, such as maximum mean discrepancy (MMD) [long2015learning], correlation alignment [sun2017correlation], and contrastive domain discrepancy [kang2019contrastive].

  2. Adversarial generative methods generate fake data to align the source and target domains at pixel-level based on Generative Adversarial Network (GAN) [goodfellow2014generative] and its variants, such as CycleGAN [zhu2017unpaired, zhao2019cycleemotiongan].

  3. Adversarial discriminative methods employ an adversarial objective with a domain discriminator to align the features [tzeng2017adversarial, tsai2018learning].

  4. Reconstruction based methods aim to reconstruct the target input from the extracted features using the source task model [ghifary2016deep].

In practice, the labeled data may be collected from multiple sources with different distributions [sun2015survey, bhatt2016multi]. In such cases, the aforementioned SDA methods could be trivially applied by combining the sources into a single source: an approach we refer to as source-combined DA. However, source-combined DA oftentimes results in a poorer performance than simply using one of the sources and discarding the others. As illustrated in Figure 2, the accuracy on the best single source digit recognition adaptation using DANN [ganin2016domain] is 71.3%, while the source-combined accuracy drops to 70.8%. For segmentation adaptation using CyCADA [hoffman2018cycada], the mIoU of source-combined DA (37.3%) is also lower than that of SDA from GTA (38.7%). Because the domain shift not only exists between each source and target, but also exists among different sources, the source-combined data from different sources may interfere with each other during the learning process [riemer2019learning]. Therefore, multi-source domain adaptation (MDA) is needed in order to leverage all of the available data.

The early MDA methods mainly focus on shallow models [sun2015survey], either learning a latent feature space for different domains [sun2011two, duan2012exploiting] or combining pre-learned source classifiers [schweikert2009empirical]. Recently, the emphasis on MDA has shifted to deep learning architectures. In this paper, we systematically survey recent progress on deep learning based MDA, summarize and compare similarities and differences in the approaches, and discuss potential future research directions.

2 Problem Definition

In the typical MDA setting, there are multiple source domains ( is the number of sources) and one target domain . Suppose the observed data and corresponding labels222The label could be any type, such as object classes, bounding boxes, semantic segmentation, etc. in the source are drawn from distribution are and , respectively, where is the number of source samples. Let and denote the target data and corresponding labels drawn from the target distribution , where is the number of target samples.

Suppose the number of labeled target samples is , the MDA problem can be classified into different categories:

  • unsupervised MDA, when ;

  • fully supervised MDA, when ;

  • semi-supervised MDA, otherwise.

Area Task Dataset Reference #D #S Labels Short description
CV digit recognition Digits-five (D) lecun1998gradient,netzer2011reading 5 145,298 10 classes handwritten, synthetic, and street-image digits
hull1994database,ganin2015unsupervised
object classification Office-31 (O) saenko2010adapting 3 4,110 31 classes images from amazon and taken by different cameras
Office-Caltech (OC) gong2013connecting 4 2,533 10 classes overlapping categories from Office-31 and C
Office-Home (OH) venkateswara2017deep 4 15,500 65 classes artistic, clipart, product, and real objects
ImageCLEF (IC) Challenge3 3 1,800 12 classes shared categories from 3 datasets
PACS (P) li2017deeper 4 9,991 7 classes photographic, artistic, cartoon, and sketchy objects
DomainNet (DN) peng2019moment 6 600,000 345 classes clipart, infographic, artistic, quickdrawn, real, and sketchy objects
sentiment classification SentiImage (SI) machajdik2010affective 4 25,986 2 classes artistic and social images on visual sentiment
you2016building,you2015robust,borth2013large
vehicle counting WebCamT (W) zhang2017understanding 8 16,000 vehicle counts each camera used as one domain
semantic segmentation Sim2RealSeg (S2R) cordts2016cityscapes,yu2018bdd100k 4 49,366 16 classes simulation-to-real adaptation
richter2016playing,ros2016synthia for pixel-wise predictions
NLP sentiment classification AmazonReviews (AR) chen2012marginalized 4 12,000 2 classes reviews on four kinds of products
MediaReviews (MR) liu2017adversarial 5 6897 2 classes reviews on products and movies
part-of-speech tagging SANCL (S) petrov2012overview 5 5250 tags part-of-speech tagging in 5 web domains
Table 1: Released and freely available datasets for MDA, where ‘#D’ and ‘#S’ represent the number of domains and the total number of samples usually used for MDA, respectively.

Suppose and are an observation in source and target , we can classify MDA into:

  • homogeneous MDA, when ;

  • heterogeneous MDA, otherwise.

Suppose and are the label set for source and target , we can define different MDA strategies:

  • closed set MDA, when ;

  • open set MDA, for at least one , ;

  • partial MDA, for at least one , ;

  • universal MDA, when no prior knowledge of the label sets is available;

where and indicate the intersection set and proper subset between two sets.

Suppose the number of labeled source samples is for source , the MDA problem can be classified into:

  • strongly supervised MDA, when for ;

  • weakly supervised MDA, otherwise.

When adapting to multiple target domains simultaneously, the task becomes multi-target MDA. When the target data is unavailable during training [yue2019domain], the task is often called multi-source domain generalization or zero-shot MDA.

3 Datasets

The datasets for evaluating MDA models usually contain multiple domains with different styles, such as synthetic vs. real, artistic vs. sketchy, which impose large domain shift among different domains. Here we summarize the commonly employed datasets in both computer vision (CV) and natural language processing (NLP) areas, as shown in Table 1.

Digit recognition. Digits-five includes 5 digit image datasets sampled from different domains, including handwritten MNIST (mt[lecun1998gradient], combined MNIST-M (mm[ganin2015unsupervised] from MNIST and randomly extracted color patches, street image SVHN (sv[netzer2011reading], Synthetic Digits (sy[ganin2015unsupervised] generated from Windows fonts by various conditions, and handwritten USPS (up[hull1994database]. Usually, 25,000 images are sampled for training and 9,000 for testing in mt, mm, sv, and sy. The entire 9,298 images in up are selected.

Object classification. Office-31 [saenko2010adapting] contains 4,110 images in 31 categories collected from office environments in 3 domains: Amazon (A) with 2,817 images downloaded from amazon.com, Webcam (W) and DSLR (D) with 795 and 498 images taken by web camera and digital SLR camera with different photographical settings.

Office-Caltech [gong2013connecting] consists of the 10 overlapping categories shared by Office-31 [saenko2010adapting] and Caltech-256 (C[griffin2007caltech]. Totally there are 2,533 images.

Office-Home [venkateswara2017deep] consists of about 15,500 images from 65 categories of everyday objects in office and home settings. There are 4 different domains: Artistic images (Ar), Clip Art (Cl), Product images (Pr) and Real-World images (Rw).

ImageCLEF, originated from ImageCLEF 2014 domain adaptation challenge333http://imageclef.org/2014/adaptation, consists of 12 object categories shared by ImageNet ILSVRC 2012 (I), Pascal VOC 2012 (P), and C. Totally there are 600 images for each domain with 50 for each category.

PACS [li2017deeper] contains 9,991 images of 7 object categories extracted from 4 different domains: Photo (P), Art paintings (A), Cartoon (C) and Sketch (S).

DomainNet [peng2019moment], the largest DA dataset to date for object classification, contains about 600K images from 6 domains: Clipart, Infograph, Painting, Quickdraw, Real, and Sketch. There are totally 345 object categories.

Sentiment classification of images. SentiImage [lin2020multi] is a DA dataset with 4 domains for binary sentiment classification of images: social Flickr and Instagram (FI[you2016building], artistic ArtPhoto (AP[machajdik2010affective], social Twitter I (TI[you2015robust], and social Twitter II (TII[borth2013large]. There are 23,308, 806, 1,269, and 603 images in these 4 domains, respectively.

Vehicle counting. WebCamT [zhang2017understanding] is a vehicle counting dataset from large-scale city camera videos with low resolution, low frame rate, and high occlusion. Totally there are 60,000 frames with vehicle bounding box and count annotations. For MDA, 8 cameras located in different intersections are selected, each with more than 2,000 labeled images. We can view each camera as a domain.

Figure 3: Illustration of widely employed framework for MDA. The solid arrows and dashed dot arrows indicate the training of latent space transformation and intermediate domain generation, respectively. The dashed arrows represent the reference process. Most existing MDA methods can be obtained by employing different component details, enforcing some constraints, or slightly changing the architecture. Best viewed in color.

Scene segmentation. Sim2RealSeg contains 2 synthetic datasets (GTA, SYNTHIA) and 2 real datasets (Cityscapes, BDDS) for segmentation. Cityscapes (CS) [cordts2016cityscapes] contains vehicle-centric urban street images collected from a moving vehicle in 50 cities from Germany and neighboring countries. There are 5,000 images with pixel-wise annotations into 19 classes. BDDS [yu2018bdd100k] contains 10,000 real-world dash cam video frames with a compatible label space with Cityscapes. GTA [richter2016playing] is a vehicle-egocentric image dataset collected in the high-fidelity rendered computer game GTA-V. It contains 24,966 images (video frames) with 19 classes as Cityscapes. SYNTHIA [ros2016synthia] is a large synthetic dataset. To pair with Cityscapes, a subset, named SYNTHIA-RANDCITYSCAPES, is designed with 9,400 images which are automatically annotated with 16 compatible Cityscapes classes, one void class, and some unnamed classes. The common 16 classes are used for MDA.

Sentiment classification of natural languages. Amazon Reviews [chen2012marginalized] is a dataset of reviews on four kinds of products: Books (B), DVDs (D), Electronics (E), and Kitchen appliances (K

). Reviews are encoded as 5,000 dimensional feature vectors of unigrams and bigrams and are labeled with binary sentiment. Each source has 2,000 labeled examples, and the target test set has 3,000 to 6,000 examples.

Media Reviews [liu2017adversarial] contains 16 domains of product reviews and movie reviews for binary sentiment classification. 5 domains with 6,897 labeled samples are usually employed for MDA, including Apparel, Baby, Books, Camera taken from Amazon and MR from Rotten Tomato.

Part-of-speech tagging. The SANCL dataset [petrov2012overview] contains part-of-speech tagging annotations in 5 web domains: Emails (E), Weblogs (W), Answers (A), Newsgroups (N), and Reviews (R). 750 sentences from each source are used for training.

Unless otherwise specified, each domain is selected as the target and the rest domains are considered as the sources. For WebCamT, 2 domains are randomly selected as the target. For Sim2RealSeg, MDA is often performed using the simulation-to-real setting [zhao2019multi], i.e. from synthetic GTA, SYNTHIA to real Cityscapes, BDDS. For SANCL, N, R, and A are used as target domains, while E and W are used as target domains [guo2018multi].

Reference Feature Feature Feature Feature Classifier #C Classifier Task Dataset Result
extractor alignment method alignment loss alignment domains alignment weight backbone
[mancini2018boosting] shared CT loss 1 AlexNet O, OC, P 83.6, 91.8, 85.3
[guo2018multi] shared discrepancy MMD target and each source PoS metric AlexNet AR, S 84.8, 90.1
[hoffman2018algorithms] shared discrepancy Rényi-divergence target and each source CT loss 1 AlexNet O 87.6
[zhu2019aligning] shared discrepancy MMD target and each source loss uniform ResNet-50 O, OH, IC 90.2, 89.4, 74.1
[rakshit2019unsupervised] unshared discrepancy distance pairwise all domains CT loss 1 ResNet-50 O, OC, IC 88.3, 97.5, 91.2
[peng2019moment] shared discrepancy moment distance pairwise all domains loss relative error LeNet-5 D 87.7
ResNet-101 OC 96.4
ResNet-101 DN 42.6
[guo2020multi] shared discrepancy mixture distance target and each source CT loss 1 BiLSTM MR 79.3
[xu2018deep] shared discriminator GAN loss target and each source perplexity score AlexNet D, O, IC 74.2, 83.8, 80.8
[li2018extracting] shared discriminator Wasserstein pairwise all domains CT loss 1 AlexNet D 79.9
[zhao2018adversarial] shared discriminator -divergence target and each source CT loss 1 BiLSTM AR 82.7
AlexNet D 76.6
FCN8s W 1.4
[wang2019tmda] shared discriminator Wasserstein pairwise all domains CT loss 1 BiLSTM AR 84.5
AlexNet D 83.4
[zhao2020multi] unshared discriminator Wasserstein target and each source Wasserstein LeNet-5 D 88.1
AlexNet O 84.2
Table 2: Comparison of different latent space transformation methods for MDA, where ‘#C’, ‘CT loss’, and ‘MMD’ are short for the number of classifiers during reference ( is the number of source domains), combined task loss, and maximum mean discrepancy, respectively. ‘Result’ is the average performance of all target domains measured by accuracy for classification and counting error for vehicle counting.

4 Deep Multi-source Domain Adaptation

Existing methods on deep MDA primarily focus on the unsupervised, homogeneous, closed set, strongly supervised, one target, and target data available settings. That is, there is one target domain, the target data is unlabeled but available during the training process, the source data is fully labeled, the source and target data are observed in the same data space, and the label sets of all sources and the target are the same. In this paper, we focus on MDA methods under these settings.

There are some theoretical analysis to support existing MDA algorithms. Most theories are based on the seminal theoretical model [blitzer2008learning, ben2010theory]. mansour2009mda mansour2009mda assumed that the target distribution can be approximated by a mixture of the source distributions. Therefore, weighted combination of source classifiers has been widely employed for MDA. Moreover, tighter cross domain generalization bound and more accurate measurements on domain discrepancy can provide intuitions to derive effective MDA algorithms. hoffman2018algorithms hoffman2018algorithms derived a novel bound using DC-programming and calculated more accurate combination weights. zhao2018adversarial zhao2018adversarial extended the generalization bound of seminal theoretical model to multiple sources under both classification and regression settings. Besides the domain discrepancy between the target and each source [hoffman2018algorithms, zhao2018adversarial], li2018extracting li2018extracting also considered the relationship between pairwise sources and derived a tighter bound on weighted multi-source discrepancy. Based on this bound, more relevant source domains can be picked out.

Typically, some task models (e.g. classifiers) are learned based on the labeled source data with corresponding task loss, such as cross-entropy loss for classification. Meanwhile, specific alignments among the source and target domains are conducted to bridge the domain shift so that the learned task models can be better transferred to the target domain. Based on the different alignment strategies, we can classify MDA into different categories. Latent space transformation tries to align the latent space (e.g. features) of different domains based on optimizing the discrepancy loss or adversarial loss. Intermediate domain generation explicitly generates an intermediate adapted domain for each source that is indistinguishable from the target domain. The task models are then trained on the adapted domain. Figure 3 summarizes the common overall framework of existing MDA methods.

4.1 Latent Space Transformation

The two common methods for aligning the latent spaces of different domains are discrepancy-based methods and adversarial methods. We discuss these two methods below, and Table 2 summarizes key examples of each method.

Discrepancy-based methods explicitly measure the discrepancy of the latent spaces (typically features) from different domains by optimizing some specific discrepancy losses, such as maximum mean discrepancy (MMD) [guo2018multi, zhu2019aligning], Rényi-divergence [hoffman2018algorithms], distance [rakshit2019unsupervised], and moment distance [peng2019moment]

. guo2020multi guo2020multi claimed that different discrepancies or distances can only provide specific estimates of domain similarities and that each distance has its pathological cases. Therefore, they consider the mixture of several distances 

[guo2020multi], including distance, Cosine distance, MMD, Fisher linear discriminant, and Correlation alignment. Minimizing the discrepancy to align the features among the source and target domains does not introduce any new parameters that must be learned.

Adversarial methods try to align the features by making them indistinguishable to a discriminator. Some representative optimized objectives include GAN loss [xu2018deep], -divergence [zhao2018adversarial], Wasserstein distance [li2018extracting, wang2019tmda, zhao2020multi]. These methods aim to confuse the discriminator’s ability to distinguishing whether the features from multiple sources were drawn from the same distribution. Compared with GAN loss and -divergence, Wasserstein distance can provide more stable gradients even when the target and source distributions do not overlap [zhao2020multi]. The discriminator is often implemented as a network, which leads to new parameters that must be learned.

There are many modular implementation details for both types of methods, such as how to align the target and multiple sources, whether the feature extractors are shared, how to select the more relevant sources, and how to combine the multiple predictions from different classifiers.

Alignment domains. There are different ways to align the target and multiple sources. The most common method is to pairwise align the target with each source [xu2018deep, guo2018multi, zhao2018adversarial, hoffman2018algorithms, zhu2019aligning, zhao2020multi, guo2020multi]. Since domain shift also exists among different sources, several methods enforce pairwise alignment between every domain in both the source and target domains [li2018extracting, rakshit2019unsupervised, peng2019moment, wang2019tmda].

Reference Domain Pixel Feature Feature #C Classifier Task Dataset Task Result
generator alignment domains alignment loss alignment domains weight backbone
[russo2019towards] CoGAN target and each source GAN loss target and each source uniform DeepLabV2 S2R-CS seg 42.8
shared
[zhao2019multi] CycleGAN target and aggregated source GAN loss target and each source 1 FCN8s S2R-CS seg 41.4
shared S2R-BDDS 36.3
[lin2020multi] VAE+CycleGAN target and combined source 1 ResNet-18 SI cls 68.1
unshared
Table 3: Comparison of different intermediate domain generation methods for MDA, where ‘#C’, ‘seg’, and ‘cls’ are short for the number of classifiers during reference ( is the number of source domains), segmentation, and classification, respectively. ‘Result’ is the average performance of all target domains measured by accuracy for classification and mean intersection-over-union (mIoU) for segmentation.

Weight sharing of feature extractor. Most methods employ shared feature extractors to learn domain-invariant features. However, domain invariance may be detrimental to discriminative power. On the contrary, rakshit2019unsupervised rakshit2019unsupervised adopted one feature extractor for each source and target pair with unshared weights, while zhao2020multi zhao2020multi first pre-trained one feature extractor for each source and then mapped the target into the feature space of each source. Correspondingly, there are and feature extractors. Although unshared feature extractors can better align the target and sources in the latent space, this substantially increases the number of parameters in the model.

Classifier alignment. Intuitively, the classifiers trained on different sources may result in misaligned predictions for the target samples that are close to the domain boundary. By minimizing specific classifier discrepancy, such as 1 loss [zhu2019aligning, peng2019moment], the classifiers are better aligned, which can learn a generalized classification boundary for target samples mentioned above. Instead of explicitly training one classifier for each source, many methods focus on training a compound classifier based on specific combined task loss, such as normalized activations [mancini2018boosting] and bandit controller [guo2020multi].

Target prediction. After aligning the features of target and source domains in the latent space, the classifiers trained based on the labeled source samples can be used to predict the labels of a target sample. Since there are multiple sources, it is possible that they will yield different target predictions. One way to reconcile these different predictions is to uniformly average the predictions from different source classifiers [zhu2019aligning]. However, different sources may have different relationships with the target, e.g. one source might better align with the target, so a non-uniform, weighted averaging of the predictions leads to better results. Weighting strategies, known as a source selection process, include uniform weight [zhu2019aligning], perplexity score based on adversarial loss [xu2018deep], point-to-set (PoS) metric using Mahalanobis distance [guo2018multi], relative error based on source-only accuracy [peng2019moment], and Wasserstein distance based weights [zhao2020multi].

Besides the source importance, zhao2020multi zhao2020multi also considered the sample importance, i.e. different samples from the same source may still have different similarities from the target samples. The source samples that are closer to the target are distilled (based on a manually selected Wasserstein distance threshold) to fine-tune the source classifiers. Automatically and adaptively selecting the most relevant training samples for each source remains an open research problem.

4.2 Intermediate Domain Generation

Feature-level alignment only aligns high-level information, which is insufficient for fine-grained predictions, such as pixel-wise semantic segmentation [zhao2019multi]. Generating an intermediate adapted domain with pixel-level alignment, typically via GANs [goodfellow2014generative], can help address this problem.

Domain generator. Since the original GAN is highly under-constrained, some improved versions are employed, such as Coupled GAN (CoGAN) in [russo2019towards] and CycleGAN in MADAN [zhao2019multi]. Instead of directly taking the original source data as input to the generator [russo2019towards, zhao2019multi]

, lin2020multi lin2020multi used a variational autoencoder to map all source and target domains to a latent space and then generated an adapted domain from the latent space. russo2019towards russo2019towards then tried to align the target and each adapted domain, while lin2020multi lin2020multi aligned the target and combined adapted domain from the latent space. zhao2019multi zhao2019multi proposed to aggregate different adapted domains using a sub-domain aggregation discriminator and cross-domain cycle discriminator, where the pixel-level alignment is then conducted between the aggregated and target domains. zhao2019multi zhao2019multi and lin2020multi lin2020multi showed that the semantics might change in the intermediate representation, and that enforcing a semantic consistency before and after generation can help preserve the labels.

Feature alignment and target prediction. Feature-level alignment is often jointly considered with pixel-level alignment. Both alignments are usually achieved by minimizing the GAN loss with a discriminator. One classifier is trained on each adapted domain [russo2019towards] and the multiple predictions for a given target sample are averaged. Only one classifier is trained on the aggregated domain [zhao2020multi] or on the combined adapted domain [lin2020multi] which is obtained by a unique generator from the latent space for all source domains. The comparison of these methods are summarized in Table 3.

5 Conclusion and Future Directions

In this paper, we provided a survey of recent MDA developments in the deep learning era. We motivated MDA, defined different MDA strategies, and summarized the datasets that are commonly used for performing MDA evaluation. Our survey focused on a typical MDA setting, i.e. unsupervised, homogeneous, closed set, and one target MDA. We classified these methods into different categories, and compared the representative ones technically and experimentally. We conclude with several open research directions:

Specific MDA strategy implementation. As introduced in Section 2, there are many types of MDA strategies, and implementing an MDA strategy according to the specific problem requirement would likely yield better results than a one-size-fits-all MDA approach. Further investigation is needed to determine which MDA strategies work the best for which types of problems. Also, real-world applications may have a small amount of labeled target data; determining how to include this data and what fraction of this data is needed for a certain performance remains an open question.

Multi-modal MDA. The labeled source data may be of different modalities, such as LiDAR, radar, and image. Further research is needed to find techniques for fusing different data modalities in MDA. A further extension of this idea is to have varied modalities in defferent sources as well as partially labeled, multi-modal sources.

Incremental and online MDA. Designing incremental and online MDA algorithms remains largely unexplored and may provide great benefit for real-world scenarios, such as updating deployed MDA models when new source or target data becomes available.

References