Multi-Adversarial Domain Adaptation

09/04/2018 ∙ by Zhongyi Pei, et al. ∙ Tsinghua University 0

Recent advances in deep domain adaptation reveal that adversarial learning can be embedded into deep networks to learn transferable features that reduce distribution discrepancy between the source and target domains. Existing domain adversarial adaptation methods based on single domain discriminator only align the source and target data distributions without exploiting the complex multimode structures. In this paper, we present a multi-adversarial domain adaptation (MADA) approach, which captures multimode structures to enable fine-grained alignment of different data distributions based on multiple domain discriminators. The adaptation can be achieved by stochastic gradient descent with the gradients computed by back-propagation in linear-time. Empirical evidence demonstrates that the proposed model outperforms state of the art methods on standard domain adaptation datasets.



There are no comments yet.


Code Repositories


Code release for "Multi-Adversarial Domain Adaptation" (AAAI 2018)

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Deep networks, when trained on large-scale labeled datasets, can learn transferable representations which are generically useful across diverse tasks and application domains [Donahue et al.2014, Yosinski et al.2014]. However, due to a phenomenon known as dataset bias or domain shift [Torralba and Efros2011]

, predictive models trained with these deep representations on one large dataset do not generalize well to novel datasets and tasks. The typical solution is to further fine-tune these networks on task-specific datasets, however, it is often prohibitively expensive to collect enough labeled data to properly fine-tune the high-capacity deep networks. Hence, there is strong motivation to establishing effective algorithms to reduce the labeling consumption by leveraging readily-available labeled data from a different but related source domain. This promising transfer learning paradigm, however, suffers from the shift in data distributions across different domains, which poses a major obstacle in adapting classification models to target tasks

[Pan and Yang2010].

Existing transfer learning methods assume shared label space and different feature distributions across the source and target domains. These methods bridge different domains by learning domain-invariant feature representations without using target labels, and the classifier learned from source domain can be directly applied to target domain. Recent studies have revealed that deep neural networks can learn more transferable features for domain adaptation

[Donahue et al.2014, Yosinski et al.2014]

, by disentangling explanatory factors of variations behind domains. The latest advances have been achieved by embedding domain adaptation modules in the pipeline of deep feature learning to extract domain-invariant representations

[Tzeng et al.2014, Long et al.2015, Ganin and Lempitsky2015, Tzeng et al.2015, Long et al.2016, Bousmalis et al.2016, Long et al.2017].

Figure 1: The difficulty of domain adaptation: discriminative structures may be mixed up or falsely aligned across domains. As an intuitive example, in this figure, the source class cat is falsely aligned with target class dog, making final classification wrong.

Recently, adversarial learning has been successfully embedded into deep networks to learn transferable features to reduce distribution discrepancy between the source and target domains. Domain adversarial adaptation methods [Ganin and Lempitsky2015, Tzeng et al.2015] are among the top-performing deep architectures. These methods mainly align the whole source and target distributions, without considering the complex multimode structures underlying the data distributions. As a result, not only all data from the source and target domains will be confused, but also the discriminative structures could be mixed up, leading to false alignment of the corresponding discriminative structures of different distributions, with intuitive example shown in Figure 1. Hence, matching the whole source and target domains as previous methods without exploiting the discriminative structures may not work well for diverse domain adaptation scenarios.

There are two technical challenges to enabling domain adaptation: (1) enhancing positive transfer by maximally matching the multimode structures underlying data distributions across domains, and (2) alleviating negative transfer by preventing false alignment of modes in different distributions across domains. Motivated by these challenges, we present a multi-adversarial domain adaptation (MADA) approach, which captures multimode structures to enable fine-grained alignment of different data distributions based on multiple domain discriminators. A key improvement over previous methods is the capability to simultaneously promote positive transfer of relevant data and alleviate negative transfer of irrelevant data. The adaptation can be achieved by stochastic gradient descent with the gradients computed by back-propagation in linear-time. Empirical evidence demonstrates that the proposed MADA approach outperforms state of the art methods on standard domain adaptation benchmarks.

Related Work

Transfer learning [Pan and Yang2010]

bridges different domains or tasks to mitigate the burden of manual labeling for machine learning 

[Pan et al.2011, Duan, Tsang, and Xu2012, Zhang et al.2013, Wang and Schneider2014]

, computer vision

[Saenko et al.2010, Gong et al.2012, Hoffman et al.2014]

and natural language processing

[Collobert et al.2011]. The main technical difficulty of transfer learning is to formally reduce the distribution discrepancy across different domains. Deep networks can learn abstract representations that disentangle different explanatory factors of variations behind data [Bengio, Courville, and Vincent2013] and manifest invariant factors underlying different populations that transfer well from original tasks to similar novel tasks [Yosinski et al.2014]. Thus deep networks have been explored for transfer learning [Glorot, Bordes, and Bengio2011, Oquab et al.2013, Hoffman et al.2014], multimodal and multi-task learning [Collobert et al.2011, Ngiam et al.2011], where significant performance gains have been witnessed against prior shallow transfer learning methods.

However, recent advances show that deep networks can learn abstract feature representations that can only reduce, but not remove, the cross-domain discrepancy [Glorot, Bordes, and Bengio2011, Tzeng et al.2014], resulting in unbounded risk for target tasks [Mansour, Mohri, and Rostamizadeh2009, Ben-David et al.2010]

. Some recent work bridges deep learning and domain adaptation

[Tzeng et al.2014, Long et al.2015, Ganin and Lempitsky2015, Tzeng et al.2015, Long et al.2016, Bousmalis et al.2016, Long et al.2017], which extends deep convolutional networks (CNNs) to domain adaptation by adding adaptation layers through which the mean embeddings of distributions are matched [Tzeng et al.2014, Long et al.2015, Long et al.2016], or by adding a subnetwork as domain discriminator while the deep features are learned to confuse the discriminator in a domain-adversarial training paradigm [Ganin and Lempitsky2015, Tzeng et al.2015]. While performance was significantly improved, these state of the art methods may be restricted by the fact that the discriminative structures as well as complex multimode structures are not exploited for fine-grained alignment of different distributions.

Adversarial learning has been explored for generative modeling in Generative Adversarial Networks (GANs) [Goodfellow et al.2014]. Recently, several difficulties of GANs have been addressed, e.g. ease training [Arjovsky, Chintala, and Bottou2017, Arjovsky and Bottou2017], avoid mode collapse [Mirza and Osindero2014, Che et al.2017, Metz et al.2017]. In particular, Generative Multi-Adversarial Network (GMAN) [Durugkar, Gemp, and Mahadevan2017] extends GANs to multiple discriminators including formidable adversary and forgiving teacher, which significantly eases model training and enhances distribution matching.

Figure 2: The architecture of the proposed Multi-Adversarial Domain Adaptation (MADA) approach, where is the extracted deep features, is the predicted data label, and is the predicted domain label; is the feature extractor, and are the label predictor and its loss, and are the domain discriminator and its loss; GRL stands for Gradient Reversal Layer. The blue part shows the multiple adversarial networks (each for a class, in total) crafted in this paper. Best viewed in color.

Multi-Adversarial Domain Adaptation

In unsupervised domain adaptation, we are given a source domain of labeled examples and a target domain of

unlabeled examples. The source domain and target domain are sampled from joint distributions

and respectively, and note that . The goal of this paper is to design a deep neural network that enables learning of transfer features and adaptive classifier to reduce the shifts in the joint distributions across domains, such that the target risk minimized by jointly minimizing source risk and distribution discrepancy by multi-adversarial domain adaptation.

There are two technical challenges to enabling domain adaptation: (1) enhancing positive transfer by maximally matching the multimode structures underlying data distributions and across domains, and (2) alleviating negative transfer by preventing false alignment of different distribution modes across domains. These two challenges motivate the multi-adversarial domain adaptation approach.

Domain Adversarial Network

Domain adversarial networks have been successfully applied to transfer learning [Ganin and Lempitsky2015, Tzeng et al.2015] by extracting transferable features that can reduce the distribution shift between the source domain and the target domain. The adversarial learning procedure is a two-player game, where the first player is the domain discriminator trained to distinguish the source domain from the target domain, and the second player is the feature extractor fine-tuned simultaneously to confuse the domain discriminator.

To extract domain-invariant features , the parameters of feature extractor are learned by maximizing the loss of domain discriminator , while the parameters of domain discriminator are learned by minimizing the loss of the domain discriminator. In addition, the loss of label predictor is also minimized. The objective of domain adversarial network [Ganin and Lempitsky2015] is the functional:


where and is a trade-off parameter between the two objectives that shape the features during learning. After training convergence, the parameters , , will deliver a saddle point of the functional (1):


Domain adversarial networks [Ganin and Lempitsky2015, Tzeng et al.2015] are the top-performing architectures for standard domain adaptation when the distributions of the source domain and target domain can be aligned successfully.

Multi-Adversarial Domain Adaptation

In practical domain adaptation problems, however, the data distributions of the source domain and target domain usually embody complex multimode structures, reflecting either the class boundaries in supervised learning or the cluster boundaries in unsupervised learning. Thus, previous domain adversarial adaptation methods that only match the data distributions without exploiting the multimode structures may be prone to either under transfer or negative transfer. Under transfer may happen when different modes of the distributions cannot be maximally matched. Negative transfer may happen when the corresponding modes of the distributions across domains are falsely aligned. To promote positive transfer and combat negative transfer, we should find a technology to reveal the multimode structures underlying distributions on which multi-adversarial domain adaptation can be performed.

To match the source and target domains upon the multimode structures underlying data distributions, we notice that the source domain labeled information provides strong signals to reveal the multimode structures. Therefore, we split the domain discriminator in Equation (1) into class-wise domain discriminators , each is responsible for matching the source and target domain data associated with class , as shown in Figure 2. Since target domain data are fully unlabeled, it is not easy to decide which domain discriminator is responsible for each target data point. Fortunately, we observe that the output of the label predictor to each data point

is a probability distribution over the label space of

classes. This distribution well characterizes the probability of assigning to each of the classes. Thus, it is a natural idea to use as the probability to indicate how much each data point should be attended to the domain discriminators . The attention of each point to a domain discriminator can be modeled by weighting its features with probability . Applying this to all domain discriminators yields


where is the -th domain discriminator while is its cross-entropy loss, and is the domain label of point . We note that the above strategy shares similar ideas with the attention mechanism.

Compared with the previous single-discriminator domain adversarial network in Equation (1), the proposed multi-adversarial domain adaptation network enables fine-grained adaptation where each data point is matched only by those relevant domain discriminators according to its probability . This fine-grained adaptation may introduce three benefits. (1) It avoids the hard assignment of each point to only one domain discriminator, which tends to be inaccurate for target domain data. (2) It circumvents negative transfer since each point is only aligned to the most relevant classes, while the irrelevant classes are filtered out by the probability and will not be included in the corresponding domain discriminators, hence avoiding false alignment of the discriminative structures in different distributions. (3) The multiple domain discriminators are trained with probability-weighted data points , which naturally learn multiple domain discriminators with different parameters ; these domain discriminators with different parameters promote positive transfer for each instance.

Integrating all things together, the objective of the Multi-Adversarial Domain Adaptation (MADA) is


where , and is a hyper-parameter that trade-offs the two objectives in the unified optimization problem. The optimization problem is to find the parameters , and that jointly satisfy


The multi-adversarial domain adaptation (MADA) model simultaneously enhances positive transfer by maximally matching the multimode structures underlying data distributions across domains, and circumvents negative transfer by avoiding false alignment of the distribution modes across domains.


We evaluate the proposed multi-adversarial domain adaptation (MADA) model with state of the art transfer learning and deep learning methods. The codes, datasets and configurations will be available online at

Method A W D W W D A D D A W A Avg
AlexNet [Krizhevsky, Sutskever, and Hinton2012] 60.60.4 95.40.2 99.00.1 64.20.3 45.50.5 48.30.5 68.8
TCA [Pan et al.2011] 59.00.0 90.20.0 88.20.0 57.80.0 51.60.0 47.90.0 65.8
GFK [Gong et al.2012] 58.40.0 93.60.0 91.00.0 58.60.0 52.40.0 46.10.0 66.7
DDC [Tzeng et al.2014] 61.00.5 95.00.3 98.50.3 64.90.4 47.20.5 49.40.4 69.3
DAN [Long et al.2015] 68.50.3 96.00.1 99.00.1 66.80.2 50.00.4 49.80.3 71.7
RTN [Long et al.2016] 73.30.2 96.80.2 99.60.1 71.00.2 50.50.3 51.00.1 73.7
RevGrad [Ganin and Lempitsky2015] 73.00.5 96.40.3 99.20.3 72.30.3 52.40.4 50.40.5 74.1
MADA 78.50.2 99.80.1 100.0.0 74.10.1 56.00.2 54.50.3 77.1
ResNet [He et al.2016] 68.40.2 96.70.1 99.30.1 68.90.2 62.50.3 60.70.3 76.1
TCA [Pan et al.2011] 74.70.0 96.70.0 99.60.0 76.10.0 63.70.0 62.90.0 79.3
GFK [Gong et al.2012] 74.80.0 95.00.0 98.20.0 76.50.0 65.40.0 63.00.0 78.8
DDC [Tzeng et al.2014] 75.80.2 95.00.2 98.20.1 77.50.3 67.40.4 64.00.5 79.7
DAN [Long et al.2015] 83.80.4 96.80.2 99.50.1 78.40.2 66.70.3 62.70.2 81.3
RTN [Long et al.2016] 84.50.2 96.80.1 99.40.1 77.50.3 66.20.2 64.80.3 81.6
RevGrad [Ganin and Lempitsky2015] 82.00.4 96.90.2 99.10.1 79.70.4 68.20.4 67.40.5 82.2
MADA 90.00.1 97.40.1 99.60.1 87.80.2 70.30.3 66.40.3 85.2
Table 1: Accuracy (%) on Office-31 for unsupervised domain adaptation (AlexNet and ResNet)


Office-31 [Saenko et al.2010] is a standard benchmark for visual domain adaptation, comprising 4,652 images and 31 categories collected from three distinct domains: Amazon (A), which contains images downloaded from, Webcam (W) and DSLR (D), which contain images respectively taken by web camera and digital SLR camera with different environments. We evaluate all methods across three transfer tasks A W, D W and W D, which are widely used by previous deep transfer learning methods [Tzeng et al.2014, Ganin and Lempitsky2015], and another three transfer tasks A D, D A and W A as used in [Long et al.2015, Tzeng et al.2015, Long et al.2016].

ImageCLEF-DA111 is a benchmark dataset for ImageCLEF 2014 domain adaptation challenge, which is organized by selecting the 12 common categories shared by the following three public datasets, each is considered as a domain: Caltech-256 (C), ImageNet ILSVRC 2012 (I), and Pascal VOC 2012 (P). The 12 common categories are aeroplane, bike, bird, boat, bottle, bus, car, dog, horse, monitor, motorbike, and people. There are 50 images in each category and 600 images in each domain. We use all domain combinations and build 6 transfer tasks: I P, P I, I C, C I, C P, and P C. Different from the Office-31 dataset where different domains are of different sizes, the three domains in this dataset are of equal size, making it a good alternative dataset.

We compare the proposed multi-adversarial domain adaptation (MADA) with both shallow and deep transfer learning methods: Transfer Component Analysis (TCA) [Pan et al.2011], Geodesic Flow Kernel (GFK) [Gong et al.2012], Deep Domain Confusion (DDC) [Tzeng et al.2014], Deep Adaptation Network (DAN) [Long et al.2015], Residual Transfer Network (RTN) [Long et al.2016], and Reverse Gradient (RevGrad) [Ganin and Lempitsky2015]

. TCA learns a shared feature space by Kernel PCA with linear-MMD penalty. GFK interpolates across an infinite number of intermediate subspaces to bridge the source and target subspaces. For these shallow transfer methods, we adopt SVM as the base classifier. DDC maximizes domain confusion by adding to deep networks a single adaptation layer that is regularized by linear-kernel MMD. DAN learns transferable features by embedding deep features of multiple domain-specific layers to reproducing kernel Hilbert spaces (RKHSs) and matching different distributions optimally using multi-kernel MMD. RTN jointly learns transferable features and adapts different source and target classifiers via deep residual learning

[He et al.2016]. RevGrad enables domain adversarial learning [Goodfellow et al.2014] by adapting a single layer of deep networks, which matches the source and target domains by making them indistinguishable for a domain discriminator.

We follow standard evaluation protocols for unsupervised domain adaptation [Long et al.2015, Ganin and Lempitsky2015]. For both Office-31 and ImageCLEF-DA

datasets, we use all labeled source examples and all unlabeled target examples. We compare the average classification accuracy of each method on three random experiments, and report the standard error of the classification accuracies by different experiments of the same transfer task. For all baseline methods, we either follow their original model selection procedures, or conduct

transfer cross-validation [Zhong et al.2010] if their model selection strategies are not specified. We also adopt transfer cross-validation [Zhong et al.2010] to select parameter for the MADA models. Fortunately, our models perform very stably under different parameter values, thus we fix throughout all experiments. For MMD-based methods (TCA, DDC, DAN, and RTN), we use Gaussian kernel with bandwidth set to the median pairwise squared distances on the training data, i.e. median trick [Gretton et al.2012, Long et al.2015]. We examine the influence of deep representations for domain adaptation by exploring AlexNet [Krizhevsky, Sutskever, and Hinton2012] and ResNet [He et al.2016] as base architectures for learning deep representations. For shallow methods, we follow DeCAF [Donahue et al.2014] and use as deep representations the activations of the (AlexNet) and (ResNet) layers.

We implement all deep methods based on the Caffe [Jia et al.2014] framework, and fine-tune from AlexNet [Krizhevsky, Sutskever, and Hinton2012] and ResNet [He et al.2016] models pre-trained on the ImageNet dataset [Russakovsky et al.2014]. We fine-tune all convolutional and pooling layers and train the classifier layer via back propagation. Since the classifier is trained from scratch, we set its learning rate to be 10 times that of the lower layers. We employ the mini-batch stochastic gradient descent (SGD) with momentum of 0.9 and the learning rate strategy implemented in RevGrad [Ganin and Lempitsky2015]: the learning rate is not selected by a grid search due to high computational cost—it is adjusted during SGD using these formulas: , where is the training progress linearly changing from to , and , which is optimized to promote convergence and low error on source domain. To suppress noisy activations at the early stages of training, instead of fixing parameter , we gradually change it by multiplying , where [Ganin and Lempitsky2015]. This progressive training strategy significantly stabilizes parameter sensitivity of the proposed approach.

Method I P P I I C C I C P P C Avg
AlexNet [Krizhevsky, Sutskever, and Hinton2012] 66.20.2 70.00.2 84.30.2 71.30.4 59.30.5 84.50.3 73.9
DAN [Long et al.2015] 67.30.2 80.50.3 87.70.3 76.00.3 61.60.3 88.40.2 76.9
RTN [Long et al.2016] 67.40.3 82.30.3 89.50.4 78.00.2 63.00.2 90.10.1 78.4
RevGrad [Ganin and Lempitsky2015] 66.50.5 81.80.4 89.00.5 79.80.5 63.50.4 88.70.4 78.2
MADA 68.30.3 83.00.1 91.00.2 80.70.2 63.80.2 92.20.3 79.8
ResNet [He et al.2016] 74.80.3 83.90.1 91.50.3 78.00.2 65.50.3 91.20.3 80.7
DAN [Long et al.2015] 75.00.4 86.20.2 93.30.2 84.10.4 69.80.4 91.30.4 83.3
RTN [Long et al.2016] 75.60.3 86.80.1 95.30.1 86.90.3 72.70.3 92.20.4 84.9
RevGrad [Ganin and Lempitsky2015] 75.00.6 86.00.3 96.20.4 87.00.5 74.30.5 91.50.6 85.0
MADA 75.00.3 87.90.2 96.00.3 88.80.3 75.20.2 92.20.3 85.8
Table 2: Accuracy (%) on ImageCLEF-DA for unsupervised domain adaptation (AlexNet and ResNet)
Method A W D W W D A D D A W A Avg
AlexNet [Krizhevsky, Sutskever, and Hinton2012] 58.20.4 95.90.2 99.00.1 60.40.3 49.80.5 47.30.5 68.4
RevGrad [Ganin and Lempitsky2015] 65.10.5 91.70.3 97.10.3 60.60.3 42.10.4 42.90.5 66.6
MADA 70.80.2 96.60.1 99.5.0 69.60.1 51.40.2 54.20.3 73.7
Table 3: Accuracy (%) on Office-31 for domain adaptation from 31 classes to 25 classes (AlexNet)


The classification accuracy results on the Office-31 dataset for unsupervised domain adaptation based on AlexNet and ResNet are shown in Table 1. For fair comparison, the results of DAN [Long et al.2015], RTN [Long et al.2016], and RevGrad [Ganin and Lempitsky2015] are directly reported from their original papers. MADA outperforms all comparison methods on most transfer tasks. It is noteworthy that MADA promotes the classification accuracies substantially on hard transfer tasks, e.g. A W, A D, D A, and W A, where the source and target domains are substantially different, and produce comparable classification accuracies on easy transfer tasks, D W and W D, where the source and target domains are similar [Saenko et al.2010]. The three domains in the ImageCLEF-DA dataset are balanced in each category. As reported in Table 2, the MADA approach outperforms the comparison methods on most transfer tasks. The encouraging results highlight the importance of multi-adversarial domain adaptation in deep neural networks, and suggest that MADA is able to learn more transferable representations for effective domain adaptation.

The experimental results reveal several insightful observations. (1) Standard deep learning methods (AlexNet and ResNet) either outperform or underperform traditional shallow transfer learning methods (TCA and GFK) using deep features as input. This confirms the current practice that deep networks, even the extremely deep ones (ResNet), can learn abstract feature representations that only reduce but not remove the cross-domain discrepancy [Yosinski et al.2014]. (2) Deep transfer learning methods substantially outperform both standard deep learning methods and traditional shallow transfer learning methods with deep features as input. This validates that explicitly reducing the cross-domain discrepancy by embedding domain-adaptation modules into deep networks (DDC, DAN, RTN, and RevGrad) can learn more transferable features. (3) MADA substantially outperforms previous methods based on either multilayer adaptation (DAN), semi-supervised adaptation (RTN), and domain adversarial training (RevGrad). Although both MADA and RevGrad [Ganin and Lempitsky2015] perform domain adversarial adaptation, the improvement from RevGrad to MADA is crucial for domain adaptation: RevGrad matches data distributions across domains without exploiting the complex multimode structures; MADA enables domain adaptation by making the source and target domains indistinguishable multiple domain discriminators, each responsible for matching the source and target data associated with the same class, which can essentially reduce the shift in the data distributions of complex multimode structures.

(a) RevGrad: source=A
(b) RevGrad: target=W
(c) MADA: source=A
(d) MADA: target=W
Figure 3:

The t-SNE visualization of deep features extracted by RevGrad (a)(b) and MADA (c)(d).

(a) Sharing Strategies
(b) -distance
(c) Convergence
Figure 4: Empirical analysis: (a) Sharing strategies, (b) -distance, and (c) Convergence performance.

Negative transfer is an important technical bottleneck for successful domain adaptation. Negative transfer is more likely to happen when the source domain is substantially larger than the target domain, in which there exist many source data points that are irrelevant to the target domain. To evaluate the robustness against negative transfer, we randomly remove 6 classes from all transfer tasks constructed from the Office-31 dataset. For example, we perform domain adaptation on transfer task A 31 W 25, where the source domain A has 31 classes but the target domain W has only 25 classes. In this more general and challenging scenario, we observe from Table 3 that the top-performing adversarial adaptation method, RevGrad, significantly underperforms standard AlexNet on most transfer tasks. This is an evidence of the negative transfer difficulty. The proposed MADA approach significantly exceeds the performance of both AlexNet and RevGrad, and successfully avoids the negative transfer trap. These positive results imply that the multi-adversarial adaptation can alleviate negative transfer.


Feature Visualization: We go deeper into the feature transferability by visualizing in Figures 3(a)3(d) the network activations of task A W (10 classes) learned by RevGrad (the bottleneck layer ) and MADA (the bottleneck layer ) respectively using t-SNE embeddings [Donahue et al.2014]. The visualization results reveal several interesting observations. (1) Under RevGrad features, the source and target domains are made indistinguishable; however, different categories are not well discriminated clearly. The reason is that domain adversarial learning is performed only at the feature layer , while the discriminative information is not taken into account by the domain adversary. (2) Under MADA features, not only the source and target domains are made more indistinguishable but also different categories are made more discriminated, which leads to the best adaptation accuracy. This superior results benefit from the integration of discriminative information into multiple domain discriminators, which enables matching of complex multimode structures of the source and target data distributions.

Sharing Strategies: Besides the proposed multi-adversarial strategy, one may consider other sharing strategies for multiple domain discriminators. For example, one can consider sharing all network parameters in the multiple domain discriminators, which is similar to previous domain adversarial adaptation methods with single domain discriminator; or consider sharing only a fraction of the network parameters for more flexibility. To examine different sharing strategies, we compare different variants of MADA: MADA-full, which shares all parameters of the multiple domain discriminator networks; MADA-partial, which shares only the lowest layers of the multiple discriminator networks. The accuracy results of tasks A W and A D in Figure 4(a) reveal that the transfer performance decreases when we share more parameters of multiple discriminators. This confirms our motivation that multiple domain discriminators are necessary to establish fine-grained distribution alignment.

Distribution Discrepancy: The domain adaptation theory [Ben-David et al.2010, Mansour, Mohri, and Rostamizadeh2009] suggests -distance as a measure of cross-domain discrepancy, which, together with the source risk, will bound the target risk. The proxy -distance is defined as , where is the generalization error of a classifier (e.g. kernel SVM) trained on the binary task of discriminating source and target. Figure 4(b) shows on tasks A W, W D with features of ResNet, RevGrad, and MADA. We observe that using MADA features is much smaller than using ResNet and RevGrad features, which suggests that MADA features can reduce the cross-domain gap more effectively. As domains W and D are similar, of task W D is smaller than that of A W, which well explains better accuracy of W D.

Convergence Performance: Since MADA involves alternating optimization procedures, we testify the convergence performance with ResNet and RevGrad. Figure 4(c) demonstrates the test errors of different methods on task A W, which suggests that MADA has similarly stable convergence performance as RevGrad while significantly outperforming RevGrad in the whole process of convergence. Also, the computational complexity of MADA is similar to RevGrad since the multiple domain discriminators only occupy a small fraction of the overall computational complexity.


This paper presented a novel multi-adversarial domain adaptation approach to enable effective deep transfer learning. Unlike previous domain adversarial adaptation methods that only match the feature distributions across domains without exploiting the complex multimode structures, the proposed approach further exploits the discriminative structures to enable fine-grained distribution alignment in a multi-adversarial adaptation framework, which can simultaneously promote positive transfer and circumvent negative transfer. Experiments show state of the art results of the proposed approach.


This work was supported by the National Key Research and Development Program of China (2016YFB1000701), National Natural Science Foundation of China (61772299, 61325008, 61502265, 61672313) and Tsinghua National Laboratory (TNList) Key Project.


  • [Arjovsky and Bottou2017] Arjovsky, M., and Bottou, L. 2017. Towards principled methods for training generative adversarial networks. In ICLR.
  • [Arjovsky, Chintala, and Bottou2017] Arjovsky, M.; Chintala, S.; and Bottou, L. 2017. Wasserstein gan. arXiv preprint arXiv:1701.07875.
  • [Ben-David et al.2010] Ben-David, S.; Blitzer, J.; Crammer, K.; Kulesza, A.; Pereira, F.; and Vaughan, J. W. 2010. A theory of learning from different domains. Machine Learning 79(1-2):151–175.
  • [Bengio, Courville, and Vincent2013] Bengio, Y.; Courville, A.; and Vincent, P. 2013. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 35(8):1798–1828.
  • [Bousmalis et al.2016] Bousmalis, K.; Trigeorgis, G.; Silberman, N.; Krishnan, D.; and Erhan, D. 2016. Domain separation networks. In NIPS, 343–351.
  • [Che et al.2017] Che, T.; Li, Y.; Jacob, A. P.; Bengio, Y.; and Li, W. 2017. Mode regularized generative adversarial networks. ICLR.
  • [Collobert et al.2011] Collobert, R.; Weston, J.; Bottou, L.; Karlen, M.; Kavukcuoglu, K.; and Kuksa, P. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research (JMLR) 12:2493–2537.
  • [Donahue et al.2014] Donahue, J.; Jia, Y.; Vinyals, O.; Hoffman, J.; Zhang, N.; Tzeng, E.; and Darrell, T. 2014. Decaf: A deep convolutional activation feature for generic visual recognition. In ICML.
  • [Duan, Tsang, and Xu2012] Duan, L.; Tsang, I. W.; and Xu, D. 2012. Domain transfer multiple kernel learning. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 34(3):465–479.
  • [Durugkar, Gemp, and Mahadevan2017] Durugkar, I.; Gemp, I.; and Mahadevan, S. 2017. Generative multi-adversarial networks. ICLR.
  • [Ganin and Lempitsky2015] Ganin, Y., and Lempitsky, V. 2015.

    Unsupervised domain adaptation by backpropagation.

    In ICML.
  • [Glorot, Bordes, and Bengio2011] Glorot, X.; Bordes, A.; and Bengio, Y. 2011. Domain adaptation for large-scale sentiment classification: A deep learning approach. In ICML.
  • [Gong et al.2012] Gong, B.; Shi, Y.; Sha, F.; and Grauman, K. 2012. Geodesic flow kernel for unsupervised domain adaptation. In CVPR.
  • [Goodfellow et al.2014] Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In NIPS.
  • [Gretton et al.2012] Gretton, A.; Borgwardt, K.; Rasch, M.; Schölkopf, B.; and Smola, A. 2012. A kernel two-sample test. Journal of Machine Learning Research (JMLR) 13:723–773.
  • [He et al.2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR.
  • [Hoffman et al.2014] Hoffman, J.; Guadarrama, S.; Tzeng, E.; Hu, R.; Donahue, J.; Girshick, R.; Darrell, T.; and Saenko, K. 2014. LSDA: Large scale detection through adaptation. In NIPS.
  • [Jia et al.2014] Jia, Y.; Shelhamer, E.; Donahue, J.; Karayev, S.; Long, J.; Girshick, R.; Guadarrama, S.; and Darrell, T. 2014. Caffe: Convolutional architecture for fast feature embedding. In ACM MM.
  • [Krizhevsky, Sutskever, and Hinton2012] Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012.

    Imagenet classification with deep convolutional neural networks.

    In NIPS.
  • [Long et al.2015] Long, M.; Cao, Y.; Wang, J.; and Jordan, M. I. 2015. Learning transferable features with deep adaptation networks. In ICML.
  • [Long et al.2016] Long, M.; Zhu, H.; Wang, J.; and Jordan, M. I. 2016. Unsupervised domain adaptation with residual transfer networks. In NIPS, 136–144.
  • [Long et al.2017] Long, M.; Zhu, H.; Wang, J.; and Jordan, M. I. 2017. Deep transfer learning with joint adaptation networks. In ICML.
  • [Mansour, Mohri, and Rostamizadeh2009] Mansour, Y.; Mohri, M.; and Rostamizadeh, A. 2009. Domain adaptation: Learning bounds and algorithms. In COLT.
  • [Metz et al.2017] Metz, L.; Poole, B.; Pfau, D.; and Sohl-Dickstein, J. 2017. Unrolled generative adversarial networks. ICLR.
  • [Mirza and Osindero2014] Mirza, M., and Osindero, S. 2014. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784.
  • [Ngiam et al.2011] Ngiam, J.; Khosla, A.; Kim, M.; Nam, J.; Lee, H.; and Ng, A. Y. 2011. Multimodal deep learning. In ICML.
  • [Oquab et al.2013] Oquab, M.; Bottou, L.; Laptev, I.; and Sivic, J. 2013. Learning and transferring mid-level image representations using convolutional neural networks. In CVPR.
  • [Pan and Yang2010] Pan, S. J., and Yang, Q. 2010. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering (TKDE) 22(10):1345–1359.
  • [Pan et al.2011] Pan, S. J.; Tsang, I. W.; Kwok, J. T.; and Yang, Q. 2011. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks (TNN) 22(2):199–210.
  • [Russakovsky et al.2014] Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; Berg, A. C.; and Fei-Fei, L. 2014. ImageNet Large Scale Visual Recognition Challenge.
  • [Saenko et al.2010] Saenko, K.; Kulis, B.; Fritz, M.; and Darrell, T. 2010. Adapting visual category models to new domains. In ECCV.
  • [Torralba and Efros2011] Torralba, A., and Efros, A. A. 2011. Unbiased look at dataset bias. In CVPR.
  • [Tzeng et al.2014] Tzeng, E.; Hoffman, J.; Zhang, N.; Saenko, K.; and Darrell, T. 2014. Deep domain confusion: Maximizing for domain invariance. CoRR abs/1412.3474.
  • [Tzeng et al.2015] Tzeng, E.; Hoffman, J.; Zhang, N.; Saenko, K.; and Darrell, T. 2015. Simultaneous deep transfer across domains and tasks. In ICCV.
  • [Wang and Schneider2014] Wang, X., and Schneider, J. 2014. Flexible transfer learning under support and model shift. In NIPS.
  • [Yosinski et al.2014] Yosinski, J.; Clune, J.; Bengio, Y.; and Lipson, H. 2014. How transferable are features in deep neural networks? In NIPS.
  • [Zhang et al.2013] Zhang, K.; Schölkopf, B.; Muandet, K.; and Wang, Z. 2013. Domain adaptation under target and conditional shift. In ICML.
  • [Zhong et al.2010] Zhong, E.; Fan, W.; Yang, Q.; Verscheure, O.; and Ren, J. 2010. Cross validation framework to choose amongst models and datasets for transfer learning. In ECML/PKDD, 547–562. Springer.