Adversarial Domain Adaptation Being Aware of Class Relationships

05/28/2019 ∙ by Zeya Wang, et al. ∙ 1

Adversarial training is a useful approach to promote the learning of transferable representations across the source and target domains, which has been widely applied for domain adaptation (DA) tasks based on deep neural networks. Until very recently, existing adversarial domain adaptation (ADA) methods ignore the useful information from the label space, which is an important factor accountable for the complicated data distributions associated with different semantic classes. Especially, the inter-class semantic relationships have been rarely considered and discussed in the current work of transfer learning. In this paper, we propose a novel relationship-aware adversarial domain adaptation (RADA) algorithm, which first utilizes a single multi-class domain discriminator to enforce the learning of inter-class dependency structure during domain-adversarial training and then aligns this structure with the inter-class dependencies that are characterized from training the label predictor on the source domain. Specifically, we impose a regularization term to penalize the structure discrepancy between the inter-class dependencies respectively estimated from domain discriminator and label predictor. Through this alignment, our proposed method makes the ADA aware of class relationships. Empirical studies show that the incorporation of class relationships significantly improves the performance on benchmark datasets.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The success of deep learning largely depends on large-scale datasets with labels (e.g. ImageNet 

[8]). Manually annotating labels is costly and time-consuming, which becomes an obstacle for applying deep learning models to new datasets [10]. An effective approach to build a model on unlabeled data in a target domain is to leverage off-the-shelf labeled data from its relevant source domains. However, due to domain shift [28], models trained on source domains usually do not generalize well to target domains. Recently, adversarial training has been introduced to learn domain-invariant features and substantially improves the DA performance [10]. These adversarial-learning-based methods incorporate a domain discriminator to encourage domain confusion for minimizing the distribution discrepancy between source and target domains [10, 2, 21, 29, 24]. Despite the significant improvement from ADA, most existing methods simply match the distributions across domains without considering the structure behind the complicated data distributions. The conditional distributions of data given different associated semantic classes can be different, which may lead to multimodal distributions for multi-class classification. Failing to capture the modes of data distribution will mislead the alignment of distributions across domains [24, 19, 1]. Current attempts focus on revealing this complex structure within the feature space, but ignore the high-level semantics from the label space. Some recent studies design separate class-wise domain discriminators, where each discriminator is only responsible for the distribution alignment for one semantic class [24, 6]. Including the class information, these approaches successfully mitigates the false distribution alignment across domains. However, assigning separate discriminator for each class essentially constrains all classes to be orthogonal with each other. These methods do not explore the inter-class semantic relationships in the label space for DA.

Utilizing structure information from the label space could be helpful for capturing the multimodal structure more accurately. Intuitively, the class relationships are supposed to remain consistent across domains (Figure 1

), which motivates us to exploit the structure information among semantic classes and inject it into the learning process of DA. Multiple task learning (MTL) jointly learns multiple related tasks through knowledge sharing, where structure learning has gained growing popularity for explicitly exploiting hidden task structures. Gaussian graphical model is a powerful tool for studying conditional dependency structure among random variables, so has been widely used for learning the structure of task relationships 

[11]. Recently, this approach has been extended to exploit the class relationships with deep neural networks (DNNs) for improved video categorization performance  [17], which provides an effective solution for characterizing inter-class relationships in our work.

Inspired by this line of work, we first design a single multi-class domain discriminator that implements class-specific domain classification. In doing so, we encourage knowledge sharing across classes for domain classification, which enables the learning of inter-class dependencies, and also favor a parsimonious network. We introduce a structure regularization to constrain the class relationships captured by the domain discrimination maximally agree to the inter-class dependencies that are revealed from label prediction on source domain data. Given that this work focuses on how class relationships are incorporated to improve DA, we build our model on top of domain adversarial neural network (DANN) [10], which is the plain ADA framework. We point out that the presented design and regularizer can be seen as an “add-on” and be easily integrated to other ADA frameworks. Experiments on benchmark datasets show the proposed approach outperforms the competing methods.

Figure 1: Class relationships are intuitively supposed to be similar between source and target domains (solid lines imply strong relationships while dashed lines imply weak relationships). We are motivated to encourage the similarity for making ADA aware of class relationships.

2 Related Work

Adversarial Domain Adaptation

Deep DA methods attempt to generalize the deep neural networks across different domains. The most commonly used approaches are based on discrepancy minimization [30, 18, 26, 21, 20, 5] or adversarial training [9, 10, 29, 31]. Adversarial training, inspired by generative modeling in GANs [14], is an important approach for deep transfer learning tasks. DANN [11]

is proposed with a domain discriminator for classifying whether a sample is from the source or target domains 

[10, 9]. With a gradient reversal layer (GRL), it promotes the learning of discriminative features for classification, and ensures the learned feature distributions over different domains are similar. Recent works realize the importance of exploiting the complex structure behind the data distributions for DA rather than just aligning the whole source and target distributions [19, 24]. Multi-adversarial domain adaptation (MADA) utilizes the information from the label space by assigning class-wise discriminators to capture the multimodal structure owing to different classes [24]. However, the structure information from label space is unexplored for DA.

Structure Learning

Multi-task learning (MTL) seeks to improve the generalization performance by transferring knowledge among related tasks. This knowledge sharing feature makes it possible for learning the structure among tasks, so structure learning, which studies how to accurately characterize the task relationships, has become a central issue of MTL [11, 32]

.As one of the earliest MTL models, DNNs also share certain commonalities (neurons of the hidden layer) among the neurons of the output layers 

[4, 17]. Inspired by the methods explicitly modeling task relationships in MTL [12, 32], recent studies for multi-class classification using CNNs exploit and harness the inter-class relationships through imposing a regularization, which has been successfully validated for improving the video categorization performance  [17].

3 Methods

In this section, we first discuss how class relationships are modeled with DNNs, followed by the design of single discriminator to perform class-specific domain classification. We then introduce our RADA algorithm that is able to keep the domain adversarial training aware of class relationships.

3.1 Inter-class Dependency Structure Learning with Deep Neural Networks

For the multi-class classification problem, the data is given, where represents the input features and is the associated label for each sample. A DNN is used to map the input features of each sample to its associated class through a large number of interconnected neurons. Typically, these neurons are arranged in multiple layers, e.g., convolutional and pooling layers. In the classification task, a stack of fully connected (FC) layers are often on top of these layers for predicting the final class scores. Only considering the FC layers in a network with layers in total, we use and

to denote the weight matrix and bias vector of neurons in the

-th layer respectively, where denotes the number of neurons in that layer. Let and denote the input and output of the

-th layer with an activation function

. We have , and the final output of the network is . For simplicity of the following discussion, we concatenate to the row vectors of to have a unified weight matrix . The training objective can be calculated through a cross entropy loss :


Inspired by recent research for learning task relationships in MTL [32, 17, 12], in classification problems, DNN has been used to exploit the inter-class dependency structure through additional regularization on the output layer to enforce knowledge sharing across different classes. One typical way to model the dependency structure among classes is through a precision matrix , of which each off-diagonal element captures the pairwise partial correlation between classes. Specifically, we assume the row vectors of weight matrix

of the output layer follow a multivariate Gaussian distribution

. Let . By maximizing its log-likelihood subject to the positive semidefinite constraint, can be optimized concurrently with the training objective in equation (1) by:

Figure 2: The architecture of the proposed RADA algorithm built on top of the plain DANN model. In our paper we use a one-layer domain discriminator with . Note that double arrows represent deterministic inference and dashed lines denote the structure discrepancy.

3.2 Multi-class Adversarial Domain Adaptation

In an unsupervised domain adaptation (UDA) problem, we are given labeled source domain data and unlabeled target domain data . DANN has been designed to extract domain invariant features between source and target domains through an adversarial training scheme [10]. The whole architecture consists of three parts: a feature extractor , a label predictor , and a domain discriminator . and together form a standard feed-forward DNN for predicting class labels. is trained to discriminate samples between source and target domains, while is fine-tuned to confuse . Let , , and denote the parameters of , , and , respectively. In the adversarial training procedure, is learned by minimizing a binary cross entropy loss over the domain labels , while is learned by maximizing jointly with minimizing (equation (3)). This is achieved by integrating a gradient reversal layer (GRL) between and , finally ensuring the feature distributions over the source and target domains are made similar.


where is a pseudo-function for GRL [10], and is a balancing parameter for adversarial loss.

In order to capture the multimodal structure of data distribution that is accountable by different semantic classes for DA, a design of multiple discriminators has been applied, such that one discriminator is responsible for matching the source and target domain data associated with one certain class [24]. This design has been proved to successfully enhance positive transfer and alleviate negative transfer. However, there are still two concerns: 1) it has a strong assumption of orthogonality across classes during distribution alignment, i.e., it neglects the structure information among the semantic classes 2) the number of discriminators increased with the number of classes elevates the memory cost for network parameters. In addressing these concerns, we first present a multi-class ADA , where it should be noted that the way we use “multi-class” differs from that in standard multi-class classification. Instead of adopting separate discriminators, we use one single discriminator with a multi-branch design to match the multimodal structure across different classes.

Figure 2 gives a demonstration of in the whole network. One shared hidden layer encodes the common discriminative features between domains for all classes. The shared layer is followed by a layer with class-specific nodes, where each node/branch only predicts domain label for the samples with its associated class and is muted when domain label is predicted for samples associated with other classes. We use () to denote the binary domain classification loss associated with class . With label information

, source domain data can be easily assigned to each class-specific node. For the unlabeled target domain data, a weighted sum of loss values from different nodes are calculated, where the probability score vector

given by are used as the weights. Integrating this new design, we update the objective of our multi-class ADA as:



is one-hot encoding of

for and for .

3.3 Adversarial Domain Adaptation Being Aware of Class Relationships

Incorporating the information of class relationships to the alignment process between the source and target data distributions will relax the orthogonality assumption and help maximally match the multimodal structure of data distributions. Recall that is used to model the inter-class dependency structure with DNN from Section 3.1. By implicitly injecting into the adversarial training process, we may make ADA automatically aware of class relationships.

With the extracted features from , predicts class labels, where the class relationships can be characterized from prediction as well. In order for to capture a similar inter-class dependency structure while aligning the source and target data distributions, the precision matrix , which can be estimated from the class-specific domain classification job implemented by , is supposed to be consistent with that from the prediction task done by . To maximize this consistency, we propose an approach that minimizes the discrepancy between the class relationships respectively learned from and (as shown in Figure 2). Let and denote the precision matrices w.r.t the weight matrices and of the output layers in and , thus and are used to characterize the inter-class dependencies w.r.t and . Let be the value of calculated w.r.t , according to equation (2), we can solve from


It is straightforward to derive the solution to this minimization problem. Using the spectral theorem, we can conclude . Similarly, solving


With being the value of from , we obtain . We adopt a structure regularization approach that minimizes the discrepancy between precision matrices and (a.k.a. KL divergence [7]). Given it is an asymmetric metric, we formulate the discrepancy as:


, which minimizes the divergence from to , or


, which minimizes the divergence from to . Inserting and into equation (7) or (8), we design a regularization to minimize the discrepancy of class relationships between and :


Integrating the penalty to equation (4), we have our final training objective:


where is a balancing parameter for the relationship-aware regularization term.

4 Experiments

Method IP PI IC CI CP PC Average
ResNet [16] 74.8 83.9 91.5 78.0 65.5 91.2 80.7
DAN [18] 75.0 86.2 93.3 84.1 69.8 91.3 83.3
RTN [20] 75.6 86.8 95.3 86.9 72.7 92.2 84.9
DANN [10] 75.0 86.0 96.2 87.0 74.3 91.5 85.0
JAN [21] 76.8 88.0 94.7 89.5 74.2 91.7 85.8
CAN [31] 78.2 87.5 94.2 89.5 75.8 89.2 85.7
MADA [24] 75.0 87.9 96.0 88.8 75.2 92.2 85.8
RADA 78.8 92.1 97.3 90.9 76.4 94.6 88.4
RADA 79.2 92.4 97.5 91.1 76.6 95.3 88.7
Table 1: Mean accuracy (%) on ImageCLEF-DA for UDA (ResNet-50)
Method AW DW WD AD DA WA Average
ResNet [16] 68.4 96.7 99.3 68.9 62.5 60.7 76.1
TCA [23] 74.7 96.7 99.6 76.1 63.7 62.9 79.3
GFK [13] 74.8 95.0 98.2 76.5 65.4 63.0 78.8
DDC [30] 75.8 95.0 98.2 77.5 67.4 64.0 79.7
DAN [18] 83.8 96.8 99.5 78.4 66.7 62.7 81.3
RTN [20] 84.5 96.8 99.4 77.5 66.2 64.8 81.6
DANN [10] 82.0 96.9 99.1 79.7 68.2 67.4 82.2
ADDA [29] 86.2 96.2 98.4 77.8 69.5 68.9 82.9
JAN [21] 85.4 97.4 99.8 84.7 68.6 70.0 84.3
JDDA [5] 82.6 95.2 99.7 79.8 57.4 66.7 80.2
CAN [31] 81.5 98.2 99.7 85.5 65.9 63.4 82.4
MADA [24] 90.0 97.4 99.6 87.8 70.3 66.4 85.2
RADA 91.5 99.0 100.0 90.3 71.5 70.1 87.1
RADA 91.5 98.9 100.0 90.7 71.5 71.3 87.3
Table 2: Mean accuracy (%) on Office-31 for UDA (ResNet-50)

4.1 Experiment Setup

Datasets We evaluate our model performance on two benchmarks. The first dataset is ImageCLEF-DA 111 All the images are collected from three public datasets: Caltech256 (C), ImageNet ILSVRC 2012 (I), and Pascal VOC 2012 (P). They are in 12 common categories shared by the three datasets, with images in each category. We evaluate our method on the transfer tasks with all domain combinations: , , , , and . The other dataset is Office-31 [25], which consists of totally images from 31 categories. All the images are collected from three different domains: Amazon (A), which are downloaded from, DSLR (D), which are taken by digital SLR camera and Webcam (W), which are recorded with a simple webcam. This dataset with images from different photographical settings represent visual domain shifts. We evaluate our method in terms of classification accuracy on all the six transfer tasks , , , , and .

Method AW DW WD AD DA WA Average
ResNet [16] 54.5 94.6 94.3 65.6 73.2 71.7 75.6
DAN [18] 46.4 53.6 58.6 42.7 65.7 65.3 55.4
ADDA [29] 43.7 46.5 40.1 43.7 42.8 46.0 43.8
RTN [20] 75.3 97.1 98.3 66.9 85.6 85.7 84.8
JAN [21] 43.4 53.6 41.4 35.7 51.0 51.6 46.1
DANN [10] 41.4 46.8 38.9 41.4 41.3 44.7 42.4
MADA [24] 222Reimplementation. 63.5 84.8 99.7 67.7 59.1 64.0 73.1
RADA 83.0 97.4 97.2 87.4 86.0 85.5 89.4
RADA 82.8 97.4 97.6 86.8 86.6 86.3 89.6
Table 3: Mean accuracy (%) on Office-31 for PDA from 31 classes to 10 classes (ResNet-50)

Impementation Details The training and testing are implemented by Pytorch

. Among all the transfer tasks, we use stochastic gradient descent (SGD) with momentum of


for minimizing the loss function given by equation (

11). We adopt balanced sampling between classes to increase the chance that samples from each category can be drawn in each batch. The learning rate is initialized with for all the CNN layers and for all the fully connected layers, and then exponentially decayed during SGD by a factor , where , and

is the training progress measured by epoch numbers 

[10]. All weights and biases are regularized by a weight decay with penalty multiplier set to . for adversarial training is fixed with , decayed with a factor through the training process, while is fixed with across all the experiments [24]. We implement our method with all the compared deep learning methods based on the ResNet-50 [16] pre-trained on the ImageNet dataset [8].

Baselines We follow standard evaluation protocols for UDA using all labeled source samples and all unlabeled target samples, and report the mean classification accuracy over three random experiments [24]. We compare RADA with recent state-of-the-art deep transfer learning methods: Deep Domain Confusion (DDC) [30], Deep Adaptation Network (DAN) [18], Residual Transfer Network (RTN) [20], DANN [10], Adversarial Discriminative Domain Adaptation (ADDA) [29], Joint Adaptation Network (JAN) [21], MADA [24], Collaborative and Adversarial Network (CAN) [31] and Joint Discriminative Domain Adaptation (JDDA) [5]

; and traditional machine learning methods: Transfer Component Analysis (TCA) 

[23], Geodesic Flow Kernel (GFK) [13].

4.2 Main Results

The mean classification accuracy on ImageCLEF-DA and Office-31 is reported in Table 1 and Table 2. The results of baseline methods are reprinted from previous literatures [24, 19, 5, 31, 3]. RADA and RADA are our methods trained with and . As shown in Table 1, RADA and RADA both outperform baseline methods across all the transfer tasks for both ImageCLEF-DA and Office-31. RADA slightly outperforms RADA. This very similar performance suggests the choice of the target matrix does not make much difference in spite of the asymmetry of the adopted discrepancy metric. The number of parameters used by RADA is also reduced to a large extent, compared with the multiple discriminators method (e.g. MADA). For ImageCLEF-DA and Office31 datasets, the adoption of and independent two-layer discriminators generates more than parameters, whereas RADA only generates parameters. This reduction is important in practice, especially when the number of classes is very large (e.g. ). The improved performance with even simpler network highlights the significance of incorporating class relationships into the adversarial training process. The alignment of class relationships between label predictor and domain discriminator introduces more structure information from the label space to the adversarial training process, and efficiently promotes the learning of transferable representations for feature extractor.

We additionally provide evaluations for partial domain adaptation

(PDA) problem, where the target label space is a subset of source label space. It is a new technical bottleneck, which is more challenging and practical than the standard DA, considering the outlier classes in the source domain can cause negative transfer when discriminating the target classes 

[3, 24]. To show the robustness of our method against PDA, we implement the evaluation in a benchmark experimental setup. From Office-31, we use all the categories for the source domain and choose the ten categories shared with Caltech256 [15] for the target domain. Among all the transfer tasks, the source domain contains classes and the target domain has classes. From Table 3, we can observe that RADA outperforms ResNet and other general DA methods, especially on the tasks , , and , which suggests it successfully avoids the negative transfer trap.

(a) DANN
(b) MADA
(c) RADA
(d) RADA
Figure 3: The t-SNE visulization of embedded features from target domain.
(a) Heatmap of (blue: , red: )
(b) Confusion matrix of RADA
(c) Confusion matrix of MADA
Figure 4: Visualization for characterized class relationships and confusion matrix for task

4.3 Empirical Analysis

Feature VisualizationIn order to visualize the embedded data, we use t-SNE [22] to project the feature representations after pool5 in ResNet-50 that are respectively trained with DANN, MADA and RADA to lower dimensional space. The two-dimensional map of embedded data in target domain from the transfer task is visualized in Figure 3, where the class information is also given by assigning data points with different colors and numerical labels in the plot. We observe that the embedded features from different classes are better separated in RADA and MADA when compared with DANN. Although MADA is able to separate most of the data points according to their class labels, several classes around the center are still mixed up, while RADA can better separate those points. By integrating the information of class relationships, RADA can better extract the features uniquely belonging to each class and capture the modes of the data distribution.

Figure 5: Proxy -Distance

Class Relationships Partial correlation is a symmetric measure of association between two variables while controlling the effect of other variables. It is commonly used to model the conditional dependencies among a group of variables. The partial correlation between class and can be calculated by from the element of precision matrix . With the estimated and from the transfer task using RADA, we calculate and visualize the partial correlations among all the classes in Figure 4, where estimated from label predictor is displayed on the upper triangular part and that from the discriminator is on the lower triangular part. The symmetry of the heatmap in Figure 3(a) indicates our regularization successfully encourages the class relationships to be consistent between and . Some class relationships are interesting and intuitive. For example, not surprisingly, the class paper notebook is found to be positive associated with ring binder from both label predictor and discriminator with RADA (black framed cell in Figure 3(a)). Aware of such class relationships, our method avoids miss-classifying several images (Figure 3(b)) of ring binder as paper notebook compared to MADA (black framed cell in Figure 3(c)).

Distribution Discrepancy Proxy

-Distance (PAD

[2, 10] is a widely used metric to measure the feature distributional discrepancy between source and target domains. PAD is defined as , where is the classification error (e.g. mean absolute error) of a domain classifier (e.g. SVM). Generally, a lower PAD indicates a better generalization ability. As shown in Figure 5, on two transfer tasks and , RADA consistently outperforms DANN and MADA. This indicates RADA can better extract domain-invariant features. In addition, of are slightly lower than , showing that has a better generalization ability.

5 Conclusion

We present a novel approach to DA through revealing the structure information from the label space for aligning complicated data distributions during adversarial training. We propose a new design of multi-class domain discriminator and a novel regularizer to align the inter-class dependencies respectively characterized from label predictor and domain discriminator. Experiments show considering class relationship information can substantially improve the transfer learning performance.


  • [1] Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and equilibrium in generative adversarial nets (gans). In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 224–232. JMLR. org, 2017.
  • [2] Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adaptation. In Advances in Neural Information Processing Systems, pages 137–144, 2007.
  • [3] Zhangjie Cao, Lijia Ma, Mingsheng Long, and Jianmin Wang. Partial adversarial domain adaptation. In

    European Conference on Computer Vision

    , pages 135–150, 2018.
  • [4] Rich Caruana. Multitask learning. Machine Learning, 28(1):41–75, 1997.
  • [5] Chao Chen, Zhihong Chen, Boyuan Jiang, and Xinyu Jin. Joint domain alignment and discriminative feature learning for unsupervised deep domain adaptation. In

    AAAI Conference on Artificial Intelligence

    , 2019.
  • [6] Yi-Hsin Chen, Wei-Yu Chen, Yu-Ting Chen, Bo-Cheng Tsai, Yu-Chiang Frank Wang, and Min Sun. No more discrimination: Cross city adaptation of road scene segmenters. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2011–2020. IEEE, 2017.
  • [7] Xiangzhao Cui, Chun Li, Jine Zhao, Li Zeng, Defei Zhang, and Jianxin Pan. Covariance structure regularization via frobenius-norm discrepancy. Linear Algebra and its Applications, 510:124–145, 2016.
  • [8] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In

    IEEE Conference on Computer Vision and Pattern Recognition

    , pages 248–255. IEEE, 2009.
  • [9] Yaroslav Ganin and Victor Lempitsky.

    Unsupervised domain adaptation by backpropagation.

    In International Conference on Machine Learning, pages 1180–1189, 2015.
  • [10] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. Journal of Machine Learning Research, 17(1):2096–2030, 2016.
  • [11] André R Gonçalves, Puja Das, Soumyadeep Chatterjee, Vidyashankar Sivakumar, Fernando J Von Zuben, and Arindam Banerjee. Multi-task sparse structure learning. In International Conference on Information and Knowledge Management, pages 451–460. ACM, 2014.
  • [12] André R Gonçalves, Fernando J Von Zuben, and Arindam Banerjee. Multi-task sparse structure learning with gaussian copula models. Journal of Machine Learning Research, 17(1):1205–1234, 2016.
  • [13] Boqing Gong, Yuan Shi, Fei Sha, and Kristen Grauman. Geodesic flow kernel for unsupervised domain adaptation. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2066–2073. IEEE, 2012.
  • [14] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
  • [15] Gregory Griffin, Alex Holub, and Pietro Perona. Caltech-256 object category dataset. 2007.
  • [16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
  • [17] Yu-Gang Jiang, Zuxuan Wu, Jun Wang, Xiangyang Xue, and Shih-Fu Chang. Exploiting feature and class relationships in video categorization with regularized deep neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(2):352–364, 2018.
  • [18] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I Jordan. Learning transferable features with deep adaptation networks. In Proceedings of the 32nd International Conference on International Conference on Machine Learning-Volume 37, pages 97–105. JMLR. org, 2015.
  • [19] Mingsheng Long, Zhangjie Cao, Jianmin Wang, and Michael I Jordan. Conditional adversarial domain adaptation. In Advances in Neural Information Processing Systems, pages 1640–1650, 2018.
  • [20] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. Unsupervised domain adaptation with residual transfer networks. In Advances in Neural Information Processing Systems, pages 136–144, 2016.
  • [21] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. Deep transfer learning with joint adaptation networks. In International Conference on Machine Learning, pages 2208–2217. JMLR. org, 2017.
  • [22] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
  • [23] Sinno Jialin Pan, Ivor W Tsang, James T Kwok, and Qiang Yang. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks, 22(2):199–210, 2011.
  • [24] Zhongyi Pei, Zhangjie Cao, Mingsheng Long, and Jianmin Wang. Multi-adversarial domain adaptation. In AAAI Conference on Artificial Intelligence, 2018.
  • [25] Kate Saenko, Brian Kulis, Mario Fritz, and Trevor Darrell. Adapting visual category models to new domains. In European Conference on Computer Vision, pages 213–226. Springer, 2010.
  • [26] Baochen Sun, Jiashi Feng, and Kate Saenko. Return of frustratingly easy domain adaptation. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.
  • [27] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In International Conference on Machine Learning, pages 1139–1147, 2013.
  • [28] A Torralba and AA Efros. Unbiased look at dataset bias. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1521–1528. IEEE Computer Society, 2011.
  • [29] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. In IEEE Conference on Computer Vision and Pattern Recognition, volume 1, page 4, 2017.
  • [30] Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474, 2014.
  • [31] Weichen Zhang, Wanli Ouyang, Wen Li, and Dong Xu. Collaborative and adversarial network for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3801–3809, 2018.
  • [32] Yu Zhang and Dit-Yan Yeung. A convex formulation for learning task relationships in multi-task learning. In Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, pages 733–742. AUAI Press, 2010.