Utilizing structure information from the label space could be helpful for capturing the multimodal structure more accurately. Intuitively, the class relationships are supposed to remain consistent across domains (Figure 1
), which motivates us to exploit the structure information among semantic classes and inject it into the learning process of DA. Multiple task learning (MTL) jointly learns multiple related tasks through knowledge sharing, where structure learning has gained growing popularity for explicitly exploiting hidden task structures. Gaussian graphical model is a powerful tool for studying conditional dependency structure among random variables, so has been widely used for learning the structure of task relationships. Recently, this approach has been extended to exploit the class relationships with deep neural networks (DNNs) for improved video categorization performance , which provides an effective solution for characterizing inter-class relationships in our work.
Inspired by this line of work, we first design a single multi-class domain discriminator that implements class-specific domain classification. In doing so, we encourage knowledge sharing across classes for domain classification, which enables the learning of inter-class dependencies, and also favor a parsimonious network. We introduce a structure regularization to constrain the class relationships captured by the domain discrimination maximally agree to the inter-class dependencies that are revealed from label prediction on source domain data. Given that this work focuses on how class relationships are incorporated to improve DA, we build our model on top of domain adversarial neural network (DANN) , which is the plain ADA framework. We point out that the presented design and regularizer can be seen as an “add-on” and be easily integrated to other ADA frameworks. Experiments on benchmark datasets show the proposed approach outperforms the competing methods.
2 Related Work
Adversarial Domain Adaptation
Deep DA methods attempt to generalize the deep neural networks across different domains. The most commonly used approaches are based on discrepancy minimization [30, 18, 26, 21, 20, 5] or adversarial training [9, 10, 29, 31]. Adversarial training, inspired by generative modeling in GANs , is an important approach for deep transfer learning tasks. DANN 
is proposed with a domain discriminator for classifying whether a sample is from the source or target domains[10, 9]. With a gradient reversal layer (GRL), it promotes the learning of discriminative features for classification, and ensures the learned feature distributions over different domains are similar. Recent works realize the importance of exploiting the complex structure behind the data distributions for DA rather than just aligning the whole source and target distributions [19, 24]. Multi-adversarial domain adaptation (MADA) utilizes the information from the label space by assigning class-wise discriminators to capture the multimodal structure owing to different classes . However, the structure information from label space is unexplored for DA.
Multi-task learning (MTL) seeks to improve the generalization performance by transferring knowledge among related tasks. This knowledge sharing feature makes it possible for learning the structure among tasks, so structure learning, which studies how to accurately characterize the task relationships, has become a central issue of MTL [11, 32]
.As one of the earliest MTL models, DNNs also share certain commonalities (neurons of the hidden layer) among the neurons of the output layers[4, 17]. Inspired by the methods explicitly modeling task relationships in MTL [12, 32], recent studies for multi-class classification using CNNs exploit and harness the inter-class relationships through imposing a regularization, which has been successfully validated for improving the video categorization performance .
In this section, we first discuss how class relationships are modeled with DNNs, followed by the design of single discriminator to perform class-specific domain classification. We then introduce our RADA algorithm that is able to keep the domain adversarial training aware of class relationships.
3.1 Inter-class Dependency Structure Learning with Deep Neural Networks
For the multi-class classification problem, the data is given, where represents the input features and is the associated label for each sample. A DNN is used to map the input features of each sample to its associated class through a large number of interconnected neurons. Typically, these neurons are arranged in multiple layers, e.g., convolutional and pooling layers. In the classification task, a stack of fully connected (FC) layers are often on top of these layers for predicting the final class scores. Only considering the FC layers in a network with layers in total, we use and
to denote the weight matrix and bias vector of neurons in the-th layer respectively, where denotes the number of neurons in that layer. Let and denote the input and output of the
-th layer with an activation function. We have , and the final output of the network is . For simplicity of the following discussion, we concatenate to the row vectors of to have a unified weight matrix . The training objective can be calculated through a cross entropy loss :
Inspired by recent research for learning task relationships in MTL [32, 17, 12], in classification problems, DNN has been used to exploit the inter-class dependency structure through additional regularization on the output layer to enforce knowledge sharing across different classes. One typical way to model the dependency structure among classes is through a precision matrix , of which each off-diagonal element captures the pairwise partial correlation between classes. Specifically, we assume the row vectors of weight matrix
of the output layer follow a multivariate Gaussian distribution. Let . By maximizing its log-likelihood subject to the positive semidefinite constraint, can be optimized concurrently with the training objective in equation (1) by:
3.2 Multi-class Adversarial Domain Adaptation
In an unsupervised domain adaptation (UDA) problem, we are given labeled source domain data and unlabeled target domain data . DANN has been designed to extract domain invariant features between source and target domains through an adversarial training scheme . The whole architecture consists of three parts: a feature extractor , a label predictor , and a domain discriminator . and together form a standard feed-forward DNN for predicting class labels. is trained to discriminate samples between source and target domains, while is fine-tuned to confuse . Let , , and denote the parameters of , , and , respectively. In the adversarial training procedure, is learned by minimizing a binary cross entropy loss over the domain labels , while is learned by maximizing jointly with minimizing (equation (3)). This is achieved by integrating a gradient reversal layer (GRL) between and , finally ensuring the feature distributions over the source and target domains are made similar.
where is a pseudo-function for GRL , and is a balancing parameter for adversarial loss.
In order to capture the multimodal structure of data distribution that is accountable by different semantic classes for DA, a design of multiple discriminators has been applied, such that one discriminator is responsible for matching the source and target domain data associated with one certain class . This design has been proved to successfully enhance positive transfer and alleviate negative transfer. However, there are still two concerns: 1) it has a strong assumption of orthogonality across classes during distribution alignment, i.e., it neglects the structure information among the semantic classes 2) the number of discriminators increased with the number of classes elevates the memory cost for network parameters. In addressing these concerns, we first present a multi-class ADA , where it should be noted that the way we use “multi-class” differs from that in standard multi-class classification. Instead of adopting separate discriminators, we use one single discriminator with a multi-branch design to match the multimodal structure across different classes.
Figure 2 gives a demonstration of in the whole network. One shared hidden layer encodes the common discriminative features between domains for all classes. The shared layer is followed by a layer with class-specific nodes, where each node/branch only predicts domain label for the samples with its associated class and is muted when domain label is predicted for samples associated with other classes. We use () to denote the binary domain classification loss associated with class . With label information
, source domain data can be easily assigned to each class-specific node. For the unlabeled target domain data, a weighted sum of loss values from different nodes are calculated, where the probability score vectorgiven by are used as the weights. Integrating this new design, we update the objective of our multi-class ADA as:
is one-hot encoding offor and for .
3.3 Adversarial Domain Adaptation Being Aware of Class Relationships
Incorporating the information of class relationships to the alignment process between the source and target data distributions will relax the orthogonality assumption and help maximally match the multimodal structure of data distributions. Recall that is used to model the inter-class dependency structure with DNN from Section 3.1. By implicitly injecting into the adversarial training process, we may make ADA automatically aware of class relationships.
With the extracted features from , predicts class labels, where the class relationships can be characterized from prediction as well. In order for to capture a similar inter-class dependency structure while aligning the source and target data distributions, the precision matrix , which can be estimated from the class-specific domain classification job implemented by , is supposed to be consistent with that from the prediction task done by . To maximize this consistency, we propose an approach that minimizes the discrepancy between the class relationships respectively learned from and (as shown in Figure 2). Let and denote the precision matrices w.r.t the weight matrices and of the output layers in and , thus and are used to characterize the inter-class dependencies w.r.t and . Let be the value of calculated w.r.t , according to equation (2), we can solve from
It is straightforward to derive the solution to this minimization problem. Using the spectral theorem, we can conclude . Similarly, solving
With being the value of from , we obtain . We adopt a structure regularization approach that minimizes the discrepancy between precision matrices and (a.k.a. KL divergence ). Given it is an asymmetric metric, we formulate the discrepancy as:
, which minimizes the divergence from to , or
Integrating the penalty to equation (4), we have our final training objective:
where is a balancing parameter for the relationship-aware regularization term.
4.1 Experiment Setup
Datasets We evaluate our model performance on two benchmarks. The first dataset is ImageCLEF-DA 111http://imageclef.org/2014/adaptation. All the images are collected from three public datasets: Caltech256 (C), ImageNet ILSVRC 2012 (I), and Pascal VOC 2012 (P). They are in 12 common categories shared by the three datasets, with images in each category. We evaluate our method on the transfer tasks with all domain combinations: , , , , and . The other dataset is Office-31 , which consists of totally images from 31 categories. All the images are collected from three different domains: Amazon (A), which are downloaded from amazon.com, DSLR (D), which are taken by digital SLR camera and Webcam (W), which are recorded with a simple webcam. This dataset with images from different photographical settings represent visual domain shifts. We evaluate our method in terms of classification accuracy on all the six transfer tasks , , , , and .
|MADA  222Reimplementation.||63.5||84.8||99.7||67.7||59.1||64.0||73.1|
Impementation Details The training and testing are implemented by Pytorch
. Among all the transfer tasks, we use stochastic gradient descent (SGD) with momentum of
for minimizing the loss function given by equation (11). We adopt balanced sampling between classes to increase the chance that samples from each category can be drawn in each batch. The learning rate is initialized with for all the CNN layers and for all the fully connected layers, and then exponentially decayed during SGD by a factor , where , and
is the training progress measured by epoch numbers. All weights and biases are regularized by a weight decay with penalty multiplier set to . for adversarial training is fixed with , decayed with a factor through the training process, while is fixed with across all the experiments . We implement our method with all the compared deep learning methods based on the ResNet-50  pre-trained on the ImageNet dataset .
Baselines We follow standard evaluation protocols for UDA using all labeled source samples and all unlabeled target samples, and report the mean classification accuracy over three random experiments . We compare RADA with recent state-of-the-art deep transfer learning methods: Deep Domain Confusion (DDC) , Deep Adaptation Network (DAN) , Residual Transfer Network (RTN) , DANN , Adversarial Discriminative Domain Adaptation (ADDA) , Joint Adaptation Network (JAN) , MADA , Collaborative and Adversarial Network (CAN)  and Joint Discriminative Domain Adaptation (JDDA) 
; and traditional machine learning methods: Transfer Component Analysis (TCA), Geodesic Flow Kernel (GFK) .
4.2 Main Results
The mean classification accuracy on ImageCLEF-DA and Office-31 is reported in Table 1 and Table 2. The results of baseline methods are reprinted from previous literatures [24, 19, 5, 31, 3]. RADA and RADA are our methods trained with and . As shown in Table 1, RADA and RADA both outperform baseline methods across all the transfer tasks for both ImageCLEF-DA and Office-31. RADA slightly outperforms RADA. This very similar performance suggests the choice of the target matrix does not make much difference in spite of the asymmetry of the adopted discrepancy metric. The number of parameters used by RADA is also reduced to a large extent, compared with the multiple discriminators method (e.g. MADA). For ImageCLEF-DA and Office31 datasets, the adoption of and independent two-layer discriminators generates more than parameters, whereas RADA only generates parameters. This reduction is important in practice, especially when the number of classes is very large (e.g. ). The improved performance with even simpler network highlights the significance of incorporating class relationships into the adversarial training process. The alignment of class relationships between label predictor and domain discriminator introduces more structure information from the label space to the adversarial training process, and efficiently promotes the learning of transferable representations for feature extractor.
We additionally provide evaluations for partial domain adaptation
(PDA) problem, where the target label space is a subset of source label space. It is a new technical bottleneck, which is more challenging and practical than the standard DA, considering the outlier classes in the source domain can cause negative transfer when discriminating the target classes[3, 24]. To show the robustness of our method against PDA, we implement the evaluation in a benchmark experimental setup. From Office-31, we use all the categories for the source domain and choose the ten categories shared with Caltech256  for the target domain. Among all the transfer tasks, the source domain contains classes and the target domain has classes. From Table 3, we can observe that RADA outperforms ResNet and other general DA methods, especially on the tasks , , and , which suggests it successfully avoids the negative transfer trap.
4.3 Empirical Analysis
Feature VisualizationIn order to visualize the embedded data, we use t-SNE  to project the feature representations after pool5 in ResNet-50 that are respectively trained with DANN, MADA and RADA to lower dimensional space. The two-dimensional map of embedded data in target domain from the transfer task is visualized in Figure 3, where the class information is also given by assigning data points with different colors and numerical labels in the plot. We observe that the embedded features from different classes are better separated in RADA and MADA when compared with DANN. Although MADA is able to separate most of the data points according to their class labels, several classes around the center are still mixed up, while RADA can better separate those points. By integrating the information of class relationships, RADA can better extract the features uniquely belonging to each class and capture the modes of the data distribution.
Class Relationships Partial correlation is a symmetric measure of association between two variables while controlling the effect of other variables. It is commonly used to model the conditional dependencies among a group of variables. The partial correlation between class and can be calculated by from the element of precision matrix . With the estimated and from the transfer task using RADA, we calculate and visualize the partial correlations among all the classes in Figure 4, where estimated from label predictor is displayed on the upper triangular part and that from the discriminator is on the lower triangular part. The symmetry of the heatmap in Figure 3(a) indicates our regularization successfully encourages the class relationships to be consistent between and . Some class relationships are interesting and intuitive. For example, not surprisingly, the class paper notebook is found to be positive associated with ring binder from both label predictor and discriminator with RADA (black framed cell in Figure 3(a)). Aware of such class relationships, our method avoids miss-classifying several images (Figure 3(b)) of ring binder as paper notebook compared to MADA (black framed cell in Figure 3(c)).
Distribution Discrepancy Proxy
-Distance (PAD)[2, 10] is a widely used metric to measure the feature distributional discrepancy between source and target domains. PAD is defined as , where is the classification error (e.g. mean absolute error) of a domain classifier (e.g. SVM). Generally, a lower PAD indicates a better generalization ability. As shown in Figure 5, on two transfer tasks and , RADA consistently outperforms DANN and MADA. This indicates RADA can better extract domain-invariant features. In addition, of are slightly lower than , showing that has a better generalization ability.
We present a novel approach to DA through revealing the structure information from the label space for aligning complicated data distributions during adversarial training. We propose a new design of multi-class domain discriminator and a novel regularizer to align the inter-class dependencies respectively characterized from label predictor and domain discriminator. Experiments show considering class relationship information can substantially improve the transfer learning performance.
-  Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and equilibrium in generative adversarial nets (gans). In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 224–232. JMLR. org, 2017.
-  Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adaptation. In Advances in Neural Information Processing Systems, pages 137–144, 2007.
Zhangjie Cao, Lijia Ma, Mingsheng Long, and Jianmin Wang.
Partial adversarial domain adaptation.
European Conference on Computer Vision, pages 135–150, 2018.
-  Rich Caruana. Multitask learning. Machine Learning, 28(1):41–75, 1997.
Chao Chen, Zhihong Chen, Boyuan Jiang, and Xinyu Jin.
Joint domain alignment and discriminative feature learning for
unsupervised deep domain adaptation.
AAAI Conference on Artificial Intelligence, 2019.
-  Yi-Hsin Chen, Wei-Yu Chen, Yu-Ting Chen, Bo-Cheng Tsai, Yu-Chiang Frank Wang, and Min Sun. No more discrimination: Cross city adaptation of road scene segmenters. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2011–2020. IEEE, 2017.
-  Xiangzhao Cui, Chun Li, Jine Zhao, Li Zeng, Defei Zhang, and Jianxin Pan. Covariance structure regularization via frobenius-norm discrepancy. Linear Algebra and its Applications, 510:124–145, 2016.
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei.
Imagenet: A large-scale hierarchical image database.
IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255. IEEE, 2009.
Yaroslav Ganin and Victor Lempitsky.
Unsupervised domain adaptation by backpropagation.In International Conference on Machine Learning, pages 1180–1189, 2015.
-  Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. Journal of Machine Learning Research, 17(1):2096–2030, 2016.
-  André R Gonçalves, Puja Das, Soumyadeep Chatterjee, Vidyashankar Sivakumar, Fernando J Von Zuben, and Arindam Banerjee. Multi-task sparse structure learning. In International Conference on Information and Knowledge Management, pages 451–460. ACM, 2014.
-  André R Gonçalves, Fernando J Von Zuben, and Arindam Banerjee. Multi-task sparse structure learning with gaussian copula models. Journal of Machine Learning Research, 17(1):1205–1234, 2016.
-  Boqing Gong, Yuan Shi, Fei Sha, and Kristen Grauman. Geodesic flow kernel for unsupervised domain adaptation. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2066–2073. IEEE, 2012.
-  Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
-  Gregory Griffin, Alex Holub, and Pietro Perona. Caltech-256 object category dataset. 2007.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
-  Yu-Gang Jiang, Zuxuan Wu, Jun Wang, Xiangyang Xue, and Shih-Fu Chang. Exploiting feature and class relationships in video categorization with regularized deep neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(2):352–364, 2018.
-  Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I Jordan. Learning transferable features with deep adaptation networks. In Proceedings of the 32nd International Conference on International Conference on Machine Learning-Volume 37, pages 97–105. JMLR. org, 2015.
-  Mingsheng Long, Zhangjie Cao, Jianmin Wang, and Michael I Jordan. Conditional adversarial domain adaptation. In Advances in Neural Information Processing Systems, pages 1640–1650, 2018.
-  Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. Unsupervised domain adaptation with residual transfer networks. In Advances in Neural Information Processing Systems, pages 136–144, 2016.
-  Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. Deep transfer learning with joint adaptation networks. In International Conference on Machine Learning, pages 2208–2217. JMLR. org, 2017.
-  Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
-  Sinno Jialin Pan, Ivor W Tsang, James T Kwok, and Qiang Yang. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks, 22(2):199–210, 2011.
-  Zhongyi Pei, Zhangjie Cao, Mingsheng Long, and Jianmin Wang. Multi-adversarial domain adaptation. In AAAI Conference on Artificial Intelligence, 2018.
-  Kate Saenko, Brian Kulis, Mario Fritz, and Trevor Darrell. Adapting visual category models to new domains. In European Conference on Computer Vision, pages 213–226. Springer, 2010.
-  Baochen Sun, Jiashi Feng, and Kate Saenko. Return of frustratingly easy domain adaptation. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.
-  Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In International Conference on Machine Learning, pages 1139–1147, 2013.
-  A Torralba and AA Efros. Unbiased look at dataset bias. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1521–1528. IEEE Computer Society, 2011.
-  Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. In IEEE Conference on Computer Vision and Pattern Recognition, volume 1, page 4, 2017.
-  Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474, 2014.
-  Weichen Zhang, Wanli Ouyang, Wen Li, and Dong Xu. Collaborative and adversarial network for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3801–3809, 2018.
-  Yu Zhang and Dit-Yan Yeung. A convex formulation for learning task relationships in multi-task learning. In Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, pages 733–742. AUAI Press, 2010.