Improving Unsupervised Domain Adaptation by Reducing Bi-level Feature Redundancy

12/28/2020 ∙ by Mengzhu Wang, et al. ∙ 21

Reducing feature redundancy has shown beneficial effects for improving the accuracy of deep learning models, thus it is also indispensable for the models of unsupervised domain adaptation (UDA). Nevertheless, most recent efforts in the field of UDA ignores this point. Moreover, main schemes realizing this in general independent of UDA purely involve a single domain, thus might not be effective for cross-domain tasks. In this paper, we emphasize the significance of reducing feature redundancy for improving UDA in a bi-level way. For the first level, we try to ensure compact domain-specific features with a transferable decorrelated normalization module, which preserves specific domain information whilst easing the side effect of feature redundancy on the sequel domain-invariance. In the second level, domain-invariant feature redundancy caused by domain-shared representation is further mitigated via an alternative brand orthogonality for better generalization. These two novel aspects can be easily plugged into any BN-based backbone neural networks. Specifically, simply applying them to ResNet50 has achieved competitive performance to the state-of-the-arts on five popular benchmarks. Our code will be available at https://github.com/dreamkily/gUDA.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 7

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Recent attempts on deep neural networks have brought about extraordinary performance in various visual tasks, especially in image classification. Yet, for cross-domain classification tasks, a classifier directly trained on one large annotated dataset (source domain) could degrade on another dataset (target domain) due to the problem of domain drift. A candidate to defeat this issue is domain adaptation 

[1],[2], [3], [4],[5], which fills the distribution gap between the source domain and target domain. In reality, unsupervised domain adaptation (UDA) could be a promising technique since it does not require the target dataset to be annotated available in the training process. Nonetheless, this also leads to a series of difficulties, of which the most challenging one is how to leverage the unlabeled data from the target domain to reduce domain shift.

The early research efforts in this respect use some proper distance metrics like the maximum mean discrepancy (MMD) [6] and its variants including maximum mean and covariance discrepancy (MMCD) [7] to measure inter-domain feature distribution, then trains a model to minimize such distance metrics. Afterwards, a series of attempts [8], [9] are to reweight or select key instances [10] in the source domain to minimize MMD with class-wise information. Other works like deep confusion network [3] treat MMD and its variants [11]

as a regularization to bound the learned feature distribution across domains. With the advent of generative adversarial networks (GAN)

[12], adversarial learning starts to become another main line of confusing feature distribution to learn domain-invariant features. In this sort, a domain-invariant feature generator is trained for two domains to fool a learned discriminator in a two-player game, where the generator learns domain-invariant features, whereas the discriminator helps induce the domain-specific features. For instance, ADDA [13] is the first time to exploit GAN into domain adaptation. CDAN [4] considers to simultaneously confuse features and labels across domains by aligning multi-modal structures across domains via multi-linear conditioning strategies. Differently, MADA [14]

captures multi-modal structures within individual domains by using multiple domain discriminators. Parallel to such studies, another line focuses on exploring network modules like Batch Normalization (BN)

[15] for improving UDA. To make BN apply for domain adaptation tasks, AdaBN [16] refines batch normalization to leverage the first- and second-order statistics of the target domain and transfers the knowledge from the source domain to the target domain. Transferable batch normalization [17] delves into the channel-wise transferability of normalization techniques for domain adaptation. Apparently, such preceding novel arts predominantly focus on learning transferable feature representation but few consider whether the learned feature representations contain abundant information and how feature redundancy affects the UDA performance.

As stated in [18]

, convolutional neural networks (CNNs) have significant redundancy between filters and feature channels. This, in a sense, obstructs network performance gains for the imbalanced distribution of spectrum. Recently, promoting diversity via weight orthogonality 

[19], and decorrelating features [20] can make feature spectrum near uniform in different manners. Nevertheless, little progress explicitly applies such insights for UDA. Importantly, such insights are in general designed for single-domain visual tasks and could not be suitable for cross-domain counterparts. To this end, we revisit the UDA problem from the perspective of feature redundancy and propose to prevent feature redundancy for improving UDA models in a bi-level way. Specifically, we first explore a transferable decorrelated batch normalization module (TDBN) to prevent domain-specific feature redundancy by jointly learning transferable and decorrelated features. We then explore a novel orthogonality based regularization for enhancing the diversity of domain-invariant features to mitigate the corresponding feature redundancy. As a result, by simply plugging two components into a pre-trained BN-based backbone ResNet50, a deep UDA model is constructed to simultaneously reduce the redundancy of both intra-domain and inter-domain features, which could ease overfitting and achieve better generalization. Through comprehensive experiments on five widely-used UDA benchmarks, reducing the bi-level feature redundancy shows the promising efficacy for UDA, and the corresponding deep UDA model yields encouraging performance as compared to several state-of-art siblings.

Our main contributions are summarized as follows:

We provide a novel perspective on how to reduce feature redundancy for improving UDA. A bi-level way to reduce feature redundancy is proposed, which can easily be well coupled with any BN-based backbone networks for UDA.

A transferable decorrelated normalization module (TDBN) is devised to prevent domain-specific feature redundancy through relieving intra-domain feature co-adaptation for better generalization.

A novel orthogonal regularization is proposed to enhance inter-domain feature diversity, which can serve as an alternative approach to realize the orthogonality. It helps to reduce domain-invariant feature redundancy by stabilizing the distributions of features as well as regularizing the networks to induce compact features. Besides, it yields (near) orthogonality without the usage of singular value decomposition (SVD).

Ii Related Work

In this section, we will review some recent trends in unsupervised domain adaptation that is mostly related to our approach.

Ii-a Unsupervised Domain Adaptation

The goal of UDA is to transfer knowledge [21], [22], [23] from an annotated source domain to another unlabeled target domain by reducing domain shift. From the standpoint of feature learning, many UDA studies can be considered as either domain-invariant feature learning or domain-specific feature learning.

Domain-invariant feature learning. In UDA, the primary pursuit is to learn domain-invariant features. Many mainstream approaches [24], [25], [7] belong to this category of domain-invariant feature learning and achieve this pursuit by aligning data distributions across domains in an either coarse or fine-grained manner.

The idea behind coarse alignment approaches are the usage of either global distribution alignment or global data structure. For global distribution alignment, many distribution discrepancy metrics like WMMD [24], CMD [25], and MMCD [7], which consider data statistics information, are directly minimized by those UDA models. Correlation alignment (CORAL) [2] minimizes the discrepancy between the covariance matrices of the whole source features and the target features. Wasserstein Distance (WDGRL) [26] learns domain-invariant representation to reduce domain shift based on Wasserstein distance. Also, adversarial learning has been widely used for coarse domain alignment. Domain adversarial neural networks (DANN) [27] confuses domain features via domain adversarial loss to induce domain-invariant representation. Joint adaptation network (JAN) [3] incorporates MMD [6]

with adversarial learning to align the joint distributions based on specific layers across domains. Besides, deep relevance network considers global low-rank structure to align different domains via discriminative relevance regularization (DRR)

[28]. For fine-grained alignment, such efforts seek to match multimode data structures like class-wise margin and data distributions [14], [4], [29], [30]. Multi-adversarial domain adaptation (MADA) [14] captures the multimode structures to enable fine-grained alignment of different data distributions based on local domain discriminators to learn domain-invariant feature. Conditional domain adaptation network (CDAN) [4] simultaneously aligns features and labels and overcome mode mismatch in adversarial learning for domain invariance. Cycle-consistent adversarial domain adaptation (CyCADA) [29] introduces a cycle-consistency loss to ensure the model the consistency of local semantic mapping.

Domain-specific feature learning

. Domain-specific feature learning assumes that varied domains have different information, which should be separated from domain-invariant features. In general, methods in this respect consider to devise network components and loss function for each domain. Of them, Batch Normalization (BN) is a versatile network component that has been extended to learn domain-specific features for UDA

[23], [17]. DSBN [23] adapt to both domains by specifying batch normalization layers to separate domain-specific information from domain-invariant features in an explicit manner. Transferable Normalization (TransNorm) [17] associates domain-specific features via the attention mechanism to realize transferability. Adaptive Batch Normalization (AdaBN) [16] proposes a brute-force post-processing method to replace batch normalization statistics with the counterparts of target samples. Domain-specific whitening transform (DWT) [20] proposes domain-specific alignment layers to align domain-specific covariance matrices of intermediate features. Different from them, which only encourage the transferability across domains, this work focuses on the reduction of feature redundancy for better generalization. Besides, our method learns domain-specific features and domain-invariant features, simultaneously, and enjoys both benefits.

Ii-B Generalization Strategies for DNNs

Many approaches seek to improve DNNs for better generalization by easing overfitting. Towards this goal, several regularizers and network structures have been developed to improve model generalization ability. For instance, early stopping [31] achieves this aim by keeping from abusing finite samples as early as possible. Weight decay [32]

penalizes large weights in stochastic gradient descent (SGD) like

regularization to reduce model complexity. Decoupled weight decay regularization [33] further extends weight decay to be applicable for Adam by decoupling weight decay from the gradient-based update. Model ensemble [34] is a kind of effective methods for model generalization by averaging the inference results of multiple models. Of them, Adaboost [35] is a basic and versatile form, which concatenates a group of weak classifiers to yield a united stronger classifier. Data augmentation [36] augments different variations about a few instances by hand to help the model to learn invariant features and behaves like a regularizer, thereby easing overfitting. Dropout [37], [38] can also regularize the network for sake of the behavior of the model ensemble. Different from Dropout, DropConnect [39] tries to randomly drop a portion of the weights rather than feature activations to regularize the network in an implicit way. BN [15], [40] can also handle the internal covariate shift problem to stabilize the network training. To reduce feature redundancy, decorrelated batch normalization [41] takes a step to promote generalization by decorrelated learning. Besides, orthogonality [19] stabilizes the distribution of the network activation with efficient convergence and achieves better generalization. Such techniques reported in the literature have shown the ability to improve network generalization ability. Motivated by such insights, we utilize decorrelated learning and orthogonality to improve the model of UDA from the view of redundancy minimization.

Ii-C Orthogonality in DNNs

As claimed in [19], [42], [43], orthogonality imposed over the weights is able to stabilize the optimization of DNNs by preventing the explosion or vanishing of back-propagated gradients. In [39], the orthogonality is applied in singal processing since it is capable of preserving activation energy and reducing redundancy in representation. Yang et al. [43], [44] proposed that orthogonal weight normalization in the nonlinear sigmoid network can obtain dynamical isometry. Huang et al. [42]

showed a novel orthogonal weight normalization method can guarantee stability and formulate this problem in feed-forward neural networks. Also, Huang

et al. [45]

proposed a computationally efficient and numerically stable orthogonality method to learn a layer-wise orthogonal weight matrix in DNNs, which enables to control the orthogonality of a weight matrix. There is also some research in Recurrent Neural Networks (RNNs) based on orthogonal matrices to avoid the gradient vanishing problems. Different from above, our method is related to the methods that impose orthogonal regularization in loss function under Frobenius norm, it can stabilize training via ensuring fully-connected layers output to be identical distributions which reduce the feature co-adaptation. Orthogonal to such siblings, the proposed (near) orthogonality is implicitly obtained through the joint matrix trace and determination constraints in theory. To the best of our knowledge, this orthogonality is first mentioned here and applied for the UDA problem.

Iii Method

This section reconsiders the UDA problem from the view of feature redundancy and readily details how to reduce bi-level feature redundancy for improving unsupervised domain adaptation model (UDA).

Iii-a Preliminary

In this paper, we will focus on the UDA, where the source domain has enough labels whereas target domain has unlabeled examples. In UDA, we suppose and

to be the labeled source data and unlabeled target, drawn from different distributions respectively. The data probability distributions of

and are and , respectively, where and , in UDA we have . Our goal is to predict the target label and reduce the difference between the two distributions.

Iii-B Model

Fig. 1: The distribution of feature spectrum and layer-wise feature similarity of ResNet50, DANN and our UDA model, respectively: the proposed bi-level way to reduce feature redundancy definitely makes the feature spectrum near uniform as compared to both ResNet50 and DANN. Note that, for the results of our model, the notations BN1, , BN6 for concise are abused here but they actually correspond to the proposed BN modules. In ResNet50 and DANN, they stand for vanilla batch normalization module.

Motivation. In the field of UDA, the mainstream methods focus on filling the distribution gap between the source and the target, due to domain shift. As a matter of fact, we easily neglect the inherent vulnerability of deep networks in themselves such as the instability in network learning and the redundancy of feature representation. Apparently, such internal weaknesses could continue to deteriorate in cross-domain tasks for the sake of cross-domain data heterogeneity. Nevertheless, few efforts truly discuss the effect of this insight in domain adaptation; in contrast, most arts consider to learn discriminative or transferable feature patterns for cross-domain classification tasks. Different from such arts, this study primarily aims to prevent feature patterns from being redundant, namely, reducing feature redundancy or redundancy minimization.

Recent studies [46] claim that the imbalanced distribution of the feature spectrum could contribute to feature redundancy, and fortunately, feature diversity or feature decorrelation can mitigate this realistic issue to some degree. Besides, many works judge feature redundancy with feature similarity. Inspired by such previous insights, we start to scrutinize whether there exists feature redundancy in existing UDA models, in light of both the distribution of the feature spectrum and layer-wise feature similarity. As Fig. 1 shows, the distribution of feature spectrum in both ResNet50 and domain adaptation neural networks (DANN) on the transfer task

of the Office-Home dataset are some imbalanced, and meanwhile their corresponding feature similarities, coming from the channels in different BN layers and features in Fc layers, are still highly similar. Especially, DANN has better transferability with lower feature similarity than ResNet50, but still suffers from relatively serious distribution imbalance of spectrum. This implies that low feature similarity is not enough to prevent feature redundancy. Instead, ResNet50 has weaker transferability with higher feature similarity but has relatively balanced distribution of spectrum as compared to DANN. This also indicates that the near uniform distribution of the sepctrum is still insufficient to ensure redundancy minimization as well. Based on these discussions, we consider to explicitly reduce feature redundancy in a

bi-level way, which jointly make feature similarity much lower and distribution of feature spectrum close to uniform. As in Fig. 1, in contrast with both ResNet50 and DANN, the proposed approach plugged into ResNet50 can significantly reduce layer-wise feature similarity of ResNet50 as well as mitigate the corresponding imbalanced feature spectrum. Generally speaking, making feature spectrum near uniform is beneficial for stabilizing network learning and model convergence, while reducing feature similarity assists in achieving better model generalization ability. Since the proposed method can achieve both goals–one stone two birds, our model is definitely helpful for reducing feature redundancy to promote UDA performance.

Build off the above analysis, the proposed approach to reducing feature redundancy contains two aspects or levels: the first level is designed for the reduction of domain-specific (or intra-domain) feature redundancy, while the second seeks to prevent domain-invariant (or inter-domain) feature redundancy. Fig. 1(b) shows that, in both ResNet50 and DANN, the domain-specific features learned by BN layers and domain-invariant features by Fc layers are to a certain redundant, while our results have lower feature similarity and near uniform distribution of feature spectrum. Thus, this bi-level manner is feasible. In detail, we at first try to learn as compact domain-specific features as possible through a transferable decorrelated normalization module (TDBN), with the goal of maintaining specific domain information as well as reducing the risk of this feature redundancy on the following domain invariance. Then, domain-invariant feature redundancy caused by domain-shared representation can be further reduced by using a novel orthogonality for feature diversity. The above bi-level scheme is lightweight and can be easily plugged into ResNet50 as our UDA model (Ours in short). As empirical studies show, the proposed UDA model without extra hassle can achieve competitive performance with several state-of-the-arts.

Fig. 2: Illustration of the proposed UDA model based on ResNet50 with a bi-level way to prevent feature redundancy: (a) each BN is replaced with TDBN, for the purpose of reducing domain-specific feature redundancy; (b) the pre-softmax weights in fully-connected layer are constrained to be (near) orthogonal by an orthogonal regularizer, which keeps domain-invariant features from being redundant.

The Flowchart. For clarity, the proposed UDA model is illustrated in Fig. 2

. To hinder the overfitting risk of domain-specific feature redundancy, TDBN first mitigates feature co-adaptation and then starts to proceed feature transferability. Different from existing BN-based modules, TDBN not only reduces feature redundancy but also learns transferable features, without introducing any extra parameters. Albeit simple, the performance gain is attractive. For the ultimate classifier in the Fc layer, the pre-softmax weights are imposed to be (near) orthogonal for domain-invariant feature diversity, thereby reducing feature redundancy. This is because the orthogonality can extract the more compact feature bases to span the whole feature subspace. Thus, it encourages to learn compact features. This property is so amazing to reduce the risk of model overfitting as well. Besides, the orthogonality still shares extra appealing properties: accelerating model convergence and stabilizing network training. The difference from existing orthogonalities lies in that the proposed orthogonality does not involve either singular value decomposition (SVD) or the iterative calculation of the largest and smallest singular values.

Iii-C Reducing Domain-specific Feature Redundancy

Recall that domain-specific feature learning assumes that each domain has distinct information, thus domain-specific features should be separated from domain-invariant features. In the field of UDA, batch normalization (BN) has been extended to learn domain-specific features, which calculates domain-independent statistical information, i.e., the mean and variance, for standardization. It serves as an important network component to accelerate network convergence by reducing the co-variant shift. Most BN-based modules are based on standardization, which have no ability to simultaneously reduce feature redundancy and realize feature transferability. Towards this goal, we couple the decorrelated BN

[41] with the channel-wise transferability attention idea from transferable BN [17], to jointly enjoy their strengths: the former mitigates feature co-adaptation to reduce feature redundancy, while the latter focuses on feature transferability across domains. For completeness, we detail our BN module termed Transferable Decorrelated Batch Normalization (TDBN) as below.

In a BN module, given a mini-batch input, the corresponding standard normalized outputs become:

(1)

where and are the mean and variance of the mini-batch, is a predefined small number to prevent numerical instability, both and are extra learnable parameters, is the number of the mini-batch samples.

BN merely performs feature standardization, without conducting feature decorrelation, thus it does not reduce feature redundancy. Huang et al. [41] proposed a Decorrelated Batch Normalization (DBN) module to attack this issue, and then proceeded to improve DBN with the iterative optimization algorithm for efficiency [47]. Intrinsically, they contribute to doing ZCA whitening on the standardized inputs. Thus the concrete formula of this process is:

(2)
(3)

where the covariance matrix is

(4)

and the corresponding singular value decomposition (SVD) form , wherein and

are the singular values and the singular vectors, respectively. Iterative whitening normalization efficiently computes

by the newton iteration method. can be calculated as follows:

(5)

where indicates the trace of , then can be calculated as follows:

(6)

Whitening the activation ensures that all the dimensions along the eigenvectors have equal importance in the subsequent linear layer, while standardization ensures that the normalized output gives equal importance to each dimension by multiplying the diagonal scaling matrix. Obviously, it ignores the interplay between the source domain and the target domain. Similar to TransNorm, which tries to align the sufficient statistics of both domains via channel transferability, we also quantify the transferability of different channels and adapt them for different domains. For each channel

and the whiten data , the importance of the different channels can be determined by:

(7)

where

(8)

wherein denotes the number of channels in the layer. In contrast to TransNorm [17], the distance has been replaced by the similarity without introducing extra parameters. Following [17], we get the final output of the proposed module TDBN as below:

(9)
(10)

Iii-D Preventing Domain-invariant Feature Redundancy

Learning domain invariant features means features with transferability across domains, which is always a predominantly expected aim of UDA. Hence, it plays a critical role in the ultimate inference ability of the classifier on the target domain. Considering this, we try to constrain the pre-softmax weights in the Fully-connected (Fc) layer, which behaves like a classifier, to induce domain invariance. Recall that increasing feature diversity can assist in preventing feature redundancy, while the orthogonality just has the ability to cope with this issue. Thus, we explore an orthogonal regularization over the pre-softmax weights of the classifier for feature diversity.

As one knows, the orthogonality has shown great success in improving the stability of deep neural networks, because it can preserve energy, stabilize activation distributions, and ensure uniform spectrum. Due to such appealing properties, many studies explore the orthogonality [19], [42], [45], [48] in different ways to promote DNNs. Some studies regard it as the optimization problem on Stiefel Manifolds [49], while the others consider its soft version as a differentiable regularizer [19] to regularize CNNs. Such novel insights spur us to develop another alternative orthogonal regularizer by avoiding iterative spectrum computation for efficiency. Moreover, it can guarantee the singular values to be near ones in theory. This significantly differs from most preceding arts [19].

Prior to detailing our orthogonality, we at first present its constraint form of our orthogonality as below:

(11)

where is the pre-softmax weight matrix. Note that our orthogonal constraint is not limited to the pre-softmax weights. denotes the determination of the square matrix, while is the trace of the square matrix. This constraint can implicitly ensure the weight matrix to be orthogonal in theory.

Lemma 1. Any non-singular symmetric matrix has the singular values of all the ones under the orthogonal constraint of Eq. (11).

Proof  By simple algebra, one knows:

(12)

where is the -th singular value of . According to Eq. (12), the inequalities hold:

(13)

Of them, the first inequalities of (13) contain the constraint Eq. (11), thus it naturally holds. According to the basic inequality theory given in APPENDIX, when there holds:

(14)

This completes the proof.

Theorem 1. The learned weight matrix under the constraint of Eq. (11) could be orthogonal for .

Proof. Assume that , where . Then, is the most likely to be a non-singular symmetric matrix. In practice, this point can be guaranteed. Assume that there exists the singular value decomposition of , i.e., . Accordingly, . According to Lemma 1, because . Since is a square matrix corresponding to all the singular values, there still holds . This implies that .

To our best knowledge, no previous works explicitly mention these findings. Besides, it is non-trivial to directly optimize the above constrained objective. Thus, we treat this orthogonal constraint as a regularization for efficient optimization. This type of relaxation is also the mainstream of implementing orthogonality. Following this, we rewrite Eq. (14) as:

(15)

where is a regularization parameter. We again resort to auto-computing the gradient of Eq. (15) for simplicity. We do not need to compute the expensive SVD or iteratively calculate singular values such as the regularizer based on spectral restricted isometry property [50]. Our orthogonality in an implicit fashion encourages the network to learn diverse features so that features become more compact.

Iv Experiments

Fig. 3: Object recognition sample images from each dataset. (a) Office-31 dataset images on three domains, (b) Office-Home dataset images on four domains, (c) VisDA-C dataset images on twelve domains, and (d) ImageCLEF-DA dataset images on twelve domains

Fig. 4:

(a) MNIST and USPS dataset images on ten domains, (b) MNIST and SVHN dataset images.

This section will detail the implementation of our UDA model (Ours in short) and the corresponding training protocols, and report our experimental evaluation in the end. We conduct five experiments on both small and large-scale datasets to evaluate our method as compared with its well-behaved siblings. Moreover, we also present an ablation study to analyze the effect of each component over UDA.

Iv-a Experimental Settings

Data Preparation. We conduct five experiments on seven datasets: Office-31, Office-Home, ImageCLEF-DA, MNIST, USPS, SVHN, and VisDA-C. These datasets as popular benchmarks have been widely adopted to evaluate domain adaptation algorithms in previous arts.

Office-31 [51] dataset is a standard benchmark for testing domain adaptation methods. It contains 4,652 images organized in 31 classes from three different domains: Amazon (A), Dslr (D), Webcam (W). Amazon images are collected from amaon.com, while both Webcam and Dslr images are manually collected in an office environment, we evaluate all the compared methods on all six transfer tasks including AW, AD, DW, WA, DA, and WD.

Office-Home [52] dataset contains four distinct domains including Art (A), Clipart (Cl), Product (Pr), and Real World(Rw), and each domain involves 65 different categories. There are 15,500 images totally in the dataset, thus it serves as a large-scale benchmark for testing domain adaptation methods.

ImageCLEF-DA111http://imageclef.org/2014/adaptation

dataset is a benchmark for ImageCLEF 2014 domain adaptation, which is composed of 12 common categories shared by three public datasets. They are Caltech-256 (C), ImageNet ILSVRC 2012 (I), and Pascal VOC 2012 (P), respectively. There have 50 images category and 600 images in each domain. We evaluate the methods on all the six transfer tasks: C

I, CP, PI, PC, IC, and IP.

VisDA-2017 [53] is a challenging simulation-to-real dataset, it has two very distinct domains: synthetic, renderings of 3-D models from different angles and with different lighting conditions, and real natural images. It contains 12 classes in the training, validation, and test domains

MNIST (M) dataset comes from the National Institute of Standards and Technology. It is a standard digit recognition datasets covering ten handwritten digits from 0-9 in the form of pixels grayscale images. It is divided into both training and testing sets. The training dataset comes from 250 different people, of which 50% are from high school students, and the remaining from Census Bureau workers; the testing dataset is similar to training dataset. This dataset contains 60,000 training images and 10,000 testing images. Similar to MNIST (M), USPS (U) also contains 10 classes from 0-9, and has 7,291 training and 2,007 testing images. All the images are cropped as

grayscale pixels. It is subject to significantly different probability distributions to MNIST. Besides,

SVHN (S) [54] contains colored 10-digit images with the size 3232, which contains 73,257 training images and 26,032 testing images.

Implementation Details. For fairness, in object recognition tasks, we adopt the same base network ResNet50 as our backbone. For example, on Office-Home dataset, we use ResNet50 [55]

as our backbone. Especially for VisDA-2017, owing to its huge volume, we use ResNet101 as our backbone. We will fine-tune all convolutional and pooling layers from the ImageNet pretrained models and then train the classifier layer by back-propagation. This network is initialized with weights from ImageNet dataset. For our model, we also adopt the same configurations. In the fully connected layers, we discard original class numbers and change class numbers to 65, and this matches the Office-Home classes, Also, we will discard original class numbers and change class numbers to corresponding domains. During the training phase, we set the mini-batch size to 24, and the epoch to 200. We use Adam optimizer with an initial learning rate of 0.01 and the weight decay is

. The learning rate will decrease by a factor of 10 after 50 and 90 epochs. The maximum number of the iterations of the optimizer is set to 10,000. For digit classification, we follow the protocols in [13]. We set the hyper parameter for all the tasks.

We implement the proposed UDA model in PyTorch, and report the average classification accuracy on both object recognition and digit classification. On the Office-31 dataset, we compare our method with popular transfer learning and recent transfer learning methods including deep residual learning for image recognition (ResNet50)

[55], Geodesic flow kernel for unsupervised domain adaptation (GFK) [56], Domain-adversarial training of neural networks (DAN) [27], Joint adaptation networks (JAN) [3], Multi-adversarial domain adaptation (MADA) [14], Collaborative and adversarial network for unsupervised domain adaptation (iCAN) [57], Conditional adversarial domain adaptation (CDAN and CDAN+E) [4],Confidence regularized self-learning[58],Batch nuclear-norm maximization under label insufficient situations [59], and the results of the other compared methods are from the original paper for a fair comparison. On the Office-Home, we compare our UDA model with ResNet-50, DAN, JAN, Unsupervised domain adaptation using feature-whitening and consensus loss (DWT and DWT-MEC) [20], and compare it with ResNet101, DAN, deep adversarial neural networks (DANN), maximum classifier discrepancy (MCD) [11], CDAN, Batch spectral penalization (BSP+DANN and BSP + CDAN) [60] and Domain-Specific Batch Normalization for Unsupervised Domain Adaptation(DSBN) [61] on VisDA-2017, respectively. On ImageCLEF-DA, we compare our UDA model with several standard deep learning methods and deep transfer learning methods: ResNet50, DAN, DANN [62], JAN, CDAN, and CDAN + E, CAN and iCAN, Visual Domain Adaptation with Manifold Embedded Distribution Alignment (MEDA), Adversarial Tight Match (ATM) [30],DSAN,Self-adaptive re-weighted adversarial domain adaptation [63]. For digit classification, we compare our model with CORAL, Simultaneous deep transfer across domains and tasks (MMD) [64], Visual Domain Adaptation with Manifold Embedded Distribution Alignment (DRCN) [65], Coupled generative adversarial networks (CoGAN) [66], Adversarial discriminative domain adaptation (ADDA) [13]

, Unsupervised image-to-image translation networks (UNIT)

[67], Asymmetric Tri-training for unsupervised domain adaptation (ATT) [68], Generate to adapt: Alignment domains using generative adversarial net (GTA) [69], and Learning Semantic Representations for Unsupervised Domain Adaptation (MSTN) [70], while the results of baselines are borrowed from [20].

Method MU UM SM
Source Only
MMD (2015 ICCV)
DRCN (2016 ECCV)
CoGAN (2016 NIPS)
ADDA (2017 CVPR)
UNIT (2017 NIPS)
ATT (2017 ICML)
GTA (2019 CVPR)
MSTN (2018 PMLR)
DWT (2019 CVPR)
DWT-MEC (2019 CVPR)
ATM (2020 TPAMI)
SAR (2020 IJCAI)
Ours
TABLE I: Accuracy (%) of the digit recognitions. The best results are highlighted by bold numbers.
Method AD AW DA DW WA WD Avg
ResNet50
GFK (2012 CVPR)
DAN (2015 ICML)
JAN (2017 ICML)
MADA (2018 AAAI)
CAN (2018 CVPR)
CDAN (2018 NIPS)
CDAN+E (2018 NIPS)
CRST (2019 ICCV)
BNM (2020 CVPR)
Ours
TABLE II: Classification accuracies (%) on Office-31 dataset (ResNet50)
Method PR AP CA PC CP PA AC PR AR RC RA CR Avg
ResNet50
DAN (2015 ICML)
DANN (2016 JMLR)
JAN (2017 ICML)
CDAN-M (2018 NIPS)
CDAN (2018 NIPS)
CDAN+E (2018 NIPS)
DWT (2019 CVPR)
ATM (2020 TPAMI)
Ours
TABLE III: Classification accuracies (%) on Office-Home dataset (ResNet50)
Method IP PI IC CI CP PC Avg
ResNet50
DAN (2015 ICML)
DANN (2016 JMLR)
CAN (2018 CVPR)
JAN (2017 ICML)
CDAN+E (2018 NIPS)
CDAN-M (2018 NIPS)
iCAN (2018 CVPR)
MEDA (2018 ACM MM)
ATM (2020 TPAMI)
DSAN (2020 TNNLS)
SAR (2020 IJCAI)
Ours
TABLE IV: Classification accuracies (%) on ImageCLEF-DA dataset (ResNet50)
Method aero truck train skate person plant motor knife horse car bus bicycle Avg
ResNet101
DAN (2015 ICML)
DANN (2016 JMLR)
MCD (2018 CVPR)
BSP+DANN (2019 ICML)
CDAN (2018 NIPS)
BSP+CDAN (2019 ICML)
DSBN+MSTN (2019 CVPR)
DSAN (2020 TNNLS)
Ours
TABLE V: Classification accuracies (%) on VisDA-2017 dataset (ResNet101)
Methods AD AW DA DW WA WD Avg
ResNet50 68.9 68.4 62.5 96.7 60.7 99.3 66.1
+TDBN 85.7 85.2 71.2 99.1 68.9 100.0 85.0
+OL 88.8 83.2 72.3 99.0 70.9 100.0 85.7
Ours (TDBN+OL) 95.5 87.5 74.7 100.0 71.3 100.0 88.2
TABLE VI: Ablation study with regard to transferability on the Office-31 dataset

Iv-B Results

Office-31. Table II quantifies the performance of ours employing TDBN and orthogonality weights and compares it with state-of-the-art records on Office-31 datasets. TDBN means that we replace the batch normalization layers with TDBN layers, and then we use the orthogonal weights in the fully-connected layers. In standard domain adaptation, our model outperforms such compared methods. Of them, most consider feature alignment without inter-domain relationships. From Table II, our model behaves better than several compared methods, which account for the relationships within the features.

Office-Home. Table III shows that ours exceeds the average accuracy more than 3.6%. Also, considering Office-Home has 12 evaluations, from Table III, we can easily find that our method achieves the best on all of them. Owing to Office-Home is a large-scale dataset, it can be concluded that our proposed method not only works well on standard benchmarks but also generalizes well on large-scale datasets.

ImageCLEF-DA. The classification results of ImageCLEF-DA are shown in Table IV. Our model outperforms the other compared methods on most transfer tasks. In particular, our overpasses DSAN by 3.3%, which implies that redundancy reduction is beneficial for transferability.

VisDA-C. Table V reports classification results of all the methods on the VisDA-C dataset. As Table 2 shows, the proposed UDA consistently achieves performance gains on all the adaptation tasks.

MNIST-USPS-SVHN. We also consider the digit classification tasks on from MNIST to USPS and to SVHN. Experimental results are recorded in Table I. The proposed UDA model is superior to the other counterparts overally.

Overall Analysis. The above experimental results reveal some insightful observations as follows: 1) In UDA, DWT [20] directly utilizes a decorrelated batch normalization method and outperforms the previous methods, but it only considers to align global distribution. Differently, our method considers inter-domain and intra-domain features correlations as well as make feature standardization, which can capture more information on each category. 2) From object recognition, we can find ours achieves a great performance, which verifies the effectiveness of our model. 3) By comparing our method with several advanced GAN-based UDA methods, our method without extra hassle achieves the competitive performance with them, which shows its efficacy.

Ablation Study. In this subsection, we will provide an ablation study analysis for our method. We perform ablative experiments on the Office-31 dataset to analyze the functionality of each component in our UDA model. Table VI summarizes the ablation results on Office-31 dataset using ResNet50 as the baseline, where the last column in the table reports the average recognition accuracy of ours as compared to two other configurations, i.e., TDBN without orthogonal loss, and orthogonal loss (OL) without TDBN. The comparative results directly show that both TDBN and OL greatly improve the performance of ResNet at a competitive level. Surprisingly, both cooperation in our model makes performance gains great. This implies that they are mutually complementary to each other. This is line with the previous claim that both schemes are conducted in a bi-level way: TDBN reduces the domain-specific feature redundancy, while OL does domain-invariant feature redundancy.

To further observe their ability to narrow the distribution gap across domains, the -distance [71] as the distribution discrepancy between source and target domain is used here to evaluate the functionality of each component of our model. Following [71], the -distance is defined as , where is the classification error of a binary domain classifier for discriminating the source and target domains. We select all the transfer tasks on Office-31 to show the -distance bridged by ResNet-50, TDBN, OL, and our model, respectively, as in Fig. 5. From this figure, we can observe that simultaneous reducing inter-domain and intra-domain feature redundancy can greatly reduce the -distance across domains.

Fig. 5: The -distance across domain distributions on the Office-31 dataset

V Conclusion

In this paper, we propose a simple UDA approach from the perspective of feature redundancy minimization. Compared to existing arts that focus on feature discrimination, the proposed model seeks to reduce feature redundancy, thereby enhancing generalization the ability of the deep UDA model. Our model performs the reduction of bi-level feature redundancy: the first level decorrelates domain-specific features, as well as matches feature distribution; the second level can stabilize the distribution of network activations, regularize deep neural networks and learn compact diverse features. Experiments on several large-scale publicly available image classification datasets show the superiority of our model against recent state-of-the-art traditional and deep counterparts.

References

  • [1] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell, “Deep domain confusion: Maximizing for domain invariance,” arXiv preprint arXiv:1412.3474, 2014.
  • [2] B. Sun and K. Saenko, “Deep coral: Correlation alignment for deep domain adaptation,” in

    European conference on computer vision

    .   Springer, 2016, pp. 443–450.
  • [3] M. Long, H. Zhu, J. Wang, and M. I. Jordan, “Deep transfer learning with joint adaptation networks,” in

    International conference on machine learning

    .   PMLR, 2017, pp. 2208–2217.
  • [4] M. Long, Z. Cao, J. Wang, and M. I. Jordan, “Conditional adversarial domain adaptation,” in Advances in Neural Information Processing Systems, 2018, pp. 1640–1650.
  • [5] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on knowledge and data engineering, pp. 1345–1359, 2009.
  • [6] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola, “A kernel two-sample test,” The Journal of Machine Learning Research, vol. 13, no. 1, pp. 723–773, 2012.
  • [7] W. Zhang, X. Zhang, L. Lan, and Z. Luo, “Maximum mean and covariance discrepancy for unsupervised domain adaptation,” Neural Processing Letters, vol. 51, no. 1, pp. 347–366, 2020.
  • [8] W. Dai, Q. Yang, G.-R. Xue, and Y. Yu, “Boosting for transfer learning,” in Proceedings of the 24th international conference on Machine learning, 2007, pp. 193–200.
  • [9] Y. Xu, S. J. Pan, H. Xiong, Q. Wu, R. Luo, H. Min, and H. Song, “A unified framework for metric transfer learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 29, no. 6, pp. 1158–1171, 2017.
  • [10] Y.-H. Hubert Tsai, Y.-R. Yeh, and Y.-C. Frank Wang, “Learning cross-domain landmarks for heterogeneous domain adaptation,” in

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , 2016, pp. 5081–5090.
  • [11] K. Saito, K. Watanabe, Y. Ushiku, and T. Harada, “Maximum classifier discrepancy for unsupervised domain adaptation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3723–3732.
  • [12] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
  • [13] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarial discriminative domain adaptation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7167–7176.
  • [14] Z. Pei, Z. Cao, M. Long, and J. Wang, “Multi-adversarial domain adaptation,” arXiv preprint arXiv:1809.02176, 2018.
  • [15] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
  • [16] Y. Li, N. Wang, J. Shi, X. Hou, and J. Liu, “Adaptive batch normalization for practical domain adaptation,” Pattern Recognition, vol. 80, pp. 109–117, 2018.
  • [17] X. Wang, Y. Jin, M. Long, J. Wang, and M. I. Jordan, “Transferable normalization: Towards improving transferability of deep neural networks,” in Advances in Neural Information Processing Systems, 2019, pp. 1953–1963.
  • [18] J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai, T. Liu, X. Wang, G. Wang, J. Cai et al., “Recent advances in convolutional neural networks,” Pattern Recognition, vol. 77, pp. 354–377, 2018.
  • [19] N. Bansal, X. Chen, and Z. Wang, “Can we gain more from orthogonality regularizations in training deep networks?” in Advances in Neural Information Processing Systems, 2018, pp. 4261–4271.
  • [20] S. Roy, A. Siarohin, E. Sangineto, S. R. Bulo, N. Sebe, and E. Ricci, “Unsupervised domain adaptation using feature-whitening and consensus loss,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 9471–9480.
  • [21] X. Yu, T. Liu, M. Gong, K. Zhang, K. Batmanghelich, and D. Tao, “Transfer learning with label noise,” arXiv preprint arXiv:1707.09724, 2017.
  • [22] M. Jing, J. Zhao, J. Li, L. Zhu, Y. Yang, and H. T. Shen, “Adaptive component embedding for domain adaptation,” IEEE Transactions on Cybernetics, 2020.
  • [23] W.-G. Chang, T. You, S. Seo, S. Kwak, and B. Han, “Domain-specific batch normalization for unsupervised domain adaptation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [24] H. Yan, Y. Ding, P. Li, Q. Wang, Y. Xu, and W. Zuo, “Mind the class weight bias: Weighted maximum mean discrepancy for unsupervised domain adaptation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [25] W. Zellinger, T. Grubinger, E. Lughofer, T. Natschläger, and S. Saminger-Platz, “Central moment discrepancy (cmd) for domain-invariant representation learning,” arXiv preprint arXiv:1702.08811, 2017.
  • [26] J. Shen, Y. Qu, W. Zhang, and Y. Yu, “Wasserstein distance guided representation learning for domain adaptation,” arXiv preprint arXiv:1707.01217, 2017.
  • [27] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky, “Domain-adversarial training of neural networks,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 2096–2030, 2016.
  • [28] W. Zhang, X. Zhang, L. Lan, and Z. Luo, “Enhancing unsupervised domain adaptation by discriminative relevance regularization,” Knowledge and Information Systems, vol. 62, no. 9, pp. 3641–3664, 2020.
  • [29] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. Efros, and T. Darrell, “Cycada: Cycle-consistent adversarial domain adaptation,” in International conference on machine learning.   PMLR, 2018, pp. 1989–1998.
  • [30] L. Jingjing, C. Erpeng, D. Zhengming, Z. Lei, L. Ke, and S. H. Tao, “Maximum density divergence for domain adaptation,” arXiv preprint arXiv:2004.12615, 2020.
  • [31] Y. Yao, L. Rosasco, and A. Caponnetto, “On early stopping in gradient descent learning,” Constructive Approximation, vol. 26, no. 2, pp. 289–315, 2007.
  • [32] A. Krogh and J. A. Hertz, “A simple weight decay can improve generalization,” in Advances in neural information processing systems, 1992, pp. 950–957.
  • [33] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
  • [34] H. Wang, W. Fan, P. S. Yu, and J. Han, “Mining concept-drifting data streams using ensemble classifiers,” in Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, 2003, pp. 226–235.
  • [35] A. Vezhnevets and V. Vezhnevets, “Modest adaboost-teaching adaboost to generalize better,” in Graphicon, vol. 12, no. 5, 2005, pp. 987–997.
  • [36] R. Reed, S. Oh, R. Marks et al., “Regularization using jittered training data,” in International Joint Conference on Neural Networks, vol. 3, 1992, pp. 147–152.
  • [37] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The journal of machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014.
  • [38] S. Srinivas and R. V. Babu, “Generalized dropout,” arXiv preprint arXiv:1611.06791, 2016.
  • [39] L. Wan, M. Zeiler, S. Zhang, Y. Le Cun, and R. Fergus, “Regularization of neural networks using dropconnect,” in International conference on machine learning, 2013, pp. 1058–1066.
  • [40] J. Kohler, H. Daneshmand, A. Lucchi, M. Zhou, K. Neymeyr, and T. Hofmann, “Towards a theoretical understanding of batch normalization,” stat, vol. 1050, p. 27, 2018.
  • [41] L. Huang, D. Yang, B. Lang, and J. Deng, “Decorrelated batch normalization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 791–800.
  • [42] L. Huang, X. Liu, B. Lang, A. W. Yu, Y. Wang, and B. Li, “Orthogonal weight normalization: Solution to optimization over multiple dependent stiefel manifolds in deep neural networks,” arXiv preprint arXiv:1709.06079, 2017.
  • [43] G. Yang, J. Pennington, V. Rao, J. Sohl-Dickstein, and S. S. Schoenholz, “A mean field theory of batch normalization,” arXiv preprint arXiv:1902.08129, 2019.
  • [44] D. Xie, J. Xiong, and S. Pu, “All you need is beyond a good init: Exploring better solution for training extremely deep convolutional neural networks with orthonormality and modulation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6176–6185.
  • [45] L. Huang, L. Liu, F. Zhu, D. Wan, Z. Yuan, B. Li, and L. Shao, “Controllable orthogonalization in training dnns,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 6429–6438.
  • [46] J. Wang, Y. Chen, R. Chakraborty, and S. X. Yu, “Orthogonal convolutional neural networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11 505–11 515.
  • [47] L. Huang, Y. Zhou, F. Zhu, L. Liu, and L. Shao, “Iterative normalization: Beyond standardization towards efficient whitening,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [48] K. Jia, S. Li, Y. Wen, T. Liu, and D. Tao, “Orthogonal deep neural networks,” IEEE transactions on pattern analysis and machine intelligence, 2019.
  • [49] J. Balogh and B. Tóth, “Global optimization on stiefel manifolds: a computational approach,” Central European Journal of Operations Research, vol. 13, no. 3, p. 213, 2005.
  • [50] R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin, “A simple proof of the restricted isometry property for random matrices,” Constructive Approximation, vol. 28, no. 3, pp. 253–263, 2008.
  • [51] K. Saenko, B. Kulis, M. Fritz, and T. Darrell, “Adapting visual category models to new domains,” in European conference on computer vision.   Springer, 2010, pp. 213–226.
  • [52] H. Venkateswara, J. Eusebio, S. Chakraborty, and S. Panchanathan, “Deep hashing network for unsupervised domain adaptation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5018–5027.
  • [53] X. Peng, B. Usman, N. Kaushik, J. Hoffman, D. Wang, and K. Saenko, “Visda: The visual domain adaptation challenge,” arXiv preprint arXiv:1710.06924, 2017.
  • [54] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng, “Reading digits in natural images with unsupervised feature learning,” 2011.
  • [55] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [56] B. Gong, Y. Shi, F. Sha, and K. Grauman, “Geodesic flow kernel for unsupervised domain adaptation,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition.   IEEE, 2012, pp. 2066–2073.
  • [57] W. Zhang, W. Ouyang, W. Li, and D. Xu, “Collaborative and adversarial network for unsupervised domain adaptation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3801–3809.
  • [58] Y. Zou, Z. Yu, X. Liu, B. Kumar, and J. Wang, “Confidence regularized self-training,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 5982–5991.
  • [59] S. Cui, S. Wang, J. Zhuo, L. Li, Q. Huang, and Q. Tian, “Towards discriminability and diversity: Batch nuclear-norm maximization under label insufficient situations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3941–3950.
  • [60] X. Chen, S. Wang, M. Long, and J. Wang, “Transferability vs. discriminability: Batch spectral penalization for adversarial domain adaptation,” in International Conference on Machine Learning, 2019, pp. 1081–1090.
  • [61] W.-G. Chang, T. You, S. Seo, S. Kwak, and B. Han, “Domain-specific batch normalization for unsupervised domain adaptation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 7354–7362.
  • [62] M. Ghifary, W. B. Kleijn, M. Zhang, D. Balduzzi, and W. Li, “Deep reconstruction-classification networks for unsupervised domain adaptation,” in European Conference on Computer Vision.   Springer, 2016, pp. 597–613.
  • [63] S. Wang and L. Zhang, “Self-adaptive re-weighted adversarial domain adaptation,” arXiv preprint arXiv:2006.00223, 2020.
  • [64] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko, “Simultaneous deep transfer across domains and tasks,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 4068–4076.
  • [65] J. Wang, W. Feng, Y. Chen, H. Yu, M. Huang, and P. S. Yu, “Visual domain adaptation with manifold embedded distribution alignment,” in Proceedings of the 26th ACM international conference on Multimedia, 2018, pp. 402–410.
  • [66] M.-Y. Liu and O. Tuzel, “Coupled generative adversarial networks,” in Advances in neural information processing systems, 2016, pp. 469–477.
  • [67] M.-Y. Liu, T. Breuel, and J. Kautz, “Unsupervised image-to-image translation networks,” in Advances in neural information processing systems, 2017, pp. 700–708.
  • [68] K. Saito, Y. Ushiku, and T. Harada, “Asymmetric tri-training for unsupervised domain adaptation,” arXiv preprint arXiv:1702.08400, 2017.
  • [69] S. Sankaranarayanan, Y. Balaji, C. D. Castillo, and R. Chellappa, “Generate to adapt: Aligning domains using generative adversarial networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8503–8512.
  • [70] S. Xie, Z. Zheng, L. Chen, and C. Chen, “Learning semantic representations for unsupervised domain adaptation,” in International Conference on Machine Learning, 2018, pp. 5423–5432.
  • [71] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan, “A theory of learning from different domains,” Machine learning, vol. 79, no. 1-2, pp. 151–175, 2010.

Appendix A Basic Inequality Theorem

Basic Inequality Theorem. Any positive real number , there holds:

(16)

where the equality holds if and only if .

Proof If are positive real numbers, the mathematic average is and the geometric average is , then we can draw a basic conclusion that .

Suppose are arranged in an ascending order, when , the inequality obviously holds. Suppose when , then inequality holds. Then, when ,

(17)

Owing to the supposed sequence being ascending, we can get:

(18)

According to Eq. (18), we replace with and with , respectively, to keep the arithmetic mean unchanged, then the new sequence is , and its mean is still . So the sub-sequence with the first element is:

(19)

In terms of our hypothesis, the mean of (19) is not larger than , and then we can yield:

(20)

Let us multiply both sides by . The Eq. (20) will become:

(21)

According to Eq. (22),

(22)

Then, . Obviously, the equality holds only when .

Mengzhu Wang received a Bachelor’s degree in Information and Computing Science from Tianjin University of Commerce,Tianjin, China, in 2016. She received the Master’s degree from Chongqing University(CQU), China, in 2018. From 2019 to now, she is pursuing the Ph.D. degree with the School of Computer Science, the National University of Defense Technology, Changsha, China. Her current research interest include transfer learning, image segmentation and computer vision.

Xiang Zhang received the M.S., and Ph.D. degrees from the National University of Defense Technology, Changsha, China, in 2010 and 2015, respectively. He is currently a research assistant with the Institute for Quantum Information State Key Laboratory of High Performance Computing, College of Computer, National University of Defense Technology. His current research interests include computer vision and machine learning.

Long Lan is currently a lecturer with College of Computer, National University of Defense Technology. He received the Ph.D. degree in computer science from National University of Defense Technology 2017. He was a visiting Ph.D. student in University of Technology, Sydney from 2015 to 2017. His research interests include multi-object tracking, computer vision and discrete optimization.

Wei Wang is currently a Ph.D. candidate at the School of Software Technology, Dalian University of Technology, Dalian, China. He received the B.S. degree at the school of science from the Anhui Agricultural University, Hefei, China, in 2015. He received the M.S. degree at the school of computer science and technology from the Anhui University, Hefei, China, in 2018. His major research interests include transfer learning, zero-shot learning, deep learning, etc.

Huibin Tan received a Bachelor’s degree in Computer Science and Technology from Northeastern University of China in 2014. In 2014-2015, she studied postgraduate at the School of Computer Science, National University of Defense Technology, and later became a PhD student through a direct Ph.D application. From 2016 to now, she is pursuing the Ph.D. degree with the School of Computer Science, the National University of Defense Technology, Changsha, China. Her current research interest include face recognition, visual tracking and representation learning.

Zhigang Luo received the B.S., M.S., and Ph.D. degrees from the National University of Defense Technology, Changsha, China, in 1981, 1993, and 2000, respectively. He is currently a Professor with the College of Computer, National University of Defense Technology. His current research interests include machine learning, computer vision and bioinformatics.