I Introduction
Recent attempts on deep neural networks have brought about extraordinary performance in various visual tasks, especially in image classification. Yet, for crossdomain classification tasks, a classifier directly trained on one large annotated dataset (source domain) could degrade on another dataset (target domain) due to the problem of domain drift. A candidate to defeat this issue is domain adaptation
[1],[2], [3], [4],[5], which fills the distribution gap between the source domain and target domain. In reality, unsupervised domain adaptation (UDA) could be a promising technique since it does not require the target dataset to be annotated available in the training process. Nonetheless, this also leads to a series of difficulties, of which the most challenging one is how to leverage the unlabeled data from the target domain to reduce domain shift.The early research efforts in this respect use some proper distance metrics like the maximum mean discrepancy (MMD) [6] and its variants including maximum mean and covariance discrepancy (MMCD) [7] to measure interdomain feature distribution, then trains a model to minimize such distance metrics. Afterwards, a series of attempts [8], [9] are to reweight or select key instances [10] in the source domain to minimize MMD with classwise information. Other works like deep confusion network [3] treat MMD and its variants [11]
as a regularization to bound the learned feature distribution across domains. With the advent of generative adversarial networks (GAN)
[12], adversarial learning starts to become another main line of confusing feature distribution to learn domaininvariant features. In this sort, a domaininvariant feature generator is trained for two domains to fool a learned discriminator in a twoplayer game, where the generator learns domaininvariant features, whereas the discriminator helps induce the domainspecific features. For instance, ADDA [13] is the first time to exploit GAN into domain adaptation. CDAN [4] considers to simultaneously confuse features and labels across domains by aligning multimodal structures across domains via multilinear conditioning strategies. Differently, MADA [14]captures multimodal structures within individual domains by using multiple domain discriminators. Parallel to such studies, another line focuses on exploring network modules like Batch Normalization (BN)
[15] for improving UDA. To make BN apply for domain adaptation tasks, AdaBN [16] refines batch normalization to leverage the first and secondorder statistics of the target domain and transfers the knowledge from the source domain to the target domain. Transferable batch normalization [17] delves into the channelwise transferability of normalization techniques for domain adaptation. Apparently, such preceding novel arts predominantly focus on learning transferable feature representation but few consider whether the learned feature representations contain abundant information and how feature redundancy affects the UDA performance.As stated in [18]
, convolutional neural networks (CNNs) have significant redundancy between filters and feature channels. This, in a sense, obstructs network performance gains for the imbalanced distribution of spectrum. Recently, promoting diversity via weight orthogonality
[19], and decorrelating features [20] can make feature spectrum near uniform in different manners. Nevertheless, little progress explicitly applies such insights for UDA. Importantly, such insights are in general designed for singledomain visual tasks and could not be suitable for crossdomain counterparts. To this end, we revisit the UDA problem from the perspective of feature redundancy and propose to prevent feature redundancy for improving UDA models in a bilevel way. Specifically, we first explore a transferable decorrelated batch normalization module (TDBN) to prevent domainspecific feature redundancy by jointly learning transferable and decorrelated features. We then explore a novel orthogonality based regularization for enhancing the diversity of domaininvariant features to mitigate the corresponding feature redundancy. As a result, by simply plugging two components into a pretrained BNbased backbone ResNet50, a deep UDA model is constructed to simultaneously reduce the redundancy of both intradomain and interdomain features, which could ease overfitting and achieve better generalization. Through comprehensive experiments on five widelyused UDA benchmarks, reducing the bilevel feature redundancy shows the promising efficacy for UDA, and the corresponding deep UDA model yields encouraging performance as compared to several stateofart siblings.Our main contributions are summarized as follows:
We provide a novel perspective on how to reduce feature redundancy for improving UDA. A bilevel way to reduce feature redundancy is proposed, which can easily be well coupled with any BNbased backbone networks for UDA.
A transferable decorrelated normalization module (TDBN) is devised to prevent domainspecific feature redundancy through relieving intradomain feature coadaptation for better generalization.
A novel orthogonal regularization is proposed to enhance interdomain feature diversity, which can serve as an alternative approach to realize the orthogonality. It helps to reduce domaininvariant feature redundancy by stabilizing the distributions of features as well as regularizing the networks to induce compact features. Besides, it yields (near) orthogonality without the usage of singular value decomposition (SVD).
Ii Related Work
In this section, we will review some recent trends in unsupervised domain adaptation that is mostly related to our approach.
Iia Unsupervised Domain Adaptation
The goal of UDA is to transfer knowledge [21], [22], [23] from an annotated source domain to another unlabeled target domain by reducing domain shift. From the standpoint of feature learning, many UDA studies can be considered as either domaininvariant feature learning or domainspecific feature learning.
Domaininvariant feature learning. In UDA, the primary pursuit is to learn domaininvariant features. Many mainstream approaches [24], [25], [7] belong to this category of domaininvariant feature learning and achieve this pursuit by aligning data distributions across domains in an either coarse or finegrained manner.
The idea behind coarse alignment approaches are the usage of either global distribution alignment or global data structure. For global distribution alignment, many distribution discrepancy metrics like WMMD [24], CMD [25], and MMCD [7], which consider data statistics information, are directly minimized by those UDA models. Correlation alignment (CORAL) [2] minimizes the discrepancy between the covariance matrices of the whole source features and the target features. Wasserstein Distance (WDGRL) [26] learns domaininvariant representation to reduce domain shift based on Wasserstein distance. Also, adversarial learning has been widely used for coarse domain alignment. Domain adversarial neural networks (DANN) [27] confuses domain features via domain adversarial loss to induce domaininvariant representation. Joint adaptation network (JAN) [3] incorporates MMD [6]
with adversarial learning to align the joint distributions based on specific layers across domains. Besides, deep relevance network considers global lowrank structure to align different domains via discriminative relevance regularization (DRR)
[28]. For finegrained alignment, such efforts seek to match multimode data structures like classwise margin and data distributions [14], [4], [29], [30]. Multiadversarial domain adaptation (MADA) [14] captures the multimode structures to enable finegrained alignment of different data distributions based on local domain discriminators to learn domaininvariant feature. Conditional domain adaptation network (CDAN) [4] simultaneously aligns features and labels and overcome mode mismatch in adversarial learning for domain invariance. Cycleconsistent adversarial domain adaptation (CyCADA) [29] introduces a cycleconsistency loss to ensure the model the consistency of local semantic mapping.Domainspecific feature learning
. Domainspecific feature learning assumes that varied domains have different information, which should be separated from domaininvariant features. In general, methods in this respect consider to devise network components and loss function for each domain. Of them, Batch Normalization (BN) is a versatile network component that has been extended to learn domainspecific features for UDA
[23], [17]. DSBN [23] adapt to both domains by specifying batch normalization layers to separate domainspecific information from domaininvariant features in an explicit manner. Transferable Normalization (TransNorm) [17] associates domainspecific features via the attention mechanism to realize transferability. Adaptive Batch Normalization (AdaBN) [16] proposes a bruteforce postprocessing method to replace batch normalization statistics with the counterparts of target samples. Domainspecific whitening transform (DWT) [20] proposes domainspecific alignment layers to align domainspecific covariance matrices of intermediate features. Different from them, which only encourage the transferability across domains, this work focuses on the reduction of feature redundancy for better generalization. Besides, our method learns domainspecific features and domaininvariant features, simultaneously, and enjoys both benefits.IiB Generalization Strategies for DNNs
Many approaches seek to improve DNNs for better generalization by easing overfitting. Towards this goal, several regularizers and network structures have been developed to improve model generalization ability. For instance, early stopping [31] achieves this aim by keeping from abusing finite samples as early as possible. Weight decay [32]
penalizes large weights in stochastic gradient descent (SGD) like
regularization to reduce model complexity. Decoupled weight decay regularization [33] further extends weight decay to be applicable for Adam by decoupling weight decay from the gradientbased update. Model ensemble [34] is a kind of effective methods for model generalization by averaging the inference results of multiple models. Of them, Adaboost [35] is a basic and versatile form, which concatenates a group of weak classifiers to yield a united stronger classifier. Data augmentation [36] augments different variations about a few instances by hand to help the model to learn invariant features and behaves like a regularizer, thereby easing overfitting. Dropout [37], [38] can also regularize the network for sake of the behavior of the model ensemble. Different from Dropout, DropConnect [39] tries to randomly drop a portion of the weights rather than feature activations to regularize the network in an implicit way. BN [15], [40] can also handle the internal covariate shift problem to stabilize the network training. To reduce feature redundancy, decorrelated batch normalization [41] takes a step to promote generalization by decorrelated learning. Besides, orthogonality [19] stabilizes the distribution of the network activation with efficient convergence and achieves better generalization. Such techniques reported in the literature have shown the ability to improve network generalization ability. Motivated by such insights, we utilize decorrelated learning and orthogonality to improve the model of UDA from the view of redundancy minimization.IiC Orthogonality in DNNs
As claimed in [19], [42], [43], orthogonality imposed over the weights is able to stabilize the optimization of DNNs by preventing the explosion or vanishing of backpropagated gradients. In [39], the orthogonality is applied in singal processing since it is capable of preserving activation energy and reducing redundancy in representation. Yang et al. [43], [44] proposed that orthogonal weight normalization in the nonlinear sigmoid network can obtain dynamical isometry. Huang et al. [42]
showed a novel orthogonal weight normalization method can guarantee stability and formulate this problem in feedforward neural networks. Also, Huang
et al. [45]proposed a computationally efficient and numerically stable orthogonality method to learn a layerwise orthogonal weight matrix in DNNs, which enables to control the orthogonality of a weight matrix. There is also some research in Recurrent Neural Networks (RNNs) based on orthogonal matrices to avoid the gradient vanishing problems. Different from above, our method is related to the methods that impose orthogonal regularization in loss function under Frobenius norm, it can stabilize training via ensuring fullyconnected layers output to be identical distributions which reduce the feature coadaptation. Orthogonal to such siblings, the proposed (near) orthogonality is implicitly obtained through the joint matrix trace and determination constraints in theory. To the best of our knowledge, this orthogonality is first mentioned here and applied for the UDA problem.
Iii Method
This section reconsiders the UDA problem from the view of feature redundancy and readily details how to reduce bilevel feature redundancy for improving unsupervised domain adaptation model (UDA).
Iiia Preliminary
In this paper, we will focus on the UDA, where the source domain has enough labels whereas target domain has unlabeled examples. In UDA, we suppose and
to be the labeled source data and unlabeled target, drawn from different distributions respectively. The data probability distributions of
and are and , respectively, where and , in UDA we have . Our goal is to predict the target label and reduce the difference between the two distributions.IiiB Model
Motivation. In the field of UDA, the mainstream methods focus on filling the distribution gap between the source and the target, due to domain shift. As a matter of fact, we easily neglect the inherent vulnerability of deep networks in themselves such as the instability in network learning and the redundancy of feature representation. Apparently, such internal weaknesses could continue to deteriorate in crossdomain tasks for the sake of crossdomain data heterogeneity. Nevertheless, few efforts truly discuss the effect of this insight in domain adaptation; in contrast, most arts consider to learn discriminative or transferable feature patterns for crossdomain classification tasks. Different from such arts, this study primarily aims to prevent feature patterns from being redundant, namely, reducing feature redundancy or redundancy minimization.
Recent studies [46] claim that the imbalanced distribution of the feature spectrum could contribute to feature redundancy, and fortunately, feature diversity or feature decorrelation can mitigate this realistic issue to some degree. Besides, many works judge feature redundancy with feature similarity. Inspired by such previous insights, we start to scrutinize whether there exists feature redundancy in existing UDA models, in light of both the distribution of the feature spectrum and layerwise feature similarity. As Fig. 1 shows, the distribution of feature spectrum in both ResNet50 and domain adaptation neural networks (DANN) on the transfer task
of the OfficeHome dataset are some imbalanced, and meanwhile their corresponding feature similarities, coming from the channels in different BN layers and features in Fc layers, are still highly similar. Especially, DANN has better transferability with lower feature similarity than ResNet50, but still suffers from relatively serious distribution imbalance of spectrum. This implies that low feature similarity is not enough to prevent feature redundancy. Instead, ResNet50 has weaker transferability with higher feature similarity but has relatively balanced distribution of spectrum as compared to DANN. This also indicates that the near uniform distribution of the sepctrum is still insufficient to ensure redundancy minimization as well. Based on these discussions, we consider to explicitly reduce feature redundancy in a
bilevel way, which jointly make feature similarity much lower and distribution of feature spectrum close to uniform. As in Fig. 1, in contrast with both ResNet50 and DANN, the proposed approach plugged into ResNet50 can significantly reduce layerwise feature similarity of ResNet50 as well as mitigate the corresponding imbalanced feature spectrum. Generally speaking, making feature spectrum near uniform is beneficial for stabilizing network learning and model convergence, while reducing feature similarity assists in achieving better model generalization ability. Since the proposed method can achieve both goals–one stone two birds, our model is definitely helpful for reducing feature redundancy to promote UDA performance.Build off the above analysis, the proposed approach to reducing feature redundancy contains two aspects or levels: the first level is designed for the reduction of domainspecific (or intradomain) feature redundancy, while the second seeks to prevent domaininvariant (or interdomain) feature redundancy. Fig. 1(b) shows that, in both ResNet50 and DANN, the domainspecific features learned by BN layers and domaininvariant features by Fc layers are to a certain redundant, while our results have lower feature similarity and near uniform distribution of feature spectrum. Thus, this bilevel manner is feasible. In detail, we at first try to learn as compact domainspecific features as possible through a transferable decorrelated normalization module (TDBN), with the goal of maintaining specific domain information as well as reducing the risk of this feature redundancy on the following domain invariance. Then, domaininvariant feature redundancy caused by domainshared representation can be further reduced by using a novel orthogonality for feature diversity. The above bilevel scheme is lightweight and can be easily plugged into ResNet50 as our UDA model (Ours in short). As empirical studies show, the proposed UDA model without extra hassle can achieve competitive performance with several stateofthearts.
The Flowchart. For clarity, the proposed UDA model is illustrated in Fig. 2
. To hinder the overfitting risk of domainspecific feature redundancy, TDBN first mitigates feature coadaptation and then starts to proceed feature transferability. Different from existing BNbased modules, TDBN not only reduces feature redundancy but also learns transferable features, without introducing any extra parameters. Albeit simple, the performance gain is attractive. For the ultimate classifier in the Fc layer, the presoftmax weights are imposed to be (near) orthogonal for domaininvariant feature diversity, thereby reducing feature redundancy. This is because the orthogonality can extract the more compact feature bases to span the whole feature subspace. Thus, it encourages to learn compact features. This property is so amazing to reduce the risk of model overfitting as well. Besides, the orthogonality still shares extra appealing properties: accelerating model convergence and stabilizing network training. The difference from existing orthogonalities lies in that the proposed orthogonality does not involve either singular value decomposition (SVD) or the iterative calculation of the largest and smallest singular values.
IiiC Reducing Domainspecific Feature Redundancy
Recall that domainspecific feature learning assumes that each domain has distinct information, thus domainspecific features should be separated from domaininvariant features. In the field of UDA, batch normalization (BN) has been extended to learn domainspecific features, which calculates domainindependent statistical information, i.e., the mean and variance, for standardization. It serves as an important network component to accelerate network convergence by reducing the covariant shift. Most BNbased modules are based on standardization, which have no ability to simultaneously reduce feature redundancy and realize feature transferability. Towards this goal, we couple the decorrelated BN
[41] with the channelwise transferability attention idea from transferable BN [17], to jointly enjoy their strengths: the former mitigates feature coadaptation to reduce feature redundancy, while the latter focuses on feature transferability across domains. For completeness, we detail our BN module termed Transferable Decorrelated Batch Normalization (TDBN) as below.In a BN module, given a minibatch input, the corresponding standard normalized outputs become:
(1) 
where and are the mean and variance of the minibatch, is a predefined small number to prevent numerical instability, both and are extra learnable parameters, is the number of the minibatch samples.
BN merely performs feature standardization, without conducting feature decorrelation, thus it does not reduce feature redundancy. Huang et al. [41] proposed a Decorrelated Batch Normalization (DBN) module to attack this issue, and then proceeded to improve DBN with the iterative optimization algorithm for efficiency [47]. Intrinsically, they contribute to doing ZCA whitening on the standardized inputs. Thus the concrete formula of this process is:
(2)  
(3) 
where the covariance matrix is
(4) 
and the corresponding singular value decomposition (SVD) form , wherein and
are the singular values and the singular vectors, respectively. Iterative whitening normalization efficiently computes
by the newton iteration method. can be calculated as follows:(5) 
where indicates the trace of , then can be calculated as follows:
(6) 
Whitening the activation ensures that all the dimensions along the eigenvectors have equal importance in the subsequent linear layer, while standardization ensures that the normalized output gives equal importance to each dimension by multiplying the diagonal scaling matrix. Obviously, it ignores the interplay between the source domain and the target domain. Similar to TransNorm, which tries to align the sufficient statistics of both domains via channel transferability, we also quantify the transferability of different channels and adapt them for different domains. For each channel
and the whiten data , the importance of the different channels can be determined by:(7) 
where
(8) 
wherein denotes the number of channels in the layer. In contrast to TransNorm [17], the distance has been replaced by the similarity without introducing extra parameters. Following [17], we get the final output of the proposed module TDBN as below:
(9)  
(10) 
IiiD Preventing Domaininvariant Feature Redundancy
Learning domain invariant features means features with transferability across domains, which is always a predominantly expected aim of UDA. Hence, it plays a critical role in the ultimate inference ability of the classifier on the target domain. Considering this, we try to constrain the presoftmax weights in the Fullyconnected (Fc) layer, which behaves like a classifier, to induce domain invariance. Recall that increasing feature diversity can assist in preventing feature redundancy, while the orthogonality just has the ability to cope with this issue. Thus, we explore an orthogonal regularization over the presoftmax weights of the classifier for feature diversity.
As one knows, the orthogonality has shown great success in improving the stability of deep neural networks, because it can preserve energy, stabilize activation distributions, and ensure uniform spectrum. Due to such appealing properties, many studies explore the orthogonality [19], [42], [45], [48] in different ways to promote DNNs. Some studies regard it as the optimization problem on Stiefel Manifolds [49], while the others consider its soft version as a differentiable regularizer [19] to regularize CNNs. Such novel insights spur us to develop another alternative orthogonal regularizer by avoiding iterative spectrum computation for efficiency. Moreover, it can guarantee the singular values to be near ones in theory. This significantly differs from most preceding arts [19].
Prior to detailing our orthogonality, we at first present its constraint form of our orthogonality as below:
(11) 
where is the presoftmax weight matrix. Note that our orthogonal constraint is not limited to the presoftmax weights. denotes the determination of the square matrix, while is the trace of the square matrix. This constraint can implicitly ensure the weight matrix to be orthogonal in theory.
Lemma 1. Any nonsingular symmetric matrix has the singular values of all the ones under the orthogonal constraint of Eq. (11).
Proof By simple algebra, one knows:
(12) 
where is the th singular value of . According to Eq. (12), the inequalities hold:
(13) 
Of them, the first inequalities of (13) contain the constraint Eq. (11), thus it naturally holds. According to the basic inequality theory given in APPENDIX, when there holds:
(14) 
This completes the proof.
Theorem 1. The learned weight matrix under the constraint of Eq. (11) could be orthogonal for .
Proof. Assume that , where . Then, is the most likely to be a nonsingular symmetric matrix. In practice, this point can be guaranteed. Assume that there exists the singular value decomposition of , i.e., . Accordingly, . According to Lemma 1, because . Since is a square matrix corresponding to all the singular values, there still holds . This implies that .
To our best knowledge, no previous works explicitly mention these findings. Besides, it is nontrivial to directly optimize the above constrained objective. Thus, we treat this orthogonal constraint as a regularization for efficient optimization. This type of relaxation is also the mainstream of implementing orthogonality. Following this, we rewrite Eq. (14) as:
(15) 
where is a regularization parameter. We again resort to autocomputing the gradient of Eq. (15) for simplicity. We do not need to compute the expensive SVD or iteratively calculate singular values such as the regularizer based on spectral restricted isometry property [50]. Our orthogonality in an implicit fashion encourages the network to learn diverse features so that features become more compact.
Iv Experiments
This section will detail the implementation of our UDA model (Ours in short) and the corresponding training protocols, and report our experimental evaluation in the end. We conduct five experiments on both small and largescale datasets to evaluate our method as compared with its wellbehaved siblings. Moreover, we also present an ablation study to analyze the effect of each component over UDA.
Iva Experimental Settings
Data Preparation. We conduct five experiments on seven datasets: Office31, OfficeHome, ImageCLEFDA, MNIST, USPS, SVHN, and VisDAC. These datasets as popular benchmarks have been widely adopted to evaluate domain adaptation algorithms in previous arts.
Office31 [51] dataset is a standard benchmark for testing domain adaptation methods. It contains 4,652 images organized in 31 classes from three different domains: Amazon (A), Dslr (D), Webcam (W). Amazon images are collected from amaon.com, while both Webcam and Dslr images are manually collected in an office environment, we evaluate all the compared methods on all six transfer tasks including AW, AD, DW, WA, DA, and WD.
OfficeHome [52] dataset contains four distinct domains including Art (A), Clipart (Cl), Product (Pr), and Real World(Rw), and each domain involves 65 different categories. There are 15,500 images totally in the dataset, thus it serves as a largescale benchmark for testing domain adaptation methods.
ImageCLEFDA^{1}^{1}1http://imageclef.org/2014/adaptation
dataset is a benchmark for ImageCLEF 2014 domain adaptation, which is composed of 12 common categories shared by three public datasets. They are Caltech256 (C), ImageNet ILSVRC 2012 (I), and Pascal VOC 2012 (P), respectively. There have 50 images category and 600 images in each domain. We evaluate the methods on all the six transfer tasks: C
I, CP, PI, PC, IC, and IP.VisDA2017 [53] is a challenging simulationtoreal dataset, it has two very distinct domains: synthetic, renderings of 3D models from different angles and with different lighting conditions, and real natural images. It contains 12 classes in the training, validation, and test domains
MNIST (M) dataset comes from the National Institute of Standards and Technology. It is a standard digit recognition datasets covering ten handwritten digits from 09 in the form of pixels grayscale images. It is divided into both training and testing sets. The training dataset comes from 250 different people, of which 50% are from high school students, and the remaining from Census Bureau workers; the testing dataset is similar to training dataset. This dataset contains 60,000 training images and 10,000 testing images. Similar to MNIST (M), USPS (U) also contains 10 classes from 09, and has 7,291 training and 2,007 testing images. All the images are cropped as
grayscale pixels. It is subject to significantly different probability distributions to MNIST. Besides,
SVHN (S) [54] contains colored 10digit images with the size 3232, which contains 73,257 training images and 26,032 testing images.Implementation Details. For fairness, in object recognition tasks, we adopt the same base network ResNet50 as our backbone. For example, on OfficeHome dataset, we use ResNet50 [55]
as our backbone. Especially for VisDA2017, owing to its huge volume, we use ResNet101 as our backbone. We will finetune all convolutional and pooling layers from the ImageNet pretrained models and then train the classifier layer by backpropagation. This network is initialized with weights from ImageNet dataset. For our model, we also adopt the same configurations. In the fully connected layers, we discard original class numbers and change class numbers to 65, and this matches the OfficeHome classes, Also, we will discard original class numbers and change class numbers to corresponding domains. During the training phase, we set the minibatch size to 24, and the epoch to 200. We use Adam optimizer with an initial learning rate of 0.01 and the weight decay is
. The learning rate will decrease by a factor of 10 after 50 and 90 epochs. The maximum number of the iterations of the optimizer is set to 10,000. For digit classification, we follow the protocols in [13]. We set the hyper parameter for all the tasks.We implement the proposed UDA model in PyTorch, and report the average classification accuracy on both object recognition and digit classification. On the Office31 dataset, we compare our method with popular transfer learning and recent transfer learning methods including deep residual learning for image recognition (ResNet50)
[55], Geodesic flow kernel for unsupervised domain adaptation (GFK) [56], Domainadversarial training of neural networks (DAN) [27], Joint adaptation networks (JAN) [3], Multiadversarial domain adaptation (MADA) [14], Collaborative and adversarial network for unsupervised domain adaptation (iCAN) [57], Conditional adversarial domain adaptation (CDAN and CDAN+E) [4],Confidence regularized selflearning[58],Batch nuclearnorm maximization under label insufficient situations [59], and the results of the other compared methods are from the original paper for a fair comparison. On the OfficeHome, we compare our UDA model with ResNet50, DAN, JAN, Unsupervised domain adaptation using featurewhitening and consensus loss (DWT and DWTMEC) [20], and compare it with ResNet101, DAN, deep adversarial neural networks (DANN), maximum classifier discrepancy (MCD) [11], CDAN, Batch spectral penalization (BSP+DANN and BSP + CDAN) [60] and DomainSpecific Batch Normalization for Unsupervised Domain Adaptation(DSBN) [61] on VisDA2017, respectively. On ImageCLEFDA, we compare our UDA model with several standard deep learning methods and deep transfer learning methods: ResNet50, DAN, DANN [62], JAN, CDAN, and CDAN + E, CAN and iCAN, Visual Domain Adaptation with Manifold Embedded Distribution Alignment (MEDA), Adversarial Tight Match (ATM) [30],DSAN,Selfadaptive reweighted adversarial domain adaptation [63]. For digit classification, we compare our model with CORAL, Simultaneous deep transfer across domains and tasks (MMD) [64], Visual Domain Adaptation with Manifold Embedded Distribution Alignment (DRCN) [65], Coupled generative adversarial networks (CoGAN) [66], Adversarial discriminative domain adaptation (ADDA) [13], Unsupervised imagetoimage translation networks (UNIT)
[67], Asymmetric Tritraining for unsupervised domain adaptation (ATT) [68], Generate to adapt: Alignment domains using generative adversarial net (GTA) [69], and Learning Semantic Representations for Unsupervised Domain Adaptation (MSTN) [70], while the results of baselines are borrowed from [20].Method  MU  UM  SM 

Source Only  
MMD (2015 ICCV)  
DRCN (2016 ECCV)  
CoGAN (2016 NIPS)  
ADDA (2017 CVPR)  
UNIT (2017 NIPS)  
ATT (2017 ICML)  
GTA (2019 CVPR)  
MSTN (2018 PMLR)  
DWT (2019 CVPR)  
DWTMEC (2019 CVPR)  
ATM (2020 TPAMI)  
SAR (2020 IJCAI)  
Ours 
Method  AD  AW  DA  DW  WA  WD  Avg 

ResNet50  
GFK (2012 CVPR)  
DAN (2015 ICML)  
JAN (2017 ICML)  
MADA (2018 AAAI)  
CAN (2018 CVPR)  
CDAN (2018 NIPS)  
CDAN+E (2018 NIPS)  
CRST (2019 ICCV)  
BNM (2020 CVPR)  
Ours 
Method  PR  AP  CA  PC  CP  PA  AC  PR  AR  RC  RA  CR  Avg 

ResNet50  
DAN (2015 ICML)  
DANN (2016 JMLR)  
JAN (2017 ICML)  
CDANM (2018 NIPS)  
CDAN (2018 NIPS)  
CDAN+E (2018 NIPS)  
DWT (2019 CVPR)  
ATM (2020 TPAMI)  
Ours 
Method  IP  PI  IC  CI  CP  PC  Avg 

ResNet50  
DAN (2015 ICML)  
DANN (2016 JMLR)  
CAN (2018 CVPR)  
JAN (2017 ICML)  
CDAN+E (2018 NIPS)  
CDANM (2018 NIPS)  
iCAN (2018 CVPR)  
MEDA (2018 ACM MM)  
ATM (2020 TPAMI)  
DSAN (2020 TNNLS)  
SAR (2020 IJCAI)  
Ours 
Method  aero  truck  train  skate  person  plant  motor  knife  horse  car  bus  bicycle  Avg 

ResNet101  
DAN (2015 ICML)  
DANN (2016 JMLR)  
MCD (2018 CVPR)  
BSP+DANN (2019 ICML)  
CDAN (2018 NIPS)  
BSP+CDAN (2019 ICML)  
DSBN+MSTN (2019 CVPR)  
DSAN (2020 TNNLS)  
Ours 
Methods  AD  AW  DA  DW  WA  WD  Avg 

ResNet50  68.9  68.4  62.5  96.7  60.7  99.3  66.1 
+TDBN  85.7  85.2  71.2  99.1  68.9  100.0  85.0 
+OL  88.8  83.2  72.3  99.0  70.9  100.0  85.7 
Ours (TDBN+OL)  95.5  87.5  74.7  100.0  71.3  100.0  88.2 
IvB Results
Office31. Table II quantifies the performance of ours employing TDBN and orthogonality weights and compares it with stateoftheart records on Office31 datasets. TDBN means that we replace the batch normalization layers with TDBN layers, and then we use the orthogonal weights in the fullyconnected layers. In standard domain adaptation, our model outperforms such compared methods. Of them, most consider feature alignment without interdomain relationships. From Table II, our model behaves better than several compared methods, which account for the relationships within the features.
OfficeHome. Table III shows that ours exceeds the average accuracy more than 3.6%. Also, considering OfficeHome has 12 evaluations, from Table III, we can easily find that our method achieves the best on all of them. Owing to OfficeHome is a largescale dataset, it can be concluded that our proposed method not only works well on standard benchmarks but also generalizes well on largescale datasets.
ImageCLEFDA. The classification results of ImageCLEFDA are shown in Table IV. Our model outperforms the other compared methods on most transfer tasks. In particular, our overpasses DSAN by 3.3%, which implies that redundancy reduction is beneficial for transferability.
VisDAC. Table V reports classification results of all the methods on the VisDAC dataset. As Table 2 shows, the proposed UDA consistently achieves performance gains on all the adaptation tasks.
MNISTUSPSSVHN. We also consider the digit classification tasks on from MNIST to USPS and to SVHN. Experimental results are recorded in Table I. The proposed UDA model is superior to the other counterparts overally.
Overall Analysis. The above experimental results reveal some insightful observations as follows: 1) In UDA, DWT [20] directly utilizes a decorrelated batch normalization method and outperforms the previous methods, but it only considers to align global distribution. Differently, our method considers interdomain and intradomain features correlations as well as make feature standardization, which can capture more information on each category. 2) From object recognition, we can find ours achieves a great performance, which verifies the effectiveness of our model. 3) By comparing our method with several advanced GANbased UDA methods, our method without extra hassle achieves the competitive performance with them, which shows its efficacy.
Ablation Study. In this subsection, we will provide an ablation study analysis for our method. We perform ablative experiments on the Office31 dataset to analyze the functionality of each component in our UDA model. Table VI summarizes the ablation results on Office31 dataset using ResNet50 as the baseline, where the last column in the table reports the average recognition accuracy of ours as compared to two other configurations, i.e., TDBN without orthogonal loss, and orthogonal loss (OL) without TDBN. The comparative results directly show that both TDBN and OL greatly improve the performance of ResNet at a competitive level. Surprisingly, both cooperation in our model makes performance gains great. This implies that they are mutually complementary to each other. This is line with the previous claim that both schemes are conducted in a bilevel way: TDBN reduces the domainspecific feature redundancy, while OL does domaininvariant feature redundancy.
To further observe their ability to narrow the distribution gap across domains, the distance [71] as the distribution discrepancy between source and target domain is used here to evaluate the functionality of each component of our model. Following [71], the distance is defined as , where is the classification error of a binary domain classifier for discriminating the source and target domains. We select all the transfer tasks on Office31 to show the distance bridged by ResNet50, TDBN, OL, and our model, respectively, as in Fig. 5. From this figure, we can observe that simultaneous reducing interdomain and intradomain feature redundancy can greatly reduce the distance across domains.
V Conclusion
In this paper, we propose a simple UDA approach from the perspective of feature redundancy minimization. Compared to existing arts that focus on feature discrimination, the proposed model seeks to reduce feature redundancy, thereby enhancing generalization the ability of the deep UDA model. Our model performs the reduction of bilevel feature redundancy: the first level decorrelates domainspecific features, as well as matches feature distribution; the second level can stabilize the distribution of network activations, regularize deep neural networks and learn compact diverse features. Experiments on several largescale publicly available image classification datasets show the superiority of our model against recent stateoftheart traditional and deep counterparts.
References
 [1] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell, “Deep domain confusion: Maximizing for domain invariance,” arXiv preprint arXiv:1412.3474, 2014.

[2]
B. Sun and K. Saenko, “Deep coral: Correlation alignment for deep domain
adaptation,” in
European conference on computer vision
. Springer, 2016, pp. 443–450. 
[3]
M. Long, H. Zhu, J. Wang, and M. I. Jordan, “Deep transfer learning with joint
adaptation networks,” in
International conference on machine learning
. PMLR, 2017, pp. 2208–2217.  [4] M. Long, Z. Cao, J. Wang, and M. I. Jordan, “Conditional adversarial domain adaptation,” in Advances in Neural Information Processing Systems, 2018, pp. 1640–1650.
 [5] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on knowledge and data engineering, pp. 1345–1359, 2009.
 [6] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola, “A kernel twosample test,” The Journal of Machine Learning Research, vol. 13, no. 1, pp. 723–773, 2012.
 [7] W. Zhang, X. Zhang, L. Lan, and Z. Luo, “Maximum mean and covariance discrepancy for unsupervised domain adaptation,” Neural Processing Letters, vol. 51, no. 1, pp. 347–366, 2020.
 [8] W. Dai, Q. Yang, G.R. Xue, and Y. Yu, “Boosting for transfer learning,” in Proceedings of the 24th international conference on Machine learning, 2007, pp. 193–200.
 [9] Y. Xu, S. J. Pan, H. Xiong, Q. Wu, R. Luo, H. Min, and H. Song, “A unified framework for metric transfer learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 29, no. 6, pp. 1158–1171, 2017.

[10]
Y.H. Hubert Tsai, Y.R. Yeh, and Y.C. Frank Wang, “Learning crossdomain
landmarks for heterogeneous domain adaptation,” in
Proceedings of the IEEE conference on computer vision and pattern recognition
, 2016, pp. 5081–5090.  [11] K. Saito, K. Watanabe, Y. Ushiku, and T. Harada, “Maximum classifier discrepancy for unsupervised domain adaptation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3723–3732.
 [12] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
 [13] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarial discriminative domain adaptation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7167–7176.
 [14] Z. Pei, Z. Cao, M. Long, and J. Wang, “Multiadversarial domain adaptation,” arXiv preprint arXiv:1809.02176, 2018.
 [15] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
 [16] Y. Li, N. Wang, J. Shi, X. Hou, and J. Liu, “Adaptive batch normalization for practical domain adaptation,” Pattern Recognition, vol. 80, pp. 109–117, 2018.
 [17] X. Wang, Y. Jin, M. Long, J. Wang, and M. I. Jordan, “Transferable normalization: Towards improving transferability of deep neural networks,” in Advances in Neural Information Processing Systems, 2019, pp. 1953–1963.
 [18] J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai, T. Liu, X. Wang, G. Wang, J. Cai et al., “Recent advances in convolutional neural networks,” Pattern Recognition, vol. 77, pp. 354–377, 2018.
 [19] N. Bansal, X. Chen, and Z. Wang, “Can we gain more from orthogonality regularizations in training deep networks?” in Advances in Neural Information Processing Systems, 2018, pp. 4261–4271.
 [20] S. Roy, A. Siarohin, E. Sangineto, S. R. Bulo, N. Sebe, and E. Ricci, “Unsupervised domain adaptation using featurewhitening and consensus loss,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 9471–9480.
 [21] X. Yu, T. Liu, M. Gong, K. Zhang, K. Batmanghelich, and D. Tao, “Transfer learning with label noise,” arXiv preprint arXiv:1707.09724, 2017.
 [22] M. Jing, J. Zhao, J. Li, L. Zhu, Y. Yang, and H. T. Shen, “Adaptive component embedding for domain adaptation,” IEEE Transactions on Cybernetics, 2020.
 [23] W.G. Chang, T. You, S. Seo, S. Kwak, and B. Han, “Domainspecific batch normalization for unsupervised domain adaptation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
 [24] H. Yan, Y. Ding, P. Li, Q. Wang, Y. Xu, and W. Zuo, “Mind the class weight bias: Weighted maximum mean discrepancy for unsupervised domain adaptation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
 [25] W. Zellinger, T. Grubinger, E. Lughofer, T. Natschläger, and S. SamingerPlatz, “Central moment discrepancy (cmd) for domaininvariant representation learning,” arXiv preprint arXiv:1702.08811, 2017.
 [26] J. Shen, Y. Qu, W. Zhang, and Y. Yu, “Wasserstein distance guided representation learning for domain adaptation,” arXiv preprint arXiv:1707.01217, 2017.
 [27] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky, “Domainadversarial training of neural networks,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 2096–2030, 2016.
 [28] W. Zhang, X. Zhang, L. Lan, and Z. Luo, “Enhancing unsupervised domain adaptation by discriminative relevance regularization,” Knowledge and Information Systems, vol. 62, no. 9, pp. 3641–3664, 2020.
 [29] J. Hoffman, E. Tzeng, T. Park, J.Y. Zhu, P. Isola, K. Saenko, A. Efros, and T. Darrell, “Cycada: Cycleconsistent adversarial domain adaptation,” in International conference on machine learning. PMLR, 2018, pp. 1989–1998.
 [30] L. Jingjing, C. Erpeng, D. Zhengming, Z. Lei, L. Ke, and S. H. Tao, “Maximum density divergence for domain adaptation,” arXiv preprint arXiv:2004.12615, 2020.
 [31] Y. Yao, L. Rosasco, and A. Caponnetto, “On early stopping in gradient descent learning,” Constructive Approximation, vol. 26, no. 2, pp. 289–315, 2007.
 [32] A. Krogh and J. A. Hertz, “A simple weight decay can improve generalization,” in Advances in neural information processing systems, 1992, pp. 950–957.
 [33] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
 [34] H. Wang, W. Fan, P. S. Yu, and J. Han, “Mining conceptdrifting data streams using ensemble classifiers,” in Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, 2003, pp. 226–235.
 [35] A. Vezhnevets and V. Vezhnevets, “Modest adaboostteaching adaboost to generalize better,” in Graphicon, vol. 12, no. 5, 2005, pp. 987–997.
 [36] R. Reed, S. Oh, R. Marks et al., “Regularization using jittered training data,” in International Joint Conference on Neural Networks, vol. 3, 1992, pp. 147–152.
 [37] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The journal of machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014.
 [38] S. Srinivas and R. V. Babu, “Generalized dropout,” arXiv preprint arXiv:1611.06791, 2016.
 [39] L. Wan, M. Zeiler, S. Zhang, Y. Le Cun, and R. Fergus, “Regularization of neural networks using dropconnect,” in International conference on machine learning, 2013, pp. 1058–1066.
 [40] J. Kohler, H. Daneshmand, A. Lucchi, M. Zhou, K. Neymeyr, and T. Hofmann, “Towards a theoretical understanding of batch normalization,” stat, vol. 1050, p. 27, 2018.
 [41] L. Huang, D. Yang, B. Lang, and J. Deng, “Decorrelated batch normalization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 791–800.
 [42] L. Huang, X. Liu, B. Lang, A. W. Yu, Y. Wang, and B. Li, “Orthogonal weight normalization: Solution to optimization over multiple dependent stiefel manifolds in deep neural networks,” arXiv preprint arXiv:1709.06079, 2017.
 [43] G. Yang, J. Pennington, V. Rao, J. SohlDickstein, and S. S. Schoenholz, “A mean field theory of batch normalization,” arXiv preprint arXiv:1902.08129, 2019.
 [44] D. Xie, J. Xiong, and S. Pu, “All you need is beyond a good init: Exploring better solution for training extremely deep convolutional neural networks with orthonormality and modulation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6176–6185.
 [45] L. Huang, L. Liu, F. Zhu, D. Wan, Z. Yuan, B. Li, and L. Shao, “Controllable orthogonalization in training dnns,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 6429–6438.
 [46] J. Wang, Y. Chen, R. Chakraborty, and S. X. Yu, “Orthogonal convolutional neural networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11 505–11 515.
 [47] L. Huang, Y. Zhou, F. Zhu, L. Liu, and L. Shao, “Iterative normalization: Beyond standardization towards efficient whitening,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
 [48] K. Jia, S. Li, Y. Wen, T. Liu, and D. Tao, “Orthogonal deep neural networks,” IEEE transactions on pattern analysis and machine intelligence, 2019.
 [49] J. Balogh and B. Tóth, “Global optimization on stiefel manifolds: a computational approach,” Central European Journal of Operations Research, vol. 13, no. 3, p. 213, 2005.
 [50] R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin, “A simple proof of the restricted isometry property for random matrices,” Constructive Approximation, vol. 28, no. 3, pp. 253–263, 2008.
 [51] K. Saenko, B. Kulis, M. Fritz, and T. Darrell, “Adapting visual category models to new domains,” in European conference on computer vision. Springer, 2010, pp. 213–226.
 [52] H. Venkateswara, J. Eusebio, S. Chakraborty, and S. Panchanathan, “Deep hashing network for unsupervised domain adaptation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5018–5027.
 [53] X. Peng, B. Usman, N. Kaushik, J. Hoffman, D. Wang, and K. Saenko, “Visda: The visual domain adaptation challenge,” arXiv preprint arXiv:1710.06924, 2017.
 [54] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng, “Reading digits in natural images with unsupervised feature learning,” 2011.
 [55] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
 [56] B. Gong, Y. Shi, F. Sha, and K. Grauman, “Geodesic flow kernel for unsupervised domain adaptation,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2012, pp. 2066–2073.
 [57] W. Zhang, W. Ouyang, W. Li, and D. Xu, “Collaborative and adversarial network for unsupervised domain adaptation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3801–3809.
 [58] Y. Zou, Z. Yu, X. Liu, B. Kumar, and J. Wang, “Confidence regularized selftraining,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 5982–5991.
 [59] S. Cui, S. Wang, J. Zhuo, L. Li, Q. Huang, and Q. Tian, “Towards discriminability and diversity: Batch nuclearnorm maximization under label insufficient situations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3941–3950.
 [60] X. Chen, S. Wang, M. Long, and J. Wang, “Transferability vs. discriminability: Batch spectral penalization for adversarial domain adaptation,” in International Conference on Machine Learning, 2019, pp. 1081–1090.
 [61] W.G. Chang, T. You, S. Seo, S. Kwak, and B. Han, “Domainspecific batch normalization for unsupervised domain adaptation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 7354–7362.
 [62] M. Ghifary, W. B. Kleijn, M. Zhang, D. Balduzzi, and W. Li, “Deep reconstructionclassification networks for unsupervised domain adaptation,” in European Conference on Computer Vision. Springer, 2016, pp. 597–613.
 [63] S. Wang and L. Zhang, “Selfadaptive reweighted adversarial domain adaptation,” arXiv preprint arXiv:2006.00223, 2020.
 [64] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko, “Simultaneous deep transfer across domains and tasks,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 4068–4076.
 [65] J. Wang, W. Feng, Y. Chen, H. Yu, M. Huang, and P. S. Yu, “Visual domain adaptation with manifold embedded distribution alignment,” in Proceedings of the 26th ACM international conference on Multimedia, 2018, pp. 402–410.
 [66] M.Y. Liu and O. Tuzel, “Coupled generative adversarial networks,” in Advances in neural information processing systems, 2016, pp. 469–477.
 [67] M.Y. Liu, T. Breuel, and J. Kautz, “Unsupervised imagetoimage translation networks,” in Advances in neural information processing systems, 2017, pp. 700–708.
 [68] K. Saito, Y. Ushiku, and T. Harada, “Asymmetric tritraining for unsupervised domain adaptation,” arXiv preprint arXiv:1702.08400, 2017.
 [69] S. Sankaranarayanan, Y. Balaji, C. D. Castillo, and R. Chellappa, “Generate to adapt: Aligning domains using generative adversarial networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8503–8512.
 [70] S. Xie, Z. Zheng, L. Chen, and C. Chen, “Learning semantic representations for unsupervised domain adaptation,” in International Conference on Machine Learning, 2018, pp. 5423–5432.
 [71] S. BenDavid, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan, “A theory of learning from different domains,” Machine learning, vol. 79, no. 12, pp. 151–175, 2010.
Appendix A Basic Inequality Theorem
Basic Inequality Theorem. Any positive real number , there holds:
(16) 
where the equality holds if and only if .
Proof If are positive real numbers, the mathematic average is and the geometric average is , then we can draw a basic conclusion that .
Suppose are arranged in an ascending order, when , the inequality obviously holds. Suppose when , then inequality holds. Then, when ,
(17) 
Owing to the supposed sequence being ascending, we can get:
(18) 
According to Eq. (18), we replace with and with , respectively, to keep the arithmetic mean unchanged, then the new sequence is , and its mean is still . So the subsequence with the first element is:
(19) 
In terms of our hypothesis, the mean of (19) is not larger than , and then we can yield:
(20) 
Let us multiply both sides by . The Eq. (20) will become:
(21) 
According to Eq. (22),
(22) 
Then, . Obviously, the equality holds only when .
Mengzhu Wang received a Bachelor’s degree in Information and Computing Science from Tianjin University of Commerce,Tianjin, China, in 2016. She received the Master’s degree from Chongqing University(CQU), China, in 2018. From 2019 to now, she is pursuing the Ph.D. degree with the School of Computer Science, the National University of Defense Technology, Changsha, China. Her current research interest include transfer learning, image segmentation and computer vision. 
Xiang Zhang received the M.S., and Ph.D. degrees from the National University of Defense Technology, Changsha, China, in 2010 and 2015, respectively. He is currently a research assistant with the Institute for Quantum Information State Key Laboratory of High Performance Computing, College of Computer, National University of Defense Technology. His current research interests include computer vision and machine learning. 
Long Lan is currently a lecturer with College of Computer, National University of Defense Technology. He received the Ph.D. degree in computer science from National University of Defense Technology 2017. He was a visiting Ph.D. student in University of Technology, Sydney from 2015 to 2017. His research interests include multiobject tracking, computer vision and discrete optimization. 
Wei Wang is currently a Ph.D. candidate at the School of Software Technology, Dalian University of Technology, Dalian, China. He received the B.S. degree at the school of science from the Anhui Agricultural University, Hefei, China, in 2015. He received the M.S. degree at the school of computer science and technology from the Anhui University, Hefei, China, in 2018. His major research interests include transfer learning, zeroshot learning, deep learning, etc. 
Huibin Tan received a Bachelor’s degree in Computer Science and Technology from Northeastern University of China in 2014. In 20142015, she studied postgraduate at the School of Computer Science, National University of Defense Technology, and later became a PhD student through a direct Ph.D application. From 2016 to now, she is pursuing the Ph.D. degree with the School of Computer Science, the National University of Defense Technology, Changsha, China. Her current research interest include face recognition, visual tracking and representation learning. 
Zhigang Luo received the B.S., M.S., and Ph.D. degrees from the National University of Defense Technology, Changsha, China, in 1981, 1993, and 2000, respectively. He is currently a Professor with the College of Computer, National University of Defense Technology. His current research interests include machine learning, computer vision and bioinformatics. 
Comments
There are no comments yet.