1 Introduction
Visual understanding of an image or video is a longstanding and challenging problem in computer vision. Visual classification, as a fundamental problem of visual understanding, aims to recognize what an image depicts. A solidified route of visual classification is to establish a learning model by collecting an image dataset, which can be recognized as target data. However, labeling a large number of target samples is costineffective which consumes a lot of human resources in labor and time expenses and becomes almost unrealistic. Therefore, leveraging another distribution different but semantic related source domain with sufficiently labeled samples for recognizing task samples is becoming an increasingly important topic.
With the explosive increase of multisource data from Internet such as YouTube and Flickr, a large number of labeled web database can be easily crawled. It is thus natural to consider train a learning model using multisource web data for recognizing target data. However, a prevailing problem is that distribution mismatch and domain shift [1, 2] across source and target domain often exist owing to various factors such as resolution, illumination, viewpoint, background, etc. in computer vision. Therefore, the classification performance is dramatically degraded when the source data that used to learn the classifier model has different distribution from the target data on which the model is applied. This is due to that the fundamental independent identical distribution condition supposed in statistical learning is no longer satisfied, which, therefore promotes the emergence of transfer learning (TL) and domain adaptation (DA) [3, 4, 5]. In early, TL supposed different joint probability distribution, i.e., between source and target domains. DA supposed different marginal distribution, i.e., but similar category space between domains i.e., . Several related reviews on transfer learning and domain adaptation can be referred to as [4, 6, 7, 8, 9, 10, 11, 12, 13]. In this paper, we use a general name Transfer Adaptation Learning (TAL) for unifying both TLs and DAs. In the past decade, TAL was an active area in machine learning community, and the goal of which is to narrow down the distribution gap between source and target data, such that the labeled source data from one or more relevant domains can be utilized for executing tasks in target domain, as illustrated in Fig. 1.
Moving forward, deep learning (DL) techniques
[14, 15, 16, 17]have recently become dominant and powerful algorithms in feature representation and abstraction for image classification. In particular, the parameter adaptability and generality of DL models to other target data is worthy of praise, by finetuning a pretrained deep neural network using a small amount of target data. Therefore,
finetune has become a commonly used strategy for training deep models and frameworks in various applications, such as object detection [18, 19, 20, 21], person reidentification [22, 23, 24], medical imaging [25, 26, 27], remote sensing [28, 29, 30, 31], etc. Generally, the finetune can be recognized as a prototype for bridging the big source data and the target data [32], which also facilitates the research of visual transfer learning in computer vision. Conceptually, finetune is big datadriven transfer learning method, which depends on a pretrained model with a big source database. The context of transfer learning challenge and why pretraining of representations can be useful have been formally explored in [32]. Extensively, from the viewpoint of generative learning, the popular generative adversarial net (GAN) [33] and its variants [34, 35, 36, 37, 38] that aim to synthesize plausible images of some target distribution from, for example, the noise signal (source distribution), can also be recognized as generalized transfer learning techniques. Differently, conventional transfer learning approaches put emphasis on the output knowledge (highlevel model parameters) adaptation across source and target domains, while GANs focus on inputdata (lowlevel image pixels) adaptation from source distribution to target distribution. Recently, image pixellevel transfer has been intensively studied in imagetoimage translation
[39, 40, 41, 42, 43], style transfer [44, 45, 46] and target face synthesis (e.g., pose transfer vs. age transfer) [47, 48, 49, 50, 51], etc.In this paper, we focus on technical advances and challenges in modeldriven transfer adaptation learning. Learning from multiple sources for transferring or adapting to new target domains offers the possibility of promoting model generalization and understanding the biological learning essence. Transfer adaptation learning is similar but different from multitask learning that resorts to the shared feature representations or classifiers for related tasks [52], simultaneously. In the past decade, a number of transfer learning and domain adaptation approaches have been emerged. In this paper, the challenges and advances in the research field of transfer learning and domain adaptation are identified and surveyed. Specifically, we explore five key challenges of transfer adaptation learning, which are beyond the early semisupervised and unsupervised split.

Instance Reweighting Adaptation
. Due to the probability distribution discrepancy across domains, it is natural to account for the difference by directly inferring the resampling weights of instances based on feature distribution matching across source and target data in a nonparametric manner. The parameter estimation of the weights under a parametric distribution assumption remains to be a challenge.

Feature Adaptation. For adapting the data from multiple sources, learning a common feature subspace or representation where the projected source and target domain are with similar distribution is generally resulted. The heterogeneity of data distribution makes it challenging to gain such generic feature.

Classifier Adaptation. The classifier trained on instances of source domain is often biased when recognizing instances from target domain due to the domain shifts. Learning a generalized classifier from multiple domains that can be used for other different domains, is a challenging topic.

Deep Network Adaptation. Deep neural networks have been recognized with strong feature representation power and general deep model is built on single domain. Large domain shift makes deep neural network training challenging to obtain transferrable deep representation.

Adversarial Adaptation. Adversarial learning originates from the generative adversarial nets. The objective of TL/DA is to make the source and target domains more close in feature space. It is amount to confusing the two domains, such that they can not be easily discriminated. Therefore, there comes a technical challenge in domain confusion by using adversarial training and gaming strategy.
For each challenge, the taxonomic classes and subclasses are presented to structure the recent work in transfer adaptation learning. We start with an discussion of weaklysupervised learning perspectives in Section 2, which is followed by the technical advances in transfer adaptation learning, including instance reweighting adaptation (Section 3), feature adaptation (Section 4), classifier adaptation (Section 5), deep network adaptation (Section 6), and adversarial adaptation (Section 7). The existing benchmarks and future challenging tasks are discussed in Section 8 and the paper is concluded in Section 9.
2 Weaklysupervised Learning Perspective
The concept of weak learning originated 20 years ago in AdaBoost [53] and Ensemble learning [54] algorithms, which tend to ensemble multiple weak learners to solve a problem. AdaBoost, that has been listed as the top 10 algorithms in data mining [55], aims to learn multiple weak learners, in which each weak learner is obtained by training on the weighted incorrectly classified examples. By ensemble of multiple weak learners, the performance is significantly boosted. Although the weak concept was proposed as early as 1997, the problem in that era was still established on strong supervision due to the relatively smaller data. That is, the early problem can be strongly learned by conventional statistical learning models. However, today, the big data era, the problem becomes really a weak supervision problem, due to the inaccurate, inexact, and incomplete characteristics of data labels [56], which, therefore, has to be weakly learned. Currently, weaklysupervised learning is becoming a leading research topic. Undoubtedly, transfer adaptation learning, that resorts to solving crossdomain problems, is also a kind of weaklysupervised learning methodology. This section is deployed with typical weaklysupervised learning frameworks and perspectives.
2.1 Semisupervised Learning
Semisupervised learning (SSL) aims to solve the problem where there are a large amount of unlabeled examples and a few labeled examples in the dataset [57, 58]. Generally, semisupervised learning methods consist of four categories. (i) Generative methods that advocate generating the labeled and unlabeled data via an inherent model [59, 60, 61]. (ii) Lowdensity separation methods that constrains the classifier boundary crossing the lowdensity region [62, 63, 64]. (iii) Disagreement based methods that advocate cotraining of multiple learners for annotating the unlabeled instances [65, 66, 67]. (iv) Graph based methods that propose to build the connection graph of the training instances for label propagation through graph modeling [68, 69, 70, 71]. A good literature review of semisupervised learning can be referred to as [72, 73].
Consider a general SSL framework, then the following expected risk is generally minimized.
(1) 
where is the probability distribution, is the data,
is the label index in which zero vector is posed for unlabeled samples,
is the model parameter. and denote the number of dimensionality, samples, and classes of data, respectively.The training data usually comes from a subset, therefore, the regularized risk, i.e., the average empirical risk with regularization is minimized.
(2) 
where
is the prediction loss function and
is the average empirical risk (e.g. mean squared loss) on training data. A general SSL model with graph based manifold regularization can be written as(3) 
where is the predicted label for sample and
is the affinity matrix used for locality preservation. Usually,
if and are neighbors, otherwise 0.The key difference from transfer learning is that the marginal distribution and label space distribution are the same, i.e., and . Generally, SSL attempts to exploit the unlabeled data for auxiliary training on the labeled data without human intervention, because the distribution of unlabeled data can intrinsically reflect sample class information. Actually, in SSL model, three basic assumptions, i.e., smoothness, cluster, and manifold, have been established. The smoothness assumption denotes that data is distributed with different density, and the two instances falling into the region of high density have the same label. The cluster assumption denotes that data have inherent cluster structure, and the two samples in the same cluster are more similar. The manifold assumption means that the data lie on a manifold, and the instances in a small local neighborhood have similar semantics. The three basic assumptions are visually shown in Fig. 2.
2.2 Active Learning
Active learning (AL) aims to obtain the groundtruth labels of selected unlabeled instances with human intervention [74, 75], which is different from semisupervised learning that exploits unlabeled data together with labeled data for improving recognition performance. Specifically, AL aims at progressively selecting and annotating the most informative data points from the pool of unlabeled samples, such that the labeling cost for training an effective model can be minimized [76, 77]. There are two engines, learning engine and selection engine, in active learning paradigm. The learning engine targets at obtaining a baseline classifier, while the selection engine tries to select unlabeled instances and deliver them to human experts for manual annotation. The selection criteria is generally determined based on information uncertainty [74, 75].
2.3 Zeroshot Learning
Recently, zeroshot learning (ZSL) [78, 79, 80, 81, 82], as a typical weaklysupervised learning paradigm, has attracted researchers’ attention. ZSL tries to recognize the samples of unseen categories that never appear in training data, i.e., there is no overlap between the seen categories in training data and the unseen categories in test data. That is, the label space distribution between training and test data is different, i.e., , which can be recognized as a special case of transfer learning. This situation often occurs in various fields, due to that manually annotating tens of thousands of different object classes in the world is quite expensive and almost unrealistic. The general problem of ZSL is as follows.
Zeroshot learning with disjoint training and testing classes. Let be an arbitrary feature space of training data. Let and be the sets of seen and unseen object categories, respectively, and there is . The task is to learning a classifier f: by using the training data .
An extension of ZSL is the one/few shot learning (O/FSL) where few labeled examples of each unseen object classes are revealed during training process. The usual idea of Z/O/FSL is to learn the embedding of the image feature into the semantic space or semantic attributes [79, 83]. Afterwards, recognition of new classes can be conducted by matching the semantic embedding of the visual features with the semantic/attribute representation. However, visualsemantic mapping learned from the seen categories may not generalize well to the unseen category due to the domain shift, which, thus can be a challenging topic by utilizing transfer learning to ZSL. Actually, for improving ZSL under domain shifts, transductive or semisupervised zeroshot learning approaches have been studied for reducing the difference of visualsemantic mappings between seen and unseen categories [84, 85, 86, 87, 88].
2.4 Open Set Recognition
Conventional recognition tasks in computer vision where all testing classes are known at training time are generally recognized as closedset recognition. Open set recognition addresses a more realistic vision scenario where unknown classes can be encountered during testing time [89, 90, 91, 92], which shares very similar characteristic with ZSL in tasks. ZSL is different from open set recognition that the former uses the semantic embedding of visual features for recognizing unknown classes, while the latter focus on a oneclass classification problem.
More recently, a similar open set framework with transductive ZSL for recognition under domain shift is the openset domain adaptation approach [93, 94], which were established on the concept of open set recognition. Conventional domain adaptation assumes that the categories in target domain are known and can be seen in the source domain, while openset domain adaptation addresses the scenarios where the target domain contains the instances of categories that are unseen in the source domain [94]. The differences between zeroshot learning and openset domain adaptation lie in that (1) ZSL tends to solve the recognition of instances of unseen categories under the same marginal distribution across training and testing data, while open set domain adaptation aims to solve the same problem but under different marginal distribution across source and target domains. (2) Generalized ZSL [83, 95] was proposed for the scenario where the training and test classes are not necessarily disjoint, while open set domain adaptation was proposed for the scenario where there still a few categories of interest are shared across source and target data. The open set domain adaptation shares some similarity with ZSL. This paper surveys the mainstream closedset domain adaptation and transfer learning challenges.
3 Instance Reweighting Adaptation
When the training and test data are drawn from different distribution, this is commonly referred to as sample selection bias or covariate shift [96]. Instance reweighting aims to infer the resampling weight directly by feature distribution matching across different domains in a nonparametric manner. Generally, given a dataset , a learning model can be obtained by minimizing the following expected risk of the training set,
(4) 
But actually, we are more concerned about the expected risk of the testing set, shown as follows
(5) 
where and represent the probability distribution of training and testing data, respectively. is the loss function and is the ratio between the two probabilities, which is amount to the weighting coefficient. Obviously, when , we have .
From Eq.(5), we know that and can be estimated for computing the weight by following [97] based on the prior knowledge of the class distributions. Although this is intuitive, it requires very good density estimation of and . Particularly, a serious overweighting of the observations with very large coefficients will be resulted from possible small errors or noise in estimating and . Therefore, in order to improve the reliability of the weights, can be directly estimated by imposing flexible constraints into the learning model without having to estimate the two probability distributions.
Sample reweighting based domain adaptation methods mainly focuses on the case where the difference between the source domain and the target domain is not too large. The objective is to reweight the source samples so that the source data distribution can be more close to the target data distribution. Usually, when the distribution difference between the two domains is relatively large, the sample reweighting methods can be combined with others (e.g. feature adaptation) for auxiliary transfer learning. Instance reweighting has been studied with different models, which can be divided into three categories based on weighting scheme: (i) Intuitive Weighting, (ii) Kernel Mapping Based Weighting, and (iii) Cotraining Based Weighting. This kind of methods put emphasis on the learning or computation of the weights by using different criterions and training protocols. The taxonomy of instance reweighting based models is summarized in Table I.
Reweighting Adaptation  Model Basis  Reference  

Intuitive Weighting  Adaptive tuning 


Kernel MapBased  
Distribution Matching  KMM&MMD 


Sample Selection 

[104, 105]  
CotrainingBased  Double classifiers  [106, 107] 
3.1 Intuitive Weighting
Instance reweighting based domain adaptation was first proposed for natural language processing (NLP)
[98, 99]. In [98], Jiang and Zhai proposed an intuitive instance weighted domain adaptation framework, which introduced four parameters for characterizing the distribution difference between source and target samples. For example, for each , the labeled source data, the parameter that was used to indicate how likely is close to and the parameter that was ideally computed as were introduced. Obviously, large means the high confidence of the labeled source sample contributing positively to the learning effectiveness. Small means the two probabilities are very different, and the instance can be discarded in the learning process. Additionally, for each , the unlabeled target data, and for each possible label , the hypothesis space, the parameter that indicates how likely a tentative pseudolabel can be assigned to , then the is included as a training sample.Generally, and
play an intuitive role in sample selection by removing those misleading source samples and adding those valuable labeled target samples during the transfer learning process. Although the optimal weighting values of these parameters for the target domain are unknown, the intuitions behind the weights can be served as guidelines for researchers designing heuristic parameter tuning scheme
[98]. Therefore, adaptive learning of these intuitive weights remains still a challenging issue.In [99]
, Wang et al. proposed two instance weighting schemes for neural machine translation (NMT) domain adaptation, i.e., sentence weighting and dynamic domain weighting. Specifically, given the parallel training corpus
consisting of indomain data and outofdomain data, the sentence weighted NMT objective function was written as(6) 
where is the weight to score each . is the conditional probability activated by softmax function. and represent the source sentence and target sentence, respectively. For domain weighting (dw), a weight was designed for the indomain data, and the NMT objective function Eq.(6) can be transformed as [99]
(7) 
A dynamic batch weight tuning scheme was proposed by monotonically increasing the ratio of indomain data in the minibatch, which is supervised by the training cost. Dai et al. proposed a TrAdaBoost [100] transfer learning method, which leveraged Boosting algorithm to automatically tune the weights of the training samples.
In [101], Chen et al. proposed a more intuitive weighting based subspace alignment method by reweighting the source samples for generating source subspace that are close to the target subspace. Let denote the weighting vector of the source samples. Obviously, the w.r.t. the source sample increases if its distribution is more close to target data. Therefore, a simple weight assignment strategy was presented for assigning larger weights to the source samples that are closer to target domain [101].
After obtaining the weight vector , the weighted source space can be obtained by performing PCA on the following covariance matrix of weighted source data,
(8) 
where
is the weighted mean vector. Then the eigenvectors
can span the source subspace. By performing PCA on the target data, the eigenvectors can span the target subspace. Thereafter, the following unsupervised domain adaptation model, subspace alignment (SA) [108], with Frobenius norm minimization, was implemented.(9) 
The subspace alignment matrix can be easily solved with leastsquare solution.
3.2 Kernel Mapping Based Weighting
The intuitive weighting based domain adaptation was implemented in the raw data space. In order to infer the sampling weights by distribution matching across source and target data in feature space in a nonparametric way, kernel mapping based weighting was proposed. Briefly, the distribution difference between source and target data can be better characterised by reweighting the source samples such that the means of the source and target instances in a reproducing kernel Hilbert space (RKHS) are close [96]. Kernel mapping based weighting consists of two categories of methods: Distribution Matching [96, 102, 103] and Sample Selection [104, 105].
(1) Distribution Matching. The intuitive idea of distribution matching is to match the means between the source and target data in a reproducing kernel Hilbert space (RKHS) by resampling the weights of the source data. Two similar distribution matching criterions, i.e., kernel mean matching (KMM) [96] and maximum mean discrepancy (MMD) [109, 110], have been used as nonparametric statistic to measure the distribution difference. Specifically, Huang et al. [96] firstly proposed to reweight the source samples with , such that the KMM between the means of target data and the weighted source data is minimized.
(10) 
where is the nonlinear mapping function into RKHS.
Chu et al. [102] further proposed a selective transfer machine (STM) by minimizing KMM for distribution matching, and simultaneously minimizing the empirical risk of the classifiers learned on the reweighted training samples.
(11) 
where is the empirical risk (loss) on the training set , indicates the distribution mismatch formulated by KMM, is the weighting vector of the source samples, and is the classifier parameters. From Eq.(11), the KMM based distribution mismatch plays an important role in model regularization on the sampling weights.
More recently, Yan et al. [103]
proposed a weighted MMD (WMMD) for domain adaptation, which was implemented with convolutional neural network. WMMD overcomes the flaw of conventional MMD that ignores the class weight bias and assumes the same class weights between source and target domain. WMMD is formulated as
[103](12) 
where is the class weight w.r.t. the class of the source sample and is the nonlinear mapping into RHKS . and denote the number of samples drawn from source and target domain, respectively.
(2) Sample Selection is another kind of kernel mapping based reweighting method. Zhong et al. [104] proposed a cluster based sample selection method KMapWeighted which was established on the assumption that the kernel mapping can make the marginal distribution across domains similar, but the conditional probabilities between two domains after kernel mapping are still different. Therefore, in the RKHS space, they further select those source samples that are more likely similar to target data via a means based clustering criterion. The data in the same cluster should be with the same labels and then the source samples with similar labels to target data were selected.
Long et al. [105] proposed a TJM method for domain adaptation method by minimizing the MMD based distribution mismatch between source and target data, in which the transformation matrix was imposed with structural sparsity (i.e., norm regularization constraint) for sampling. Then, larger coefficients correspond to the strong correlation between the source samples and the target domain samples. The TJM model is provided as [105]
(13) 
where the norm on source transformation
means that source outliers can be excluded in transferring to target domain, the target transformation
was regularized for smoothness, and is the deduced matrix from MMD. is the centering matrix and is kernel matrix.3.3 Cotraining Based Weighting
Cotraining [66] assumes that the dataset is characterized into two different views, in which two classifiers are then separately learned for each view. The inputs with high confidence of one of the two classifiers can be moved to the training set. In weighting based transfer learning, Chen et al. proposed a CODA [106] method, in which two classifiers with different weight vectors were trained. For better training both classifiers on the training set, the two classifiers were jointly minimized with weighting. In essence, the method of sample reweighting based on the classifier is similar to the TrAdaBoost [100] and KMapWeighted [104].
In [107], Chen et al. proposed a reweighted adversarial adaptation network (RAAN) for unsupervised domain adaptation. Two classifiers including a multiclass source instance classifier and a binary domain classifier were designed for adversarial training. The domain classifier aims to discriminate whether features are from source or target domain, while the domain feature representation network tries to confuse them, which formulates an adversarial training manner. For improving the domain confusion effect, the source feature distribution is reweighted with during training of the domain classifier . With the gaming between and as GAN does [33], the following minimax objective function was used [107],
(14) 
where the weight is multiplied with , and both and were trained in a cooperative way. The learning of the source classifier was easily performed by minimizing the crossentropy loss.
3.4 Discussion and Summary
In this section, we recognize three kinds of instance reweighting: intuitive, kernel mapping and cotraining. The intuitive reweighting advocates to tune the weights of the source samples, such that the weighted source distribution is closer to target distribution. The kernel mapping based reweighting is further divided into distribution matching and sample selection. The former aims to learn source sample weights such that the kernel mean discrepancy between target data and the weighted source data is minimized, and the latter advocates sample selection by using Kmeans clustering (cluster assumption) and norm based structural sparsity in RKHS space. The cotraining mechanism focus on learning with two classifiers. Additionally, the adversarial training of the weighted domain classifier can facilitate domain confusion.
Although instance reweighting is the earliest method to address domain mismatch problem, there are still some directions worth studying: 1) essentially, instance weighting can be incorporated into most of learning frameworks; 2) the initialization and estimation of instance weights are important and can be treated as a latent variable obeying some probability distribution.
4 Feature Adaptation
Feature adaptation aims to discover the common feature representation of the data drawn from multiple sources by using different techniques including linear and nonlinear ones. In the past decade, feature adaptation induced transfer adaptation learning has been intensively studied, which, in our taxonomy, can be categorized into (i) Feature SubspaceBased, (ii) Feature TransformationBased, (iii) Feature ReconstructionBased and (iv) Feature CodingBased. Despite these advances, the technical challenges being faced by researchers lie in the domain subspace alignment, projection learning for distribution matching, generic representation and shared domain dictionary coding. The taxonomy of feature adaptation challenges is summarized in Table II.
Feature Adaptation  Model Basis  Reference  
Feature Subspace  
Geodesic path  Grassman manifold  [111, 112]  
Alignment  Subspace learning 



Feature Transformation  
Projection 



Metric 



Augmentation 



Feature Reconstruction  
Lowrank models 



Sparse models 



Feature Coding  
Domainshared dictionary  Dictionary learning 


Domainspecific dictionary  Dictionary learning 

4.1 Feature SubspaceBased
Learning subspace generally resorts to unsupervised domain adaptation. Three representational models are referred to as sampling geodesic flow (SGF) [111], geodesic flow kernel (GFK) [112] and subspace alignment (SA) [108]
. There exists a common property of the three methods, i.e. the data is assumed to be represented by a lowdimensional linear subspace. That is, a lowdimensional Grassmann manifold is embedded in the highdimensional data. Generally, principal component analysis (PCA) was used to construct the Grassmann manifold, where the source and target domains become two points and a geodesic flow or path was formulated. SGF proposed by Gopalan, et al.
[111] is an unsupervised lowdimensional subspace transfer method, which samples a group of subspaces along the geodesic path between source and target data, and aims to find an intermediate representation with closer domain distance.Similar to but different from SGF, Gong et al. proposed a GFK [112], in which the geodesic flow kernel was used to model the domain shift by integrating an infinite number of subspaces. GFK explores an intrinsic lowdimensional spatial structure that associates two domains and the main idea behind is to find a geodesic line from to , such that the raw feature can be transformed into a space of infinite dimension from to where distribution difference is easy to be reduced. In particular, the infinite dimensional features in the manifold space can be represented as . The inner product of the transformed features and defines a positive semidefinite geodesic flow kernel as follows:
(15) 
where is a positive semidefinite mapping matrix. With x, features in the original space can be transformed into the Grassmann manifold space.
For aligning the source subspace to the target subspace, in SA [108], Fernando, et al. proposed to move closer the two subspaces with respect to the points in Grassmann manifold by directly designing an alignment matrix , which well bridges the source and target subspaces. The model of SA is described in Eq.(9). As presented in SA, the subspaces of source and target data were spanned by the eigenvectors induced with a PCA. Further, Sun and Saenko proposed a subspace distribution alignment (SDA) [113] by simultaneously aligning the distributions as well as the subspace bases, which overcomes the flaw of SA that does not take into account the distribution difference.
More intuitively, Liu and Zhang proposed a guided transfer hashing (GTH) [114] framework, which introduced a more generic method for moving the source subspace closer to target subspace ,
(16) 
where is a weighting matrix on the difference between source and target subspaces. Through this way, the two subspaces can be solved alternatively and progressively, which is therefore recognized as a guided transfer mechanism.
4.2 Feature TransformationBased
This kind of models aim to learn a transformation or projection of the data with some distribution matching metrics between source and target domains [5, 142, 143]. Then, the transformed or projected feature distribution difference across two domains can be removed or relieved. Feature transformation based domain adaptation has been a mainstream in visual transfer learning community in last years, which can be further divided into Projection, Metric, and Augmentation according to the model formulation.
(1) ProjectionBased domain adaptation aims to solve a projection matrix in source and target domain for reducing the marginal distribution difference and conditional distribution difference between domains, by introducing Kernel Matching Criterion [118, 115, 144, 116, 117, 145, 146, 147] and Discriminative Criterion [148, 149, 150, 151]. The kernel matching criterion generally adopts the maximum mean discrepancy (MMD) statistic, which characterizes the marginal distribution difference and conditional distribution difference between source and target data. In unsupervised domain adaptation setting, the labels of target domain samples are generally unavailable, therefore the pseudolabels of target samples should be iteratively predicted for quantifying the conditional MMD between domains [148, 152]. The discriminative criterion focus on withinclass compactness and betweenclass separability of the projection. Mathematically, the formulation of empirical nonparametric MMD in universal RKHS is written as
(17) 
Specifically, with MMD based kernel matching criterion, Pan and Yang firstly proposed a transfer component analysis (TCA) [115]
by introducing the marginal MMD with projection as the loss function. The joint distribution adaptation (JDA) proposed by Long et al.
[116] further introduced the conditional MMD on the basis of TCA, such that the crossdomain distribution alignment becomes more discriminative. The general model can be written as(18) 
where denotes the projection matrix, denotes the predicted pseudolabel of target data, and represent the marginal and conditional distribution discrepancy, respectively. For improving the discrimination of the projection matrix, such that the withinclass compactness and betweenclass separability in each domain can be better characterized, the model with joint discriminative subspace learning and MMD minimization was proposed, for example, JGSA [148] and CDSL [149], and generally written as
(19) 
where is a scalable subspace learning function of the projection , for example, linear discriminative analysis (LDA), local preservation projection (LPP), marginal fisher analysis (MFA), principal component analysis (PCA), etc. In addition to the MMD based criterion in projection based transfer model, Bregman divergence based [118], HilbertSchmidt independence criterion (HSIC) based [153, 117, 154, 133], and manifold criterion based [126].
In [118], Si et al. proposed a transfer subspace learning (TSL) by introducing a Bregman divergencebased discrepancy as regularization instead of MMD, which is written as
(20) 
where is similar to Eq.(19) and is the Bregman divergencebased regularization that measures the distance between the probability distribution of training samples and that of the testing samples in the projected subspace .
The HSIC proposed by Gretton et al. [153], the same author as that of MMD, was used to measure the dependency between two sets and . Let and denote the kernel function w.r.t. the RKHS and . The HSIC is mathematically written as [153]
(21) 
where is the size of the set and and is the HilbertSchmidt norm of the crosscovariance operator. and denote the two kernel Gram matrix, and is the centering matrix. HSIC will be zero if and only if and are independent. In [133], Wang et al. proposed to use the projected HSIC as regularization, which is written as
(22) 
where denotes the label set of source and target data. Obviously, the model constrains to reduce the independency between feature set and label set , such that the classification performance can be improved. In model formulation, the general way is to set a common projection for both domains. Another way is to learn two projections and , one for each domain, such that domain specific projection can be solved [148, 122, 114, 129]. For moving the two projections of both domains closer, the Frobenius norm of their difference like Eq.(16) can be used.
(2) MetricBased aims to learn a good distance metric from labeled source data which can be easily adapted to a related but different target domain [155]. Metric transfer has a close link to projection based examples, if the metric is a semidefinite matrix and can be decomposed into [119]. The metricbased transfer can be divided into Firstorder statistic [119, 120, 142, 156, 157, 158] and Secondorder statistic [121, 122, 159, 160, 161] based distance metric, such as Euclidean or Mahalanobis distance.
The Firstorder metric transfer generally learns a metric under which the distance between source and target feature is minimized, and it can be written as
(23) 
where is the feature representation or mapping function, and it can be linear mapping [142], kernel mapping [120, 157], autoencoder [119] or neural network [156].
For example, the robust transfer metric learning (RTML) proposed by Ding et al. [119] adopted an autoencoder based feature representation for metric learning, such that the Mahalanobis distance between source and target domain is minimized. The objective function of RTML is as follows:
(24) 
where is positive semidefinite matrix, is the repeated version of , is the randomly corrupted version of . The first item is Mahalanobis distance induced domain discrepancy under metric , the second item is autoencoder for feature learning, and the third term is the lowrank constraint for characterizing the internal correlation between domains.
The Secondorder metric transfer generally learns a metric under which the distance between the covariances of source and target domain instead of the means is minimized [160, 122, 159, 121]. For example, Sun et al. [159, 121] proposed a simple but efficient correlation alignment (CORAL) by aligning the secondorder statistic (i.e. the covariance) between source and target distributions instead of the firstorder metric. By introducing a metric matrix , the difference between source covariance and target covariance in CORAL can be minimized by solving
(25) 
The Eq.(25
) is amount to matching the two centered Gaussian distribution, which is the basic assumption for such secondorder statistic based transfer.
(3) AugmentationBased domain adaptation often assume that the feature representation is grouped with three types: common representation, sourcespecific representation and targetspecific representation. In general case, the source domain should be characterized as the composition of common component and sourcespecific component, and similarly, the target domain should be characterized as the composition of common component and targetspecific component. Feature augmentation based DA can be divided into the generic Zero Padding [162, 123, 163, 124, 164] and the latest Generative [125, 126] types.
Zero Padding was firstly proposed by Daume III [162], which presented an EasyAdapt (EA) model. Assume the raw input data space to be , then the augmented feature spaces should be . By defining the mapping functions of source and target domain from to as and , respectively. Then, there is
(26) 
where is a zero vector. The first, second and third bits of the augmented feature in Eq.(26) represent the common, sourcespecific and targetspecific feature component, respectively. However, in heterogeneous domain adaptation that addressing different feature dimensions between source and target domain [165, 166, 167], for example, crossmodal learning (e.g., images vs. text), Li et al. [124] argued that such simple zeropadding for dimensionality consistence between domains is not meaningful. The reason is that there would be no correspondences between the heterogeneous features. Therefore, Li et al. [124] proposed a heterogeneous feature augmentation (HFA) model, which incorporates the projected features together with the raw features for feature augmentation by introducing two projection matrices and . The augmented feature for source and target domain can be written as
(27) 
where and represent the dimensionality of source and target data, respectively. For incorporating the unlabeled target data, Daume III further proposed an EA++ model with zero padding based feature augmentation for semisupervised domain adaptation [123, 163]. Chen et al. [164] proposed to use zero padding based camera correlation aware feature augmentation (CRAFT) for crossview person reidentification.
Generative methods for feature augmentation mainly focus on plausible data generation for enhancing the robustness for domain transfer. In [125], Volpi et al. proposed an adversarial feature augmentation by introducing two generative adversarial nets (GANs). The first GAN was used to train the generator for synthesizing implausible source images (data augmentation) by inputting noise and conditional labels. The second GAN was used to train the shared feature encoder (feature augmentation) for both domains, by adversarial learning with the synthesized source images via . Finally, the encoder was used as the domain adapted feature extractor shared by both domains. In [126], Zhang et al. proposed a manifold criterion guided intermediate domain generation for feature augmentation, which improved the transfer performance by generating highquality intermediate features.
4.3 Feature ReconstructionBased
Feature reconstruction between source and target data using a representational matrix for domain transfer has been studied for several years. By linear sample reconstruction in an intermediate representation with lowrankness and sparsity, it can well characterize the intrinsic relatedness and correspondences between source and target domain, while excluding noises and outliers during domain adaptation. To this end, feature reconstruction based domain transfer can be generally divided into two types: Lowrank Reconstruction [127, 128, 126, 129] and Sparse Reconstruction [130, 132, 131, 133]. For the former, for characterizing the domain differences and uncovering the domain noises, the reconstruction matrix was imposed with lowrank constraint, such that the relatedness between domains can be discovered. For the latter, sparsity or structural sparsity was generally used for transferrable sample selection. Methodologically, reconstruction based domain transfer is closely related to lowrank representation (LRR) [168, 169], matrix recovery [170, 171] and sparse subspace clustering (SSC) [172, 173, 174].
(1) Lowrank Reconstruction based domain adaptation was firstly proposed by Jhuo et al. [127], in which the transformed source feature was reconstructed by the target domain with lowrank constraint on the reconstruction matrix and norm constraint on the error.
(28) 
However, seeking for an alignment between and may not transfer knowledge directly, due to the out of domain problem of for unilateral projection.
On the basis of [127], Shao et al. [128] proposed a latent subspace transfer learning (LTSL), which tends to reconstruct the target data by using the source data as basis in a projected latent subspace.
(29) 
where is a subspace learning function, similar to Eq.(19), Eq.(20) and Eq.(22). By comparing Eq.(28) to Eq.(29), the major difference lies in the latent space learning of for both domains in LTSL. Both methods, established on LRR, advocated lowrank reconstruction between domains for transfer learning. As demonstrated in [169], trivial solution may be easily encountered when handling disjoint subspaces and insufficient data using LRR and a strong independent subspace assumption is necessary.
(2) Sparse Reconstruction based domain transfer was established on the SSC, which, different from LRR, is well supported by theoretical analysis and experiments when handling the data near the intersections of subspaces [173]. Therefore, in [130], Zhang et al. proposed a latent sparse domain transfer (LSDT) model, which jointly learn the sparse coding between domains and the latent subspace .
(30) 
where is the feature set grouped by and .
With the sparsity constraint on , the most transferrable samples can be selected during domain adaptation, which is more robust to noise or outliers drawn from source domain. The model has also been kernerlized by defining the projection as the linear representation of . The reconstruction is then implemented in a highdimensional reproducing kernel Hilbert space (RKHS), based on the Representor theorem. In [132], Zhang et al. proposed a norm constraint based reconstruction transfer model with discriminative subspace learning and the domainclass consistency was guaranteed. The joint constraint with lowrankness and sparsity for the reconstruction matrix was proposed in [131], such that the global and local structures of data can be preserved.
4.4 Feature CodingBased
In feature reconstruction based transfer models, the focus is the learning of reconstruction coefficients across domains, on the basis of the raw feature of source or target data. Different from that, feature coding based transfer learning put emphasis on seeking a group of basis (i.e., dictionary) and representation coefficients in each domain, which was generally called domain adaptive dictionary learning. The typical dictionary learning approach aims to minimize the representation error of the given data set under a sparsity constraint [175, 176, 177]. The crossdomain dictionary learning aims to learn domain adaptive dictionaries without requiring any explicit correspondences between domains, which was generally divided into two types of learning, domainshared dictionarybased [134, 135, 136, 137] and domainspecific dictionarybased [138, 139, 140, 141, 178]. Obviously, the former resorts to learning one common dictionary for both domains, while the latter contributes to obtain two or more dictionaries for each domain.
(1) Domainshared dictionary aims at representing the source and target domain using a common dictionary. In [134, 136], Shekhar et al. proposed to separately represent the source and target data in a latent subspace with a shared dictionary , which can be written as
(31) 
where denotes the latent subspace projection, denotes the representational coefficients for source data and target data using a shared dictionary , and denotes the regularizer. The shared dictionary is demonstrated to incorporate the common information from both domains.
(2) Domainspecific dictionary tends to learn multiple dictionaries, one for each domain, to represent the data in each domain based on domain specific or common representation coefficients [140, 178]. The general model can be written as
(32) 
where denotes the difference between representation coefficients of source and target. If , then and the model in Eq.(32) is degenerated as the common representation coefficients based domain adaptive dictionary learning [138].
In [139, 135, 141], a set of intermediate domains that bridge the gap between source and target domains were incorporated as multiple dictionaries , which can progressively capture the intrinsic domain shift between source domain dictionary and target domain dictionary . The difference between the atoms of adjacent two subdictionaries can well characterize the incremental transition and shift between two domains. Actually, this kind of models can be linked with SGF [111] and GFK [112] by sampling finite or infinite number of intermediate subspaces on the Grassmann manifold for better capturing the intrinsic domain shift.
4.5 Discussion and Summary
In this section, feature adaptation methods are presented, including subspace, transformation, reconstruction and coding based types. Feature subspace focuses on the subspace alignment between domains in Grassmann manifold. Feature transformation is further categorized into three subclasses: projection learning with MMD criterion, metric learning with firstorder or secondorder statistics and augmentation with zeropadding. Feature reconstruction aims to explicitly bridge the source and target data in a latent subspace by lowrank or sparse reconstruction. Finally, the feature coding focus on domain data representation by learning domain adaptive dictionaries without explicit correspondences between domains.
Feature adaptation is intensively studied and two future directions are specified: 1) more reliable probability distribution similarity metric is needed, except the Gaussian kernel induced MMD; 2) for learning domaininvariant representation, model ensemble of linear and nonlinear ones is desired.
5 Classifier Adaptation
In crossdomain visual categorization, classifier adaptation based TAL aims to learn a generic classifier by leveraging labeled samples drawn from source domain and few labeled samples from target domain [3, 179, 180, 181]. Typical crossdomain classifier adaptation can be divided into (i) Kernel ClassifierBased [3, 182, 183, 184, 179, 185, 186], (ii) Manifold RegularizerBased [187, 188, 189, 190, 191] and (iii) Bayesian ClassifierBased [192, 193, 194, 195, 196, 197]. The taxonomy of classifier adaptation approaches is summarized in Table III.
Classifier Adaptation  Model Basis  Reference  

Kernel Classifier  SVM&MKL 


Manifold Regularizer 



Bayesian Classifier 


5.1 Kernel ClassifierBased
Yang et al. [3]
firstly proposed an adaptive support vector machine (ASVM) in 2007 for target classifier training, which assumed that there exists a bias
between source classifier and target classifier . This means that the bias can be added to the source classifier to generate a new decision function, that is adapted to classifying the target data. There is,(33) 
where is the parameter of the bias function , which was solved by standard SVM,
(34) 
In Eq.(34), was known and trained on labeled source data, are drawn from few labeled target data, and is the parameter of rather than .
More recently, on the basis of ASVM, Duan et al. proposed a series of multiple kernel learning (MKL) based domain transfer classifiers [182, 183, 184, 185], including AMKL, DTSVM, and DTMKL, in which the kernel function was assumed to be a linear combination of multiple predefined base kernel functions by following the MKL methodology [198, 199]. Additionally, for reducing the domain distribution mismatch, MMD based kernel matching metric was jointly minimized with the structural risk based classifiers. The general model of MKL based classifier adaptation can be written as
(35) 
where denotes the structural risk on labeled training samples, is the decision function, is the monotonic increasing function, is a linear combination of a set of base kernels with and . The structural risk was generally formulated based on the hinge loss, i.e., , as that in SVM. Duan et al. [200] also proposed a domain adaptation machine (DAM), which incorporated SVM hinge loss based structural risk with multiple domain regularizers for target classifier learning. Regularized leastsquare loss based classifier adaptation can be referred to as [187, 188].
5.2 Manifold RegularizerBased
The manifold assumption in semisupervised learning means that the the similar samples with small distance in feature space more likely belongs to the same class. By constructing the affinity graph based manifold regularizer, under which, the classifier trained on source data can be more easily adapted to target data through label propagation. Long et al. [187] and Cao et al. [190] proposed ARTL and DMM which advocated manifold regularization based structural risk and betweendomain MMD minimization for classifier training, structural preservation and domain alignment. In [191], Yao et al. proposed to simultaneously minimize the classification error, preserve the geometric structure of data and restrict similarity characterized on unlabeled target data. Zhang and Zhang [188] proposed a manifold regularization based leastsquare classifier EDA on both domains with label precomputation and refining for domain adaptation. More recently, Wang et al. [189] proposed a domaininvariant classifier MEDA in Grassmann manifold with structural risk minimization, while performing crossdomain distribution alignment of marginal and conditional distributions with different importances. Graph based manifold regularization can be written as
(36) 
where is the data of source and target domain, is the predicted labels, is the Laplacian matrix, is the weight between sample and , and is a diagonal matrix with . This term constrains the geometric structure preservation in label propagation and helps classifier adaptation. Although manifold regularizer can improve classifier adaptation performance, the fact is that the manifold assumption may not always hold, particularly when domain distribution does not match [126].
5.3 Bayesian ClassifierBased
In learning complex systems with limited data, Bayesian learning can well integrate prior knowledge to improve the weak generalization of models caused by data scarcity. For unsupervised domain adaptation, an underlying assumption in the kernel classifier and manifold classifier based models is that the conditional domain shift between domains can be minimized without relying on the target labels. Additionally, these methods are deterministic, which rely more on the expensive crossvalidation for determining the underlying manifold space where the kernel mismatch between domains is effectively reduced. Recently, probabilistic model, i.e., Bayesian classifier based graphical models for DA/TL have been studied [192, 193, 194, 195, 196, 197], which aim to have better insights on the transferrable process from source domain to target domain.
In [192], Gnen and Margolin firstly proposed graphical model, i.e., kernelized Bayesian transfer learning (KBTL) for domain adaptation. This work aims to seek a shared subspace and learn a coupled linear classifier in this subspace using a full Bayesian framework, solved by a variational approximation based inference algorithm. In [195], Gholami et al. proposed a probabilistic latent variable model (PUnDA) for unsupervised domain adaptation, by simultaneously learning the classifier in a projected latent space and minimizing the MMD based domain disparity. A regularized Variational Bayesian (VB) algorithm was used for efficient model parameter estimation in PUnDA, because the computation of exact posterior distribution of the latent variables is intractable. More recently, Karbalayghareh et al. [196] proposed an optimal Bayesian transfer learning (OBTL) classifier to formulate the optimal Bayesian classifier (OBC) in target domain by using the prior knowledge of source and target domains, where OBC [201] aims to achieve Bayesian minimum mean squared error over uncertainty classes. In order to avoid costly computations such as MCMC sampling, OBTL classifier was derived based on the Laplace approximated hypergeometric functions.
5.4 Discussion and Summary
In this section, classifier adaptation including kernel classifier, manifold regularizer and Bayesian classifier are surveyed, which mostly rely on a small amount of tagged target domain data and facilitate semisupervised transfer learning. This can be easily adapted to unsupervised transfer learning by precomputing and iteratively updating the pseudolabels of the completely unlabeled target domain in classifier adaptation. The kernel classifier focuses on SVM or MKL learning jointly with MMD based domain disparity minimization. The manifold regularizer based models aim to preserve the data affinity structure for label propagation. The Bayesian classifier based models resort to compensating the generalization performance loss due to data scarcity by modeling on the prior knowledge under reliable distribution assumptions, and having theoretical understanding on transfer learning from the viewpoint of data generation.
However, some inherent flaws exist: 1) incorrect pseudolabels of target data significantly lead to performance degradation; 2) inaccurate distribution assumption in estimating various latent variables produces very negative effect; 3) the manifold assumption between domains does not hold for serious domain disparity.
6 Deep Network Adaptation
Deep neural networks (DNNs) have been recognized as dominant techniques for addressing computer vision tasks, due to their powerful feature representation and endtoend training capability. Although DNNs can achieve more generalized features and performance in visual categorization, they rely on massive amounts of labeled data. For a target domain where the labeled data is unavailable or a very few labeled data is available, deep network adaptation started to rise. Yosinski et al. [202] has discussed the transferability of features in bottom, middle and top layers of DNNs, and demonstrated that the transferability of features decreases as the distance between domains increases. In [203], Donahue et al. proposed the deep convolutional activation feature (DeCAF) extracted by using a pretrained AlexNet model [16], which has well proved the generalization of DNNs for generic visual classification. This work further facilitated deep transfer learning and deep domain adaptation. Generally, the presented three types of TAL models in Section 3, 4 and 5, including instance reweighting, feature adaptation and classifier adaptation, can be incorporated into DNNs with endtoend training for deep network adaptation. In 2015, Long et al. [204, 205]
proposed a deep adaptation network (DAN) for learning transferrable features, which, for the first time opened the topic of deep transfer and adaptation. The basic idea of DAN is to enhance feature transferability in taskspecific layers of DNNs by embedding the higher layered features into reproducing kernel Hilbert spaces (RKHSs) for nonparametric kernel matching (e.g., MMDbased) between domains. In training process, DAN was trained by finetuning on the ImageNet pretrained DNN, such as AlexNet
[16], VGGNet [206], GoogLeNet [207] and ResNet [17]. Currently, the works in deep network adaptation can be divided into (i) Marginal AlignmentBased, (ii) Conditional AlignmentBased and (iii) AutoencoderBased, in which the first two focus on convolutional neural networks. The taxonomy of deep network adaptation challenges is summarized in Table IV.Deep Net Adaptation  Model Basis  Reference  

Marginal Alignment  CNN&MMD 


Conditional Alignment 



AutoencoderBased 


6.1 Marginal AlignmentBased
In unsupervised deep domain adaptation frameworks, for reducing the distribution disparity between labeled source domain and unlabeled target domain, the top layered features were generally transformed to a RKHS space where the maximum mean discrepancy (MMD) based kernel matching between domains was performed, which is recognized as marginal alignment based deep network adaptation [204, 208, 209, 210]. For image classification, the softmax guided crossentropy loss on the labeled source data is generally minimized. Representative works can be referred to as DDC proposed by Tzeng et al. [208] and DAN [204]. The model can be written as
(37) 
where is the crossentropy loss function, is the feature representation function, denotes the domain feature set from the layer and is the marginal alignment function (i.e., MMD in Eq.(17)) between domains. Clearly, in Eq.(37), multiple MMDs were formulated, one for each layer, and the summation of all MMDs is minimized. For better measuring the discrepancy between domains, a unified MMD called joint MMD (JMMD) was further designed by Long et al. [210]
in a tensor product Hilbert space for matching the joint distribution of activations of multiple layers.
The model in Eq.(37) does not take into account the network outputs of target domain stream, which may not well adapt the source classifier to target data. For addressing this problem, conditionalentropy minimization principle [220] that favors the lowdensity separation between classes in unlabeled target data was further exploited in [205, 211, 212]. The entropy minimization is written as
(38) 
where is the probability that sample is predicted as class . Entropy minimization is amount to uncertainty minimization of the predicted labels of target samples. Additionally, by following the assumption of ASVM in [3], the residual between source and target classifiers was learned in the residual transfer network (RTN) [211]
, with a residual connection.
6.2 Conditional AlignmentBased
In marginal alignment based deep network adaptation, only the top layered feature matching in RKSH spaces was formulated by using the nonparametric MMD metric. However, the highlevel semantic information was not taken into account in domain alignment, which may degrade the adaptability of source data trained DNNs to unlabeled target domain. Therefore, conditional alignment based deep network adaptation methods were presented jointly with marginal alignment based models [213, 214]. Similar to the formulation of MMD in Eq.(17), the conditional alignment was generally formulated by building MMD like metric on the probabilities , the uncertainty that predicts a sample to class between domains.
(39) 
Therefore, conditional alignment based deep adaptation model was generally constructed by combining Eq.(37) and Eq.(39) together. The probability constraint between domains can effectively improve the semantic discrimination. Actually, norm can also be imposed on the difference between the probabilities of source and target samples.
6.3 AutoencoderBased
As mentioned above, the training of DNNs needs a large amount of labeled source data. For unsupervised feature learning in domain adaptation, deep autoencoder based network adaptation framework was presented [215, 216, 217, 166, 218]. Generic autoencoders are comprised of an encoder function and a decoder function
, which are typically trained to minimize the reconstruction error. Denoising autoencoders (DAE) were generally constructed with onelayer neural networks for reconstructing original data from partially or randomly corrupted data
[221]. The denoising autoencoders can be stacked into a deep network (i.e., SDA), optimized by greedy layerwise fashion based on stochastic gradient descent (SGD). The rational behind deep autoencoder based network adaptation is that the source data trained parameters of encoder and decoder can be adapted to represent those samples from a target domain.
In [215]
, Glorot et al. proposed a SDA based feature representation in conjunction with SVMs for sentiment analysis across different domains. Chen et al.
[216] proposed a marginalized stacked denoising autoencoder (mSDA), which addressed two crucial limitations of SDAs, such as high computational cost and low scalability to highdimensional features, by inducing a closedform solution of parameters without SGD. In [217], Zhuang et al. proposed a supervised deep autoencoder for learning domain invariant features. The encoder is constructed with two encoding layers: embedding layer for domain disparity minimization and label encoding layer for softmax guided source classifier training. Suppose , and to be the input sample, intermediate representation (encoded) and reconstructed output (decoded), respectively, then there is(40) 
where is the intermediate feature representation of sample . Generally, stacked deep autoencoder based TAL framework can be written as
(41) 
where is domain shared encoder, is domain shared decoder, represents the reconstruction error loss (e.g., norm squared loss), is the distribution discrepancy metric between source feature and target feature , is the classifier loss (e.g. crossentropy) with parameter learned on the set, and is the regularizer of the network parameters of and . In [217], KullbackLeibler (KL) divergence [222] based distribution distance metric was considered. KL is a nonsymmetric measure of the divergence between two probability distributions and , which was defined as . Smaller value of means higher similarity of two distributions. Due to that , a symmetric KL version was used in [217], in which the in Eq.(41) was written as
(42) 
where and represent the probability distribution of source and target domains. and represent the mean vector of encoded feature representations of source and target samples, respectively.
Similar to the reconstruction protocol in stacked autoencoder, a related work with deep reconstruction based on convolutional neural networks can be referred to as [219], in which the encoded source feature representation is feeded into the source classifier for visual classification and simultaneously into the decoder module for reconstructing the target data. Under this framework, a shared encoder for both domains can be learned.
6.4 Discussion and Summary
In this section, deep network adaptation advances are presented and categorized, which mainly contains three types of technical challenges: marginal alignment based, conditional alignment based and autoencoder based. A common characteristic of these methods is that the softmax guided crossentropy loss based on labeled source data was minimized for classifier learning. In marginal alignment based models, the distribution discrepancy of feature representation from top layers is generally characterized by MMD. Besides that, the semantic similarity across domains was further characterized in conditional alignment based models. Different from both marginal and conditional alignment models, the autoencoder based ones tend to learn domain invariant feature embedding by imposing a KullbackLeibler divergence in feature embedding layer.
Despite recent advances deep network adaptation faces several challenges: 1) a number of labeled source data is needed for training (finetuning) a deep network; 2) the confidence of an unlabeled target sample predicted to class is sometimes very low when domain disparity is very large.
7 Adversarial Adaptation
Adversarial learning, originated from the generative adversarial net (GAN) [33], is a promising approach for generating pixellevel target samples or featurelevel target representations by training robust DNNs. Currently, adversarial learning has become an increasing popular idea for addressing TAL issues, by minimizing the betweendomain discrepancy through an adversarial objective (e.g., binary domain discriminator), instead of the generic MMDbased domain disparity in RKHS spaces. In fact, minimization of the domain disparity is amount to domain confusion in a learned feature space, where the domain discriminator cannot discriminate which domain a sample comes from. In this paper, the adversarial adaptation based TAL approaches are divided into three types: (i) Gradient ReversalBased, (ii) Minimax OptimizationBased and (iii) Generative Adversarial NetBased. The first two resort to featurelevel domain confusion supervised by a domain discriminator for domain distribution discrepancy minimization, while the last one tends to pixellevel domain transfer by synthesizing implausible target domain images. The taxonomy of adversarial adaptation challenges is summarized in Table V.
Adversarial Adaptation  Model Basis  Reference  

Gradient Reversal 



Minimax Optimization 



GANsBased 


7.1 Gradient ReversalBased
In adversarial optimization of DNNs between the general crossentropy loss for source classifier learning and the domain discriminator for domain label prediction, Ganin and Lempitsky [223] firstly demonstrated that the domain adaptation behavior can be achieved by adding a simple but effective gradient reversal layer (GRL)
. The augmented deep architecture can still be trained using standard stochastic gradient descent (SGD) based backpropagation. The gradient reversal based adversarial adaptation network consists of three parts: domaininvariant feature representation
, visual classifier and domain classifier . Objectively, can be learned by trying to minimize the visual classifier loss and simultaneously maximize the domain classifier loss , such that the feature representation can be domain invariant (i.e. domain confusion) and class discriminative. Therefore, in backpropagation optimization of , the contributed gradients from losses and are and , respectively. The essence of GRL lies in the reversal gradient with negative multiplier .More recently, the gradient reversal based adversarial strategy has been used for domain adaptation [224, 225, 226, 227] under CNN architecture, domain adaptive object detection [228] under FasterRCNN framework, largescale kinship verification [229, 230] and finegrained visual classification [231] under Siamese network. By following a similar protocol with [223], in [224, 228], a domain classifier was designed as an adversarial objective for learning domaininvariant features by deploying a GRL layer. In [229, 230], two methods, AdvNet and AdvKin were proposed, in which a general Siamese network was constructed with three fullyconnected (fc) layers for similarity learning. The reversal gradient with negative multiplier was placed in the fclayer (MMDloss), the generic contrastive loss was deployed in the fclayer and the softmax guided crossentropy loss was deployed in the last fclayer. In [226], Pei et al. argued that single domain discriminator based adversarial adaptation only aligns the betweendomain distribution without exploiting the multimode structures. Therefore, they proposed a multiadversarial domain adaptation (MADA) method based on GRL with multiple classwise domain discriminators for capturing multimode structures, such that finegrained alignment of different distributions is enabled. Also, Zhang et al. [227]
proposed a collaborative adversarial network (CAN) by designing multiple domain classifiers, one for each feature extraction block in CNN.
7.2 Minimax OptimizationBased
In GANs, the two key parts and are often placed with an adversarial state, and generally solved by using a minimax based gaming optimization method [33]. Therefore, the minimax optimization based adversarial adaptation can be implemented for domain confusion, through an adversarial objective of the domain classifier or regressor [232, 233, 234, 235, 236, 237, 238, 239]. Minimax optimization based adversarial adaptation training of DNNs originated in 2015 [232, 233]. Domain confusion maximization based adversarial domain adaptation was first proposed by Tzeng et al. [232], in which an adversarial CNN framework was deployed with classification loss, softlabel loss and two adversarial objectives i.e., domain confusion loss and domain classifier loss. In [233], Ajakan et al. firstly proposed an adversarial training of stacked autoencoders (DANN) deployed with classification loss and an adversarial objective i.e., domain regressor loss.
Suppose the labeled source data trained visual classifier to be , the domain discriminator to be , and the feature representation to be . The corresponding parameters are defined as , and . The general adversarial adaptation model aims to minimize the visual classifier loss and maximize the domain discriminator loss by learning , such that the feature representation function can be more discriminative and domaininvariant. Simultaneously, the adversarial training aims to minimize the domain discriminator loss under . Generally, maximizing is amount to maximizing the domain confusion, such that it cannot discriminate which domain the samples come from, and vice versa. The above process can be generally formulated as the following adversarial adaptation model,
(43) 
where and mean the source and target domain samples, denotes the source data labels.
Under this basic framework in Eq.(43), Tzeng et al. [235] further proposed an adversarial discriminative domain adaptation (ADDA) method, in which two CNNs were separately learned for source and target domain. The training of source CNN relied only on the source data and labels by minimizing the crossentropy loss , while the target CNN and the domain discriminator loss was alternatively trained in an adversarial fashion with the source CNN fixed. Rozantsev et al. [237] proposed a residual parameter transfer model with adversarial domain confusion supervised by a domain classifier, in which the residual transform between domains was deployed in convolutional layers. For augmenting the domainspecific feature representation, Long et al. [238] proposed a conditional domain adversarial network (CDAN), in which the feature representation and classifier prediction were integrated via multilinear map for jointly learning the domain classifier. More recently, Saito et al. [240] proposed a novel adversarial strategy, i.e., maximum classifier discrepancy (MCD), which aims to maximize the discrepancy between two classifiers’ outputs instead of domain discriminator. The feature extractor aims to minimize the two classifiers’ discrepancy. They argued that the general domain discriminator does not take into account the taskspecific decision boundaries between classes, which may lead to ambiguous features near class boundaries from the feature extractor.
7.3 Generative Adversarial NetBased
In generative adversarial net (GAN) [33] and its variants, two key parts: generator and discriminator are generally composed. The generator aims to synthesize implausible images by using the encoder and decoder, while the discriminator plays a role in identification of authenticity by recognizing a sample to be true or false. A minimax gaming based alternative optimization scheme is generally used for solving and . In TAL studies, started from 2017, GAN based models have been presented to synthesize distribution approximated pixellevel images with target domain and then enable the crossdomain image classification by using synthesized image samples (e.g., objects, scenes, pedestrians and faces, etc.) [241, 242, 243, 244, 245, 41, 246, 247, 248].
Under the CycleGAN framework proposed by Zhu et al. [40], Hoffman et al. [241] firstly proposed a cycleconsistent adversarial domain adaptation model (CyCADA) for adapting representations in both pixellevel and featurelevel without requiring aligned pairs, by jointly minimizing pixel loss, feature loss, semantic loss and cycle consistence loss. Bousmalis et al. [242] and Taigman et al. [243] proposed GANbased models for unsupervised imagelevel domain adaptation, which aims to adapt source domain images to appear as if drawn from target domain with wellpreserved identity. In [244], Hu et al. proposed a duplex GAN (DupGAN) for imagelevel domain transformation, in which the duplex discriminators, one for each domain, were trained against the generator for ensuring the reality of the domain transformation. Murez et al. [245] and Hong et al. [246] proposed imagetoimage translation based domain adaptation models by leveraging GAN and synthetic data for semantic segmentation of the target domain images. Person reidentification (ReID) is typical crossdomain feature match and retrieval problem [178, 164]. Recently, for addressing ReID challenges in complex scenarios, GANbased domain adaptation was presented for implausible person image generation from source domain to target domain [249, 247, 248], across different visual cues and styles, such as poses, backgrounds, lightings, resolutions, seasons, etc. Additionally, GAN based crossdomain facial image generation for poseinvariant face representation, face frontalization and rotation were intensively studied [36, 50, 48, 47]
, all of which tend to address domain adaptation and transfer problems in face recognition across poses.
7.4 Discussion and Summary
In this section, adversarial adaptation is presented with three streams, including gradient reversal, minimax optimization and generative adversarial net (GAN). The gradient reversal and minimax optimization share a common characteristic, i.e., featurelevel adaptation, by introducing a domain discriminator based adversarial objective for training against the feature extractor. The difference between them is the against strategy. Different from both of them, GANbased adversarial adaptation focuses on pixellevel adaptation, i.e., image generation from source domain to a target, such that the synthesized implausible images are as if drawn from target domain.
Adversarial adaptation is recognized to be an emerging perspective, despite these advances it still faces with several challenges: 1) the domain discriminator is easily overtrained; 2) maximizing only domain confusion easily leads to class bias; 3) the gaming between feature generator and discriminator is human dependent.
8 Benchmark Datasets
In this section, the benchmark datasets for testing TAL models are introduced to facilitate readers’ impression on how to start studies of transfer adaptation learning. Totally, 12 benchmark datasets including Office31 (3DA) [5], Office+Caltech10 (4DA) [5, 112, 203, 250], MNIST+USPS [130, 131], MultiPIE [130, 131], COIL20 [251], MSRC+VOC2007 [116], IVLSC [252, 253], AwA [254], Crossdataset Testbed [1], Office Home [255], ImageCLEF [256], and PACS [252] are summarized, each of which contains at least 2 different domains.
8.1 Office31 (3DA)
Office31 is a popular benchmark for visual domain transfer, which includes 31 categories of samples drawn from three different domains, i.e., Amazon (A), DSLR (D) and Webcam (W). Amazon consists of online ecommerce pictures, DSLR contains highresolution pictures and Webcam contains lowresolution pictures taken by a web camera. There are totally 4652 images, composed of 2817, 498 and 795 images from domain A, D and W, respectively. In feature extraction, (1) for shallow features, 800dimensional feature vectors extracted by the Speed Up Robust Features (SURF) were used, and (2) for deep features, 4096dimensional feature vectors extracted from pretrained AlexNet or VGGnet were used. In model evaluation, six kinds of sourcetarget domain pairs were tested, i.e., A
D, AW, DA, DW , WA, WD.8.2 Office+Caltech10 (4DA)
This 4DA dataset contains 4 domains, in which 3 domains (A, D, W) are from the Office31 and another domain (C) is from Caltech256, a benchmark containing 30,607 images of 256 classes in object recognition. The common 10 classes among the Office31 and Caltech256 were selected to form the 4DA, and therefore 2,533 images composed of 958, 157, 295 and 1123 images from domain A, D, W and C were collected. In evaluation, 12 tasks with different sourcetarget domain pairs are addressed, i.e., AD, AC, AW, DA, DC, DW, CA, CD, CW, WA, WC, WD.
8.3 Mnist+usps
MNIST and USPS are two benchmarks containing 10 categories of digit images under different distribution for handwritten digit recognition, and therefore qualified for TAL tasks. The MNIST includes 60,000 training pictures and 10,000 test pictures. The USPS includes 7291 training pictures and 2007 test pictures. For TAL tasks, 2000 pictures and 1800 pictures were randomly selected from MNIST and USPS, respectively. For feature extraction, each image was resized into 1616 and a 256dimensional feature vector that encode the pixel values was finally extracted. In evaluation, 2 crossdomain tasks, i.e., MNISTUSPS and USPSMNIST are addressed.
8.4 MultiPIE
MultiPIE is a benchmark with poses, illuminations and expressions in face recognition, which includes 41,368 faces of 68 different identities. For TAL tasks, (1) face recognition across poses is generally evaluated on five different face orientations, including C05: left pose, C07: upward pose, C09: downward pose, C27: front pose and C29: right pose. Totally, 3332, 1629, 1632, 3329, and 1632 facial images are contained in C05, C07, C09, C27 and C29. Therefore, 20 tasks were evaluated, i.e., C05C07, C05C09, C05C27, etc.; (2) face recognition across illuminations and exposure conditions is evaluated by randomly selecting two sets: PIE1 and PIE2 from front face images. Two tasks: PIEPIE2 and PIE2PIE1 were evaluated.
8.5 Coil20
COIL20 is a 3D object recognition benchmark containing 1440 images of 20 object categories. By rotating each object class horizontally of 5 degrees, 72 images per class after rotating 360 degrees were obtained. For TAL tasks, two disjoint subsets with different distribution i.e., COIL1 and COIL2 were prepared, where COIL1 contains the images in [, ] U [, ] and the images of COIL2 are in [, ] U [, ]. Therefore, two crossdomain tasks i.e., COIL1COIL2 and COIL2COIL1 were evaluated.
8.6 Msrc+voc2007
The MSRC contains 4323 images of 18 categories and VOC2007 contains 5011 images of 20 categories. 1269 and 1530 images w.r.t six common categories, i.e., aeroplane, bicycle, bird, car, cow and sheep, were finally selected from MSRC and VOC2007, respectively. In feature representation, 128dimensional DenseSIFT features were extracted for crossdomain image classification tasks, i.e., MSRCVOC2007 and VOC2007MSRC.
8.7 Ivlsc
IVLSC is a largescale image dataset containing five subsets, i.e., ImageNet (I), VOC2007 (V), LabelMe (L), SUN09 (S), and Caltech (C). For TAL tasks, 7341, 3376, 2656, 3282, and 1415 samples w.r.t. five common categories i.e., bird, cat, chair, dog and human, were randomly selected from I, V, L, S, and C domains, respectively. In feature representation, 4096dimensional DeCaf6 deep features were extracted for crossdomain image classification under 20 tasks, i.e., IV, IL, IS, IC, …, CI, CV, CL, CS.
8.8 AwA
AwA is an animal identification dataset containing 30,475 images of 50 categories, which provides a benchmark due to the inherent data distribution difference. This data set is currently less used in evaluating TAL algorithms.
8.9 Crossdataset Testbed
This benchmark contains 10,473 images of 40 categories, collected from three domains: 3,847 images in Caltech256 (C), 4,000 images in ImageNet (I), and 2,626 images in SUN (S). In feature extraction, the 4096dimensional DeCAF7 deep features were used for crossdomain image classification tasks, i.e., CI, CS, IC, IS, SC, SI.
8.10 Office Home
Office Home is a relatively new benchmark containing 15,585 images of 65 categories, collected from 4 domains, i.e., (1) Art (Ar): artistic depictions of objects in the form of sketches, paintings, ornamentation, etc.; (2) Clipart (Cl): collection of clipart images; (3) Product (Pr): images of objects without a background, akin to the Amazon category in Office dataset; (4) RealWorld (RW): images of objects captured with a regular camera. In detail, there contains 2421, 4379, 4428 and 4357 images in Ar, Cl, Pr and RW domains, respectively. In evaluation, 12 crossdomain tasks were tested, e.g., ArCl, ArPr, ArRW, ClAr, etc.
8.11 ImageCLEF
This benchmark includes 1800 images of 12 categories, which were drawn from 3 domains: 600 images in Caltech 256 (C), 600 images in ImageNet ILSVRC2012 (I), and 600 images in Pascal VOC2012 (P). Therefore, 6 crossdomain tasks i.e., CI, CP, IC, IP, PC, PI were evaluated.
8.12 PaCS
PACS is a new benchmark containing 7 common categories: dog, elephant, giraffe, guitar, horse, house and person, from 4 domains, i.e., 1670 images in Photo (P), 2048 images in Art Painting (A), 2344 images in Cartoon (C), and 3929 images in Sketch (S). In feature representation, 4096dimensional VGGM deep features were used and 12 crossdomain tasks are evaluated, e.g., PA, P C, PS, AP, AC, etc.
8.13 Discussion and Summary
In this section, 12 benchmarks constructed based on popular datasets in computer vision such as ImageNet, ILSVRC, PASCAL VOC, Caltech256, multiPIE and MNIST for addressing crossdomain image classification tasks are presented. Despite these endeavors made by researchers, more benchmarks in crossdomain vision understanding problems we could see, namely: object detection, semantic segmentation, visual relation modeling, scene parsing, etc. are future challenges for transfer adaptation learning.
9 Conclusion
Transfer adaptation learning is an energetic research field which aims to learn domain adaptive representations and classifiers from source domains toward representing and recognizing samples from a distribution different but semantic related target domain. This paper surveyed recent advances in transfer adaptation learning in the past decade and present a new taxonomy of five technical challenges being faced by researchers: instance reweighting adaptation, feature adaptation, classifier adaptation, deep network adaptation and adversarial adaptation. Besides, 12 visual benchmarks that address multiple crossdomain recognition tasks are collected and summarized to help facilitate researchers’ insight on the tasks and scenarios that transfer adaptation learning aims to address.
The proposed taxonomy of transfer adaptation learning challenges in this paper provides a framework for researchers better understanding and identifying the status of the field, future research challenges and directions. Each challenge was summarized with a discussion of existing problems and future direction, which, we believe, are worth studying for better capturing general domain knowledge, toward universal machine learning. Throughout the entire research lines, one specific research area of transfer adaptation learning that seems to be still understudied is the coadaptation of multiple but heterogeneous domains, which goes beyond two homogeneous domains. This challenge is more approaching realworld scenarios that numerous domains can be found, and coadaptation expects to capture commonality and specificity among multiple domains. Additionally, an open question that remains to be unanswered is when will we need transfer adaptation learning for a given application scenario? the basic analyzing condition of whether crossdomain happens is still not clear. We observe these promising directions of transfer adaptation learning for future research.
Acknowledgment
The author would like to thank the pioneer researchers in transfer learning, domain adaptation and other related fields. The author would also like to thank Dr. Mingsheng Long and Dr. Lixin Duan for their kindly help in providing insightful discussions.
References
 [1] A. Torralba and A. A. Efros, “Unbiased look at dataset bias,” in CVPR, 2011, pp. 1521–1528.
 [2] H. Shimodaira, “Improving predictive inference under covariate shift by weighting the loglikelihood function,” J. Stat. Plan. Inference, vol. 90, no. 2, pp. 227–244, 2000.
 [3] J. Yang, R. Yan, and A. G. Hauptmann, “Crossdomain video concept detection using adaptive svms,” in ACM MM, 2007, pp. 188–197.
 [4] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans. Knowledge and Data Engineering, vol. 22, no. 10, pp. 1345–1359, 2010.
 [5] K. Saenko, B. Kulis, M. Fritz, and T. Darrell, “Adapting visual category models to new domains,” in ECCV, 2010.
 [6] D. Cook, K. Feuz, and N. Krishnan, “Transfer learning for activity recognition: a survey,” Knowl. Inf. Syst., vol. 36, pp. 537–556, 2013.
 [7] J. Lu, V. Behbood, P. Hao, H. Zuo, S. Xue, and G. Zhang, “Transfer learning using computational intelligence: A survey,” Knowledge Based Systems, vol. 80, pp. 14–23, 2015.
 [8] V. Patel, R. Gopalan, R. Li, and R. Chellappa, “Visual domain adaptation,” IEEE Signal Processing Magazine, pp. 53–69, 2015.
 [9] L. Shao, F. Zhu, and X. Li, “Transfer learning for visual categorization: A survey,” IEEE Trans. Neural Networks and Learning Systems, vol. 26, no. 5, pp. 1019–1034, 2015.
 [10] W. Pan, “A survey of transfer learning for collaborative recommendation,” Neurocomputing, vol. 177, pp. 447–453, 2016.
 [11] M. Wang and W. Deng, “Deep visual domain adaptation: A survey,” Neurcomputing, vol. 312, pp. 135–153, 2018.
 [12] K. Weiss, T. Khoshgoftaar, and D. Wang, “A survey of transfer learning,” J. Big Data, vol. 3, pp. 1–40, 2016.
 [13] S. Salaken, A. Khosravi, T. Nguyen, and S. Nahavandi, “Extreme learning machine based transfer learning algorithms: A survey,” Neurocomputing, vol. 267, pp. 516–524, 2017.
 [14] G. Hinton and R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006.
 [15] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, pp. 436–444, 2015.
 [16] A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classification with deep convolutional neural networks,” in NIPS, 2012, pp. 1097–1105.
 [17] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2015, pp. 770–778.
 [18] S. Ren, K. He, R. Girshick, and J. Sun, “Faster rcnn: Towards realtime object detection with region proposal networks,” in NIPS, 2015.
 [19] J. Dai, Y. Li, K. He, and J. Sun, “Rfcn: Object detection via region based fully convolutional networks,” in NIPS, 2016.
 [20] W. Liu, D. Anguelov, D. Erhan, S. Christian, S. Reed, C. Fu, and A. Berg, “Ssd: single shot multibox detector,” in ECCV, 2016.
 [21] C. Fu, W. Liu, A. Ranga, A. Tyagi, and A. Berg, “Dssd: Deconvolutional single shot detector,” in arXiv, 2017.
 [22] E. Ahmed, M. Jones, and T. Marks, “An improved deep learning architecture for person reidentification,” in CVPR, 2015.
 [23] D. Chung, K. Tahboub, and E. Delp, “A two steam siamese convolutional neural network for person reidentification,” in ICCV, 2017.
 [24] D. Cheng, Y. Gong, S. Zhou, J. Wang, and N. Zheng, “Person reidentification by multichannel partsbased cnn with improved triplet loss function,” in CVPR, 2016.
 [25] L. Hou, D. Samaras, T. Kurc, Y. Gao, J. Davis, and J. Saltz, “Patchbased convolutional neural network for whole slide tissue image classification,” in CVPR, 2016.
 [26] A. Esteva, B. Kuprel, R. Novoa, J. Ko, S. Swetter, H. Blau, and S. Thrun, “Dermatologistlevel classification of skin cancer with deep neural networks,” Nature, vol. 542, pp. 115–118, 2017.
 [27] D. Shen, G. Wu, and H. Suk, “Deep learning in medical image analysis,” Annual Review of Biomedical Engineering, vol. 19, pp. 221–248, 2017.
 [28] M. Xie, N. Jean, M. Burke, D. Lobell, and S. Ermon, “Transfer learning from deep features for remote sensing and poverty mapping,” in AAAI, 2016.
 [29] N. Jean, M. Burke, M. Xie, W. Davis, D. Lobell, and S. Ermon, “Combining satellite imagery and machine learning to predict poverty,” Science, vol. 353, no. 6301, p. 790–794, 2016.
 [30] Y. Chen, H. Jiang, C. Li, X. Jia, and P. Ghamisi, “Deep feature extraction and classification of hyperspectral images based on convolutional neural networks,” IEEE Trans. Geoscience and Remote Sensing, vol. 54, no. 10, pp. 6232–6251, 2016.
 [31] E. Maggiori, Y. Tarabalka, G. Charpiat, and P. Alliez, “Convolutional neural networks for largescale remotesensing image classification,” IEEE Trans. Geoscience and Remote Sensing, vol. 55, no. 2, pp. 645–657, 2017.
 [32] Y. Bengio, “Deep learning of representations for unsupervised and transfer learning,” JMLR, vol. 27, p. 17–37, 2012.
 [33] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in arXive, 2014.
 [34] M. Mirza and S. Osindero, “Conditional generative adversarial nets,” in arXiv, 2014.
 [35] L. Tran, X. Yin, and X. Liu, “Representation learning by rotating your faces,” IEEE Trans. Pattern Analysis and Machine Intelligence, 2018.

[36]
——, “Disentangled representation learning gan for poseinvariant face recognition,” in
CVPR, 2017.  [37] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, “Generative adversarial text to image synthesis,” in ICML, 2016.
 [38] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” in ICLR, 2016.

[39]
P. Isola, J. Zhu, T. Zhou, and A. Efros, “Imagetoimage translation with conditional adversarial networks,” in
CVPR, 2017, pp. 1125–1134.  [40] J. Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired imagetoimage translation using cycleconsistent adversarial networks,” in ICCV, 2017.
 [41] Y. Choi, M. Choi, M. Kim, J. Ha, S. Kim, and J. Choo, “Stargan: Unified generative adversarial networks for multidomain imagetoimage translation,” in CVPR, 2018, pp. 8789–8797.
 [42] M. Liu, T. Breuel, and J. Kautz, “Unsupervised imagetoimage translation networks,” in NIPS, 2017.
 [43] D. Yoo, N. Kim, S. Park, A. Paek, and I. Kweon, “Pixellevel domain transfer,” in ECCV, 2016.
 [44] L. Gatys, A. Ecker, and M. Bethge, “Image style transfer using convolutional neural networks,” in CVPR, 2016, pp. 2414–2423.
 [45] L. Gatys, A. Ecker, M. Bethge, A. Hertzmann, and E. Shechtman, “Controlling perceptual factors in neural style transfer,” in CVPR, 2017, pp. 3985–3993.

[46]
J. Johnson, A. Alahi, and F. Li, “Perceptual losses for realtime style transfer and superresolution,” in
ECCV, 2016, pp. 694–711.  [47] Y. Hu, X. Wu, B. Yu, R. He, and Z. Sun, “Poseguided photorealistic face rotation,” in CVPR, 2018.
 [48] X. Yin, X. Yu, K. Sohn, X. Liu, and M. Chandraker, “Towards largepose face frontalization in the wild,” in ICCV, 2017.
 [49] J. Cao, Y. Hu, B. Yu, R. He, and Z. Sun, “Load balanced gans for multiview face image synthesis,” in arXiv:1802.07447, 2018.
 [50] R. Huang, S. Zhang, T. Li, and R. He, “Beyond face rotation: Global and local perception gan for photorealistic and identity preserving frontal view synthesis,” in ICCV, 2017.
 [51] H. Yang, D. Huang, Y. Wang, and A. Jain, “Learning face age progression: A pyramid architecture of gans,” in CVPR, 2018.
 [52] A. Evgeniou and M. Pontil, “Multitask feature learning,” in NIPS, 2007.
 [53] Y. Freund and R. Schapire, “A decisiontheoretic generalization of online learning and an application to boosting,” J. Comput. Syst. Sci, vol. 55, no. 1, pp. 119–139, 1997.
 [54] T. Dietterich, “Machine learning: Four current directions,” AI Mag., vol. 18, no. 4, pp. 97–136, 1997.
 [55] X. Wu, V. Kumar, J. Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. McLachlan, A. Ng, B. Liu, P. Yu, Z. Zhou, M. Steinbach, D. Hand, and D. Steinberg, “Top 10 algorithms in data mining,” Knowl. Inf. Syst., vol. 14, pp. 1–37, 2008.
 [56] Z.H. Zhou, “A brief introduction to weakly supervised learning,” National Science Review, vol. 1, pp. 1–10, 2017.
 [57] O. Chapelle, B. Scholkopf, and A. Zien, SemiSupervised Learning, 2006.
 [58] Z. Zhou and M. Li, “Semisupervised learning by disagreement,” Knowledge and Information Systems, vol. 24, no. 3, pp. 415–439, 2010.
 [59] K. Nigam, A. McCallum, S. Thrun, and T. Mitchell, “Text classification from labeled and unlabeled documents using em,” Machine Learning, vol. 39, pp. 103–134, 2000.
 [60] D. Miller and H. Uyar, “A mixture of experts classifier with learning based on both labeled and unlabeled data,” in NIPS, 1997, pp. 571–577.
 [61] D. Kingma, S. Mohamed, D. Jimenez Rezende, and M. Welling, “Semisupervised learning with deep generative models,” in NIPS, 2014.
 [62] O. Chapelle and A. Zien, “Semisupervised classification by low density separation,” in AISTATS, 2005, pp. 57–64.
 [63] T. Joachims, “Transductive inference for text classication using support vector machines,” in ICML, 1999, pp. 200–209.
 [64] Y. Li, I. Tsang, J. Kwok, and Z. Zhou, “Convex and scalable weakly labeled svms,” Journal of Machine Learning Research, vol. 14, pp. 2151–2188, 2013.
 [65] Z. Zhou and M. Li, “Tritraining: exploiting unlabeled data using three classifiers,” IEEE Trans. Knowledge Data Engineering, vol. 17, pp. 1529–1541, 2005.

[66]
A. Blum and T. Mitchell, “Combining labeled and unlabeled data with
cotraining,” in
The 11th Int’ Conf’ Computational Learning Theory
, 1998, pp. 92–100.  [67] Z. Zhou and M. Li, “Semisupervised learning by disagreement,” Knowledge and Information Systems, vol. 24, pp. 415–439, 2010.
 [68] A. Blum and S. Chawla, “Learning from labeled and unlabeled data using graph mincuts,” in ICML, 2001, pp. 19–26.
 [69] M. Belkin and P. Niyogi, “Semisupervised learning on riemannian manifolds,” Machine Learning, vol. 56, no. 13, pp. 209–239, 2004.
 [70] T. Kipf and M. Welling, “Semisupervised classification with graph convolutional networks,” in ICLR, 2017.
 [71] Z. Yang, W. Cohen, and R. Salakhutdinov, “Revisiting semisupervised learning with graph embeddings,” in ICML, 2016.
 [72] X. Zhu, “Semisupervised learning literature survey,” Technical Report 1530, 2008.
 [73] I. Triguero, S. Garcia, and F. Herrera, “Selflabeled techniques for semisupervised learning: taxonomy, software and empirical study,” Knowledge and Information Systems, vol. 42, no. 2, pp. 245–284, 2015.
 [74] D. D. Lewis and W. A. Gale, “A sequential algorithm for training text classifiers,” Acm Sigir Forum, vol. 29, no. 2, pp. 13–19, 1994.
 [75] S. Tong and D. Koller, “Support vector machine active learning with applications to text classification.” in ICML, 2000, pp. 999–1006.
 [76] E. Elhamifar, G. Sapiro, A. Yang, and S. S. Sasrty, “A convex optimization framework for active learning,” in ICCV, 2013, pp. 209–216.
 [77] B. Settles, “Active learning literature survey,” in Technical Report, 2010.
 [78] M. Rohrbach, M. Stark, and B. Schiele, “Evaluating knowledge transfer and zeroshot learning in a largescale setting,” in CVPR, 2011, pp. 1641–1648.
 [79] C. Lampert, H. Nickishch, and S. Harmeling, “Attributebased classification for zeroshot visual object categorization,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 36, no. 3, pp. 453–465, 2014.
 [80] Y. Fu, T. Hospedales, T. Xiang, and S. Gong, “Transductive multiview zeroshot learning,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 37, no. 11, pp. 2332–2345, 2015.
 [81] Z. Ding, M. Shao, and Y. Fu, “Generative zeroshot learning via lowrank embedded semantic dictionary,” IEEE Trans. Pattern Analysis and Machine Intelligence, 2018.
 [82] L. Niu, J. Cai, A. Veeraraghavan, and L. Zhang, “Zeroshot learning via categoryspecific visualsemantic mapping and label refinement,” IEEE Trans. Image Processing, 2018.
 [83] S. Rahman, S. Khan, and F. Porikli, “A unified approach for conventional zeroshot, generalized zeroshot and fewshot learning,” IEEE Trans. Image Processing, pp. 1–16, 2018.
 [84] Y. Yu, Z.
Comments
There are no comments yet.