Transfer Learning for Cross-Dataset Recognition: A Survey

05/11/2017 ∙ by Jing Zhang, et al. ∙ 0

This paper summarises and analyses the cross-dataset recognition transfer learning techniques with the emphasis on what kinds of methods can be used when the available source and target data are presented in different forms for boosting the target task. This paper for the first time summarises several transferring criteria in details from the concept level, which are the key bases to guide what kind of knowledge to transfer between datasets. In addition, a taxonomy of cross-dataset scenarios and problems is proposed according the properties of data that define how different datasets are diverged, thereby review the recent advances on each specific problem under different scenarios. Moreover, some real world applications and corresponding commonly used benchmarks of cross-dataset recognition are reviewed. Lastly, several future directions are identified.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

It has been explored how human would transfer learning in one context to another similar context [Woodworth and Thorndike (1901), Perkins et al. (1992)]

in the field of Psychology and Education. For example, learning to drive a car helps a person later to learn more quickly to drive a truck, and learning mathematics prepares students to study physics. The machine learning algorithms are mostly inspired by human brains. However, most of them require a huge amount of training examples to learn a new model from scratch and fail to apply knowledge learned from previous domains or tasks. This may be due to that a basic assumption of statistical learning theory is that the training and test data are drawn from the same distribution and belong to the same task. Intuitively, learning from scratch is not realistic and practical, because it violates how human learn things. In addition, manually labelling a large amount of data for new domain or task is labour extensive, especially for the modern “data-hungry” and “data-driven” learning techniques (i.e. deep learning). However, the big data era provides a huge amount available data collected for other domains and tasks. Hence, how to use the previously available data smartly for the current task with scarce data will be beneficial for real world applications.

To reuse the previous knowledge for current tasks, the differences between old data and new data need to be taken into account. Take the object recognition as an example. As claimed by Torralba2011, despite the great efforts of object datasets creators, the datasets appear to have strong build-in bias caused by various factors, such as selection bias, capture bias, category or label bias, and negative set bias. This suggests that no matter how big the dataset is, it is impossible to cover the complexity of the real visual world. Hence, the dataset bias needs to be considered before reusing data from previous datasets. Pan2010 summarise that the differences between different datasets can be caused by domain divergence (i.e. distribution shift or feature space difference) or task divergence (i.e. conditional distribution shift or label space difference), or both. For example, in visual recognition, the distributions between the previous and current data can be discrepant due to the different environments, lighting, background, sensor types, resolutions, view angles, and post-processing. Those external factors may cause the distribution divergence or even feature space divergence between different domains. On the other hand, the task divergence between current and previous data is also ubiquitous. For example, it is highly possible that an animal species that we want to recognize have not been seen by the pre-established recognition system. So if we still want the system to recognize the new animal species, a large amount of labelled samples of the animal species are required by traditional machine learning techniques and the whole system needs to be retrained. Hence, when either the domain divergence or task divergence occurs, the traditional machine learning techniques all cannot be used directly. To make the best use of previous data, some transferring strategies need to be employed to take domain or task shift into account, and then tackle the problem in new data without drop of performance. Thus, the objective of this paper is to review the related state-of-the-art transfer learning techniques on efficiently reusing previous data for current tasks.

There have been other surveys on transfer learning during past few years. For example, the survey written by Pan2010 categorized and reviewed transfer learning for classification, regression, and clustering problems. They defined three setting of transfer learning, namely inductive transfer, transductive transfer and unsupervised transfer. They covered the basic scenarios with one source domain and one target domain, and transfer knowledge between datasets with homogeneous feature spaces. However, the real world applications require techniques that can transfer knowledge in more complicated and unconstrained scenarios, such as multiple sources, heterogeneous features, heterogeneous tasks, sequential data, or even transfer previous knowledge to unseen target data. Shao2015 reviews the transfer learning algorithms on the applications of visual categorization. Similarly, the work presented by Patel2015 focus on visual domain adaptation. These two surveys only cover the methodologies on transferring in the visual domains. In addition, Shao’s work reviews previous research only based on two main categories: feature-based knowledge transfer and classifier-based knowledge transfer. Patel’s work is purely approach based and only focuses on domain adaptation, a subtopic of transfer learning. Hence, it is unclear how to transfer the knowledge in more complex real world scenarios when different kinds of data are presented, because different techniques would be used differently based on the types of previous and current knowledge and the characteristics of available data, which have not been well summarized and analysed by previous surveys.

In this paper, our focus is to analyse and summarise how to take full advantage of previous data presented in different forms (such as homogeneous or heterogeneous feature space and label space, and availability of labels, etc.) for current domain and task. Specifically, the key contributions of this survey are as follows:

  • First, this paper defines the dataset properties that potentially lead to dataset divergence in terms of domain and task, such as feature space, data sufficiency, data balance, sequential data, availability of labels, and label spaces. Different properties define different scenarios and problems, which lead to different strategies for feasible and proper transfer.

  • Secondly, this paper for the first time summarises several transferring criteria in details from the concept level, which have been used by previous research explicitly or implicitly. These criteria are the key bases to guide what kind of knowledge to transfer between datasets. The basic assumptions and the potential categories of knowledge (i.e. instance, feature representation, classifier, or hybrid knowledge) can be transferred using each criterion are analysed and presented. Moreover, several commonly used methods or ideas are illustrated under each criterion.

  • Thirdly, we give a comprehensive overview of the recent advances (based on transferring criterion and knowledge) on each scenario defined by the properties of dataset divergence, and connect each scenario with the referred techniques among the literature, which provides a systematic knowledge map of transfer learning for cross-dataset recognition and presents specific strategies for each scenario.

  • Fourthly, some real world applications of cross-dataset recognition are reviewed and the commonly used benchmark datasets for each application are provided.

  • Lastly, several future research directions are identified on transfer learning for cross-dataset recognition.

The rest of the paper is organised as follows. Section 2 defines the terminologies used in the paper, and summarises the cross-dataset transferring criteria that guide the transferring of knowledge and the basic ideas of the typical methods that use each criterion. Section 3 discusses the scenario where the feature spaces and label spaces between training and test datasets are homogeneous, suggesting that the shift is caused by the data distributions. The scenario of homogeneous feature and label spaces is further classified into seven problems: labelled target dataset; labelled plus unlabelled target dataset; unlabelled target dataset; imbalanced unlabelled target dataset; sequential labelled target data; sequential unlabelled target data; unavailable target training data. Section 4 discusses the scenario where the label spaces are the same but the feature spaces between the training and test datasets are different, with three problems: labelled target dataset; labelled plus unlabelled target dataset; unlabelled target dataset. Section 5 discusses the scenario where the the feature spaces are the same but the label spaces between the training and test datasets are different, with five problems: labelled target dataset; unlabelled target dataset; sequential labelled target dataset; unavailable target training dataset; unlabelled source dataset. Section 6 discusses the scenario where both the label spaces and the feature spaces between the training and test datasets are different, with two problems: labelled target dataset; sequential labelled target dataset. In Section 7, several real-world applications and the most commonly used datasets for cross-dataset transfer learning are summarised. In Section 8, the conclusion and discussion on future directions are presented.

2 Overview

2.1 Terminologies and Definitions

We begin with the definitions of terminologies. We follow the definitions of “domain” and “task” defined by Pan2010.

(Domain [Pan and Yang (2010)]) A domain is defined as , which is composed of two components: a feature space

and a marginal probability distribution

, where .

(Task [Pan and Yang (2010)]) Given a specific domain, a task is defined as , which is composed of two components: a label space and a predictive function , where can be seen as a conditional distribution and .

In addition, we give the formal definition of “Dataset” in the context of cross-dataset recognition. (Dataset) A dataset is defined as , which is a collection of data that belong to a specific domain with a specific task.

According to the definitions, two datasets can be different with respect to domain, or task, or both. We define the auxiliary training dataset as source dataset and the current test dataset as target dataset. The properties of the source and target datasets determine what kinds of methods can be used for transferring knowledge. Below, we define several dataset properties (based on “Data” and “Label”) that potentially lead to different types of divergence and can further define different problems.

  • Data:

    • Feature space:

      the consistency of feature spaces (i.e. different feature extraction methods or different data modalities) between source and target datasets.

    • Data availability: the availability and sufficiency of target data in the training stage.

    • Balanced data: whether the numbers of data samples in each class are balanced.

    • Sequential data: whether the data are sequential and evolving over time.

  • Label:

    • Label availability: the availability of labels in source and target datasets.

    • Label space: whether the data categories of the two datasets are identical.

Based on the types of dataset properties, the properties of feature space and label space are used as the starting points that define different problems defined by other data properties. Hence, the four scenarios are categorised as the first layer of the taxonomy of problems and are described as follows,

  • Homogeneous feature spaces and label spaces: the source and target datasets are different only in terms of data distributions (i.e. domain divergence occurs). The feature spaces and label spaces are identical across datasets.

  • Heterogeneous label spaces: the source and target datasets are different in terms of label spaces (i.e. task divergence occurs) but with same feature spaces.

  • Heterogeneous feature spaces: the source and target datasets are different in terms of feature spaces (i.e. domain divergence occurs) but with same label spaces.

  • Heterogeneous feature spaces and label spaces: both the feature spaces and the label spaces between source and target dataset are different (i.e. both domain and task divergence occur).

2.2 “What to Transfer”

Four main categories of knowledge would be transferred among the literature,

  • Instance-based transfer: re-weight some data in the source domain or in both domains to reduce domain divergence.

  • Feature representation based transfer: learn a “good” feature representations to minimize domain shift and error of learning task.

  • Classifier-based transfer: learn a new model that minimizes the generalization error in the target domain via training instances from both domains.

  • Hybrid knowledge transfer: transfer more than one kinds of knowledge, i.e. joint instance and feature representation transfer, joint instance and classifier transfer, or joint feature representation and classifier transfer.

2.3 Transferring Criteria

In this section, we summarise the most typical criteria for transferring knowledge between datasets, which have been used explicitly or implicitly among the literature. The basic assumptions and the potential categories of knowledge (“what to transfer”) can be transferred using each criterion are analysed and presented. Moreover, several commonly used methods or ideas are illustrated under each criterion. The identified criteria include: statistic criterion, geometric criterion, higher-level representation criterion, correspondence criterion, class-based criterion, self labelling, and hybrid criterion. We denote the source dataset as , which are draw from distribution and the target dataset as , which are draw from distribution .

  • Statistic Criterion: aims at reducing the statistical distribution shift between datasets using some mechanisms, such as instance re-weighting, feature transformation, and classifier parameter transfer. The statistic criterion generally assumes sufficient data in each dataset to approximate the respective statistic distributions. This criterion has been used for instance-based transfer, feature representation based transfer, classifier-based transfer, and hybrid transfer (i.e. joint instance and feature representation transfer [Long et al. (2014), Aljundi et al. (2015)]). The most commonly used methods for comparing and thereby reducing distribution shift are summarised as follows.

    1. Kullback-Leibler divergence [Sugiyama et al. (2008)]

      (1)
    2. Jensen-Shannon divergence [Quanz et al. (2012)]

      (2)
    3. Quadratic divergence [Si et al. (2010)]

      (3)
    4. Hellinger distance [Baktashmotlagh et al. (2014)]

      (4)
    5. Mutual information [Shi and Sha (2012)]

      (5)

      where X represents all the data from source and target domain, Q denotes the domain label (i.e. 0 for source domain, and 1 for target domain),

      is the two-dimensional posterior probability vector of assigning

      to either the source or the target, given all other data points from the two domains, and

      is the estimated prior distribution of domain (i.e.

      ). By minimizing the mutual information between the data instance X and its (binary) domain label Q, the domain shift is reduced.

    6. divergence [Ben-David et al. (2007), Ben-David et al. (2010)]

      (6)

      where is a hypothesis class on , and is the set for which

      is the characteristic function; that is

      . divergence measures the distance between distributions in a hypothesis class , which can be approximated by the empirical risk of a classifier that discriminates between instances drawn from and instances drawn from .

    7. Maximum Mean Discrepancy (MMD) [Pan et al. (2009)]

      (7)

      where denotes data from all the datasets,

      represents the kernel function that maps the original data to a reproducing kernel Hilbert space (RKHS). The MMD compares the statistic moments of distributions. If the

      is a characteristic kernel (i.e. Gaussian kernels, or Laplace kernels), MMD compares all the orders of statistic moments, making MMD a metric on distributions.

  • Geometric Criterion: bridges datasets according to their geometrical properties. This criterion assumes domain shift can be reduced using the relationship of geometric structures between datasets and is generally used for feature representation based transfer. The examples of transferring using geometric criterion are as follows.

    1. Subspace alignment [Fernando et al. (2013)]
      Learn a linear mapping to align the source subspace coordinate system to the target one.

      (8)

      where the subspaces and are pre-learned using PCA on source and target domain respectively.

    2. Intermediate subspaces [Gopalan et al. (2011), Gong et al. (2012)]
      Identify intermediate subspaces between the source and target, and then learn the information from these subspaces to convey the domain changes. The subspaces are identified with the Grassmann manifold , and source subspace and target subspace are points on . To bridge and , the points on the geodesic paths between them (which are constant velocity curves on a manifold) are sampled to form the intermediate subspaces. Then both source and target data are projected to the obtained intermediate subspaces (either by sampling along the geodesic [Gopalan et al. (2011)] or all of them [Gong et al. (2012)]) to augment the data for help finding the correlations between domains.

    3. Manifold alignment (without correspondence) [Cui et al. (2014a)]
      Align the manifolds defined by source and target datasets without the correspondence information. Hence, the two manifolds can be aligned geometrically:

      (9)

      where the and are full adjacency matrix (i.e. ), denotes the correspondence matrix that need to be learnt, and are linear projections of from two datasets that also need to be learnt, and and are geometry preserving terms to preserve the manifold structures of respective domains.

  • Higher-level Representations: aims at finding higher-level representations that are representative, compact, or invariant between datasets. This criterion does not assume the existence of labelled data, or the existence of correspondence set. It assumes that there exist the domain invariant higher-level representations between datasets. This criterion is generally used for feature representation based transfer.

    Note that the higher-level representation criterion is commonly used together with other criteria for better transfer, but it is also used independently without any mechanism to reduce the domain divergence explicitly.

    1. Sparse coding [Raina et al. (2007)]
      The dictionary is learnt based on the source data, and then apply the learnt dictionary to the target data to obtain the sparse codes of the target data.

      Step 1: (10)
      Step 2: (11)

      where is the learnt dictionary on source data, and are the sparse codes.

    2. Low-rank representation [Shao et al. (2012)]
      Find a subspace where each datum in the target domain can be linearly represented by the corresponding subspace in the source domain.

      (12)

      Because Z is constrained to be low rank, each target sample is linearly represented by some subspace from a subspace union in the source domain. Hence, the structure information in the source and target domain is considered.

    3. Deep Neural Networks Representations 

      [Donahue et al. (2014)]

      Use deep neural networks to learn more transferable features by disentangling explanatory factors of variations underlying data samples, and grouping deep features hierarchically according the their relatedness to invariant factors.

    4. Stacked Denoising Auto-encoders (SDAs) [Glorot et al. (2011), Chen et al. (2012)]
      Train the SDAs [Vincent et al. (2010)] to reconstruct the data from all the domains. It has been shown that SDAs can disentangle hidden factors which explain the variations in the input data that are invariant to different domains [Glorot et al. (2011), Chen et al. (2012)]. The key idea is to map the corrupted input

      to a hidden representation, and then decode the hidden representation back to reconstructed vectors in input space:

      encoder: (13)
      decoder: (14)

      where is the corrupted version of the data from all the domains, and are parameters of the auto-encoder, is the nonlinear squashing-function, is the hidden representation of input data, is the reconstructed data which are supposed to be as similar as as possible to the uncorrupted input data , i.e. . Then several layers of denoising auto-encoders can be learnt and stacked to constitute the stacked denoising auto-encoders (SDAs) to build deep architectures.

    5. Attribute space representation [Lampert et al. (2009)]
      Use the human-specified high-level description (attributes) of target objects instead of training images to detect object in an image. Assume is the attribute representation for class c with m attributes in all the classes. The is learnt first by learning probabilistic classifiers for each attribute from the source dataset. In the test stage, assume each target class is determined by its attribute vector, i.e. , then use the Bayes’ rule to obtain . Hence, the posterior of a target class given an image can be calculated as follows

      (15)

      Since the attributes are assigned on a per-class basis instead of a per-image basis, the manual effort to add a new object class is much smaller than labelling a huge amount of training data.

  • Correspondence Criterion: uses the paired correspondence samples from different domains to construct the relationship between domains. Obviously, a set of corresponding samples (i.e. the same object captured from different view angles, or by different sensors) are required for correspondence criterion. This criterion is commonly used for feature representation-based transfer. The typical methods are as follows.

    1. Sparse coding with correspondence [Zheng et al. (2012)]
      The corresponding samples between domains are force to share the same sparse codes:

      (16)

      where and are the dictionaries, is the sparse codes.

    2. Manifold alignment (with correspondence) [Zhai et al. (2010)]
      Given a set of correspondence samples set C between domains, learn mapping matrices and for source and target set respectively to preserve the correspondence relationships after mapping:

      (17)

      where and are the manifold regularization terms which are used to preserve the intrinsic manifold structures of source and target domains.

  • Class-based Criterion: uses the class label information as a guidance for connecting different datasets. Hence, the labelled samples from each dataset are assumed to be available, no matter sufficient or not. This criterion has been used for feature representation-based transfer [Daumé III (2007), Saenko et al. (2010)], classifier-based transfer [Yang et al. (2007), Ma et al. (2014)], and hybrid knowledge transfer (i.e. joint feature-based and classifier-based transfer [Hoffman et al. (2013)]). Below is the commonly used methods that use the supervision criterion for transferring the source knowledge to the target dataset.

    1. Feature augmentation [Daumé III (2007)]
      Each feature is augmented into three versions of it: a general version, a source-specific version and a target-specific version. The augmented source data will contain only general and source-specific versions. The augmented target data contains general and target-specific versions. The rest of the dimensions are appended with zeros. The augmented features are as follows,

      (18)

      where is the general version, is the source specific version, and is the target specific version.

    2. Metric learning [Saenko et al. (2010)]
      Learn a metric such that the distance of samples from different domains with same labels are close while the distance of samples from different domains with different labels are far away:

      (19)

      where are threshold parameters, is the pairwise distance for some , and is the matrix that needs to be learnt.

    3. Linear Discriminative Model [Yang et al. (2007)]
      Use the source classifier parameters to regularize the target classifier parameters:

      (20)

      where

      denotes some loss functions over data and labels, and

      denotes some regularization between source and target classifiers.

    4. Bayesian Model [Fei-Fei et al. (2006)]

      Present the source knowledge as a prior probability in the space of target model parameters. Specifically, the “general knowledge” from source domain categories are extracted and then represented in the form of a prior probability density function in the space of model parameters. Given a small training set in the target domain, this knowledge can be updated and produce a posterior density.

  • Self Labelling: uses the source domain samples to train an initial model, which is used to obtain the pseudo labels of target domain samples. Then the target samples are incorporated to retrain the model. The procedure is carried iteratively until convergence. The self labelling has been used for classifier-based transfer [Dai et al. (2007a), Tan et al. (2009)], and hybrid knowledge transfer [Dai et al. (2007b), Chen et al. (2011)].

    1. Self-training [Dai et al. (2007b), Tan et al. (2009)]
      Initialize the target model parameters using the source data (or re-weighted source data), i.e. . The pseudo labels of target samples can be obtained using the initial model, then use EM algorithm to iteratively refine the target model by incorporating some mechanisms (i.e. assign small weight to source samples that are dissimilar to target samples):

      (21)
      (22)

      where and are assigned weights to each target and source samples using some mechanisms.

  • Hybrid Criterion: combines two or more above criteria for better transferring of knowledge. The combination of criteria is generally used for feature representation transfer, classifier transfer, and hybrid knowledge transfer. Several examples of combination are

    1. Correspondence + Higher-level representation [Huang and Wang (2013)]

    2. Higher-level representation + Statistic [Long et al. (2013), Long and Wang (2015), Wei et al. (2016)]

    3. Statistic + Geometric [Zhang et al. (2017)]

    4. Statistic + Self labelling [Dai et al. (2007a)]

    5. Correspondence + Class-based [Diethe et al. (2008)]

    6. Statistic + Class-based [Duan et al. (2012a)]

    7. Higher-level representation + Class-based [Zhu and Shao (2014)]

Based on the summary of different transferring criteria and the categories of “what to transfer”, below, we discuss what kinds of methods can be used for boosting current task in the four different scenarios (homogeneous feature spaces and label spaces, heterogeneous label spaces, heterogeneous label spaces, and heterogeneous feature spaces and label spaces) defined in Section 2.1 as well as corresponding problems defined by other dataset properties. An overview of the taxonomy can be found in Figure 1. In each problem, the related work is reviewed according to transferring criteria: statistic criterion, geometric criterion, higher-level representations criterion, correspondence criterion, class-based criterion, self labelling, and hybrid criterion. Then the methods under each criterion are further subdivided into “what to transfer”: instance-based transfer, feature representation-based transfer, classifier-based transfer, and hybrid knowledge transfer.

Figure 1: A taxonomy of cross-dataset recognition problems

3 Homogeneous Feature Spaces and Label Spaces

In this scenario, the feature spaces as well as the label spaces are both identical between source and target datasets. Hence, the source and target datasets are generally different in terms of data distributions ().

3.1 Labelled Target Dataset

In this problem, a small number of labelled data in target domain are available. However, the labelled data in target domain are generally not sufficient for classification tasks. The characteristics of this problem can be found in Table 1, which corresponds to supervised domain adaptation among the literature.

Attributes Source dataset Target data
Data Feature space Identical between two sets Identical between two sets
Availability Sufficient Insufficient
Balanced Yes Yes
Sequential No No
Label Availability Labelled Labelled
Label space Identical between two sets Identical between two sets
Table 1: Characteristics of problem 3.1

3.1.1 Class-based Criterion

Feature Representations-based Transfer

The class-based criterion can be used for guiding the new representations of features for reducing the domain shift. For example, DaumeIII2007 propose a feature augmentation based methods where each feature is replicated into a high-dimensional feature space. Specifically, the augmented source data contain general and source-specific version, while the augmented target data contain general and target-specific version. The supervised metric learning methods can be used. Zhang2010 propose a supervised transfer metric learning (TML) using the idea of multi-task metric learning. Perrot2015 propose a metric hypothesis transfer learning, where a biased regularization is introduced to learn the target metric by regularizing the difference between the target metric and the source metric . They also provide the theoretical analysis of supervised regularized metric learning approaches.

Some methods assume that samples from only a subset of categories are available in the target training set. Then the adapted features are generalized to unseen categories (which are unseen in target dataset, but have seen in source dataset). The reason why the methods under this setting are not discussed under the problem of heterogeneous label spaces is that they still assume the same label spaces between domains, but some of the categories are not presented in the target training set. Generally, these methods assume the shift between domains is category-independent. For example, Saenko2010 propose a supervised metric learning method to learn a transformation that minimizes the effect of domain-induced changes in the feature distribution using the target training labelled data from a subset of categories. Then the transformation is applied to unseen target test data that may come from different categories from the target training data.

Classifier-based Transfer

The class-based criterion has been used for classifier-based methods that transfer the parameters of discriminative classifiers across datasets. Yang2007 propose Adaptive Support Vector Machines (A-SVMs) to adapt one or more source classifiers to the target domain by learning a new decision boundary that is close to the original decision boundary as well as separating the target data. Similarly, Jiang2008 propose cross-domain SVM (CD-SVM). They also want to preserve the discriminant property of new classifier over source data, which is addressed over important source data samples that have similar distribution to the target data. Xu2014b introduce the adaptive structural SVM (A-SSVM) and structure-aware A-SSVM (SA-SSVM) to adapt classifier parameters between domains by taking the structure knowledge in feature space into account.

3.1.2 Self Labelling

Hybrid Knowledge Transfer

The instance weights and classifier parameters are jointly exploited by [Dai et al. (2007b)]. They propose a transfer learning framework called TrAdaBoost, which extends boosting-based learning algorithms by adding a mechanism to decrease the weights of the instances that are the most dissimilar to the target distribution in order to weaken their impacts. The target labelled samples are used to help vote on the usefulness of each of the source domain data instance.

3.1.3 Hybrid Criterion

Feature Representation-based Transfer

The higher-level representation criterion and class-based criterion can be used together for better cross-dataset representation. For example, Zhu2013,Zhu2014 match source and target distributions via a cross-domain discriminative dictionary learning method, where a reconstructive, discriminative and domain-adaptive dictionary-pair are learned, such that the instances of same class from different domains have similar sparse codes representations. Shekhar2013 also propose a dictionary learning based method. They jointly learn projections that map the data in the two domains into a low-dimensional subspace and a common discriminative dictionary which represent data of two domains in the projected subspace.

Except for the discriminative dictionary learning, the label information can also be used for guiding the deep neural networks to reduce domain shift. Koniusz2017 utilize two CNN streams: the source and target networks fused at the classifier level. Features from the fully connected layers fc7 of each network are used to compute second- or even higher-order scatter tensors; one per network stream per class. Then the scatters of the two network streams of the same class (within-class scatters) are aligned while good separation of the between-class scatters are maintained. Hence, in addition to higher-level representation criterion and class-based criterion, the statistic criterion is also used in 

[Koniusz et al. (2017)].

3.2 Labelled Plus Unlabelled Target Dataset

Compared to the scenario where only limited labelled target data are presented, additional redundant unlabelled target domain data are also presented in the training stage in this problem, which allows the algorithms to learn the structure information of target domain. This setting is realistic in real world applications, where unlabelled data are much more easier to obtain than labelled data. The characteristics of problem 3.2 are shown in Table 2, which is known as semi-supervised domain adaptation among the literature.

Attributes Source dataset Target data
Data Feature space Identical between two sets Identical between two sets
Availability Sufficient Insufficient labelled+ Sufficient unlabelled
Balanced Yes Yes
Sequential No No
Label Availability Labelled Labelled+Unlabelled
Label space Identical between two sets Identical between two sets
Table 2: Characteristics of problem 3.2

3.2.1 Statistic Criterion

Hybrid Knowledge Transfer

Zhong2009 propose an adaptive kernel approach that maps the marginal distributions of target and source domain data into a common kernel space, and utilize a sample selection strategy to draw conditional probabilities between the two domains closer, where the statistic criterion is used for both the feature transformation and instance re-weighting.

3.2.2 Correspondence Criterion

Feature Representation based Transfer

Zhai2010 assume in addition to a set of labelled correspondence pairs between source and target datasets, some unlabelled samples from both datasets are also available. They propose a manifold alignment method to learn explicit corresponding mappings from different manifolds to the underlying common embeddings, where the common embeddings should be consistent with the labelled corresponding pairs and also should preserve the local geometric structures of respective datasets.

3.2.3 Class-based Criterion

Feature Representation-based Transfer

Some researches extend distance-based classifiers, such as the k-nearest neighbor (k-NN) and nearest class mean (NCM) classifiers, from a metric learning perspective to address domain adaptation. Tommasi2013 present an NBNN-based domain adaptation algorithm that learns iteratively a class metric while inducing a large margin separation among classes for each sample. Similarly, Csurka2014 extend the Nearest Class Mean (NCM) classifier by introducing for each class domain-dependent mean parameters as well as domain-specific weights. Then they propose a generic adaptive semi-supervised metric learning technique that iteratively curates the training set.

Daume2010 extended fully supervised EASYADAPT method [Daumé III (2007)] to semi-supervised setting, where the unlabelled data are utilized to co-regularize these learned hypotheses by making them agree on unlabelled samples.

Classifier-based Transfer

Duan2012c extend SVM based classifier transfer methods with unlabelled target data. They proposed a domain-dependent regularizer which enforces that the learned target classifiers and the prelearned source classifiers should have similar decision values on the unlabelled target instances.

Hybrid Knowledge Transfer

There is also a group of multiple kernel learning (MKL) based transfer methods [Duan et al. (2012d), Duan et al. (2012a)], which learn a new feature representation that simultaneously reduce the distribution shift and optimize the target classifier, using both the statistic criterion and class-based criterion. Duan2012a propose a Adaptive Multiple Kernel Learning (A-MKL) cross-domain learning method. A-MKL learns a kernel function and a classifier by minimizing both the structural risk functional and the distribution mismatch between domains using MMD criterion. In addition to A-MKL, Duan2012 propose Domain Transfer Multiple Kernel Learning (DTMKL), where an additional regularization terms are proposed to enforce that the decision values from the target classifier and the existing base classifiers are similar on the unlabelled target patterns and the decision values from the target classifier on the labelled target data are close to the true labels.

3.2.4 Self Labelling

Hybrid Knowledge Transfer

The instance weights and feature representation can be exploited simultaneously in the self labelling framework. For example, Chen2011 propose a co-training method in a self-training framework, which bridges the gap between source and target domains by slowly adding to the training set both the target features and instances, such that the current algorithm is the most confident. Then select a subset of shared source and target features based on their compatibility. Lastly, to better exploit the domain specific feature in the unlabelled target domain, the pseudo multiview co-training method is used to add additional features to the training set.

3.2.5 Hybrid Criterion

Feature Representation-based Transfer

A large group of feature representation-based methods in this problem combines statistic criterion and class-based criterion. Pan2011 extend the unsupervised TCA method [Pan et al. (2009)]

to semi-supervised version by maximizing the label dependence using labelled target data while minimizing the MMD distance using unlabelled target data. Yao2015 proposed a general subspace learning method for semi-supervised domain adaptation. They explore the low-dimensional structures across domains, such that the empirical risk of labelled samples from both domains are minimized, the distance between samples of same category from different domains is small, and the outputs of the predictive function are restricted to have similar values for similar examples. Quanz2012 modify the idea of sparse coding by focusing the identification of shared clusters between data when source and target data may have different distributions by incorporating distribution distance estimates for the embedded data. The kernel density estimation is used for estimating the distributions and symmetric version of the common KL-divergence measure, the Jensen-Shannon divergence, is used for comparing two distributions. To incorporate target label information, the class-based distribution distance is estimated for each class. Tzeng2015 propose a CNN architecture that simultaneously learns a domain invariant representation using a domain confusion loss (

divergence) over all source and target data and transfers the learned source semantic structure to the target domain via soft labels (which is defined as the average over the softmax of all activations of source examples in each category). Similar to the settings of [Saenko et al. (2010)], labelled target data from only a subset of the categories of interest are available in the training stage.

Recently, Wu2016 introduce a constrained deep transfer feature learning method to perform simultaneous transfer learning and feature learning. Specifically, the paired source and target data are used for capturing the joint distribution of target and source data as a bridge between domains. Then additional large amount of unpaired source data are transferred to the target domain as pseudo data for further target domain feature learning. Hence, correspondence criterion and high-lever representation criterion are used.

Hybrid Knowledge Transfer

Yamada2014 generalize the EASYADAPT method [Daumé III (2007)] to semi-supervised setting. They propose to project input features into a higher dimensional space as well as estimate weights for the training samples based on the ratio of test and training marginal distributions in that space using unlabelled target samples. Hence, the class-based criterion and statistic criterion are used for both feature representation and instance re-weighting.

3.3 Unlabelled Target Dataset

In this problem, no labelled data in target domain are available but sufficient unlabelled target domain data are observable in the training stage. Table 3 gives characteristics of problem 3.3, where source domain has sufficient labelled data while target domain has sufficient unlabelled data. This problem is also named unsupervised domain adaptation. The unsupervised domain adaptation has been attracted increasing attention in recent years, which is supposed to be more challenging and realistic.

Attributes Source dataset Target data
Data Feature space Identical between two sets Identical between two sets
Availability Sufficient Sufficient
Balanced Yes Yes
Sequential No No
Label Availability Labelled Unlabelled
Label space Identical between two sets Identical between two sets
Table 3: Characteristics of problem 3.3

3.3.1 Statistic Criterion

Instance-based Transfer

Huang2006 present the kernel mean matching (KMM) method which directly produces resampling weights using MMD criterion, without the need of explicit distribution estimation. Sun2011 propose a two-stage domain adaptation methodology which combines weighted data from multiple sources based on marginal probability differences (first stage) using MMD criterion as well as conditional probability differences (second stage) by computing the weights of multiple sources. Gong2013 select a subset of the source data by choosing the samples that are distributed most similarly to the target using MMD criterion. A balancing constraint is added to balance the selected number of landmarks in each class. Differently, Sugiyama2008 propose to estimate the sample importance using Kullback-Leibler divergence without explicit density estimation.

Rather than assuming single domain in a dataset, some methods assume a dataset may contain several distinctive sub-domains due to the large variations in visual data. For example, Gong2013a automatically discover latent domains from multiple sources. The latent domains are characteristic in terms of their large inter-domain variations and can derive strong discriminative models to apply to new testing data.

Feature Representation-based Transfer

The first line of feature representation-based methods use the MMD criterion as a distance measure for two distributions. Pan2008 propose a kernel learning method to minimizes the distance between distributions of the data. In addition to the idea in [Pan et al. (2008)], Chen2009 minimize the distribution shift, while at the same time minimizing the empirical loss on the labelled data in the source domain. Pan2009 then proposed transfer component analysis (TCA), for domain adaptation by extending kernel learning in [Pan et al. (2008)] to learn a parametric kernel map, which can deal with out-of-sample problem and is more efficient in solving the objective function. Baktashmotlagh2013 compare the distributions on the transformed data in RKHS space rather than a lower-dimensional space as in TCA. Rather than merely minimizing marginal distribution alone, Long2013 adapt the joint distribution in a principled dimensionality reduction procedure by iteratively estimating pseudo labels of target data for the adaptation of conditional distribution. Geng2011 propose a domain adaptation metric learning (DAML), by introducing the MMD regularization between domains to the conventional metric learning framework.

Instead of using MMD to compare two distributions, other statistic criteria, such as Hellinger distance on statistical manifolds, Quadratic divergence, and mutual information, are also used in this settings among literature [Baktashmotlagh et al. (2014), Si et al. (2010), Jhuo et al. (2012), Shi and Sha (2012)]. Baktashmotlagh2014 argued that MMD criterion does not exploit the fact that probability distributions lie on a Riemannian manifold. They proposed to make better use of the structure of the manifold and rely on Hellinger distance on the manifold to compare the source and target distributions. In the work of Si2010, the Quadratic divergence-based regularization is used to measure the difference between distributions. Thus, a family of subspace learning algorithms can be used for reduce the domain shift by adding the Bregman divergence-based regularization. Shi2012 reduce the distribution divergence from the information-theoretic perspective, where the joint distribution divergence is reduced by minimizing the mutual information between domains maximizing the mutual information between the data and the estimated labels based on the discriminative clustering assumption. Sun2016 propose the CORrelation ALignment (CORAL) to minimize domain shift by aligning the second-order statistics.

Different from previous global transformation, Optimal Transport [Courty et al. (2016)] defines a local transformation for each sample in the source domain. The domain adaptation problem is seen as a graph matching problem, where each source sample are mapped on target samples under the constraint of marginal distribution preservation.

Classifier-based Transfer

Rather than reducing the distribution shift in the feature space, some methods use MMD criteria to regularize the classifier trained on the source domain using target domain unlabelled data. For instance, Quanz2009 incorporate the MMD criterion as a constraint into the SVM paradigm. Hence, it achieves a trade-off between the large margin class separation in the source domain and the minimization of marginal distribution discrepancy between domains, as projected along the linear classifier. Similarly, Long2014a extend Quanz’s method to reduce both marginal and conditional distribution discrepancies between domains in the domain invariant classifier learning framework.

Hybrid Knowledge Transfer

Some methods jointly re-weight instances and map feature to a common space [Long et al. (2014), Hsu et al. (2015)]. Long2014 reduce the domain divergence by jointly matching the features and re-weighting the source domain instances in a low-dimensional space using MMD. There are also methods assume that the target domain contains a subset of categories presented in the source dataset. For example, Hsu2015 propose Closest Common Space Learning (CCSL) for associating data that the label numbers across domains are different or collected from multiple datasets, with the capability of preserving label and structural information within and across domains.

3.3.2 Geometric Criterion

Feature Representation-based Transfer

Gopalan2011 propose a Sampling Geodesic Flow (SGF) method by creating intermediate representations of data between the two domains. The generative subspaces created from these intermediate domains are viewed as points on the Grassmann manifold, and points along the geodesic between them are sampled to obtain subspaces for adapting between domains. However, the limitation is that the number of intermediate points is a hard-to-determine hyper-parameter. Gong2012 tackle this problem by extending the idea in [Gopalan et al. (2011)] to a geodesic flow kernel (GFK) method that integrates an infinite number of subspaces that characterize changes from the source to the target domain. The methods in [Gopalan et al. (2011), Gopalan et al. (2014)] and [Gong et al. (2012), Gong et al. (2014)] open the opportunity of researches on constructing intermediate domains to bridge the mismatch. For example, Caseiro2015 extend SGF [Gopalan et al. (2011)]

method to multiple source datasets. They use smooth polynomial functions described by splines on the manifold to interpolate between all the source domains and the target domain. Zhang2013b connect the source and target domain by interpolate virtual views by a virtual path for the cross-view action recognition. Rather than manipulating on the subspaces, Cui2014 characterize samples from each domain as one covariance matrix and interpolate some intermediate points (i.e., covariance matrices) to bridge the two domains. There are also methods 

[Ni et al. (2013), Xu et al. (2015)] generate a set of intermediate dictionaries by learning the domain-adaptive dictionaries between domains.

Instead of modelling intermediate domains, some methods align the two domains directly [Fernando et al. (2013), Cui et al. (2014a)]. For instance, Fernando2013 learn a mapping function which aligns the source subspace with the target one directly. Cui2014a propose to align the manifolds defined by source and target datasets by integrating geometry structure matching, feature matching and geometry preserving.

Most of the existing methods assume a global shift, but ignore the individual class differences across domains. Lin2016 take the individual class differences into consideration and generate one joint subspace for each class independently. Then assign labels to the anchor subspaces (with unlabelled samples that are close to each other) in the target domain for guiding the learning of joint subspace of each class. Specifically, the labels of anchor subspaces are assigned using the principle angles as a measure for calculating the similarities between them and source subspaces.

3.3.3 Higher-level Representation Criterion

Feature Representation-based Transfer

Blitzer2006 propose a structural correspondence learning (SCL) method to identify correspondences among features from different domains. The correlations are modelled with pivot features that behave in the same way for discriminative learning in both domains. The extracted pivot features are used to learn a mapping that maps the original feature of both domains to a shared feature space.

The low-rank criterion is also used to guide the learning of domain invariant representations [Jhuo et al. (2012), Shao et al. (2014), Ding et al. (2015)]

. Jhuo2012 propose to capture the intrinsic relatedness of the source samples using a low-rank structure and meanwhile identifies the noise and outlier information using a sparse structure. Specifically, the source samples are transformed to an intermediate representation such that each source sample can be linearly represented by the samples of target domain. Similarly, Shao2014 propose to project both source and target data to a generalized subspace where each target sample can be represented by a low-rank transformation of source samples. Ding2015a extend the low-rank coding method to a deep low-rank coding method, where multiple layers are stacked to learn the multi-level features across two domains. Each layer is constrained by a low-rank coding.

There are also methods use dictionary learning method to learn domain invariant representations [Peng et al. (2016), Tsai et al. (2016a)]. Peng2016 propose a Multi-task Dictionary Learning (UMDL) model to learn view-invariant and discriminative information for the task of person re-identification. Three types of dictionaries are learnt: task shared dictionary that is dataset invariant, target specific dictionary that is view-invariant, and task specific residual dictionary that encodes the residual parts of features that cannot be modelled by the source task. Similar to [Hsu et al. (2015)], Tsai2016 also assume that the target domain contains a subset of categories presented in the source dataset. They derive a domain invariant space for aligning and representing cross domain data with a locality constraint sparse coding method.

Bengio2012 argue that deep neural networks can learn more transferable features by disentangling the unknown factors of variation that underlie the training distribution. Donahue2014 propose the deep convolutional representations named DeCAF, where a deep convolutional neural network is pre-trained using the previous large scale source dataset in a fully supervised fashion. Then transfer the features (defined by the convolutional network weights learned on source dataset) to the target data.

The deep auto-encoders are also used for the cross-dataset tasks [Glorot et al. (2011), Kan et al. (2015), Chen et al. (2012), Jiang et al. (2016), Ghifary et al. (2016b)]

. Glorot2011 propose to use stacked denoising autoencoders (SDAs) to learn more transferable features for domain adaptation. Kan2015 propose a Bi-shifting Auto-Encoder Network (BAE), which has one common encoder, two domain specific decoders. The proposed BAE can shift source domain samples to the target domain and also shift the target domain data to the source domain using the sparse reconstruction to ensure the distribution consistency. Ghifary2016a propose a Deep Reconstruction-Classification Network (DRCN), which jointly learns a shared encoding representation for two tasks: i) supervised classification of labelled source data, and ii) unsupervised reconstruction of unlabelled target data. Chen2012 identify that the SDAs are limited by their high computational cost. Hence, they propose marginalized SDA (mSDA) by adopting the greedy layer-by-layer training of SDAs for domain adaptation.

The CoGAN [Liu and Tuzel (2016)] approach applied Generative Adversarial Networks (GANs) [Goodfellow et al. (2014)] to the domain transfer problem by training two GANs to generate the source and target images respectively. This weight-sharing constraint allows CoGAN to learn a joint distribution of two domains. The intuition is that the images from different domains share the same high-level abstraction but have different low-level realizations.

3.3.4 Hybrid Criterion

Feature Representation-based Transfer

Zheng2012 propose to learn two dictionaries simultaneously for pairs of correspondence video and encourage each video in the pair to have the same sparse representation, where both correspondence and higher level representation criteria are used. Similarly, Huang2013 present a joint model which learns a pair of dictionaries with a feature space for describing and associating cross-domain data. Long2013a incorporate MMD criterion into the objective function of sparse coding to make the new representations robust to the distribution difference. Sun2015 extend the subspace alignment [Fernando et al. (2013)] by aligning the second-order of statistics (covariance) as well as the subspace bases. Zhang2017 propose to learn two coupled projections that project the source domain and target domain data into low-dimensional subspaces where the geometrical shift and distribution shift are reduced simultaneously using both geometric and statistic criteria.

As mentioned in Section3.3.3, deep neural networks can learn more transferable features for domain adaptation [Bengio (2012), Donahue et al. (2014)], by disentangling explanatory factors of variations underlying data samples, and grouping deep features hierarchically according to their relatedness to invariant factors. However, the features computed in higher layers of the network must depend greatly on the specific dataset and task [Yosinski et al. (2014)], which are task-specific features and are not safely transferable to novel tasks. Hence, some recent work impose statistic criterion into the deep learning framework to further reduce domain bias. For instance, the MMD criterion is used to reduce divergence of marginal distributions [Tzeng et al. (2014), Long and Wang (2015)] or joint distributions [Long et al. (2017)] between domains. Long2016 use MMD to learn transferable features and adaptive classifiers. They relax a shared-classifier assumption made by previous methods and assume that the source classifier and target classifier differ by a residual function [He et al. (2016)]. Sun2016a extend the CORrelation ALignment (CORAL) method [Sun et al. (2016)] to align the second-order statistics of the source and target distributions in deep learning framework. Zellinger2017 propose the Central Moment Discrepancy (CMD) method to match the higher order central moments of probability distributions by means of order-wise moment differences, which does not require computationally expensive distance and kernel matrix computations.

The statistic criterion is also incorperated into the deep autoencoder. Wei2016 propose to use MMD distance to quantify the domain divergence and incorporate it in the learning of the transformation in the mSDA. The Domain Separation Networks (DSN) [Bousmalis et al. (2016)] introduces the notion of a private subspace for each domain, which captures domain specific properties, such as background and low level image statistics. A shared subspace, enforced through the use of autoencoders and explicit loss functions (i.e. MMD or divergence), captures shared features between the domains. The model integrates a reconstruction loss using a shared decoder, which learns to reconstruct the input sample by using both the private and shared representations.

Hu2015a propose a deep transfer metric learning (DTML) in a deep learning framework, where inter-class variations are maximized and the intra-class variations are minimized, and the distribution divergence between the source domain and the target domain at the top layer of the network is minimized, simultaneously. Similar to [Hu et al. (2015)], Ding2017 also propose a metric learning based method. However, their method is developed in marginalized denoising scheme and low-rank constraint is incorporated to guide the cross-domain metric learning by uncovering more common feature structure across two domains. In addition, the marginal and conditional distribution differences across two domains are both leveraged.

Inspired by the recent adversarial learning [Goodfellow et al. (2014)], the divergence is used to encourage samples from different domains to be non-discriminative with respect to domain labels [Tzeng et al. (2015), Ganin and Lempitsky (2015), Ganin et al. (2016), Tzeng et al. (2017), Bousmalis et al. (2017)]

. For example, Tzeng2015 proposed adding a domain classifier that predicts the binary domain label of the inputs and designed a domain confusion loss to encourage its prediction to be as close as possible to a uniform distribution over binary labels. The gradient reversal algorithm (ReverseGrad) proposed by Ganin2015 also treats domain invariance as a binary classification problem, but directly maximizes the loss of the domain classifier by reversing its gradients. Bousmalis2017 propose a GAN-based method that adapts source-domain images to appear as if drawn from the target domain in pixel level. Sankaranarayanan2017 propose an adversarial image generation frame-work to directly learn the shared feature embedding using labelled data from source and unlabelled data from the target.

Classifier-based Transfer

Dai2007a use both self labelling and statistic criterion to transfer Naive Bayes classifier. They first estimate the initial probabilities under source domain distribution, and then use an EM algorithm to revise the model for the target distribution. Moreover, the Kullback-Leibler (KL) divergence is used to measure the distance between the training and test data. Then the distance is used to estimate the trade-off parameters.

Saito2017 propose an asymmetric tri-training method for unsupervised domain adaptation, where the pseudo-labels are assigned to unlabelled samples and train the deep neural networks as if they are true labels. Hence, both self-labelling and high level representations are used.

Hybrid Knowledge Transfer

Aljundi2015 consider both subspace alignment and selection of landmarks similarly distributed between the two domains. Those landmarks are selected so as to reduce the discrepancy between the domains and then are used to non linearly project the data in the same space. Hence, both geometric and statistic criteria are used for transferring instance and feature representations.

3.4 Imbalanced Unlabelled Target Dataset

This problem assumes the target domain is class imbalanced and only with unlabelled data. Thus, the statistic criterion can be used. This problem is known as prior probability shift, or imbalanced data in the context of classification. The imbalanced data issue is quite common in practice. For example, the abnormal actions (kick, punch, or fall down) are generally much rarer than normal actions in video surveillance but generally require higher recognition rate.

Attributes Source dataset Target data
Data Feature space Identical between two sets Identical between two sets
Availability Sufficient Sufficient
Balanced Yes No
Sequential No No
Label Availability Labelled Unlabelled
Label space Identical between two sets Identical between two sets
Table 4: Characteristics of problem 3.4

3.4.1 Statistic Criterion

Feature Representation-based Transfer

In the classification scenario, the prior probability () shift was referred to the class imbalance problem [Japkowicz and Stephen (2002), Zhang et al. (2013a)]. The statistic criterion can be used. Zhang2013a assume the between source and target datasets are different, but the source set is richer than the target set, such that the support of is contained in the support of . Then they tackle the prior probability shift by re-weighting the source samples using similar idea as Kernel Mean Matching method [Huang et al. (2006)]. They also define the situation where both and change across datasets and propose a kernel approach to re-weight and transform the source data to reduce the distribution shift by assuming that the source domain can be transferred to the target domain by location-scale (LS) transformation (the and

only differs in the location and scale). Rather than assuming that all the features can be transferred to the target domain by LS transformation, Gong2016 propose to learn the conditional invariant components by a linear transformation, and then re-weight the source domain data to reduce shift of

and between domains.

Recently, Yan2017 introduce class-specific auxiliary weights into the original MMD for exploiting the class prior probability on source and target domains. The proposed weighted MMD model is defined by introducing an auxiliary weight for each class in the source domain, and a classification EM algorithm is suggested by alternating between assigning the pseudo-labels, estimating auxiliary weights and updating model parameters.

3.5 Sequential Labelled Target Data

In real world applications, the target data can be sequential video streams or continuous evolving data. The distribution of the target data may also change with time. Since the target data are labelled, this problem is named supervised online domain adaptation.

Attributes Source dataset Target data
Data Feature space Identical between two sets Identical between two sets
Availability Sufficient Insufficient
Balanced Yes Yes
Sequential No Yes
Label Availability Labelled Labelled
Label space Identical between two sets Identical between two sets
Table 5: Characteristics of problem 3.6

3.5.1 Self Labelling

Classifier-based Methods

Xu2014a propose an incremental domain adaptation for object detection by assuming weak-labelling. Specifically, the adaptation model is a weighted ensemble of source and target classifiers and the ensemble weights are updated with time.

3.6 Sequential Unlabelled Target Data

Similar to problem in 3.5, the target data are sequential in this problem, however, no labelled target data is available, which is named unsupervised online domain adaptation and related to but different from concept drift. The concept drift [Gama et al. (2014)] refers to changes in the conditional distribution of the output given the input, while the distribution of the input stays unchanged, while in online domain adaptation the changes between domains are caused by the changes of the input distribution.

Attributes Source dataset Target data
Data Feature space Identical between two sets Identical between two sets
Availability Sufficient Sufficient
Balanced Yes Yes
Sequential No Yes
Label Availability Labelled Unlabelled
Label space Identical between two sets Identical between two sets
Table 6: Characteristics of problem 3.6

3.6.1 Geometric Criterion

Feature Representation-Based Transfer

Hoffman2014a extend the Subspace Alignment [Fernando et al. (2013)] to continuous evolving target domain. Both the subspaces and subspace metrics that align the two subspaces are updated after each new target sample comes. Bitarafan2016 tackle the continuously evolving target domain by using the idea of GFK [Gong et al. (2012)] to construct linear transformation. The linear transformation is updated after a new batch of unlabelled target domain data come. Each batch of arrived target data are classified after the transformation and included to the source domain for recognizing the next batch of target data.

3.6.2 Self Labelling

Classifier-based Transfer

Jain2011 address the online adaptation in the face detection task by adapting pre-trained classifiers using a Gaussian process regression scheme. The intuition is that the “easy-to-detect” faces can help the detection of “hard-to-detect” faces by normalizing the co-occurring “hard-to-detect” faces and thus reducing their difficulty of detection. Differently, Cao2010a address the cross-dataset action detection problem by proposing a Maximum a Posterior (MAP) estimation framework, which explores the spatial-temporal coherence of actions and makes use of the prior information. Xu2016 propose an online domain adaptation model for multiple object tracking based on a two-level hierarchical adaptation tree, which consists of instance detectors in the leaf nodes and a category detector at the root node. The adaptation is executed in a progressive manner.

3.7 Unavailable Target Data

This problem is also named domain generalization among the literature, where the target domain data are assumed not to be presented in training stage. Thus, multiple source datasets are generally required to learn the dataset invariant knowledge that can be generalized to new dataset.

Attributes Source dataset Target data
Data Feature space Identical between two sets Identical between two sets
Availability Sufficient No
Balanced Yes -
Sequential No No
Label Availability Labelled -
Label space Identical between two sets Identical between two sets
Table 7: Characteristics of problem 3.7

3.7.1 Higher-level Representation Criterion

Feature Representation-Based Transfer

Most of the existing work tackle this problem by learning domain invariant and compact representation from source domains [Blanchard et al. (2011), Khosla et al. (2012), Muandet et al. (2013), Fang et al. (2013), Stamos et al. (2015), Ghifary et al. (2016a)]. For example, Khosla2012 propose a discriminative framework that explicitly defines a bias associated with each dataset and attempts to approximate the weights for the visual world by undoing the bias from each dataset. Muandet2013 propose the Domain-Invariant Component Analysis (DICA), a kernel-based optimization algorithm that learns an invariant transformation by minimizing the dissimilarity across domains, whilst preserving the functional relationship between input and output variables. Fang2013 propose an unbiased metric learning approach to learn unbiased metric from multiple biased datasets. Ghifary2016 propose a scatter component analysis (SCA) method that finds a representation that trades between maximizing the separability of classes, minimizing the mismatch between domains, and maximizing the separability of data.

The ensemble of classifiers learnt from multiple sources is also used for generalizing to unseen target domain [Xu et al. (2014a), Niu et al. (2015a), Niu et al. (2015b)]. Xu2014 propose to exploit the low-rank structure from multiple latent source domains. Then the domain shift is reduced in an exemplar-SVMs framework by regularizing the likelihoods of positive samples within the same latent domain from each exemplar classifier to be similar. Similarly, Niu2015a extend this idea to the source domain samples with multiple types of features (i.e., multi-view features). Niu2015 explicitly discover the multiple hidden domains with different data distributions using previous methods [Gong et al. (2013b)]. And then one classifier for each class and each latent domain is learnt to form an ensemble of classifiers.

4 Heterogeneous Feature Spaces

This section discusses the problems that the feature spaces between source and target datasets are different but the label spaces are the same. The different feature spaces can be caused by different data modalities or different feature extraction methods.

4.1 Labelled Target Dataset

This problem assumes limited target data are presented in the training stage, see Table 8. This problem is named supervised heterogeneous domain adaptation.

Attributes Source dataset Target data
Data Feature space Different from target Different from source
Availability Sufficient Insufficient
Balanced Yes Yes
Sequential No No
Label Availability Labelled Labelled
Label space Identical between two sets Identical between two sets
Table 8: Characteristics of problem 4.1

4.1.1 Higher-level Representation Criterion

Feature Representation-based Transfer

Some methods assume that the source and target datasets are different only in terms of feature spaces while the distributions are the same between datasets. Since the labelled data in the target dataset are scarce, Zhu2011 propose to use the auxiliary heterogeneous data that contain both modalities from Web to extract the semantic concept and find the shared latent semantic feature space between different modalities. There are also methods assume that not only the feature spaces are heterogeneous between domains, the data distributions are also diverged, which will be discussed in Section 4.1.3.

4.1.2 Class-based Criterion

Feature Representation-based Transfer

The class-based criterion has also been used to connect heterogeneous feature spaces. Finding relationship between different feature spaces can be seen as translating between different languages. Hence, Dai2008 propose a translator using a language model to translate between source and target feature spaces using the class labels. They argue that though the limited number of labelled data in the target domain is not sufficient for building a good classifier, an effective translator can be constructed. Kan2012 propose a multi-view discriminant analysis by seeking for a discriminant common space by jointly learning multiple view-specific linear transformations using label information. Manifold alignment method is also used for heterogeneous domain adaptation with the class-based criterion. For example, Wang2011 propose to make use of class label information to align the manifolds for heterogeneous domain adaptation. They treat each input domain as a manifold. The goal is to find the respective mappings for each domain to project them into a latent space where the topology, i.e. the manifold structure, is preserved and the discriminative information of each domain is also preserved.

The feature augmentation based method has also been proposed [Duan et al. (2012c), Li et al. (2014)], which transforms the data from two domains into a common subspace, then two new feature mapping functions are proposed to augment the transformed data with their original features and zeros.

Similar to the case [Saenko et al. (2010)] in the homogeneous feature spaces as mentioned in Section 3.1.1, a line of research assumes the label spaces of target training set and target test set are non overlapping subsets of source label space. Kulis2011 extend [Saenko et al. (2010)] to learn an asymmetric non-linear transformation that maps points from one domain to another domain using supervised data from both domains. Hoffman2012 extend [Kulis et al. (2011)] to multi-domain adaptation by discovering latent domains with heterogeneous but unknown structure.

Classifier-based Transfer

Instead of using the idea of metric learning to learn the asymmetric feature transformation between heterogeneous features [Kulis et al. (2011)], the asymmetric metric of classifiers can also be learnt to bridge source and target classifiers on heterogeneous features. For instance, Zhou2014 propose to learn the feature mapping across heterogeneous features by maximizing the dependency between the mapped source and target classifier weight vectors. In their formulation, the transformation matrix is constrained to be sparse and class-invariant.

4.1.3 Hybrid Criterion

Feature Representation-based Transfer

By contrast to the scenario mentioned in Section 4.1.1, where the source and target datasets are different only in terms of feature spaces, another scenario assumes that both the feature spaces and the data distributions are different. Shekhar2015 extend [Shekhar et al. (2013)]

to heterogeneous feature spaces, where the two projections and a latent dictionary are jointly learnt to simultaneously find a common discriminative low-dimensional space and reduce the distribution shift. Similarly, Sukhija2016 use the shared label distributions present across the domains as pivots for learning a sparse feature transformation. The shared label distributions and the relationship between the feature spaces and the label distributions are estimated in a supervised manner using random forests. Hence, the higher-level representation and class-based criteria are used in both methods.

Hybrid Knowledge Transfer

Hoffman2013 extend [Kulis et al. (2011)] to scale well to large dataset by proposing a Max-Margin Domain Transforms (MMDT) method that provides a way to adapt max-margin classifiers and an asymmetric transform jointly to optimize both the feature representation and the classifier parameters. The MMDT can be optimized quickly in linear space. Similar to [Shekhar et al. (2015), Sukhija et al. (2016)], Shi2010 also assume both feature spaces and data distributions are different between datasets. They propose to learn a spectral embedding to unify the different feature spaces and use sample selection method to deal with distribution mismatch. Hence, both feature transformation and sample selection are preceded.

4.2 Labelled Plus Unlabelled Target Dataset

In this problem, both limited labelled and sufficient unlabelled target data are presented (Table 9), which is named semi-supervised heterogeneous domain adaptation.

Attributes Source dataset Target data
Data Feature space Different from target Different from source
Availability Sufficient Insufficient Labelled+ Sufficient unlabelled
Balanced Yes Yes
Sequential No No
Label Availability Labelled Labelled+Unlabelled
Label space Identical between two sets Identical between two sets
Table 9: Characteristics of problem 4.2

4.2.1 Statistic Criterion

Hybrid Knowledge Transfer

Tsai2016a propose a learning algorithm of Cross-Domain Landmark Selection (CDLS) for solving heterogeneous domain adaptation (HDA) using the statistic criterion (MMD), where the instance weights and feature transformation are learnt simultaneously. Specifically, the CDLS derives a heterogeneous feature transformation which results in a domain-invariant subspace for associating cross-domain data, and assign weight to each instance according to their adaptation ability using both labelled and unlabelled target samples.

4.2.2 Class-based Criterion

Feature Representation-based Transfer

Xiao2015 propose a kernel matching method for heterogeneous domain adaptation. The target data points are mapped to similar source data points by matching the target kernel matrix to a submatrix of the source kernel matrix using the label information. Specifically, The labelled target samples perform as pivot points for class separation, where each labelled target sample is mapped into a source sample with the same class label. Then the unlabelled target instances are expected to be mapped to the source samples with same labels with the guides of labelled pivot points through a distance measure between samples.

4.3 Unlabelled Target Dataset

This problem assumes no labelled target data is available (see Table 10). We name this problem as unsupervised heterogeneous domain adaptation. In this problem, the feature spaces can be completely different between dataset. On the other hand, it can also be assumed that the source data consist of multiple modalities while the target data only contain one of the modalities, or vice versa. This scenario is also considered under this problem because the relationships between different feature spaces still need to be exploited.

Attributes Source dataset Target data
Data feature space Different from target Different from source
Availability Sufficient Sufficient
Balanced Yes Yes
Sequential No No
Label Availability Labelled Unlabelled
Label space Identical between two sets Identical between two sets
Table 10: Characteristics of problem 4.3

4.3.1 Statistic Criterion

Hybrid Knowledge Transfer

Chen2014a assume the source datasets contain multiple modalities and target dataset only contains one modality. However, the target data are also available in the training stage. In addition, they assume the source and target data are from different distributions. Hence, the distribution mismatch also needs to be reduced. Both the feature representation and classifier parameters are transferred by employing both original and privileged knowledge in the source dataset. Specifically, they use the statistic criterion (MMD) to transform the source and target common modality into a shared subspace, in the meantime, the multiple source modalities are also transformed to the same representation in the common space. They iteratively learn the common space and the robust classifier based on the intuition that a suitable common space will be beneficial for learning a more robust classifier; and the robust classifier can also help finding a more discriminative common space.

4.3.2 Higher-level Representation Criterion

Feature Representation Transfer

Similar to [Chen et al. (2014)], Ding2014 also assume that the source domain contains data with multi-modality but the target domain only has one data modality. They name this setting as Missing Modality Problem. They propose to recover the missing modality in the target domain by finding similar data from the source domain. In their method, a latent factor is incorporated to uncover the missing modality based on the low-rank criterion.

Zero-padding has also been used for dealing with the heterogeneous feature spaces. For example, the marginalized SDA (mSDA) 

[Chen et al. (2012)] deals with heterogeneous feature spaces by padding all input vectors with zeros to make both domains be of equal dimensionality.

4.3.3 Correspondence Criterion

Feature Representation Transfer

The co-occurrence data between different feature spaces or modalities have been employed for heterogeneous domain adaptation [Qi et al. (2011a), Yang et al. (2016)]. In Qi’s method [Qi et al. (2011a)]

, the co-occurrence data are used for mapping instances from different domains into a common space as a bridge to semantically connect the two heterogeneous spaces. Differently, Yang2016 propose to learn the transferred weights obtained from co-occurrence data, which indicate the relatedness between heterogeneous domains. Specifically, they compute the principal components of instances in each feature space such that co-occurrence data can be represented by these principal components. By using these principal component coefficients, the Markov Chain Monte Carlo method is employed to construct a directed cyclic network where each node is a domain and each edge weight is the conditional dependence from one domain to another domain.

A line of research dedicates on the task of translation between different domains. For example, in the context of machine translation between languages, the sentence pairs are presented in the form of a parallel training corpus for learning the translation system. Traditional phrase-based translation system [Koehn et al. (2003)]

consists of many small sub-components that are tuned separately. Differently, a newly emerging approach, named Neural machine translation 

[Kalchbrenner and Blunsom (2013), Sutskever et al. (2014), Cho et al. (2014), Bahdanau et al. (2015)], attempts to build and train a single, large neural network that reads a sentence and outputs a correct translation. The neural machine translation approach typically consists of two components: the first encodes a source sentence, and the second decodes to a target sentence.

Similarly, in the vision domain, image-to-image translation 

[Isola et al. (2017)] has also been extensively exploited, which aims at converting an image from one representation of a given scene to another (i.e. greyscale to color [Isola et al. (2017)], texture synthesis [Efros and Freeman (2001), Hertzmann et al. (2001), Li and Wand (2016)], sketch to photograph [Chen et al. (2009a), Isola et al. (2017)], time hallucination [Shih et al. (2013), Laffont et al. (2014), Isola et al. (2017)], image to semantic labels [Long et al. (2015), Eigen and Fergus (2015), Xie and Tu (2015)], and style transfer [Li and Wand (2016), Wang and Gupta (2016), Gatys et al. (2016), Johnson et al. (2016), Zhang et al. (2016)]). The key idea for tackling these tasks is to learn a translation model between paired samples from different domains. The recent deep learning based techniques have greatly advanced the image-to-image translation task. For example, the deep convolutional neural networks [Long et al. (2015), Xie and Tu (2015), Eigen and Fergus (2015), Gatys et al. (2016), Johnson et al. (2016), Zhang et al. (2016)], and the Generative Adversarial Networks (GANs) [Wang and Gupta (2016), Li and Wand (2016), Isola et al. (2017)] have been extensively exploited for learning the translation model.

Though the original purposes of these work on translation between domains may not be cross-dataset recognition, the ideas can be borrowed for cross-modality or cross feature spaces recognition. Since if the source domain data can be translated to the target domain properly, the target task can be boosted by the translated source domain data.

Gupta2016 assume that the source data are large scale labelled RGB data and the target data are unlabelled RGB and depth image pairs. The source data are used for training multiple layers of rich representations using deep convolutional neural networks. Then the paired target data are used for transfer the source rich parameters to the target networks by constraining the paired samples from different modalities to have similar representations.

4.3.4 Self Labelling

Instance-based Transfer

Tang2012 propose a self-paced method for adapting object detectors from image to video, hence the data modalities between domains are completely different. They iteratively adapt the detector by automatically discovering examples from target video data, starting from the most confident ones. In each iteration, the number of target examples is increased while the source examples are decreased.

5 Heterogeneous Label Spaces

In this section, we discuss a set of problems that assume different label spaces but same feature spaces between source and target datasets. For example, in the classification tasks, when the label spaces between datasets are different, there still exists shared knowledge between previous categories (i.e. horse) and new categories (i.e. zebra) that can be used for learning new categories.

5.1 Labelled Target Dataset

When limited labelled data (maybe only one example per category) are presented in the target dataset, the problem is generally named one-shot learning or few-shot learning. The characteristics of this problem are shown in Table 11. This setting is closely related to multi-task learning. The difference is that one-shot learning emphasize on the recognition on the target data with limited labelled data while multi-task learning aims at improving all the tasks with good training data in each task.

Attributes Source dataset Target data
Data Feature space Identical between two sets Identical between two sets
Availability Sufficient Insufficient
Balanced Yes Yes
Sequential No No
Label Availability Labelled Labelled
Label space Different from target Different from source
Table 11: Characteristics of problem 5.1

5.1.1 Statistic Criterion

Feature Representation-based Transfer

Bian2012 propose a transfer topic model (TTM) by assuming the source and target datasets can be represented by shared concepts even with different classes. Since the target labelled data are insufficient, they propose use the topics learned from the source domain to regularize the topic estimation in the target domain. Kullback Leibler divergences are used as regularization to reduce the shift between topic pairs of the two domains.

5.1.2 Higher-level Representation Criterion

Feature Representation-based Transfer

Yang2009,Yang2009a propose to transfer the parameters of distance function from source data to the target. The distance function can help to detect the patch saliency in the target videos.

5.1.3 Class-based Criterion

Instance-based Transfer

Qi2011 develop a cross-category label propagation algorithm, which directly propagate the inter-category knowledge at instance level between the source and the target categories.

Feature Representation-based Transfer

Patricia2014 consider each source as an expert that judges on the new target samples. The output confidence value of prior models is treated as features and the features are combined with the features from the target samples to build a target classifier.

Classifier-based Transfer

Several classifier-based methods are proposed to transfer the parameters of classifiers. Fei-Fei2006 propose a Bayesian approach, where a generic object model is estimated from some source categories and it is then used as prior to evaluate the target object parameter distribution with a maximum-a-posteriori technique. Lake2011 introduce a generative model of how characters are composed from strokes, where knowledge from previous characters helps to infer the latent strokes in novel characters.

Instead of using generative model, some discriminative classifier-based transfer methods [Tommasi et al. (2010), Aytar and Zisserman (2011), Ma et al. (2014), Jie et al. (2011)] are also proposed to incorporate prior knowledge. For example, Tommasi2010 present an Least Square Support Vector Machine (LS-SVM) based model adaptation algorithm able to select and weight appropriately prior knowledge coming from different categories by assuming the new categories are similar to some of the “prior” categories. Similarly, Aytar2011 propose three transfer learning formulations, namely Adaptive SVM(A-SVM), Projective Model Transfer SVM (PMT-SVM), Deformable Adaptive SVM (DA-SVM), where a template learnt previously for other categories is used to regularize the training of a new category. They also assume the target category is visually similar to a source category. Ma2014 design a multi-task learning framework to explore the shared knowledge between source and target classifiers and jointly optimize the classifiers for both sides. Jie2011 propose Multi Kernel Transfer Learning (MKTL) method, which takes advantages of priors built over different features and with different learning methods. They use the priors as experts and transfer their outputs to the target using multiple kernel learning (MKL). The intuition is that knowledge can be transferred between classes that share common visual properties, such as bicycle and motorbike.

Kuzborskij2013 conduct a theoretical analysis (with the focus of algorithmic stability) of the discriminative classifier-based transfer methods [Tommasi et al. (2010), Aytar and Zisserman (2011), Ma et al. (2014), Jie et al. (2011)], which is named as Hypothesis Transfer Learning (HTL) in their paper. In HTL, only source hypotheses trained on a source domain are utilized for transferring the parameters to the target domain with the presence of a small set of target labelled data. They give the generalization bound in terms of the Leave-One-Out (LOO) risk and show that the relatedness of source and target domains accelerates the convergence of the LOO error and generalization error. In the case of unrelated domains, they propose how a hypothetical algorithm could avoid negative transfer.

Recently, the deep learning based approaches have been emerged for few-shot learning. Vinyals2016 propose the matching networks, which uses an attention mechanism over a learned embedding of the limited labelled data of target classes, which can be interpreted as a weighted nearest-neighbor classifier applied within an embedding space. Ravi2017 propose a meta-learning approach to few-shot learning. Their approach involves training an LSTM [Hochreiter and Schmidhuber (1997)] to produce the updates to a classifier, given a few target labelled examples, such that it will generalize well to the target set. Snell2017 propose the prototypical networks by learning a non-linear mapping of the input into an embedding space using a neural network and take a class’s prototype to be the mean of its support set in the embedding space.

5.2 Unlabelled Target Dataset

Some research also try to tackle the heterogeneous label space problem by assuming that no labelled data are presented. This problem can be named as unsupervised transfer learning.

Attributes Source dataset Target data
Data Feature space Identical between two sets Identical between two sets
Availability Sufficient Sufficient
Balanced Yes Yes
Sequential No No
Label Availability Labelled Unlabelled
Label space Different from target Different from source
Table 12: Characteristics of problem 5.2

5.2.1 Higher-level Representation Criterion

Feature Representation-based Transfer

The higher-level representation criterion is generally used for this problem. Two different scenarios are considered among the literature.

The first scenario is that only the label spaces between source and target datasets are different. Since there are no labels in the target training set, the unseen class information is generally gained from a higher level semantic space shared between datasets. Some research uses Web search the semantic representation linking the different label spaces [Zheng et al. (2009), Hu and Yang (2011)]. For example, Zheng2009 assume that there is no labels in the target data, but the number of activities and names of activities are known and the source activities and the target activities have some kind of relationship. They bridge between the activities in two domains by learning a similarity function in the text feature space (word vectors) via Web search. Then they train a weighted SVM model with different probabilistic confidence weights learned from the similarity function. Some other research assumes the shared human-specified high-level semantic space (i.e. attributes [Palatucci et al. (2009)], or text descriptions [Reed et al. (2016)]

) between datasets. Given a defined attribute or text description ontology, each class can be represented by a vector in the semantic space. However, the attribute annotations or text descriptions are expansive to acquire. The attributes are substituted by the visual similarity and data distribution information in transductive settings 

[Li et al. (2015), Zhang and Saligrama (2015), Zhang and Saligrama (2016a), Zhang and Saligrama (2016b)], where the target domain data of unseen classes are required in the training stage to learn the model. Another strategy learns the semantic space by borrowing the large and unrestricted, but freely available, text corpora (i.e. Wikipedia) to derive a word vector space [Frome et al. (2013), Mikolov et al. (2013), Socher et al. (2013)]. The related work on semantic space (i.e. attributes, text descriptions, or word vector) will be further discussed in Section 5.4, since the target data are generally not required when the semantic space is involved.

The second scenario assumes that in addition to the different label spaces, the domain shift (i.e. the distribution shift of features) also exists between datasets [Fu et al. (2015), Kodirov et al. (2015), Wang et al. (2016)]. For example, Fu2015 name this problem as projection domain shift and propose a transductive multi-view embedding space that rectifies the projection shift. Kodirov2015 propose a regularised sparse coding framework which uses the target domain class labels’ projections in the semantic space to regularise the learned target domain projection thus overcoming the projection domain shift problem.

5.3 Sequential Labelled Target Data

As mentioned, the target domain data can come sequentially. Hence, this problem assume the target data are sequential and can be from different classes, which is also called online transfer learning, and closely related to lifelong learning. Both concepts focus on the continuous learning processes for evolving tasks. However, the online transfer learning emphasize on the performance on the target data (without sufficient target training data), but lifelong learning tries to improve the future target task (with sufficient target training data) as well as all the past tasks [Chen et al. (2015)]. Also, the lifelong learning can be seen as incremental/online multi-task learning.

Attributes Source dataset Target data
Data Feature space Identical between two sets Identical between two sets
Availability Sufficient Insufficient
Balanced Yes Yes
Sequential No Yes
Label Availability Labelled Labelled
Label space Different from target Different from source
Table 13: Characteristics of problem 5.3

5.3.1 Self Labelling

Classifier-based Transfer

Nater2011 address an action recognition scenario where the unseen activities to be recognized only have one labelled sample per new activity, with the help of other offline trained activities with many labelled data. They build a multiclass model which exploits prior knowledge of known classes and incrementally learns the new actions. Then the newly labelled activities are integrated in the previous model to update the activity model. Zhao2010 propose an ensemble learning based method (OTL) that learns a classifier in an online fashion with data from the target domain, and combines it with the source domain classifier. The weights for the combination are adjusted dynamically on the basis of a loss function which evaluates the difference among the current prediction and the correct label of new incoming sample. Tommasi2012 then extended OTL [Zhao and Hoi (2010)] and addressed the case of online transfer from multiple sources.

5.4 Unavailable Target Data

This problem is also named zero-shot learning among the literature, where unseen categories are required to be recognized in the target set but no training data is available for the unseen categories. Different from domain generalization (see Section 3.7), the unseen target data are from different categories as the source data in zero-shot learning. As mentioned in Section 5.2, the unseen categories can be connected via some auxiliary information, such as a common semantic representation space.

Attributes Source dataset Target data
Data Feature space Identical between two sets Identical between two sets
Availability Sufficient No
Balanced Yes -
Sequential No No
Label Availability Labelled -
Label space Different from target Different from source
Table 14: Characteristics of problem 5.4

5.4.1 Higher-level Representation Criterion

Feature Representation-based Transfer

Most of the methods for this problem rely on the existence of a labelled training set of seen classes and the knowledge about how each unseen class is semantically related to the seen classes. Seen and unseen classes are usually related in a high dimensional vector space, which is called semantic space. Such a space can be an attribute space [Palatucci et al. (2009)], text description space [Reed et al. (2016)], or a word vector space [Frome et al. (2013), Mikolov et al. (2013), Socher et al. (2013)].

The attribute space is the most commonly used intermediate semantic space. The attributes are defined as properties observable in images, which are described with human-designated names such as “white”, “hairy”, “four-legged”. Hence, in addition to label annotation, the attribute annotations are required for each class. However, the attributes are assigned on a per-class basis instead of a per-image basis, the manual effort to add a new object class is kept minimal. This motivated the collection of datasets containing images annotated with visual attributes [Farhadi et al. (2009), Lampert et al. (2009)]. Two main strategies are proposed for recognizing unseen object categories using attributes. The first is recognition using independent attributes, consists of learning an independent classifier per attribute [Lampert et al. (2009), Palatucci et al. (2009), Kumar et al. (2009), Liu et al. (2011), Parikh and Grauman (2011)]. At test time, the independent classifiers allow the prediction of attribute values for each test sample, from which the test class label are inferred. Since attribute detectors are expected to generalize well across different categories, including those previously unseen, some research is devoted to modelling the uncertainty of attributes [Wang and Ji (2013), Jayaraman and Grauman (2014)] or robustly detecting attributes from images [Gan et al. (2016), Bucher et al. (2016)]. However, Akata2013 argue that the attribute classifiers in previous work are learned independently of the end-task, they might be optimal at predicting attributes but not necessarily at predicting classes. Hence, the second strategy is recognition by assuming a fixed transformation between the attributes and the class labels [Akata et al. (2015), Romera-Paredes and Torr (2015), Akata et al. (2016), Qiao et al. (2016), Xian et al. (2016)] to learn all attributes simultaneously. To sum up, the attribute based methods are promising for recognizing unseen classes, while with a key drawback that the attribute annotations are still required for each class.

Instead of using attributes, Reed2016 use image text descriptions to construct the semantic space to provide a natural language interface. However, similar to attribute space, the good performance is obtained at the price of manual annotation.

The third semantic space is the word vector space [Frome et al. (2013), Mikolov et al. (2013), Socher et al. (2013), Lei Ba et al. (2015)], which is attractive since no extensive annotations are required for the semantic space. The word vector space is derived from both labelled images and a huge unannotated text corpus (i.e. Wikipedia) and generally learnt by deep neural network.

5.5 Unlabelled Source Dataset

This problem is similar to self-taught learning, where the source data are unlabelled but contain information (i.e. basic visual patterns) that can be used for target tasks.

Attributes Source dataset Target data
Data Feature space Identical between two sets Identical between two sets
Availability Sufficient Insufficient
Balanced Yes Yes
Sequential No No
Label Availability Unlabelled Labelled
Label space Different from target Different from source
Table 15: Characteristics of problem 5.5

5.5.1 Higher-level Representation Criterion

Feature Representation-based Transfer

Raina2007 firstly presented the idea of “self-taught learning”. They propose to construct higher-level features using sparse coding with the unlabelled source data. Recently, Kumagai2016 provides a theoretical learning bound for self-taught learning with focus on discussing the performance of sparse coding in self-taught learning.

The idea of self-taught learning has also been used in deep learning framework, where the unlabelled data are used for pretraining the network to obtain good starting point of parameters [Le et al. (2011), Gan et al. (2014), Kuen et al. (2015)]

. For instance, Le2011 propose a self-taught learning framework based on a deep Independent Subspace Analysis (ISA) network. They train the network on video blocks from UCF and Youtube datasets and use the learned model to extract features and recognize actions on Hollywood2 video clips. Gan2014 use the unlabelled samples to pretrain the first layer of Convolutional deep belief network (CDBN) for initializing the network parameters. Kuen2015 use stacked convolutional autoencoders to learn the invariant representations from previous unlabelled image patches for visual tracking.

6 Heterogeneous Feature Spaces and Label Spaces

This section, a more challenging scenario is discussed, where both the feature spaces and label feature spaces between source and target datasets are different.

6.1 Labelled Target Dataset

This problem assumes the labelled target data are available. We name this problem as heterogeneous supervised transfer learning.

Attributes Source dataset Target data
Data Feature space Different from target Different from source
Availability Sufficient Insufficient
Balanced Yes Yes
Sequential No No
Label Availability Labelled Labelled
Label space Different from target Different from source
Table 16: Characteristics of problem 6.1

6.1.1 Higher-level Representation Criterion

Feature Representation-based Transfer

Rather than assuming completely different feature spaces, Jia2014 propose to transfer the knowledge of RGB-D (RGB and depth) data to the dataset that only has RGB data, and use this additional source of information to recognize human actions from RGB videos. They applied latent low-rank tensor transfer learning to learn shared subspaces of two databases. Specifically, each action sample is represented as a three-order tensor (row dimensions, column dimensions, and number of frames) and the subspace is learned by imposing the low-rank constraints on each mode of the tensor, such that more spatial-temporal information are uncovered from the action videos.

6.1.2 Hybrid Criterion

Hybrid knowledge transfer

Hu2011 propose to transfer the knowledge between different activity recognition tasks, relaxing the assumption of same feature space, same label space as well as same underlying distribution by automatically learning a mapping between different sensors. They adopt similar idea of translated learning [Dai et al. (2008)] to find a translator between different feature spaces using statistic criterion (i.e. Jensen-Shannon divergence). Then the Web knowledge is used as a bridge to link the different label spaces using self labelling.

6.2 Sequential Labelled Target Data

This problem (Table 17) assume the sequential target data have different feature space with source data, which is named as heterogeneous online supervised transfer learning.

Attributes Source dataset Target data
Data Feature space Different from target Different from source
Availability Sufficient Insufficient
Balanced Yes Yes
Sequential No Yes
Label Availability Labelled Labelled
Label space Different from target Different from source
Table 17: Characteristics of problem 6.2

6.2.1 Self Labelling

Classifier-based Transfer

As mentioned in Section 5.3, Zhao2010 propose the OTL method for online transfer learning. They also consider the case of heterogeneous feature spaces by assuming the feature space of the source domain is a subset of that of the target domain. Then a multi-view approach is proposed by adopting a co-regularization principle of online learning of two target classifiers simultaneously from the two views. The unseen target example is classified by the combination of the two target classifiers.

7 Applications and Datasets

The cross-dataset transfer learning are crucial for different real world applications, including WiFi localization, sentiment classification, Part-of-speech tagging, spam email filtering, text classification, object recognition, Hand-Written digit recognition, face recognition, person re-identification, scene categorization, action recognition, and video event detection. We take the object recognition as an example for explanation. As claimed by Torralba2011, despite the great efforts of object datasets creators, the datasets appear to have strong build-in bias caused by various factors, such as selection bias, capture bias, category or label bias, and negative set bias. This suggests that no matter how big the dataset is, it is impossible to cover the complexity of the real visual world. Similarly, datasets for other tasks also have biases caused by different factors. Hence, to make good use of the big data available on the internet, cross-dataset transfer learning is crucial for solving the current tasks effectively and efficiently. Below, we summarise a set of real world cross-dataset transfer learning applications and provide the information of commonly used datasets for evaluating the performance.

7.1 WiFi Localization

The indoor WiFi localization [Yang et al. (2008)] aims at estimating the location of a mobile device based on the received signal strength (RSS) from a set of access points that periodically send out wireless signals to others. The WiFi signal strength may be a function of many dynamic factors, such as time, device, and space. Hence, to estimate data of different time period, transfer learning is required. The WiFi dataset111http://www.cs.ust.hk/~qyang/ICDMDMC07/ is also publicly available.

7.2 Sentiment Classification

The objective of automatic sentiment classification is to judge the overall opinion of a product, i.e. whether a product review is positive or negative. However, sentiment is expressed differently in different domains [Blitzer et al. (2007)]. Hence, a multi-domain sentiment dataset222https://www.cs.jhu.edu/~mdredze/datasets/sentiment/ was created..

Prettenhofer2010 extend the Multi-Domain Sentiment Dataset [Blitzer et al. (2007)] to a Cross-Lingual Sentiment (CLS) dataset333http://www.webis.de/research/corpora/corpus-webis-cls-10/cls-acl10-processed.tar.gz. The CLS dataset consists of Amazon product reviews of three product categories: books, DVDs and music, with more than 4 million reviews in the three languages German, French, and Japanese.

7.3 Part-of-speech Tagging

Part-of-speech (PoS) tagging is one of the important natural language processing (NLP) task, which aims at labelling a word context with its grammatical function. However, different domains use very different vocabularies. Hence, to transfer PoS tagger from one domain to another is crucial to reduce the efforts of creating training corpora for each domain. Blitzer2006 propose to use sentences from Wall Street Journal (WSJ)

444http://www.cis.upenn.edu/~treebank/home.html as source domain data, and sentences from Biomedical Text 555http://languagelog.ldc.upenn.edu/myl/ldc/ITR/ as target domain data.

7.4 Spam Email Filtering

The ECML/PKDD 2006 Challenge focuses on the personalized spam filtering tasks. It aims at building a spam filter for automatically detecting spam emails. Training such filters rely on publicly available sources. However, the spam emails are person specific, suggesting that the distributions of the combined source of training data are different from that of emails received by individual users. In this challenge, a spam email filtering dataset was provided666http://www.ecmlpkdd2006.org/challenge.html.

Another cross domain spam email dataset was created by Bickel2006, which is built upon Enron Corpus777https://www.cs.cmu.edu/~./enron/ [Klimt and Yang (2004)]. The dataset contains nine different inboxes with test emails (5270 to 10964 emails, depending on inbox) and one set of training emails collected from various different sources.

7.5 Text Classification

Lang1995 create the 20 Newsgroups data set888http://qwone.com/~jason/20Newsgroups/ for the task of cross domain text classification algorithms, which contains 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups corresponding to different topics. The goal is to distinguish documents from newsgroup categories, such as between rec and talk.

7.6 Object Recognition

Saenko2010 create the Office dataset999https://cs.stanford.edu/~jhoffman/domainadapt/ for evaluating the effects of domain shift. Three datasets representing three domains are involved in the Office dataset, which are: Amazon (images downloaded from online merchants), Webcam (low-resolution images by a web camera), DSLR (high-resolution images by a digital SLR camera). Thirty one object categories are contained in the three datasets.

Gong2012 combine the Caltech-256 [Griffin et al. (2007)] dataset with the Office dataset [Saenko et al. (2010)], constituting Office+Caltech object dataset101010http://www-scf.usc.edu/~boqinggo/domainadaptation.html. The Caltech-256 dataset contains 256 object classes downloaded from Google images. Ten classes shared by four datasets are generally selected for the cross dataset object recognition task.

Another object dataset111111https://sites.google.com/site/crossdataset/home

for cross-dataset analysis is proposed by Tommasi2014a. The dataset has two settings: one dense contains 40 classes shared by four datasets (Caltech 256, Bing, Imagenet and SUN), and one sparse contains 105 classes shared by at least four out of twelve datasets (RGB-D, a-Yahoo, ETH80, MSRCORID, PascalVOC07, AwA, Office, Caltech101, SUN, Imagenet, Bing, Caltech256).

7.7 Hand-Written Digit Recognition

For cross-domain hand-written digit recognition task, the combinations of different digit datasets ( i.e. MNIST [Lecun et al. (1998)] and USPS [Hull (1994)], SVHN [Netzer et al. (2011)] ) are used. MNIST dataset121212http://yann.lecun.com/exdb/mnist/ contains a training set of 60,000 examples, and a test set of 10,000 examples of size 2828. USPS dataset131313http://statweb.stanford.edu/~tibs/ElemStatLearn/data.html consists of 7,291 training images and 2,007 test images of size 1616. SVHN dataset141414http://ufldl.stanford.edu/housenumbers/ was obtained from a large number of Street View images, which comprises over 600,000 labelled characters.

7.8 face recognition

The dataset shift in face recognition can be caused by poses, resolution, illuminations, expressions, and modality. For example, MultiPIE dataset151515http://www.flintbox.com/public/project/4742/ [Gross et al. (2010)] contains face images under various poses, illuminations and expressions. The face images under one pose can be used as the source dataset while the images under another pose can be used as the target dataset. Similarly, the illuminations and expressions can also cause the distribution shift, and can be used for evaluating cross-domain performance. Some face datasets also contain face images captured by different sensors, resulting in multiple data modalities. For instance, the Oulu-CASIA NIR&VIS facial expression database161616http://www.ee.oulu.fi/~gyzhao/ and the BUAA-VisNir Face Database171717http://irip.buaa.edu.cn/Research/Research_Highlights.htm contain face images captured by both NIR (Near Infrared) and VIS (Visible light). There are also datasets provide the sketch of human faces, such as Face Sketch Database (CUFS)181818http://mmlab.ie.cuhk.edu.hk/archive/facesketch.html. The recognition across photo and sketch has also been addressed [Wang et al. (2012)], which has the potential application of law enforcement. For example, it is often required to compare the face photos to a sketch drawn by an artist based on the verbal description of the suspect.

7.9 Person Re-identification

Person re-identification is also an important real world application in cross-domain transfer learning. Several commonly used datasets are also available, such as VIPeR dataset191919https://www.researchgate.net/publication/261596035_Viewpoint_Invariant_Pedestrian_Recognition_VIPeR_Dataset_v10 ,CUHK Person Re-identification dataset202020http://www.ee.cuhk.edu.hk/~xgwang/CUHK_identification.html, and PRID dataset212121https://lrs.icg.tugraz.at/datasets/prid/. The persons in these datasets are all captured from different viewpoints.

7.10 Scene Categorization

The cross-dataset transfer learning is also used for cross-task scene categorization, which means that the knowledge learnt from previous scene categories can be used for new categories. The Flickr scene image data set222222http://lms.comp.nus.edu.sg/research/NUS-WIDE.htm contains 33 scene categories and 34,926 images in total. SUN-397 database232323http://groups.csail.mit.edu/vision/SUN/

is a very big scene recognition dataset, which contains 397 scene categories. There are also scene datasets that provide the attribute, such as SUN Attribute Database

242424https://cs.brown.edu/~gen/sunattributes.html which provides 102 discriminative attributes and covers more than 700 scene categories.

7.11 Action Recognition

The cross-dataset video-based action recognition has also been addressed. For example, Zhu2014b conduct action recognition across the UCF YouTube dataset252525http://crcv.ucf.edu/data/UCF_YouTube_Action.php and HMDB51 dataset262626http://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/. The shared actions between datasets are selected. Another line of research transfers across action datasets with different label spaces. For example, Ma2014 use the laboratory collected datasets (KTH272727http://www.nada.kth.se/cvap/actions/, HumanEva282828http://humaneva.is.tue.mpg.de/) as source domain, while the real-world action datasets such as UCF YouTube datasetnote1, CareMedia dataset292929http://www.informedia.cs.cmu.edu/caremedia/index.html, and Kinect Skeleton Action (KSA) are used for target data.

Another scenario of transfer of actions is cross-view action recognition. The most commonly used multi-view action dataset is IXMAS dataset303030http://4drepository.inrialpes.fr/public/viewgroup/6, where actions are captured from different views. Recently, some RGB-D based multi-view action datasets (Northwestern-UCLA Multiview Action3D (N-UCLA)313131http://users.eecs.northwestern.edu/~jwa368/my_data.html, ACT dataset323232http://www.datatang.com/data/45062, and Multi-View TJU Dataset333333http://media.tju.edu.cn/mv_tju_dataset.html) are also emerged after the release of Kinect sensors.

7.12 Video Event Detection

Video event detection or video concept detection is to automatically classify video shots into certain semantic event (such as making a cake, wedding ceremony, and changing a vehicle tire) or concept (such as meeting, sports, and entertainment). We discuss the two tasks together, because they are both to extract semantic meaning from a video clip. Most commonly used data for evaluating video event detection are from TREC VideoRetrieval Evaluation (TRECVID) dataset series343434http://www-nlpir.nist.gov/projects/trecvid/. The transfer learning is used for cross-task video event detection by previous research. For example, the TRECVID 2005 dataset is used by Yang2007, where one of the 39 concepts is picked as the target concept and one of the 13 programs is the target program. Differently, the TRECVID 2010 and TRECVID 2011 are used together by Ma2012 for the cross-domain event detection, where the TRECVID 2011 semantic indexing task (SIN11) is used as the source auxiliary dataset.

8 Conclusion and Discussion

Learning from previous knowledge for current tasks is crucial for real world applications by taking full advantage of the big data available on the internet. According to the properties of previously available data and current data in a real world scenario or application, completely different techniques would be carefully chosen for boosting current task. This paper presents a comprehensive overview of recent advances by defining a taxonomy of scenarios and problems according to dataset characteristics. Though it is impossible for this survey to cover all the related papers, the selected representative works can discover the recent advances as well as the emphasis has been put on to date.

In general, most of current methods adopt some transferring criteria to use previous related data for the target task. As mentioned in Section 2.3, different criteria can be used with different assumptions of data. Some of the existing methods use one criterion and others may use two or more criteria. Carefully choosing criteria is crucial for better transfer performance. The future work is encouraged to refer to the defined criteria for transferring knowledge according to the properties of available data. In addition, transferring using multiple criteria is also encouraged to guide the transfer process for better performance.

From the perspective of scenarios and problems, it can be seen from Figure 1 that most of the previous work focused on the first scenario (homogeneous feature spaces and label spaces) with seven sub-problems, while the last scenario (heterogeneous feature spaces and label spaces) is the least addressed, with only two sub-problems. Below, we will discuss on each sub-problem and identify some future directions based on the problems under different scenario.

Firstly, the supervised cases, i.e. supervised domain adaptation and supervised transfer learning, have been employed in all scenarios, where the labelled data in target dataset are available. This is because the supervised cases are considered the most easy to solve with the guidance of target labels.

Secondly, the semi-supervised cases only appear in the first two scenarios, namely homogeneous feature spaces and label spaces and heterogeneous feature space. Hence, a natural future direction is the development of semi-supervised methods on the rest two scenarios with the help of unlabelled target samples in addition to limited labelled target samples.

Thirdly, the unsupervised cases have been researched in the first three scenarios. This may be because the last scenario is more challenging compared to other three scenarios. However, in the real world applications, the available source data that are used for target data are generally presented in various forms and may focus on different tasks, and the target labelled data are much harder to obtain than unlabelled data. Hence, the future methods that transfer from unconstrained source data to unlabelled target data are encouraged.

Fourthly, the issue of data imbalance in the target dataset has been greatly neglected by previous research, no matter in which scenarios; only the homogeneous scenario has touched this issue with a few papers, which is another future research direction.

Fifthly, though the problem with sequential target data (no matter labelled or unlabelled) has been touched by a few work in three scenarios (except for the scenario of heterogeneous feature spaces), the attention to this problem has no been paid enough.

Sixthly, when the target data are not presented in the training stage, the domain generalization and zero-shot learning have been proposed for the scenarios of homogeneous feature spaces and label spaces and heterogeneous label spaces, where the feature spaces or data modalities between dataset are assumed to be identical. This may be due to that if the source and target data are presented in different feature spaces or modalities, at least some target data are required to be presented for connecting the relationship between source and target data.

Lastly, a natural assumption among most of the literature is that the source data are labelled. This may be because the source data are generally treated as the auxiliary data for instructing the target task and the unlabelled data could be unrelated and lead to negative transfer. However, there are still research argue that the redundant unlabelled source data can still be a treasure as a good starting point of parameters for target task as mentioned in Section 5.5.1. Hence, it is also encouraged to employ more potential from unlabelled source data in the future research.

References

  • [1]
  • Akata et al. (2016) Zeynep Akata, Mateusz Malinowski, Mario Fritz, and Bernt Schiele. 2016. Multi-cue zero-shot learning with strong supervision. In

    Proc. IEEE Conference on Computer Vision and Pattern Recognition

    . 59–68.
  • Akata et al. (2013) Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and Cordelia Schmid. 2013. Label-embedding for attribute-based classification. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 819–826.
  • Akata et al. (2015) Zeynep Akata, Scott Reed, Daniel Walter, Honglak Lee, and Bernt Schiele. 2015. Evaluation of output embeddings for fine-grained image classification. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 2927–2936.
  • Aljundi et al. (2015) Rahaf Aljundi, Rémi Emonet, Damien Muselet, and Marc Sebban. 2015. Landmarks-based Kernelized Subspace Alignment for Unsupervised Domain Adaptation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 56–63.
  • Aytar and Zisserman (2011) Yusuf Aytar and Andrew Zisserman. 2011. Tabula rasa: Model transfer for object category detection. In Proc. IEEE International Conference on Computer Vision. IEEE, 2252–2259.
  • Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proc. International Conference on Learning Representations.
  • Baktashmotlagh et al. (2013) Mahsa Baktashmotlagh, Mehrtash T Harandi, Brian C Lovell, and Mathieu Salzmann. 2013. Unsupervised domain adaptation by domain invariant projection. In Proc. IEEE International Conference on Computer Vision. IEEE, 769–776.
  • Baktashmotlagh et al. (2014) Mahsa Baktashmotlagh, Mehrtash T Harandi, Brian C Lovell, and Mathieu Salzmann. 2014. Domain adaptation on the statistical manifold. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2481–2488.
  • Ben-David et al. (2010) Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. 2010. A theory of learning from different domains. Machine learning 79, 1 (2010), 151–175.
  • Ben-David et al. (2007) Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. 2007. Analysis of Representations for Domain Adaptation. In Proc. Advances in Neural Information Processing Systems. 137–144.
  • Bengio (2012) Yoshua Bengio. 2012. Deep Learning of Representations for Unsupervised and Transfer Learning. Unsupervised and Transfer Learning Challenges in Machine Learning, Volume 7 (2012), 19.
  • Bian et al. (2012) Wei Bian, Dacheng Tao, and Yong Rui. 2012. Cross-Domain Human Action Recognition. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 42, 2 (April 2012), 298–307.
  • Bickel and Scheffer (2006) Steffen Bickel and Tobias Scheffer. 2006. Dirichlet-enhanced spam filtering based on biased samples. (2006), 161–168.
  • Bitarafan et al. (2016) Adeleh Bitarafan, Mahdieh Soleymani Baghshah, and Marzieh Gheisari. 2016. Incremental Evolving Domain Adaptation. IEEE Transactions on Knowledge and Data Engineering 28, 8 (Aug 2016), 2128–2141.
  • Blanchard et al. (2011) Gilles Blanchard, Gyemin Lee, and Clayton Scott. 2011. Generalizing from several related classification tasks to a new unlabeled sample. In Proc. Advances in Neural Information Processing Systems. 2178–2186.
  • Blitzer et al. (2007) John Blitzer, Mark Dredze, Fernando Pereira, and others. 2007. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In ACL, Vol. 7. 440–447.
  • Blitzer et al. (2006) John Blitzer, Ryan McDonald, and Fernando Pereira. 2006. Domain adaptation with structural correspondence learning. In Proc. 2006 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 120–128.
  • Bousmalis et al. (2017) Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru Erhan, and Dilip Krishnan. 2017. Unsupervised Pixel-Level Domain Adaptation with Generative Adversarial Networks. In Proc. IEEE Conference on Computer Vision and Pattern Recognition.
  • Bousmalis et al. (2016) Konstantinos Bousmalis, George Trigeorgis, Nathan Silberman, Dilip Krishnan, and Dumitru Erhan. 2016. Domain separation networks. In Advances in Neural Information Processing Systems. 343–351.
  • Bucher et al. (2016) Maxime Bucher, Stéphane Herbin, and Frédéric Jurie. 2016. Improving Semantic Embedding Consistency by Metric Learning for Zero-Shot Classiffication. In Proc. European Conference on Computer Vision. Springer, 730–746.
  • Cao et al. (2010) Liangliang Cao, Zicheng Liu, and Thomas S Huang. 2010. Cross-dataset action detection. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1998–2005.
  • Caseiro et al. (2015) Rui Caseiro, João F Henriques, Pedro Martins, and Jorge Batista. 2015. Beyond the shortest path: Unsupervised Domain Adaptation by Sampling Subspaces along the Spline Flow. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3846–3854.
  • Chen et al. (2009b) Bo Chen, Wai Lam, Ivor Tsang, and Tak-Lam Wong. 2009b. Extracting discriminative concepts for domain adaptation in text mining. In Proc. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 179–188.
  • Chen et al. (2014) Lin Chen, Wen Li, and Dong Xu. 2014. Recognizing RGB images by learning from RGB-D data. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1418–1425.
  • Chen et al. (2011) Minmin Chen, Kilian Q Weinberger, and John Blitzer. 2011. Co-training for domain adaptation. In Proc. Advances in Neural Information Processing Systems. 2456–2464.
  • Chen et al. (2012) Minmin Chen, Zhixiang Xu, Fei Sha, and Kilian Q Weinberger. 2012. Marginalized Denoising Autoencoders for Domain Adaptation. In Proc. International Conference on Machine Learning. 767–774.
  • Chen et al. (2009a) Tao Chen, Ming-Ming Cheng, Ping Tan, Ariel Shamir, and Shi-Min Hu. 2009a. Sketch2photo: Internet image montage. ACM Transactions on Graphics 28, 5 (2009), 124.
  • Chen et al. (2015) Zhiyuan Chen, Nianzu Ma, and Bing Liu. 2015. Lifelong learning for sentiment classification. In Association for Computational Linguistics.
  • Cho et al. (2014) Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the Properties of Neural Machine Translation: Encoder–Decoder Approaches. Syntax, Semantics and Structure in Statistical Translation (2014), 103.
  • Courty et al. (2016) Nicolas Courty, Rémi Flamary, Devis Tuia, and Alain Rakotomamonjy. 2016. Optimal transport for Domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2016).
  • Csurka et al. (2014) Gabriela Csurka, Boris Chidlovskii, and Florent Perronnin. 2014. Domain adaptation with a domain specific class means classifier. In Proc. European Conference on Computer Vision Workshops. Springer, 32–46.
  • Cui et al. (2014a) Zhen Cui, Hong Chang, Shiguang Shan, and Xilin Chen. 2014a. Generalized unsupervised manifold alignment. In Proc. Advances in Neural Information Processing Systems. 2429–2437.
  • Cui et al. (2014b) Zhen Cui, Wen Li, Dong Xu, Shiguang Shan, Xilin Chen, and Xuelong Li. 2014b. Flowing on Riemannian manifold: Domain adaptation by shifting covariance. IEEE Transactions on Cybernetics 44, 12 (2014), 2264–2273.
  • Dai et al. (2008) Wenyuan Dai, Yuqiang Chen, Gui-Rong Xue, Qiang Yang, and Yong Yu. 2008. Translated learning: Transfer learning across different feature spaces. In Proc. Advances in Neural Information Processing Systems. 353–360.
  • Dai et al. (2007a) Wenyuan Dai, Gui-Rong Xue, Qiang Yang, and Yong Yu. 2007a. Transferring naive bayes classifiers for text classification. In

    Proc. AAAI Conference on Artificial Intelligence

    , Vol. 22. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 540.
  • Dai et al. (2007b) Wenyuan Dai, Qiang Yang, Gui-Rong Xue, and Yong Yu. 2007b. Boosting for transfer learning. In Proc. International Conference on Machine Learning. ACM, 193–200.
  • Daume et al. (2010) Hal Daume, Abhishek Kumar, and Avishek Saha. 2010. Co-regularization based semi-supervised domain adaptation. In Proc. Advances in Neural Information Processing Systems. 478–486.
  • Daumé III (2007) Hal Daumé III. 2007. Frustratingly easy domain adaptation. In Proc. Annual Meeting of the Association of Computational Linguistics. 256 –263.
  • Diethe et al. (2008) Tom Diethe, David R Hardoon, and John Shawe-taylor. 2008. Multiview Fisher discriminant analysis. In In NIPS Workshop on Learning from Multiple Sources. Citeseer.
  • Ding and Fu (2017) Zhengming Ding and Yun Fu. 2017. Robust Transfer Metric Learning for Image Classification. IEEE Transactions on Image Processing 26, 2 (2017), 660–670.
  • Ding et al. (2014) Zhengming Ding, Shao Ming, and Yun Fu. 2014. Latent low-rank transfer subspace learning for missing modality recognition. In Proc. AAAI Conference on Artificial Intelligence. 1192–1198.
  • Ding et al. (2015) Zhengming Ding, Ming Shao, and Yun Fu. 2015. Deep Low-Rank Coding for Transfer Learning. In Proc. International Joint Conference on Artificial Intelligence. 3453–3459.
  • Donahue et al. (2014) Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. 2014. DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition. In Proc. International Conference on Machine Learning. 647–655.
  • Duan et al. (2012a) Lixin Duan, Ivor W Tsang, and Dong Xu. 2012a. Domain transfer multiple kernel learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 3 (2012), 465–479.
  • Duan et al. (2012d) Lixin Duan, Dong Xu, IW-H Tsang, and Jiebo Luo. 2012d. Visual event recognition in videos by learning from web data. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 9 (2012), 1667–1680.
  • Duan et al. (2012b) Lixin Duan, Dong Xu, and Ivor W Tsang. 2012b. Domain adaptation from multiple sources: A domain-dependent regularization approach. IEEE Transactions on Neural Networks and Learning Systems 23, 3 (2012), 504–518.
  • Duan et al. (2012c) Lixin Duan, Dong Xu, and Ivor W Tsang. 2012c. Learning with Augmented Features for Heterogeneous Domain Adaptation. In Proc. International Conference on Machine Learning. 711–718.
  • Efros and Freeman (2001) Alexei A Efros and William T Freeman. 2001. Image quilting for texture synthesis and transfer. In Proc. annual conference on Computer graphics and interactive techniques. ACM, 341–346.
  • Eigen and Fergus (2015) David Eigen and Rob Fergus. 2015. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proc. IEEE International Conference on Computer Vision. 2650–2658.
  • Fang et al. (2013) Chen Fang, Ye Xu, and Daniel Rockmore. 2013. Unbiased metric learning: On the utilization of multiple datasets and web images for softening bias. In Proc. IEEE International Conference on Computer Vision. IEEE, 1657–1664.
  • Farhadi et al. (2009) Ali Farhadi, Ian Endres, Derek Hoiem, and David Forsyth. 2009. Describing objects by their attributes. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1778–1785.
  • Fei-Fei et al. (2006) Li Fei-Fei, Rob Fergus, and Pietro Perona. 2006. One-shot learning of object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence 28, 4 (2006), 594–611.
  • Fernando et al. (2013) Basura Fernando, Amaury Habrard, Marc Sebban, and Tinne Tuytelaars. 2013. Unsupervised visual domain adaptation using subspace alignment. In Proc. IEEE International Conference on Computer Vision. IEEE, 2960–2967.
  • Frome et al. (2013) Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, and others. 2013. Devise: A deep visual-semantic embedding model. In Advances in neural information processing systems. 2121–2129.
  • Fu et al. (2015) Yanwei Fu, Timothy M Hospedales, Tao Xiang, and Shaogang Gong. 2015. Transductive multi-view zero-shot learning. IEEE transactions on pattern analysis and machine intelligence 37, 11 (2015), 2332–2345.
  • Gama et al. (2014) João Gama, Indrė Žliobaitė, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia. 2014. A survey on concept drift adaptation. Comput. Surveys 46, 4 (2014), 44.
  • Gan et al. (2016) Chuang Gan, Tianbao Yang, and Boqing Gong. 2016. Learning Attributes Equals Multi-Source Domain Generalization. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 87–97.
  • Gan et al. (2014) Junying Gan, Lichen Li, Yikui Zhai, and Yinhua Liu. 2014. Deep self-taught learning for facial beauty prediction. Neurocomputing 144 (2014), 295–303.
  • Ganin and Lempitsky (2015) Yaroslav Ganin and Victor Lempitsky. 2015.

    Unsupervised Domain Adaptation by Backpropagation. In

    Proc. International Conference on Machine Learning. 1180–1189.
  • Ganin et al. (2016) Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. 2016. Domain-adversarial training of neural networks. Journal of Machine Learning Research 17, 59 (2016), 1–35.
  • Gatys et al. (2016) Leon A Gatys, Alexander S Ecker, and Matthias Bethge. 2016. Image style transfer using convolutional neural networks. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 2414–2423.
  • Geng et al. (2011) Bo Geng, Dacheng Tao, and Chao Xu. 2011. DAML: Domain adaptation metric learning. IEEE Transactions on Image Processing 20, 10 (2011), 2980–2989.
  • Ghifary et al. (2016a) Muhammad Ghifary, David Balduzzi, W Bastiaan Kleijn, and Mengjie Zhang. 2016a. Scatter Component Analysis: A Unified Framework for Domain Adaptation and Domain Generalization. IEEE Transactions on Pattern Analysis and Machine Intelligence PP, 99 (2016), 1–1.
  • Ghifary et al. (2016b) Muhammad Ghifary, W Bastiaan Kleijn, Mengjie Zhang, David Balduzzi, and Wen Li. 2016b. Deep Reconstruction-Classification Networks for Unsupervised Domain Adaptation. In European Conference on Computer Vision. Springer, 597–613.
  • Glorot et al. (2011) Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Domain adaptation for large-scale sentiment classification: A deep learning approach. In Proc. International Conference on Machine Learning. 513–520.
  • Gong et al. (2013a) Boqing Gong, Kristen Grauman, and Fei Sha. 2013a. Connecting the dots with landmarks: Discriminatively learning domain-invariant features for unsupervised domain adaptation. In Proc. International Conference on Machine Learning. 222–230.
  • Gong et al. (2013b) Boqing Gong, Kristen Grauman, and Fei Sha. 2013b. Reshaping visual datasets for domain adaptation. In Proc. Advances in Neural Information Processing Systems. 1286–1294.
  • Gong et al. (2014) Boqing Gong, Kristen Grauman, and Fei Sha. 2014. Learning kernels for unsupervised domain adaptation with applications to visual object recognition. International Journal of Computer Vision 109, 1-2 (2014), 3–27.
  • Gong et al. (2012) Boqing Gong, Yuan Shi, Fei Sha, and Kristen Grauman. 2012. Geodesic flow kernel for unsupervised domain adaptation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2066–2073.
  • Gong et al. (2016) Mingming Gong, Kun Zhang, Tongliang Liu, Dacheng Tao, Clark Glymour, and Bernhard Schölkopf. 2016. Domain adaptation with conditional transferable components. In Proc. International Conference on Machine Learning. 2839–2848.
  • Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems. 2672–2680.
  • Gopalan et al. (2011) Raghuraman Gopalan, Ruonan Li, and Rama Chellappa. 2011. Domain adaptation for object recognition: An unsupervised approach. In Proc. IEEE International Conference on Computer Vision. IEEE, 999–1006.
  • Gopalan et al. (2014) Raghavan Gopalan, Ruonan Li, and Rama Chellappa. 2014. Unsupervised adaptation across domain shifts by generating intermediate data representations. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 11 (2014), 2288–2302.
  • Griffin et al. (2007) Gregory Griffin, Alex Holub, and Pietro Perona. 2007. Caltech-256 object category dataset. Technical Report.
  • Gross et al. (2010) Ralph Gross, Iain Matthews, Jeffrey Cohn, Takeo Kanade, and Simon Baker. 2010. Multi-pie. Image and Vision Computing 28, 5 (2010), 807–813.
  • Gupta et al. (2016) Saurabh Gupta, Judy Hoffman, and Jitendra Malik. 2016. Cross modal distillation for supervision transfer. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 2827–2836.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proc. Conference on Computer Vision and Pattern Recognition. 770–778.
  • Hertzmann et al. (2001) Aaron Hertzmann, Charles E Jacobs, Nuria Oliver, Brian Curless, and David H Salesin. 2001. Image analogies. In Proc. annual conference on Computer graphics and interactive techniques. ACM, 327–340.
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
  • Hoffman et al. (2014) Judy Hoffman, Trevor Darrell, and Kate Saenko. 2014. Continuous manifold based adaptation for evolving visual domains. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 867–874.
  • Hoffman et al. (2012) Judy Hoffman, Brian Kulis, Trevor Darrell, and Kate Saenko. 2012. Discovering latent domains for multisource domain adaptation. In Proc. European Conference on Computer Vision. Springer, 702–715.
  • Hoffman et al. (2013) Judy Hoffman, Erik Rodner, Trevor Darrell, Jeff Donahue, and Kate Saenko. 2013. Efficient learning of domain-invariant image representations. In Proc. International Conference on Learning Representations.
  • Hsu et al. (2015) Tzu Ming Harry Hsu, Wei Yu Chen, Cheng-An Hou, Yao-Hung Hubert Tsai, Yi-Ren Yeh, and Yu-Chiang Frank Wang. 2015. Unsupervised Domain Adaptation With Imbalanced Cross-Domain Data. In Proc. IEEE International Conference on Computer Vision. IEEE, 4121–4129.
  • Hu and Yang (2011) Derek Hao Hu and Qiang Yang. 2011. Transfer learning for activity recognition via sensor mapping. In Proc. International Joint Conference on Artificial Intelligence, Vol. 22. 1962–1967.
  • Hu et al. (2015) Junlin Hu, Jiwen Lu, and Yap-Peng Tan. 2015. Deep transfer metric learning. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 325–333.
  • Huang and Wang (2013) De-An Huang and Yu-Chiang Frank Wang. 2013. Coupled dictionary and feature space learning with applications to cross-domain image synthesis and recognition. In Proc. IEEE International Conference on Computer Vision. IEEE, 2496–2503.
  • Huang et al. (2006) Jiayuan Huang, Arthur Gretton, Karsten M Borgwardt, Bernhard Schölkopf, and Alex J Smola. 2006. Correcting sample selection bias by unlabeled data. In Proc. Advances in Neural Information Processing Systems. 601–608.
  • Hull (1994) Jonathan J. Hull. 1994. A database for handwritten text recognition research. IEEE Transactions on Pattern Analysis and Machine Intelligence 16, 5 (1994), 550–554.
  • Isola et al. (2017) Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017.

    Image-to-image translation with conditional adversarial networks. In

    Proc. IEEE Conference on Computer Vision and Pattern Recognition.
  • Jain and Learned-Miller (2011) Vidit Jain and Erik Learned-Miller. 2011. Online domain adaptation of a pre-trained cascade of classifiers. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 577–584.
  • Japkowicz and Stephen (2002) Nathalie Japkowicz and Shaju Stephen. 2002. The class imbalance problem: A systematic study. Intelligent Data Analysis 6, 5 (2002), 429–449.
  • Jayaraman and Grauman (2014) Dinesh Jayaraman and Kristen Grauman. 2014. Zero-shot recognition with unreliable attributes. In Proc. Advances in Neural Information Processing Systems. 3464–3472.
  • Jhuo et al. (2012) I-Hong Jhuo, Dong Liu, DT Lee, Shih-Fu Chang, and others. 2012. Robust visual domain adaptation with low-rank reconstruction. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2168–2175.
  • Jia et al. (2014) Chengcheng Jia, Yu Kong, Zhengming Ding, and Yun Raymond Fu. 2014. Latent tensor transfer learning for rgb-d action recognition. In Proc. ACM International Conference on Multimedia. ACM, 87–96.
  • Jiang et al. (2016) Wenhao Jiang, Hongchang Gao, Fu-lai Chung, and Heng Huang. 2016. The l2, 1-Norm Stacked Robust Autoencoders for Domain Adaptation. In Proc. AAAI Conference on Artificial Intelligence.
  • Jiang et al. (2008) Wei Jiang, Eric Zavesky, Shih-Fu Chang, and Alex Loui. 2008. Cross-domain learning methods for high-level visual concept classification. In Proc. IEEE International Conference on Image Processing. IEEE, 161–164.
  • Jie et al. (2011) Luo Jie, Tatiana Tommasi, and Barbara Caputo. 2011. Multiclass transfer learning from unconstrained priors. In Proc. IEEE International Conference on Computer Vision. IEEE, 1863–1870.
  • Johnson et al. (2016) Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016.

    Perceptual Losses for Real-Time Style Transfer and Super-Resolution. In

    European Conference on Computer Vision.
  • Kalchbrenner and Blunsom (2013) Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent Continuous Translation Models. In Proc. Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
  • Kan et al. (2015) Meina Kan, Shiguang Shan, and Xilin Chen. 2015. Bi-shifting Auto-Encoder for Unsupervised Domain Adaptation. In Proc. IEEE International Conference on Computer Vision. IEEE, 3846–3854.
  • Kan et al. (2012) Meina Kan, Shiguang Shan, Haihong Zhang, Shihong Lao, and Xilin Chen. 2012. Multi-view discriminant analysis. In Proc. European Conference on Computer Vision. Springer, 808–821.
  • Khosla et al. (2012) Aditya Khosla, Tinghui Zhou, Tomasz Malisiewicz, Alexei A Efros, and Antonio Torralba. 2012. Undoing the damage of dataset bias. In Proc. European Conference on Computer Vision. Springer, 158–171.
  • Klimt and Yang (2004) Bryan Klimt and Yiming Yang. 2004. The enron corpus: A new dataset for email classification research. In Proc. European Conference on Machine Learning. Springer, 217–226.
  • Kodirov et al. (2015) Elyor Kodirov, Tao Xiang, Zhenyong Fu, and Shaogang Gong. 2015. Unsupervised domain adaptation for zero-shot learning. In Proc. IEEE International Conference on Computer Vision. IEEE, 2452–2460.
  • Koehn et al. (2003) Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proc. Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1. Association for Computational Linguistics, 48–54.
  • Koniusz et al. (2017) Piotr Koniusz, Yusuf Tas, and Fatih Porikli. 2017. Domain Adaptation by Mixture of Alignments of Second-or Higher-Order Scatter Tensors. In Proc. IEEE Conference on Computer Vision and Pattern Recognition.
  • Kuen et al. (2015) Jason Kuen, Kian Ming Lim, and Chin Poo Lee. 2015. Self-taught learning of a deep invariant representation for visual tracking via temporal slowness principle. Pattern Recognition 48, 10 (2015), 2964–2982.
  • Kulis et al. (2011) Brian Kulis, Kate Saenko, and Trevor Darrell. 2011. What you saw is not what you get: Domain adaptation using asymmetric kernel transforms. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1785–1792.
  • Kumagai (2016) Wataru Kumagai. 2016. Learning Bound for Parameter Transfer Learning. In Advances in Neural Information Processing Systems. 2721–2729.
  • Kumar et al. (2009) Neeraj Kumar, Alexander C Berg, Peter N Belhumeur, and Shree K Nayar. 2009. Attribute and simile classifiers for face verification. In Proc. IEEE International Conference on Computer Vision. IEEE, 365–372.
  • Kuzborskij and Orabona (2013) Ilja Kuzborskij and Francesco Orabona. 2013. Stability and Hypothesis Transfer Learning. In Proc. International Conference on Machine Learning. 942–950.
  • Laffont et al. (2014) Pierre-Yves Laffont, Zhile Ren, Xiaofeng Tao, Chao Qian, and James Hays. 2014. Transient attributes for high-level understanding and editing of outdoor scenes. ACM Transactions on Graphics 33, 4 (2014), 149.
  • Lake et al. (2011) Brenden Lake, Ruslan Salakhutdinov, Jason Gross, and Joshua Tenenbaum. 2011. One shot learning of simple visual concepts. In Proc. Cognitive Science Society, Vol. 33.
  • Lampert et al. (2009) Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. 2009. Learning to detect unseen object classes by between-class attribute transfer. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 951–958.
  • Lang (1995) Ken Lang. 1995. Newsweeder: Learning to filter netnews. In Proc. International Conference on Machine Learning. 331–339.
  • Le et al. (2011) Quoc V Le, Will Y Zou, Serena Y Yeung, and Andrew Y Ng. 2011. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3361–3368.
  • Lecun et al. (1998) Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (Nov 1998), 2278–2324.
  • Lei Ba et al. (2015) Jimmy Lei Ba, Kevin Swersky, Sanja Fidler, and others. 2015. Predicting deep zero-shot convolutional neural networks using textual descriptions. In Proc. IEEE International Conference on Computer Vision. 4247–4255.
  • Li and Wand (2016) Chuan Li and Michael Wand. 2016. Precomputed real-time texture synthesis with markovian generative adversarial networks. In European Conference on Computer Vision.
  • Li et al. (2014) Wen Li, Lixin Duan, Dong Xu, and Ivor W Tsang. 2014. Learning with augmented features for supervised and semi-supervised heterogeneous domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 6 (2014), 1134–1148.
  • Li et al. (2015) Xin Li, Yuhong Guo, and Dale Schuurmans. 2015. Semi-supervised zero-shot classification with label representation learning. In Proc. IEEE International Conference on Computer Vision. 4211–4219.
  • Lin et al. (2016) Yuewei Lin, Jing Chen, Yu Cao, Youjie Zhou, Lingfeng Zhang, Yuan Yan Tang, and Song Wang. 2016. Cross-Domain Recognition by Identifying Joint Subspaces of Source Domain and Target Domain. IEEE Transactions on Cybernetics PP, 99 (2016), 1–12.
  • Liu et al. (2011) Jingen Liu, Benjamin Kuipers, and Silvio Savarese. 2011. Recognizing human actions by attributes. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3337–3344.
  • Liu and Tuzel (2016) Ming-Yu Liu and Oncel Tuzel. 2016. Coupled generative adversarial networks. In Advances in Neural Information Processing Systems. 469–477.
  • Long et al. (2015) Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 3431–3440.
  • Long et al. (2013) Mingsheng Long, Guiguang Ding, Jianmin Wang, Jiaguang Sun, Yuchen Guo, and Philip S Yu. 2013. Transfer sparse coding for robust image representation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 407–414.
  • Long and Wang (2015) Mingsheng Long and Jianmin Wang. 2015. Learning Transferable Features with Deep Adaptation Networks. In Proc. International Conference on Machine Learning. 97–105.
  • Long et al. (2014) Mingsheng Long, Jianmin Wang, Guiguang Ding, Sinno Jialin Pan, and Philip S Yu. 2014. Adaptation regularization: A general framework for transfer learning. IEEE Transactions on Knowledge and Data Engineering 26, 5 (2014), 1076–1089.
  • Long et al. (2013) Mingsheng Long, Jianmin Wang, Guiguang Ding, Jiaguang Sun, and Philip S Yu. 2013. Transfer feature learning with joint distribution adaptation. In Proc. IEEE International Conference on Computer Vision. IEEE, 2200–2207.
  • Long et al. (2014) Mingsheng Long, Jianmin Wang, Guiguang Ding, Jiaguang Sun, and Philip S Yu. 2014. Transfer Joint Matching for Unsupervised Domain Adaptation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1410–1417.
  • Long et al. (2017) Mingsheng Long, Jianmin Wang, and Michael I Jordan. 2017. Deep transfer learning with joint adaptation networks. In Proc. International Conference on Machine Learning.
  • Long et al. (2016) Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. 2016. Unsupervised domain adaptation with residual transfer networks. In Advances in Neural Information Processing Systems. 136–144.
  • Ma et al. (2012) Zhigang Ma, Yi Yang, Yang Cai, Nicu Sebe, and Alexander G Hauptmann. 2012. Knowledge adaptation for ad hoc multimedia event detection with few exemplars. In Proc. ACM International Conference on Multimedia. ACM, 469–478.
  • Ma et al. (2014) Zhigang Ma, Yi Yang, Feiping Nie, Nicu Sebe, Shuicheng Yan, and Alexander G Hauptmann. 2014. Harnessing lab knowledge for real-world action recognition. International Journal of Computer Vision 109, 1-2 (2014), 60–73.
  • Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111–3119.
  • Muandet et al. (2013) Krikamol Muandet, David Balduzzi, and Bernhard Schölkopf. 2013. Domain Generalization via Invariant Feature Representation. In Proc. International Conference on Machine Learning. 10–18.
  • Nater et al. (2011) Fabian Nater, Tatiana Tommasi, Helmut Grabner, Luc Van Gool, and Barbara Caputo. 2011. Transferring activities: Updating human behavior analysis. In Proc. IEEE International Conference on Computer Vision Workshops. IEEE, 1737–1744.
  • Netzer et al. (2011) Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. 2011. Reading digits in natural images with unsupervised feature learning. In Proc. NIPS workshop on deep learning and unsupervised feature learning.
  • Ni et al. (2013) Jie Ni, Qiang Qiu, and Rama Chellappa. 2013. Subspace interpolation via dictionary learning for unsupervised domain adaptation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 692–699.
  • Niu et al. (2015a) Li Niu, Wen Li, and Dong Xu. 2015a. Multi-view domain generalization for visual recognition. In Proc. IEEE International Conference on Computer Vision. 4193–4201.
  • Niu et al. (2015b) Li Niu, Wen Li, and Dong Xu. 2015b. Visual Recognition by Learning from Web Data: A Weakly Supervised Domain Generalization Approach. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2774–2783.
  • Palatucci et al. (2009) Mark Palatucci, Dean Pomerleau, Geoffrey E Hinton, and Tom M Mitchell. 2009. Zero-shot learning with semantic output codes. In Proc. Advances in Neural Information Processing Systems. 1410–1418.
  • Pan et al. (2009) Sinnojialin Pan, Ivor W Tsang, James Tin Yau Kwok, and Qiang Yang. 2009. Domain Adaptation via Transfer Component Analysis. In Proc. International Joint Conference on Artificial Intelligence. 1187.
  • Pan et al. (2008) Sinno Jialin Pan, James T Kwok, and Qiang Yang. 2008. Transfer Learning via Dimensionality Reduction.. In Proc. AAAI conference on Artficial Intelligence, Vol. 8. 677–682.
  • Pan et al. (2011) Sinno Jialin Pan, Ivor W Tsang, James T Kwok, and Qiang Yang. 2011. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks 22, 2 (2011), 199–210.
  • Pan and Yang (2010) Sinno Jialin Pan and Qiang Yang. 2010. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22, 10 (2010), 1345–1359.
  • Parikh and Grauman (2011) Devi Parikh and Kristen Grauman. 2011. Relative attributes. In Proc. IEEE International Conference on Computer Vision. IEEE, 503–510.
  • Patel et al. (2015) Vishal M Patel, Raghuraman Gopalan, Ruonan Li, and Rama Chellappa. 2015. Visual domain adaptation: A survey of recent advances. IEEE Signal Processing Magazine 32, 3 (2015), 53–69.
  • Patricia and Caputo (2014) Novi Patricia and Barbara Caputo. 2014. Learning to learn, from transfer learning to domain adaptation: A unifying perspective. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1442–1449.
  • Peng et al. (2016) Peixi Peng, Tao Xiang, Yaowei Wang, Massimiliano Pontil, Shaogang Gong, Tiejun Huang, and Yonghong Tian. 2016. Unsupervised Cross-Dataset Transfer Learning for Person Re-identification. In Proc. IEEE Conference of Computer Vision and Pattern Recognition.
  • Perkins et al. (1992) David N Perkins, Gavriel Salomon, and others. 1992. Transfer of learning. International Encyclopedia of Education 2 (1992), 6452–6457.
  • Perrot and Habrard (2015) Michaël Perrot and Amaury Habrard. 2015. A Theoretical Analysis of Metric Hypothesis Transfer Learning. In Proc. International Conference on Machine Learning. 1708–1717.
  • Prettenhofer and Stein (2010) Peter Prettenhofer and Benno Stein. 2010. Cross-language text classification using structural correspondence learning. In Proc. Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 1118–1127.
  • Qi et al. (2011a) Guo-Jun Qi, Charu Aggarwal, and Thomas Huang. 2011a. Towards semantic knowledge propagation from text corpus to web images. In Proc. international conference on World wide web. ACM, 297–306.
  • Qi et al. (2011b) Guo-Jun Qi, Charu Aggarwal, Yong Rui, Qi Tian, Shiyu Chang, and Thomas Huang. 2011b. Towards cross-category knowledge propagation for learning visual concepts. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 897–904.
  • Qiao et al. (2016) Ruizhi Qiao, Lingqiao Liu, Chunhua Shen, and Anton van den Hengel. 2016. Less is more: zero-shot learning from online textual documents with noise suppression. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 2249–2257.
  • Quanz and Huan (2009) Brian Quanz and Jun Huan. 2009. Large margin transductive transfer learning. In Proc. ACM conference on Information and knowledge management. ACM, 1327–1336.
  • Quanz et al. (2012) Brian Quanz, Jun Huan, and Meenakshi Mishra. 2012. Knowledge transfer with low-quality data: A feature extraction issue. IEEE Transactions on Knowledge and Data Engineering 24, 10 (2012), 1789–1802.
  • Raina et al. (2007) Rajat Raina, Alexis Battle, Honglak Lee, Benjamin Packer, and Andrew Y Ng. 2007. Self-taught learning: transfer learning from unlabeled data. In Proc. International Conference on Machine learning. ACM, 759–766.
  • Ravi and Larochelle (2017) Sachin Ravi and Hugo Larochelle. 2017. Optimization as a Model for Few-Shot Learning. In Proc. International Conference on Learning Representations.
  • Reed et al. (2016) Scott Reed, Zeynep Akata, Honglak Lee, and Bernt Schiele. 2016. Learning deep representations of fine-grained visual descriptions. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 49–58.
  • Romera-Paredes and Torr (2015) Bernardino Romera-Paredes and Philip Torr. 2015. An embarrassingly simple approach to zero-shot learning. In Proc. International Conference on Machine Learning. 2152–2161.
  • Saenko et al. (2010) Kate Saenko, Brian Kulis, Mario Fritz, and Trevor Darrell. 2010. Adapting visual category models to new domains. In Proc. European Conference on Computer Vision. Springer, 213–226.
  • Saito et al. (2017) Kuniaki Saito, Yoshitaka Ushiku, and Tatsuya Harada. 2017. Asymmetric Tri-training for Unsupervised Domain Adaptation. In Proc. International Conference on Machine Learning.
  • Sankaranarayanan et al. (2017) Swami Sankaranarayanan, Yogesh Balaji, Carlos D Castillo, and Rama Chellappa. 2017. Generate To Adapt: Aligning Domains using Generative Adversarial Networks. arXiv preprint arXiv:1704.01705 (2017).
  • Shao et al. (2015) Ling Shao, Fan Zhu, and Xuelong Li. 2015. Transfer learning for visual categorization: a survey. IEEE Transactions on Neural Networks and Larning Systems 26, 5 (2015), 1019–1034.
  • Shao et al. (2012) Ming Shao, Carlos Castillo, Zhenghong Gu, and Yun Fu. 2012. Low-rank transfer subspace learning. In Proc. IEEE International Conference on Data Mining. IEEE, 1104–1109.
  • Shao et al. (2014) Ming Shao, Dmitry Kit, and Yun Fu. 2014. Generalized transfer subspace learning through low-rank constraint. International Journal of Computer Vision 109, 1-2 (2014), 74–93.
  • Shekhar et al. (2013) Sumit Shekhar, Vishal M Patel, Hien V Nguyen, and Rama Chellappa. 2013. Generalized domain-adaptive dictionaries. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 361–368.
  • Shekhar et al. (2015) Sumit Shekhar, Vishal M Patel, Hien Van Nguyen, and Rama Chellappa. 2015. Coupled projections for adaptation of dictionaries. IEEE Transactions on Image Processing 24, 10 (2015), 2941–2954.
  • Shi et al. (2010) Xiaoxiao Shi, Qi Liu, Wei Fan, Philip S Yu, and Ruixin Zhu. 2010. Transfer learning on heterogenous feature spaces via spectral transformation. In Proc. IEEE International Conference on Data Mining. IEEE, 1049–1054.
  • Shi and Sha (2012) Yuan Shi and Fei Sha. 2012. Information-Theoretical Learning of Discriminative Clusters for Unsupervised Domain Adaptation. In Proc. International Conference on Machine Learning. 1079–1086.
  • Shih et al. (2013) Yichang Shih, Sylvain Paris, Frédo Durand, and William T Freeman. 2013. Data-driven hallucination of different times of day from a single outdoor photo. ACM Transactions on Graphics 32, 6 (2013), 200.
  • Si et al. (2010) Si Si, Dacheng Tao, and Bo Geng. 2010. Bregman divergence-based regularization for transfer subspace learning. IEEE Transactions on Knowledge and Data Engineering 22, 7 (2010), 929–942.
  • Snell et al. (2017) Jake Snell, Kevin Swersky, and Richard S Zemel. 2017. Prototypical Networks for Few-shot Learning. In Proc. International Conference on Learning Representations.
  • Socher et al. (2013) Richard Socher, Milind Ganjoo, Christopher D Manning, and Andrew Ng. 2013. Zero-shot learning through cross-modal transfer. In Proc. Advances in Neural Information Processing Systems. 935–943.
  • Stamos et al. (2015) Dimitris Stamos, Samuele Martelli, Moin Nabi, Andrew McDonald, Vittorio Murino, and Massimiliano Pontil. 2015. Learning with dataset bias in latent subcategory models. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3650–3658.
  • Sugiyama et al. (2008) Masashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, Paul V Buenau, and Motoaki Kawanabe. 2008. Direct importance estimation with model selection and its application to covariate shift adaptation. In Proc. Advances in Neural Information Processing Systems. 1433–1440.
  • Sukhija et al. (2016) Sanatan Sukhija, Narayanan C Krishnan, and Gurkanwal Singh. 2016. Supervised Heterogeneous Domain Adaptation via Random Forests. In Proc. International Joint Conference on Artificial Intelligence. AAAI Press.
  • Sun et al. (2016) Baochen Sun, Jiashi Feng, and Kate Saenko. 2016. Return of Frustratingly Easy Domain Adaptation. In Proc. AAAI Conference on Artificial Intelligence.
  • Sun and Saenko (2015) Baocheng Sun and Kate Saenko. 2015. Subspace Distribution Alignment for Unsupervised Domain Adaptation. In Proc. British Machine Vision Conference.
  • Sun and Saenko (2016) Baochen Sun and Kate Saenko. 2016. Deep CORAL: Correlation Alignment for Deep Domain Adaptation. In Proc. Transferring and Adapting Source Knowledge in Computer Vision (TASK-CV) in conjunction with the ECCV.
  • Sun et al. (2011) Qian Sun, Rita Chattopadhyay, Sethuraman Panchanathan, and Jieping Ye. 2011. A two-stage weighting framework for multi-source domain adaptation. In Proc. Advances in Neural Information Processing Systems. 505–513.
  • Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems. 3104–3112.
  • Tan et al. (2009) Songbo Tan, Xueqi Cheng, Yuefen Wang, and Hongbo Xu. 2009.

    Adapting naive bayes to domain adaptation for sentiment analysis. In

    Proc. European Conference on Information Retrieval. Springer, 337–349.
  • Tang et al. (2012) Kevin Tang, Vignesh Ramanathan, Li Fei-Fei, and Daphne Koller. 2012. Shifting weights: Adapting object detectors from image to video. In Advances in Neural Information Processing Systems. 638–646.
  • Tommasi and Caputo (2013) Tatiana Tommasi and Barbara Caputo. 2013. Frustratingly easy nbnn domain adaptation. In Proc. IEEE International Conference on Computer Vision. IEEE, 897–904.
  • Tommasi et al. (2010) Tatiana Tommasi, Francesco Orabona, and Barbara Caputo. 2010. Safety in numbers: Learning categories from few examples with multi model knowledge transfer. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3081–3088.
  • Tommasi et al. (2012) Tatiana Tommasi, Francesco Orabona, Mohsen Kaboli, and Barbara Caputo. 2012. Leveraging over prior knowledge for online learning of visual categories. In Proc. British Machine Vision Conference.
  • Tommasi and Tuytelaars (2014) Tatiana Tommasi and Tinne Tuytelaars. 2014. A testbed for cross-dataset analysis. In Proc. European Conference on Computer Vision. Springer, 18–31.
  • Torralba and Efros (2011) Antonio Torralba and Alexei Efros. 2011. Unbiased look at dataset bias. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1521–1528.
  • Tsai et al. (2016a) Yao-Hung Hubert Tsai, Cheng-An Hou, Wei-Yu Chen, Yi-Ren Yeh, and Yu-Chiang Frank Wang. 2016a. Domain-Constraint Transfer Coding for Imbalanced Unsupervised Domain Adaptation. In Proc. AAAI Conference on Artificial Intelligence.
  • Tsai et al. (2016b) Yao-Hung Hubert Tsai, Yi-Ren Yeh, and Yu-Chiang Frank Wang. 2016b. Learning Cross-Domain Landmarks for Heterogeneous Domain Adaptation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 5081–5090.
  • Tzeng et al. (2015) Eric Tzeng, Judy Hoffman, Trevor Darrell, and Kate Saenko. 2015. Simultaneous deep transfer across domains and tasks. In Proc. IEEE International Conference on Computer Vision. IEEE, 4068–4076.
  • Tzeng et al. (2017) Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. 2017. Adversarial discriminative domain adaptation. Proc. IEEE Conference on Computer Vision and Pattern Recognition (2017).
  • Tzeng et al. (2014) Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. 2014. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474 (2014).
  • Vincent et al. (2010) Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol. 2010. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research 11, Dec (2010), 3371–3408.
  • Vinyals et al. (2016) Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, and others. 2016. Matching networks for one shot learning. In Advances in Neural Information Processing Systems. 3630–3638.
  • Wang and Mahadevan (2011) Chang Wang and Sridhar Mahadevan. 2011. Heterogeneous domain adaptation using manifold alignment. In Proc. International Joint Conference on Artificial Intelligence. 1541–1546.
  • Wang et al. (2016) Donghui Wang, Yanan Li, Yuetan Lin, and Yueting Zhuang. 2016. Relational knowledge transfer for zero-shot learning. In Proc. AAAI Conference on Artificial Intelligence. AAAI Press, 2145–2151.
  • Wang et al. (2012) Shenlong Wang, Lei Zhang, Yan Liang, and Quan Pan. 2012. Semi-coupled dictionary learning with applications to image super-resolution and photo-sketch synthesis. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2216–2223.
  • Wang and Gupta (2016) Xiaolong Wang and Abhinav Gupta. 2016. Generative image modeling using style and structure adversarial networks. In European Conference on Computer Vision.
  • Wang and Ji (2013) Xiaoyang Wang and Qiang Ji. 2013. A unified probabilistic approach modeling relationships between attributes and objects. In Proc. IEEE International Conference on Computer Vision. 2120–2127.
  • Wei et al. (2016) Pengfei Wei, Yiping Ke, and Chi Keong Goh. 2016. Deep Nonlinear Feature Coding for Unsupervised Domain Adaptation. In Proc. International Joint Conferences on Artificial Intelligence.
  • Woodworth and Thorndike (1901) RS Woodworth and EL Thorndike. 1901. The influence of improvement in one mental function upon the efficiency of other functions.(I). Psychological Review 8, 3 (1901), 247.
  • Wu and Ji (2016) Yue Wu and Qiang Ji. 2016. Constrained Deep Transfer Feature Learning and Its Applications. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. IEEE.
  • Xian et al. (2016) Yongqin Xian, Zeynep Akata, Gaurav Sharma, Quynh Nguyen, Matthias Hein, and Bernt Schiele. 2016. Latent embeddings for zero-shot classification. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 69–77.
  • Xiao and Guo (2015) Min Xiao and Yuhong Guo. 2015. Feature space independent semi-supervised domain adaptation via kernel matching. IEEE Transactions on Pattern Analysis and Machine Intelligence 37, 1 (2015), 54–66.
  • Xie and Tu (2015) Saining Xie and Zhuowen Tu. 2015. Holistically-nested edge detection. In Proc. IEEE international conference on computer vision. 1395–1403.
  • Xu et al. (2015) Hongyu Xu, Jingjing Zheng, and Rama Chellappa. 2015. Bridging the Domain Shift by Domain Adaptive Dictionary Learning. In Proc. British Machine Vision Conference.
  • Xu et al. (2014b) Jiaolong Xu, Sebastian Ramos, David Vazquez, and Antonio M Lopez. 2014b. Domain adaptation of deformable part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 12 (2014), 2367–2380.
  • Xu et al. (2014c) Jiaolong Xu, Sebastian Ramos, David Vázquez, and Antonio M López. 2014c. Incremental Domain Adaptation of Deformable Part-based Models.. In Proc. British Machine Vision Conference.
  • Xu et al. (2016) J. Xu, D. V zquez, K. Mikolajczyk, and A. M. L pez. 2016. Hierarchical online domain adaptation of deformable part-based models. In 2016 IEEE International Conference on Robotics and Automation. 5536–5541.
  • Xu et al. (2014a) Zheng Xu, Wen Li, Li Niu, and Dong Xu. 2014a. Exploiting low-rank structure from latent domains for domain generalization. In Proc. European Conference on Computer Vision. Springer, 628–643.
  • Yamada et al. (2014) Makoto Yamada, Leonid Sigal, and Yi Chang. 2014. Domain Adaptation for Structured Regression. International Journal of Computer Vision 109, 1-2 (2014), 126–145.
  • Yan et al. (2017) Hongliang Yan, Yukang Ding, Peihua Li, Qilong Wang, Yong Xu, and Wangmeng Zuo. 2017. Mind the Class Weight Bias: Weighted Maximum Mean Discrepancy for Unsupervised Domain Adaptation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition.
  • Yang et al. (2007) Jun Yang, Rong Yan, and Alexander G Hauptmann. 2007. Cross-domain video concept detection using adaptive svms. In Proc. ACM International Conference on Multimedia. ACM, 188–197.
  • Yang et al. (2016) Liu Yang, Liping Jing, Jian Yu, and Michael K Ng. 2016. Learning transferred weights from co-occurrence data for heterogeneous transfer learning. IEEE transactions on neural networks and learning systems 27, 11 (2016), 2187–2200.
  • Yang et al. (2008) Qiang Yang, Sinno Jialin Pan, and Vincent Wenchen Zheng. 2008. Estimating Location Using Wi-Fi. IEEE Intelligent Systems 23, 1 (2008), 8–13.
  • Yang et al. (2009a) Weilong Yang, Yang Wang, and Greg Mori. 2009a. Efficient human action detection using a transferable distance function. In Proc. Asian conference on computer vision. Springer, 417–426.
  • Yang et al. (2009b) Weilong Yang, Yang Wang, and Greg Mori. 2009b. Human action recognition from a single clip per action. In Proc. IEEE International Conference on Computer Vision Workshops. IEEE, 482–489.
  • Yao et al. (2015) Ting Yao, Yingwei Pan, Chong-Wah Ngo, Houqiang Li, and Tao Mei. 2015. Semi-supervised Domain Adaptation with Subspace Learning for Visual Recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2142–2150.
  • Yosinski et al. (2014) Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transferable are features in deep neural networks?. In Advances in neural information processing systems. 3320–3328.
  • Zellinger et al. (2017) Werner Zellinger, Thomas Grubinger, Edwin Lughofer, Thomas Natschläger, and Susanne Saminger-Platz. 2017. Central moment discrepancy (CMD) for domain-invariant representation learning. In Proc. International Conference on Learning Representations.
  • Zhai et al. (2010) Deming Zhai, Bo Li, Hong Chang, Shiguang Shan, Xilin Chen, and Wen Gao. 2010. Manifold Alignment via Corresponding Projections. In Proc. British Machine Vision Conference. BMVA Press, 3.1–3.11. doi:10.5244/C.24.3.
  • Zhang et al. (2017) Jing Zhang, Wanqing Li, and Philip Ogunbona. 2017. Joint Geometrical and Statistical Alignment for Visual Domain Adaptation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition.
  • Zhang et al. (2013a) Kun Zhang, Krikamol Muandet, Zhikun Wang, and others. 2013a. Domain adaptation under target and conditional shift. In Proc. International Conference on Machine Learning. 819–827.
  • Zhang et al. (2016) Richard Zhang, Phillip Isola, and Alexei A Efros. 2016.

    Colorful image colorization. In

    European Conference on Computer Vision.
  • Zhang and Yeung (2010) Yu Zhang and Dit-Yan Yeung. 2010. Transfer metric learning by learning task relationships. In Proc. ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 1199–1208.
  • Zhang and Saligrama (2015) Ziming Zhang and Venkatesh Saligrama. 2015. Zero-shot learning via semantic similarity embedding. In Proc. IEEE International Conference on Computer Vision. IEEE, 4166–4174.
  • Zhang and Saligrama (2016a) Ziming Zhang and Venkatesh Saligrama. 2016a. Zero-shot learning via joint latent similarity embedding. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 6034–6042.
  • Zhang and Saligrama (2016b) Ziming Zhang and Venkatesh Saligrama. 2016b. Zero-shot recognition via structured prediction. In European Conference on Computer Vision. Springer, 533–548.
  • Zhang et al. (2013b) Zhong Zhang, Chunheng Wang, Baihua Xiao, Wen Zhou, Shuang Liu, and Cunzhao Shi. 2013b. Cross-view action recognition via a continuous virtual path. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2690–2697.
  • Zhao and Hoi (2010) Peilin Zhao and Steven C Hoi. 2010. OTL: A framework of online transfer learning. In Proc. International Conference on Machine Learning. 1231–1238.
  • Zheng et al. (2012) Jingjing Zheng, Zhuolin Jiang, P Jonathon Phillips, and Rama Chellappa. 2012. Cross-View Action Recognition via a Transferable Dictionary Pair. In Proc. British Machine Vision Conference, Vol. 1. 1–11.
  • Zheng et al. (2009) Vincent Wenchen Zheng, Derek Hao Hu, and Qiang Yang. 2009. Cross-domain activity recognition. In Proc. International Conference on Ubiquitous Computing. ACM, 61–70.
  • Zhong et al. (2009) Erheng Zhong, Wei Fan, Jing Peng, Kun Zhang, Jiangtao Ren, Deepak Turaga, and Olivier Verscheure. 2009. Cross domain distribution adaptation via kernel mapping. In Proc. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1027–1036.
  • Zhou et al. (2014) Joey Tianyi Zhou, Ivor W Tsang, Sinno Jialin Pan, and Mingkui Tan. 2014. Heterogeneous Domain Adaptation for Multiple Classes. In Proc. International Conference on Artificial Intelligence and Statistics. 1095–1103.
  • Zhu and Shao (2013) Fan Zhu and Ling Shao. 2013. Enhancing action recognition by cross-domain dictionary learning. In Proc. British Machine Vision Conference. BMVA Press.
  • Zhu and Shao (2014) Fan Zhu and Ling Shao. 2014. Weakly-supervised cross-domain dictionary learning for visual recognition. International Journal of Computer Vision 109, 1-2 (2014), 42–59.
  • Zhu et al. (2014) Y. Zhu, W. Chen, and G. Guo. 2014. Evaluating spatiotemporal interest point features for depth-based action recognition. Image and Vision Computing 32, 8 (2014), 453–464.
  • Zhu et al. (2011) Yin Zhu, Yuqiang Chen, Zhongqi Lu, Sinnojialin Pan, Guirong Xue, Yong Yu, and Qiang Yang. 2011. Heterogeneous Transfer Learning for Image Classification. In Proc. AAAI Conference on Artificial Intelligence.