1 Introduction
Recent years have seen dramatic advances in object recognition by deep learning algorithms
[23, 11, 32]. Much of the increased performance derives from applying large networks to massive labeled datasets such as PASCAL VOC [14]and ImageNet
[22]. Unfortunately, dataset bias – which can include factors such as backgrounds, camera viewpoints and illumination – often causes algorithms to generalize poorly across datasets [35] and significantly limits their usefulness in practical applications. Developing algorithms that are invariant to dataset bias is therefore a compelling problem.Problem definition.
In object recognition, the “visual world” can be considered as decomposing into views (e.g. perspectives or lighting conditions) corresponding to domains. For example, frontalviews and rotatedviews correspond to two different domains. Alternatively, we can associate views or domains with standard image datasets such as PASCAL VOC2007 [14], and Office [31].
The problem of learning from multiple source domains and testing on unseen target domains is referred to as domain generalization [6, 26]. A domain
from which samplesare drawn. Source domains provide training samples, whereas distinct target domains are used for testing. In the standard supervised learning framework, it is assumed that the source and target domains coincide. Dataset bias becomes a significant problem when training and test domains differ: applying a classifier trained on one dataset to images sampled from another typically results in poor performance
[35, 18]. The goal of this paper is to learn features that improve generalization performance across domains.Contribution.
The challenge is to build a system that recognizes objects in previously unseen datasets, given one or multiple training datasets. We introduce Multitask Autoencoder (MTAE), a feature learning algorithm that uses a multitask strategy [8, 34] to learn unbiased object features, where the task is the data reconstruction.
Autoencoders were introduced to address the problem of ‘backpropagation without a teacher’ by using
inputs as labels – and learning to reconstruct them with minimal distortion [28, 5]. Denoising autoencoders in particular are a powerful basic circuit for unsupervised representation learning [36]. Intuitively, corrupting inputs forces autoencoders to learn representations that are robust to noise.This paper proposes a broader view: that autoencoders are generic circuits for learning invariant features. The main contribution is a new training strategy based on naturally occurring transformations such as: rotations in viewing angle, dilations in apparent object size, and shifts in lighting conditions. The resulting MultiTask Autoencoder learns features that are robust to realworld image variability, and therefore generalize well across domains. Extensive experiments show that MTAE with a denoising criterion outperforms the prior stateoftheart in domain generalization over various crossdataset recognition tasks.
2 Related work
Domain generalization has recently attracted attention in classification tasks, including automatic gating of flow cytometry data [6, 26] and object recognition [16, 21, 38]. Khosla [21] proposed a multitask maxmargin classifier, which we refer to as UndoBias, that explicitly encodes datasetspecific biases in feature space. These biases are used to push the datasetspecific weights to be similar to the global weights. Fang [16] developed Unbiased Metric Learning (UML) based on learning to rank framework. Validated on weaklylabeled web images, UML produces a less biased distance metric that provides good object recognition performance. and validated on weaklylabeled web images. More recently, Xu [38] extended an exemplarSVM to domain generalization by adding a nuclear normbased regularizer that captures the likelihoods of all positive samples. The proposed model is denoted by LRESVM.
Other works in object recognition exist that address a similar problem, in the sense of having unknown targets, where the unseen dataset contains noisy images that are not in the training set [17, 33]. However, these were designed to be noisespecific and may suffer from dataset bias when observing objects with different types of noise.
A closely related task to domain generalization is domain adaptation, where unlabeled samples from the target dataset are available during training. Many domain adaptation algorithms have been proposed for object recognition (see, , [2, 31]). Domain adaptation algorithms are not readily applicable to domain generalization, since no information is available about the target domain.
Our proposed algorithm is based on the feature learning approach. Feature learning has been of a great interest in the machine learning community since the emergence of deep learning (see
[4]and references therein). Some feature learning methods have been successfully applied to domain adaptation or transfer learning applications
[9, 13]. To our best knowledge, there is no prior work along these lines on the more difficult problem of domain generalization, , to create useful representations without observing the target domain.3 The Proposed Approach
Our goal is to learn features that provide good domain generalization. To do so, we extend the autoencoder [7] into a model that jointly learns multiple datareconstruction tasks taken from related domains. Our strategy is motivated by prior work demonstrating that learning from multiple related tasks can improve performance on a novel, yet related, task – relative to methods trained on a singletask [1, 3, 8, 34].
3.1 Autoencoders
Autoencoders (AE) have become established as a pretraining model for deep learning [5]. The autoencoder training consists of two stages: 1) encoding and 2) decoding. Given an unlabeled input , a single hidden layer autoencoder can be formulated as
(1) 
where , are inputtohidden and hiddentooutput connection weights^{1}^{1}1While the bias terms are incorporated in our experiments, they are intentionally omitted from equations for the sake of simplicity. respectively,
is the hidden node vector, and
are elementwise nonlinear activation functions, and
and are not necessarily identical. Popular choices for the activation function are, , the sigmoidand the rectified linear (ReLU)
.Let be the autoencoder parameters and be a set of input data. Learning corresponds to minimizing the following objective
(2) 
where
is the loss function, usually in the form of
least square or crossentropy loss, and is a regularization term used to avoid overfitting. The objective (2) can be optimized by the backpropagation algorithm [29]. If we apply autoencoders to raw pixels of visual object images, the weights usually form visually meaningful “filters” that can be interpreted qualitatively.To create a discriminative model using the learnt autoencoder model, either of the following options can be considered: 1) the feature map is extracted and used as an input to supervised learning algorithms while keeping the weight matrix fixed; 2) the learnt weight matrix
is used to initialize a neural network model and is updated during the supervised neural network training (
finetuning).Recently, several variants such as denoising autoencoders (DAE) [37]
and contractive autoencoders (CAE)
[27] have been proposed to extract features that are more robust to small changes of the input. In DAEs, the objective is to reconstruct a clean input given its corrupted counterpart. Commonly used types of corruption are zeromasking, Gaussian, and saltandpepper noise. Features extracted by DAE have been proven to be more discriminative than ones extracted by AE
[37].3.2 Multitask Autoencoders
We refer to our proposed domain generalization algorithm as Multitask Autoencoder (MTAE). From an architectural viewpoint, MTAE is an autoencoder with multiple output layers, see Fig. 1. The inputhidden weights represent shared parameters and the hiddenoutput weights represent domainspecific parameters. The architecture is similar to the supervised multitask neural networks proposed by Caruana [8]. The main difference is that the output layers of MTAE correspond to different domains instead of different class labels.
The most important component of MTAE is the training strategy, which constructs a generalized denoising autoencoder that learns invariances to naturally occurring transformations. Denoising autoencoders focus on the special case where the transformation is simply noise. In contrast, MTAE training treats a specific perspective on an object as the “corrupted” counterpart of another perspective (, a rotated digit 6 is the noisy pair of the original digit). The autoencoder objective is then reformulated along the lines of multitask learning: the model aims to jointly achieve good reconstruction of all source views given a particular view. For example, applying the strategy to handwritten digit images with several views, MTAE learns representations that are invariant across the source views, see Section 4.
Two types of reconstruction tasks are performed during MTAE training: 1) selfdomain reconstruction and 2) betweendomain reconstruction. Given source domains, there are reconstruction tasks, of which task are selfdomain reconstructions and the remaining tasks are betweendomain reconstructions. Note that the selfdomain reconstruction is identical to the standard autoencoder reconstruction (3.1).
Formal description.
Let , be a set of dimensional data points in the domain, where . Each domain’s data points are combined into a matrix , where is its row, such that form a categorylevel correspondence. This configuration enforces the number of samples in a category to be the same in every domain. Note that such a configuration is necessary to ensure that the betweendomain reconstruction works (we will discuss how to handle the case with unbalanced samples in Section 3.3). The input and output pairs used to train MTAE can then be written as concatenated matrices
(3) 
where and . In words, is the matrix of data points taken from all domains and is the matrix of replicated data sets taken from the domain. The replication imposed in constructs inputoutput pairs for the autoencoder learning algorithm. In practice, the algorithm can be implemented efficiently – without replicating the matrix in memory.
We now describe MTAE more formally. Let and be the row of matrices and , respectively, the feedforward MTAE reconstruction is
(4) 
where contains the matrices of shared and individual weights, respectively.
The MTAE training is achieved as follows. Let us define the loss function summed over the datapoints
(5) 
Given domains, training MTAE corresponds to minimizing the objective
(6) 
where is a regularization term. In this work, we use the standard norm weight penalty
. Stochastic gradient descent is applied on each reconstruction task to achieve the objective (
6). Once training is completed, the optimal shared weights are obtained. The stopping criterion is empirically determined by monitoring the average loss over all reconstruction tasks during training – the process is stopped when the average loss stabilizes. The detailed steps of MTAE training is summarized in Algorithm 1.Input:
Data matrices based on (3.2): and ;
Source labels: ;
The learning rate: ;
Output:
MTAE learnt weights: ;
3.3 Handling unbalanced samples per category
MTAE requires that every instance in a particular domain has a categorylevel corresponding pair in every other domain. MTAE’s apparent applicability is therefore limited to situations where the number of source samples per category is the same in every domain. However, unbalanced samples per category occur frequently in applications. To overcome this issue, we propose a simple random selection procedure applied in the betweendomain reconstructions, denoted by randsel, which is simply balancing the samples per category while keeping their categorylevel correspondence.
In detail, the randsel strategy is as follows. Let be the number of subsamples in the th category, where and is the number of samples in the th category of domain . For each category and each domain , select samples randomly such that . This procedure is executed in every iteration of the MTAE algorithm, see Line 3 of Algorithm 1.
4 Experiments and Results
We conducted experiments on several real world object datasets to evaluate the domain generalization ability of our proposed system. In Section 4.1, we investigate the behaviour of MTAE in comparison to standard singletask autoencoder models on raw pixels as proofofprinciple. In Section 4.2, we evaluate the performance of MTAE against several stateoftheart algorithms on modern object datasets such as the Office [31], Caltech [20], PASCAL VOC2007 [14], LabelMe [30], and SUN09 [10].
4.1 Crossrecognition on the MNIST and ETH80 datasets
In this part, we aim to understand MTAE’s behavior when learning from multiple domains that form physically reasonable object transformations such as roll, pitch rotation, and dilation. The task is to categorize objects in views (domains) that were not presented during training. We evaluate MTAE against several autoencoder models. To perform the evaluation, a variety of object views were constructed from the MNIST handwritten digit [24] and ETH80 object [25] datasets.
Data setup.
We created four new datasets from MNIST and ETH80 images: 1) MNISTr, 2) MNISTs, 3) ETH80p, and 4) ETH80y. These new sets contain multiple domains so that every instance in one domain has a pair in another domain. The detailed setting for each dataset is as follows.
MNISTr contains six domains, each corresponding to a degree of roll rotation. We randomly chose 1000 digit images of ten classes from the original MNIST training set to represent the basic view, i.e., 0 degree of rotation;^{2}^{2}2Note that the rotation angle of the basic view is not perfectly since the original MNIST images have varying appearances. each class has 100 images. Each image was subsampled to a representation to simplify the computation. This subset of 1000 images is denoted by . We then created 5 rotated views from with difference in counterclockwise direction, denoted by , . , , and . The MNISTs is the counterpart of MNISTr, where each domain corresponds to a dilation factor. The views are denoted by , , , , and , where the subscripts represent the dilation factors with respect to .
The ETH80p consists of eight object classes with 10 subcategories for each class. In each subcategory, there are 41 different views with respect to pose angles. We took five views from each class denoted by , , , , and , which represent the horizontal poses, i.e., pitchrotated views starting from the top view to the side view. This makes the number of instances only 80 for each view. We then greyscaled and subsampled the images to . The ETH80y contains five views of the ETH80 representing the vertical poses, i.e., yawrotated views starting from the rightside view to the leftside view denoted by , , , , and . Other settings such as the image dimensionality and preprocessing stage are similar to ETH80p. Examples of the resulting views are depicted in Fig. 2.
Source  Target  Raw  AE  DAE  CAE  uDICA  MTAE  DMTAE 
MNISTr leaveonerollrotationout  
, , , ,  red  
, , , ,  red  
, , , ,  red  
, , , ,  red  
, , , ,  red  
, , , ,  red  
Average  red  
MNISTs leaveonedilationout  
, , ,  red  
, , ,  red  
, , ,  red  
, , ,  red  
, , ,  red  
Average  red 
Baselines.
We compared the classification performance of our models with several singletask autoencoder models: Descriptions of the methods and their hyperparameter settings are provided below.

AE [5]: the standard autoencoder model trained by stochastic gradient descent, where all object views were concatenated as one set of inputs. The number of hidden nodes was fixed at 500 on the MNIST dataset and at 1000 on the ETH80 dataset. The learning rate, weight decay penalty, and number of iterations were empirically determined at , , and , respectively.

DAE [37]: the denoising autoencoder with zeromasking noise, where all object views were concatenated as one set of input data. The corruption level was fixed at for all cases. Other hyperparameter values were identical to AE.

CAE [27]: the autoencoder model with the Jacobian matrix norm regularization referred to as the contractive autoencoder. The corresponding regularization constant was set at 0.1.

MTAE: our proposed multitask autoencoder model with identical hyperparameter settings as AE, except for the learning rate set at 0.03, which was also chosen empirically. This value provides a lower reconstruction error for each task and visually clearer first layer weights.

DMTAE: MTAE with a denoising criterion. The learning rate was set the same as MTAE; other hyperparameters followed DAE.
We also evaluated the unsupervised DomainInvariant Component Analysis (uDICA) [26] on these datasets for completness. The hyperparameters were tuned using 10fold crossvalidation on source domains. We also did experiments using the supervised variant, DICA, with the same tuning strategy. Surprisingly, the peak performance of uDICA is consistently higher than DICA. A possible explanation is that the Dirac kernel function measuring the label similarity is less appropriate in this application.
We normalized the raw pixels to a range of for autoencoderbased models and unit ball for uDICA. We evaluated the classification accuracies of the learnt features using multiclass SVM with linear kernel (LSVM) [12]. Using a linear kernel keeps the classifier simple – since our main focus is on the feature extraction process. The LIBLINEAR package [15] was used to run the LSVM.
Crossdomain recognition results.
We evaluated the object classification accuracies of each algorithm by leaveonedomainout test, , taking one domain as the test set and the remaining domains as the training set. For all autoencoderbased algorithms, we repeated the experiments on each leaveonedomainout
case 30 times and reported the average accuracies. The standard deviations are not reported since they are small (
).The detailed results on the MNISTr and MNISTs can be seen in Table 1. On average, MTAE has the second best classification accuracies, and in particular outperforms singletask autoencoder models. This indicates that the multitask feature learning strategy can provide better discriminative features than the singletask feature learning unseen object views.
The algorithm with the best performance is on these datasets is DMTAE. Specifically, DMTAE performs best on average and also on 9 out of 11 individual crossdomain cases of the MNISTr and MNISTs. The closest singletask feature learning competitor to DMTAE is CAE. This suggests that the denoising criterion strongly benefits domain generalization. The denoising criterion is also useful for singletask feature learning although it does not yield competitive accuracies, see AE and DAE performance.
We also obtain a consistent trend on the ETH80p and ETH80y datasets, , DMTAE and MTAE are the best and second best models. In detail, DMTAE and MTAE produce the average accuracies of and on the ETH80p, and and on the ETH80y.
Observe that there is an anomaly in the MNISTr dataset: the performance on is far worse than its neighbors (). This anomaly appears to be related to the geometry of the MNISTr digits. We found that the most frequently misclassified digits are 4, 6, and 9 on , which rarely occurs on other MNISTr’s domains – typically 4 as 9, 6 as 4, and 9 as 8. The same phenomenon applies to LSVM.
Weight visualization.
Useful insight is obtained from considering the qualitative outcome of the MTAE training by visualizing the first layer weights. Figure 4 depicts the weights of some autoencoder models, including ours, on the MNISTr dataset. Both MTAE and DMTAE’s weights form “filters” that tend to capture the underlying transformation across the MNISTr views, which is the rotation. This effect is unseen in AE and DAE, the filters of which only explain the contents of handwritten digits in the form of Fourier componentlike descriptors such as local blob detectors and stroke detectors [37]. This might be a reason that MTAE and DMTAE features can provide better domain generalization than AE and DAE, since they implicitly capture the relationship among the source domains.
Next we discuss the difference between MTAE and DMTAE filters. The DMTAE filters not only capture the object transformation, but also produce features that describe the object contents more distinctively. These filters basically combine both properties of the DAE and MTAE filters that might benefit the domain generalization.
Invariance analysis.
A possible explanation for the effectiveness of MTAE relates to the dimensionality of the manifold in feature space where samples concentrate. We hypothesize that if features concentrate near a lowdimensional submanifold, then the algorithm has found simple invariant features and will generalize well.
To test the hypothesis, we examine the singular value spectrum of the Jacobian matrix , where and are the input and feature vectors respectively [27]. The spectrum describes the local dimensionality of the manifold around which samples concentrate. If the spectrum decays rapidly, then the manifold is locally of low dimension.
Figure 3 depicts the average singular value spectrum on test samples from MNISTr and MNISTs. The spectrum of DMTAE decays the most rapidly, followed by MTAE and then DAE (with similar rates), and AE decaying the slowest. The ranking of decay rates of the four algorithms matches their ranking in terms of empirical performance in Table 1. Figure 3 thus provides partial confirmation for our hypothesis. However, a more detailed analysis is necessary before drawing strong conclusions.
4.2 Crossrecognition on the Office, Caltech, VOC2007, LabelMe, and SUN09 datasets
In the second set of experiments, we evaluated the crossrecognition performance of the proposed algorithms on modern object datasets. The aim is to show that MTAE and DMTAE are applicable and competitive in the more general setting. We used the Office, Caltech, PASCAL VOC2007, LabelMe, and SUN09 datasets from which we formed two crossdomain datasets. Our general strategy is to extend the generalization of features extracted from the current best deep convolutional neural network
[23].Data Setup.
The first crossdomain dataset consists of images from PASCAL VOC2007 (V), LabelMe (L), Caltech101 (C), and SUN09 (S) datasets, each of which represents one domain. C is an objectcentric dataset, while V, L, and S are scenecentric. This dataset, which we abbreviate as VLCS, shares five object categories: ‘bird’, ‘car’, ‘chair’, ‘dog’, and ‘person’. Each domain in the VLCS dataset was divided into a training set () and a test set () by random selection from the overall dataset. The detailed trainingtest configuration for each domain is summarized in Table 2. Instead of using the raw features directly, we employed the features [13] as inputs to the algorithms. These features have dimensionality of 4,096 and are publicly available.^{3}^{3}3 \({\scriptstyle\mathrm{http://www.cs.dartmouth.edu/chenfang/proj\_page/FXR\_iccv% 13/index.php}}\)
The second crossdomain dataset is referred to as the Office+Caltech [31, 19] dataset that contains four domains: Amazon (A), Webcam (W), DSLR (D), and Caltech256 (C), which share ten common categories. This dataset has 8 to 151 instances per category per domain, and 2,533 instances in total. We also used the features extracted from this dataset, which are also publicly available.^{4}^{4}4\({\scriptstyle\mathrm{http://vc.sce.ntu.edu.sg/transfer\_learning\_domain\_% adaptation/}}\)
Domain  VOC2007  LabelMe  Caltech101  SUN09 

#training  2,363  1,859  991  2,297 
#test  1,013  797  424  985 
Training/Test  VOC2007  LabelMe  Caltech101  SUN09 

VOC2007  
LabelMe  
Caltech101  
SUN09 
Source  Target  LSVM  1HNN  DAE+1HNN  UndoBias  UML  LRESVM  MTAE+1HNN  DMTAE+1HNN 
L,C,S  V  red  
V,C,S  L  red  
V,L,S  C  red  
V,L,C  S  57.45  red  
Avg.  65.46  red 
Source  Target  LSVM  1HNN  DAE+1HNN  UndoBias  UML  LRESVM  MTAE+1HNN  DMTAE+1HNN 
A,C  D,W  red  
D,W  A,C  red  
C,D,W  A  red  
A,W,D  C  red  
Avg.  red 
Training protocol.
On these datasets, we utilized the MTAE or DMTAE learning as pretraining for a fullyconnected neural network with one hidden layer (1HNN). The number of hidden nodes was set at 2,000, which is less than the input dimensionality. In the pretraining stage, the number of output layers was the same as the number of source domains –each corresponds to a particular source domain. The sigmoid activation and linear activation functions were used for and .
The MTAE pretraining was run with the learning rate at , the number of epochs at , and the batch size at , which were empirically determined the smallest average reconstruction loss. DMTAE has the same hyperparameter setting as MTAE except the additional zeromasking corruption level at . After the pretraining is completed, we then performed backpropagation finetuning using 1HNN with softmax output, where the first layer weights were initialized by either the MTAE or DMTAE learnt weights. The supervised learning hyperparameters were tuned using 10fold cross validation (10FCV) on source domains. We denote the overall models by MTAE+1HNN and DMTAE+1HNN.
Baselines.
We compared our proposed models with six baselines:

LSVM: an SVM with linear kernel.

1HNN: a single hidden layer neural network without pretraining.

DAE+1HNN: a twolayer neural network with denoising autoencoder pretraining (DAE+1HNN).

UndoBias [21]: a multitask SVMbased algorithm for undoing dataset bias. Three hyperparameters () require tuning by 10FCV.

UML [16]: a structural metric learningbased algorithm that aims to learn a less biased distance metric for classification tasks. The initial tuning proposal for this method was using a set of weaklylabeled data retrieved from querying class labels to search engine. However, here we tuned the hyperparameters using the same strategy as others (10FCV) for a fair comparison.

LRESVM [38]: a nonlinear exemplarSVMs model with a nuclear norm regularization to impose a lowrank likelihood matrix. Four hyperparameters () were tuned using 10FCV.
The last three are the stateoftheart domain generalization algorithms for object recognition.
We report the performance in terms of the classification accuracy (%) following Xu [38]. For all algorithms that are optimized stochastically, we ran independent training processes using the best performing hyperparameters in 10 times and reported the average accuracies. Similar to the previous experiment, we do not report the standard deviations due to their small values ().
Results on the VLCS dataset.
We first conducted the standard trainingtest evaluation using LSVM, , learning the model on a training set from one domain and testing it on a test set from another domain, to check the groundtruth performance and also to identify the existence of the dataset bias. The performance is summarized in Table 3. We can see that the bias indeed exists in every domain despite the use of , the sixth layer features of the stateoftheart deep convolutional neural network. The performance gap between the best crossdomain performance and the groundtruth is large, with difference.
We then evaluated the domain generalization performance of each algorithm. We conducted leaveonedomainout evaluation, which induces four crossdomain cases. The complete recognition results are shown in Table 4. In general, the dataset bias can be reduced by all algorithms after learning from multiple source domains (compare, , the minimum accuracy over the first row –V as the target– in Table 4 with the maximum crossrecognition accuracy over the VOC2007’s column in Table 3). Furthermore, Caltech101, which is objectcentric, appears to be the easiest dataset to recognize, consistent with an investigation in [35]: scenecentric datasets tend to generalize well over objectcentric datasets. Surprisingly, the performance of 1HNN has already achieved competitive accuracy compared to more complicated stateoftheart algorithms, UndoBias, UML, and LRESVM. Furthermore, DMTAE outperforms other algorithms on three out of four crossdomain cases and on average, while MTAE has the second best performance on average.
Results on the Office+Caltech dataset.
We report the experiment results on the Office+Caltech dataset. Table 5 summarizes the recognition accuracies of each algorithm over four crossdomain cases. DMTAE+1HNN has the best performance on two out of four crossdomain cases and ranks second for the remaining cases. On average, DMTAE+1HNN has better performance than the prior stateoftheart on this dataset, LRESVM [38].
5 Conclusions
We have proposed a new approach to multitask feature learning that reduces dataset bias in object recognition. The main idea is to extract features shared across domains via a training protocol that, given an image from one domain, learns to reconstruct analogs of that image for all domains. The strategy yields two variants: the Multitask Autoencoder (MTAE) and the Denoising MTAE which incorporates a denoising criterion. A comprehensive suite of crossdomain object recognition evaluations shows that the algorithms successfully learn domaininvariant features, yielding stateoftheart performance when predicting the labels of objects from unseen target domains.
Our results suggest several directions for further study. Firstly, it is worth investigating whether stacking MTAEs improves performance. Secondly, more effective procedures for handling unbalanced samples are required, since these occur frequently in practice. Finally, a natural application of MTAEs is to streaming data such as video, where the appearance of objects transforms in realtime.
The problem of dataset bias remains far from solved: the best model on the VLCS dataset achieved accuracies less than on average. A partial explanation for the poor performance compared to supervised learning is insufficient training data: the classoverlap across datasets is quite small (only 5 classes are shared across VLCS). Further progress in domain generalization requires larger datasets.
References
 [1] A. Argyriou, T. Evgeniou, and M. Pontil. Convex MultiTask Feature Learning. Machine Learning, 73(3):243–272, 2008.
 [2] M. Baktashmotlagh, M. T. Harandi, B. C. Lovell, and M. Salzmann. Domain Adaptation on the Statistical Manifold. In CVPR, pages 2481–2488, 2014.

[3]
J. Baxter.
A Model of Inductive Bias Learning.
Journal of Artificial Intelligence Research
, 12:149–198, 2000.  [4] Y. Bengio, A. C. Courville, and P. Vincent. Representation Learning: A Review and New Perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798–1828, 2013.
 [5] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy LayerWise Training of Deep Networks. In NIPS, pages 153–160, 2007.
 [6] G. Blanchard, G. Lee, and C. Scott. Generalizing from Several Related Classification Tasks to a New Unlabeled Sample. In NIPS, volume 1, pages 2178–2186, 2011.

[7]
H. Bourlard and Y. Kamp.
AutoAssociation by Multilayer Perceptrons and Singular Value Decomposition.
Biological Cybernetics, 59:291–294, 1988.  [8] R. Caruana. Multitask Learning. Machine Learning, 28:41–75, 1997.
 [9] M. Chen, Z. Xu, K. Weinberger, and F. Sha. Marginalized Denoising Autoencoders for Domain Adaptation. In ICML, pages 767–774, 2012.
 [10] M. J. Choi, J. J. Lim, A. Torralba, and A. S. Willsky. Exploiting hierarchical context on a large database of object categories. In CVPR, pages 129–136, 2010.
 [11] D. Ciresan, U. Meier, and J. Schmidhuber. Multicolumn deep neural network for image classification. In CVPR, pages 3642–3649, 2012.
 [12] K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernelbased vector machines. JMLR, 2:265–292, 2001.
 [13] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition. In ICML, pages 647–655, 2014.
 [14] M. Everingham, L. VanGool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results, 2007.
 [15] R.E. Fan, K.W. Chang, C.J. Hsieh, X.R. Wang, and C.J. Lin. LIBLINEAR: A Library for Large Linear Classification. JMLR, 9:1871–1874, 2008.
 [16] C. Fang, Y. Xu, and D. N. Rockmore. Unbiased Metric Learning: On the Utilization of Multiple Datasets and Web Images for Softening Bias. In ICCV, pages 1657–1664, 2013.
 [17] M. Ghifary, W. B. Kleijn, and M. Zhang. Deep hybrid networks with good outofsample object recognition. In ICASSP, pages 5437–5441, 2014.
 [18] B. Gong, K. Grauman, and F. Sha. Reshaping Visual Datasets for Domain Adaptation. In NIPS, pages 1286–1294, 2013.
 [19] B. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesic Flow Kernel for Unsupervised Domain Adaptation. In CVPR, pages 2066–2073, 2012.
 [20] G. Griffin, A. Holub, and P. Perona. Caltech256 object category dataset. Technical report, California Inst. of Tech., 2007.
 [21] A. Khosla, T. Zhou, T. Malisiewicz, A. Efros, and A. Torralba. Undoing the Damage of Dataset Bias. In ECCV, volume I, pages 158–171, 2012.
 [22] A. Krizhevsky. Learning Multiple Layers of Features from Tiny Images. Master’s thesis, Department of Computer Science, University of Toronto, Apr. 2009.
 [23] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In NIPS, volume 25, pages 1106–1114, 2012.
 [24] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. In Proceedings of the IEEE, volume 86, pages 2278–2324, 1998.
 [25] B. Leibe and B. Schiele. Analyzing appearance and contour based methods for object categorization. In CVPR, pages 409–415, 2003.
 [26] K. Muandet, D. Balduzzi, and B. Schölkopf. Domain Generalization via Invariant Feature Representation. In ICML, pages 10–18, 2013.
 [27] S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio. Contractive AutoEncoders : Explicit Invariance During Feature Extraction. In ICML, number 1, pages 833–840, 2011.
 [28] D. Rumelhart, G. Hinton, and R. Williams. Parallel Distributed Processing. I: Foundations. MIT Press, 1986.
 [29] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by backpropagating errors. Nature, 323:533–536, 1986.
 [30] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman. LabelMe: A database and webbased tool for image annotation. In IJCV, volume 77, pages 157–173. 2008.
 [31] K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting Visual Cateogry Models to New Domains. In ECCV, pages 213–226, 2010.
 [32] I. Sutskever, O. Vinyals, and Q. Le. Sequence to Sequence Learning with Neural Networks. In NIPS, 2014.
 [33] Y. Tang and C. Eliasmith. Deep networks for robust visual recognition. In ICML, pages 1055–1062, 2010.
 [34] S. Thrun. Is learning the nth thing any easier than learning the first? In NIPS, pages 640–646, 1996.
 [35] A. Torralba and A. Efros. Unbiased look at dataset bias. In CVPR, pages 1521–1528, 2011.
 [36] P. Vincent, H. Larochelle, Y. Bengio, and P.A. Manzagol. Extracting and Composing Robust Features with Denoising Autoencoders. In ICML, 2008.
 [37] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.A. Manzagol. Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion. Journal of Machine Learning Research, 11:3371–3408, 2010.
 [38] Z. Xu, W. Li, L. Niu, and D. Xu. Exploiting LowRank Structure from Latent Domains for Domain Generalization. In ECCV, pages 628–643, 2014.