1 Introduction
As deep learning approaches have gained prominence in computer vision we have seen tasks that have large amounts of available labeled data flourish with improved results. There are still many problems worth solving where labeled data on an equally large scale is too expensive to collect, annotate, or both, and by extension a straightforward deep learning approach would not be feasible. Typically, in such a scenario, practitioners will train or reuse a model from a closely related dataset with a large amount of samples, here called the source domain, and then train with the much smaller dataset of interest, referred to as the target domain. This process is wellknown under the name finetuning. Finetuning, while simple to implement, has been found to be suboptimal when compared to later techniques such as
domain adaptation blitzer2006domain . Domain Adaptation can be supervised tzengHDS15iccv ; koniusz2016domain , unsupervised ghifary2016deep ; long2015learning , or semisupervised gong2012geodesic ; GuoX12 ; Yao_2015_CVPR , depending on what data is available in a labeled format and how much can be collected.Unsupervised domain adaptation (UDA) algorithms do not need any target data labels, but they require large amounts of target training samples, which may not always be available. Conversely, supervised domain adaptation (SDA) algorithms do require labeled target data, and because labeling information is available, for the same quantity of target data, SDA outperforms UDA Motiian_2017_ICCV . Therefore, if the available target data is scarce, SDA becomes attractive, even if the labeling process is expensive, because only few samples need to be processed.
Most domain adaptation approaches try to find a feature space such that the confusion between source and target distributions in that space is maximum (domain confusion). Because of that, it is hard to say whether a sample in the feature space has come from the source distribution or the target distribution. Recently, generative adversarial networks NIPS2014_5423 have been introduced for image generation which can also be used for domain adaptation. In NIPS2014_5423 , the goal is to learn a discriminator to distinguish between real samples and generated (fake) samples and then to learn a generator which best confuses the discriminator. Domain adaptation can also be seen as a generative adversarial network with one difference, in domain adaptation there is no need to generate samples, instead, the generator network is replaced with an inference network. Since the discriminator cannot determine if a sample is from the source or the target distribution the inference becomes optimal in terms of creating a joint latent space. In this manner, generative adversarial learning has been successfully modified for UDA liu2016coupled ; tzeng2017adversarial ; sankaranarayanan2017generate and provided very promising results.
Here instead, we are interested in adapting adversarial learning for SDA which we are calling fewshot adversarial domain adaptation (FADA) for cases when there are very few labeled target samples available in training. In this fewshot learning regime, our SDA method has proven capable of increasing a model’s performance at a very high rate with respect to the inclusion of additional samples. Indeed, even one additional sample can significantly increase performance.
Our first contribution is to handle this scarce data while providing effective training. Our second contribution is to extend adversarial learning NIPS2014_5423 to exploit the label information of target samples. We propose a novel way of creating pairs of samples using source and target samples to address the first challenge. We assign a group label to a pair according to the following procedure: 0 if samples of a pair come from the source distribution and the same class label, 1 if they come from the source and target distributions but the same class label, 2 if they come from the source distribution but different class labels, and 3 if they come from the source and target distributions and have different class labels. The second challenge is addressed by using adversarial learning NIPS2014_5423
to train a deep inference function, which confuses a welltrained domainclass discriminator (DCD) while maintaining a high classification accuracy for the source samples. The DCD is a multiclass classifier that takes pairs of samples as input and classifies them into the above four groups. Confusing the DCD will encourage
domain confusion, as well as the semantic alignment of classes. Our third contribution is an extensive validation of against the stateoftheart. Although our method is general, and can be used for all domain adaptation applications, we focus on visual recognition.2 Related work
Naively training a classifier on one dataset for testing on another is known to produce suboptimal results, because an effect known as dataset bias ponce2006dataset ; torralbaE11cvpr ; tommasi2015deeper , or covariate shift shimodaira00jspi , occurs due to a difference in the distributions of the images between the datasets.
Prior work in domain adaptation has minimized this shift largely in three ways. Some try to find a function which can map from the source domain to the target domain saenkoKFD2010eccv ; kulis2011you ; gopalanLC11iccv ; gong2012geodesic ; fernandoHST13iccv ; tommasi2016learning ; Shrivastava_2017_CVPR . Others find a shared latent space that both domains can be mapped to before classification long2013transfer ; baktashmotlaghHLS13iccv ; muandet2013domain ; ganin2014unsupervised ; ganin2016domain ; panTKY11tnn ; motiian2016information ; Motiian_2017_ICCV . Finally, some use regularization to improve the fit on the target domain bergamo2010exploiting ; aytar2011tabula ; yang2007adapting ; duan2009domain ; becker2013non ; daume2006domain . UDA can leverage the first two approaches while SDA uses the second, third, or a combination of the two approaches. In addition to these methods, chenLX2014cvpr ; motiian2016ECCV ; sarafianos2017adaptive have addressed UDA when an auxiliary data view lapinHS2014nn ; motiian2016information , is available during training, but that is beyond the scope of this work.
For this approach we are focused on finding a shared subspace for both the source and target distributions. Siamese networks chopra2005learning
work well for subspace learning and have worked very well with deep convolutional neural networks
Donahue13decaf ; Simonyan14c ; kumar2016learning ; varior2016siamese . Siamese networks have also been useful in domain adaptation recently. In tzengHDS15iccv , which is a deep SDA approach, unlabeled and sparsely labeled target domain data are used to optimize for domain invariance to facilitate domain transfer while using a soft label distribution matching loss. In sun2016deep , which is a deep UDA approach, unlabeled target data is used to learn a nonlinear transformation that aligns correlations of layer activations in deep neural networks. Some approaches went beyond Siamese weightsharing and used couple networks for DA. koniusz2016domain uses two CNN streams, for source and target, fused at the classifier level. rozantsev2016beyond , which is a deep UDA approach and can be seen as an SDA after finetuning, also uses a twostreams architecture, for source and target, with related but not shared weights. Motiian_2017_ICCV , which is an SDA approach, creates positive and negative pairs using source and target data and then finds a shared feature space between source and target by bringing together the positive pairs and pushing apart the negative pairs.Recently, adversarial learning NIPS2014_5423 has shown promising results in domain adaptation and can be seen as examples of the second category. liu2016coupled
introduced a coupled generative adversarial network (CoGAN) for learning a joint distribution of multidomain images for different applications including UDA.
tzeng2017adversarial has used the adversarial loss for discriminative UDA. sankaranarayanan2017generate introduces an approach that leverages unlabeled data to bring the source and target distributions closer by inducing a symbiotic relationship between the learned embedding and a generative adversarial framework.Here we use adversarial learning to train inference networks such that samples from different distributions are not distinguishable. We consider the task where very few labeled target data are available in training. With this assumption, it is not possible to use the standard adversarial loss used in liu2016coupled ; tzeng2017adversarial ; sankaranarayanan2017generate , because the training target data would be insufficient. We address that problem by modifying the usual pairing technique used in many applications such as learning similarity metrics chopra2005learning ; Discriminative2014Hu ; hoffer2015deep . Our pairing technique encodes domain labels as well as class labels of the training data (source and target samples), producing four groups of pairs. We then introduce a multiclass discriminator with four outputs and design an adversarial learning strategy to find a shared feature space. Our method also encourages the semantic alignment of classes, while other adversarial UDA approaches do not.
3 Fewshot adversarial domain adaptation
In this section we describe the model we propose to address supervised domain adaptation (SDA). We are given a training dataset made of pairs . The feature
is a realization from a random variable
, and the label is a realization from a random variable . In addition, we are also given the training data , where is a realization from a random variable , and the labels . We assume that there is a covariate shift shimodaira00jspi between and , i.e., there is a difference between the probability distributions and . We say that represents the source domain and that represents the target domain. Under this settings the goal is to learn a prediction function that during testing is going to perform well on data from the target domain.The problem formulated thus far is typically referred to as supervised domain adaptation. In this work we are especially concerned with the version of this problem where only very few target labeled samples per class are available. We aim at handling cases where there is only one target labeled sample, and there can even be some classes with no target samples at all.
In absence of covariate shift a visual classifier is trained by minimizing a classification loss
(1) 
where denotes statistical expectation and
could be any appropriate loss function. When the distributions of
and are different, a deep model trained with will have reduced performance on the target domain. Increasing it would be trivial by simply training a new model with data . However, is small and deep models require large amounts of labeled data.In general, could be modeled by the composition of two functions, i.e., . Here would be an inference from the input space to a feature or inference space , and would be a function for predicting from the feature space. With this notation we would have and , and the SDA problem would be about finding the best approximation for and , given the constraints on the available data.
If and are able to embed source and target samples, respectively, to a domain invariant space, it is safe to assume from the feature to the label space that . Therefore, domain adaptation paradigms are looking for such inference functions so that they can use the prediction function for target samples.
Traditional unsupervised DA (UDA) paradigms try to align the distributions of the features in the feature space, mapped from the source and the target domains using a metric between distributions, Maximum Mean Discrepancy grettonBRSS06nips
being a popular one and other metrics like Kullback Leibler divergence
kullback1951information and Jensen–Shannon NIPS2014_5423 divergence becoming popular when using adversarial learning. Once they are aligned, a classifier function would no longer be able to tell whether a sample is coming from the source or the target domain. Recent UDA paradigms try to find inference functions to satisfy this important goal using adversarial learning. Adversarial training looks for a domain discriminator that is able to distinguish between samples of source and target distributions. In this case is a binary classifier trained with the standard crossentropy loss(2) 
Once the discriminator is learned, adversarial learning tries to update the target inference function in order to confuse the discriminator. In other words, the adversarial training is looking for an inference function that is able to map a target sample to a feature space such that the discriminator will no longer distinguish it from a source sample.
From the above discussion it is clear that in order to perform well, UDA needs to align the distributions effectively in order to be successful. This can happen only if distributions are represented by a sufficiently large dataset. Therefore, UDA approaches are in a position of weakness when we assume to be small. Moreover, UDA approaches have also another intrinsic limitation; even with perfect confusion alignment, there is no guarantee that samples from different domains but with the same class label will map nearby in the feature space. This lack of semantic alignment is a major source of performance reduction.
3.1 Handling Scarce Target Data
We are interested in the case where very few labeled target samples (as low as 1 sample per class) are available. We are facing two challenges in this setting. First, since the size of is small, we need to find a way to augment it. Second, we need to somehow use the label information of . Therefore, we create pairs of samples. In this way, we are able to alleviate the lack of training target samples by pairing them with each training source sample. In Motiian_2017_ICCV , we have shown that creating positive and negative pairs using source and target data is very effective for SDA. Since the method proposed in Motiian_2017_ICCV does not encode the domain information of the samples, it cannot be used in adversarial learning. Here we extend Motiian_2017_ICCV by creating 4 groups of pairs () as follows: we break down the positive pairs into two groups (Groups 1 and 2), where pairs of the first group consist of samples from the source distribution with the same class labels, while pairs of the second group also have the same class label but come from different distributions (one from the source and one from the target distribution). This is important because we can encode both label and domain information of training samples. Similarly, we break down the negative pairs into two groups (Groups 3 and 4), where pairs of the third group consist of samples from the source distribution with different class labels, while pairs of the forth group come from different class labels and different distributions (one from the source and one from the target distributions). See Figure 1. In order to give each group the same amount of members we use all possible pairs from , as it is the smallest, and then uniformly sample from the pairs in , , and to match the size of . Any reasonable amount of portions between the numbers of the pairs can also be used.
In classical adversarial learning we would at this point learn a domain discriminator, but since we have semantic information to consider as well, we are interested in learning a multiclass discriminator (we call it domainclass discriminator (DCD)) in order to introduce semantic alignment of the source and target domains. By expanding the binary classifier to its multiclass equivalent, we can train a classifier that will evaluate which of the 4 groups a given sample pair belongs to. We model the DCD with 2 fully connected layers with a softmax activation in the last layer which we can train with the standard categorical crossentropy loss
(3) 
where is the label of and is the DCD function. is a symbolic function that takes a pair as input and outputs the concatenation of the results of the appropriate inference functions. The output of is passed to the DCD (Figure 2).
In the second step, we are interested in updating in order to confuse the DCD in such a way that the DCD can no longer distinguish between groups 1 and 2, and also between groups 3 and 4 using the loss
(4) 
(4) is inspired by the nonsaturating game goodfellow2016nips and will force the inference function to embed target samples in a space that DCD will no longer be able to distinguish between them.
Connection with multiclass discriminators:
Consider an image generation task where training samples come from classes. Learning the image generator can be done by any standard class classifier and adding generated samples as a new class (generated class) and correspondingly increasing the dimension of the classifier output from to . During the adversarial learning, only the generated class is confused. This has proven effective for image generation salimans2016improved and other tasks. However, this is different than the proposed DCD, where group 1 is confused with 2, and group 3 is confused with 4. Inspired by salimans2016improved , we are able to create a classifier to also guarantee a high classification accuracy. Therefore, we suggest that (4) needs to be minimized together with the main classifier loss
(5) 
where strikes the balance between classification and confusion. Misclassifying pairs from group 2 as group 1 and likewise for groups 4 and 3, means that the DCD is no longer able to distinguish positive or negative pairs of different distributions from positive or negative pairs of the source distribution, while the classifier is still able to discriminate positive pairs from negative pairs. This simultaneously satisfies the two main goals of SDA, domain confusion and class separability in the feature space. UDA only looks for domain confusion and does not address class separability, because of the lack of labeled target samples.
Connection with conditional GANs:
Concatenation of outputs of different inferences has been done before in conditional GANs. For example, reed2016generative ; reed2016learning ; zhang2016stackgan concatenate the input text to the penultimate layers of the discriminators. isola2016image concatenates positive and negative pairs before passing them to the discriminator. However, all of them use the vanilla binary discriminator.
Relationship between and :
There is no restriction for and and they can be constrained or unconstrained. An obvious choice of constraint is equality (weightsharing) which makes the inference functions symmetric. This can be seen as a regularizer and will reduce overfitting Motiian_2017_ICCV . Another approach would be learning an asymmetric inference function rozantsev2016beyond . Since we have access to very few target samples, we use weightsharing ().
Choice of , , and :
Since we are interested in visual recognition, the inference functions and are modeled by a convolutional neural network (CNN) with some initial convolutional layers, followed by some fully connected layers which are described specifically in the experiments section. In addition, the prediction function
is modeled by fully connected layers with a softmax activation function for the last layer.
Training Process:
Here we discuss the training process for the weightsharing regularizer (). Once the inference functions and the prediction function are chosen, FADA takes the following steps: First, and are initialized using the source dataset . Then, the mentioned four groups of pairs should be created using and . The next step is training DCD using the four groups of pairs. This should be done by freezing . In the next step, the inference function and prediction function should be updated in order to confuse DCD and maintain high classification accuracy. This should be done by freezing DCD. See Algorithm 1 and Figure 2. The training process for the non weightsharing case can be derived similarly.
Traditional UDA  Adversarial UDA  

LB  tzeng2014deep  rozantsev2016beyond  ghifary2016deep  liu2016coupled  tzeng2017adversarial  sankaranarayanan2017generate  SDA  1  2  3  4  5  6  7  
65.4  47.8  60.7  91.8  91.2  89.4  92.5  FT  82.3  84.9  85.7  86.5  87.2  88.4  88.6  
Motiian_2017_ICCV  85.0  89.0  90.1  91.4  92.4  93.0  92.9  
FADA  89.1  91.3  91.9  93.3  93.4  94.0  94.4  
58.6  63.1  67.3  73.7  89.1  90.1  90.8  FT  72.6  78.2  81.9  83.1  83.4  83.6  84.0  
Motiian_2017_ICCV  78.4  82.2  85.8  86.1  88.8  89.6  89.4  
FADA  81.1  84.2  87.5  89.9  91.1  91.2  91.5  
60.1      82.0  76.0    84.7  FT  65.5  68.6  70.7  73.3  74.5  74.6  75.4  
FADA  72.8  81.8  82.6  85.1  86.1  86.8  87.2  
20.3      40.1      36.4  FT  29.7  31.2  36.1  36.7  38.1  38.3  39.1  
FADA  37.7  40.5  42.9  46.3  46.1  46.8  47.0  
66.0              FT  69.4  71.8  74.3  76.2  78.1  77.9  78.9  
FADA  78.3  83.2  85.2  85.7  86.2  87.1  87.5  
15.3              FT  19.9  22.2  22.8  24.6  25.4  25.4  25.6  
FADA  27.5  29.8  34.5  36.0  37.9  41.3  42.9 
4 Experiments
We present results using the Office dataset saenkoKFD2010eccv , the MNIST dataset lecun1998gradient , the USPS dataset hull1994database , and the SVHN dataset netzer2011reading .
4.1 MNISTUSPSSVHN Datasets
The MNIST (), USPS (), and SVHN () datasets have recently been used for domain adaptation fernandoTT15prl ; rozantsev2016beyond ; tzeng2017adversarial . They contain images of digits from 0 to 9 in various different environments including in the wild in the case of SVHN netzer2011reading . We considered six crossdomain tasks. The first two tasks include , , and followed the experimental setting in fernandoTT15prl ; rozantsev2016beyond ; liu2016coupled ; tzeng2017adversarial ; sankaranarayanan2017generate , which involves randomly selecting 2000 images from MNIST and 1800 images from USPS. For the rest of the crossdomain tasks, , , , and , we used all training samples of the source domain for training and all testing samples of the target domain for testing.
Since fernandoTT15prl ; rozantsev2016beyond ; liu2016coupled ; tzeng2017adversarial ; sankaranarayanan2017generate introduced unsupervised methods, they used all samples of a target domain as unlabeled data in training. Here instead, we randomly selected labeled samples per class from target domain data and used them in training. We evaluated our approach for ranging from to and repeated each experiment
times (we only show the mean of the accuracies for this experiment because standard deviation is very small).
Since the images of the USPS dataset have pixels, we resized the images of the MNIST and SVHN datasets to pixels. We assume and share weights () for this experiment. Similar to lecun1998gradient , we used convolutional layers with 6 and 16 filters of
kernels followed by maxpooling layers and
fully connected layers with size and as the inference function , and one fully connected layer with softmax activation as the prediction function . Also, we used fully connected layers with size and as DCD ( groups classifier). Training for each stage was done using the Adam Optimizer kingmab14 . We compare our method with SDA method, under the same condition, and recent UDA methods. UDA methods use all target samples in their training stage, while we only use very few labeled target samples per category in training.Table 1 shows the classification accuracies, where  stands for our method when we use labeled target samples per category in training. works well even when only one target sample per category () is available in training. Also, we can see that by increasing , the accuracy goes up. This is interesting because we can get comparable accuracies with the stateoftheart using only 10 labeled target samples (one sample per class) instead of using more than thousands unlabeled target samples. We also report the lower bound (LB) of our model which corresponds to training the base model using only source samples. Moreover, we report the accuracies obtained by finetuning (FT) the base model on available target data. Although Table 1 shows that FT increases the accuracies over LB, it has reduce performance compared to SDA methods.
Figure 3 shows how much improvement can be obtained with respect to the base model. The base model is the lower bound LB. This is simply obtained by training and with only the classification loss and source training data; so, no adaptation is performed.
WeightSharing. As we discussed earlier, weightsharing can be seen as a regularizer that prevents the target network from overfitting. This is important because can be easily overfitted since target data is scarce. We repeated the experiment for the with without sharing weights. This provides an average accuracy of over repetitions, which is less than the weightsharing case.
Unsupervised Methods  Supervised Methods  

LB  tzeng2014deep  long2015learning  ghifary2016deep  tzengHDS15iccv  koniusz2016domain  Motiian_2017_ICCV  
61.2 0.9  61.8 0.4  68.5 0.4  68.7 0.3  82.7 0.8  84.5 1.7  88.2 1.0  88.1 1.2  
62.3 0.8  64.4 0.3  67.0 0.4  67.1 0.3  86.1 1.2  86.3 0.8  89.0 1.2  88.2 1.0  
51.6 0.9  52.2 0.4  53.1 0.3  54.09 0.5  65.0 0.5  65.7 1.7  72.1 1.0  71.1 0.9  
95.6 0.7  98.5 0.4  99.0 0.2  99.0 0.2  97.6 0.2  97.5 0.7  97.6 0.4  97.5 0.6  
58.5 0.8  52.1 0.8  54.0 0.4  56.0 0.5  66.2 0.3  66.5 1.0  71.8 0.5  68.1 06  
80.1 0.6  95.0 0.5  96.0 0.3  96.4 0.3  95.7 0.5  95.5 0.6  96.4 0.8  96.4 0.8  
Average  68.2  70.6  72.9  73.6  82.2  82.6  85.8  84.9 
4.2 Office Dataset
The office dataset is a standard benchmark dataset for visual domain adaptation. It contains 31 object classes for three domains: Amazon, Webcam, and DSLR, indicated as , , and , for a total of 4,652 images. The first domain , consists of images downloaded from online merchants, the second , consists of low resolution images acquired by webcams, the third , consists of high resolution images collected with digital SLRs. We consider four domain shifts using the three domains (, , , and ). Since there is not a considerable domain shift between and , we exclude and .
We followed the setting described in tzengHDS15iccv . All classes of the office dataset and 5 traintest splits are considered. For the source domain, 20 examples per category for the Amazon domain, and 8 examples per category for the DSLR and Webcam domains are randomly selected for training for each split. Also, 3 labeled examples are randomly selected for each category in the target domain for training for each split. The rest of the target samples are used for testing. Note that we used the same splits generated by tzengHDS15iccv .
In addition to the SDA algorithms, we report the results of some recent UDA algorithms. They follow a different experimental protocol compared to the SDA algorithms, and use all samples of the target domain in training as unlabeled data together with all samples of the source domain. So, we cannot make an exact comparison between results. However, since UDA algorithms use all samples of the target domain in training and we use only very few of them (3 per class), we think it is still worth looking at how they differ.
Here we are interested in the case where and share weights (). For the inference function , we used the convolutional layers of the VGG16 architecture Simonyan14c followed by 2 fully connected layers with output size of 1024 and 128, respectively. For the prediction function , we used a fully connected layer with softmax activation. Similar to tzengHDS15iccv
, we used the weights pretrained on the ImageNet dataset
imagenet2015 for the convolutional layers, and initialized the fully connected layers using all the source domain data. We model the DCD with 2 fully connected layers with a softmax activation in the last layer.Table 2 reports the classification accuracy over classes for the Office dataset and shows that has performance comparable to the stateoftheart.
5 Conclusions
We have introduced a deep model combining a classification and an adversarial loss to address SDA in fewshot learning regime. We have shown that adversarial learning can be augmented to address SDA. The approach is general in the sense that the architecture subcomponents can be changed. We found that addressing the semantic distribution alignments with pointwise surrogates of distribution distances and similarities for SDA works very effectively, even when labeled target samples are very few. In addition, we found the SDA accuracy to converge very quickly as more labeled target samples per category are available. The approach shows clear promise as it sets new stateoftheart performance in the experiments.
References
 (1) Y. Aytar and A. Zisserman. Tabula rasa: Model transfer for object category detection. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 2252–2259. IEEE, 2011.
 (2) M. Baktashmotlagh, M. T. Harandi, B. C. Lovell, and M. Salzmann. Unsupervised domain adaptation by domain invariant projection. In IEEE ICCV, pages 769–776, 2013.
 (3) C. J. Becker, C. M. Christoudias, and P. Fua. Nonlinear domain adaptation with boosting. In Advances in Neural Information Processing Systems, pages 485–493, 2013.
 (4) A. Bergamo and L. Torresani. Exploiting weaklylabeled web images to improve object classification: a domain adaptation approach. In Advances in Neural Information Processing Systems, pages 181–189, 2010.

(5)
J. Blitzer, R. McDonald, and F. Pereira.
Domain adaptation with structural correspondence learning.
In
Proceedings of the 2006 conference on empirical methods in natural language processing
, pages 120–128. Association for Computational Linguistics, 2006.  (6) L. Chen, W. Li, and D. Xu. Recognizing RGB images by learning from RGBD data. In CVPR, pages 1418–1425, June 2014.

(7)
S. Chopra, R. Hadsell, and Y. LeCun.
Learning a similarity metric discriminatively, with application to
face verification.
In
Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on
, volume 1, pages 539–546. IEEE, 2005. 
(8)
H. Daume III and D. Marcu.
Domain adaptation for statistical classifiers.
Journal of Artificial Intelligence Research
, 26:101–126, 2006.  (9) J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. DeCAF: a deep convolutional activation feature for generic visual recognition. In arXiv:1310.1531, 2013.
 (10) L. Duan, I. W. Tsang, D. Xu, and S. J. Maybank. Domain transfer svm for video concept detection. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 1375–1381. IEEE, 2009.
 (11) B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars. Unsupervised visual domain adaptation using subspace alignment. In IEEE ICCV, pages 2960–2967, 2013.
 (12) B. Fernando, T. Tommasi, and T. Tuytelaarsc. Joint crossdomain classification and subspace learning for unsupervised adaptation. Pattern Recogition Letters, 2015.
 (13) Y. Ganin and V. Lempitsky. Unsupervised domain adaptation by backpropagation. arXiv preprint arXiv:1409.7495, 2014.

(14)
Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette,
M. Marchand, and V. Lempitsky.
Domainadversarial training of neural networks.
Journal of Machine Learning Research
, 17(59):1–35, 2016.  (15) M. Ghifary, W. B. Kleijn, M. Zhang, D. Balduzzi, and W. Li. Deep reconstructionclassification networks for unsupervised domain adaptation. In European Conference on Computer Vision, pages 597–613. Springer, 2016.
 (16) B. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesic flow kernel for unsupervised domain adaptation. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2066–2073. IEEE, 2012.
 (17) I. Goodfellow. Nips 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160, 2016.
 (18) I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc., 2014.
 (19) R. Gopalan, R. Li, and R. Chellappa. Domain adaptation for object recognition: An unsupervised approach. In IEEE ICCV, pages 999–1006, 2011.
 (20) A. Gretton, K. M. Borgwardt, M. Rasch, B. Schölkopf, and A. J. Smola. A kernel method for the twosampleproblem. In NIPS, 2006.
 (21) Y. Guo and M. Xiao. Cross language text classification via subspace coregularized multiview learning. In Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26  July 1, 2012, 2012.
 (22) E. Hoffer and N. Ailon. Deep metric learning using triplet network. In International Workshop on SimilarityBased Pattern Recognition, pages 84–92. Springer, 2015.
 (23) J. Hu, J. Lu, and Y.P. Tan. Discriminative deep metric learning for face verification in the wild. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 1875–1882, June 2014.
 (24) J. J. Hull. A database for handwritten text recognition research. IEEE Transactions on pattern analysis and machine intelligence, 16(5):550–554, 1994.
 (25) P. Isola, J.Y. Zhu, T. Zhou, and A. A. Efros. Imagetoimage translation with conditional adversarial networks. arXiv preprint arXiv:1611.07004, 2016.
 (26) D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
 (27) P. Koniusz, Y. Tas, and F. Porikli. Domain adaptation by mixture of alignments of secondor higherorder scatter tensors. arXiv preprint arXiv:1611.08195, 2016.
 (28) B. Kulis, K. Saenko, and T. Darrell. What you saw is not what you get: Domain adaptation using asymmetric kernel transforms. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1785–1792. IEEE, 2011.
 (29) S. Kullback and R. A. Leibler. On information and sufficiency. The annals of mathematical statistics, 22(1):79–86, 1951.
 (30) B. Kumar, G. Carneiro, I. Reid, et al. Learning local image descriptors with deep siamese and triplet convolutional networks by minimising global loss functions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5385–5394, 2016.
 (31) M. Lapin, M. Hein, and B. Schiele. Learning using privileged information: SVM+ and weighted SVM. Neural Networks, 53:95–108, 2014.
 (32) Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 (33) M.Y. Liu and O. Tuzel. Coupled generative adversarial networks. In Advances in Neural Information Processing Systems, pages 469–477, 2016.
 (34) M. Long, Y. Cao, J. Wang, and M. I. Jordan. Learning transferable features with deep adaptation networks. In ICML, pages 97–105, 2015.
 (35) M. Long, G. Ding, J. Wang, J. Sun, Y. Guo, and P. S. Yu. Transfer sparse coding for robust image representation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 407–414, 2013.
 (36) S. Motiian and G. Doretto. Information bottleneck domain adaptation with privileged information for visual recognition. In European Conference on Computer Vision, pages 630–647. Springer, 2016.
 (37) S. Motiian, M. Piccirilli, D. A. Adjeroh, and G. Doretto. Information bottleneck learning using privileged information for visual recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1496–1505, 2016.
 (38) S. Motiian, M. Piccirilli, D. A. Adjeroh, and G. Doretto. Unified deep supervised domain adaptation and generalization. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
 (39) K. Muandet, D. Balduzzi, and B. Schölkopf. Domain generalization via invariant feature representation. In ICML (1), pages 10–18, 2013.
 (40) Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, 2011.
 (41) S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang. Domain adaptation via transfer component analysis. IEEE TNN, 22(2):199–210, 2011.
 (42) J. Ponce, T. L. Berg, M. Everingham, D. A. Forsyth, M. Hebert, S. Lazebnik, M. Marszalek, C. Schmid, B. C. Russell, A. Torralba, et al. Dataset issues in object recognition. In Toward categorylevel object recognition, pages 29–48. Springer, 2006.
 (43) S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text to image synthesis. In Proceedings of the 33rd International Conference on International Conference on Machine LearningVolume 48, pages 1060–1069. JMLR. org, 2016.
 (44) S. E. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee. Learning what and where to draw. In Advances in Neural Information Processing Systems, pages 217–225, 2016.
 (45) A. Rozantsev, M. Salzmann, and P. Fua. Beyond sharing weights for deep domain adaptation. arXiv preprint arXiv:1603.06432, 2016.
 (46) O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. FeiFei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.
 (47) K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting visual category models to new domains. In ECCV, pages 213–226, 2010.
 (48) T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pages 2234–2242, 2016.
 (49) S. Sankaranarayanan, Y. Balaji, C. D. Castillo, and R. Chellappa. Generate to adapt: Aligning domains using generative adversarial networks. arXiv preprint arXiv:1704.01705, 2017.
 (50) N. Sarafianos, M. Vrigkas, and I. A. Kakadiaris. Adaptive svm+: Learning with privileged information for domain adaptation. arXiv preprint arXiv:1708.09083, 2017.
 (51) H. Shimodaira. Improving predictive inference under covariate shift by weighting the loglikelihood function. Journal of Statistical Planning and Inference, 90(2):227–244, 2000.
 (52) A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb. Learning from simulated and unsupervised images through adversarial training. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
 (53) K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. CoRR, abs/1409.1556, 2014.
 (54) B. Sun and K. Saenko. Deep coral: Correlation alignment for deep domain adaptation. In Computer Vision–ECCV 2016 Workshops, pages 443–450. Springer, 2016.
 (55) T. Tommasi, M. Lanzi, P. Russo, and B. Caputo. Learning the roots of visual domain shift. In Computer Vision–ECCV 2016 Workshops, pages 475–482. Springer, 2016.
 (56) T. Tommasi, N. Patricia, B. Caputo, and T. Tuytelaars. A deeper look at dataset bias. In German Conference on Pattern Recognition, pages 504–516. Springer, 2015.
 (57) A. Torralba and A. A. Efros. Unbiased look at dataset bias. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1521–1528, 2011.
 (58) E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko. Simultaneous deep transfer across domains and tasks. In ICCV, 2015.
 (59) E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
 (60) E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474, 2014.

(61)
R. R. Varior, B. Shuai, J. Lu, D. Xu, and G. Wang.
A siamese long shortterm memory architecture for human reidentification.
In European Conference on Computer Vision, pages 135–153. Springer, 2016.  (62) J. Yang, R. Yan, and A. G. Hauptmann. Adapting svm classifiers to data with shifted distributions. In Data Mining Workshops, 2007. ICDM Workshops 2007. Seventh IEEE International Conference on, pages 69–76. IEEE, 2007.
 (63) T. Yao, Y. Pan, C.W. Ngo, H. Li, and T. Mei. Semisupervised domain adaptation with subspace learning for visual recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
 (64) H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang, X. Wang, and D. Metaxas. Stackgan: Text to photorealistic image synthesis with stacked generative adversarial networks. arXiv preprint arXiv:1612.03242, 2016.