Humans are exceptional visual learners capable of generalizing their learned knowledge to novel domains and concepts and capable of learning from few examples. In recent years, computational models based on end-to-end learnable convolutional networks have made significant improvements for visual recognition [18, 28, 54] and have been shown to demonstrate some cross-task generalizations [8, 48] while enabling faster learning of subsequent tasks as most frequently evidenced through fine-tuning [14, 36, 50].
However, most efforts focus on the supervised learning scenario where a closed world assumption is made at training time about both the domain of interest and the tasks to be learned. Thus, any generalization ability of these models is only an observed byproduct. There has been a large push in the research community to address generalizing and adapting deep models across different domains[64, 13, 58, 38], to learn tasks in a data efficient way through few shot learning [27, 70, 47, 11], and to generically transfer information across tasks [1, 14, 50, 35].
While most approaches consider each scenarios in isolation we aim to directly tackle the joint problem of adapting to a novel domain which has new tasks and few annotations. Given a large labeled source dataset with annotations for a task set, A, we seek to transfer knowledge to a sparsely labeled target domain with a possibly wholly new task set, B. This setting is in line with our intuition that we should be able to learn reusable and general purpose representations which enable faster learning of future tasks requiring less human intervention. In addition, this setting matches closely to the most common practical approach for training deep models which is to use a large labeled source dataset (often ImageNet[6, 52]) to train an initial representation and then to continue supervised learning with a new set of data and often with new concepts.
In our approach, we jointly adapt a source representation for use in a distinct target domain using a new multilayer unsupervised domain adversarial formulation while introducing a novel cross-domain and within domain class similarity objective. This new objective can be applied even when the target domain has non-overlapping classes to the source domain.
We evaluate our approach in the challenging setting of joint transfer across domains and tasks and demonstrate our ability to successfully transfer, reducing the need for annotated data for the target domain and tasks. We present results transferring from a subset of Google Street View House Numbers (SVHN)  containing only digits 0-4 to a subset of MNIST  containing only digits 5-9. Secondly, we present results on the challenging setting of adapting from ImageNet  object-centric images to UCF-101  videos for action recognition.
2 Related work
Domain adaptation. Domain adaptation seeks to learn from related source domains a well performing model on target data distribution . Existing work often assumes that both domains are defined on the same task and labeled data in target domain is sparse or non-existent . Several methods have tackled the problem with the Maximum Mean Discrepancy (MMD) loss [17, 36, 37, 38, 73] between the source and target domain. Weight sharing of CNN parameters [58, 22, 21, 3] and minimizing the distribution discrepancy of network activations [51, 65, 30] have also shown convincing results. Adversarial generative models [33, 32, 2, 59] aim at generating source-like data with target data by training a generator and a discriminator simultaneously, while adversarial discriminative models [62, 64, 13, 12, 23] focus on aligning embedding feature representations of target domain to source domain. Inspired by adversarial discriminative models, we propose a method that aligns domain features with multi-layer information.
. In computer vision, examples of transfer learning include[1, 31, 61]
which try to overcome the deficit of training samples for some categories by adapting classifiers trained for other categories. With the power of deep supervised learning and the ImageNet dataset [6, 52], learned knowledge can even transfer to a totally different task (i.e. image classification object detection [50, 49, 34]; image classification semantic segmentation ) and then achieve state-of-the-art performance. In this paper, we focus on the setting where source and target domains have differing label spaces but the label spaces share the same structure. Namely adapting between classifying different category sets but not transferring from classification to a localization plus classification task.
Few-shot learning. Few-shot learning seeks to learn new concepts with only a few annotated examples. Deep siamese networks  are trained to rank similarity between examples. Matching networks  learns a network that maps a small labeled support set and an unlabeled example to its label. Aside from these metric learning-based methods, meta-learning has also served as a essential part. Ravi et al.  propose to learn a LSTM meta-learner to learn the update rule of a learner. Finn et al.  tries to find a good initialization point that can be easily fine-tune with new examples from new tasks. When there exists a domain shift, the results of prior few-shot learning methods are often degraded.
Many unsupervised learning algorithms have focused on modeling raw data using reconstruction objectives[19, 69, 26]
. Other probabilistic models include restricted Boltzmann machines, deep Boltzmann machines , GANs [15, 10, 9]
, and autoregressive models[42, 66] are also popular. An alternative approach, often terms “self-supervised learning” , defines a pretext task such as predicting patch ordering , frame ordering , motion dynamics 
, or colorization, as a form of indirect supervision. Compared to these approaches, our unsupervised learning method does not rely on exploiting the spatial or temporal structure of the data, and is therefore more generic.
We introduce a semi-supervised learning algorithm which transfers information from a large labeled source domain,, to a sparsely labeled target domain, . The goal being to learn a strong target classifier without requiring the large annotation overhead required for standard supervised learning approaches.
In fact, this setting is very commonly explored for convolutional network (convnet) based recognition methods. When learning with convnets the usual learning procedure is to use a very large labeled dataset (e.g. ImageNet [6, 52]) for initial training of the network parameters (termed pre-training). The learned weights are then used as initialization for continued learning on new data and for new tasks, called fine-tuning. Fine-tuning has been broadly applied to reduce the number of labeled examples needed for learning new tasks, such as recognizing new object categories after ImageNet pre-training [54, 18], or learning new label structures such as detection after classficiation pre-training [14, 50]. Here we focus on transfer in the case of a shared label structure (e.g. classification of different category sets).
We assume the source domain contains images, , with associated labels, . Similarly, the target domain consists of unlabeled images, , as well as images, , with associated labels, . We assume that the target domain is only sparsely labeled so that the number of image-label pairs is much smaller than the number of unlabeled images, . Additionally, the number of source labeled images is assumed to be much larger than the number of target labeled images, .
Unlike standard domain adaptation approaches which transfer knowledge from source to target domains assuming a marginal or conditional distribution shift under a shared label space (), we tackle joint image or feature space adaptation as well as transfer across semantic spaces. Namely, we consider the case where the source and target label spaces are not equal, , and even the most challenging case where the sets are non-overlapping, .
3.1 Joint domain and semantic transfer
Our approach consists of unsupervised feature alignment between source and target as well as semantic transfer to the unlabeled target data from either the labeled target or the labeled source data. We introduce a new multi-layer domain discriminator which can be used for domain alignment following the recent domain adversarial learning approaches [13, 64]. We next introduce a new semantic transfer learning objective which uses cross category similarity and can be tuned to account for varying size of label set overlap.
We depict our overall model in Figure 1. We take the source labeled examples, , the target labeled examples, , and the unlabeled target images, as input. We learn an initial layered source representation and classification network (depicted in blue in Figure 1) using standard supervised techniques. We then initialize the target model (depicted in green in Figure 1) with the source parameters and begin our adaptive transfer learning.
Our model jointly optimizes over a target supervised loss, , a domain transfer objective, , and finally a semantic transfer objective, . Thus, our total objective can be written as follows:
where the hyperparametersand determine the influence of the domain transfer loss and the semantic transfer loss, respectively. In the following sections we elaborate on our domain and semantic transfer objectives.
3.2 Multi-layer domain adversarial loss
We define a novel domain alignment objective function called multi-layer domain adversarial loss. Recent efforts in deep domain adaptation have shown strong performance using feature space domain adversarial objectives [13, 64]. These methods learn a target representation such that the target distribution viewed under this model is aligned with the source distribution viewed under the source representation. This alignment is accomplished through an adversarial minimization across domain, analogous to the prevalent generative adversarial approaches . In particular, a domain discriminator, , is trained to classify whether a particular data point arises from the source or the target domain. Simultaneously, the target embedding function (defined as the application of layers of the network is trained to generate the target representation that cannot be distinguished from the source domain representation by the domain discriminator. Similar to [63, 64], we consider a representation to be domain invariant if the domain discriminator can not distinguish examples from the two domains.
Prior work considers alignment for a single layer of the embedding at a time and as such learns a domain discriminator which takes the output from the corresponding source and target layers as input. Separately, domain alignment methods which focus on first and second order statistics have shown improved performance through applying domain alignment independently at multiple layers of the network . Rather than learning independent discriminators for each layer of the network we propose a simultaneous alignment of multiple layers through a multi-layer discriminator.
At each layer of our multi-layer domain discriminator, information is accumulated from both the output from the previous discriminator layer as well as the source and target activations from the corresponding layer in their respective embeddings. Thus, the output of each discriminator layer is defined as:
where is the current layer,
is the activation function,is the decay factor, represents concatenation or element-wise summation, and is taken either from source data , or target data . Notice that the intermediate discriminator layers share the same structure with their corresponding encoding layers to match the dimensions.
Thus, the following loss functions are proposed to optimize the multi-layer domain discriminator and the embeddings, respectively, according to our domain transfer objective:
where are the outputs of the last layer of the source and target multi-layer domain discriminator. Note that these losses are placed after the final domain discriminator layer and the last embedding layer but then produce gradients which back-propagate throughout all relevant lower layer parameters. These two losses together comprise , and there is no iterative optimization procedure involved.
This multi-layer discriminator (shown in Figure 1 - yellow) allows for deeper alignment of the source and target representations which we find empirically results in improved target classification performance as well as more stable adversarial learning.
3.3 Cross category similarity for semantic transfer
In the previous section, we introduced a method for transferring an embedding from the source to the target domain. However, this only enforces alignment of the global domain statistics with no class specific transfer. Here, we define a new semantic transfer objective, , which transfers information from a labeled set of data to an unlabeled set of data by minimizing the entropy of the softmax with temperature of the similarity vector between an unlabeled point and all labeled points. Thus, this loss may be applied either between the source and unlabeled target data or between the labeled and unlabeled target data.
For each unlabeled target image, , we compute the similarity, , to each labeled example or to each prototypical example  per class in the labeled set. For simplicity of presentation let us consider semantic transfer from the source to the target domain first. For each target unlabeled image we compute a similarity vector where the element is the similarity between this target image and the labeled source image: . Our semantic transfer loss can be defined as follows:
where, is the information entropy function, is the softmax function and is the temperature of the softmax. Note that the temperature can be used to directly control the percentage of source examples we expect the target example to be similar to (see Figure 2).
Entropy minimization has been widely used for unsupervised  and semi-supervised  learning by encouraging low density separation between clusters or classes. Recently this principle of entropy minimization has be applied for unsupervised adaptation . Here, the source and target domains are assumed to share a label space and each unlabeled target example is passed through the initial source classifier and the entropy of the softmax output scores is minimized.
In contrast, we do not assume a shared label space between the source and target domains and as such can not assume that each target image maps to a single source label. Instead, we compute pairwise similarities between target points and the source points (or per class averages of source points ) across the features spaces aligned by our multi-layer domain adversarial transfer. We then tune the softmax temperature based on the expected similarity between the source and target labeled set. For example, if the source and target label set overlap, then a small temperature will encourage each target point to be very similar to one source class, whereas a larger temperature will allow for target points to be similar to multiple source classes.
For semantic transfer within the target domain, we utilize the metric-based cross entropy loss between labeled target examples to stabilize and improve the learning. For a labeled target example, in addition to the traditional cross entropy loss, we also calculate a metric-based cross entropy loss 111We refer this as "metric-based" to cue the reader that this is not a cross entropy within the label space.. Assume we have labeled examples from each class in the target domain. We compute the embedding for each example and then the centroid of each class in the embedding space. Thus, we can compute the similarity vector for each labeled example, where the element is the similarity between this labeled example and the centroid of each class: . We can then calculate the metric based cross entropy loss:
Similar to the source-to-target scenario, for target-to-target we also have the unsupervised part,
With the metric-based cross entropy loss, we introduce the constraint that the target domain data should be similar in the embedding space. Also, we find that this loss can provide a guidance for the unsupervised semantic transfer to learn in a more stable way. is the combination of from source-target (Equation 5), from source-target (Equation 6), and from target-target (Equation 7), i.e.,
This section is structured as follows. In section 4.1, we show that our method outperform fine-tuning approach by a large margin, and all parts of our method are necessary. In section 4.2, we show that our method can be generalized to bigger datasets. In section 4.3, we show that our multi-layer domain adversarial method outperforms state-of-the-art domain adversarial approaches.
Datasets We perform adaptation experiments across two different paired data settings. First for adaptation across different digit domains we use MNIST  and Google Street View House Numbers (SVHN) 
. The MNIST handwritten digits database has a training set of 60,000 examples, and a test set of 10,000 examples. The digits have been size-normalized and centered in fixed-size images. SVHN is a real-world image dataset for machine learning and object recognition algorithms with minimal requirement on data preprocessing and formatting. It has 73,257 digits for training, 26,032 digits for testing. As our second experimental setup, we consider adaptation from object centric images in ImageNet to action recognition in video using the UCF-101  dataset. ImageNet is a large benchmark for the object classification task. We use the task 1 split from ILSVRC2012. UCF-101 is an action recognition dataset collected on YouTube. With 13,320 videos from 101 action categories, UCF-101 provides a large diversity in terms of actions and with the presence of large variations in camera motion, object appearance and pose, object scale, viewpoint, cluttered background, illumination conditions, etc.
Implementation details We pre-train the source domain embedding function with cross-entropy loss. For domain adversarial loss, the discriminator takes the last three layer activations as input when the number of output classes are the same for source and target tasks, and takes the second last and third last layer activations when they are different. The similarity score is chosen as the dot product of the normalized support features and the unnormalized target feature. We use the temperature for source-target semantic transfer and for within target transfer as the label space is shared. We use and in our objective function. The network is trained with Adam optimizer  and with learning rate
. We conduct all the experiments with the PyTorch framework.
4.1 Svhn 0-4 Mnist 5-9
Experimental setting. In this experiment, we define three datasets: (i) labeled data in source domain ; (ii) few labeled data in target domain ; (iii) unlabeled data in target domain . We take the training split of SVHN dataset as dataset . To fairly compare with traditional learning paradigm and episodic training, we subsample examples from each class to construct dataset so that we can perform traditional training or episodic ()-shot learning. We experiment with , which corresponds to labeled examples, or of the total training data respectively. Since our approach involves using annotations from a small subset of the data, we randomly subsample different subsets from the training split of MNIST dataset, and use the remaining data as for each . Note that source domain and target domain have non-overlapping classes: we only utilize digits - in SVHN, and digits - in MNIST.
Baselines and prior work. We compare against six different methods: (i) Target only: the model is trained on from scratch; (ii) Fine-tune: the model is pretrained on and fine-tuned on ; (iii) Matching networks : we first pretrain the model on , then use as the support set in the matching networks; (iv) Fine-tuned matching networks: same as baseline iii, except that for each the model is fine-tuned on with 5-way ()-shot learning: examples in each class are randomly selected as the support set, and the last example in each class is used as the query set; (v) Fine-tune + adversarial: in addition to baseline ii, the model is also trained on and with a domain adversarial loss; (vi.) Full model: fine-tune the model with the proposed multi-layer domain adversarial loss.
Results and analysis.
We calculate the mean and standard error of the accuracies acrosssets of data, which is shown in Table 1. Due to domain shift, matching networks perform poorly without fine-tuning, and fine-tuning is only marginally better than training from scratch. Our method with multi-layer adversarial only improves the overall performance, but is more sensitive to the subsampled data. Our method achieves significant performance gain, especially when the number of labeled examples is small (). For reference, fine-tuning on full target dataset gives an accuracy of .
|Target only||0.642 0.026||0.771 0.015||0.801 0.010||0.840 0.013|
|Fine-tune||0.612 0.020||0.779 0.018||0.802 0.016||0.830 0.011|
|Matching nets ||0.469 0.019||0.455 0.014||0.566 0.013||0.513 0.023|
|Fine-tuned matching nets||0.645 0.019||0.755 0.024||0.793 0.013||0.827 0.011|
|Ours: fine-tune + adv.||0.702 0.020||0.800 0.013||0.804 0.014||0.831 0.013|
|Ours: full model ()||0.917 0.007||0.936 0.006||0.942 0.006||0.950 0.004|
4.2 Image object recognition video action recognition
Problem analysis. Many recent works [60, 24] study the domain shift between images and video in the object detection settings. Compared to still images, videos provide several advantages: (i) motion provides information for foreground vs background segmentation ; (ii) videos often show multiple views and thus provide 3D information. On the other hand, video frames usually suffer from: (i) motion blur; (ii) compression artifacts; (iii) objects out-of-focus or out-of-frame.
Experimental setting. In this experiment, we focus on three dataset splits: (i) ImageNet training set as the labeled data in source domain ; (ii) video clips per class randomly sampled from UCF-101 training as the few labeled data in target domain set ; (iii) the remaining videos in UCF-101 training set as the unlabeled data in target domain . We experiment with , which corresponds video clips, or of the total training data respectively. Each experiment is run times on , , and .
Baselines and prior work. We compare our method with two baseline methods: (i) Target only: the model is trained on from scratch; (ii) Fine-tune: the model is first pre-trained on , then fine-tuned on . For reference, we report the performance of a fully supervised method .
Results and analysis. The accuracy of each model is shown in Table 2. We also fine-tune a model with all the labeled data for comparison. Per-frame performance (img) and average-across-frame performance (vid) are both reported. Note that we calculate the average-across-frame performance by averaging the softmax score of each frame in a video. Our method achieves significant improvement on average-across-frame performance over standard fine-tuning for each value of . Note that compared to fine-tuning, our method has a bigger gap between per-frame and per-video accuracy. We believe that this is due to the semantic transfer: our entropy loss encourages a sharper softmax variance among per-frame softmax scores per video (if the variance is zero, then per-frame accuracy = per-video accuracy). By making more confident predictions among key frames, our method achieves a more significant gain with respective to per-video performance, even when there is little change in the per-frame prediction.
|Target only (img)||0.0980.003||0.1260.022||0.1000.035||-|
|Target only (vid)||0.1050.003||0.1330.024||0.1060.038||-|
|Two-stream spatial ||-||-||-||0.708 - 0.720|
4.3 Ablation: unsupervised domain adaptation
To validate our multi-layer domain adversarial loss objective, we conduct an ablation experiment for unsupervised domain adaptation. We compare against multiple recent domain adversarial unsupervised adaptation methods. In this experiment, we first pretrain a source embedding CNN on the training split SVHN  and then adapt the target embedding for MNIST by performing adversarial domain adaptation. We evaluate the classification performance on the test split of MNIST . We follow the same training strategy and model architecture for the embedding network as .
All the models here have a two-step training strategy and share the first stage. ADDA  optimizes encoder and classifier simultaneously. We also propose a similar method, but optimize encoder only. Only we try a model with no classifier in the last layer (i.e. perform domain adversarial training in feature space). We choose as the decay factor for this model.
The accuracy of each model is shown in Table 3. We find that our method achieve performance gain over the best competing domain adversarial approach indicating that our multilayer objective indeed contributes to our overall performance. In addition, in our experiments, we found that the multilayer approach improved overall optimization stability, as evidenced in our small standard error.
|Source only||0.601 0.011|
|Gradient reversal ||0.739|
|Domain confusion ||0.681 0.003|
|ADDA ||0.760 0.018|
In this paper, we propose a method to learn a representation that is transferable across different domains and tasks in a data efficient manner. The framework is trained jointly to minimize the domain shift, to transfer knowledge to new task, and to learn from large amounts of unlabeled data. We show superior performance over the popular fine-tuning approach. We hope to keep improving the method in future work.
We would like to start by thanking our sponsors: Stanford Computer Science Department and Stanford Program in AI-assisted Care (PAC). Next, we specially thank De-An Huang, Kenji Hata, Serena Yeung, Ozan Sener and all the members of Stanford Vision and Learning Lab for their insightful discussion and feedback. Lastly, we thank all the anonymous reviewers for their valuable comments.
-  Yusuf Aytar and Andrew Zisserman. Tabula rasa: Model transfer for object category detection. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 2252–2259. IEEE, 2011.
-  Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru Erhan, and Dilip Krishnan. Unsupervised pixel-level domain adaptation with generative adversarial networks. arXiv preprint arXiv:1612.05424, 2016.
Lluis Castrejon, Yusuf Aytar, Carl Vondrick, Hamed Pirsiavash, and Antonio
Learning aligned cross-modal representations from weakly aligned
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2940–2949, 2016.
-  Gabriela Csurka. Domain adaptation for visual applications: A comprehensive survey. arXiv preprint arXiv:1702.05374, 2017.
-  Virginia R de Sa. Learning classification with unlabeled data. Advances in neural information processing systems, pages 112–112, 1994.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
-  Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, pages 1422–1430, 2015.
-  Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ICML’14, pages I–647–I–655. JMLR.org, 2014.
-  Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Adversarial feature learning. arXiv preprint arXiv:1605.09782, 2016.
-  Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Alex Lamb, Martin Arjovsky, Olivier Mastropietro, and Aaron Courville. Adversarially learned inference. arXiv preprint arXiv:1606.00704, 2016.
-  Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400, 2017.
-  Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. arXiv preprint arXiv:1409.7495, 2014.
Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo
Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky.
Domain-adversarial training of neural networks.Journal of Machine Learning Research, 17(59):1–35, 2016.
-  Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Computer Vision and Pattern Recognition, 2014.
-  Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
-  Yves Grandvalet, Yoshua Bengio, et al. Semi-supervised learning by entropy minimization. In NIPS, volume 17, pages 529–536, 2004.
-  A Gretton, A.J. Smola, J Huang, Marcel Schmittfull, K.M. Borgwardt, B Schölkopf, J Quiñonero Candela, M Sugiyama, A Schwaighofer, and N D. Lawrence. Covariate shift by kernel mean matching. In Dataset Shift in Machine Learning, 131-160 (2009), 01 2009.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
-  Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. science, 313(5786):504–507, 2006.
-  Geoffrey E Hinton and Terrence J Sejnowski. Learning and releaming in boltzmann machines. Parallel Distrilmted Processing, 1, 1986.
-  Judy Hoffman, Saurabh Gupta, and Trevor Darrell. Learning with side information through modality hallucination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 826–834, 2016.
-  Judy Hoffman, Saurabh Gupta, Jian Leong, Sergio Guadarrama, and Trevor Darrell. Cross-modal adaptation for rgb-d detection. In Robotics and Automation (ICRA), 2016 IEEE International Conference on, pages 5032–5039. IEEE, 2016.
-  Judy Hoffman, Dequan Wang, Fisher Yu, and Trevor Darrell. Fcns in the wild: Pixel-level adversarial and constraint-based adaptation. arXiv preprint arXiv:1612.02649, 2016.
-  Vicky Kalogeiton, Vittorio Ferrari, and Cordelia Schmid. Analysing domain shift factors between videos and images for object detection. IEEE transactions on pattern analysis and machine intelligence, 38(11):2327–2334, 2016.
-  Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
-  Gregory Koch. Siamese neural networks for one-shot image recognition. PhD thesis, University of Toronto, 2015.
Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton.
Imagenet classification with deep convolutional neural networks.In Neural Information Processing Systems (NIPS), 2012.
-  Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
-  Yanghao Li, Naiyan Wang, Jianping Shi, Jiaying Liu, and Xiaodi Hou. Revisiting batch normalization for practical domain adaptation. arXiv preprint arXiv:1603.04779, 2016.
-  Joseph J Lim, Ruslan Salakhutdinov, and Antonio Torralba. Transfer learning by borrowing examples for multiclass object detection. In Proceedings of the 24th International Conference on Neural Information Processing Systems, pages 118–126. Curran Associates Inc., 2011.
-  Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised image-to-image translation networks. arXiv preprint arXiv:1703.00848, 2017.
-  Ming-Yu Liu and Oncel Tuzel. Coupled generative adversarial networks. In Advances in Neural Information Processing Systems, pages 469–477, 2016.
-  Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European Conference on Computer Vision, pages 21–37. Springer, 2016.
-  Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015.
-  Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I Jordan. Learning transferable features with deep adaptation networks. In ICML, pages 97–105, 2015.
-  Mingsheng Long, Jianmin Wang, and Michael I Jordan. Deep transfer learning with joint adaptation networks. arXiv preprint arXiv:1605.06636, 2016.
-  Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. Unsupervised domain adaptation with residual transfer networks. In Advances in Neural Information Processing Systems, pages 136–144, 2016.
-  Zelun Luo, Boya Peng, De-An Huang, Alexandre Alahi, and Li Fei-Fei. Unsupervised learning of long-term motion dynamics for videos. arXiv preprint arXiv:1701.01821, 2017.
-  Ishan Misra, C Lawrence Zitnick, and Martial Hebert. Shuffle and learn: unsupervised learning using temporal order verification. In European Conference on Computer Vision, pages 527–544. Springer, 2016.
Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y
Reading digits in natural images with unsupervised feature learning.
NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 5, 2011.
-  Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, 2016.
-  Maxime Oquab, Leon Bottou, Ivan Laptev, and Josef Sivic. Learning and transferring mid-level image representations using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1717–1724, 2014.
-  Gintautas Palubinskas, Xavier Descombes, and Frithjof Kruggel. An unsupervised clustering method using the entropy minimization. In AAAI, 1999.
-  Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2010.
-  Deepak Pathak, Ross Girshick, Piotr Dollár, Trevor Darrell, and Bharath Hariharan. Learning features by watching objects move. arXiv preprint arXiv:1612.06370, 2016.
-  Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In International Conference on Learning Representations, volume 1, page 6, 2017.
-  Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. CNN features off-the-shelf: An astounding baseline for recognition. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR Workshops 2014, Columbus, OH, USA, June 23-28, 2014, pages 512–519, 2014.
-  Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 779–788, 2016.
-  Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
-  Artem Rozantsev, Mathieu Salzmann, and Pascal Fua. Beyond sharing weights for deep domain adaptation. arXiv preprint arXiv:1603.06432, 2016.
-  Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
-  Ruslan Salakhutdinov and Geoffrey Hinton. Deep boltzmann machines. In Artificial Intelligence and Statistics, pages 448–455, 2009.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
-  Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, pages 568–576, 2014.
-  Jake Snell, Kevin Swersky, and Richard S Zemel. Prototypical networks for few-shot learning. arXiv preprint arXiv:1703.05175, 2017.
-  Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
-  Baochen Sun and Kate Saenko. Deep coral: Correlation alignment for deep domain adaptation. In Computer Vision–ECCV 2016 Workshops, pages 443–450. Springer, 2016.
-  Yaniv Taigman, Adam Polyak, and Lior Wolf. Unsupervised cross-domain image generation. arXiv preprint arXiv:1611.02200, 2016.
-  Kevin Tang, Vignesh Ramanathan, Li Fei-Fei, and Daphne Koller. Shifting weights: Adapting object detectors from image to video. In Advances in Neural Information Processing Systems, pages 638–646, 2012.
-  Tatiana Tommasi, Francesco Orabona, and Barbara Caputo. Safety in numbers: Learning categories from few examples with multi model knowledge transfer. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 3081–3088. IEEE, 2010.
-  Eric Tzeng, Judy Hoffman, Trevor Darrell, and Kate Saenko. Simultaneous deep transfer across domains and tasks. In Proceedings of the IEEE International Conference on Computer Vision, pages 4068–4076, 2015.
-  Eric Tzeng, Judy Hoffman, Trevor Darrell, and Kate Saenko. Simultaneous deep transfer across domains and tasks. In International Conference in Computer Vision (ICCV), 2015.
-  Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. In Computer Vision and Pattern Recognition (CVPR), 2017.
-  Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474, 2014.
-  Aaron van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems, pages 4790–4798, 2016.
-  Laurens Van Der Maaten. Accelerating t-sne using tree-based algorithms. Journal of machine learning research, 15(1):3221–3245, 2014.
-  Laurens Van der Maaten and Geoffrey Hinton. Visualizing non-metric similarities in multiple maps. Machine learning, 87(1):33–55, 2012.
Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol.
Extracting and composing robust features with denoising autoencoders.In Proceedings of the 25th international conference on Machine learning, pages 1096–1103. ACM, 2008.
-  Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In Advances in Neural Information Processing Systems, pages 3630–3638, 2016.
-  Karl Weiss, Taghi M Khoshgoftaar, and DingDing Wang. A survey of transfer learning. Journal of Big Data, 3(1):1–40, 2016.
-  Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In European Conference on Computer Vision, pages 649–666. Springer, 2016.
-  Xu Zhang, Felix Xinnan Yu, Shih-Fu Chang, and Shengjin Wang. Deep transfer network: Unsupervised domain adaptation. arXiv preprint arXiv:1503.00591, 2015.
(a) SVHN 0-4 Mnist 5-9
|max pool||conv-batchnorm-relu||max pool|
|layer type||conv-batchnorm-relu||max pool||conv-batchnorm-relu||max pool|
(b) Image object recognition video action recognition
Embedding network structure: ResNet-18 222We refer readers to the PyTorch implementation: https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py.
|layer type||conv-batchnorm-leaky relu||conv-batchnorm-leaky relu||conv|
(c) Ablation: unsupervised domain adaptation
|layer type||conv-relu||max pool||conv-relu||max pool|