GradMix: Multi-source Transfer across Domains and Tasks

02/09/2020 ∙ by Junnan Li, et al. ∙ 0

The computer vision community is witnessing an unprecedented rate of new tasks being proposed and addressed, thanks to the deep convolutional networks' capability to find complex mappings from X to Y. The advent of each task often accompanies the release of a large-scale annotated dataset, for supervised training of deep network. However, it is expensive and time-consuming to manually label sufficient amount of training data. Therefore, it is important to develop algorithms that can leverage off-the-shelf labeled dataset to learn useful knowledge for the target task. While previous works mostly focus on transfer learning from a single source, we study multi-source transfer across domains and tasks (MS-DTT), in a semi-supervised setting. We propose GradMix, a model-agnostic method applicable to any model trained with gradient-based learning rule, to transfer knowledge via gradient descent by weighting and mixing the gradients from all sources during training. GradMix follows a meta-learning objective, which assigns layer-wise weights to the source gradients, such that the combined gradient follows the direction that minimize the loss for a small set of samples from the target dataset. In addition, we propose to adaptively adjust the learning rate for each mini-batch based on its importance to the target task, and a pseudo-labeling method to leverage the unlabeled samples in the target domain. We conduct MS-DTT experiments on two tasks: digit recognition and action recognition, and demonstrate the advantageous performance of the proposed method against multiple baselines.



There are no comments yet.


page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: High-level overview of the proposed method. We transfer knowledge to the target domain by weighting and mixing gradients from source domains, such that the combined gradient should minimize the loss for a few validation samples from the target domain.

Deep convolutional networks (ConvNets) have significantly improved the state-of-the-art for visual recognition, by finding complex mappings from to . Unfortunately, these impressive gains in performance come only when massive amounts of paired labeled data s.t. are available for supervised training. For many application domains, it is often prohibitive to manually label sufficient training data, due to the significant amount of human efforts required or the concern of violating individual’s privacy. Hence, there is strong incentive to develop algorithms that can reduce the burden of manual labeling, typically by leveraging off-the-shelf labeled datasets from other related domains and tasks.

There has been a large amount of efforts in the research community to address adapting deep models across domains [7, 21, 39], to transfer knowledge across tasks [23, 8, 42], and to learn efficiently in a few shot manner [5, 29, 30]. However, most works focus on a single-source and single-target scenario. Recently, some works [41, 25, 43] propose deep approaches for multi-source domain adaptation, but assume that the source and target domains have shared label space (task).

In many computer vision applications, there often exist multiple labeled datasets available from different domains and/or tasks related to the target application. Hence, it is important and practically valuable that we can transfer knowledge from as many source datasets as possible. In this work, we formalize this problem as multi-source domain and task transfer (MS-DTT). Given a set of labeled source dataset, , we aim to transfer knowledge to a sparsely labeled target dataset . Each source dataset could come from a different domain compared to , having a different task, or different in both domain and task. We focus on a semi-supervised setting where only few samples in have labels.

Most works achieve domain transfer by aligning the feature distribution of source domain and target domain [20, 21, 7, 38, 25, 41]. However, this method could be suboptimal for MS-DTT. The reason is that in MS-DTT, the distribution of source data and target data could be significantly different in both input space and label space, thus feature alignment may generate indiscriminative features for the target classes. In addition, feature alignment introduces additional layers and loss terms, which require careful design to perform well.

In this work, we propose a generic and scalable method, namely GradMix, for semi-supervised MS-DTT. GradMix is a model-agnostic method, applicable to any model that uses gradient-based learning rule. Our method does not introduce extra layers or loss functions for feature alignment. Instead, we perform knowledge transfer via gradient descent, by weighting and mixing the gradients from all the source datasets during training. We follow a meta-learning paradigm and model the most basic assumption:

the combined gradient should minimize the loss for a set of unbiased samples from the target dataset [31]. We propose an online method to weight and mix the source gradients at each training iteration, such that the knowledge most useful for the target task is preserved through the gradient update. Our method can adaptively adjust the learning rate for each mini-batch based on its importance to the target task. In addition, we propose a pseudo-labeling method based on model ensemble to learn from the unlabeled data in target domain. We perform extensive experiments on two sets of MS-DTT task, including digit recognition and action recognition, and demonstrate the advantageous performance of the proposed method compared to multiple baselines.

2 Related Work

2.1 Domain Adaptation

Domain adaptation seeks to address the domain shift problem [4] and learn from source domain a model that performs well on the target domain. Most existing works focus on aligning the feature distribution of the source domain and the target domain. Several works attempt to learn domain-invariant features by minimizing Maximum Mean Discrepancy [20, 21, 36]. Other methods propose adversarial discriminative models, which try to learn domain-agnostic representations by maximizing a domain confusion loss [7, 38, 23].

Recently, multi-source domain adaptation with deep model has been studied. Mancini  [25] use DA-layers [3, 18] to minimize the distribution discrepancy of network activations. Xu  [41] propose multi-way adversarial domain discriminator that minimizes the domain discrepancies between the target and each of the sources. Zhao  [43] propose multisource domain adversarial networks that approach domain adaptation by optimizing domain-adaptive generalization bounds. However, all of these methods [25, 41, 43] assume that the source and target domains have a shared label space.

2.2 Transfer Learning.

Transfer learning extends domain adaptation into more general cases, where the source and target domain could be different, in both input space and label space [28, 40, 16, 14]

. In computer vision, transfer learning has been widely studied to overcome the deficit of labeled data by adapting models trained for other tasks. With the advance of deep supervised learning, ConvNets trained on large datasets such as ImageNet 

[32] have achieved state-of-the-art performance when transfered to other tasks ( object detection [8], semantic segmentation [19], etc.) by simple fine-tuning. In this work, we focus on the setting where source and target domains have the same input space and different label spaces.

2.3 Meta-Learning.

Meta-learning aims to utilize knowledge from past experiences to learn quickly on target tasks, from only a few annotated samples. Meta-learning generally seeks performing the learning at a level higher than where conventional learning occurs,  learning the update rule of a learner [29], or finding a good initialization point that is more robust [17] or can be easily fine-tuned [5]. Li  [13] propose a meta-learning method to train models with good generalization ability to novel domains. Franceschi  [6]

introduce a framework based on bilevel programming that unifies gradient-based hyperparameter optimization and meta-learning. Sun 

 [37] propose a meta-transfer learning method to address the few-shot learning task. Ren  [31] propose example reweighting in a meta-learning framework. Our method follows the meta-learning paradigm that uses validation loss as the meta-objective. However, different from [31] which reweight samples in a batch for robust learning against noise, we reweight source domain gradients layer-wise for transfer learning. Gradient alignment has also been used to enhance learning congruency in [22].

3 Method

3.1 Problem Formulation

We first formally introduce the semi-supervised MS-DTT problem. Assume that there exists a set of source domains and a target domain . Each source domain contains images, , with associated labels . Similarly, the target domain consists of unlabeled images, , as well as labeled images with associated labels . We assume target domain is only sparsely labeled,  

. Our goal is to learn a strong target classifier that can predict labels

given .

Different from standard domain adaptation approaches that assume a shared label space between each source and target domain (), we study the problem of joint transfer across domains and tasks. In our setting, only one of the source domain needs to have the same label space as the target domain (). Other source domains could either have a partially overlapping label space with the target domain ( and ), or a non-overlapping label space ().

3.2 Meta-learning Objective

Let denote the network parameters for our model. We consider a loss function

to minimize during training. For deep networks, stochastic gradient descent (SGD) or its variants are commonly used to optimize the loss functions. At every step

of training, we forward a mini-batch of samples from each of the source domain , and apply back-propagation to calculate the gradients w.r.t the parameters , . The parameters are then adjusted according to the sum of the source gradients. For example, for vanilla SGD:


where is the learning rate.

In semi-supervised MS-DTT, we also have a small validation set that contains few labeled samples from the target domain. We want to learn a set of weights for the source gradients, , such that when taking a gradient descent using their weighted combination , the loss on the validation set is minimized:


3.3 Layer-wise Gradient Weighting

Calculating the optimal requires two nested loops of optimization, which can be computationally expensive. Here we propose an approximation to the above objective. At each training iteration , we do a forward-backward pass using the small validation set to calculate the gradient, . We take a first-order approximation and assume that adjusting in the direction of can minimize . Therefore, we find the optimal

by maximizing the cosine similarity between the combined source gradient and the validation gradient:


where the cosine similarity between two vectors is defined as:


Instead of using a global weight value for each source gradient, we propose a layer-wise gradient weighting, where the gradient for each network layer are weighted separately. This enables a finer level of gradient combination. Specifically, in our MS-DTT setting, all source domains and the target domain share the same parameters up to the last fully-connected (fc) layer, which is task-specific (the target domain shares its last layer only with the source domain that has the same label space as the target). Therefore, for each layer with parameter , and for each source domain , we have a corresponding weight . We can then write Equation 4 as:


where is the total number of layers for the ConvNet. We constrain for all and , since negative gradient update can usually result in unstable behavior. To efficiently solve the above constrained non-linear optimization problem, we utilize a sequential quadratic programming method, SLSQP, implemented in NLopt [10].

In practice, we normalize the weights for each layer across all source domains so that they sum up to one:


The computational overhead of GradMix mainly comes from optimizing and calculating . Compared to source-only training, GradMix increases the training time per-batch by approximately .

3.4 Adaptive Learning Rate

Intuitively, certain mini-batches from the source domains contain more useful knowledge that can be transferred to the target domain, whereas some mini-batches contain less. Therefore, we want to adaptively adjust our training to pay more attention to the important mini-batches. To this end, we measure the importance score of a mini-batch using the cosine similarity between the optimally combined gradient and the validation gradient:


Based on , we calculate a scaling term bounded between 0 and 1:


where controls the rate of saturation for , and controls the shift along the horizontal axis ( when , ). We determine the value of and empirically through experiments.

Finally, we multiply to the learning rate , and perform SGD to update the parameters:


3.5 Pseudo-label with Ensembles

In our semi-supervised MS-DTT setting, there also exists a large set of unlabeled images in the target domain, denoted as . We want to learn target-discriminative knowledge from . To achieve this, we propose a method to calculated pseudo-labels for the unlabeled images, and construct a pseudo-labeled dataset . Then we leverage using the same gradient mixing method as described above. Specifically, we consider to minimize a loss during training where . At each training iteration , we sample a mini-batch from , calculate the gradient , and combine it with the source gradients using the proposed layer-wise weighting method.

In order to acquire the pseudo-labels, we perform a first step to train a model using the source domain datasets following the proposed gradient mixing method, and use the learned model to label . However, the learned model would inevitably create some false pseudo-labels. Previous studies found that ensemble of models helps to produce more reliable pseudo-labels [34, 11]. Therefore, in our first step, we train multiple models with different combination of and in Equation 9. Then we pick the top models with the best accuracies on the hyper-validation set (we set in our experiments), and use their ensemble to create pseudo-labels. The difference in hyper-parameters during training ensures that different models learn significantly different sets of weight, hence the ensemble of their prediction is less biased.

Here we propose two approaches to create pseudo-labels, namely hard label and soft label:

Hard label. Here, we assume that the pseudo-label is more likely to be correct if all the models can reach an agreement with high confidence. We assign a pseudo-label to an image , where is a class index, if the two following conditions are satisfied. First, all of the models should predict

as the class with maximum probability. Second, for all models, the probability for

should exceed certain threshold, which is set as 0.8 in our experiments. If these two conditions are satisfied, we will add into . During training, the loss is the standard cross entropy loss.

Soft label. Let denote the output from the

-th model’s softmax layer for an input

, which represents the probability over classes. We calculate the average of across all of the models as the soft pseudo-label for ,  . Every unlabeled image will be assigned a soft label and added to . During training, let be the output probability from the model, we want to minimize the KL-divergence between and the soft pseudo-label for all pairs . Therefore, the loss is .

For both hard label and soft label approach, after getting the pseudo-labels, we train a model from scratch using all available datasets , and . Since the proposed gradient mixing method relies on

to estimate the model’s performance on the target domain, we enlarge the size of

to 100 samples per class, by adding hard-labeled images from using the method described above. The enlarged can represent the target domain with less bias, which helps to calculate better weights on the source gradients, such that the model’s performance on the target domain is maximized.

3.6 Incorporating Semi-supervised Learning

We can further exploit the unlabeled target domain data

by leveraging semi-supervised learning (SSL) methods. Specifically, we incorporate two state-of-the-art SSL methods, virtual adversarial training 

[26] and MixMatch [2], into our GradMix method, by adding an additional unlabeled loss term on during training. The details of the unlabeled loss can be found in the original papers [26, 2].

Digit Recognition Action Recognition
Figure 2: An illustration of the two experimental settings for multi-source domain and task transfer (MS-DTT). Our method effectively transfers knowledge from multiple sources to the target task.

4 Experiment

4.1 Experimental Setup

Datasets. In our experiment, we perform MS-DTT across two different groups of data settings, as shown in Figure 2. First, we do transfer learning across different digit domains using MNIST [12] and Street View House Numbers (SVHN) [27]. MNIST is a popular benchmark for handwritten digit recognition, which contains a training set of 60,000 examples and a test set of 10,000 examples. SVHN is a real-word dataset consisting of images with colored background and blurred digits. It has 73,257 examples for training and 26,032 examples for test.

For our second setup, we study MS-DTT from human activity images in MPII dataset [1] and human action images from the Web (BU101 dataset) [24], to video action recognition using UCF101 [35] dataset. MPII dataset consists of 28,821 images covering 410 human activities including home activities, religious activities, occupation, etc. UCF101 is a benchmark action recognition dataset collected from YouTube. It has 13,320 videos from 101 action categories, captured under various lighting conditions with camera motion and occlusion. We take the first split of UCF101 for our experiment. BU101 contains 23,800 images collected from the Web, with the same action categories as UCF101. It contains professional photos, commercial photos, and artistic photos, which differ significantly from video frames.

Network and implementation details. For digit recognition, we use the same ConvNet architecture as [23], which has 4 Conv layers and 2 fc layers. We randomly initialize the weights, and train the network using SGD with learning rate , and a momentum of 0.9. For fine-tuning we reduce the learning rate to 0.005. For action recognition, we use ResNet-18 [9] architecture. We initialize the network with ImageNet pre-trained weights, which is important for all baseline methods to perform well. The learning rate is 0.001 for training and for fine-tuning.

Method Datasets k=2 k=3 k=4 k=5
Target only 71.351.85 77.151.36 81.431.41 84.831.10
Source only 82.39 82.39 82.39 82.39
Fine-tune 89.940.35 89.860.46 90.890.48 91.960.39
GradMix SGD [31] 89.300.73 89.780.72 91.700.45 92.050.29
GradMix w/o AdaLR 90.100.37 90.220.62 92.140.43 92.920.29
GradMix 91.170.37 91.450.52 92.140.40 93.060.46
MME [33] 90.250.31 90.370.36 91.380.29 91.760.24
MDDA [25] 90.230.40 90.280.50 91.450.37 91.850.31
DCTN [41] 91.810.26 92.340.28 92.420.39 92.970.37
GradMix w/ soft label 94.620.18 95.030.30 95.260.17 95.740.21
GradMix w/ hard label 96.020.24 96.240.33 96.630.17 96.840.20
GradMix w/ VAT [26] 96.230.21 96.350.31 96.870.19 96.940.20
GradMix w/ MixMatch [2] 96.300.23 96.430.32 96.850.19 97.020.21
Table 1:

Classification accuracy (%) of the baselines and our method on the test split of MNIST 5-9. We report the mean and the standard error of each method across 10 runs with different randomly sampled


4.2 Svhn 5-9 + Mnist 0-4 Mnist 5-9

Experimental setting. In this experiment, we define four sets of training data: (1) labeled images of digits 5-9 from the training split of SVHN dataset as the first source , (2) labeled images of digits 0-4 from the training split of MNIST dataset as the second source , (3) few labeled images of digits 5-9 from the training split of MNIST dataset as the validation set , (4) unlabeled images from the rest of the training split of MNIST 5-9 as . We subsample examples from each class of MNIST 5-9 to construct the unbiased validation set . We experiment with , which corresponds to labeled examples. Since is randomly sampled, we repeat our experiment 10 times with different . In order to monitor training progress and tune hyper-parameters (  ), we split out another 1000 labeled samples from MNIST 5-9 as the hyper-validation set. The hyper-validation set is the traditional validation set and is fixed across 10 runs.

Baselines. We compare the proposed method to multiple baseline methods:

  • [wide, labelwidth=!, labelindent=0pt]

  • Target only: the model is trained using .

  • Source only: the model is trained using and without gradient reweighting.

  • Fine-tune: the Source only model is fine-tuned using .

  • MME [33]: Minimax Entropy is a state-of-the-art method for single-source semi-supervised domain adaptation. We use (SVHN 5-9) as the source domain because it is has the same label space as the target task.

  • MDDA [25]

    : Multi-domain domain alignment layers that shift the network activations for each domain using a parameterized transformation equivalent to batch normalization.

  • DCTN [41]: Deep Cocktail Network, which uses multi-way adversarial adaptation to align the distribution of multiple source domains and the target domain.

We also evaluate different variants of our model with and without certain component to show its effect:

  • [wide, labelwidth=!, labelindent=0pt]

  • GradMix SGD: instead of calculating the optimal weights by maximizing cosine similarity of gradients (Equation 6), we follow the method in [31] and perform SGD on to directly minimize the validation error in Equation 3.

  • GradMix w/o AdaLR: the method in Section 3.3 without the adaptive learning rate (Section 3.4).

  • GradMix: the proposed method that uses and during training.

  • GradMix w/ hard label: using the hard label approach to create pseudo-labels for , and train a model with all available datasets.

  • GradMix w/ soft label: using the soft label approach to create pseudo-labels for , and train a model with all available datasets.

  • GradMix w/ VAT: incorporating VAT [26] into GradMix.

  • GradMix w/ MixMatch: incorporating MixMatch [2] into GradMix.

Figure 3: Loss on the hyper-validation set as training proceeds on digit recognition task. Top row is with whereas the bottom row is with

. We define 1 epoch as training for 100 mini-batches (gradient descents).

90.92 90.96 90.95 90.58 90.75 90.75 90.51 90.63 91.12
90.41 90.75 89.95 90.79 90.59 89.95 90.58 90.63 90.56
89.76 90.44 90.42 90.94 90.28 90.40 90.52 90.70 90.66
90.05 90.89 90.93 90.57 90.77 90.69 89.99 90.58 90.71
90.32 90.70 90.48 90.94 90.47 90.92 90.20 90.23 90.86
90.52 90.03 89.67 90.01 89.84 90.51 91.45 90.58 90.70
Table 2: Results of GradMix using different and when . Numbers indicate the test accuracy (%) on MNIST 5-9 (averaged across 10 runs). The ensemble of the top three models is used to create pseudo-labels.

Results. Table 1 shows the results for methods described above. We report the mean and standard error of classification accuracy across 10 runs with randomly sampled . Methods in the upper part of the table do not use the unlabeled target domain data . Among these methods, the proposed GradMix has the best performance. If we remove the adaptive learning rate, the accuracy would decrease. As expected, the performance improves as increases, which indicates more samples in can help the GradMix method to better combine the gradients during training.

The lower part of the table shows methods that leverage the unlabeled target data . MME [33] only uses , whereas other methods use both and . The proposed GradMix without can achieve comparable performance with state-of-the-art baselines that use (MME, MDDA and DCTN). Using pseudo-label with model ensemble significantly improves performance compared to baseline methods. Comparing soft label to hard label, the hard label approach achieves better performance. More detailed results about model ensemble for pseudo-labeling is shown later in the ablation study. Furthermore, both VAT [26] and MixMatch [2] can achieve performance improvement by effectively utilizing the unlabeled data .

Ablation Study. In this section, we perform ablation experiments to demonstrate the effectiveness of our method and the effect of different hyper-parameters. First, Figure 3 shows two examples of the hyper-validation loss as training proceeds. We show the loss for the Source only baseline and the proposed GradMix, where we perform hyper-validation every 100 mini-batches (gradient descents). In both examples with different , GradMix achieves a quicker and steadier decrease in the hyper-validation loss.

In Table 2, we show the results using GradMix with different combination of and when . We perform a grid search with and . The accuracy is the highest for and . The top three models are selected for ensemble to create pseudo-labels for the unlabeled set .

In addition, we perform experiments with various number of models used for ensemble when creating pseudo-labels for the unlabeled set . Figure 4 shows the results for across all values of . has the best overall performance and a moderate computational cost. Therefore, we use the ensemble of the top three models to create reliable pseudo-labels.

Figure 4: Results of GradMix w/ hard label using various number of pre-trained models (R) for ensemble on digit recognition task. k is the number of labeled samples per class in .
Method Datasets per-frame per-video
k=3 k=5 k=10 k=3 k=5 k=10
Target only 42.58 53.31 63.05 43.74 55.50 64.74
Source only 41.96 41.96 41.96 43.46 43.46 43.46
Fine-tune 55.86 60.55 66.77 58.57 66.01 70.21
EnergyNet [15] 55.93 60.82 66.73 58.70 66.23 70.25
GradMix 56.25 61.73 67.30 59.41 66.27 71.49
MDDA [25] 56.65 61.58 67.65 60.00 65.14 71.54
DCTN [41] 57.88 61.97 68.46 61.64 66.59 72.85
GradMix w/ hard label 68.92 68.76 69.25 72.58 72.34 73.48
GradMix w/ VAT [26] 69.02 69.59 70.11 73.35 73.05 73.71
GradMix w/ MixMatch [2] 69.33 69.88 70.09 73.57 73.46 73.68
Table 3: Classification accuracy (%) of the baselines and our method on the test split of UCF101. We report the mean accuracy of each method across two runs with different randomly sampled .

4.3 Mpii + Bu101 Ucf101

Experimental setting. In the action recognition experiment, we have four sets of training data similar to the digit recognition experiment, which include (1) : labeled images from the training split of MPII, (2) : labeled images from the training split of BU101, (3) : labeled video clips per class randomly sampled from the training split of UCF101, (4) : unlabeled images from the rest of the training split of UCF101. We experiment with which corresponds to video clips. Each experiment is run two times with different . We report the mean accuracy across the two runs for both per-frame classification and per-video classification. Per-frame classification is the same as doing individual image classification for every frame in the video, and per-video classification is done by averaging the softmax score for all the frames in a video as the video’s score.

Baselines. We compare our method with multiple baselines described in Section 4.2, including Target only, Source only, Fine-tune, MDDA [25] and DCTN [41]. In addition, we evaluate another baseline for knowledge transfer in action recognition, namely EnergyNet [15]: The ConvNet (ResNet-18) is first trained on MPII and BU101, then knowledge is transfered to UCF101 through spatial attention maps using a Siamese Energy Network.

Results. Table 3 shows the results for action recognition. Target only has better performance compared to Source only even for , which indicates a strong distribution shift between source data and target data for actions in the wild. For all values of , the proposed GradMix outperforms baseline methods that use and for training in both per-frame and per-video accuracy. GradMix also has comparable performance with MDDA that uses the unlabeled dataset . The proposed pseudo-label method achieves significant gain in accuracy by assigning hard labels to and learn target-discriminative knowledge from the pseudo-labeled dataset. Futhermore, performance improved is achieved by incorporating state-of-the-art semi-supervised learning methods.

5 Conclusion

In this work, we propose GradMix, a method for semi-supervised MS-DTT: multi-source domain and task transfer. GradMix assigns layer-wise weights to the gradients calculated from each source objective, in a way such that the combined gradient can optimize the target objective, measured by the loss on a small validation set. GradMix can adaptively adjust the learning rate for each mini-batch based on its importance to the target task. In addition, we assign pseudo-labels to the unlabeled samples using model ensembles, and consider the pseudo-labeled dataset as a source during training. We validate the effectiveness our method with extensive experiments on two MS-DTT settings, namely digit recognition and action recognition. GradMix is a generic framework applicable to any models trained with gradient descent. For future work, we intend to extend GradMix to other problems where labeled data for the target task is expensive to acquire, such as image captioning.


This research is supported by the National Research Foundation, Prime Minister’s Office, Singapore under its Strategic Capability Research Centres Funding Initiative. The computational work for this article was partially performed on resources of the National Supercomputing Centre, Singapore (


  • [1] M. Andriluka, L. Pishchulin, P. V. Gehler, and B. Schiele (2014)

    2D human pose estimation: new benchmark and state of the art analysis

    In CVPR, pp. 3686–3693. Cited by: §4.1.
  • [2] D. Berthelot, N. Carlini, I. J. Goodfellow, N. Papernot, A. Oliver, and C. Raffel (2019) MixMatch: A holistic approach to semi-supervised learning. In NeurIPS, Cited by: §3.6, 7th item, §4.2, Table 1, Table 3.
  • [3] F. M. Carlucci, L. Porzi, B. Caputo, E. Ricci, and S. R. Bulò (2017) AutoDIAL: automatic domain alignment layers. In ICCV, pp. 5077–5085. Cited by: §2.1.
  • [4] G. Csurka (2017) A comprehensive survey on domain adaptation for visual applications. In Domain Adaptation in Computer Vision Applications, pp. 1–35. Cited by: §2.1.
  • [5] C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, pp. 1126–1135. Cited by: §1, §2.3.
  • [6] L. Franceschi, P. Frasconi, S. Salzo, R. Grazzi, and M. Pontil (2018) Bilevel programming for hyperparameter optimization and meta-learning. In ICML, pp. 1563–1572. Cited by: §2.3.
  • [7] Y. Ganin and V. S. Lempitsky (2015)

    Unsupervised domain adaptation by backpropagation

    In ICML, pp. 1180–1189. Cited by: §1, §1, §2.1.
  • [8] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick (2017) Mask R-CNN. In ICCV, pp. 2980–2988. Cited by: §1, §2.2.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §4.1.
  • [10] S. G. Johnson (2008) The NLopt nonlinear-optimization package. External Links: Link Cited by: §3.3.
  • [11] S. Laine and T. Aila (2017) Temporal ensembling for semi-supervised learning. In ICLR, Cited by: §3.5.
  • [12] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §4.1.
  • [13] D. Li, Y. Yang, Y. Song, and T. M. Hospedales (2018) Learning to generalize: meta-learning for domain generalization. In AAAI, Cited by: §2.3.
  • [14] J. Li, J. Liu, Y. Wong, S. Nishimura, and M. S. Kankanhalli (2019) Self-supervised representation learning using 360° data. In ACM Multimedia, pp. 998–1006. Cited by: §2.2.
  • [15] J. Li, Y. Wong, Q. Zhao, and M. S. Kankanhalli (2017) Attention transfer from web images for video recognition. In ACM Multimedia, pp. 1–9. Cited by: §4.3, Table 3.
  • [16] J. Li, Y. Wong, Q. Zhao, and M. S. Kankanhalli (2018) Unsupervised learning of view-invariant action representations. In NeurIPS, pp. 1262–1272. Cited by: §2.2.
  • [17] J. Li, Y. Wong, Q. Zhao, and M. S. Kankanhalli (2019) Learning to learn from noisy labeled data. In CVPR, pp. 5051–5059. Cited by: §2.3.
  • [18] Y. Li, N. Wang, J. Shi, J. Liu, and X. Hou (2017) Revisiting batch normalization for practical domain adaptation. In ICLR, Cited by: §2.1.
  • [19] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In CVPR, pp. 3431–3440. Cited by: §2.2.
  • [20] M. Long, Y. Cao, J. Wang, and M. I. Jordan (2015) Learning transferable features with deep adaptation networks. In ICML, pp. 97–105. Cited by: §1, §2.1.
  • [21] M. Long, H. Zhu, J. Wang, and M. I. Jordan (2016) Unsupervised domain adaptation with residual transfer networks. In NeurIPS, pp. 136–144. Cited by: §1, §1, §2.1.
  • [22] Y. Luo, Y. Wong, M. S. Kankanhalli, and Q. Zhao (2019)

    Direction concentration learning: enhancing congruency in machine learning

    IEEE Transactions on Pattern Analysis and Machine Intelligence (), pp. 1–1. Cited by: §2.3.
  • [23] Z. Luo, Y. Zou, J. Hoffman, and F. Li (2017) Label efficient learning of transferable representations acrosss domains and tasks. In NeurIPS, pp. 164–176. Cited by: §1, §2.1, §4.1.
  • [24] S. Ma, S. A. Bargal, J. Zhang, L. Sigal, and S. Sclaroff (2017) Do less and achieve more: training cnns for action recognition utilizing action images from the web. Pattern Recognition 68, pp. 334–345. Cited by: §4.1.
  • [25] M. Mancini, L. Porzi, S. R. Bulò, B. Caputo, and E. Ricci (2018) Boosting domain adaptation by discovering latent domains. In CVPR, pp. 3771–3780. Cited by: §1, §1, §2.1, 5th item, §4.3, Table 1, Table 3.
  • [26] T. Miyato, S. Maeda, M. Koyama, and S. Ishii (2019) Virtual adversarial training: A regularization method for supervised and semi-supervised learning. TPAMI 41 (8). Cited by: §3.6, 6th item, §4.2, Table 1, Table 3.
  • [27] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng (2011) Reading digits in natural images with unsupervised feature learning. In NeurIPS workshop, Cited by: §4.1.
  • [28] S. J. Pan and Q. Yang (2010) A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22 (10), pp. 1345–1359. Cited by: §2.2.
  • [29] S. Ravi and H. Larochelle (2017) Optimization as a model for few-shot learning. In ICLR, Cited by: §1, §2.3.
  • [30] M. Ren, E. Triantafillou, S. Ravi, J. Snell, K. Swersky, J. B. Tenenbaum, H. Larochelle, and R. S. Zemel (2018) Meta-learning for semi-supervised few-shot classification. In ICLR, Cited by: §1.
  • [31] M. Ren, W. Zeng, B. Yang, and R. Urtasun (2018)

    Learning to reweight examples for robust deep learning

    In ICML, pp. 4331–4340. Cited by: §1, §2.3, 1st item, Table 1.
  • [32] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and F. Li (2015) ImageNet large scale visual recognition challenge. IJCV 115 (3), pp. 211–252. Cited by: §2.2.
  • [33] K. Saito, D. Kim, S. Sclaroff, T. Darrell, and K. Saenko (2019) Semi-supervised domain adaptation via minimax entropy. In ICCV, Cited by: 4th item, §4.2, Table 1.
  • [34] K. Saito, Y. Ushiku, and T. Harada (2017) Asymmetric tri-training for unsupervised domain adaptation. In ICML, pp. 2988–2997. Cited by: §3.5.
  • [35] K. Soomro, A. R. Zamir, and M. Shah (2012) UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402. Cited by: §4.1.
  • [36] B. Sun and K. Saenko (2016) Deep CORAL: correlation alignment for deep domain adaptation. In ECCV Workshops, pp. 443–450. Cited by: §2.1.
  • [37] Q. Sun, Y. Liu, T. Chua, and B. Schiele (2019-06) Meta-transfer learning for few-shot learning. In CVPR, Cited by: §2.3.
  • [38] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko (2015) Simultaneous deep transfer across domains and tasks. In ICCV, pp. 4068–4076. Cited by: §1, §2.1.
  • [39] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell (2017) Adversarial discriminative domain adaptation. In CVPR, pp. 2962–2971. Cited by: §1.
  • [40] K. R. Weiss, T. M. Khoshgoftaar, and D. Wang (2016) A survey of transfer learning. Journal of Big Data 3, pp. 9. Cited by: §2.2.
  • [41] R. Xu, Z. Chen, W. Zuo, J. Yan, and L. Lin (2018) Deep cocktail network: multi-source unsupervised domain adaptation with category shift. In CVPR, pp. 3964–3973. Cited by: §1, §1, §2.1, 6th item, §4.3, Table 1, Table 3.
  • [42] A. R. Zamir, A. Sax, W. Shen, L. J. Guibas, J. Malik, and S. Savarese (2018) Taskonomy: disentangling task transfer learning. In CVPR, pp. 3712–3722. Cited by: §1.
  • [43] H. Zhao, S. Zhang, G. Wu, J. M. F. Moura, J. P. Costeira, and G. J. Gordon (2018) Adversarial multiple source domain adaptation. In NeurIPS, pp. 8568–8579. Cited by: §1, §2.1.