SOSELETO: A Unified Approach to Transfer Learning and Training with Noisy Labels

by   Or Litany, et al.

We present SOSELETO (SOurce SELEction for Target Optimization), a new method for exploiting a source dataset to solve a classification problem on a target dataset. SOSELETO is based on the following simple intuition: some source examples are more informative than others for the target problem. To capture this intuition, source samples are each given weights; these weights are solved for jointly with the source and target classification problems via a bilevel optimization scheme. The target therefore gets to choose the source samples which are most informative for its own classification task. Furthermore, the bilevel nature of the optimization acts as a kind of regularization on the target, mitigating overfitting. SOSELETO may be applied to both classic transfer learning, as well as the problem of training on datasets with noisy labels; we show state of the art results on both of these problems.



There are no comments yet.


page 1

page 2

page 3

page 4


Lautum Regularization for Semi-supervised Transfer Learning

Transfer learning is a very important tool in deep learning as it allows...

Motion Blur removal via Coupled Autoencoder

In this paper a joint optimization technique has been proposed for coupl...

Sample-based Regularization: A Transfer Learning Strategy Toward Better Generalization

Training a deep neural network with a small amount of data is a challeng...

An Efficient Source Model Selection Framework in Model Databases

With the explosive increase of big data, training a Machine Learning (ML...

Scalable Greedy Algorithms for Transfer Learning

In this paper we consider the binary transfer learning problem, focusing...

One-Step Abductive Multi-Target Learning with Diverse Noisy Samples

One-step abductive multi-target learning (OSAMTL) was proposed to handle...

Fully probabilistic design for knowledge fusion between Bayesian filters under uniform disturbances

This paper considers the problem of Bayesian transfer learning-based kno...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning has made possible many remarkable successes, leading to state of the art algorithms in computer vision, speech and audio, and natural language processing. A key ingredient in this success has been the availability of large datasets. While such datasets are common in certain settings, in other scenarios this is not true. Examples of the latter include “specialist” scenarios, for instance a dataset which is entirely composed of different species of tree; and medical imaging, in which datasets on the order of hundreds to a thousand are common.

A natural question is then how one may apply the techniques of deep learning within these relatively data-poor regimes. A standard approach involves the concept of transfer learning: one uses knowledge gleaned from the source (data-rich regime), and transfers it over to the target

(data-poor regime). One of the most common versions of this approach involves a two-stage technique. In the first stage, a network is trained on the source classification task; in the second stage, this network is adapted to the target classification task. There are two variants for this second stage. In feature extraction (e.g. 


), only the parameters of the last layer (i.e. the classifier) are allowed to adapt to the target classification task; whereas in fine-tuning (e.g. 

girshick2014rich ), the parameters of all of the network layers (i.e. both the features/representation and the classifier) are allowed to adapt. The idea is that by pre-training the network on the source data, a useful feature representation may be learned, which may then be recycled – either partially or completely – for the target regime. This two-stage approach has been quite popular, and works reasonably well on a variety of applications.

Despite this success, we claim that the two-stage approach misses an essential insight: some source examples are more informative than others for the target classification problem. For example, if the source is a large set of natural images and the target consists exclusively of cars, then we might expect that source images of cars, trucks, and motorcycles might be more relevant for the target task than, say, spoons. However, this example is merely illustrative; in practice, the source and target datasets may have no overlapping classes at all. As a result, we don’t know a priori which source examples will be important. Thus, we propose to learn this source filtering as part of an end-to-end training process.

The resulting algorithm is SOSELETO: SOurce SELEction for Target Optimization. Each training sample in the source dataset is given a weight, corresponding to how important it is. The shared source/target representation is then optimized by means of a bilevel optimization. In the interior level, the source minimizes its classification loss with respect to the representation parameters, for fixed values of the sample weights. In the exterior level, the target minimizes its classification loss with respect to both the source sample weights and its own classification layer. The sample weights implicitly control the representation through the interior level. The target therefore gets to choose the source samples which are most informative for its own classification task. Furthermore, the bilevel nature of the optimization acts as a kind of regularization on the target, mitigating overfitting, as the target does not directly control the representation parameters. Finally, note that the entire process – training of the shared representation, target classifier, and source weights – happens simultaneously.

Above, we have illustrated how SOSELETO may be applied to the problem of transfer learning. However, the same algorithm can be applied to the problem of training with noisy labels. Concretely, we assume that there is a large noisy dataset, as well as a much smaller clean dataset; the latter can be constructed cheaply through careful hand-labelling, given its small size. Then if we take the source to be the large noisy dataset, and the target to the small clean dataset, SOSELETO can be applied to the problem. The algorithm will assign high weights to samples with correct labels and low weights to those with incorrect labels, thereby implicitly denoising the source, and allowing for an accurate classifier to be trained.

The remainder of the paper is organized as follows. Section 2 presents related work. Section 3 presents the SOSELETO algorithm, deriving descent equations as well as convergence properties of the bilevel optimization. Section 4 presents results of experiments on both transfer learning as well as training with noisy labels. Section 5 concludes.

2 Related work

Transfer learning   As described in Section 1, the most common techniques for transfer learning are feature extraction and fine-tuning, see for example donahue2014decaf and girshick2014rich , respectively. An older survey of transfer learning techniques may be found in pan2010survey . Domain adaptation saenko2010adapting is concerned with transferring knowledge when the source and target classes are the same. Earlier techniques aligned the source and target via matching of feature space statistics tzeng2014deep ; long2015learning ; subsequent work used adversarial methods to improve the domain adaptation performance ganin2015unsupervised ; tzeng2015simultaneous ; tzeng2017adversarial ; hoffman2017cycada .

In this paper, we are more interested in transfer learning where the source and target classes are different. A series of recent papers long2017deep ; pei2018multi ; cao2018unsupervised ; cao2018partial by Long et al. address domain adaptation that is closer to our setting. In particular, cao2018partial examines “partial transfer learning”, the case in which there is partial overlap between source and target classes (particularly when the target classes are a subset of the source). This setting is also dealt with in busto2017open .

Ge and Yu ge2017borrowing examine the scenario where the source and target classes are completely different. Similar to SOSELETO, they propose selecting a portion of the source dataset. However, the selection is not performed in an end-to-end fashion, as in SOSELETO; rather, selection is performed prior to training, by finding source examples which are similar to the target dataset, where similarity is measured by using filter bank descriptors.

Another recent work of interest is luo2017label , which focuses on a slightly different scenario: the target consists of a very small number of labelled examples (i.e. the few-shot regime), but a very large number of unlabelled examples. Training is achieved via an adversarial loss to align the source and the target representations, and a special entropy-based loss for the unlabelled part of the data.

Learning with noisy labels

   Classification with noisy labels is a longstanding problem in the machine learning literature, see the review paper

frenay2014classification and the references therein. Within the realm of deep learning, it has been observed that with sufficiently large data, learning with label noise – without modification to the learning algorithms – actually leads to reasonably high accuracy krause2016unreasonable ; sun2017revisiting .

The setting that is of greatest interest to us is when the large noisy dataset is accompanied by a small clean dataset. Sukhbaatar et al. sukhbaatar2014training introduce an additional noise layer into the CNN which attempts to adapt the output to align with the noisy label distribution; the parameters of this layer are also learned. Xiao et al. xiao2015learning use a more general noise model, in which the clean label, noisy label, noise type, and image are jointly specified by a probabilistic graphical model. Both the clean label and the type of noise must be inferred given the image, in this case by two separate CNNs. Li et al. li2017learning

consider the same setting, but with additional information in the form of a knowledge graph on labels.

Other recent work on label noise includes rolnick2017deep , which shows that adding many copies of an image with noisy labels to a clean dataset barely dents performance; malach2017decoupling

, in which two separate networks are simultaneously trained, and a sample only contributes to the gradient descent step if there is disagreement between the networks (if there is agreement, that probably means the label is wrong); and

drory2018resistance , which analyzes theoretically the situations in which CNNs are more and less resistant to noise.

Bilevel optimization   A branch of mathematical programming, bilevel optimization has been extensively studied within this community colson2007overview ; bard2013practical . For recent developments, readers are referred to the review paper sinha2018review . Bilevel optimization has been used occasionally in both machine learning, e.g. bennett2006model ; bennett2008bilevel and computer vision, e.g. ochs2015bilevel . In this context, it is worth mentioning the recent work of Yang and Deng yang2017shape , which solves the shape from shading problem without using an external shape dataset. They simultaneously generate complex shapes from primitives and train on these shapes; while the formulation is different, the method has the flavor of a bilevel optimization framework.

3 SOSELETO: SOurce SELEction for Target Optimization

3.1 The algorithm

We have two datasets. The source set is the data-rich set, on which we can learn extensively. It is denoted by , where as usual is the source training image, and is its corresponding label. The second dataset is the target set, which is data-poor; but it is this set which ultimately interests us. That is, the goal in the end is to learn a classifier on the target set, and the source set is only useful insofar as it helps in achieving this goal. The target set is denoted , and it is assumed that is much smaller than the source set, i.e. .

Our goal is to exploit the source set to solve the target classification problem. The key insight is that not all source examples contribute equally useful information in regards to the target problem. For example, suppose that the source set consists of a broad collection of natural images; whereas the target set consists exclusively of various breeds of dog. We would assume that any images of dogs in the source set would help in the target classification task; images of wolves might also help, as might cats. Further afield it might be possible that objects with similar textures as dog fur might be useful, such as rugs. On the flip side, it is probably less likely that images of airplanes and beaches will be relevant (though not impossible). However, the idea is not to come with any preconceived notions (semantic or otherwise) as to which source images will help; rather, the goal is to let the algorithm choose the relevant source images, in an end-to-end fashion.

We assume that the source and target classifier networks have the same architecture, but different network parameters. In particular, the architecture is given by

where is last layer, or possibly last few layers, and constitutes all of the remaining layers. We will refer to colloquially as the “classifier”, and to as the “features” or “representation”. (This is consistent with the usage in related papers, see for example tzeng2017adversarial .) Now, the source and target will share features, but not classifiers; that is, the source network will be given by , whereas the target network will be . The features are shared between the two, and this is what allows for transfer learning.

The weighted source loss is given by

where is a weight assigned to each source training example; and is a per example classification loss, in this case cross-entropy. The use of the weights will allow us to decide which source images are most relevant for the target classification task.

The target loss is standard:

As noted in Section 1, this formulation allows us to address both the transfer learning problem as well as learning with label noise. In the former case, the source and target may have non-overlapping label spaces; high weights will indicate which source examples have relevant knowledge for the target classification task. In the latter case, the source is the noisy dataset, the target is the clean dataset, and they share a label space; high weights will indicate which source examples do not have label noise, and are therefore reliable. In either case, the target is much smaller than the source.

The question now becomes: how can we combine the source and target losses into a single optimization problem? A simple idea is to create a weighted sum of source and target losses. Unfortunately, issues are likely to arise regardless of the weight chosen. If the target is weighted equally to the source, then overfitting may likely result given the small size of the target. On the other hand, if the weights are proportional to the size of the two sets, then the source will simply drown out the target.

A more promising idea is to use bilevel optimization. Specifically, in the interior level we find the optimal features and source classifier as a function of the weights , by minimizing the source loss:


In the exterior level, we minimize the target loss, but only through access to the source weights; that is, we solve:


Why might we expect this bilevel formulation to succeed? The key is that the target only has access to the features in an indirect manner, by controlling which source examples are included in the source classification problem. Thus, the target can influence the features chosen, but only in this roundabout way. This serves as an extra form of regularization, mitigating overfitting, which is the main threat when dealing with a small set such as the target.

Implementing the bilevel optimization is rendered somewhat challenging due to the need to solve the optimization problem in the interior level (1). Note that this optimization problem must be solved at every point in time; thus, if we choose to solve the optimization (2) for the exterior level via gradient descent, we will need to solve the interior level optimization (1) at each iteration of the gradient descent. This is clearly inefficient. Furthermore, it is counter to the standard deep learning practice of taking small steps which improve the loss. Thus, we instead propose the following procedure.

At a given iteration, we will take a gradient descent step for the interior level problem (1):


where is the iteration number; is the learning rate (where the subscript stands for “parameters”, to distinguish it from a second learning rate for , to appear shortly); and is a matrix whose column is given by

Thus, Equation (3) leads to an improvement in the features , for a fixed set of source weights . Note that there will be an identical descent equation for the classifier , which we omit for clarity.

Given this iterative version of the interior level of the bilevel optimization, we may now turn to the exterior level. Plugging Equation (3) into Equation (2) gives the following problem:

where we have suppressed ’s arguments for readability. We can then take a gradient descent step of this equation, yielding:


where in the final line, we have made use of the fact that is small. Of course, there will also be a descent equation for the classifier .

We have not yet dealt with the weight constraint. That is, we would like to explicitly require that each . We may achieve this by requiring where the new variable , and

is a sigmoid-type function. As shown in the supplementary material, for a particular piecewise linear sigmoid function, replacing the Update Equation (

4) with a corresponding update equation for is equivalent to modifying Equation (4) to read


where clips the values below to be ; and above to be .

Thus, SOSELETO consists of alternating Equations (3) and (5), along with the descent equations for the source and target classifiers and . As usual, the whole operation is done on a mini-batch basis, rather than using the entire set. SOSELETO is summarized in Algorithm 1. Note that the target derivatives and are evaluated over a target mini-batch; we suppress this for clarity.

  Initialize: , , , .
  while not converged do
     Sample source batch
     Denote by
  end while
Algorithm 1 SOSELETO: SOurce SELEction for Target Optimization

In terms of time-complexity, we note that each iteration requires both a source batch and a target batch; assuming identical batch sizes, this means that SOSELETO requires about twice the time as the ordinary source classification problem. Regarding space-complexity, in addition to the ordinary network parameters we need to store the source weights

. Thus, the additional relative space-complexity required is the ratio of the source dataset size to the number of network parameters. This is obviously problem and architecture dependent; a typical number might be given by taking the source dataset to be Imagenet ILSVRC-2012 (size 1.2M) and the architecture to be ResNeXt-101

xie2017aggregated (size 44.3M parameters), yielding a relative space increase of about 3%.

3.2 Convergence properties

SOSELETO is only an approximation to the solution of a bilevel optimization problem. As a result, it is not entirely clear whether it will even converge. In this section, we demonstrate a set of sufficient conditions for SOSELETO to converge to a local minimum of the target loss .

To this end, let us examine the change in the target loss from iteration to . As shown in the supplementary material, to first order we have

Note that the latter two terms are both negative, and will therefore cause the target loss to decrease, as desired. As regards the first term, matters are unclear. However, it is clear that if we set the learning rate sufficiently large, the second term will eventually dominate the first term, and the target loss will be decreased. Indeed, we can do a slightly finer analysis. As shown in the supplementary material, can be upper bounded as follows:

Thus, a sufficient condition for the target loss to decrease is if

This bound is stricter than what is required in practice; we discuss actual setting for the two learning rates in Section 4.

4 Results

We briefly discuss some implementation details. In all experiments, we use the SGD optimizer without learning rate decay, and we use . We initialize the -values to be , and in practice clip them to be in the slightly expanded range ; this allows more relevant source points some room to grow. Other settings are experiment specific, and are discussed in the relevant sections.

4.1 Noisy labels: synthetic experiment

Figure 1: Noisy labels: synthetic experiment. Left: clean dataset (target). Middle: noisy dataset (source). Right: noisy dataset, with size of points indicating the corresponding weight (-value).

To illustrate how SOSELETO works on the problem of learning with noisy labels, we begin with a small synthetic experiment, see Figure 1. The clean dataset, i.e. the target, consists of 50 points in ; the two classes may be linearly separated by a line at an angle of clockwise from the y-axis. The noisy dataset consists of 100 points in ; in this case, the classes are (erroneously) separated by the y-axis. Thus, any point between the y-axis and the line clockwise from the y-axis is incorrectly labelled in the noisy dataset; this is the “mislabelled region”.

We run SOSELETO, which converges quickly to performance on the target set. More interesting are the values of the source weights, i.e. the -values, which are illustrated in Figure 1 on the right. We see that points in the mislabelled region all have small weight, and some (e.g. black points at the bottom) have even disappeared from the plot, due to having

. Points not in the mislabelled region all have larger weights; in fact, the points with the largest weights seem to lie near the true class separator, somewhat in the manner of support vectors.

4.2 Noisy labels: CIFAR-10

We now turn to a real-world setting of the problem of learning with label noise. We use a noisy version of CIFAR-10 krizhevsky2009learning , following the settings used in sukhbaatar2014training ; xiao2015learning

. In particular, an overall noise level is selected. Based on this, a label confusion matrix is chosen such that the diagonal entries of the matrix are equal to one minus the noise level, and the off-diagonals are chosen randomly (while maintaining the matrix’s stochasticity). Noisy labels are then sampled according to this confusion matrix. We run experiments for various overall noise levels.

The target consists of a small clean dataset. CIFAR-10’s train set consists of 50K images; of this 50K, both sukhbaatar2014training ; xiao2015learning set aside 10K clean examples for pre-training, a necessary step in both of these algorithms. In contrast, we use a smaller clean dataset of half the size, i.e. 5K examples. We compare our results to the two state of the art methods sukhbaatar2014training ; xiao2015learning

, as they both address the same setting as we do – the large noisy dataset is accompanied by a small clean dataset, with no extra side-information available. In addition, we compare with the baseline of simply training on the noisy labels without modification. In all cases, Caffe’s CIFAR-10 Quick

cifar10quick architecture has been used. For SOSELETO, we use the following settings: , the source batch-size is 32, and the target batch-size is 256. We use a larger target batch-size to enable more -values to be affected quickly.

Results are shown in Table 1 for three different overall noise levels, 30%, 40%, and 50%. Performance is reported for CIFAR-10’s test set, which is of size 10K. (Note that the competitors’ performance numbers are taken from xiao2015learning .) SOSELETO achieves state of the art on all three noise levels, with considerably better performance than both sukhbaatar2014training and xiao2015learning : between to absolute improvement. Furthermore, it does so in each case with only half of the clean samples used in sukhbaatar2014training ; xiao2015learning .

We perform further analysis by examining the -values that SOSELETO chooses on convergence, see Figure 2. To visualize the results, we imagine thresholding the training samples in the source set on the basis of their -values; we only keep those samples with greater than a given threshold. By increasing the threshold, we both reduce the total number of samples available, as well as change the effective noise level, which is the fraction of remaining samples which have incorrect labels. We may therefore plot these two quantities against each other, as shown in Figure 2; we show three plots, one for each noise level. Looking at these plots, we see for example that for the noise level, if we take the half of the training samples with the highest -values, we are left with only about which have incorrect labels. We can therefore see that SOSELETO has effectively filtered out the incorrect labels in this instance. For the and noise levels, the corresponding numbers are about and incorrect labels; while not as effective in the noise level, SOSELETO is still operating as designed. Further evidence for this is provided by the large slopes of all three curves on the righthand side of the graph.

Noise Level CIFAR-10 Quick Sukhbaatar et al. sukhbaatar2014training 10K clean examples Xiao et al. xiao2015learning 10K clean examples SOSELETO 5K clean examples
Table 1: Noisy labels: CIFAR-10. Best results in bold.
Figure 2: Noisy labels on CIFAR-10: Effect of -values chosen by SOSELETO. Blue is noise, green is noise, red is noise. See accompanying explanation in the text.
Uses Unlabelled Data? Method
No Target only
No Fine-tuning
Yes Matching Nets vinyals2016matching
Yes Fine-tuned Matching Nets (variant of vinyals2016matching )
Yes Fine-tune + domain adversarial luo2017label
Yes Label Efficient luo2017label
Table 2: SVHN 0-4 MNIST 5-9. Best results in bold.

4.3 Transfer learning: SVHN 0-4 to MNIST 5-9

We now examine the performance of SOSELETO on a transfer learning task. In order to provide a challenging setting, we choose to (a) use source and target sets with disjoint label sets, and (b) use a very small target set. In particular, the source dataset is chosen to the subset of Google Street View House Numbers (SVHN) netzer2011reading corresponding to digits 0-4. SVHN’s train set is of size images, with about half of those belonging to the digits 0-4. The target dataset is a very small subset of MNIST lecun1998gradient corresponding to digits 5-9. While MNIST’s train set is of size 60K, with 30K corresponding to digits 5-9, we use very small subsets: either 20 or 25 images, with equal numbers sampled from each class (4 and 5, respectively). Thus, as mentioned, there is no overlap between source and target classes, making it a true transfer learning (rather than domain adaptation) problem; and the small target set size adds further challenge. Furthermore, this task has already been examined in luo2017label .

We compare our results with the following techniques. Target only, which indicates training on just the target set; standard fine-tuning; Matching Nets vinyals2016matching , a few-shot technique which is relevant given the small target size; fine-tuned Matching Nets, in which the previous result is then fine-tuned on the target set; and two variants of the Label Efficient Learning technique luo2017label – one which includes fine-tuning plus a domain adversarial loss, and the other the full technique presented in luo2017label . Note that besides the target only and fine-tuning approaches, all other approaches depend on unlabelled target data. Specifically, they use all of the remaining MNIST 5-9 examples – about – in order to aid in transfer learning. SOSELETO, by contrast, does not make use of any of this data.

For each of the above methods, the simple LeNet architecture lecun1998gradient was used. For SOSELETO, we use the following settings: , the source batch-size is 32, and the target batch-size is 10 (it is chosen to be small since the target itself is very small). Additionally, the SVHN images were resized to , to match the MNIST size. The performance of the various methods is shown in Table 2, and is reported for MNIST’s test set which is of size 10K. We have divided Table 2 into two parts: those techniques which use the 30K examples of unlabelled data, and those which do not. SOSELETO has superior performance to all of the techniques which do not use unlabelled data. Furthermore, SOSELETO has superior performance to all of the techniques which do use unlabelled data, except the Label Efficient technique. It is noteworthy in particular that SOSELETO outperforms the few-shot techniques, despite not being designed to deal with such small amounts of data.

Finally, we note that although SOSELETO is not designed to use unlabelled data, one may do so using the following two-stage procedure. Stage 1: run SOSELETO as described above. Stage 2: use the learned SOSELETO classifier to classify the unlabelled data. This will now constitute a dataset with noisy labels, and SOSELETO can now be run in the mode of training with label noise, where the noisily labelled unsupervised data is now the source, and the target remains the same small clean set. In the case of , this procedure elevates the accuracy to above .

5 Conclusions

We have presented SOSELETO, a technique for exploiting a source dataset to learn a target classification task. This exploitation takes the form of joint training through bilevel optimization, in which the source loss is weighted by sample, and is optimized with respect to the network parameters; while the target loss is optimized with respect to these weights and its own classifier. We have derived an efficient algorithm for performing this bilevel optimization, through joint descent in the network parameters and the source weights, and have analyzed the algorithm’s convergence properties. We have empirically shown the effectiveness of the algorithm on both learning with label noise, as well as transfer learning problems. We note that SOSELETO is architecture-agnostic, and thus may be easily deployed. Furthermore, although we have focused on classification tasks, the technique is general and may be applied to other learning tasks within computer vision; this is an important direction for future research.


  • (1) Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In International conference on machine learning, pages 647–655, 2014.
  • (2) Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 580–587, 2014.
  • (3) Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2010.
  • (4) Kate Saenko, Brian Kulis, Mario Fritz, and Trevor Darrell. Adapting visual category models to new domains. In European conference on computer vision, pages 213–226. Springer, 2010.
  • (5) Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474, 2014.
  • (6) Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I Jordan. Learning transferable features with deep adaptation networks. In Proceedings of the 32nd International Conference on International Conference on Machine Learning-Volume 37, pages 97–105. JMLR. org, 2015.
  • (7) Yaroslav Ganin and Victor Lempitsky.

    Unsupervised domain adaptation by backpropagation.

    In International Conference on Machine Learning, pages 1180–1189, 2015.
  • (8) Eric Tzeng, Judy Hoffman, Trevor Darrell, and Kate Saenko. Simultaneous deep transfer across domains and tasks. In Computer Vision (ICCV), 2015 IEEE International Conference on, pages 4068–4076. IEEE, 2015.
  • (9) Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. In Computer Vision and Pattern Recognition (CVPR), volume 1, 2017.
  • (10) Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei A Efros, and Trevor Darrell. Cycada: Cycle-consistent adversarial domain adaptation. arXiv preprint arXiv:1711.03213, 2017.
  • (11) Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. Deep transfer learning with joint adaptation networks. In International Conference on Machine Learning, pages 2208–2217, 2017.
  • (12) Zhongyi Pei, Zhangjie Cao, Mingsheng Long, and Jianmin Wang. Multi-adversarial domain adaptation. In

    AAAI Conference on Artificial Intelligence

    , 2018.
  • (13) Yue Cao, Mingsheng Long, and Jianmin Wang. Unsupervised domain adaptation with distribution matching machines. In AAAI Conference on Artificial Intelligence, 2018.
  • (14) Zhangjie Cao, Mingsheng Long, Jianmin Wang, and Michael I Jordan. Partial transfer learning with selective adversarial networks. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • (15) P Panareda Busto and Juergen Gall. Open set domain adaptation. In The IEEE International Conference on Computer Vision (ICCV), volume 1, page 3, 2017.
  • (16) Weifeng Ge and Yizhou Yu. Borrowing treasures from the wealthy: Deep transfer learning through selective joint fine-tuning. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, volume 6, 2017.
  • (17) Zelun Luo, Yuliang Zou, Judy Hoffman, and Li F Fei-Fei. Label efficient learning of transferable representations acrosss domains and tasks. In Advances in Neural Information Processing Systems, pages 164–176, 2017.
  • (18) Benoît Frénay and Michel Verleysen. Classification in the presence of label noise: a survey.

    IEEE transactions on neural networks and learning systems

    , 25(5):845–869, 2014.
  • (19) Jonathan Krause, Benjamin Sapp, Andrew Howard, Howard Zhou, Alexander Toshev, Tom Duerig, James Philbin, and Li Fei-Fei. The unreasonable effectiveness of noisy data for fine-grained recognition. In European Conference on Computer Vision, pages 301–320. Springer, 2016.
  • (20) Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 843–852. IEEE, 2017.
  • (21) Sainbayar Sukhbaatar, Joan Bruna, Manohar Paluri, Lubomir Bourdev, and Rob Fergus. Training convolutional networks with noisy labels. arXiv preprint arXiv:1406.2080, 2014.
  • (22) Tong Xiao, Tian Xia, Yi Yang, Chang Huang, and Xiaogang Wang. Learning from massive noisy labeled data for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2691–2699, 2015.
  • (23) Yuncheng Li, Jianchao Yang, Yale Song, Liangliang Cao, Jiebo Luo, and Li-Jia Li. Learning from noisy labels with distillation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1910–1918, 2017.
  • (24) David Rolnick, Andreas Veit, Serge Belongie, and Nir Shavit. Deep learning is robust to massive label noise. arXiv preprint arXiv:1705.10694, 2017.
  • (25) Eran Malach and Shai Shalev-Shwartz. Decoupling" when to update" from" how to update". In Advances in Neural Information Processing Systems, pages 961–971, 2017.
  • (26) Amnon Drory, Shai Avidan, and Raja Giryes. On the resistance of neural nets to label noise. arXiv preprint arXiv:1803.11410, 2018.
  • (27) Benoît Colson, Patrice Marcotte, and Gilles Savard. An overview of bilevel optimization. Annals of operations research, 153(1):235–256, 2007.
  • (28) Jonathan F Bard. Practical bilevel optimization: algorithms and applications, volume 30. Springer Science & Business Media, 2013.
  • (29) Ankur Sinha, Pekka Malo, and Kalyanmoy Deb. A review on bilevel optimization: from classical to evolutionary approaches and applications.

    IEEE Transactions on Evolutionary Computation

    , 22(2):276–295, 2018.
  • (30) Kristin P Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli, and Jong-Shi Pang. Model selection via bilevel optimization. In Neural Networks, 2006. IJCNN’06. International Joint Conference on, pages 1922–1929. IEEE, 2006.
  • (31) Kristin P Bennett, Gautam Kunapuli, Jing Hu, and Jong-Shi Pang. Bilevel optimization and machine learning. In IEEE World Congress on Computational Intelligence, pages 25–47. Springer, 2008.
  • (32) Peter Ochs, René Ranftl, Thomas Brox, and Thomas Pock. Bilevel optimization with nonsmooth lower level problems. In International Conference on Scale Space and Variational Methods in Computer Vision, pages 654–665. Springer, 2015.
  • (33) Dawei Yang and Jia Deng. Shape from shading through shape evolution. arXiv preprint arXiv:1712.02961, 2017.
  • (34) Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 5987–5995. IEEE, 2017.
  • (35) Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical Report, 2009.
  • (36) CIFAR-10 Quick network.
  • (37) Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In Advances in Neural Information Processing Systems, pages 3630–3638, 2016.
  • (38) Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, 2011.
  • (39) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.