Deep Learning from Noisy Image Labels with Quality Embedding

11/02/2017 ∙ by Jiangchao Yao, et al. ∙ Shanghai Jiao Tong University 0

There is an emerging trend to leverage noisy image datasets in many visual recognition tasks. However, the label noise among the datasets severely degenerates the performance of deep learning approaches. Recently, one mainstream is to introduce the latent label to handle label noise, which has shown promising improvement in the network designs. Nevertheless, the mismatch between latent labels and noisy labels still affects the predictions in such methods. To address this issue, we propose a quality embedding model, which explicitly introduces a quality variable to represent the trustworthiness of noisy labels. Our key idea is to identify the mismatch between the latent and noisy labels by embedding the quality variables into different subspaces, which effectively minimizes the noise effect. At the same time, the high-quality labels is still able to be applied for training. To instantiate the model, we further propose a Contrastive-Additive Noise network (CAN), which consists of two important layers: (1) the contrastive layer estimates the quality variable in the embedding space to reduce noise effect; and (2) the additive layer aggregates the prior predictions and noisy labels as the posterior to train the classifier. Moreover, to tackle the optimization difficulty, we deduce an SGD algorithm with the reparameterization tricks, which makes our method scalable to big data. We conduct the experimental evaluation of the proposed method over a range of noisy image datasets. Comprehensive results have demonstrated CAN outperforms the state-of-the-art deep learning approaches.



There are no comments yet.


page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

While editorially labeled image data is crucial to visual classification [1, 2, 3, 4], weakly supervised detection and segmentation [5, 6, 7, 8, 9, 10], collecting such datasets in large volume can be prohibitive. Non-editorial means such as social tagging and crowdsourcing, have been explored as efficient alternatives [11, 12, 13]. For example, there are a plethora of images with tags available on the Flickr website, which provides us valuable labeled resources to build image classifiers. However, the challenges lie in the fact that social tags as labels are highly noisy. As a result, deep learning from noisy image labels has attracted the increasing attention [14].

Previous studies have investigated the label noise [15, 16, 17, 18, 19]

for non-deep approaches in the machine learning community. For example, Vikas

et al. [15] introduce parameters for annotators to transit latent predictions to noisy labels. For parameter estimation, they resort to an EM optimization algorithm that is also adopted in the contemporaneous works. However, it is not straightforward to apply these studies to deep learning methods due to the computational consuming in the EM optimization.

With the success of deep learning in computer vision

[1, 2, 3, 4]

, training neural network with noisy image labels has also been explored

[20, 14, 21, 22, 23, 24, 25, 26, 27, 28, 29]

. These methods can be summarized into two categories, building the robust loss function and modeling the latent label. The former paradigm is heuristic and usually depends on non-trivial hyperparameter selection. For instance, Reed

et al. [24] construct a weighted combination of noisy image labels and predictions to supervise the network training. However, it is unclear that how the weight interacts with the real-world label noise for settings. One popular example of the latter paradigm, Sukhbaatar et al. [14]

model the latent label to handle the label noise. Specifically, the classifier is trained based on latent labels, and thus the label noise will not directly affect the classifier. However, they adapt latent labels to noisy labels with a linear transition layer, which cannot sufficiently model the label corruption. Label noise can still go through this layer to degenerate the performance. The deficiency of above deep learning methods is that they do not explicitly model the trustworthiness of noisy labels. Implicitly considering noise in the loss function or by modeling the latent label may harm the nature of noise, e.g., flip and outlier.

Fig. 1: Analysis about back-propagation in previous methods that model the latent label, as well as our idea to avoid the effect of label noise. (a) All images are forward into the model and the mismatch error caused by both label estimation and label noise are back-propagated. (b) With quality embedding as a control from latent labels to predictions, the negative effect of label noise is reduced in the back-propagation.

In this paper, we follow the latter paradigm and propose a quality embedding model. Fig. 1 illustrates our idea as well as its advantage to reduce the noise effect. For example, in Fig. 1(a), the latent labels and predictions of the first three cat images must approximately consistent due to their content similarity. However, mismatch will occur between the second prediction and the corresponding annotation by virtue of the label noise. For the fourth image, the prediction induced by the estimation error of the latent label, also has conflict with the fourth annotation. As a result, these two mismatches will mix together for back-propagation. On the other hand, if we explicitly introduce a quality variable to model the trustworthiness of noisy labels like Fig. 1(b), label noise can be reduced more effectively. For example, if the quality variable of the second sample is embedded in the non-trustworthy subspace, the latent label can be disturbed accordingly to prevent mismatch error caused by the label noise from back-propagation. While for the fourth sample whose quality variable is estimated in the trustworthy subspace, the latent label still transits to the final prediction causing the mismatch. Then supervision from the correct annotations is normally fed back.

Mathematically, we illustrate the corresponding graphical model in Fig. 2. Different from previous latent-label-based deep learning approaches, a quality variable is specially introduced to model the trustworthiness of noisy labels. By embedding the quality variable into different subspaces, the shortcoming illustrated like Fig.1(a) can be solved as Fig.1(b). To instantiate our probabilistic model with deep neural network, we further design a Contrastive-Additive Noise network (CAN) shown in Fig. 3. For parameter learning, we optimize an evidence lower bound [30, 31, 32] plus a variational mutual information regularizer, and deduce an SGD algorithm. The major contribution in this paper can be summarized into four parts in the following.

  • To address the shortcoming of existing latent-label-based deep learning approaches, we propose a quality embedding model that introduce a quality variable to represent the trustworthiness of noisy labels. By embedding the quality variable into different subspace, the negative effect of label noise can be effectively reduced. Simultaneously, the supervision from high quality labels still can be back-propagated normally for training.

  • To instantiate the quality embedding model, we design a Contrastive-Additive Noise network. Specially, it consists of two important layers: (1) the contrastive layer estimates the quality variable in the embedding space to reduce noise effect; (2) the additive layer aggregates prior predictions and noisy labels as posterior to train the classifier.

  • To tackle the optimization difficulty, we apply the reparameterization tricks and deduce an efficient SGD algorithm, which makes our model scalable to big data.

  • We conduct a range of experiments to demonstrate that CAN outperforms existing state-of-the-art deep learning methods on noisy datasets. We further present qualitative analysis about quality embedding, latent label estimation and noise pattern to give a deep insight on our model.

The rest of this paper is organized as follows. Section 2 briefly reviews the related work of learning with noisy labels in deep learning. Then we introduce our quality embedding model, the corresponding instantiation Contrastive-Additive-Noise network as well as its optimization algorithm in Section 3. We validate the efficiency of our method over a range of experiments in Section 4. Section 5 concludes the paper.

Ii Related Work

Social websites and crowdsourcing platforms provide us an effective way to gather a large amount of low-cost annotations for images. However, in the visual recognition tasks such as image classification, the noise among labels shall severely degenerate the performance of classification models [33]. To exploit the great value of noisy labels, several noise-aware deep learning methods have been proposed for the image classification task. Here, we briefly review these related works.

Robust loss function This line of research aims at designing a robust loss function to alleviate noise effect. For instance, Joulin et al. [34] weight the cross-entropy loss with the sample number to balance the emphasis of noise in positive and negative instances. Izadinia et al. [23] estimate a global ratio of positive samples to weaken the supervision in the loss function. Reed et al. [24] consider the consistency of predictions in similar images and apply bootstrap to the loss function. They substitute the noisy label with a weight combination of the noisy label and the prediction to encourage the consistent output. Recently, Li et al. [28]

re-weight the noisy label with a soft label learned from side information. They train a teacher network with the clean dataset to compute the soft label by leveraging the knowledge graph. The soft label is then combined with the noisy label in the loss function to pilot student model’s learning. Andreas

et al. [29] rectify labels in the cross-entropy loss with a label-correction network trained on the extra clean dataset. While these methods are concerned with modifying the labels in the loss function by re-weighting or rectification, our approach also models the auxiliary trustworthiness of noisy image labels to reduce the noise effect on training.

Modeling the latent labels This paradigm targets at modeling the latent labels to train the classifier, and building a transition for adaption from the latent labels to the noisy labels. With the success of deep learning in image recognition, this kind of idea receives considerable attention. Mnih et al. [20] first propose a latent variable model on aerial images, which assumes that the noise is symmetric and at random. Based on it, [14, 27]

use an linear adaptation layer to model the asymmetric label noise, and add the layer on top of a deep neural network. This transition layer can be deemed as the confusion matrix representing label flip probability. However, the matrix only depends on the distribution of labels but ignore the information of image contents. Chen et al.

[12] apply a two-stage approach to model the latent label and learn the translation to the noisy label, in which a clean dataset is used. Different from methods that model label transition in the dataset level, Xiao et al. [26] propose a probabilistic graphic model that disturbs the label in the image level. However, the model also needs a small part of clean data to learn conditional probability, which may constrains the generalization of the model. To demonstrate the human-centric noisy label exhibits specific structure that can be modeled, Misra et al. [22] build two parallel classifiers. One classifier deals with image recognition and the other classifier model human’s reporting bias. However, it still suffers from the problem mentioned in Fig. 1(a) since similar images have similar latent variables. Although these methods take advantages of deep neural network to model the latent label, the simple transition cannot sufficiently model the label corruption. We go on by unearthing the annotation quality from training data and further utilize it to guide the learning of our model.

Iii Quality Embedding models

Iii-a Preliminaries

Consider that we have a noisy image dataset of items,

where each tuple in the dataset consists of one image and its noisy labels . Note that

can be the original image or the feature vector extracted from the image.

is a -dimensional binary vector indicating which labels are annotated, and is the number of categories. However, may be corrupted with annotation noise and thus incorrect. We assume the underlying clean label is . We introduce , a quality variable embedded in -dimensional Gaussian space, to represent the annotation quality of . For ease of reference, we list the notations of this paper in Table I.

Formally, it is a multi-label, multi-class classification problem with noise in labels. We target to train a deep classifier from these noisy training samples. There are many other tasks that are consistent with this setting, like weakly supervised object detection and segmentation [5, 6, 7, 8, 9, 10] with web data.

Fig. 2: Quality embedding model for noisy image labels. The shaded nodes as the observed variables are image X and its noisy label vector Y. The latent label vector Z and the quality variable vector S are latent variables. Solid lines and dashed lines represent the generative process and the inference process respectively.

Iii-B Quality Embedding Model

Iii-B1 quality embedding

In this section, we introduce a quality variable in parallel to the latent label, which jointly transit to the noisy image label. Our probabilistic graphical model is illustrated in Fig. 2. In the generative process, the latent label vector purely depends on the instance . We model this dependency with . However, the noisy label vector is generated based on both the annotation quality and the latent label , which we model with . In the inference process, both the distributions of and are all modeled based on and . We respectively represent these two distributions with and , which plays roles of posterior approximation.

According to the graphical model in Fig. 2, once given the training set, we have the following log-likelihood.


However, the log-likelihood function is difficult to explicitly compute. We instead choose to optimize an adjustable evidence lower bound (ELBO) [30, 31, 32]. The ELBO is acquired by introducing two variational distributions and to approximate the true distributions of and . We illustrate the form of our ELBO in Eq.(2).


Above bound is a good approximation of the marginal likelihood, which provides a basis for selecting a model [32]. When the gap between marginal likelihood and ELBO becomes zero, the variational distributions approach the true distributions.

Notation Description
number of training items
number of categories
dimension of the quality variable
image variable
noisy label vector variable
latent label vector variable
quality vector variable
parameter of classifier network
parameter of noise network
parameter of annotation quality network
parameter of latent label network
index of an item
th observed image
th observed noisy label vector
th latent label vector
th quality vector
mean of Guassian distribution

covariance diagonal of Gaussian distribution

regularizaion cofficient
th sample from Gumbel distribution
th sample from Gaussian distribution
temperature in Gumbel-SoftMax
time-varying coefficient
TABLE I: Notations and their descriptions frequently used in this paper

Iii-B2 variational mutual information regularizer

Although Fig. 2 presents the structure prior of our probabilistic model, optimization on ELBO may not converge to the desirable optimal since modeling the distribution with neural network introduces much flexibility. It is a common problem in Bayesian models and a general solution is posterior regularizations [35]. Posterior regularizations ensure the desirable expectation and simultaneously retain the computational efficiency. Such methods have been applied in clustering [36], classification [37] and image generation[38]. In this paper, we introduce the regularization for variational distributions of and in the perspective of mutual information maximization. We deduce the regularizers as follows,


where means the mutual information of two distributions and is the entropy of the variable. As can be seen in Eq. (III-B2), maximizing the mutual information is equal to minimizing the entropy of and . For the latent label , such posterior regularization can force the probability close to the extreme points. And for the quality variable , it will encourage the distribution

to have a low variance.

Iii-B3 objective

Combining Eq. (2) with (III-B2), our objective then becomes the maximization of ELBO along with the mutual information regularizer. Note that, we substitute in Eq. (III-B2) with a coefficient to weight the regularization effect in the optimization. Instead of maximization, we re-write our goal as the following minimization problem for the simplicity sake.


From Eq. (4), our model mainly differs from previous methods in three aspects. First, indicates that the transition from the latent label to the noisy label is based on both and while previous methods [20, 14, 27] only depend on . Second, previous works [20, 14, 27, 26, 22] use the linear transition while our model applies nonlinear implementation . Third, and are approximated with and in the posterior perspective while previous works [26, 28, 29] might have to facilitate the extra clean dataset or other label knowledge.

Fig. 3: The network consists of four modules, encoder, sampler, decoder and classifier, which are trained end-to-endly. Encoder tries to learn latent labels and evaluate the quality of noisy labels; sampler is used to generate samples from encoder outputs; decoder tries to recover noisy labels from samples. Meanwhile, our classifier is learned based on KL-divergence between and .

Iii-C Contrastive-Additive Noise Network

In this section, we instantiate our model with a Contrastive-Additive Noise network (CAN) in Fig. 3. Simply, CAN consists of four modules, encoder, sampler, decoder and classifier, which are corresponding to the different parts of our model respectively. In the following, we decribe the design in detail.

Iii-C1 architectures

For encoder module, it is used to model the variational distributions, and . Concretely, we first forward to a neural network to generate a prior label judgement . Then, according to and , we model the distribution parameters with two elaborately-designed layers. The neural network for can be decided by the type of . If

is the original image, then a convolutional neural network can be applied. While if

is a feature vector, a fully-connected network can be chosen. In Fig. 3, we take the convolutional neural network as an example. The sampler module is the implementation of Monte Carlo sampling for and . It receives the output of the encoder module and samples from the Gumbel and Gaussian distributions to generate a sample set of and . In the next section, we will talk out this part in detail with reparameterization tricks. For the decoder module, it is a neural network for

, which consists of two group of (linear, ReLU) layers, following with a Sigmoid layer. It takes the sampler output to recover noisy labels. Previous works

[20, 14, 27, 26, 22] usually use a linear transition from to . We consider the nonlinear transition since we have the heterogeneous quality variable . The classifier module as our most important target , employs a same network for in the encoder module. It is trained based on KL-divergence between and .

Iii-C2 contrastive layer and additive layer

We specially describe these two important layers in the encoder module. Regarding the distribution , it is a -dimensional Gaussian distribution and both mean and variance need to be modeled. We exploits the contrastive layer to implement the estimation. It internally forwards and into a shared fully-connected layer with ReLU () and transforms their difference to and with another fully-connected layer (function ). It is simply represented as follows,

This contrastive layer is built up based on the assumption that the quality variable is related to the difference between and . We evaluate their difference in a latent space with and decide which subspace it is embedded with . This embedding mechanism makes us identify the label quality explicitly and subsequently helps to reduce the noise effect in . This idea has never been proposed in previous noise-aware deep learning approaches [20, 14, 21, 22, 23, 24, 25, 26, 27, 28, 29].

Regarding the distribution , it consists of Bernoulli distributions and thus probabilities need to be modeled. We design an additive layer to learn these parameters. It internally uses two non-shared fully-connected layers ( and ) to transform and

into a latent space, and then feeds their addition into another fully-connected layer plus a sigmoid function (function

), illustrated as follows,

This design learns a posterior label from and by a nonlinear combination with neural network. Previous methods in [34, 23, 24, 28, 29] use a weight in their lost function to linearly combine the noisy label with the “soft” label from the prediction, the clean dataset or other side information. They usually need non-trivial tuning manually, while we resort to a learning procedure by neural network automatically.

The whole network can be trained end-to-endly, which will be explained in the next section. In the training, the noise effect is reduced by the branch of the quality variable, and simultaneously the posterior label is estimated by the additive layer to guarantee a more reliable training. We will demonstrate the effectiveness of our network in the experiments.

Iii-D Optimization

In this section, we will analyze the difficulty in optimization and deduce an SGD algorithm with reparameterization tricks.

Iii-D1 The reparameterization tricks

The first term in the RHS of Eq. (4) has no closed form when either or is not conjugated with . Let alone we model these distributions with deep neural network in the paper. The general way is by the Monte Carlo sampling. However, Paisley et al. [31] have shown when the derivative is about or , the sampling estimation will present high variance. In this case, a large number of samples will be required to have an accurate estimation, which may lead to the high GPU load and the computational burden. Fortunately, reparameterization tricks [39, 40] are explored to overcome this difficulty in the recent years. They have shown promising efficiency in discrete and continuous representation learning. Simply, the idea behind reparameterization tricks is to decouple the integral variate as one parameter-related part and another parameter-free variate. After integral by substitution, the Monte Carlo sampling on this parameter-free variate will have a small variance. According to this, we apply the reparameterization trick [40] for discrete and the reparameterization trick [39] for continuous as follows,

where is a temperature to control the discreteness of samples, 111, are both sampled by , where Uniform(0,1) and are the parameter-free variates, , and are parameter-related parts. With above reparameterization tricks, we have the following low-variance sampling estimation,


where is the sample number of and for the th image. Based on Eq. (5), the first term in the RHS of Eq. (4) can be efficiently estimated, even though we set the sample number equal to 1 in the training.

Iii-D2 Stochastic variational gradient

The remaining terms in the RHS of Eq. (4) can be explicity computed. We just present their deduction in the appendix. Putting Eq. (5) and (A) (in the appendix) back to Eq. (4), the objective is derivable regarding parameters of all distributions. We can learn the parameter of each distribution with a SGD algorithm, even if they are all modeled with deep neural network. It is important for deep learning especially on the large datasets. Assuming , , and respectively represent the parameters of , , and

, their gradients can be computed with the following equations with chain rules.


where is the abbreviation of for the space sake. Note that, although we have above gradients for CAN, there are two undesirable problems existing in the optimization: (1) It is not easy to precisely decouple the information from back-propagation respectively for and , i.e., squeeze out the clean label information for and leave the quality-related information to ; (2) The corresponding label order between and may be inconsistent in the optimization. For example, the category in first dimension of can be corresponding to the category in the second dimension of . To avoid these two problems, we can asymmetrically inject auxiliary information to the optimization procedure in an annealing way, that is, substitute with the following Eq.(7).


where is gradient regarding the cross-entropy loss between and , and is a time-varying term. In this equation, is initially decided by and then progressively anneals to with increasing. It guarantees the decoupling procedure from the back-propagation with asymmetrical constraint to and make the label order of and consistent in the optimization.

Fig. 4: Difference between the conventional auto-encoder and our model. Solid lines and dashed lines respectively represent generative (decoding) and inference (encoding) procedures. (a) a conventional auto-encoder is symmetric that observed knowledge is used to encode to latent variables and decoded symmetrically. (b) Our model uses an auxiliary variables in this encoding-decoding procedure and meanwhile learns a discriminative part ( to ).
Fig. 5: Left: the instance number of each category in WEB dataset. Right: the instance number of each category in AMT dataset.
Model aer bik brd boa btl bus car cat cha cow tbl dog hrs mbk prs plt shp sfa trn tv mAP
Resnet-N 93.5 85.3 90.1 85.1 51.2 82.3 84.8 91.2 59.3 87.1 72.1 88.7 91.3 88.9 76.1 54.4 87.6 70.0 90.4 61.4 79.5
LearnQ 92.8 86.1 91.0 87.8 50.2 84.9 85.1 90.9 59.2 88.3 71.1 90.1 91.2 88.1 78.3 56.6 89.1 73.1 90.7 64.3 80.4
ICNM 92.5 86.2 90.5 87.9 47.7 84.0 84.8 90.6 59.8 88.3 72.7 89.8 91.5 87.2 77.0 57.0 88.9 71.5 91.2 65.7 80.3
Bootstrap 94.0 88.4 90.3 88.2 51.7 83.8 86.5 91.0 65.4 88.0 77.4 90.4 91.8 90.8 79.8 55.2 92.8 75.2 90.8 66.4 81.9
CAN 95.5 87.0 91.4 89.9 60.1 85.5 87.6 92.0 67.2 90.1 77.7 91.8 93.3 90.6 82.1 56.0 93.6 80.7 94.5 70.6 83.8
TABLE II: Classification Results on V07TE
Model aer bik brd boa btl bus car cat cha cow tbl dog hrs mbk prs plt shp sfa trn tv mAP
Resnet-N 98.4 81.1 92.9 88.7 57.0 87.4 73.2 96.6 63.3 90.0 63.9 94.3 95.0 92.9 76.8 43.8 92.9 67.2 93.1 65.1 80.7
LearnQ 98.4 83.8 93.8 88.5 53.5 87.8 73.7 96.5 64.3 90.6 62.6 94.6 96.1 91.6 78.4 46.8 92.8 69.0 94.0 65.4 81.1
ICNM 98.1 82.9 93.6 88.9 53.4 87.7 72.3 96.2 64.7 91.2 66.3 94.2 96.2 91.4 78.0 44.0 93.5 69.3 94.4 66.9 81.2
Bootstrap 98.6 84.1 93.6 90.9 56.3 89.8 75.5 96.3 69.8 91.6 69.9 94.4 95.8 93.2 82.2 43.2 92.8 70.9 95.4 67.4 82.6
CAN 98.8 84.1 95.3 93.2 62.1 90.8 77.0 97.9 72.6 94.4 73.5 96.1 97.7 94.3 82.4 45.5 95.8 71.4 95.8 68.6 84.4
TABLE III: Classification Results on V12TE

The optimization procedure can be interpreted as a probabilistic auto-encoder [39]. However, our model is different from the traditional auto-encoder, which is illustrated in Fig.4. A conventional auto-encoder is symmetric, that is, observed knowledge is encoded into latent variables and decoded to itself, for instance in Fig.4 (a), is encoded to and , and then and are used to decode to . It is usually used in generative models and their corresponding applications like image generation [41, 42]. In Fig. 4 (b), our model uses an auxiliary variables in the encoding-decoding procedure, that is, and are used to encode and , and then and are only used to decode . Simultaneously, a discriminative model will be involved and jointly optimize with our auto-encoder.

Iv Experiments

In this section, we conduct the quantitative and qualitative experiments to show the superiority of CAN in classification. Specifically, we compare CAN with state-of-the-art methods, investigate its performance with varying training sizes, hyperparameter sensitivity and artificial noise. To present a deep insight on how CAN works, we analyze the quality embedding, latent label estimation and noise transition in the network.

Iv-a Datasets

We totally have five image datasets used in the experiments.

WEB222 This dataset is a subset of YFCC100M [43] collected from the social image-sharing website. It is formed by randomly selecting images from YFCC100M, which belong to the 20 categories of the PASCAL VOC [44]. The statistics of this dataset are shown in the left panel of Fig. 5. There are 97,836 samples in total and the sample number in each category ranges from 4k to 8k. Most of images in this dataset belong to one class and about 10k images have two or more. Labels in this dataset may contain annotation error.

AMT333 This dataset is collected by Zhou et al. [18] from the Amazon Mechanical Turk platform. They submit 4 breeds of dog images from the Stanford Dog dataset [45] to Turkers and acquire their annotations. To ease the classification, Zhou et al. also provide a 5376-dimensional feature for each image. The statistics of this dataset is illustrated in the right panel of Fig. 5. There are 7,354 samples in total and the sample number in each category is between 1k and 2k. All images in this dataset belong to one class. Labels in this dataset may contain annotation error.

V07444 This dataset is provided for the 20-cateogry classification task in PASCAL VOC Chanllenge 2007 [44]. It consists of two subsets: trainging (V07TR) and test (V07TE). There are 5,011 samples in V07TR and 4,592 samples in V07TE. All labels in this dataset are clean.

V12555 This dataset is provided for the 20-cateogry classification task in PASCAL VOC Chanllenge 2012 [46]. It consists of two subsets: trainging (V12TR) and test (V12TE). There are 11,540 samples in V12TR and 10,991 samples in V12TE. All labels in this dataset are clean.

SD4666 This last dataset consists of 4 categories of dogs (same to [18]) in the Stanford Dog dataset [45]. It is a fine-grained categorization dataset and there are 837 samples in total. We randomly partition samples into training (SD4TR) and test (SD4TE) by to use. All labels in this dataset are clean.

Iv-B Experimental Setup

For WEB, V07 and V12 datasets, a 34-layer residual network [4] is adopted as the convolutional networks in CAN, and this configuration is also applied to all baselines to be fair. In the training phase, we first resize the short side of each image to 224 and then follow the transformations in the residual network777 to preprocess images. In the test phase, we average the results of six-crop images as the final prediction. For AMT and SD4 datasets, we directly use the features provided by [18]. Hence, one 3-layer perception network (53761024, ReLU, 102430, ReLU, 304) is adopted as the substitution of the convolutional networks in CAN. Both the temperature in the Gumbel-softmax function and the annealing coefficient in Eq. (7) vary with the formula . in the sampler is set to 1 following [39]. The regularizer coefficient

is empirically set to 0.3. The batch size is set to 50 and the learning rate starting from 0.01 is divided by 10 every 30 epochs. All experiments run 90 epochs. For the evaluation metric, we adopt Average Precision (AP) and mean Average Precision (mAP) like

[44, 46].

Fig. 6: Classification results with different training sizes. We sample the subsets of WEB and AMT with five different ratios for training, and evaluate all models on V07TE, V12TE and SD4TE datasets.

In the following sections of “model comparision”, “impact of training size” and “hyperparameter sensitivity”, we train all models on WEB and AMT datasets and test them on V07TE, V12TE and SD4TE datasets. Note that, models trained on WEB dataset are evaluated on both V07TE and V12TE datasets since they have same categories. And models trained on AMT dataset are ony evaluated on SD4TE dataset. For the “artificial noise” section, we first quantitatively add noise to V07TR, V12TR and SD4TR datasets, and then train all models. Finally, we test them on V07TE, V12TE and SD4TE datasets.

Iv-C Classification Results

Iv-C1 Training with real-world noisy datasets

To demonstrate the effectiveness of the proposed method in classification, we compare CAN with three state-of-the-art approaches, LearnQ [14], ICNM [22] and Bootstrap [24]. Besides, two baselines Resnet-N and MLP-N are added, which directly train the 34-layer residual network and the 3-layer perception network on WEB dataset and AMT dataset. The classification performance for each category on the V07TE, V12TE and SD4TE datasets is reported in Table. II, III, and IV.

From the results in Table II and III, we find CAN outperforms all baselines in terms of mAP and show improvement almost in all categories. For example, on V07TE dataset, CAN achieves 83.8 mAP, which outperforms Resnet-N by 4.3 mAP and the best baseline Bootstrap by 1.9 mAP. In the challenging categories such as “bottle”, “chair” and “sofa”, it also achieves significant improvement. However, although the results of LearnQ, ICNM and Bootstrap are better than those of Resnet-N, the improvement is still limited. Similarly in Table. IV, CAN outperforms the baselines by at least 2.8 mAP while LearnQ, ICNM and Bootstrap only improve about 1.6 mAP compared with MLP-N.

Model nft nwt iwh swh mAP
MLP-N 78.1 73.2 80.9 76.5 77.2
LearnQ 80.5 73.7 83.0 77.7 78.7
ICNM 80.5 72.8 83.9 78.3 78.9
Bootstrap 80.7 72.5 83.7 78.1 78.8
CAN 82.0 79.0 81.8 83.8 81.7
TABLE IV: Classification Results on SD4TE

Based on above experiments, we have the following interpretations. (1) LearnQ and ICNM, which only introduce the latent label to handle the label noise, cannot prevent noise from degenerating the classifier sufficiently. (2) Bootstrap shares the similar idea with CAN in the aspect of estimating the posterior label for training. But its loss function uses the linear combination of predictions and noisy labels, which still cannot prevent the error back-propagation from label noise. (3) Our approach, which one one hand models the trustworthiness of noisy labels to reduce the noise effect, and on the other hand estimates the latent label in the posterior perspective to train the classifier, shows better classification performance.

0 0.2 0.5 1 2 5 10
V07TE 82.9 83.5 84.8 83.6 80.7 78.8 77.0
V12TE 84.3 85.2 84.1 83.0 80.8 78.3 76.6
SD4TE 78.6 80.7 80.4 79.9 76.4 73.9 71.3
TABLE V: Classification results with different in CAN.
Fig. 7: Quality embedding visualization of two categories in WEB dataset and two categories in AMT dataset. Better distinguishability of clusters indicates better identifiability of mismatches between latent labels and noisy labels. Blue: trustworthy embedding, Green: non-trustworthy embedding.
Resnet-N LearnQ ICNM Bootstrap CAN Resnet-N LearnQ ICNM Bootstrap CAN MLP-N LearnQ ICNM Bootstrap CAN
1.0 6.4 9.1 9.2 8.9 8.6 5.2 8.4 8.4 8.2 10.5 29.6 26.9 27.0 27.8 30.1
0.8 33.4 28.0 28.5 30.1 36.1 26.6 23.7 23.8 25.1 28.0 41.6 39.6 39.7 38.6 49.7
0.6 53.0 56.4 57 59.3 63.2 49.2 49.7 49.6 51.8 55.3 51.5 60.4 60.8 58.7 63.9
0.4 70.2 72.0 71.6 73.3 79.4 69.0 70.3 70.5 72.6 78.4 73.4 72.7 73.1 73.5 77.1
0.2 78.2 80.1 79.6 81.0 83.6 80.0 81.3 81.4 82.2 84.5 86.1 89.0 89.2 89.3 91.1
0.0 86.8 85.4 85.4 85.5 85.3 89.7 88.3 88.3 88.5 87.3 96.4 95.9 95.8 96.2 94.3
TABLE VI: Model performance (mAP) with Quantitative noise.

Iv-C2 Impact of training size

To explore the reliability of the proposed method when the training size changes, we compare CAN with other methods on different scales of datasets. We randomly sample different ratios of subsets in WEB and AMT datasets for training, and illustrate results of all the methods on V07TE, V12TE and SD4TE in Fig. 6.

From Fig. 6, the results of all methods on these datasets decline with the decrease of the training size. However, CAN performs better than other models persistently. For instance, in the left panel of Fig. 6, when the training size accounts at 20%, CAN achieves 81.0 mAP on the V07TE dataset, while ICNM and LearnQ are even worse than the most simple Resnet-N (79.4% mAP). Similar clues can be found in the middle and right panels. These results demonstrate the reliability of CAN on different scales of datasets.

In Fig. 6, we also find the decline trend on SD4TE dataset is more significant than that on V07TE and V12TE datasets. This is because that even if the 20% subset, there are still about 20k samples for training in WEB dataset. But there are only about 1.6k samples remaining in AMT dataset, which may lack enough knowledge to learn the classifier in the training.

Iv-C3 hyperparameter sensitivity

To investigate the reliability of CAN with different the regularizer coefficients, we set to 0, 0.2, 0.5, 1, 5, 10 to respectively validate its effect. The results are illustrated in Table. V. From this table, we find the performance on all datasets first grows to a peak and then gradually decreases with increasing. For example, CAN achieves 85.2 mAP on V12TE dataset when =0.2, but significantly decreases to below 76.6 mAP when =10. This indicates: (1) the regularizer in the proper degree encourages our model to find a good solution; (2) too strong regularization may induce the solution to depart from the optimal. Empirically, setting between 0 to 1 makes the variational mutual information regularizer collaborate well with KL-divergences.

Iv-C4 controlled experiments with artificial noise

In previous sections, all models are trained on WEB and AMT with given noise, which does not exhibit the characteristics in different noise levels. To show the superiority of CAN, we quantitatively add noise on V07TR, V12TR and SD4TR datasets for training, and then compare the classification performance of all models on the V07TE, V12TE and SD4TE datasets. The way to add noise to datasets is by setting a corruption probability P to randomly decide whether to shuffle elements of each clean label vector or not. We list the model performance in different P settings in Table. VI.

As shown in Table. VI, when the corruption probability P=1.0, the classification results of all models are close to randomness. With P varying from 1.0 to 0, all models show improvement, since there are some clean samples available for training. Specially when P is set to 0.8, 0.6, 0.4, 0.2, CAN robustly outperforms other baselines. However, when the training data becomes purely clean, i.e., P=0, all noise-aware models are worse than Resnet-N and MLP-N. Table. VI indicates: (1) The performance of all existing models is strongly-related to the noise level in the datasets. All noise-aware models perform bad in the heavy noise. (2) When the training data is clean, noise-aware models may be worse than models without considering noise. (3) CAN shows advantages in different noise levels compared with existing methods.

Fig. 8: Exemplars on latent label estimation of WEB dataset (the first two rows) and AMT dataset (the third row). We forward the noisy label (black word in title) and the image into CAN and compute the latent label (red word in title).

Iv-D Model Visualization

To give a deep insight on how CAN works, in this section, we will present the qualitative analysis about quality embedding, latent label estimation and noise transition in CAN.

Iv-D1 quality embedding

The quality variable is estimated in the embedding space by the contrastive layer. To visualize this mechanism, we respectively forward all the training samples into CAN to compute their quality embedding. By comparing the consistency between the prior prediction (thresholded by 0.5) and the noisy label, we then binarize each embedding as

trustworthy embedding or non-trustworthy embedding. If we only consider the Gaussian mean of each quality variable plus the embedding type, a low dimensional visualization of quality embedding can be illustrated with t-SNE package [47].

In Fig. 7, two exemplar categories “aeroplane” and “bike” in WEB dataset, and two exemplar categories “Norfolk Terrier” and “Norwich Terrier” in AMT dataset, are presented. As shown in Fig. 7, the embedding in each category exhibits two distinguishable clusters. It indicates CAN can identify mismatches between latent labels and noisy labels, and selectively embed the quality variable to different subspace based on the training samples. Thus the label noise can be effectively reduced with the auxiliary of the quality variable.

Besides, we find the embedding for the first two categories are better than that for the last two categories in Fig.7. It is because the categories in WEB and AMT

datasets are notably different in number and diversity of training samples. For example, there are about 4,200 different images and annotations in the “aeroplane”, while there are only about 200 different images and 1,300 annotations in the “Norfolk Terrier”. Thus embedding in the first two categories is uniformly distributed but in the last two categories is discretely cluttered.

Fig. 9: Transition patterns among labels conditioned on trustworthy embedding and non-trustworthy embedding on WEB dataset (the first two panels) and AMT dataset (the last two panels). Transition conditioned on trustworthy embedding requires the consistency between the latent label and the noisy label, and thus concentrates on the diagonal. Transition conditioned on non-trustworthy embedding identifies the mismatch between the latent label and the noisy label, and thus diffuses from the diagonal.

Iv-D2 latent label estimation

The latent label is estimated in the posterior perspective by the additive layer. To visualize this estimation, we forward all the training samples into CAN to compute output of the additive layer. In Fig. 8, we present 20 examples of WEB dataset and 8 examples of AMT dataset.

From Fig. 8, we observe: (1) the annotations in WEB dataset may be totally unrelated to the image content, e.g., “bottle” for the first aeroplane image; (2) In AMT, the Turkers also assign the wrong labels to the fine-grained images. The former error is usually from the batch annotation function provided by the Flickr website. The latter error is usually from the limit domain knowledge of Turkers. Nevertheless, from the estimation, we find our additive layer still successfully rectifies the wrong labels. Thus based on these latent labels for training, CAN achieves the better performance than other baselines.

Iv-D3 noise transition

To explore how the quality embedding intermediates the mismatch between latent labels and noisy labels, we investigate the transition patterns between latent labels and noisy labels. Firstly, we forward all the training samples to CAN to compute quality embeddings and latent labels. Secondly, we utilize K-means to binarize quality embeddings (only consider Gaussian mean) into trustworthy embedding and non-trustworthy embedding. Thirdly, we count transitions from latent labels to noisy labels conditioned on two types of embeddings. In Fig.

9, we respectively plot two transition patterns with heatmaps for WEB dataset and AMT dataset.

As shown for WEB dataset in Fig. 9, the diagonal of the transition pattern conditioned on trustworthy embedding is dominant. In this case, noisy labels are considered to be reliable and thus transition should mainly happen among same labels. However, the transition patterns conditioned on non-trustworthy embedding is diffusing. Because in this case, noisy labels are considered not correct and transition usually happen between different labels. Similarly, transition patterns on AMT dataset in Fig. 9 also have these characteristics. Fig. 9 indicates CAN is based on quality embedding to automatically disturb the latent label to match the noisy label.

The transition pattern conditioned on non-trustworthy embedding usually reflects the real-world noise. Some interesting patterns can be found. For instance, according to the second panel of Fig. 9, “plt” class has less transition to other classes while the transition between “prs” and “tv” has high value. It means: (1) people who upload the “pottedplant” images to social websites almost do not annotate it wrong; (2) for “tv” images, some people focus on persons in the TV program, and others may pay attention to TV itself. Similarly in the fourth panel of Fig. 9, the transition on AMT usually exists in the appear-similar dogs, i.e., “Norfolk Terrier” and “Norwich Terrier”, “Irish Wolfhound” and “Scottish Wolfhound”. It reflects that it is more difficult to distinguish these two breeds of dogs than other pairs in some sense.

V Conclusion

In this paper, we present a quality embedding model to learn the classifier from noisy image labels, which effectively avoid the error back-propagated from label noise. To instantiate the model, a Contrastive-Additive Noise network is well-designed. Regarding parameter estimation, we deduce an efficient SGD optimization algorithm by applying recent discrete and continuous reparameterization tricks. We demonstrate our model outperform other noise-aware deep learning methods on some noisy training datasets. Simultaneously, detailed visualization on three key parts is presented to give a deep insight on our model. However, we only validate our model in image data in this paper and other types of contents can be further explored.

Appendix A Computation for KL-divergences and regularizers

The remaining four terms in the RHS of Eq. (4) can be calculated without sampling. For example, for the latent label , both and are two -dimensional multinomial probabilities. Their KL-divergence term and regularizer can be simplified by enumerating each dimension. For the quality variable , it is from the -dimensional Gaussian space whose parameters are implicitly modeled with network of input and . If we assume its prior is like [39], it is easy to compute their KL-divergence and the regularizer due to the conjugation. In Eq. (A), we give their simplifications bigeminally.




  • [1]

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in

    Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds.   Curran Associates, Inc., 2012, pp. 1097–1105.
  • [2] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” 2014.
  • [3] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , June 2015.
  • [4] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
  • [5] C. Wang, W. Ren, K. Huang, and T. Tan, “Weakly supervised object localization with latent category learning,” in European Conference on Computer Vision.   Springer, 2014, pp. 431–445.
  • [6] H. Bilen and A. Vedaldi, “Weakly supervised deep detection networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2846–2854.
  • [7] L. Wang, G. Hua, J. Xue, Z. Gao, and N. Zheng, “Joint segmentation and recognition of categorized objects from noisy web image collection,” IEEE Transactions on Image Processing, vol. 23, no. 9, pp. 4070–4086, 2014.
  • [8] W. Zhang, S. Zeng, D. Wang, and X. Xue, “Weakly supervised semantic segmentation for social images,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
  • [9] A. Khoreva, R. Benenson, J. Hosang, M. Hein, and B. Schiele, “Simple does it: Weakly supervised instance and semantic segmentation,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [10] Z. Lu, Z. Fu, T. Xiang, P. Han, L. Wang, and X. Gao, “Learning from weak and noisy labels for semantic segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 3, pp. 486–500, 2017.
  • [11] S. K. Divvala, A. Farhadi, and C. Guestrin, “Learning everything about anything: Webly-supervised visual concept learning,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014.
  • [12]

    X. Chen and A. Gupta, “Webly supervised learning of convolutional networks,” in

    2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1431–1439.
  • [13] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma et al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” International Journal of Computer Vision, vol. 123, no. 1, pp. 32–73, 2017.
  • [14] S. Sukhbaatar, J. Bruna, M. Paluri, L. Bourdev, and R. Fergus, “Training convolutional networks with noisy labels,” Computer Science, 2015.
  • [15] V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy, “Learning from crowds,” Journal of Machine Learning Research, vol. 11, no. Apr, pp. 1297–1322, 2010.
  • [16] N. Natarajan, I. S. Dhillon, P. Ravikumar, and A. Tewari, “Learning with noisy labels,” Advances in Neural Information Processing Systems, vol. 26, pp. 1196–1204, 2013.
  • [17] T. Liu and D. Tao, “Classification with noisy labels by importance reweighting,” IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 38, no. 3, p. 447, 2014.
  • [18] D. Zhou, S. Basu, Y. Mao, and J. C. Platt, “Learning from the wisdom of crowds by minimax entropy,” in Advances in Neural Information Processing Systems, 2012, pp. 2195–2203.
  • [19] B. Frenay and M. Verleysen, “Classification in the presence of label noise: A survey,” IEEE Transactions on Neural Networks and Learning Systems, vol. 25, no. 5, pp. 845–869, May 2014.
  • [20] V. Mnih and G. Hinton, “Learning to label aerial images from noisy data,” in International Conference on Machine Learning, 2012.
  • [21] S. Azadi, J. Feng, S. Jegelka, and T. Darrell, “Auxiliary image regularization for deep cnns with noisy labels,” 2016.
  • [22] I. Misra, C. Lawrence Zitnick, M. Mitchell, and R. Girshick, “Seeing through the human reporting bias: Visual classifiers from noisy human-centric labels,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2930–2939.
  • [23] H. Izadinia, B. C. Russell, A. Farhadi, M. D. Hoffman, and A. Hertzmann, “Deep classifiers from image tags in the wild,” in The Workshop on Community-Organized Multimodal Mining: Opportunities for Novel Solutions, 2015, pp. 13–18.
  • [24] S. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, and A. Rabinovich, “Training deep neural networks on noisy labels with bootstrapping,” Computer Science, 2014.
  • [25] G. Patrini, A. Rozza, A. Menon, R. Nock, and L. Qu, “Making neural networks robust to label noise: a loss correction approach,” arXiv preprint arXiv:1609.03683, 2016.
  • [26] T. Xiao, T. Xia, Y. Yang, C. Huang, and X. Wang, “Learning from massive noisy labeled data for image classification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2691–2699.
  • [27] I. Jindal, M. Nokleby, and X. Chen, “Learning deep networks from noisy labels with dropout regularization,” in Data Mining (ICDM), 2016 IEEE 16th International Conference on.   IEEE, 2016, pp. 967–972.
  • [28] Y. Li, J. Yang, Y. Song, L. Cao, J. Luo, and J. Li, “Learning from noisy labels with distillation,” arXiv preprint arXiv:1703.02391, 2017.
  • [29] A. Veit, N. Alldrin, G. Chechik, I. Krasin, A. Gupta, and S. Belongie, “Learning from noisy large-scale datasets with minimal supervision,” Computer Vision and Pattern Recognition (CVPR), 2017.
  • [30] M. J. Wainwright, M. I. Jordan et al., “Graphical models, exponential families, and variational inference,” Foundations and Trends® in Machine Learning, vol. 1, no. 1–2, pp. 1–305, 2008.
  • [31]

    D. M. Blei, M. I. Jordan, and J. W. Paisley, “Variational bayesian inference with stochastic search,” in

    Proceedings of the 29th International Conference on Machine Learning (ICML-12), J. Langford and J. Pineau, Eds.   New York, NY, USA: ACM, 2012, pp. 1367–1374.
  • [32] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe, “Variational inference: A review for statisticians,” Journal of the American Statistical Association, no. just-accepted, 2017.
  • [33] D. F. Nettleton, A. Orriols-Puig, and A. Fornells, “A study of the effect of different types of noise on the precision of supervised learning techniques,” Artificial Intelligence Review, vol. 33, no. 4, pp. 275–306, 2010.
  • [34] A. Joulin, L. V. D. Maaten, A. Jabri, and N. Vasilache, Learning Visual Features from Large Weakly Supervised Data.   Springer International Publishing, 2015.
  • [35] K. Ganchev, J. Gillenwater, B. Taskar et al., “Posterior regularization for structured latent variable models,” Journal of Machine Learning Research, vol. 11, no. Jul, pp. 2001–2049, 2010.
  • [36] A. Krause, P. Perona, and R. G. Gomes, “Discriminative clustering by regularized information maximization,” in Advances in Neural Information Processing Systems 23, J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, Eds.   Curran Associates, Inc., 2010, pp. 775–783.
  • [37] J. Zhu, N. Chen, and E. P. Xing, “Bayesian inference with posterior regularization and applications to infinite latent svms,” Journal of Machine Learning Research, vol. 15, p. 1799, 2014.
  • [38] X. Chen, X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel, “Infogan: Interpretable representation learning by information maximizing generative adversarial nets,” in Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, Eds.   Curran Associates, Inc., 2016, pp. 2172–2180.
  • [39] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” stat, vol. 1050, p. 10, 2014.
  • [40] E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with gumbel-softmax,” arXiv preprint arXiv:1611.01144, 2016.
  • [41] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner, “beta-vae: Learning basic visual concepts with a constrained variational framework,” 2016.
  • [42] A. van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves et al., “Conditional image generation with pixelcnn decoders,” in Advances in Neural Information Processing Systems, 2016, pp. 4790–4798.
  • [43] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L. J. Li, “The new data and new challenges in multimedia research,” Communications of the Acm, vol. 59, no. 2, pp. 64–73, 2015.
  • [44] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes challenge 2007 results,” 2007.
  • [45] A. Khosla, N. Jayadevaprakash, B. Yao, and L. Fei-Fei, “Novel dataset for fine-grained image categorization,” in First Workshop on Fine-Grained Visual Categorization, IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO, June 2011.
  • [46] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes challenge 2012 results,” 2012.
  • [47] L. Van Der Maaten, “Accelerating t-sne using tree-based algorithms.” Journal of machine learning research, vol. 15, no. 1, pp. 3221–3245, 2014.