Loss factorization, weakly supervised learning and label noise robustness

02/08/2016 ∙ by Giorgio Patrini, et al. ∙ CSIRO Max Planck Society Australian National University 0

We prove that the empirical risk of most well-known loss functions factors into a linear term aggregating all labels with a term that is label free, and can further be expressed by sums of the loss. This holds true even for non-smooth, non-convex losses and in any RKHS. The first term is a (kernel) mean operator --the focal quantity of this work-- which we characterize as the sufficient statistic for the labels. The result tightens known generalization bounds and sheds new light on their interpretation. Factorization has a direct application on weakly supervised learning. In particular, we demonstrate that algorithms like SGD and proximal methods can be adapted with minimal effort to handle weak supervision, once the mean operator has been estimated. We apply this idea to learning with asymmetric noisy labels, connecting and extending prior work. Furthermore, we show that most losses enjoy a data-dependent (by the mean operator) form of noise robustness, in contrast with known negative results.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Supervised learning is by far the most effective application of the machine learning paradigm. However, there is a growing need of decoupling the success of the field from its topmost framework, often unrealistic in practice. In fact while the amount of available data grows continuously, its relative training labels –often derived by human effort– become rare, and hence learning is performed with partially missing, aggregate-level and/or noisy labels. For this reason,

weakly supervised learning (wsl) has attracted much research. In this work, we focus on binary classification under weak supervision. Traditionally, wsl problems are attacked by designing ad-hoc loss functions and optimization algorithms tied to the particular learning setting. Instead, we advocate to “do not reinvent the wheel” and present an unifying treatment. In summary, we show that, under a mild decomposability assumption,

Any loss admitting a minimizing algorithm over fully labelled data, can also be minimized in wsl setting with provable generalization and noise robustness guarantees. Our proof is constructive: we show that a simple change in the input and of one line of code is sufficient.

(a) logistic loss
(b) square loss
Figure 1: Loss factorization: .

1.1 Contribution and related work

We introduce

linear-odd losses

(lols), a definition that not does demand smoothness or convexity, but that a loss is such that is linear. Many losses of practical interest are such, e.g. logistic and square. We prove a theorem reminiscent of Fisher-Neyman’s factorization (Lehmann and Casella, 1998) of the exponential family which lays the foundation of this work: it shows how empirical -risk factors (Figure 1) in a label free term with another incorporating a sufficient statistic of the labels, the mean operator. The interplay of the two components is still apparent on newly derived generalization bounds, that also improve on known ones (Kakade et al., 2009). Aside from factorization, the above linearity is known to guarantee convexity to certain losses used in learning on positive and unlabeled data (pu) in the recent (du Plessis et al., 2015).

An advantage of isolating labels comes from applications on wsl. In this scenario, training labels are only partially observable due to an unknown noise process (Garcıa-Garcıa and Williamson, 2011; Hernandez-Gonzalez et al., 2016). For example, labels may be noisy (Natarajan et al., 2013), missing as with semi-supervision (Chapelle et al., 2006) and pu (du Plessis et al., 2015), or aggregated as it happens in multiple instance learning (Dietterich et al., 1997) and learning from label proportions (llp) (Quadrianto et al., 2009). As the success of those areas shows, labels are not strictly needed for learning. However, most wsl methods implicitly assumes that labels must be recovered in training, as pointed out by (Joulin and Bach, 2012). Instead, sufficiency supports a more principled two-step approach: (1) estimate the mean operator from weakly supervised data and (2) plug it into any lol and resort to known procedures for empirical risk minimization (erm). Thus, (1) becomes the only technical obstacle in adapting an algorithm, although often easy to surpass. Indeed, this approach unifies a growing body of literature (Quadrianto et al., 2009; Patrini et al., 2014; van Rooyen et al., 2015; Gao et al., 2016)

. As a showcase, we implement (2) by adapting stochastic gradient descent (

sgd) to wsl. The upgrade only require to transform input by a “double sample trick” and to sum the mean operator in the model update.

Finally, we concentrate on learning with asymmetric label noise. We connect and extend previous work of (Natarajan et al., 2013; van Rooyen et al., 2015) by designing an unbiased mean operator estimator, for which we derive generalization bounds independent on the chosen lol and algorithm. Recent results (Long and Servedio, 2010) have shown that requiring the strongest form of robustness –on any possible noisy sample– rules out most losses commonly used, and have drifted research focus on non-convex (Stempfel and Ralaivola, 2009; Masnadi-Shirazi et al., 2010; Ding and Vishwanathan, 2010) or linear losses (van Rooyen et al., 2015). Our approach is more pragmatic, as we show that any lols enjoy an approximate form of noise robustness which converges, on a data-dependent basis, to the strongest one. The mean operator is still central here, being the data-dependent quantity that shapes the bound. The theory is validated with experimental analysis, for which we call the adapted sgd as a black box.

Elements of this work appeared in an early version (Patrini et al., 2015), mostly interested in elucidating the connection between loss factorization and -label differential privacy (Chaudhuri and Hsu, 2011).

Next, Section 2 settles notations and background. Section 3 states the Factorization Theorem. Sections 4 and 5 focus on wsl and noisy labels. Section 6 discusses the paper. Proofs and additional results are given in Appendix.

2 Learning setting and background

We denote vectors in bold as

and the indicator of being true. We define and . In binary classification, a learning sample is a sequence of (observation, label) pairs, the examples, drawn from an unknown distribution over , with and . Expectation (or average) over () is denoted as ().

Given a hypothesis , a loss is any function . A loss gives a penalty when predicting the value and the observed label is . We consider margin losses, i.e. (Reid and Williamson, 2010), which are implicitly symmetric: . For notational convenience, we will often use a generic scalar argument . Examples are loss , logistic loss , square loss and hinge loss .

The goal of binary classification is to select a hypothesis that generalizes on . That is, we aim to minimize the true risk on the loss . In practice, we only learn from a finite learning sample and thus minimize the empirical -risk , where is a tractable upperbound of loss.

Finally, we discuss the meaning of wsl –and in particular of weakly supervised binary classification. The difference with the above is at training time: we learn on a sample drawn from a noisy distribution that may flip, aggregate or suppress labels, while observations are the same. Still, the purpose of learning is unchanged: to minimize the true risk. A rigorous definition will not be relevant in our study.

2.1 Background: exponential family and logistic loss

Some background on the exponential family is to come. We can learn a binary classifier fitting a model in the conditional exponential family parametrized by

: , with random variable. The two terms in the exponent are the log-partition function and the sufficient statistic , which fully summarizes one example . The Fisher-Neyman theorem (Lehmann and Casella, 1998, Theorem 6.5) gives a sufficient and necessary condition for sufficiency of a statistic

: the probability distribution factors in two functions, such that

interacts with the only through :

In our case, it holds that , and , since the value of is not needed to compute . This shows how is indeed sufficient (for ). Now, under the i.i.d. assumption, the log-likelihood of is (the negative of)


Step (2) is true because . At last, by At last, by re-parameterizing and normalizing, we obtain the empirical risk of logistic loss. Equation (1) shows how the loss splits into a linear term aggregating the labels and another, label free term. We translate this property for classification with erm, by transferring the ideas of sufficiency and factorization to a wide set of losses including the ones of (Patrini et al., 2014).

3 Loss factorization and sufficiency

The linear term just encountered in logistic loss integrates a well-studied statistical object. The (empirical) mean operator of a learning sample is We drop the when clear by the context. The name mean operator, or mean map, is borrowed from the theory of Hilbert space embedding (Quadrianto et al., 2009)111

We decide to keep the lighter notation of linear classifiers, but nothing should prevent the extension to non-parametric models, exchanging

with an implicit feature map .

. Its importance is due to the injectivity of the map –under conditions on the kernel– which is used in applications such as two-sample and independence tests, feature extraction and covariate shift

(Smola et al., 2007). Here, will play the role of sufficient statistic for labels w.r.t. a set of losses. A function is said to be a sufficient statistic for a variable w.r.t. a set of losses and a hypothesis space when for any , any and any two samples and the empirical -risk is such that

The definition is motivated by the one in Statistics, taking log-odd ratios (Patrini et al., 2014). We aim to establish sufficiency of mean operators for a large set of losses. The next theorem is our main contribution.

[Factorization] Let be the space of linear hypotheses. Assume that a loss is such that is linear. Then, for any sample and hypothesis the empirical -risk can be written as

where .


We write ] as


Step 3 is due to the definition of and linearity of . The Theorem follows by linearity of and expectation. ∎

Factorization splits -risk in two parts. A first term is the -risk computed on the same loss on the “doubled sample” that contains each observation twice, labelled with opposite signs, and hence it is label free. A second term is a loss of applied to the mean operator , which aggregates all sample labels. Also observe that is by construction an odd function, i.e. symmetric w.r.t. the origin. We call the losses satisfying the Theorem linear-odd losses. A loss is -linear-odd (-lol) when , for any . Notice how this does not exclude losses that are not smooth, convex, or proper (Reid and Williamson, 2010). The definition puts in a formal shape the intuition behind (du Plessis et al., 2015) for pu –although unrelated to factorization. From now on, we also consider as the space linear hypotheses . (Theorem 3 applies beyond lols and linear models; see Section 6.) As a consequence of Theorem 3, is sufficient for all labels.

Table 1: Factorization of linear-odd losses: spl (including logistic, square and Matsushita) (Nock and Nielsen, 2009), double “2”-hinge and perceptron (du Plessis et al., 2015), unhinged (van Rooyen et al., 2015). For -loss see the text.

The mean operator is a sufficient statistic for the label with regard to lols and . (Proof in A.1) The corollary is at the core of the applications in the paper: the single vector summarizes all information concerning the linear relationship between and for losses that are lol (see also Section 6). Many known losses belong to this class; see Table 3. For logistic loss it holds that (Figure 1(a)):

This “symmetrization” is known in the literature (Jaakkola and Jordan, 2000; Gao et al., 2016). Another case of lol is unhinged loss (van Rooyen et al., 2015) –while standard hinge loss does not factor in a linear term.

The Factorization Theorem 3 generalizes (Patrini et al., 2014, Lemma 1) that works for symmetric proper losses (spls), e.g. logistic, square and Matsushita losses. Given a permissible generator (Kearns and Mansour, 1996; Nock and Nielsen, 2009), i.e. , is strongly convex, differentiable and symmetric with respect to , spls are defined as , where is the convex conjugate of . Then, since :

A similar result appears in (Masnadi-Shirazi, 2011, Theorem 11). A natural question is whether the classes spl and lol are equivalent. We answer in the negative.


The exhaustive class of linear-odd losses is in 1-to-1 mapping with a proper subclass of even functions, i.e. , with any even function.

(Proof in A.2) Interestingly, the proposition also let us engineer losses that always factor: choose any even function with desired properties –it need not be convex nor smooth. The loss is then , with to be chosen. For example, let , with . is a lol; furthermore, upperbounds loss and intercepts it in . Non-convex can be constructed similarly. Yet, not all non-differentiable losses can be crafted this way since they are not lols. We provide in B sufficient and necessary conditions to bound losses of interest, including hinge and Huber loss, by lols.

From the optimization viewpoint, we may be interested in keeping properties of after factorization. The good news is that we are dealing with the same plus a linear term. Thus, if the property of interest is closed under summation with linear functions, then it will hold true. An example is convexity: if is lol and convex, so is the factored loss.

The next Theorem sheds new light on generalization bounds on Rademacher complexity with linear hypotheses. Assume is -lol and -Lipschitz. Suppose and . Let and . Then for any , with probability at least :

or more explicity

(Proof in A.3) The term is derived by an improved upperbound to the Rademacher complexity of computed on (A.3, Lemma A.3); we call it in short the complexity term. The former expression displays the contribution of the non-linear part of the loss, keeping aside what is missing: a deviation of the empirical mean operator from its population mean. When is not known because of partial label knowledge, the choice of any estimator would affect the bound only through that norm discrepancy. The second expression highlights the interplay of the two loss components. is the only non-linear term, which may well be predominant in the bound for fast-growing losses, e.g. strongly convex. Moreover, we confirm that the linear-odd part does not change the complexity term and only affects the usual statistical penalty by a linear factor. A last important remark comes from comparing the bound with the one due to (Kakade et al., 2009, Corollary 4). Our factor in front of the complexity term is instead of , that is three times smaller. A similar statement may be derived for rkhs on top of (Bartlett and Mendelson, 2002; Altun and Smola, 2006).

4 Weakly supervised learning

  Input: , ; ;
  For any :
    Pick uniformly at random 
    Pick any
Algorithm 1 sgd

In the next two Sections we discuss applications to wsl. Recall that in this scenario we learn on with partially observable labels, but aim to generalize to . Assume to know an algorithm that can only learn on . By sufficiency (Corollary 3), a principled two-step approach to use it on is: (1) estimate from ; (2) run the algorithm with the lol computed on the estimated . This direction was explored by work on llp (Quadrianto et al., 2009, logistic loss) and (Patrini et al., 2014, spl) and in the setting of noisy labels (van Rooyen et al., 2015, unhinged loss) and (Gao et al., 2016, logistic loss). The approach contrasts with ad-hoc losses and optimization procedures, often trying to recover the latent labels by coordinate descent and EM (Joulin and Bach, 2012). Instead, the only difficulty here is to come up with a well-behaved estimator of –a statistic independent on both and . Thereom 3 then assures bounded -risk and, in turn, true risk. On stricter conditions on (Altun and Smola, 2006, Section 4) and (Patrini et al., 2014, Theorem 6) hold finite-sample guarantees.

Algorithm 1, sgd, adapts sgd for weak supervision. For the sake of presentation, we work on a simple version of sgd based on subgradient descent with regularization, inspired by PEGASO (Shalev-Shwartz et al., 2011). Given changes are trivial: (i) construct from and (ii) sum to the subgradients of each example of . In Section 6 upgrades proximal algorithms with the same minimal-effort strategy. The next Section shows an estimator of in the case of noisy labels and specializes sgd. We also analyze the effect of noise through the lenses of Theorem 3 and discuss a non-standard notion of noise robustness.

5 Asymmetric label noise

In learning with noisy labels, is a sequence of examples drawn from a distribution , which samples from and flips labels at random. Each example is with probability at most or it is otherwise. The noise rates are label dependent222While being independent on the observation. by respectively for positive and negative examples, that is, asymmetric label noise (aln) (Natarajan et al., 2013).

Our first result builds on (Natarajan et al., 2013, Lemma 1)

that provides a recipe for unbiased estimators of losses. Thanks to the Factorization Theorem

3, instead of estimating the whole we act on the sufficient statistic:


The estimator is unbiased, that is, its expectation over the noise distribution is the population mean operator: . Denote then the risk computed on the estimator as . Unbiasedness transfers to -risk: (Proofs in A.4). We have therefore obtained a good candidate as input for any algorithm implementing our 2-step approach, like sgd. However, in the context of the literature, there is more. On one hand, the estimators of (Natarajan et al., 2013) may not be convex even when is so, but this is never the case with lols; in fact, may be seen as alternative sufficient condition to (Natarajan et al., 2013, Lemma 4) for convexity, without asking differentiable, for the same reason in (du Plessis et al., 2015). On the other hand, we generalize the approach of (van Rooyen et al., 2015) to losses beyond unhinged and to asymmetric noise. We now prove that any algorithm minimizing lols that uses the estimator in Equation 4 has a non-trivial generalization bound. We further assume that is Lipschitz.

Consider the setting of Theorem 3, except that here . Then for any , with probability at least :

(Proof in A.5) Again, the complexity term is tighter than prior work. (Natarajan et al., 2013, Theorem 3) proves a factor of that may even be unbounded due to noise, while our estimate shows a constant of about and it is noise free. In fact, lols are such that noise affects only the linear component of the bound, as a direct effect of factorization. Although we are not aware of any other such results, this is intuitive: Rademacher complexity is computed regardless of sample labels and therefore is unchanged by label noise. Furthermore, depending on the loss, the effect of (limited) noise on generalization may be also be negligible since could be very large for losses like strongly convex. This last remark fits well with the property of robustness that we are about to investigate.

5.1 Every lol is approximately noise-robust

The next result comes in pair with Theorem 5: it holds regardless of algorithm and (linear-odd) loss of choice. In particular, we demonstrate that every learner enjoys a distribution-dependent property of robustness against asymmetric label noise. No estimate of is involved and hence the theorem applies to any naïve supervised learner on . We first refine the notion of robustness of (Ghosh et al., 2015; van Rooyen et al., 2015) in a weaker sense.

Let respectively be the minimizers of in . is said -aln robust if for any , .

The distance of the two minimizers is measured by empirical -risk under expected label noise. -aln robust losses are also aln robust: in fact if then . And hence if has a unique global minimum, that will be . More generally

Assume . Then every -lol is -aln. That is

Moreover: (1) If for then every lol is aln for any . (2) Suppose that is also once differentiable and -strongly convex. Then .

Figure 2: Behavior of Theorem 5.1 on synthetic data.

(Proof in A.6) Unlike Theorem 5, this bound holds in expectation over the noisy risk . Its shape depends on the population mean operator, a distribution-dependent quantity. There are two immediate corollaries. When , we obtain optimality for all lols. The second corollary goes further, limiting the minimizers’ distance when losses are differentiable and strongly convex. But even more generally, under proper compactness assumptions on the domain of , Theorem 5.1 tells us much more: in the case has a unique global minimizer, the smaller , the closer the minimizer on noisy data will be to the minimizer on clean data . Therefore, assuming to know an efficient algorithm that computes a model not far from the global optimum , that will be not far from either. This is true in spite of the presence of local minima and/or saddle points.

(Long and Servedio, 2010) proves that no convex potential333A convex potential is a loss , convex, such that and for . Many convex potentials are lols but not all, but there is no inclusion. An example is . is noise tolerant, that is, -ALN robust. This is not a contradiction. To show the negative statement, the authors craft a case of breaking any of those losses. And in fact that choice of does not meet optimality in our bound, because , with . In contrast, we show that every element of the broad class of lols is approximately robust, as opposed to a worst-case statement. Finally, compare our -robustness to the one of (Ghosh et al., 2015): . Such bound, while relating the (non-noisy) -risks, is not data-dependent and may be not much informative for high noise rates.

5.2 Experiments

  Input: ; ;
   Equation 4
Algorithm 2 sgd applied on noisy labels

We analyze experimentally the theory so far developed. From now on, we assume to know and at learning time. In principle they can be tuned as hyper-parameters Natarajan et al. (2013), at least for small (Sukhbaatar and Fergus, 2014). While being out of scope, practical noise estimators are studied (Bootkrajang and Kabán, 2012; Liu and Tao, 2014; Menon et al., 2015; Scott, 2015).

(a) vs.
(b) vs.
(c) vs.
Figure 3: How mean operator and noise rate condition risks. .

We begin by building a toy planar dataset to probe the behavior of Theorem 5.1. It is made of four observations: and three times, with the first example the only negative, repeated 5 times. We consider this the distribution so as to calculate . We fix and control to measure the discrepancy , its counterpart computed on , and how the two minimizers “differ in sign” by . The same simulation is run varying the noise rates with constant . We learn with by standard square loss. Results are in Figure 2. The closer the parameters to , the smaller , while they are equal when each parameter is individually . is negligible, which is good news for the -risk on sight.

Algorithm 2 learns with noisy labels on the estimator of Equation 4 and by calling the black box of sgd. The next results are based on UCI datasets (Bache and Lichman, 2013). We learn with logistic loss, without model’s intercept and set and

(4 epochs). We measure

and , injecting symmetric label noise and averaging over 25 runs. Again, we consider the whole distribution so as to play with the ingredients of Theorem 5.1. Figure 3(a) confirms how the combined effect of can explain most variation of . While this is not strictly implied by Theorem 5.1 that only involves , the observed behavior is expected. A similar picture is given in Figure 3(b) which displays the true risk computed on the minimizer of . From 3(a) and 3(b) we also deduce that large is a good proxy for generalization with linear classifiers; see the relative difference between points at the same level of noise. Finally, we have also monitored . Figure 3(c) shows that large indicates small as well. This remark can be useful in practice, when the norm can be accurately estimated from , as opposite to and , and used to anticipate the effect of noise on the task at hand.

We conclude with a systematic study of hold-out error of sgd. The same datasets are now split in 1/5 test and 4/5 training sets once at random. In contrast with the previous experimental setting we perform cross-validation of on 5-folds in the training set. We compare with vanilla sgd run on corrupted sample and measure the gain from estimating . The other parameters are the same for both algorithms; the learning rate is untouched from (Shalev-Shwartz et al., 2011) and not tuned for sgd. The only differences are in input and gradient update. Table 2 reports test error for sgd and its difference with sgd, for a range of values of . sgd beats systematically sgd with large noise rates, and still performs in pair with its competitor under low or null noise. Interestingly, in the presence of very intense noise , while the contender is often doomed to random guess, sgd still learns sensible models and improves up to relatively to the error of sgd.

dataset sgd sgd sgd sgd sgd sgd sgd sgd sgd sgd sgd sgd sgd sgd
australian 0.13 0.15 0.14 0.14 0.16 0.26 0.45
breast-can. 0.02 0.03 0.03 0.03 0.05 0.11 0.17
diabetes 0.28 0.29 0.29 0.27 0.28 0.39 0.59
german 0.27 0.26 0.27 0.29 0.31 0.31 0.31
heart 0.15 0.17 0.16 0.17 0.18 0.26 0.35
housing 0.17 0.23 0.22 0.20 0.22 0.34 0.41
ionosphere 0.14 0.19 0.20 0.20 0.21 0.35 0.54
sonar 0.27 0.29 0.29 0.34 0.36 0.43 0.45
Table 2: Test error for sgd and sgd over 25 trials of artificially corrupted datasets.

6 Discussion and conclusion

Mean and covariance operators  The intuition behind the relevance of the mean operator becomes clear once we rewrite it as follows.


Let be the positive label proportion of . Then

(Proof in A.7) We have come to the unsurprising fact that –when observations are centered– the covariance is what we need to know about the labels for learning linear models. The rest of the loss may be seen as a data-dependent regularizer. However, notice how the condition does not implies , which would make linear classification hard and limit Theorem 5.1’s validity to degenerate cases. A kernelized version of this Lemma is given in (Song et al., 2009).

The generality of factorization  Factorization is ubiquitous for any (margin) loss, beyond the theory seen so far. A basic fact of real analysis supports it: a function is (uniquely) the sum of an even function and an odd :

One can check that and are indeed even and odd (Figure 1). This is actually all we need for the factorization of . [Factorization] For any sample and hypothesis the empirical -risk can be written as

where is odd and is even and both uniquely defined. Its range of validity is exemplified by loss, a non-convex discontinuous piece-wise linear function, which factors as

It follows immediately that is sufficient for . However, is a function of model . This defeats the purpose a of sufficient statistic, which we aim to be computable from data only and it is the main reason to define lols. The Factorization Theorem 3 can also be stated for rkhs. To show that, notice that we satisfy all hypotheses of the Representer Theorem (Schölkopf and Smola, 2002). Let be a feature map into a Reproducing Kernel Hilbert Space (rkhs) with symmetric positive definite kernel , such that . For any learning sample , the empirical -risk with regularization can be written as

and the optimal hypothesis admits a representation of the form . All paper may be read in the context of non-parametric models, with the kernel mean operator as sufficient statistic. Finally, it is simple to show factorization for square loss for regression (C). This finding may open further applications of our framework.

The linear-odd losses of (du Plessis et al., 2015)  This recent work in the context of pu shows how the linear-odd condition on a convex allows one to derive a tractable, i.e. still convex, loss for learning with pu. The approach is conceptually related to ours as it isolates a label-free term in the loss, with the goal of leveraging on the unlabelled examples too. Interestingly, the linear term of their Equation 4 can be seen as a mean operator estimator like , where is the set of positive examples. Their manipulation of the loss is not equivalent to the Factorization Theorem 3 though, as explained with details in (D. Beside that, since we reason at the higher level of wsl, we can frame a solution for pu simply calling sgd on defined above or on estimators improved by exploiting results of (Patrini et al., 2014).

Learning reductions  Solving a machine learning problem by solutions to other learning problems is a learning reduction (Beygelzimer et al., 2015). Our work does fit into this framework. Following (Beygelzimer et al., 2005), we define a wsl task as a triple , with weakly supervised advice , predictions space and loss , and we reduce to binary classification . Our reduction is somehow simple, in the sense that does not change and neither does . Although, Algorithm 1 modifies the internal code of the “oracle learner” which contrasts with the concept of reduction. Anyway, we could as well write subgradients as

which equals , and thus the oracle would be untouched.

Beyond sgd  meta-sgd is intimately similar to stochastic average gradient (sag) (Schmidt et al., 2013). Let denote if (example picked at time ), otherwise . Define the same for accordingly. Then, sag’s model update is:

and recalling that , sgd’s update is

From this parallel, the two algorithms appear to be variants of a more general sampling mechanism of examples and gradient components, at each step. More generally, stochastic gradient is just one suit of algorithms that fits into our 2-step learning framework. Proximal methods (Bach et al., 2012) are another noticeable example. The same modus operandi leads to a proximal step of the form:

with and the regularizer. Once again, the adaptation works by summing in the gradient step and changing the input to .

A better (?) picture of robustness  The data-dependent worst-case result of (Long and Servedio, 2010), like any extreme-case argument, should be handled with care. It does not give the big picture for all data we may encounter in a real world, but only the most pessimistic. We present such a global view which appears better than expected: learning the minimizer from noisy data does not necessarily reduce convex losses to a singleton (van Rooyen et al., 2015) but depends on the mean operator for a large number of them (not necessarily linear, convex or smooth). Quite surprisingly, factorization also marries the two opposite views in one formula444See (Ghosh et al., 2015, Theorem 1).:

To conclude, we have seen how losses factor in a way that we can isolate the contribution of supervision. This has several implications both on theoretical and practical grounds: learning theory, formal analysis of label noise robustness, and adaptation of algorithms to handle poorly labelled data. An interesting question is whether factorization would let one identify what really matters in learning that is instead completely unsupervised, and to do so with more complex models than the ones considered here, as for example deep architectures.


The authors thank Aditya Menon for insightful feedback on an earlier draft. NICTA is funded by the Australian Government through the Department of Communications and the Australian Research Council through the ICT Center of Excellence Program.


  • Lehmann and Casella (1998) E. L. Lehmann and G. Casella. Theory of point estimation, volume 31. Springer Science & Business Media, 1998.
  • Kakade et al. (2009) S. M. Kakade, K. Sridharan, and A. Tewari. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In NIPS*23, 2009.
  • du Plessis et al. (2015) M C. du Plessis, G. Niu, and M. Sugiyama. Convex formulation for learning from positive and unlabeled data. In 32 ICML, pages 1386–1394, 2015.
  • Garcıa-Garcıa and Williamson (2011) D. Garcıa-Garcıa and R. C. Williamson. Degrees of supervision. In NIPS*25 workshop on Relations between machine learning problems, 2011.
  • Hernandez-Gonzalez et al. (2016) J. Hernandez-Gonzalez, I. Inza, and J.A. Lozano. Weak supervision and other non-standard classification problems: a taxonomy. In Pattern Recognition Letters, pages 49–55. Elsevier, 2016.
  • Natarajan et al. (2013) N. Natarajan, I. S. Dhillon, P. K. Ravikumar, and A. Tewari. Learning with noisy labels. In NIPS*27, 2013.
  • Chapelle et al. (2006) O. Chapelle, B. Schölkopf, and A. Zien. Semi-supervised learning. MIT press Cambridge, 2006.
  • Dietterich et al. (1997) T. G. Dietterich, R. H. Lathrop, and T. Lozano-Pérez. Solving the multiple instance problem with axis-parallel rectangles. Artificial Intelligence, 89:31–71, 1997.
  • Quadrianto et al. (2009) N. Quadrianto, A. J. Smola, T. S. Caetano, and Q. V. Le. Estimating labels from label proportions. JMLR, 10:2349–2374, 2009.
  • Joulin and Bach (2012) A. Joulin and F. R. Bach. A convex relaxation for weakly supervised classifiers. In 29 ICML, 2012.
  • Patrini et al. (2014) G. Patrini, R. Nock, P. Rivera, and T. Caetano. (Almost) no label no cry. In NIPS*28, 2014.
  • van Rooyen et al. (2015) B. van Rooyen, A. K. Menon, and R. C. Williamson. Learning with symmetric label noise: The importance of being unhinged. In NIPS*29, 2015.
  • Gao et al. (2016) W. Gao, L. Wang, Y.-F. Li, and Z.-H Zhou. Risk minimization in the presence of label noise. In Proc. of the 30 AAAI Conference on Artificial Intelligence, 2016.
  • Long and Servedio (2010) P. M. Long and R. A. Servedio. Random classification noise defeats all convex potential boosters. Machine learning, 78(3):287–304, 2010.
  • Stempfel and Ralaivola (2009) G. Stempfel and L. Ralaivola. Learning SVMs from sloppily labeled data. In

    Artificial Neural Networks (ICANN)

    , pages 884–893. Springer, 2009.
  • Masnadi-Shirazi et al. (2010) H. Masnadi-Shirazi, V. Mahadevan, and N. Vasconcelos.

    On the design of robust classifiers for computer vision.

    In Proc. of the 23 IEEE CVPR, pages 779–786. IEEE, 2010.
  • Ding and Vishwanathan (2010) N. Ding and S. V. N. Vishwanathan.

    t-logistic regression.

    In NIPS*24, pages 514–522, 2010.
  • Patrini et al. (2015) G. Patrini, F. Nielsen, and R. Nock. Bridging weak supervision and privacy aware learning via sufficient statistics. NIPS*29, Workshop on learning and privacy with incomplete data and weak supervision, 2015.
  • Chaudhuri and Hsu (2011) K. Chaudhuri and D. Hsu. Sample complexity bounds for differentially private learning. In JMLR, volume 2011, page 155, 2011.
  • Reid and Williamson (2010) M. D. Reid and R. C. Williamson. Composite binary losses. JMLR, 11:2387–2422, 2010.
  • Smola et al. (2007) A. Smola, A. Gretton, L. Song, and B. Schölkopf. A Hilbert space embedding for distributions. In Algorithmic Learning Theory, pages 13–31. Springer, 2007.
  • Nock and Nielsen (2009) R. Nock and F. Nielsen. Bregman divergences and surrogates for learning. IEEE Trans.PAMI, 31:2048–2059, 2009.
  • Jaakkola and Jordan (2000) T. S. Jaakkola and M. I. Jordan. Bayesian parameter estimation via variational methods. Statistics and Computing, 10(1):25–37, 2000.
  • Kearns and Mansour (1996) M. J. Kearns and Y. Mansour.

    On the boosting ability of top-down decision tree learning algorithms.

    In 28 ACM STOC, pages 459–468, 1996.
  • Masnadi-Shirazi (2011) H. Masnadi-Shirazi. The design of bayes consistent loss functions for classification. PhD thesis, University of California at San Diego, 2011.
  • Bartlett and Mendelson (2002) P.-L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. JMLR, 3:463–482, 2002.
  • Altun and Smola (2006) Y. Altun and A. J. Smola. Unifying divergence minimization and statistical inference via convex duality. In 19 COLT, 2006.
  • Shalev-Shwartz et al. (2011) S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: Primal estimated sub-gradient solver for svm. Mathematical programming, 127(1):3–30, 2011.
  • Ghosh et al. (2015) A. Ghosh, N. Manwani, and P. S. Sastry. Making risk minimization tolerant to label noise. Neurocomputing, 160:93–107, 2015.
  • Sukhbaatar and Fergus (2014) S. Sukhbaatar and R. Fergus. Learning from noisy labels with deep neural networks. arXiv:1406.2080, 2014.
  • Bootkrajang and Kabán (2012) J. Bootkrajang and A. Kabán. Label-noise robust logistic regression and its applications. In Machine Learning and Knowledge Discovery in Databases, pages 143–158. Springer, 2012.
  • Liu and Tao (2014) T. Liu and D. Tao. Classification with noisy labels by importance reweighting. arXiv:1411.7718, 2014.
  • Menon et al. (2015) A. Menon, B. Van Rooyen, C. S. Ong, and B. Williamson. Learning from corrupted binary labels via class-probability estimation. In 32 ICML, 2015.
  • Scott (2015) C. Scott. A rate of convergence for mixture proportion estimation, with application to learning from noisy labels. In 18 AISTATS, 2015.
  • Bache and Lichman (2013) K. Bache and M. Lichman. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/ml.
  • Song et al. (2009) L. Song, J. Huang, A. J. Smola, and K. Fukumizu. Hilbert space embeddings of conditional distributions with applications to dynamical systems. In 26 ICML. ACM, 2009.
  • Schölkopf and Smola (2002) B. Schölkopf and A. J. Smola.

    Learning with kernels: Support vector machines, regularization, optimization, and beyond

    MIT press, 2002.
  • Beygelzimer et al. (2015) A. Beygelzimer, H. Daumé III, J. Langford, and P. Mineiro. Learning reductions that really work. arXiv:1502.02704, 2015.
  • Beygelzimer et al. (2005) A. Beygelzimer, V. Dani, T. Hayes, J. Langford, and B. Zadrozny. Error limiting reductions between classification tasks. In 22 ICML, pages 49–56, 2005.
  • Schmidt et al. (2013) M. Schmidt, N. L. Roux, and F. Bach. Minimizing finite sums with the stochastic average gradient. arXiv:1309.2388, 2013.
  • Bach et al. (2012) F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducing penalties. Foundations and Trends® in Machine Learning, 4(1):1–106, 2012.
  • McDiarmid (1998) C. McDiarmid. Concentration. In M. Habib, C. McDiarmid, J. Ramirez-Alfonsin, and B. Reed, editors, Probabilistic Methods for Algorithmic Discrete Mathematics, pages 1–54. Springer Verlag, 1998.
  • Manwani and Sastry (2013) N. Manwani and P. S. Sastry. Noise tolerance under risk minimization. Cybernetics, IEEE Transactions on, 43(3):1146–1151, 2013.


Appendix A Proofs

a.1 Proof of Lemma 3

We need to show the double implication that defines sufficiency for .
By Factorization Theorem (3), is label independent only if the odd part cancels out.
If then is independent of the label, because the label only appears in the mean operator due to Factorization Theorem (3).

a.2 Proof of Lemma 3

Consider the class of lols satisfying . For any element of the class, define , which is even. In fact we have

a.3 Proof of Theorem 3

We start by proving two helper Lemmas. The next one provides a bound to the Rademacher complexity computed on the sample .


Suppose even. Suppose be the observations space, and be the space of linear hypotheses. Let . Then the empirical Rademacher complexity

of on satisfies:


with .


Suppose without loss of generality that . The proof relies on the observation that ,