The goal of supervised learning is to find a function in some hypothesis class that predicts a relationship between instances and labels. Such a function should have low average loss according to the true distribution of instances and labels, . The learner is not given direct access to , but rather a training set comprising iid samples from . There are many algorithms for solving this problem (for example empirical risk minimization) and this problem is well understood.
There are many other types
of data one could learn from. For example in semi-supervised learning the learner is given instance label pairs and instances devoid of labels. In learning with noisy labels [2, 24, 30], the learner observes instance label pairs where the observed labels have been corrupted by some noise process. There are many other variants including, but not limited to, learning with label proportions , learning with partial labels , multiple instance learning  as well as combinations of the above.
What is currently lacking is a general theory of learning from corrupted data, as well as means to compare the relative usefulness of different data types. Such a theory is required if one wishes to make informed economic decisions on which data sets to acquire. For example, are clean datum better or worse than noisy labels and partial labels?
To answer this question we first place the problem of corrupted learning into the abstract language of statistical decision theory. We then develop general lower and upper bounds on the risk relative to the amount of corruption of the clean data. Finally we show examples of problems that fit into this abstract framework.
The main contributions of this paper are:
Novel lower bounds on the risk of corrupted learning (theorem 5).
Analyses of the tightness of the above bounds.
In doing so we provide answers to our central question of how to rank different types of corrupted data, through the utilization of our upper or lower bounds. While not the complete story for all problems, the contributions outlined above make progress toward the final goal of being able to make informed economic decisions regarding the acquisition of data sets. All proofs omitted in the main text appear in the appendix.
2 The Decision Theoretic Framework
Decision theory deals with the general problem of decision making under uncertainty. One starts with a set of possible true hypotheses (only one of which is actually true) as well as set of actions available to the decision maker. Prior to acting, the decision maker performs an experiment, the outcome of which is assumed to be related to the true hypothesis, and observes in an observation space . Ultimately the decision maker makes act and incurs loss , with the unknown true hypothesis. We model the relationships between unknowns and the results of experiments with Markov kernels [34, 25, 29, 13]. The abstract development that follows is necessary in order to place a wide range of corruption processes into a single framework so that they may be compared.
2.1 Markov Kernels
As much of our focus will be on noise on the labels and not on the instances, henceforth we will assume we are only working with finite sets.
the set of probability distributions on a set. Define a Markov kernel from a set to a set (denoted by ) to be a function . Denote the set of all Markov kernels from to by . Every function defines a Markov kernel with , a point mass on . Given two Markov kernels and we can compose them to form by taking
for all . One can also combine Markov kernels in parallel. If and , denote the product distribution by . If , , are Markov kernels then with
. By restricting ourselves to finite sets, distributions can be represented by vectors, Markov kernels by column stochastic matrices (positive matrices with column sum 1) and composition by matrix multiplication. Anexperiment on is any Markov kernel with domain and a learning algorithm is any Markov kernel with co-domain . Finally, from any experiment we define the replicated experiment , with the -fold product of .
2.2 Loss and Risk
One assesses the consequence of actions through a loss . It is sometimes useful to work with losses in curried form. From any loss and action , define with
. We measure the size of a loss function by its supremum norm. If and we overload our notation with .
Normally, we are not interested in the absolute loss of an action, rather its loss relative to the best action, defined formally as the regret . We measure the performance of an algorithm by the risk
For the sake of comparison by a single number either the max risk or the average risk with respect to a distribution can be used. We define a learning problem to be a pair with a loss and an experiment. We measure the difficulty of a learning problem by the minimax risk
Normally we are not concerned with the quality of a learning algorithm for observation of a single . Rather we wish to know the rate at which the risk decreases as the number of replications of the experiment grows. Hence the prime quantity of interest is .
2.3 Statistics vs Machine Learning
they can be readily applied to machine learning problems. The main distinction is that statistics focuses onparametric families and loss functions of type . The goal is to accurately reconstruct parameters. In machine learning one is interested in predicting the observations of the experiment well. There the focus is on problems with and loss functions of the form , where measures how well predicts the observation . Our focus is on problems of the second sort, however abstractly there is no real difference. Both are just different learning problems. When clear we use and interchangeably.
2.3.1 Supervised Learning
In Table 1 we explain the mapping of supervised learning into our abstract language. We focus on the problem of conditional probability estimation of which learning a binary classifier is a special case. Lettingbe the instance space and the label space we have
|Unknowns||Distributions of instance, label pairs,|
|Observation Space||instance label pairs .|
|Action Space||Function class|
|Experiment||Maps each to itself|
a standard object of study in learning theory .
2.4 Corrupted Learning
In corrupted learning, rather than observing , one observes a corrupted in a different observation space . We model the corruption process through a Markov kernel and define a corrupted learning problem to be the triple . For convenience we define the corrupted experiment . Ideally we wish to compare with . By general forms of the information processing theorem [32, 22] , however this does not allow one to rank the utility of different .
Even after many years of directed research, in general we can not compute exactly, let alone for general corruptions. Consequently our effort for the remaining turns to upper and lower bounds of .
3 Upper Bounds for Corrupted Learning
When convenient we use the shorthand .  introduced a method of learning classifiers from data subjected to label noise, termed the method of unbiased estimators. Here we show that this method can be generalized to other corruptions. Firstly, , the dual space of . We use the notation . From any markov kernel , we obtain a linear map with
where is the pullback of by . In terms of matrices is the transpose or adjoint of T.
A Markov kernel is reconstructible if has a left inverse, there exists a linear map such that .
Intuitively, is reconstructible if there is some transformation that “undoes” the effects of . In general is not a Markov kernel. Many forms of corrupted learning are reconstructible, including semi-supervised learning, learning with label noise and learning with partial labels for all but a few pathological cases. The reader is directed to 10.1 for worked examples.
We call a left inverse of a reconstruction. For concreteness, one can always take
the Moore-Penrose pseudo inverse of . Reconstructible Markov kernels are exactly those where we can transfer a loss function from the clean distribution to the corrupted distribution. We have by properties of adjoints
In words, to take expectations of with samples from we use the corruption corrected .
Theorem 1 (Corruption Corrected Loss).
For all reconstructible , loss functions and reconstructions define the corruption corrected loss , with . Then for all distributions , .
We direct the reader to 10.1 for some examples of for different corruptions. Minimizing on a sample provides means to learn from corrupted data. Let , the average loss on the sample. By an application of the PAC Bayes bound ([28, 39, 10]) one has for all algorithms , priors and distributions
This bound yields the following theorem.
For all reconstructible Markov kernels , algorithms , priors , distributions and bounded loss functions
A similar result also holds with high probability on draws from . If is Empirical Risk Minimization (ERM), is finite and uniform on the above analysis yields convergence to the optimum as for learning with corrupted data versus for learning with clean data. Therefore, the ratio measures the relative difficulty of corrupted versus clean learning.
3.1 Upper Bounds for Combinations of Corrupted Data
Recall that our final goal is to be able to make informed economic decisions in regarding the acquisition of data sets. As such, we wish to quantify the utility of a data set comprising different corrupted data. For example in learning with noisy labels out of datum, there could be clean, slightly noisy and very noisy samples and so on. More generally we assume access to a corrupted sample , made up of different types of corrupted data, with .
Let be a collection of reconstructible Markov kernels. Let and , and . Then for all algorithms , priors , distributions and bounded loss functions
A similar result also holds with high probability on draws from . Theorem 3 is a generalization of the final bound appearing in  that only pertains to symmetric label noise and binary classification. Theorem 3 suggest the following means of choosing data sets. Let be the cost of acquiring data corrupted by and the maximum total cost. First, choose data from the with lowest until picking more violates the budget constraint. Then choose data from the second lowest and so on.
4 Lower Bounds for Corrupted Learning
Thus far we have developed upper bounds for ERM style algorithms. In particular we have found that reconstructible corruption does not effect the rate at which learning occurs, it only effects constants in the upper bound. Can we do better? Are these constants tight? To answer this question we develop lower bounds for corrupted learning.
Here we review Le Cam’s method  a powerful technique for generating lower bounds for learning problems that very often gives the correct rate and dependence on constants (including being able to reproduce the standard VC dimension lower bounds for classification presented in ). In recent times it has been used to establish lower bounds for: differentially private learning , learning in a distributed set up , function evaluations required in convex optimization  as well as generic lower bounds in statistical estimation problems . We show how this method can be extended using the strong data processing theorem [9, 15] to provide a general tool for lower bounding corrupted learning problems.
4.1 Le Cam’s Method and Minimax Lower Bounds
Le Cam’s method proceeds by reducing a general learning problem to an easier binary classification problem, before relating the best possible performance on this classification problem to the minimax risk. Define the separation , . The separation measures how hard it is to act well against both and simultaneously. We have the following (see section 10.4 for a more detailed treatment).
For all experiments , loss functions and
where is the variational divergence.
This lower bound is a trade off between distances measured by and statistical distances measured by the variational divergence. A learning problem is easy if proximity in variational divergence of and (hard to distinguish and statistically) implies proximity of and in (hard to distinguish and with actions).
If there exists with and we instantly get that the minimax regret must be positive. For corrupted experiments, if is not reconstructible it may be the case that for some . Hence we assume that is reconstructible.
4.1.1 Replication and Rates
We wish to lower bound how the risk decreases as grows. When working with replicated experiments it can be advantageous to work with an -divergence (see section 4.3) different to variational divergence and to invoke a generalized Pinkser inequality . Common choices in theoretical statistics are the Hellinger and alpha divergences  as well as the KL divergence . Here we use the variational divergence and the following lemma.
For all collections of distributions ,
Here we make use of the specific case where and for all .
For all experiments , loss functions , and
To use lemma 3, one defines and for , with the property
or equivalently . This yields a lower bound of
To obtain tight lower bounds, needs to be designed in a problem dependent fashion. However, as our goal here is to reason relatively we assume that is given.
4.1.2 Other Methods for Obtaining Minimax Lower Bounds
There are many other techniques for lower bounds in terms of functions of pairwise divergences  (for example Assouad’s method) as well as functions of pairwise f-divergences . While such methods are often required to get tighter lower bounds, all of what follows can be applied to these more intricate lower bounding techniques. Therefore, for the sake of conceptual clarity, we proceed with Le Cam’s method.
4.2 Measuring the Amount of Corruption
Rather than the experiment , in corrupted learning we work with the corrupted experiment . By the information processing theorem for -divergences , states that
Thus any lower bound achieved by Le Cam’s method for can be directly transferred to one for . This is just a manifestation of theorems presented in [32, 22] and alluded to in section 2.4. However, this provides us with no means to rank different . For some , the information processing theorem can be strengthened, in the sense that one can find such that
The coefficient provides a means to measure the amount of corruption present in . For example if is constant and maps all to the same distribution, then . If is an invertible function, then . Together with lemma 3 this strong information processing theorem  leads to meaningful lower bounds that allow the comparison of different corrupted experiments.
4.3 A Generic Strong Data Processing Theorem.
Following , we present a strong data processing theorem that works for all -divergences.
Let be a set and a convex function with . For all distributions the -divergence between and is
Both the variational and KL divergence are examples of divergences. For fixed we seek an such that
To do so we first relate the amount contracts and to a certain deconstruction for Markov kernels before proving when such a deconstruction can occur.
For all Markov kernels and distributions , if there exists and such that with then .
Hence the amount contracts and is related to the amount of that fixes and . We seek the largest such that a decomposition is always possible, no matter what pair of distributions is required to fix.
For all Markov kernels define . Then if and only if for all pairs of distributions there exists a decomposition
with and .
Theorem 4 (Strong Data Processing).
For all Markov kernels define . Then for all ,
4.4 Relating to Variational Divergence
where . Hence is the operator 1-norm of T when restricted to . The above also shows that provides the tightest strong data processing theorem possible when using variational divergence, and hence it gives the tightest generic strong data processing theorem. We also have the following compositional property of .
For all Markov kernels and ,
4.5 Lower bounds Relative to the Amount of Corruption
For all experiments , loss functions , , and corruptions
In words, if ever Le Cam’s method gives a lower bound of for repetitions of the clean experiment, we obtain a lower bound of for repetitions of the corrupted experiment. Hence the rate is unaffected, only the constants. However, a penalty of factor is unavoidable no matter what learning algorithm is used, suggesting that is a valid way of measuring the amount of corruption. We summarize the results of this section in the following theorem.
For all corruptions and experiments , if Le Cam’s method yields a lower bound then
In particular if one has a lower bound of for the clean problem, as is usual for many machine learning problems, theorem 5 yields a lower bound of for the corrupted problem.
4.6 Lower Bounds for Combinations of Corrupted Data
As in section 3.1 we present lower bounds for combinations of corrupted data. For example in learning with noisy labels out of datum, there could be clean, slightly noisy and very noisy samples and so on.
Let , , be reconstructible Markov kernels. Let with . If Le Cam’s method yields a lower bound then
As in section 3.1 this bound suggest means of choosing data sets, via the following integer program
where is the cost of acquiring data corrupted by and is the maximum total cost. This is exactly the unbounded knapsack problem  which admits the following near optimal greedy algorithm. First, choose data from the with highest until picking more violates the constraints. Then pick from the second highest and so on.
5 Measuring the Tightness of the Upper Bounds and Lower Bounds
In the previous sections we have shown upper bounds that depend on as well as lower bounds that depend on . Recall from theorem that 1 , as such the worst case ratio is determined by the operator norm of . For a linear map define
which are two operator norms of . They are equal to the maximum absolute column and row sum of respectively . Hence .
For all losses , and reconstructions , .
If is reconstructible, with reconstruction , then
The intuition here is if contracts a particular greatly, which would occur if
was small (here ), then could greatly increase the norm of a loss . However, it need not increase the norm of the particular loss of interest. Note that for lower bounds we look at the best case separation of columns of , for upper bounds we essentially use the worst. We also get the following compositional theorem.
If and are reconstructible, with reconstructions and then is reconstructible with reconstruction . Furthermore .
What we have shown is the following implication, for all reconstructible
By lemma 9, in the worse case , and in the “optimistic worst case” we arrive at bounds a factor of apart. We do not know if this is the fault of our upper or lower bounding techniques. However, when considering specific and this gap is no longer present (see section 10.1).
Assuming is the cost of acquiring data corrupted by , theorem 6 the ranks the utility of different corruptions by where as theorem 6 ranks by . By lemma 9, is a proxy for meaning both theorems are “doing the same thing”. In theorems 6 and 3 we have best case and a worst case loss specific method for choosing data sets. Theorem 3 combined with 1emma 8 provides a worst case loss insensitive method for choosing data sets.
6 What if Clean Learning is Fast?
The preceding largly solves the problem of learning from corrupted data when learning from the clean distribution occurs at a slow () rate. The reader is directed to section 10.12 for some preliminary work on when corrupted learning also occurs at a fast rate.
7 Proper Losses and Convexity
A loss is proper if
It is stricly proper if is the unique minimizer.
Proper losses provide suitable surrogate losses for learning problems. All strictly proper losses can be convexified through the use of the canonical link function [33, 36]. Ultimately one works with a loss of the form
with , the constant function and a convex function.
Theorem 7 (Preservation of Convexity).
Let and be a convex function. Define the loss . Then
Furthermore this loss is convex in .
This was first noticed in .
8 Uses in Supervised Learning
Recall in supervised learning and the goal is to find a function that predicts from with low expected loss. Many supervised learning techniques proceed by minimizing a proper loss. Given a suitable function class and a strictly proper loss , they attempt to find
Using the canonical link function and a careful chosen function class, leaves the learner with a convex problem. If we assume the labels have been corrupted by a corruption , we can correct for the corruptions and solve for
This objective is equivalent to the first and will also be convex.
We have sought to solve the problem of how to rank different forms of corrupted data with the ultimate goal of making informed decisions regarding to the acquisition of data sets. To do so we have introduced a general framework in which many corrupted learning tasks can be expressed. Furthermore, we have derived general upper and lower bounds for the reconstructible subset of corrupted learning problems. Finally, we have shown that in some examples these bounds are tight enough to be of use and that they produce the quantities one would expect. These bounds facilitate the ranking of different corrupted data, either through the use of best case lower bounds or worst case upper bounds. We have shown both loss specific and worst case as the loss is varied bounds. Future work will attempt to further refine these methods as well as extend the framework to non reconstructible problems such as multiple instance learning and learning with label proportions. Theorems 3 and 6 provide means of choosing between data sets that feature collections of different corrupted data.
We now show examples of common corrupted learning problems. Once again, our focus is corruption of the labels and not the instances. Thus we work directly with losses . In particular we work with classification problems. We present the worst case upper bound, , as well as the upper bound relevant for loss, .
10.1.1 Noisy Labels
The above equation is lemma 1 in  and is the original method of unbiased estimators. Interestingly, even if is positive, can be negative. If the noise is symmetric with and is loss then
which is just a rescaled and shifted version of loss. If we work in the realizable setting, ie there is some with
then the above provides an interesting correspondence between learning with symmetric label noise and learning under distributions with large Tsybakov margin . Taking with separable in turn implies has Tsybakov margin . This means bounds developed for this setting  can be transferred to the setting of learning with symmetric label noise. Our lower bound reproduces the results of 
Below is a table of the relevant parameters for learning with noisy binary labels. These results directly extend those present in  that considered only the case of symmetric label noise.
|Learning with Label Noisy (Binary)|
We see that as long as is reconstructible. The pattern we see in this table is quite common. tends to be marginally greater than , with less than both. In the symmetric case our lower bound reproduces those of .
10.1.2 Semi-Supervised Learning
We consider the problem of semi-supervised learning . Here is the probability class has a missing label. We first consider the easier symmetric case where .
|Symmetric Semi-Supervised Learning|
Once again . As long as . Our lower bound confirms that in general unlabelled data does not help . Rather than using the method of unbiased estimators, one could simply throw away the unlabelled data leaving behind labelled instances on average.
Other parameters for the more general case are omitted due to complexity (they involve the maximum of three 4th order rational equations). They are available in closed form.
10.1.3 Three Class Symmetric Label Noise
In line with , here we present parameters for the three class variant of symmetric label noise. We have with , if and otherwise.
|Learning with Symmetric Label Noisy (Multiclass)|