 # Mixture Proportion Estimation via Kernel Embedding of Distributions

Mixture proportion estimation (MPE) is the problem of estimating the weight of a component distribution in a mixture, given samples from the mixture and component. This problem constitutes a key part in many "weakly supervised learning" problems like learning with positive and unlabelled samples, learning with label noise, anomaly detection and crowdsourcing. While there have been several methods proposed to solve this problem, to the best of our knowledge no efficient algorithm with a proven convergence rate towards the true proportion exists for this problem. We fill this gap by constructing a provably correct algorithm for MPE, and derive convergence rates under certain assumptions on the distribution. Our method is based on embedding distributions onto an RKHS, and implementing it only requires solving a simple convex quadratic programming problem a few times. We run our algorithm on several standard classification datasets, and demonstrate that it performs comparably to or better than other algorithms on most datasets.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Mixture proportion estimation (MPE) is the problem of estimating the weight of a component distribution in a mixture, given samples from the mixture and component. Solving this problem happens to be a key step in solving several “weakly supervised” learning problems. For example, MPE is a crucial ingredient in solving the weakly supervised learning problem of learning from positive and unlabelled samples (LPUE), in which one has access to unlabelled data and positively labelled data but wishes to construct a classifier distinguishing between positive and negative data

(Liu et al., 2002; Denis et al., 2005; Ward et al., 2009)

. MPE also arises naturally in the task of learning a classifier with noisy labels in the training set, i.e., positive instances have a certain chance of being mislabelled as negative and vice-versa, independent of the observed feature vector

(Lawrence & Scholkopf, 2001; Bouveyron & Girard, 2009; Stempfel & Ralaivola, 2009; Long & Servido, 2010; Natarajan et al., 2013). Natarajan et al. (2013) show that this problem can be solved by minimizing an appropriate cost sensitive loss. But the cost parameter depends on the label noise parameters, the computation of which can be broken into two MPE problems (Scott et al., 2013a). MPE also has applications in several other problems like anomaly rejection (Sanderson & Scott, 2014) and crowdsourcing (Raykar et al., 2010).

When no assumptions are made on the mixture and the components, the problem is ill defined as the mixture proportion is not identifiable (Scott, 2015). While several methods have been proposed to solve the MPE problem (Blanchard et al., 2010; Sanderson & Scott, 2014; Scott, 2015; Elkan & Noto, 2008; du Plessis & Sugiyama, 2014; Jain et al., 2016), to the best of our knowledge no provable and efficient method is known for solving this problem in the general non-parametric setting with minimal assumptions. Some papers propose estimators that converge to the true proportion under certain conditions (Blanchard et al., 2010; Scott et al., 2013a; Scott, 2015), but they cannot be efficiently computed. Hence they use a method which is motivated based on the provable method but has no direct guarantees of convergence to the true proportion. Some papers propose an estimator that can be implemented efficiently (Elkan & Noto, 2008; du Plessis & Sugiyama, 2014), but the resulting estimator is correct only under very restrictive conditions (see Section 7) on the distribution. Further, all these methods except the one by du Plessis & Sugiyama (2014)

require an accurate binary conditional probability estimator as a sub-routine and use methods like logistic regression to achieve this. In our opinion, requiring an accurate conditional probability estimate (which is a real valued function over the instance space) for estimating the mixture proportion (a single number) is too roundabout.

Our main contribution in this paper is an efficient algorithm for mixture proportion estimation along with convergence rates of the estimate to the true proportion (under certain conditions on the distribution). The algorithm is based on embedding the distributions (Gretton et al., 2012) into a reproducing kernel Hilbert space (RKHS), and only requires a simple quadratic programming solver as a sub-routine. Our method does not require the computation of a conditional probability estimate and is hence potentially better than other methods in terms of accuracy and efficiency. We test our method on some standard datasets, compare our results against several other algorithms designed for mixture proportion estimation and find that our method performs better than or comparable to previously known algorithms on most datasets.

The rest of the paper is organised as follows. The problem set up and notations are given in Section 2. In Section 3 we introduce the main object of our study, called the -distance, which essentially maps a candidate mixture proportion value to a measure of how ‘bad’ the candidate is. We give a new condition on the mixture and component distributions that we call ‘separability’ in Section 4, under which the -distance function explicitly reveals the true mixture proportion, and propose two estimators based on this. In Section 5 we give the rates of convergence of the proposed estimators to the true mixture proportion. We give an explicit implementation of one of the estimators based on a simple binary search procedure in Section 6. We give brief summaries of other known algorithms for mixture proportion estimation in Section 7 and list their characteristics and shortcomings. We give details of our experiments in Section 8 and conclude in Section 9.

## 2 Problem Setup and Notations

Let be distributions over a compact metric space with supports given by . Let and let be a distribution that is given by a convex combination (or equivalently, a mixture) of and as follows:

 F=(1−κ∗)G+κ∗H.

Equivalently, we can write

 G=(λ∗)F+(1−λ∗)H,

where . Given samples drawn i.i.d. from and drawn i.i.d. from , the objective in mixture proportion estimation (MPE) (Scott, 2015) is to estimate .

Let be a reproducing kernel Hilbert space (RKHS) (Aronszajn, 1950; Berlinet & Thomas, 2004) with a positive semi-definite kernel . Let represent the kernel mapping . For any distribution over , let . It can be seen that for any distribution and , that . Let be the -dimensional probability simplex given by . Let be defined as

 C ={w∈H:w=ϕ(P),for some distribution P}, CS ={w∈H:w=n+m∑i=1αiϕ(xi), for some α∈Δn+m}.

Clearly, , and both are convex sets.

Let be the distribution over that is uniform over . Let be the distribution over that is uniform over . As is a mixture of and , we have that some is drawn from and the rest from . We let

denote the uniform distribution over

. On average, we expect the cardinality of to be . Note that we do not know and hence cannot compute directly, however we have that .

## 3 RKHS Distance to Valid Distributions

Define the “-distance” function as follows:

 d(λ)=infw∈C∥λϕ(F)+(1−λ)ϕ(H)−w∥H. (1)

Intuitively, reconstructs from and assuming , and computes its distance to . Also, define the empirical version of the -distance function, , which we call the -distance function, as

 ˆd(λ)=infw∈CS∥λϕ(ˆF)+(1−λ)ϕ(ˆH)−w∥H. (2)

Note that the -distance function can be computed efficiently via solving a quadratic program. For any , let be such that , where is the -dimensional all ones vector, and is the -dimensional all zeros vector. Let be the kernel matrix given by . We then have

 (ˆd(λ))2=infv∈Δn+m(uλ−v)⊤K(uλ−v).

We now give some basic properties of the -distance function and the -distance function that will be of use later. All proofs not found in the paper can be found in the supplementary material.

###### Proposition 1.
 d(λ) =0,∀λ∈[0,λ∗], ˆd(λ) =0,∀λ∈[0,1].
###### Proposition 2.

and are non-decreasing convex functions on .

Below, we give a simple reformulation of the -distance function and basic lower and upper bounds that reveal its structure.

###### Proposition 3.

For all ,

 d(λ∗+μ) =infw∈C∥ϕ(G)+μ(ϕ(F)−ϕ(H))−w∥H.
###### Proposition 4.

For all ,

 d(λ) ≥λ∥ϕ(F)−ϕ(H)∥−supw∈C∥ϕ(H)−w∥, (3) d(λ∗+μ) ≤μ∥ϕ(F)−ϕ(H)∥. (4)

Using standard results of Smola et al. (2007), we can show that the kernel mean embeddings of the empirical versions of , and are close to the embeddings of the distributions themselves.

###### Lemma 5.

Let the kernel be such that for all . Let . The following holds with probability (over the sample ) if ,

 ∥ϕ(F)−ϕ(ˆF)∥H ≤3√log(1/δ)√n ∥ϕ(H)−ϕ(ˆH)∥H ≤3√log(1/δ)√m ∥ϕ(G)−ϕ(ˆG)∥H ≤3√log(1/δ)√n/(2λ∗).

We will call this high probability event as . All our results hold under this event.

Using Lemma 5 one can show that the -distance function and the -distance function are close to each other. Of particular use to us is an upper bound on the -distance function for , and a general lower bound on .

###### Lemma 6.

Let for all . Assume . For all we have that

 ˆd(λ)≤(2−1λ∗+√2√λ∗)λ⋅3√log(1/δ)√min(m,n).
###### Lemma 7.

Let for all . Assume . For all , we have

 ˆd(λ)≥d(λ)−(2λ−1)⋅3√log(1/δ)√min(m,n).

## 4 Mixture Proportion Estimation under a Separability Condition

Blanchard et al. (2010); Scott (2015) observe that without any assumptions on and , the mixture proportion is not identifiable, and postulate an “irreducibility” assumption under which becomes identifiable. The irreducibility assumption essentially states that cannot be expressed as a non-trivial mixture of and some other distribution. Scott (2015) propose a stronger assumption than irreducibility under which they provide convergence rates of the estimator proposed by Blanchard et al. (2010) to the true mixture proportion . We call this condition as the “anchor set” condition as it is similar to the “anchor words” condition of Arora et al. (2012) when the domain is finite.

###### Definition 8.

A family of subsets , and distributions are said to satisfy the anchor set condition with margin , if there exists a compact set such that and .

We propose another condition which is similar to the anchor set condition (and is defined for a class of functions on rather than subsets of ). Under this condition we show that the -distance function (and hence the -distance function) reveals the true mixing proportion .

###### Definition 9.

A class of functions , and distributions are said to satisfy separability condition with margin and tolerance , if and

 EX∼Gh(X)≤infxh(x)+β≤EX∼Hh(X)−α.

We say that a kernel and distributions satisfy the separability condition, if the unit norm ball in its RKHS and distributions satisfy the separability condition.

Given a family of subsets satisfying the anchor set condition with margin , it can be easily seen that the family of functions given by the indicator functions of the family of subsets satisfy the separability condition with margin and tolerance . Hence this represents a natural extension of the anchor set condition to a function space setting.

Under separability one can show that is the “departure point from zero” for the -distance function.

###### Theorem 10.

Let the kernel , and distributions satisfy the separability condition with margin and tolerance . Then

 d(λ∗+μ)≥αμλ∗−β.
###### Proof.

(Sketch) For any inner product and its norm over the vector space , we have that for all with . The proof mainly follows by lower bounding the norm in the definition of , with an inner product with the witness of the separability condition. ∎

Further, one can link the separability condition and the anchor set condition via universal kernels (like the Gaussian RBF kernel) (Michelli et al., 2006), which are kernels whose RKHS is dense in the space of all continuous functions over a compact domain.

###### Theorem 11.

Let the kernel be universal. Let the distributions be such that they satisfy the anchor set condition with margin for some family of subsets of . Then, for all , there exists a such that the kernel , and distributions satisfy the separability condition with margin and tolerance .

###### Proof.

(Sketch) As the distributions satisfy the anchor set condition, there must exist a continuous non-negative function that is zero on the support of and greater than one on the set that witnesses the anchor set condition. Due to universality of the kernel , there must exist an element in its RKHS that arbitrarily approximates this function. The normalised version of this function forms a witness to the separability condition. ∎

The ultimate objective in mixture proportion estimation is to estimate (or equivalently ). If one has direct access to and the kernel and distributions satisfy the separability condition with tolerance , then we have by Proposition 1 and Theorem 10 that

 λ∗=inf{λ:d(λ)>0}.

We do not have direct access to , but we can calculate . From Lemmas 1 and 7, we have that for all , converges to as the sample size increases. From Lemma 7 we have that for all , for any if is large enough. Hence is a good surrogate for and based on this observation we propose two strategies of estimating and show that the errors of both these strategies can be made to approach under the separability condition.

The first estimator is called the value thresholding estimator. For some it is defined as,

 ˆλVτ=inf{λ:ˆd(λ)≥τ}.

The second estimator is called the gradient thresholding estimator. For some it is defined as

 ˆλGν=inf{λ:∃g∈∂ˆd(λ),g≥ν},

where is the sub-differential of at . As is a convex function, the slope of is a non-decreasing function and thus thresholding the gradient is also a viable strategy for estimating .

To illustrate some of the ideas above, we plot and for two different true mixing proportions and sample sizes in Figure 2. The data points from the component and mixture distribution used for computing the plot are taken from the waveform dataset.

## 5 Convergence of Value and Gradient Thresholding Estimators

We now show that both the value thresholding estimator and the gradient thresholding estimator converge to under appropriate conditions.

###### Theorem 12.

Let . Let for all . Let the kernel , and distributions satisfy the separability condition with tolerance and margin . Let the number of samples be large enough such that . Let the threshold be such that . We then have with probability

 λ∗−ˆλVτ ≤0, ˆλVτ−λ∗ ≤βλ∗α+c⋅√log(1/δ)⋅(min(m,n))−1/2,

where .

###### Proof.

(Sketch) Under event , Lemma 6 gives an upper bound on for , which is denoted by the line in Figure 0(a). Under event and the separability condition, Lemma 7 and Theorem 10 give a lower bound on for and is denoted by the line in Figure 0(a). These two bounds immediately give upper and lower bounds on the value thresholding estimator for any . An illustration is provided in Figure 0(a) by the horizontal line through . The points of intersection of this line with the feasible values of as in Figure 0(a), given by and in the figure form lower and upper bounds respectively for .

###### Theorem 13.

Let for all . Let the kernel , and distributions satisfy the separability condition with tolerance and margin . Let and . We then have with probability

 λ∗−ˆλGν ≤c⋅√log(1/δ)⋅(min(m,n))−1/2, ˆλGν−λ∗ ≤4βλ∗α+c′⋅√log(1/δ)⋅(min(m,n))−1/2,

for constants and .

###### Proof.

(Sketch) The upper and lower bounds on given by Lemmas 7, 6 and Theorem 10 also immediately translate into upper and lower bounds for (assume differentiability of for convenience) due to convexity of . As shown in Figure 0(a), the gradient of at some is upper bounded by the slope of the line joining and . Similarly, the gradient of at some is lower bounded by the slope of the line joining and . Along with trivial bounds on

, these bounds give the set of feasible values for the ordered pair

, as illustrated in Figure 0(b). This immediately gives bounds on for any . An illustration is provided in Figure 0(b) by the horizontal line through . The points of intersection of this line with the feasible values of as in Figure 0(b), given by and in the figure form lower and upper bounds respectively for . ∎

Remark: Both the value and gradient thresholding estimates converge to with rates , if the kernel satisfies the separability condition with a tolerance . In the event of the kernel only satisfying the separability condition with tolerance , the estimates converge to within an additive factor of . As shown in Theorem 11, with a universal kernel the ratio can be made arbitrarily low, and hence both the estimates actually converge to , but a specific rate is not possible, due to the dependence of the constants on and , without further assumptions on and .

## 6 The Gradient Thresholding Algorithm

As can be seen in Theorems 12 and 13, the value and gradient thresholding estimators both converge to at a rate of

, in the scenario where we know the optimal threshold. In practice, one needs to set the threshold heuristically, and we observe that the estimate

is much more sensitive to the threshold , than the gradient thresholding estimate is to the threshold . This agrees with our intuition of the asymptotic behavior of and – the curve of vs is close to a hinge, whereas the curve of vs is close to a step function. This can also be seen in Figure 1(b). Hence, our estimator of choice is the gradient thresholding estimator and we give an algorithm for implementing it in this section.

Due to the convexity of , the slope is an increasing function, and thus the gradient thresholding estimator can be computed efficiently via binary search. The details of the computation are given in Algorithm 1.

Algorithm 1 maintains upper and lower bounds ( and ) on the gradient thresholding estimator,111We assume an initial upper bound of 10 for convenience, as we don’t gain much by searching over higher values. corresponds to a mixture proportion estimate of . estimates the slope at the current point and adjusts the upper and lower bounds based on the computed slope. The slope at the current point is estimated numerically by computing the value of at (lines 9 to 15). We compute the value of for some given using the general purpose convex programming solver CVXOPT. 222The accuracy parameter must be set large enough so that the optimization error in computing is small when compared to .

We employ the following simple strategy for model selection (choosing the kernel and threshold ). Given a set of kernels, we choose the kernel for which the empirical RKHS distance between the distributions and , given by is maximized. This corresponds to choosing a kernel for which the “roof” of the step-like function is highest. We follow two different strategies for setting the gradient threshold . One strategy is motivated by Lemma 6, where we can see that the slope of for is and based on this we set . The other strategy is based on empirical observation, and is set as a convex combination of the initial slope of at and the final slope at which is equal to the RKHS distance between the distributions and , given by . We call the resulting two algorithms as “KM1” and “KM2” respectively in our experiments.333In KM2,

## 7 Other Methods for Mixture Proportion Estimation

Blanchard et al. (2010) propose an estimator based on the following equality, which holds under an irreducibility condition (which is a strictly weaker requirement than the anchor set condition), where is the set of measurable sets in . The estimator proposed replaces the exact terms and in the above ratio with the empirical quantities and and includes VC-inequality based correction terms in the numerator and denominator and restricts to a sequence of VC classes. Blanchard et al. (2010) show that the proposed estimator converges to the true proportion under the irreducibility condition and also show that the convergence can be arbitrarily slow. Note that the requirement of taking infimum over VC classes makes a direct implementation of this estimator computationally infeasible.

Scott (2015) show that the estimator of Blanchard et al. (2010) converges to the true proportion at the rate of under the anchor set condition, and also make the observation that the infimum over the sequence of VC classes can be replaced by an infimum over just the collection of base sets (e.g. the set of all open balls). Computationally, this observation reduces the complexity of a direct implementation of the estimator to where is the number of data points, and is the data dimension. But the estimator still remains intractable for even datasets with moderately large number of features.

Sanderson & Scott (2014); Scott (2015) propose algorithms based on the estimator of Blanchard et al. (2010), which treats samples from and samples from as positive and negative classes, builds a conditional probability estimator and computes the estimate of

from the constructed ROC (receiver operating characteristic) curve. These algorithms return the correct answer when the conditional probability function learned is exact, but the effect of error in this step is not clearly understood. This method is referred to as “ROC” in our experimental section.

Elkan & Noto (2008) propose another method for estimating by constructing a conditional probability estimator which treats samples from and samples from as positive and negative classes. Even in the limit of infinite data, it is known that this estimator gives the right answer only if the supports of and are completely distinct. This method is referred to as “EN” in our experiments.

du Plessis & Sugiyama (2014) propose a method for estimating based on Pearson divergence minimization. It can be seen as similar in spirit to the method of Elkan & Noto (2008), and thus has the same shortcoming of being exact only when the supports of and are disjoint, even in the limit of infinite data. The main difference between the two is that this method does not require the estimation of a conditional probability model as an intermediate object, and computes the mixture proportion directly.

Recently, Jain et al. (2016) have proposed another method for the estimation of mixture proportion which is based on maximizing the “likelihood” of the mixture proportion. The algorithm suggested by them computes a likelihood associated with each possible value of , and returns the smallest value for which the likelihood drops significantly. In a sense, it is similar to our gradient thresholding algorithm, which also computes a distance associated to each possible value of , and returns the smallest value for which the distance increases faster than a threshold. Their algorithm also requires a conditional probability model distinguishing and to be learned. It also has no guarantees of convergence to the true estimate . This method is referred to as “alphamax” in our experiments.

Menon et al. (2015); Liu & Tao (2016) and Scott et al. (2013b) propose to estimate the mixture proportion , based on the observation that, if the distributions and satisfy the anchor set condition, then can be directly related to the maximum value of the conditional probability given by , where is the conditional probability function in the binary classification problem treating samples from as positive and samples from negative. Thus one can get an estimate of from an estimate of the conditional probability through . This method clearly requires estimating a conditional probability model, and is also less robust to errors in estimating the conditional probability due to the form of the estimator.

## 8 Experiments

We ran our algorithm with 6 standard binary classification datasets444shuttle, pageblocks, digits are originally multiclass datasets, they are used as binary datasets by either grouping or ignoring classes.

taken from the UCI machine learning repository, the details of which are given below in Table

1.555In our experiments, we project the data points from the digits and mushroom datasets onto a 50-dimensional space given by PCA.

From each binary dataset containing positive and negative labelled data points, we derived 6 different pairs of mixture and component distributions ( and respectively) as follows. We chose a fraction of the positive data points to be part of the component distribution, the positive data points not chosen and the negative data points constitute the mixture distribution. The fraction of positive data points chosen to belong to the component distribution was one of giving 3 different pairs of distributions. The positive and negative labels were flipped and the above procedure was repeated to get 3 more pairs of distributions. From each such distribution we drew a total of either 400,800,1600 or 3200 samples and ran the two variants of our kernel mean based gradient thresholding algorithm given by “KM1” and “KM2”. Our candidate kernels were five Gaussian RBF kernels, with the kernel width taking values uniformly in the log space between a tenth of the median pairwise distance and ten times the median distance, and among these kernels the kernel for which is highest is chosen. We also ran the “alphamax”, “EN” and “ROC” algorithms for comparison.666The code for our algorithms KM1 and KM2 are at http://web.eecs.umich.edu/c̃scott/code.html#kmpe. The code for ROC was taken from http://web.eecs.umich.edu/c̃scott/code/mpe.zip. The codes for the alphamax and EN algorithms were the same as in Jain et al. (2016), and acquired through personal communication. The above was repeated 5 times with different random seeds, and the average error was computed. The results are plotted in Figure 3 and the actual error values used in the plots is given in the supplementary material Section H. Note that points in all plots are an average of 30 error terms arising from the 6 distributions for each dataset, and 5 different sets of samples for each distribution arising due to different random seeds. (a) n+m=3200,κ∗=0.43,λ∗=1.75 Figure 3: The average error made by the KM, alphamax, ROC and EN algorithms in predicting the mixture proportion κ∗ for various datasets as a function of the total number of samples from the mixture and component.

It can be seen from the plots in Figure 3, that our algorithms (KM1 and KM2) perform comparably to or better than other algorithms for all datasets except mushroom.

## 9 Conclusion

Mixture proportion estimation is an interesting and important problem that arises naturally in many ‘weakly supervised learning’ settings. In this paper, we give an efficient kernel mean embedding based method for this problem, and show convergence of the algorithm to the true mixture proportion under certain conditions. We also demonstrate the effectiveness of our algorithm in practice by running it on several benchmark datasets.

## Acknowledgements

This work was supported in part by NSF Grants No. 1422157, 1217880, and 1047871.

## References

• Aronszajn (1950) Aronszajn, N. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68(3):337–404, 1950.
• Arora et al. (2012) Arora, S., Ge, R., and Moitra, A. Learning topic models – going beyond SVD. In Proceedings of IEEE Foundations of Computer Science (FOCS), pp. 1–10, 2012.
• Berlinet & Thomas (2004) Berlinet, A. and Thomas, C. Reproducing kernel Hilbert spaces in Probability and Statistics. Kluwer Academic Publishers, 2004.
• Blanchard et al. (2010) Blanchard, G., Lee, G., and Scott, C.

Semi-supervised novelty detection.

Journal of Machine Learning Research, 11:2973–3009, 2010.
• Bouveyron & Girard (2009) Bouveyron, C. and Girard, S. Robust supervised classification with mixture models: Learning from data with uncertain labels.

Journal of Pattern Recognition

, 42:2649–2658, 2009.
• Denis et al. (2005) Denis, F., Gilleron, R., and Letouzey, F. Learning from positive and unlabeled examples. Theoretical Computer Science, 348(1):70–83, 2005.
• du Plessis & Sugiyama (2014) du Plessis, M. C. and Sugiyama, M. Class prior estimation from positive and unlabeled data. IEICE Transactions on Information and Systems, 97:1358–1362, 2014.
• Elkan & Noto (2008) Elkan, C. and Noto, K. Learning classifiers from only positive and unlabeled data. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD08), pp. 213–220, 2008.
• Gretton et al. (2012) Gretton, A., Borgwardt, K. M., Rasch, M. J., Scholkopf, B., and Smola, A. A kernel two-sample test. Journal of Machine Learning Research, 13:723–773, 2012.
• Jain et al. (2016) Jain, S., White, M., Trosset, M. W., and Radivojac, P. Nonparametric semi-supervised learning of class proportions. arXiv:1601.01944, 2016.
• Lawrence & Scholkopf (2001) Lawrence, N. and Scholkopf, B. Estimating a kernel Fisher discriminant in the presence of label noise. In Proc. of the Int. Conf. in Machine Learning (ICML), 2001.
• Liu et al. (2002) Liu, B., Lee, W. S., Yu, P. S., and Li, X. Partially supervised classification of text documents. In Proc. of the Int. Conf. on Machine Learning (ICML), pp. 387–394, 2002.
• Liu & Tao (2016) Liu, T. and Tao, D. Classification with noisy labels by importance reweighting. IEEE Transactions on pattern analysis and machine intelligence, 38(3):447–461, 2016.
• Long & Servido (2010) Long, P. and Servido, R. Random classification noise defeats all convex potential boosters. Machine Learning, 78:287–304, 2010.
• Menon et al. (2015) Menon, A. K., van Rooyen, B., Ong, C. S., and Williamson, R. C. Learning from corrupted binary labels via class-probability estimation. In In Proc. of the Int. Conf. in Machine Learning (ICML), pp. 125–134, 2015.
• Michelli et al. (2006) Michelli, C., Xu, Y., and Zhang, H. Universal kernels. Journal of Machine Learning Research, 7:2651–2667, 2006.
• Natarajan et al. (2013) Natarajan, N., Dhillon, I. S., Ravikumar, P., and Tewari, A. Learning with noisy labels. In Advances in Neural Information Processing Systems (NIPS) 26, pp. 1196–1204, 2013.
• Raykar et al. (2010) Raykar, V. C., Yu, S., Zhao, L. H., Valadez, G. H., Florin, C., Bogoni, L., and Moy, L. Learning from crowds. The Journal of Machine Learning Research, 11:1297–1322, 2010.
• Sanderson & Scott (2014) Sanderson, T. and Scott, C. Class proportion estimation with application to multiclass anomaly rejection. In

Proc. of the 17th Int. Conf. on Artificial Intelligence and Statistics (AISTATS)

, 2014.
• Scott (2015) Scott, C. A rate of convergence for mixture proportion estimation, with application to learning from noisy labels. In Proc. of the Int. Conf. on Artificial Intelligence and Statistics (AISTATS), 2015.
• Scott et al. (2013a) Scott, C., Blanchard, G., and Handy, G. Classification with asymmetric label noise: Consistency and maximal denoising. In Proc. Conf. on Learning Theory, JMLR W&CP, volume 30, pp. 489–511. 2013a.
• Scott et al. (2013b) Scott, C., Blanchard, G., Handy, G., Pozzi, S., and Flaska, M. Classification with asymmetric label noise: Consistency and maximal denoising. Technical Report arXiv:1303.1208, 2013b.
• Smola et al. (2007) Smola, A., Gretton, A., Song, L., and Scholkopf, B. A Hilbert space embedding for distributions. In Algorithmic Learning Theory (ALT), 2007.
• Stempfel & Ralaivola (2009) Stempfel, G. and Ralaivola, L. Learning SVMs from sloppily labeled data. In

Proc. 19th Int. Conf. on Artificial Neural Networks: Part I

, pp. 884–893, 2009.
• Ward et al. (2009) Ward, G., Hastie, T., Barry, S., Elith, J., and Leathwick, J. R. Presence-only data and the EM algorithm. Biometrics, 65:554–564, 2009.

## Appendix A Proof of Propositions 1, 2, 3 and 4

###### Proposition.
 d(λ) =0,∀λ∈[0,λ∗], ˆd(λ) =0,∀λ∈[0,1].
###### Proof.

The second equality is obvious and follows from convexity of and that both and are in .

The first statement is due to the following. Let , then we have that,

 d(λ) =infw∈C∥λϕ(F)+(1−λ)ϕ(H)−w∥H =infw∈C∥∥∥λλ∗(λ∗ϕ(F)+(1−λ∗)ϕ(H))+(1−λλ∗)ϕ(H)−w∥∥∥H =infw∈C∥∥∥λλ∗(ϕ(G))+(1−λλ∗)ϕ(H)−w∥∥∥H =0.

###### Proposition.

and are non-decreasing convex functions.

###### Proof.

Let . Let . Let be such that

 d(λ1) ≥∥(λ1)ϕ(F)+(1−λ1)ϕ(H)−w1∥H−ϵ, d(λ2) ≥∥(λ2)ϕ(F)+(1−λ2)ϕ(H)−w2∥H−ϵ.

By definition of such exist for all .

Let , and . We then have that

 d(λγ) ≤∥(λγ)ϕ(F)+(1−λγ)ϕ(H)−wγ∥H =∥((1−γ)λ1+γλ2)ϕ(F)+(1−(1−γ)λ1−γλ2)ϕ(H)−wγ∥H =∥((1−γ)λ1+γλ2)ϕ(F)+((1−γ)(1−λ1)+γ(1−λ2))ϕ(H)−wγ∥H =∥(1−γ)(λ1ϕ(F)+(1−λ1)ϕ(H)−w1)+γ(λ2ϕ(F)+(1−λ2)ϕ(H)−w2)∥ ≤(1−γ)∥(λ1ϕ(F)+(1−λ1)ϕ(H)−w1)∥+γ∥(λ2ϕ(F)+(1−λ2)ϕ(H)−w2)∥ ≤(1−γ)(d(λ1)+ϵ)+γ(d(λ2)+ϵ) =(1−γ)d(λ1)+γd(λ2)+ϵ.

As the above holds for all and is independent of , we have

 d(λγ)=d((1−γ)λ1+γλ2)≤(1−γ)d(λ1)+γd(λ2).

Thus we have that is convex.

As is convex and , we have that for , and hence for . By convexity, we then have that for all , all elements of the sub-differential are non-negative and hence is a non-decreasing function.

By very similar arguments, we can also show that is convex and non-decreasing. ∎

###### Proposition.

For all

 d(λ∗+μ) =infw∈C∥ϕ(G)+μ(ϕ(F)−ϕ(H))−w∥H.
###### Proof.
 d(λ∗+μ) =infw∈C∥(λ∗+μ)ϕ(F)+(1−λ∗−μ)ϕ(H)−w∥H =infw∈C∥λ∗ϕ(F)+(1−λ∗)ϕ(H)+μ(ϕ(F)−ϕ(H))−w∥H =infw∈C∥ϕ(λ∗F+(1−λ∗)H)+μ(ϕ(F)−ϕ(H))−w∥H.

###### Proposition.

For all ,

 d(λ) ≥λ∥ϕ(F)−ϕ(H)∥−supw∈C∥ϕ(H)−w∥, (5) d(λ∗+μ) ≤μ∥ϕ(F)−ϕ(H)∥,. (6)
###### Proof.

The proof of the first inequality above follows from applying triangle inequality to from Equation (1).

The proof of the second inequality above follows from Proposition 3 by setting . ∎

## Appendix B Proof of Lemma 5

###### Lemma.

Let the kernel be such that for all . Let . We have that, the following holds with probability (over the sample ) if .

 ∥ϕ(F)−ϕ(ˆF)∥H ≤3√log(1/δ)√n, ∥ϕ(H)−ϕ(ˆH)∥H ≤3√log(1/δ)√m, ∥ϕ(G)−ϕ(ˆG)∥H ≤3√log(1/δ)√n/(2λ∗).

The proof for the first two statements is a direct application of Theorem 2 of Smola et al. (Smola et al., 2007), along with bounds on the Rademacher complexity. The proof of the third statement also uses Hoeffding’s inequality to show that out of the samples drawn from , at least samples are drawn from .

###### Lemma 14.

Let the kernel be such that for all . Then we have the following

1. For all such that we have that .

2. For all distributions over , the Rademacher complexity of is bounded above as follows:

 Rn(H,P)=1nEx1,…,xn∼PEσ1,…,σn[suph:∥h∥H≤1∣∣ ∣∣n∑i=1σih(xi)∣∣ ∣∣]≤1√n.
###### Proof.

The first item simply follows from Cauchy-Schwarz and the reproducing property of

 |h(x)|=|⟨h