Positive-unlabeled (PU) learning, where a binary classifier is trained from P and U data, has drawn considerable attention recently [Denis, 1998, Letouzey et al., 2000, Elkan and Noto, 2008, Ward et al., 2009, Scott and Blanchard, 2009, Blanchard et al., 2010, du Plessis et al., 2014, 2015a]. It is appealing to not only the academia but also the industry, since for example the click-through data automatically collected in search engines are highly PU due to position biases [Dupret and Piwowarski, 2008, Craswell et al., 2008, Chapelle and Zhang, 2009]. Although PU learning uses no negative (N) data, it is sometimes even better than PN learning (i.e., ordinary supervised learning, perhaps with class-prior change [Quiñonero-Candela et al., 2009]) in practice. Nevertheless, there is neither theoretical nor experimental analysis for this phenomenon, and it is still an open problem when PU learning is likely to outperform PN learning. We clarify this question in this paper.
For PU learning, there are two problem settings based on one sample (OS) and two samples (TS) of data respectively. More specifically, let and (
) be the input and output random variables and equipped with anunderlying joint density . In OS [Elkan and Noto, 2008], a set of U data is sampled from the marginal density . Then if a data point is P, this P label is observed with probability , and remains U with probability ; if is N, this N label is never observed, and remains U with probability . In TS [Ward et al., 2009], a set of P data is drawn from the positive marginal density and a set of U data is drawn from . Denote by and the sizes of P and U data. As two random variables, they are fully independent in TS, and they satisfy in OS where is the class-prior probability. Therefore, TS is slightly more general than OS, and we will focus on TS problem settings.
Similarly, consider TS problem settings of PN and NU learning, where a set of N data (of size ) is sampled from independently of the P/U data. For PN learning, if we enforce that when sampling the data, it will be ordinary supervised learning; otherwise, it is supervised learning with class-prior change, a.k.a. prior probability shift [Quiñonero-Candela et al., 2009].
In [du Plessis et al., 2014], a cost-sensitive formulation for PU learning was proposed, and its risk estimator was proven unbiased if the surrogate loss is non-convex and satisfies a symmetric condition. Therefore, we can naturally compare empirical risk minimizers in PU and NU learning against that in PN learning.
We establish risk bounds of three risk minimizers in PN, PU and NU learning for comparisons in a flavor of statistical learning theory [Vapnik, 1998, Bousquet et al., 2004]. For each minimizer, we firstly derive a uniform deviation bound from the risk estimator to the risk using Rademacher complexities (see, e.g., [Koltchinskii, 2001, Bartlett and Mendelson, 2002, Meir and Zhang, 2003, Mohri et al., 2012]), and secondly obtain an estimation error bound. Thirdly, if the surrogate loss is classification-calibrated [Bartlett et al., 2006], an excess risk bound is an immediate corollary. In [du Plessis et al., 2014], there was a generalization error bound similar to our uniform deviation bound for PU learning. However, it is based on a tricky decomposition of the risk, where surrogate losses for risk minimization and risk analysis are different and labels of U data are needed for risk evaluation, so that no further bound is implied. On the other hand, ours utilizes the same surrogate loss for risk minimization and analysis and requires no label of U data for risk evaluation, so that an estimation error bound is possible.
Our main results can be summarized as follows. Denote by , and the risk minimizers in PN, PU and NU learning. Under a mild assumption on the function class and data distributions,
Finite-sample case: The estimation error bound of is tighter than that of whenever , and so is the bound of tighter than that of if .
Asymptotic case: Either the limit of bounds of or that of (depending on , and ) will improve on that of , if in the same order and faster in order than and .
Notice that both results rely on only the constant and variables , and ; they are simple and independent of the specific forms of the function class and/or the data distributions. The asymptotic case is from the finite-sample case that is based on theoretical comparisons of the aforementioned upper bounds on the estimation errors of , and . To the best of our knowledge, this is the first work that compares PU learning against PN learning.
2 Unbiased estimators to the risk
For convenience, denote by and partial marginal densities. Recall that instead of data sampled from , we consider three sets of data , and which are drawn from three marginal densities , and independently.
Let be a real-valued decision function for binary classification and be a Lipschitz-continuous loss function
Lipschitz-continuous loss function. Denote by
partial risks, where . Then the risk of w.r.t. under is given by
In PN learning, by approximating based on Eq. (1), we can get an empirical risk estimator as
In PU learning, is not available and then cannot be directly estimated. However, [du Plessis et al., 2014] has shown that we can estimate without any bias if satisfies the following symmetric condition:
Specifically, let be a risk that U data are regarded as N data. Given Eq. (2), we have , and hence
By approximating based on (3) using and , we can obtain
Although regards as N data and aims at separating and if being minimized, it is an unbiased and consistent estimator to with a convergence rate [Chung, 1968].
Similarly, in NU learning cannot be directly estimated. Let . Given Eq. (2), , and
By approximating based on (4) using and , we can obtain
On the loss function
In order to train by minimizing these estimators, it remains to specify the loss . The zero-one loss satisfies (2) but is non-smooth. [du Plessis et al., 2014] proposed to use a scaled ramp loss as the surrogate loss for in PU learning:
instead of the popular hinge loss that does not satisfy (2). Let be the risk of w.r.t. under . Then, is neither an upper bound of so that is not guaranteed, nor a convex loss so that it gets more difficult to know whether is classification-calibrated or not [Bartlett et al., 2006].111A loss function is classification-calibrated if and only if there is a convex, invertible and nondecreasing transformation with , such that [Bartlett et al., 2006]. If it is, we are able to control the excess risk w.r.t. by that w.r.t. . Here we prove the classification calibration of , and consequently it is a safe surrogate loss for .
The scaled ramp loss is classification-calibrated (see Appendix A for the proof).
3 Theoretical comparisons based on risk bounds
When learning is involved, suppose we are given a function class , and let be the optimal decision function in , , , and be arbitrary global minimizers to three risk estimators. Furthermore, let and denote the Bayes risks w.r.t. and , where the infimum of is over all measurable functions.
In this section, we derive and compare risk bounds of three risk minimizers , and under the following mild assumption on , , and : There is a constant such that
for any marginal density , where
is the Rademacher complexity of for the sampling of size from (that is, and , with each drawn from and each as a Rademacher variable) [Mohri et al., 2012]. A special case is covered, namely, sets of hyperplanes with bounded normals and feature maps:
where is a Hilbert space with an inner product ,
is a normal vector,is a feature map, and and are constants [Schölkopf and Smola, 2001].
3.1 Risk bounds
Let be the Lipschitz constant of in its first parameter. To begin with, we establish the learning guarantee of (the proof can be found in Appendix A).
where and are the Rademacher complexities of for the sampling of size from and the sampling of size from . Moreover, if is a classification-calibrated loss, there exists nondecreasing with , such that with probability at least ,
In Theorem 2, and are w.r.t. , though is trained from two samples following and . We can see that (7) is an upper bound of the estimation error of w.r.t. , whose right-hand side (RHS) is small if is small; (8) is an upper bound of the excess risk of w.r.t. , whose RHS also involves the approximation error of (i.e., ) that is small if is large. When is fixed and satisfies (5), we have and , and then
in . On the other hand, when the size of grows with and properly, those complexities of vanish slower in order than and but we may have
which means approaches the Bayes classifier if is a classification-calibrated loss, in an order slower than due to the growth of .
Similarly, we can derive the learning guarantees of and for comparisons. We will just focus on estimation error bounds, because excess risk bounds are their immediate corollaries.
Assume (2). For any , with probability at least ,
where is the Rademacher complexity of for the sampling of size from .
Assume (2). For any , with probability at least ,
The estimation error bounds below hold separately with probability at least :
3.2 Finite-sample comparisons
Note that three risk minimizers , and work in similar problem settings and their bounds in Corollary 5 are proven using exactly the same proof technique. Then, the differences in bounds reflect the intrinsic differences between risk minimizers. Let us compare those bounds. Define
Theorem 6 (Finite-sample comparisons).
We analyze some properties of before going to our second main result. The most important property is that it relies on , , and only; it is independent of , , , and as long as (5) is satisfied. Next, is obviously a monotonic function of , , and . Furthermore, it is unbounded no matter if is fixed or not. Properties of are similar, as summarized in Table 1.
|no specification||sizes are proportional|
|mono. inc.||mono. dec.||mono. inc.||mono. dec.||mono. inc.||minimum|
Implications of the monotonicity of are given as follows. Intuitively, when other factors are fixed, larger or improves or respectively. However, it is complicated why is monotonically decreasing with and increasing with . The weights of the empirical average of is in and in , as in it also joins the estimation of . It makes more important for , and thus larger improves more than . Moreover, is directly estimated in and the concentration is better if is larger, whereas it is indirectly estimated through in and the concentration is worse if is larger. As a result, when the sample sizes are fixed is more (or less) favorable as decreases (or increases).
A natural question is what the monotonicity of would be if we enforce , and to be proportional. To answer this question, we assume , and where , and are certain constants, then (14) and (15) can be rewritten as
As shown in Table 1, is now increasing with and decreasing with . It is because, for instance, when is fixed and increases, is meant to decrease relatively to and .
Finally, the properties will dramatically change if we enforce that approximately holds in ordinary supervised learning. Under this constraint, we have
where the equality is achieved at . Here, decreases with if and increases with if , though it is not convex in . Only if is sufficiently larger than (e.g., ), could be possible and have a tighter estimation error bound.
3.3 Asymptotic comparisons
In practice, we may find that is worse than and given , and . This is probably the consequence especially when is not sufficiently larger than and . Should we then try to collect much more U data or just give up PU learning? Moreover, if we are able to have as many U data as possible, is there any solution that would be provably better than PN learning?
We answer these questions by asymptotic comparisons. Notice that each pair of yields a value of the RHS of (12), each yields a value of the RHS of (11), and consequently each triple of determines a value of . Define the limits of and as
Recall that , and are independent, and we need two conditions for the existence of and : and in the same order and faster in order than them. It is a bit stricter than what is necessary, but is consistent with a practical assumption: P and N data are roughly equally expensive, whereas U data are much cheaper than P and N data. Intuitively, since and measure relative qualities of the estimation error bounds of and against that of , and measure relative qualities of the limits of those bounds accordingly.
In order to illustrate properties of and , assume only approaches infinity while and stay finite, so that and . Thus, , which implies or unless . In principle, this exception should be exceptionally rare since is a rational number whereas is a real number. This argument constitutes our second main result.
Theorem 7 (Asymptotic comparisons).
Assume (5) and one set of conditions below are satisfied:
, and . In this case, let ;
and . In this case, let where .
Then, either the limit of estimation error bounds of will improve on that of (i.e., ) if , or the limit of bounds of will improve on that of (i.e., ) if . The only exception is in (a) or in (b).
Note that in both cases. The proof of case (a) has been given as an illustration of the properties of and . The proof of case (b) is analogous. ∎
As a result, when we find that is worse than and , we should look at defined in Theorem 7. If , is promising and we should collect more U data; if otherwise, we should give up , but instead is promising and we should collect more U data as well. In addition, the gap between and one indicates how many U data would be sufficient. If the gap is significant, slightly more U data may be enough; if the gap is slight, significantly more U data may be necessary. In practice, however, U data are cheaper but not free, and we cannot have as many U data as possible. Therefore, is still of practical importance given limited budgets.
Theorem 2 relies on a fundamental lemma of the uniform deviation from the risk estimator to the risk :
For any , with probability at least ,
), but also the vast majority of discriminative models in machine learning that are nonlinear in parameters such asdecision trees (cf. Theorem 17 in [Bartlett and Mendelson, 2002]) and
feedforward neural networks(cf. Theorem 18 in [Bartlett and Mendelson, 2002]).
where the surrogate loss is not for risk minimization and labels of are needed for risk evaluation, so that no further bound is implied. Lemma 8 uses the same as risk minimization and requires no label of for evaluating , so that it can serve as the stepping stone to our estimation error bound in Theorem 2.
In this section, we experimentally validate our theoretical findings.
Here, , and are in and drawn from three marginal densities
is the normal distribution with meanand covariance , and
are the all-one vector and identity matrix of size. The test set contains one million data drawn from .
The model where and the scaled ramp loss are employed. In addition, an -regularization is added with the regularization parameter fixed to , and there is no hard constraint on or as in Eq. (6). The solver for minimizing three regularized risk estimators comes from [du Plessis et al., 2014] (refer also to [Collobert et al., 2006, Yuille and Rangarajan, 2001] for the optimization technique).
The results are reported in Figure 1. In fig:expart-theo-nufig:expart-expe-nu, , , , and varies from to ; in fig:expart-theo-pifig:expart-expe-pi, , , , and varies from to . Specifically, fig:expart-theo-nu shows and as functions of , and fig:expart-theo-pi shows them as functions of . For the experimental results, , and were trained based on random samplings for every in fig:expart-expe-nu and
in fig:expart-expe-pi, and means with standard errors of the misclassification rates are shown, asis classification- calibrated. Note that the empirical misclassification rates are essentially the risks w.r.t. as there were one million test data, and the fluctuations are attributed to the non-convex nature of . Also, the curve of is not a flat line in fig:expart-expe-nu, since its training data at every were exactly same as the training data of and for fair experimental comparisons.
In Figure 1, the theoretical and experimental results are highly consistent. The red and blue curves intersect at nearly the same positions in fig:expart-theo-nufig:expart-expe-nu and in fig:expart-theo-pifig:expart-expe-pi, even though the risk minimizers in the experiments were locally optimal and regularized, making our estimation error bounds inexact.
Table 2 summarizes the specification of benchmarks, which were downloaded from many sources including the IDA benchmark repository [Rätsch et al., 2001], the UCI machine learning repository, the semi-supervised learning book [Chapelle et al., 2006], and the European ESPRIT 5516 project.333See http://www.raetschlab.org/Members/raetsch/benchmark/ for IDA, http://archive.ics.uci.edu/ml/ for UCI, http://olivier.chapelle.cc/ssl-book/ for the SSL book and https://www.elen.ucl.ac.be/neural-nets/Research/Projects/ELENA/ for the ELENA project. In Table 2, three rows describe the number of features, the number of data, and the ratio of P data according to the true class labels. Given a random sampling of , and , the test set has all the remaining data if they are less than , or else drawn uniformly from the remaining data of size .
For benchmark data, the linear model for the artificial data is not enough, and its kernel version is employed. Consider training for example. Given a random sampling,