Theoretical Comparisons of Positive-Unlabeled Learning against Positive-Negative Learning

03/10/2016 ∙ by Gang Niu, et al. ∙ 0

In PU learning, a binary classifier is trained from positive (P) and unlabeled (U) data without negative (N) data. Although N data is missing, it sometimes outperforms PN learning (i.e., ordinary supervised learning). Hitherto, neither theoretical nor experimental analysis has been given to explain this phenomenon. In this paper, we theoretically compare PU (and NU) learning against PN learning based on the upper bounds on estimation errors. We find simple conditions when PU and NU learning are likely to outperform PN learning, and we prove that, in terms of the upper bounds, either PU or NU learning (depending on the class-prior probability and the sizes of P and N data) given infinite U data will improve on PN learning. Our theoretical findings well agree with the experimental results on artificial and benchmark data even when the experimental setup does not match the theoretical assumptions exactly.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Positive-unlabeled (PU) learning, where a binary classifier is trained from P and U data, has drawn considerable attention recently [Denis, 1998, Letouzey et al., 2000, Elkan and Noto, 2008, Ward et al., 2009, Scott and Blanchard, 2009, Blanchard et al., 2010, du Plessis et al., 2014, 2015a]. It is appealing to not only the academia but also the industry, since for example the click-through data automatically collected in search engines are highly PU due to position biases [Dupret and Piwowarski, 2008, Craswell et al., 2008, Chapelle and Zhang, 2009]. Although PU learning uses no negative (N) data, it is sometimes even better than PN learning (i.e., ordinary supervised learning, perhaps with class-prior change [Quiñonero-Candela et al., 2009]) in practice. Nevertheless, there is neither theoretical nor experimental analysis for this phenomenon, and it is still an open problem when PU learning is likely to outperform PN learning. We clarify this question in this paper.

Problem settings

For PU learning, there are two problem settings based on one sample (OS) and two samples (TS) of data respectively. More specifically, let and (

) be the input and output random variables and equipped with an

underlying joint density . In OS [Elkan and Noto, 2008], a set of U data is sampled from the marginal density . Then if a data point is P, this P label is observed with probability , and remains U with probability ; if is N, this N label is never observed, and remains U with probability . In TS [Ward et al., 2009], a set of P data is drawn from the positive marginal density and a set of U data is drawn from . Denote by and the sizes of P and U data. As two random variables, they are fully independent in TS, and they satisfy in OS where is the class-prior probability. Therefore, TS is slightly more general than OS, and we will focus on TS problem settings.

Similarly, consider TS problem settings of PN and NU learning, where a set of N data (of size ) is sampled from independently of the P/U data. For PN learning, if we enforce that when sampling the data, it will be ordinary supervised learning; otherwise, it is supervised learning with class-prior change, a.k.a. prior probability shift [Quiñonero-Candela et al., 2009].

In [du Plessis et al., 2014], a cost-sensitive formulation for PU learning was proposed, and its risk estimator was proven unbiased if the surrogate loss is non-convex and satisfies a symmetric condition. Therefore, we can naturally compare empirical risk minimizers in PU and NU learning against that in PN learning.

Contributions

We establish risk bounds of three risk minimizers in PN, PU and NU learning for comparisons in a flavor of statistical learning theory [Vapnik, 1998, Bousquet et al., 2004]. For each minimizer, we firstly derive a uniform deviation bound from the risk estimator to the risk using Rademacher complexities (see, e.g., [Koltchinskii, 2001, Bartlett and Mendelson, 2002, Meir and Zhang, 2003, Mohri et al., 2012]), and secondly obtain an estimation error bound. Thirdly, if the surrogate loss is classification-calibrated [Bartlett et al., 2006], an excess risk bound is an immediate corollary. In [du Plessis et al., 2014], there was a generalization error bound similar to our uniform deviation bound for PU learning. However, it is based on a tricky decomposition of the risk, where surrogate losses for risk minimization and risk analysis are different and labels of U data are needed for risk evaluation, so that no further bound is implied. On the other hand, ours utilizes the same surrogate loss for risk minimization and analysis and requires no label of U data for risk evaluation, so that an estimation error bound is possible.

Our main results can be summarized as follows. Denote by , and the risk minimizers in PN, PU and NU learning. Under a mild assumption on the function class and data distributions,

  • Finite-sample case: The estimation error bound of is tighter than that of whenever , and so is the bound of tighter than that of if .

  • Asymptotic case: Either the limit of bounds of or that of (depending on , and ) will improve on that of , if in the same order and faster in order than and .

Notice that both results rely on only the constant and variables , and ; they are simple and independent of the specific forms of the function class and/or the data distributions. The asymptotic case is from the finite-sample case that is based on theoretical comparisons of the aforementioned upper bounds on the estimation errors of , and . To the best of our knowledge, this is the first work that compares PU learning against PN learning.

Throughout the paper, we assume that the class-prior probability is known. In practice, it can be effectively estimated from P, N and U data [Saerens et al., 2002, du Plessis and Sugiyama, 2012, Iyer et al., 2014] or only P and U data [du Plessis et al., 2015b, Ramaswamy et al., 2016].

Organization

The rest of this paper is organized as follows. Unbiased estimators are reviewed in Section 

2. Then in Section 3 we present our theoretical comparisons based on risk bounds. Finally experiments are discussed in Section 4.

2 Unbiased estimators to the risk

For convenience, denote by and partial marginal densities. Recall that instead of data sampled from , we consider three sets of data , and which are drawn from three marginal densities , and independently.

Let be a real-valued decision function for binary classification and be a

Lipschitz-continuous loss function

. Denote by

partial risks, where . Then the risk of w.r.t.  under is given by

(1)

In PN learning, by approximating based on Eq. (1), we can get an empirical risk estimator as

For any fixed , is an unbiased and consistent estimator to and its convergence rate is of order according to the central limit theorem [Chung, 1968], where denotes the order in probability.

In PU learning, is not available and then cannot be directly estimated. However, [du Plessis et al., 2014] has shown that we can estimate without any bias if satisfies the following symmetric condition:

(2)

Specifically, let be a risk that U data are regarded as N data. Given Eq. (2), we have , and hence

(3)

By approximating based on (3) using and , we can obtain

Although regards as N data and aims at separating and if being minimized, it is an unbiased and consistent estimator to with a convergence rate [Chung, 1968].

Similarly, in NU learning cannot be directly estimated. Let . Given Eq. (2), , and

(4)

By approximating based on (4) using and , we can obtain

On the loss function

In order to train by minimizing these estimators, it remains to specify the loss . The zero-one loss satisfies (2) but is non-smooth. [du Plessis et al., 2014] proposed to use a scaled ramp loss as the surrogate loss for in PU learning:

instead of the popular hinge loss that does not satisfy (2). Let be the risk of w.r.t.  under . Then, is neither an upper bound of so that is not guaranteed, nor a convex loss so that it gets more difficult to know whether is classification-calibrated or not [Bartlett et al., 2006].111A loss function is classification-calibrated if and only if there is a convex, invertible and nondecreasing transformation with , such that [Bartlett et al., 2006]. If it is, we are able to control the excess risk w.r.t.  by that w.r.t. . Here we prove the classification calibration of , and consequently it is a safe surrogate loss for .

Theorem 1.

The scaled ramp loss is classification-calibrated (see Appendix A for the proof).

3 Theoretical comparisons based on risk bounds

When learning is involved, suppose we are given a function class , and let be the optimal decision function in , , , and be arbitrary global minimizers to three risk estimators. Furthermore, let and denote the Bayes risks w.r.t.  and , where the infimum of is over all measurable functions.

In this section, we derive and compare risk bounds of three risk minimizers , and under the following mild assumption on , , and : There is a constant such that

(5)

for any marginal density , where

is the Rademacher complexity of for the sampling of size from (that is, and , with each drawn from and each as a Rademacher variable) [Mohri et al., 2012]. A special case is covered, namely, sets of hyperplanes with bounded normals and feature maps:

(6)

where is a Hilbert space with an inner product ,

is a normal vector,

is a feature map, and and are constants [Schölkopf and Smola, 2001].

3.1 Risk bounds

Let be the Lipschitz constant of in its first parameter. To begin with, we establish the learning guarantee of (the proof can be found in Appendix A).

Theorem 2.

Assume (2). For any , with probability at least ,222Here, the probability is over repeated sampling of data for training , while in Lemma 8, it will be for evaluating .

(7)

where and are the Rademacher complexities of for the sampling of size from and the sampling of size from . Moreover, if is a classification-calibrated loss, there exists nondecreasing with , such that with probability at least ,

(8)

In Theorem 2, and are w.r.t. , though is trained from two samples following and . We can see that (7) is an upper bound of the estimation error of w.r.t. , whose right-hand side (RHS) is small if is small; (8) is an upper bound of the excess risk of w.r.t. , whose RHS also involves the approximation error of (i.e., ) that is small if is large. When is fixed and satisfies (5), we have and , and then

in . On the other hand, when the size of grows with and properly, those complexities of vanish slower in order than and but we may have

which means approaches the Bayes classifier if is a classification-calibrated loss, in an order slower than due to the growth of .

Similarly, we can derive the learning guarantees of and for comparisons. We will just focus on estimation error bounds, because excess risk bounds are their immediate corollaries.

Theorem 3.

Assume (2). For any , with probability at least ,

(9)

where is the Rademacher complexity of for the sampling of size from .

Theorem 4.

Assume (2). For any , with probability at least ,

(10)

In order to compare the bounds, we simplify (9), (7) and (10) using Eq. (5). To this end, we define . For the special case of defined in (6), define accordingly as .

Corollary 5.

The estimation error bounds below hold separately with probability at least :

(11)
(12)
(13)

3.2 Finite-sample comparisons

Note that three risk minimizers , and work in similar problem settings and their bounds in Corollary 5 are proven using exactly the same proof technique. Then, the differences in bounds reflect the intrinsic differences between risk minimizers. Let us compare those bounds. Define

(14)
(15)

Eqs. (14) and (15) constitute our first main result.

Theorem 6 (Finite-sample comparisons).

Assume (5) is satisfied. Then the estimation error bound of in (12) is tighter than that of in (11) if and only if ; also, the estimation error bound of in (13) is tighter than that of if and only if .

Proof.

Fix , , and , and then denote by , and the values of the RHSs of (11), (12) and (13). In fact, the definitions of and in (14) and (15) came from

As a consequence, compared with , is smaller and (12) is tighter if and only if , and is smaller and (13) is tighter if and only if . ∎

We analyze some properties of before going to our second main result. The most important property is that it relies on , , and only; it is independent of , , , and as long as (5) is satisfied. Next, is obviously a monotonic function of , , and . Furthermore, it is unbounded no matter if is fixed or not. Properties of are similar, as summarized in Table 1.

no specification sizes are proportional
mono. inc. mono. dec. mono. inc. mono. dec. mono. inc. minimum
Table 1: Properties of and .

Implications of the monotonicity of are given as follows. Intuitively, when other factors are fixed, larger or improves or respectively. However, it is complicated why is monotonically decreasing with and increasing with . The weights of the empirical average of is in and in , as in it also joins the estimation of . It makes more important for , and thus larger improves more than . Moreover, is directly estimated in and the concentration is better if is larger, whereas it is indirectly estimated through in and the concentration is worse if is larger. As a result, when the sample sizes are fixed is more (or less) favorable as decreases (or increases).

A natural question is what the monotonicity of would be if we enforce , and to be proportional. To answer this question, we assume , and where , and are certain constants, then (14) and (15) can be rewritten as

As shown in Table 1, is now increasing with and decreasing with . It is because, for instance, when is fixed and increases, is meant to decrease relatively to and .

Finally, the properties will dramatically change if we enforce that approximately holds in ordinary supervised learning. Under this constraint, we have

where the equality is achieved at . Here, decreases with if and increases with if , though it is not convex in . Only if is sufficiently larger than (e.g., ), could be possible and have a tighter estimation error bound.

3.3 Asymptotic comparisons

In practice, we may find that is worse than and given , and . This is probably the consequence especially when is not sufficiently larger than and . Should we then try to collect much more U data or just give up PU learning? Moreover, if we are able to have as many U data as possible, is there any solution that would be provably better than PN learning?

We answer these questions by asymptotic comparisons. Notice that each pair of yields a value of the RHS of (12), each yields a value of the RHS of (11), and consequently each triple of determines a value of . Define the limits of and as

Recall that , and are independent, and we need two conditions for the existence of and : and in the same order and faster in order than them. It is a bit stricter than what is necessary, but is consistent with a practical assumption: P and N data are roughly equally expensive, whereas U data are much cheaper than P and N data. Intuitively, since and measure relative qualities of the estimation error bounds of and against that of , and measure relative qualities of the limits of those bounds accordingly.

In order to illustrate properties of and , assume only approaches infinity while and stay finite, so that and . Thus, , which implies or unless . In principle, this exception should be exceptionally rare since is a rational number whereas is a real number. This argument constitutes our second main result.

Theorem 7 (Asymptotic comparisons).

Assume (5) and one set of conditions below are satisfied:

  1. , and . In this case, let ;

  2. and . In this case, let where .

Then, either the limit of estimation error bounds of will improve on that of (i.e., ) if , or the limit of bounds of will improve on that of (i.e., ) if . The only exception is in (a) or in (b).

Proof.

Note that in both cases. The proof of case (a) has been given as an illustration of the properties of and . The proof of case (b) is analogous. ∎

As a result, when we find that is worse than and , we should look at defined in Theorem 7. If , is promising and we should collect more U data; if otherwise, we should give up , but instead is promising and we should collect more U data as well. In addition, the gap between and one indicates how many U data would be sufficient. If the gap is significant, slightly more U data may be enough; if the gap is slight, significantly more U data may be necessary. In practice, however, U data are cheaper but not free, and we cannot have as many U data as possible. Therefore, is still of practical importance given limited budgets.

3.4 Remarks

Theorem 2 relies on a fundamental lemma of the uniform deviation from the risk estimator to the risk :

Lemma 8.

For any , with probability at least ,

In Lemma 8, is w.r.t. , though is w.r.t.  and . Rademacher complexities are also w.r.t.  and , and they can be bounded easily for defined in Eq. (6).

Theorems 6 and 7 rely on (5). Thanks to it, we can simplify Theorems 2, 3 and 4. In fact, (5) holds for not only the special case of defined in (6

), but also the vast majority of discriminative models in machine learning that are nonlinear in parameters such as

decision trees (cf. Theorem 17 in [Bartlett and Mendelson, 2002]) and

feedforward neural networks

(cf. Theorem 18 in [Bartlett and Mendelson, 2002]).

Theorem 2 in [du Plessis et al., 2014] is a similar bound of the same order as our Lemma 8. That theorem is based on a tricky decomposition of the risk

where the surrogate loss is not for risk minimization and labels of are needed for risk evaluation, so that no further bound is implied. Lemma 8 uses the same as risk minimization and requires no label of for evaluating , so that it can serve as the stepping stone to our estimation error bound in Theorem 2.

4 Experiments

In this section, we experimentally validate our theoretical findings.

(a) Theo. ( var.)
(b) Expe. ( var.)
(c) Theo. ( var.)
(d) Expe. ( var.)
Figure 1: Theoretical and experimental results based on artificial data.
Artificial data

Here, , and are in and drawn from three marginal densities

where

is the normal distribution with mean

and covariance , and

are the all-one vector and identity matrix of size

. The test set contains one million data drawn from .

The model where and the scaled ramp loss are employed. In addition, an -regularization is added with the regularization parameter fixed to , and there is no hard constraint on or as in Eq. (6). The solver for minimizing three regularized risk estimators comes from [du Plessis et al., 2014] (refer also to [Collobert et al., 2006, Yuille and Rangarajan, 2001] for the optimization technique).

The results are reported in Figure 1. In fig:expart-theo-nufig:expart-expe-nu, , , , and varies from to ; in fig:expart-theo-pifig:expart-expe-pi, , , , and varies from to . Specifically, fig:expart-theo-nu shows and as functions of , and fig:expart-theo-pi shows them as functions of . For the experimental results, , and were trained based on random samplings for every in fig:expart-expe-nu and

in fig:expart-expe-pi, and means with standard errors of the misclassification rates are shown, as

is classification- calibrated. Note that the empirical misclassification rates are essentially the risks w.r.t.  as there were one million test data, and the fluctuations are attributed to the non-convex nature of . Also, the curve of is not a flat line in fig:expart-expe-nu, since its training data at every were exactly same as the training data of and for fair experimental comparisons.

In Figure 1, the theoretical and experimental results are highly consistent. The red and blue curves intersect at nearly the same positions in fig:expart-theo-nufig:expart-expe-nu and in fig:expart-theo-pifig:expart-expe-pi, even though the risk minimizers in the experiments were locally optimal and regularized, making our estimation error bounds inexact.

banana phoneme magic image german twonorm waveform spambase coil2
dim 2 5 10 18 20 20 21 57 241
size 5300 5404 19020 2086 1000 7400 5000 4597 1500
P ratio .448 .293 .648 .570 .300 .500 .329 .394 .500
Table 2: Specification of benchmark datasets.
(a) Theo.
(b) banana
(c) phoneme
(d) magic
(e) image
(f) german
(g) twonorm
(h) waveform
(i) spambase
(j) coil2
(a) Theo.
(b) banana
(c) phoneme
(d) magic
(e) image
(f) german
(g) twonorm
(h) waveform
(i) spambase
(j) coil2
Figure 2: Experimental results based on benchmark data by varying .
Figure 3: Experimental results based on benchmark data by varying .
Figure 2: Experimental results based on benchmark data by varying .
Benchmark data

Table 2 summarizes the specification of benchmarks, which were downloaded from many sources including the IDA benchmark repository [Rätsch et al., 2001], the UCI machine learning repository, the semi-supervised learning book [Chapelle et al., 2006], and the European ESPRIT 5516 project.333See http://www.raetschlab.org/Members/raetsch/benchmark/ for IDA, http://archive.ics.uci.edu/ml/ for UCI, http://olivier.chapelle.cc/ssl-book/ for the SSL book and https://www.elen.ucl.ac.be/neural-nets/Research/Projects/ELENA/ for the ELENA project. In Table 2, three rows describe the number of features, the number of data, and the ratio of P data according to the true class labels. Given a random sampling of , and , the test set has all the remaining data if they are less than , or else drawn uniformly from the remaining data of size .

For benchmark data, the linear model for the artificial data is not enough, and its kernel version is employed. Consider training for example. Given a random sampling,