1 Introduction
In realworld applications, such as spam filtering (Drucker et al., 1999) and medical diagnosing (Huang et al., 2010), the loss of misclassifying a positive instance and negative instance can be rather different. For instance, in medical diagnosing, misdiagnosing a patient as healthy is more dangerous than misclassifying healthy person as sick. Meanwhile, in reality, it is often very difficult to define an accurate cost for these two kinds of errors (Liu and Zhou, 2010; Zhou and Zhou, 2016). In such situations, it is more desirable to keep the classifier working under a small tolerance of falsepositive rate (FPR) , i.e., only allow the classifier to misclassify no larger than percent of negative instances. Traditional classifiers trained by maximizing classification accuracy or AUC are not suitable due to mismatched goal.
In the literature, classification under constrained falsepositive rate is known as NeymanPearson (NP) Classification problem (Scott and Nowak, 2005; Lehmann and Romano, 2006; Rigollet and Tong, 2011), and existing approaches can be roughly grouped into several categories. One common approach is to use costsensitive learning, which assigns different costs for different classes, and representatives include costsensitive SVM (Osuna et al., 1997; Davenport et al., 2006, 2010), costinterval SVM (Liu and Zhou, 2010) and costsensitive boosting (MasnadiShirazi and Vasconcelos, 2007, 2011). Though effective and efficient in handling different misclassification costs, it is usually difficult to find the appropriate misclassification cost for the specific FPR tolerance. Another group of methods formulates this problem as a constrained optimization problem, which has the FPR tolerance as an explicit constraint (Mozer et al., 2002; Gasso et al., 2011; Mahdavi et al., 2013). These methods often need to find the saddle point of the Lagrange function, leading to a timeconsuming alternate optimization. Moreover, a surrogate loss is often used to simplify the optimization problem, possibly making the tolerance constraint not satisfied in practice. The third line of research is scoringthresholding methods, which train a scoring function first, then find a threshold to meet the target FPR tolerance (Drucker et al., 1999)
. In practice, the scoring function can be trained by either class conditional density estimation
(Tong, 2013) or bipartite ranking (Narasimhan and Agarwal, 2013b). However, computing density estimation itself is another difficult problem. Also most bipartite ranking methods are less scalable with superlinear training complexity. Additionally, there are methods paying special attention to the positive class. For example, asymmetric SVM (Wu et al., 2008) maximizes the margin between negative samples and the core of positive samples, oneclass SVM (BenHur et al., 2001) finds the smallest ball to enclose positive samples. However, they do not incorporate the FPR tolerance into the learning procedure either.In this paper, we address the tolerance constrained learning problem by proposing False Positive Learning (FPL). Specifically,
FPL is a scoringthresholding method. In the scoring stage, we explicitly learn a ranking function which optimizes the probability of ranking any positive instance above the centroid of the worst
percent of negative instances. Whereafter, it is shown that, with the help of our newly proposed Euclidean projection algorithm, this ranking problem can be solved in linear time under the projected gradient framework. It is worth noting that the Euclidean projection problem is a generalization of a large family of projection problems, and our proposed lineartime algorithm based on bisection and divideandconquer is one to three orders faster than existing stateoftheart methods. In the thresholding stage, we devise an outofbootstrap thresholding method to transform aforementioned ranking function into a low falsepositive classifier. This method is much less prone to overfitting compared to existing thresholding methods. Theoretical analysis and experimental results show that the proposed method achieves superior performance over existing approaches.2 From Constrained Optimization to Ranking
In this section, we show that the FPR tolerance problem can be transformed into a ranking problem, and then formulate a convex ranking loss which is tighter than existing relaxation approaches.
Let be the instance space for some norm and be the label set, be a set of training instances, where and contains and instances independently sampled from distributions and , respectively. Let be the maximum tolerance of falsepositive rate. Consider the following NeymanPearson classification problem, which aims at minimizing the false negative rate of classifier under the constraint of false positive rate:
(1) 
where is a scoring function and is a threshold. With finite training instances, the corresponding empirical risk minimization problem is
(2) 
where is the indicator function.
The empirical version of the optimization problem is difficult to handle due to the presence of noncontinuous constraint; we introduce an equivalent form below.
Proposition 1.
Define be the th largest value in multiset and be the floor function. The constrained optimization problem (2) share the same optimal solution and optimal objective value with the following ranking problem
(3) 
Proof.
Proposition 1 reveals the connection between constrained optimization (2) and ranking problem (3). Intuitively, problem (3) makes comparsion between each positve sample and the th largest negative sample. Here we give further explanation about its form: it is equivalent to the maximization of the partialAUC near the risk area.
Although it seems that partialAUC optimization considers fewer samples than the full AUC, optimizing (3) is intractable even if we replace by a convex loss, since the operation is nonconvex when . Indeed, Theorem 1 further shows that whatever is chosen, even for some weak hypothesis of , the corresponding optimization problem is NPhard.
Definition 1.
A surrogate function of is a contionuous and nonincreasing function satisfies that (i) , ; (ii) , ; (iii) as .
Theorem 1.
For fixed , and surrogate function , optimization problem 
with hyperplane hypothesis set
is NPhard.This intractability motivates us to consider the following upper bound approximation of (3):
(4) 
which prefers scores on positive examples to exceed the mean of scores on the worst proportion of negative examples. If , optimization (4) is equivalent to the original problem (3). In general cases, equality also could hold when both the scores of the worst proportion negative examples are the same.
The advantages of considering ranking problem (4) include (details given later):

For appropriate hypothesis set of , replacing by its convex surrogate will produce a tight convex upper bound of the original minimization problem (3), in contrast to the costsensitive classification which may only offer an insecure lower bound. This upper bound is also tighter than the convex relaxation of the original constrained minimization problem (2) (see Proposition 2);

By designing efficient learning algorithm, we can find the global optimal solution of this ranking problem in linear time, which makes it well suited for largescale scenarios, and outperforms most of the traditional ranking algorithms;

Explicitly takes into consideration and no additional cost hyperparameters required;

The generalization performance of the optimal solution can be theoretically established.
2.1 Comparison to Alternative Methods
Before introducing our algorithm, we briefly review some related approaches to FPR constrained learning, and show that our approach can be seen as a tighter upper bound, meanwhile maintain linear training complexity.
Cost sensitive Learning One alternative approach to eliminating the constraint in (2) is to approximate it by introducing asymmetric costs for different types of error into classification learning framework:
where is a hyperparameter that punishes the gain of false positive instance. Although reasonable for handling different misclassification costs, we point out that methods under this framework indeed minimize a lower bound of problem (2). This can be verified by formulating (2) into unconstrained form
Thus, for a fixed , minimizing is equivalent to minimize a lower bound of . In other words, costsensitive learning methods are insecure in this setting: mismatched can easily violate the constraint on FPR or make excessive requirements, and the optimal for specified is only known after solving the original constrained problem, which can be proved NPhard.
In practice, one also has to make a continuous approximation for the constraint in (2) for tractable costsensitive training, this multilevel approximation makes the relationship between the new problem and original unclear, and blocks any straightforward theoretical justification.
Constrained Optimization The main difficulty of employing Lagrangian based method to solve problem (2) lies in the fact that both the constraint and objective are noncontinuous. In order to make the problem tractable while satisfying FPR constraint strictly, the standard solution approach relies on replacing with its convex surrogate function (see Definition 1) to obtain a convex constrained optimization problem ^{1}^{1}1For precisely approximating the indicator function in constraint, choosing “bounded” loss such as sigmoid or ramp loss (Collobert et al., 2006; Gasso et al., 2011) etc. seems appealing. However, bounded functions always bring about a nonconvex property, and corresponding problems are usually NPhard (Yu et al., 2012). These difficulties limit both efficient training and effective theoretical guarantee of the learned model.. Interestingly, Proposition 2 shows that whichever surrogate function and hypothesis set are chosen, the resulting constrained optimization problem is a weaker upper bound than that of our approach.
Proposition 2.
For any nonempty hypothesis set and , convex surrogate function , let
we have
Thus, in general, there is an exact gap between our risk and convex constrained optimization risk . In practice one may prefer a tighter approximation since it represents the original objective better. Moreover, our Theorem 5 also achieves a tighter bound on the generalization error rate, by considering empirical risk .
RankingThresholding Traditional bipartite ranking methods usually have superlinear training complexity. Compared with them, the main advantage of our algorithm comes from its lineartime conplexity in each iteration without any convergence rate (please refer to Table 1). We named our algorithm FPL, and give a detailed description of how it works in the next sections.
3 ToleranceConstrained False Positive Learning
Based on previous discussion, our method will be divided into two stages, namely scoring and thresholding. In scoring, a function is learnt to maximize the probability of giving higher a score to positive instances than the centroid of top percent of the negative instances. In thresholding, a suitable threshold will be chosen, and the final prediction of an instance can be obtained by
(5) 
3.1 ToleranceConstrained Ranking
In (4), we consider linear scoring function , where
is the weight vector to be learned,
regularization and replace by some of its convex surrogate . Kernel methods can be used for nonlinear ranking functions. As a result, the learning problem is(6) 
where is the regularization parameter, and .
Directly minimizing (6) can be challenging due to the operator, we address it by considering its dual,
Theorem 2.
According to Theorem 1, learning scoring function is equivalent to learning the dual variables and by solving problem (7). Its optimization naturally falls into the area of projected gradient method. Here we choose where due to its simplicity of conjugation. The key steps are summarized in Algorithm 1. At each iteration, we first update solution by the gradient of the objective function , then project the dual solution onto the feasible set . In the sequel, we will show that this projection problem can be efficiently solved in linear time. In practice, since is smooth, we also leverage Nesterov’s method to further accelerate the convergence of our algorithm. Nesterov’s method (Nesterov, 2003) achieves convergence rate for smooth objective function, where is the number of iterations.
3.2 Linear Time Projection onto the Topk Simplex
One of our main technical results is a linear time projection algorithm onto , even in the case of is close to . For clear notations, we reformulate the projection problem as
(9) 
It should be noted that, many Euclidean projection problems studied in the literature can be seen as a special case of this problem. If the term is fixed, or replaced by a constant upper bound , we obtain a well studied case of continuous quadratic knapsack problem (CQKP)
where . Several efficient methods based on medianselecting or variable fixing techniques are available (Patriksson, 2008). On the other hand, if , all upper bounded constraints are automatically satisfied and can be omitted. Such special case has been studied, for example, in (Liu and Ye, 2009) and (Li et al., 2014), both of which achieve complexity.
Unfortunately, none of those above methods can be directly applied to solving the generalized case (9), due to its property of unfixed upperbound constraint on when . To our knowledge, the only attempt to address the problem of unfixed upper bound is (Lapin et al., 2015). They solve a similar (but simpler) problem
based on sorting and exhaustive search and their method achieves a runtime complexity , which is superlinear and even quadratic when and are linearly dependent. By contrast, our proposed method can be applied to both of the aforementioned special cases with minor changes and remains complexity. The notable characteristic of our method is the efficient combination of bisection and divideandconquer: the former offers the guarantee of worst complexity, and the latter significantly reduces the large constant factor of bisection method.
We first introduce the following theorem, which gives a detailed description of the solution for (9).
Theorem 3.
is the optimal solution of (9) if and only if there exist dual variables , , satisfy the following system of linear constraints:
(10)  
(11)  
(12) 
and , .
Based on Theorem 3, the projection problem can be solved by finding the value of three dual variables , and that satisfy the above linear system. Here we first propose a basic bisection method which guarantees the worst time complexity. Similar method has also been used in (Liu and Ye, 2009). For brevity, we denote and the largest dimension in and respectively, and define function , , and as follows^{2}^{2}2Indeed, for some , is not onevalued and thus need more precise definition. Here we omit it for brevity, and leave details in section 6.6.:
(13)  
(14)  
(15)  
(16) 
The main idea of leveraging bisection to solve the system in theorem 3 is to find the root of . In order to make bisection work, we need three conditions: should be continuous; the root of can be efficiently bracketed in a interval; and the value of at the two endpoints of this interval have opposite signs. Fortunately, based on the following three lemmas, all of those requirements can be satisfied.
Lemma 1.
(Zero case) is an optimal solution of (9) if and only if .
Lemma 2.
(Bracketing ) If , .
Lemma 3.
(Monotonicity and convexity)

is convex, continuous and strictly decreasing in , ;

is continuous, monotonically decreasing in ;

is continuous, strictly increasing in ;

is continuous, strictly increasing in .
Furthermore, we can define the inverse function of as , and rewrite as:
(17) 
it is a convex function of , strictly decreasing in .
Lemma 1 deals with the special case of . Lemma 2 and 3 jointly ensure that bisection works when ; Lemma 2 bounds ; Lemma 3 shows that is continuous, and since it is also strictly increasing, the value of at two endpoints must has opposite sign.
Basic method: bisection & leverage convexity We start from select current in the range . Then compute corresponding by (13) in , and use the current to compute by (14). Computing can be completed in by a welldesigned medianselecting algorithm (Kiwiel, 2007). With the current (i.e. updated) , and in hand, we can evaluate the sign of in and determine the new bound of . In addition, the special case of can be checked using Lemma 1 in by a lineartime klargest element selecting algorithm (Kiwiel, 2005). Since the bound of is irrelevant to and , the number of iteration for finding is , where is the maximum tolerance of the error. Thus, the worst runtime of this algorithm is . Furthermore, we also leverage the convexity of and to further improve this algorithm, please refer to (Liu and Ye, 2009) for more details about related techniques.
Although bisection solves the projections in linear time, it may lead to a slow convergence rate. We further improve the runtime complexity by reducing the constant factor . This technique benefits from exploiting the monotonicity of both functions , , and , which have been stated in Lemma 3. Notice that, our method can also be used for finding the root of arbitary piecewise linear and monotone function, without the requirement of convexity.
Improved method: endpoints Divide & Conquer Lemma 3 reveals an important chain monotonicity between the dual variables, which can used to improve the performance of our baseline method. The key steps are summarized in Algorithm 2. Denote the value of a variable in iteration as . For instance, if , from emma 3 we have , and . This implies that we can set uncertainty intervals for both , , and . As the interval of shrinking, lengths of these four intervals can be reduced simultaneously. On the other hand, notice that is indeed piecewise linear function (at most segments), the computation of its value only contains a comparison between and all of the s. By keeping a cache of s and discard those elements which are out of the current bound of in advance, in each iteration we can reduce the expected comparison counts by half. A more complex but similar procedure can also be applied for computing , , and , because both of these functions are piecewise linear and the main cost is the comparison with endpoints. As a result, for approximately linear function and evenly distributed breakpoints, if the first iteration of bisection costs time, the overall runtime of the projection algorithm will be , which is much less than the original bisection algorithm whose runtime is .
3.3 Convergence and Computational Complexity
Following immediately from the convergence result of Nesterov’s method, we have:
Theorem 4.
Let and be the output from the FPL algorithm after T iterations, then , where .
Algorithm  Training  Validation 
FPL  Linear  
TopPush  Linear  
CSSVM  Quadratic  
Linear  
Bipartite  Linear  
Ranking 
Finally, the computational cost of each iteration is dominated by the gradient evaluation and the projection step. Since the complexity of projection step is and the cost of computing the gradient is , combining with Theorem 4 we have that: to find an suboptimal solution, the total computational complexity of FPL is . Table 1 compares the computational complexity of FPL with that of some stateoftheart methods. The order of validation complexity corresponds to the number of hyperparameters. From this, it is easy to see that FPL is asymptotically more efficient.
3.4 OutofBootstrap Thresholding
In the thresholding stage, the task is to identify the boundary between the positive instances and percent of the negative instances. Though thresholding on the training set is commonly used in (Joachims, 1996; Davenport et al., 2010; Scheirer et al., 2013), it may result in overfitting. Hence, we propose an outofbootstrap method to find a more accurate and stable threshold. At each time, we randomly split the training set into two sets and , and then train on as well as the select threshold on
. The procedure can be running multiple rounds to make use of all the training data. Once the process is completed, we can obtain the final threshold by averaging. On the other hand, the final scoring function can be obtained by two ways: learn a scoring function using the full set of training data, or gather the weights learned in each previous round and average them. This method combines both the advantages of outofbootstrap and softthresholding techniques: accurate error estimation and reduced variance with little sacrifice on the bias, thus fits the setting of thresholding near the risk area.
4 Theoretical Guarantees
Now we develop the theoretical guarantee for the scoring function, which bounds the probability of giving any positive instances higher score than proportion of negative instances. To this end, we first define , the probability for any negative instance to be ranked above using , i.e. , and then measure the quality of by , which is the probability of giving any positive instances lower score than percent of negative instances. The following theorem bounds by the empirical loss .
Theorem 5.
Given training data consisting of independent instances from distribution and independent instances from distribution , let be the optimal solution to the problem (6). Assume and . We have, for proper and any , with a probability at least ,
(18) 
where , and .
Theorem 5 implies that if is upper bounded by , the probability of ranking any positive samples below percent of negative samples is also bounded by . If is approaching infinity, would be close to 0, which means in that case, we can almost ensure that by thresholding at a suitable point, the truepositive rate will get close to 1. Moreover, we observe that and play different roles in this bound. For instance, it is well known that the largest absolute value of Gaussian random instances grows in . Thus we believe that the growth of only slightly affects both the largest and the centroid of topproportion scores of negatives samples. This leads to a conclusion that increasing only slightly raise , but significant reduce the margin between target and . On the other hand, increasing will reduce upper bound of , thus increasing the chance of finding positive instances at the top. In sum, and control and respectively.
5 Experiment Results
5.1 Effectiveness of the Lineartime Projection
We first demonstrate the effectiveness of our projection algorithm. Following the settings of (Liu and Ye, 2009)
, we randomly sample 1000 samples from the normal distribution
and solve the projection problem. The comparing method is ibis (Liu and Ye, 2009), an improved bisection algorithm which also makes use of the convexity and monotonicity. All experiments are running on an Intel Core i5 Processor. As shown in Fig.2, thanks to the efficient reduction of the constant factor, our method outperforms ibis by saving almost of the running time in the limit case.We also solve the projection problem proposed in (Lapin et al., 2015) by using a simplified version of our method, and compare it with the method presented in (Lapin et al., 2015) (PTkC), whose complexity is . As one can observe from Fig.2(b), our method is linear in complexity regarding with and does not suffer from the growth of . In the limit case (both large and ), it is more than threeorder of magnitude faster than the competitors.
5.2 Ranking Performance
Dataset  heart  spambase  realsim  w8a  
120/150,d:13  1813/2788,d:57  22238/50071,d:20958  2933/62767,d:300  
5  10  0.1  0.5  1  5  10  0.01  0.1  1  5  10  0.05  0.1  0.5  1  5  10  
CSSVM  .526  .691  .109  .302  .811  .920  .376  .921  .972  .990  .501  .520  .649  .695  .828  .885  
TopPush  .711  .112  .303  .484  .774  .845  .747  .920  .968  .983  .627  .656  .761  .842  
.509  .728  N/A  N/A  N/A  N/A  N/A  N/A  N/A  N/A  N/A  N/A  N/A  N/A  N/A  N/A  N/A  N/A  
Rank  
2Rank 
Next, we validate the ranking performance of our FPL method, i.e. scoring and sorting test samples, and then evaluate the proportion of positive samples ranked above proportion of negative samples. Considering ranking performance independently can avoid the practical problem of mismatching the constraint in (2) on testing set, and always offer us the optimal threshold.
Specifically, we choose (3) as evaluation and validation criterion. Compared methods include costsensitive SVM (CSSVM) (Osuna et al., 1997), which has been shown a lower bound approximation of (3); TopPush (Li et al., 2014) ranking, which focus on optimizing the absolute top of the ranking list, also a special case of our model (); (Narasimhan and Agarwal, 2013a), a more general method which designed for optimizing arbitrary partialAUC. We test two version of our algorithms: Rank and Rank, which correspond to the different choice of in learning scheme. Intuitively, enlarge in training phase can be seen as a topdown approximation—from upper bound to the original objective (2). On the other hand, the reason for choosing is that, roughly speaking, the average score of the top proportion of negative samples may close to the score of th largest negative sample.
Settings. We evaluate the performance on publicly benchmark datasets with different domains and various sizes^{3}^{3}3https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary. For small scale datasets( instances), 30 times stratified holdout tests are carried out, with data as train set and data as test set. For large datasets, we instead run 10 rounds. In each round, hyperparameters are chosen by 5fold cross validation from grid, and the search scope is extended if the optimal is at the boundary.
Results. Table 2 reports the experimental results. We note that at most cases, our proposed method outperforms other peer methods. It confirms the theoretical analysis that our methods can extract the capacity of the model better. For TopPush, it is highlycompetitive in the case of extremely small , but gradually lose its advantage as increase. The algorithm of is based on cuttingplane methods with exponential number of constraints, similar technologies are also used in many other ranking or structured prediction methods, e.g. Structured SVM (Tsochantaridis et al., 2005). The time complexity of this kind of methods is , and we found that even for thousands of training samples, it is hard to finish experiments in allowed time.
5.3 Overall Classification Accuracy
Dataset(+/)  (%)  BSSVM  CSLR  CSSVM  CSSVMOOB  FPL  FPL 
heart  5  
120/150,d:13  10  
breastcancer  1  
239/444  5  
d:10  10  
spambase  0.5  
1813/2788  1  
d:57  5  
10  
realsim  0.01  
22238/50071  0.1  
d:20958  0.5  
1  
5  
10  
w8a  0.05  
1933/62767  0.1  
d:123  0.5  
1  
5  
10 
In this section we compare the performance of different models by jointly learning the scoring function and threshold in training phase, i.e. output a classifier. To evaluate a classifier under the maximum tolerance, we use NeymanPearson score (NPscore) (Scott, 2007). The NPscore is defined by where and are falsepositive rate and truepositive rate of the classifier respectively, and is the maximum tolerance. This measure punishes classifiers whose falsepositive rates exceed , and the punishment becomes higher as .
Settings
. We use the similar setting for classification as for ranking experiments, i.e., for small scale datasets, 30 times stratified holdout tests are carried out; for large datasets, we instead run 10 rounds. Comparison baselines include: CostSensitive Logistic Regression (CSLR) which choose a surrogate function that different from CSSVM; BiasShifting Support Vector Machine (BSSVM), which first training a standard SVM and then tuning threshold to meet specified falsepositive rate; costsensitive SVM (CSSVM). For complete comparison, we also construct a CSSVM by our outofbootstrap thresholding (CSSVMOOB), to eliminate possible performance gains comes from different thresholding method, and focus on the training algorithm itself. For all of comparing methods, the hyperparameters are selected by 5fold crossvalidation with grid search, aims at minimizing the NPscore, and the search scope is extended when the optimal value is at the boundary. For our
FPL, in the ranking stage the regularization parameter is selected to minimize (3) , and then the threshold is chosen to minimize NPscore. We test two variants of our algorithms: FPL and FPL, which corresponding different choice of in learning scheme. As mentioned previously, enlarge can be seen as a topdown approximation towards the original objective.Results. The NPscore results are given in Table 3. First, we note that both our methods can achieve the best performance in most of the tests, compared to various comparing methods. Moreover, it is clear that even using the same method to select the threshold, the performance of cost sensitive method is still limited. Another observation is that both of the three algorithms which using outofbootstrap thresholding can efficiently control the false positive rate under the constraint. Moreover, FPLs are more stable than other algorithms, which we believe benefits from the accurate splitting of the positivenegative instances and stable thresholding techniques.
5.4 Scalability
We study how FPL scales to a different number of training examples by using the largest dataset realsim. In order to simulate the limit situation, we construct six datasets with different data size, by upsampling original dataset. The sampling ratio is , thus results in six datasets with data size from 72309 to 2313888. We running FPL ranking algorithm on these datasets with different and optimal (chosen by crossvalidation), and report corresponding training time. Upsampling technology ensures that, for a fixed , all the six datasets share the same optimal regularization parameter . Thus the unique variable can be fixed as data size. Figure 3 shows the loglog plot for the training time of FPL versus the size of training data, where different lines correspond to different . It is clear that the training time of FPL is indeed linear dependent in the number of training data. This is consistent with our theoretical analysis and also demonstrate the scalability of FPL.
6 Proofs and Technical Details
In this section, we give all the detailed proofs missing from the main text, along with ancillary remarks and comments.
Notation. In the following, we define be the th largest dimension of a vector , define ,
Comments
There are no comments yet.