1 Introduction
Classification plays an important role in many aspects of our society. In medical research, identifying pathogenically distinct tumor types is central to advances in cancer treatments (Golub et al., 1999; Alderton, 2014). In cyber security, spam messages and virus make automatic categorical decisions a necessity. Binary classification is arguably the simplest and most important form of classification problems, and can serve as a building block for more complicated applications. We focus our attention on binary classification in this work. A few common notations are introduced to facilitate our discussion. Let be a random pair where
is a vector of features and
indicates ’s class label. A classifier is a mapping from to that assigns to one of the classes. A classification loss function is defined to assign a “cost” to each misclassified instance , and the classification erroris defined as the expectation of this loss function with respect to the joint distribution of
. We will focus our discussion on the 01 loss function throughout the paper, where denotes the indicator function. Denote by andthe generic probability distribution and expectation, whose meaning depends on specific contexts. The classification error is
. The law of total probability allows us to decompose it into a weighted average of type I error
and type II error as(1.1) 
With the advent of highthroughput technologies, classification tasks have experienced an exponential growth in the feature dimensions throughout the past decade. The fundamental challenge of “high dimension, low sample size” has motivated the development of a plethora of classification algorithms for various applications. While dependencies among features are usually considered a crucial characteristic of the data (Ackermann and Strimmer, 2009), and can effectively reduce classification errors under suitable models and relative data abundance (Shao et al., 2011; Cai and Liu, 2011; Fan et al., 2012; Mai et al., 2012; Witten and Tibshirani, 2012), independence rules, with their superb scalability, become a rule of thumb when the feature dimension grows faster than the sample size (Hastie et al., 2009; James et al., 2013)
. Despite Naive Bayes models’ reputation of being “simplistic” by ignoring all dependency structure among features, they lead to simple classifiers that have proven worthy on highdimensional data with remarkably good performances in numerous reallife applications. Taking the classical model setting of twoclass Gaussian with a common covariance matrix,
Bickel and Levina (2004) showed the superior performance of Naive Bayes models over (naive implementation of) the Fisher linear discriminant rule under broad conditions in highdimensional settings. Fan and Fan (2008)further established the necessity of feature selection for highdimensional classification problems by showing that even independence rules can be as poor as random guessing due to noise accumulation. Featuring both independence rule and feature selection, the (sparse) Naive Bayes model remains a good choice for classification when the sample size is
fairly limited.1.1 Asymmetrical priorities on errors
Most existing binary classification methods target on the optimization of the overall risk (1.1) and may fail to serve the purpose when users’ relative priorities over type I/II errors differ significantly from those implied by the marginal probabilities of the two classes. A representative example of such scenario is the diagnosis of serious disease. Let code the healthy class and code the diseased class. Given that usually
minimizing the overall risk (1.1) might yield classifiers with small overall risk (as a result of small ) yet large — a situation quite undesirable in practice given flagging a healthy case incurs only extra cost of additional tests while failing to detect the disease endangers a life.
The neuroblastoma dataset introduced by Oberthuer et al. (2006) provides a perfect illustration of such intuition. The dataset contains gene expression profiles on genes from 246 patients in a German neuroblastoma trial, among which 56 are highrisk (labeled as 0) and 190 are lowrisk (labeled as 1). We randomly selected 41 ‘’s and 123 ‘’s as our training sample (such that the proportion of ‘’s is about the same as that in the entire dataset), and tested the resulting classifiers on the rest 15 ‘’s and 67 ‘’s. The average error rates of PSN
(to be proposed; implemented here at significance level 0.05), Gaussian Naive Bayes (nb), penalized logistic regression (penlog), and Support Vector Machine (svm) over 1000 random splits are summarized in Table
1.Error Type  PSN  nb  penlog  svm 

type I ( as )  .038  .308  .529  .603 
type II ( as )  .761  .150  .103  .573 
All procedures except PSN led to high type I errors, and are thus considered unsatisfactory given the more severe consequences of missing a diseased instance than vice versa.
One existing solution to asymmetric error control is costsensitive learning, which assigns two different costs as weights of the type I/II errors (Elkan, 2001; Zadrozny et al., 2003). Despite many merits and practical values of this framework, limitations arise in applications when there is no consensus over how much costs to be assigned to each class, or more fundamentally, whether it is morally acceptable to assign costs in the first place. Also, when users have a specific target for type I/II error control, costsensitive learning does not fit. Other methods aiming for small type I error include the Asymmetric Support Vector Machine (Wu et al., 2008), and the value for classification (Dümbgen et al., 2008). However, the former has no theoretical guarantee on errors, while the latter treats all classes as of equal importance.
1.2 NeymanPearson (NP) paradigm and NP oracle inequalities
NeymanPearson (NP) paradigm was introduced as a novel statistical framework for targeted type I/II error control. Assume type I error as the prioritized error type, this paradigm seeks to control under a user specified level with as small as possible. The oracle is thus
(1.2) 
where the significance level reflects the level of conservativeness towards type I error. Given is unattainable in the learning paradigm, the best within our capability is to construct a data dependent classifier that mimics it.
Despite its practical importance, NP classification has not received much attention in the statistics and machine learning communities.
Cannon et al. (2002) initiated the theoretical treatment of NP classification. Under the same framework, Scott (2005) and Scott and Nowak (2005) derived several results for traditional statistical learning such as PAC bounds or oracle inequalities. By combining type I and type II errors in sensible ways, Scott (2007) proposed a performance measure for NP classification. More recently, Blanchard et al. (2010)developed a general solution to semisupervised novelty detection by reducing it to NP classification. Other related works include
Casasent and Chen (2003) and Han et al. (2008). A common issue with methods in this line of literature is that they all follow an empirical risk minimization (ERM) approach, and use some forms of relaxed empirical type I error constraint in the optimization program. As a result, all type I errors can only be proven to satisfy some relaxed upper bound. Take the framework set up by Cannon et al. (2002) for example. Given , they proposed the programwhere is a set of classifiers with finite VapnikChervonenkis dimension, and , are the empirical type I and type II errors respectively. It is shown that with high probability, the solution to the above program satisfies simultaneously: i) the type I error is bounded from above by , and ii) the type II error is bounded from above by for some .
Rigollet and Tong (2011) is a significant departure from the previous NP classification literature. This paper argues that a good classifier under the NP paradigm should respect the chosen significance level , rather than some relaxation of it. More precisely, two NP oracle inequalities should be satisfied simultaneously with high probability:

the type I error constraint is respected, i.e., .

the excess type II error diminishes with explicit rates (w.r.t. sample size).
Recall that, for a classifier , the classical oracle inequality insists that with high probability
the excess risk diminishes with explicit rates,  (1.3) 
where is the Bayes classifier, in which is the regression function of on (see Koltchinskii (2008) and references within). The two NP oracle inequalities defined above can be thought of as a generalization of (1.3) that provides a novel characterization of classifiers’ theoretical performances under the NP paradigm.
Using a more stringent empirical type I error constraint (than the level ), Rigollet and Tong (2011) established NP oracle inequalities for its proposed classifiers under convex loss functions (as opposed to the indicator loss). They also proved an interesting negative result: under the binary loss, ERM approaches (convexification or not) cannot guarantee diminishing excess type II error as long as one insists type I error of the proposed classifier be bounded from above by with high probability. This negative result motivated a plugin approach to NP classification in Tong (2013).
1.3 Plugin approaches
Plugin methods in classical binary classification have been well studied in the literature, where the usual plugin target is the Bayes classifier . Earlier works gave rise to pessimism of the plugin approach to classification. For example, under certain assumptions, Yang (1999)
showed plugin estimators cannot achieve excess risk with rates faster than
, while direct methods can achieve rates up to under margin assumption (Mammen and Tsybakov, 1999; Tsybakov, 2004; Tsybakov and van de Geer, 2005; Tarigan and van de Geer, 2006). However, it was shown in Audibert and Tsybakov (2007) that plugin classifiers based on local polynomial estimators can achieve rates faster than , with a smoothness condition on and the margin assumption.The oracle classifier under the NP paradigm arises from its close connection to the NeymanPearson Lemma in statistical hypothesis testing. Hypothesis testing bears strong resemblance to binary classification if we assume the following model. Let
and be two known probability distributions on . Assume that for some , and the conditional distribution of given is. Given such a model, the goal of statistical hypothesis testing is to determine if we should reject the null hypothesis that
was generated from . To this end, we construct a randomized test that rejects the null with probability . Two types of errors arise: type I error occurs when is rejected yet , and type II error occurs when is not rejected yet . The NeymanPearson paradigm in hypothesis testing amounts to choosing that solves the following constrained optimization problemwhere is the significance level of the test. A solution to this constrained optimization problem is called a most powerful test of level . The NeymanPearson Lemma gives mild sufficient conditions for the existence of such a test.
Lemma 1.1 (NeymanPearson Lemma).
Let and be two probability measures with densities and respectively, and denote the density ratio as . For a given significance level , let be such that and . Then, the most powerful test of level is
Under mild continuity assumption, we take the NP oracle
(1.4) 
as our plugin target for NP classification. With kernel density estimates
, , and a proper estimate of the threshold level , Tong (2013) constructed a plugin classifier that satisfies both NP oracle inequalities with high probability when the dimensionality is small, leaving the highdimensional case an unchartered territory.1.4 Contribution
In the big data era, NP classification framework faces the same curse of dimensionality as its classical counterpart. Despite its wide potential applications, this paper is the
first attempt to construct performanceguaranteed classifiers under the NP paradigm in highdimensional settings. Based on the NeymanPearson Lemma, we employ Naive Bayes models and propose a computationally feasible plugin approach to construct classifiers that satisfy the NP oracle inequalities. We also improve the detection condition, a critical theoretical assumption first introduced in Tong (2013), for effective threshold level estimation that grounds the good NP properties of these classifiers. Necessity of the new detection condition is also discussed. Note that classifiers proposed in this work are not straightforward extensions of Tong (2013): kernel density estimation is now applied in combination with feature selection, and the threshold level is estimated in a more precise way by order statistics that require only moderate sample size — while Tong (2013) resorted to the VapnikChervonenkis theory and required sample size much bigger than what is available in most highdimensional applications.The rest of the paper is organized as follows. Two screening based plugin NPtype classifiers are presented in Section 2, where theoretical properties are also discussed. Performance of the proposed classifiers is demonstrated in Section 3 by both simulation studies and real data analysis. We conclude in Section 4 with a short discussion. The technical proofs are relegated to the Appendix.
2 Methods
In this section, we first introduce several notations and definitions, with a focus on the detection condition. Then we present the plugin procedure, together with its theoretical properties.
2.1 Notations and definitions
We introduce here several notations adapted from Audibert and Tsybakov (2007). For , denote by the largest integer strictly less than . For any and any times continuously differentiable realvalued function on , we denote by its Taylor polynomial of degree at point . For , the Hölder class of functions, denoted by , is the set of functions that are times continuously differentiable and satisfy, for any , the inequality The Hölder class of density is defined as
We will use valid kernels (kernels of order , Tsybakov (2009)) for all the kernel estimation throughout the theoretical discussion, the definition of which is as follows.
Definition 2.1.
Let be a realvalued function on with support . The function is a valid kernel if it satisfies , for any , , and in the case , it satisfies for any such that .
We assume that all the valid kernels considered in the theoretical part of this paper are constructed from Legendre polynomials, and are thus Lipschitz and bounded, satisfying the kernel conditions for the important technical Lemma A.6.
Definition 2.2 (margin assumption).
A function is said to satisfy margin assumption of order with respect to probability distribution at the level if there exists a positive constant , such that for any ,
This assumption was first introduced in Polonik (1995). In the classical binary classification framework, Mammen and Tsybakov (1999) proposed a similar condition named “margin condition” by requiring most data to be away from the optimal decision boundary. In the classical classification paradigm, definition 2.2 reduces to the “margin condition” by taking and , with giving the decision boundary of the Bayes classifier. On the other hand, unlike the classical paradigm where the optimal threshold level is known and does not need an estimate, the optimal threshold level in the NP paradigm is unknown and needs to be estimated, suggesting the necessity of having sufficient data around the decision boundary to detect it well. This concern motivated the following condition improved from Tong (2013).
Definition 2.3 (detection condition).
A function is said to satisfy detection condition of order with respect to (i.e., ) at level if there exists a positive constant , such that for any ,
A detection condition works as an opposite force to the margin assumption, and is basically an assumption on the lower bound of probability. Though we take here a power function as the lower bound, so that it is simple and aesthetically similar to the margin assumption, any increasing on with should be able to serve the purpose. The version of detection condition we would use to establish the NP inequalities for the (to be) proposed classifiers takes , , and (recall that is the conditional distribution of given ).
Now we argue why such a condition is necessary to achieve the NP oracle inequalities. Consider the simpler case where the density ratio is known, and we only need a proper estimate of the threshold level . If there is nothing like the detection condition (Definition 2.3 involves a power function, but the idea is just to have any kind of lower bound), we would have, for some ,
(2.1) 
In getting the threshold estimate of , we can not distinguish any threshold level between and . In particular, it is possible that
But then the excess type II error is bounded from below as follows
where the last quantity can be positive. Therefore, the second NP oracle inequality (diminishing excess type II error) does not hold for . Since some detection condition is necessary in this simpler case, it is certainly necessary in our real setup.
Note that Definition 2.3 is a significant improvement of the detection condition formulated in Tong (2013), which requires
We are able to drop the lower bound for the first piece due to an improved layout of the proofs. Intuitively, our new detection condition ensures an upper bound on . But we do not need an extra condition to get a lower bound of , because of the type I error bound requirement (see the proof of Proposition 2.4 for details).
2.2 NeymanPearson plugin procedure
Suppose the sampling scheme is fixed as follows.
Assumption 1.
Assume the training sample contains i.i.d. observations from class 1 with density , and i.i.d. observations from class 0 with density . Given fixed , , , and such that , , we further decompose and into independent subsamples as: , and , where , , , , .
The sample splitting idea has been considered in the literature, such as in Meinshausen and Bühlmann (2010) and Robins et al. (2006). Given these samples, we introduce the following plugin procedure.
Definition 2.4.
NeymanPearson plugin procedure

Use , , , and to construct a density ratio estimate . The specific use of each subsample will be introduced in Section 2.4.

Given choose a threshold estimate from the set .
Denote by the th order statistic of , . The corresponding plugin classifier by setting is
(2.2) 
A generic procedure for choosing the optimal will be given in Section 2.3.
2.3 Threshold estimate
For any arbitrary density ratio estimate , we employ a proper order statistic to estimate the threshold , and establish a probabilistic upper bound for the type I error of for each .
Proposition 2.1.
For any arbitrary density ratio estimate , let . It holds for any and that
(2.3) 
where is the cdf of Beta. The inequality becomes equality when is continuous almost surely.
In view of the above proposition, a sufficient condition for the classifier to satisfy NP Oracle Inequality (I) at tolerance level is thus
(2.4) 
Despite the potential tightness of (2.3), we are not able to derive an explicit formula for the minimum that satisfies (2.4). To get an explicit choice for , we resort to concentration inequalities for an alternative.
Proposition 2.2.
For any arbitrary density ratio estimate , let . It holds for any and that
(2.5) 
where
(2.6) 
Let . Proposition 2.2 implies that is a sufficient condition for the classifier to satisfy NP Oracle Inequality (I). The next step is to characterize and choose some , so that has small excess type II error. Clearly, we would like to find the smallest element in .
Proposition 2.3.
The minimum that satisfies is
(2.7) 
where denotes the smallest integer larger than or equal to , and
Moreover,

.

For any , we have , and thus
Introduce shorthand notations , , and . We will take
(2.8) 
as the default NP plugin classifier for any arbitrary . An alternative threshold estimate that also guarantees type I error bound is derived in the Appendix C. Assume for the rest of the theoretical discussion. It follows from Proposition 2.3 that , and thus , with guaranteed type I error control.
Remark 2.1.
Note that . Thus, choosing the th order statistic of as the threshold can be viewed as a modification to the classical approach of estimating the quantile of by the th order statistic of . Recall that the oracle is actually the quantile of distribution so the intuition is that is asymptotically (when ) equivalent to the quantile of which in turn converges (when ) to as the quantile of under moderate conditions.
Lemma 2.1.
Proposition 2.4.
Let . In addition to assumptions of Lemma 2.1, assume that the density ratio satisfies the margin assumption of order at level (with constant ) and detection condition of order at level (with constant ), both with respect to distribution . If , the excess type II error of the classifier defined in (2.8) satisfies with probability at least ,
Given the above proposition, we can control the excess type II error as long as the uniform deviation of density ratio estimate is controlled. In the following subsection, we will introduce estimates and provide bounds for .
2.4 Density ratio estimate
Denote the marginal densities of class 1 and 0 as and () respectively, Naive Bayes models for the density ratio take the form
The subsamples , , and are used to construct (nonparametric/parametric) estimates of and for .
Nonparametric estimate of the density ratio. For marginal densities and , we apply kernel estimates , and , where is the kernel function, are the bandwidths, and and denote the th component of and respectively. The resulting nonparametric estimate is
(2.10) 
Parametric estimate of the density ratio. Assume the twoclass Gaussian model where . We estimate , and using their sample versions , and . Under this model, the density ratio function is given by
and the corresponding parametric estimate is
(2.11) 
2.5 Screeningbased density ratio estimate and plugin procedures
For “high dimension, low sample size” applications, complex models that take into account all features usually fail; even Naive Bayes models that ignore feature dependency might lead to poor performance due to noise accumulation (Fan and Fan, 2008). A common solution in these scenarios is to first study marginal relations between the response and each of the features (Fan and Lv, 2008; Li et al., 2012). By selecting the most important individual features, we greatly reduce the model size, and other models can be applied after this screening step. We now introduce screening based variants of and . Let and denote the cdfs of and respectively, for . Step 1 of Procedure 2.4 introduced in Section 2.1 is now decomposed into a screening substep and an estimation substep.
Nonparametric Screeningbased NP Naive Bayes (NSN) classifier
 Step 1.1

Select features using and as follows:
(2.12) where is some threshold level, and
(2.13) are the empirical cdfs.
 Step 1.2

Use and to construct kernel estimates of and for . The density ratio estimate is given by
 Step 2

Given , use to get a threshold estimate as in (2.8).
The resulting NSN classifier is
(2.14) 
Parametric Screeningbased NP Naive Bayes (PSN) classifier
The PSN procedure is similar to NSN, except the following two differences. In Step 1.1, features are now selected based on statistics ( represent the index set of the selected features). In Step 1.2, , for follow twoclass Gaussian model, and the resulting parametric screeningbased density ratio estimate is
The corresponding PSN classifier is thus given by
(2.15) 
We assume the domains of all and to be for all the following theoretical discussion. We will prove NP oracle inequalities for , and those for can be developed similarly. Recall that by Proposition 2.4, we need an upper bound for . Necessarily, performance of the screening step should be studied. To this end, we assume that only a small fraction of the features have marginal differentiating power.
Assumption 2.
There exists a signal set with size such that for some positive constant , and for .
The following proposition shows that Step 1.1 achieves exact recovery () with high probability for some properly chosen .
Proposition 2.5 (exact recovery).
Now we are ready to control the uniform deviation of density ratio estimate given in Step 1.2.
Assumption 3.
The marginal densities , for all , and there exists such that for all . There exists some constant , such that , and there is a uniform absolute upper bound for and for and . Moreover, the kernel in the nonparametric density estimates is valid and Lipschitz.
Smoothness conditions (Assumption 3) and the margin assumption were used together in the classical classification literature. However, it is not entirely obvious why Assumption 3 does not render the detection condition redundant. We refer interested readers to Appendix B for more detailed discussion.
Comments
There are no comments yet.