Neyman-Pearson Classification under High-Dimensional Settings

08/13/2015 ∙ by Anqi Zhao, et al. ∙ MIT University of Southern California Harvard University Columbia University 0

Most existing binary classification methods target on the optimization of the overall classification risk and may fail to serve some real-world applications such as cancer diagnosis, where users are more concerned with the risk of misclassifying one specific class than the other. Neyman-Pearson (NP) paradigm was introduced in this context as a novel statistical framework for handling asymmetric type I/II error priorities. It seeks classifiers with a minimal type II error and a constrained type I error under a user specified level. This article is the first attempt to construct classifiers with guaranteed theoretical performance under the NP paradigm in high-dimensional settings. Based on the fundamental Neyman-Pearson Lemma, we used a plug-in approach to construct NP-type classifiers for Naive Bayes models. The proposed classifiers satisfy the NP oracle inequalities, which are natural NP paradigm counterparts of the oracle inequalities in classical binary classification. Besides their desirable theoretical properties, we also demonstrated their numerical advantages in prioritized error control via both simulation and real data studies.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Classification plays an important role in many aspects of our society. In medical research, identifying pathogenically distinct tumor types is central to advances in cancer treatments (Golub et al., 1999; Alderton, 2014). In cyber security, spam messages and virus make automatic categorical decisions a necessity. Binary classification is arguably the simplest and most important form of classification problems, and can serve as a building block for more complicated applications. We focus our attention on binary classification in this work. A few common notations are introduced to facilitate our discussion. Let be a random pair where

is a vector of features and

indicates ’s class label. A classifier is a mapping from to that assigns to one of the classes. A classification loss function is defined to assign a “cost” to each misclassified instance , and the classification error

is defined as the expectation of this loss function with respect to the joint distribution of

. We will focus our discussion on the 0-1 loss function throughout the paper, where denotes the indicator function. Denote by and

the generic probability distribution and expectation, whose meaning depends on specific contexts. The classification error is

. The law of total probability allows us to decompose it into a weighted average of type I error

and type II error as


With the advent of high-throughput technologies, classification tasks have experienced an exponential growth in the feature dimensions throughout the past decade. The fundamental challenge of “high dimension, low sample size” has motivated the development of a plethora of classification algorithms for various applications. While dependencies among features are usually considered a crucial characteristic of the data (Ackermann and Strimmer, 2009), and can effectively reduce classification errors under suitable models and relative data abundance (Shao et al., 2011; Cai and Liu, 2011; Fan et al., 2012; Mai et al., 2012; Witten and Tibshirani, 2012), independence rules, with their superb scalability, become a rule of thumb when the feature dimension grows faster than the sample size (Hastie et al., 2009; James et al., 2013)

. Despite Naive Bayes models’ reputation of being “simplistic” by ignoring all dependency structure among features, they lead to simple classifiers that have proven worthy on high-dimensional data with remarkably good performances in numerous real-life applications. Taking the classical model setting of two-class Gaussian with a common covariance matrix,

Bickel and Levina (2004) showed the superior performance of Naive Bayes models over (naive implementation of) the Fisher linear discriminant rule under broad conditions in high-dimensional settings. Fan and Fan (2008)

further established the necessity of feature selection for high-dimensional classification problems by showing that even independence rules can be as poor as random guessing due to noise accumulation. Featuring both independence rule and feature selection, the (sparse) Naive Bayes model remains a good choice for classification when the sample size is

fairly limited.

1.1 Asymmetrical priorities on errors

Most existing binary classification methods target on the optimization of the overall risk (1.1) and may fail to serve the purpose when users’ relative priorities over type I/II errors differ significantly from those implied by the marginal probabilities of the two classes. A representative example of such scenario is the diagnosis of serious disease. Let code the healthy class and code the diseased class. Given that usually

minimizing the overall risk (1.1) might yield classifiers with small overall risk (as a result of small ) yet large — a situation quite undesirable in practice given flagging a healthy case incurs only extra cost of additional tests while failing to detect the disease endangers a life.

The neuroblastoma dataset introduced by Oberthuer et al. (2006) provides a perfect illustration of such intuition. The dataset contains gene expression profiles on genes from 246 patients in a German neuroblastoma trial, among which 56 are high-risk (labeled as 0) and 190 are low-risk (labeled as 1). We randomly selected 41 ‘’s and 123 ‘’s as our training sample (such that the proportion of ‘’s is about the same as that in the entire dataset), and tested the resulting classifiers on the rest 15 ‘’s and 67 ‘’s. The average error rates of PSN

(to be proposed; implemented here at significance level 0.05), Gaussian Naive Bayes (nb), penalized logistic regression (pen-log), and Support Vector Machine (svm) over 1000 random splits are summarized in Table


Error Type PSN nb pen-log svm
type II ( as ) .038 .308 .529 .603
type II ( as ) .761 .150 .103 .573
Table 1: Average error rates over 1000 random splits for neuroblastoma dataset.

All procedures except PSN led to high type I errors, and are thus considered unsatisfactory given the more severe consequences of missing a diseased instance than vice versa.

One existing solution to asymmetric error control is cost-sensitive learning, which assigns two different costs as weights of the type I/II errors (Elkan, 2001; Zadrozny et al., 2003). Despite many merits and practical values of this framework, limitations arise in applications when there is no consensus over how much costs to be assigned to each class, or more fundamentally, whether it is morally acceptable to assign costs in the first place. Also, when users have a specific target for type I/II error control, cost-sensitive learning does not fit. Other methods aiming for small type I error include the Asymmetric Support Vector Machine (Wu et al., 2008), and the -value for classification (Dümbgen et al., 2008). However, the former has no theoretical guarantee on errors, while the latter treats all classes as of equal importance.

1.2 Neyman-Pearson (NP) paradigm and NP oracle inequalities

Neyman-Pearson (NP) paradigm was introduced as a novel statistical framework for targeted type I/II error control. Assume type I error as the prioritized error type, this paradigm seeks to control under a user specified level with as small as possible. The oracle is thus


where the significance level reflects the level of conservativeness towards type I error. Given is unattainable in the learning paradigm, the best within our capability is to construct a data dependent classifier that mimics it.

Despite its practical importance, NP classification has not received much attention in the statistics and machine learning communities.

Cannon et al. (2002) initiated the theoretical treatment of NP classification. Under the same framework, Scott (2005) and Scott and Nowak (2005) derived several results for traditional statistical learning such as PAC bounds or oracle inequalities. By combining type I and type II errors in sensible ways, Scott (2007) proposed a performance measure for NP classification. More recently, Blanchard et al. (2010)

developed a general solution to semi-supervised novelty detection by reducing it to NP classification. Other related works include

Casasent and Chen (2003) and Han et al. (2008). A common issue with methods in this line of literature is that they all follow an empirical risk minimization (ERM) approach, and use some forms of relaxed empirical type I error constraint in the optimization program. As a result, all type I errors can only be proven to satisfy some relaxed upper bound. Take the framework set up by Cannon et al. (2002) for example. Given , they proposed the program

where is a set of classifiers with finite Vapnik-Chervonenkis dimension, and , are the empirical type I and type II errors respectively. It is shown that with high probability, the solution to the above program satisfies simultaneously: i) the type I error is bounded from above by , and ii) the type II error is bounded from above by for some .

Rigollet and Tong (2011) is a significant departure from the previous NP classification literature. This paper argues that a good classifier under the NP paradigm should respect the chosen significance level , rather than some relaxation of it. More precisely, two NP oracle inequalities should be satisfied simultaneously with high probability:

  • the type I error constraint is respected, i.e., .

  • the excess type II error diminishes with explicit rates (w.r.t. sample size).

Recall that, for a classifier , the classical oracle inequality insists that with high probability

the excess risk diminishes with explicit rates, (1.3)

where is the Bayes classifier, in which is the regression function of on (see Koltchinskii (2008) and references within). The two NP oracle inequalities defined above can be thought of as a generalization of (1.3) that provides a novel characterization of classifiers’ theoretical performances under the NP paradigm.

Using a more stringent empirical type I error constraint (than the level ), Rigollet and Tong (2011) established NP oracle inequalities for its proposed classifiers under convex loss functions (as opposed to the indicator loss). They also proved an interesting negative result: under the binary loss, ERM approaches (convexification or not) cannot guarantee diminishing excess type II error as long as one insists type I error of the proposed classifier be bounded from above by with high probability. This negative result motivated a plug-in approach to NP classification in Tong (2013).

1.3 Plug-in approaches

Plug-in methods in classical binary classification have been well studied in the literature, where the usual plug-in target is the Bayes classifier . Earlier works gave rise to pessimism of the plug-in approach to classification. For example, under certain assumptions, Yang (1999)

showed plug-in estimators cannot achieve excess risk with rates faster than

, while direct methods can achieve rates up to under margin assumption (Mammen and Tsybakov, 1999; Tsybakov, 2004; Tsybakov and van de Geer, 2005; Tarigan and van de Geer, 2006). However, it was shown in Audibert and Tsybakov (2007) that plug-in classifiers based on local polynomial estimators can achieve rates faster than , with a smoothness condition on and the margin assumption.

The oracle classifier under the NP paradigm arises from its close connection to the Neyman-Pearson Lemma in statistical hypothesis testing. Hypothesis testing bears strong resemblance to binary classification if we assume the following model. Let

and be two known probability distributions on . Assume that for some , and the conditional distribution of given is

. Given such a model, the goal of statistical hypothesis testing is to determine if we should reject the null hypothesis that

was generated from . To this end, we construct a randomized test that rejects the null with probability . Two types of errors arise: type I error occurs when is rejected yet , and type II error occurs when is not rejected yet . The Neyman-Pearson paradigm in hypothesis testing amounts to choosing that solves the following constrained optimization problem

where is the significance level of the test. A solution to this constrained optimization problem is called a most powerful test of level . The Neyman-Pearson Lemma gives mild sufficient conditions for the existence of such a test.

Lemma 1.1 (Neyman-Pearson Lemma).

Let and be two probability measures with densities and respectively, and denote the density ratio as . For a given significance level , let be such that and . Then, the most powerful test of level is

Under mild continuity assumption, we take the NP oracle


as our plug-in target for NP classification. With kernel density estimates

, , and a proper estimate of the threshold level , Tong (2013) constructed a plug-in classifier that satisfies both NP oracle inequalities with high probability when the dimensionality is small, leaving the high-dimensional case an unchartered territory.

1.4 Contribution

In the big data era, NP classification framework faces the same curse of dimensionality as its classical counterpart. Despite its wide potential applications, this paper is the

first attempt to construct performance-guaranteed classifiers under the NP paradigm in high-dimensional settings. Based on the Neyman-Pearson Lemma, we employ Naive Bayes models and propose a computationally feasible plug-in approach to construct classifiers that satisfy the NP oracle inequalities. We also improve the detection condition, a critical theoretical assumption first introduced in Tong (2013), for effective threshold level estimation that grounds the good NP properties of these classifiers. Necessity of the new detection condition is also discussed. Note that classifiers proposed in this work are not straightforward extensions of Tong (2013): kernel density estimation is now applied in combination with feature selection, and the threshold level is estimated in a more precise way by order statistics that require only moderate sample size — while Tong (2013) resorted to the Vapnik-Chervonenkis theory and required sample size much bigger than what is available in most high-dimensional applications.

The rest of the paper is organized as follows. Two screening based plug-in NP-type classifiers are presented in Section 2, where theoretical properties are also discussed. Performance of the proposed classifiers is demonstrated in Section 3 by both simulation studies and real data analysis. We conclude in Section 4 with a short discussion. The technical proofs are relegated to the Appendix.

2 Methods

In this section, we first introduce several notations and definitions, with a focus on the detection condition. Then we present the plug-in procedure, together with its theoretical properties.

2.1 Notations and definitions

We introduce here several notations adapted from Audibert and Tsybakov (2007). For , denote by the largest integer strictly less than . For any and any times continuously differentiable real-valued function on , we denote by its Taylor polynomial of degree at point . For , the -Hölder class of functions, denoted by , is the set of functions that are times continuously differentiable and satisfy, for any , the inequality The -Hölder class of density is defined as

We will use -valid kernels (kernels of order , Tsybakov (2009)) for all the kernel estimation throughout the theoretical discussion, the definition of which is as follows.

Definition 2.1.

Let be a real-valued function on with support . The function is a -valid kernel if it satisfies , for any , , and in the case , it satisfies for any such that .

We assume that all the -valid kernels considered in the theoretical part of this paper are constructed from Legendre polynomials, and are thus Lipschitz and bounded, satisfying the kernel conditions for the important technical Lemma A.6.

Definition 2.2 (margin assumption).

A function is said to satisfy margin assumption of order with respect to probability distribution at the level if there exists a positive constant , such that for any ,

This assumption was first introduced in Polonik (1995). In the classical binary classification framework, Mammen and Tsybakov (1999) proposed a similar condition named “margin condition” by requiring most data to be away from the optimal decision boundary. In the classical classification paradigm, definition 2.2 reduces to the “margin condition” by taking and , with giving the decision boundary of the Bayes classifier. On the other hand, unlike the classical paradigm where the optimal threshold level is known and does not need an estimate, the optimal threshold level in the NP paradigm is unknown and needs to be estimated, suggesting the necessity of having sufficient data around the decision boundary to detect it well. This concern motivated the following condition improved from Tong (2013).

Definition 2.3 (detection condition).

A function is said to satisfy detection condition of order with respect to (i.e., ) at level if there exists a positive constant , such that for any ,

A detection condition works as an opposite force to the margin assumption, and is basically an assumption on the lower bound of probability. Though we take here a power function as the lower bound, so that it is simple and aesthetically similar to the margin assumption, any increasing on with should be able to serve the purpose. The version of detection condition we would use to establish the NP inequalities for the (to be) proposed classifiers takes , , and (recall that is the conditional distribution of given ).

Now we argue why such a condition is necessary to achieve the NP oracle inequalities. Consider the simpler case where the density ratio is known, and we only need a proper estimate of the threshold level . If there is nothing like the detection condition (Definition 2.3 involves a power function, but the idea is just to have any kind of lower bound), we would have, for some ,


In getting the threshold estimate of , we can not distinguish any threshold level between and . In particular, it is possible that

But then the excess type II error is bounded from below as follows

where the last quantity can be positive. Therefore, the second NP oracle inequality (diminishing excess type II error) does not hold for . Since some detection condition is necessary in this simpler case, it is certainly necessary in our real setup.

Note that Definition 2.3 is a significant improvement of the detection condition formulated in Tong (2013), which requires

We are able to drop the lower bound for the first piece due to an improved layout of the proofs. Intuitively, our new detection condition ensures an upper bound on . But we do not need an extra condition to get a lower bound of , because of the type I error bound requirement (see the proof of Proposition 2.4 for details).

2.2 Neyman-Pearson plug-in procedure

Suppose the sampling scheme is fixed as follows.

Assumption 1.

Assume the training sample contains i.i.d. observations from class 1 with density , and i.i.d. observations from class 0 with density . Given fixed , , , and such that , , we further decompose and into independent subsamples as: , and , where , , , , .

The sample splitting idea has been considered in the literature, such as in Meinshausen and Bühlmann (2010) and Robins et al. (2006). Given these samples, we introduce the following plug-in procedure.

Definition 2.4.

Neyman-Pearson plug-in procedure

  • Use , , , and to construct a density ratio estimate . The specific use of each subsample will be introduced in Section 2.4.

  • Given choose a threshold estimate from the set .

Denote by the -th order statistic of , . The corresponding plug-in classifier by setting is


A generic procedure for choosing the optimal will be given in Section 2.3.

2.3 Threshold estimate

For any arbitrary density ratio estimate , we employ a proper order statistic to estimate the threshold , and establish a probabilistic upper bound for the type I error of for each .

Proposition 2.1.

For any arbitrary density ratio estimate , let . It holds for any and that


where is the cdf of Beta. The inequality becomes equality when is continuous almost surely.

In view of the above proposition, a sufficient condition for the classifier to satisfy NP Oracle Inequality (I) at tolerance level is thus


Despite the potential tightness of (2.3), we are not able to derive an explicit formula for the minimum that satisfies (2.4). To get an explicit choice for , we resort to concentration inequalities for an alternative.

Proposition 2.2.

For any arbitrary density ratio estimate , let . It holds for any and that




Let . Proposition 2.2 implies that is a sufficient condition for the classifier to satisfy NP Oracle Inequality (I). The next step is to characterize and choose some , so that has small excess type II error. Clearly, we would like to find the smallest element in .

Proposition 2.3.

The minimum that satisfies is


where denotes the smallest integer larger than or equal to , and


  1. .

  2. is asymptotically the empirical

    -th quantile of

    in the sense that

  3. For any , we have , and thus

Introduce shorthand notations , , and . We will take


as the default NP plug-in classifier for any arbitrary . An alternative threshold estimate that also guarantees type I error bound is derived in the Appendix C. Assume for the rest of the theoretical discussion. It follows from Proposition 2.3 that , and thus , with guaranteed type I error control.

Remark 2.1.

Note that . Thus, choosing the -th order statistic of as the threshold can be viewed as a modification to the classical approach of estimating the quantile of by the -th order statistic of . Recall that the oracle is actually the quantile of distribution so the intuition is that is asymptotically (when ) equivalent to the quantile of which in turn converges (when ) to as the quantile of under moderate conditions.

Lemma 2.1.

Let . In addition to Assumption 1, suppose be such that is continuous almost surely. Then for any and , the distance between ( as defined in (2.8)) and can be bounded as



If , we have

Proposition 2.4.

Let . In addition to assumptions of Lemma 2.1, assume that the density ratio satisfies the margin assumption of order at level (with constant ) and detection condition of order at level (with constant ), both with respect to distribution . If , the excess type II error of the classifier defined in (2.8) satisfies with probability at least ,

Given the above proposition, we can control the excess type II error as long as the uniform deviation of density ratio estimate is controlled. In the following subsection, we will introduce estimates and provide bounds for .

2.4 Density ratio estimate

Denote the marginal densities of class 1 and 0 as and () respectively, Naive Bayes models for the density ratio take the form

The subsamples , , and are used to construct (nonparametric/parametric) estimates of and for .

Nonparametric estimate of the density ratio. For marginal densities and , we apply kernel estimates , and , where is the kernel function, are the bandwidths, and and denote the -th component of and respectively. The resulting nonparametric estimate is


Parametric estimate of the density ratio. Assume the two-class Gaussian model where . We estimate , and using their sample versions , and . Under this model, the density ratio function is given by

and the corresponding parametric estimate is


2.5 Screening-based density ratio estimate and plug-in procedures

For “high dimension, low sample size” applications, complex models that take into account all features usually fail; even Naive Bayes models that ignore feature dependency might lead to poor performance due to noise accumulation (Fan and Fan, 2008). A common solution in these scenarios is to first study marginal relations between the response and each of the features (Fan and Lv, 2008; Li et al., 2012). By selecting the most important individual features, we greatly reduce the model size, and other models can be applied after this screening step. We now introduce screening based variants of and . Let and denote the cdfs of and respectively, for . Step 1 of Procedure 2.4 introduced in Section 2.1 is now decomposed into a screening substep and an estimation substep.

Nonparametric Screening-based NP Naive Bayes (NSN) classifier

Step 1.1

Select features using and as follows:


where is some threshold level, and


are the empirical cdfs.

Step 1.2

Use and to construct kernel estimates of and for . The density ratio estimate is given by

Step 2

Given , use to get a threshold estimate as in (2.8).

The resulting NSN classifier is


Parametric Screening-based NP Naive Bayes (PSN) classifier

The PSN procedure is similar to NSN, except the following two differences. In Step 1.1, features are now selected based on -statistics ( represent the index set of the selected features). In Step 1.2, , for follow two-class Gaussian model, and the resulting parametric screening-based density ratio estimate is

The corresponding PSN classifier is thus given by


We assume the domains of all and to be for all the following theoretical discussion. We will prove NP oracle inequalities for , and those for can be developed similarly. Recall that by Proposition 2.4, we need an upper bound for . Necessarily, performance of the screening step should be studied. To this end, we assume that only a small fraction of the features have marginal differentiating power.

Assumption 2.

There exists a signal set with size such that for some positive constant , and for .

The following proposition shows that Step 1.1 achieves exact recovery () with high probability for some properly chosen .

Proposition 2.5 (exact recovery).

Let . In addition to Assumptions 1 and 2, suppose . Then for any where , the screening substep Step 1.1 (2.12) satisfies

Now we are ready to control the uniform deviation of density ratio estimate given in Step 1.2.

Assumption 3.

The marginal densities , for all , and there exists such that for all . There exists some constant , such that , and there is a uniform absolute upper bound for and for and . Moreover, the kernel in the nonparametric density estimates is -valid and -Lipschitz.

Smoothness conditions (Assumption 3) and the margin assumption were used together in the classical classification literature. However, it is not entirely obvious why Assumption 3 does not render the detection condition redundant. We refer interested readers to Appendix B for more detailed discussion.

Let and be the constants in Lemma A.6 when applied to and respectively. Assumption 3 ensures the existence of absolute constants and