1 Introduction
Designing a binary classifier with asymmetrical costs for the errors of type I (false positive) and type II (false negative) Gu et al. (2018); Zhang et al. (2017), or equivalently designing a NeymanPearson classifier Kim et al. (2019)
, is required in various applications ranging from face recognition
Zhang et al. (2016) and online electrocardiogram monitoring Carrera et al. (2019) to video surveillance Saligrama and Chen (2012)and data imputation
Ozkan et al. (2015b). For example, in medical diagnostics, type II error (misdiagnosing as healthy) has perhaps more severe consequences, whereas type I error (misdiagnosing as unhealthy) may result in devastating psychological effects
Jiang et al. (2014). In this example, the error costs must be determined probably asymmetrically for cost sensitive learning
Gu et al. (2018); Zhang et al. (2017) of the desired classifier. However, it is often more convenient but technically equivalent Kim et al. (2019); Davenport et al. (2010) to describe the user needs by the maximum tolerable type I error, cf. Tong et al. (2016) and the references therein, instead of having to determine the error costs to meet the tolerance. This leads to the NeymanPearson (NP) characterization of the desired classifier Kim et al. (2019); Davenport et al. (2010) and false positive rate controllability, where the goal is to maximize the detection power, i.e., minimize type II error, while upperbounding the false positive rate, i.e., type I error, by a userspecified threshold.To this goal, as the first time in the literature, we introduce a novel online and nonlinear NP classifier based on a single hidden layer feedforward neural network (SLFN), which is sequentially learned with a Lagrangian nonconvex NP objective (i.e. maximum detection power about a controllable user specified false positive rate). We use stochastic gradient descent (SGD) optimization for scalability to voluminous data and online processing with limited memory requirements. During the SGD iterations, we a) sequentially infer the value of the Lagrangian multiplier in a data driven manner to obtain the correspondence between the asymmetrical error costs and the desired type I error rate, and b) update all the SLFN parameters to maximize the detection power (minimize the resulting cost sensitive classification error) at the desired false positive rate. To achieve powerful nonlinear modeling and improve scalability, we use the SLFN in a kernel inspired manner, cf. Rahimi and Recht (2008) for the kernel approach to nonlinearity. For this purpose, the hidden layer is initialized with a sinusoidal activation to approximately construct the high dimensional kernel space (of any symmetric and shift invariant kernel under Mercer’s conditions, e.g., radial basis function) through the random Fourier features (RFFs) Rahimi and Recht (2008). The output layer follows with identity activation.
We emphasize that the kernel inspired SLFN has two benefits: expedited powerful nonlinear modeling and scalability. Namely, first, it enables an excellent network initialization as RFFs are already sufficiently powerful to learn complex nonlinear decision boundaries even when kept untrained. This speeds up and enhances the learning of complex nonlinearities by relieving the burden of network initialization. Second, the hidden layer is compactified thanks to the exponential rate of improvement in approximating the high dimensional kernel space due to Hoeffding’s inequality Rahimi and Recht (2008). As a result, the number of hidden nodes, parameter complexity and the computational complexity of forwardbackward network evaluations reduce, and therefore the scalability substantially improves while also mitigating overfitting. Moreover, thanks to the learning of the hidden layer, the randomly initialized Fourier features are continuously improved during SGD steps for even further compactification and better nonlinear modeling. Hence, our online NP classifier is powerfully nonlinear and computationally highly efficient with processing and negligible space complexity, where is the number of data instances.
The main contribution of our work is that we are the first to propose a NeymanPearson (NP) classifier that is both online and nonlinear. Our algorithm as an important novel addition to the literature is appropriate for contemporary fast streaming large scale data applications that require real time processing with capabilities of complex nonlinear modeling and false positive rate controllability. In our extensive experiments, the introduced classifier yields significantly better results compared to the competing stateoftheart NP techniques; either performancewise (in terms of the detection power and false positive rate controllability) at a comparable computational and space complexity, or efficiencywise (in terms of complexity) at a comparable performance. The presented study is also the first to design a neural network (as an SLFN) in the context of NP characterization of classifiers, which is expected to open up new directions into deeper architectures since the NP approach has been left surprisingly unexplored in deep learning.
In the following Section 2, we discuss stateoftheart NP classification methods. We provide the problem description in Section 3, and then introduce our technique for online and nonlinear NP classification in Section 4. After the experimental evaluation is presented in Section 5, we conclude in Section 6.
2 Related Work
NeymanPearson classification has found a widespread use across various applications due to the direct control over the false positive rate that it offers, cf. Tong et al. (2016)
and the references therein. For example, an NP classifier is commonly employed for anomaly detection, where the false positive rate controllability is particularly important. In the one class formulation (due to the extreme rarity of anomalies) of anomaly detection
Scott and Nowak (2006); Hero (2007); Zhao and Saligrama (2009); Saligrama and Zhao (2012), the NP classification turns out (when the anomalies are assumed uniformly distributed) estimating the minimum volume set (MVS) that covers
fraction of the nominal data ( is the desired false positive rate). Then, an instance is anomalous if it is not in the MVS. A structural risk minimization approach is presented in Scott and Nowak (2006) for learning the MVS based on a class of sets generated by a dyadic tree partitioning. Geometric entropy minimization Hero (2007) and empirical scoring Zhao and Saligrama (2009) can also be used to estimate the MVS, both of which are based on nearest neighbor graphs. The scoring of Zhao and Saligrama (2009) is later extended to the local anomaly detection in Saligrama and Zhao (2012)and a new one class support vector machines (SVM) in
Chen et al. (2013). Although the algorithms in these examples with batch processing, i.e., not online, have decent theoretical performance guarantees, they are not scalable to large scale data due to their prohibitive computational as well as space complexity and hence they cannot be used in our scenario of fast streaming applications. Online extensions to the original batch one class SVM Schölkopf et al. (2001), which can be shown to provide an estimator of the MVS Vert and Vert (2006), have been proposed for distributed processing Miao et al. (2018) and wireless sensor networks Zhang et al. (2009). However, neither these online extensions nor the original one class SVM address the false positive rate controllability as they require additional manual parameter tuning for that. In contrast, our proposed online NP classifier directly controls (without parameter tuning) the false positive rate and maximizes the detection power with nonlinear modeling capabilities. Furthermore, NP formulation in the one class setting requires the knowledge of the target density (e.g., anomaly), which is often unknown and thus typically assumed to be uniform; but then the problem can be turned into a supervised binary NP classification by simply sampling from the assumed target density. On the other hand, when there is also data from the target class, the one class formulation in aforementioned studies does not directly address how to incorporate the target data. Hence, our two class supervised formulation of binary NP classification also covers the solution of the one class classification, and our proposed algorithm is consequently more general and applicable in both cases of target data availability.Among the two class binary NP classification studies (cf. Tong et al. (2016) for a survey), plugin approaches (such as Streit (1990) and Tong (2013)) based on density estimation as an application of the NP lemma Scott and Nowak (2005) are difficult to be applied in high dimension due to overfitting Hero (2007). Particularly, Streit (1990)
exploits the expectationmaximization algorithm for density estimation using a neural network with however batch processing and manual tuning for finding the threshold to satisfy the NP type I error constraint. In
JaraboAmores et al. (2013), a neural network is trained with symmetric error costs for modeling the likelihood ratio, which is thresholded to match the desired false positive rate but determining the threshold requires additional work. Moreover, the approach of thresholding after training with symmetric error costs (cf. Tong et al. (2016) for other examples in addition to JaraboAmores et al. (2013)) does not yield NP optimality, since NP classification requires training with asymmetric error costs corresponding to the desired false positive rate. Unlike our presented work, approaches in Streit (1990); Tong (2013); JaraboAmores et al. (2013) are also not online and do not allow real time false positive rate controllability. Recall that NP classification is equivalent to cost sensitive learning Davenport et al. (2010) when the desired false positive rate can be accurately translated to error costs, but achieving an accurate translation, i.e., correspondence, is typically nontrivial requiring special attention Davenport et al. (2010); Kong et al. (2019). This correspondence problem is addressed i) in Davenport et al. (2010) as parameter tuning with improved error estimations, and ii) in Kong et al. (2019) as an optimization with the assumption of class priors and unlabeled data. Besides the exploitation of SVM Davenport et al. (2010), other classifiers such as logistic regression
Cox (1958) have also been considered in Tong et al. (2018) and incorporated into a unifying NP framework as an umbrella algorithm. We emphasize that these approaches, the SVM based tuning approach Davenport et al. (2010) and the risk minimization of Scott and Nowak (2005) as well as the umbrella algorithm Tong et al. (2018) in addition to the optimization of Kong et al. (2019), do not satisfy our computational online processing requirements, as they are batch techniques and not scalable to large scale data.In most of the contemporary fast streaming data applications, such as computer vision based surveillance
Lu et al. (2013) and time series analysis Ozkan et al. (2015a), computationally efficient processing along with only limited space needs is a crucial design requirement. This is necessary for scalability in such applications which constantly generate voluminous data at unprecedented rates. However, the literature about the NeymanPearson classification (cf. Tong et al. (2016) for the current state) appears to be fairly limited from this large scale efficient processing point of view. Out of very few examples, a lineartime algorithm for learning a scoring function and thresholding is presented in Zhang et al. (2018), which is still not an online algorithm (i.e. it is not designed to process data indefinitely on the fly) since batch processing is assumed with large space complexity and processing latency. Moreover, scoring of Zhang et al. (2018) is similar to the one of Zhao and Saligrama (2009) but unlike Zhao and Saligrama (2009) trades off NP optimality for lineartime processing. Also, the technique of Zhang et al. (2018) is restricted to linearly separable data only, and it requires to adjust thresholding for false positive rate controllability which can be seen impractical. The NP technique of Ozkan et al. (2015a) is truly online (and one class) but it is strongly restricted to Markov sources, thus fails in the case of general nonMarkov data (whereas our proposed algorithm has no such restriction). Another online NP classifier is presented in Gasso et al. (2011) without strict assumptions unlike Ozkan et al. (2015a), but for only linearly separable data while leaving the online generalization to nonlinear setting as a future research direction.To our best knowledge, online NP classification has not been studied yet in the nonlinear setting. Thus, as the first time in the literature, we solve the online and nonlinear NP classification problem based on a kernel inspired SLFN within the nonconvex Lagrangian optimization framework of Gasso et al. (2011); Uzawa (1958), and use SGD updates for scalability. Our NP classifier exploits Fourier features Rahimi and Recht (2008) and sinusoidal activations in the hidden layer of the SLFN (hence the name kernel inspired) to achieve a powerful nonlinear modeling with high computational efficiency and online real time processing capability.
Random Fourier features (RFFs) and also kernels in general have been successfully used for classification and regression of large scale data (please refer to Rahimi and Recht (2008); Lu et al. (2016); Porikli and Ozkan (2011) and Wang et al. (2012) for examples). Our presented work also exploits RFFs (during SLFN initialization) for large scale learning but, in contrast, for the completely different goal of solving the problem of online nonlinear NeymanPearson (NP) classification with neural networks in a nonconvex Lagrangian optimization framework. Furthermore, the presented work learns the useful Fourier features with SGD updates beyond the initial randomness. On the other hand, kernels and RFFs have been previously studied in conjunction with neural networks. For example, computational relations from certain kernels to large networks are drawn in Cho and Saul (2009)
, and a kernel approximating convolutional neural network is proposed in
Mairal et al. (2014) for visual recognition. In particular, RFFs have been used to learn deep Gaussian processes Cutajar et al. (2017), and for hybridization in deep models to connect linear layers nonlinearly Mehrkanoon and Suykens (2018). A radial basis function (rbf) network is proposed in Casasent and Chen (2003)with batch processing, i.e., not online, which briefly discusses a heuristic by varying rbf parameters to manually control the false positive rate. Note that our SLFN is not an rbf network since we explicitly construct (during initialization) the kernel space in the hidden layer without a further need for kernel evaluations. We stress that the hidden layer of our SLFN for NP classification is same as the RFF layer of
Xie et al. (2019) for kernel learning (a simultaneous development of the same layer). The RFF layer in Xie et al. (2019) is proposed as a building block to deep architectures for the goal of kernel learning. However, our goal of designing an online nonlinear NP classifier is completely different. Hence, our formulation, network objective and the resulting training process as well as our algorithm and experimental demonstration in this paper are fundamentally different compared to Xie et al. (2019). Moreover, online processing is not a focus in these studies except that Mairal et al. (2014) and Cutajar et al. (2017) address scalability to voluminous data; and none of those (including Xie et al. (2019) for kernel learning, and Mairal et al. (2014) and Cutajar et al. (2017) for scalability) consider our goal of NP classification.3 Problem Description
NeymanPearson (NP) classification Tong et al. (2016) seeks a classifier for a dimensional^{1}^{1}1In this paper, all vectors are column vectors and they are denoted by boldface lower case letters. For a vector , its transpose is represented by and the time index is given as subscript, i.e., . Also, a) is the indicator function returning if its argument condition holds, and returning , otherwise; and b) is the sign function returning if its argument is positive, and returning , otherwise. observation to choose one of the two classes as , where (nontarget: , target: ) is the true class label and
are the corresponding conditional probability density functions. The goal is to minimize the type II error (nondetection) rate
(1) 
(thus, the detection power is maximized) while upper bounding the type I error (false positive) rate by a user specified threshold as
(2) 
with being the corresponding expectations. Namely, is an NP classifier, if it satisfies . It is wellknown that by the NP lemma Scott and Nowak (2005), the likelihood ratio provides an NP test, i.e.,
(3) 
where the offset is chosen to satisfy the false positive rate constraint. Hence, finding the discriminant function is sufficient for NP testing.
The discriminant function can be simplified in many cases, and it might be linear or nonlinear as a function of after full simplification. We provide two corresponding examples in the following. For instance, if the conditional densities are both Gaussian with same covariances, then the discriminant is linear. On the other hand, in the example of one class classification Schölkopf et al. (2001) with applications to anomaly detection, there is typically no data from the target (anomaly) hypothesis because of the extreme rarity of anomalies, and there is also not much prior information due to the unpredictable nature of anomalies. Hence, the usual approach is to assume that the target density is uniform (with a finite support) Zhao and Saligrama (2009), i.e., . Then, the critical region for the NP test to decide nontarget, i.e., , is known as the minimum volume set (MVS) Scott and Nowak (2006) covering fraction of the nontarget instances, i.e., is set with simplification such that . Consequently, MVS has the minimum volume with respect to the uniform target density and hence maximizes the detection power. Here, the MVS discriminant (after simplification) is generally nonlinear, for instance, even when is Gaussian with zero mean unitdiagonal covariance. Therefore, we emphasize that the discriminant of the NP test^{2}^{2}2Note that knowing the continues valued discriminant is equivalent to knowing the discrete valued test due to onetoone correspondence, i.e., . Hence, in the rest of the paper, we refer to the discriminant as the NP classifier as well. might be arbitrarily nonlinear in general. Furthermore, since the discriminant definition requires the knowledge of the conditional densities which are unavailable in most realistic scenarios, the discriminant is unknown. For this reason, NP classification refers to the data driven statistical learning of an approximation of the unknown discriminant based on given two classes of data , where is an appropriate set of functions which is sufficiently powerful to model the complexity of .
As a result, the data driven statistical learning of the NP classifier is obtained as the output of the following NP optimization:
(4) 
empirically estimates the type I (expectation in (1)) and type II (expectation in (2)) errors, respectively. For example, Gasso et al. (2011) studies this optimization in (4) for the set of linear discriminants, in which case however the resulting linear NP classifier is largely suboptimal in most realistic scenarios; for example, the MVS estimation for anomaly detection requires to learn nonlinear class separation boundaries with a nonlinear discriminant.
Our goal in the presented work is to develop, as the first time in the literature to our best knowledge, an online nonlinear NP classifier for any given userspecified desired false positive rate
with real time processing capability. In particular, we use a kernel inspired single hidden layer feed forward neural network (SLFN), cf. Fig.
1, to model the set of nonlinear candidate discriminant functions in (4) as(5) 
where and are the hidden and output layer parameters, and and are the nonlinear hidden and identity output layer activations. We sequentially learn the SLFN parameters based on the NP objective (that is maximizing the detection power about a userspecified false positive rate as given in (4)) with stochastic gradient descent (SGD) to obtain the nonlinear classification boundary, i.e., to estimate the unknown discriminant , in an online manner while maintaining scalibility to voluminous data.
The data processing in our proposed algorithm is computationally highly efficient and truly online with computational and space complexity ( is the total number of processed instances). Namely, we sequentially observe the data indefinitely without knowing a horizon, and decide about its label as if the SLFN at time provides , and as , otherwise. Then, we update our model , i.e., update the SLFN at time , to obtain based on the error via SGD and discard the observed data, i.e., and , without storing. Hence, each instance is processed only once. In this processing framework, models the NP discriminant in (3). As a result of this processing efficiency, our algorithm is appropriate for large scale data applications.
4 SLFN for Online Nonlinear NP Classification
In order to learn nonlinear NeymanPearson classification boundaries, we use a single hidden layer feed forward neural network (SLFN), illustrated in 1, that is designed based on the kernel approach to nonlinear modeling (cf. Rahimi and Recht (2008) and the references therein for the mentioned kernel approach). Namely, the hidden layer is randomly initialized to explicitly transform the observation space (via ) into a high dimensional kernel space with sinusoidal hidden layer activations by using the random Fourier features Rahimi and Recht (2008)
. We use a certain variant of the perceptron algorithm
Rosenblatt (1958) as the output layer with identity activation followed by a sigmoid loss. Based on this SLFN, we sequentially (in a truly online manner) learn the network parameters, i.e., the classifier parameters as well as the kernel mapping parameters , through SGD in accordance with the NP optimization objective (4).In the hidden layer of the SLFN, the randomized initial transformation at time ,
(6) 
is constructed based on the fact (as provided in Rahimi and Recht (2008)) that any continuous, symmetric and shift invariant kernel can be approximated as with an appropriately randomized kernel feature mapping. Note that the kernel is an implicit access to the targeted high dimensional kernel space as it encodes the targeted inner products. This kernel space is explicitly and approximately constructed by the sinusoidal hidden layer activations of the SLFN in which the new inner products across activations approximate originally targeted inner products. Hence, linear techniques applied to the sinusoidal hidden layer activations can learn nonlinear models. In our method, we use the radial basis function (rbf) kernel^{3}^{3}3We use the rbf kernel in this study as an example but it is not required. Thus, the presented technique can be straightforwardly extended to any symmetric and shift invariant kernel satisfying the Bochner’s theorem, cf. Rahimi and Recht (2008). with the bandwidth parameter (that is inversely related to the actual bandwidth). Then,
(7) 
(due to Bochner’s theorem as provided in Rahimi and Recht (2008)) where the Fourier feature is
(8) 
and is sampled from the
dimensional multivariate Gaussian distribution
(which is the Fourier transform of the kernel in hand) with
being the corresponding expectation. Hence, by replacing the expectation in (7) with the independent and identically distributed (i.i.d) sample mean of the ensemble of size , we define our kernel mapping as(9) 
which can be directly implemented in the hidden layer of the SLFN, cf. Fig. 1, along with the sinusoidal activation due to the definiton of .
Note that keeps all the hidden layer parameters at time as a matrix of size consisting of ’s corresponding to the hidden units, i.e., . And the hidden layer activation is sinusoidal: and
for the odd and even indexed hidden nodes, respectively, due to the definition in (
8). At time , is randomly initialized with an appropriate of the rbf kernel so that the SLFN starts with approximately constructing the high dimensional kernel space in its hidden layer, and in relation to (6), . Note that of the rbf kernel readily provides a powerful nonlinear modeling to the SLFN even if the hidden layer is kept untrained. Thanks to this excellent network initialization, we achieve an expedited process of learning from data. Moreover, in the course of our sequential processing, the SLFN continuously updates and improves the hidden layer, i.e., kernel mapping, parameters as . Therefore, we optimize a nonlinear NP classifier in actually the larger space (as our optimization is not restricted to of the random initialization, cf. the definition of in (5)) for greater nonlinear modeling capability compared to the rbf kernel.The SLFN in Fig. 1 that we use for online and nonlinear NP classification is compact in principle since the required number of hidden nodes is relatively small. The reason is that the convergence of the sample mean of the i.i.d. ensemble of size to the true mean is exponentially fast with the order of by Hoeffding’s inequality Rahimi and Recht (2008)
. On the other hand, since random Fourier features are independent of data, further compactification is possible by eliminating irrelevant, i.e., unuseful, Fourier features in a data driven manner, cf. the examples of feature selection in
Porikli and Ozkan (2011) and Nyström method in Yang et al. (2012) for this purpose. In contrast, and alternatively, we distill useful Fourier features in the hidden layer activations as a result of the sequential learning of the kernel mapping parameters, i.e., , via SGD. Hence, nodes of the SLFN are dedicated to only useful Fourier features, and thus we achieve a further network compactification by reducing the necessary number of hidden nodes as well as reducing the parameter complexity. Then, one can expect to better fight overfitting with great nonlinear modeling power and NP classification performance. This compactification does also significantly reduce the computational as well as space complexity of our SLFN based classifier, which together with the SGD optimization yields scalability to voluminous data. Consequently, the proposed online NP classifier is computationally highly efficient and appropriate for real time processing in large scale data applications.Remark : We obtain a sequence of kernel mapping parameters in the course of data processing. This means that at the end of processing instances, one can potentially construct a new nonisotropic rbf kernel by estimating the multivariate density of the collection (here, we assume that is large and the density is multivariate Gaussian. If it is not Gaussian, then one can straightforwardly incorporate a Gaussianity measure into the overall network objective) and then finding out the corresponding nonisotropic rbf kernel by taking back the inverse Fourier transform of the estimated density. Therefore, our algorithm is also kerneladaptive since it essentially learns a new kernel (and also improves the previous one) at each SGD learning step. This kernel adaptation ability can be improved. For instance, one can start with a random mapping as described and estimate the density of the mapping parameters after convergence, and then restart with new samples from the converged density. Multiple iterations of this process may yield better kernel adaptation (but rerunning would hinder online processing and define batch processing, hence it is out of scope of the present work), which we consider as future work.
In the output layer of the SLFN, we use a certain variant of perceptron Rosenblatt (1958) with the identity activation, i.e., . Then, the classification model is defined linearly after the hidden layer kernel inspired transformation as , where is the normal vector to the linear separator and is the bias. Thus, the decision of the SLFN is .
Regarding the overall network objective for sequential learning of the network parameters and solving the NP optimization in (4) to obtain our SLFN based online nonlinear NP classifier, we next formulate the NP objective similar to Gasso et al. (2011) as
(10) 
where the first term is the regularizer for which we use the magnitude of the classifier parameters in the output layer, i.e., , and is the regularization weight. For differentiability, the nondetection and false positive error rates are estimated based on data until time as
(11) 
with , (set cardinality) and . Note that another appropriate function can be used here to obtain a differentiable surrogate for the errors in (4) for estimating the error rates. However, our results in the rest of this paper are based on the sigmoid loss .
For sequential optimization of the NP objective in (10), we next define the following Lagrangian
(12) 
where is the userspecified desired false positive rate and is the corresponding Lagrange multiplier.
Since the saddle points of (12) correspond to the local minimum of (10), cf. Gasso et al. (2011) and Uzawa (1958) for the details, we apply the Uzawa approach Uzawa (1958) to search for the saddle points of (12) and learn our parameters in the online setting with SGD updates. To be more precise, we follow the optimization framework of Gasso et al. (2011) and solve the optimization via an iterative approach with gradient steps, where one iteration minimizes for a fixed and the other maximizes for a fixed . Note that the fixed minimization
is a regularized weighted error minimization, where the ratio of the type I error rate cost to the one of type II error rate is . Hence, the unknown Lagrange multiplier
defines (up to a scaling with the prior probabilities) the asymmetrical error costs that correspond to the false positive rate constraint in (
4). On the other hand, the gradient ascent updates in the fixed maximization determines the unknown multiplier so that the type I error cost is decreased (increased) if the error estimate is below (above) the tolerable rate in favor of detection power (true negative detection). This provides an iterative learning of the correspondence between the asymmetrical error costs and the NP constraint.To this end, inserting the definitions in (11) and (11) into (12) with the regularization yields the overall SLFN objective as follows
(13) 
where and if , and , otherwise.
In order to learn the SLFN parameters for obtaining the proposed online nonlinear NP classifier via the NP optimization explained above, we use stochastic gradient descent (SGD) to sequentially optimize the overall network objective defined in (13). These network parameters are 1) , to project input to the higher dimensional kernel space, 2) and , which are the perceptron parameters of the output layer to classify the projected input , and 3) , to learn the correspondence between the error costs and the NP constraint.
Suppose at the beginning of time , we have an existing model learned with the past data as well as the error costs corresponding to ; and a little later, we observe the instance . SGD based optimization takes steps to update and to obtain and with respect to the partial derivatives of the instantaneous objective . Namely, Based on the partial derivatives of the instantaneous objective defined in (13), the SGD updates for the SLFN parameters can be computed as , , , and , where is the learning rate and is named as the Uzawa gain Uzawa (1958) controlling the learning rate of the Lagrange multiplier. Using the sigmoid yields the partial derivatives with as
(14)  
(15)  
(16)  
which can be straightforwardly incorporated into the backpropagation.
In our experiments, we obtain an empirical false positive rate estimate based on a sliding window keeping the errors for a couple hundreds of the past negative data instances, and use the following update instead of the aforementioned stochastic one:
(17) 
which has been observed to yield a more stable and robust performance. Note that this update is directly resulted from (12), and does certainly not disturb realtime online processing since a past window of positive decisions requires almost no additional space complexity (only bits in the case of, for instance, storing binary decisions for past negative instances).
Based on the derivations above, we sequentially update the SLFN at each time in a truly online manner with (here, : total number of processed data instances) computational and space complexity in accordance with the NP objective. Hence, we construct our method called “NPNN” in Algorithm 1 that can be used in real time for online nonlinear NeymanPearson classification. We refer to the Section 5 of our experimental study for all the details about the input parameters and initializations.
Remark: Recall that the goal in NP classification is to achieve the minimum miss rate (maximum detection power) while upper bounding the false positive rate (FPR) by a userspecified threshold . Therefore, both aspects (minimum miss rate and its FPR constraint) of this goal should be considered in evaluating the performance of NP classifiers. The NPscore of Scott (2007); Davenport et al. (2010) is defined as
(18) 
where is the NP model to be evaluated and controls the relative weights of the miss rate (with weight ) and its FPR constraint (with weight if the desired rate is exceeded, and with weight otherwise). Namely, controls the hardness of the NP FPR constraint, and a smaller NPscore indicates a better NP classifier. By enforcing a strict hard constraint on FPR with a very large , one can immediately reject models (while evaluating various models) that violate FPR constraint with even a slight positive deviation from the desired FPR (a negative deviation does not violate). However, even though the original NP formulation requires a hard constraint, we consider that it is not appropriate to use a hard constraint in practice, as also extensively explained in Scott (2007), based on the following two reasons: (1) An NP classifier is typically learned using a set of observations, and that set is itself a random sample from the underlying density of the data. Hence, the estimated FPR of the model is also a random quantity, which is merely an estimator of the unknown true FPR . Note that the true FPR is actually the one to be strictly constrained, but unavailable. Thus, it is unreliable to enforce a strict hard constraint (with a very large ) on the random estimator , and a relatively soft constraint has surely more practical value by allowing a small positive deviation from the desired FPR . (2) Also, one might be willing to exchange true negatives in favor of detections with a small positive deviation from the desired FPR , when the gain is larger than the loss as the NPscore improves. Consequently, for parameter selections with cross validation in our algorithm design as well as for performance evaluations in our experiments, we opt for a relatively soft constraint and use in accordance with the recommendation by the authors Scott (2007). This choice allows a relatively small positive deviation from the desired FPR, and normalizes the deviation by measuring it in a relative percentage manner. For example, the positive deviations and both degrade the score equally by when the desired rates are and , respectively. Various other NP studies in the literature do also practically allow small positive deviations from the desired FPR . For instance, we observe such a deviation in Gasso et al. (2011) with theirs and compared algorithms Davenport et al. (2010) in the case of spambase dataset, in Zhang et al. (2018) with theirs in the case of heart and breast cancer datasets, and finally in Kong et al. (2019) with one of the compared algorithms Du Plessis et al. (2015) in all datasets.
A comprehensive experimental evaluation of our proposed technique is next provided based on real as well as synthetic datasets in comparison to stateoftheart competing methods.
5 Experiments
We present extensive comparisons of the proposed kernel inspired SLFN for online nonlinear NeymanPearson classification (NPNN), described in Algorithm 1, with different stateoftheart NP classifiers. These compared techniques are online linear NP (OLNP) Gasso et al. (2011), as well as logistic regression (NPROCLOG) Cox (1958) and support vector machines with rbf kernel (NPROCSVM) Cortes and Vapnik (1995) in the NP framework of the umbrella algorithm described in Tong et al. (2018). Among these, OLNP (linear NP classification) is an online technique with computational complexity, whereas NPROCLOG (linear NP classification) and NPROCSVM (nonlinear NP classification) are batch techniques with at least computational complexity, where is the number of processed instances. In contrast, we emphasize that to our best knowledge, the proposed NP classifier NPNN is both nonlinear and online as the first time in literature, with computational and negligible space complexity resulting real time nonlinear NP modeling and false positive rate controllability. Consequently, the proposed NPNN is appropriate for challenging fast streaming data applications.
We conduct experiments based on various real and synthetic datasets Dua and Graff (2017); Chang and Lin (2011)
from several fields such as bioinformatics and computer vision, each of which is normalized by either unitnorm (each instance is divided by its magnitude) or zscore (each feature is brought down to zero mean unit variance) normalization before processing. For each dataset, smaller class is designated as the positive (target) class. The details of the datasets are provided in Table
1, where the starred ones and unstarred ones are normalized with unit norm and zscore, respectively. For performance evaluations, we generate random permutations of each dataset, and each random permutation is split into two as training () and test () sequences. We strongly emphasize that the processing in the proposed algorithm NPNN is truly online, meaning that, there are no separate training and test phases. However, since NPROCLOG and NPROCSVM are batch algorithms requiring a separate training, we opt to use training/test splits in this first set of experiments for a fair and statistically unbiased robust performance comparison. Such a split is in fact not needed in practice in the case of the proposed NPNN that by design processes data on the fly. Additional experiments based on two larger scale datasets to demonstrate the ideal usecase (i.e. online processing without separate training/tests phases) of the proposed algorithm NPNN are presented in Fig. 4.The rbf kernel bandwidth parameter (for the proposed NPNN as well as NPROCSVM), the error cost parameter (for NPROCSVM) and the number of hidden nodes (for the SLFN in the proposed NPNN) are all fold crossvalidated (based on NPscore) for each random permutation using the corresponding training sequence by a grid search with , and , where is the data dimension. As the regularization has been observed to help little, we opt to use along with SGD learning updates and , randomly initialized and (around ) and for both the proposed NPNN and OLNP uniformly in all of our experiments. We directly use the code provided by the authors Tong et al. (2018)
for NPROCLOG and NPROCSVM and also optimize it by the aforementioned cross validation in terms of parameter selection. We observe that for the datasets of relatively short length, algorithms using SGD optimization, i.e., OLNP and NPNN, improve with multiple passes over the training sequence. Hence, the length of the training sequence of each random permutation is increased by concatenation with additional randomizations for only OLNP and NPNN (not for NPROCLOG and NPROCSVM) during training of both the crossvalidation and actual training, resulting in an epochbyepoch training procedure. This concatenation is only for training purposes, and hence it is not used in testing and validation, i.e., the actual data size is used in all types of testing to avoid statistical bias and multiple counting. Our proposed algorithm NPNN does certainly not need such a concatenation approach for data augmentation in the targeted fast streaming data applications (cf. Fig.
4), where data is already abundant and scarcity is not an issue.We run all the algorithms on the test sequence of each of the random permutations (after training on the corresponding training sequences), and record in each case the achieved false positive rate, i.e., FPR, and true detection rate, i.e., TPR, for the target false positive rates (TFPR) . For performance evaluation, we compare the mean area under curve (AUC) of the resulting receiver operating characteristics (ROC) curves of TFPR vs TPR, as well as the mean of the resulting NPscores Scott (2007), cf. (18) with . Note that the mean AUC (higher is better) accounts only for the resulting detection power without regard to false positive rate tractability, whereas the mean NPscore (lower is better) provides an overall combined measure. We evaluate with the both (Table 1) in addition to visual presentation of the mean ROC curves (Fig. 2) of FPR and TPR. Table 1 additionally reports the mean TPRs and mean FPRs. We also provide the decision boundaries and the mean convergence of the achieved false positive rate during training for the visually presentable dimensional Banana dataset (Fig. 3
). All of our results are provided with the corresponding standard deviations.
We exclude the results of NPROCLOG in Table 1 (instead we keep NPROCSVM since it generally performs better than NPROCLOG) due to the page limitation, as the table gets too wide otherwise. One can access the results of NPROCLOG from our Fig. 2. Based on our detailed analysis presented in Table 1 along with the visualization with ROC curves in Fig. 2, we first conclude that in general the algorithms NPROCSVM and the proposed NPNN with powerful nonlinear classification capabilities significantly outperform the linear algorithms OLNP and NPROCLOG in terms of both AUC and NPscore, hence the proposed NPNN and NPROCSVM better address the need for modeling complex decision boundaries in the contemporary applications. This significant performance difference in favor of nonlinear algorithms NPROCSVM and the proposed NPNN is much more clear (especially in terms of the AUC) in highly nonlinear datasets such as Banana, Spiral, Iris, SVMguide1 and Fourclass, as shown in Fig. 2. In the case of a small size dataset that seems linear or less nonlinear (e.g., Bupaliver), although OLNP and NPROCLOG are both linear by design and targeting this dataset with the right complexity and hence expected to be less affected by overfitting, the proposed NPNN competes with the both well and even slightly outperforms them in terms of AUC (while staying comparable in terms of NPscore). We consider that this is most probably due to the successful compactification of the SLFN in the proposed NPNN which reduces the parameter complexity by learning the Fourier features in the hidden layer.
As for the comparison between the nonlinear algorithms NPROCSVM and the proposed NPNN, we first strongly emphasize that NPROCSVM has computational complexity (in the worst case of full number of support vectors) between and in training and in test, where the space complexity is . On the other hand, the proposed NPNN is truly online without separate training or test phases, which only requires computational and negligible space complexity. Hence, NPROCSVM cannot be applied in our targeted large scale data processing applications due to its prohibitive complexity; nevertheless, we opt to include it in our experiments to set a baseline that is achievable by batch processing. According to the numeric results in Table 1 and the ROC curves in Fig. 2, we first observe that NPROCSVM and the proposed NPNN perform comparably in terms of the AUC, hence our technique (thanks to its computationally highly efficient implementation) can be used in large scale applications (where NPROCSVM computationally fails) without a loss in classification performance. In addition, our algorithm NPNN outperforms NPROCSVM in terms of AUC in datasets; and for small target false positive rate (), the proposed NPNN has higher TPR compared to NPROCSVM in datasets. On the other hand, comparing in terms of the NPscore, NPROCSVM performs better as a result of enhanced false positive rate controllability due to batch processing. However, this advantage of NPROCSVM over the proposed NPNN seems to disappear or decrease as the data size (relative to the dimension) and/or the desired false positive rate increases as observed in the cases of, for instance, Banana and Codrna datasets. Therefore, we expect no loss (compared to NPROCSVM) with the proposed NPNN in terms of false positive rate controllability as well, when data size increases as in the targeted scenario of the big data applications where NPROCSVM cannot be used. Indeed, we observe a decent nonlinear classification performance and false positive rate controllability with the proposed NPNN on, for example, the Banana dataset ( instances in only dimensions), as clearly visualized in Fig. 3 which shows the false positive rate convergence as well as the nonlinear decision boundaries for various desired false positive rates. Lastly, NPROCSVM seems to be failing when TFPR requires only a few mistakes in the nontarget class. In this case, NPROCSVM picks zero mistake resulting in zero TPR and a poor NPscore in return. In contrast, the propsed NPNN successfully handles such situations as demonstrated by, for instance, TFPR= at Iris dataset in Table 1.
Our experiments in Table 1, Fig. 2 and Fig. 3 include comparisons of the proposed NPNN with certain batch processing techniques (NPROCSVM and NPROCLOG). Hence, we utilize separate training and test phases, along with multiple passes over training sequences (due to small sized datasets in certain cases such as Iris), in those experiments for statistical fairness. However, we emphasize that in the targeted scenario of large scale data applications: 1) one can only use computationally scalable online (such as the proposed NPNN and OLNP) algorithms, 2) multiple passes are not necessary as the data is abundant, and also 3) one can target for even smaller false positive rates such as and . Therefore, to better address this scenario of large scale data streaming conditions, we conduct additional experiments to compare the online methods (OLNP and the proposed NPNN) when processing large datasets (after zscore normalization and random permutations) on the fly without separate training and test phases based on just a single pass: covertype ( instances in dimensions) and Codrna ( instances in dimensions, this is the original full scale, for which we previously use in Table 1 a relatively small subset for testing the batch algorithms). We run for and present the resulting TPR and FPR at each time (in a timeaccumulated manner after averaging over random permutations) in Fig. 4. Parameters are set with manual inspection based on a small fraction of the data in the beginning of the stream.
Although the false positive rate constraint is set harder (i.e. smaller as ) in this experiment (compared to the smallest TFPR value in Table 1), both techniques (OLNP and the proposed NPNN) successfully converge (the proposed NPNN appears to converge slightly better) to the target rate (FPR TFPR) uniformly in all cases. Therefore, both techniques promise decent false positive rate controllability (almost perfect) when the data is sufficient. On the other hand, the proposed NPNN strongly outperforms OLNP in terms of the TPR (again uniformly in all cases), which proves the gain due to nonlinear modeling in the proposed NPNN. In terms of the NPscore, the proposed NPNN again strongly outperforms OLNP (except one case, where we observe comparable results). We finally emphasize that the proposed NPNN achieves this high performance while processing data on the fly in a computation as well as spacewise extremely efficient manner, in contrast to failing batch techniques in large scale streaming applications due to complexity and failing linear techniques due to insufficient modeling power.
6 Conclusion
We considered binary classification with particular regard to i) a user defined constraint on the type I error (false positive) rate that requires false positive rate (FPR) controllability, ii) nonlinear modeling of complex decision boundaries, and iii) computational scalability to voluminous data with online processing. To this end, we propose a computationally highly efficient online algorithm to determine the pair of asymmetrical type I and type II error costs to satisfy the FPR constraint and solve the resulting cost sensitive nonlinear classification problem in the nonconvex sequential optimization framework of neural networks. The proposed algorithm is essentially a NeymanPearson classifier, which is based on a single hidden layer feed forward neural network (SLFN) with decent nonlinear classification capability thanks to its kernel inspired hidden layer. The SLFN that we use for NeymanPearson classification is compact in principle for two reasons. First, the hidden layer exploits during initialization the exponential convergence of the inner products of random Fourier features to the true kernel value with sinusoidal activation. Second, learning of the hidden layer parameters, i.e., Fourier features, help to improve the randomly initialized Fourier features. Consequently, the required number of hidden nodes, i.e., the required number of network parameters and Fourier features, can be chosen relatively small. This reduces the parameter complexity and thus mitigates overfitting while significantly reducing the computational as well as space complexity. Then the output layer follows as a perceptron with identity activation. We sequentially learn the SLFN parameters through stochastic gradient descent based on a Lagrangian nonconvex optimization to goal of NeymanPearson classification. This procedure minimizes the type II error rate about the user specified type I error rate, while producing classification decisions in the run time. Overall, the proposed algorithm is truly online and appropriate for contemporary fast streaming data applications with real time processing and FPR controllability requirements. Our online algorithm was experimentally observed to either outperform (in terms of the detection power and false positive rate controllability) the stateoftheart competing techniques with a comparable processing and space complexity, or perform comparably with the batch processing techniques, i.e., not online, that are however computationally prohibitively complex and not scalable.
References
 Online anomaly detection for longterm ecg monitoring using wearable devices. Pattern Recognition 88, pp. 482–492. Cited by: §1.
 Radial basis function neural networks for nonlinear fisher discrimination and Neyman–Pearson classification. Neural Networks 16 (56), pp. 529–535. Cited by: §2.
 LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2 (3), pp. 1–27. Cited by: §5.
 A new oneclass svm for anomaly detection. IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3567–3571. Cited by: §2.
 Kernel methods for deep learning. Advances in Neural Information Processing Systems, pp. 342–350. Cited by: §2.
 Supportvector networks. Machine learning 20 (3), pp. 273–297. Cited by: §5.

The regression analysis of binary sequences
. Journal of the Royal Statistical Society: Series B (Methodological) 20 (2), pp. 215–232. Cited by: §2, §5.  Random feature expansions for deep gaussian processes. International Conference on Machine Learning, pp. 884–893. Cited by: §2.
 Tuning support vector machines for minimax and NeymanPearson classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (10), pp. 1888–1898. Cited by: §1, §2, §4.
 Convex formulation for learning from positive and unlabeled data. International Conference on Machine Learning, pp. 1386–1394. Cited by: §4.
 UCI machine learning repository (2017). URL http://archive. ics. uci. edu/ml 37. Cited by: §5.
 Batch and online learning algorithms for nonconvex NeymanPearson classification. ACM Transactions on Intelligent Systems and Technology (TIST) 2 (3), pp. 1–19. Cited by: §2, §2, §3, §4, §4, §4, §5.
 Chunk incremental learning for costsensitive hinge loss support vector machine. Pattern Recognition 83, pp. 196–208. Cited by: §1.
 Geometric entropy minimization (gem) for anomaly detection and localization. Advances in Neural Information Processing Systems, pp. 585–592. Cited by: §2, §2.

Radar detection with the Neyman–Pearson criterion using supervisedlearningmachines trained with the crossentropy error
. EURASIP Journal on Advances in Signal Processing 2013 (1), pp. 44. Cited by: §2. 
Costsensitive bayesian network classifiers
. Pattern Recognition Letters 45, pp. 211–216. Cited by: §1.  Valid oversampling schemes to handle imbalance. Pattern Recognition Letters 125, pp. 661–667. Cited by: §1.
 False positive rate control for positive unlabeled learning. Neurocomputing 367, pp. 13–19. Cited by: §2, §4.
 Abnormal event detection at 150 fps in matlab. IEEE International Conference on Computer Vision, pp. 2720–2727. Cited by: §2.
 Large scale online kernel learning. Journal of Machine Learning Research 17 (1), pp. 1613–1655. Cited by: §2.
 Convolutional kernel networks. Advances in Neural Information Processing Systems, pp. 2627–2635. Cited by: §2.
 Deep hybrid neuralkernel networks using random fourier features. Neurocomputing 298, pp. 46–54. Cited by: §2.
 Distributed online oneclass support vector machine for anomaly detection over networks. IEEE Transactions on Cybernetics (99), pp. 1–14. Cited by: §2.
 Online anomaly detection under markov statistics with controllable typei error. IEEE Transactions on Signal Processing 64 (6), pp. 1435–1445. Cited by: §2.
 Data imputation through the identification of local anomalies. IEEE Transactions on Neural Networks and Learning Systems 26 (10), pp. 2381–2395. Cited by: §1.
 Data driven frequency mapping for computationally scalable object detection. IEEE International Conference on Advanced Video and Signal Based Surveillance, pp. 30–35. Cited by: §2, §4.
 Random features for largescale kernel machines. Advances in Neural Information Processing Systems, pp. 1177–1184. Cited by: §1, §1, §2, §2, §4, §4, §4, footnote 3.
 The perceptron: a probabilistic model for information storage and organization in the brain.. Psychological Review 65 (6), pp. 386. Cited by: §4, §4.
 Video anomaly detection based on local statistical aggregates. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 2112–2119. Cited by: §1.
 Local anomaly detection. Artificial Intelligence and Statistics, pp. 969–983. Cited by: §2.
 Estimating the support of a highdimensional distribution. Neural Computation 13 (7), pp. 1443–1471. Cited by: §2, §3.
 Learning minimum volume sets. Journal of Machine Learning Research 7, pp. 665–704. Cited by: §2, §3.
 A NeymanPearson approach to statistical learning. IEEE Transactions on Information Theory 51 (11), pp. 3806–3819. Cited by: §2, §3.
 Performance measures for Neyman–Pearson classification. IEEE Transactions on Information Theory 53 (8), pp. 2852–2863. Cited by: §4, §5.
 A neural network for optimum NeymanPearson classification. IEEE International Joint Conference on Neural Networks, pp. 685–690. Cited by: §2.
 NeymanPearson classification algorithms and np receiver operating characteristics. Science Advances 4 (2), pp. eaao1659. Cited by: §2, §5, §5.
 A survey on NeymanPearson classification and suggestions for future research. Wiley Interdisciplinary Reviews: Computational Statistics 8 (2), pp. 64–81. Cited by: §1, §2, §2, §2, §3.
 A plugin approach to NeymanPearson classification. Journal of Machine Learning Research 14, pp. 3011–3040. Cited by: §2.
 Iterative methods for concave programming. Studies in Linear and Nonlinear Programming 6, pp. 154–165. Cited by: §2, §4, §4.
 Consistency and convergence rates of oneclass svms and related algorithms. Journal of Machine Learning Research 7, pp. 817–854. Cited by: §2.
 Breaking the curse of kernelization: budgeted stochastic gradient descent for largescale svm training. Journal of Machine Learning Research 13 (Oct), pp. 3103–3131. Cited by: §2.
 Deep kernel learning via random fourier features. External Links: 1910.02660 Cited by: §2.
 Nyström method vs random fourier features: a theoretical and empirical comparison. Advances in Neural Information Processing Systems, pp. 476–484. Cited by: §4.
 Taufpl: toleranceconstrained learning in linear time. ThirtySecond AAAI Conference on Artificial Intelligence. Cited by: §2, §4.
 Costsensitive dictionary learning for face recognition. Pattern Recognition 60, pp. 613–629. Cited by: §1.
 KRNN: k rareclass nearest neighbour classification. Pattern Recognition 62, pp. 33–44. Cited by: §1.

Adaptive and online oneclass support vector machinebased outlier detection techniques for wireless sensor networks
. IEEE International Conference on Advanced Information Networking and Applications Workshops, pp. 990–995. Cited by: §2.  Anomaly detection with score functions based on nearest neighbor graphs. Advances in Neural Information Processing Systems, pp. 2250–2258. Cited by: §2, §2, §3.
Comments
There are no comments yet.