Classification is a common task in many research areas, and a classification rule is conventionally built under a training/testing framework, where the label information is essential in its training stage. Modern computing and communication technologies can make data collection procedures very efficient, and these collected huge data sets are not for any particular research purpose. For classification problems, this means that the essential label information for a particular problem of interest cannot be readily obtainable in a data set for training a classification model of interest. In this situation, we need to conduct an extra labeling procedure such that we can have enough labeled subjects for constructing such a classification model. Besides the size of training set, the impacts or information of subjects for building a classification model are usually not the same. Therefore, considering the labeling cost in time and capital, how to increase the size of labeled subjects in training set via selecting the most “informative” subjects to be examined and labeled first in order to accelerate the learning process is an important issue. Deng et al (2009)
gave us a typical example of this scenario, in which a bank customer data set is used to build a money laundering prediction model. They treated it as a binary classification problem, and their learning subjects are then adaptively, sequentially selected without their true label information. This kind of procedures are called active learning in the machine learning literature, and there are all kinds of active learning procedures proposed targeting at different learning goals, e.g.Lewis and Gale (1994); Osugi et al (2005); Lughofer (2012); Rubens et al (2016) and references there.
Because active learning procedures sequentially recruit new observations joining the training set to continuously ameliorate the performance, they are naturally sequential procedures from a statistical viewpoint. If those new subjects are recruited according to the information obtained from the previous training subjects, then this is an adaptive sequential procedure as in stochastic regression models, and in this situation, the observations may be no longer independent. In the literature, depending on the learning target and model/method used, many different observation-recruiting procedures are studied (Cohn et al, 1996; Zhu et al, 2003; Bouneffouf, 2016). Here we consider a binary-classification problem, under the active learning scenario described above, with a logistic model with adaptively selected observations. To collect new observations is essential in both active learning procedures and sequential experimental design methods. A major difference between them is that in active learning applications, we already have a bunch of unlabeled samples available, and it is unlikely to observe/collect new observations at those design points based on theories of experimental design. Instead, we will just use the information obtained from the experimental design to find the suitable subjects from the studied data set. In addition to subject selection, we will select the effective variables under the proposed active learning setup.
Because we do not want to label and use whole data set for training, the size of learning observations is of interest in addition to the selecting scheme for active learning procedures. Here, the proposed procedure has equipped with a data-dependent stopping criterion such that it can have a satisfactory performance when we stop the learning procedure. The receiver operating characteristic (ROC) curve and its related indexes are popular classification performance measures, in this study we adopt the area under ROC curve (AUC) to measure classification performances, and study that whether our procedure has a satisfactory AUC under the proposed stopping criterion. For active learning procedures aiming at different performance measures, please refer to Cohn et al (1996); Zhu et al (2003); Bouneffouf (2016) and references there.
In the rest of this paper, we will first introduce the adaptive shrinkage estimate (ASE) for logistic models under adaptive design case (Wang and Chang, 2013; Lu et al., 2015), and use the ideas there to sequentially select the effective variables. Based on this estimate, we study an adaptive subject selection method together with the proposed active learning procedure, and then conduct some numerical studies to illustrate the proposed method. We also apply our method to several real data sets, and the results are also presented, and a summary is given after that. Some theoretical proofs and technical details are then given as Appendix.
Under an active learning scenario described above, we want to build a binary classification rule when there is only limited labeled subjects to beginning with, and a bunch of unlabeled data, which we can used to enlarge the training set after labeling them. In our procedure, we will simultaneously select the most informative subjects for training and find the effective variables for this classification model. We will adopt the idea of the ASE in Wang and Chang (2013) to find the effective variables during the course of an active learning procedure, where they studied a sequential estimation of linear models with the variable selection feature, and later (Lu et al., 2015) extended their results further for generalized linear model cases. However, there is no specific observation/subject recruiting procedures discussed in their papers. For searching suitable observations at the training stage from the existent unlabeled data set, we use the D-optimality from experimental design theory as a guideline and using the method of the uncertainty sampling for boosting the computational efficiency. To fix notations, we will first briefly summarize the ASE of a logistic model under adaptive design cases and then discuss the details of the proposed active learning procedure with these notations.
2.1 Variable selection and adaptive shrinkage estimate
Suppose that are random observations that satisfy with , where for and . Hence, for each , is -measurable as in a stochastic regression model (Lai and Wei, 1982). Suppose that
, are binary response variables satisfying a logistic model
is a vector of (unknown) regression parameters, and, are covariate vector with length . Let be a solution to the estimating equation:
That is, is the maximum likelihood estimate (MLE) for in (1) with a sample of size
. It follows that under some moment conditions on the covariate vectors
s, it was shown that with probability one,(see, for example, Chen et al, 1999). Moreover, it is shown that as goes to infinity,
where , and is the identity matrix.
be the maximum and minimum eigenvalues of, respectively, and set . Let is a non-random function of and , for each , with a constant such that for some ,
Then by the asymptotic properties of , we have that with probability one, as ,
where is the indicator function. (Note that here we set .) Thus, for a given , Equation (4) suggests that we can use the indicator to determine whether the jth component, , of is significantly apart from zero by a positive constant .
Suppose that the following Conditions (A1) and (A2):
(A1) The random error is a martingale difference sequence with respect to an increasing sequence of -fields with
(A2) The eigenvalues, and , satisfy, with probability one, that
are satisfied, then we have the following theorem.
Suppose that observations satisfy Assumptions (A1) and (A2). Then for any small , almost surely as . In addition that is a strongly consistent estimate of and .
(The proof of it is given in Appendix.)
From Theorem 2.1, we know that
Following this result, we define an ASE based on :
where is a diagonal matrix. Then by Theorem 2.1, for each , the th component of , , is shrunk to if , otherwise it remains unchanged. Thus, the estimate is a “shrunk” version of , and we will refer to it as an asymptotic shrinkage estimate (ASE) of .
If the following Condition (A3):
(A3) There exists a non-random positive definite symmetric matrix and a continuously increasing function such that
where is a positive definite matrix.
is satisfied, then we have the asymptotic normality of below. (The proof of it will also be presented in Appendix.)
Suppose the assumptions of Theorem 2.1 are satisfied. Then with probability one, (i) and (ii) as . (iii) If, in addition, Assumption (A3) is satisfied, then for any small ,
where is a diagonal matrix.
Condition (A3) is a regularity condition for the random design matrix. Theorem 2.2 (i) and (ii) mean that is a strongly consistent estimate of with a convergence rate approximately equal . Theorem 2.2 (iii) indicates that if for some , then the limiting distribution of will eventually degenerate to 0 when is large. Moreover, if with some , then and has the same asymptotic distribution.
Let be a positive integer-valued random variable such that converges to 1 in probability as . If the conditions of Theorem 2.2 are satisfied, then with probability one, and as
Corollary 1 states that under sequential sampling strategy if a random sample size satisfying the above assumption, then the asymptotic distribution of ASE remains. Based on this property, we propose a stopping criterion such that the estimate satisfies a pre-required precision when we stop to recruit new subjects for training. A brief proof of this corollary is given below.
Proof of Corollary 1:
Because the integer-valued random variable satisfies that converges to 1 in probability as , and we know that with probability one as from Theorem 2.2, it implies that with probability one. To prove the asymptotic distribution of is sufficient to show that is uniform continuity in probability (ucip) (Woodroofe, 1982), and in the current problem the proof will follow the arguments of Anscombe (1952) (see also Woodroofe, 1982; Wang and Chang, 2013) and using the Hájek-Rényi Inequality for martingale differences (see Chow and Teicher, 1988, page 247). Hence, we omit the detailed arguments here.
3 Subject selection, variable determination and stopping criterion
Suppose that we already have subjects as a training set in the th stage, and let be a vector of the labels of these subjects, and be the corresponding matrices of covariates. Without loss of generality, we can rearrange the components of as such that the values of the corresponding ’s of and are 1 and 0, respectively. Hence, the lengths of and become and . It follows from linear algebra, we know that there exists an orthonormal matrix , depending on the samples up to the current stage, such that and .
Under this setup, we will describe our subject selection strategy and stopping criterion below. We adopt both the D-optimality from the theory of experimental design and the concepts of uncertainty sampling in our subject selection scheme, and will separately discuss them below.
3.1 Subject selection strategy
Let and be the active sample set (training data under the current stage) and unlabeled sample pool, respectively. For each , we compute
Ranking the set in decreasing order, take the covariates from the first largest ones as uncertainty set , where is a pre-specified constant and is number of samples in .
Uncertainty sampling strategy
For each , we compute
where is a given target value. We select the covariates with the minimum value in , then delete from and add in , where is observed response value for .
Although we use the experimental design criterion to find the promising candidate subjects for building the classification model, there is still a major difference between the experimental design and the subject selection of active learning. In conventional design of experiments, people will construct the design points/locations first and then conduct actual observations on these particular design points. However, in most the active learning scenarios, subjects already exist and we only use the design criteria to find these possible candidate subjects. Hence, when the size of data set is large, how to efficiently searching the most informative ones among all subjects is an important computational issue in active learning. Here we use the uncertainty sampling method to confine the searching range, which can help to shorten the searching time.
3.2 Variable selection and stopping criterion
The relation between the logistic-type classification function and AUC are intensively discussed (see Eguchi and Copas, 2002). These results suggest us to apply sequential confidence set estimation methods to a logistic model-based active learning procedure. In addition to the conventional sequential estimation problems, we also merge a variable selection method to find the effective variables with the proposed learning process, which will make the concluding model more compact such that we can have a good ability of model-interpretation.
To define the stopping criterion and to identified effective variables, we will first partition matrix as follows. Denote
where and Partitioning the matrix according to the first components of such that
It implies that with simple matrix computation,
Let denote a general inverse matrix of . It follows that
where and is sub-vector of corresponding to . Then Theorem 2.2 implies that as ,
Suppose that is a pre-fixed constant. Let , and be the maximum eigenvalue of . Then for a given ,
Let be the estimate of based on the first observations as defined in Theorem 2.1. Because the true in (3.2) is unknown, we replace it with a strongly consistent estimate . Let be the set of the first observations, and for given , let be a constant satisfying the conditional probability for a given . Thus, for a given observed samples , is a constant, and is an quantile of the chi-square distribution with degrees of freedom. Now, suppose is a set of subjects in the beginning with . Then a stopping time is defined as follows:
where and are two constants defined before. Equation (15) means that we will stop recruiting new samples into our training set once the maximum eigenvalue satisfies the inequality in (15). Replacing the non-random sample size in (3.2) with the newly defined stopping time , we define . Similarly, we have
which is a sequential fixed size confidence ellipsoid for .
The following theorem says that using both uncertainty sampling and the D-optimal design method to selection new training subjects, sequentially, the proposed an active learning procedure will have the following properties. (The proof of Theorem 3.1 will be given in Appendix.)
Note that under active learning scenarios described here, when we recruit new subjects to join our training set, we do not know their label information. Their label information will only be revealed after being selected, and we estimate the regression coefficient vector with the selected subjects only. Because we sequentially find the effective variables using the current training subjects, the degree of freedom of the asymptotic distribution is also data-dependent, which makes this sequential estimation procedure here different from the conventional ones.
It is clear from the definition that the stopping time will go to infinity as goes to 0. Theorem 3.1 (ii) and (iii) say that if goes to 0, then the coverage probability of asymptotically equals to the nominated value and the expected sample size of the sequential procedure approaches to the best (unknown) sample size. In Chow and Robbins (1965), they called these two properties as asymptotic consistency and asymptotic efficiency of sequential estimation methods, respectively, In addition, Theorem 3.1 (iv) states that almost surely converges to number of the effect parameters under the proposed sequential procedure. Hence, Theorem 1 and Theorem 3.1 (iv) together implies that the effective variables are eventually identified. In practice, choices of constant depends on the application needs and many other factors. A smaller means that we required more accurate/precise estimate, hence a larger sample size is usually required.
3.3 Stopping criterion and area under ROC curve
In Eguchi and Copas (2002) , they showed that when a logistic model is correct, then its classification function will reach the theoretical maximum AUC, asymptotically. Here, we will study that whether the proposed procedure have a satisfactory AUC with the proposed stopping criterion.
Let be the angle between and , and AUC and AUC be the AUCs of with respect to and . Because AUC is scale-invariant, to show that AUC converges to AUC, it is sufficient to show that converges to 0. We know that from the definition, goes to infinity as goes to 0 with probability one. Since Theorem 2.1 and 2.2 together imply that almost surely as goes to 0, they also imply that converges to 0 almost surely. Thus, we have a corollary below, which shows that the empirical AUC will also reach its theoretical optimal with the proposed procedure.
Let be the angle between and , then under the assumptions of Theorem 3.1, as goes to 0. In addition, AUC AUC almost surely as goes to 0.
(Proof follows from simple algebra operations and is given in Appendix.)
4 Numerical results
We report the numerical results of the proposed ASE-based active learning procedure using some synthesized data sets, and compare with the results obtained from MLE-based active learning methods. In addition, we use the credit card fraud detection and the MAGIC gamma telescope data sets obtained from the Internet for demonstration purposes. We will describe these two data sets later.
4.1 Synthesized Data
We generate synthesized data based on a logistic regression model stated in (1)
with the coefficients , and (i.e. ),
with generated from a multivariate normal distribution,
generated from a multivariate normal distribution,, where is a identity matrix. In addition to the uncertainty sampling, an optional clustering algorithm, using only the covariate vectors, is used to partition the training data in order to reduce the searching time.
The results in Tables 1 and 2 are based on 1000 simulations, and
Table 3 also lists estimate of number of nonzero parameters for the proposed procedure.
The results of these two tables show that when classification performances of two methods are similar, the proposed procedure will stop earlier (use less training subjects), have a shorter computational times and the concluding model is more compact model (less variables used) than that of the MLE-based procedure.
In addition, it shows that when becomes small, the estimates of regression parameters and will converge to their corresponding true values, and standard variances of these estimates will also decrease.
will converge to their corresponding true values, and standard variances of these estimates will also decrease.
Note that there is no variable selection in the MLE-based procedure, hence there is no estimate of in these cases.
4.2 Real Examples
Credit Card Fraud Detection Data set
The credit card fraud detection is an anonymized data set obtained from a machine learning competition website Kaggle ,
which is a platform for data science competitions. Please refer to their website
, which is a platform for data science competitions. Please refer to their websitewww.kaggle.com/host for the further details.
In this data set, it consists of transactions occurred in two days of September 2013 by European cardholders. There are 492 frauds out of 284,807 transactions; i.e there are only 0.172% of all transactions are frauds. For this data set, we will refer to these fraud transactions as positive cases. Due to confidentiality issues, they did not offer the original features and the detailed background information about this data set, and only numerical variables resulting from a PCA transformation with 28 principal components are available. The feature “Amount” is the transaction amount, which is not included in the PCA transformation, and “Class” is the response variable, which takes value 1 in case of fraud and 0 otherwise. We apply our procedure to the credit card fraud detection data set, and use the first 3 and last 2 components in our analysis. We select these 5 PCA components as covariates, since we want to show both the classification performance and variable identification ability. The variables names Var1, Var2 and Var3 denote the first 3 PCA components and Var27 and Var28 denote the last two components. In each simulation run, we randomly select 400 fraud cases, and 1600 from the regular cases, so the total size of the training set is 2000. Thus, there are 282807 regular cases and 92 fraud cases in our testing set. Based on this setup, we expect that our method can successfully find the first three effective components, and produce a satisfactory prediction results as well. (See the top three plots of Figure 1.) Because the ratio of the sizes of the cases to the non-cases is small, using AUC as the performance measure is also recommended (see also Kaggle website). Here, we summarize both averages of the accuracy and AUC based on 1000 runs, and we also report the variable selection results and their corresponding coefficient estimates in addition.
MAGIC Gamma Telescope Data Set
We get this MAGIC Gamma Telescope Data Set from the UC Irvine Machine Learning Repository (archive.ics.uci.edu/ml), and according to their description, this data set is used to simulate the registration of high energy gamma particles in a ground-based atmospheric Cherenkov gamma telescope using the imaging technique.
There are 10 continuous real-valued variables and a class variable. The total sample size is 19020. Among them there are 12332 signal (gamma) and 6688 background (hadron) samples, and we will refer to these subjects with gamma signal as positive cases, and the other 6688 subjects as negative cases. We conduce a similar PCA transformation as the previous example, and use the first 4 and last 2 PCA components as our covariates. Hence, our procedure should only select the first 4 variables. (See the bottom three plots of Figure 1.) For each simulation run, we randomly select 20% of subjects as our training set. In order to keep the positive to negative ratio, there are 2466 positive subjects and 1338 negative subjects in the training set. We repeat this scheme 1000 runs. Our goal is to discriminate statistically those caused by primary gammas (signal) from the images of hadronic showers initiated by cosmic rays in the upper atmosphere (background). The detailed explanation of this data set and its physical background, can be found in the UC Irvine Machine Learning Repository and the original owner’s website (www.magic.mppmu.mpg.de).
Table 3 shows the average number of the training subjects (stopping time) used, and the prediction performance (both accuracy and AUC), when we apply the ASE-based active learning procedure to these two data sets with 1000 runs and different s. Figure 1 shows the box-plots of coefficient estimate based on 1000 runs for each case, separately. For the Credit Card Fraud data set, the first 3 variables (PCA components) are successfully identified; especially with the case of, the estimate of the intercept term of the model reflects the imbalanced ratio of the sizes of the cases to the controls.
Note that in simulation study, we already know that the required training subject size of the MLE-based active learning procedure is 2 to 4 times of that of the corresponding ASE-based method (see Table 1). Because it will take too much time for the MLE-based methods, we only apply the ASE-based method to these two real data sets.
|Credit||0.5||1446.18 (208.90)||18018.49 (5981.34)||0.9801 (0.0027)||0.8378 (0.0227)|
|Card||0.6||1024.39 (215.67)||11696.80 (4571.01)||0.9797 (0.0040)||0.8378 (0.0229)|
|0.7||740.81 (262.26)||7552.82 (4281.15)||0.9780 (0.0068)||0.8384 (0.0231)|
|Magic||0.5||719.48 (347.07)||2699.02 (2622.26)||0.7994 (0.0241)||0.6462 (0.0304)|
|Gamma||0.6||441.54 (246.15)||1824.43 (2021.63)||0.8107 (0.0241)||0.6391 (0.0335)|
|0.7||293.53 (158.71)||697.18 (888.25)||0.8188 (0.0210)||0.6322 (0.0351)|
In this paper, we propose a procedure for building binary classification models when the complete label information is not available in the beginning of the training stage. To this end, we apply the idea of active learning procedure to a logistic model such that the proposed procedure can simultaneously select the most informative subjects for training and find the effective variables for this classification model.
In an active learning procedure, we continuously select new observations according to some predefined selection criteria. until a predetermined stopping criterion is fulfilled. Here we use the information obtained from analyzing the current observations to select the most informative observations from a given training data set in order to shorten the learning course, and this selection scheme is different from the conventional sequential analysis. We adopt an adaptive selection procedure in the proposed algorithm, and then treat the dependent-observation situations as a logistic model with adaptive covariates.
Because we use a parametric logistic model in our active learning procedure, we are able to use the methods of experimental design to find the most informative subjects from the given training data set. The criterion used here is like that of the conventional sequential optimal experimental design, however, in an active learning process we just use this criterion to search the potential observations from the existing data set. Thus, the key step will be how to search the most informative subjects in the given data set. We do not have to know exactly where the optimal design points are as in the traditional optimal experimental design problems. Instead, we will select the observations that are “close” enough to the theoretical optimal design points.
We take the advantages of stochastic approximations and optimal design, and aims to search the next unlabeled data point(s) from huge pool of subjects with the aid of the uncertainty sampling. The uncertainty sampling strategy is vague concept and here we only use this idea to confine our searching range of new samples to shorten the searching time. The asymptotic results presented here only assumed the selected observations satisfying some regularity conditions and do not depend on any specific design criterion, hence the other selection criteria can also be used in our procedure. Here, we use the D-optimality criterion and the uncertainty sampling strategy together. Depending on the learning target, we can use other indexes to replace this criterion. Moreover, the relations between the optimal experimental design and the method of uncertainty is not studied here, which will be an interesting future study problem.
Proof of Theorem 2.1: Let . Firstly we consider non-zero component for some , . For any , we have
Since by definition of , we have
If for some , , then
In addition, we also have , almost surely as (Chen et al., 1999) . Therefore,
Due to , we show
By the definition of , we know that
Since for all , it follows from the Dominated Convergence Theorem, that . Thence, the proof of Theorem 2.1 is compeleted. ∎
Proof of Theorem 2.2:
By the definition of , it easily shows that for any given ,
Hence, using the triangle inequality, we have
Proof of Theorem 3.1:
Under D-optimality, we choose the new samples such that the determinant of information matrix is maximized, which implies the minimum eigenvalue has order of . Uncertainty sampling step selects one sample from the chosen sample set of D-optimality. It implies that is larger than 0 for all . Therefore, the proof of Theorem 3.1 follows the similar arguments to those of Wang and Chang (2013, Theorem 8) in and Lemma 3.3, Theorems 3.1 and 3.2 in Chang (2001). Hence, the details will be omitted here.
Proof of Corollary 2:
From the definition of stopping time, we know that goes to infinity as goes to 0 with probability one. It has been shown that almost surely. This implies that eventually converges to as . Similarly, it is known that as . Moreover, by definition, the maximum axis of is no greater than . Let denote the estimated direction . Then, with simple vector algebra, it is shown that as , with probability no less than , that
It is clear that if goes to 0, then