1 Introduction
Learning a supervised model with good predictive performance requires a sufficiently large amount of labeled data. In many realworld applications, while the unlabeled data is abundant and easily obtained, acquiring class labels is timeconsuming, costly or requires expert knowledge, i.e. medical image analysis requires pathology experts. Additionally, labeling all data is often redundant, as some examples do not add further information to the already labeled ones. An active learner aims to learn a model using as few training examples as possible to achieve good model performance Settles2009 . Active learning has been shown to be effective in a variety of domains, including text categorization Tong2001
kapoor2007active and medical image analysis hoi2006batch .One common setting for active learning is the poolbased active learning, where a large number of unlabeled examples together with a small number of labeled examples are initially available Settles2009
. The learner interacts with an oracle (i.e., human expert) that provides labels when queried. At each step, the active learner chooses example(s) intelligently from the unlabeled pool and request labels of these queries from the oracle. Next, the training data is augmented with the newly labeled data, and the classifier is retrained. This iterative procedure is repeated until a stopping criterion (i.e., budget constraint, desired accuracy, etc.) is met. In the sequential mode, the active learner solicits the label for a single instance
Lewis94 . In cases where the training process is hard and/or there are multiple annotators available that could work in parallel, retraining the learner at each step would be inefficient. In the batch mode active learning, the learner requests the labels of a set of examples at once at each step. In both sequential and batch mode active learning, the critical step is the proper selection of label(s) to probe.In this work, we focus on binary classification in a supervised learning setting for the poolbased learning scenario. We propose a new criterion to assess the representativeness of an example in the pool that is based on statistical leverage scores. We develop two novel active learning algorithms which we shall abbreviate as ALEVS and DBALEVS. ALEVS (
Active Learning by Statistical Leverage Scores) is a sequential active learning algorithm that queries one example at a time. Diverse Batchmode Active Learning by Statistical Leverage Sampling (DBALEVS) is a batch mode alogrithm which selects not only a batch of examples that are influential in the class but also a set that is diverse with respect to the already labeled examples and to the other examples in the batch. We achieve this by encoding these properties in a set function that is submodular and monotonically nondecreasing; therefore, we can utilize a greedy submodular maximization algorithm that is provably nearoptimal.2 Problem set up and Notations
To explore the use of leverage scores for querying examples, we focus on supervised learning binary classification in the poolbased active learning scenario, where a small number of labeled examples are provided along with a large pool of unlabeled examples. The objective is to learn an accurate classifier, , where denotes the instance space and is the set of class labels. We denote the training data with wherein
the example’s feature vector is denoted by
and the class label with . We propose solutions for the sequential learning and the batch mode setting based on leverage scores.In the sequential learning scenario, the active learner iteratively selects one example from the unlabeled pool and queries its label. We denote the queried example with and its feature vector with , The labeling oracle, , upon receiving the labeling request for responds with the true label . We assume uniform cost for labeling across examples. We denote the labeled set of training examples at iteration with and the set of unlabeled examples with while comprises labeled pairs, include only .
The objective is to attain a good accuracy classifier by minimizing the number of examples queried thereby reducing the labeling cost. In the batch mode active learning, instead of selecting a single example at each iteration, batches of size are sequentially picked where is specified a priori by the user. We will refer the set of examples queried at iteration with the set . Notations used throughout the paper is provided in Appendix A.1 Table 7.
The rest of this article is organized as follows. In the following section, we briefly review the related work. In Section 4, we introduce statistical leverage scores, discuss their use in the literature and present the idea of querying based on statistical leverage scores. In Section 5, ALEVS presents active learning method for querying a single example at each active learning while Section 6 presents the batch mode active learning algorithm DBALEVS. Experimental results are reported in Sections 7 and in 8. Lastly, Section 9 concludes.
3 Related Work
For selecting a query, there are two main approaches proposed in the literature Settles2009 . The first one is to query examples based on their informativeness. Uncertainty sampling, in which the learner queries the example with the most uncertain class label, is one of the most used such methods Lewis94 . The uncertainty can be assessed by the distance to the decision boundary Tong2001 , through label entropy Lewis94 or by the disagreement of the ensemble of classifiers trained with the current label set seung1992query ; Freund1997 . One common drawback for these algorithms, particularly at early iterations, is that the classifier is uncertain about many points and the decision boundary formed with the classifier is not reliable as it relies on a limited set of examples available for training roy2001 . Furthermore, these approaches introduce a sampling bias, and the methods fail to exploit the unlabeled data distribution dasgupta2011two
. Others choose the most informative example that minimizes the model variance
mackay1992information . To assess the informativeness of an example, yu:icml06expdesign extends the classical experimental design to active learning and aims at finding examples that will lead to best predictions.The second set of approaches select instances that are representative of the data distribution xu2003representative ; nguyen2004active ; xu2007incorporating ; dasgupta2008hierarchical . These algorithms’ success heavily depends on the clustering algorithm employed. There are also hybrid approaches that combine the representative strategies in a single framework. settles2008analysis uses a weighting strategy that incorporates the similarity of the example to the other points based on its informativeness. A similar method, using density and entropy, is applied to a text classification problem zhu:2008 . QUIRE quire optimizes an objective function wherein both the informativeness and representativeness of the examples are considered simultaneously.Adapting the single query selection to batch mode setting by simply choosing the top examples, where is the number of elements in batch, does not account for the fact that there can be redundant information among the selected set of examples. Several batch mode active learning strategies have been proposed xu2007incorporating ; zhu:2008 ; guo2008discriminative ; hoi2009batch ; guo2010active ; chattopadhyay2013batch ; gu2014batch ; yang2015multi ; wang2015querying . Among these, there are methods that directly optimize an objective function that represents a good quality batch. The method introduced in zhu:2008 selects top examples that satisfies an objective function combining density and entropy. Guo and Schuurmans guo2008discriminative select a batch of examples which achieves the best discriminative classification performance. For assembling a good batch, Guo guo2010active selects instances that maximize the mutual information between labeled and unlabeled examples. In the work yang2015multi , the most uncertain and representative queries are selected by minimizing the empirical risk. In the batch mode setting, selecting a diverse set of examples is critical. Brinker et al. brinker2003incorporating
selects a diverse batch using SVMs, where the diversity is measured as the angle between the hyperplane induced by the currently selected point and the hyperplanes induced by the previously selected points.
hoi2006batch proposes a framework that minimizes the Fisher information and solves this optimization problem using the submodular properties of the set selection function. Chen and Krause chen2013near similarly employ submodular optimization and their approach asks for the batch with the maximum marginal gain.4 Statistical Leverage Scores and Our Motivation
Statistical leverage scores of a matrix are the squared rownorms of the matrix containing its (top) left singular vectors. For a symmetric positive semidefinite (SPSD) matrix, the statistical leverage scores relative to the best rank approximation to the input matrix are defined as follows gittens2013 :
Definition 1 (Leverage scores for an SPSD matrix)
Let , an arbitrary
SPSD matrix with the eigenvalue decomposition
. can be partitioned as where comprises orthonormal columns spanning the topdimensional eigenspace of
. Let be the eigenvalues of ranked in descending order. Given and a rank parameter , the statistical leverage scores of relative to the best rank approximation to is equal to the squared Euclidean norms of the rows of the matrix :(1) 
for , where , and .
Intuitively, leverage scores determine which columns (or rows) are most representative with respect to a rank subspace of . They are most recently used in lowrank matrix approximation algorithms to identify influential columns of the input matrix dkm2006 ; gittens2013 ; papailiopoulos2014provable ; mahoney2009cur ; yang2015 ; wang2015provably ; wang2015empirical . Mahoney et al. dkm2006 ; yang2015
show that in a lowrank matrix approximation task, the column subset selection is improved if the columns of the matrices are sampled based on a probability distribution weighted by the leverage scores of the columns. Along with these randomized algorithms, Papailiopoulos et al.
papailiopoulos2014provable demonstrate that deterministically selecting a subset of the matrix columns with the largest leverage scores results in a good lowrank matrix approximation. In another work, CUR decomposition is improved with the use statistical leverage scores mahoney2009cur . Gittens and Mahoney gittens2013 analyze different Nyström sampling strategies for symmetric positive semidefinite (SPSD) matrices and show that sampling based on leverage scores is quite effective.Motivated from this line of work which shows that statistical leverage scores are effective in finding columns (or rows) that exhibit high influence on the best lowrank fit of the data matrix, we propose to measure the representativeness of an example in a class with its leverage score in the kernel matrix computed on the examples. A kernel function, returns the dot product of the input vectors in a typically higher dimensional transformed feature space, scholkopf2001learning . Let . For a given number of examples, the kernel matrix is defined as .
Statistical leverage scores reflect the influence of the examples in a kernel matrix by capturing the most dominant part of the matrix. Fig 1 demonstrates this idea on two toy matrices. The first matrix,
, contains entries that are drawn from a uniform distribution on
(Fig 1 a), whereas, , includes a submatrix that includes entries sampled uniformly at random from the range and the remaining entries are sampled from (Fig 1d). Hence, every example is equally representative in while few examples in are representative. Consider the linear kernel computed on these examples and let and denote them respectively (Fig 1b and 1e). The leverage scores computed on and depict the structural difference between the two matrices and successfully identify the important rows (compare Fig 1c and Fig 1f ). The rows with high leverage scores of , rows 47 (Fig 1f), encode most of the information in the matrix while the rows with all zeroentries have leverage scores of 0. We use the idea that leverage scores can identify and rank the rows (examples) with most information in constructing the original kernel matrix, thus they can be used to assess the influence of the examples in the data distribution.5 Proposed Sequential Active Learning Method: ALEVS
For the sequential learning scenario, where at each round one example is queried for labeling, we propose ALEVS. The following steps are taken in deciding the example to query at each iteration .
First, the training examples are divided into two subsets based on class memberships and two separate feature matrices are formed on these subsets. Let be the classifier at iteration that is trained with the labeled training examples with a supervised method, the class membership of the unlabeled examples are predicted with . is a feature matrix, where the rows are the feature vectors of examples with positive class membership at iteration . These examples are those whose true labels are known to be positive along with the examples for which the true labels are not known but are predicted to be in the positive class based on the prediction of . is similarly constructed from the negative examples.
In the second step, ALEVS computes kernel matrices on and separately. For a given number of examples, the kernel matrix is defined as . ALEVS computes one kernel matrix on the positive class examples, , which we will denote with . Similarly, for the negatively labeled feature matrix , a kernel matrix is defined. These two matrices encode the similarity of examples to other examples that are in the same class.
We would like to find examples that carry the most information in the matrix to reconstruct the kernel matrix. ALEVS finds the example that imparts the strongest influence on the kernel matrices and through statistical leverage scores (Definition 1). To be able to compare leverage scores of examples computed on matrices with different and values, we use the scaled leverage scores, which ensures that the average leverage score is 1:
(2) 
At iteration , ALEVS computes leverage scores for and , and the unlabeled example that corresponds to the highest leverage score row in these matrices is selected for query:
(3) 
These steps are repeated at each round of the active learning iterations. An important parameter in ALEVS is the target rank parameter . Let be the proportion of variance explained by the top first eigenvalues. We select the minimum possible lowrank parameter , where the sum of the top eigenvalues is at least as large as . The overall procedure of ALEVS is summarized in Algorithms 1, 2, and 3.
6 Proposed BatchMode Active Learning Method: DBALEVS
A highquality batch should contain highly influential examples in the data distribution. On the other hand, as some of the examples can be highly influential on an individual basis, they might contain redundant information and can form poor batches if they are queried together. DBALEVS aims to select a batch not only diverse within the current batch but also with respect to the already labeled examples. We encode these properties in a set scoring function and use it to select batches at each iteration. The sum of leverage scores of the examples in the batch assesses the total usefulness of a set of examples. To select a diverse set, we incorporate a term that penalizes the selection of examples that are similar to each other. For evaluating the similarity of examples, we use the kernel function. We define the following set scoring function:
Definition 2 (Set scoring function)
Given a set that is a subset of the ground set , the scoring function, , is defined as follows:
(4) 
Here, is a cardinality constraint on the selectable set size of a batch. denotes the leverage score of point , and denotes the kernel function evaluation of points and with the assumption that . is a parameter.
The first part of this function evaluates the individual representatives of the examples in the set while the second part of the function penalizes the selection of highly similar instances. The influence of the diversity term can be adjusted by the tradeoff parameter . We would like to select a batch that maximizes the set function, :
(5)  
This is a subset selection problem and except for small sets and small values of , the exhaustive search for the optimal batch will be intractable. To tackle this computational challenge, we exploit the fact that the suggested set function is submodular. Although submodular maximization is also NPhard in general Krause05nearoptimalnonmyopic , Nemhauser et al. nemhauser1978analysis showed that the greedy algorithm for selecting a subset of size is guaranteed to return a solution close to the optimal value within a constant bound (Theorem 6.1).
Theorem 6.1
For a monotone, nonnegative, submodular function , and a cardinality constraint , the greedy approximation yields to:
(6) 
where denotes the greedily selected set with cardinality nemhauser1978analysis .
The greedy algorithm adds elements to the solution that gives the maximum increase at each step. To be able to use this greedy algorithm with the aforementioned approximation bound, we need to show that is a submodular, monotonically nondecreasing and nonnegative function. Below we first define submodularity and then prove is submodular.
Definition 3 (Submodularity)
Let , where denotes the ground set and let be an element. A set function is called submodular if the following holds:
(7) 
Proposition 1 (Submodularity)
is submodular.
Proof
For to be submodular, the following should hold:
(8) 
Using Definition 2 for :
Rearranging the terms we end up with the following expression:
If we do the same simplification for the right hand side of the submodularity definition, , we arrive to a similar expression for set . Therefore,
Since and and , . Therefore, . Hence, is submodular. ∎.
To be able to apply the greedy algorithm with an approximation guarantee, we also need to show that is a monotonically nondecreasing and nonnegative function under reasonable conditions. The proofs that satisfies these conditions when the selected batch size is less than or equal to are provided in Appendix A.2 and A.3.
The procedure for querying a batch is summarized in Algorithm 4. First, the labeled and unlabeled pool is divided based on class labels. As in ALEVS, the iteration , the classifier, is exclusively trained with the labeled training examples with a supervised method, and the class membership of the unlabeled examples are predicted with . The examples whose true labels are known along with the instances for which the true labels are not known but are predicted to be in the positive class based on the prediction of form a positive class group, . is similarly constructed from negatively predicted and labeled examples. Having divided the pool based on class memberships, the kernel matrices for each class are computed. is formed using , and is formed using . Then leverage scores of the examples are computed using the kernel matrices based on Definition 1 for each class. Not we used the leverage scores without scaling with . This is necessary to ensure the submodularity of . DBALEVS selects half of the batch from the positive examples, and half of the points from the negative examples. For this selection, the method uses the set scoring function (Definition 2). For greedy maximization, the method uses the available labeled data for positive () and negative () class as the initial set. This allows selecting a set that is also diverse with respect to the already labeled examples. This modified greedy maximization is given in Algorithm 5.
7 Results for ALEVS
We compare ALEVS with the following five approaches:

Random sampling: Selects an unlabelled example uniformly at random.

Uncertainty sampling: Queries the example that the current classifier is most uncertain about Lewis94 , that is the one with maximal value; here
is the predicted class label for that example. The posterior probability is estimated with Platt’s algorithm
platt1999probabilistic based on SVM’s output. 
Leverage sampling on all data (LevOnAll): Computes the leverage score on the pool of examples at the beginning of the iteration without paying attention to class membership, then at each iteration queries the unlabeled example with the largest leverage score.

Transductive experimental design:
Method selects observations to maximize the quality of parameter estimates in linear regression model
yu:icml06expdesign . The model is also applicaple to classification problems. 
QUIRE: Selects an instance that is both informative and representative through optimizing a function that encodes these properties quire .
Dataset  Size  Dim.  +/ 

digit1  1500  241  1.00 
g241c  1500  241  1.00 
UvsV  1577  16  1.10 
USPS  1500  241  0.25 
twonorm  2000  20  1.00 
ringnorm  2000  20  1.00 
spambase  2000  57  0.66 
3vs5  2000  784  1.20 
We compare the methods on eight different datasets. Table 1 summarizes the characteristics of the datasets (for more information about the data, see Appendix A.4. Each dataset is divided into training and held out test sets. We start with four randomly selected labeled examples, two from each class. At each iteration, the classifier is updated for all the methods with the training data, and the accuracy values are calculated on the same heldout test data. In all experiments, an SVM classifier with RBF kernel is trained. The experiments are repeated times with random splitting of the training and the test data and random initial selection of labeled examples.
Fig. 2 shows the average classification accuracy values of ALEVS and other approaches at each iteration of active sampling. Table 3 and Table 3 summarize the win, tie and lost counts of ALEVS versus each of the competing methods on the 1sided paired sample test at the significance level of 0.05.
Dataset  vs. QUIRE  vs. LevOnAll  vs. Random  vs. Uncertainty  vs. ExpDesign 

digit1  8/19/23  29/21/0  27/23/0  46/4/0  33/17/0 
g241c  0/34/16  32/18/0  30/20/0  34/16/0  30/19/1 
USPS  0/43/7  32/16/2  33/12/5  0/50/0  32/18/0 
ringnorm  47/3/0  48/2/0  49/1/0  47/3/0  49/1/0 
spambase  8/21/21  16/27/7  10/29/11  0/46/4  32/7/11 
MNIST3vs5  3/41/6  42/8/0  44/6/0  48/2/0  43/7/0 
UvsV  0/2/48  48/2/0  25/25/0  8/12/30  49/1/0 
twonorm  49/1/0  50/0/0  50/0/0  50/0/0  49/1/0 
Dataset  vs. QUIRE  vs. LevOnAll  vs. Random  vs. Uncertainty  vs. ExpDesign 

digit1  0/0/50  0/16/34  0/16/34  0/15/35  0/50/0 
g241c  9/41/0  50/0/0  50/0/0  50/0/0  50/0/0 
USPS  0/12/38  50/0/0  50/0/0  0/32/18  50/0/0 
ringnorm  50/0/0  22/28/0  50/0/0  13/37/0  50/0/0 
spambase  0/6/44  2/48/0  24/26/0  0/9/41  50/0/0 
MNIST3vs5  0/20/30  39/11/0  50/0/0  6/17/27  50/0/0 
UvsV  0/0/50  2/48/0  0/50/0  0/0/50  50/0/0 
twonorm  50/0/0  50/0/0  50/0/0  50/0/0  50/0/0 
We observe that ALEVS outperforms random sampling and uncertainty sampling in almost all datasets (Fig. 2). Exceptions to this are the USPS and spambase dataset, for which ALEVS performs as good as the uncertainty sampling but not better. ALEVS’ performance is consistently better than random sampling in the first 50 iterations of active sampling (Table 3). For iterations between , the two methods tie in spambase and UvsV datasets and random sampling performs better in digit1 (Table 3). In UvsV dataset uncertainty sampling works well against all methods. When compared to transductive experimental design, ALEVS performs better in all datasets except digit1 in the iterations 50100, where the two methods tie (Table 3 and Table 3).
When comparing the performance of ALEVS against QUIRE, there are three different groups of datasets. First group of datasets comprise ringnorm and twonorm, for which ALEVS decisively outperforms QUIRE. In the second group of datasets, ALEVS either outperforms QUIRE or ties with it at a subset of the iterations. In digit1, ALEVS outperforms in the first 50 iterations. For the g241c dataset, ALEVS either ties or performs worse than QUIRE in early iterations but the performance of QUIRE is not consistent in early iteration (Fig. 1(c)). In this dataset, ALEVS holds up with QUIRE and outperforms it at later iterations. In the 3vs5 dataset QUIRE and ALEVS tie in most of the 50 iterations. There are also datasets, where ALEVS lags behind QUIRE. These include UvsV, USPS and spambase. For the UvsV, few labels are sufficient to obtain good accuracy and the performances of different methods do not differ dramatically (Fig. 1(e)). For the spambase dataset, ALEVS shows promising performance around iterations 30 and 40 (Fig. 1(h)). One observes that generally ALEVS manages to find effective examples for querying in the early iterations. Therefore, a strategy that combines ALEVS with a method that performs poorly in early iterations but do better in later iterations – such as uncertainty sampling – could lead to a strong active learner. Such a hybrid classifier will be explored in future work.
To understand whether computing class specific kernel matrices have any merit, we compare ALEVS against LevOnAll, we observe that ALEVS consistently outperforms it. Thus calculating the leverage scores within each class is better at finding the influential data points than calculating them on the whole pool.
Class ratio  ALEVS  QUIRE  UNCERTAINTY  RANDOM  LEVONALL  EXP. DESIGN 

5:1  1.28  1.60  2.41  5.32  4.80  inf* 
10:1  1.31  1.64  2.30  10.55  10.19  inf * 
To understand how ALEVS reacts to class imbalance, we sample the 5vs3 dataset with two different class ratios, 5:1 and 10:1 and repeat the experiments on this dataset. The experimental set up is identical to that of in the previous section, the only difference being the adoption of F1 score in measuring performance. As depicted in the Fig. 3, ALEVS successfully copes with the class imbalance and outperforms other methods. The transductive experimental design method is unable to handle the class imbalance and returns F1 scores of 0; therefore, we exclude its results from each figure. Similarly, with 10:1 class ratio, LevOnAll, and random sampling return very poor F1 scores and are excluded from the graph. To understand why different methods would handle the class imbalance differently, we analyze the class label distribution of the queried set. Table 4 illustrates that those that are robust to class imbalance sample equally from both classes whereas those that fail sample close to the original class distribution. Since the classifier is provided with a balanced dataset, the overall training process is not hurt by the unequal class distribution. As the class label balance becomes detorioted, if one adopts costsensititve training methods, this could solve the problem but adds an extra complexity layer to the problem.
We also compare methods in term of their running times. The querying step of ALEVS involves the calculation of eigenvalue decomposition of the kernel matrices; however, in practice, this does not cause a computational bottleneck. We summarize the average CPU times for selecting one example in a single iteration from the unlabeled data pool in Appendix Fig. 6. As the figures show, ALEVS is as fast as uncertainty sampling.
8 Results for DBALEVS
We compare DBALEVS with the following approaches:

Random sampling: Randomly selects examples uniformly at random from the unlabelled pool.

Uncertainty sampling: Selects examples with maximal uncertainty.

Top leverage sampling (TopLev): Computes the leverage score on the whole pool at the beginning without paying attention to class membership and selects the top examples based on their leverage scores.

Nearoptimal batch mode active learning (NearOpt) : NearOpt chen2013near selects a batch of instances using adaptive submodular optimization.
To evaluate the performance of DBALEVS we run experiments on six different datasets (Table 5). The details of these datasets can be found in Appendix Section X. Each dataset is divided into training and held out test sets. We start with four randomly selected labeled examples, two from each class. Batch size, , is set to 10 and diversity tradeoff parameter is set to 0.5 for each dataset except ringnorm, wherein that dataset it is set to 0.1. At each iteration, the classifier is updated for all the methods with the training data, and the accuracy values are calculated on the same heldout test data. In all experiments, an SVM classifier with RBF kernel is used. For each dataset, the experiment is repeated times with random splitting of the training and the test data and random initial selection of labeled examples. For the set function defined in Definition 2 to be submodular, the kernel function should satisfy: . We use RBF kernel in calculating the set scoring function, which is in this range. However, the method is compatible with other kernel functions as long as they are normalized within the range . For optimizing the set scoring function in Definition 2, we use submodular function optimization toolbox krause2010sfo .
Dataset  Size  +/ 

autos  1986 x 11009  1.00 
hardware  1945 x 9877  1.00 
sport  1993 x 11148  1.00 
ringnorm  7400 x 20  1.00 
3vs5  13454 x 784  1.20 
4vs9  13782 x 784  1.20 
Fig. 4 shows the average classification accuracy values of DBALEVS and other approaches at each iteration of active sampling. Table 6 summarizes the win, tie and lost counts of DBALEVS versus each of the competing methods on the 1sided paired sample test at the significance level of 0.05.
Dataset  vs.NearOpt  vs.TopLev  vs.Random  vs.Uncertainty 

ringnorm  20/0/0  0/1/19  20/0/0  20/0/0 
autos  18/11/1  49/8/3  49/11/0  51/6/3 
hardware  25/5/0  57/3/0  54/6/0  57/3/0 
sport  25/5/0  51/9/0  46/14/0  54/6/0 
4vs9  18/2/0  20/0/0  18/2/0  19/1/0 
3vs5  17/3/0  18/2/0  18/2/0  19/1/0 
DBALEVS outperforms NearOpt in almost all cases. For all of the datasets, DBALEVS wins over NearOpt except for few tie cases (See Table 6), and DBALEVS never loses against NearOpt. We also observe that the performances of NearOpt and random sampling are comparable. The results indicate that DBALEVS outperforms random sampling and uncertainty sampling approaches in all of the datasets. One interesting observation is, uncertainty sampling performs poorly in the batch mode setting. The main cause for this poor performance is the strong dependency of uncertainty sampling on the initial hypothesis. If the initial hypothesis formed from the initially labeled examples is not reasonable, then the query selection is affected by it because noninformative instances are selected at the subsequent iterations. One other drawback of this approach is that it fails to model the dependencies among the selected instances.
Another result is that random sampling performance is not bad and outperforms uncertainty sampling in all of the datasets. This surprising performance is also noted by others gu2014batch . It could be attributed to the fact that datasets that are available for running these experiments are not truly random; therefore, if the batch size is large enough, the examples provide valuable information to learn the datasets.
When compared to TopLev baseline, we observe that DBALEVS consistently other contenders, only exception being the ringnorm dataset. One possible explanation pertinent to this data is the structure of the dataset: since ringnorm dataset is artificially created from multivariate Gaussians, by querying points with maximal leverage scores, the learner receives labels from dense regions in different clusters without the need for diversification.
9 Conclusions and Future Work
In this study, we present a new query measure for active learning that is based on statistical leverage scores. We propose two novel algorithms based on this querying strategy: a sequentialmode algorithm ALEVS and a batch mode algorithm DBALEVS. Our experimentation on 8 different datasets shows that the use of statistical leverage scores as an alternative strategy to find examples that are influential. ALEVS achieves superior performance compared to common baselines such as uncertainty sampling and random selection, an older method transductive experimental design. More importantly, ALEVS performs better or equally well in terms of classification accuracy when compared to a stateoftheart approach, QUIRE quire , which is documented to outperform other methods. Moreover, ALEVS runs much faster than QUIRE.We also show emprically that ALEVS is also robust to class imbalance.
The second proposed algorithm, DBALEVS is designed to address the batch mode active learning setting. We formulate a set scoring function that rewards examples with highleverage scores and penalizes the inclusion of similar examples into the set. We prove that this function is submodular, monotone and nonnegative, which enables us to use a greedy algorithm for solving the submodular maximization problem that produces a solution that is constant factor approximate to the optimal solution. Our experiments on 6 different datasets show that the idea of incorporating leverage scores and kernel matrix entries to find an influential and diverse batch of points is an effective strategy. DBALEVS performs well against common baselines random sampling and uncertainty sampling; and against NearOpt chen2013near , which employs a newly introduced framework called adaptive submodularity. These results show that statistical leverage score is an effective measure for detecting which examples to query in a pool of unlabeled examples.
The work presented here can be extended in different directions. We observe ALEVS and DBALEVS are especially effective in early iterations a stage where many of the existing algorithms inadequately perform. Therefore, a future direction can be developing a hybrid strategy where the proposed method works in cooperation with other methods. For example, the framework proposed in this study does not incorporate any knowledge about the uncertainty of the class labels. One possible future direction would be to study the adaptability of these methods to streambased selective sampling active learning approaches. Finally, to understand the effectiveness of the methods, we did not consider the more complex active learning scenarios such as the nonuniform cost of labels and noisy, reluctant experts. The general framework presented here can be further investigated to incorporate these alternative settings.
Acknowledgements.
O.T. acknowledges support from Bilim Akademisi  The Science Academy, Turkey under the BAGEP program.References
 [1] L. Breiman. Bias, variance, and arcing classifiers. Technical report, University of California, Berkeley, 1996.

[2]
K. Brinker.
Incorporating diversity in active learning with support vector machines.
In ICML, volume 3, pages 59–66, 2003.  [3] O. Chapelle, B. Schölkopf, and A. Zien, editors. SemiSupervised Learning. MIT Press, 2006.
 [4] R. Chattopadhyay, Z. Wang, W. Fan, I. Davidson, S. Panchanathan, and J. Ye. Batch mode active sampling based on marginal probability distribution matching. ACM Transactions on Knowledge Discovery from Data, 7(3):13, 2013.
 [5] Y. Chen and A. Krause. Nearoptimal batch mode active learning and adaptive submodular optimization. In Proceedings of the 30th International Conference on Machine Learning, pages 160–168, 2013.
 [6] S. Dasgupta. Two faces of active learning. Theoretical Computer Science, 412(19):1767–1781, 2011.
 [7] S. Dasgupta and D. Hsu. Hierarchical sampling for active learning. In Proceedings of the 25th International Conference on Machine Learning, pages 208–215, 2008.
 [8] P. Drineas, R. Kannan, and M. Mahoney. Fast monte carlo algorithms for matrices iii: Computing a compressed approximate matrix decomposition. SIAM Journal on Computing, 36:2006, 2004.
 [9] Y. Freund, H. S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee algorithm. Machine Learning, 28(23):133–168, Sept. 1997.
 [10] A. Gittens and M. Mahoney. Revisiting the nyström method for improved largescale machine learning. In Proceedings of the 30th International Conference on Machine Learning, 2013.

[11]
Q. Gu, T. Zhang, and J. Han.
Batchmode active learning via error bound minimization.
In
Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence
, pages 300–309, 2014.  [12] Y. Guo. Active instance sampling via matrix partition. In Advances in Neural Information Processing Systems 23, pages 802–810. Curran Associates Inc., 2010.
 [13] Y. Guo and D. Schuurmans. Discriminative batch mode active learning. In Advances in Neural Information Processing Systems 20, pages 593–600. Curran Associates Inc., 2008.

[14]
S. C. Hoi, R. Jin, and M. R. Lyu.
Batch mode active learning with applications to text categorization and image retrieval.
IEEE Transactions on Knowledge and Data Engineering, 21(9):1233–1248, 2009.  [15] S. C. Hoi, R. Jin, J. Zhu, and M. R. Lyu. Batch mode active learning and its application to medical image classification. In Proceedings of the 23rd international conference on Machine learning, pages 417–424. ACM, 2006.
 [16] S.J. Huang, R. Jin, and Z.H. Zhou. Active learning by querying informative and representative examples. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(10):1936–1949, 2014.
 [17] Z. X. Kai, Yu, V. Tresp, X. Xu, and J. Wang. Representative sampling for text classification using support vector machines. In European Conference on Information Retrieval, pages 393–407. Springer, 2003.
 [18] A. Kapoor, K. Grauman, R. Urtasun, and T. Darrell. Active learning with gaussian processes for object categorization. In Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, pages 1–8. IEEE, 2007.
 [19] A. Krause. Sfo: A toolbox for submodular function optimization. Journal of Machine Learning Research, 11(Mar):1141–1144, 2010.
 [20] A. Krause and C. Guestrin. Nearoptimal nonmyopic value of information in graphical models. In Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence, page 5, 2005.
 [21] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [22] D. D. Lewis and W. A. Gale. A sequential algorithm for training text classifiers. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 3–12. SpringerVerlag, 1994.
 [23] M. Lichman. UCI machine learning repository, 2013.
 [24] D. J. MacKay. Informationbased objective functions for active data selection. Neural Computation, 4(4):590–604, 1992.
 [25] M. Mahoney and P. Drineas. CUR matrix decompositions for improved data analysis. Proceedings of the National Academy of Sciences, 106(3):697–702, 2009.
 [26] G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An analysis of approximations for maximizing submodular set functionsi. Mathematical Programming, 14(1):265–294, 1978.
 [27] H. T. Nguyen and A. Smeulders. Active learning using preclustering. In Proceedings of the 21st International Conference on Machine Learning, page 79, 2004.
 [28] D. Papailiopoulos, A. Kyrillidis, and C. Boutsidis. Provable deterministic leverage score sampling. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 997–1006. ACM, 2014.
 [29] J. C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Advances in Large Margin Classifiers, pages 61–74. MIT Press, 1999.
 [30] N. Roy and A. McCallum. Toward optimal active learning through sampling estimation of error reduction. In Proceedings of the 18th International Conference on Machine Learning, pages 441–448, 2001.
 [31] B. Schölkopf and A. J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization and Beyond. MIT press, 2001.
 [32] B. Settles. Active learning literature survey. Technical Report 1648, University of Wisconsin–Madison, 2009.

[33]
B. Settles and M. Craven.
An analysis of active learning strategies for sequence labeling
tasks.
In
Proceedings of the Conference on Empirical Methods in Natural Language Processing
, pages 1070–1079. Association for Computational Linguistics, 2008. 
[34]
H. S. Seung, M. Opper, and H. Sompolinsky.
Query by committee.
In
Proceedings of the 5th Annual Workshop on Computational Learning Theory
, pages 287–294. ACM, 1992.  [35] S. Tong and D. Koller. Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, 2:45–66, November 2001.
 [36] Y. Wang and A. Singh. Column subset selection with missing data via active sampling. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, pages 1033––1041, 2015.
 [37] Y. Wang and A. Singh. An empirical comparison of sampling techniques for matrix column subset selection. In Proceedings of 53rd Annual Allerton Conference on Communication, Control, and Computing, pages 1069–1074. IEEE, 2015.
 [38] Z. Wang and J. Ye. Querying discriminative and representative samples for batch mode active learning. ACM Transactions on Knowledge Discovery from Data (TKDD), 9(3):17, 2015.
 [39] Z. Xu, R. Akella, and Y. Zhang. Incorporating diversity and density in active learning for relevance feedback. In European Conference on Information Retrieval, pages 246–257. Springer, 2007.
 [40] T. Yang, L. Zhang, R. Jin, and S. Zhu. An explicit sampling dependent spectral error bound for column subset selection. In Proceedings of the 32nd International Conference on Machine Learning, pages 135–143, 2015.
 [41] Y. Yang, Z. Ma, F. Nie, X. Chang, and A. G. Hauptmann. Multiclass active learning by uncertainty sampling with diversity maximization. International Journal of Computer Vision, 113(2):113–127, 2015.
 [42] K. Yu, J. Bi, and V. Tresp. Active learning via transductive experimental design. In Proceedings of the 23rd International Conference on Machine Learning (ICML 2006), 2006.
 [43] J. Zhu, H. Wang, T. Yao, and B. K. Tsou. Active learning with sampling by uncertainty and density for word sense disambiguation and text classification. In Proceedings of the 22nd International Conference on Computational Linguistics, pages 1137–1144, 2008.
Appendix
a.1 Notation table
Symbol  Explanation 

input space  
output space  
dataset  
feature vector of a data point  
queried example  
class label  
labeling oracle  
classifier  
threshold parameter for rank parameter selection  
batch size  
queried batch  
diversity tradeoff parameter  
statistical leverage score  
eigenvalue  
rank parameter for calculating truncated leverage scores  
input feature matrix  
feature mapping  
kernel function evaluated on examples and  
kernel matrix  
th row of  
element in th row and th column of matrix  
kernel parameter configuration  
scale parameter of RBF kernel  
degree of polynomial kernel  
coefficient of polynomial kernel 
a.2 Proving is monotone
Proposition 2 (Monotonicity)
is a monotonically nondecreasing set function when input set size is at most and for kernel values and .
Proof
Consider two arbitrary sets, and , where . And let , and and . We need to show that the following inequality holds:
(9) 
Using Definition 2 for :
The leftmost summation calculated over terms and the second one sums over terms. Using the fact , these terms can be at most and , respectively. Additionally, since the minimum value that can take is 0. Then the following inequality holds:
Passing from the second line to the third line we used the fact that, . As , is monotonically nondecreasing for sets with sizes smaller than or equal to .
a.3 Proving is nonnegative
Proposition 3 (Nonnegativity)
is a nonnegative set function for sets with cardinality smaller than or equal to .
Proof
For to be nonnegative, the following statement should hold for sets with cardinality at most :
is defined as follows:
Moving from equality to inequality (line 2 to 3), we use the facts that the minimum value can take is zero, the maximum value of is 1, and . Thus, the summation of the leverage scores is minimum 0 and kernel terms can be at most , which is the number of elements in the matrix excluding the diagonals. Since , ; thereby, . This completes the proof that is nonnegative when the chosen set size is bounded with .
a.4 Datasets
The following datasetsets are used in the ALEVS experiments. The digit1, g241c, USPS datasets are from^{1}^{1}1http://olivier.chapelle.cc/sslbook/benchmarks.html [3]. The spambase and letter datasets are obtained from [23]. The letter dataset is a multiclass dataset; we select a letter pair that are difficult to distinguish: UvsV. Similarly, we sample 3 and 5 digits from the MNIST dataset as 3vs5, since they are one of the most confused pairs in the MNIST dataset [21], obtained from^{2}^{2}2http://yann.lecun.com/exdb/mnist/. Finally, twonorm and ringnorm are culled from^{3}^{3}3http://www.cs.toronto.edu/~delve/data/twonorm/desc.html and^{4}^{4}4http://www.cs.toronto.edu/~delve/data/ringnorm/desc.html which are implementations of [1]. We use a random subsample of 2000 examples for ringnorm, twonorm, spambase, and 3vs5 because the running time for QUIRE is prohibitively long. The description of these datasets are given in Table 1.
The following datasetsets are used in the DBALEVS experiments. We used the autos, hardware and sport tasks in the 20newsgroups dataset^{5}^{5}5http://qwone.com/~jason/20Newsgroups/. These subtopics that are picked because they are harder to differentiate. autos involves classification of rec.autos and rec.motorcycles topics. hardware involves classifying comp.sys.ibm.pc.hardware and comp.sys.m
ac.hardware topics and lastly the sport dataset involves classification of rec.spor
t.baseball and rec.sport.hockey topics. We use a bagofwords representation for features in these datasets. Similarly, 35 and 49 digit pairs from the MNIST dataset [21] are sampled to create the 3vs5 and 4vs9 classification tasks. Finally, ringnorm is culled from^{6}^{6}6http://www.cs.toronto.edu/~delve/data/ringnorm/desc.html which is an implementation of [1]. The description of these datasets and the parameters chosen for each of the dataset are listed in Table 5.
a.5 Effect of target rank k
One parameter that has a large impact on the performance of ALEVS is the target lowrank parameter . In this work, we adaptively select the value of for negative and positive kernel matrices at each iteration by setting a threshold on the variance for top dimensional eigenspace as described in RankSelector algorithm. We further analyze the effect of by varying these thresholds; experimented on four datasets with three different values. Accuracies shown in Fig. 5 are averages computed over 10 random experiments. Selecting the full rank option for computing leverage scores does not necessarily provide the best performance. The lowrank representation acts as a regularizer and focuses on the core dimensions that matter in the datasets. For the datasets digit1, twonorm, ringnorm, works best, whereas for 3vs5 threshold value 0.75 is a better choice and is the worst choice. This difference is expected, as the eigenvalue spectra of the matrices are different.
a.6 ALEVS runtime performance
We performed the experiments in Matlab on a computer with 2.6 GHz CPU (24core) and 64 GB of memory running Ubuntu 14.04 LTS operating system. ALEVS is as fast as almost uncertainity sampling.
a.7 DBALEVS runtime performance
The querying step of DBALEVS involves the calculation of eigenvalue decomposition of the kernel matrices and the greedy maximization procedure. Fig. 7 displays the average CPU times for selecting a batch in a single iteration from the unlabeled data pool. DBALEVS have comparable runtimes with the NearOpt method. Experiments are conducted in Matlab on a computer with 2.6 GHz CPU (24core) and 64 GB of memory running Ubuntu 14.04 LTS operating system.