Active Learning Methods based on Statistical Leverage Scores

12/06/2018 ∙ by Cem Orhan, et al. ∙ Sabancı University 0

In many real-world machine learning applications, unlabeled data are abundant whereas class labels are expensive and scarce. An active learner aims to obtain a model of high accuracy with as few labeled instances as possible by effectively selecting useful examples for labeling. We propose a new selection criterion that is based on statistical leverage scores and present two novel active learning methods based on this criterion: ALEVS for querying single example at each iteration and DBALEVS for querying a batch of examples. To assess the representativeness of the examples in the pool, ALEVS and DBALEVS use the statistical leverage scores of the kernel matrices computed on the examples of each class. Additionally, DBALEVS selects a diverse a set of examples that are highly representative but are dissimilar to already labeled examples through maximizing a submodular set function defined with the statistical leverage scores and the kernel matrix computed on the pool of the examples. The submodularity property of the set scoring function let us identify batches with a constant factor approximate to the optimal batch in an efficient manner. Our experiments on diverse datasets show that querying based on leverage scores is a powerful strategy for active learning.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Learning a supervised model with good predictive performance requires a sufficiently large amount of labeled data. In many real-world applications, while the unlabeled data is abundant and easily obtained, acquiring class labels is time-consuming, costly or requires expert knowledge, i.e. medical image analysis requires pathology experts. Additionally, labeling all data is often redundant, as some examples do not add further information to the already labeled ones. An active learner aims to learn a model using as few training examples as possible to achieve good model performance Settles2009 . Active learning has been shown to be effective in a variety of domains, including text categorization Tong2001

, computer vision

kapoor2007active and medical image analysis hoi2006batch .

One common setting for active learning is the pool-based active learning, where a large number of unlabeled examples together with a small number of labeled examples are initially available Settles2009

. The learner interacts with an oracle (i.e., human expert) that provides labels when queried. At each step, the active learner chooses example(s) intelligently from the unlabeled pool and request labels of these queries from the oracle. Next, the training data is augmented with the newly labeled data, and the classifier is retrained. This iterative procedure is repeated until a stopping criterion (i.e., budget constraint, desired accuracy, etc.) is met. In the sequential mode, the active learner solicits the label for a single instance

Lewis94 . In cases where the training process is hard and/or there are multiple annotators available that could work in parallel, retraining the learner at each step would be inefficient. In the batch mode active learning, the learner requests the labels of a set of examples at once at each step. In both sequential and batch mode active learning, the critical step is the proper selection of label(s) to probe.

In this work, we focus on binary classification in a supervised learning setting for the pool-based learning scenario. We propose a new criterion to assess the representativeness of an example in the pool that is based on statistical leverage scores. We develop two novel active learning algorithms which we shall abbreviate as ALEVS and DBALEVS. ALEVS (

Active Learning by Statistical Leverage Scores) is a sequential active learning algorithm that queries one example at a time. Diverse Batch-mode Active Learning by Statistical Leverage Sampling (DBALEVS) is a batch mode alogrithm which selects not only a batch of examples that are influential in the class but also a set that is diverse with respect to the already labeled examples and to the other examples in the batch. We achieve this by encoding these properties in a set function that is submodular and monotonically non-decreasing; therefore, we can utilize a greedy submodular maximization algorithm that is provably near-optimal.

2 Problem set up and Notations

To explore the use of leverage scores for querying examples, we focus on supervised learning binary classification in the pool-based active learning scenario, where a small number of labeled examples are provided along with a large pool of unlabeled examples. The objective is to learn an accurate classifier, , where denotes the instance space and is the set of class labels. We denote the training data with wherein

-the example’s feature vector is denoted by

and the class label with . We propose solutions for the sequential learning and the batch mode setting based on leverage scores.

In the sequential learning scenario, the active learner iteratively selects one example from the unlabeled pool and queries its label. We denote the queried example with and its feature vector with , The labeling oracle, , upon receiving the labeling request for responds with the true label . We assume uniform cost for labeling across examples. We denote the labeled set of training examples at iteration with and the set of unlabeled examples with while comprises labeled pairs, include only .

The objective is to attain a good accuracy classifier by minimizing the number of examples queried thereby reducing the labeling cost. In the batch mode active learning, instead of selecting a single example at each iteration, batches of size are sequentially picked where is specified a priori by the user. We will refer the set of examples queried at iteration with the set . Notations used throughout the paper is provided in Appendix A.1 Table  7.

The rest of this article is organized as follows. In the following section, we briefly review the related work. In Section 4, we introduce statistical leverage scores, discuss their use in the literature and present the idea of querying based on statistical leverage scores. In Section 5, ALEVS presents active learning method for querying a single example at each active learning while Section 6 presents the batch mode active learning algorithm DBALEVS. Experimental results are reported in Sections 7 and in 8. Lastly, Section 9 concludes.

3 Related Work

For selecting a query, there are two main approaches proposed in the literature Settles2009 . The first one is to query examples based on their informativeness. Uncertainty sampling, in which the learner queries the example with the most uncertain class label, is one of the most used such methods Lewis94 . The uncertainty can be assessed by the distance to the decision boundary Tong2001 , through label entropy Lewis94 or by the disagreement of the ensemble of classifiers trained with the current label set seung1992query ; Freund1997 . One common drawback for these algorithms, particularly at early iterations, is that the classifier is uncertain about many points and the decision boundary formed with the classifier is not reliable as it relies on a limited set of examples available for training roy2001 . Furthermore, these approaches introduce a sampling bias, and the methods fail to exploit the unlabeled data distribution dasgupta2011two

. Others choose the most informative example that minimizes the model variance

mackay1992information . To assess the informativeness of an example, yu:icml06expdesign extends the classical experimental design to active learning and aims at finding examples that will lead to best predictions.The second set of approaches select instances that are representative of the data distribution xu2003representative ; nguyen2004active ; xu2007incorporating ; dasgupta2008hierarchical . These algorithms’ success heavily depends on the clustering algorithm employed. There are also hybrid approaches that combine the representative strategies in a single framework. settles2008analysis uses a weighting strategy that incorporates the similarity of the example to the other points based on its informativeness. A similar method, using density and entropy, is applied to a text classification problem zhu:2008 . QUIRE quire optimizes an objective function wherein both the informativeness and representativeness of the examples are considered simultaneously.

Adapting the single query selection to batch mode setting by simply choosing the top examples, where is the number of elements in batch, does not account for the fact that there can be redundant information among the selected set of examples. Several batch mode active learning strategies have been proposed xu2007incorporating ; zhu:2008 ; guo2008discriminative ; hoi2009batch ; guo2010active ; chattopadhyay2013batch ; gu2014batch ; yang2015multi ; wang2015querying . Among these, there are methods that directly optimize an objective function that represents a good quality batch. The method introduced in zhu:2008 selects top examples that satisfies an objective function combining density and entropy. Guo and Schuurmans guo2008discriminative select a batch of examples which achieves the best discriminative classification performance. For assembling a good batch, Guo guo2010active selects instances that maximize the mutual information between labeled and unlabeled examples. In the work yang2015multi , the most uncertain and representative queries are selected by minimizing the empirical risk. In the batch mode setting, selecting a diverse set of examples is critical. Brinker et al. brinker2003incorporating

selects a diverse batch using SVMs, where the diversity is measured as the angle between the hyperplane induced by the currently selected point and the hyperplanes induced by the previously selected points.

hoi2006batch proposes a framework that minimizes the Fisher information and solves this optimization problem using the submodular properties of the set selection function. Chen and Krause chen2013near similarly employ submodular optimization and their approach asks for the batch with the maximum marginal gain.

4 Statistical Leverage Scores and Our Motivation

Statistical leverage scores of a matrix are the squared row-norms of the matrix containing its (top) left singular vectors. For a symmetric positive semi-definite (SPSD) matrix, the statistical leverage scores relative to the best rank- approximation to the input matrix are defined as follows gittens2013 :

Definition 1 (Leverage scores for an SPSD matrix)

Let , an arbitrary

SPSD matrix with the eigenvalue decomposition

. can be partitioned as where comprises orthonormal columns spanning the top

-dimensional eigenspace of

. Let be the eigenvalues of ranked in descending order. Given and a rank parameter , the statistical leverage scores of relative to the best rank- approximation to is equal to the squared Euclidean norms of the rows of the matrix :

(1)

for , where , and .

Intuitively, leverage scores determine which columns (or rows) are most representative with respect to a rank- subspace of . They are most recently used in low-rank matrix approximation algorithms to identify influential columns of the input matrix dkm2006 ; gittens2013 ; papailiopoulos2014provable ; mahoney2009cur ; yang2015 ; wang2015provably ; wang2015empirical . Mahoney et al. dkm2006 ; yang2015

show that in a low-rank matrix approximation task, the column subset selection is improved if the columns of the matrices are sampled based on a probability distribution weighted by the leverage scores of the columns. Along with these randomized algorithms, Papailiopoulos et al.

papailiopoulos2014provable demonstrate that deterministically selecting a subset of the matrix columns with the largest leverage scores results in a good low-rank matrix approximation. In another work, CUR decomposition is improved with the use statistical leverage scores mahoney2009cur . Gittens and Mahoney gittens2013 analyze different Nyström sampling strategies for symmetric positive semi-definite (SPSD) matrices and show that sampling based on leverage scores is quite effective.

(a)
(b) , linear
(c) of
(d)
(e) , linear
(f) of
Figure 1: Leverage scores demonstrated on two toy matrices.

Motivated from this line of work which shows that statistical leverage scores are effective in finding columns (or rows) that exhibit high influence on the best low-rank fit of the data matrix, we propose to measure the representativeness of an example in a class with its leverage score in the kernel matrix computed on the examples. A kernel function, returns the dot product of the input vectors in a typically higher dimensional transformed feature space, scholkopf2001learning . Let . For a given number of examples, the kernel matrix is defined as .

Statistical leverage scores reflect the influence of the examples in a kernel matrix by capturing the most dominant part of the matrix. Fig 1 demonstrates this idea on two toy matrices. The first matrix,

, contains entries that are drawn from a uniform distribution on

(Fig 1 a), whereas, , includes a submatrix that includes entries sampled uniformly at random from the range and the remaining entries are sampled from (Fig 1d). Hence, every example is equally representative in while few examples in are representative. Consider the linear kernel computed on these examples and let and denote them respectively (Fig 1b and 1e). The leverage scores computed on and depict the structural difference between the two matrices and successfully identify the important rows (compare Fig 1c and Fig 1f ). The rows with high leverage scores of , rows 4-7 (Fig 1f), encode most of the information in the matrix while the rows with all zero-entries have leverage scores of 0. We use the idea that leverage scores can identify and rank the rows (examples) with most information in constructing the original kernel matrix, thus they can be used to assess the influence of the examples in the data distribution.

5 Proposed Sequential Active Learning Method: ALEVS

For the sequential learning scenario, where at each round one example is queried for labeling, we propose ALEVS. The following steps are taken in deciding the example to query at each iteration .

First, the training examples are divided into two subsets based on class memberships and two separate feature matrices are formed on these subsets. Let be the classifier at iteration that is trained with the labeled training examples with a supervised method, the class membership of the unlabeled examples are predicted with . is a feature matrix, where the rows are the feature vectors of examples with positive class membership at iteration . These examples are those whose true labels are known to be positive along with the examples for which the true labels are not known but are predicted to be in the positive class based on the prediction of . is similarly constructed from the negative examples.

In the second step, ALEVS computes kernel matrices on and separately. For a given number of examples, the kernel matrix is defined as . ALEVS computes one kernel matrix on the positive class examples, , which we will denote with . Similarly, for the negatively labeled feature matrix , a kernel matrix is defined. These two matrices encode the similarity of examples to other examples that are in the same class.

We would like to find examples that carry the most information in the matrix to reconstruct the kernel matrix. ALEVS finds the example that imparts the strongest influence on the kernel matrices and through statistical leverage scores (Definition 1). To be able to compare leverage scores of examples computed on matrices with different and values, we use the scaled leverage scores, which ensures that the average leverage score is 1:

(2)

At iteration , ALEVS computes leverage scores for and , and the unlabeled example that corresponds to the highest leverage score row in these matrices is selected for query:

(3)

These steps are repeated at each round of the active learning iterations. An important parameter in ALEVS is the target rank parameter . Let be the proportion of variance explained by the top first eigenvalues. We select the minimum possible low-rank parameter , where the sum of the top eigenvalues is at least as large as . The overall procedure of ALEVS is summarized in Algorithms 1,  2, and  3.

  Input: : a training dataset of instances; : labeling oracle; : eigenvalue threshold; : kernel parameters.
  Output: : final classifier.
  Initialize:
         // initial set of labeled instances
     // the pool of unlabeled instances
  
  repeat
     —————— Classification —————————
      train()
      predict()
     —————— Sampling ———————————
     Based on and , construct and
      ComputeKernel()
      ComputeKernel()
      ComputeLeverage(,)
      ComputeLeverage(,)
     
     
      query(,)
     —————— Update ———————————–
     
     
     
  until stopping criterion
  
Algorithm 1 ALEVS: Active Learning with Leverage Score Sampling
  Input: : kernel matrix; : eigenvalue threshold.
  Output: : leverage scores.
  
  
   RankSelector(,)
  , where spans top -eigenspace of and is
  for  to  do
     
  end for
Algorithm 2 ComputeLeverage
  Input: : vector containing eigenvalues; : eigenvalue threshold.
  Output: : target rank.
   sort(,‘descend’)
  
  while  do
     
  end while
Algorithm 3 RankSelector

6 Proposed Batch-Mode Active Learning Method: DBALEVS

A high-quality batch should contain highly influential examples in the data distribution. On the other hand, as some of the examples can be highly influential on an individual basis, they might contain redundant information and can form poor batches if they are queried together. DBALEVS aims to select a batch not only diverse within the current batch but also with respect to the already labeled examples. We encode these properties in a set scoring function and use it to select batches at each iteration. The sum of leverage scores of the examples in the batch assesses the total usefulness of a set of examples. To select a diverse set, we incorporate a term that penalizes the selection of examples that are similar to each other. For evaluating the similarity of examples, we use the kernel function. We define the following set scoring function:

Definition 2 (Set scoring function)

Given a set that is a subset of the ground set , the scoring function, , is defined as follows:

(4)

Here, is a cardinality constraint on the selectable set size of a batch. denotes the leverage score of point , and denotes the kernel function evaluation of points and with the assumption that . is a parameter.

The first part of this function evaluates the individual representatives of the examples in the set while the second part of the function penalizes the selection of highly similar instances. The influence of the diversity term can be adjusted by the trade-off parameter . We would like to select a batch that maximizes the set function, :

(5)

This is a subset selection problem and except for small sets and small values of , the exhaustive search for the optimal batch will be intractable. To tackle this computational challenge, we exploit the fact that the suggested set function is submodular. Although submodular maximization is also NP-hard in general Krause05near-optimalnonmyopic , Nemhauser et al. nemhauser1978analysis showed that the greedy algorithm for selecting a subset of size is guaranteed to return a solution close to the optimal value within a constant bound (Theorem 6.1).

Theorem 6.1

For a monotone, non-negative, submodular function , and a cardinality constraint , the greedy approximation yields to:

(6)

where denotes the greedily selected set with cardinality nemhauser1978analysis .

The greedy algorithm adds elements to the solution that gives the maximum increase at each step. To be able to use this greedy algorithm with the aforementioned approximation bound, we need to show that is a submodular, monotonically non-decreasing and non-negative function. Below we first define submodularity and then prove is submodular.

Definition 3 (Submodularity)

Let , where denotes the ground set and let be an element. A set function is called submodular if the following holds:

(7)
Proposition 1 (Submodularity)

is submodular.

Proof

For to be submodular, the following should hold:

(8)

Using Definition 2 for :

Rearranging the terms we end up with the following expression:

If we do the same simplification for the right hand side of the submodularity definition, , we arrive to a similar expression for set . Therefore,

Since and and , . Therefore, . Hence, is submodular. ∎.

To be able to apply the greedy algorithm with an approximation guarantee, we also need to show that is a monotonically non-decreasing and non-negative function under reasonable conditions. The proofs that satisfies these conditions when the selected batch size is less than or equal to are provided in Appendix A.2 and A.3.

  Input: : a training dataset of instances; : labeling oracle; : eigenvalue threshold; : kernel parameters; : set scoring function in Definition 2; : batch size; : diversity trade-off parameter for .
  Output: : final classifier.
  Initialize:
         // initial set of labeled instances
     // the pool of unlabeled instances
  
  repeat
     —————— Classification —————————
      train()
      predict()
     —————— Sampling ———————————
     Based on and , construct and
     Based on , construct and //labeled class matrices
      ComputeKernel()
      ComputeKernel()
      ComputeLeverage(,)
      ComputeLeverage(,)
      B-GreedyAlgorithm()
      B-GreedyAlgorithm()
     
      query(, )
     —————— Update ———————————–
     
     
     
  until stopping criterion
  
Algorithm 4 DBALEVS: Diverse Batch Mode Active Learning with Leverage Score Sampling
  Input: : set scoring function in Definition 2; : batch size; : initial set; : ground set; : leverage scores; : kernel matrix; : diversity parameter for .
  Output: : selected set.
  
  
  
  while  do
     
     
  end while
  
Algorithm 5 B-GreedyAlgorithm

The procedure for querying a batch is summarized in Algorithm 4. First, the labeled and unlabeled pool is divided based on class labels. As in ALEVS, the iteration , the classifier, is exclusively trained with the labeled training examples with a supervised method, and the class membership of the unlabeled examples are predicted with . The examples whose true labels are known along with the instances for which the true labels are not known but are predicted to be in the positive class based on the prediction of form a positive class group, . is similarly constructed from negatively predicted and labeled examples. Having divided the pool based on class memberships, the kernel matrices for each class are computed. is formed using , and is formed using . Then leverage scores of the examples are computed using the kernel matrices based on Definition 1 for each class. Not we used the leverage scores without scaling with . This is necessary to ensure the submodularity of . DBALEVS selects half of the batch from the positive examples, and half of the points from the negative examples. For this selection, the method uses the set scoring function (Definition 2). For greedy maximization, the method uses the available labeled data for positive () and negative () class as the initial set. This allows selecting a set that is also diverse with respect to the already labeled examples. This modified greedy maximization is given in Algorithm 5.

7 Results for ALEVS

We compare ALEVS with the following five approaches:

  • Random sampling: Selects an unlabelled example uniformly at random.

  • Uncertainty sampling: Queries the example that the current classifier is most uncertain about Lewis94 , that is the one with maximal value; here

    is the predicted class label for that example. The posterior probability is estimated with Platt’s algorithm

    platt1999probabilistic based on SVM’s output.

  • Leverage sampling on all data (LevOnAll): Computes the leverage score on the pool of examples at the beginning of the iteration without paying attention to class membership, then at each iteration queries the unlabeled example with the largest leverage score.

  • Transductive experimental design:

    Method selects observations to maximize the quality of parameter estimates in linear regression model

    yu:icml06expdesign . The model is also applicaple to classification problems.

  • QUIRE: Selects an instance that is both informative and representative through optimizing a function that encodes these properties quire .

Dataset Size  Dim.  +/-
digit1 1500  241  1.00
g241c 1500  241  1.00
UvsV 1577  16  1.10
USPS 1500  241  0.25
twonorm 2000  20  1.00
ringnorm 2000  20  1.00
spambase 2000  57  0.66
3vs5 2000  784  1.20
Table 1: Datasets for ALEVS experiments, number of samples, features and the positive to negative class ratio (+/-) r are listed.

We compare the methods on eight different datasets. Table 1 summarizes the characteristics of the datasets (for more information about the data, see Appendix A.4. Each dataset is divided into training and held out test sets. We start with four randomly selected labeled examples, two from each class. At each iteration, the classifier is updated for all the methods with the training data, and the accuracy values are calculated on the same held-out test data. In all experiments, an SVM classifier with RBF kernel is trained. The experiments are repeated times with random splitting of the training and the test data and random initial selection of labeled examples.

Fig. 2 shows the average classification accuracy values of ALEVS and other approaches at each iteration of active sampling. Table 3 and Table 3 summarize the win, tie and lost counts of ALEVS versus each of the competing methods on the 1-sided paired sample -test at the significance level of 0.05.

Dataset vs. QUIRE vs. LevOnAll vs. Random vs. Uncertainty vs. ExpDesign
digit1 8/19/23 29/21/0 27/23/0 46/4/0 33/17/0
g241c 0/34/16 32/18/0 30/20/0 34/16/0 30/19/1
USPS 0/43/7 32/16/2 33/12/5 0/50/0 32/18/0
ringnorm 47/3/0 48/2/0 49/1/0 47/3/0 49/1/0
spambase 8/21/21 16/27/7 10/29/11 0/46/4 32/7/11
MNIST-3vs5 3/41/6 42/8/0 44/6/0 48/2/0 43/7/0
UvsV 0/2/48 48/2/0 25/25/0 8/12/30 49/1/0
twonorm 49/1/0 50/0/0 50/0/0 50/0/0 49/1/0

Table 3: Win/Tie/Loss counts for ALEVS against the competitor algorithm iterations between 50 and 100 (Sequential-mode).
Dataset vs. QUIRE vs. LevOnAll vs. Random vs. Uncertainty vs. ExpDesign
digit1 0/0/50 0/16/34 0/16/34 0/15/35 0/50/0
g241c 9/41/0 50/0/0 50/0/0 50/0/0 50/0/0
USPS 0/12/38 50/0/0 50/0/0 0/32/18 50/0/0
ringnorm 50/0/0 22/28/0 50/0/0 13/37/0 50/0/0
spambase 0/6/44 2/48/0 24/26/0 0/9/41 50/0/0
MNIST-3vs5 0/20/30 39/11/0 50/0/0 6/17/27 50/0/0
UvsV 0/0/50 2/48/0 0/50/0 0/0/50 50/0/0
twonorm 50/0/0 50/0/0 50/0/0 50/0/0 50/0/0
Table 2: Win/Tie/Loss counts of ALEVS against the competitor algorithms for the first 50 iterations (Sequential-mode).
(a) digit1
(b) 3vs5
(c) g241c
(d) UvsV
(e) USPS
(f) twonorm
(g) ringnorm
(h) spambase
Figure 2: Comparison of ALEVS with other methods on classification accuracy. The dashed line indicates the accuracy obtained when model is trained with all of the training data.

We observe that ALEVS outperforms random sampling and uncertainty sampling in almost all datasets (Fig. 2). Exceptions to this are the USPS and spambase dataset, for which ALEVS performs as good as the uncertainty sampling but not better. ALEVS’ performance is consistently better than random sampling in the first 50 iterations of active sampling (Table 3). For iterations between , the two methods tie in spambase and UvsV datasets and random sampling performs better in digit1 (Table 3). In UvsV dataset uncertainty sampling works well against all methods. When compared to transductive experimental design, ALEVS performs better in all datasets except digit1 in the iterations 50-100, where the two methods tie (Table 3 and Table 3).

When comparing the performance of ALEVS against QUIRE, there are three different groups of datasets. First group of datasets comprise ringnorm and twonorm, for which ALEVS decisively outperforms QUIRE. In the second group of datasets, ALEVS either outperforms QUIRE or ties with it at a subset of the iterations. In digit1, ALEVS outperforms in the first 50 iterations. For the g241c dataset, ALEVS either ties or performs worse than QUIRE in early iterations but the performance of QUIRE is not consistent in early iteration (Fig. 1(c)). In this dataset, ALEVS holds up with QUIRE and outperforms it at later iterations. In the 3vs5 dataset QUIRE and ALEVS tie in most of the 50 iterations. There are also datasets, where ALEVS lags behind QUIRE. These include UvsV, USPS and spambase. For the UvsV, few labels are sufficient to obtain good accuracy and the performances of different methods do not differ dramatically (Fig. 1(e)). For the spambase dataset, ALEVS shows promising performance around iterations 30 and 40 (Fig. 1(h)). One observes that generally ALEVS manages to find effective examples for querying in the early iterations. Therefore, a strategy that combines ALEVS with a method that performs poorly in early iterations but do better in later iterations – such as uncertainty sampling – could lead to a strong active learner. Such a hybrid classifier will be explored in future work.

To understand whether computing class specific kernel matrices have any merit, we compare ALEVS against LevOnAll, we observe that ALEVS consistently outperforms it. Thus calculating the leverage scores within each class is better at finding the influential data points than calculating them on the whole pool.

(a) 5:1 class ratio
(b) 10:1 class ratio
Figure 3: Comparison of ALEVS with other methods on imbalanced datasets. The dashed horizontal line indicates the F1 achieved when trained on the whole training data.
Class ratio ALEVS QUIRE UNCERTAINTY RANDOM LEV-ON-ALL EXP. DESIGN
  5:1 1.28 1.60 2.41   5.32   4.80 inf*
10:1 1.31 1.64 2.30 10.55 10.19 inf *
Table 4: +/- class ratios of the queried sets by each method.

To understand how ALEVS reacts to class imbalance, we sample the 5vs3 dataset with two different class ratios, 5:1 and 10:1 and repeat the experiments on this dataset. The experimental set up is identical to that of in the previous section, the only difference being the adoption of F1 score in measuring performance. As depicted in the Fig. 3, ALEVS successfully copes with the class imbalance and outperforms other methods. The transductive experimental design method is unable to handle the class imbalance and returns F1 scores of 0; therefore, we exclude its results from each figure. Similarly, with 10:1 class ratio, Lev-On-All, and random sampling return very poor F1 scores and are excluded from the graph. To understand why different methods would handle the class imbalance differently, we analyze the class label distribution of the queried set. Table 4 illustrates that those that are robust to class imbalance sample equally from both classes whereas those that fail sample close to the original class distribution. Since the classifier is provided with a balanced dataset, the overall training process is not hurt by the unequal class distribution. As the class label balance becomes detorioted, if one adopts cost-sensititve training methods, this could solve the problem but adds an extra complexity layer to the problem.

We also compare methods in term of their running times. The querying step of ALEVS involves the calculation of eigenvalue decomposition of the kernel matrices; however, in practice, this does not cause a computational bottleneck. We summarize the average CPU times for selecting one example in a single iteration from the unlabeled data pool in Appendix Fig. 6. As the figures show, ALEVS is as fast as uncertainty sampling.

8 Results for DBALEVS

We compare DBALEVS with the following approaches:

  • Random sampling: Randomly selects examples uniformly at random from the unlabelled pool.

  • Uncertainty sampling: Selects examples with maximal uncertainty.

  • Top leverage sampling (Top-Lev): Computes the leverage score on the whole pool at the beginning without paying attention to class membership and selects the top examples based on their leverage scores.

  • Near-optimal batch mode active learning (NearOpt) : NearOpt chen2013near selects a batch of instances using adaptive submodular optimization.

To evaluate the performance of DBALEVS we run experiments on six different datasets (Table 5). The details of these datasets can be found in Appendix Section X. Each dataset is divided into training and held out test sets. We start with four randomly selected labeled examples, two from each class. Batch size, , is set to 10 and diversity tradeoff parameter is set to 0.5 for each dataset except ringnorm, wherein that dataset it is set to 0.1. At each iteration, the classifier is updated for all the methods with the training data, and the accuracy values are calculated on the same held-out test data. In all experiments, an SVM classifier with RBF kernel is used. For each dataset, the experiment is repeated times with random splitting of the training and the test data and random initial selection of labeled examples. For the set function defined in Definition 2 to be submodular, the kernel function should satisfy: . We use RBF kernel in calculating the set scoring function, which is in this range. However, the method is compatible with other kernel functions as long as they are normalized within the range . For optimizing the set scoring function in Definition 2, we use submodular function optimization toolbox krause2010sfo .

Dataset Size  +/-
autos 1986 x 11009  1.00
hardware 1945 x 9877  1.00
sport 1993 x 11148  1.00
ringnorm 7400 x 20  1.00
3vs5 13454 x 784  1.20
4vs9 13782 x 784  1.20
Table 5: Datasets for DBALEVS experiments, number of samples, features and the positive to negative class ratio (+/-) are listed.

Fig. 4 shows the average classification accuracy values of DBALEVS and other approaches at each iteration of active sampling. Table 6 summarizes the win, tie and lost counts of DBALEVS versus each of the competing methods on the 1-sided paired sample -test at the significance level of 0.05.

Dataset vs.NearOpt vs.Top-Lev vs.Random vs.Uncertainty
ringnorm 20/0/0 0/1/19 20/0/0 20/0/0
autos 18/11/1 49/8/3 49/11/0 51/6/3
hardware 25/5/0 57/3/0 54/6/0 57/3/0
sport 25/5/0 51/9/0 46/14/0 54/6/0
4vs9 18/2/0 20/0/0 18/2/0 19/1/0
3vs5 17/3/0 18/2/0 18/2/0 19/1/0
Table 6: Win/Tie/Loss counts for DBALEVS against the competitor algorithm.
(a) autos
(b) hardware
(c) sport
(d) ringnorm
(e) 3vs5
(f) 4vs9
Figure 4: Comparison of DBALEVS with other methods on classification accuracy.

DBALEVS outperforms NearOpt in almost all cases. For all of the datasets, DBALEVS wins over NearOpt except for few tie cases (See Table 6), and DBALEVS never loses against NearOpt. We also observe that the performances of NearOpt and random sampling are comparable. The results indicate that DBALEVS outperforms random sampling and uncertainty sampling approaches in all of the datasets. One interesting observation is, uncertainty sampling performs poorly in the batch mode setting. The main cause for this poor performance is the strong dependency of uncertainty sampling on the initial hypothesis. If the initial hypothesis formed from the initially labeled examples is not reasonable, then the query selection is affected by it because non-informative instances are selected at the subsequent iterations. One other drawback of this approach is that it fails to model the dependencies among the selected instances.

Another result is that random sampling performance is not bad and outperforms uncertainty sampling in all of the datasets. This surprising performance is also noted by others gu2014batch . It could be attributed to the fact that datasets that are available for running these experiments are not truly random; therefore, if the batch size is large enough, the examples provide valuable information to learn the datasets.

When compared to Top-Lev baseline, we observe that DBALEVS consistently other contenders, only exception being the ringnorm dataset. One possible explanation pertinent to this data is the structure of the dataset: since ringnorm dataset is artificially created from multivariate Gaussians, by querying points with maximal leverage scores, the learner receives labels from dense regions in different clusters without the need for diversification.

9 Conclusions and Future Work

In this study, we present a new query measure for active learning that is based on statistical leverage scores. We propose two novel algorithms based on this querying strategy: a sequential-mode algorithm ALEVS and a batch mode algorithm DBALEVS. Our experimentation on 8 different datasets shows that the use of statistical leverage scores as an alternative strategy to find examples that are influential. ALEVS achieves superior performance compared to common baselines such as uncertainty sampling and random selection, an older method transductive experimental design. More importantly, ALEVS performs better or equally well in terms of classification accuracy when compared to a state-of-the-art approach, QUIRE quire , which is documented to outperform other methods. Moreover, ALEVS runs much faster than QUIRE.We also show emprically that ALEVS is also robust to class imbalance.

The second proposed algorithm, DBALEVS is designed to address the batch mode active learning setting. We formulate a set scoring function that rewards examples with high-leverage scores and penalizes the inclusion of similar examples into the set. We prove that this function is submodular, monotone and non-negative, which enables us to use a greedy algorithm for solving the submodular maximization problem that produces a solution that is constant factor approximate to the optimal solution. Our experiments on 6 different datasets show that the idea of incorporating leverage scores and kernel matrix entries to find an influential and diverse batch of points is an effective strategy. DBALEVS performs well against common baselines random sampling and uncertainty sampling; and against NearOpt chen2013near , which employs a newly introduced framework called adaptive submodularity. These results show that statistical leverage score is an effective measure for detecting which examples to query in a pool of unlabeled examples.

The work presented here can be extended in different directions. We observe ALEVS and DBALEVS are especially effective in early iterations a stage where many of the existing algorithms inadequately perform. Therefore, a future direction can be developing a hybrid strategy where the proposed method works in cooperation with other methods. For example, the framework proposed in this study does not incorporate any knowledge about the uncertainty of the class labels. One possible future direction would be to study the adaptability of these methods to stream-based selective sampling active learning approaches. Finally, to understand the effectiveness of the methods, we did not consider the more complex active learning scenarios such as the non-uniform cost of labels and noisy, reluctant experts. The general framework presented here can be further investigated to incorporate these alternative settings.

Acknowledgements.
O.T. acknowledges support from Bilim Akademisi - The Science Academy, Turkey under the BAGEP program.

References

  • [1] L. Breiman. Bias, variance, and arcing classifiers. Technical report, University of California, Berkeley, 1996.
  • [2] K. Brinker.

    Incorporating diversity in active learning with support vector machines.

    In ICML, volume 3, pages 59–66, 2003.
  • [3] O. Chapelle, B. Schölkopf, and A. Zien, editors. Semi-Supervised Learning. MIT Press, 2006.
  • [4] R. Chattopadhyay, Z. Wang, W. Fan, I. Davidson, S. Panchanathan, and J. Ye. Batch mode active sampling based on marginal probability distribution matching. ACM Transactions on Knowledge Discovery from Data, 7(3):13, 2013.
  • [5] Y. Chen and A. Krause. Near-optimal batch mode active learning and adaptive submodular optimization. In Proceedings of the 30th International Conference on Machine Learning, pages 160–168, 2013.
  • [6] S. Dasgupta. Two faces of active learning. Theoretical Computer Science, 412(19):1767–1781, 2011.
  • [7] S. Dasgupta and D. Hsu. Hierarchical sampling for active learning. In Proceedings of the 25th International Conference on Machine Learning, pages 208–215, 2008.
  • [8] P. Drineas, R. Kannan, and M. Mahoney. Fast monte carlo algorithms for matrices iii: Computing a compressed approximate matrix decomposition. SIAM Journal on Computing, 36:2006, 2004.
  • [9] Y. Freund, H. S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee algorithm. Machine Learning, 28(2-3):133–168, Sept. 1997.
  • [10] A. Gittens and M. Mahoney. Revisiting the nyström method for improved large-scale machine learning. In Proceedings of the 30th International Conference on Machine Learning, 2013.
  • [11] Q. Gu, T. Zhang, and J. Han. Batch-mode active learning via error bound minimization. In

    Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence

    , pages 300–309, 2014.
  • [12] Y. Guo. Active instance sampling via matrix partition. In Advances in Neural Information Processing Systems 23, pages 802–810. Curran Associates Inc., 2010.
  • [13] Y. Guo and D. Schuurmans. Discriminative batch mode active learning. In Advances in Neural Information Processing Systems 20, pages 593–600. Curran Associates Inc., 2008.
  • [14] S. C. Hoi, R. Jin, and M. R. Lyu.

    Batch mode active learning with applications to text categorization and image retrieval.

    IEEE Transactions on Knowledge and Data Engineering, 21(9):1233–1248, 2009.
  • [15] S. C. Hoi, R. Jin, J. Zhu, and M. R. Lyu. Batch mode active learning and its application to medical image classification. In Proceedings of the 23rd international conference on Machine learning, pages 417–424. ACM, 2006.
  • [16] S.-J. Huang, R. Jin, and Z.-H. Zhou. Active learning by querying informative and representative examples. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(10):1936–1949, 2014.
  • [17] Z. X. Kai, Yu, V. Tresp, X. Xu, and J. Wang. Representative sampling for text classification using support vector machines. In European Conference on Information Retrieval, pages 393–407. Springer, 2003.
  • [18] A. Kapoor, K. Grauman, R. Urtasun, and T. Darrell. Active learning with gaussian processes for object categorization. In Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, pages 1–8. IEEE, 2007.
  • [19] A. Krause. Sfo: A toolbox for submodular function optimization. Journal of Machine Learning Research, 11(Mar):1141–1144, 2010.
  • [20] A. Krause and C. Guestrin. Near-optimal nonmyopic value of information in graphical models. In Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence, page 5, 2005.
  • [21] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [22] D. D. Lewis and W. A. Gale. A sequential algorithm for training text classifiers. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 3–12. Springer-Verlag, 1994.
  • [23] M. Lichman. UCI machine learning repository, 2013.
  • [24] D. J. MacKay. Information-based objective functions for active data selection. Neural Computation, 4(4):590–604, 1992.
  • [25] M. Mahoney and P. Drineas. CUR matrix decompositions for improved data analysis. Proceedings of the National Academy of Sciences, 106(3):697–702, 2009.
  • [26] G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An analysis of approximations for maximizing submodular set functions-i. Mathematical Programming, 14(1):265–294, 1978.
  • [27] H. T. Nguyen and A. Smeulders. Active learning using pre-clustering. In Proceedings of the 21st International Conference on Machine Learning, page 79, 2004.
  • [28] D. Papailiopoulos, A. Kyrillidis, and C. Boutsidis. Provable deterministic leverage score sampling. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 997–1006. ACM, 2014.
  • [29] J. C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Advances in Large Margin Classifiers, pages 61–74. MIT Press, 1999.
  • [30] N. Roy and A. McCallum. Toward optimal active learning through sampling estimation of error reduction. In Proceedings of the 18th International Conference on Machine Learning, pages 441–448, 2001.
  • [31] B. Schölkopf and A. J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization and Beyond. MIT press, 2001.
  • [32] B. Settles. Active learning literature survey. Technical Report 1648, University of Wisconsin–Madison, 2009.
  • [33] B. Settles and M. Craven. An analysis of active learning strategies for sequence labeling tasks. In

    Proceedings of the Conference on Empirical Methods in Natural Language Processing

    , pages 1070–1079. Association for Computational Linguistics, 2008.
  • [34] H. S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In

    Proceedings of the 5th Annual Workshop on Computational Learning Theory

    , pages 287–294. ACM, 1992.
  • [35] S. Tong and D. Koller. Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, 2:45–66, November 2001.
  • [36] Y. Wang and A. Singh. Column subset selection with missing data via active sampling. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, pages 1033––1041, 2015.
  • [37] Y. Wang and A. Singh. An empirical comparison of sampling techniques for matrix column subset selection. In Proceedings of 53rd Annual Allerton Conference on Communication, Control, and Computing, pages 1069–1074. IEEE, 2015.
  • [38] Z. Wang and J. Ye. Querying discriminative and representative samples for batch mode active learning. ACM Transactions on Knowledge Discovery from Data (TKDD), 9(3):17, 2015.
  • [39] Z. Xu, R. Akella, and Y. Zhang. Incorporating diversity and density in active learning for relevance feedback. In European Conference on Information Retrieval, pages 246–257. Springer, 2007.
  • [40] T. Yang, L. Zhang, R. Jin, and S. Zhu. An explicit sampling dependent spectral error bound for column subset selection. In Proceedings of the 32nd International Conference on Machine Learning, pages 135–143, 2015.
  • [41] Y. Yang, Z. Ma, F. Nie, X. Chang, and A. G. Hauptmann. Multi-class active learning by uncertainty sampling with diversity maximization. International Journal of Computer Vision, 113(2):113–127, 2015.
  • [42] K. Yu, J. Bi, and V. Tresp. Active learning via transductive experimental design. In Proceedings of the 23rd International Conference on Machine Learning (ICML 2006), 2006.
  • [43] J. Zhu, H. Wang, T. Yao, and B. K. Tsou. Active learning with sampling by uncertainty and density for word sense disambiguation and text classification. In Proceedings of the 22nd International Conference on Computational Linguistics, pages 1137–1144, 2008.

Appendix

a.1 Notation table

Symbol Explanation
input space
output space
dataset
feature vector of a data point
queried example
class label
labeling oracle
classifier
threshold parameter for rank parameter selection
batch size
queried batch
diversity trade-off parameter
statistical leverage score
eigenvalue
rank parameter for calculating truncated leverage scores
input feature matrix
feature mapping
kernel function evaluated on examples and
kernel matrix
-th row of
element in -th row and -th column of matrix
kernel parameter configuration
scale parameter of RBF kernel
degree of polynomial kernel
coefficient of polynomial kernel
Table 7: Notation used throughout the article.

a.2 Proving is monotone

Proposition 2 (Monotonicity)

is a monotonically non-decreasing set function when input set size is at most and for kernel values and .

Proof

Consider two arbitrary sets, and , where . And let , and and . We need to show that the following inequality holds:

(9)

Using Definition 2 for :

The leftmost summation calculated over terms and the second one sums over terms. Using the fact , these terms can be at most and , respectively. Additionally, since the minimum value that can take is 0. Then the following inequality holds:

Passing from the second line to the third line we used the fact that, . As , is monotonically non-decreasing for sets with sizes smaller than or equal to .

a.3 Proving is non-negative

Proposition 3 (Non-negativity)

is a non-negative set function for sets with cardinality smaller than or equal to .

Proof

For to be non-negative, the following statement should hold for sets with cardinality at most :

is defined as follows:

Moving from equality to inequality (line 2 to 3), we use the facts that the minimum value can take is zero, the maximum value of is 1, and . Thus, the summation of the leverage scores is minimum 0 and kernel terms can be at most , which is the number of elements in the matrix excluding the diagonals. Since , ; thereby, . This completes the proof that is non-negative when the chosen set size is bounded with .

a.4 Datasets

The following datasetsets are used in the ALEVS experiments. The digit1, g241c, USPS datasets are from111http://olivier.chapelle.cc/ssl-book/benchmarks.html [3]. The spambase and letter datasets are obtained from [23]. The letter dataset is a multi-class dataset; we select a letter pair that are difficult to distinguish: UvsV. Similarly, we sample 3 and 5 digits from the MNIST dataset as 3vs5, since they are one of the most confused pairs in the MNIST dataset [21], obtained from222http://yann.lecun.com/exdb/mnist/. Finally, twonorm and ringnorm are culled from333http://www.cs.toronto.edu/~delve/data/twonorm/desc.html and444http://www.cs.toronto.edu/~delve/data/ringnorm/desc.html which are implementations of [1]. We use a random subsample of 2000 examples for ringnorm, twonorm, spambase, and 3vs5 because the running time for QUIRE is prohibitively long. The description of these datasets are given in Table 1.

The following datasetsets are used in the DBALEVS experiments. We used the autos, hardware and sport tasks in the 20-newsgroups dataset555http://qwone.com/~jason/20Newsgroups/. These subtopics that are picked because they are harder to differentiate. autos involves classification of rec.autos and rec.motorcycles topics. hardware involves classifying comp.sys.ibm.pc.hardware and comp.sys.m
ac.hardware
topics and lastly the sport dataset involves classification of rec.spor
t.baseball
and rec.sport.hockey topics. We use a bag-of-words representation for features in these datasets. Similarly, 3-5 and 4-9 digit pairs from the MNIST dataset [21] are sampled to create the 3vs5 and 4vs9 classification tasks. Finally, ringnorm is culled from666http://www.cs.toronto.edu/~delve/data/ringnorm/desc.html which is an implementation of [1]. The description of these datasets and the parameters chosen for each of the dataset are listed in Table 5.

a.5 Effect of target rank k

(a) digit1
(b) twonorm
(c) ringnorm
(d) 3vs5
Figure 5: Effect of target rank selected by threshold on test set accuracy for ALEVS (Sequential-mode).

One parameter that has a large impact on the performance of ALEVS is the target low-rank parameter . In this work, we adaptively select the value of for negative and positive kernel matrices at each iteration by setting a threshold on the variance for top- dimensional eigenspace as described in RankSelector algorithm. We further analyze the effect of by varying these thresholds; experimented on four datasets with three different values. Accuracies shown in Fig. 5 are averages computed over 10 random experiments. Selecting the full rank option for computing leverage scores does not necessarily provide the best performance. The low-rank representation acts as a regularizer and focuses on the core dimensions that matter in the datasets. For the datasets digit1, twonorm, ringnorm, works best, whereas for 3vs5 threshold value 0.75 is a better choice and is the worst choice. This difference is expected, as the eigenvalue spectra of the matrices are different.

a.6 ALEVS runtime performance

(a) digit1
(b) 3vs5
(c) g241c
(d) UvsV
(e) USPS
(f) twonorm
(g) ringnorm
(h) spambase
Figure 6: Comparison of ALEVS with other methods on running times (Sequential-mode).

We performed the experiments in Matlab on a computer with 2.6 GHz CPU (24-core) and 64 GB of memory running Ubuntu 14.04 LTS operating system. ALEVS is as fast as almost uncertainity sampling.

a.7 DBALEVS runtime performance

(a) autos
(b) hardware
(c) sport
(d) ringnorm
(e) 3vs5
(f) 4vs9
Figure 7: Comparison of DBALEVS with other methods on runtimes (Batch mode).

The querying step of DBALEVS involves the calculation of eigenvalue decomposition of the kernel matrices and the greedy maximization procedure. Fig. 7 displays the average CPU times for selecting a batch in a single iteration from the unlabeled data pool. DBALEVS have comparable runtimes with the Near-Opt method. Experiments are conducted in Matlab on a computer with 2.6 GHz CPU (24-core) and 64 GB of memory running Ubuntu 14.04 LTS operating system.