Target-Independent Active Learning via Distribution-Splitting

09/28/2018
by   Xiaofeng Cao, et al.
University of Technology Sydney
0

To reduce the label complexity in Agnostic Active Learning (A^2 algorithm), volume-splitting splits the hypothesis edges to reduce the Vapnik-Chervonenkis (VC) dimension in version space. However, the effectiveness of volume-splitting critically depends on the initial hypothesis and this problem is also known as target-dependent label complexity gap. This paper attempts to minimize this gap by introducing a novel notion of number density which provides a more natural and direct way to describe the hypothesis distribution than volume. By discovering the connections between hypothesis and input distribution, we map the volume of version space into the number density and propose a target-independent distribution-splitting strategy with the following advantages: 1) provide theoretical guarantees on reducing label complexity and error rate as volume-splitting; 2) break the curse of initial hypothesis; 3) provide model guidance for a target-independent AL algorithm in real AL tasks. With these guarantees, for AL application, we then split the input distribution into more near-optimal spheres and develop an application algorithm called Distribution-based A^2 (DA^2). Experiments further verify the effectiveness of the halving and querying abilities of DA^2. Contributions of this paper are as follows.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

07/24/2018

A Structured Perspective of Volumes on Active Learning

Active Learning (AL) is a learning task that requires learners interacti...
12/19/2020

On the Power of Localized Perceptron for Label-Optimal Learning of Halfspaces with Adversarial Noise

We study online active learning of homogeneous halfspaces in ℝ^d with ad...
06/29/2015

S2: An Efficient Graph Based Active Learning Algorithm with Application to Nonparametric Classification

This paper investigates the problem of active learning for binary label ...
02/18/2020

Adaptive Region-Based Active Learning

We present a new active learning algorithm that adaptively partitions th...
10/15/2010

Near-Optimal Bayesian Active Learning with Noisy Observations

We tackle the fundamental problem of Bayesian active learning with noise...
08/31/2020

Active Local Learning

In this work we consider active local learning: given a query point x, a...
06/30/2022

Data-Efficient Learning via Minimizing Hyperspherical Energy

Deep learning on large-scale data is dominant nowadays. The unprecedente...

1 Introduction

Active learning (AL), leveraging abundant unlabeled data to improve model performance, has been widely adopted in various machine learning tasks, such as text processing

(Li et al., 2012), image annotation (Li and Guo, 2013), multi-label classification (Huang et al., 2017) (Huang et al., 2015), and multi-task learning (Harpale and Yang, 2010) (Fang et al., 2017), etc. Through AL, the researchers (human) can strategically query the “highly informative” data (McCallumzy and Nigamy, 1998) to reduce the error rate of the current learning model in different classification tasks. However, a natural question that arises is the following: If we increase the size of active query set, whether the error rate would keep decreasing? Besides this, can we find an -optimal ( is the error rate of prediction) hypothesis that is nearly as we desire?

This question has been considered in the Agnostic Active Learning (w.r.t A algorithms) community, which studies the hypothesis relationship in VC (Vapnik-Chervonenkis) class (Vapnik and Chervonenkis, 2015) of the version space by splitting the hypothesis edges. With a goal of minimizing the query number of unlabeled data, A algorithms (Dasgupta et al., 2008) (Balcan et al., 2006) tried to improve the learning models with labeled data sampled from various distributions and diverse noise conditions. Specially, halfspaces learning Gonen et al. (2013) is one of the agnostic active learning problem over a unit sphere to explore the theoretical guarantees on error rate and label complexity (the number of labels requested before achieving a desired error).

Under the assumption of uniform distribution over a unit enclosed sphere with bounded noise condition, an A

algorithm which actively learns a homogeneous halfspace, can help to decrease the label complexity. However, the halfspaces heavily depend on the initial hypothesis. For example, Figure 1.(a) describes a binary classification task in a two-dimensional sphere (circle) with uniform distribution and arbitrary halfspace can generate a perceptron

(Dasgupta et al., 2005)classifier. To reduce the error rate of the initial hypothesis, researchers always sample some target points from the colored candidate pool to shrink their “volumes” (areas) as fast as possible since the volume size decides the label complexity. However, there exists a challenging gap for those A algorithms: the VC dimension of sampling heavily depends on the distance between the initial and target (-optimal) hypothesis, and this problematic setting is called “target-dependent” label complexity (Gonen et al., 2013). That is, for a poor input hypothesis which is far away from the desired hypothesis, it would lead to a large volume of the candidate hypotheses pool. Then, the label complexity of querying increases dramatically. Therefore, the volume of the version space decides the dependence of the querying target to the initial hypothesis and a series AL approaches involved with shrinking volume of version space could be loosely defined as volume-splitting.

Figure 1: Illustrations of distribution-splitting over an unit sphere. In Figure 1.(a), the general AL approaches always halve the “volume” of the colored candidate area to reduce the VC dimension, but this process heavily depends on the initial hypothesis. In Figure 1.(b), we halve the number of the input distribution to obtain a halfspace with sparse density. Then, we split the halfspace to a number of spheres to represent its distribution structure.

To decrease the dependence of the initial hypothesis, (Dasgupta, 2006) removed the hypotheses edges with large model distances to split the volume of version space. In their work, the version space which includes all the feasible hypothesis, is embedded as a graph in the high dimensional space. After splitting, the VC dimension decreases dramatically. Then, the upper bound of the label complexity is also reduced. However, volume-splitting of version space still can not tackle the target-dependent problem successfully since: 1) doing volume-splitting in the candidate data pool can reduce the effect of the initial hypothesis but cannot completely get rid of its dependence; 2) volume-splitting is hard to be implemented in active learning task of input distribution, although this theoretical description has attracted the attention of researchers. Thus, finding an alternative splitting strategy which can both reach the same theoretical goal as volume-splitting of version space and deal with application tasks of input distribution is necessary. Inspired by these observations, we begin to collect evidences to consider the relationship between version space and input distribution.

Based on the perspective of (Cao et al., 2018a), the VC dimension of version space decides the hypothesis number and plays an important role in its volume description. Therefore, we are motivated to map the notion of volume into the number density of input distribution. Especially for a bounded local distribution, the more data located in the input distribution space, the larger the volume of version space would be (See Figure 2). Another perspective which supports our assumption is the input distribution induces a natural topology on the version space and a local hypothesis would easily capture its relevant local distribution (Dasgupta et al., 2008). From the above perspectives, here we would perform the splitting on input distribution with the following advantages: 1) provide theoretical guarantees on reducing label complexity and error rate as volume-splitting; 2) break the curse of initial hypothesis; 3) provide model guidance for a target-independent AL algorithm in real AL tasks.

Figure 2: Illustration of the theoretical assumption of this paper. In this figure, we can observe that the number density of the input distribution decides the volume of the version space.

In this paper, we firstly shrink the volume of version space to obtain the halfspaces to reduce its VC dimension via halving the number density of the input distribution. With the theoretical advantages of halfspaces on error difference and label complexity, we continue to split the input distribution into more near-optimal spheres to represent its structure. Contributions of this paper are described as follows.

  • We connect version space and input distribution by number density, which provides a more natural and direct way to describe the hypothesis distribution than volume.

  • We present theoretical improvement guarantees on error rate difference, fall-back analysis, and label complexities with noise conditions for the halfsapce which halves the number density of the input distribution.

  • We propose a distribution-splitting strategy for learning the structure of the input distribution.

  • We propose a “target-independent” AL algorithm called DA which is independent of the initial labeled set and classifier.

The outline of the paper is as follows. Related studies are mentioned in Section 2. In Section 3, we describe the preliminaries of this paper. The theoretical motivation of hypothesis and distribution is presented in Section 4. The proposed distribution-splitting strategy and the advantages of halfspaces are reported in Section 5. Experimental results of halving and querying abilities of the proposed algorithm are presented in Section 6. Conclusions are drawn in Section 7.

2 Related Work

The AL algorithms with input distribution aims to find the data which have “highly informativeness or representativeness” for the unseen queries. However the sampling targets heavily depend on the labeled set and classifiers .

To find an efficient labeling-tolerant algorithm, the agnostic active learner studies the -optimal hypothesis from VC class (or hypothesis class). The goal is to improve the performance guarantee of the current hypothesis by minimizing the label complexity in version space over different distribution assumptions. In this section, Section 2.1 introduces the AL strategies with input distribution, Section 2.2 presents the theoretical AL with VC class, and Section 2.3 describes the halfspaces and volume-splitting.

2.1 Active Learning with Input Distribution

The AL algorithm samples those data which can significantly improve the performance of the current learning model. By performing AL sampling strategies, even the worst annotator would no longer annotate such a large amount of labeled data because of expensive human cost. To optimize the querying process, the membership-query strategy (Freund et al., 1997)

tried to find the most distinguished sample in a set by asking subset membership queries. In structural data space with margins, the AL algorithm optimized the queries close to the hyperplane under the support of SVM theory. These approaches can be loosely approximated as finding the “highly informative” samples

(Tosh and Dasgupta, 2017) (Gal et al., 2017) (McCallumzy and Nigamy, 1998) to improve the performance of the model it learned.

The another sampling criterion is to pick up the representative samples from the unlabeled data pool. With each query, the researchers want to minimize the distribution differences between the labeled and original input sets, such as the experimental design (Yu et al., 2006) (Zhang et al., 2011) which measures the regression between data and their labels. To strengthen the sampling performance, (Huang et al., 2014) developed the informativeness and representativeness as an uniform standard with a given evaluation function. Then a series of AL algorithms which exploited the data with both the two characteristics kept appearing (Du et al., 2017).

However, these strategies heavily depend on the categories of classifiers and labeled set. For example, the AL algorithms using maximum margin of classification hyperplane could not break the curse of SVM classifier. Moreover, the influence of the input labeled set size is underestimated. In our investigation over some AL work, the researchers always query at least 50 of the total data as the initial labeled set, such as (Cai and He, 2012) (Du et al., 2017) (Shi and Shen, 2016). However, the real world might only provide very few labeled data for researchers and the percentage of the labeled data may be less than or of the total number of the data set, even less.

2.2 Active Learning with VC Class

To reduce the dependence of labeled set, the Agnostic AL tries to find an -optimal hypothesis from the hypothesis class in version space. In this theoretical learning task, we are given access to a stream of unlabeled data drawn i.i.d. from a fixed distribution. Since agnostic AL could help to achieve such dramatic reduction in label complexity, substantial agnostic AL frameworks under various of assumptions of classifiers and labeled set were proposed (Dasgupta, 2006) (Balcan et al., 2006). Among them, the most popular strategy is pool-based AL which assumes there exists a perfect classifier when giving a fixed sample number. With this prior promise, they want to minimize the query number to decrease the difference between the perfect and final hypotheses. However, most of these agnostic AL algorithms either make strong distribution assumptions such as separability, uniform input distribution or are generally computationally prohibitive, thus they cannot effectively be applied in AL application tasks (Dasgupta et al., 2008).

2.3 Halfspaces and Volume-Splitting

A halfspace is either of the two convex sets into which a hyperplane divides the enclosed-bounded affine space. In halfspaces learning, researchers study the perceptron training in the uniform distribution to find the optimal halfspace over a unit sphere, such as Figure 1.(a). The sampling goal is to reduce the vector classifier angle

as fast as possible, in which decides the volume of the candidate hypothesis pool. Under this training assumption, the researchers try to reduce the label complexity by two methods: (1) halving the volume of the candidate pool to a halfspace, (2) binary search for halving (Tong and Koller, 2001).

By halving, the learner can fast reduce the VC dimension of version space to decrease the label complexity of querying since a part of hypotheses would be removed. Therefore, volume-splitting strategy is an effective solution whether in AL theory or application tasks.

Notation Definition
Input Data Set
Hypothesis (or VC) class which covers all feasible hypotheses
Hypothesis with special setting
The optimal hypothesis in the VC class
Distribution over
Distribution of the halfspace
Distribution with special setting
Error rate of predicting by training
Error rate of predicting by training
The sphere with radius and center settings
The halfspace by 1/2-splitting
VC dimension of the input set

Conditional probability

Error rate
Error rate with special setting
Probability
Label complexity of with settings of error rate and hypothesis probability
Volume of the input object
Number density of the input space
Vector angle or disagreement coefficient
Vector angle with special setting
The th query
The total number of query
The number of the unlabeled data in the candidate pool
The number of the data in
The distance between the two input objects
Table 1: Mathematical notations

3 Preliminaries

In this section, we present the preliminaries for the learning rule of A algorithms which use a disagreement coefficient to control the sampling inputs in Section 3.1. Specifically, we present one important volume-splitting strategy called -splitting in Section 3.2. Then, we define the target-dependent and target-independent Agnostic AL by halfspaces learning in Section 3.3.

3.1 Learning Rule

The A algorithm queries the label of one example based on the empirical rule of error rate difference after assigning positive and negative labels. To describe the basic model of A algorithms, we present some preliminaries in this section. In addition, the mathematical notations of this paper are described in Table 1.

Given a data set with binary class labels, and is the distribution over , we divide into two groups: and , in which contains the labeled set of , and contains the unlabeled set. Let denote the error rate of predicting by training the input set, and denote the queried data with different label assumptions, denote the th query, represents the labeled set in the th query, and denote the total number of queries. Here we define the empirical rule for querying by the sampling margin ,

(1)

where and represent the classification hypotheses after querying with different label assumptions. By using this model rule, is used to fast shrink the number of the unlabeled data of the candidate pool by increasing the margin. Then, it can shrink the volume of the version space to reduce the VC dimension.

To estimate

, the A algorithms use the disagreement coefficient (Hanneke, 2007) to control the sampling inputs,

(2)

where is the hypothesis class over , is the hypothesis with the lowest error rate , , and denotes the metrical distance between the two inputs.

Generally, bounds the querying amount by calculating the probability mass of hypotheses in the ball . It is a theoretical assumption in version space which highlights the learning model of AL.

3.2 Volume-Splitting

To find the data which could reduce the VC dimension of the unlabeled data pool after querying its label, the volume-splitting strategy tries to design a margin factor evaluation rule, in which -splitting (Dasgupta et al., 2008) is one important approach. In its splitting rule, the rule is to find the subset of hypotheses which satisfies -splitting (Dasgupta et al., 2008) in all finite edge-sets , in which is a fraction, is the error rate, and is the number needed of unlabeled data. Under this policy, the -splitting rule is defined as to split the hypothesis edges by

(3)

Following this, -splitting is defined as

(4)

To find the instance with the highest informativeness, the agnostic AL algorithms select the data which maximally split , and then return the remained hypotheses. Considering the effectiveness of the splitting algorithm, we study it in the agnostic AL learning task which is independent of the structural assumption of uniform distribution. Without the fixed assumption on distribution, the hypotheses distances would be the most important splitting factor. Let be the current hypothesis, and be two candidate sampling points, there exists a new disagreement coefficient for agnostic AL learning task,

(5)

Let be the number of the unlabeled data in the candidate pool, we obtain the following hypothesis relationship,

(6)

Based on this relationship set, we will use it to remove some hypotheses to shrink the VC dimension.

3.3 Target-Dependent and Target-Independent Agnostic AL

Halfspaces learning provides clear visualization to describe the hypothesis relationship. Based on this advantage, we will define the target-independent active learning over a unit sphere in this section.

Target-dependent label complexity is a challenging problem in AL. It shows the performance of the unseen sampling process heavily dependents on the initial hypothesis since they always select those points which could maximize the hypothesis or distribution update. Here we present the related mathematic definitions for target-dependent Agnostic AL by halfspaces learning. Firstly, we define the halfspacs:

Definition 1.

Halfspaces. Learning a halfspace (Alabdulmohsin et al., 2015) (Chen et al., 2017) in a unite instance space is to estimate an unknown vector ,

(7)

Let be the perceptron vector on the th query, be the angle between and , we give the following definitions:

Definition 2.

Target-dependent Agnostic AL. With a probability of to select an arbitrary point on over an unit sphere, the probability of a lower error rate is . Even using the halving algorithm, the label complexity is .

Different from the target-dependent AL, the target-independent AL requires the unseen sampled targets is independent of the initial hypothesis.

Definition 3.

Target-independent Agnostic AL. With a probability of to select an arbitrary point on over an unit sphere, the probability of a lower error rate is (where ) but with a bounded label complexity .

Intuitively, learning the structure of distribution could help to address the target-dependent problem with a certain sampling selection. In real AL tasks, the target-independent would mean the sampled target is dependent of the input training set.

4 Hypothesis and Distribution

In Section 4.1, we would firstly reveal the monotonic property of active query set to show the uncertain error rate change after querying. Then, we discuss the bottleneck of informative AL and describe our splitting rule by representation sampling in Section 4.2. Finally, we discuss the relationship between error rate and number density of input distribution in Section 4.3.

Based on these theoretical analysis, we are motivated us to do the splitting on input distribution to learn the structure of the input distribution to address the target-dependence challenging.

4.1 Monotonic Property of Active Query Set

To observe the error rate change after increasing the size of active query set, we follow the perceptron training (see Figure 1(a)) (Dasgupta et al., 2005) to analyze the hypothesis relationship. In our perspective, training the updated hypothesis will result in two uncertain situations: (1) the error rate declines after querying, and (2) the error rate shows negative (or slow) improvement when querying a lot of unlabeled data. Therefore, the monotonic property of active query set size and error rate is unknown. The following theorem shows a mathematic description for this discovery.

Theorem 4.1.

The monotonic property of active query set and error rate is unsatisfactory or negative. Suppose and respectively be the error rates of training the active query sets and . There must hold a uncertain probability which satisfies .

Proof.

Suppose the vector be the optimal diameter which classifies the unit circle correctly, be the current classification perceptron at the th querying, and be angle between vectors, we then have (w.r.t is the label of sample ), in which is trained by , and is trained by . We divide the the circle into two parts: - and -, where - denotes the areas between and , and - denotes the area out of it. Considering the two cases: (1) When the query is in -, i.e., , we will have . (2) When the query is in the -, i.e., , we have the conclusion of . Finally, Theorem 1 holds. ∎

Theorem 1 describes our first perspective of this paper about the relationship between the performance of hypothesis and active query set size. It shows that the error rate probability reduces with the increase of the active query set size.

4.2 Bottleneck of Querying Informative Sample

The researchers want to obtain the optimal hypothesis path in the version space by querying informative points (See Figure 3.(a)). However, the hypothesis distances are very hard to calculate. Especially when the initial hypothesis is very far from the optimal hypothesis, the path finding process might be more difficult. Thus, there exists a bottleneck for the AL sampling by querying informative samples.

Since the VC dimension affects the path finding process largely, we use the volume-splitting to split the version space into a number of near-optimal spheres (Figure 3.(b)). Then, we remain the structure of the version space for AL sampling (Figure 3.(c)). This way aims to shrink the number of the candidate paths to reduce the uncertain performance of sampling in the path finding process. However, this strategy is adopted in the version space, we need to study a new splitting strategy which not only can achieve same representation results as volume-splitting but also can be applied in AL with input distribution. Therefore, we perform the splitting idea on input distribution by finding local balls with the following rules.

Given is the ball which tightly enclose , and are the local balls with the condition of . Let respectively define the volume (Cao et al., 2018a) and radii of the input hypothesis class, the splitting must satisfy (1), (2) , (3) , and (4) , where denotes the center of the th splitted ball, and denotes the center of .

(a)
(b)
(c)
Figure 3: Illustration of volume-splitting over version space. The graph structure denotes the version space, each node denotes one hypothesis, and the red lines denotes the hypothesis distances. In Figure 3.(a), researchers try to find the optimal path to from , where represent the hypothesis with error rate , and represents the initial hypothesis. In Figure 3.(b), researchers split the version space into a number of near-optimal spheres to reduce the VC dimension. In Figure 3.(c), researchers respectively use one hypothesis to represent each sphere to learn the structure of the version space.

4.3 Error Rate with Number Density

Following the perceptron training in the unit circle with uniform distribution, we find the error rate grows with the number density of input distribution.

Theorem 4.2.

Assume , we know (w.r.t the volume of the circle is ), where denotes the number density of the distribution.

Proof.

Let denotes a unit circle with samples, we have . After querying more samples, we obtain . In the unit circle, the volume of it is . Then, the theorem is as stated. ∎

Error rate difference represents the distance between two arbitrary hypotheses. By observing the above theorem, we can find number density affects the hypothesis distances. Besides this, we know the number density largely decides the VC dimension since . Based on the two reasons, number density is a more direct way to describe the hypothesis distribution than volume in version space. Therefore, we are motivated to map the volume of version space into the number density of the input distribution to both reduce the VC dimension and find a lower label complexity. In addition, we will define for the real AL tasks in Section 5.3.2.

5 Distribution-Splitting for Agnostic Active Learning

By using a heuristic greedy selection, we halve the number density of input distribution to obtain the halfspaces in Section 5.1. Then, we discuss its theoretical advantages in Section 5.2. With these guarantees, Section 5.3 splits the halfspaces of input distribution into a certain number of local balls to find a near-optimal representation structure.

5.1 Halving Number Density for Halfspaces

By sorting the hypothesis distance of each pair, we could use a halving threshold to cut the number density of the input distribution to a half. The cutting rule is: if , we will remove from . After the cutting, will be reduced in to one halfspace .

In hypothesis class , the VC dimension of and could be written as and (Cao et al., 2018a). Based on these assumptions, let us discuss the advantages of halfspaces by label complexity and upper bound of the querying.

Lemma 1.

Label complexity. Let each hypothesis hold for a probability at least , the label complexity would be

(8)
Proof.

The proof could be adapted from (Balcan et al., 2007) (Dasgupta et al., 2005). ∎

Lemma 2.

Upper bound of queries. Following (Balcan et al., 2006), let us assume , , then the A will make at most queries, where is denoted as .

Proof.

In the halfspace , let . Then, we obtain , in which is described in (Balcan et al., 2006). Then, . The lemma is as stated. ∎

Based on the above discussion, we could easily observe that the values of the two properties of halfspaces are lower than that of the original hypothesis class since the VC dimension is reduced.

5.2 Advantages of Halfspaces

To observe the advantages of halfspaces, Section 5.2.1 analyzes the bounds of error differences between the hypotheses with positive or negative labeling assumptions, Section 5.2.2 discusses the upper bound of error rate by fall-back analysis, and Section 5.2.3 then describes the label complexities in -bounded and -adversarial noise conditions.

5.2.1 Bounds of Error Difference in Halfspaces

In this learning process, we still use the greedy strategy of halving to split the local unit ball . Before splitting, here we present the halving guarantees of error rate difference on the distribution of halfspaces.

Theorem 5.1.

Let be the distribution over , be a family of functions , be the th shatter coefficient with infinite VC dimension, , be the empirical average of over a subset with probability at least . Then, we have

Proof.

Following (Dasgupta et al., 2008), assume for any , where , then the i.i.d sample of size from satisfies

(9)

With the similar inequality in halfspace, let , ,

(10)

Let us rewrite the above equation as for the four parts, in which each part is within a pair of brackets. Considering the number density of is smaller than that of , there exists and . Therefore, . For VC dimension, since  , then we have . Then, . ∎

By this lemma, the error rate of halfspcaes guarantees the decrease. However, it has relationships with the size of . To obtain the structure of the version space, we continue use the halving approach to split to local balls with a fall-back and bounded noises-tolerant guarantees.

5.2.2 Fall-Back Analysis in Halfspaces

Fall-back analysis helps us to observe the upper bound of error rate in the halfspace. Before analyzing the fall-back of querying, we need some technical lemmas.

Lemma 3.

(Dasgupta et al., 2008) With an assumption of normalized uniform, could be defined as: , where
.

Lemma 4.

With the assumptions of , and it is consistent with the labeled set for all .

Proof.

Apply in Lemma 4, we have . Then, there exists the inequality of

(11)

Let us rewrite this inequality as , we then have

(12)
(13)

Therefore,

(14)

Now, we have . Then, the lemma follows. ∎

Theorem 5.2.

Assume there exists a hypothesis which satisfies . If the A algorithm is given by queries with probability of , let , the error rate of halfspace is at most .

Proof.

Using Lemma 4, we obtain . Let then

(15)

With this upper bound performance, the halfspaces can have a convergent error rate guarantee and this bound is lower than that of the input distribution without halving. Next, let us analyze the bounds of the label complexity in the noise settings.

5.2.3 Bounded Noise Analysis of Halfspace

Under the uniform assumption, noises affect the unseen queries. Here we discuss the label complexities of the halfspace in - and - noise settings (Yan and Zhang, 2017).

Theorem 5.3.

For some with respect to (w.r.t. definition of halfspaces), if for any , , we say the distribution of is - (Massart et al., 2006). Under this assumption, (1) there are at most unlabeled data, and (2) the number of queries is at most , where .

Theorem 5.4.

For some with respect to , if for any , , we say the distribution of is - noise condition (Awasthi et al., 2014). Under this assumption, (1)there are at most unlabeled data, and (2) the number of queries is at most .

Compared with the original input distribution, the label complexity of our proposed halfspace is lower since the VC dimension of the halfspaces is reduced.

5.3 Distribution-Splitting for AL Tasks

Halfspaces provide theoretical advantages without special distribution assumptions since number density is independent of arbitrary distribution situation. Therefore, in the application tasks, we firstly halve the number density of input distribution to learn halfspaces via an active scoring strategy in Section 5.3.1. After obtaining the halfspace, Section 5.3.2 splits the halfspace into near-optimal balls via the distribution density. Then, Section 5.3.3 proposes the DA algorithm for AL querying.

5.3.1 Active Scoring for Halving

Active scoring is used to measure the local representativeness of an arbitrary data, in which the score value monotonically grows with the representativeness. Here we use the sequential optimization Yu et al. (2006) to find data with the highest scores to represent the input data distribution. Its definition is as follows.

(16)

where is the kernel matrix of , represents the sequence position of in , represents the sequence position of the data with current highest confidence score in , and is the penalty factor of the global optimization.

5.3.2 Splitting by Distribution Density

Considering our theory perspective of number density, performing splitting in the input distribution by it has already been proved effective. But, in -dimensional space, calculating the number density of a high dimensional-bounded space is challenging. Therefore, we use the exponential value of the distribution density to fast split the input distribution. Here we define the number density as

(17)

where respectively denote the mean value and variable of the local ball . Then, we propose the splitting rule:

(18)

To solve the above minimum optimization problem, we would like use the (1+)-approximation (Tsang et al., 2005) approach to increase the ball radius to make it converge, where is set by the empirical threshold.

5.3.3 Querying by DA

How to query the unlabeled data is the important step for AL tasks. In this section, we propose a Distribution-based A algorithm (DA) to implement the proposed distribution-splitting strategy. The algorithm has two steps. Step 1 is to find a halfspace which contains the optimal data sequences by active scoring of Eq. (16). Step 2 is to solve the optimization of Eq. (18). Finally, the output of the algorithm is used as the AL queries.

1while  do
2       for i=1,2,3…,n do
3             Calculate the score of by using Eq. (16) and store .
4       end for
5      Find the sequence with the maximum value of .
6       Add to .
7       Update matrix K by Eq. (16).
8      
9 end while
10Solve the optimization of Eq. (18) by random settings of and in .
11 Output .
Algorithm 1 DA algorithm
(a) sonar
(b) german
(c) iris
(d) monk1
(e) vote
(f) ionosphere
(g) DvsP
(h) A-D
Figure 4: Error rate changes of doing passive sampling in the original input space and halfspace on different data sets. The used classifier is LIBSVM and error rate is the evaluation standard. (a)-(f) are the six UCI real data sets. (g)-(k) are the selected sub data sets of letter.

6 Evaluation

In the algorithm description, we know the DA algorithm performs the querying in the halfspace. Thus, the halving ability would play an important role in its AL querying.

In this section, we investigate the halving and querying performances of DA algorithm on six structured clustering and seven unstructured symbolic data sets. Before the test, we use the RBF kernel to transfer the input space into a non-liner kernel space to run DA algorithm, where the used kernel sigma parameter is 1.8. To compare their classification performance of different AL algorithms, the LIBSVM (Chang and Lin, 2011) and error rate are respectively set as the default classifier and evaluation standard. Furthermore, we present the probabilistic explanations and models differences analysis for the two groups of experimental results.

6.1 Effectiveness of Halving

To verify the halving ability of DA algorithm, we do passive sampling in the input space and halfspace to compare their prediction abilities on the input space. The tested data sets are six UCI real data sets: sonar, german, iris, monk1, vote, ionosphere, and two sub sets of letter data set: DvsP and A-D, where A-D represents the instances of AvsBvsCvsD. In the experimental process, we do passive sampling 10 times to obtain the mean error rate under different querying number settings in the two different spaces. Figure 4 has drawn the tested results.

By observing the change of the two curves, we could find the querying ability of the halfspace is better than that of the original input space since the halfspace has removed some “redundant points” (Cao et al., 2018b) which have small influences on the performance of the training model. Let us assume there are “effective points” (Cao et al., 2018b) which greatly decide the performance of the final learning model in the input space. Under the limited training input number , if we do not consider the influences of the classifiers and parameter settings, the probabilities of obtaining an -optimal hypothesis in the two spaces respectively are and , where is the number of the effective points of the halfspace. Since the sequential optimization has obtained success in representation sampling, would be very close to . Therefore, .

(a) german
(b) iris
(c) monk1
(d) A-T
(e) A-L
(f) A-P
(g) A-X
(h) A-Z
Figure 5: The error rate performances of the five AL approaches on active learning test. (a)-(c) are the six UCI real data sets. (d)-(k) are the selected sub data sets of letter, where the class number of them are 12, 16, 20, 24, and 26.

6.2 Optimal Accuracy of Querying

The experiments of halving have shown the halfspace could have a better passive sampling performance compared with the original input space. It provides guarantee for the distribution-splitting in the halfspace. However, the most of AL work are based on the labeled set. To verify the target-independence of the proposed algorithm, we set the size of the initial training set be the class category via randomly selecting one data from each class. Because the labeled set-based AL algorithms must show negative performances when the input set is small, we try to minimize the influences of the labeled set by training their parameters. Under different querying number settings, we collect the results of 100 times to obtain their optimal prediction results

In this group of experiments, we compare the AL abilities of three mainstream algorithms including Hieral, TED, GEN. The description is as follows:

  • Hieral:(Dasgupta and Hsu, 2008)

    utilizes the prior knowledge of hierarchical clustering to actively annotate more unlabeled data by an established probability evaluation model, but it is sensitive to cluster structure.

  • TED: (Yu et al., 2006) is also called T-optimization, and prefers the data points that are on the one side hard to predict and on the other side representative for the rest of the unlabeled pool.

  • GEN: (Du et al., 2017) pays attention on the data which minimizes the difference between the distribution of labeled and unlabeled sets.

Figure 5 presents the error rate curves of the five AL approaches on different tested data sets. Although we have maximized the model performances of the labeled-based AL algorithms, the performances of DA algorithm are still better than them. To reveal the potential reasons behind their performances, we report the probabilistic and model explanations for this group of tests.

In the probabilistic analysis, if we assume all the learning algorithms could obtain the -optimal hypothesis in each test, we will have the following results: (1) for the uncertain AL algorithms, an arbitrary input set could lead to an -optimal hypothesis, and ; (2) in this group of test, an arbitrary input set has an approximate probability of to obtain the best classification result; (3) for the target-independent DA algorithm, the trained hypothesis is unique and ; (4) in this group of test, an arbitrary input set has a certain probability of to obtain the best classification result.

To analyze the model differences of these algorithms, we present the following specific discussions. (1)The idea of Hieral is active annotating and it depends on the cluster assumption of version space. Its classification ability in the unstructured data sets such as letter data sets is unstable, and the recorded error rates are higher than that of other approaches, although we have increased the test number. (2)TED tends to select those points with large norms, which might be hard to predict, but not be able to best represent the whole data set. Also, the noises will be sampled in its querying set. So the reported classification results are good but not the best. (3) GEN always presents disappointed results at the beginning of training in all tested data sets and its error rate declines rapidly with the increase of queries. The reason is that the established objective function prefers the data located at the center area of classes, which can’t reflect the complete class structure well. (4) Compared with the above algorithms, DA algorithm halves the number density of the data distribution into a halfspace which removes most of the redundant points. The remained points, which represent the local data distributions, help the learner to obtain the structure of the original data distribution. In the reported error rate curves, this represented structure shows effective sampling guidance when the number of queries are not big.

7 Conclusion

The A algorithm can provide strong theoretical guarantees on learning an agnostic AL task under fixed distribution and noises conditions. However, the label complexity of querying heavily depends on the initial hypothesis. This generates a challenging gap between theoretical guarantee and application performance of A algorithms.

To bridge this gap, we use the distribution-splitting strategy to halve the number density of the input distribution to reduce the VC dimension of the version space. With the reduction of error rate and label complexity, we split the original data distribution into a halfspace which retains the most highly informative data for the unseen sampling. In the constructed halfspace, we then continue to use the distribution-splitting approach to find a number of near-optimal spheres to represent an arbitrary agnostic input distribution. The proposed DA algorithm further demonstrates the effectiveness of the halving and querying abilities of the proposed distribution-splitting strategy.

References

  • Alabdulmohsin et al. (2015) Alabdulmohsin IM, Gao X, Zhang X (2015) Efficient active learning of halfspaces via query synthesis. In: AAAI, pp 2483–2489
  • Awasthi et al. (2014)

    Awasthi P, Balcan MF, Long PM (2014) The power of localization for efficiently learning linear separators with noise. In: Proceedings of the forty-sixth annual ACM symposium on Theory of computing, ACM, pp 449–458

  • Balcan et al. (2006) Balcan MF, Beygelzimer A, Langford J (2006) Agnostic active learning. In: ICML, ACM, pp 65–72
  • Balcan et al. (2007) Balcan MF, Broder A, Zhang T (2007) Margin based active learning. In: COLT, Springer, pp 35–50
  • Cai and He (2012) Cai D, He X (2012) Manifold adaptive experimental design for text categorization. IEEE Transactions on Knowledge and Data Engineering 24(4):707–719
  • Cao et al. (2018a) Cao X, Tsang IW, Xu G (2018a) A structured perspective of volumes on active learning. arXiv:180708904
  • Cao et al. (2018b) Cao X, Tsang IW, Xu J, Shi Z, Xu G (2018b) Geometric active learning via enclosing ball boundary. arXiv:180512321
  • Chang and Lin (2011)

    Chang CC, Lin CJ (2011) Libsvm: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST) 2(3):27

  • Chen et al. (2017) Chen L, Hassani SH, Karbasi A (2017) Near-optimal active learning of halfspaces via query synthesis in the noisy setting. In: AAAI, pp 1798–1804
  • Dasgupta (2006) Dasgupta S (2006) Coarse sample complexity bounds for active learning. In: NIPS, pp 235–242
  • Dasgupta and Hsu (2008) Dasgupta S, Hsu D (2008) Hierarchical sampling for active learning. In: ICML, pp 208–215
  • Dasgupta et al. (2005) Dasgupta S, Kalai AT, Monteleoni C (2005) Analysis of perceptron-based active learning. In: COLT, Springer, pp 249–263
  • Dasgupta et al. (2008) Dasgupta S, Hsu DJ, Monteleoni C (2008) A general agnostic active learning algorithm. In: NIPS, pp 353–360
  • Du et al. (2017) Du B, Wang Z, Zhang L, Zhang L, Liu W, Shen J, Tao D (2017) Exploring representativeness and informativeness for active learning. IEEE transactions on cybernetics 47(1):14–26
  • Fang et al. (2017) Fang M, Yin J, Hall LO, Tao D (2017) Active multitask learning with trace norm regularization based on excess risk. IEEE transactions on cybernetics 47(11):3906–3915
  • Freund et al. (1997) Freund Y, Seung HS, Shamir E, Tishby N (1997) Selective sampling using the query by committee algorithm. Machine learning 28(2-3):133–168
  • Gal et al. (2017) Gal Y, Islam R, Ghahramani Z (2017) Deep bayesian active learning with image data. ICML
  • Gonen et al. (2013) Gonen A, Sabato S, Shalev-Shwartz S (2013) Efficient active learning of halfspaces: an aggressive approach. The Journal of Machine Learning Research 14(1):2583–2615
  • Hanneke (2007) Hanneke S (2007) A bound on the label complexity of agnostic active learning. In: ICML, ACM, pp 353–360
  • Harpale and Yang (2010) Harpale A, Yang Y (2010) Active learning for multi-task adaptive filtering. In: ICML
  • Huang et al. (2014) Huang SJ, Jin R, Zhou ZH (2014) Active learning by querying informative and representative examples. IEEE Transactions on Pattern Analysis & Machine Intelligence (10):1936–1949
  • Huang et al. (2015) Huang SJ, Chen S, Zhou ZH (2015) Multi-label active learning: Query type matters. In: IJCAI, pp 946–952
  • Huang et al. (2017) Huang SJ, Gao N, Chen S (2017) Multi-instance multi-label active learning. In: IJCAI, pp 1886–1892
  • Li et al. (2012) Li L, Jin X, Pan SJ, Sun JT (2012) Multi-domain active learning for text classification. In: SIGKDD, pp 1086–1094
  • Li and Guo (2013) Li X, Guo Y (2013) Adaptive active learning for image classification. In: CVPR, pp 859–866
  • Massart et al. (2006) Massart P, Nédélec É, et al. (2006) Risk bounds for statistical learning. The Annals of Statistics 34(5):2326–2366
  • McCallumzy and Nigamy (1998) McCallumzy AK, Nigamy K (1998) Employing em and pool-based active learning for text classification. In: ICML, Citeseer, pp 359–367
  • Shi and Shen (2016) Shi L, Shen YD (2016) Diversifying convex transductive experimental design for active learning. In: IJCAI, pp 1997–2003
  • Tong and Koller (2001) Tong S, Koller D (2001) Support vector machine active learning with applications to text classification. Journal of machine learning research 2(Nov):45–66
  • Tosh and Dasgupta (2017) Tosh C, Dasgupta S (2017) Diameter-based active learning. ICML
  • Tsang et al. (2005) Tsang IW, Kwok JT, Cheung PM (2005) Core vector machines: Fast svm training on very large data sets. Journal of Machine Learning Research 6(Apr):363–392
  • Vapnik and Chervonenkis (2015) Vapnik VN, Chervonenkis AY (2015) On the uniform convergence of relative frequencies of events to their probabilities. In: Measures of complexity, Springer, pp 11–30
  • Yan and Zhang (2017) Yan S, Zhang C (2017) Revisiting perceptron: Efficient and label-optimal learning of halfspaces. In: NIPS, pp 1056–1066
  • Yu et al. (2006) Yu K, Bi J, Tresp V (2006) Active learning via transductive experimental design. In: ICML, pp 1081–1088
  • Zhang et al. (2011) Zhang L, Chen C, Bu J, Cai D, He X, Huang TS (2011) Active learning based on locally linear reconstruction. IEEE Transactions on Pattern Analysis and Machine Intelligence 33(10):2026–2038