Geometric Active Learning via Enclosing Ball Boundary

05/31/2018 ∙ by Xiaofeng Cao, et al. ∙ University of Technology Sydney Hong Kong Baptist University University of Amsterdam 0

Active Learning (AL) requires learners to retrain the classifier with the minimum human supervisions or labeling in the unlabeled data pool when the current training set is not enough. However, general AL sampling strategies with a few label support inevitably suffer from performance decrease. To identify which samples determine the performance of the classification hyperplane, Core Vector Machine (CVM) and Ball Vector Machine (BVM) use the geometry boundary points of each Minimum Enclosing Ball (MEB) to train the classification hypothesis. Their theoretical analysis and experimental results show that the improved classifiers not only converge faster but also obtain higher accuracies compared with Support Vector Machine (SVM). Inspired by this, we formulate the cluster boundary point detection issue as the MEB boundary problem after presenting a convincing proof of this observation. Because the enclosing ball boundary may have a high fitting ratio when it can not enclose the class tightly, we split the global ball problem into two kinds of small Local Minimum Enclosing Ball (LMEB): Boundary ball (B-ball) and Core ball (C-ball) to tackle its over-fitting problem. Through calculating the update of radius and center when extending the local ball space, we adopt the minimum update ball to obtain the geometric update optimization scheme of B-ball and C-ball. After proving their update relationship, we design the LEB (Local Enclosing Ball) algorithm using centers of B-ball of each class to detect the enclosing ball boundary points for AL sampling. Experimental and theoretical studies have shown that the classification accuracy, time, and space performance of our proposed method significantly are superior than the state-of-the-art algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Active learning [1]

is a well-studied subject area in many machine learning and data mining scenarios such as text AL

[17], image AL [18] [47] [48], transfer AL [19] [20], online learning [2]

, semi-supervised learning

[3], and so on, where the unannotated resources are abundant and cheap, but collecting massive annotated data is expensive, time-consuming, and impractical. In this learning process, reducing the prediction error rate of the version space (data set) and being able to achieve this through fewer queries and little training is the goal of an active learner. To improve the performance of the classifier, the learner is allowed to sample a subset from an unlabeled data pool to select those instances that provide the main support for constructing the classification model in AL. Usually, training the optimal classification hypothesis by accessing the unlabeled data pool and querying their true labels for a certain number is essential, but this may encounter a selection difficulty because there is a large amount of unlabeled data in the pool.

To tackle this issue, uncertainty sampling [4] was proposed to guide AL by selecting the most important instances in a given sampling scheme or distribution assumption, such as Margin [5]

, uncertainty probability

[6], maximum entropy[7], confused votes by committee [8], maximum model diameter [9], maximum unreliable [44], and so on. Therefore, the main issue for AL is to find a way to reduce the number of queries or converge the classifier quickly to reduce the total cost of the learning process. Accompanied by multiple iterations, querying stops when the defined sampling number is met or a satisfactory model is found. However, it is still necessary to traverse the huge version space repeatedly in this framework, although this technique performs well.

(a) Complete version space
(b) Cluster bone points
(c) Cluster boundary points
Figure 1: Motivation of training cluster boundary points in AL. For each sub-figure, the black lines represent the trained SVM classification model based on the data points in the figure. (a) Training the complete version space (data set). (b) Training the cluster bone points. (c) Training the cluster boundary points. We find that the generated classification line of (c) is similar to the models (a) and (b). In this paper, we will use the cluster boundary points for the AL sampling points because of its decisive performance to classification hyperplane.
(a) MEB
(b) LMEB
Figure 2: This group of figures shows the idea of how to obtain cluster boundary points by MEB in two dimensional space, where the first figure is the basic MEB problem description, and the second figure is our proposed LMEB approach. (a)MEB is to optimize , where represents the minimum ball of the class, represents the ball center, and represents the ball radius. We find that the boundary of MEB may not be tight to the class, and it can not be used to represent the real cluster boundary. (b)LMEB splits the MEB problem into two kinds of Local MEB problems, i.e., Core ball (C-ball) and Boundary ball (B-ball), where C-ball represents the local ball within the class, B-ball represents the local ball located at the edge area of the class, and the radii of these balls are different. We call these centers of B-ball Enclosing Ball boundary Points.

Querying the labels of sampled data is a reasonable approach to improving the AL prediction model when the training set is insufficient, but devising such positive evaluation rules is awkward because neither the learners nor the human annotators know which instances are the most important in the pool. In general, we seek methods with advantages of (1) high efficiency in querying the most effective or important instances; and (2) low redundancy in reducing the queries on redundant or useless instances. Intuitively, training a robust prediction model that performs well on unannotated data is the common goal of the different AL approaches and there have also been many uncertain evaluation strategies proposed to achieve this goal. However, they always suffer from one main limitation, that is, heuristically searching the complete version space to obtain the optimal sampling subset is impossible because of the unpredictable scale of the candidate set.

In practice, it might be more efficient when the optimal classification model can still be trained by a sub-space without any prior experience, and this will solve the previously mentioned limitation in a different way [14] [45] [46]

. For reliable space scaling, hierarchical sampling utilizes unsupervised learning as a way of obtaining the cluster bone points to improve the sampling (see Figure 1(b)). Although this provides positive support with more informative instances, the data points within clusters always have weak or no influence on the current model because of their clear class labels. We call these data points

core points. On the other hand, hierarchical sampling does sample some redundant points to annotate its subtree by its root node’s label. Interestingly, after removing the core points, a similar trained model can still be obtained, although only the cluster boundary points are retained (see Figure 1(c)).

In this paper, the cluster boundary points detection problem is considered equivalent to a geometric description problem, where boundary points are located at the geometric surface of one high dimensional enclosing space. By utilizing the geometric features of manifold space, [23] reconstructed the geometric space by local representative sampling for AL, and [24] mapped the underlying geometry of the data by its manifold adaptive kernel space for AL. Therefore, we consider the cluster boundary points detection problem as the enclosing ball boundary fitting, which is popular in a hard-margin support vector data description (SVDD) [16]. In this one-class classification problem, fitting the hyperplane of the high dimensional ball is used to improve the generalization of the training model when the trained data labels are imbalanced. To reduce the time consumption of multiple quadratic programming (QP) in large scale data, [51] [52] changed the SVM to a problem of minimum enclosing ball (MEB) and then iteratively calculated the ball center and radius in a (1+) approximation. Trained by the detected core sets, the proposed Core vector Machine (CVM) performed faster than the SVM and needed less support vectors. Especially in the Gaussian kernel, a fixed radius was used to simplify the MEB problem to the EB (Enclosing Ball), and accelerated the calculation process of the Ball Vector Machine (BVM) [54]. Without sophisticated heuristic searches in the kernel space, the training model, using points of high dimensional ball surface, can still be approximated to the optimal solution. However, the MEB alone could not calculate the fitting hyperplane of the ball and nor could obtain the real boundary points of the ball. This is because the kernel data space might not be a complete ball space or the ball surface might not be tight to class within the ball (Figure 2(a)).

To obtain a tighter [55] enclosing ball boundary, we split the MEB, which is a global optimization problem, into two types of local minimum enclosing ball (LMEB) issues, where one type is the B-ball (boundary ball), the other is C-ball (core ball), and centers of the B-balls are the enclosing ball boundary points. This approach tries to optimize the goodness of fit to obtain the whole geometric boundary points for each cluster. Figure 2(b) shows the motivation for this approach. The above observations and investigations motivated us to propose a new AL strategy - Local Enclosing Ball (LEB), which utilizes the MEB approach to obtain the enclosing ball boundary points for AL sampling. Our contributions in this paper are:

  • We propose an idea of reducing the uncertainty sampling space to an enclosing ball boundary hyperplane and validate it in various settings of classification.

  • We develop an AL approach termed LEB that samples independently without iteration and help from labeled data.

  • We break the theoretical curse of uncertainty sampling by enclosing ball boundary in AL since LEB is neither a model-based nor label-based strategy with the fixed time and space complexities of and respectively.

  • We conduct experiments to verify that LEB can be applied in multi-class settings to overcome the binary classification limitation of many existing AL approaches.

The remainder of this paper is structured as follows. The preliminaries are described in Section 2 and the performance of cluster boundary is defined in Section 3.1 (Theorem 1). To prove it, we discuss the model distance (Lemma 1 of Section 3.2) and inclusion relation of classifiers (Lemma 2 of Section 3.3) between cluster boundary and core points in binary classification, multi-class of low and high dimensional settings, respectively. The background for the MEB problem is presented in Section 4.1. Then, we optimize the geometric update of radius (Section 4.2) and center (Section 4.3) when extending the local ball space, in which the established update optimization equation is analyzed in Lemmas 3-5 of Section 4.4. Based on the above findings, we design the LEB algorithm in Section 4.5, analyze its time and space complexities in Section 4.6, and discuss its advantages in Section 4.6. The experiments and results, including eight geometric clustering data sets and one unstructured letter recognition data set, are reported in Sections 5.1-5.3. Then Section 5.4 further discusses the time performance of different AL approaches. Finally, we conclude this paper in Section 6.

2 Preliminary

In this section, we firstly describe the general AL problem, and then classify the unlabeled data into two kinds of objects involved with evaluating whether a sampled data point will benefit the classifier training. As we define the AL sampling issue as geometric cluster boundary detection problem, we introduce some related geometric structures for geometric AL, where the related definitions, main notations and variables are briefly summarized in Table 1.

Given represents data space , where and the label space , considering the classifier:

(1)

where is the parameter vector and is the constant vector, here gives:
Definition 1. Active Learning. Optimizing to get the minimum RSS (residual sum of squares)[22] [23]:

(2)

i.e.,

(3)

where is the labeled data, is the queried data, and is the updated training set.

Given hypothesis , the error rate change of predicting after adding the queried data is

(4)

where represents the prediction error rate of when training the input classification model.
Definition 2. Effective point: If , is an effective point that shows positive help for the next training after adding it to . Here , and it is an impactor factor that decides whether the data point will affect .
Definition 3. Redundant point: If , is a redundant point that has weak or negative influence on the current and future model .

In an enclosing geometric space, we scale the AL sampling issue as cluster boundary point detection problem. Here we introduce some related geometric structures in AL.
Definition 4. Cluster boundary point[11]:
A boundary point is an object that satisfies the following conditions:
1. It is within a dense region .
2. region near , or .
Definition 5. Core point: A core point is an object that satisfies the following conditions:
1. It is within a dense region .
2. an expanded region based on , .
Definition 6. Enclosing ball boundary: An enclosed high dimensional hyperplane connects all the boundary points.
Let define the boundary points of one class, and , where is the number of boundary points, then the closed hyperplane satisfies the following conditions:
1. Most of the boundary points are distributed in the hyperplane.
2.
where , and is a -dimension constant vector.

Notation Definition
classifiers
hyperplane of the enclosing ball
prediction error rate of when training
data set
data number of
number of labeled, unlabeled, queried data
label set
a data point in
labeled data points in
queried data points in
training set after querying
distance function
core points
cluster boundary points
noises
training set of []
core points located inside the positive class
core points located inside the negative class
cluster boundary points located near
noises
core points
boundary points
noises
distribution function
variables
constant
ball
ball center
radius
relaxation variable
C user-defined parameter
coefficient vector
Lagrange parameter
kernel matrix
K(, ) kernel change between and
a point of

’s KNN

Table I: A summary of notations

3 Motivation

In clustering-based AL work, core points are redundant points because of their clear class labels, and provide a little help for parameter training of classifiers. Considering cluster boundary points may provide decisive factors for the support vectors, CVM and BVM use the points distributed on the hyperplane of an enclosing ball to train fast core support vectors in large-scale data sets. Their significant success motivates the work of this paper.

To further show the importance of cluster boundary points, we (1) clarify the performance of training cluster boundary points in Section 3.1, (2) discuss the model distance to the classification line or hyperplane of boundary and core points in Section 3. 2, and (3) analyze the inclusion relation of classifiers when training boundary and core points in Section 3.3, where the discussion cases of (2) and (3) are binary, and multi-class classifications of low and high dimensional space.

3.1 Performance of cluster boundary

In this paper, we consider the performance of classification model is determined by cluster boundary points. Therefore, we have

Theorem 1. The performance of classification model by training cluster boundary points are similar with that of boundary points, that is to say,

(5)

where represents the core points, represents the cluster boundary points, and =[].

Theorem 1 aims to show which core points are redundant and have little influences on training h. The objective function is supported by Lemma1 and 2 in the next subsections. One of them is that cluster boundary points are closer to classification model compared with other data and the other is that the trained models based on core points are a subset of the boundary points’. Then, the detailed proofs of the two Lemmas are discussed in settings of binary, multi-class settings of low and high dimension space, respectively.

3.2 Model distance

Model distance function is defined as the distance to the classification line or hyperplane of one data point. The model distance relations of boundary points and core points are described in the following Lemma 1.
Lemma 1. The model distance of boundary points are bigger than that of core points, that is to say,

(6)

Lemma 1 is divided into three different cases:

  • Corollary 1: binary classification in low dimensional space, where Corollaries 1.1 and 1.2 prove Theorem 1 in the adjacent classes and separation classes, respectively.

  • Corollary 2: multi-class classification problem in low dimensional space.

  • Corollary 3: high dimensional space.

Corollary 1: Binary classification in low dimensional space

Given two facts in the classification: (1) the data points far from h usually have clear assigned labels with a high prediction class probability; (2) h is always surrounded by noises and a part of the boundary points. Based on these facts, the proof is as follows.

Corollary 1.1: Adjacent classes

Proof.

For the binary classification of the adjacent classes problem (see Figure 3(a)) with {-1,+1}, we get the result:

(7)

where represents the core points located inside the positive class, represents the core points located inside the negative class, represents the cluster boundary points near h, and represents the noises near h. Here , and represent their numbers of the four types of points.

Because noises always have wrong guidance on model training, we only focus on the differences between the core and boundary points, that is to say,

(8)

The distance function between and in space is:

(9)

Because the classifier definition is , , then Lemma 1 is established when (see Figure 3(b)). ∎

Corollary 1.2: Separation classes

Proof.

In the separation classes problem (see Figure 3(c)), the trained model based on any data points will lead to a strong classification result, that is to say, all AL approaches will perform well in this setting since:

(10)

where represents the boundary points near h in the positive class, represents the boundary points near h in the negative class, , and . Let , , we can still have the results of Eq. (8) and (9). ∎

Corollary 2: Multi-class classification in low dimensional space

Proof.

In this setting, , the classifier set , and cluster boundary points are segmented into parts , where represents the data points close to , (see Figure 3(d))). Based on the result of Case 1, dividing the multi-class classification problem into binary classification problems, we can obtain:

(11)

and

(12)

where represents the core points near . Then, the following holds:

(13)

Corollary 3: High dimensional space

Proof.

In high dimensional space, the distance function between and hyperplane is

(14)

where , and is a -dimension vector. Because the above equation is the m-dimension extension of Eq. (9), the proof relating to low dimensional space is still valid in high dimensional space. ∎

(a)
(b)
(c)
(d)
Figure 3: (a)An example of adjacent classes in two-dimensional space. is a linear classifier. The red diamonds represent Class 1, the blue square represents Class 2. are core points and are boundary points. This figure illustrates eq. (7) and the conclusion of it are and . (b)An example of in the binary classification problem. This figure illustrates Eq. (11). (c)An example of separation classes in two-dimensional space. This figure illustrates Eq. (10). (d)An example of segmenting in the multi-class classification problem with = 6.

3.3 Inclusion relation of classifiers

Inclusion relation of classifiers is the collection relation of training different data sub sets. Lemma 2 shows this relation of training boundary and core points, respectively.
Lemma 2. Training models based on are the subset of models based on , that is to say,

(15)

It shows training models based on can predict well, but the model based on may sometimes not predict well. To prove this relation, we discuss it in three different cases:

  • Corollary 4: binary classification in low dimensional space, where Corollary 4.1 and Corollary 4.2 prove Lemma 2 in one-dimension space and two-dimension space, respectively.

  • Corollary 5: binary classification in high dimensional space.

  • Corollary 6: multi-class classification.

Corollary 4: Binary classification in low dimensional space

Corollary 4.1: Linear one-dimension space

Proof.

Given point classifier in the linear one-dimension space as described in Figure 4(a),

(16)

where are core points. In comparison, the boundary points of have smaller distances to the optimal classification model , i.e., . Therefore, it is easy to conclude: Then, classifying and by is successful, but we cannot classify and by , or , respectively. ∎

Corollary 4.2: Two dimensional space

Proof.

Given two core points in the two dimensional space, the line segment between them is described as follows:

(17)

Training and can get the following classifier:

(18)

where is the angle between (see Figure 4(b)).

Similarly, the classifier trained by is subject to:

(19)

where is the line segment between and . Intuitively, the difference of and is their constraint equation. Because , we can conclude:

(20)

It aims to show cannot classify and when or in the constraint equation. But for any , it can classify correctly. ∎

Corollary 5: High dimensional space

Proof.

Given two core points , the Bounded Hyperplane between them is:

(21)

Training the two data points can get the following classifier:

(22)

where is the angle between and , is the normal vector of . Given point , which is located on , if , in the positive class or in the negative class, cannot predict and correctly. It can also be described as follows: if segments the bounded hyperplane between and , or and , the trained can not classify and . Then Lemma 2 is established. ∎

Corollary 6: Multi-class classification

Proof.

Like the multi-class classification proof of Lemma 1, the multi-class problem can be segmented into parts of binary classification problems. ∎

(a)
(b)
Figure 4: (a)An example of in one-dimensional space. are two point classifiers. (b)An example of in two-dimensional space.

4 Enclosing ball boundary

The hard-margin support data description in one-class classification is equivalent to the MEB (Minimum Enclosing Ball) problem which attempts to find the radius component of the radius-margin bound and its center. It is described in Section 4.1. To improve the fitting of the ball boundary, we split the ball of each cluster to two kinds of different small balls: C-ball (core balls) and B-ball (boundary balls), where core balls are located within the clusters, and boundary balls are located at the edge of clusters.

Our task of this section is to detect the B-ball of each cluster by calculating the increment of the ball radius (Section 4.2) and center (Section 4.3) when extending the local space, where the two types of interments of B-ball are bigger than that of C-ball. To enhance the difference of the two types of local features, we consider both radius and center updates to propose an optimization scheme in Section 4.4 and develop the LEB algorithm in Section 4.5. The time and space complexities then are analyzed in Section 4.5. Finally, the advantages of our approach are further discussed in Section 4.7.

4.1 MEB in SVDD

The MEB problem is to optimize the [51][52]:

(23)

where is the ball radius, and is the ball center. The corresponding dual is to optimize:

(24)

where is the relaxation variable and is a user-defined parameter to describe . The optimization result is:

(25)

where . According to the conclusion in [51] [52], is close to a constant, and then the optimization task changes to:

(26)

4.2 Update radius

Intuitively, the geometric volume of B-ball is larger than that of C-ball by the global characteristics description. Therefore, the local characteristics of radius update when adding more data to the current enclosing ball, will benefit the enhancement of characteristics scale.

based on the global characteristics and the local characteristics of radius update when adding more data to the current enclosing ball benefits the enhancement of characteristics scale.

When the data is added to on time , the new radius is:

(27)

where , and is the updated kernel matrix after adding. Then, the square increment of the radius is:

(28)

where , , is the -th row of matrix and is the -th column of matrix . Therefore, after - times of adding, the kernel matrix changes to :

(29)

Let , and . The square increment of adding features to is close to:

(30)

In the kernel matrix, , therefore the optimization task changes to:

(31)

4.3 Update center

Path change of ball center when adding more data to the current enclosing ball is another important local characteristics to distinguish between B-ball and C-ball, where the length of path update of B-ball is bigger than that of C-ball.

Given as the ball center of t-time, the optimization objective function is [54]:

(32)

and the Lagrange equation is:

(33)

On setting its derivative to zero, we obtain:

(34)

where . As such,

(35)

where is a constant. Therefore, the increment of adding features to can be written as:

(36)

The matrix form is:

(37)

where .

4.4 Geometric update optimization

To enhance the difference of local geometric feature of B-ball and C-ball, we consider both the radius and center update and discuss the properties of the optimized objective function.

The kernel update of and is:

(38)

Let to calculate the update:

(39)

where , .

Next, let us produce some properties for this objective function. The detailed proofs of the following lemmas are presented in the Appendix.

Lemma 3. Suppose that , where , . Otherwise, when .

Lemma 4. is a monotonically increasing function on .

Lemma 5.

4.5 LEB algorithm

Based on the conclusion of Lemmas 3-5, we find: and increases with the extension of local ball volume. Therefore, we propose an AL sampling algorithm called LEB (see Algorithm 1).

To calculate the update of radius and center, we need to capture the neighbors of each data point. After initialization in Lines 1-3, Lines 6 calculates the kNN matrix of using the Kd-tree with a time consumption of . Then, Line 7- 8 iteratively calculate by Eq. (39) and stores it in , where the used kernel function is RBF kernel.

However, radius and center updates of noises sometimes may be larger than that of ball boundary points. To smooth noises, Line 10 sorts times by descending, where is the querying number. After sorting, the sorted values of are stored in matrix and their corresponding positions are stored in matrix .

Intuitively, the data with an update value located in the interval of and is a noise and ball boundary point respectively, where is the round down operation. In other words, the input parameter is an effective liner segmentation of noises and querying data by their update values of Eq. (39).

After capturing the update range of ball boundary points, Line 11 finds the position of queried data from matrix and Line 12-14 then return the queried data accordingly. Finally, the expert gives label annotation help for the queried data in Line 15.

Algorithm 1. LEB
Input: data set with samples,
 number of queries ,
 nearest neighbor number k,
 noise ratio .
Output: Queried data
1: Initialize:
2:
3:
4: Begin:
5: for i=1 to do
6:  Calculate the NN of by Kd-tree and store them in
7: Let , then calculate using Eq. (39)
8: and store in
9:  endfor
10: [ sort(, descending, )
11
12: for i=1 to do
13:  add to matrix
14:  endfor
15: Query the labels of all data of

4.6 Time and space complexities

In model-based approaches, the time complexity of training classifiers determines the time consumption of sampling process. Studying the time complexity of SVM is to , we predict that Margin’s time cost will rise to to with a given query number of , where is the number of labeled data. For Hierarchical [10]

AL approach, hierarchical clustering is its main time consumption process that costs

. Similarly, calculating the kernel matrix also costs the time price of in TED [22]. Although Re-active [21] is a novel idea, it still needs to visit the whole version space to select a data point by approximately times of SVM training. It means that the time complexity of one selection will cost to and the time consumption of sampling data points will be to . (The detailed descriptions of Hierarchical, TED, and Re-active are presented in Section 5.1.)

In our LEB approach, Line 6 uses the Kd-tree to calculate the kNN matrix of data set and the time complexity is , Line 7-8 cost to calculate the radius and center update of each data point, Line 10 costs for sorting, and Line 11-14 return the boundary points of . After that, we will train the boundary points within a short time . Therefore, the total time complexity is

(40)

Standard SVM training has space complexity when the training set size is . It is thus computationally expensive on data sets with a large amount of samples. By observing the iterative sampling process of model-based AL approaches, we conclude these approaches cost space complexities. However, our LEB approach uses the tree structure to calculate the kNN matrix, which costs cheaply with a space consumption of . Therefore, the space complexity of LEB is lower than that of other model-based AL approaches.

4.7 Advantages of LEB

Our investigation finds that many existing AL algorithms which need labeled data for training are model-based and suffer from the model curse. To describe this problem, we have summarized the iterative sampling model in Algorithm 2. In its description, Line 6-10 calculate the uncertainty function, Line 11 finds the position of the data with the maximum uncertainty, where represents this operation, and Line 12-13 update the labeled and unlabeled set. After times iterations, Line 15-16 train the classifier and returns the error rate of predicting .

Interestingly, different labeled data will lead to various iterative sampling sets because of is always retrained after updating and . Then the matrix must be recalculated in each iteration. In addition, some AL algorithms work in special scenarios, for example: (1) the margin-based AL approaches only work under the SVM classification; (2) Entropy-based AL only works under the probabilistic classifier or probability return values. Table II lists the properties summary of different AL approaches.

From the analysis results, we can find the reported approaches all need iterative sampling, the support of labeled data, and the high time consumption. We can observe that many AL algorithms pay too much attention on the uncertainty of the classification model since the unfamiliarity of which data are their main sampling objects. However, our proposed LEB algorithm does not need any iteration and labeled data to sample, and also can be trained by any available classifier whatever in binary or multi-class settings.

Approach Model Iteration Label support Classifier Multi-class Time consumption
Margin SVM Y Y SVM Y Uncertain
Entropy Uncertain probability Y Y Probability classifier Y Uncertain
Hierarchical Clustering Y Y Any Y
TED Experimental optimization Y Y Any Y
Re-active Maximize the model difference Y Y Any N Uncertain, but high
LEB Enclosing ball boundary N N Any N
Table II: Properties of different active learning strategies. Y’ represents ‘Yes’, ‘N’ represents ‘No’, ‘Uncertain’ represents the time consumption is hard to evaluate since it relates to the sampling number or time complexity of classifier.
Algorithm 2. Iterative sampling
Input: ,
 number of queries ,
 labeled data
Output: prediction error rate
1:Initialize: uncertainty function ,
2: ,
3:  
4:  unlabeled data and it has data
5: while
6:  for i=1:1:
7:   =train()
8:   calculate the based on
9:   store it in matrix
10:  endfor
11: )
12: ]
13: update
14: endwhile
15: =train()
16: return = err()

5 Experiments

To demonstrate the effectiveness of our proposed LEB algorithm, we evaluate and compare the classification performance with existing algorithms in eight clustering data sets (structured data sets) since they have clear geometry boundary, and a letter recognition data set (unstructured data set) to observe its performance. The structure of this section is: Section 5.1 and 5.2 describe the related baselines and tested data sets, respectively, Section 5.3 describes the experimental settings and analyzes the results, and Section 5.4 discusses the time and space performance of different AL approaches.

5.1 Baselines

Several algorithms have been proposed in the literature [5] [10] [22] [21], and will be compared with LEB, where Random is an uncertainty sampling without any guidance, Margin is based on SVM, Hierarchical is a clustering-based AL approach, TED is a statical experimental optimization approach, and Re-active is an idea of maximizing the model differences:

  • Random, which uses a random sampling strategy to query unlabeled data, and can be applied to any AL task but with an uncertain result.

  • Margin [5], which selects the unlabeled data point with the shortest distance to the classification model, only can be supported by the SVM [26] [27] [28] classification model.

  • Hierarchical[10] sampling is a very different idea, compared to many existing AL approaches. It labels the subtree with the root node’s label when the subtree meets the objective probability function. But incorrect labeling always leads to a bad classification result.

  • TED[22] favors data points that are on the one side hard to predict and on the other side representative for the rest of the unlabeled data.

  • Re-active[21] learning finds the data point which has the maximum influence on the future prediction result after annotating the selected data with positive and negative labels. This novel idea does not need to query the label information of unlabeled data when relabeling, but needs a well-trained classification model at the beginning. Furthermore, its reported approach cannot be applied in multi-class classification problems without extension.

5.2 Data sets

We compare the different algorithms’ best classification results on some structured data sets [30] , and one unstructured letter recognition data set letter.

  • g2-2-30[31]:20482. There are 2 adjacent clusters in the data set.

  • Flame[36]:2402. It has 2 adjacent clusters with similar densities.

  • Jain[35]:3732. It has two adjacent clusters with different densities.

  • Pathbased[33]:3002. Two clusters are close and surround by an arc cluster.

  • Spiral[33]:3122. There are three spiral curve clusters which are linear inseparable.

  • Aggregation[32]:7882. There are 7 adjacent clusters in the data set and noises connect them.

  • R15[34]:6002. There are 7 separate clusters and 8 adjacent clusters.

  • D31[34]:31002. It has 31 adjacent Gaussian clusters.

  • letter[37] [38]:2000016. It is a classical letter recognition data set with 26 English letters. We select 5 pairs letters which are difficult to distinguish from each other to test the above AL algorithms in a two-class setting. They are DvsP, EvsF, IvsJ, MvsN, UvsV, respectively. For multi-class test, we select A-D, A-H, A-L, A-P, A-T, A-X, A-Z, respectively. Of these, A-D is the letter set A to D, and A-H is the letter set A to H, … , A-Z is the letter set A to Z. The seven multi-class sets have 4, 8, 12, 16, 20, and 26 classes respectively.

In addition to the introduction for the tested data sets, all two-dimensional data sets are shown in Figure 5.

Figure 5: The classical clustering data sets. (a) g2-2-30 (b) Flame (c) Jain (d) Pathbased (e) Spiral (f) Aggregation (g) R15 (h) D31.
Figure 6: The marked cluster boundary points of Aggregation and Flame are in blue circles.
(a)
(b)
(c)
Figure 7: The AL process on Flame, where represents the th sampling data point, Acc represents the prediction Accuracy.
<
Data sets Num_C Algorithms Number of queries (percentage of the data set)
1% 5% 10% 15% 20% 30% 40% 50% 60%
g2-2-30 2 Random .516.026 .546.012 .603.028 .652.029 .693.031 .767.026 .815.026 .849.021 .881.022
Margin .500.000 .509.015 .551.047 .590.076 .644.103 .709.153 .822.139 .882.161 .927.188
Hierarchical .504.000 .550.000 .585.000 .615.000 .668.000 .774.014 .847.000 .920.011 .974.000
TED .610.000 .619.009 .651.003 .759.006 .848.007 .875.005 .901.005 .964.005 .972.000
Re-active .506.008 .531.029 .554.052 .593.065 .634.058 .744.060 .715.047 .811.000 .816.000
LEB .724.163 .725.022 .790.021 .825.018 .886.012 .909.013 .927.011 .994.008 1.00.000
Flame 2 Random .670.142 .794.106 .904.059 .944.036 .958.025 .976.014 .984.008 .987.005 .990.006
Margin .499.137 .596.102 .740.162 .872.158 .930.159 .935.145 .961.120 .963.109 .944.165
Hierarchical .720.041 .607.042 .855.062 .972.010 .999.000