1 Introduction
Active learning (AL), leveraging abundant unlabeled data to improve model performance, has been widely adopted in various machine learning tasks, such as text processing
(Li et al., 2012), image annotation (Li and Guo, 2013), multilabel classification (Huang et al., 2017) (Huang et al., 2015), and multitask learning (Harpale and Yang, 2010) (Fang et al., 2017), etc. Through AL, the researchers (human) can strategically query the “highly informative” data (McCallumzy and Nigamy, 1998) to reduce the error rate of the current learning model in different classification tasks. However, a natural question that arises is the following: If we increase the size of active query set, whether the error rate would keep decreasing? Besides this, can we find an optimal ( is the error rate of prediction) hypothesis that is nearly as we desire?This question has been considered in the Agnostic Active Learning (w.r.t A algorithms) community, which studies the hypothesis relationship in VC (VapnikChervonenkis) class (Vapnik and Chervonenkis, 2015) of the version space by splitting the hypothesis edges. With a goal of minimizing the query number of unlabeled data, A algorithms (Dasgupta et al., 2008) (Balcan et al., 2006) tried to improve the learning models with labeled data sampled from various distributions and diverse noise conditions. Specially, halfspaces learning Gonen et al. (2013) is one of the agnostic active learning problem over a unit sphere to explore the theoretical guarantees on error rate and label complexity (the number of labels requested before achieving a desired error).
Under the assumption of uniform distribution over a unit enclosed sphere with bounded noise condition, an A
algorithm which actively learns a homogeneous halfspace, can help to decrease the label complexity. However, the halfspaces heavily depend on the initial hypothesis. For example, Figure 1.(a) describes a binary classification task in a twodimensional sphere (circle) with uniform distribution and arbitrary halfspace can generate a perceptron
(Dasgupta et al., 2005)classifier. To reduce the error rate of the initial hypothesis, researchers always sample some target points from the colored candidate pool to shrink their “volumes” (areas) as fast as possible since the volume size decides the label complexity. However, there exists a challenging gap for those A algorithms: the VC dimension of sampling heavily depends on the distance between the initial and target (optimal) hypothesis, and this problematic setting is called “targetdependent” label complexity (Gonen et al., 2013). That is, for a poor input hypothesis which is far away from the desired hypothesis, it would lead to a large volume of the candidate hypotheses pool. Then, the label complexity of querying increases dramatically. Therefore, the volume of the version space decides the dependence of the querying target to the initial hypothesis and a series AL approaches involved with shrinking volume of version space could be loosely defined as volumesplitting.To decrease the dependence of the initial hypothesis, (Dasgupta, 2006) removed the hypotheses edges with large model distances to split the volume of version space. In their work, the version space which includes all the feasible hypothesis, is embedded as a graph in the high dimensional space. After splitting, the VC dimension decreases dramatically. Then, the upper bound of the label complexity is also reduced. However, volumesplitting of version space still can not tackle the targetdependent problem successfully since: 1) doing volumesplitting in the candidate data pool can reduce the effect of the initial hypothesis but cannot completely get rid of its dependence; 2) volumesplitting is hard to be implemented in active learning task of input distribution, although this theoretical description has attracted the attention of researchers. Thus, finding an alternative splitting strategy which can both reach the same theoretical goal as volumesplitting of version space and deal with application tasks of input distribution is necessary. Inspired by these observations, we begin to collect evidences to consider the relationship between version space and input distribution.
Based on the perspective of (Cao et al., 2018a), the VC dimension of version space decides the hypothesis number and plays an important role in its volume description. Therefore, we are motivated to map the notion of volume into the number density of input distribution. Especially for a bounded local distribution, the more data located in the input distribution space, the larger the volume of version space would be (See Figure 2). Another perspective which supports our assumption is the input distribution induces a natural topology on the version space and a local hypothesis would easily capture its relevant local distribution (Dasgupta et al., 2008). From the above perspectives, here we would perform the splitting on input distribution with the following advantages: 1) provide theoretical guarantees on reducing label complexity and error rate as volumesplitting; 2) break the curse of initial hypothesis; 3) provide model guidance for a targetindependent AL algorithm in real AL tasks.
In this paper, we firstly shrink the volume of version space to obtain the halfspaces to reduce its VC dimension via halving the number density of the input distribution. With the theoretical advantages of halfspaces on error difference and label complexity, we continue to split the input distribution into more nearoptimal spheres to represent its structure. Contributions of this paper are described as follows.

We connect version space and input distribution by number density, which provides a more natural and direct way to describe the hypothesis distribution than volume.

We present theoretical improvement guarantees on error rate difference, fallback analysis, and label complexities with noise conditions for the halfsapce which halves the number density of the input distribution.

We propose a distributionsplitting strategy for learning the structure of the input distribution.

We propose a “targetindependent” AL algorithm called DA which is independent of the initial labeled set and classifier.
The outline of the paper is as follows. Related studies are mentioned in Section 2. In Section 3, we describe the preliminaries of this paper. The theoretical motivation of hypothesis and distribution is presented in Section 4. The proposed distributionsplitting strategy and the advantages of halfspaces are reported in Section 5. Experimental results of halving and querying abilities of the proposed algorithm are presented in Section 6. Conclusions are drawn in Section 7.
2 Related Work
The AL algorithms with input distribution aims to find the data which have “highly informativeness or representativeness” for the unseen queries. However the sampling targets heavily depend on the labeled set and classifiers .
To find an efficient labelingtolerant algorithm, the agnostic active learner studies the optimal hypothesis from VC class (or hypothesis class). The goal is to improve the performance guarantee of the current hypothesis by minimizing the label complexity in version space over different distribution assumptions. In this section, Section 2.1 introduces the AL strategies with input distribution, Section 2.2 presents the theoretical AL with VC class, and Section 2.3 describes the halfspaces and volumesplitting.
2.1 Active Learning with Input Distribution
The AL algorithm samples those data which can significantly improve the performance of the current learning model. By performing AL sampling strategies, even the worst annotator would no longer annotate such a large amount of labeled data because of expensive human cost. To optimize the querying process, the membershipquery strategy (Freund et al., 1997)
tried to find the most distinguished sample in a set by asking subset membership queries. In structural data space with margins, the AL algorithm optimized the queries close to the hyperplane under the support of SVM theory. These approaches can be loosely approximated as finding the “highly informative” samples
(Tosh and Dasgupta, 2017) (Gal et al., 2017) (McCallumzy and Nigamy, 1998) to improve the performance of the model it learned.The another sampling criterion is to pick up the representative samples from the unlabeled data pool. With each query, the researchers want to minimize the distribution differences between the labeled and original input sets, such as the experimental design (Yu et al., 2006) (Zhang et al., 2011) which measures the regression between data and their labels. To strengthen the sampling performance, (Huang et al., 2014) developed the informativeness and representativeness as an uniform standard with a given evaluation function. Then a series of AL algorithms which exploited the data with both the two characteristics kept appearing (Du et al., 2017).
However, these strategies heavily depend on the categories of classifiers and labeled set. For example, the AL algorithms using maximum margin of classification hyperplane could not break the curse of SVM classifier. Moreover, the influence of the input labeled set size is underestimated. In our investigation over some AL work, the researchers always query at least 50 of the total data as the initial labeled set, such as (Cai and He, 2012) (Du et al., 2017) (Shi and Shen, 2016). However, the real world might only provide very few labeled data for researchers and the percentage of the labeled data may be less than or of the total number of the data set, even less.
2.2 Active Learning with VC Class
To reduce the dependence of labeled set, the Agnostic AL tries to find an optimal hypothesis from the hypothesis class in version space. In this theoretical learning task, we are given access to a stream of unlabeled data drawn i.i.d. from a fixed distribution. Since agnostic AL could help to achieve such dramatic reduction in label complexity, substantial agnostic AL frameworks under various of assumptions of classifiers and labeled set were proposed (Dasgupta, 2006) (Balcan et al., 2006). Among them, the most popular strategy is poolbased AL which assumes there exists a perfect classifier when giving a fixed sample number. With this prior promise, they want to minimize the query number to decrease the difference between the perfect and final hypotheses. However, most of these agnostic AL algorithms either make strong distribution assumptions such as separability, uniform input distribution or are generally computationally prohibitive, thus they cannot effectively be applied in AL application tasks (Dasgupta et al., 2008).
2.3 Halfspaces and VolumeSplitting
A halfspace is either of the two convex sets into which a hyperplane divides the enclosedbounded affine space. In halfspaces learning, researchers study the perceptron training in the uniform distribution to find the optimal halfspace over a unit sphere, such as Figure 1.(a). The sampling goal is to reduce the vector classifier angle
as fast as possible, in which decides the volume of the candidate hypothesis pool. Under this training assumption, the researchers try to reduce the label complexity by two methods: (1) halving the volume of the candidate pool to a halfspace, (2) binary search for halving (Tong and Koller, 2001).By halving, the learner can fast reduce the VC dimension of version space to decrease the label complexity of querying since a part of hypotheses would be removed. Therefore, volumesplitting strategy is an effective solution whether in AL theory or application tasks.
Notation  Definition 

Input Data Set  
Hypothesis (or VC) class which covers all feasible hypotheses  
Hypothesis with special setting  
The optimal hypothesis in the VC class  
Distribution over  
Distribution of the halfspace  
Distribution with special setting  
Error rate of predicting by training  
Error rate of predicting by training  
The sphere with radius and center settings  
The halfspace by 1/2splitting  
VC dimension of the input set  
Conditional probability 

Error rate  
Error rate with special setting  
Probability  
Label complexity of with settings of error rate and hypothesis probability  
Volume of the input object  
Number density of the input space  
Vector angle or disagreement coefficient  
Vector angle with special setting  
The th query  
The total number of query  
The number of the unlabeled data in the candidate pool  
The number of the data in  
The distance between the two input objects 
3 Preliminaries
In this section, we present the preliminaries for the learning rule of A algorithms which use a disagreement coefficient to control the sampling inputs in Section 3.1. Specifically, we present one important volumesplitting strategy called splitting in Section 3.2. Then, we define the targetdependent and targetindependent Agnostic AL by halfspaces learning in Section 3.3.
3.1 Learning Rule
The A algorithm queries the label of one example based on the empirical rule of error rate difference after assigning positive and negative labels. To describe the basic model of A algorithms, we present some preliminaries in this section. In addition, the mathematical notations of this paper are described in Table 1.
Given a data set with binary class labels, and is the distribution over , we divide into two groups: and , in which contains the labeled set of , and contains the unlabeled set. Let denote the error rate of predicting by training the input set, and denote the queried data with different label assumptions, denote the th query, represents the labeled set in the th query, and denote the total number of queries. Here we define the empirical rule for querying by the sampling margin ,
(1) 
where and represent the classification hypotheses after querying with different label assumptions. By using this model rule, is used to fast shrink the number of the unlabeled data of the candidate pool by increasing the margin. Then, it can shrink the volume of the version space to reduce the VC dimension.
To estimate
, the A algorithms use the disagreement coefficient (Hanneke, 2007) to control the sampling inputs,(2) 
where is the hypothesis class over , is the hypothesis with the lowest error rate , , and denotes the metrical distance between the two inputs.
Generally, bounds the querying amount by calculating the probability mass of hypotheses in the ball . It is a theoretical assumption in version space which highlights the learning model of AL.
3.2 VolumeSplitting
To find the data which could reduce the VC dimension of the unlabeled data pool after querying its label, the volumesplitting strategy tries to design a margin factor evaluation rule, in which splitting (Dasgupta et al., 2008) is one important approach. In its splitting rule, the rule is to find the subset of hypotheses which satisfies splitting (Dasgupta et al., 2008) in all finite edgesets , in which is a fraction, is the error rate, and is the number needed of unlabeled data. Under this policy, the splitting rule is defined as to split the hypothesis edges by
(3) 
Following this, splitting is defined as
(4) 
To find the instance with the highest informativeness, the agnostic AL algorithms select the data which maximally split , and then return the remained hypotheses. Considering the effectiveness of the splitting algorithm, we study it in the agnostic AL learning task which is independent of the structural assumption of uniform distribution. Without the fixed assumption on distribution, the hypotheses distances would be the most important splitting factor. Let be the current hypothesis, and be two candidate sampling points, there exists a new disagreement coefficient for agnostic AL learning task,
(5) 
Let be the number of the unlabeled data in the candidate pool, we obtain the following hypothesis relationship,
(6) 
Based on this relationship set, we will use it to remove some hypotheses to shrink the VC dimension.
3.3 TargetDependent and TargetIndependent Agnostic AL
Halfspaces learning provides clear visualization to describe the hypothesis relationship. Based on this advantage, we will define the targetindependent active learning over a unit sphere in this section.
Targetdependent label complexity is a challenging problem in AL. It shows the performance of the unseen sampling process heavily dependents on the initial hypothesis since they always select those points which could maximize the hypothesis or distribution update. Here we present the related mathematic definitions for targetdependent Agnostic AL by halfspaces learning. Firstly, we define the halfspacs:
Definition 1.
Let be the perceptron vector on the th query, be the angle between and , we give the following definitions:
Definition 2.
Targetdependent Agnostic AL. With a probability of to select an arbitrary point on over an unit sphere, the probability of a lower error rate is . Even using the halving algorithm, the label complexity is .
Different from the targetdependent AL, the targetindependent AL requires the unseen sampled targets is independent of the initial hypothesis.
Definition 3.
Targetindependent Agnostic AL. With a probability of to select an arbitrary point on over an unit sphere, the probability of a lower error rate is (where ) but with a bounded label complexity .
Intuitively, learning the structure of distribution could help to address the targetdependent problem with a certain sampling selection. In real AL tasks, the targetindependent would mean the sampled target is dependent of the input training set.
4 Hypothesis and Distribution
In Section 4.1, we would firstly reveal the monotonic property of active query set to show the uncertain error rate change after querying. Then, we discuss the bottleneck of informative AL and describe our splitting rule by representation sampling in Section 4.2. Finally, we discuss the relationship between error rate and number density of input distribution in Section 4.3.
Based on these theoretical analysis, we are motivated us to do the splitting on input distribution to learn the structure of the input distribution to address the targetdependence challenging.
4.1 Monotonic Property of Active Query Set
To observe the error rate change after increasing the size of active query set, we follow the perceptron training (see Figure 1(a)) (Dasgupta et al., 2005) to analyze the hypothesis relationship. In our perspective, training the updated hypothesis will result in two uncertain situations: (1) the error rate declines after querying, and (2) the error rate shows negative (or slow) improvement when querying a lot of unlabeled data. Therefore, the monotonic property of active query set size and error rate is unknown. The following theorem shows a mathematic description for this discovery.
Theorem 4.1.
The monotonic property of active query set and error rate is unsatisfactory or negative. Suppose and respectively be the error rates of training the active query sets and . There must hold a uncertain probability which satisfies .
Proof.
Suppose the vector be the optimal diameter which classifies the unit circle correctly, be the current classification perceptron at the th querying, and be angle between vectors, we then have (w.r.t is the label of sample ), in which is trained by , and is trained by . We divide the the circle into two parts:  and , where  denotes the areas between and , and  denotes the area out of it. Considering the two cases: (1) When the query is in , i.e., , we will have . (2) When the query is in the , i.e., , we have the conclusion of . Finally, Theorem 1 holds. ∎
Theorem 1 describes our first perspective of this paper about the relationship between the performance of hypothesis and active query set size. It shows that the error rate probability reduces with the increase of the active query set size.
4.2 Bottleneck of Querying Informative Sample
The researchers want to obtain the optimal hypothesis path in the version space by querying informative points (See Figure 3.(a)). However, the hypothesis distances are very hard to calculate. Especially when the initial hypothesis is very far from the optimal hypothesis, the path finding process might be more difficult. Thus, there exists a bottleneck for the AL sampling by querying informative samples.
Since the VC dimension affects the path finding process largely, we use the volumesplitting to split the version space into a number of nearoptimal spheres (Figure 3.(b)). Then, we remain the structure of the version space for AL sampling (Figure 3.(c)). This way aims to shrink the number of the candidate paths to reduce the uncertain performance of sampling in the path finding process. However, this strategy is adopted in the version space, we need to study a new splitting strategy which not only can achieve same representation results as volumesplitting but also can be applied in AL with input distribution. Therefore, we perform the splitting idea on input distribution by finding local balls with the following rules.
Given is the ball which tightly enclose , and are the local balls with the condition of . Let respectively define the volume (Cao et al., 2018a) and radii of the input hypothesis class, the splitting must satisfy (1), (2) , (3) , and (4) , where denotes the center of the th splitted ball, and denotes the center of .
4.3 Error Rate with Number Density
Following the perceptron training in the unit circle with uniform distribution, we find the error rate grows with the number density of input distribution.
Theorem 4.2.
Assume , we know (w.r.t the volume of the circle is ), where denotes the number density of the distribution.
Proof.
Let denotes a unit circle with samples, we have . After querying more samples, we obtain . In the unit circle, the volume of it is . Then, the theorem is as stated. ∎
Error rate difference represents the distance between two arbitrary hypotheses. By observing the above theorem, we can find number density affects the hypothesis distances. Besides this, we know the number density largely decides the VC dimension since . Based on the two reasons, number density is a more direct way to describe the hypothesis distribution than volume in version space. Therefore, we are motivated to map the volume of version space into the number density of the input distribution to both reduce the VC dimension and find a lower label complexity. In addition, we will define for the real AL tasks in Section 5.3.2.
5 DistributionSplitting for Agnostic Active Learning
By using a heuristic greedy selection, we halve the number density of input distribution to obtain the halfspaces in Section 5.1. Then, we discuss its theoretical advantages in Section 5.2. With these guarantees, Section 5.3 splits the halfspaces of input distribution into a certain number of local balls to find a nearoptimal representation structure.
5.1 Halving Number Density for Halfspaces
By sorting the hypothesis distance of each pair, we could use a halving threshold to cut the number density of the input distribution to a half. The cutting rule is: if , we will remove from . After the cutting, will be reduced in to one halfspace .
In hypothesis class , the VC dimension of and could be written as and (Cao et al., 2018a). Based on these assumptions, let us discuss the advantages of halfspaces by label complexity and upper bound of the querying.
Lemma 1.
Label complexity. Let each hypothesis hold for a probability at least , the label complexity would be
(8) 
Lemma 2.
Upper bound of queries. Following (Balcan et al., 2006), let us assume , , then the A will make at most queries, where is denoted as .
Proof.
In the halfspace , let . Then, we obtain , in which is described in (Balcan et al., 2006). Then, . The lemma is as stated. ∎
Based on the above discussion, we could easily observe that the values of the two properties of halfspaces are lower than that of the original hypothesis class since the VC dimension is reduced.
5.2 Advantages of Halfspaces
To observe the advantages of halfspaces, Section 5.2.1 analyzes the bounds of error differences between the hypotheses with positive or negative labeling assumptions, Section 5.2.2 discusses the upper bound of error rate by fallback analysis, and Section 5.2.3 then describes the label complexities in bounded and adversarial noise conditions.
5.2.1 Bounds of Error Difference in Halfspaces
In this learning process, we still use the greedy strategy of halving to split the local unit ball . Before splitting, here we present the halving guarantees of error rate difference on the distribution of halfspaces.
Theorem 5.1.
Let be the distribution over , be a family of functions , be the th shatter coefficient with infinite VC dimension, , be the empirical average of over a subset with probability at least . Then, we have
Proof.
Following (Dasgupta et al., 2008), assume for any , where , then the i.i.d sample of size from satisfies
(9) 
With the similar inequality in halfspace, let , ,
(10) 
Let us rewrite the above equation as for the four parts, in which each part is within a pair of brackets. Considering the number density of is smaller than that of , there exists and . Therefore, . For VC dimension, since , then we have . Then, . ∎
By this lemma, the error rate of halfspcaes guarantees the decrease. However, it has relationships with the size of . To obtain the structure of the version space, we continue use the halving approach to split to local balls with a fallback and bounded noisestolerant guarantees.
5.2.2 FallBack Analysis in Halfspaces
Fallback analysis helps us to observe the upper bound of error rate in the halfspace. Before analyzing the fallback of querying, we need some technical lemmas.
Lemma 3.
(Dasgupta et al., 2008) With an assumption of normalized uniform, could be defined as: , where
.
Lemma 4.
With the assumptions of , and it is consistent with the labeled set for all .
Proof.
Apply in Lemma 4, we have . Then, there exists the inequality of
(11) 
Let us rewrite this inequality as , we then have
(12) 
(13) 
Therefore,
(14) 
Now, we have . Then, the lemma follows. ∎
Theorem 5.2.
Assume there exists a hypothesis which satisfies . If the A algorithm is given by queries with probability of , let , the error rate of halfspace is at most .
Proof.
Using Lemma 4, we obtain . Let then
(15) 
∎
With this upper bound performance, the halfspaces can have a convergent error rate guarantee and this bound is lower than that of the input distribution without halving. Next, let us analyze the bounds of the label complexity in the noise settings.
5.2.3 Bounded Noise Analysis of Halfspace
Under the uniform assumption, noises affect the unseen queries. Here we discuss the label complexities of the halfspace in  and  noise settings (Yan and Zhang, 2017).
Theorem 5.3.
For some with respect to (w.r.t. definition of halfspaces), if for any , , we say the distribution of is  (Massart et al., 2006). Under this assumption, (1) there are at most unlabeled data, and (2) the number of queries is at most , where .
Theorem 5.4.
For some with respect to , if for any , , we say the distribution of is  noise condition (Awasthi et al., 2014). Under this assumption, (1)there are at most unlabeled data, and (2) the number of queries is at most .
Compared with the original input distribution, the label complexity of our proposed halfspace is lower since the VC dimension of the halfspaces is reduced.
5.3 DistributionSplitting for AL Tasks
Halfspaces provide theoretical advantages without special distribution assumptions since number density is independent of arbitrary distribution situation. Therefore, in the application tasks, we firstly halve the number density of input distribution to learn halfspaces via an active scoring strategy in Section 5.3.1. After obtaining the halfspace, Section 5.3.2 splits the halfspace into nearoptimal balls via the distribution density. Then, Section 5.3.3 proposes the DA algorithm for AL querying.
5.3.1 Active Scoring for Halving
Active scoring is used to measure the local representativeness of an arbitrary data, in which the score value monotonically grows with the representativeness. Here we use the sequential optimization Yu et al. (2006) to find data with the highest scores to represent the input data distribution. Its definition is as follows.
(16) 
where is the kernel matrix of , represents the sequence position of in , represents the sequence position of the data with current highest confidence score in , and is the penalty factor of the global optimization.
5.3.2 Splitting by Distribution Density
Considering our theory perspective of number density, performing splitting in the input distribution by it has already been proved effective. But, in dimensional space, calculating the number density of a high dimensionalbounded space is challenging. Therefore, we use the exponential value of the distribution density to fast split the input distribution. Here we define the number density as
(17) 
where respectively denote the mean value and variable of the local ball . Then, we propose the splitting rule:
(18) 
To solve the above minimum optimization problem, we would like use the (1+)approximation (Tsang et al., 2005) approach to increase the ball radius to make it converge, where is set by the empirical threshold.
5.3.3 Querying by DA
How to query the unlabeled data is the important step for AL tasks. In this section, we propose a Distributionbased A algorithm (DA) to implement the proposed distributionsplitting strategy. The algorithm has two steps. Step 1 is to find a halfspace which contains the optimal data sequences by active scoring of Eq. (16). Step 2 is to solve the optimization of Eq. (18). Finally, the output of the algorithm is used as the AL queries.
6 Evaluation
In the algorithm description, we know the DA algorithm performs the querying in the halfspace. Thus, the halving ability would play an important role in its AL querying.
In this section, we investigate the halving and querying performances of DA algorithm on six structured clustering and seven unstructured symbolic data sets. Before the test, we use the RBF kernel to transfer the input space into a nonliner kernel space to run DA algorithm, where the used kernel sigma parameter is 1.8. To compare their classification performance of different AL algorithms, the LIBSVM (Chang and Lin, 2011) and error rate are respectively set as the default classifier and evaluation standard. Furthermore, we present the probabilistic explanations and models differences analysis for the two groups of experimental results.
6.1 Effectiveness of Halving
To verify the halving ability of DA algorithm, we do passive sampling in the input space and halfspace to compare their prediction abilities on the input space. The tested data sets are six UCI real data sets: sonar, german, iris, monk1, vote, ionosphere, and two sub sets of letter data set: DvsP and AD, where AD represents the instances of AvsBvsCvsD. In the experimental process, we do passive sampling 10 times to obtain the mean error rate under different querying number settings in the two different spaces. Figure 4 has drawn the tested results.
By observing the change of the two curves, we could find the querying ability of the halfspace is better than that of the original input space since the halfspace has removed some “redundant points” (Cao et al., 2018b) which have small influences on the performance of the training model. Let us assume there are “effective points” (Cao et al., 2018b) which greatly decide the performance of the final learning model in the input space. Under the limited training input number , if we do not consider the influences of the classifiers and parameter settings, the probabilities of obtaining an optimal hypothesis in the two spaces respectively are and , where is the number of the effective points of the halfspace. Since the sequential optimization has obtained success in representation sampling, would be very close to . Therefore, .
6.2 Optimal Accuracy of Querying
The experiments of halving have shown the halfspace could have a better passive sampling performance compared with the original input space. It provides guarantee for the distributionsplitting in the halfspace. However, the most of AL work are based on the labeled set. To verify the targetindependence of the proposed algorithm, we set the size of the initial training set be the class category via randomly selecting one data from each class. Because the labeled setbased AL algorithms must show negative performances when the input set is small, we try to minimize the influences of the labeled set by training their parameters. Under different querying number settings, we collect the results of 100 times to obtain their optimal prediction results
In this group of experiments, we compare the AL abilities of three mainstream algorithms including Hieral, TED, GEN. The description is as follows:

Hieral:(Dasgupta and Hsu, 2008)
utilizes the prior knowledge of hierarchical clustering to actively annotate more unlabeled data by an established probability evaluation model, but it is sensitive to cluster structure.

TED: (Yu et al., 2006) is also called Toptimization, and prefers the data points that are on the one side hard to predict and on the other side representative for the rest of the unlabeled pool.

GEN: (Du et al., 2017) pays attention on the data which minimizes the difference between the distribution of labeled and unlabeled sets.
Figure 5 presents the error rate curves of the five AL approaches on different tested data sets. Although we have maximized the model performances of the labeledbased AL algorithms, the performances of DA algorithm are still better than them. To reveal the potential reasons behind their performances, we report the probabilistic and model explanations for this group of tests.
In the probabilistic analysis, if we assume all the learning algorithms could obtain the optimal hypothesis in each test, we will have the following results: (1) for the uncertain AL algorithms, an arbitrary input set could lead to an optimal hypothesis, and ; (2) in this group of test, an arbitrary input set has an approximate probability of to obtain the best classification result; (3) for the targetindependent DA algorithm, the trained hypothesis is unique and ; (4) in this group of test, an arbitrary input set has a certain probability of to obtain the best classification result.
To analyze the model differences of these algorithms, we present the following specific discussions. (1)The idea of Hieral is active annotating and it depends on the cluster assumption of version space. Its classification ability in the unstructured data sets such as letter data sets is unstable, and the recorded error rates are higher than that of other approaches, although we have increased the test number. (2)TED tends to select those points with large norms, which might be hard to predict, but not be able to best represent the whole data set. Also, the noises will be sampled in its querying set. So the reported classification results are good but not the best. (3) GEN always presents disappointed results at the beginning of training in all tested data sets and its error rate declines rapidly with the increase of queries. The reason is that the established objective function prefers the data located at the center area of classes, which can’t reflect the complete class structure well. (4) Compared with the above algorithms, DA algorithm halves the number density of the data distribution into a halfspace which removes most of the redundant points. The remained points, which represent the local data distributions, help the learner to obtain the structure of the original data distribution. In the reported error rate curves, this represented structure shows effective sampling guidance when the number of queries are not big.
7 Conclusion
The A algorithm can provide strong theoretical guarantees on learning an agnostic AL task under fixed distribution and noises conditions. However, the label complexity of querying heavily depends on the initial hypothesis. This generates a challenging gap between theoretical guarantee and application performance of A algorithms.
To bridge this gap, we use the distributionsplitting strategy to halve the number density of the input distribution to reduce the VC dimension of the version space. With the reduction of error rate and label complexity, we split the original data distribution into a halfspace which retains the most highly informative data for the unseen sampling. In the constructed halfspace, we then continue to use the distributionsplitting approach to find a number of nearoptimal spheres to represent an arbitrary agnostic input distribution. The proposed DA algorithm further demonstrates the effectiveness of the halving and querying abilities of the proposed distributionsplitting strategy.
References
 Alabdulmohsin et al. (2015) Alabdulmohsin IM, Gao X, Zhang X (2015) Efficient active learning of halfspaces via query synthesis. In: AAAI, pp 2483–2489

Awasthi et al. (2014)
Awasthi P, Balcan MF, Long PM (2014) The power of localization for efficiently learning linear separators with noise. In: Proceedings of the fortysixth annual ACM symposium on Theory of computing, ACM, pp 449–458
 Balcan et al. (2006) Balcan MF, Beygelzimer A, Langford J (2006) Agnostic active learning. In: ICML, ACM, pp 65–72
 Balcan et al. (2007) Balcan MF, Broder A, Zhang T (2007) Margin based active learning. In: COLT, Springer, pp 35–50
 Cai and He (2012) Cai D, He X (2012) Manifold adaptive experimental design for text categorization. IEEE Transactions on Knowledge and Data Engineering 24(4):707–719
 Cao et al. (2018a) Cao X, Tsang IW, Xu G (2018a) A structured perspective of volumes on active learning. arXiv:180708904
 Cao et al. (2018b) Cao X, Tsang IW, Xu J, Shi Z, Xu G (2018b) Geometric active learning via enclosing ball boundary. arXiv:180512321

Chang and Lin (2011)
Chang CC, Lin CJ (2011) Libsvm: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST) 2(3):27
 Chen et al. (2017) Chen L, Hassani SH, Karbasi A (2017) Nearoptimal active learning of halfspaces via query synthesis in the noisy setting. In: AAAI, pp 1798–1804
 Dasgupta (2006) Dasgupta S (2006) Coarse sample complexity bounds for active learning. In: NIPS, pp 235–242
 Dasgupta and Hsu (2008) Dasgupta S, Hsu D (2008) Hierarchical sampling for active learning. In: ICML, pp 208–215
 Dasgupta et al. (2005) Dasgupta S, Kalai AT, Monteleoni C (2005) Analysis of perceptronbased active learning. In: COLT, Springer, pp 249–263
 Dasgupta et al. (2008) Dasgupta S, Hsu DJ, Monteleoni C (2008) A general agnostic active learning algorithm. In: NIPS, pp 353–360
 Du et al. (2017) Du B, Wang Z, Zhang L, Zhang L, Liu W, Shen J, Tao D (2017) Exploring representativeness and informativeness for active learning. IEEE transactions on cybernetics 47(1):14–26
 Fang et al. (2017) Fang M, Yin J, Hall LO, Tao D (2017) Active multitask learning with trace norm regularization based on excess risk. IEEE transactions on cybernetics 47(11):3906–3915
 Freund et al. (1997) Freund Y, Seung HS, Shamir E, Tishby N (1997) Selective sampling using the query by committee algorithm. Machine learning 28(23):133–168
 Gal et al. (2017) Gal Y, Islam R, Ghahramani Z (2017) Deep bayesian active learning with image data. ICML
 Gonen et al. (2013) Gonen A, Sabato S, ShalevShwartz S (2013) Efficient active learning of halfspaces: an aggressive approach. The Journal of Machine Learning Research 14(1):2583–2615
 Hanneke (2007) Hanneke S (2007) A bound on the label complexity of agnostic active learning. In: ICML, ACM, pp 353–360
 Harpale and Yang (2010) Harpale A, Yang Y (2010) Active learning for multitask adaptive filtering. In: ICML
 Huang et al. (2014) Huang SJ, Jin R, Zhou ZH (2014) Active learning by querying informative and representative examples. IEEE Transactions on Pattern Analysis & Machine Intelligence (10):1936–1949
 Huang et al. (2015) Huang SJ, Chen S, Zhou ZH (2015) Multilabel active learning: Query type matters. In: IJCAI, pp 946–952
 Huang et al. (2017) Huang SJ, Gao N, Chen S (2017) Multiinstance multilabel active learning. In: IJCAI, pp 1886–1892
 Li et al. (2012) Li L, Jin X, Pan SJ, Sun JT (2012) Multidomain active learning for text classification. In: SIGKDD, pp 1086–1094
 Li and Guo (2013) Li X, Guo Y (2013) Adaptive active learning for image classification. In: CVPR, pp 859–866
 Massart et al. (2006) Massart P, Nédélec É, et al. (2006) Risk bounds for statistical learning. The Annals of Statistics 34(5):2326–2366
 McCallumzy and Nigamy (1998) McCallumzy AK, Nigamy K (1998) Employing em and poolbased active learning for text classification. In: ICML, Citeseer, pp 359–367
 Shi and Shen (2016) Shi L, Shen YD (2016) Diversifying convex transductive experimental design for active learning. In: IJCAI, pp 1997–2003
 Tong and Koller (2001) Tong S, Koller D (2001) Support vector machine active learning with applications to text classification. Journal of machine learning research 2(Nov):45–66
 Tosh and Dasgupta (2017) Tosh C, Dasgupta S (2017) Diameterbased active learning. ICML
 Tsang et al. (2005) Tsang IW, Kwok JT, Cheung PM (2005) Core vector machines: Fast svm training on very large data sets. Journal of Machine Learning Research 6(Apr):363–392
 Vapnik and Chervonenkis (2015) Vapnik VN, Chervonenkis AY (2015) On the uniform convergence of relative frequencies of events to their probabilities. In: Measures of complexity, Springer, pp 11–30
 Yan and Zhang (2017) Yan S, Zhang C (2017) Revisiting perceptron: Efficient and labeloptimal learning of halfspaces. In: NIPS, pp 1056–1066
 Yu et al. (2006) Yu K, Bi J, Tresp V (2006) Active learning via transductive experimental design. In: ICML, pp 1081–1088
 Zhang et al. (2011) Zhang L, Chen C, Bu J, Cai D, He X, Huang TS (2011) Active learning based on locally linear reconstruction. IEEE Transactions on Pattern Analysis and Machine Intelligence 33(10):2026–2038