Abstract
Acknowledgments
1 Introduction
How do humans learn? Say we look at the process of a child learning how to recognize a cat. We can focus on two types of input. The first type of input is when a child’s parent points at a cat and states “Look, a cat!”. The second type of input is an answer to the child’s frequent question “What is that?”, which the child may pose when seeing a cat, but also when seeing a dog, a mouse, a rabbit, or any other small animal.
These two types of input were the basis for the learning model originally suggested in the celebrated paper “A theory of the learnable” (Valiant, 1984). In Valiant’s learning model, the learning algorithm has access to two sources of information  EXAMPLES and ORACLE. The learning algorithm can call EXAMPLES to receive an example with its label (sampled from the “nature”). Additionally, the learning algorithm can use ORACLE, which provides the label of any
example presented to it. With these two input types, we can look at two models of learning: learning using only calls for EXAMPLES, and learning using calls for both EXAMPLES and ORACLE. The first is the standard Probably Approximately Correct (PAC) model. The second is the so called PAC+MQ (Membership Queries) model. There has been a lot of theoretical work searching for the limits of the additional strength of membership queries. The use of membership queries in addition to examples was proven to be stronger than the standard PAC model in many cases
(Angluin, 1987; Blum and Rudich, 1992; Bshouty, 1995; Jackson, 1994)(see section 2).Despite that the MQ model seems much stronger, both intuitively and formally, it is rarely used in practice. This is commonly believed to result from the fact that in many cases it is not easy to implement MQ algorithms, that can create new and artificial examples to be labeled as part of the training phase. This problem of labeling artificial examples was highlighted by the experiment of Baum and Lang (1992). Baum and Lang implemented a membership query algorithm proposed by Baum (1991) for learning halfspaces . Their algorithm had very poor results, which was attributed to the fact that the algorithm created artificial and unnatural examples, which resulted in a noisy labeling. We elaborate on this experiment and criticize its conclusions in section 2.
A suggested solution to the problem of unnatural examples was proposed by Awasthi et al. (2012). They suggested a midway model of learning with queries, but only restricted ones. The queries that their model allows the algorithm to ask are only local queries, i.e., queries that are close in some sense to examples from the sample set. Hopefully, examples which are similar to natural examples will also appear to be natural, or at least close to natural, and in any case will be far from appearing random or artificial. In their work, Awasti et al. started to investigate the power and the limitations of this model of local queries. They proved positive results on learning sparse polynomials with local queries under what they defined as locally smooth distributions^{1}^{1}1locally smooth distributions can be defined as the class of distributions for which the logarithm of the density function is Lipschitz with respect to the Hamming distance.
, which in some sense generalize the uniform and product distributions. They also proposed an algorithm that learns DNF formulas under the uniform distribution in quasipolynomial time using only
local queries.The exciting ideas of Awasthi et al. (2012) leave many directions for future work. One issue is that their analysis holds for a restricted family of distributions. While these results provide evidence of the excessive power of local queries, the distributional assumptions are rather strong.
Our work follows Awasthi et al., and is focused on 1local queries, which are the closest to the original PAC model. We formulate an arguably natural distributional assumption, and present an algorithm that uses 1local membership queries to learn DNF formulas under this assumption. We also provide a matching lower bound: Namely, we prove that learning DNFs under our assumption is hard without the use of queries, assuming that learning decision trees is hard. This is the first example of a natural problem in which 1local queries are stronger than the vanilla PAC model (it complements the work of Awasthi et al. who showed a similar result for a highly artificial problem).
Finally, we provide some empirical evidence that using local queries can be helpful in practice, and importantly, that the implementation of the queries is easy, straightforward, and can be acquired by crowdsourcing without the use of an expert. We present a method for using local queries to perform a userinduced feature selection process, and present results of this protocol on the task of sentiment analysis of tweets. Our results show that by acquiring a more expressive data set, using (a variant of) 1local queries, we can achieve better results with fewer examples. Based on the fact that a smaller data set is sufficient, we gain twice: we need less manpower for the labeling process and less computing power for the training process. We note that similar experiments also present encouraging results along this line
(Raghavan and Allan, 2007; Raghavan et al., 2005; Settles, 2011; Druck et al., 2009). This supplies more evidence that such querybased methods can be useful in practice.2 Previous Work
2.1 Pac
Valiant’s Probably Approximately Correct (PAC) model of learning (Valiant, 1984) formulates the problem of learning a concept from examples. Examples are chosen according to a fixed but unknown and arbitrary distribution on the instance space. The learner’s task is to find a prediction rule. The requirement is that with high probability, the prediction rule will be correct on all but a small fraction of the instances.
A few positive results are known in this model  i.e., concept classes that have been proven to be PAClearnable. Maybe the most significant example is the class of halfspaces. More examples include relatively weak classes such as DNFs and CNFs with constantly many terms (Valiant, 1984), and rank decision trees (Ehrenfeucht and Haussler, 1989) for a constant .
Despite these positive results, most PAC learning problems are probably intractable. In fact, beyond the results mentioned above, almost no positive results are known. Furthermore, several negative results are known. For example, learning automatons, logarithmic depth circuits, and intersections of polynomially many halfspaces are all intractable, assuming the security of various cryptographic schemes (Kearns and Valiant, 1994; Klivans et al., 2006). In (Daniely et al., 2014; Daniely and ShalevShwatz, 2014; Daniely et al., 2013), it is shown that learning DNF formulas, and learning intersections of halfspaces are intractable under the assumption that refuting random SAT is hard.
2.2 Membership Queries
The PAC model is a “passive” model in which the learner receives a random data set of examples and their labels and then outputs a classifier. A stronger version would be an active model in which the learner gathers information about the world by asking questions and receiving responses. Several types of active models have been proposed: the Membership Query Synthesis, StreamBased Selective Sampling, and PoolBased Sampling
(Settles, 2010). Our work is in the area of the “Membership Queries” (MQ) model which was presented in (Valiant, 1984). In this model the learner is allowed to query for the label of any particular example that it chooses (even examples that are not in the given sample).This model has been shown to be stronger in several scenarios. Some examples of concept classes that have been proven to be PAClearnable only if membership queries are available include: The class of Deterministic Finite Automatons (Angluin, 1987), the class of kterm DNF for (Blum and Rudich, 1992), the class of decision trees and kalmost monotoneDNF formulas (Bshouty, 1995), the class of intersections of khalfspaces (Baum, 1991) and the class of DNF formulas under the uniform distribution (Jackson, 1994). The last of these results was built upon Freund’s boosting algorithm (Freund, 1995) and the Fourierbased technique for learning using membership queries due to (Kushilevitz and Mansour, 1993).
It should be noted that there are cases in which the additional strength of MQ does not help. E.g., in the case of learning DNF and CNF formulas (Angluin and Kharitonov, 1995), and in the case of distribution free agnostic learning (although in the distributionspecific agnostic setting membership queries do increase the power of the learner) (Feldman, 2009).
2.3 Baum and Lang
As discussed above, there has been widespread and significant theoretical work in the PAC + MQ model. On the other hand, almost no practical work on implementing these ideas has been done. A wellknown exception is the work of Baum and Lang (1992). They applied a variation of the MQ algorithm for learning a linear classifier proposed in Baum (1991). This algorithm uses the idea that given two examples, one positive and one negative, and a query oracle, it is possible to find an approximately accurate separating halfspace by using a binary search on the line between the positive and negative examples. Their experiment attempts to evaluate this idea in practice. The task that they chose is the task of binary digit classification. The algorithm would receive two examples, one positive and one negative (say, an image of the digit 4 and an image of the digit 7) and would return the weights of the halfspace. The generalization error of the halfspace would then be tested on other examples from the data. The query technique they used in the experiment is different than in the original algorithm: “A direct implementation of this algorithm would repeatedly flash images on the screen during the binary search and would require the test subject to type in the correct label for each image. Because this process seemed likely to be error prone, we instead provided an interface that permitted the test subject to scan through the input space using the mouse and then click on an image that seemed to lie right at the edge of recognizability” (from Baum and Lang (1992)).
For an example of what the users saw on the screen see figure 1.
They compared the performance of their algorithm to five other variants, three classic PAC (sample based) algorithms: Backpropogation, Perceptron and simplex, and two baselines: the first returns the perpendicular bisection of the line segments connecting the two examples, and the second returns a randomly oriented hyperplane through the midpoint of the line. The query learning algorithm uses the additional information obtained from the users as described above, while the three PAC algorithms use additional examples drawn from the data set. All three PAC algorithms outperformed the querybased algorithm. More surprisingly, even the baseline of choosing the perpendicular bisection line had significantly better results than the halfspace created by the query algorithm. The only method that was worse than the query based method was the random bisector method. They suggest that the reason for the poor results is that the question the users had to answer, to find the boundary pattern, lay outside the range of the human competence.
This work led many to the conclusion that membership queries are not useful in practice (Settles (2010); Balcan et al. (2006); Dasgupta (2004) and more). We argue that there are several problems with this conclusion. First and foremost, the task that the users were asked to perform (scanning through images and finding the boundary between digits) is not an intuitive task, and it is very easy to think of other variants for queries which would be more suitable. It is therefore not surprising that the labeling turned out to be noisy considering the nature of the question at hand. Second, their algorithm did not use the PAC abilities; it used queries but did not use the additional option to sample extra points for the data.
2.4 Local Membership Queries
Several suggestions have been made of ways to solve the problem of the algorithm’s generation of unnatural examples. The most common one was to drop the whole framework of membership queries and focus on the other types of active learning: streambased and poolbased. The idea is to filter existing examples taken from a large unlabeled data set drawn from the distribution rather than creating artificial examples. Another suggestion is to give the human annotator the option of answering “I don’t know”, or to be tolerant of some incorrect answers. The theoretical framework is the model of an
incomplete membership oracle in which the answers to a random subset of the queries may be missing. This notion was first presented in Angluin and Slonim (1994), and then followed by the notion of limited MQ and malicious MQ. (Angluin et al. (1997); Blum et al. (1995); Sloan and Turán (1994); Bisht et al. (2008)).The third method is to restrict the examples that the learning algorithm can query to examples that are similar to examples drawn from the distribution. This is formalized in the work of Awasthi et al. (2012). They present the concept of learning using only local membership queries. This framework deals with the problem raised by (Baum and Lang, 1992). By questioning about examples which are close to examples from the distribution we escape the problem of generating random or nonclassifiable examples.
The work of Awasthi et al. focused on the ndimensional boolean hypercube and on local queries, i.e., the learning algorithm is given the option to query the label of any point for which there exists a point in the training sample with hamming distance lower than . The model they suggested is a midway model between the PAC model (0local queries) and the PAC + MQ model (nlocal queries). Their main result is that tsparse polynomials are learnable under locally smooth distributions using local queries. Another interesting result that they presented is that the class of DNF formulas is learnable under the uniform distribution in quasipolynomial time () using local queries. They also presented some results regarding the strength of local MQ. They proved that under standard cryptographic assumptions, using local queries is more powerful than using local queries (for every ). They also showed that local queries do not always help. They showed that if a concept class is agnostically learnable under the uniform distribution using local queries (for constant ) then it is also agnostically learnable (under the uniform distribution) in the PAC model.
2.5 Other Related Work
In section 5, we give some experimental evidence that the use of extra information from the user is helpful. There have been other works along the same line. Druck et al. (2009) propose a poolbased active learning approach in which the user provides “labels” for input features, rather than instances. The users are asked to provide a “label” for input features, where a labeled input feature denotes that a particular feature is highly indicative of a particular label. Following that, Settles (2011) presented an active learning annotation interface, in which the users label instances and features simultaneously. At any point in time, an instance and a list of features for each label is presented on the screen. The user can choose to either label the instance, choose a feature from the list as being indicative, or add a new feature of his or her choice. Another similar work is of Raghavan and Allan (2007) and Raghavan et al. (2005). They studied the problem of tandem learning where they combine uncertainty sampling for instances along with cooccurrencebased interactive feature selection. All the above experiments were conducted on the text domain and the features were always unigrams. The experiments presented encouraging results of using the human annotators, either by reaching better results, or by showing that the excessive use of annotators can reduce the size of the data set, and sometimes both.
3 Setting
3.1 The PAC Model
Our framework is an extension of the PAC (Probably Approximately Correct) model of learning. Before introducing it, we will briefly review PAC learning. We will only consider binary classification where the instance space is and the label space is . A learning problem is defined by a hypothesis class . We assume that the learner receives a training set
where the ’s are sampled i.i.d. from some unknown distribution on and is some unknown hypothesis. We will focus on the socalled realizable case where is assumed to be in . The learner returns (a description of) a hypothesis . The goal is to approximate , namely to find with loss as small as possible, where the loss is defined as . We will require our algorithms to return a hypothesis with loss in time that is polynomial in and . Concretely,
Definition 1 (Learning algorithm)
We say that a learning algorithm PAC learns if

There exists a function , such that for every distribution over , every and every , if is given a training sequence
where the ’s are sampled i.i.d. from and , then with probability of at least (over the choice of )^{2}^{2}2The success probability can be amplified to by repetition., the output of satisfies .

Given a training set of size

runs in time .

The hypothesis returned by can be evaluated in time .

Definition 2 (PAC learnability)
We say that a hypothesis class is PAC learnable if there exists a PAC learning algorithm for this class.
3.2 (Local) Membership Queries Model
Learning with membership queries is an extension of the PAC model in which the learning algorithm is allowed to query the labels of specific examples in the domain set. A membership query is a call to an ORACLE which receives as input some and returns . This is called a “membership query” because the ORACLE returns if is in the set of examples positively labeled by .
Definition 3 (MembershipQuery Learning Algorithm)
We say that a learning
algorithm learns with membership queries if

There exists a function , such that for every distribution over , every and every , if is given access to membership queries, and a training sequence
where the ’s are sampled i.i.d. from and , then with probability of at least (over the choice of ), the output of satisfies .

Given a training set of size

asks at most membership queries.

runs in time .

The hypothesis returned by can be evaluated in time .

Our work will deal with a specific type of membership queries, ones that are in some way close to examples that are already in the sample. Concretely, we say that a membership query is local if there exists a training example whose Hamming distance^{3}^{3}3We only consider the instance space , so the hamming distance is natural. However, the definition can be extended to other metrics. from is at most .
Definition 4 (LocalQuery Learning Algorithm)
We say that a learning algorithm learns with local membership queries if learns with membership queries that are all local.
Definition 5
We say that a hypothesis class is qLQ learnable if there exists a qLocalquery learning algorithm for this class.
Learning Under a Specific Family of Distributions
In the classic PAC model discussed above, the learning algorithm needs to be probablyapproximately correct for any distribution on and any hypothesis . In this work we will have guarantees with respect to more restricted families. We will say that learns w.r.t a family of pairs of distributions on and hypotheses in if the following holds: The algorithm satisfies the requirements of a learning algorithm whenever the pair and in the definition of a learning algorithm belongs to . Similar considerations apply also to the notion of learning with (local) membership queries.
4 Learning DNFs with Evident Examples Using 1local MQ
4.1 Definitions and Notations
Definition 6 (Disjunction Normal Form Formula)
A DNF term is a conjunction of literals. A DNF formula is a disjunction of DNF terms.
Each DNF formula over n variables naturally induces a function (when we standardly identify with “True” and “False”). We denote by the function induced by the DNF formula .
Remark 1
We will look at succinctly described hypotheses (e.g., a with a small number of terms) and on small, but nonnegligible probabilities. For simplicity, we will take the convention that small is at most and non negligible is at least . All of our results can be easily generalized to the case where “small” and “nonnegligible” are defined as and for any constants .
Definition 7
Denote by the hypothesis class of all functions that can be realized by a with a small number of terms. That is
Intuitively, when evaluating a DNF formula on a given example, we check a few conditions (corresponding to the formula’s terms), and deem the example positive if one of the conditions holds. We will consider the case that for each of these conditions, there is some chance to see a “prototype example”. Namely, an example that satisfies only this condition in a strong (or evident) way.
Definition 8
Let be a DNF formula. An example satisfies a term (with respect to the formula ) evidently if :

It satisfies . (In particular, )

It does not satisfy any other term (for ) from F.

No coordinate change will turn False and another term True. Concretely, if for we denote , then for every coordinate , if satisfies (i.e. if ) then satisfies and only .
The first distributional assumption that we consider is that each positive example satisfies one term evidently.
Definition 9
A pair of a distribution over and is realized by a small DNF with evident examples if there exists a DNF formula over with such that and additionally, every positive example with satisfies one of ’s terms evidently.
One of the assumptions in our definition is that the target function can be realized by a formula for which every example satisfies at most one term. For a function that is realized by a decision tree this always holds. So, in a sense, our assumption holds for functions that can be realized by a “stable” decision tree.
The above definition makes a strong assumption, namely that every positive example is an evidence for one term. The next definition relaxes that assumption and only assumes that for every term there is a nonnegligible probability to see an evident example.
Definition 10
A pair of a distribution over and is weakly realized by a small with evident examples if there exists a DNF formula over with such that and for every term there is a nonnegligible^{4}^{4}4Recall that nonnegligible is at least probability to see an example that satisfies this term evidently.
For example, our assumption holds for every distribution , provided that can be realized by a DNF formulas in which any pair of different terms contains two opposite literals.
4.2 Upper Bounds
We will now present two learning algorithms that use 1LQ, and prove that each of these algorithms learn the class with respect to the families of distributions defined above. Both algorithms use the following claim that follows directly from definition 8
Claim 1
Let be a formula over . Then for every that satisfies a term evidently (with respect to ), for every it holds that:
Theorem 1
The hypothesis class is 1LQ learnable with respect to distributions that are realized by a with evident examples.
Proof We will prove that algorithm 1 learns with 1local membership queries. First, it is easy to see that this algorithm is efficient: For a training set of size the algorithm asks for at most local membership queries, and runs in time . Likewise, the hypothesis that the algorithm returns is a formula with at most m terms and every term is of size at most n, therefore it can be evaluated in time polynomial in .
Now, let be a distribution on and be a hypothesis such that the pair is realized by a small with evident examples. Let be that small formula, (in particular and ). For we take a sample where are sampled i.i.d from and .
Let be the formula returned by the algorithm after running on , and let be the function induced by . We will prove that with probability of at least 3/4 (over the choice of the examples) .
From the assumption on the distribution we get that every instance that satisfies the formula (in our case every such that ), satisfies exactly one term . For every one of these positive instances from , we will show that we add that exact term to . For every such we start with a full term (containing all the possible literals) and then for every , at iteration :

if we know from claim 1 that the variable cannot appear in  so we remove it and its negation from the current term.

if and we know that either or appears in and we remove the one that cannot appear in according to the value of .
After iterations we get exactly  the term that satisfies evidently. Therefore  will contain every term from for which there was an instance in that satisfies it  other then that will contain no other terms. In other words,
and we get that
Denote by the probability to sample (from ) that will satisfy , and let be the event that did not contain any which satisfies . Then
Notice that since is the probability to sample we get that
Now if we look at the expectation we get
Since we get and using Markov’s inequality we obtain
Theorem 2
The hypothesis class is 1LQ learnable with respect to distributions that are weakly realized by a with evident examples.
Proof We will prove that algorithm 2 learns with 1local membership queries. In this case we will have two sample sets  of size which will be used as before  to build the terms of , and of size  a separate set to check the terms that were built. Again, it is easy to see that this algorithm is efficient. For training sets of size and of size the algorithm asks for at most local membership queries. The running time of the first loop is and in that loop we add at most terms to so the running time of the second loop is . All in all the running time is polynomial in . Also, the hypothesis that the algorithm returns is a formula with at most terms and every term is of size at most n, therefore it can be evaluated at time polynomial in .
Now, let be a distribution on and be a hypothesis such that the pair is realized by a small with evident examples. Let be that small formula, (in particular and ). Denote by the DNF formula algorithm 2 returns. Following the same argument from the last proof, a term will be added to H in the first loop if contains an example that satisfies evidently. We will define so that with high probability for every term there will be such that satisfies evidently.
Denote by the probability to sample (from ) that satisfies evidently, and let . Since for every term the probability to see an evident example is nonnegligible, .
For every i, the probability of not seeing an example in that satisfies evidently is
If we set to be we get that the probability of not seeing an example that satisfies evidently (when sampling from ) is less than and from the union bound we get that the probability that the sample will contain an evident example for every term is at least . Therefore with probability of at least we will add every to in the first loop. In the second loop, when we remove terms from , we only remove terms which contradicts one of the examples in . Since all of the examples in the sample set are labeled by , we will never remove a term that is a part of Therefore with probability of at least will contain all of ’s terms. Formally,
Note that we are not done, as the algorithm might create a wrong term (when using a ”nonevident” example). For this reason we add the second loop. We use the sample to test every term that was added to in the first loop. If we see an example such that but we remove and continue to the next term. Now denote by the probability to sample (from ) that will satisfy , and by the event that is a wrong term (not from F) but the ”checking” step did not discover that. Then
Note that since is the event that there wasn’t any example in which satisfied (otherwise the checking step would discover that is wrong) this is the same situation as in the proof of theorem 1, so
By the same analysis of the former proof, we get that if the size of is then
Finally we notice that , because for each example in the algorithm adds at most one term to . So we can set as above and and if we run algorithm 2 on and we get that with probability of at least over sampling and
4.3 A Lower Bound
In this section we provide evidence that the use of queries in our upper bounds is crucial. We will show that the problem of learning polysized decision trees can be reduced to the problem of learning DNFs w.r.t. distributions that are realized by a small DNF with evident examples. As learning decision trees is widely believed to be intractable (in fact, even learning the much smaller class of juntas is conjectured to be hard), this reduction serves as an indication that the problems we considered are hard without membership queries.
Definition 11
A decision tree over is a binary tree with labels chosen from
on the internal
nodes, and labels from on the leaves. Each internal node’s left branch is viewed as
the branch; the right branch is the branch. Each decision tree over variables induces a function in the following way: For a decision tree
, a vector
defines a path in the tree from the root to a specific leaf by choosing ’s branch at each node and the value that the function returns on a is defined to be the label of the leaf at the end of this path.Definition 12
Denote by the hypothesis class of all functions that can be realized by a decision tree with a small number of leaves. That is
Theorem 3
PAC learning the hypothesis class w.r.t distributions that are realized by a small with evident examples is as hard as PAC learning .
The proof will follow from the following claim:
Claim 2
There exists a mapping (a reduction) , that can be evaluated in time so that for every decision tree over there exists a formula over such that the following holds:

The number of terms in is upper bounded by the number of leaves in


such that , satisfies some term in evidently.
Proof We will denote by and by .
Define as follows:
Now, for every tree , we will build the desired formula as follows: First we build  a formula over . Every leaf labeled ’’ in will define the following term take the path from the root to that leaf and form the logical AND of the literals describing the path. will be a disjunction of these terms. Now, for every term in we will define a term over in the following way: Let and . So
Define
Finally, define to be the formula over by
We will now prove that and satisfy the required conditions. First, can be evaluated in linear time in . Second, it is easy to see that , and as every term in matches one of ’s leaves, the number of terms in cannot exceed the number of leaves in . It is left to show that the third requirement holds. Let there be an such that , then is matched to one and only one path from ’s root to a leaf labeled ’1’. From the construction of , satisfies one and only one term in because every term is matched to exactly one path from ’s root to a leaf labeled 1. Regarding the last requirement  that no coordinate change will make one term from False and another one True  we made sure this will not happen by “doubling” each variable. By this construction, in order to change a term from False to True at least two coordinate must change their value.
Proof [of theorem 3]
Suppose we have an efficient algorithm that PAC learns with respect to distributions that are realized by with evident examples. Using the reduction from claim 2 we will build an efficient algorithm that will PAC learn .
For every training set with examples from :
we define a matching training set with examples from , using from the above claim:
The algorithm will work as follows:
Given a training set , will construct and then run with input . Let be the output of when running on , will return . Since can be evaluated in time and is efficient, we get that is also efficient.
We will prove that algorithm is a learning algorithm for the class . Since is a learning algorithm for the class with respect to distributions that are realized by a small with evident examples, there exists a function , such that for every that is realized by a small with evident examples and every , if is given a training sequence
where the ’s are sampled i.i.d. from and , then with probability of at least (over the choice of ), the output of satisfies .
Let be a distribution on and let be a hypothesis that can be realized by a small . Define a distribution on by,
Since is onetoone, is well defined and is a valid distribution on .
Now, as is realized by a small , then from the conditions that satisfies we get that there exists a formula such that and the pair is realized by a small with evident examples. Now for every we take a sample with and obtain that with probability of at least it holds that
So is indeed a learning algorithm for the class
5 Experiments
Membership queries are a mean by which we can use human knowledge for improving performance in learning tasks. Human beings have a very rich knowledge and understanding of many problems that the ML community works on. They can provide much more information than merely the category of the object or an answer to a “yes” or “no” question. This knowledge is often basic, and can be acquired without the use of an expert (e.g., using crowdsourcing). In this section we will present empirical results of an algorithm which takes advantage of this extensive knowledge in order to perform smart feature selection.
In standard supervised classification tasks the user is only asked to give the label of each example. What we did in this task, is to ask for additional information. Specifically, we faced a situation where we had a large number of features, and that these features had an interpretation that is easily understood. For every example in the sample set, we asked the user for its label and in addition, we asked which features indicate that this instance is labeled as such. After we finished iterating over the entire sample, we used the information on the relevant features to narrow down the feature space. Concretely, we trained linear classifiers only on the features that were chosen to be indicative by the users.
Arguably, this algorithm gathers additional information in a manner that is similar to using 1local membership queries. 1local query tests whether changing the value of a single feature changes the label. This can be seen as asking whether this feature is relevant to the prediction or not. In the algorithm presented here, we ask for the relevant features in a broader way. Namely, we explicitly ask which words are relevant to the corresponding label.
5.1 Is the additional data useful?
When humans make decisions, it is often by very complex thought processes and we do not know whether we can access specific considerations that were used in the decision making process. The first goal of this experiment is to show that at least for some tasks, important parts of this thought process are easily accessible. I.e., that the annotators’ knowledge can be retrieved by asking simple questions. The second goal is to show that using this extra knowledge can help significantly decrease the number of tagged examples that are required.
We will formulate the above goals using the notion of error decomposition. Let be the classifier returned by the algorithm. We decompose as a sum of the approximation error (the error of the best linear classifier) and the estimation error (the difference between and the approximation error):
The approximation error measures how good is the class of linear classifiers that we restrict ourselves to. In other words, since the class is linear, how informative are the features we use. The estimation error measures to which extent the algorithm overfits the data.
We can now formulate the above goals into claims on the approximation and estimation error. By applying the user induced feature selection mentioned above we can only increase the approximation error, as we reduce the hypothesis class to a smaller one. We will want to show that the feature space chosen by the users is still expressive enough, so that the increase in the approximation error will be minor. In addition, we will show that the feature selection is effective in the sense that the estimation error decreases significantly.
5.2 Experimental setup
5.2.1 Sentiment Analysis
Sentiment analysis (SA) is the Natural Language Processing task of identifying the attitude of a given text (usually whether it is positive, neutral or negative). This task has been studied in the NLP community for many years at different scale levels. It started off from being a document level classification task
(Pang and Lee, 2004), and then the focus shifted to handling the sentence level (Hu and Liu, 2004; Kim and Hovy, 2004). The newest focus is sentiment analysis of Microblog data like Twitter. Working with these informal text genres, on which users post their opnions, emotions, and recations about practically everything, presents new challenges for natural language processing beyond those encountered when working with more traditional text genres such as newswire or product reviews. Indeed, classical approaches to Sentiment Analysis (Pang and Lee, 2008) are not directly applicable to tweets. While most of them focus on relatively large texts, e.g. movie or product reviews, tweets are very short and finegrained. Nevertheless, the great prominence of Social Media during the last few years encouraged a focus on the sentiment detection over a microblogging domain. There has been a lot of recent work on sentiment analysis of twitter data. Some examples are (Pak and Paroubek, 2010; Kouloumpis et al., 2011; Davidov et al., 2010; Barbosa and Feng, 2010).We chose this task to demonstrate our method since each example (tweet) is constructed from a limited number of features (words), making each of these features very important for classification. Therefore, it seems that information supplied by users, can be useful in focusing our attention on the important features. Secondly, if in fact the two claims above hold, it will enable us to use a smaller data set, which is very important for this kind of tasks, since SA (and many more NLP tasks) require a large labeled data set which is often costly.
5.2.2 Dataset
Negative  Neutral  Positive  All  
Train  1234  3012  8439  
Test  4701  
We worked with the data set from SemEval (Nakov et al., 2013), a shared task for Sentiment Analysis of Tweets . This dataset is constructed of 13,140 (8,439 train+development and 4,701 test, see Table 1) tweets which were collected over a oneyear period spanning from January 2012 to January 2013. The tweets were labeled using the crowd sourcing tool Amazon Mechanical Turk and the labels were filtered to get rid of spammers.
For each sentence (tweet), the users were asked to indicate the overall sentiment of the sentence  positive, negative or neutral ^{5}^{5}5The original labeling had 4 classes[objective, positive, negative, or neutral] but since the turkers tended to mix up between the objective and neutral, the two classes were combined in the final task. and also to mark all the subjective (positive or negative) words/phrases in the sentence^{6}^{6}6This labelling procedure was originally intended to be used for two separate tasks. The first is, when given a tweet containing a marked instance of a word or a phrase, to identify the sentiment of that instance (i.e., whether the word is negative or positive). The second is identifying the sentiment of the whole tweet (without using the marked words).. The learning task that we worked on is classifying the sentiment of the entire sentence. Although we only want to predict the sentiment of the tweet, we use these two labellings to get one “richer” labelled dataset. I.e., each instance in our training set holds additional information to its sentiment  which words/phrases in the sentence indicate a positive or negative sentiment.
5.2.3 Preprocessing
Beside simple text, tweets may contain URL addresses, references to other Twitter users (appear as @username) or content tags (also called hashtags) assigned by the tweeter (tag). During preprocessing, we performed the following standard manipulations:

Words were switched to lower case and punctuation marks were removed (apart from a fixed set of smileys)

Every hyperlink was replaced by the metaword URL

Every word starting with , i.e. a username in twitter syntax, was replaced by the metaword USR.

The hashtag sign was removed from every tag to get a simple word. For example perfect was changed to perfect.
5.2.4 Language Model
We used the simple bagofwords language models of ngrams (in our case unigrams, bigrams and trigrams). I.e., each tweet is represented as a sparse vector in
, where is the size of the dictionary and the ’th coordinate equals 1 if and only if the ’th word in the dictionary appears in the tweet. We performed a standard cutoff of rare ngrams ^{7}^{7}7without performing this cutoff, the results for the nonquery variant are much worse.5.2.5 Scoring
The results were evaluated on averaged scores. This scoring function is used in the SemEval shared task, and overall a very common scoring function for NLP tasks. The
score is the harmonic mean of Precision and Recall. Every label has it’s
score. For the positive label, the Precision is the number of tweets that were correctly labeled as positive divided by the total number of tweets that were labeled as positive:The Recall of the positive label is the number of tweets that were correctly labeled as positive divided by the total number of positive tweets in the data:
The positive label score is computed as follows:
The negative label score is computed similarly. The final score that the results are evaluated on is the average of the above two:
5.2.6 The algorithm
We compare two variants for the feature space: using the entire feature space (after cutting off the rare ngrams), and using the ”query acquired” feature space which contains only features that were selected by the users as positive or negative for some example. Information about the data and the number of features is given in table 2.
Unigrams  Bigrams  Trigrams  

Overall number of features  18257  89788  128699 
Features after cutoff  3182  3099  1718 
Features selected by the users  1391  1368  846 
We used a simple Naive Bayes classifier, with a small smoothing parameter. We also checked other classification algorithms random forests, logistic regression, and multiclass SVM, (with
regularization and regularization), but the results of the Naive Bayes predictor were the highest for both feature spaces.5.3 Results
The results that we will present are the results of the unigram model. The test scores of the other language models (unigram+bigrams and unigram+bigram+trigram) are almost identical for both feature spaces, and the training scores gets higher with the model complexity, as expected. Since our training set only contains approximately 8000 instances, we chose to present the results of the simplest model, so that the number of features would be comparable to the number of instances.
The results of both variants are presented in figure 2. As can be seen by the test scores, our algorithm outperforms the other variant which does not uses the additional information. The difference in test performance is approximately constant across different training sizes. Getting back to our claims  regarding the approximation error, by looking at the final training scores (using the larger training set possible), it can seen that both variants are almost identical in all of the measurements. This fact indicates that we did not increase the approximation error. Regarding the improvement of estimation error, this can be seen clearly by looking at the gap between the test scores and the train scores. The gap in the query acquired model is smaller than the gap in the other model.
5.3.1 Precision and Recall
Additional interesting properties can be seen in the precision and recall graphs (figure 3). For example, by looking at the results for positive samples (a & b) we can see that the improvement in the results from using the query model is almost only due to the improvement in the precision scores. If we only use of the data, the query model reaches 0.77 test precision, while the nonquery model only reaches 0.71 test precision score even when using the whole data set. Another interesting property that can be seen is that when a small training set is used, the difference in the test scores between the query and nonquery methods is about twice as large as the difference when the largest possible training set is used.
5.3.2 Overfitting
When using the naive bayes algorithm, we estimate for every feature and every label . This term measures how much the appearance of contributes to the fact that is the correct label ^{8}^{8}8by the naive assumption that all of the features are independent given the label, this information is actually the only information we use in order to build the classifier . Using those terms, we can sort the features by an order which conveys their informativeness. Since our features are words (or bigrams or trgrams), we can get some interesting insights by looking at the most informative features that each variant uses. If we only look at the top of the list (the top 20), the chosen features
Comments
There are no comments yet.