A Robust UCB Scheme for Active Learning in Regression from Strategic Crowds

01/25/2016 ∙ by Divya Padmanabhan, et al. ∙ 0

We study the problem of training an accurate linear regression model by procuring labels from multiple noisy crowd annotators, under a budget constraint. We propose a Bayesian model for linear regression in crowdsourcing and use variational inference for parameter estimation. To minimize the number of labels crowdsourced from the annotators, we adopt an active learning approach. In this specific context, we prove the equivalence of well-studied criteria of active learning like entropy minimization and expected error reduction. Interestingly, we observe that we can decouple the problems of identifying an optimal unlabeled instance and identifying an annotator to label it. We observe a useful connection between the multi-armed bandit framework and the annotator selection in active learning. Due to the nature of the distribution of the rewards on the arms, we use the Robust Upper Confidence Bound (UCB) scheme with truncated empirical mean estimator to solve the annotator selection problem. This yields provable guarantees on the regret. We further apply our model to the scenario where annotators are strategic and design suitable incentives to induce them to put in their best efforts.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Crowdsourcing platforms such as Amazon Mechanical Turk are becoming popular avenues for getting large scale human intelligence tasks executed at a much lower cost. In particular, they have been widely used to procure labels to train learning models. These platforms are characterized by a large pool of diverse yet inexpensive annotators. To leverage these platforms for learning tasks, the following issues need to be addressed: (1) A learning model that encompasses parameter estimation and annotator quality estimation. (2) Identifying the best yet minimal set of instances from the pool of unlabeled data. (3) Determining an optimal subset of annotators to label the instances. (4) Providing suitable incentives to elicit best efforts from the chosen annotators under a budget constraint. We provide an end to end solution to address the above issues for a regression task.

Identifying the best yet minimal set of instances to be labeled is important to minimize the generalization error, as the learner only has limited budget. This involves selection of those unlabeled instances, the labels of which when fed to the learner, yield maximum performance enhancement of the underlying model. The question of choosing an optimal set of unlabeled examples occupies center stage in the realm of active learning. Past work on active learning in crowdsourcing apply to classification  [27, 23]

and most of these do not directly apply to regression where the space of labels is unbounded. For instance, the Markov Decision Processes (MDP) based method

[23] relies on label space and thereby the state space being finite, which is not the case in regression.

Similar to the instance selection problem, the annotator choice to label an instance also has a bearing on the accuracy of the learnt model. Optimal annotator selection, in the context of classification, has been addressed using multi-armed bandit (MAB) algorithms [1]

. Here the annotators are considered as the arms and their qualities as the stochastic rewards. In classification, the quality of the annotators is modeled as a Bernoulli random variable, thereby making it suitable for application of algorithms such as UCB1

[2, 7]

. However for regression tasks, the labels provided by the annotators are naturally modeled to have Gaussian noise, the variance of which is a measure of the quality of the annotator. This in turn is a function of the effort put in. Therefore, optimal annotator set selection problem involves identifying annotators with low variance. Though existing work has adopted MAB algorithms for estimating variance

[22] and several other applications [30], there is a research gap in its applicability to active learning and regression tasks and in particular where heavy tailed distributions arise as a result of squaring the Gaussian noise. To bridge this gap, we invoke ideas from Robust UCB [8] and set up theoretical guarantees for annotator selection in active learning.

Another non-trivial challenge emerges when we are required to account for the strategic behavior of the human agents. An agent, in the absence of suitable incentives, may not find it beneficial to put in efforts while labeling the data. To induce best efforts from agents, the learner could appropriately incentivize them. In the field of mechanism design, several incentive schemes exist [13, 34]. To the best of our knowledge, such schemes have not been explored in the context of active learning for regression.

Contributions: The key contributions of this paper are as follows.
(1)Bayesian model for Regression: In Section 3, we set up a novel Bayesian model for regression using labels from multiple annotators with varying noise levels, which makes the problem challenging. We use variational inference for parameter estimation to overcome intractability issues.
(2)Active learning for crowd regression and decoupling instance selection and annotator selection: In Section 4.1, we focus on various active learning criteria as applicable to the proposed regression model. Interestingly, in our setting, we show that the criteria of minimizing estimator error and minimizing estimator’s entropy are equivalent. These criteria also remarkably enable us to decouple the problems of instance selection and annotator selection.
(3)Annotator selection with multi-armed bandits: In Section 4.2

, we describe the problem of selecting an annotator having least variance. We establish an interesting connection of this problem to the multi-armed bandit problem. In our formulation, we work with the square of the label noise to cast the problem into a variance minimization framework; the square of the noise follows a sub-exponential distribution. We show that standard UCB strategies based on

-UCB [7] are not applicable and we propose the use of robust UCB [8] with truncated empirical mean. We show that the logarithmic regret bound of robust UCB is preserved in this setting as well. Moreover the number of samples discarded is also logarithmic.
(4)Handling strategic agents: In Section 5, we consider the case of strategic annotators where the learner needs to induce them to put in their best efforts. For this, we propose the notion of ‘quality compatibility’ and introduce a payment scheme that induces agents to put in their best efforts and is also individually rational.
(5)Experimental validation: We describe our experimental findings in Section 6. We compare the RMSE and regret of our proposed models with state-of-the-art benchmarks on several real world datasets. Our experiments demonstrate a superior performance.

2 Related Work

A rich body of literature exists in the field of active learning for statistical models where labels are provided by a single source [28, 12, 9, 10]. Popular techniques include minimizing the variance or uncertainty of the learner, query by committee schemes [33] and expected gradient length [32] to name a few. In the literature on Optimal Experimental Design in Statistics, the selection of most informative data instances is captured by concepts such as A-optimality, D-optimality, etc. [16, 18]. The idea is to construct confidence regions for the learner and bound these regions. A survey on active learning approaches for various problems is presented in [31].

The works that have looked into active learning for regression are applicable only for a single noisy source, and not to a crowd. In crowdsourcing, several learning models for regression have been proposed, for instance, [25, 26] obtain the maximum likelihood estimate (MLE) and maximum-a-posteriori (MAP) estimate respectively. [17] proposes a scheme to aggregate information from multiple annotators for regression using Gaussian Processes. [4, 24] develop models for classification using crowds. However, these do not employ techniques from active learning. Also, they do not obtain a posterior distribution over the parameters, and hence do not perform probabilistic inference. Of late, there have been a few crowdsourcing classification models employing the active learning paradigm [27, 36, 35, 23, 15]. These include uncertainty-based methods and MDPs. To the best of our knowledge, active learning for regression using the crowds has not been looked at explicitly.

When an annotator is requested to label an instance, and the annotator, being strategic, does not put in the best effort, the learning algorithm could seriously underperform. So we must incentivize the annotator to induce the best effort. Such studies are not reported in the current literature.  [11, 14] propose payment schemes for linear regression for crowds. Both  [11, 14] make the assumption that an instance is provided only to a single annotator and also do not look at the active learning paradigm. The idea in our work is to design incentives for active learning in the context of crowdsourced regression which would induce the annotators to put in their best efforts.

In the next section, we explain our model for regression using the crowd, assuming non-strategic annotators.

3 Bayesian Linear Regression from a Non-strategic Crowd

Given a data instance , the linear regression model aims at predicting its label such that . Instead of x, non-linear functions of x, can be used. To avoid notational clutter, we work with x

throughout this paper. The coefficient vector

is unknown and training a linear regression model essentially involves finding w. Let be the initially procured training dataset and let denote the pool of unlabeled instances. We later (in Section 4.1) select instances from via active learning to enhance our model.

In classical linear regression, the labels are assumed to be provided by a single noisy source. In crowdsourcing, however, there are multiple annotators denoted by the set . Each of the annotators provides a label vector which we denote by , where for . Each annotator may or may not provide the label for every instance in the training set. We, therefore, define an indicator matrix , where if annotator labels instance , else . We denote by , the number of labels provided by annotator . That is, . We also define a matrix whose rows contain the instances that are labeled by annotator . Also, we denote by , the label provided by annotator for , which is the same as element of the label vector . The true label of a data instance is given by . Each annotator introduces a Gaussian noise in the label he provides. That is, where, is the precision or inverse variance of the distribution followed by . Intuitively, is directly proportional to the effort put in by annotator . We assume that there is always a maximum level of effort that annotator can put in and inverse variance corresponding to his best effort is given by , which is unknown to the learner as well as other annotators.
In general, an annotator may be strategic and may exert a lower effort level if appropriate incentives are not provided. In this section, however, we adhere to the assumption that annotators are non-strategic and annotator always introduces a precision of , thereby setting . The parameters of the linear regression model from crowds, therefore, become . The aim of training a linear regression model is to obtain estimates of using the training data . We now describe a Bayesian framework for this.

Figure 1: Plate notation for our model

Bayesian Model and Variational Inference for Parameter Estimation:

A Bayesian framework for parameter estimation is well suited for active learning as incremental learning can be done conveniently. Bayesian framework has been developed for estimating the parameters of the linear regression model when labels of training data are supplied by a single noisy source [5]. To the best of our knowledge, the counterpart of such a Bayesian framework in the presence of multiple annotators has not been explicitly explored. We assume a Gaussian prior for w with mean and precision matrix or inverse covariance matrix . We assume Gamma priors for ’s, that is, . The plate notation of the Bayesian model described above is provided in Figure 1. The computation of the posterior distributions and for is not tractable. Therefore, we appeal to variational approximation methods [3]. These methods approximate the posterior distributions using mean field assumptions. We use and to represent the mean field variational approximation of and respectively. The variational approximation begins by initializing the parameters of the prior distributions, and for all . At each iteration of the algorithm, the parameters of the posterior approximation are updated and the steps are repeated until convergence.

Lemma 1

The variational update rules for the posterior approximations using mean field assumptions are and where

(1)
(2)
(3)
(4)
Proof

If and

denote the true and approximate posterior joint distributions of the parameters respectively, we know that,

, where, and is the KL divergence between the distributions and . By the mean field assumption, the joint distribution factorizes as follows, . For simplicity we denote by the distribution and by the distribution .

(5)

where, and . In order to minimize , we must maximise . Eqn (3) shows that is the negative KL-divergence between and . is maximised when the KL-divergence between and is minimized. Therefore, we must set . By similar calculations, we must set, , where .

By completing the squares we get the update rules for . The similar steps can be performed to get the variational updates for . Due to constraints on space, we have not included the steps.

The variational updates for and defined in Eqns (1) and (2) involve . The updates for given in Eqn (4) involve and . This interdependency between the update equations leads to an iterative algorithm.

Remark 1 (Parameter Estimation)

: Our approach is not tied to the variational inference approximation scheme. For example, MCMC can be used instead.

Lemma 2

Asymptotic convergence of Bayes estimators: Let be the true underlying value of w and the Bayes estimator for w under the least squares loss be . Then, .

Proof

Let and be the mean and precision respectively, of the approximate posterior distribution , estimated from the training set . Let be the realized value of the underlying .

(6)

If the second term in Eqn 3 approaches as , the estimate

is an asymptotically unbiased estimate for

. Using standard linear algebra results, we can prove that the determinant of the precision matrix approaches with large number of samples, that is, . Hence the second term in Eqn 3 approaches zero. Therefore .

Lemma 2 is a desirable property of the estimators, and in general holds true for Bayes estimators.

Inference:

We now describe an inference scheme to make prediction about the label of a test data instance. We denote by the predicted label for the test instance

. From the Bayesian framework of parameter estimation, The posterior predictive distribution for

turns out to be as follows: . This follows from standard results in [5]. We can use this distribution later in scenarios like active learning.

4 Active Learning for Linear Regression from the Crowd

We now discuss various active learning [31] strategies in our framework. Let be the set of unlabeled instances. The goal is to identify an instance, say , for which seeking a label and retraining the model with this additional training example will improve the model in terms of the generalization error. In the crowdsourcing context, since multiple annotators are involved, we also need to identify the annotator from whom we should obtain the label for . The active learning criterion, thus, involves finding a pair so that retraining with the new labeled set would provide maximum improvement in the model.

4.1 Instance Selection

To our crowdsourcing model, we now apply two criteria well-studied in active learning from a single source. We also show that all these seemingly different criteria embody the same logic.

4.1.1 Minimizing Estimator Error

Minimizing estimator error is a natural criterion for active learning [29]. The error in the estimator , if we choose a pair , is given by, . The error in the estimator , before including the instance in the training set is, .

Lemma 3

The relation between errors in and is given by,

(7)
Proof

We first compute .

(8)

Making necessary substitutions and rearranging the terms,

Again rearranging the terms and subtracting w from both the sides yields, . We now bound , in terms of the old error, as follows: where, is the spectral norm of the matrix . Since is a rank one matrix, the matrix has eigenvalues equal to 1 and one eigenvalue equal to . Note, since is a positive definite matrix. Therefore, spectral norm of the matrix is and its minimum eigenvalue is and we arrive at the error bound.

From Theorem 3, it is clear that to reduce the value of the lower bound, we must pick a pair for which the score is maximum.

4.1.2 Minimizing Estimator’s Entropy

This is another natural criterion for active learning which suggests that the entropy of the estimator after adding an example should decrease [20, 21]. Formally, let and denote the entropies of the estimator before and after adding an example, respectively, where we have . Again, let us assume ’s are known for the time being. The entropy of the distribution before adding an example satisfies: . After adding the example, entropy function behaves as follows. , where

(9)

From (9), we would like to choose an instance and an annotator that jointly maximize so that as well as estimator’s entropy are minimized. Recall, the same selection strategy was obtained while using the minimize estimator error criterion. Let and . We can further bound the estimator precision as follows.

We observe that the selection of the best instance and the best annotator can be decoupled. That is, we can first select an instance for which is maximum and independently select an annotator for whom is maximum. But this scheme of annotator selection may lead to starvation of best annotators if the annotators have not been explored sufficiently. Hence we only use this strategy for selecting an instance and not for selecting the annotator.

4.2 Selection of an Annotator

Having chosen the instance , next the learner must decide which annotator should label it. Consider any arbitary sequential selection algorithm A for the annotators. If the variance of the annotators’ labels were known upfront, the best strategy would be to always select the annotator introducing the minimum variance . The variances of the annotators’ labels are unknown and hence a sequential selection algorithm A incurs a regret defined by Regret-Seq(A) below. We denote the sub-optimality of annotator by .

Definition 1

Regret-Seq(A, t): If is the number of times annotator is selected in runs of A, the expected regret of A in runs, with respect to the choice of annotator, is computed as, .

The problem is to formally establish an annotator selection strategy which yields a regret as low as possible. The main challenge is that the annotators’ noise level is unknown and must be estimated simultaneously while also deciding on the selection strategy. We observe the connections of this problem to the multi-armed bandit (MAB) problem. In MAB problems, there are arms each producing rewards from fixed distributions with unknown means . The goal is to maximise the overall reward and for this, at every time-step a decision has to be made as to which arm must be pulled. We denote the sub-optimality of arm by , where .

Definition 2

Regret-MAB(M, t): If is the number of times arm is selected in runs of any MAB algorithm , the expected regret of in runs, Regret-MAB(), is computed as, .

We now show that the active learning problem in crowdsourcing regression tasks can be mapped to the MAB problem. We know that, . Since we are interested in the annotator introducing the minimum variance, we could work with a MAB framework where the rewards of the arms (annotators in our case) are drawn from the distribution of . This idea was used in [22] in the context of sequential selection from a pool of Monte Carlo estimators. If the selection strategy appeals to any MAB algorithm defined on the distributions , Regret-MAB() will be the same as Regret-Seq(), as proved by [22]. This implies that for the selection strategy, we could work with any standard MAB algorithm such as UCB on the distribution of and Regret-Seq(A, t) would be the same as Regret-MAB(M, t), for an appropriately formulated MAB algorithm M.

4.2.1 UCB Algorithm on

As mentioned, we can work with MAB algorithms on for which we look at the widely used UCB family of MAB algorithms. The UCB algorithm is an index based scheme which, at time instant selects an arm that has the maximum value of sum of the estimated mean (

) and a carefully designed confidence interval

to provide desired guarantees. To design the UCB confidence interval , a fairly general class of algorithms called -UCB [7] can be used. The procedure for applying -UCB for a random variable with some arbitrary distribution, involves choosing a convex function , such that, for all . Further, an application of Chernoff bounds gives the confidence interval. In particular when satisfies the sub-Gaussian property, the choice of is easy. In our setting, we will see that -UCB is inapplicable.

Lemma 4

Inapplicability of -UCB: Let the distribution of random variables

follow a zero-mean normal distribution for

. The distribution of is sub-exponential which is a heavy-tailed distribution. For an MAB framework where the rewards of the arms are sampled from , -UCB is not applicable.

Proof

A variable is sub-exponential if for . We now prove that the random variable , where is sub-exponential.

(10)
(11)
(12)
(13)
(14)

Setting shows that is sub-exponential. A random variable is sub-exponential iff is sub-exponential. Therefore is sub-exponential.

Let . We now compute the functions, and . .

Similar calculations also yield,

In order to apply -UCB for the MAB framework where the rewards of the arms are sampled from , we need to compute a function such that for all , and . is not even defined for and hence the function cannot be computed. Therefore -UCB cannot be applied to this framework.

In our setting, follows a normal distribution and has a sub-exponential distribution which is heavy tailed. Therefore from Lemma 4, an upper confidence interval cannot be obtained using -UCB.

4.2.2 Robust-UCB with Truncated Empirical Mean

To devise upper confidence intervals for heavy tailed distributions, Robust UCB [8]

prescribes working with ‘robust’ estimators such as a truncated empirical mean, where samples that lie beyond a carefully chosen range are discarded. The necessary condition to be satisfied while applying Robust UCB is that the reward distribution of the arms should have moments of order

for some . Since the distribution of has finite variance, Robust UCB with the truncated empirical mean can be used by setting . At round , the truncated empirical mean of the samples, the absolute value of which do not exceed , is computed as,

(15)

where and is the estimator of w obtained from the variational inference algorithm. In Eqn 15, is the number of samples that are actually considered, is the desired confidence on the deviation of from for all , is an upper bound on . From Lemma 2 is an unbiased estimate for w and hence we use instead of w. The parameter can be tuned appropriately to get tight bounds on the regret.We now describe the algorithm.
 

Input: No. of annotators , Unlabeled set , Labeled set , , , for
Set using variational inference procedure described earlier; ;
Set for the annotators using Eqn (15);
while  ( the learner has budget or the model has not attained the desired RMSE ) do
  •  

    Choose an instance ;

  •  

    Get a label from an annotator such that ;

  • ; ; ;

  •  

    Run variational inference procedure described earlier

  • and update ;
  •  

    If

    • ;

    • Update using Eqn (15);

  • end while
    Algorithm 1 Robust UCB for selecting the annotators

     

    Theorem 4.1

    Regret-Seq.

    Proof

    We first prove that, with probability at least

    ,

    (16)

    Let . Let the random variable = . As mentioned earlier . Note that

    (17)
    (18)

    Equation (4.2.2) arises due to Holder’s inequality. Further,

    (19)

    The first term in Eqn (19) arises as a consequence of Eqn (4.2.2) and the remaining terms arise as a result of Bernstein’s inequality with some simplification. Further algebraic simplification of Eqn (19) gives us Eqn (16).
    For a MAB algorithm using as an estimator for , the regret satisfies the following bound when , where is the total time horizon of plays of the MAB algorithm.

    (20)

    Proof of Eqn (20) involves bounding the number of trials where a sub-optimal arm is pulled, similar to the technique in [2, 8]. A pull of a sub-optimal arm indicates one of the following three events occur:(1) The mean corresponding to the best arm is underestimated (2) the mean corresponding to a sub-optimal arm is over-estimated (3) the mean corresponding to the sub-optimal arm is close to that of the optimal arm. Next we bound each of the three events and use union bound to get the final result. Eqn (16) is used to get bounds for events (1) and (2). Regret-Seq Regret-MAB from [22].

    Theorem 4.2

    The expected number of samples discarded by the Robust UCB algorithm in trials of the algorithm, .

    Proof

    As per the robust UCB algorithm, at the time instant, the probability of the random variable exceeding ,

    The number of samples discarded upto a time is

    5 The Case of Strategic Annotators

    Till now, we have inherently assumed that annotators are non-strategic. Now we look at the scenario where an annotator who has been allocated an instance is strategic about how much effort to put in. For this, we assume that, for each annotator , the precision introduced while labeling an instance is proportional to the effort put in by annotator . We now refer to the effort as for simplicity. It is best for the learning algorithm when the annotator puts in as much effort (high ) as possible thereby reducing the variance in the labeled data. A given level of effort incurs a cost to the annotator . We assume that is a non-negative strictly increasing function of