Task control of workflows over micro-task crowdsourcing platforms, such as Amazon Mechanical Turk (AMT), has received significant attention in AI literature [Weld et al.2015]. Typically, a requester needs to balance three competing objectives – (1) total cost, owing to payments made to workers for their responses (or ballots), (2) overall quality, usually evaluated as accuracy of the final output, and (3) the total time for completing the task. These criteria are inter-related: increasing the pay per task attracts more workers to the task, thereby reducing completion time. However, it also exhausts the budget sooner, so requesters can afford fewer ballots per task, likely reducing the overall quality.
Most prior work on crowd controllers has focused on the tradeoff between cost (or no. of ballots) and quality [Dai et al.2013, Lin, Mausam, and Weld2012, Bragg, Mausam, and Weld2013, Kamar et al.2013, Parameswaran et al.2012]
. A common approach is to define a Partially Observable Markov Decision Process (POMDP)per task, which decides on whether to get another ballot or submit the best answer for that task. However, this work is time-agnostic, and assumes that pay per ballot is given as input.
Recent work has also studied the tradeoff between cost and completion time for a batch of tasks [Gao and Parameswaran2014]. They model the problem as a Markov Decision Process (MDP) that changes the pay per ballot, so that all ballots can be obtained by the given deadline in a cost-efficient manner. However, this work assumes that the number of ballots needed to complete the whole batch is a constant known to the requester in advance.
There is limited research on simultaneously addressing tradeoffs between cost, quality and latency. We know of only one work that studies this for the specific workflow for finding max of a set of items [Venetis et al.2012]. This work assumes that latency is pay-independent – an assumption well-known to be incorrect [Faradani, Hartmann, and Ipeirotis2011, Gao and Parameswaran2014]. Under a fixed latency-per-response assumption, they speed up the task and save cost by taking fewer responses. Our three-way optimization is for the broader case of answering a batch of tasks, and uses variable pricing to alter the latency of task completion, in line with crowdsourced marketplace dynamics.
Building upon these strands of research, we present Octopus, an AI agent that can balance all three objectives (cost, quality, time) in concert on real crowdsourced marketplaces, by optimizing a requester-specified, joint utility function for a batch of tasks. It achieves this by controlling both the pay per ballot and the (predicted) accuracy of each individual task.
We could model the whole problem as a single POMDP, however, that is unlikely to scale. An alternative could be to use multi-objective MDPs, but they are also less tractable, because they produce a pareto-optimal set of solution policies [Chatterjee, Majumdar, and Henzinger2006]. Octopus uses a three-component architecture – one to set the pay per ballot (CostSetter), another to choose the next available task (TaskSelector) and a third to control each task’s quality (QualityManager). A key technical novelty is in the careful modeling of the CostSetter’s state space in order to circumvent intractability – the state space contains aggregate statistics regarding completion levels of all tasks, so that it can decide the next best pay to set.
We perform extensive experiments using both simulated and real data, as well as online experiments on AMT. Since no existing system performs direct 3-way optimization in crowdsourced marketplaces, our experiments compare against existing state-of-the art approaches that optimize 2 of the 3 objectives. We find that in most settings, Octopus simultaneously outperforms, or is at par with multiple variants of these baselines. Our contributions are:
We describe Octopus, a novel framework to address cost-quality-time optimization for a batch of tasks in a crowdsourced marketplace setting. It contains three components that set the pay per ballot, select the next task and control each task’s quality. A key technical novelty is the use of aggregate statistics of all tasks in the state space design for the CostSetter, ensuring tractability for real-time deployment.
Octopus consistently performs at par with or better than state-of-the-art baselines, yielding up to 37% reward improvements on real data.
We deploy Octopus on AMT and demonstrate that it is able to optimize utility effectively in a live, online experiment.111Code can be found at https://github.com/krandiash/octopus.
2 Related Work
Cost-Quality Optimization. There is significant work on getting more quality out of a fixed budget. One branch of this research focuses on collective classification, which develops aggregation mechanisms to infer the best output per task, given a static set of ballots [Whitehill et al.2009, Welinder et al.2010, Oleson et al.2011, Welinder and Perona2010]. The other branch studies intelligent control, which dynamically decides whether to ask for a new ballot on a task, or stop and submit the answer. These include control of binary or multiple choice tasks [Dai et al.2013, Parameswaran et al.2012, Kamar et al.2013], multi-label tasks [Bragg, Mausam, and Weld2013], and tasks beyond multiple choice answers [Lin, Mausam, and Weld2012, Dai, Mausam, and Weld2011]. All these works design agents to control a single task and assume a constant pay per ballot. Our work closely follows the POMDP formulation laid down in Dai et al. dai2013pomdp for binary tasks.
Cost-Time Optimization. Increasing pay per ballot can reduce completion times. Faradani et al. faradani2011s develop models to find upfront, the static price per ballot so that a desired deadline can be met. Gao & Parameswaran gao2014finish extend this by varying pay at discrete time-steps using an MDP. Both approaches assume a fixed number of ballots known a-priori, without dynamic quality control of tasks. There is also some work on price-independent latency reduction [Haas et al.2015].
Cost-Quality-Time Optimization. There is limited work in this area. The only paper we are aware of is Venetis et al. venetis2012max, which addresses cost-quality-time optimization but in a restrictive setting with important distinctions from our work: (i) they look at max-finding for a set of items, while our task-type is classification; (ii) they consider latency to be pay-independent and fixed per task, while we study the more realistic setting in which changing pay directly impacts workers’ desire to work on our tasks; (iii) unlike us, they don’t change pay per task directly, and instead, change the number of responses sought per task to control both cost and latency. Qualitatively, our work thus also highlights how workers perceive pay changes in a crowdsourcing marketplace and its overall effect on task completion.
Worker Retention. Previous work deals with incentivizing workers to perform more tasks via bonuses or diversification [Rzeszotarski et al.2013, Ipeirotis and Gabrilovich2014, Difallah et al.2014, Dai et al.2015]. Recently, Kobren et al. kobren2015getting model the process of worker retention. We present empirical results that suggest worker retention plays a dominant role in determining task completion rates.
Task Routing. Prior work deals with two issues; deciding which task from the batch to solve next, or which worker to route a task to. Ambati et al.
ambati2011towards rank tasks based on user preferences using a max-entropy classifier. Other work uses low-rank matrix approximations[Karger, Oh, and Shah2014] for equal difficulty tasks. Rajpal et al. rajpalpomdp decide which worker pool to route a task to. Other papers study task routing on volunteer platforms [Bragg et al.2014, Shahaf and Horvitz2010].
Like AMT, we assume no control on which worker picks a ballot job, but we select the best next task to assign to an incoming worker. Following [Mason and Watts2010, Gao and Parameswaran2014], we assume that worker quality is independent of the pay per ballot. We re-verify this for our data in our experiments.
Decentralized Approaches. There is related work in decentralized Wald stopping problems [Teneketzis and Ho1987] which considers how to optimize a common utility function given a set of agents who each make independent observations. However, these approaches do not scale well with the number of agents (tasks in our setting), which can be quite large. There is also work in decentralized metareasoning [Hansen and Zilberstein2001] to decide when to stop optimizing a utility function. Metareasoning approaches typically assume that utility is monotonically increasing over time, which is not true in our setting since for instance, conflicting ballots on a single task would decrease utility.
3 Problem Definition
A requester provides a batch of binary tasks , each having a response. They also provide a utility function , which describes how to tradeoff cost, time and quality. The agent can dynamically change pay per ballot , and choose a variable number of ballots per task to optimize the final objective. We study the setting where is expressed as a sum of task-level utilities () minus cost, i.e. . Here, is total money spent on . We assume that answers to all tasks are to be returned to the requester as one single batch.
MDP/POMDP background. An MDP models the long-term reward optimization problem under full observability and is defined by a five tuple . Here, is a set of states, a set of actions,
denotes the probability of transitioning to stateafter taking action in state , and maps a state-action pair to a real-valued reward.
is the discount factor for making infinite-horizon MDPs well formed. A POMDP extends an MDP into a partially-observable setting, where the state is not fully observable and only a belief (probability distribution) over possible states can be maintained using observations from the model. A POMDP is represented astuple, where a new function denotes the distribution over observations on taking an action and arriving in a new state . Lack of space precludes a long discussion of the subject – there are existing solvers for solving MDPs and POMDPs of reasonable sizes, e.g. [Smith and Simmons2012], which we use in our work.
To formulate the problem optimally, we would need to define a single, centralized POMDP over a state containing answers and difficulty estimates of all individual tasks as well as the current pay per ballot and the current time. The actions will include requesting a ballot on a task, changing the pay, and a terminal submit action. Solving this POMDP would yield the optimal policy, which would decide which task to get ballots on next, when to change pay and when to submit. Since the number of tasks in a batch can be huge, this POMDP is unlikely to scale due to a large state and action space. Naive extensions to Dai et al.’s or Gao et al.’s state-of-the-art models for cost-quality and cost-time optimization respectively are not possible either – Dai et al.’s model is solved per task, whereas pay must be set based on progress of the whole batch of tasks; Gao et al.’s model assumes a fixed number of ballots per task, and has no natural way to optimize quality by taking a variable number of ballots based on each task’s difficulty.
4 Octopus for Three-Way Optimization
We propose a three-component architecture (see Figure 1). In Octopus each task has its own QualityManager that decides, based on the current pay, whether it is worth taking another ballot for this task (light edge) or not (dark edge). This information is conveyed to the TaskSelector, which selects an available (light) task to route to an incoming worker. Based on the current progress of the whole batch, the CostSetter decides what pay per ballot to set; this action is taken at periodic intervals. We design Octopus such that it can allocate tasks as and when workers arrive and works instantaneously in practice, therefore utilizing crowdsourcing marketplaces with full parallelism.
Background. QualityManagers follow the worker response model and POMDP formulation of Dai et al. dai2013pomdp. Each worker is assumed to have an error parameter ( is error-free). An average worker has error parameter . Each task has an unknown true Boolean answer , and an associated difficulty (
is easy) – these are estimated using an Expectation-Maximization algorithm[Whitehill et al.2009] as data is received. Each task has a prior difficulty distribution . A worker’s ballot for depends on their , and .
In Dai et al. dai2013pomdp the POMDP for maintains a belief state over state tuples. For instance indicates a 40% belief that and the answer to is . The POMDP has two actions; (1) request another ballot, or (2) mark as completed. The POMDP policy maps every belief state to an action. Each time the POMDP receives another ballot, a Bayesian update is performed to re-estimate a new belief. For Octopus, the POMDP optimizes . We note that depends on the pay per ballot : if is smaller, the POMDP can afford more ballots. Dai et al.’s original model keeps constant, but in our case the CostSetter can change it, triggering a change in .
We define as the confidence in the most probable answer for task ; , where are the current probabilities that ’s answer is 0 or 1 respectively. can be computed from by summing out .
Computation of Aggregate Statistics. We now define two batch-level statistics, which will be a key part of the state space representation for the CostSetter.
We first define a normalized estimate of task quality, ; normalizes so that it lies in . A high value of (near 1) indicates the POMDP’s high confidence in its estimated answer for , and vice versa for a low (near 0) value. We also define a related notion of batch quality, , which is an aggregate statistic estimating the current quality for the entire batch of tasks. Finally, we construct the batch quality histogram – a histogram built by binning tasks into equally sized bins based on their values. The bin width is denoted as .
The CostSetter also needs an estimate of the number of ballots remaining. We define as an estimate of the expected number of ballots that will be needed (starting from the current time) until task will be marked completed. For notational ease we write to denote . Recall that can change with a change in . Hence also depends on .
How can we compute ? We use a trajectory-tree approach (similar to [Kearns, Mansour, and Ng1999]) called FrontierFinding. We construct a binary tree of future observations, rooted at the current time step, where each edge corresponds to an observation (a worker response of ). The node below any edge contains the belief state generated by updating the POMDP, using the observation associated with that edge. Each trajectory is a path from the root to a leaf, and is generated with an associated path probability (using Dai et al.’s generative model assuming an average worker). A leaf is created whenever the policy takes the ‘mark as completed’ action, or when the path probability drops below a threshold. is simply the expected length of a trajectory in this tree.
We also estimate the expected ballots to completion for the batch: . ’s role is similar to that of – it is an aggregate statistic that describes how far the batch is from completion. It also helps in quantifying the expected cost of completion: if and , we would expect to spend units of money to complete the batch.
In summary, we described the design of a per-task QualityManager. Collectively, of these help us in estimating two aggregated quantities, and , which measure the overall quality, and degree of completion of the batch, respectively. All notation for this and future sections is summarized in Table 1.
The TaskSelector decides which incomplete task to assign to the next incoming worker. It must have an ‘anytime’ behavior, i.e. it must increase utility quickly. This is because the time of final submission is not in its control, and the batch might be submitted at any time by the CostSetter.
To be prepared for any contingency, the TaskSelector uses a 1-step greedy policy over expected utility gain. We define each task’s priority () as the difference between the current utility () of and the expected utility after receiving 1 ballot () from an average worker (error rate ), given the current belief state of ’s QualityManager. Thus, . TaskSelector assigns the task with the maximum value to the next available worker.
Unfortunately, (and therefore ) is neither monotonic (conflicting ballots decrease utility) nor submodular (a skilled worker could arrive after an error-prone one), so we cannot utilize prior work on adaptive submodularity [Golovin and Krause2011, Bragg et al.2014] to guarantee solution quality. Providing quality bounds is left for future work.
Lastly, note that the task allocation process is instantaneous, as well as completely parallelized, since we don’t wait for a task to be returned before allocating another task. Given enough workers, we could get ballots on every single task in parallel. This is important, since it allows us to take full advantage of micro-task marketplaces.
|Overall utility (requester defined function)|
|Pay per ballot|
|Total cost incurred for all tasks|
|Total cost incurred for task|
|Policy for task ’s QualityManager|
|,||Worker error parameter, average error|
|Difficulty of task|
|True binary answer for task|
|Belief state for task|
|Mapping from to for task under|
|Granularity (e.g. )|
The CostSetter is an MDP that changes (pay per ballot) in order to maximize . It uses information about the completion level of each task to assess whether the batch of tasks is completing on schedule or needs to be sped up or slowed down. To influence the rate of completion of the batch, it sets at discrete time steps .
The key challenge for the CostSetter is in defining the state space. Ideally, as stated earlier, each task’s belief should be part of the state, but that would make computations intractable. Instead, we approximate by using aggregate statistics over the whole batch of tasks. We describe the state space, actions, transition functions, and rewards of this MDP below.
State Space. The choice of the best pay per ballot depends on its current value, the current time, and the aggregate degree of completion of the batch. We choose the state to be a 4-tuple . Both and are important for this decision; estimates the expected accuracy on the batch, while gives us an idea of how much more improvement in is possible (at the current ). For a fixed and , a high would indicate the presence of several unsolved tasks and the possibility of improving . On the other hand, a low would indicate that most tasks are solved and there is little improvement possible. therefore captures the spread of the distribution of task qualities , while is the mean of this distribution. We use this intuition later to construct the transition function for the CostSetter.
As another example consider a case where both and are high. This indicates that the QualityManagers consider there be to scope for quality improvement despite the batch quality being high already, possibly due to very low . If we did not have in the state, we would instead base our decision on the high value, and believe that further improvement in utility was not possible.
All state variables are continuous, and for tractability we discretize them. is discretized with a granularity , with a granularity , and with a granularity . We assume that can take values . These values are defined by the requester, and in practice would respect marketplace constraints, such as minimum wages. Interestingly, the model is robust in that if a requester provides a very poor starting wage, workers will likely not pick up the task, and the model will subsequently respond by increasing the wage.
Actions and Rewards. Every state has access to two pay-change actions: and . increases to while does the opposite. The and actions incur no cost to the system. However, in practice we assign a small cost to these actions to prevent frequent cyclical pay-changes in a policy. Changing has no impact on and , but it does change , since the number of ballots remaining per task depends on the current pay per ballot (via the policy ). We discuss how to compute this when defining the transition function.
We also have a no-change action, which increments by , along with asking workers for more ballots at , which remains unchanged (for the next duration). This is essentially a marketplace action, where we post tasks to the marketplace with a pay per ballot equaling . At the end of minutes, we would then arrive in a new state. The cost of this transition (to the nearest value) is just the amount paid to workers during this duration on the marketplace, equaling the number of ballots received during this time, multiplied by .
The final action is a ‘terminate’ action that submits all answers to the requester. Its reward should be based on batch quality and current time (cost is not needed, since that was already accounted in the no-change action). Unfortunately, the MDP has access to only the aggregate statistics, and not the full batch quality histogram, which is needed for computing . We now describe a novel -reconstruction procedure that allows us to extrapolate the full histogram from aggregate statistics, useful for computing this reward as well as the transition function.
-Reconstruction. The goal is to reconstruct an approximate batch quality histogram given the aggregate statistics, and , and the current
. The procedure assumes that the histogram can be approximated with a two parameter Beta distribution,. Also assume that we are provided a function that maps a pair to ; it returns the expected number of ballots needed for a task given its current quality and pay per ballot. Note also that will be a non-increasing function of . We now show that we can find suitable and given , and .
To compute the best fit values, we solve two equations. The equation enforces that the mean of the reconstructed distribution is : . We can reparameterize using ; and write it as . Our task now reduces to finding the best fit . A equation imposes as an expectation over the batch quality distribution: .
Since is computed using a POMDP policy, it will rarely be available in closed form. The integral above is approximated using a numerical algorithm. The best is found via using a linear search, and works instantaneously in practice. Having found a suitable value, we now bin all tasks into the distribution to recover the task quality histogram, as desired.
Finally, we describe the procedure for estimation of . First we calculate the corresponding . Intuitively, assesses the number of ballots taken by the QualityManager when its belief in the current answer is and it has difficulty . However, we haven’t reconstructed – we use the prior distribution as its belief on . To compute we initiate the QualityManager’s POMDP from such a belief state, and compute under policy using FrontierFinding.
Transitions. We now describe the transitions for action in a given state to reach . As we change , the main change is in . We quantify this change by a simple observation: regardless of pay, the number of ballots taken until this point is fixed. Suppose that at we compute the estimated number of ballots for the batch as . If we have taken ballots till now, our current estimate of will be simply . At pay , our next state’s estimate should be . Thus, when changing pay from to we can simply add to compute . A similar analysis works for the action.
No-Change Transitions. The no-change action emulates the setting that the tasks are posted on the platform for time at pay . Let the next state be . Estimation of and requires a model of task completion. Similar to Gao & Parameswaran gao2014finish, we maintain a pay-dependent ballot completion model, , as the probability that Octopus will receive ballots in duration at pay . However, different from their work, the probability model combines the effects of worker arrival, retention and time taken per task (and not just arrival). Thus, will reduce by with probability . The cost of the transition will be . Since is discretized upto granularity, the cost will be rounded off to the nearest bucket.
For updating , we -reconstruct the batch quality histogram using the current state. We then bin tasks into this histogram, and simulate the TaskSelector on this reconstructed batch. To do this, we first create a POMDP belief state for each reconstructed task by choosing from the histogram to recover , and using the prior difficulty distribution. We then select a task, simulate a ballot using an average worker () and compute the posterior belief. We continue until all ballots are used up. At the end we compute the based on the updated state of the batch. For robustness, we repeat this entire procedure multiple times and average the values from different runs.
Implementation Details. We construct the whole MDP with all transitions and rewards using simulations and -reconstructions as described above. Since time can be unbounded, we keep a max time when defining the total state space. We also recognize that the transition from to depends on but not on any other part of the state. By caching a table of values once, we can save on a lot of simulations when computing the transition function of the no-change action in various states.
We use Value Iteration to learn the policy with as our start state. Intuitively, we are starting at 0 quality, the maximum possible number of ballots to completion, and the lowest price.
In a real execution environment, it is possible that over time, the CostSetter’s aggregates diverge from the set of QualityManager’s beliefs in the batch. To improve performance, we synchronize the CostSetter’s and to those of the real batch after every time interval. Experimentally Octopus performs well even without synchronization, indicating that the learned transitions are effective.
In summary, we described the CostSetter, an MDP that keeps track of the aggregate statistics and for a batch of tasks, and changes pay to optimize utility . A key contribution is a novel -reconstruction procedure that approximates the batch quality histogram for a state.
We conduct three sets of experiments: (i) experiments on simulated data; (ii) offline evaluation on real data; (iii) live experiments on Amazon MTurk (AMT).
Model Parameters. For experiments, we initialize Octopus with a hard deadline utility function: if ; otherwise, for every incorrect answer and zero for a correct answer. This joint utility combines the utilities from Dai et al.’s and Gao et al.’s models. Here, penalty represents how important quality is to the requester. Notice that since a POMDP doesn’t know whether it is submitting the correct answer, it cannot compute the utility exactly. It uses its belief to estimate expected utility as . This linear utility makes the reward computations for CostSetter’s ‘terminate’ action simple. Given the of the state, the reward is calculated as on termination.
Having a hard deadline makes the CostSetter state space finite, since we only consider states with . We use a time-independent Poisson process for the ballot completion model . These parameters are illustrative – Octopus can accommodate other utilities/distributions.
Baselines. Our main goal is to compare Octopus’s performance with state-of-the-art methods. However, we know of no algorithms that performs direct three-way optimization in a crowdsourced marketplace setting.222We don’t compare to Venetis et al. venetis2012max since they run a tournament for max-finding, different from our task type. They also don’t change pay directly or model latency as done by us. So we compare against state-of-the-art methods that optimize 2 of the 3 objectives, while giving them the added benefit of our TaskSelector. We compare Octopus to Dai et al.’s and Gao et al.’s models, which are related to our work. Code was provided by the authors.
All comparisons are on the requester’s utility function, computed against gold labels with . We normalize so that Octopus always has 1.0 utility, i.e. the performance of baselines is represented as a proportion of Octopus.
Data collection. We collect data on AMT at 6 pay points using a Twitter Sentiment dataset [Sheshadri and Lease2013]. Workers are asked to classify the sentiments of tweets into either positive or negative sentiment. At each price point per ballot), we post 400 tweets, and seek 20 ballots/tweet. A single HIT contains 10 tweets for a worker to solve. 40 tweets are common to all prices for a total of 2200 tweets. Each price is posted on a different weekday at the same time, and remains active for 24 hours. This ensures consistency in data collection across prices and minimizes interaction between different prices.
Figure 2a (task completion rates vs. pay) verifies that higher pricing results in faster task completion: at 0.1 cent, only 5500 ballots are received even after 24 hours, whereas at 0.6 cents, all 8000 ballots are received within 12 hours. We estimate the Poisson parameters using this data.
Worker Retention vs. Arrival. Contrary to prior work [Faradani, Hartmann, and Ipeirotis2011, Gao and Parameswaran2014], we observe that the increase in task completion rate with pay is predominantly due to higher worker retention, rather than a higher rate or number of worker arrivals (possibly due to the large number of HITs that we sought). Figure 3a shows that both retention and task completion rates are highly correlated, doubling as price goes from 0.1 to 0.6 cents. However, Figure 3b shows that the worker arrival rate does not rise much. To the best of our knowledge, there is no prior marketplace model that handles both retention and arrivals.
We also verify that worker quality is independent of pay. We run a K-S test for every pair of costs. We could not reject the null hypothesis (error rates drawn from cost-independent distributions) at(Figure 2b).
5.1 Simulation Experiments
Our main aim through simulations is to assess the quality of -reconstruction. We use a variety of parameter settings and simulate Octopus with ballot arrivals simulated according to a pay-dependent Poisson process, and answers generated based on worker models. Note that this experiment is without any synchronization between the CostSetter state and the real batch. Figure 5a compares the values between the system and the batch (for one such setting333 tasks with , worker errors sampled from , , mins, , .). Even after 10 time intervals, and across multiple cost changes (the sharp drops in the curve), our tracking is extremely accurate. Figure 5b shows that our tracking is extremely effective as well, with very little divergence from the true value. Even when the values do diverge in both cases, they are highly correlated.
High quality tracking of and are essential. If we diverge too far from the real values, we would likely optimize the long-term expected reward poorly. Overall, this reaffirms the hypothesis that the CostSetter is able to capture the global state of the whole batch using just the aggregate statistics.
Our other goal is to compare Octopus’s performance with baselines in simulation, which we do below.
Comparison to Dai. Comparing Octopus with Dai highlights the benefit of changing cost on real-world utility. Ideally, our method should be able to vary pay to match or exceed the utility of the static cost baselines. We run both Octopus and Dai with deadlines ranging from 60 to 360 minutes. We run Dai for different static pays, ranging from 1 to 6 (measured in a tenth of a cent), allowing it to take ballots until the deadline. Statistical significance is indicated on the plots.
Figure 4 shows the comparison. Octopus simultaneously outperforms (or is at par with) all static cost baselines for every deadline, whereas each static cost baseline has a ‘sweet spot’ range of deadlines where it does best. Octopus plans robustly, where despite making pricing decisions 24 times without synchronizing for the hr deadline, it still outperforms Dai. Figure 4b depicts the utility averaged across all deadlines as a function of task-completion rates. As shown, Octopus is robust to different task-completion rates, and continues to outperform Dai on changing them. Figure 4d & 4e explain why Octopus achieves better utility scores than Dai – it maintains very high accuracy while keeping costs reasonable. Qualitatively, for every deadline, we observe that Octopus tends to have an average cost that is close to the best static cost.
Comparison to Gao. Comparing with Gao allows us to delineate the effect of the QualityManagers while optimizing for batch quality using aggregate statistics.
For fairness, we augment Gao’s framework; fixing ballots/task up-front, and using the same worker response model as Octopus for ballot aggregation. Figure 4c demonstrates Octopus’s performance against Gao for different values of . Octopus consistently outperforms Gao for all values of and across all deadlines. Octopus’s performance improves for longer deadlines, due to the higher quality achieved by the QualityManager. For instance, Octopus is around better than Gao for the 6 hr deadline. We see that Gao-3 and Octopus incur nearly the same cost (Figure 4g) but Octopus is 6% better in terms of accuracy (Figure 4f). Examining Figure 6b reveals that this is in part due to Octopus collecting more ballots than Gao-3 by changing pay/ballot more intelligently.
It is interesting to note that increasing does not significantly improve baseline performance. This is due to the fact that the QualityManager only takes more ballots on tasks that really need them. For Gao the extra cost spent on every task is not offset by a corresponding increase in quality, especially for tasks that are very easy (don’t require ballots) or too hard (unsolved even with ballots). In Figure 4f & 4g, we notice that Octopus once again has near-highest accuracy, but spends much less than Gao variants that have higher accuracy than Octopus.
Lastly, Figure 6 demonstrates that Octopus has intuitive behavior – as the deadline length increases, Octopus spends less pay/ballot on average since more time can be taken to finish the batch of tasks.
These experiments demonstrate that Octopus simultaneously achieves higher utility than every variant of baseline methods, and does so without sacrificing accuracy or incurring high costs.
5.2 Offline Experiments on Real Data
We run extensive offline experiments on our collected data against the state-of-the-art baselines described earlier.
When comparing algorithms we sample different execution trajectories from the real data by choosing a random ballot for each task (since quality is cost-independent), while keeping ballot arrivals as per the real data. Multiple trajectories help compute statistical significance over algorithms’ performances. These experiments synchronize the CostSetter’s state with the real batch after every
time. We use an uninformed uniform distribution for the difficulty prior.
Figure 7a shows Octopus when compared to Dai at different static price points, with the -axis being different values. Statistical significance is marked on the plots. Octopus outperforms most Dai costs across all deadlines, with upto increase in real utility. No single static cost is able to match Octopus across all deadlines. This underscores the benefits of changing pay dynamically based on current task completion. In Figure 7d & 7e, we see that Octopus achieves high accuracy at relatively low cost.
Figure 7b compares Octopus with Gao-, with denoting the static number of ballots per task. We find that Octopus outperforms Gao for most deadlines. Further analysis reveals that our batch exhibits a bi-modal task difficulty distribution – no algorithm exceeds around accuracy, while getting is easy with 1 ballot per task (in Figure 7f, we see a sharp jump in accuracy only when the deadline is long enough to take a lot of ballots). For shorter deadlines, where solving difficult tasks is infeasible, Gao outperforms Octopus slightly, since Octopus optimizes with respect to a uniform prior. For longer deadlines, estimates of task difficulties are refined by the QualityManagers, and Octopus gives large gains.
Lastly, Octopus outperforms TaskSelector baselines where Greedy task selection is replaced by a random policy (Random), or a single round robin followed by random selection (Random-Robin), by large margins (Figure 7c). Note also that the baselines start to converge over time, as the advantage of doing intelligent task selection diminishes when nearly all tasks are run till completion.
Overall, we find that Octopus learns robust policies, consistently outperforming all baselines.
5.3 Live Online Experiments
Lastly, we deploy Octopus on AMT to test performance in a dynamic, online setting, as well as gain qualitative insight into worker behavior. In this, we keep exactly 3 HITS (of 10 tasks each) on AMT at a time. If a worker accepts a HIT, another one is posted immediately – this enables greedy task routing while ensuring full power of worker parallelism. After every mins, all available HITs are taken down and reposted with the new price output by Octopus. Octopus’s policies are learned using task completion rates estimated from real data earlier. At runtime, querying Octopus is instantaneous. We solve a batch of 500 tweets for 3 deadlines – 1, 2, and 4 hours. We compare against Gao-, the best baseline in offline expts in Table 2.
For the 1 hr deadline, Octopus maintains pay at 0.5 cent/ballot, before decreasing it to 0.1 cent/ballot in the last 15 minutes. This allows Octopus to receive around 1 ballot/task; more ballots stagnate for short deadlines due to conflicting workers, and Octopus prefers to save money to optimize utility. On the other hand, Gao maintains pay at 0.6 cent throughout to also ensure it receives 1 ballot/task, but is unlucky in the responses it receives. In terms of decision making, the superiority of Octopus is clear, since it recognizes the danger posed by disagreement on task quality estimates.
In the 2 hr deadline, Octopus once again increases pay initially, but gets more ballot arrivals than expected. In response, after 45 minutes Octopus decreases pay so that it can take a higher number of ballots for difficult tasks, getting a substantial accuracy and utility improvement over Gao.
For the 4 hr deadline, Octopus is aggressive in trying to solve all tasks till completion. Due to the bi-modal difficulty of the tasks, workers provide several conflicting ballots on harder tasks, which the TaskSelector prefers to re-route for utility gain. The overall accuracy increases marginally over the 2 hour deadline.
Qualitatively, workers respond as predicted – flocking to the tasks when pay was set at 0.5 cent or more, and staying away at very low pay. Workers respond naturally to the price changing algorithm; dropping out immediately if the pay is suddenly lowered, and coming back if it is increased once again. No worker complained about the fluctuating pay.
In this work, we focused on experiments with Octopus in a fixed time deadline setting to compare with past work. Extension to the popular fixed budget setting (where is constrained) is simple: (i) never synchronize so that its value in any state is simply the start value of (which is fixed and known) minus the ballots received, making it a proxy for cost incurred so far; (ii) modify the CostSetter’s reward to give a reward for total cost (now computable using ) exceeding the budget. Another extension involves using a different formulation of the QualityManager; any algorithm that defines an appropriate policy and from which we can compute and is suitable.
We present Octopus, one of the first AI agents for a 3-way optimization of total cost, work quality and completion time in crowdsourcing. The agent combines three different subagents that control quality per task, select the best next task and set pay for the whole batch. A key technical contribution is the computation of aggregate statistics of the quality and completeness of the whole batch – this is used as the state for best setting the next pay.
Octopus outperforms state-of-the-art baselines in a variety of simulated and real world settings, demonstrating the superiority of our approach. We also showcase Octopus’s real world applicability by deploying it directly on AMT. In the future, we hope to develop general purpose formulations of Octopus as a plug-and-play architecture for practitioners.
This work is supported by Google language understanding and knowledge discovery focused research grants, a Bloomberg award, a Microsoft Azure sponsorship, and a Visvesvaraya faculty award by Govt. of India to the third author. We thank Chris Lin, Yihan Gao and Aditya Parameswaran for sharing code, and all the AMT workers who participated in our experiments.
- [Ambati, Vogel, and Carbonell2011] Ambati, V.; Vogel, S.; and Carbonell, J. G. 2011. Towards task recommendation in micro-task markets. In AAAI Workshop on Human Computation.
- [Bragg et al.2014] Bragg, J.; Kolobov, A.; Mausam; and Weld, D. S. 2014. Parallel task routing for crowdsourcing. In HCOMP.
- [Bragg, Mausam, and Weld2013] Bragg, J.; Mausam; and Weld, D. S. 2013. Crowdsourcing multi-label classification for taxonomy creation. In HCOMP.
- [Chatterjee, Majumdar, and Henzinger2006] Chatterjee, K.; Majumdar, R.; and Henzinger, T. A. 2006. Markov decision processes with multiple objectives. In STACS 2006. Springer.
- [Dai et al.2013] Dai, P.; Lin, C. H.; Mausam; and Weld, D. S. 2013. Pomdp-based control of workflows for crowdsourcing. Artif. Intell. 202:52–85.
- [Dai et al.2015] Dai, P.; Rzeszotarski, J. M.; Paritosh, P.; and Chi, E. H. 2015. And now for something completely different: Improving crowdsourcing workflows with micro-diversions. In CSCW. ACM.
- [Dai, Mausam, and Weld2011] Dai, P.; Mausam; and Weld, D. S. 2011. Artificial intelligence for artificial artificial intelligence. In AAAI.
- [Difallah et al.2014] Difallah, D. E.; Catasta, M.; Demartini, G.; and Cudré-Mauroux, P. 2014. Scaling-up the crowd: Micro-task pricing schemes for worker retention and latency improvement. In HCOMP.
- [Faradani, Hartmann, and Ipeirotis2011] Faradani, S.; Hartmann, B.; and Ipeirotis, P. G. 2011. What’s the right price? pricing tasks for finishing on time. In AAAI Workshop on Human Computation.
- [Gao and Parameswaran2014] Gao, Y., and Parameswaran, A. G. 2014. Finish them!: Pricing algorithms for human computation. PVLDB.
Golovin, D., and Krause, A.
Adaptive submodularity: Theory and applications in active learning and stochastic optimization.Journal of Artificial Intelligence Research.
- [Haas et al.2015] Haas, D.; Wang, J.; Wu, E.; and Franklin, M. J. 2015. Clamshell: speeding up crowds for low-latency data labeling. VLDB.
- [Hansen and Zilberstein2001] Hansen, E. A., and Zilberstein, S. 2001. Monitoring and control of anytime algorithms: A dynamic programming approach. Artificial Intelligence 126(1-2):139–157.
- [Ipeirotis and Gabrilovich2014] Ipeirotis, P. G., and Gabrilovich, E. 2014. Quizz: Targeted crowdsourcing with a billion (potential) users. In WWW. ACM.
- [Kamar et al.2013] Kamar, E.; Kapoor, A.; Horvitz, E.; and Redmond, W. 2013. Lifelong learning for acquiring the wisdom of the crowd. In IJCAI. Citeseer.
- [Karger, Oh, and Shah2014] Karger, D. R.; Oh, S.; and Shah, D. 2014. Budget-optimal task allocation for reliable crowdsourcing systems. Operations Research.
- [Kearns, Mansour, and Ng1999] Kearns, M. J.; Mansour, Y.; and Ng, A. Y. 1999. Approximate planning in large pomdps via reusable trajectories. Citeseer.
- [Kobren et al.2015] Kobren, A.; Tan, C. H.; Ipeirotis, P.; and Gabrilovich, E. 2015. Getting more for less: optimized crowdsourcing with dynamic tasks and goals. In WWW.
- [Lin, Mausam, and Weld2012] Lin, C. H.; Mausam; and Weld, D. S. 2012. Crowdsourcing control: Moving beyond multiple choice. In UAI.
- [Mason and Watts2010] Mason, W., and Watts, D. J. 2010. Financial incentives and the performance of crowds. ACM SigKDD Explorations Newsletter 11(2):100–108.
- [Oleson et al.2011] Oleson, D.; Sorokin, A.; Laughlin, G. P.; Hester, V.; Le, J.; and Biewald, L. 2011. Programmatic gold: Targeted and scalable quality assurance in crowdsourcing. Human computation.
- [Parameswaran et al.2012] Parameswaran, A. G.; Garcia-Molina, H.; Park, H.; Polyzotis, N.; Ramesh, A.; and Widom, J. 2012. Crowdscreen: algorithms for filtering data with humans. In SIGMOD.
- [Rajpal, Goel, and Mausam2015] Rajpal, S.; Goel, K.; and Mausam. 2015. Pomdp-based worker pool selection for crowdsourcing. CrowdML Workshop, ICML.
- [Rzeszotarski et al.2013] Rzeszotarski, J. M.; Chi, E.; Paritosh, P.; and Dai, P. 2013. Inserting micro-breaks into crowdsourcing workflows. In HCOMP.
- [Shahaf and Horvitz2010] Shahaf, D., and Horvitz, E. 2010. Generalized task markets for human and machine computation. In AAAI.
- [Sheshadri and Lease2013] Sheshadri, A., and Lease, M. 2013. Square: A benchmark for research on computing crowd consensus. In HCOMP.
- [Smith and Simmons2012] Smith, T., and Simmons, R. 2012. Point-based pomdp algorithms: Improved analysis and implementation. arXiv preprint arXiv:1207.1412.
- [Teneketzis and Ho1987] Teneketzis, D., and Ho, Y.-C. 1987. The decentralized wald problem. Information and Computation 73(1):23–44.
- [Venetis et al.2012] Venetis, P.; Garcia-Molina, H.; Huang, K.; and Polyzotis, N. 2012. Max algorithms in crowdsourcing environments. In Proceedings of the 21st international conference on World Wide Web, 989–998. ACM.
- [Weld et al.2015] Weld, D. S.; Mausam; Lin, C. H.; and Bragg, J. 2015. Artificial intelligence and collective intelligence. Handbook of Collective Intelligence.
- [Welinder and Perona2010] Welinder, P., and Perona, P. 2010. Online crowdsourcing: rating annotators and obtaining cost-effective labels.
- [Welinder et al.2010] Welinder, P.; Branson, S.; Perona, P.; and Belongie, S. J. 2010. The multidimensional wisdom of crowds. In NIPS.
- [Whitehill et al.2009] Whitehill, J.; Wu, T.-f.; Bergsma, J.; Movellan, J. R.; and Ruvolo, P. L. 2009. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In NIPS.