Crowdsourcing has been used for a wide variety of tasks, from image labeling to language transcription and translation. Many complex jobs can be decomposed into small micro-tasks [23, 3, 29, 5]. After such decomposition, the primary challenge becomes ensuring that independent individual judgments result in accurate global answers. Approaches ranging aggregation via majority vote  to programmatic filtering via gold-standard questions 
have all been created to achieve this goal. Further improvements have led to more intelligent aggregation such as expectation maximization (EM)[7, 38, 37]. However, EM may still fall short, especially on hard problems where individual judgments are unreliable. Indeed, some researchers have concluded that crowdsourcing is incapable of achieving perfect accuracy .
Yet recently, argumentation has been shown to be an effective way to improve the accuracy of both individual and aggregate judgments. For example, Drapeau et al.’s MicroTalk  used a pipelined approach of: 1) asking crowdworkers to assess a question’s answer, 2) prompting them to justify their reasoning, 3) showing them counterarguments written by other workers, and 4) allowing them to reconsider their original answers to improve individual judgments. In principle, this simplified form of argumentation allows a single dissident worker, through force of reason, to steer others to the right answer. Furthermore, the authors showed that argumentation was compatible with EM; combining the two methods resulted in substantial gains in accuracy.
However, while asynchronous argumentation systems like MicroTalk attempt to resolve disagreement, the steering power of a one-round debate is limited. Workers are only shown a pre-collected justification for an opposing answer; they aren’t challenged by a specific and personalized argument against the flaws in their original reasoning. There is also no back-and-forth interaction that could illuminate subtle aspects of a problem or resolve a worker’s misconceptions — something which may only become apparent after several turns of discussion. Furthermore, since justifications are pre-collected, workers need to write a generic counter argument; while this works for binary answer tasks, it is completely impractical for tasks with many answers; such a counter-argument would typically be prohibitively long, refuting alternatives.
This paper presents Cicero, a new workflow that engages workers in multi-turn and contextual argumentation to improve crowd accuracy on difficult tasks. Cicero selects workers with opposing answers to questions and pairs them into a discussion session using a chat-style interface, in which they can respond to each other’s reasoning and debate the best answer. During these exchanges, workers are able to write context-dependent counter-arguments addressing their partner’s specific claims, cite rules from the training materials to support their answers, point out oversights of other workers, and resolve misconceptions about the rules and task which can impact their future performance on the task. As a result of these effects, workers are more likely to converge to correct answers, improving individual accuracy. Our experiments on two difficult text based task domains, relation extraction and a word association task, show that contextual multi-turn discussion yields vastly improved worker accuracy compared to traditional argumentation workflows.
In summary, we make the following contributions:
We propose Cicero, a novel workflow that induces multi-turn and contextual argumentation, facilitating focused discussions about the answers to objective questions.
We introduce a new type of worker training to ensure that workers understand the process of argumentation (in addition to the task itself) and produce high quality arguments.
We develop Cicero-Sync, a synchronous implementation of our workflow using real-time crowdsourcing, and apply it to conduct the following experiments:
In the Relation Extraction domain introduced by MicroTalk , we show that contextual, multi-turn argumentation results in significantly higher improvement for individual worker accuracy over existing one-shot argumentation: a 16.7 percentage point improvement v.s 6.8.
Using a version of the Codenames domain , that has many answer choices (making MicroTalk’s non-contextual argumentation unworkable), we show that Cicero is quite effective, improving individual worker accuracy from 66.7% to a near-perfect 98.8%.
We qualitatively analyze the discussion transcripts produced from our experiments with Cicero-Sync, identifying several characteristics present in contextual, multi-turn argumentation.
2 Previous Work
Quality control has been a central concern in crowdsourcing, and space precludes a complete discussion of post-hoc methods such as majority vote , gated instruction , and programmatic generation of gold-standard questions . Similarly, many different approaches have been suggested to improve accuracy by assigning certain question to specific workers [17, 6, 16].
Expectation maximization [7, 38, 37] is especially popular, but all methods embody greedy optimization and hence are prone to local maxima. If the data set includes challenging problems, where a significant majority of workers gets the answer wrong, EM will likely converge to the incorrect answer.
Other researchers have investigated methods to handle cases where the majority may be wrong, e.g., tournament voting  and Bayesian truth serum . Unfortunately, these methods are unlikely to work for difficult questions where individual answers require detailed analysis.
2.1 Rationales & Feedback Can Improve Accuracy
Dow et al. dow-cscw12 conduct experiments demonstrating that timely, task-specific feedback helps crowd workers learn, persevere, and produce better results. Wiebe et al. wiebe-acl99 go a step further (with small-group, in-person studies), showing the benefits of getting annotators to reconsider their positions and discuss them with other workers. ConsiderIt  takes this kind of principled debate online and into the political spectrum, using pro/con points and encouraging participants to restate alternative positions to help voters make informed choices. Recently, Kobayashi et al.  have shown that self-correction in the form of reviewing other workers’ answers is effective in getting workers to reconsider and correct their own.
Many of these papers inspired the MicroTalk system, which is the primary inspiration for this work . MicroTalk combines three microtasks — assess, justify & reconsider — asking crowd workers to assess a question, prompting them to justify their reasoning, confronting them with another worker’s counterargument, and then encouraging them to reconsider their original decision. Our work builds on MicroTalk in several ways: 1) we support contextual communication that allows participants to respond to specific points made by their partner, and 2) we support multi-turn dialog that allows workers to “dive deep” if necessary to resolve the disagreement.
Liu et al. Zhu:2014:RVD:2531602.2531718 also noted that workers who review others’ work perform better on subsequent tasks, which inspired us to also examine effects of multi-turn argumentation on workers’ future tasks.
2.2 Real-Time Crowdsourcing
In order to facilitate synchronous and real-time dialog, our implementation, Cicero-Sync, has to connect workers in real-time. There are significant logistical challenges with recruiting a real-time crowd , especially in the context of our experiments, where we wished to have several dozen workers working simultaneously. Fortunately, there have been enough real-time, crowd deployments (e.g., [3, 4]) that many useful lessons have been distilled .
2.3 Instructions for the Task & the Meta-Task
It’s generally agreed that good instructions are essential for high inter-annotator agreement . We were frequently reminded of this as we iterated on our task design. Perhaps we should not have been surprised to discover that good instructions were also necessary for our ‘meta-task’ — arguing about the task. Argument forms and norms that contribute to positive discussion have been long-studied in the education community, termed ‘accountable talk’ . Our training for each experiment domain instructs workers on some of these guidelines, e.g. asking workers to rate justifications to ensure that participants understand accepted standards of reasoning, arguments that emphasize logical connections, and the ability to draw a reasonable conclusion.
3 Cicero Design
We designed the Cicero workflow to address issues in existing crowdsourced argumentation systems by using contextual, multi-turn discussions to address drawbacks in one-shot reconsider systems.
In this section, we present the Cicero workflow as well as design considerations in a synchronous implementation of the workflow used for our experiments. We first explain the rationale for contextual, multi-turn discussions and give an overview of our Cicero workflow. We then talk about the decision to implement our workflow in a synchronous system—Cicero-Sync. Finally, we discuss the design choices we made to (1) create an interface for effective real-time discussion, as well as (2) improve instructions and training for the domains we examined.
3.1 Contextual and Multi-Turn Discussion
In natural forms of debate, participants who disagree take turns presenting arguments which can refute or supplement prior arguments. Our Cicero workflow is designed around the concept of emulating this process in a crowd work setting by using paired discussions facilitated by a dynamic matching system. Participants are matched with partners based on their current beliefs and are encouraged to present their arguments over multiple turns.
While real-life debates may include multiple participants each responsible for addressing arguments on different aspects of a problem, in the crowd setting we can utilize the diversity of workers to cover a broad set of views and reasoning; thus, to simplify the process, we focus on a two-participant discussion model.
3.2 Workflow Overview
Since argumentation happens on an ad-hoc basis, it’s much more flexible to have our workflow focus on managing transitions between different states a worker may be in instead of defining a single pipeline. Due to this, our design of the Cicero workflow follows an event-based definition model where the task assigner allocates tasks as workers’ state changes. Figure 2 summarizes how our workflow allocates worker resources and questions in a dynamic way.
Initially, workers are recruited from a crowd work platform (such as Amazon Mechanical Turk) and are immediately assigned to a training task. Workers who pass training and the associated gating tests  enter the worker pool and wait to be assigned work. Then, instead of a fixed workflow, our event-based task allocator decides which type of task and question to assign to a worker subject to a set of constraints. As workers complete their tasks and update the beliefs of questions in the working set, new candidate tasks are dynamically selected and allocated.
In Cicero, there are two main types of tasks that the assigner may assign to an idle worker: assess and discussion.
The assess task acquires one worker () from the worker pool who is then presented with one question () — in our case a single question in the domain — that asks for an answer to a multiple choice question () and optionally a free-form justification for their position (). This task is a combination of the assess and justify microtasks in MicroTalk  as a single task.
The discussion task acquires two workers () from the worker pool who are both shown a discussion interface for a question (). At the end of a discussion, may be updated for . We will cover details on the design of the discussion task in later sections.
The task allocator manages which type of task can be allocated when a worker changes their state and, depending on domain, can be adjusted to prioritize specific kinds of tasks, particular questions or qualities such as minimizing worker wait time and increasing concurrent work.
In general, the task assigned can be adapted to the goals of the requester. However, there are a few general constraints that the task assigner must follow:
Incompatible beliefs: A discussion may only be assigned to workers () if they have incompatible beliefs (). Implicitly, this also requires existence of the both beliefs, implying they must have been collected.
No repeated discussions: Two workers () may only discuss question if they have never discussed question with each other before ().
These constraints guarantee that the workflow will eventually terminate when there are no more workers who disagree and have never paired with each other. It’s important to note that requesters can set up the task allocator to terminate the workflow sooner if, for example, thresholds for agreement on questions are reached.
3.3 Cicero-Sync: A Real-Time Implementation
While the Cicero workflow does not constrain the type of interaction during a discussion task, we decided to test out the effectiveness of our workflow using synchronous discussions where both workers are simultaneously online and engaged in a chat-like discussion environment.
In this implementation of Cicero, Cicero-Sync, once workers are matched into a discussion, they will not be assigned other tasks for the duration of that discussion and are expected to give each other their undivided attention. A synchronous and real-time discussion task allows us to maintain a continuous dialogue spanning many turns while preserving discussion context in a simple and natural way. However, we came to learn that systems relying on synchronous, real-time worker interactions have some disadvantages: The synchronous nature of discussions means that some workers will have to wait for a partner to become available and workers need to be online and active within the same time window, both of which imply a higher cost to the requester.
Additionally, there are many practical challenges to implementing and setting up synchronous real-time experiments with crowd workers, including implementing real-time client-server communication and working with APIs for worker recruitment and payment. We elected to build Cicero-Sync on top of the TurkServer  toolkit. Various tools built into TurkServer simplify the interfacing with Amazon Mechanical Turk for worker recruitment and task management and allow us to automatically track worker state as well as building a “worker pool” through the lobby. These allowed us to quickly design and prototype Cicero-Sync, which builds upon TurkServer’s lobby-assigner-experiment model. Our training, assess, and discussion tasks types in Cicero-Sync map experiment instances in TurkServer.
3.4 Discussion Interface
The discussion task is the most important and defining task of the Cicero workflow. We considered multiple different options for the discussion interface focusing on ways to organize discussion structure and facilitate discoverability.
Early proposals included designs that were inspired by the posts-and-replies interfaces in social network timelines and the split-view pros-and-cons interfaces used in ConsiderIt, a political, argumentation system . Our pilot studies showed that these methods were cumbersome and non-intuitive, so we decided on a free-form instant messaging (chat) metaphor for the discussion task; shown in Figure 1.
When a pair of workers enter a discussion, they are placed into a familiar instant messaging setting, where they can freely send and receive messages. An additional “exit” section below the chat interface allows either participant to terminate the discussion if they feel that it is no longer useful. Workers can utilize this exit mechanism to indicate that a consensus was reached or that no agreement is possible between the pair. The discussion interface can be easily adapted to specific needs of each experiment domain: In the Relation Extraction domain, the justifications collected from earlier assess or discussion tasks are used to seed the system, which we found to be beneficial in starting a conversation. In the Codenames domain, a drop-down menu below the text input field accommodates switching to alternate answers during the discussion addressing the non-binary nature of the questions.
We found that workers required minimal training to understand the discussion interface and were quickly able to effectively participate in argumentation with their partners in both domains.
3.5 Optimizing Task Instructions
We observed in early pilot experiments that arguments which refer explicitly to parts of task guidelines were more effective at convincing a partner. However, our original task guidelines and training did nothing to encourage this practice. Workers came up with different ways to refer to parts of the instructions or training examples, but this was inconsistent and frequently caused confusion. References to the guidelines were hard to identify making it harder for workers to determine correct invocations of rules in the Relation Extraction domain pilots. Since arguing in synchronous discussion sessions is time-sensitive, creating rules and shorthands that are easy to cite is important for discussion efficiency.
We adjusted the task guidelines for the Relation Extraction task from those in MicroTalk, re-organizing them into five concrete and easy-to-cite rules as shown in Figure 3. Each rule was given a shorthand so that workers can unambiguously refer to a specific rule and aid in identification of proper or improper rule usage during the discussions. We observed that citing behavior became more consistent within discussions with workers frequently utilizing our shorthands.
In the Codenames domain, which has simple instructions but a lot of example cases, we designed the instructions to both show the general guidelines and also provide a way for workers to review examples from training if they decide to reference them.
3.6 Selecting and Training Effective Workers
In initial pilots with Cicero-Sync, we provided minimal training to workers. New workers were presented with task descriptions, instructions and one sample problem which could be attempted multiple times. After several pilot experiments, we noticed that workers were performing inconsistently. Many workers’ discussions were ineffective, with one or both showing lack of basic understanding of the instructions. Inspection of timing data also showed that many workers were spending significantly less time in training and assess tasks without putting the necessary thought into learning the rules.
One method for improving worker quality, suggested by Drapeau et al. , is to select for “discerning workers” by finding workers who write justifications that are more complex than those of the median worker e.g., using Flesch-Kincaid  readability grade levels on gold standard questions. Following Drapeau et al., we tried filtering for “discerning workers” using the Flesh-Kincaid score but our pilot experiments on the Relation Extraction domain showed limited effect and significant reductions in worker pool size. Filtering of workers based only on gold standard question performance was also ineffective as it did not train workers to understand the rules required for our complex tasks.
Since our tasks require worker training, we opted to implement a gating process , that can both train and select workers at the same time. Workers are presented with questions laid out in a quiz-like format. Each training question is provided along an introduction of related concepts from the task instructions. The questions are interleaved with the instructions so that new questions are presented as new concepts are introduced to reinforce the understanding. Feedback is given when a workers selects an answer. At the end, workers’ performance on a set of quiz questions is recorded. If a worker’s accuracy on the quiz falls below a certain threshold, the worker will be asked to retry the training section (a limited number of times) with the order of the quiz questions randomized. Workers are dismissed if they exceed the retry limit and still do not meet the passing threshold.
After the improved training and gating, we observed no significant correlation between workers’ initial accuracy and the “discerning worker” measure in the Relation Extraction domain.
3.7 Selecting and Training Effective Argue-ers
In addition to gating for the task instructions, we also designed a novel justification training task incorporated as a part of the training process with the goal being to train the workers on recognizing good arguments and justifications. This justification training task is presented in the form of an assess task followed by providing feedback to workers explaining good arguments and justifications based on their assessment.
In the Relation Extraction domain, workers are asked to identify the best justification in a list after providing an assessment. Incorrect options aim to address potential mistakes a worker may make in writing a good justification, such as: failure to cite rules, incomplete or incorrect references to the rules, or making extended and inappropriate inferences. Workers are given feedback explaining why their selected justification was incorrect or correct with both the good and bad parts of the justification explained.
Since questions in the Codenames domain have ten or more possible answers, it’s not practical to collect justifications (which would have to rule out alternatives) at assess time. Therefore, justification training is adjusted to instead show a context-dependent counter-argument when a worker selects an incorrect answer refuting the incorrect choice and supporting the correct one. By training workers to recognize and analyze arguments, the goal of justification training is to promote more critical discussion.
3.8 Worker Retention and Real-Time Quality Control
Due to the synchronous nature of discussions in Cicero-Sync, workers may become idle for short periods of time when they are waiting to be matched to a partner. In these circumstances, workers are kept in the worker pool in the form of a lobby. Our lobby design was mainly inspired by both the default lobby provided in TurkServer  and from a worker-progress feedback design developed by Huang et al.  for low-latency crowdsourcing. While in the lobby, workers are presented with information on their peers’ current status, such as how many workers are currently online and which workers may become available soon. Workers also see statistics on their work, which is tied to bonus payments, and are encouraged to wait. In Cicero-Sync, the task assigner is configured to immediately assign work as it becomes available, but we also allow a worker to voluntarily exit after they have waited for a period without matches or completed a sufficiently large amount of tasks.
In addition, while our gating process is designed to select workers serious about the task, we do incorporate several techniques to assure that workers stay active when a task gets assigned to them. Individual tasks, such as assess tasks, impose anti-cheating mechanisms to discourage spammers from quickly progressing. These mechanisms include character and word count minimums and disabling of copy-paste for free-form entries. Workers are also encouraged to peer-regulate during discussion — participants can indicate a partner’s inactivity upon ending a discussion with no agreement. Paired with corresponding payout incentives, these methods ensure that most workers stay active throughout the duration of an experiment.
We deployed our experiments on our synchronous implementation, Cicero-Sync, to address the following questions: 1) Does multi-turn discussion improve individual accuracy more compared to existing one-shot reconsider based workflows? 2) Is multi-turn discussion effective in cases where acquiring justifications to implement one-shot argumentation (reconsider) is impractical? 3) Do discussions exhibit multi-turn and contextual properties?
We selected two domains to evaluate the research questions above: a traditional NLP binary answer task, Relation Extraction, for comparing against one-shot argumentation and a multi choice answer task, inspired by the word relation game Codenames, to evaluate Cicero in a non-binary choice domain.
In the following sections, we first introduce the experiment setup and configuration, then we introduce each domain and present our results for experiments on that domain. At the end, we present a qualitative analysis of discussion characteristics and explore whether discussions can improve future accuracy.
4.1 Experiment Setup
Cicero’s design enables interleaved assignment of different task types (assessments or discussions) for individual workers. This can be beneficial in reducing worker waiting overhead by assigning individual tasks when paired tasks are not available. However, in order to evaluate the effects of contextual, multi-turn argumentation under a controlled setting, we need to isolate the process of assessment and argumentation. For our experiments, we implemented a “blocking” task assigner that avoids interleaved concurrent tasks and is designed to assign the same type of task to a worker until they have answered all questions of that type.
The blocking assigner includes a few extra constraints in addition to those required by the workflow:
Gold Standard Assessments: The task assigner assigns assess tasks for gold standard questions to evaluate quality of workers who passed the training and gating quiz phase. Workers are assigned these questions before any other questions. No discussions are ever initiated for these questions; they let us control for worker quality and filter workers that do not pass the gating threshold.
Greedy Matching: The task assigner tries to assign a discussion as soon as such a task is available. In the case of multiple candidates, the task assigner picks one randomly.
Additionally, the blocking assigner doesn’t allocate any discussions until a worker has finished Assess-ing all questions. This allows us to collect the initial answers of a worker before they participate in any argumentation.
We adjusted Cicero-Sync to include these experimental constraints and collected multi-turn, contextual arguments. The system used in experiments consists of three distinct stages: Training, Assess and Discussion. Workers progress through each stage sequentially. Within the same stage, workers will be allocated tasks on demand by the blocking assigner. This system represents the discussion condition.
In addition to Cicero-Sync, we also implemented the adaptive workflow described in MicroTalk  using our platform to compare with one-shot argumentation. We reproduced the reconsider task interface from MicroTalk which replaces our discussion task and created a specialized assigner that allocates tasks following the adaptive algorithm described in MicroTalk. In this system, workers will be adaptively asked to justify or do reconsider tasks depending on their initial answer. When a worker is the only worker with a particular answer for a question, they will be asked to provide a justification for their answer. Reconsider tasks are only assigned to a worker if there is a previously justified answer that their current answer disagrees with. That justification will be the one shown to the worker. This system represents the reconsider condition.
4.2 Recruiting and Incentives
We ran experiments on Amazon Mechanical Turk, using workers who had completed at least 100 tasks with a 95% acceptance rate for both of our experiment domains.
Within each domain, we calibrated our subtask payments by observing the average worker time for that subtask from a pilot run and allocating an approximately $7 hourly wage. Our training bonus of $1.00 for successfully completing training and the gating quiz is also calibrated using the average time it takes workers to complete the training session.
For the Relation Extraction domain, workers are paid $0.10 as base payment and $1.00 for training. Workers are paid a per-question bonus of $0.05 for an assessment, $0.05 for a justification, and — depending on their condition — a bonus of $0.50 for a discussion or $0.05 for a reconsider task. Since we always collect a justification for each question in the discussion condition, workers in that condition are always given the full $0.10 per-question bonus. Per-question incentives are chosen to match those used in MicroTalk .
For the Codenames domain, workers are paid $0.20 as base and $1.00 for training. Workers are paid a per-question bonus of $0.20 for each correct answer and a per-discussion bonus of $0.50 for participating in a discussion with an extra $0.25 for holding the correct answer at the end of that.
While it is possible to design a more complex incentive structure, our main goal for this set of incentives is to discourage cheating behavior and align with that of MicroTalk. We think these incentives are consistent with those used in other, recent crowdsourcing research .
4.3 Relation Extraction Domain: Binary Answer
In the interest of comparing to previous work, we evaluated our method on a tradition NLP annotation task of information extraction (IE) — identifying structured, semantic information (relational tuples, such as would be found in a SQL database) from natural language text . The task is of considerable interest in the NLP community, since most IE approaches use machine learning and many exploit crowdsourced training data [41, 31, 2, 24].
Specifically, we consider the problem of annotating a sentence to indicate whether it encodes the TAC KBP LivedIn relation — does a sentence support the conclusion that a person lived in a location? While such a judgment may seem simple, the official LDC annotation guidelines are deceptively complex . For example, one can conclude that a national official lives in her country, but not that a city official lives in her city. Figure 3 defines the task, showing the instructions given to our workers.
We created a set of 23 challenging TAC KBP questions drawing from the 20 used in MicroTalk  and adding 3 additional questions from Liu et al. . This set was then divided into 3 batches of size 7, 8, and 8 for our discussion experiments. For gold standard questions, we selected 3 simple questions from the TAC KBP set, each of which can be resolved with an invocation of one rule. Upon recruitment, each worker is also presented with a 6 question gating quiz and are allowed 2 attempts to pass the gating threshold. Gating questions were written to be simple and unambiguous, testing whether the worker was diligent and had absorbed the guidelines.
4.4 Multi-turn vs. One-shot Workflows
Our first experiment compares worker accuracy for the multi-turn, contextual discussion workflow design against that of an one-shot (non-contextual) reconsider workflow on the binary answer Relation Extraction domain (i.e., Cicero vs. MicroTalk). We deployed both conditions with the systems described in the experiment setup. Since workers need to complete all assessments before starting discussions which would cause increased waiting time on a large set of questions, we deployed the discussion condition experiments in 3 batches (N=9, 16, 13) corresponding to the 3 batches the experiment questions were divided into. In the reconsider condition (N = 12), workers were put through our implementation of the adaptive workflow from MicroTalk on all questions.
In both conditions, the gating threshold was set at — workers needed to answer all gold standard questions correctly to be included. From the plot shown in Figure 4 we can see that the discussion condition (Cicero) improves average worker accuracy by percentage points over the initial accuracy compared to for the reconsider condition (statistically significant at ).
We performed a t-test on the initial accuracy of workers across both conditions for each batch and found no statistically significant difference () indicating that workers of similar quality were recruited for each of our batches. On average, workers participated in discussions () and were presented with reconsider prompts () in the one-shot workflow.
We do note that discussions are more costly, largely due to paying workers for time spent waiting for their partner to respond. Each Cicero-Sync discussion took an average of seconds () of worker time compared to a one-shot reconsider task averaging seconds (). We believe that an asynchronous implementation of Cicero could reduce overhead and dramatically lower costs.
4.5 Codenames Domain: Multiple Choice with Many Answers
|Candidates||business, card, knot|
|Positive Clues||suit, tie|
|Negative Clues||corporation, speed|
Previous work using one-shot argumentation focused mainly on evaluating argumentation in domains that only acquired binary answers such as Relation Extraction. However, we observed that this is not sufficient to represent a wide variety of real world tasks. In a binary answer setting, workers are able to write justifications arguing against the single opposing belief. As the number of answer options grows, it becomes increasingly inefficient and even infeasible to collect justifications as a part of each worker’s assessment.
Effective justifications for multiple choice answers need to address not only the selected answer, but also argue against remaining options, making them long and difficult to understand. Multi-turn discussion can address these scaling issues through back-and-forth dialog through which workers can argue against their partner’s specific choice.
We created a synthetic task inspired by the popular word association Codenames board game as a multiple choice answer domain. The Codenames game has been adopted as the main task in other cooperative work designs such as DreamTeam , which utilized a cooperative version of the game. The objective in the game is for each team to identify the tiles assigned to them from a shared list of word tiles. Clue words are given by one team member (the “spymaster”) who can see the assignment of word tiles (which ones belong to which team) while other teammates have to find the correct word tiles for their team while avoiding the tiles assigned to the other team.
Our Codenames task domain draws inspiration from the competitive aspect of the game. We observe that late into the game, good word guesses are often informed by both the teammate clues (which should be matched) and opponent clues (which should be avoided). With this observation, we created tasks which consist of a list of candidate words, several positive and several negative clue words. Workers, in the role of a team member, are instructed to find the single best candidate word that is related in meaning to some positive clue word but none of the negative clues. An example of this task can be seen in Table 1. Each question contains around 2 positive clues, 2-3 negative clues and 7-10 candidate words. We created 3 gating questions, 7 experiment questions and 1 individual assessment question for this task. We used a gating threshold of . While Codenames is not a typical task for crowd work, as also noted in DreamTeam, we think its aspect of multiple choice answers is representative of a whole class of similar tasks that lack effective one-shot argumentation strategies.
The loose definition of words being “related” in Codenames domain reduces the amount of worker training required for participation since it utilizes common knowledge of language. However, this may lead to ambiguity in reference answers which would be undesirable. We elected to manually create a set of questions which were validated to have only 1 objectively best answer. The distractors for each question and our reference argument were evaluated with a group of expert pilot testers. We confirmed that all participants agreed with our reference counter-arguments against the distractors and also with our reference answer. In the pilot test, we also noted that this task can be very challenging even for experts as multiple word senses are involved in distractors.
4.6 Evaluating on Multiple Choice Tasks with Many Answers
Our second experiment examines the performance of Cicero-Sync on multiple choice answer tasks from the Codenames domain, a domain that would be very inefficient for one-shot argumentation (justifications would need to address up to 9 alternatives). While initial worker accuracy is only , Cicero-Sync achieves a final average worker accuracy of , a significant improvement ().
We performed the ANOVA omnibus test with a mixed effects model using worker initial accuracy as a random effect and found that the experiment stage (initial v.s. final) is statistically significant at (). The average duration of each discussion was seconds () and workers had an average of discussions () in the Codenames domain.
4.7 Discussion Characteristics
We can see from the previous experiments that multi-turn, contextual argumentation is effective at improving worker accuracy across a variety of tasks, but are the discussions actually taking advantage of multi-turn arguments and the context being available? To answer this question, we collected and analyzed the transcripts recorded for each domain: Relation Extraction and Codenames.
We computed statistics on multi-turn engagement by analyzing the number of worker initiated messages – each of which is considered a turn. We found that in the Relation Extraction domain, discussions averaged turns (, median of ) while in Codenames discussions averaged turns (, median of ). We also found that in Codenames, the number of turns correlates to convergence on the correct answer () while we found no significant relation between turns and convergence () in the Relation Extraction domain. We note that in Relation Extraction, discussions are seeded with workers’ justifications from the assess task (equivalent to 2 non-contextual turns, not counted in the average) whereas discussions in the Codenames domain use actual contextual turns to communicate this information. Codenames discussions also utilized extra turns to argue about alternative choices.
Additionally, we noticed several patterns in the discussion text that appeared in both domains. We further examined these patterns by coding the the discussion transcripts (147 from Relation Extraction and 38 from Codenames). We surveyed the discussions looking only at patterns specific to argumentation and came up with 8 patterns related to argumentation techniques and 6 reasons workers changed their answer.
We then narrowed down the argumentation patterns by removing any that were highly correlated or any that had just 1-2 examples and finalized the following 4 prominent patterns as codes:
Refute: Argue by directly giving a reason for why the partner’s specific answer is believed to be incorrect. Examples: “Small [partner choice] is the opposite of large [negative clue] and will not work”; “Louisana isn’t a country, therefore NonCountry applies.”
Query: Ask the partner to explain their answer, a part of their answer or ask for a clarification in their explanation. Examples: “Why do you think it should be bill?”; “How would bridge work?”
Counter: Pose a counter-argument to a partner in response to their explanation. Example: A: “Erdogan’s government is nationally affiliated with Turkey.” B: “[…] The sentence could be interpreted as one of Turkey’s allies is helping them with the EU thing.”
Previous: Explicitly state that knowledge/line of reasoning acquired from a previous discussion is being used. Example: “I had window at first too, but someone else had bridge, but they thought bridge because of the card game bridge, and that made sense to me”;
We found that workers used these contextual patterns frequently during their discussions for both domains with and of all discussions utilizing at least one pattern in the Relation Extraction and Codenames domains respectively. We can also see that distribution of patterns across the two domains (Table 2) on discussions converging to the correct answer indicates that the utility of each pattern may be different in different domains. We hypothesize that the higher frequency of Counter and lower frequency of Query in Relation Extraction is likely due to the justification seeding which reduced need for workers to ask for explanations but encouraged more counter-arguments.
We also condensed the reasons for workers changing their answer down to 3 basic categories: learning about the task (rules), agreeing on meaning of concepts in a question, and being convinced by an argument. After coding the discussions, we found that the distribution of the reason for changing answers was for Relation Extraction domain and for Codenames, across each category (task, question, convinced) respectively showing that discussions could help workers understand the task.
We also observed that of all discussions and of discussions converging to the right answer used our rule shorthands when referring to the rules instead of describing them. However, we note that simply citing shorthands doesn’t correlate with convergence of a discussion ().
4.8 Do Workers Learn Through Discussion?
While we didn’t design discussions to be used as a way of training workers, many reported that they “understood the task much better” after discussions in pilot experiment feedback so we explored the effects of discussions on workers’ future accuracy. We tested a worker’s performance by adding post-test questions after they finished their corresponding experiment condition. We selected questions for the Relation Extraction domain and for the Codenames domain, all of comparable difficulty to the main questions, to be individually evaluated.
Average accuracy on the individual evaluation sections trended higher for the discussion condition: accuracies were , , and for the baseline (no argumentation), reconsider and discussion conditions respectively in the Relation Extraction domain and and for the baseline and discussion conditions in the Codenames domain. However, ANOVA on all conditions for each domain shows no statistically significant interaction ( and for Codenames and Relation Extraction respectively) between the experiment condition and the accuracy on the individual evaluation questions. We conjecture that need for argumentation may be reduced as workers better learn the guidelines through peer interaction [21, 10], but the difficult questions will likely always warrant some debate.
In this paper we explored the potential for multi-turn, contextual argumentation as a next step for improving crowdsourcing accuracy. Multi-turn argumentation systems provide some distinct benefits over one-shot reconsider systems. Since workers can respond directly to each other’s arguments, it is much more likely that a correct, minority opinion may convince the majority without the need to recruit a huge number of workers. This power does come with caveats, noted earlier, such as higher costs and the need for training workers how to recognize and make good arguments (in addition to training them on the base task). Our results suggest that workers can be trained to discuss successfully.
While each discussion task in Cicero-Sync required more worker time, we found significantly higher gains to individual worker accuracy compared to the reconsider condition from MicroTalk. We believe that much of the increase in work time stems from our decision to use synchronous, real-time crowdsourcing in Cicero-Sync, leading to higher per-argument-task costs. Under a synchronous environment, workers must wait for other workers’ actions during and in-between discussions. Since our experiments are focused on evaluating the multi-turn argumentation workflow, synchronized discussions allowed us to better collect data in a controlled way. Many efficiency optimizations, that we did not explore, could be implemented to run the Cicero workflow at scale in a more cost effective way. Specifically, an asynchronous implementation of Cicero would eliminate the need for workers to wait for each other, reducing costs. Alternatively, if a synchronous implementation were run at larger scale on a much larger set of problems, there would be proportionately less overhead.
Argumentation (both contextual and one-shot) is likely unnecessary for many crowdsourcing tasks. For example, if one is merely labeling training data for supervised machine learning (a common application), then it may be more cost effective to eschew most forms of quality control (majority vote, EM or argumentation) and instead collect a larger amount of noisy data . However, if one needs data of the highest possible accuracy, then argumentation — specifically contextual, multi-turn argumentation — is the best option. Cicero may cost more than other approaches, but as shown in Figure 6, it achieves higher accuracy than any other approach, regardless of cost. Furthermore, the Cicero workflow can handle questions with many possible answers such as those in Codenames. Single-shot argumentation systems, such as MicroTalk, aren’t practical in these situations, because of the need to pre-collect arguments against so many alternative answers.
In the end, the most cost effective crowd technique depends on both problem difficulty and quality requirements.
High cost methods, like argumentation, should be reserved for the most difficult tasks, such as developing challenging machine learning test sets, or tasks comprising a high-stakes decision, where a corresponding explanation is desirable.
6 Conclusion & Future Work
In this paper, we presented Cicero, a novel workflow that engages workers in multi-turn, contextual argumentation (discussion) to improve crowd accuracy on difficult tasks. We implemented this workflow using a synchronous, real-time design for discussions tasks and created the Cicero-Sync system. Since the quality of a discussion depends on its participants, we also designed and implemented gated instructions and a novel justification training task for Cicero-Sync to ensure competent discussions through improving workers’ ability to recognize and synthesize good arguments.
We demonstrate that our implementation of Cicero-Sync, the synchronous version of the Cicero workflow, is able to achieve
Higher improvement of individual worker accuracy compared to a state-of-art, one-shot argumentation system on a difficult NLP annotation task: 16.7 vs. 6.8 percentage points improvement, at a higher cost, and
Very high accuracy in a non-binary choice answer task that would be impractical with one-shot argumentation: 98.8% accuracy (a 32.1 percentage point improvement over the initial accuracy)
Both these accuracies are much higher than can be achieved without argumentation. Traditional majority vote and expectation-maximization without argumentation approaches plateau at 65% on similar questions . Additionally, we observed several interesting patterns of discourse that are enabled by multi-turn, contextual argumentation and note that many successful discussions utilize these patterns.
There are many future directions for improving the argumentation workflow and system implementation. Currently, the cost of argumentation is still relatively high but cost may be reduced further. It’s common for people to context switch between several ongoing discussions as well as have group discussions with many people, both of which could be interesting modifications to the workflow. We also envision that better models of discussions could allow a system to only pair arguments where the outcome reduces uncertainty.
Furthermore, there is potential in incorporating natural language processing techniques to identify and support positive behavior patterns during argumentation and opportunities for learning from misconceptions surfaced during discussion to improve training and task instructions.
We would like to thank Eunice Jun, Gagan Bansal and Tongshuang Wu for their helpful feedback and participation in pilot experiments. This work was supported in part by NSF grant IIS-1420667, ONR grants N00014-15-1-2774 and N00014-18-1-2193, the WRF/Cable Professorship and support from Google.
-  Gabor Angeli, Julie Tibshirani, Jean Wu, and Christopher D. Manning. 2014. Combining Distant and Partial Supervision for Relation Extraction. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, Alessandro Moschitti, Bo Pang, and Walter Daelemans (Eds.). ACL, 1556–1567. http://aclweb.org/anthology/D/D14/D14-1164.pdf
-  Michael S Bernstein, Greg Little, Robert C Miller, Björn Hartmann, Mark S Ackerman, David R Karger, David Crowell, and Katrina Panovich. 2010. Soylent: a word processor with a crowd inside. In UIST ’10 Proceedings of the 23nd annual ACM symposium on User interface software and technology. ACM Press, 313–322.
-  J. Bigham, C. Jayant, H. Ji, G. Little, , A. Miller, R. Miller, A. Tatarowicz, B. White, S. White, and T. Yeh. 2010. VizWiz: Nearly Real-Time Answers to Visual Questions. In UIST.
-  L. Chilton, G. Little, D. Edge, D. Weld, and J. Landay. 2013. Cascade: Crowdsourcing Taxonomy Creation. In CHI ’13. ACM Press, New York, NY, USA.
-  Peng Dai, Christopher H. Lin, Mausam, and Daniel S. Weld. 2013. POMDP-based control of workflows for crowdsourcing. Artificial Intelligence 202 (2013), 52–85.
A.P. Dawid and A. M. Skene. 1979.
Maximum Likelihood Estimation of Observer Error-rates using the EM Algorithm.Applied Statistics 28, 1 (1979), 20–28.
-  Gianluca Demartini, Djellel Eddine Difallah, and Philippe Cudré-Mauroux. 2012. ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for Large-scale Entity Linking. In Proceedings of the 21st International Conference on World Wide Web (WWW ’12). ACM, New York, NY, USA, 469–478. DOI:http://dx.doi.org/10.1145/2187836.2187900
-  Jeff Donahue and Kristen Grauman. 2011. Annotator rationales for visual recognition. In ICCV 2011. DOI:http://dx.doi.org/10.1109/ICCV.2011.6126394
-  Steven Dow, Anand Pramod Kulkarni, Scott R. Klemmer, and Björn Hartmann. 2012. Shepherding the crowd yields better work. In CSCW.
-  Ryan Drapeau, Lydia B. Chilton, Jonathan Bragg, and Daniel S. Weld. 2016. MicroTalk: Using Argumentation to Improve Crowdsourcing Accuracy.
-  Ralph Grishman. 1997. Information extraction: Techniques and challenges. In Information extraction a multidisciplinary approach to an emerging information technology. Springer, 10–27.
-  Ting-Hao Kenneth Huang and Jeffrey P Bigham. 2017. A 10-Month-Long Deployment Study of On-Demand Recruiting for Low-Latency Crowdsourcing. In Proceedings of The fifth AAAI Conference on Human Computation and Crowdsourcing (HCOMP 2017).
Ting-Hao Kenneth Huang, Walter S Lasecki, Amos Azaria, and Jeffrey P
" Is There Anything Else I Can Help You With?" Challenges in Deploying an On-Demand Crowd-Powered Conversational Agent. InFourth AAAI Conference on Human Computation and Crowdsourcing.
-  R. L. Rogers J. P. Kincaid, R. P. Fishburne Jr and B. S. Chissom. 1975. Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel. In Technical report, DTIC Document.
-  Ece Kamar, Severin Hacker, and Eric Horvitz. 2012. Combining human and machine intelligence in large-scale crowdsourcing. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems-Volume 1. International Foundation for Autonomous Agents and Multiagent Systems, 467–474.
-  David R. Karger, Sewoong Oh, and Devavrat Shah. 2011. Budget-optimal Crowdsourcing using Low-rank Matrix Approximations. In Conference on Communication, Control, and Computing.
-  Masaki Kobayashi, Hiromi Morita, Masaki Matsubara, Nobuyuki Shimizu, and Atsuyuki Morishima. 2018. An Empirical Study on Short- and Long-Term Effects of Self-Correction in Crowdsourced Microtasks. In HCOMP.
-  Travis Kriplean, Jonathan T. Morgan, Deen Freelon, Alan Borning, and Lance Bennett. 2011. ConsiderIt: improving structured public deliberation. In CHI Extended Abstracts.
-  Anand Kulkarni, Matthew Can, and Björn Hartmann. 2012. Collaboratively crowdsourcing workflows with Turkomatic. In CSCW. ACM Press, New York, New York, USA. DOI:http://dx.doi.org/10.1145/2145204.2145354
-  Chinmay Kulkarni, Koh Pang Wei, Huy Le, Daniel Chia, Kathryn Papadopoulos, Justin Cheng, Daphne Koller, and Scott R Klemmer. 2015. Peer and self assessment in massive online classes. In Design thinking research. Springer, 131–168.
-  Christopher H. Lin, Mausam, and Daniel S. Weld. 2014. To Re(label), or Not To Re(label). In HCOMP.
-  Greg Little, Lydia B. Chilton, Max Goldman, and Robert C. Miller. 2009. TurKit: Tools for Iterative Tasks on Mechanical Turk. In Human Computation Workshop (HComp2009).
-  Angli Liu, Stephen Soderland, Jonathan Bragg, Christopher H. Lin, Xiao Ling, and Daniel S. Weld. 2016a. Effective Crowd Annotation for Relation Extraction. In Proceedings of NAACL and HLT 2016.
-  Angli Liu, Stephen Soderland, Jonathan Bragg, Christopher H. Lin, Xiao Ling, and Daniel S. Weld. 2016b. Effective Crowd Annotation for Relation Extraction. In Proceedings of NAACL and HLT 2016.
-  Andrew Mao, Yiling Chen, Krzysztof Z. Gajos, David C. Parkes, Ariel D. Procaccia, and Haoqi Zhang. 2012. TurkServer: Enabling Synchronous and Longitudinal Online Experiments. In The Fourth Workshop on Human Computation (HCOMP 2012).
-  Tyler McDonnell, Matthew Lease, Tamer Elsayad, and Mucahid Kutlu. 2016. Why Is That Relevant? Collecting Annotator Rationales for Relevance Judgments. In Proceedings of the 4th AAAI Conference on Human Computation and Crowdsourcing (HCOMP). 10.
-  Sarah Michaels, Catherine O’Connor, and Lauren B Resnick. 2008. Deliberative discourse idealized and realized: Accountable talk in the classroom and in civic life. Studies in philosophy and education 27, 4 (2008), 283–297.
-  Jon Noronha, Eric Hysen, Haoqi Zhang, and Krzysztof Z Gajos. 2011. Platemate: crowdsourcing nutritional analysis from food photographs. In Proceedings of the 24th annual ACM symposium on User interface software and technology. ACM, 1–12.
-  David Oleson, Alexander Sorokin, Greg P Laughlin, Vaughn Hester, John Le, and Lukas Biewald. 2011. Programmatic Gold: Targeted and Scalable Quality Assurance in Crowdsourcing.. In Human Computation Workshop. 11.
-  Maria Pershina, Bonan Min, Wei Xu, and Ralph Grishman. 2014. Infusion of labeled data into distant supervision for relation extraction. In Proceedings of ACL.
-  Drazen Prelec and H. Sebastian Seung. 2007. An algorithm that finds truth even if most people are wrong. (2007).
-  Daniela Retelny, Sébastien Robaszkiewicz, Alexandra To, Walter S Lasecki, Jay Patel, Negar Rahmati, Tulsee Doshi, Melissa Valentine, and Michael S Bernstein. 2014. Expert crowdsourcing with flash teams. In Proceedings of the 27th annual ACM symposium on User interface software and technology. ACM, 75–85.
-  Rion Snow, Brendan O’Connor, Daniel Jurafsky, and A. Ng. 2008. Cheap and Fast — But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks. In EMNLP’08.
-  Yu-An Sun, Christopher R Dance, Shourya Roy, and Greg Little. 2011. How to assure the quality of human computation tasks when majority voting fails. In Workshop on Computational Social Science and the Wisdom of Crowds, NIPS.
-  Mihai Surdeanu. 2013. Overview of the TAC2013 Knowledge Base Population Evaluation: English Slot Filling and Temporal Slot Filling. In TAC 2013.
-  Peter Welinder, Steve Branson, Serge Belongie, and Pietro Perona. 2010. The Multidimensional Wisdom of Crowds. In NIPS.
-  Jacob Whitehill, Paul Ruvolo, Tingfan Wu, Jacob Bergsma, and Javier Movellan. 2009. Whose Vote Should Count More: Optimal Integration of Labels from Laberlers of Unknown Expertise. In In Proc. of NIPS. 2035–2043.
-  J. Wiebe, R. Bruce, and T. O’Hara. 1999. Development and use of a gold-standard data set for subjectivity classifications. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics. 246–253. http://dl.acm.org/citation.cfm?id=1034678.1034721
-  Omar F Zaidan, Jason Eisner, and Christine D Piatko. 2007. Using “annotator rationales" to improve machine learning for text categorization. In Proceedings of NAACL and HLT 2007.
-  Ce Zhang, Feng Niu, Christopher Ré, and Jude Shavlik. 2012. Big data versus the crowd: Looking for relationships in all the right places. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 825–834.
-  Sharon Zhou, Melissa Valentine, and Michael S. Bernstein. 2018. In Search of the Dream Team: Temporally Constrained Multi-Armed Bandits for Identifying Effective Team Structures.
-  Haiyi Zhu, Steven P. Dow, Robert E. Kraut, and Aniket Kittur. 2014. Reviewing Versus Doing: Learning and Performance in Crowd Assessment. In Proceedings of the 17th ACM Conference on Computer Supported Cooperative Work & Social Computing (CSCW ’14). ACM, New York, NY, USA, 1445–1455. DOI:http://dx.doi.org/10.1145/2531602.2531718