Models in the Loop: Aiding Crowdworkers with Generative Annotation Assistants

by   Max Bartolo, et al.

In Dynamic Adversarial Data Collection (DADC), human annotators are tasked with finding examples that models struggle to predict correctly. Models trained on DADC-collected training data have been shown to be more robust in adversarial and out-of-domain settings, and are considerably harder for humans to fool. However, DADC is more time-consuming than traditional data collection and thus more costly per example. In this work, we examine if we can maintain the advantages of DADC, without suffering the additional cost. To that end, we introduce Generative Annotation Assistants (GAAs), generator-in-the-loop models that provide real-time suggestions that annotators can either approve, modify, or reject entirely. We collect training datasets in twenty experimental settings and perform a detailed analysis of this approach for the task of extractive question answering (QA) for both standard and adversarial data collection. We demonstrate that GAAs provide significant efficiency benefits in terms of annotation speed, while leading to improved model fooling rates. In addition, we show that GAA-assisted data leads to higher downstream model performance on a variety of question answering tasks.


page 1

page 2

page 3

page 4


On the Efficacy of Adversarial Data Collection for Question Answering: Results from a Large-Scale Randomized Study

In adversarial data collection (ADC), a human workforce interacts with a...

Improving Question Answering Model Robustness with Synthetic Adversarial Data Generation

Despite the availability of very large datasets and pretrained models, s...

Collecting high-quality adversarial data for machine reading comprehension tasks with humans and models in the loop

We present our experience as annotators in the creation of high-quality,...

Analyzing Dynamic Adversarial Training Data in the Limit

To create models that are robust across a wide range of test inputs, tra...

"Playing the whole game": A data collection and analysis exercise with Google Calendar

We provide an exercise suitable for early introduction in an undergradua...

Does Putting a Linguist in the Loop Improve NLU Data Collection?

Many crowdsourced NLP datasets contain systematic gaps and biases that a...

1 Introduction

Figure 1: Example interaction between an annotator and the models in the loop. The annotator selects an answer from the passage, for which the Generative Annotation Assistant (GAA) prompts a question. The annotator can then freely modify the question and/or answer, or generate another prompt. In the adversarial data collection setting, a model-in-the-loop provides predictions with the aim of encouraging annotators to find model-fooling examples. In the answer prompting setting, an answer suggestion is prompted by the assistive model instead of being selected by the annotator.

Natural language processing has become increasingly reliant on large datasets obtained using crowd sourcing. However, crowdsourcing as an unconstrained annotation approach is known to result in machine-exploitable annotator artefacts Jia and Liang (2017); Schwartz et al. (2017); Gururangan et al. (2018); Geva et al. (2019), leading to poor out-of-distribution generalisation Chen et al. (2016); Weissenborn et al. (2017); Yogatama et al. (2019); McCoy et al. (2019). Dynamic Adversarial Data Collection (DADC) aims to address these issues by introducing state-of-the-art models into the data collection loop and asking human annotators to produce examples that these models find challenging Kiela et al. (2021). The intuition behind this approach is that it leads human annotators to better explore the space of possible examples. Previous work has found that DADC leads to improved model robustness on adversarial datasets Nie et al. (2020); Bartolo et al. (2020), increased sample diversity Bartolo et al. (2020); Wallace et al. (2021), better training data Wallace et al. (2021) and better domain generalisation Bartolo et al. (2021).

Despite these advantages, a downside to DADC is that it increases the human effort necessary to annotate a single example and thus the overall annotation cost. In fact, to date, only a limited number of large-scale training datasets have been produced using DADC and its application has been primarily restricted to producing challenge sets or as additional training data to improve the performance of models already trained on non-DADC curated datasets. To make better use of DADC data, Bartolo et al. (2021) propose generating synthetic adversarial training sets to further improve model robustness. However, this approach inevitably limits example diversity as it relies on examples ultimately generated by a model with no additional human input, and provides no guarantees that useful synthetic examples would transfer across target adversary models of varying capabilities or across annotation rounds.

In this work, we propose assisting annotators by having generative models aid human annotators in the data collection loop. Concretely, we utilise a Generative Annotation Assistant (GAA) model that provides prompt suggestions to crowdworkers, while allowing full flexibility for edits and rewrites to support example generation while still allowing for human creativity as shown in Figure 1. We explore GAAs in a broad range of experimental settings, including standard and adversarial data collection approaches, training on various source datasets, and employing sampling methodologies based on likelihood, adversarial feedback, and uncertainty. We showcase the value of this approach on the task of extractive question answering (QA), and find that GAAs can help improve both the standard and adversarial data collection paradigms. We find considerable efficiency gains, with around a 28% observed annotation speed-up, as well as improved data effectiveness with up to a 4.5F improvement in downstream performance for adversarial data collection.

Figure 2: The Annotation Interface used for data collection. This example shows a question generated using a generative assistant trained on the AdversarialQA data and selected an adversarial sampler, which successfully allowed the annotator to beat the QA model in the loop.

2 Related Work

2.1 Dynamic Adversarial Data Collection (DADC)

There exists a rich body of recent work showing the value of dynamic adversarial data collection in model evaluation Yang et al. (2017); Dua et al. (2019); Dinan et al. (2019); Nie et al. (2020); Bartolo et al. (2020); Kiela et al. (2021); Wallace et al. (2021), although the approach has also been challenged for not necessarily leading to better generalisation on non-adversarial test sets Kaushik et al. (2021a) and being unfair to the model that was used in the loop Bowman and Dahl (2021); Phang et al. (2021). This work builds on previous work in adversarial data collection methods for QA Bartolo et al. (2020), and work investigating the use of generative models to create synthetic adversarial data to improve QA model robustness Bartolo et al. (2021).

2.2 Generative Model Annotation Support

A long line of prior work has trained generative models for question answering (Du et al., 2017; Du and Cardie, 2018; Zhao et al., 2018; Lewis and Fan, 2019; Alberti et al., 2019; Puri et al., 2020; Yang et al., 2020; Bartolo et al., 2021; Lewis et al., 2021). In many cases, these approaches filter out questions that an external QA model gets wrong, in order to ensure correctness of the generated questions; our filtering strategies instead focus on generated questions that QA models get wrong as we hypothesise that these would serve as more useful initial prompts to human annotators.

Generative models have also been used to aid experts with writing contrast sets (Wu et al., 2021; Ross et al., 2021), but to the best of our knowledge, this is the first work to investigate the use of generative annotation assistants for crowdworkers directly in the annotation loop for NLP. Recent work on supporting crowdworkers for textual entailment in a non-adversarial setting shows no improvements on downstream transfer performance over baseline, albeit with reductions in previously observed issues with annotation artefacts Bowman et al. (2020). Subsequent work highlights the need for further data collection efforts focusing on improving writing-based annotation processes Vania et al. (2021), which we aim to investigate in this work. Separately, Ettinger et al. (2017) provide breakers with the ability to minimally edit original data to identify the boundaries of system capabilities, while Potts et al. (2021)

analyse the use of prompts to assist crowdworkers in beating a model in the loop for sentiment analysis. In both cases, prompts are sourced from existing datasets and are not generated on the fly.

2.3 Active Learning and Weak Supervision

Active learning approaches have been used to accelerate annotation Tsuruoka et al. (2008), although this typically assumes access to a pool or stream of unlabelled data for which the learning algorithm can query labels (Settles, 2009). In our setting, no unlabelled questions are provided, necessitating the use of a generative model to suggest questions instead. Moreover, our annotators are free to edit and browse generated questions, whereas annotators in active learning typically only provide labels and have no choice in what to label. Some of our sampling and filtering strategies based on entropy are inspired by uncertainty sampling, a standard active learning algorithm (Lewis and Gale, 1994).

3 Experimental Setup

Our study focuses on the effects of incorporating generative annotation assistants and their interactions with annotators and discriminative models-in-the-loop in a DADC context for QA. We provide crowdworkers with a short passage from Wikipedia and ask them to write five questions and highlight the span in the passage that best answers the question for each (see Figure 2). We pay workers equally across experiment modes to avoid creating an incentive imbalance and pay out an additional bonus for each question that successfully beats the discriminative QA model i.e., for each question that the model fails to answer correctly. Finally, we validate all collected examples using a distinct worker pool and ask three additional workers to report on the validity of each example.

Selected Passages

We select passages from KILT (Petroni et al., 2021) to allow the possibility of future investigation into cross-domain and task transfer of knowledge intensive language understanding in the context of data collected in a DADC setting. We filter KILT passages to those with between 100 and 600 tokens that are used by at least 5 KILT tasks. We further filter out any passages with any 8-gram overlap (after normalisation) to the SQuAD1.1 training or development sets, seeking to ensure that all passages used in our study are novel and previously unseen by the discriminative QA models in the loop. This leaves a total of 10,109 passages from 421 Wikipedia pages. We retain and supply all passage-relevant KILT metadata (such as IDs and provenances) with our collected datasets to facilitate future work.


The discriminative QA model in the loop is ELECTRALarge (Clark et al., 2020) trained on SQuAD1.1 and AdversarialQA, and enhanced using SynQA to improve adversarial robustness as investigated by Bartolo et al. (2021).111You can interact with this model at This model represents the best-performing model on the Dynabench Kiela et al. (2021) leaderboard at the time of conducting this study, obtaining a word-overlap F score of 94.5% on the SQuAD1.1 dev set, and represents the state-of-the-art on AdversarialQA achieving 77.6% on the subset, 71.5% on , and 63.2% on .


For our generative model, we use the fairseq (Ott et al., 2019) implementation of BART Lewis et al. (2020), fine-tuning the decoder to generate questions conditioned on the passage and the answer highlighted by the annotator. To provide a diverse set of questions to annotators, we decode using nucleus sampling with as decoding using standard beam search results in questions which are too similar to each other and therefore likely to be of less use as question prompts to annotators. To speed up inference and model-annotator interaction, we preemptively identify answer candidates for each passage and generate questions to build up a large cache from which we serve questions during annotation. Once there are no questions remaining in the cache for a particular answer, or if the annotator selects an answer that is not in the cache, we fall back to querying the generative model in real-time. In this work, we investigate generative assistants trained on three different sources of questions: SQuAD1.1, AdversarialQA, and the combination of both SQuAD and AdversarialQA.

Question Sampling

We investigate three different selection strategies for presenting the generated questions as prompts to annotators: i) generator likelihood samples candidates in the order prescribed by the generative model’s associated likelihood values; ii) adversarial sampling selects generated questions in order of the least word-overlap F scores when queried against the discriminative QA model; and iii) uncertainty sampling is inspired by active learning and selects generated questions in order of the least span selection confidence when queried against the QA model. The latter two provide an interesting trade-off for exploration as we would expect the quality of the generated questions to be worse than if sampled based on likelihood. However, we hope that such prompts could serve to inspire annotators and provide a “starting point” beyond the answering capabilities of the QA model, irrespective of correctness. We hypothesise that modifying such examples might be a more effective process for annotators to undertake than when starting from higher quality but less model-confusing prompts, and investigate this question thoroughly.

Answer Prompts

We also investigate the effects of abstracting away the answer selection task from the annotator. To identify potential candidate answers, we use Self-Attention Labelling (SAL) Bartolo et al. (2021) and investigate providing annotators with both answer prompts as well as the corresponding generated questions.

Experimental Settings

In total, there are twenty different experimental settings involving combinations of the above-mentioned annotation pipeline components. We collect 1,000 validated training examples for each of these settings, for a total of 20,000 examples. For downstream evaluation we train ELECTRALarge QA models on the training datasets collected each setting, and perform identical model selection and hyper-parameter tuning.

Annotation Interface

We use a variant of the Dynabench (Kiela et al., 2021) QA interface that allows annotators to interact with the models in the loop, and further allows them to edit and modify generated questions and answers as required. The same base interface is used across experimental settings and only varied minimally depending on the current setting, for example by changing the title and instructions in the adversarial annotation setting, or by adding a “Generate Question” button when the setting involves GAAs. In the GAA settings, annotators are not informed what generative model they are interacting with, or what sampling mechanism is being used.

=0pt =0pt Adversary-in-the-loop? t (s) vMER (%) t/vMFE (s) SQuADdev MRQA 56.3 0.63 11274 45.4 14.7 9.2 8.8 25.2 61.2 1.62 4863 82.0 44.4 29.2 22.4 53.8

Table 1: Baseline results comparing standard and adversarial data collection. t shows the median time taken per example in seconds and median absolute deviation (subscript). vMER is the validated model error rate. t/vMFE is the time per validated model-fooling example. Lower is better for the time-dependent metrics. Downstream evaluation is measured by training an ELECTRALarge QA model on the collected datasets and evaluating F scores on the SQuAD1.1 dev set, the AdversarialQA test sets, and the MRQA dev sets for domain generalisation.

Crowdsourcing Protocol

We use Amazon Mechanical Turk to recruit workers for this study. To facilitate proficiency in English, crowdworkers are required to be based in Canada, the UK, or the US. They are also require to have a Human Intelligence Task (HIT) Approval Rate greater than , have previously completed at least 1,000 HITs, and are required to undergo a dedicated onboarding process. Workers were randomly assigned to one of the possible experiment modes and were all presented with passages sampled from the same set, for which they were tasked with writing and answering five questions. All collected questions were than validated for correctness by a separate group of crowdworkers. We collect three validations per question and use this information, along with manual verification of a subset of the annotated examples, to maintain a high level of quality and remove examples from workers who were generating examples with an incorrectness rate above an acceptability threshold of 95%. Workers were provided an additional $0.50 bonus for each example validated as having successfully fooled the model in the adversarial data collection settings. In total, 1,388 workers participated in the study, with 1,113 contributing to the final datasets. We also continuously validate both annotators and validators based on signals such as repetitiveness, agreement, and manual checks.


We evaluate the outcomes in each of the experimental settings by a selection of metrics: i) median time per example as a measure of annotation efficiency and where a lower time taken is better; ii) validated Model Error Rate (vMER) (Bartolo et al., 2021) which evaluates the effectiveness of annotators at generating valid question-answer pairs that the QA model fails to answer correctly; iii) median time per validated model-fooling example which serves as a single metric incorporating both method efficiency and effectiveness and thus provides a convenient metric for comparison across the various experimental settings; and iv) downstream effectiveness in which we evaluate the performance (by word-overlap F score) of a QA model trained on the data collected in each of the experimental modes on the standard SQuAD1.1 benchmark, on the AdversarialQA benchmark, and in terms of domain generalisation ability on the MRQA (Fisch et al., 2019) dev sets. Lower values are better for the time-dependent metrics, however, from the perspective of training data we consider a higher vMER to be better guided by the performance benefits observed for adversarial over standard data collection. This is corroborated by comparison with downstream results.

=0pt =0pt Sampling Strategy t (s) vMER (%) t/vMFE (s) SQuADdev MRQA Likelihood 40.2 0.69 6331 53.6 15.9 11.0 9.9 31.4 Adversarial 56.7 3.22 2277 80.1 39.1 21.1 18.7 49.5 Uncertainty 56.9 2.93 2643 80.1 40.1 24.3 22.6 51.1

Table 2: Results for the investigation into supporting standard data collection using GAAs. Since this setting assumes no access to adversarially-sourced data, we use a generative model trained only on questions from SQuAD1.1. There is no adversarial QA model in the loop in this setting.

4 Results

Our study allows us to perform a thorough investigation into both the efficiency and effectiveness of the different data annotation methodologies. It also allows us to build on work investigating the various differences between standard and adversarial data collection Kaushik et al. (2021b).

4.1 Standard versus Adversarial Data Collection

The standard and adversarial data collection settings we use as baselines do not make use of GAAs, and are designed to replicate the SQuAD1.1 Rajpurkar et al. (2016) and AdversarialQA Bartolo et al. (2020) annotation setups as closely as possible. However, in contrast to AdversarialQA, our setting only provides annotators with a financial incentive to try to beat the model in the loop through the use of a bonus, and does not restrict annotators to only submitting model-fooling examples.

The results, shown in Table 1, highlight the differences between the two annotation approaches. As expected, standard data collection is more efficient in terms of the time taken per example, as there is no requirement for annotators to make any effort to try to beat a model. However, the efficiency differences are not as large as seen in settings where annotators have to submit model-fooling examples Bartolo et al. (2020). We also find considerable benefits from adversarial data collection in terms of the validated model error rate and subsequent downstream performance.

We note that the training data sizes in both these settings are relatively small, and the benefits of adversarial data collection have been shown to be more pronounced in the low data regime, likely due to increased example diversity. We would not necessarily expect these differences to be as pronounced with larger scale collection efforts. We also note that while our passages are sourced from Wikipedia, there may exist characteristic differences between these and the passages used in SQuAD. Furthermore, we highlight the considerably lower (i.e., better) adversarial human evaluation vMER scores achieved for our synthetically-augmented ELECTRALarge model-in-the-loop compared to the 8.8% reported for RoBERTaLarge by Bartolo et al. (2021). We hypothesise that this is primarily due to two factors: the improved robustness of ELECTRA in comparison to RoBERTa, and more stringent example validation. For further evidence of the improved robustness of ELECTRA, see Appendix A.

=0pt =0pt GAA Training Sampling t (s) vMER (%) t/vMFE (s) SQuADdev MRQA SQuAD Likelihood 66.2 2.40 3489 81.2 44.2 27.8 21.3 52.3 SQuAD Adversarial 63.3 2.87 2831 80.2 41.7 28.8 20.9 49.3 SQuAD Uncertainty 65.7 2.34 3505 82.6 45.1 29.0 23.0 52.4 AdversarialQA Likelihood 59.0 2.63 3034 79.9 40.8 30.2 24.9 52.6 AdversarialQA Adversarial 64.7 3.95 2077 75.7 38.7 28.8 23.1 50.3 AdversarialQA Uncertainty 66.7 3.79 2305 78.3 41.9 29.4 22.9 51.0 Combined Likelihood 52.7 2.51 2827 79.6 40.7 29.9 24.2 53.3 Combined Adversarial 71.0 2.76 3450 78.7 39.8 26.6 22.0 49.6 Combined Uncertainty 66.7 3.08 2854 81.0 44.0 26.4 22.2 52.7

Table 3: Results for the investigation into supporting adversarial data collection using GAAs. We investigate three different GAA training dataset sources, and three sampling strategies. The adversarial QA model-in-the-loop is identical for all settings.

4.2 Improving Standard Data Collection

We now investigate whether it might be possible to improve standard data collection practices using generative assistants – can we achieve similar performance to adversarial data collection without access to any adversarial data?

We therefore use a GAA trained on SQuAD1.1, and investigate the three sampling techniques namely: likelihood, adversarial, and uncertainty sampling. Results are shown in Table 2.

We find that using a GAA with likelihood sampling considerably improves the efficiency of the annotation process in comparison to the standard data collection baseline in Table 1, while giving comparable, if slightly improved, vMER and downstream results.

Furthermore, both the adversarial and uncertainty sampling strategies prove effective. While the time taken per example not as impressive as for standard likelihood sampling, and is comparable to the standard data collection baseline, the vMER – an indicator of the diversity of the collected training data – is substantially improved and outperforms the adversarial data collection baseline. The downstream results are also very promising, considerably improving on the standard data collection setting. They also approach the values for the adversarial data collection baseline although, despite the improved vMER, overall downstream performance is better in the adversarial data collection setting. In summary, this result shows that we can encourage annotators to come up with more challenging examples and approach the downstream performance achieved using adversarial data collection without requiring any adversarially-collected data or an adversarial model in the loop, simply through the use of GAAs paired with an appropriate sampling strategy. While impressive, this is in line with our initial hypothesis that sampling generated prompts from regions of known model uncertainty, or prompts that we know the model finds challenging to answer, irrespective of generated sample quality, provides annotators with a better starting point for example creation.

=0pt =0pt GAA Training Sampling t (s) vMER (%) t/vMFE (s) SQuADdev MRQA AdversarialQA Likelihood 49.9 6.08 1086 78.2 44.0 33.7 26.2 52.0 AdversarialQA Adversarial 43.8 2.22 2587 79.9 44.2 30.6 23.6 52.1 AdversarialQA Uncertainty 50.9 4.04 1667 80.4 42.8 28.8 22.1 51.1 Combined Likelihood 49.0 2.72 2510 79.6 42.7 31.1 23.8 50.2 Combined Adversarial 65.2 4.41 2042 80.2 44.7 31.5 24.8 53.0 Combined Uncertainty 54.1 2.94 2740 81.1 44.8 27.9 23.8 51.2

Table 4: Results for the investigation into supporting adversarial data collection using GAAs equipped with answer prompting. We investigate two different GAA training dataset sources, and three sampling strategies. The adversarial QA model-in-the-loop is identical for all settings.

4.3 Improving Adversarial Data Collection

Following the impressive gains observed for standard data collection, we investigate whether it is possible for GAAs to provide further improvements over adversarial data collection. Here, we experiment with GAAs trained on three different datasets: SQuAD1.1, AdversarialQA, and the combination of both. We combine each of these with the three previously discussed sampling strategies giving nine different experimental settings. Results are shown in Table 3.

We find that when annotators are incentivised to try to beat an adversarial QA model-in-the-loop, the previously seen efficiency gains are not as clear cut. In fact, annotators are slightly slower than the adversarial data collection baseline when using a SQuAD-trained GAA. When using a GAA that has been trained on adversarially-sourced questions, likelihood sampling provides efficiency gains over the baseline, however, both adversarial and uncertainty sampling (which naturally lead to more complex prompts that might be more challenging to work with) actually slow annotators down, although they do provide improved validated model error rates. In terms of downstream performance, there is no clear best setting, but the best settings outperform the adversarial data collection baseline. We also observe that a SQuAD-trained GAA with uncertainty sampling gives best performance on the less challenging evaluation sets, while an AdversarialQA-trained GAA with adversarial sampling gives best performance on the evaluation datasets collected using a more performant adversary. This is also in line with the observations made by Bartolo et al. (2020) showing a distributional shift in question type and complexity with an increasingly stronger model-in-the-loop.

The general takeaway therefore in terms of the ideal experimental setting from the perspective of downstream performance is that this depends on the particular evaluation setting, with GAAs trained on examples from a particular setting yielding better performance when the downstream model is also evaluated in similar conditions. Another key observation is that both the validated model error rate and time per validated model-fooling example comfortably outperform the baselines across the board, highlighting the enhancements to the effectiveness of the annotation process provided by incorporating GAAs in the loop.

4.4 Investigating Answer Prompting

The previously explored settings focus on investigating the effects of assisting free-text generation using GAAs. However, the QA crowdsourcing setting also involves answer annotation, which we also explore in seek of efficiency gains. Here, we explore GAAs trained on datasets with adversarially-sourced components and the same three sampling strategies as previously, with the addition of providing annotators with an answer suggestion. In essence, this is similar to an answer and question validation setting, with the difference that annotators have the ability to freely modify both answer and question, or request additional suggestions. Results are shown in Table 


We find that answer prompting is incredibly effective at improving annotation efficiency, providing gains in all six experimental settings while also providing improved vMER results in some cases. We also see very similar downstream performance result patterns to the previous set of experiments – for performance on the more challenging evaluation sets ( and ), an AdversarialQA-trained GAA with likelihood sampling gives best performance, while for performance on SQuAD and , a GAA trained on examples including SQuAD coupled with uncertainty sampling gives best performance. This consistency in performance patterns serves to further highlight our previous observation that, while using GAAs provides considerable gains in both the efficiency of the annotation process and effectiveness in terms of downstream results, the ideal annotation setup should be selected based on the target downstream evaluation.

5 Annotator Interaction with GAAs

While we provide annotators with instructions explaining how they can use the GAAs to aid their annotation, they are free to query the generative models as many times as they like, if at all, during annotation. We are interested to see how the three main factors affecting interaction with the GAAs that we explore – training data, sampling strategy, and answer prompting – affect the ways in which annotators interact or use the GAAs.

Results, shown in Table 5, indicate that annotators query the GAA less frequently when being shown simpler prompts i.e. those obtained using a GAA trained on non-adversarially sourced examples, or selected using likelihood sampling which tends to provide higher quality and less complex generated texts. We also find that annotators query the GAA more frequently when an answer prompt is also provided. We believe that this can be attributed to the fact that the answer and question prompt setting is more similar to a validation workflow, allowing annotators to generate prompts until a satisfactory one is found.

Feature Setting Avg. #Generations per Example
GAA Training SQuAD 0.72
AdversarialQA 0.86
Combined 0.80
Sampling Likelihood 0.59
Adversarial 0.88
Uncertainty 0.81
Answer Prompt? 0.79
Table 5: Results showing how often annotators query the GAA in different experimental settings.

6 Discussion and Conclusion

In this work, we introduce Generative Annotation Assistants (GAAs) and investigate their potential to aid crowdworkers with creating more effective training data more efficiently. We perform a thorough analysis of how GAAs can be used for improving QA dataset annotation in different settings, including different generative model training data, sampling strategies, and whether to also provide annotators with answer suggestions.

We find that GAAs are beneficial in both the standard and adversarial data collection settings. In the standard data collection setting, and under the assumption of no access to adversarially-collected data, GAAs with prompts sampled based on likelihood provide annotation speed-ups, while prompts sampled by adversarial performance or uncertainty metrics provide benefits to both the model error rates on the collected data as well as subsequent downstream QA performance. We find that we can get near-adversarial data collection downstream performance using GAAs without involving an adversaral model in the loop.

For adversarial data collection, we demonstrate improved effectiveness of the annotation process over the non-GAA baseline, although this comes at a cost of reduced annotation efficiency. We show that also aiding annotators with answer prompts boosts data collection efficiency even beyond that achieved for standard data collection, while retaining downstream performance. We find that the ideal annotation setting differs for different intended evaluations, with an uncertainty sampled GAA trained on data that was not entirely adversarially-collectedproviding best performance on simpler questions, while an adversarially sampled GAA trained on adversarially-collected data provides best downstream performance on more challenging evaluation sets. Overall, we see annotation speed-ups over a baseline of 28.6% for standard and 28.4% for adversarial data collection. We also see a 3.75x improvement in vMER for adversarial data collection, along with best downstream performance gains of 0.6F on SQuADdev, 0.7F on , 4.5F on , and 3.8F on . Furthermore, we see benefits in domain generalisation for standard data collection, and show that annotators interact with the GAA more frequently when it has been trained on adversarially-collected data, is sampled from based on adversarial or uncertainty feedback, and also provides answer prompts.

While our analysis is limited by the size of the collected data, we believe that GAAs can help drive further innovation into improved data collection methodologies based on these observations. We hope that our analysis of various aspects of GAA incorporation into the annotation pipeline can help inform future work exploring broader aspects of GAA use, such as for other NLP tasks or for larger scale annotation efforts.

7 Ethical Considerations

We collect a training datasets as a part of the analysis in this work. The passages are sourced from Wikipedia through KILT. As described in the main text, our incentive structure is designed to ensure that crowdworkers were fairly compensated. Our datasets focus on the English language. As this data is not collected for the purpose of designing NLP applications, we do not foresee any risks associated with the use of this data.


The authors would like to thank the Dynabench team for their feedback and continuous support.


Appendix A Adversarial Robustness of ELECTRA and RoBERTa

Model Training Data SQuADdev AddSent AddOneSent
BERTLarge SQuAD 90.3 73.7 80.3
SQuAD + AdversarialQA 93.3 80.1 85.2
RoBERTaLarge SQuAD 93.5 82.4 86.9
SQuAD + AdversarialQA 92.5 83.4 86.7
SQuAD + AdversarialQA + SynQA 94.8 86.0 89.0
SQuAD + AdversarialQA + SynQAExt 94.9 87.1 90.1
ELECTRALarge SQuAD 94.4 85.0 89.0
SQuAD + AdversarialQA 94.7 86.1 89.9
SQuAD + AdversarialQA + SynQA 94.8 85.7 89.2
Table 6: Word-overlap F results for BERT, RoBERTa, and ELECTRA on the SQuAD1.1 dev set and the AddSent and AddOneSent adversarial evaluation sets Jia and Liang (2017).

Table 6 shows adversarial robustness performance evaluated on the AddSent and AddOneSent evaluation datasets introduced by Jia and Liang (2017). We observe that even when trained only on SQuAD1.1, ELECTRA performs considerably better than RoBERTa in this setting, suggesting that it is considerably more robust “out of the box”.