Natural language processing has become increasingly reliant on large datasets obtained using crowd sourcing. However, crowdsourcing as an unconstrained annotation approach is known to result in machine-exploitable annotator artefacts Jia and Liang (2017); Schwartz et al. (2017); Gururangan et al. (2018); Geva et al. (2019), leading to poor out-of-distribution generalisation Chen et al. (2016); Weissenborn et al. (2017); Yogatama et al. (2019); McCoy et al. (2019). Dynamic Adversarial Data Collection (DADC) aims to address these issues by introducing state-of-the-art models into the data collection loop and asking human annotators to produce examples that these models find challenging Kiela et al. (2021). The intuition behind this approach is that it leads human annotators to better explore the space of possible examples. Previous work has found that DADC leads to improved model robustness on adversarial datasets Nie et al. (2020); Bartolo et al. (2020), increased sample diversity Bartolo et al. (2020); Wallace et al. (2021), better training data Wallace et al. (2021) and better domain generalisation Bartolo et al. (2021).
Despite these advantages, a downside to DADC is that it increases the human effort necessary to annotate a single example and thus the overall annotation cost. In fact, to date, only a limited number of large-scale training datasets have been produced using DADC and its application has been primarily restricted to producing challenge sets or as additional training data to improve the performance of models already trained on non-DADC curated datasets. To make better use of DADC data, Bartolo et al. (2021) propose generating synthetic adversarial training sets to further improve model robustness. However, this approach inevitably limits example diversity as it relies on examples ultimately generated by a model with no additional human input, and provides no guarantees that useful synthetic examples would transfer across target adversary models of varying capabilities or across annotation rounds.
In this work, we propose assisting annotators by having generative models aid human annotators in the data collection loop. Concretely, we utilise a Generative Annotation Assistant (GAA) model that provides prompt suggestions to crowdworkers, while allowing full flexibility for edits and rewrites to support example generation while still allowing for human creativity as shown in Figure 1. We explore GAAs in a broad range of experimental settings, including standard and adversarial data collection approaches, training on various source datasets, and employing sampling methodologies based on likelihood, adversarial feedback, and uncertainty. We showcase the value of this approach on the task of extractive question answering (QA), and find that GAAs can help improve both the standard and adversarial data collection paradigms. We find considerable efficiency gains, with around a 28% observed annotation speed-up, as well as improved data effectiveness with up to a 4.5F improvement in downstream performance for adversarial data collection.
2 Related Work
2.1 Dynamic Adversarial Data Collection (DADC)
There exists a rich body of recent work showing the value of dynamic adversarial data collection in model evaluation Yang et al. (2017); Dua et al. (2019); Dinan et al. (2019); Nie et al. (2020); Bartolo et al. (2020); Kiela et al. (2021); Wallace et al. (2021), although the approach has also been challenged for not necessarily leading to better generalisation on non-adversarial test sets Kaushik et al. (2021a) and being unfair to the model that was used in the loop Bowman and Dahl (2021); Phang et al. (2021). This work builds on previous work in adversarial data collection methods for QA Bartolo et al. (2020), and work investigating the use of generative models to create synthetic adversarial data to improve QA model robustness Bartolo et al. (2021).
2.2 Generative Model Annotation Support
A long line of prior work has trained generative models for question answering (Du et al., 2017; Du and Cardie, 2018; Zhao et al., 2018; Lewis and Fan, 2019; Alberti et al., 2019; Puri et al., 2020; Yang et al., 2020; Bartolo et al., 2021; Lewis et al., 2021). In many cases, these approaches filter out questions that an external QA model gets wrong, in order to ensure correctness of the generated questions; our filtering strategies instead focus on generated questions that QA models get wrong as we hypothesise that these would serve as more useful initial prompts to human annotators.
Generative models have also been used to aid experts with writing contrast sets (Wu et al., 2021; Ross et al., 2021), but to the best of our knowledge, this is the first work to investigate the use of generative annotation assistants for crowdworkers directly in the annotation loop for NLP. Recent work on supporting crowdworkers for textual entailment in a non-adversarial setting shows no improvements on downstream transfer performance over baseline, albeit with reductions in previously observed issues with annotation artefacts Bowman et al. (2020). Subsequent work highlights the need for further data collection efforts focusing on improving writing-based annotation processes Vania et al. (2021), which we aim to investigate in this work. Separately, Ettinger et al. (2017) provide breakers with the ability to minimally edit original data to identify the boundaries of system capabilities, while Potts et al. (2021)
analyse the use of prompts to assist crowdworkers in beating a model in the loop for sentiment analysis. In both cases, prompts are sourced from existing datasets and are not generated on the fly.
2.3 Active Learning and Weak Supervision
Active learning approaches have been used to accelerate annotation Tsuruoka et al. (2008), although this typically assumes access to a pool or stream of unlabelled data for which the learning algorithm can query labels (Settles, 2009). In our setting, no unlabelled questions are provided, necessitating the use of a generative model to suggest questions instead. Moreover, our annotators are free to edit and browse generated questions, whereas annotators in active learning typically only provide labels and have no choice in what to label. Some of our sampling and filtering strategies based on entropy are inspired by uncertainty sampling, a standard active learning algorithm (Lewis and Gale, 1994).
3 Experimental Setup
Our study focuses on the effects of incorporating generative annotation assistants and their interactions with annotators and discriminative models-in-the-loop in a DADC context for QA. We provide crowdworkers with a short passage from Wikipedia and ask them to write five questions and highlight the span in the passage that best answers the question for each (see Figure 2). We pay workers equally across experiment modes to avoid creating an incentive imbalance and pay out an additional bonus for each question that successfully beats the discriminative QA model i.e., for each question that the model fails to answer correctly. Finally, we validate all collected examples using a distinct worker pool and ask three additional workers to report on the validity of each example.
We select passages from KILT (Petroni et al., 2021) to allow the possibility of future investigation into cross-domain and task transfer of knowledge intensive language understanding in the context of data collected in a DADC setting. We filter KILT passages to those with between 100 and 600 tokens that are used by at least 5 KILT tasks. We further filter out any passages with any 8-gram overlap (after normalisation) to the SQuAD1.1 training or development sets, seeking to ensure that all passages used in our study are novel and previously unseen by the discriminative QA models in the loop. This leaves a total of 10,109 passages from 421 Wikipedia pages. We retain and supply all passage-relevant KILT metadata (such as IDs and provenances) with our collected datasets to facilitate future work.
The discriminative QA model in the loop is ELECTRALarge (Clark et al., 2020) trained on SQuAD1.1 and AdversarialQA, and enhanced using SynQA to improve adversarial robustness as investigated by Bartolo et al. (2021).111You can interact with this model at https://dynabench.org/models/109. This model represents the best-performing model on the Dynabench Kiela et al. (2021) leaderboard at the time of conducting this study, obtaining a word-overlap F score of 94.5% on the SQuAD1.1 dev set, and represents the state-of-the-art on AdversarialQA achieving 77.6% on the subset, 71.5% on , and 63.2% on .
For our generative model, we use the fairseq (Ott et al., 2019) implementation of BART Lewis et al. (2020), fine-tuning the decoder to generate questions conditioned on the passage and the answer highlighted by the annotator. To provide a diverse set of questions to annotators, we decode using nucleus sampling with as decoding using standard beam search results in questions which are too similar to each other and therefore likely to be of less use as question prompts to annotators. To speed up inference and model-annotator interaction, we preemptively identify answer candidates for each passage and generate questions to build up a large cache from which we serve questions during annotation. Once there are no questions remaining in the cache for a particular answer, or if the annotator selects an answer that is not in the cache, we fall back to querying the generative model in real-time. In this work, we investigate generative assistants trained on three different sources of questions: SQuAD1.1, AdversarialQA, and the combination of both SQuAD and AdversarialQA.
We investigate three different selection strategies for presenting the generated questions as prompts to annotators: i) generator likelihood samples candidates in the order prescribed by the generative model’s associated likelihood values; ii) adversarial sampling selects generated questions in order of the least word-overlap F scores when queried against the discriminative QA model; and iii) uncertainty sampling is inspired by active learning and selects generated questions in order of the least span selection confidence when queried against the QA model. The latter two provide an interesting trade-off for exploration as we would expect the quality of the generated questions to be worse than if sampled based on likelihood. However, we hope that such prompts could serve to inspire annotators and provide a “starting point” beyond the answering capabilities of the QA model, irrespective of correctness. We hypothesise that modifying such examples might be a more effective process for annotators to undertake than when starting from higher quality but less model-confusing prompts, and investigate this question thoroughly.
We also investigate the effects of abstracting away the answer selection task from the annotator. To identify potential candidate answers, we use Self-Attention Labelling (SAL) Bartolo et al. (2021) and investigate providing annotators with both answer prompts as well as the corresponding generated questions.
In total, there are twenty different experimental settings involving combinations of the above-mentioned annotation pipeline components. We collect 1,000 validated training examples for each of these settings, for a total of 20,000 examples. For downstream evaluation we train ELECTRALarge QA models on the training datasets collected each setting, and perform identical model selection and hyper-parameter tuning.
We use a variant of the Dynabench (Kiela et al., 2021) QA interface that allows annotators to interact with the models in the loop, and further allows them to edit and modify generated questions and answers as required. The same base interface is used across experimental settings and only varied minimally depending on the current setting, for example by changing the title and instructions in the adversarial annotation setting, or by adding a “Generate Question” button when the setting involves GAAs. In the GAA settings, annotators are not informed what generative model they are interacting with, or what sampling mechanism is being used.
We use Amazon Mechanical Turk to recruit workers for this study. To facilitate proficiency in English, crowdworkers are required to be based in Canada, the UK, or the US. They are also require to have a Human Intelligence Task (HIT) Approval Rate greater than , have previously completed at least 1,000 HITs, and are required to undergo a dedicated onboarding process. Workers were randomly assigned to one of the possible experiment modes and were all presented with passages sampled from the same set, for which they were tasked with writing and answering five questions. All collected questions were than validated for correctness by a separate group of crowdworkers. We collect three validations per question and use this information, along with manual verification of a subset of the annotated examples, to maintain a high level of quality and remove examples from workers who were generating examples with an incorrectness rate above an acceptability threshold of 95%. Workers were provided an additional $0.50 bonus for each example validated as having successfully fooled the model in the adversarial data collection settings. In total, 1,388 workers participated in the study, with 1,113 contributing to the final datasets. We also continuously validate both annotators and validators based on signals such as repetitiveness, agreement, and manual checks.
We evaluate the outcomes in each of the experimental settings by a selection of metrics: i) median time per example as a measure of annotation efficiency and where a lower time taken is better; ii) validated Model Error Rate (vMER) (Bartolo et al., 2021) which evaluates the effectiveness of annotators at generating valid question-answer pairs that the QA model fails to answer correctly; iii) median time per validated model-fooling example which serves as a single metric incorporating both method efficiency and effectiveness and thus provides a convenient metric for comparison across the various experimental settings; and iv) downstream effectiveness in which we evaluate the performance (by word-overlap F score) of a QA model trained on the data collected in each of the experimental modes on the standard SQuAD1.1 benchmark, on the AdversarialQA benchmark, and in terms of domain generalisation ability on the MRQA (Fisch et al., 2019) dev sets. Lower values are better for the time-dependent metrics, however, from the perspective of training data we consider a higher vMER to be better guided by the performance benefits observed for adversarial over standard data collection. This is corroborated by comparison with downstream results.
Our study allows us to perform a thorough investigation into both the efficiency and effectiveness of the different data annotation methodologies. It also allows us to build on work investigating the various differences between standard and adversarial data collection Kaushik et al. (2021b).
4.1 Standard versus Adversarial Data Collection
The standard and adversarial data collection settings we use as baselines do not make use of GAAs, and are designed to replicate the SQuAD1.1 Rajpurkar et al. (2016) and AdversarialQA Bartolo et al. (2020) annotation setups as closely as possible. However, in contrast to AdversarialQA, our setting only provides annotators with a financial incentive to try to beat the model in the loop through the use of a bonus, and does not restrict annotators to only submitting model-fooling examples.
The results, shown in Table 1, highlight the differences between the two annotation approaches. As expected, standard data collection is more efficient in terms of the time taken per example, as there is no requirement for annotators to make any effort to try to beat a model. However, the efficiency differences are not as large as seen in settings where annotators have to submit model-fooling examples Bartolo et al. (2020). We also find considerable benefits from adversarial data collection in terms of the validated model error rate and subsequent downstream performance.
We note that the training data sizes in both these settings are relatively small, and the benefits of adversarial data collection have been shown to be more pronounced in the low data regime, likely due to increased example diversity. We would not necessarily expect these differences to be as pronounced with larger scale collection efforts. We also note that while our passages are sourced from Wikipedia, there may exist characteristic differences between these and the passages used in SQuAD. Furthermore, we highlight the considerably lower (i.e., better) adversarial human evaluation vMER scores achieved for our synthetically-augmented ELECTRALarge model-in-the-loop compared to the 8.8% reported for RoBERTaLarge by Bartolo et al. (2021). We hypothesise that this is primarily due to two factors: the improved robustness of ELECTRA in comparison to RoBERTa, and more stringent example validation. For further evidence of the improved robustness of ELECTRA, see Appendix A.
4.2 Improving Standard Data Collection
We now investigate whether it might be possible to improve standard data collection practices using generative assistants – can we achieve similar performance to adversarial data collection without access to any adversarial data?
We therefore use a GAA trained on SQuAD1.1, and investigate the three sampling techniques namely: likelihood, adversarial, and uncertainty sampling. Results are shown in Table 2.
We find that using a GAA with likelihood sampling considerably improves the efficiency of the annotation process in comparison to the standard data collection baseline in Table 1, while giving comparable, if slightly improved, vMER and downstream results.
Furthermore, both the adversarial and uncertainty sampling strategies prove effective. While the time taken per example not as impressive as for standard likelihood sampling, and is comparable to the standard data collection baseline, the vMER – an indicator of the diversity of the collected training data – is substantially improved and outperforms the adversarial data collection baseline. The downstream results are also very promising, considerably improving on the standard data collection setting. They also approach the values for the adversarial data collection baseline although, despite the improved vMER, overall downstream performance is better in the adversarial data collection setting. In summary, this result shows that we can encourage annotators to come up with more challenging examples and approach the downstream performance achieved using adversarial data collection without requiring any adversarially-collected data or an adversarial model in the loop, simply through the use of GAAs paired with an appropriate sampling strategy. While impressive, this is in line with our initial hypothesis that sampling generated prompts from regions of known model uncertainty, or prompts that we know the model finds challenging to answer, irrespective of generated sample quality, provides annotators with a better starting point for example creation.
4.3 Improving Adversarial Data Collection
Following the impressive gains observed for standard data collection, we investigate whether it is possible for GAAs to provide further improvements over adversarial data collection. Here, we experiment with GAAs trained on three different datasets: SQuAD1.1, AdversarialQA, and the combination of both. We combine each of these with the three previously discussed sampling strategies giving nine different experimental settings. Results are shown in Table 3.
We find that when annotators are incentivised to try to beat an adversarial QA model-in-the-loop, the previously seen efficiency gains are not as clear cut. In fact, annotators are slightly slower than the adversarial data collection baseline when using a SQuAD-trained GAA. When using a GAA that has been trained on adversarially-sourced questions, likelihood sampling provides efficiency gains over the baseline, however, both adversarial and uncertainty sampling (which naturally lead to more complex prompts that might be more challenging to work with) actually slow annotators down, although they do provide improved validated model error rates. In terms of downstream performance, there is no clear best setting, but the best settings outperform the adversarial data collection baseline. We also observe that a SQuAD-trained GAA with uncertainty sampling gives best performance on the less challenging evaluation sets, while an AdversarialQA-trained GAA with adversarial sampling gives best performance on the evaluation datasets collected using a more performant adversary. This is also in line with the observations made by Bartolo et al. (2020) showing a distributional shift in question type and complexity with an increasingly stronger model-in-the-loop.
The general takeaway therefore in terms of the ideal experimental setting from the perspective of downstream performance is that this depends on the particular evaluation setting, with GAAs trained on examples from a particular setting yielding better performance when the downstream model is also evaluated in similar conditions. Another key observation is that both the validated model error rate and time per validated model-fooling example comfortably outperform the baselines across the board, highlighting the enhancements to the effectiveness of the annotation process provided by incorporating GAAs in the loop.
4.4 Investigating Answer Prompting
The previously explored settings focus on investigating the effects of assisting free-text generation using GAAs. However, the QA crowdsourcing setting also involves answer annotation, which we also explore in seek of efficiency gains. Here, we explore GAAs trained on datasets with adversarially-sourced components and the same three sampling strategies as previously, with the addition of providing annotators with an answer suggestion. In essence, this is similar to an answer and question validation setting, with the difference that annotators have the ability to freely modify both answer and question, or request additional suggestions. Results are shown in Table4.
We find that answer prompting is incredibly effective at improving annotation efficiency, providing gains in all six experimental settings while also providing improved vMER results in some cases. We also see very similar downstream performance result patterns to the previous set of experiments – for performance on the more challenging evaluation sets ( and ), an AdversarialQA-trained GAA with likelihood sampling gives best performance, while for performance on SQuAD and , a GAA trained on examples including SQuAD coupled with uncertainty sampling gives best performance. This consistency in performance patterns serves to further highlight our previous observation that, while using GAAs provides considerable gains in both the efficiency of the annotation process and effectiveness in terms of downstream results, the ideal annotation setup should be selected based on the target downstream evaluation.
5 Annotator Interaction with GAAs
While we provide annotators with instructions explaining how they can use the GAAs to aid their annotation, they are free to query the generative models as many times as they like, if at all, during annotation. We are interested to see how the three main factors affecting interaction with the GAAs that we explore – training data, sampling strategy, and answer prompting – affect the ways in which annotators interact or use the GAAs.
Results, shown in Table 5, indicate that annotators query the GAA less frequently when being shown simpler prompts i.e. those obtained using a GAA trained on non-adversarially sourced examples, or selected using likelihood sampling which tends to provide higher quality and less complex generated texts. We also find that annotators query the GAA more frequently when an answer prompt is also provided. We believe that this can be attributed to the fact that the answer and question prompt setting is more similar to a validation workflow, allowing annotators to generate prompts until a satisfactory one is found.
|Feature||Setting||Avg. #Generations per Example|
6 Discussion and Conclusion
In this work, we introduce Generative Annotation Assistants (GAAs) and investigate their potential to aid crowdworkers with creating more effective training data more efficiently. We perform a thorough analysis of how GAAs can be used for improving QA dataset annotation in different settings, including different generative model training data, sampling strategies, and whether to also provide annotators with answer suggestions.
We find that GAAs are beneficial in both the standard and adversarial data collection settings. In the standard data collection setting, and under the assumption of no access to adversarially-collected data, GAAs with prompts sampled based on likelihood provide annotation speed-ups, while prompts sampled by adversarial performance or uncertainty metrics provide benefits to both the model error rates on the collected data as well as subsequent downstream QA performance. We find that we can get near-adversarial data collection downstream performance using GAAs without involving an adversaral model in the loop.
For adversarial data collection, we demonstrate improved effectiveness of the annotation process over the non-GAA baseline, although this comes at a cost of reduced annotation efficiency. We show that also aiding annotators with answer prompts boosts data collection efficiency even beyond that achieved for standard data collection, while retaining downstream performance. We find that the ideal annotation setting differs for different intended evaluations, with an uncertainty sampled GAA trained on data that was not entirely adversarially-collectedproviding best performance on simpler questions, while an adversarially sampled GAA trained on adversarially-collected data provides best downstream performance on more challenging evaluation sets. Overall, we see annotation speed-ups over a baseline of 28.6% for standard and 28.4% for adversarial data collection. We also see a 3.75x improvement in vMER for adversarial data collection, along with best downstream performance gains of 0.6F on SQuADdev, 0.7F on , 4.5F on , and 3.8F on . Furthermore, we see benefits in domain generalisation for standard data collection, and show that annotators interact with the GAA more frequently when it has been trained on adversarially-collected data, is sampled from based on adversarial or uncertainty feedback, and also provides answer prompts.
While our analysis is limited by the size of the collected data, we believe that GAAs can help drive further innovation into improved data collection methodologies based on these observations. We hope that our analysis of various aspects of GAA incorporation into the annotation pipeline can help inform future work exploring broader aspects of GAA use, such as for other NLP tasks or for larger scale annotation efforts.
7 Ethical Considerations
We collect a training datasets as a part of the analysis in this work. The passages are sourced from Wikipedia through KILT. As described in the main text, our incentive structure is designed to ensure that crowdworkers were fairly compensated. Our datasets focus on the English language. As this data is not collected for the purpose of designing NLP applications, we do not foresee any risks associated with the use of this data.
The authors would like to thank the Dynabench team for their feedback and continuous support.
- Alberti et al. (2019) Chris Alberti, Daniel Andor, Emily Pitler, Jacob Devlin, and Michael Collins. 2019. Synthetic QA corpora generation with roundtrip consistency. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6168–6173, Florence, Italy. Association for Computational Linguistics.
- Bartolo et al. (2020) Max Bartolo, Alastair Roberts, Johannes Welbl, Sebastian Riedel, and Pontus Stenetorp. 2020. Beat the AI: Investigating Adversarial Human Annotation for Reading Comprehension. Transactions of the Association for Computational Linguistics, 8:662–678.
- Bartolo et al. (2021) Max Bartolo, Tristan Thrush, Robin Jia, Sebastian Riedel, Pontus Stenetorp, and Douwe Kiela. 2021. Improving Question Answering Model Robustness with Synthetic Adversarial Data Generation.
- Bowman and Dahl (2021) Samuel R Bowman and George E Dahl. 2021. What will it take to fix benchmarking in natural language understanding? arXiv preprint arXiv:2104.02145.
- Bowman et al. (2020) Samuel R. Bowman, Jennimaria Palomaki, Livio Baldini Soares, and Emily Pitler. 2020. New protocols and negative results for textual entailment data collection. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8203–8214, Online. Association for Computational Linguistics.
- Chen et al. (2016) Danqi Chen, Jason Bolton, and Christopher D. Manning. 2016. A thorough examination of the CNN/Daily Mail reading comprehension task. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2358–2367, Berlin, Germany. Association for Computational Linguistics.
- Clark et al. (2020) Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020. Electra: Pre-training text encoders as discriminators rather than generators. In International Conference on Learning Representations.
- Dinan et al. (2019) Emily Dinan, Samuel Humeau, Bharath Chintagunta, and Jason Weston. 2019. Build it break it fix it for dialogue safety: Robustness from adversarial human attack. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4537–4546, Hong Kong, China. Association for Computational Linguistics.
- Du and Cardie (2018) Xinya Du and Claire Cardie. 2018. Harvesting paragraph-level question-answer pairs from Wikipedia. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1907–1917, Melbourne, Australia. Association for Computational Linguistics.
- Du et al. (2017) Xinya Du, Junru Shao, and Claire Cardie. 2017. Learning to ask: Neural question generation for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1342–1352, Vancouver, Canada. Association for Computational Linguistics.
- Dua et al. (2019) Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2368–2378, Minneapolis, Minnesota. Association for Computational Linguistics.
- Ettinger et al. (2017) Allyson Ettinger, Sudha Rao, Hal Daumé III, and Emily M. Bender. 2017. Towards linguistically generalizable NLP systems: A workshop and shared task. CoRR, abs/1711.01505.
- Fisch et al. (2019) Adam Fisch, Alon Talmor, Robin Jia, Minjoon Seo, Eunsol Choi, and Danqi Chen. 2019. MRQA 2019 shared task: Evaluating generalization in reading comprehension. In Proceedings of 2nd Machine Reading for Reading Comprehension (MRQA) Workshop at EMNLP.
- Geva et al. (2019) Mor Geva, Yoav Goldberg, and Jonathan Berant. 2019. Are we modeling the task or the annotator? an investigation of annotator bias in natural language understanding datasets. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1161–1166, Hong Kong, China. Association for Computational Linguistics.
- Gururangan et al. (2018) Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel Bowman, and Noah A. Smith. 2018. Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 107–112, New Orleans, Louisiana. Association for Computational Linguistics.
- Jia and Liang (2017) Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2021–2031, Copenhagen, Denmark. Association for Computational Linguistics.
- Kaushik et al. (2021a) Divyansh Kaushik, Douwe Kiela, Zachary C Lipton, and Wen-tau Yih. 2021a. On the efficacy of adversarial data collection for question answering: Results from a large-scale randomized study. arXiv preprint arXiv:2106.00872.
- Kaushik et al. (2021b) Divyansh Kaushik, Douwe Kiela, Zachary C. Lipton, and Wen-tau Yih. 2021b. On the efficacy of adversarial data collection for question answering: Results from a large-scale randomized study. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6618–6633, Online. Association for Computational Linguistics.
- Kiela et al. (2021) Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia, Zhiyi Ma, Tristan Thrush, Sebastian Riedel, Zeerak Waseem, Pontus Stenetorp, Robin Jia, Mohit Bansal, Christopher Potts, and Adina Williams. 2021. Dynabench: Rethinking benchmarking in NLP. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4110–4124, Online. Association for Computational Linguistics.
- Lewis and Gale (1994) David D. Lewis and William A. Gale. 1994. A sequential algorithm for training text classifiers. In SIGIR, pages 3–12. ACM/Springer.
- Lewis and Fan (2019) Mike Lewis and Angela Fan. 2019. Generative question answering: Learning to answer the whole question. In International Conference on Learning Representations.
- Lewis et al. (2020) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
- Lewis et al. (2021) Patrick Lewis, Yuxiang Wu, Linqing Liu, Pasquale Minervini, Heinrich Küttler, Aleksandra Piktus, Pontus Stenetorp, and Sebastian Riedel. 2021. PAQ: 65 million probably-asked questions and what you can do with them. arXiv preprint arXiv:2102.07033.
- McCoy et al. (2019) Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3428–3448, Florence, Italy. Association for Computational Linguistics.
- Nie et al. (2020) Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. 2020. Adversarial NLI: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4885–4901, Online. Association for Computational Linguistics.
- Ott et al. (2019) Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A Fast, Extensible Toolkit for Sequence Modeling. In Proceedings of NAACL-HLT 2019: Demonstrations.
- Petroni et al. (2021) Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, Vassilis Plachouras, Tim Rocktäschel, and Sebastian Riedel. 2021. KILT: a benchmark for knowledge intensive language tasks. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2523–2544, Online. Association for Computational Linguistics.
- Phang et al. (2021) Jason Phang, Angelica Chen, William Huang, and Samuel R Bowman. 2021. Adversarially constructed evaluation sets are more challenging, but may not be fair. arXiv preprint arXiv:2111.08181.
- Potts et al. (2021) Christopher Potts, Zhengxuan Wu, Atticus Geiger, and Douwe Kiela. 2021. DynaSent: A dynamic benchmark for sentiment analysis. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2388–2404, Online. Association for Computational Linguistics.
- Puri et al. (2020) Raul Puri, Ryan Spring, Mohammad Shoeybi, Mostofa Patwary, and Bryan Catanzaro. 2020. Training question answering models from synthetic data. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5811–5826, Online. Association for Computational Linguistics.
- Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas. Association for Computational Linguistics.
- Ross et al. (2021) Alexis Ross, Tongshuang Wu, Hao Peng, Matthew E. Peters, and Matt Gardner. 2021. Tailor: Generating and perturbing text with semantic controls. arXiv preprint arXiv:2107.07150.
- Schwartz et al. (2017) Roy Schwartz, Maarten Sap, Ioannis Konstas, Leila Zilles, Yejin Choi, and Noah A. Smith. 2017. The effect of different writing tasks on linguistic style: A case study of the ROC story cloze task. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 15–25, Vancouver, Canada. Association for Computational Linguistics.
- Settles (2009) Burr Settles. 2009. Active learning literature survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences.
- Tsuruoka et al. (2008) Yoshimasa Tsuruoka, Jun’ichi Tsujii, and Sophia Ananiadou. 2008. Accelerating the annotation of sparse named entities by dynamic sentence selection. In Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing, pages 30–37, Columbus, Ohio. Association for Computational Linguistics.
- Vania et al. (2021) Clara Vania, Phu Mon Htut, William Huang, Dhara Mungra, Richard Yuanzhe Pang, Jason Phang, Haokun Liu, Kyunghyun Cho, and Samuel R. Bowman. 2021. Comparing test sets with item response theory. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1141–1158, Online. Association for Computational Linguistics.
- Wallace et al. (2021) Eric Wallace, Adina Williams, Robin Jia, and Douwe Kiela. 2021. Analyzing dynamic adversarial training data in the limit. arXiv preprint arXiv:2110.08514.
- Weissenborn et al. (2017) Dirk Weissenborn, Georg Wiese, and Laura Seiffe. 2017. Making neural QA as simple as possible but not simpler. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 271–280, Vancouver, Canada. Association for Computational Linguistics.
- Wu et al. (2021) Tongshuang Wu, Marco Tulio Ribeiro, Jeffrey Heer, and Daniel Weld. 2021. Polyjuice: Generating counterfactuals for explaining, evaluating, and improving models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6707–6723, Online. Association for Computational Linguistics.
- Yang et al. (2020) Yiben Yang, Chaitanya Malaviya, Jared Fernandez, Swabha Swayamdipta, Ronan Le Bras, Ji-Ping Wang, Chandra Bhagavatula, Yejin Choi, and Doug Downey. 2020. Generative data augmentation for commonsense reasoning. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1008–1025, Online. Association for Computational Linguistics.
- Yang et al. (2017) Zhilin Yang, Saizheng Zhang, Jack Urbanek, Will Feng, Alexander H Miller, Arthur Szlam, Douwe Kiela, and Jason Weston. 2017. Mastering the dungeon: Grounded language learning by mechanical turker descent. arXiv preprint arXiv:1711.07950.
- Yogatama et al. (2019) Dani Yogatama, Cyprien de Masson d’Autume, Jerome T. Connor, Tomás Kociský, Mike Chrzanowski, Lingpeng Kong, Angeliki Lazaridou, Wang Ling, Lei Yu, Chris Dyer, and Phil Blunsom. 2019. Learning and evaluating general linguistic intelligence. ArXiv, abs/1901.11373.
- Zhao et al. (2018) Yao Zhao, Xiaochuan Ni, Yuanyuan Ding, and Qifa Ke. 2018. Paragraph-level neural question generation with maxout pointer and gated self-attention networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3901–3910, Brussels, Belgium. Association for Computational Linguistics.
Appendix A Adversarial Robustness of ELECTRA and RoBERTa
|SQuAD + AdversarialQA||93.3||80.1||85.2|
|SQuAD + AdversarialQA||92.5||83.4||86.7|
|SQuAD + AdversarialQA + SynQA||94.8||86.0||89.0|
|SQuAD + AdversarialQA + SynQAExt||94.9||87.1||90.1|
|SQuAD + AdversarialQA||94.7||86.1||89.9|
|SQuAD + AdversarialQA + SynQA||94.8||85.7||89.2|
Table 6 shows adversarial robustness performance evaluated on the AddSent and AddOneSent evaluation datasets introduced by Jia and Liang (2017). We observe that even when trained only on SQuAD1.1, ELECTRA performs considerably better than RoBERTa in this setting, suggesting that it is considerably more robust “out of the box”.