Riposte! A Large Corpus of Counter-Arguments

by   Paul Reisert, et al.
Tohoku University

Constructive feedback is an effective method for improving critical thinking skills. Counter-arguments (CAs), one form of constructive feedback, have been proven to be useful for critical thinking skills. However, little work has been done for constructing a large-scale corpus of them which can drive research on automatic generation of CAs for fallacious micro-level arguments (i.e. a single claim and premise pair). In this work, we cast providing constructive feedback as a natural language processing task and create Riposte!, a corpus of CAs, towards this goal. Produced by crowdworkers, Riposte! contains over 18k CAs. We instruct workers to first identify common fallacy types and produce a CA which identifies the fallacy. We analyze how workers create CAs and construct a baseline model based on our analysis.



There are no comments yet.


page 1

page 2

page 3

page 4


Functional Pearl: Witness Me – Constructive Arguments Must Be Guided with Concrete Witness

Beloved Curry–Howard correspondence tells that types are intuitionistic ...

Mining Arguments from Cancer Documents Using Natural Language Processing and Ontologies

In the medical domain, the continuous stream of scientific research cont...

Critical Thinking for Language Models

This paper takes a first step towards a critical thinking curriculum for...

(Seemingly) Impossible Theorems in Constructive Mathematics

We prove some constructive results that on first and maybe even on secon...

Laying the Groundwork for a Worker-Centric Peer Economy

The "gig economy" has transformed the ways in which people work, but in ...

Towards Knowledge-Grounded Counter Narrative Generation for Hate Speech

Tackling online hatred using informed textual responses - called counter...

Identification and estimation of stochastic blockmodels

This paper contains new identification results for undirected weighted s...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Critical thinking is a crucial skill necessary for valid reasoning, especially for students in a pedagogical context. Towards improving critical thinking skills for students, educators have evaluated the contents of a work and provided constructive feedback (i.e. criticism) to the student. Although such methods are effective, they require educators to articulately evaluate the contents of an essay, which can be time-consuming and varies depending on an educator’s critical thinking skills.

Figure 1: CAs in Riposte! produced by crowdworkers. The fallacy type selected by a worker is shown in parentheses.
Fallacy Type Definition Template
Begging the Question () The truth of the premise is already assumed by the claim. If [something] is assumed to be true, then [something else] is already assumed to be true”.
Hasty Generalization() Someone assumes something is generally always the case based on a few instances. It’s too hasty to assume that [text]”.
Questionable Cause () The cause of an effect is questionable. There is a questionable cause in the argument because [questionable cause] does/will not cause [effect]”.
Red Herring () Someone reverts attention away from the original claim by changing the topic. The topic being discussed is [first topic], but it is being changed to [second topic]”.
Table 1: Definition and templates of fallacy types used in our experiments.

In the field of educational research, the usefulness of identifying fallacies and counter-arguments, henceforth CAs, as constructive feedback has been emphasized de Lima Alves (2008); Oktavia et al. (2014); Indah and Kusuma (2015); Song and Ferretti (2013), as both can help writers produce high-quality arguments while simultaneously improving their critical thinking skills. Shown in Figure 1 is an example of an argument with a fallacy (i.e. errors in the logical reasoning of the argument) and its CAs (i.e. attacks to the argument). In the field of NLP, previous works have addressed fallacy identification Habernal et al. (2018a), CA retrieval Wachsmuth et al. (2017), and CA generation for macro-level arguments Hua and Wang (2018), and essay criteria such as thesis clarity Persing and Ng (2013), argument strength Persing and Ng (2015), and stance Persing and Ng (2016) have been evaluated. However, in the pedagogical context, macro-level arguments (e.g., an essay) may consist of several micro-level arguments (i.e. one claim/premise pair) that can each contain multiple fallacies. To bridge this gap, we create CAs for micro-level arguments which can be useful for automatic constructive feedback generation.

Several challenges exist for creating a corpus of CAs for constructive feedback. First, the corpus must contain a variety of different topics and arguments to both train and evaluate a model for unseen topics. Second, an argument can have many different fallacies which are not easily identifiable Oktavia et al. (2014); Indah and Kusuma (2015); El Khoiri and Widiati (2017). Third, producing CAs is costly and time-consuming.

In this work, we design a task for automatic constructive feedback and create Riposte!, a large-scale corpus of CAs via crowdsourcing. Workers are first instructed to identify common fallacy types (begging the question, hasty generalization, questionable cause, and red herring) in educational research de Lima Alves (2008); Oktavia et al. (2014); Indah and Kusuma (2015); Song and Ferretti (2013) and create a CA for micro-level arguments. In total, we collect 18,887 CAs (see Figure 1

for examples of CAs in Riposte!). We then cast automatic constructive feedback as a text generation task and create a baseline model.

max width= Criteria Total Unsure 2,043 315 2,136 1,879 6,373 CAs (FS) 3,365 3,818 2,121 1,772 11,076 CAs (O) 907 2,182 2,058 2,664 7,811 CAs (total) 4,272 6,000 4,179 4,436 18,887

Table 2: Full statistics of the Riposte! corpus, where FS represents fallacy-specific CAs and O represents other.

2 The Riposte! corpus

In this section, we determine if training data can easily be created. To the best of our knowledge, this is the first research that addresses corpus construction for automatic constructive feedback.

2.1 Counter-arguments as an NLP task

When designing a task for automatic constructive feedback, one must take into account real-world situations. In the pedagogical context, educators can choose the same topic for students annually. With automatic constructive feedback, educators may choose to use a pretrained, supervised model for a single topic with editable background knowledge (i.e., educators can choose which knowledge is necessary to automatically construct feedback). On the other hand, educators may choose a new topic each year, and thus a conditioned model for multiple topics may also be considered. The input to a model should be a topic and several claim and premise argument pairs, and the output would be a set of CAs useful for improving the argument.

2.2 Existing corpus of arguments

When training a model for constructive feedback, the data should consist of many CAs for a wide variety of topics. We use the Argument Reasoning Comprehension (ARC) dataset Habernal et al. (2018b), a corpus of 1,263 unique topic-claim-premise pairs (172 unique topics and 264 unique claims). We assume the arguments in ARC contain many fallacies because they were created by non-expert crowdworkers (i.e., workers are not experts in the field of argumentation).

2.3 Riposte! creation

For creating Riposte!, we use the crowdsourcing platform Amazon Mechanical Turk.111

Data Collection

One challenge for collecting training data for automatic constructive feedback is that the CAs should be useful for improving an argument. To assist with collecting such CAs, we adopt reisert2019annotation’s protocol for collecting CAs using crowdsourcing. We first make several modifications for our data collection (see Appendix). We create 4 separate crowdsourcing tasks (i.e., one for each fallacy type). For each of the 1,263 arguments in ARC, we ask 5 workers to produce a CA. For each fallacy type, we assist workers by providing them with a “fill-in-the-blank” template, where workers were instructed to fill in text boxes for a given pattern. The fallacy types and templates are shown in Table 1.

2.4 Riposte! statistics

The statistics of Riposte! are shown in Table 2.11,076 of the CAs are fallacy-specific (i.e. workers first identified a fallacy and then created the CA), and 7,811 CAs were created when a worker did not believe the specified fallacy existed in the argument. 6,373 instances were labeled as unsure (i.e. the worker was unsure about the fallacy type).

3 How did workers create CAs?

max width= Criteria Total Score 0.61 0.17 0.35 0.36 0.24

Table 3: The average Jaccard’s similarity scores between CAs for a single argument for each fallacy type.

When creating training data for automatic constructive feedback, CAs should be useful and diverse. We determine how workers create CAs by calculating the similarity between i) a CA and argument and ii) CAs for single arguments.

How similar is one CA to the premise-claim?

(e) Overall
Figure 2: BLEU scores calculated between each worker-produced CA and the original argument (claim and premise). The results indicate that workers used keywords directly from the argument.

In order to determine how annotators created their CAs, we calculate the BLEU Papineni et al. (2002) score of each CA and the argument (e.g., premise/claim). The distribution in Figure 2 indicates that workers copied keywords directly from the original argument in some cases.

How similar are the CAs across annotators?

One design decision when building Riposte! was that with more annotators, we could collect a wide variety of diverse CAs for a single-argument regardless of the fallacy type. We first calculate the similarity of the CAs across annotators for a single argument. We tokenize the corpus using spaCy222 and remove stop words and punctuation. We then calculate the average Jaccard similarity score for all combinations of CAs per unique argument and average over all arguments. The results (see Table 3) indicate that the CAs are diverse.

4 Experiment

4.1 Experimental design

In Section 3, we observed that workers copied keywords from the argument when creating a CA. Based on this observation, we experiment with different input settings to the model to better understand which parts of the argument annotators used to create their argument (e.g., topic (T) only, premise (P) only, claim (C) only, and so forth). We cast the task of automatic constructive feedback as a generation task and experiment with such settings.

Since both new and existing essay topics can be used and introduced by educators, we consider two possible settings: i) in-domain (i.e. topics are shared between splits) and ii) out-of-domain (i.e. topics are not shared).

For our generation model, we use gold fallacy type information.333

We built an LSTM-encoder multi-label classifier and the results of 4-way classification was 36.02% F1 score, indicating more sophisticated features such as background knowledge and reasoning are necessary.

This allows us to understand how well the model can generate CAs when correct fallacy types are predicted.

4.2 Data preparation

We filter out all unsure instances. We use majority vote for selecting CAs and their fallacy types. We split the data into 80% train, 10% test, and 10% dev. In each setting, we ensure that no unique claim-premise pairs are shared across splits.

For each experiment, we tokenize using spaCy and lowercase all tokens. For CAs, we replace the template with a special token (i.e. hg). For all other CAs, we discard the original template and add a special token between slot-fillers. This allows our baseline model to focus more on the content words found in the original argument.

4.3 Baselines

Based on our observations in Section 4.1, we create a baseline for determining which parts of an argument annotators used to create CAs and how well a model can generate a CA.

Simple Overlap (SO)

We calculate simple BLEU overlap for each setting against the CA as a baseline. In order to directly compare the results to our seq2seq baseline model, we calculate the BLEU scores for the preprocessed data from our seq2seq baseline model with unknown words.


We preprocess and train our model using fairseq Ott et al. (2019). We use pre-trained word embeddings (300-dimensional GloVe embeddings Pennington et al. (2014)) which are useful for generation tasks Qi et al. (2018). We create two models (seq2seq-i and seq2seq-o) for in-domain and out-of-domain settings, respectively.444

For seq2seq-i and seq2seq-o, we use the best hyperparameters from seq2seq-i (P+C) and seq2seq-o (P+C) across all settings, respectively.

max width= Baseline T C P T+P+C T+C T+P P+C SO 3.98 6.37 15.59 13.56 10.69 13.76 18.16 seq2seq-i 12.28 12.31 5.96 14.54 12.63 13.37 16.57 seq2seq-o 1.31 1.05 1.49 4.78 1.60 1.53 5.53

Table 4: BLEU scores of our baselines using gold fallacy type for topic (T), premise (P), and claim (C).

4.4 Evaluation

max width= Attribute Scores (GO) (GO) Scores (GE) (GE) Strength 2.3 0.20 1.98 0.20 Persuasiveness 2.26 0.71 1.94 0.15 Relevance 2.74 0.20 2.84 0.72

Table 5: Mean scores and agreement (Krippendorff’s ) scores for gold (GO) and generated (GE) CAs.

We evaluate the results of our baselines using BLEU (see Table 4). Our SO results indicate that workers mainly used the premise and claim when creating CAs. We observe that seq2seq-o’s performance is low, indicating a simple model is not sufficient when unknown topics are introduced.

For evaluation, we would also like to compare the quality of gold CAs against generated CAs. We conduct an annotation study using AMT (3 workers per CA) and evaluate CA quality using 3 dimensions: Strength, Persuasiveness, and Relevance.555We use carlile2018give’s guidelines and slightly modify for CAs. Please see the Appendix for our criteria. In total, we show 50 arguments and their gold/generated CAs, where each argument is annotated by 3 workers.666We use 50 generated CAs from seq2seq-i (P+C). The results are shown in Table 5.777We convert from a 5 to 3-scale for score calculation. We observed that workers found generated CAs more relevant, but the arguments were weaker and less persuasive. Examples of the generated output for our best model (seq2seq-i P+C) are shown in Table 6.

Source Reference Hypothesis
home - schoolers should play for high school teams because all children should be able to participate in sports . all children are to play in sports even home - schoolers will be playing sports . all children should be able to participate in sports home - schoolers should play for high school teams .
the u.s . should lift sanctions with cuba because the embargo hurts our own economy . the u.s . the embargo . us sanctions our own economy .
Table 6: Examples of output from seq2seq-i (P+C).

5 Conclusion and future work

In this work, we construct Riposte!, a large corpus of 18,887 crowdworker-produced CAs. Our analysis on Riposte! reveals that non-expert crowdworkers can produce reasonably diverse CAs. We cast automatic constructive feedback as a text generation task and create a baseline model.

In our future work, we will explore injecting background knowledge and reasoning into our model to generate CAs for unknown topics and provide detailed information to students about how to improve their original argumentation.


  • A. de Lima Alves (2008) Constructive feedback. a strategy to enhance learning. Medicina 68 (1), pp. 88–92. Cited by: §1, §1.
  • N. El Khoiri and U. Widiati (2017) Logical fallacies in indonesian efl learners’ argumentative writing: students’ perspectives.. Dinamika Ilmu 17 (1), pp. 71–81. Cited by: §1.
  • I. Habernal, P. Pauli, and I. Gurevych (2018a) Adapting Serious Game for Fallacious Argumentation to German: Pitfalls, Insights, and Best Practices. In Proceedings of the Eleventh International Conference on LREC, Cited by: §1.
  • I. Habernal, H. Wachsmuth, I. Gurevych, and B. Stein (2018b) The argument reasoning comprehension task: identification and reconstruction of implicit warrants. In Proceedings of the 2018 Conference of NAACL: HLT, Volume 1 (Long Papers), pp. 1930–1940. Cited by: §2.2.
  • X. Hua and L. Wang (2018) Neural argument generation augmented with externally retrieved evidence. arXiv preprint arXiv:1805.10254. Cited by: §1.
  • R. N. Indah and A. W. Kusuma (2015) Fallacies in english department students’ claims: a rhetorical analysis of critical thinking. Jurnal Pendidikan Humaniora 3 (4), pp. 295–304. Cited by: §1, §1, §1.
  • W. Oktavia, A. Yasin, et al. (2014) AN analysis of students’argumentative elements and fallacies in students’discussion essays. English Language Teaching 2 (3). Cited by: §1, §1, §1.
  • M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli (2019) Fairseq: a fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations, Cited by: §4.3.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Meeting of ACL, pp. 311–318. Cited by: §3.
  • J. Pennington, R. Socher, and C. D. Manning (2014)

    GloVe: global vectors for word representation

    In EMNLP, pp. 1532–1543. Cited by: §4.3.
  • I. Persing and V. Ng (2013) Modeling thesis clarity in student essays. In Proceedings of the 51st Annual Meeting of ACL(Volume 1: Long Papers), Vol. 1, pp. 260–269. Cited by: §1.
  • I. Persing and V. Ng (2015) Modeling argument strength in student essays. In Proceedings of the 53rd Annual Meeting of ACL and the 7th IJCNLP (Volume 1: Long Papers), pp. 543–552. Cited by: §1.
  • I. Persing and V. Ng (2016) Modeling stance in student essays. In Proceedings of the 54th Annual Meeting of ACL (Volume 1: Long Papers), Vol. 1, pp. 2174–2184. Cited by: §1.
  • Y. Qi, D. S. Sachan, M. Felix, S. J. Padmanabhan, and G. Neubig (2018)

    When and why are pre-trained word embeddings useful for neural machine translation?

    CoRR abs/1804.06323. Cited by: §4.3.
  • Y. Song and R. P. Ferretti (2013) Teaching critical questions about argumentation through the revising process: effects of strategy instruction on college students’ argumentative essays. Reading and Writing 26 (1), pp. 67–90. Cited by: §1, §1.
  • H. Wachsmuth, N. Naderi, Y. Hou, Y. Bilu, V. Prabhakaran, T. A. Thijm, G. Hirst, and B. Stein (2017) Computational argumentation quality assessment in natural language. In Proceedings of the 15th Conference of the EACL: Volume 1, Long Papers, pp. 176–187. Cited by: §1.

Appendix A Annotation Interface and Guidelines

We show the annotation interface used in our full-fledged crowdsourcing experiment in Figure 3. The conditions shown to workers for 3 fallacy types are shown in Figure 4. The interface for is shown in Figure 5.

Figure 3: Interface shown to crowdworkers for our hasty generalization full-fledged experiment.

The guidelines shown to workers is shown in Figure 6.

Figure 4: Conditions for rejecting worker’s responses shown to workers for , , and experiments.
Figure 5: Interface for .
Figure 6: Guidelines shown to crowdworkers.
Figure 7: CAs produced for a single argument (hasty generalization) with perfect annotator agreement. All 5 workers agreed the fallacy existed.

Appendix B Crowdsourcing settings

For our full-fledged experiment, we use the following settings: workers were required to have a number of Human Intelligence Tasks (HITs) approved to be greater than or equal to 100 and a HIT Approval Rate greater than or equal to 96%. For each HIT, workers were rewarded with $0.20 (in the case of hasty generalization, workers were rewarded with $0.10). An example of the guidelines for one fallacy type (e.g., questionable cause) are shown in Figure 6. For each of our experiments below, the settings are as follows. If workers selected no or unsure, they were required to provide a CA or reason, respectively. We inform workers that their work will be rejected if one or more of the following conditions is met. The CA is i) blank, ii) not a sentence, iii) a direct copy-paste of the original argument in the text box or copy-paste of the guidelines, or iv) not written in English. We manually reject responses that fall under this criteria.

Appendix C Model Hyperparameters

hyperparameter values
dropout 0.1, 0.2, 0.3, 0.4, 0.5
encoder/decoder layers 1,2,3
hidden layers 128, 256, 512, 1024
learning rate 0.1, 0.01, 0.001
optimizers adam, sgd
Table 7: Hyperparameters used in our experiments for seq2seq-i (P+C) and seq2seq-o (P+C).

For seq2seq-i (P+C) and seq2seq-o (P+C), we experiment with the hyperparameters shown in Table 7. The best hyperparameters for our experiment are as follows. For seq2seq-i, we use the following settings. The dropout is set to 0.4. We use SGD as an optimizer with a learning rate of 0.01. The number of encoder/decoder layers is set to 1, and the encoder/decoder hidden size is 256.

For seq2seq-o, we use the following settings. The dropout is set to 0.2. We use SGD as an optimizer with a learning rate of 0.01. The number of encoder/decoder layers is set to 1, and the encoder/decoder hidden size is 256.

Attribute Description (Strong)
Relevant Anyone can see how the counter-argument attacks the argument. The relationship between the two components is either explicit or extremely easy to infer. The relationship is thoroughly explained in the text because the two components contain the same words or exhibit coreference.
Persuasive A very strong, clear counter-argument. It would persuade most readers and is devoid of errors that might detract from its strength or make it difficult to understand.
Strength A very strong counter-argument with no fallacies. Not much can be improved in order to attack the argument better.
Table 8: Guidelines for annotating the quality of the CAs in our corpus, where the description is shown for the highest score (5). Each dimension has a score of 1-5. Annotators are only shown the criteria for the highest and lowest score only.

Appendix D Annotation Criteria and Examples

The guidelines shown to crowdworkers when annotating the quality of CAs are shown in Table 8. We show the description for strong dimensions (i.e., score of 5).

Examples of CAs for one argument are shown in Figure 7.