Argument Generation with Retrieval, Planning, and Realization

by   Xinyu Hua, et al.
Northeastern University

Automatic argument generation is an appealing but challenging task. In this paper, we study the specific problem of counter-argument generation, and present a novel framework, CANDELA. It consists of a powerful retrieval system and a novel two-step generation model, where a text planning decoder first decides on the main talking points and a proper language style for each sentence, then a content realization decoder reflects the decisions and constructs an informative paragraph-level argument. Furthermore, our generation model is empowered by a retrieval system indexed with 12 million articles collected from Wikipedia and popular English news media, which provides access to high-quality content with diversity. Automatic evaluation on a large-scale dataset collected from Reddit shows that our model yields significantly higher BLEU, ROUGE, and METEOR scores than the state-of-the-art and non-trivial comparisons. Human evaluation further indicates that our system arguments are more appropriate for refutation and richer in content.



page 3

page 16

page 18

page 20


Neural Argument Generation Augmented with Externally Retrieved Evidence

High quality arguments are essential elements for human reasoning and de...

Argument Undermining: Counter-Argument Generation by Attacking Weak Premises

Text generation has received a lot of attention in computational argumen...

Sentence-Level Content Planning and Style Specification for Neural Text Generation

Building effective text generation systems requires three critical compo...

DYPLOC: Dynamic Planning of Content Using Mixed Language Models for Text Generation

We study the task of long-form opinion text generation, which faces at l...

High Quality Real-Time Structured Debate Generation

Automatically generating debates is a challenging task that requires an ...

Step-by-Step: Separating Planning from Realization in Neural Data-to-Text Generation

Data-to-text generation can be conceptually divided into two parts: orde...

Out of the Echo Chamber: Detecting Countering Debate Speeches

An educated and informed consumption of media content has become a chall...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Counter-argument generation aims to produce arguments of a different stance, in order to refute the given proposition on a controversial issue Toulmin (1958); Damer (2012). A system that automatically constructs counter-arguments can effectively present alternative perspectives along with associated evidence and reasoning, and thus facilitate a more comprehensive understanding of complicated problems when controversy arises.

Figure 1: Sample counter-argument for a pro-death penalty statement from Reddit /r/ChangeMyView. The argument consists of a sequence of propositions, by synthesizing opinions and facts from diverse sources. Sentences in italics contain stylistic languages for argumentation purpose.

Nevertheless, constructing persuasive arguments is a challenging task, as it requires an appropriate combination of credible evidence, rigorous logical reasoning, and sometimes emotional appeal Walton et al. (2008); Wachsmuth et al. (2017a); Wang et al. (2017). A sample counter-argument for a pro-death penalty post is shown in Figure 1. As can be seen, a sequence of talking points on the “imperfect justice system” are presented: it starts with the fundamental concept, then follows up with more specific evaluative claim and supporting fact. Although retrieval-based methods have been investigated to construct counter-arguments Sato et al. (2015); Reisert et al. (2015), they typically produce a collection of sentences from disparate sources, thus fall short of coherence and conciseness. Moreover, human always deploy stylistic languages with specific argumentative functions to promote persuasiveness, such as making a concessive move (e.g., “In theory I agree with you"). This further requires the generation system to have better control of the languages style.

Our goal is to design a counter-argument generation system to address the above challenges and produce paragraph-level arguments with rich-yet-coherent content. To this end, we present CANDELA—a novel framework to generate Counter-Arguments with two-step Neural Decoders and ExternaL knowledge Augmentation.111Code and data are available at Concretely, CANDELA has three major distinct features:

First, it is equipped with two decoders: one for text planning—selecting talking points to cover for each sentence to be generated, the other for content realization—producing a fluent argument to reflect decisions made by the text planner. This enables our model to produce longer arguments with richer information.

Furthermore, multiple objectives are designed for our text planning decoder to both handle content selection and ordering, and select a proper argumentative discourse function of a desired language style for each sentence generation.

Lastly, the input to our argument generation model is augmented with keyphrases and passages retrieved from a large-scale search engine, which indexes million articles from Wikipedia and four popular English news media of varying ideological leanings. This ensures access to reliable evidence, high-quality reasoning, and diverse opinions from different sources, as opposed to recent work that mostly considers a single origin, such as Wikipedia Rinott et al. (2015) or online debate portals Wachsmuth et al. (2018b).

We experiment with argument and counter-argument pairs collected from the Reddit /r/ChangeMyView group. Automatic evaluation shows that the proposed model significantly outperforms our prior argument generation system Hua and Wang (2018) and other non-trivial comparisons. Human evaluation further suggests that our model produces more appropriate counter-arguments with richer content than other automatic systems, while maintaining a fluency level comparable to human-constructed arguments.

2 Related Work

To date, the majority of the work on automatic argument generation leads to rule-based models, e.g., designing operators that reflect strategies from argumentation theory Reed et al. (1996); Carenini and Moore (2000). Information retrieval systems are recently developed to extract arguments relevant to a given debate motion Sato et al. (2015). Although content ordering has been investigated Reisert et al. (2015); Yanase et al. (2015), the output arguments are usually a collection of sentences from heterogeneous information sources, thus lacking coherence and conciseness. Our work aims to close the gap by generating eloquent and coherent arguments, assisted by an argument retrieval system.

Recent progress in sequence-to-sequence (seq2seq) text generation models has delivered both fluent and content rich outputs by explicitly conducting content selection and ordering 

Gehrmann et al. (2018); Wiseman et al. (2018), which is a promising avenue for enabling end-to-end counter-argument construction Le et al. (2018). In particular, our prior work Hua and Wang (2018) leverages passages retrieved from Wikipedia to improve the quality of generated arguments, yet Wikipedia itself has the limitation of containing mostly facts. By leveraging Wikipedia and popular news media, our proposed pipeline can enrich the factual evidence with high-quality opinions and reasoning.

Our work is also in line with argument retrieval research, where prior effort mostly considers single-origin information source Rinott et al. (2015); Levy et al. (2018); Wachsmuth et al. (2017b, 2018b). Recent work by N18-5005 indexes all web documents collected in Common Crawl, which inevitably incorporates noisy, low-quality content. Besides, existing work treats individual sentences as arguments, disregarding their crucial discourse structures and logical relations with adjacent sentences. Instead, we use multiple high-quality information sources, and construct paragraph-level passages to retain the context of arguments.

Figure 2: Architecture of CANDELA.

Argument retrieval (§ 4): a set of passages are retrieved and ranked based on relevance and stance (§ 4.14.3), from which

a set of keyphrases are extracted (§ 4.2), with both as input for argument generation.

The biLSTM encoder consumes the input statement and passages returned from step 1.

A text planning decoder outputs a representation per sentence, and simultaneously predicts an argumentative function and selects keyphrases to include for the next sentence to be generated (§ 5.2).

A content realization decoder produces the counter-argument (§ 5.3).

3 Overview of Candela

Our counter-argument generation framework, as shown in Figure 2, has two main components: argument retrieval model (§ 4) that takes the input statement and a search engine, and outputs relevant passages and keyphrases, which are used as input for our argument generation model (§ 5) to produce a fluent and informative argument.

Concretely, the argument retrieval component retrieves a set of candidate passages from Wikipedia and news media (§ 4.1), then further selects passages according to their stances towards the input statement (§ 4.3). A keyphrase extraction module distills the refined passages into a set of talking points, which comprise the keyphrase memory as additional input for generation (§ 4.2).

The argument generation component first runs the text planning decoder (§ 5.2) to produce a sequence of hidden states, each corresponding to a sentence-level representation that encodes the selection of keyphrases to cover, as well as the predicted argumentative function for a desired language style. The content realization decoder (§ 5.3) then generates the argument conditioned on the sentence representations.

4 Argument Retrieval

4.1 Information Sources and Indexing

We aim to build a search engine from diverse information sources with factual evidence and varied opinions of high quality. To achieve that, we use Common Crawl222 to collect a large-scale online news dataset covering four major English news media: The New York Times (NYT), The Washington Post (WaPo), Reuters, and The Wall Street Journal (WSJ). HTML files are processed using the open-source tool jusText Pomikálek (2011) to extract article content. We deduplicate articles and remove the ones with less than words. We also download a Wikipedia dump. About million articles are processed in total, with basic statistics shown in Table 1.

We segment articles into passages with a sliding window of three sentences, with a step size of two. We further constraint the passages to have at least words. For shorter passages, we keep adding subsequent sentences until reaching the length limit. Per Table 1, million passages are preserved and indexed with Elasticsearch Gormley and Tong (2015) as done in N18-5005.

Source # Articles # Passages Date Range
Wikipedia 5,743,901 42,797,543 dump of 12/2016
WaPo 1,109,672 22,564,532 01/1997 - 10/2018
NYT 1,952,446 28,904,549 09/1895 - 09/2018
Reuters 1,052,592 9,913,400 06/2005 - 09/2018
WSJ 2,059,128 16,109,392 01/1996 - 09/2018
Total 11,917,739 120,289,416 -
Table 1: Statistics on information sources for argument retrieval. News media are sorted by ideological leanings from left to right, according to

Query Formulation. For an input statement with multiple sentences, one query is constructed per sentence, if it has more than content words ( for questions), and at least are distinct. For each query, the top passages ranked by BM25 Robertson et al. (1995) are retained, per medium. All passages retrieved for the input statement are merged and deduplicated, and they will be ranked as discussed in § 4.3.

4.2 Keyphrase Extraction

Here we describe a keyphrase extraction procedure for both input statements and retrieved passages, which will be utilized for passage ranking as detailed in the next section.

For input statement, our goal is to identify a set of phrases representing the issues under discussion, such as “death penalty” in Figure 1. We thus first extract the topic signature words Lin and Hovy (2000) for input representation, and expand them into phrases that better capture semantic meanings.

Concretely, topic signature words of an input statement are calculated against all input statements in our training set with log-likelihood ratio test. In order to cover phrases with related terms, we further expand this set with their synonyms, hyponyms, hypernyms, and antonyms based on WordNet Miller (1994). The statements are first parsed with Stanford part-of-speech tagger Manning et al. (2014). Then regular expressions are applied to extract candidate noun phrases and verb phrases (details in Appendix A.1). A keyphrase is selected if it contains: (1) at least one content word, (2) no more than 10 tokens, and (3) at least one topic signature word or a Wikipedia article title.

For retrieved passages, their keyphrases are extracted using the same procedure as above, except that the input statement’s topic signature words are used as references again.

4.3 Passage Ranking and Filtering

We merge the retrieved passages from all media and rank them based on the number of words in overlapping keyphrases with the input statement. To break a tie, with the input as the reference, we further consider the number of its topic signature words that are covered by the passage, then the coverage of non-stopword bigrams and unigrams. In order to encourage diversity, we discard a passage if more than of its content words are already included by a higher ranked passage. In the final step, we filter out passages if they have the same stance as the input statement for given topics. We determine the stances of passages by adopting the stance scoring model proposed by E17-1024. More details can be found in Appendix A.2.

5 Argument Generation

5.1 Task Formulation

Given an input statement , a set of passages, and a keyphrase memory , our goal is to generate a counter-argument of a different stance as , and are tokens at timestamps and . Built upon the sequence-to-sequence (seq2seq) framework with input attention Sutskever et al. (2014); Bahdanau et al. (2015), the input statement and the passages selected in § 4 are encoded by a bidirectional LSTM (biLSTM) encoder into a sequence of hidden states . The last hidden state of the encoder is used as the first hidden state of both text planning decoder and content realization decoder.

As depicted in Figure 2, the counter-argument is generated as follows. A text planning decoder (§ 5.2) first calculates a sequence of sentence representations (for the -th sentence) by encoding the keyphrases selected from the previous timestamp . During this step, an argumentative function label is predicted to indicate a desired language style for each sentence, and a subset of the keyphrases are selected from (content selection) for the next sentence. In the second step, a content realization decoder (§ 5.3) generates the final counter-argument conditioned on previously generated tokens and the corresponding sentence representation .

5.2 Text Planning Decoder

Text planning is an important component for natural language generation systems to decide on content structure for the target generation Lavoie and Rambow (1997); Reiter and Dale (2000). We propose a text planner with two objectives: selecting talking points from the keyphrase memory , and choosing a proper argumentative function per sentence. Concretely, we train a sentence-level LSTM that learns to generate a sequence of sentence representations given the selected keyphrase set as input for the -th sentence:


where is an LSTM network, is the embedding for a selected phrase, represented by summing up all its words’ Glove embeddings Pennington et al. (2014) in our experiments.

Content Selection . We propose an attention mechanism to conduct content selection and yield from the representation of the previous sentence

to encourage topical coherence. To allow the selection of multiple keyphrases, we use the sigmoid function to calculate the score:


where are trainable parameters, keyphrases with are included in , and the keyphrase with top attention value is always selected. We further prohibit a keyphrase from being chosen for more than once in multiple sentences. For the first sentence , only contains <start>, whose embedding is randomly initialized. During training, the true labels of are constructed as follows: a keyphrase in is selected for the -th gold-standard argument sentence if they overlap with any content word.

Argumentative Function Prediction . As shown in Figure 1, humans often deploy stylistic languages to achieve better persuasiveness, e.g. agreement as a concessive move. We aim to inform the realization decoder about the choice of style, and thus distinguish between two types of argumentative functions: argumentative content sentence which delivers the critical ideas, e.g. “unreliable evidence is used when there is no witness”, and argumentative filler sentence which contains stylistic languages or general statements (e.g., “you can’t bring dead people back to life”).

Since we do not have argumentative function labels, during training, we use the following rules to automatically label each sentence as content sentence if it has at least words ( for questions) and satisfy the following conditions: (1) it has at least two topic signature words of the input statement or a gold-standard counter-argument333When calculating topic signatures for gold-standard arguments, all replies in the training set are used as background., or (2) at least one topic signature word with a discourse marker at the beginning of the sentence. If the first three words in a content sentence contain a pronoun, the previous sentence is labeled as such too. Discourse markers are selected from PDTB discourse connectives (e.g., as a result, eventually, or in contrast). The full list is included in Appendix A.3. All other sentences become filler sentences. In the future work, we will consider utilizing learning-based methods, e.g., W17-5102, to predict richer argumentative functions.

The argumentative function label for the -th sentence is calculated as follows:


where is the alignment score computed as in Eq. 2,

is the attention weighted context vector,

, , and are trainable parameters.

5.3 Content Realization Decoder

The content realization decoder generates the counter-argument word by word, with another LSTM network . We denote the sentence id of the -th word in the argument as , then the sentence representation from the text planning decoder, together with the embedding of the previous generated token , are fed as input to calculate the hidden state :


The conditional probability of the next token

is then computed over a standard softmax, with an attention mechanism applied on the encoder hidden states to obtain the context vector :


where is the input attention, , , , , , , and are learnable.

Reranking-based Beam Search. Our content realization decoder utilizes beam search enhanced with a reranking mechanism, where we sort the beams at the end of each sentence by the number of selected keyphrases that are generated. We also discard beams with -gram repetition for .

5.4 Training Objective

Given all model parameters , our mixed objective considers the target argument (), the argumentative function type (), and the next sentence keyphrase selection ():


where is the training corpus, are input statement and counter-argument pairs, and are the sentence function labels. are keyphrase selection labels as computed in Eq. 2. For simplicity, we set and as in our experiments, while they can be further tuned as hyper-parameters.

6 Experimental Setups

6.1 Data Collection and Preprocessing

We use the same methodology as in our prior work Hua and Wang (2018) to collect an argument generation dataset from Reddit /r/ChangeMyView.444We further crawled threads from July 2017 to December 2018, compared to the previously collected dataset. To construct input statement and counter-argument pairs, we treat the original poster (OP) of each thread as the input. We then consider the high quality root replies, defined as the ones awarded with s or with more upvotes than downvotes (i.e., karma ). It is observed that each paragraph often makes a coherent argument. Therefore, these replies are broken down into paragraphs, and a paragraph is retained as a target argument to the OP if it has more than words and at least one argumentative content sentence.

We then identify threads in the domains of politics and policy, and remove posts with offensive languages. Most recent threads are used as test set. As a result, we have threads or OPs ( arguments) for training, ( arguments) for validation, and ( arguments) for test. They are split into sentences and then tokenized by the Stanford CoreNLP toolkit Manning et al. (2014).

Training Data Construction for Passages and Keyphrase Memory. Since no gold-standard annotation is available for the input passages and keyphrases, we acquire training labels by constructing queries from the gold-standard arguments as described in § 4.1, and reranking retrieved passages based on the following criteria in order: (1) coverage of topic signature words in the input statement; (2) a weighted summation of the coverage of -grams in the argument555We choose as weights for -grams, trigrams, and bigrams, respectively.; (3) the magnitude of stance score, where we keep the passages of the same polarity as the argument; (4) content word overlap with the argument; and (5) coverage of topic signature words in the argument.

6.2 System and Oracle Retrieved Passages

For evaluation, we employ both system retrieved passages (i.e., constructing queries from OP) and KM (§ 4), and oracle retrieved passages (i.e., constructing queries from target argument) and KM as described in training data construction. Statistics on the final dataset are listed in Table 2.

Training System Oracle
Avg. # words per OP 383.7 373.0 373.0
Avg. # words per argument 66.0 65.1 65.1
Avg. # passage 4.3 9.6 4.2
Avg. # keyphrase 57.1 128.6 56.6
Table 2: Statistics on the datasets for experiments.

6.3 Comparisons

In addition to a Retrieval model, where the top ranked passage is used as counter-argument, we further consider four systems for comparison. (1) A standard Seq2seq model with attention, where we feed the OP as input and train the model to generate counter-arguments. Regular beam search with the same beam size as our model is used for decoding. (2) A Seq2seqAug model with additional input of the keyphrase memory and ranked passages, both concatenated with OP to serve as the encoder input. The reranking-based decoder in our model is also implemented for Seq2seqAug to enhance the coverage of input keyphrases. (3) An ablated Seq2seqAug model where the passages are removed from the input. (4) We also reimplement the argument generation model in our prior work Hua and Wang (2018) (H&W

) with PyTorch 

Paszke et al. (2017), which is used for CANDELA implementation. HW takes as input the OP and ranked passages, and then uses two separate decoders to first generate all keyphrases and then the counter-argument. For our model, we also implement a variant where the input only contains the OP and the keyphrase memory.

6.4 Training Details

For all models, we use a two-layer LSTM for all encoders and decoders with a dropout probability of between layers Gal and Ghahramani (2016). All layers have -dimensional hidden states. We limit the input statement to tokens, the ranked passages to tokens, and the target counter-argument to tokens. Our vocabulary has words for both input and output, with -dimensional word embeddings initialized with GloVe Pennington et al. (2014) and fine-tuned during model training. We use AdaGrad Duchi et al. (2011) with a learning rate of and an initial accumulator of as the optimizer, with the gradient norm clipped to . Early stopping is implemented according to the perplexity on validation set. For all our models the training takes approximately hours (epochs) on a Quadro P5000 GPU card, with a batch size of . For beam search, we use a beam size of , tuned from on validation.

We also pre-train a biLSTM for encoder based on all OPs from the training set, and an LSTM for content realization decoder based on two sources of data: K counter-arguments that are high quality root reply paragraphs extended with posts of non-negative karma, and million retrieved passages randomly sampled from the training set. Both are trained as done in bengio2003neural. We then use the first layer’s parameters to initialize all models, including our comparisons.

7 Results and Analysis

7.1 Automatic Evaluation

w/ System Retrieval w/ Oracle Retrieval
B-2 B-4 R-2 MTR #Word #Sent B-2 B-4 R-2 MTR #Word #Sent
Human - - - - 66 22 - - - - 66 22
Retrieval 7.55 1.11 8.64 14.38 123 23 10.97 3.05 23.49 20.08 140 21
Seq2seq 6.92 2.13 13.02 15.08 68 15 6.92 2.13 13.02 15.08 68 15
Seq2seqAug 8.26 2.24 13.79 15.75 78 14 10.98 4.41 22.97 19.62 71 14
  w/o psg 7.94 2.28 10.13 15.71 75 12 9.89 3.34 14.20 18.40 66 12
H&W Hua and Wang (2018) 3.64 0.92 8.83 11.78 51 12 8.51 2.86 18.89 17.18 58 12
Our Models
CANDELA 12.02 2.99 14.93 16.92 119 22 15.80 5.00 23.75 20.18 116 22
  w/o psg 12.33 2.86 14.53 16.60 123 23 16.33 4.98 23.65 19.94 123 23
Table 3: Main results on argument generation. We report BLEU-2 (B-2), BLEU-4 (B-4), ROUGE-2 (R-2) recall, METEOR (MTR), and average number of words per argument and per sentence. Best scores are in bold. : statistically significantly better than all comparisons (randomization approximation test Noreen (1989), ). Input is the same for Seq2seq for both system and oracle setups.

We employ ROUGE Lin (2004), a recall-oriented metric, BLEU Papineni et al. (2002), based on -gram precision, and METEOR Denkowski and Lavie (2014)

, measuring unigram precision and recall by considering synonyms, paraphrases, and stemming. BLEU-2, BLEU-4, ROUGE-2 recall, and METEOR are reported in Table

3 for both setups.

Under system setup, our model CANDELA statistically significantly outperforms all comparisons and the retrieval model in all metrics, based on a randomization test Noreen (1989) (). Furthermore, our model generates longer sentences whose lengths are comparable with human arguments, both with about words per sentence. This also results in longer arguments. Under oracle setup, all models are notably improved due to the higher quality of reranked passages, and our model achieves statistically significantly better BLEU scores. Interestingly, we observe a decrease of ROUGE and METEOR, but a marginal increase of BLEU-2 by removing passages from our model input. This could be because the passages introduce divergent content, albeit probably on-topic, that cannot be captured by BLEU.

Figure 3: Average number of distinct -grams per argument.
100 500 1000 2000
Human 44.1 25.8 18.5 12.0
Retrieval 50.6 33.3 26.0 18.6
Seq2seq 25.0 7.5 3.2 1.2
Seq2seqAug 28.2 9.2 4.6 1.8
H&W Hua and Wang (2018) 38.6 24.0 19.5 16.2
CANDELA 30.0 10.5 5.3 2.3
Figure 4: Percentage of words in arguments that are not in the top- () frequent words seen in training. Darker color indicates higher portion of uncommon words found in the arguments.

Content Diversity. We further measure whether our model is able to generate diverse content. First, borrowing the diversity measurement from dialogue generation research Li et al. (2016), we report the average number of distinct -grams per argument under system setup in Figure 3. Our system generates more unique unigrams and bigrams than other automatic systems, underscoring its capability of generating diverse content. Our model also maintains a comparable type-token ratio (TTR) compared to systems that generate shorter arguments, e.g., a for bigram TTR of our model versus and for Seq2seqAug and Seq2seq. Retrieval, containing top ranked passages of human-edited content, produces the most distinct words.

Next, we compare how each system generates content beyond the common words. As shown in Figure 4, human-edited text, including gold-standard arguments (Human) and retrieved passages, tends to have higher usage of uncommon words than automatic systems, suggesting the gap between human vs. system arguments. Among the four automatic systems, our prior model Hua and Wang (2018) generates a significantly higher portion of uncommon words, yet further inspection shows that the output often includes more off-topic information.

7.2 Human Evaluation

Human judges are asked to rate arguments on a Likert scale of 1 (worst) to 5 (best) on the following three aspects: grammaticality—denotes language fluency; appropriateness—indicates if the output is on-topic and on the opposing stance; content richness—measures the amount of distinct talking points. In order to promote consistency of annotation, we provide descriptions and sample arguments for each scale. For example, an appropriateness score of means the counter-argument contains relevant words and is likely to be on a different stance. The judges are then asked to rank all arguments for the same input based on their overall quality.

We randomly sampled threads from the test set, and hired three native or proficient English speakers to evaluate arguments generated by Seq2seqAug, our prior argument generation model (HW), and the new model CANDELA, along with gold-standard Human arguments and the top passage by Retrieval.

Gram. Appr. Cont. Top-1 Top-2
Human 4.95 4.23 4.39 75.8% 85.8%
Retrieval 4.85 3.04 3.68 17.5% 55.8%
Seq2seqAug 4.83 2.67 2.47 1.7% 22.5%
H&W Hua and Wang (2018) 3.86 2.27 2.10 1.7% 7.5%
CANDELA 4.59 2.97 2.93 3.3% 28.3%
Table 4: Human evaluation on grammaticality (Gram), appropriateness (Appr), and content richness (Cont.), on a scale of 1 to 5 (best). The best result among automatic systems is highlighted in bold, with statistical significance marked with (approximation randomization test,

). The highest standard deviation among all is

. Top-1/2: of evaluations a system being ranked in top 1 or 2 for overall quality.

Results. The first examples are used only for calibration, and the remaining are used to report results in Table 4. Inter-annotator agreement scores (Krippendorff’s ) of , , are achieved for the three aspects, implying general consensus to intermediate agreement.

Our system obtains the highest appropriateness and content richness among all automatic systems. This confirms the previous observation that our model produces more informative argument than other neural models. Seq2seqAug has a marginally better grammaticality score, likely due to the fact that our arguments are longer, and tend to contain less fluent generation towards the end.

Furthermore, we see that human arguments are ranked as the best in about of the evaluation, followed by Retrieval. Our model is more likely to be ranked top than any other automatic models. Especially, our model is rated better than either Human or Retrieval, i.e., human-edited text, in of the evaluations, compared to for Seq2seqAug and for our prior model.

7.3 Sample Arguments and Discussions

We show sample outputs of different systems alongside human constructed counter-argument in Figure 5. As can be seen, our system arguments cover many relevant talking points, including the idea of “taking care of children” that is also used by human. It further illustrates the effectiveness of our retrieval system and the usage of keyphrase selection as part of text planning to guide argument generation. Moreover, we also observe that our model generation is able to switch between argumentative content sentence and filler sentence, though better control is needed to improve coherence. Meanwhile, Seq2seq frequently echos words from OP, and both Seq2seq and Seq2seqAug suffer from the problems of “hallucination” (e.g., the first sentence in Seq2seqAug) and repetition (e.g., the second and third sentences in Seq2seq).

OP: The wage gap isn’t about gender. (…) So one factor definitely the fact that there are some government jobs that are seriously dominated by women and happen to pay really low (…)
Human: Children are the first factor. I work in a traditionally male-dominated field, and questions are asked to determine familial status, even though those are illegal. Employers know single moms have to stay home if their child is sick. They know single moms can only work during daycare hours. They know single moms are unavailable for weekend and late night work. They know single moms cannot travel. The workplace is very family-unfriendly.
Seq2seq: the problem with this is that there are a lot of people who do n’t want to hire women . there are plenty of women who do n’t have the money to pay for it . the problem is that women do n’t need to pay for their work . they do n’t have to worry about it . it ’s a matter of money .
Seq2seqAug: men and women are not 39.5 % of the pay gap . the problem is not that women are more likely to be victims of sexism , but rather that they are more natural good-looking/attractive action . this is not the case .
CANDELA: the problem with this argument is that the wage gap does not have to do with the gender pay gap . it is a fact that women are more likely to be able to take care of their children than their male counterparts . this is not to say that it is unfair to assume that women are being paid less than men , but that does not mean that it is not the case that women are discriminated against . it is not a matter of the wage gap , it is a matter of opinion . it is the job of the employer to make sure that the job is not the same as the other
Keyphrase Memory: wage gap; discrimination; gender pay gaps; raise the child; male colleagues; paid maternity leave; underlying gender discrimination
Figure 5: Sample arguments generated by different systems along with a sample human argument. For our model CANDELA, additionally shown are the keyphrase memory with selected phrases in color, and argumentative filler sentence in italics.

Nonetheless, there is a huge space for improvement. First, our model tends to overuse negation, such as “this is not to say that it is unfair…”. It is likely due to its overfitting on specific stylistic languages, e.g., negation is often observed for refutation in debates Wang et al. (2017). Second, human arguments have significantly better organization and often deploy complicated argumentation strategies Wachsmuth et al. (2018a), which so far is not well captured by any automatic system. Both points inspire future work on (1) controlling of the language styles and corresponding content, and (2) mining argumentation structures for use in guiding generation with better planning.

8 Conclusion

We present a novel counter-argument generation framework, CANDELA. Given an input statement, it first retrieves arguments of different perspectives from millions of high-quality articles collected from diverse sources. An argument generation component then employs a text planning decoder to conduct content selection and specify a suitable language style at sentence-level, followed by a content realization decoder to produce the final argument. Automatic evaluation and human evaluation indicate that our model generates more proper arguments with richer content than non-trivial comparisons, with comparable fluency to human-edited content.


This research is supported in part by National Science Foundation through Grants IIS-1566382 and IIS-1813341. We thank Varun Raval for helping with data processing and search engine indexing. We are grateful to the three anonymous reviewers for their constructive suggestions.


  • Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations (ICLR).
  • Bar-Haim et al. (2017) Roy Bar-Haim, Indrajit Bhattacharya, Francesco Dinuzzo, Amrita Saha, and Noam Slonim. 2017. Stance classification of context-dependent claims. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 251–261. Association for Computational Linguistics.
  • Bengio et al. (2003) Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilistic language model.

    Journal of machine learning research

    , 3(Feb):1137–1155.
  • Carenini and Moore (2000) Giuseppe Carenini and Johanna Moore. 2000. A strategy for generating evaluative arguments. In INLG’2000 Proceedings of the First International Conference on Natural Language Generation, pages 47–54, Mitzpe Ramon, Israel. Association for Computational Linguistics.
  • Damer (2012) T Edward Damer. 2012. Attacking faulty reasoning. Cengage Learning.
  • Denkowski and Lavie (2014) Michael Denkowski and Alon Lavie. 2014. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 376–380, Baltimore, Maryland, USA. Association for Computational Linguistics.
  • Duchi et al. (2011) John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159.
  • Gal and Ghahramani (2016) Yarin Gal and Zoubin Ghahramani. 2016. A theoretically grounded application of dropout in recurrent neural networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 1019–1027. Curran Associates, Inc.
  • Gehrmann et al. (2018) Sebastian Gehrmann, Yuntian Deng, and Alexander Rush. 2018. Bottom-up abstractive summarization. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    , pages 4098–4109, Brussels, Belgium. Association for Computational Linguistics.
  • Gormley and Tong (2015) Clinton Gormley and Zachary Tong. 2015. Elasticsearch: The definitive guide: A distributed real-time search and analytics engine. " O’Reilly Media, Inc.".
  • Hidey et al. (2017) Christopher Hidey, Elena Musi, Alyssa Hwang, Smaranda Muresan, and Kathy McKeown. 2017. Analyzing the semantic types of claims and premises in an online persuasive forum. In Proceedings of the 4th Workshop on Argument Mining, pages 11–21. Association for Computational Linguistics.
  • Hu and Liu (2004) Minqing Hu and Bing Liu. 2004. Mining and summarizing customer reviews. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 168–177. ACM.
  • Hua and Wang (2018) Xinyu Hua and Lu Wang. 2018. Neural argument generation augmented with externally retrieved evidence. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 219–230. Association for Computational Linguistics.
  • Lavoie and Rambow (1997) Benoit Lavoie and Owen Rambow. 1997. A fast and portable realizer for text generation systems. In Fifth Conference on Applied Natural Language Processing.
  • Le et al. (2018) Dieu-Thu Le, Cam Tu Nguyen, and Kim Anh Nguyen. 2018. Dave the debater: a retrieval-based and generative argumentative dialogue agent. In Proceedings of the 5th Workshop on Argument Mining, pages 121–130. Association for Computational Linguistics.
  • Levy et al. (2018) Ran Levy, Ben Bogin, Shai Gretz, Ranit Aharonov, and Noam Slonim. 2018. Towards an argumentative content search engine using weak supervision. In Proceedings of the 27th International Conference on Computational Linguistics, pages 2066–2081. Association for Computational Linguistics.
  • Li et al. (2016) Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016. A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 110–119, San Diego, California. Association for Computational Linguistics.
  • Lin (2004) Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out.
  • Lin and Hovy (2000) Chin-Yew Lin and Eduard Hovy. 2000. The automated acquisition of topic signatures for text summarization. In COLING 2000 Volume 1: The 18th International Conference on Computational Linguistics.
  • Manning et al. (2014) Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. 2014. The stanford corenlp natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 55–60, Baltimore, Maryland. Association for Computational Linguistics.
  • Miller (1994) George A. Miller. 1994. Wordnet: A lexical database for english. In HUMAN LANGUAGE TECHNOLOGY: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994.
  • Noreen (1989) Eric W Noreen. 1989. Computer-intensive methods for testing hypotheses. Wiley New York.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  • Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, and Gregory Chanan. 2017.

    Pytorch: Tensors and dynamic neural networks in python with strong gpu acceleration.

    PyTorch: Tensors and dynamic neural networks in Python with strong GPU acceleration.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Linguistics.
  • Pomikálek (2011) Jan Pomikálek. 2011. Removing boilerplate and duplicate content from web corpora. Ph.D. thesis, Masaryk university, Faculty of informatics, Brno, Czech Republic.
  • Prasad et al. (2008) Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Miltsakaki, Livio Robaldo, Aravind K Joshi, and Bonnie L Webber. 2008. The penn discourse treebank 2.0. In LREC. Citeseer.
  • Reed et al. (1996) Chris Reed, Derek Long, and Maria Fox. 1996. An architecture for argumentative dialogue planning. In International Conference on Formal and Applied Practical Reasoning, pages 555–566. Springer.
  • Reisert et al. (2015) Paul Reisert, Naoya Inoue, Naoaki Okazaki, and Kentaro Inui. 2015. A computational approach for generating toulmin model argumentation. In Proceedings of the 2nd Workshop on Argumentation Mining, pages 45–55, Denver, CO. Association for Computational Linguistics.
  • Reiter and Dale (2000) Ehud Reiter and Robert Dale. 2000. Building natural language generation systems. Cambridge university press.
  • Rinott et al. (2015) Ruty Rinott, Lena Dankin, Carlos Alzate Perez, Mitesh M. Khapra, Ehud Aharoni, and Noam Slonim. 2015. Show me your evidence - an automatic method for context dependent evidence detection. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 440–450, Lisbon, Portugal. Association for Computational Linguistics.
  • Robertson et al. (1995) Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu, Mike Gatford, et al. 1995. Okapi at trec-3. Nist Special Publication Sp, 109:109.
  • Sato et al. (2015) Misa Sato, Kohsuke Yanai, Toshinori Miyoshi, Toshihiko Yanase, Makoto Iwayama, Qinghua Sun, and Yoshiki Niwa. 2015. End-to-end argument generation system in debating. In Proceedings of ACL-IJCNLP 2015 System Demonstrations, pages 109–114, Beijing, China. Association for Computational Linguistics and The Asian Federation of Natural Language Processing.
  • Stab et al. (2018) Christian Stab, Johannes Daxenberger, Chris Stahlhut, Tristan Miller, Benjamin Schiller, Christopher Tauchmann, Steffen Eger, and Iryna Gurevych. 2018. Argumentext: Searching for arguments in heterogeneous sources. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pages 21–25. Association for Computational Linguistics.
  • Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 3104–3112. Curran Associates, Inc.
  • Toulmin (1958) Stephen Edelston Toulmin. 1958. The use of argument. Cambridge University Press.
  • Wachsmuth et al. (2017a) Henning Wachsmuth, Nona Naderi, Ivan Habernal, Yufang Hou, Graeme Hirst, Iryna Gurevych, and Benno Stein. 2017a. Argumentation quality assessment: Theory vs. practice. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 250–255. Association for Computational Linguistics.
  • Wachsmuth et al. (2017b) Henning Wachsmuth, Martin Potthast, Khalid Al Khatib, Yamen Ajjour, Jana Puschmann, Jiani Qu, Jonas Dorsch, Viorel Morari, Janek Bevendorff, and Benno Stein. 2017b. Building an argument search engine for the web. In Proceedings of the 4th Workshop on Argument Mining, pages 49–59. Association for Computational Linguistics.
  • Wachsmuth et al. (2018a) Henning Wachsmuth, Manfred Stede, Roxanne El Baff, Khalid Al Khatib, Maria Skeppstedt, and Benno Stein. 2018a. Argumentation synthesis following rhetorical strategies. In Proceedings of the 27th International Conference on Computational Linguistics, pages 3753–3765. Association for Computational Linguistics.
  • Wachsmuth et al. (2018b) Henning Wachsmuth, Shahbaz Syed, and Benno Stein. 2018b. Retrieval of the best counterargument without prior topic knowledge. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 241–251. Association for Computational Linguistics.
  • Walton et al. (2008) Douglas Walton, Christopher Reed, and Fabrizio Macagno. 2008. Argumentation schemes. Cambridge University Press.
  • Wang et al. (2017) Lu Wang, Nick Beauchamp, Sarah Shugars, and Kechen Qin. 2017. Winning on the merits: The joint effects of content and style on debate outcomes. Transactions of the Association for Computational Linguistics, 5:219–232.
  • Wiseman et al. (2018) Sam Wiseman, Stuart Shieber, and Alexander Rush. 2018. Learning neural templates for text generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3174–3187, Brussels, Belgium. Association for Computational Linguistics.
  • Yanase et al. (2015) Toshihiko Yanase, Toshinori Miyoshi, Kohsuke Yanai, Misa Sato, Makoto Iwayama, Yoshiki Niwa, Paul Reisert, and Kentaro Inui. 2015. Learning sentence ordering for opinion generation of debate. In Proceedings of the 2nd Workshop on Argumentation Mining, pages 94–103, Denver, CO. Association for Computational Linguistics.

Appendix A Appendices

a.1 Chunking Grammar for Keyhrase Extraction

In order to construct keyphrase candidates, we compile a set of regular expressions based on the following grammar rules, and extract all matched NP and VP patterns as candidates.

NP: {<DT|PP$>?<JJ|JJR>*<NN.*|CD|JJ>+}
PP: {<IN><NP>}
VP: {<MD>?<VB.*><NP|PP>}

a.2 Stance Scoring Model

Our stance scoring model calculates the score by aggregating the sentiment words surrounding the opinion targets. Here we choose the keyphrases of input statement as opinion targets, denoted as . We then tally sentiment words, collected from hu2004mining, towards targets in , with positive words counted as and negative words as . Each score is discounted by , with being the distance between the sentiment word and the target . The stance score of a text (an input statement or a retrieved passage) towards opinion targets is calculated as:


In our experiments, we only keep passages with a stance score of the opposite sign to that of the input statement, and with a magnitude greater than , i.e. (determined by manual inspection on training set).

a.3 List of Discourse Markers

As described in §5.2 in the main paper, we use a list of discourse markers together with topic signature words to label argumentative content sentences. The following list of discourse markers are manually selected from the Appendix B in prasad2008penn.

  • Contrast: although, though, even though, by comparison, by contrast, in contrast, however, nevertheless, nonetheless, on the contrary, regardless, whereas

  • Restatement/Equivalence/Generalization: eventually, in short, in sum, on the whole, overall

  • Result: accordingly, as a result, as it turns out, consequently, finally, furthermore, hence, in fact, in other words, in short, in the end, in turn, therefore, thus, ultimately

a.4 Human Evaluation Guideline

Each human annotator is presented with 43 short argumentative text statements, where the first 3 statements are used as calibration for the annotator himself and excluded in the final study. The annotators are asked to evaluate 5 counter-arguments for every statement. For each counter-argument, they rate on a scale of 1 to 5 for the following aspects, and also specify the ranking among the 5 counter-arguments based on the overall quality. We display sample statements and score-level explanations in Table 5.

Three aspects of arguments to be evaluated:

  • Grammaticality: whether the counterargument is fluent and has no grammar errors.

  • Appropriateness: whether the counterargument is on topic and on the right stance.

  • Content Richness: how many distinct talking points the counterargument conveys.

Statement: Legislative bodies should be required to explain, in formal writing, why they voted a certain way when it comes to legislation.
1 With plenty of grammatical errors and not readable at all
e.g., “the way the way etc. ’m not ’s important”
3 With a noticeable amount of grammatical errors but is generally readable
e.g., “is a good example. i don’t think should be the case. i’re not going to talk whether or not it’s a bad thing.”
5 With no grammatical errors at all and is clear to read
e.g., “i agree that the problem lies in the fact that too many representatives do n’t understand the issues or have money influencing their decisions.”
1 Not relevant to the prompt at all
e.g., “ i don’t think it ’s fair to say that people should n’t be able to care for their children”
2 Remotely relevant to the prompt or relevant but poses an unclear stance, or contains obvious contradictions
e.g., “the problem with the current system is that there are many people who don’t want to vote and they also don’t want to vote.”
3 Relevant to the prompt but stance is unclear
e.g., “i don’t agree with you and i think legislative bodies do need to explain why they vote that way”
4 Relevant to the prompt and is overall on the opposing stance with minor logical contradictions
e.g., “while i agree with you but i don’t think it’s a good idea for house reps to explain it because they have other work to do.”
5 Relevant to the prompt and is on the opposing stance, has no unnatural repititions and logical contradictions
e.g., “there are hundreds of votes a year . how do you decide which ones are worth explaining ? so many votes are bipartisan if not nearly unanimous . do those all need explanations ? they only have two years right now and i do n’t want them spending less time legislating .”
Content Richness:
1 Generic response with no useful information about the topic
e.g., “i do n’t agree with your point about legislation but i ’m not going to change your view.”
3 With one of two key information that are useful as counterargument
e.g., “i agree that this is a problem for congress term because currently it is too short.”
5 With sufficient key information that are useful as counterargument
e.g., “congressional terms are too short and us hourse reps have to spend half of their time compaigning and securing campaign funds. they really have like a year worth of time to do policy and another year to meet with donors and do favors.”
Table 5: Sample statement with explanations on aspect scales. Due to the likely ambiguity in Appropriateness, we provide explanations on every possible score. Example counter-arguments are also given alongside explanations.

a.5 Sample Output

In Figure 6 we show two sample snippets of our model outputs where reused keyphrases are highlighted in colors. Notice that even though in test time we disallow the same keyphrase to be selected more than once as the input for text planning decoder, the content realization decoder can still generate the same keyphrase multiple times. In this example, the phrase death penalty is used three times across two sentences, while it is generally nature and relevant in the context. We further include three complete sample outputs for different systems alongside the reranked passages from Figure 7 to Figure 11.

Input: Gun control will do nothing to solve gun violence
Keyphrase Memory: gun control; gun crime; reduce gun violence; homicide rate; gun violence; gun owners; american gun deaths; …
Model Output: I’m not sure if I’m going to try to change your view, but I don’t think it’s accurate to say that gun control is the only way to reduce gun violence. However, I think it’s important to note that there is a significant amount of gun violence in the United States, which is why it’s been used as a way to reduce crime …
Input: The US should abolish the death penalty. It is morally wrong, economically inefficient, (…)
Keyphrase Memory: death penalty; capital punishment; killing someone; crime; murder; …
Model Output: the death penalty is not the only thing that can be done in the first place . death penalty is n’t just about killing people , it ’s about the fact that the death penalty itself is not the same thing as murder . (…)
Figure 6: Sample arguments by our model. We highlight the reused keyphrases from keyphrase memory with colors, and filler sentence with italics.
Input: The wage gap isn’t about gender. (…) So one factor definitely the fact that there are some government jobs that are seriously dominated by women and happen to pay really low (…)
Passage 1 Source: Wikipedia Stance: -24.65
Research has found that women steer away from STEM fields because they believe they are not qualified for them; the study suggested that this could be fixed by encouraging girls to participate in more mathematics classes. One of the factors behind girls’ lack of confidence might be unqualified or ineffective teachers. Teachers’ gendered perceptions on their students’ capabilities can create an unbalanced learning environment and deter girls from pursuing further STEM education. They can also pass these stereotyped beliefs unto their students. Studies have also shown that student-teacher interactions affect girls’ engagement with STEM. Teachers often give boys more opportunity to figure out the solution to a problem by themselves while telling the girls to follow the rules.
Passage 2 Source: The New York Times Stance: -24.01
How are the these pressures different for girls and women than they are for boys and men? If you could change one thing about typical and/or stereotypical gender roles, what would it be? 2. As a class, read and discuss the article “Girls Will be Girls” , focusing on the following questions: a. What does the author, Peggy Orenstein, mean when she says that many women are “struggling to find an ideal mix of feminism and femininity”? Do you agree? Why or why not? b. Why did some people get upset about the implicit “Girls Keep Out” sign on the cover of the “Dangerous Book for Boys”?
Passage 3 Source: The New York Times Stance: -7.91
Poverty is becoming defeminized because the working conditions of many men are becoming more feminized. Whether they realize it or not, men now have a direct stake in policies that advance gender equity. Most of the wage gap between women and men is no longer a result of blatant male favoritism in pay and promotion. Much of it stems from general wage inequality in society at large. IN most countries, women tend to be concentrated in lower-wage jobs. The United States actually has a higher proportion of skilled and highly paid female workers than countries like Sweden and Norway. Yet as a whole, Swedish and Norwegian women earn a higher proportion of the average male wage than American women because the gap between high and low wages is much smaller in those countries.
Passage 4 Source: The New York Times Stance: -21.75
Site Navigation Site Mobile Navigation Women and the Pay Gap In “How to Attack the Gender Wage Gap? Speak Up” (Dec. 16), a solution is proposed for the problem of pay inequality: make women stronger negotiators in securing their own salaries. But we should always remember that employers have an obligation to follow the law in the first place, and to pay men and women working in the same jobs the same pay. Fifty years ago, Congress decided as much by passing the Equal Pay Act - but since then the wage gap has narrowed little. There’s nothing wrong in women honing their negotiating skills - and some will succeed in getting higher pay.
Passage 5 Source: The Washington Post Stance: -6.89
But, under this metric for people with a college degree, there is virtually no pay gap at all.”) To be specific, “The [Bureau of Labor Standards] reports that single women who have never married earned 96% of men’s earnings in 2012,” two conservative economists relate. (The Post’s Wonkblog comes up with 91 percent. ) Job selection, risk taking (“Nearly all the most dangerous occupations, such as loggers or iron workers, are majority male and 92% of work-related deaths in 2012 were to men. Dangerous jobs tend to pay higher salaries to attract workers”), the choice to work part-time and factors other than discrimination explain the differences: While the BLS [Bureau of Labor Statistics]reports that full-time female workers earned 81% of full-time males, that is very different than saying that women earned 8% of what men earned for doing the same jobs, while working the same hours, with the same level of risk, with the same educational background and the same years of continuous, uninterrupted work experience, and assuming no gender differences in family roles like child care.
Figure 7: Top 5 retrieved passages after ranking for example in Figure 5 in the main paper.
Input: all drugs should be decriminalized. I am a firm believer that MDMA and LSD can be very therapeutic and eye opening. They can expand your consciousness and allow you to see things from a different perspective and have the ability to alter your life. (…)
Human: If harder drugs like cocaine, meth, heroin were legalized and easy to obtain, people like me would be curious and try them, not knowing that they definitely can ruin your life. Do you honestly believe someone can be a responsible user of legal heroin?
Seq2Seq: i do n’t think it ’s a bad idea . there are plenty of people who are addicted to drugs . i do n’t want to be able to do drugs . however , i do n’t really care about anyone else . i ’m not going to try to change your view , but i ’m not sure what you ’re talking about .
Seq2seqAug: as far as i can tell , drug use is the source of the use of drug use . in fact , there is a lot of evidence to support that . drug use can be used as a way to reduce crime . the problem is that drug use does n’t make sense . it ’s like saying " if you do n’t like it , it ’s illegal . "
CANDELA: there is a lot of research that suggests that there is a substantial amount of drugs that can be used to cause harm to others . however , there is a significant amount of research that can lead to a lot of drug addiction in the us . there are a lot of things that can be attributed to drug use , drug use , substance abuse , etc . i do n’t think it ’s fair to say that drug use should be illegal , but i think it ’s important to note that it ’s not a gateway drug . it ’s just a symptom of the root cause of the drug addiction problem . it ’s not just about the
CANDELA input w/o passages: if you want to look at the effects of drug use , you ’ll find that there are a lot of people who are addicted to drugs like meth , heroin , and so on . in fact , there are some studies that show that there is a correlation between drug use and the availability of drugs in the us . in other words , there is a significant correlation between the effects of marijuana and drug use , and the correlation between marijuana and marijuana use . this is not to say that drug use is bad , but it is also a symptom of drug use . there are many studies that show a correlation between marijuana use and marijuana are more likely to result in a decrease in violent crime . this is a study that shows that it is not the case that the study found out that it was a result of drug use and it was found that it was found in the study .
Figure 8: Sample outputs for human and different systems.
Input: all drugs should be decriminalized. I am a firm believer that MDMA and LSD can be very therapeutic and eye opening. They can expand your consciousness and allow you to see things from a different perspective and have the ability to alter your life. (…)
Passage 1 Source: Wikipedia Stance: -28.88
In 2010, DrugScience produced a ranking of drug harms in the UK, the results of which garnered significant media attention. Drugs were assessed on two metrics - harm to users and harms to society. The report found heroin, crack cocaine, and methamphetamine to be the most harmful drugs to individuals, with alcohol, heroin, and crack cocaine as the most harmful to others. Overall, alcohol was the most harmful drug, with heroin and crack cocaine in second and third places. Most notably, it found the legal status of most drugs bears little relation to the harms associated with them - several class A drugs including ecstacy (MDMA), LSD and magic mushrooms featured at the very bottom of the list.The report found heroin, crack cocaine, and methamphetamine to be the most harmful drugs to individuals, with alcohol, heroin, and crack cocaine as the most harmful to others. Overall, alcohol was the most harmful drug, with heroin and crack cocaine in second and third places. Most notably, it found the legal status of most drugs bears little relation to the harms associated with them - several class A drugs including ecstacy (MDMA), LSD and magic mushrooms featured at the very bottom of the list. Similar findings were found by a Europe-wide study conducted by 40 drug experts in 2015.
Passage 2 Source: The Wall Street Journal Stance: -5.89
Something drastic needs to be done, and the steps suggested by Mr. Murdoch may be a good start. 3:13 pm October 8, 2010 Anonymous wrote: SHOW ME THE MONEY! 12:09 pm October 11, 2010 JustFacts wrote: Where is the "accountability" for the CIA and other corrupt govt. Wall Street-affiliated players involved with international drug smuggling for decades (!) – deliberately inundating communities & specific neighborhoods with heroin, cocaine, meth, pills (MDMA/ecstacy), etc. It is a documented fact that the CIA & corrupt elements of the U.S. govt. & freemasons have been involved in large-scale heroin distribution operations and also involved in the deliberately induced crack cocaine epidemic targeting black neighborhoods (for the purposes of social undermining & political-economic control).
Passage 3 Source: The Washington Post Stance: -10.05
(Photo by Brian Vastag) In the last few years, he saw a resurgence in legitimate research on MDMA as academics restarted clinical trials with MDMA as a therapeutic tool, publishing studies showing that the drug can help veterans come to terms with the trauma of war. “He was very depressed once MDMA was criminalized,” said Rick Doblin, president of the Multidisciplinary Association for Psychedelic Studies, which funds clinical trials of MDMA and LSD. “Sasha always felt these drugs didn’t open people up to drug experiences, but opened us up to human experiences of ourselves.” In 1985, with first lady Nancy Reagan’s “Just Say No” campaign in full swing, federal authorities banned MDMA with an unusual emergency action.
Passage 4 Source: The Washington Post Stance: -7.96
It wasn’t MDMA after all, but methamphetamine. A new review board quickly signed on to support Mithoefer’s study, but the irony of the wasted year wasn’t lost on him: The misidentified drug that had been deemed too toxic to evaluate for medical use, the drug that was far more toxic than MDMA, was already a prescription drug. Meanwhile, in the four years the MDMA study lingered between concept and reality, Donna Kilgore had been driven to the brink. She took "every anti-depressant you can name," tried a dozen therapists and an almost equal number of therapeutic approaches. But nothing made that numbness, panic and rage recede.
Passage 5 Source: The New York Times Stance: -5.67

For alcohol, the safety margin is 10 (330 divided by 33 equals 10). In other words, it takes 10 times as much alcohol to kill you as it does to give you a buzz. Note: All such calculations are very rough estimates, and severe toxic reactions can occur at much lower doses depending on the health of the individual. Based on my research, the safety margin of recreational substances as normally used is: 6 for heroin; 10 for alcohol; 15 for cocaine; 16 for MDMA; 20 for codeine; and 1,000 for LSD or marijuana. Cigarettes have little immediate risk of death because most of the very lethal nicotine is destroyed in the smoke.

Figure 9: Top 5 retrieved passages after ranking for example in Figure 8.
Input: the presidents new plan to restrict welfare beneficiaries from obtaining a green card is a bad policy . (…) legal immigrants who receive any significant form of Welfare or public benefits are given a large "black mark" on their record that is heavily factored in should they attempt to apply for a green card. (…)
Human: The policy is impacting individuals, not the group. Each individual is judged based on the merit they bring forward. If an individual takes government assistance, they are most likely not contributing positively to the US.
Seq2Seq: i do n’t think it ’s fair to say that the government should be able to do this . if you want to get rid of illegal immigration , you need to be able to find a way to do it . you need to make sure that you do n’t want to get a job . it ’s just a matter of if you do n’t have a problem with the government .
Seq2seqAug there are a lot of people who do n’t want to have children , but they do n’t have to be able to pay for it . it ’s the government ’s job to make sure that the immigrants do n’t have a job . they do n’t need to be deported . they have to go through the process of becoming a citizen .
CANDELA: i ’m not going to try to change your view , but i ’d like to point out that there are a number of undocumented immigrants in the united states , and i ’m not sure what you ’re talking about , but i do n’t think it ’s fair to say that undocumented immigrants are going to be more likely to be able to get the benefits of citizenship . i ’m not a fan of this , but i ’m going to focus on the fact that there is a significant amount of people who do n’t have access to welfare programs in the us , and that ’s not the case . it ’s not like
CANDELA input w/o passages: a lot of illegal immigrants in the united states are not in the same position as they are in the us . in fact , the majority of illegal immigrants do not want to pay for the services provided by the federal government . it ’s not that they are undocumented , it ’s that they do n’t have to worry about the fact that they ’re undocumented . they ’re not going to be able to get a job because they ca n’t afford to live in a society where they ca n’t get a job . if you ’re talking about illegal immigrants , it ’s not like they ’re in the same boat as you
Figure 10: Sample outputs for human and different systems.
Input: the presidents new plan to restrict welfare beneficiaries from obtaining a green card is a bad policy . (…) legal immigrants who receive any significant form of Welfare or public benefits are given a large "black mark" on their record that is heavily factored in should they attempt to apply for a green card. (…)
Passage 1 Source: The Wall Street Journal Stance: -6.57
Who wrote this? Is that you Obama? 11:00 pm February 2, 2011 Welfare Worker in Washington State wrote: I think it is erroneous to assume that a large percentage of welfare recipients are people of color. I see more white people with their hands out in our area. We do have a large percentage of illegals with US born children and immigrants from Russia here in Washington that are receiving benefits. The Russian immigrants bring in their parents and extended family that get SSI benefits for the first 5 years if they are 65 or older. They have large families - and even if they are working have a tendency to get close to 1000 in food benefits each month.
Passage 2 Source: The New York Times Stance: -9.50
Temporary workers 883 entries in 2015 Employees (and their families) on non-immigrant work visas like H-1B for specialty workers and H-2B for agricultural workers. Fiances of U.S. citizens 669 entries in 2015 Temporary visas for fiances of U.S. citizens and for spouses and children of U.S. citizens or green card holders who have pending immigrant visas. BARRED New Immigrants Like the original order, the new ban also applies to people from the six countries newly arriving on immigrant visas, which are issued based on employment or family status. People issued immigrant visas become legal permanent residents on arrival in the United States and are issued a green card soon after.
Passage 3 Source: The Wall Street Journal Stance: -12.09
The family of the Boston bombers (although here legally as "refugees") collected significant amounts of food stamps, housing subsidies, college subsidies and the like. At the same time, they had sufficient funds to travel to their homeland and other European destinations, despite having reported only modest earnings. Last week, it was reported that a Pakistani owner (presumably a legal immigrant) of a chain of 7-11’s was importing illegals to work in his stores. We can have a welfare society or open borders but we can’t have both. Any immigration reform has to stipulate that immigrants cannot receive any taxpayer funded benefits (federal, state or local) until after they have achieved citizenship.
Passage 4 Source: The New York Times Stance: -6.16
They are not entitled to a passport or a green card because they bypassed the legal mechanisms for obtaining such documents. In any other country they would be promptly deported, justifiably. I support immigration reform and personally do not feel any economic competition from illegal aliens. That being said, there is a difference between immigrants who have applied for and received citizeship or green cards and those who have not. There should be a fast track naturalization system for children of illegals, such as this student. He grew up here because of his parents’ actions, not his own. This is similar to being born here, which has historically entailed citizenship.
Passage 5 Source: The Washington Post Stance: -13.81
In 1996, Congress enacted a requirement that legal immigrants be present for five years before becoming eligible for benefits. But we have never categorically excluded immigrants from receiving public benefits. Until now. If approved, the new policy would effectively deter legal immigrants from using public benefits for which they are eligible, lest they later be denied a green card or be removed. The DHS could also apply “public charge” to legal immigrants who use benefits for their children (such as CHIP), even if the children are U.S. citizens. The Migration Policy Institute estimates that the new policy could have a chilling effect on some 18 million noncitizens and 9 million U.S.-citizen children who reside in families where at least one person uses Medicaid/CHIP, welfare, food stamps or SSI.
Figure 11: Top 5 retrieved passages after ranking for example in Figure 10.