Argumentative relation classification (ARC) is dedicated to determining the class of the relation which may hold between two arguments or elementary argumentative units, EAUs111Here, we use the term elementary argumentative units to denote clauses or small clause-complexes – e.g., (0), (1) or (2)) – which can be ‘instantiated’ in an argumentative debate.. For instance, consider the following premises given the topic or conclusion (0) “Overall, marijuana is detrimental to your health.”:
Use of marijuana causes chronic bronchitis and airflow obstruction.
Cannabis does not need to be smoked to receive its potential health benefits.
In this case, (1) has a positive stance towards the conclusion (0); in contrast to (2), which has a negative stance towards the conclusion. Additionally, but not less importantly, we can say that (2) weakens (1) – it casts doubt about its generality by hinting at cannabis application methods which do not involve combustion or inhalation. In this work, we summarize all relations which aim at undermining or weakening another argument or premise (‘undercut’, ‘rebuttal’, etc.) as attack.222For a more ‘in-depth’ view and discussion of argumentative relations we refer the reader to, e.g., Pollock:1995:CCB:526901, wd2009 and besnard2014constructing. The EAUs from our example and their connecting relations are outlined in the graph in Figure 1.
In a rhetorically structured argumentative text333E.g., an argumentative essay., (1) and (2) may appear in configurations such as On the one hand (1), on the other (2); (1), however, (2), etc. Under these circumstances, discourse context can predict argumentative relations very well. However, when moving from such ‘closed scenario’ to a more ‘open-world setting’, e.g., where EAUs have been mined from heterogeneous documents, we need to determine relations based on their content. In this paper, we show that our method works well in both scenarios. In fact, it is in the more general and more difficult content-based setting, where our method provides the most benefits over previous work.
Systems which have learned to predict general argumentative relations have a decisive advantage when compared to systems that have ‘only’ learned to predict argumentative stances: in an argumentative debate, often a debater does not choose to bring forth any argument which supports their stance on the topic. Instead, or additionally, they may choose to select an argument which also attacks the opponent’s most recent argument. Therefore, we need not only knowledge about the stances of arguments towards topics, but also about relations to other arguments. Our experiments show that our approach is a step towards this goal.
The remainder of this paper is structured as follows: After discussing related work in Section 2, we propose a simple reconstruction trick which allows us to embed an argumentative source-target pair in a relational discourse context yielding a plausible and implausible text variant (Section 3). In Section 4, we conduct experiments and ablation studies using (i) a standard task setup, where systems are allowed to see EAUs in their document context and (ii) a more difficult ‘content-focused’ task setup where systems are only allowed to see the spans of the EAU clauses. The code for this paper is available at https://gitlab.cl.uni-heidelberg.de/opitz/pr4arc
2 Related work
In this section, we first provide an overview of the data, and the data issues people are confronted with when developing argumentative relation classification (ARC) systems. We proceed with an overview of existing ARC approaches and conclude by touching on other related tasks.
Argumentative relation data
For general argumentative relations, not many data sets have been developed. One of the largest data sets consists of 402 argumentative student essays and is henceforth denoted by Essay [stab2014identifying, stab2017parsing]. It has been annotated, i.a., with EAU clauses and more than 3,000 relations which hold among them. By Essay-Content, we denote a version of Essay from which discourse context is stripped and systems can only access the spans of EAU clauses [opitzdissec]. This setup is more difficult since systems have to learn to model the content of two EAUs in order to successfully predict their relation. Essay and Essay-Content will be more extensively described in Section 4.1, where we also show that our method is efficient across both setups.
Another data set which is annotated with in-depth argumentative annotations is the Microtext corpus covering a variety of political debates in Germany [PeldszusStede-ECA:16]. While it has been annotated with a more fine-grained set of relations (e.g., rebutting attack, undercutting attack, linked support, example support) it is rather small in size (the recently extended version [skeppstedt-etal-2018-less] contains about 700 relation tuples). Similar to Essay-Content, a variant of the Microtext corpus exists where argumentative units are detached from discourse context [wachsmuth-etal-2018-argumentation]. We believe that systems that have learned to predict argumentative relations based on the content of argumentative units have advantages over systems which focus too much on contextual discourse clues. For example, content-focused systems can better be expected to solve large-scale cross-document tasks where EAUs are mined from many heterogeneous documents. Our reconstruction trick provides one step towards this goal: it exploits potential discourse configurations without depending on seeing the true discourse context.
A key reason for the data scarcity of annotated general argumentative relations is that creating high-quality data for ‘premise-premise’ relations is a challenging task. Perhaps, it is more challenging than creating data for argumentative stance detection since topics or conclusions are often ‘a-priori’ well understood (e.g., Cannabis should be legalized) and always occur as the stance-relation target. In that sense, it may be easier and quicker to tell if an argument supports a conclusion compared to deciding whether an argument supports another argument.
A linear SVM classifier that is trained on a diverse set of features provides competitive performance onEssay [stab2017parsing]. A subsequent joint global graph optimization step, similarly to [peldszus2015joint, hou2017argument]
, yields no further improvement for classifying the relations in this data. The SVM classifier incorporates features extracted from the EAU spans as well as their context (e.g., leading or trailing words). OnEssay-Content, where systems only see the EAU clause spans, the performance of the SVM suffers a loss of more than 10 pp. macro F1 [opitzdissec] – an analysis indicates that the SVM focuses immoderately on features extracted from the EAU context and tends to neglect their actual content. This underpins the need for argumentative relation classification systems with deeper understanding of argumentation, i.e., systems that base their prediction on the actual content of two EAUs – the method we present in this paper aims at this.
The first neural approach for ARC [cocarascu2017identifying]
proposes a neural network with a Siamese structure[koch2015siamese, mueller2016siamese, cocarascu2017identifying]
. By means of a shared weight space it projects source and target EAU to a joint distributional vector space. Finally, it classifies the vector offset using a softmax-function. The authors conduct experiments on a data set which comprises texts about movies, technology and politics.
A similar model has been adopted recently where (symbolic) knowledge from large background knowledge-graphs is injected into the Siamese model by concatenating highly abstracted multi-hop knowledge paths to the source-target offset[kobbe_et_al:OASIcs:2019:10372]. Although there are consistent gains observed by including the knowledge, the gains appear to be relatively small. In this aspect, we believe that incorporating knowledge of the right form could make it possible to further enhance the system we propose in this paper. However, as of now, it is an active topic of discussion whether (symbolic) background knowledge may help in automatic argumentation and, even more so, which (form of) knowledge would be needed.
Computational argument mining and analysis
Argumentation is ubiquitous and argumentative structures can be recovered from a broad spectrum of texts. For example, they can be recovered from online dialogue [swanson2015argument, budzynska2014towards] and scientific research articles [lauscher2018arguminsci, lauscher-etal-2018-argument], where, e.g., researchers may directly or indirectly convey arguments for why some method is better than another. By now, there exists a substantial body of research publications covering a variety of argument analysis topics. For a general overview, we refer the reader to lippi2016argumentation and peldszus2013argument.
Another task that can be addressed as a text plausibility ranking task is the resolution of difficult pronouns in the Winograd Schema Challenge [Levesque:2012:WSC:3031843.3031909, opitz-frank-2018-addressing]. To resolve shell nouns and abstract anaphora (e.g., ‘I like that’.) marasovic-etal-2017-mention utilize syntactic patterns to gather plausible candidate resolutions from a background corpus in order to extend the scarce training data.
3 Context reconstruction and model
In this section, we first propose a simple reconstruction trick which allows us to build minimal pairs of plausible and implausible argumentative texts. Then, we describe a Siamese neural sequence ranking model which addresses the task of ranking texts according to their plausibility.
Constructing plausible and implausible argumentative discourse contexts
Consider two EAU clauses (source) and (target) where we need to decide whether supports or attacks . In the absence of contextual discourse clues444To name just one situation: consider a cross-document relation classification setup where stems from a different document than . Any specific textual discourse context would not only be more or less unimportant, but also bears the potential to confuse the system., a system must learn to predict this relation by considering the semantic content of and . We approach this task by offering two alternative context reconstructions and asking our model in what context and are more likely to appear. More precisely, our reconstruction trick is as follows:
. Additionally, .
. Admittedly, .
where (a) signals that two argumentative units likely stand in a support-relation and (b) signals the opposite (‘attack’). In our experiments (Section 4), we also examine other possible discourse connectors for our reconstruction (e.g., moreover/however). From here, we ask our model which of the two reconstructions leads to a more plausible ‘reading’: (a) or (b)? E.g., consider the cannabis-example from Section 1; applying our reconstruction trick yields the following implausible-plausible minimal pair :
[ Use of marijuana causes chronic bronchitis and airflow obstruction. Additionally, cannabis does not need to be smoked to receive its potential health benefits.]
[ Use of marijuana causes chronic bronchitis and airflow obstruction. Admittedly, cannabis does not need to be smoked to receive its potential health benefits.]
Clearly, (3b) constitutes a more plausible reconstruction compared with (3a). Exactly this is what we desire our model to learn: assessing the fine-grained differences between two texts which differ in only one phrase. This phrase, however, determines whether the text in its entirety is implausible or plausible.
3.1 Loss and model
We argue that a ranking approach (which reading is more plausible?) is more suitable for addressing our problem compared with a classification approach (plausible vs. implausible). The reason is that ranking allows for a more relaxed and graded notion of textual plausibility: we want the model to prefer one variant and not to choose one variant. This is accomplished by reducing the margin ranking loss on the training data :
where is a plausibility prediction model parameterized by . The plausibility-prediction model which we use is described in detail in the following paragraphs. Since is differentiable with respect to the model’s parameters , we can learn them with gradient descent.
We desire the function to return a number reflecting the plausibility of a text sequence made up of words . In our case, this function is instantiated with (i) a Siamese reading encoder (Reading Encoder, Figure 2) and a Siamese plausibility prediction layer for producing a plausibility score for any given text (Plausibility Prediction, Figure 2). Now, we will describe these two components more closely.
First, we use a contextual language model555We use BERT [devlin-etal-2019-bert] to infer the contextual embeddings. In our ablation experiments, we also present results based on ELMo embeddings [peters-etal-2018-deep] to infer a sequence of word embeddings: , which correspond to words . Here, we hope that already the contextual language model provides statistical information indicating whether a specific word sequence may be considered as rather plausible or rather implausible (‘inductive bias’). The sequence of word embeddings is further multiplied by a sequence of positive indicator coefficient embeddings: .666Similar to opitz-frank-2019-argument. This allows the model to learn to better distinguish between the source, target and the connector text (we learn three corresponding indicator embeddings). The resulting sequence is further processed by (ii) a Bi-LSTM [hochreiter1997long] to construct hidden states (we concatenate hidden states of forward and backward read) and (iii) a four-headed scaled dot-product self-attention mechanism [NIPS2017_7181], where in our case we use :
where are parameters of the model. Finally, we compute a weighted average of the final sequence of hidden states to construct a vectorized reading representation [felbo-etal-2017-using]:
where is the vector corresponding to time step computed by the previous scaled dot-product attention step and is a final vectorized representation of the input reading.
At plausibility prediction time, the vector representation , which we obtained by the previous step, is mapped to a single score by means of a linear combination with a weight vector. Lastly, a selu-function [DBLP:journals/corr/KlambauerUMH17] produces the desired plausibility-score:
This score, computed once for each of the two competing reconstructions, allows a comparison with respect to their (predicted) plausibility. For our ARC experiments, where we desire a final classification, we predict the argumentative relation class by inspecting the discourse connector of the reconstruction which obtains a higher plausibility score. E.g., if we predict the argumentative ‘support’ relation – otherwise we predict the ‘attack’ relation.
We begin this section by describing the experimental setup used to evaluate our neural plausibility ranker. Next, we present our main results and finally perform several analyses and study the effects of ablating model components.
To construct plausible and implausible texts, we experiment with eight different discourse connectors which have the potential to ‘signal’ argumentative relation types. They make up, in total, four minimal pairs (Table 1).
|A/D||I agree,||I disagree,|
We use the student essay corpus v02 [stab2017parsing] in two versions: Essay and Essay-Content. What is common to both is that they contain data from the same 402 argumentative essays written by students about a variety of topics. The essays have been annotated with, i.a. spans of argumentative units and their relations with each other (support vs. attack). Since only the argumentative clauses have been annotated, we can clean EAUs from their discourse context, which yields Essay-Content. For example, consider . To add on this, . While in Essay, a system is allowed to see EAU-surrounding tokens (to add on this), in Essay-Content, systems are allowed to see only the spans of the EAUs to predict their relation (i.e., ). In the easy case, to add on this may be enough to predict a support relation with high confidence and accuracy without even seeing the content of the EAUs – in the hard case, however, a system must learn to assess the actual content of the premises. In Essay-Content, the performance of the feature-based SVM described by stab2017parsing drops by more than 23% macro F1 compared to the standard setup (Essay) where shallow discourse context is accessible [opitzdissec].
We display the results of a competitive feature based SVM. It requires, i.a., syntactic parsing, constituency-tree sentiment annotation [socher-etal-2013-recursive] and discourse parsing [pdtbdiscou] as pre-processing steps [stab2017parsing, opitzdissec]. In contrast, our method does not depend on any pre-processing.
For each possible minimal pair, we instantiate a different model based on the pre-trained BERT model (the BERT model remains fixed during optimization). More specifically, we infer the word embeddings and average over the last four layers to produce a sequence of vectors with 1024 dimensions. Forward and backward LSTM have 256 neurons each. For development purposes we split off 1149 examples from the training data. The rank loss (Eq.1
) is minimized by performing stochastic gradient descent with Adam[kingma2014adam]777The learning rate is set to 0.001, the mini-batch size to 64 and the maximum number of epochs to 25.
. After each epoch, the model is evaluated on the development data. Finally, we select the parameters from the epoch with maximum F1 score on the development data.
In our tables, each model is denoted where indicates which pair of discourse connectors was used for reconstruction. denotes a model where we aggregate the predictions over the four different minimal-pair single models (‘ensemble model’). All results are averaged over five runs.
|SVM with features||68.0||57.3|
: improves against SVM withstanding standard deviation.
Macro F1 results
Table 2 lists the macro F1 results888Macro F1 in our case is defined as the unweighted mean over the F1 scores for our two classes. of our experiments.
On Essay, our method is competitive with the SVM that relies on extensive pre-processing. On Essay-Content, where models are forced to learn to assess the content of EAUs, our method outperforms the feature-based SVM across all configurations. The best performance on this data is provided by ArgRanker, which is trained on Moreover-However reconstructions (+6.5 pp. macro F1, relative improvement: 11%). Our ensemble model ArgRanker, which aggregates the predictions of the individual ArgRankers in a simple vote, achieves an improvement of +3.4 pp. macro F1 (relative improvement: 6%).
More detailed results
Table 3 indicates that our method offers other advantages besides raw macro F1 gains. The very rare attack-class is detected with a much greater precision compared with the SVM. The difference can range from an improvement of 5.6 pp. (ArgRanker, relative improvement: 28%) up to a maximum improvement of 31.4 pp. (ArgRanker, relative improvement: 157%). With such a large increase in precision, one might expect a drop in recall – however, this is only the case to a very small extent.
|SVM with features||20.0||22.9||93.0||91.8|
The greatest drop in recall is incurred by ArgRanker (-5.2 pp.) and thus can be said to lie in the shadow of its precision gains (+31.4 pp.). Moreover, when we use the discourse connector minimal pairs A/D and M/H, our model outperforms the SVM in the attack-class both in precision and recall. Most notably, when we instantiate our reconstructions with Moreover/However, we see a large gain in precision (+19.9 pp., relative improvement: 99.5%) but also an observable gain in recall (+7.1 pp., relative improvement: 31.0%).
With regard to the majority class (support), we make two observations: (i) precision-wise, all of our models outperform or are on par with the SVM; (ii) recall-wise, all of our models outperform the SVM. The greatest gain in recall for support is achieved by ArgRanker (+6.6 pp.).
4.3 Ablation experiments and analysis
|SVM [stab2017parsing, opitzdissec]||57.3||-||-||-|
Linguistically motivated discourse reconstruction
What is the outcome of instantiating the discourse reconstructions with ‘meaningless’ connectors? I.e., instead of instantiating the attack/support context with linguistically motivated connectors, such as, e.g., I agree/I disagree, we instantiate the contexts with the meaningless tokens ‘+’ and ‘-’. On one hand, this means that the new discourse configuration is still discriminative (either supporting or attacking). On the other hand, however, the discriminating reconstruction is not any more linguistically motivated. Thus, we hypothesize that the linguistically motivated reconstructions better ‘trigger’ the contextual BERT model into giving a useful inductive bias about whether a certain reading is plausible or not.
From Table 4 and Figure 3, we see that, indeed, our model functions better when provided with linguistically motivated reconstructions instead of the non-linguistically motivated reconstruction (Figure 3: columns A/A, A/D, M/H, Y/N vs. bottom row in Table 4 and right column in Figure 3). This holds true across all model configurations and all linguistically motivated discourse connector pairs.999An exception constitutes the model based on ELMo embeddings, which appears to work better when provided with the non-linguistically motivated connector pair.
More specifically, we find that the Moreover/However reconstruction appears to offer the most useful inductive bias (middle column, Figure 3). Our ArgRanker based on this reconstruction outperforms all other configurations by more than 4 pp. macro F1 (compared with Agree/Disagree) and more than 6 pp. macro F1 compared with the non-linguistically motivated reconstruction. One reason could be located in the fact that BERT was trained, i.a., on the Wikipedia corpus: we compute a simple word frequency statistic over this corpus and see that the terms Moreover and However appear more frequently in this corpus (e.g., however: appr. 29,900,000 occurrences) than, e.g., Admittedly (appr. 17,000 occurrences). Also, by manually inspecting a small amount of occurrences in Wikipedia, we find that moreover and however tend to occur in more ‘argumentative’ contexts, or, at least, connect two discourse units in a contrasting (however) or supporting (moreover) way. On the other hand, e.g., I agree tends to occur in less argumentative contexts, such as in I agree to the terms of service. We believe that contextual language models trained on interactive discourse texts (e.g., online discussion platforms) instead of encyclopedic texts would greatly help to provide our model with better embeddings in the situations where we want to compose plausible and implausible texts by means of more ‘interactive’ connectors (I agree/I disagree; Yes/No; etc.).
BERT vs. ELMo
In our first experiment, we replace the BERT embeddings with ELMo embeddings – we want to ‘probe’ which of the two embedding generators is better suited to rank argumentative texts according to their plausibility. First, we see that ELMo embeddings provide better performance than the feature based baseline, with one exception: ArgRanker, where we reconstruct contexts by inserting Additionally and Admittedly (Table 4, ELMo). Second, the ELMo embeddings in most cases fall short in comparison to BERT embeddings – again, however, with one exception: ArgRanker, which does not use the linguistically motivated reconstruction.
Indicator embedding coefficients
Now, we want to investigate if learning coefficients to better distinguish between source and target has helped our model. Recall, that the three coefficient indicator embeddings correspond to source/target EAU span and the discourse connector span and allow the model to highlight certain word embedding indices differently with respect to these three spans. For most connector pairs, learning the coefficients helps and their ablation leads to a performance drop (Table 4, -coeff; e.g. ArgRanker: -2.7 pp. macro F1).
Finally, we plot the learnt coefficients of the three different indicator embeddings against each other to analyze their appearance after training. Figure 4 displays all values from all discourse connector parameterizations of ArgRanker. More specifically, we are interested in the following question: Have we learned that certain contextual word embedding indices are important to inflate (deflate) with respect to the source or the target? From inspecting Figure 4, we see that this appears to be the case. For example, there is a set of embedding indices where coefficients are used to magnify the corresponding values in the target EAU and deflate them in the source EAU (Figure 3(a), top left region) – while for another set of embedding indices the opposite is true (Figure 3(a), bottom right). Furthermore, the learnt coefficients have assumed a normal-like distribution after training (distribution plots on the sides of Figures 3(a), 3(b), 3(c)).
Finally, we want to investigate the effect of ablating the self-attention mechanisms from our model. More precisely, we predict the plausibility scores based on a concatenation of the last state of forward and backward read of the Bi-LSTM. Throughout all different discourse reconstruction strategies, we see drops in performance (Table 4, -att). However, while we see observable drops in some cases (ArgRanker: -4.5 pp. macro F1), they are comparatively small in other cases (ArgRanker: -0.1 pp.).
We have treated argumentative relation classification in a new light, as a task where we learn to rank candidate texts according to their plausibility. To this aim, we have proposed a simple reconstruction trick which allows us to embed source and target argumentative units into plausible and implausible argumentative discourse contexts. In order to learn to rank such texts according to their plausibility, we have adapted a neural Siamese ranking model. Our experiments on an established data set have shown that the approach is competitive with previous work albeit it does not require pre-processing. In the ‘content-based’ setting – which is more difficult because models cannot base their decisions on shallow clues in the discourse context – the method outperforms previous work by a considerable margin. In particular with respect to the scarce class attack we observed substantial improvements in precision.
We are grateful to Anette Frank for valuable discussions and feedback on an earlier draft of this paper. This work has been supported by the Leibniz ScienceCampus “Empirical Linguistics and Computational Language Modeling”, supported by the Leibniz Association grant no. SAS-2015-IDS-LWC and by the Ministry of Science, Research, and Art of Baden-Württemberg.