DeepA2: A Modular Framework for Deep Argument Analysis with Pretrained Neural Text2Text Language Models

10/04/2021 ∙ by Gregor Betz, et al. ∙ KIT Allen Institute for Artificial Intelligence 0

In this paper, we present and implement a multi-dimensional, modular framework for performing deep argument analysis (DeepA2) using current pre-trained language models (PTLMs). ArgumentAnalyst – a T5 model (Raffel et al. 2020) set up and trained within DeepA2 – reconstructs argumentative texts, which advance an informal argumentation, as valid arguments: It inserts, e.g., missing premises and conclusions, formalizes inferences, and coherently links the logical reconstruction to the source text. We create a synthetic corpus for deep argument analysis, and evaluate ArgumentAnalyst on this new dataset as well as on existing data, specifically EntailmentBank (Dalvi et al. 2021). Our empirical findings vindicate the overall framework and highlight the advantages of a modular design, in particular its ability to emulate established heuristics (such as hermeneutic cycles), to explore the model's uncertainty, to cope with the plurality of correct solutions (underdetermination), and to exploit higher-order evidence.



There are no comments yet.


page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Argumentative text analysis is an interpretation method for clarifying arguments (Fisher, 2004). Being studied in argumentation theory, logic, or epistemology, it is widely taught and applied as a key critical thinking skill in, e.g., law (Alexy, 1989), the humanities (Bruce and Barbone, 2011), social sciences (Fairclough and Fairclough, 2012), policy advice (Hansson and Hirsch-Hadorn, 2016), or public debate (Beck, Neupane, and Carroll, 2019). This paper presents a computational approach for deep argument analysis, i.e., for reconstructing natural-language arguments from a given text, as in the following example (adapted from Siegel, 2018):

source text reconstructed argument
It is unethical to destroy human embryos. The most basic argument supporting this claim just stresses that it is wrong to intentionally kill innocent human beings. (P1) It is impermissible to kill innocent human beings. (P2) The human embryo is an innocent human being. (C) Thus: It is impermissible to kill the human embryo.

The literature on argument reconstruction (cf. Feldman, 1998; Scholz, 2000; Lau, 2011; Bowell and Kemp, 2014; Brun, 2014; Brun and Betz, 2016) characterizes deep argument analysis as:

  • a complex task involving a variety of sub-tasks, such as identifying reasons and conclusions in a text, formalizing sentences, checking validity of an inference, logical streamlining, or explicating implicit premises.

  • a non-conservative, creative task that goes beyond mere text annotation and essentially generates a new, more transparent text.

  • an iterative process through which reconstructions are built and revised step-by-step, and the solution space is gradually explored.

  • a hermeneutical task, guided by the principle of charity, which urges one to come up with an interpretation (reconstruction) as strong and plausible as possible.

  • assuming a normative background theory about what constitutes a strong and plausible argument in the first place.

  • being affected by severe underdetermination, both in terms of the process and the final outcome; in particular, there typically exist rival, yet equally legitimate reconstructions of one and the same text.

Given these special characteristics, deep argument analysis poses many challenges for machine models of natural language understanding. In this paper, we introduce a novel modular modeling approach for analysing complex argumentation that builds on recent pre-trained text2text transformers Raffel et al. (2020). Our approach – DeepA2 (illustrated in Figure 1) – works by systematically decomposing a complex reconstruction problem to smaller text2text sub-tasks (see Section 3), which allows for emulating the types of interpretation strategies and heuristics studied in argument theory. Referring to the different components of a comprehensive argumentative analysis, we may also define tailor-made metrics for assessing argument reconstructions. To demonstrate the benefits of our approach, we construct a new argumentation dataset (AAAC) that exhibits several complex interpretive dimensions, show how to map other existing datasets into our framework (Section 4), and train and evaluate our main model, referred to as ArgumentAnalyst, within DeepA2 (Section 5).

conjectures: source: Socrates is mortal because every human is.

argdown: source: Socrates is mortal because every human is. conjectures: Socrates is mortal

formalize: premises: Socrates is human | If someone is human, then they are mortal


(1) Socrates is human.
(2) If someone is human, then they are mortal.
(3) Socrates is mortal.

Socrates is mortal

F a | (x): F x -> G x

Figure 1: Example text-to-text tasks for deep argument analysis, defined by DeepA2.

Our empirical results show that

  1. ArgumentAnalyst generates – out-of-domain – semantically meaningful argument reconstructions, 70% of which are logically valid. By pooling alternative reconstructions, virtually every source text in the synthetic dataset can be reconstructed as a valid argument.

  2. Modular generation chains which emulate iterative reconstruction strategies are highly successful: they yield, in particular, a more coherent interpretation of an argumentative text, exploit the text more thoroughly, and generally outperform one-step generation as soon as problems become difficult.

  3. ArgumentAnalyst outperforms EntailmentWriter (Dalvi et al., 2021) on difficult EntailmentBank problems with respect to telling apart relevant premises from distractors.

  4. ArgumentAnalyst generates reliable higher-order evidence (Christensen, 2010) which can be used for diagnosing logical fallacies – despite the fact that ArgumentAnalyst is maximally charitable and is trained to reconstruct any input whatsoever as a logically valid argument, even if the input argument, taken at face value, is painstakingly fallacious.

In concluding this paper, we sum-up and interpret these findings as general vindication of DeepA2’s modular, multi-angular design (Section 6).

2 Related Work

Taking transformers as soft reasoners, recent work, pioneered by Clark, Tafjord, and Richardson (2020), has shown that pre-trained language models (PTLMs) possess basic deductive and abductive reasoning capabilities on diverse domains (Banerjee and Baral, 2020; Betz, Voigt, and Richardson, 2020; Bostrom et al., 2021), but are equally prone to fallacies and biases (Kassner and Schütze, 2020; Talmor et al., 2020). Besides drawing the correct conclusion, transformers are able to generate correct reasoning chains that justify an answer, which in turn further increases answer accuracy (Saha et al., 2020; Tafjord, Mishra, and Clark, 2020; Gontier et al., 2020; Saha, Yadav, and Bansal, 2021; Dalvi et al., 2021).

Neural semantic parsing uses sequence models to formalize natural language sentences (Kamath and Das, 2019). Shin et al. (2021) show that PTLMs are zero-shot parsers, and that intermediate steps which rephrase and streamline the original input before parsing it to a formal language improve accuracy.

Argument mining is an active research field that studies computational methods for retrieving argumentative components from a text corpus (Wachsmuth et al., 2017; Moens, 2018; Potthast et al., 2019; Lawrence and Reed, 2020). Recently, work in this field has started to use PTLMs: Ein-Dor et al. (2020) and Gretz et al. (2020) succeed in retrieving relevant pro- or con-arguments for a given topic from a large corpus with a fine-tuned BERT model (Devlin et al., 2019). Using BERT, Bar-Haim et al. (2020) map argumentative texts to key points that succinctly summarize the argument’s gist. Akiki and Potthast (2020) explore abstractive argument retrieval by means of text generation with GPT2 (Radford et al., 2019). Similarly, Syed et al. (2021) deploy BART (Lewis et al., 2019) to generate conclusions of argumentative texts on a challenging corpus compiled from Reddit and various online debate corpora. Rodrigues et al. (2020), revisiting the argument comprehension task (Habernal, Eckle-Kohler, and Gurevych, 2014; Habernal et al., 2018), demonstrate that identifying implicit premises – and deep argument analysis a fortiori – remains a hard, unsolved task.

3 Framework

3.1 Problem Definition

Deep argument analysis of a given text seeks to answer the following central question: Can we make sense of the text as a presentation of a rational argument? And if so, what exactly is the argument; and how precisely is it related to the text?

In carrying out a deep argument analysis, one explicates, rephrases and rebuilds – even repairs – the text’s argument in one’s own words. That is why deep argument analysis is also referred to as rational reconstruction (cf. Leitgeb and Carus, 2021). The reconstructed argument forms, together with details about its logical properties and about its relation to the source text, a comprehensive argumentative analysis of a text. The latter can be seen as an interpretative hypothesis that is abductively inferred from a source text by means of an inference to the best explanation.

A compelling argumentative analysis yields (i) a rational argument that is (ii) closely related to the source text. Deep argument analysis is, accordingly, guided by a dual goal (cf. Brun and Betz, 2016). An argument reconstruction should both be

  • systematically correct, i.e., the reconstructed argument itself is, e.g., transparent, deductively valid, non-circular, or doesn’t contain irrelevant premises; and

  • exegetically adequate, i.e., the reconstructed argument accounts for the original text, because, e.g., its premises merely reformulate parts of the text, or because its overall inferential structure can be traced within the source text.

The fact that there typically exists – regarding a specific text – a trade-off between these two goals is one major reason for the underdetermination of deep argument analysis and the plurality of legitimate reconstructions of a given text (cf. Brun and Betz, 2016).

Against this background, we may finally define the problem of

Deep artificial argument analysis:

Describe, analyse and implement an effective computational system for deep argument analysis!

3.2 Multi-angular Data

The DeepA2 framework is built upon a multi-angular data structure whose dimensions represent the essential components of a comprehensive argumentative analysis (see Section 3.1). Structured argumentative data is rendered as plain text (cf. Voigt, 2014). The different data dimensions, which are related as shown in Figure 2, are (with an illustrating example):

argument source text (S)

It is unethical to destroy human embryos. The basic argument supporting this claim just stresses that it is wrong to intentionally kill innocent human beings.

verbatim reason statements in the source text (R)

it is wrong to intentionally kill innocent human beings (ref: (1))

verbatim conjectures in the source text (J)

It is unethical to destroy human embryos (ref: (3))

argument reconstruction (A)

(1) It is impermissible to kill innocent human beings.
(2) The human embryo is an innocent human being.
– with hypothetical syllogism from (1) (2) –
(3) It is impermissible to kill the human embryo.

premises of the reconstructed argument (P)

It is impermissible to kill innocent human beings The human embryo is an innocent human being

final conclusion of the reconstructed argument (C)

It is impermissible to kill the human embryo

formalizations of premises (F)

(x): F x G x (x): H x F x

formalization of conclusion (O)

(x): H x G x

keys for the formalizations’ constants (K)

F: innocent human being G: must not be killed H: human embryo

Each record in a DeepA2 dataset contains a source text plus a legitimate comprehensive argumentative analysis, which is, given underdetermination, not necessarily the only compelling reconstruction of the text; moreover, a dataset may contain different records with one and the same source text analysed in several ways.



quotes reasons in


quotes con- jectures in


reconstructs argument in

refers to

refers to


quotes premises


quotes concl.






provides keys for

provides keys for
Figure 2: Relationships between dimensions of the multi-angular argumentative data.

Beyond this structural and functional characterization, DeepA2 is agnostic about the nature and origin of the argumentative data. Synthetically generated, automatically retrieved, manually created datasets as well as translations of other databases are all compatible with the framework and can be used side by side.

3.3 Generative Modes and Chains

Given DeepA2’s multi-dimensional data structure described in the previous section, a generative mode maps data from some input dimensions to a target dimension. For example, the mode  takes a source text (S) as input and outputs an argument reconstruction (A), the mode  reconstructs the argument (A) given the verbatim reasons (R) and conjectures (J). All in all, we define and investigate 21 different generative modes (see Appendix A.2). Every mode represents a task on which a text-to-text model can be trained.

By taking some mode’s output as another mode’s input, modes can be concatenated into generative chains. For example, the output of modes  and  (reasons and conjectures from source) can be fed into mode  to reconstruct an argument. Such generative chains allow us to emulate different strategies (heuristics) for analysing a given argumentative text (see Appendix A.3 for technical details).

Three generative chains which model distinct interpretative strategies, taking a source text (S) as sole input, are:



hermeneutic cycle


logical streamlining


For evaluation, we append to each generative chain the following sub-chain that formalizes the reconstructed argument:



A generative chain can be construed as hypergraph on the dimensions of DeepA2’s multi-angular datasets, with each of its modes representing a directed hyper-edge. Summing up the number of input dimensions (except S) over all modes yields a simple graph centrality measure, which gauges a chain’s sophistication. Thus, straight, hermeneutic cycle and logical streamlining display a sophistication of 0, 4, and 11, respectively.

3.4 Metrics

As discussed in Section 3.1, an argument reconstruction should both be sound and make sense of the text to-be-interpreted. In line with the dual goal of argument analysis, we propose metrics both for the systematic correctness and for the exegetic adequacy of a given analysis. The following metrics measure the degree to which a given generated argument is systematically correct:


1 if the argument is not a petitio principii (i.e., if no premise is identical with its final conclusion), 0 otherwise;


1 if the argument has no redundant premises (i.e., if no premise occurs more than once), 0 otherwise;


1 if the argument has no redundant conclusions (i.e., if no conclusion – intermediary or final – occurs more than once), 0 otherwise;


1 if all statements in the argument other than the final conclusion are explicitly used in an inference, 0 otherwise;


ratio of sub-arguments which correctly instantiate the explicitly stated inference scheme (e.g., hypothetical syllogism);


1 if the argument is globally valid (i.e., if the final conclusion deductively follows from the premises), 0 otherwise;

All six systematic metrics can be computed automatically (SYS-SCH tries to parse the argument based on the inference schemes and templates used to construct the synthetic dataset in the first place; SYS-VAL passes the model-generated formalizations of premises and conclusion to a symbolic theorem prover (De Moura and Bjørner, 2008); and the remaining metrics check for string identity).

Whereas systematic metrics apply primarily to the generated argument (A), a reconstruction’s interpretative adequacy will also depend on how reasons (R) and conjectures (J) coherently link the argument’s components to the original text. As a first set of exegetic metrics, we thus propose


1 if the reasons and conjectures are mutually exclusive verbatim quotes from the source text, 0 otherwise;


semantic similiarity (BLEURT, see Sellam, Das, and Parikh, 2020) of each reason statement and its counterpart premise in the reconstructed argument (if such exists, -1 otherwise);


semantic similiarity (see EXE-RSS) of each conjecture statement and its counterpart in the reconstructed argument (if such exists, -1 otherwise).

Each source text presents (more or less faithfully) an underlying target argument, which in turn marks some of the text’s statements as ‘target’ reasons, others as ‘target’ conjectures. The following two metrics assess the degree to which a comprehensive argumentative analysis correctly predicts (R, J) those target reasons and conjectures.


predictive performance (F1-score) for identifying (target) reason statements in the source text;


predictive performance (F1-score) for identifying (target) conjecture statements in the source text.

An argument’s final conclusion may be implicit or explicit in a given text. The ability to fully exploit a text can be measured by verifying whether the reconstructed argument’s final conclusion is implicit (= prediction) if and only if the target argument’s one is.


text exploitation, as measured by ability (F1-score) to reconstruct arguments with explicit final conclusions (prediction) if and only if the target final conclusions are explicit.

3.5 Models

Any text-to-text language model is compatible with the proposed DeepA2 framework. We refer to models used within the framework as ArgumentAnalyst. In this study, we train and evaluate the transformer model T5 (Raffel et al., 2020) with 770M parameters as implemented by Wolf et al. (2020).

3.6 Limitations

In the DeepA2 framework, arguments are reconstructed from relatively short and isolated texts, disregarding both the broader context of the argument and domain-specific background knowledge. This limits the framework, as presented here, in important ways: Implicit premises that are explicated in an argument reconstruction can neither be checked for plausibility nor for agreement with the author’s broader convictions. In addition, the framework cannot assess an argument’s dialectic function in a wider debate. It seems worthwhile to explore according extensions of the framework in future research.

4 Datasets

For the experiments reported below, we synthetically create two artificial argument analysis corpora that comply with the DeepA2 framework (see also Appendix A.1): AAAC01 and AAAC02. In addition, we translate the synthetic RuleTaker (Clark, Tafjord, and Richardson, 2020) and the manually compiled EntailmentBank (Dalvi et al., 2021) datasets into our framework.

In argument analysis, one proceeds from a source text to its reconstruction. Creating the synthetic corpora, we reverse-engineer this process:

  1. We sample, first of all, a possibly complex argument (A) from a set of valid inference schemes. In doing so, we use a multi-step templating strategy (inspired by Betz, Voigt, and Richardson, 2020) to translate symbolic forms into natural language schemes and to substitute natural language terms for placeholders. Premises (P), conclusion (C) and their formalization (F, O, K) are side-products of such a construction of an argument.

  2. Given the fully explicit argument (A), we compose a text (S) that presents the argument in a more or less transparent and faithful way. Such text creation involves: rendering the argument tree as a linear story, leaving out premises or conclusions (implicit premises and conclusions), inserting irrelevant material (distractors), using templates that obfuscate the logical form of a sentence, limiting the use of premise and conclusion indicators (such as “therefore”), applying rule-based and automatic paraphrasing. In composing the argumentative text (S), we may record its reasons (R) and conjectures (J).

Our two datasets AAAC01 and AAAC02 differ in the following ways:

  • predicates and names are sampled from different, disjunct domains (texts are about, e.g., allergies and family relations versus, e.g., badminton and cooking);

  • AAAC01 applies automatic paraphrasing Alisetti (2021) to the final source text whereas AAAC02 doesn’t;

  • AAAC02 allows for imprecise renditions of logical formulas, while AAAC01 sticks to plain formulations.

Each dataset contains diverse texts and arguments. Broadly speaking, data records may differ in terms of properties of the argument (step 1 above) and properties of the argument’s presentation (step 2). Along these two dimensions, we define five homogeneous subsets of the data:

simple inference:

arguments with a single inference step that neither involves negation nor compositional predicates;

complex inference:

arguments with four inference steps that heavily rely on syntactically intricate schemes (e.g., transposition, or de Morgan);

plain presentation:

all premises and conclusions are explicit in the source text which, in addition, contains no distractors;

mutilated presentation:

at least two premises and one conclusion are implicit, while the text contains two distractors and explicitly states the final conclusion;


the argument’s inference is complex, plus the text contains at least two distractors.

The RuleTaker and EntailmentBank datasets contain multi-hop inference trees (A). To import these into the DeepA2 framework, we create source texts (S) for the given arguments by means of simple templates (such as “{theory} All this entails: {hypothesis}”) and record reasons (R) and conjectures (J) on the fly. Unlike AAAC and EntailmentBank, RuleTaker (as updated in Tafjord, Mishra, and Clark, 2020) contains an equal share of arguments for which (i) the conclusion follows from the premises, (ii) the conclusion contradicts the premises, (iii) the conclusion is independent of the premises.

5 Experiments and Results

As first and main experiment

we train our base model (see Section 3.5) on the AAAC01 corpus, and evaluate the resulting ArgumentAnalyst model out-of-domain on AAAC02. ArgumentAnalyst undergoes multi-task training on 21 generative modes, which are interpreted as sequence-to-sequence tasks (the training set-up is further described in Appendix A.2).

The evaluation of ArgumentAnalyst on AAAC02 proceeds in two steps:

  • prediction: produces output in accordance with 16 different generative chains (Appendix A.3);

  • metrics application: assesses the quality of the generated output by means of the systematic and exegetic metrics of the DeepA2 framework (see Section 3.4).

systematic metrics (SYS-*) exegetic metrics (EXE-*)
straight .95 .97 .96 .96 .33 .73 .80 -.08 -.10 .93 .93 .63
herm. cy. .95 .98 .95 .93 .31 .72 .82 .16 .12 .93 .92 .71
logic. str. .95 .97 .96 .95 .32 .72 .82 .11 .00 .93 .92 .69
pooling 1.0 1.0 1.0 1.0 .73 1.0 1.0 .26 .29 .96 .96 .97
oracle 1.0 1.0 1.0 1.0 1.0 1.0 1.0 .30 .37 1.0 1.0 1.0
Table 1: Performance of ArgumentAnalyst on the AAAC02 data as measured by systematic and exegetic metrics. Rows display results for three illustrative generative chains (straight, hermeneutic cycle, logical streamlining), for the item-wise best performing generative chain out of all 16 chains (pooling), and for oracle performance (oracle), which one obtains by applying the metrics to the target data itself.

Table 1 reports the ability of ArgumentAnalyst to generate systematically correct and exegetically adequate argument reconstructions. We obtain similar global results with the three chains straight, hermeneutic cycle, and logical streamlining, whose generated reconstructions mainly differ in terms of internal coherence (EXE-RSS, EXE-JSS) and text exploitation (EXE-TE). However, the different generative chains complement each other, as shown by pooling, which does not only outperform individual chains, but nearly attains oracle performance.

inference presentation
simple compl. plain mutil. C&M
chain N=1274 N=180 N=330 N=114 N=70
straight .95 .72 .98 .61 .69
herm. c. .94 .68 .96 .67 .61
log. str. .95 .68 .98 .64 .61
straight .84 .48 .88 .40 .34
herm. c. .83 .56 .84 .49 .50
log. str. .82 .47 .86 .46 .37
straight .03 -.25 .05 -.31 -.30
herm. c. .20 .08 .15 .08 .11
log. str. .17 -.01 .13 .01 -.06
straight .06 -.32 .10 -.37 -.37
herm. c. .23 -.06 .21 -.03 -.21
log. str. .13 -.26 .07 -.26 -.40
Table 2: Performance of ArgumentAnalyst on specific subsets (columns) of the AAAC02 data as measured by selected systematic and exegetic metrics (sub-tables). Rows display results for three illustrative generative chains (straight, hermeneutic cycle, logical streamlining).
steps straight herm. cycle straight herm. cycle
1 .863 .866 .816 .871 .951
2 .798 .815 .813 .826 .886
3 .812 .815 .826 .806 .858
4 .757 .791 .820 .822 .838
5 .795 .811 .786 .773 .742
any .819 .830 .816 .834 .879
Table 3: Predictive performance of ArgumentAnalyst (ArgAnEB, ArgAnAAAC,EB) and EntailmentWriter (EntWr) for identifying reason statements in an input text (metric SYS-PPR) on the EntailmentBank task2 dataset.

Table 2 assesses ArgumentAnalyst’s reconstructions on specific subsets of the AAAC02 dataset (defined in Section 4). As expected, ArgumentAnalyst produces, all in all, much better reconstructions of simple inferences and plain presentations – compared to complex inferences and/or mutilated presentations, i.e., difficult problems. In addition, within one and the same subset, substantial performance-differences show up between the three generative chains. Globally speaking, hermeneutic cycle outperforms the other two chains for difficult problems.

Is ArgumentAnalyst capable of reliable self-evaluation? We have validated the logic metric (SYS-VAL), which passes on a self-generated formalization of the reconstructed argument to a theorem prover, in three ways: First of all, ArgumentAnalyst correctly recognizes target arguments as valid (with accuracy 92.7%), which has been verified by running the formalization subchain on target data. Secondly, virtually every generated argument with all-correct scheme instantiations (i.e., SYS-SCH ) is also – and correctly – recognized as logically valid. Thirdly, a manual analysis of 100 generated arguments with incorrect scheme instantiation (i.e., SYS-SCH ) reveals a high rate of false negatives: roughly one half of all inferences that are not automatically identified as an instantiation of the given scheme actually do correctly instantiate it. The accordingly adjusted global ratio of correct scheme instantiations (Table 1) equals roughly 0.65 (rather than 0.31–0.33), which is consistent with the ratio of logically valid arguments being 0.72–0.73.

Do reconstructed arguments exhibit basic semantic flaws? Regarding the full dataset, ArgumentAnalyst produces nearly flawless argument reconstructions, committing basic errors (petitio, redundancy, unused statements) only very rarely (Table 1). And even for very difficult problems, two thirds of all generated arguments display no basic flaw whatsoever (Table 2, SYS-PP & SYS-RP & SYS-RC & SYS-US).

Are reconstructed arguments logically valid? Roughly 70% of all arguments generated by one of the three chains are logically valid (Table 1). More importantly, though, for virtually every source text in the dataset, there is at least one chain (out of 16) which reconstructs the text as a valid argument (pooling). Given that logical validity can be automatically assessed, the pooled system may thus guarantee to yield a valid reconstruction. Concerning different problem types (Table 2), hermeneutic cycle clearly outperforms the other chains as soon as the problem gets difficult. Additional analysis shows that ArgumentAnalyst can also cope with underdetermination, as 68% of all generated arguments whose final conclusion differs () from the target argument’s one – i.e., arguments that are not reconstructed as expected given the target data – are still logically valid.

Are the generated interpretations internally coherent? The generative chain hermeneutic cycle yields comprehensive argument reconstructions where premises (P) and conclusions (C) fit much better to detected reasons (R) and conjectures (J) than straight or logical streamlining (EXE-RSS, EXE-JSS). This holds globally (Table 1), as well as for easy, and for difficult problems (Table 2). Note that the oracle baseline for metrics EXE-RSS, EXE-JSS is well below 1, which reflects the fact that source texts may present arguments in highly mutilated ways; it is nearly attained by pooling the 16 different generative chains (Table 1).

Can ArgumentAnalyst detect reasons and conjectures, and fully exploit the text? The evaluation demonstrates that reason/conjecture detection on AAAC02 is a relatively easy task (EXE-PPR, EXE-PPJ). In contrast, fully exploiting a text (i.e., generating an argument with implicit final conclusion if and only if the underlying target argument has an implicit final conclusion, EXE-TE) is seemingly more challenging (Table 1). Again, hermeneutic cycle achieves best text exploitation, performing, however, clearly below oracle baseline – which may simply reflect the degree of underdetermination in the AAAC02 corpus.

In a second experiment

we train two models on the imported EntailmentBank (task1 and task2) dataset (see Section 4), namely:

  1. our base model (T5), which yields ArgumentAnalystEB;

  2. the ArgumentAnalyst model pretrained on AAAC02 (resulting in an intermediary pre-training set-up similar to Phang, Févry, and Bowman, 2018; Geva, Gupta, and Berant, 2020), which yields ArgumentAnalystAAAC,EB.

Since the EntailmentBank data doesn’t contain formalizations, we can only train on 14 modes, which are interpreted as sequence-to-sequence tasks (see Appendix A.2). We evaluate the models on task2 of EntailmentBank only, which contains problems with a relatively large number of distractors, and proceed in two steps as before: prediction (with 11 different generative chains) and metrics application. Dalvi et al. (2021) report the ability of EntailmentWriter (a fine-tuned T5-11b model) to correctly distinguish relevant premises of an argument from distractors in terms of a F1-score, which corresponds to our metric EXE-PPR. That’s why the sole focus in this second experiment is on EXE-PPR.

Table 3 describes the ability of ArgumentAnalyst models to correctly tell apart relevant premises from mere distractors in the EntailmentBank task2 dataset for two generative chains (straight, which directly outputs reason statements, and hermeneutic cycle, which tries to reconstruct the argument first and uses both source text and argument to identify reasons), and compares this with the performance of EntailmentWriter (scores from Dalvi et al., 2021). The results, shown separately for arguments with a specific number of inference steps, let us draw three conclusions:

  • ArgumentAnalyst outperforms EntailmentWriter on difficult problems with more than 4 inference steps / sub-arguments.

  • Using the sophisticated chain hermeneutic cycle improves predictive performance compared to the simple straight chain.

  • The chain hermeneutic cycle (unlike straight) generally benefits from intermediary pre-training on AAAC – caveat: not so for arguments with more than 4 steps. This latter observation might be due to the fact that the AAAC02 corpus, by construction, doesn’t contain arguments with more than 4 steps, so that pre-training biases the model towards shorter arguments.

In a third experiment

we explore the following hypothesis:

Informative higher-order evidence.

The degree to which ArgumentAnalyst struggles in reconstructing a given argument (presented in the source text) as logically valid is a reliable indicator for whether the original argument is fallacious or not.

To test this hypothesis, we apply ArgumentAnalyst (trained on AAAC02, see above) to the RuleTaker data as imported into the DeepA2 framework (see Section 4): ArgumentAnalyst produces – by means of 13 generative chains – comprehensive reconstructions, to which the systematic and exegetic metrics are applied. RuleTaker

contains an equal share of arguments whose conclusions follow from (label=valid), contradict (label=contradiction), or are independent of (label=neutral) the corresponding premises. Now, informative higher-order evidence would allow us to correctly predict these labels. And this is exactly what we observe: First, if reconstructions of one and the same source text which are independently generated with different chains agree (disagree), then the original argument tends to be valid (invalid). Second, by training simple classifiers on our argumentative metrics and further properties of the reconstructions, we robustly achieve a predictive accuracy 10% above the random baseline. While this is far below the SOTA results of tailor-made RuleTaker

(Clark, Tafjord, and Richardson, 2020) and ProofWriter (Tafjord, Mishra, and Clark, 2020) models on this data, our findings nonetheless confirm the above hypothesis.

6 Conclusion

In this paper, we have presented and implemented a multi-angular, modular framework for deep argument analysis (DeepA2). It allows for defining a large variety of generative modes by combining different dimensions of the data. These modes, in turn, can be concatenated into complex generative chains. ArgumentAnalyst – a text-to-text model set up and trained within the DeepA2 framework – yields plausible reconstructions of argumentative texts. Our empirical findings vindicate the overall framework and highlight the following advantages of a multi-angular, modular design in general: First of all, modular chains may emulate established, well-proven, typically piece-meal, scholarly techniques for text analysis (heuristics), which hence may provide normative, methodological guidance in setting up NLP systems. Secondly, by defining and implementing different modular chains, and investigating the plurality of generated solutions, one can systematically explore the system’s uncertainty as well as the tasks’s underdetermination. Thirdly, monitoring the system during modular computation yields diagnostically useful information (e.g., intermediary results) which not only describes the model’s performance on the given problem, but which additionally allows us – as higher-order evidence – to characterize (e.g., classify) the original problem in the first place. Fourthly, breaking down a complex task into sub-tasks with intermediary results that can be further processed and re-combined helps to overcome input size limitations of neural language models. Fifthly, modular generation with meaningful modes allows users to follow the system, comprehend generated solutions, verify sub-steps and detect errors – the NLP system becomes a transparent, explainable AI (Miller, 2019). Finally, modular NLP systems as described by DeepA2 may be connected to a user-interface which promises fine-grained interactive control of modular generations and seamless cognitive cooperation of AI and human experts in analysing texts.

Appendix A Appendix

mode w1 w2 mode w1 w2 mode w1 w2
1. 1. 1. 1. .7
1. 1. 1. 1. .7
1. 1. 1. 1. .7
1. 1. 1. 1. .7
1. 1. 1. 1. .7
1. 1. 1. 1. .7
.2 .2 .2 .2 .7
.7 .7
Table 4: 21 generative modes with corresponding weights in AAAC (w1) and EntailmentBank (w2) training data.
# mode sequence len. soph.
1    3 0
2    3 1
3    3 1
4    3 2
5    3 3
6    3 3
7    3 3
8    3 3
9     4 4
10     4 4


7 8


9 11


9 11


15 20


11 18


12 21
Table 5: 16 generative chains (without final formalization sub-sequences) evaluated in this study. The illustrative chains highlighted in the main paper are #1 (straight), #9 (hermeneutic cycle), and #13 (logical streamlining).

a.1 Synthetic Argument Data

A synthetically generated AAAC record, which nicely illustrates the underdetermination of argument reconstruction, with two implicit premises, one distracting statement and a simple (one-step) argument (formatted as presented to the model):

source: It is not the case that Tracy is not an admirer of Fullerton and Tracy has seen La Habra. Plus, if someone loves Chico, then they haven’t visited Monterey, owing to the fact that loving Laguna Beach is sufficient for not having visited Monterey.

reasons: loving Laguna Beach is sufficient for not having visited Monterey (ref: (2))

conjectures: if someone loves Chico, then they haven’t visited Monterey (ref: (4))

(1) If someone is an admirer of Chico, then they are an admirer of Laguna Beach or a visitor of Stockton.
(2) If someone admires Laguna Beach, then they haven’t visited Monterey.
(3) If someone has visited Stockton, then they haven’t visited Monterey.
with generalized dilemma (neg variant) from (1) (2) (3)
(4) If someone admires Chico, then they haven’t visited Monterey.

premises: If someone is an admirer of Chico, then they are an admirer of Laguna Beach or a visitor of Stockton. (ref: (1)) | If someone admires Laguna Beach, then they haven’t visited Monterey. (ref: (2)) | If someone has visited Stockton, then they haven’t visited Monterey. (ref: (3))

conclusion: If someone admires Chico, then they haven’t visited Monterey. (ref: (4))

premises_form: (x): Fx -> (G x v H x) (ref: (1)) | (x): G x -> not I x (ref: (2)) | (x): H x -> not I x (ref: (3))

conclusion_form: (x): F x -> not I x (ref: (4))

keys: F: admirer of Chico | G: admirer of Laguna Beach | H: visitor of Stockton | I: visitor of Monterey

a.2 Training Set-up

By interpreting a generative mode as a sequence-to-sequence task, we may translate a multi-angular DeepA2 dataset (e.g., AAAC01) into a multi-task sequence-to-sequence format, on which a sequence-to-sequence model can be trained. For each record in the multi-angular DeepA2 dataset, we randomly sample 14 modes in accordance with the weights provided in Table 4 and add, for each mode, a corresponding sequence-to-sequence record to the training data. This results, for AAAC01, in a sequence-to-sequence training dataset with records.

Our models (base model T5-large with 770M parameters, and pretrained ArgumentAnalyst) are trained with batch-size 2 and learning rate 0.00001. For AAAC01

, eval loss starts to increase at epoch 8; with

EntailmentBank data, eval loss increases from epoch 2 onwards.

a.3 Iterative Prediction with Generative Chains

Generative chains are implemented with a dynamic dictionary (9 keys, corresp. to the dimensions of DeepA2 data), which is initialized with the source text, provides input for the generative modes, and is updated after each generative step with the mode’s generated output. Output is generated with beam search decoding and beam width 2.

Table 5 displays all generative chains we resort to in this study, all of which are used in the first experiment. The second experiment makes use of chains 1–11. The third experiment deploys chains 1–13.