Log In Sign Up

Automatic Debate Evaluation with Argumentation Semantics and Natural Language Argument Graph Networks

by   Ramon Ruiz-Dolz, et al.
Universitat Politècnica de València

The lack of annotated data on professional argumentation and complete argumentative debates has led to the oversimplification and the inability of approaching more complex natural language processing tasks. Such is the case of the automatic debate evaluation. In this paper, we propose an original hybrid method to automatically evaluate argumentative debates. For that purpose, we combine concepts from argumentation theory such as argumentation frameworks and semantics, with Transformer-based architectures and neural graph networks. Furthermore, we obtain promising results that lay the basis on an unexplored new instance of the automatic analysis of natural language arguments.


page 1

page 2

page 3

page 4


Technical report of "Empirical Study on Human Evaluation of Complex Argumentation Frameworks"

In abstract argumentation, multiple argumentation semantics have been pr...

On Natural Language Generation of Formal Argumentation

In this paper we provide a first analysis of the research questions that...

Towards a Benchmark of Natural Language Arguments

The connections among natural language processing and argumentation theo...

Logical Semantics, Dialogical Argumentation, and Textual Entailment

In this chapter, we introduce a new dialogical system for first order cl...

Mining Arguments from Cancer Documents Using Natural Language Processing and Ontologies

In the medical domain, the continuous stream of scientific research cont...

MultiOpEd: A Corpus of Multi-Perspective News Editorials

We propose MultiOpEd, an open-domain news editorial corpus that supports...

Natural-Language Multi-Agent Simulations of Argumentative Opinion Dynamics

This paper develops a natural-language agent-based model of argumentatio...

1 Introduction

The automatic evaluation of argumentative debates is a Natural Language Processing (NLP) task that can support judges in debate tournaments, analysts of political debates, and even help to understand the human reasoning used in social media (e.g., Twitter debates) where argumentation may be difficult to follow. This task belongs to the computational argumentation area of research, a broad, multidisciplinary area of research that has been evolving rapidly in the last years. Classically, computational argumentation research focused on formal abstract logic and computational (i.e., graph) representations of arguments and their relations. In this approach, the evaluation of arguments relied exclusively in logical and topological properties of the argument representations. Furthermore, these techniques have been thoroughly studied and analysed, but from a theoretical and formal viewpoint considering specific cases and configurations instead of large, informal debates.

The significant advances in NLP have enabled the study of new less formal approaches to undertake argumentative analysis tasks. One of the most popular tasks that has gained a lot of popularity in the recent years is Argument Mining, a task aimed at finding argumentative elements in natural language inputs (i.e., argumentative discourse segmentation), defining their argumentative purpose (i.e., argumentative component classification), and detecting argumentative structures between these elements (i.e., argumentative relation identification). Even though most of the NLP research applied to computational argumentation has been focused in argument mining, other NLP-based tasks have also been researched such as the natural language argument generation, argument persuasive assessment, and argument summarisation, among others. However, it is possible to observe an important lack of research aimed at large-scale evaluation of argumentative debates approached with NLP-based algorithms. Furthermore, most of the existing research in this topic has been contextualised in online debate forums, considering only text messages, and without a professional human evaluation.

In this paper, we propose a hybrid method for evaluating argumentative debates considering the complete lines of reasoning presented by professional debaters. Our method combines concepts from the classical computational argumentation theory (i.e., argumentation semantics), with models and algorithms effectively used in other NLP tasks (i.e., Transformer-based sentence vector representations and graph networks). This way, our method takes a complete debate as an input and predicts the winning stance (i.e., in favour or against) as the output for a given argumentative topic. For that purpose, we define the computational modelling of natural language debates that makes possible to carry out the argumentative debate evaluation task. Finally, we present a step-by-step complete implementation of our method used in a real debate. Thus, this paper’s contribution is threefold: (i) introduces a new instance of the argument assessment task, linguistically deeper (i.e., consisting of complete lines of reasoning) than the previously existing instances and reported research; (ii) defines an original hybrid method combining formal logic and NLP-based algorithms for approaching the automatic evaluation of argumentative debates; and (iii) evaluates and implements the proposed method in a real case study.

2 Related Work

From the formal logic and theoretical (computational) argumentation approach, the representation and evaluation of arguments is conducted through argumentation frameworks and argumentation semantics Dung (1995); Baroni et al. (2011). However, this line of research has been mainly focused on abstract argumentation and formal logic argumentative structures.

On the other side, the automatic assessment of natural language arguments is a research topic that has been addressed from different viewpoints in the literature. Depending on the domain, the available corpora, and its annotations, different instances of this task have been proposed in the last years. Most of this research has focused on performing an individual evaluation of arguments or argumentative lines of reasoning Wachsmuth et al. (2017) instead of a global, interactive viewpoint where complete debates consisting of multiple, conflicting lines of reasoning are analysed. Typically, the automatic evaluation of natural language arguments has been carried out comparing the convincingness of pairs of arguments Gleize et al. (2019); analysing user features such as interests or personality to predict argument persuasiveness Al Khatib et al. (2020)

; and analysing natural language features of argumentative text to estimate its persuasive effect

El Baff et al. (2020). However, a global debate approach on the evaluation of natural language arguments was also presented in Potash and Rumshisky (2017)

where a Recurrent Neural Network was trained to evaluate non-professional debates in a corpus of limited size. Similarly, in

Shirafuji et al. (2019)

, the authors present an evaluation of machine learning algorithms when predicting the outcome of debates in online forums, but concepts from computational argumentation theory are not brought into consideration. Here, we propose a new method that combines the advantages of both approaches to improve the automatic analysis of human argumentation: a hybrid approach for the automatic evaluation of complete, natural language debates of professional argumentation.

3 Data

In this paper, we approach the automatic evaluation of complete natural language argumentative debates. For that purpose, we use the VivesDebate111Available in: corpus Ruiz-Dolz et al. (2021b) to conduct all the experiments and the evaluation of our proposed method. This corpus contains annotations of the complete lines of reasoning presented by the debaters based on the IAT Budzynska and Reed (2011) standard, and professional jury evaluations of the quality of argumentation presented in each debate. Previously published corpora widely used in argumentation-based NLP tasks always tended to simplify the annotated argumentative reasoning, by only considering individual arguments, pairs of arguments, or considering a small set of arguments, instead of deeper and complete lines of argumentative reasoning. For example, in argument mining (e.g., US2016 Visser et al. (2020)), argument assessment (e.g., IBM-EviConv Gleize et al. (2019)), or natural language argument generation/summarisation (e.g., GPR-KB, DebateSum Orbach et al. (2019); Roush and Balaji (2020)). Thus, the VivesDebate corpus is the only identified publicly available corpus that enables the study of the automatic natural language evaluation of professional argumentative spoken debates.

The VivesDebate corpus contains 29 different argumentative debates (139,756 words) from a university debate tournament annotated in their entire structure: without partitions, or structural divisions, and capturing the complete lines of reasoning proposed by the debaters. The natural language text is segmented into Argumentative Discourse Units (7,810 ADUs) Peldszus and Stede (2013). Each ADU contains its own text, its stance (i.e., in favour or against the topic of the debate), the phase of the debate where it has been uttered (i.e., introduction, argumentation, and conclusion), and a set of argumentative relations (i.e., inference, conflict, and rephrase) that make possible to capture argumentative structures, the sequentiality in the debate, and the existing major lines of reasoning. Additionally, each debate has the scores of the jury that indicate which team has proposed a more solid and stronger argumentative reasoning.

4 Automatic Evaluation of Argumentative Debates

The human evaluation of argumentative debates is a complex task that involves many different aspects such as the thesis solidity, the argumentation quality, the adaptability to opponents’ reasoning, and other linguistic aspects of the debate (e.g., oral fluency, grammatical correctness, etc.). It is possible to observe that both, argumentation underlying logic and linguistics play a major role in the evaluation of argumentative debates. Therefore, the method proposed in this paper is designed to capture both aspects of argumentation by combining concepts from argumentation theory and Natural Language Processing. Our method is divided into two different phases: first, (i) determining the acceptability of arguments (i.e., their logical validity) in a debate based on their logical structures and relations; second, (ii) scoring the resulting acceptable arguments by analysing aspects of their underlying natural language to determine the winner of a debate. Figure 1 presents an scheme with the most important phases and elements of the proposed method. The code implementation of the proposed method is publicly available in

Figure 1: Structural scheme of the proposed automatic debate evaluation method.

Before describing both phases of our method, it is important to contextualise our proposal within the area of computational argumentation research. We assume that the whole argument analysis of natural language text has already been carried out: the argumentative discourse has been segmented Jo et al. (2019)

, the argument components have been classified

Bao et al. (2021), and argument relations have been identified among the segmented argumentative text spans Ruiz-Dolz et al. (2021a). Thus, a graph structure can be clearly defined from a given natural language argumentative input (e.g., a debate). Looking at Figure 1, the Argument Analysis containing the text of the arguments (i.e., node content), their stance (i.e., node colour), inference relations (i.e., green edges), conflict relations (i.e., red edges), and rephrase relations (i.e., yellow edges) among arguments can be a valid starting point to the proposed method.

4.1 Phase I: Argument Acceptability

The first phase of the method proposed in this paper for the automatic evaluation of argumentative debates relies on concepts from computational argumentation theory. This phase can be understood as a natural language pre-processing step from the NLP research viewpoint. Thus, the main goal of Phase I is to pre-process all the information contained in an argument graph resulting from an argumentative analysis (e.g., a complete argument mining pipeline Mirko et al. (2020)), and to computationally encode this information focusing on the most relevant aspects for natural language argumentation (see Figure 1, Framework Encoding and Argumentation Semantics).

For that purpose, it is necessary to introduce the concept of an abstract argumentation framework and argumentation semantics. Originally proposed by Dung in Dung (1995), an argumentation framework is a graph-based representation of abstract (i.e., non-structured) arguments and their attack relations:

Definition 1 (Argumentation Framework)

An Argumentation Framework (AF) is a tuple = , where: is a finite set of arguments, and is the attack relation on such as .

Furthermore, argumentation semantics were proposed together with the AFs as a set of logical rules to determine the acceptability of an abstract argument or a set of arguments. In this paper, following one of the most popular notations in argumentation theory, we will refer to these sets as acceptable extensions. These semantics rely on two essential set properties: conflict-freenes and admissibility. Thus, we can consider that a set of argument is conflict free if there are not attacks between arguments belonging to the same set:

Definition 2 (Conflict-free)

Let = , be an argumentation framework and . The set of arguments is conflict-free iff , : .

We can also consider that a set of arguments is admissible if, in addition of being conflict-free, it is able to defend itself from external attacks:

Definition 3 (Admissible)

Let = , be an argumentation framework and . The set of arguments is admissible iff is conflict-free, and , : and , or :

In this paper, we compare the behaviour of these two properties through the use of Naïve (conflict-free) and Preferred (admissible) semantics to compute all the acceptable extensions of arguments from the AF representations of debates. Naïve semantics are defined as maximal (w.r.t. set inclusion) conflict-free sets of arguments in a given AF. Similarly, Preferred semantics are defined as maximal (w.r.t. set inclusion) admissible sets of arguments in a given AF. A complete formal definition of both semantics can be found in Baroni et al. (2011). At this point, it is important to remark that the criteria of selecting both semantics for our method is oriented by the principle of maximality. Since acceptable extensions will be used as samples to train an NLP model in the subsequent phase of the proposed method, we selected these semantics that allow us to obtain the highest number of extensions, but keeping the most of the natural language information and maximising differences among the extensions (i.e., not accepting subsets from a given maximal extension, which would result in data redundancy and hamper the distribution learning of the model).

Therefore, we encode the argument graphs, resulting from a natural language analysis of the debate, as abstract AFs using the algorithm depicted in Algorithm 1. ADUs that follow the same line of reasoning (i.e., related with inference or rephrase) are grouped into abstract arguments, and the existing conflicts between ADUs are represented with the attack relation of the AF. Then, both Naïve and Preferred semantics are computed on the AF representation of the debate. This leads to a finite set of extensions, each one of them consisting of a set of acceptable arguments under the logic of computational argumentation theory. These extensions will be used as learning samples for training and evaluating an NLP model in the subsequent phase of the proposed method.

1:function GraphToAF()
7:     for  do
9:         for  do
11:         end for
13:     end for
15:     return
16:end function
Algorithm 1 Argumentation Framework Encoding.

4.2 Phase II: Argument Scoring

The second phase of the method focuses on analysing the natural language arguments contained in the acceptable extensions, and determining the winner of a given debate. For that purpose, we use the Graph Network (GN) architecture Battaglia et al. (2018) combined with Transformer-based sentence embeddings Reimers and Gurevych (2019) generated from the natural language arguments contained in the acceptable extensions. A GN is a machine learning algorithm aimed at learning computational representations for graph-based data structures. This way, a GN receives a graph as an input containing initialised node features (i.e., ), edge data (i.e., , where are the edge features, is the receiver node, and is the sender node), and global features (i.e., ); and updates them according to three learnt update and three static aggregation functions:


This way, computes an edge-wise update of edge features, updates the features of the nodes, and is computed at the end, updating global graph features. Finally, functions must be commutative, and calculate aggregated features, which are used in the subsequent update functions. A complete description of the algorithm is done in Battaglia et al. (2018).

Thus, the first step in Phase II is to build the learning samples from the previously computed extensions of AFs (see Figure 1, Learning Sample). An extension is a set of logically acceptable arguments under the principles of conflict-freenes or admissibility. However, there are no explicit relations between the acceptable arguments, since AF representations only consider attacks between arguments, and the conflict-free principle states that there must be no attacks between arguments belonging to the same extension. Thus, in order to structure the data and make it useful for learning linguistic features for the debate evaluation task, we generate a complete bipartite graph from each extension. The two disjoint sets of arguments are determined by their stance (i.e., one set consisting of all the acceptable arguments in favour, and the other against), since argumentation semantics allow to define sets of logically acceptable arguments but do not guarantee that they will have the same claim or a similar stance.

The second step consists on initialising all the required features of the learning samples for the GN architecture (see Figure 1, Sample Init.). Thus, we need to define which features will encode edge, node, and global information of the previously processed bipartite graph samples. Edges do not contain any relevant natural language information, so we initialise edge features identically (similar to previous research in the application of GNs for approaching abstract argumentative reasoning Craandijk and Bex (2021)

), so that node influence can be stronger when learning edge update functions. Nodes, however, are a pivotal aspect of this second phase since they contain all the natural language data. Node features are initialised from sentence embedding representations of the natural language arguments contained in each node. Thus, we propose the use of Transformer-based Language Models (i.e., BERT, RoBERTa, XLNET, etc.) to generate dense vector representations of these arguments, and initialise the vector features for subsequent learning. It may be interesting to use different Language Models for initialising node features depending on the application domain or the language of the data. In this paper, we will use an XLM-RoBERTa architecture since it was the most suitable option for the multilingual data used in our experiments. Finally, global features of our learning samples will encode the probability distribution of winning/losing a given debate (represented as acceptable extension-based bipartite graphs), and will be a binary label that indicates the winning stance.

The last step in Phase II is aimed at learning the task of automatic evaluation of argumentative debates (see Figure 1, Graph Network). Having all the features of our graph learning samples defined, we need to describe how will the GN model handle the update functions in order to properly approach the automatic debate evaluation task. In a classical debate, there are always two teams/stances: in favour and against some specific claim. In this paper, we will approach the debate evaluation as a binary classification task. Therefore, at the end of the proposed method, we want to model the classification problem as follows:


where , depending on the winning team of each debate. And is a complete bipartite graph generated from an acceptable extension of the AF pre-processing described in the Phase I

of our method. Thus, we approach this probabilistic modelling with three Multi Layer Perceptrons (MLP) for each of the

update functions. Since the debate evaluation is an instance of the graph prediction task, it is important to point out that the architecture of the two MLP approaching and

will be equivalent, and their parameters will be learnt from the backpropagation of the MLP architecture for

, which will have a 2-unit linear layer (for binary classification) and a softmax

function (for modelling the probability distribution) on its top.

5 Experiments

5.1 Experimental Setup

All the experiments and results reported in this paper have been implemented using Python 3 and run under the following setup. The initial corpus pre-processing and data structuring (i.e., Phase I) has been carried our using Pandas McKinney and others (2010) together with NetworkX Hagberg et al. (2008) libraries. Argumentation semantics have been implemented considering the NetworkX-based AF graph structures. Regarding Phase II, the language model and the dense sentence vector embeddings have been implemented through the Sentence Transformers Reimers and Gurevych (2019) library. We used a pre-trained XLM-RoBERTa Reimers and Gurevych (2020) able to encode multilingual natural language inputs into a 768 dimensional dense vector space (i.e., word embedding size). Finally, the Jraph Godwin* et al. (2020) library has been used for the implementation of the graph network architecture, for learning its update functions (i.e., Equation 1), and for the probabilistic modelling of Equation 2. We used an Intel Core i7-9700k computer with an NVIDIA RTX 3090 GPU and 32GB of RAM to run our experiments.

Furthermore, it is important to completely define the notion behind a learning sample, and how the data pipeline manages all these samples and structures them for training/evaluation. In our proposal, we defined a learning sample as an acceptable extension of a given debate. Thus, different debates may produce a different number of learning samples depending on the argumentation semantics and/or the argumentation framework topology. This way, learning samples can be managed from a debate-wise or an extension-wise viewpoint. Even though we used the learning samples individually (i.e., extension-wise) for the training and evaluation of the proposed model, we will always consider debate-wise partitions of our data. This decision has been made because it would be unfair to consider learning samples belonging to the same debate in both our train and test data partitions. The reported results could be misleading, and would not properly reflect the capability of generalisation of the proposed model.

Therefore, we used 28 complete debates in our experiments (17 in favour, 11 against) that were divided following the 80% - 20% distribution for training and test partitions. Once having the partitions defined, we applied different semantics to generate the acceptable extensions (i.e., learning samples) and we trained the GN model reporting the results presented in the following section.

5.2 Results

Regarding the Phase I, we have computed the Naïve and Preferred acceptable extension sets from the AF representations of the 28 debates. This led us to a total of 467 Naïve and 31 Preferred extensions. We can observe how the admissibility principle is much more strict than the conflict-free principle from the logical viewpoint, and has a significant repercussion on the number of learning samples produced. The 467 Naïve extensions are distributed as follows: 203 learning samples belonging to class 0 (i.e., in favour team wins in the 43.4% of the cases), and 264 samples belonging to class 1 (i.e., against team wins in the 56.6% of the cases). Similarly, the 31 Preferred extensions are distributed as follows: 19 learning samples belonging to class 0, and 12 samples belonging to class 1. Given the irregular generation of acceptable extensions per debate mentioned above, we carried out all of our experiments with a fixed train/test partition with an 80% of the debates used for train and a 20% of the debates used for evaluation, such that the produced acceptable extensions in both partitions did not significantly alter this distribution. We can observe an example of this irregularity with the AF encoded from Debate3.csv producing 92 Naïve extensions, or the AF encoded from Debate29.csv with only 4 Naïve extensions. Given this restriction, and to provide solid results (i.e., not using learning samples from the same debate for both train and test), we are not able to perform a K-Fold evaluation without considering data splits with significant differences in sample distributions. Thus, in all of our experiments we used 23 debates producing 369 Naïve and 26 Preferred samples for training, and 5 debates producing 98 Naïve and 5 Preferred samples for test.

With the data configurations defined above, we trained two different GN models in order to approach the task of automatic evaluating argumentative debates: a Naïve-GN and a Preferred-GN. Furthermore, we defined four baselines to compare the performance of our proposed method: a Random Baseline (RB) that assigned randomly a class to each extension; a Naïve Argumentation Theory Baseline (Naïve-ATB) that classified each Naïve extension by counting the majority number of acceptable arguments belonging to each stance; a Preferred Argumentation Theory Baseline (Preferred-ATB) that did a similar classification to the Naïve-ATB but considering Preferred extensions; and a Language Modelling Basline (LMB) which is implemented ignoring the Phase I of the proposed method, and training the GN directly over the whole argumentative analysis graphs. The results of our experiments are depicted in Table 1.

Experiment Train Test
Model D S D S Acc. Macro-F1
Naïve-GN 23 369 5 98 0.72 0.48
Preferred-GN 23 26 5 5 0.40 0.40
Naïve-ATB - - - - 0.16 0.14
Preferred-ATB - - - - 0.20 0.16
LMB 23 23 5 5 0.60 0.37
RB - - - - 0.48 0.33
Table 1: Accuracy and Macro-F1 results of the automatic debate evaluation task. D and S indicate the number of debates and learning samples respectively used in the Train and Test data partitions in our experiments.

It is possible to observe how the best performing results are achieved by the Naïve-GN model, which is also the one with the most number of samples for train and test. The Preferred-GN and the LMB provide not very informative results due to the small number of available test samples. We observed that the LMB always learns to predict the majority class. Both ATBs performed worse, meaning that relying only in aspects from computational argumentation theory might not be enough to conduct a proper automatic evaluation of natural language debates. Therefore, we can point out that the hybrid method proposed in this paper benefits from theoretical aspects of argumentation logic to increment the useful data from a smaller set of natural language debates, together with NLP techniques that enable a probabilistic modelling of natural language inputs that improve the automatic evaluation of human debates.

6 Method Implementation

In this section, we present a step-by-step implementation of the Naïve-GN model and the proposed method for automatic debate evaluation in a real debate. For that purpose, we use the Debate7.csv file, which was excluded from the previous experiments. Figure 2 presents an argumentative graph resulting from the preliminary analysis of the argumentative natural language in this debate file. The nodes represent ADUs and the edges argumentative relations between them. The size of the node in the argument graph indicates the phase of the debate where each ADU was uttered (i.e., introduction, argumentation, or conclusion), and its colour indicates the team stance. The colour of the edges indicates what kind of argumentative relation exists between different ADUs: a red arrow represents a conflict, a green represents an inference, and a yellow represents a rephrase.

Figure 2: Argumentation graph resulting from a preliminary analysis of a natural language debate.
(a) Complete Argumentation Framework generated from the argumentation graph.
(b) Conflicting arguments of the Argumentation Framework.
Figure 3: Argumentation Framework visualisation.

Phase I: The first step is to encode the natural language argumentation graph into a compact computational representation that simplifies the data structures and condenses all the argumentative information correctly. For that purpose, Algorithm 1 is applied taking the argumentation graph (Figure 2) as the input and producing an Argumentation Framework (Figure 2(a)) as the output. Here, the size of the nodes represent the natural language size of each abstract argument. Next, we compute the acceptable extensions of the AF under Naïve semantics. Four different extensions are obtained from this AF. All the nodes (i.e., arguments) that do not belong to any attack relation are included in all of the acceptable extensions. Then, the variations between extensions happen within the conflicting nodes. Under the conflict-free principle, arguments [3, 8, 26, 65] will be accepted within extension ; [3, 8, 26, 67] within extension ; [3, 8, 27, 65] within ; and [3, 8, 27, 67] within (see Figure 2(b) for node numeration).

Phase II: The next step is to generate the learning samples from the four acceptable extensions. Thus, a complete bipartite graph will be produced per each extension where the two disjoint sets contain all the arguments belonging to the same stance (i.e., , , , and ). Before predicting the outcome of the debate, each of these samples will be initialised with the XLM-RoBERTa sentence embedding vector representations of the natural language arguments encoded within their nodes. Then, each of the four samples is fed into the Naïve-GN model trained in our experiments and produces the following predictions: , , , and (i.e., team in favour wins). In this specific case, the model has correctly predicted all the debate samples, but this might not be always the case. In these situations where there are conflicting predictions, a normalised aggregation of the softmax probability distributions for each different sample is done to determine the most probable class for a given debate.

7 Conclusions

In this paper, we propose an original hybrid method to approach the automatic evaluation of natural language argumentative debates. For that purpose, we present a new instance of the argument assessment task, where argumentative debates and their underlying lines of reasoning are considered in a comprehensive, undivided manner. The proposed method combines aspects from formal logic and computational argumentation theory, with NLP and Deep Learning. From the observed results, several conclusions can be drawn. First, it has been possible to determine that our method performed better than approaching independently the debate evaluation task from either the argumentation theory or the NLP viewpoints. Furthermore, we have also observed in our experiments that conflict-free semantics produce a higher number of acceptable extensions from each AF compared to the admissibility-based semantics. This helped to improve the learning of the argument evaluation task in a similar way to that achieved by data augmentation techniques. Thus, better probabilistic distributions of natural language features and dependencies that are not too constrained to formal logic and graph topology (as they are when using admissibility-based semantics) can be learnt by our model. This paper represents a solid starting point of research in the evaluation of natural language argumentative debates. However, much research remains to be done and many aspects remain unexplored. As future work, we foresee to consider finer-grained features for the evaluation of argumentation such as thesis solidity, argumentation quality, and adaptability. Furthermore, a broader investigation can be carried out by considering other argumentation semantics and different deep learning architectures than the ones researched in this paper. Finally, there is still a problem with the amount of annotated data. Subsequent research in this topic would significantly benefit from extending the number of completely annotated professional debates.


This work is partially supported by the Spanish Government project PID2020-113416RB-I00.


  • K. Al Khatib, M. Völske, S. Syed, N. Kolyada, and B. Stein (2020) Exploiting personal characteristics of debaters for predicting persuasiveness. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 7067–7072. External Links: Link, Document Cited by: §2.
  • J. Bao, C. Fan, J. Wu, Y. Dang, J. Du, and R. Xu (2021) A neural transition-based model for argumentation mining. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, pp. 6354–6364. External Links: Link, Document Cited by: §4.
  • P. Baroni, M. Caminada, and M. Giacomin (2011) An introduction to argumentation semantics.

    The knowledge engineering review

    26 (4), pp. 365–410.
    Cited by: §2, §4.1.
  • P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, et al. (2018) Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261. Cited by: §4.2, §4.2.
  • K. Budzynska and C. Reed (2011) Whence inference. University of Dundee Technical Report. Cited by: §3.
  • D. Craandijk and F. Bex (2021) Deep learning for abstract argumentation semantics. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pp. 1667–1673. Cited by: §4.2.
  • P. M. Dung (1995)

    On the acceptability of arguments and its fundamental role in nonmonotonic reasoning, logic programming and n-person games

    Artificial Intelligence 77 (2), pp. 321–357. External Links: ISSN 0004-3702, Document, Link Cited by: §2, §4.1.
  • R. El Baff, H. Wachsmuth, K. Al Khatib, and B. Stein (2020) Analyzing the Persuasive Effect of Style in News Editorial Argumentation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 3154–3160. External Links: Link, Document Cited by: §2.
  • M. Gleize, E. Shnarch, L. Choshen, L. Dankin, G. Moshkowich, R. Aharonov, and N. Slonim (2019) Are you convinced? choosing the more convincing evidence with a Siamese network. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 967–976. External Links: Link, Document Cited by: §2, §3.
  • J. Godwin*, T. Keck*, P. Battaglia, V. Bapst, T. Kipf, Y. Li, K. Stachenfeld, P. Veličković, and A. Sanchez-Gonzalez (2020) External Links: Link Cited by: §5.1.
  • A. Hagberg, P. Swart, and D. S Chult (2008) Exploring network structure, dynamics, and function using networkx. Technical report Los Alamos National Lab.(LANL), Los Alamos, NM (United States). Cited by: §5.1.
  • Y. Jo, J. Visser, C. Reed, and E. Hovy (2019) A cascade model for proposition extraction in argumentation. In Proceedings of the 6th Workshop on Argument Mining, Florence, Italy, pp. 11–24. External Links: Link, Document Cited by: §4.
  • W. McKinney et al. (2010) Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference, Vol. 445, pp. 51–56. Cited by: §5.1.
  • L. Mirko, P. Sahitaj, S. Kallenberg, C. Coors, L. Dumani, R. Schenkel, and R. Bergmann (2020) Towards an argument mining pipeline transforming texts to argument graphs. Computational Models of Argument: Proceedings of COMMA 2020 326, pp. 263. Cited by: §4.1.
  • M. Orbach, Y. Bilu, A. Gera, Y. Kantor, L. Dankin, T. Lavee, L. Kotlerman, S. Mirkin, M. Jacovi, R. Aharonov, and N. Slonim (2019) A dataset of general-purpose rebuttal. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 5591–5601. External Links: Link, Document Cited by: §3.
  • A. Peldszus and M. Stede (2013) From argument diagrams to argumentation mining in texts: a survey. International Journal of Cognitive Informatics and Natural Intelligence (IJCINI) 7 (1), pp. 1–31. Cited by: §3.
  • P. Potash and A. Rumshisky (2017) Towards debate automation: a recurrent model for predicting debate winners. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 2465–2475. External Links: Link, Document Cited by: §2.
  • N. Reimers and I. Gurevych (2019) Sentence-BERT: sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 3982–3992. External Links: Link, Document Cited by: §4.2, §5.1.
  • N. Reimers and I. Gurevych (2020) Making monolingual sentence embeddings multilingual using knowledge distillation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, External Links: Link Cited by: §5.1.
  • A. Roush and A. Balaji (2020) DebateSum: a large-scale argument mining and summarization dataset. In Proceedings of the 7th Workshop on Argument Mining, Online, pp. 1–7. External Links: Link Cited by: §3.
  • R. Ruiz-Dolz, J. Alemany, S. M. H. Barberá, and A. García-Fornes (2021a) Transformer-based models for automatic identification of argument relations: a cross-domain evaluation. IEEE Intelligent Systems 36 (6), pp. 62–70. Cited by: §4.
  • R. Ruiz-Dolz, M. Nofre, M. Taulé, S. Heras, and A. García-Fornes (2021b) VivesDebate: a new annotated multilingual corpus of argumentation in a debate tournament. Applied Sciences 11 (15), pp. 7160. Cited by: §3.
  • D. Shirafuji, R. Rzepka, and K. Araki (2019) Debate outcome prediction using automatic persuasiveness evaluation and counterargument relations.. In LaCATODA/BtG@ IJCAI, pp. 24–29. Cited by: §2.
  • J. Visser, B. Konat, R. Duthie, M. Koszowy, K. Budzynska, and C. Reed (2020) Argumentation in the 2016 us presidential elections: annotated corpora of television debates and social media reaction. Language Resources and Evaluation 54 (1), pp. 123–154. Cited by: §3.
  • H. Wachsmuth, N. Naderi, Y. Hou, Y. Bilu, V. Prabhakaran, T. A. Thijm, G. Hirst, and B. Stein (2017) Computational argumentation quality assessment in natural language. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, Valencia, Spain, pp. 176–187. External Links: Link Cited by: §2.