Robust Argument Unit Recognition and Classification

04/22/2019 ∙ by Dietrich Trautmann, et al. ∙ 0

Argument mining is generally performed on the sentence-level -- it is assumed that an entire sentence (not parts of it) corresponds to an argument. In this paper, we introduce the new task of Argument unit Recognition and Classification (ARC). In ARC, an argument is generally a part of a sentence -- a more realistic assumption since several different arguments can occur in one sentence and longer sentences often contain a mix of argumentative and non-argumentative parts. Recognizing and classifying the spans that correspond to arguments makes ARC harder than previously defined argument mining tasks. We release ARC-8, a new benchmark for evaluating the ARC task. We show that token-level annotations for argument units can be gathered using scalable methods. ARC-8 contains 25% more arguments than a dataset annotated on the sentence-level would. We cast ARC as a sequence labeling task, develop a number of methods for ARC sequence tagging and establish the state of the art for ARC-8. A focus of our work is robustness: both robustness against errors in sentence identification (which are frequent for noisy text) and robustness against divergence in training and test data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Argument mining Peldszus and Stede (2013) has gained substantial attention from researchers in the NLP community, mostly due to its complexity as a task requiring sophisticated reasoning, but also due to the availability of high-quality resources. Those resources include discourse-level closed-domain datasets for political, educational or legal applications Walker et al. (2012); Stab and Gurevych (2014); Wyner et al. (2010), as well as open-domain datasets for topic-dependent argument retrieval from heterogeneous sources Stab et al. (2018b); Shnarch et al. (2018). While discourse-level argument mining aims to parse argumentative structures in a fine-grained manner within single documents (thus, mostly in single domains or applications), topic-dependent argument retrieval focusses on argumentative constructs such as claims or evidences with regard to a given topic that can be found in very different types of discourse. Argument retrieval typically frames the argumentative unit (argument, claim, evidence etc.) on the level of sentence, i.e., it seeks to detect sentences that are relevant supporting (PRO) or opposing (CON) arguments as in the examples given in Fig. 1.

In this work, we challenge the assumption that arguments should be detected on the sentence-level. This is partly justified by the difficulty of “unitizing”, i.e., of segmenting a sentence into meaningful units for argumentation tasks Stab et al. (2018b); Miller et al. (2019). We show that re-framing the argument retrieval task as Argument unit Recognition and Classification (ARC), i.e., as recognition and classification of spans within a sentence on the token-level is feasible, not just in terms of the reliability of recognizing argumentative spans, but also in terms of the scalability of generating training data.

Figure 1: Examples of sentences with two arguments as well as with annotated spans and stances.

Framing argument retrieval as in ARC, i.e., on the token-level, has several advantages:

  • It prevents merging otherwise separate arguments into a single argument (e.g., for the topic death penalty in Fig. 1).

  • It can handle two-sided argumentation adequately (e.g., for the topic gun control in Fig. 1).

  • It can be framed as a sequence labeling task, which is a common scenario for many NLP applications with many available architectures for experimentation Eger et al. (2017).

To address the feasibility of ARC, we will address the following questions. First, we discuss how to select suitable data for annotating arguments on the token-level. Second, we analyze whether the annotation of arguments on the token-level can be reliably conducted with trained experts, as well as with untrained workers in a crowdsourcing setup. Third, we test a few basic as well as state-of-the-art sequence labeling methods on ARC.

A focus of our work is robustness. (i) The assumption that arguments correspond to complete sentences makes argument mining brittle – when the assumption is not true, then sentence-level argument mining makes mistakes. In addition, sentence identification is error-prone for noisy text (e.g., text crawled from the web), resulting in noisy non-sentence units being equated with arguments. (ii) The properties of argument topics vary considerably from topic to topic. An ARC method trained on one topic will not necessarily perform well on another. We set ARC-8 up to make it easy to test the robustness of argument mining by including a cross-domain split and demonstrate that cross-domain generalization is challenging for ARC-8.

2 Related Work

Our work follows the established line of work on argument mining in the NLP community, which can loosely be divided into approaches detecting and classifying arguments on the discourse level Palau and Moens (2009); Stab and Gurevych (2014); Eger et al. (2017) and ones focusing on topic-dependent argument retrieval Levy et al. (2014); Wachsmuth et al. (2017); Hua and Wang (2017); Stab et al. (2018b). Our work is in line with the latter: we model arguments as self-contained pieces of information which can be verified as relevant arguments for a given topic with no or minimal surrounding context.

As one of the main contributions of this work, we show how to create training data for token-level argument mining with the help of crowdsourcing. Stab2018b and Shnarch2018 annotated topic-dependent arguments on the sentence-level using crowdsourcing. Fleiss’ agreement scores reported were 0.45 in Shnarch2018 for crowd workers and 0.72 in Stab2018b for experts. miller2019 present a multi-step approach to crowdsource more complex argument structures in customer reviews. Like us, they annotate arguments on the token-level – however, they annotate argument components from the discourse-level perspective. Their inter-annotator agreement ( roughly between 0.4 and 0.5) is low, demonstrating the difficulty of this task. In this work, to capture argument spans more precisely, we test the validity of arguments using a slot filling approach. Reisert2018 also use argument templates, i.e., slots to determine arguments.

Close to the spirit of this work, Ajjour2017 compare various argumentative unit segmentation approaches on the token-level across three corpora. They use a feature-based approach and various architectures for segmentation and find that BiLSTMs work best on average. However, as opposed to this work, they study argumentation on the discourse level, i.e., they do not consider topic-dependency and only account for arguments and non-arguments (no argumentative types or relations like PRO and CON). Eger2017ACL model discourse-level argument segmentation, identification (claim, premises and major claims) and relation extraction as sequence tagging, dependency parsing and entity-relation extraction. For a dataset of student essays Stab and Gurevych (2014), they find that sequence tagging and an entity-relation extraction approach Miwa and Bansal (2016) work best. In particular, for the unit segmentation task (vanilla BIO), they find that state-of-the-art sequence tagging approaches can perform as well or even better than human experts. Stab2017 propose a CRF-based approach with manually defined features for the unit segmentation task on student essays Stab and Gurevych (2014) and also achieve performance close to human experts.

3 Corpus Creation

Collecting annotations on the token-level is challenging. First, the unit of annotation needs to be clearly defined. This is straightforward for tasks with short spans (sequences of words) such as named entities, but much harder for longer spans – as in the case of argument units. Second, labels from multiple annotators need to be merged into a single gold standard.111One could also learn from “soft” labels, i.e., a distribution created from the votes of multiple annotators. However, this does not solve the problem that some annotators deliver low quality work and their votes should be outscored by a (hopefully) higher-quality majority of annotators. This is also more difficult for longer sequences because simple majority voting over individual words will likely create invalid (e.g., disrupted or grammatically incorrect) spans.

To address these challenges, we carefully designed selection of sources, sampling and annotation of input for ARC-8, our novel argument unit dataset. We first describe how we processed and retrieved data from a large webcrawl. Next, we outline the sentence sampling process that accounts for a balanced selection of both (non-)argument types and source documents. Finally, we describe how we crowdsource annotations of argument units within sentences in a scalable way.

3.1 Data Source

We used the February 2016 Common Crawl archive,222http://commoncrawl.org/2016/02/february-2016-crawl-archive-now-available/ which was indexed with Elasticsearch333https://www.elastic.co/products/elasticsearch following the description in Stab et al. (2018a). For the sake of comparability, we adopt Stab2018b’s eight topics (cf. Table 1). The topics are general enough to have good coverage in Common Crawl. They are also of a controversial nature and hence a potentially good choice for argument mining with an expected broad set of supporting and opposing arguments.

3.2 Retrieval Pipeline

For document retrieval, we queried the indexed data for Stab2018b’s topics and collected the first results per topic ordered by their document score () from Elasticsearch; a higher indicates higher relevance for the topic. Each document was checked for its corresponding WARC file at the Common Crawl Index.444http://index.commoncrawl.org/CC-MAIN-2016-07 We then downloaded and parsed the original HTML document for the next steps of our pipeline; this ensures reproducibility. Following this, we used justext555http://corpus.tools/wiki/Justext to remove HTML boilerplate. The resulting document was segmented into separate sentences as well as within a sentence into single tokens using spacy.666https://spacy.io/ We only consider sentences with number of tokens in the range .

3.3 Sentence Sampling

The sentences were pre-classified with a sentence-level argument mining model following Stab et al. (2018b) and available via the ArgumenText Classify API.777https://api.argumentsearch.com/en/doc The API returns for each sentence (i) an argument confidence score in (we discard sentences with ), (ii) the stance on the sentence-level (PRO or CON) and (iii) the stance confidence score . This information was used together with the to rank sentences for a selection in the following crowd annotation process. First, all three scores (for documents, arguments and stance confidence) were normalized in the range of available sentences and secondly summed up to create a for each sentence (see Eq. 1) with , and being the ranks fo document, argument and stance confidence scores, respectively.

(1)

The ranked sentences were divided by topic and the pre-classified stance on the sentence-level and ordered by (where a lower

indicates a better candidate). We then went down the ranked list selected each sentence with a probability of

until the target size of per stance and topic was reached; otherwise we did additional passes through the list. Table 1 gives data set creation statistics.

# topic #docs #text #sentences #candidates #final #arg-sent. #arg-segm. #non-arg
T1 abortion 491 454 39,083 3,282 1,000 424 472 (+11.32%) 576
T2 cloning 495 252 30,504 2,594 1,000 353 400 (+13.31%) 647
T3 marijuana legalization 490 472 45,644 6,351 1,000 630 759 (+20.48%) 370
T4 minimum wage 494 479 43,128 8,290 1,000 630 760 (+20.63%) 370
T5 nuclear energy 491 470 43,576 5,056 1,000 623 726 (+16.53%) 377
T6 death penalty 491 484 32,253 6,079 1,000 598 711 (+18.90%) 402
T7 gun control 497 479 38,443 4,576 1,000 529 624 (+17.96%) 471
T8 school uniforms 495 475 40,937 3,526 1,000 713 891 (+24.96%) 287
total 3,944 3,565 314,568 39,754 8,000 4,500 5,343 (+18.73%) 3,500
Table 1: Number of documents and sentences in the selection process and the final corpus size; arg-sent. is the number of argumentative sentences; arg-segm. is the information about argumentative segments; the percentage value is comparing the number of argumentative sentences with the number of argumentative segments

3.4 Crowd Annotations

The goal of this work was to come up with a scalable approach to annotate argument units on the token-level. Given that arguments need to be annotated with regard to a specific topic, large amounts of (cross-topic) training data need to be created. As has been shown by previous work on topic-dependent argument mining Shnarch et al. (2018); Stab et al. (2018b), crowdsourcing can be used to obtain reliable annotations for argument mining datasets. However, as outlined above, token-level annotation significantly increases the difficulty of the annotation task, so it was unclear whether agreement among untrained crowd workers would be sufficiently high.

We use the agreement measure Krippendorff2016a in this work. It is designed for annotation tasks that involve unitizing textual continua – i.e., segmenting continuous text into meaningful subunits – and measuring chance-corrected agreement in those tasks. It is also a good fit for argument spans within a sentence: typically these spans are long and the context is a single sentence that may contain any type of argument and any number of arguments. Krippendorff2016a define a family of -reliability coefficients that improve upon several weaknesses of previous measures. From these, we chose the coefficient, which also takes into account agreement on “blanks” (non-arguments in our case). The rationale behind this was that ignoring agreement on sentences without any argument spans would over-proportionally penalize disagreement in sentences that contain arguments while ignoring agreement in sentences without arguments.

To determine agreement, we initially carried out an in-house expert study with three graduate employees (who were trained on the task beforehand) and randomly sampled 160 sentences (10 per topic and stance) from the overall data. In the first round, we did not impose any restrictions on the span of words to be selected, other than that the selected span should be the shortest self-contained span that forms an argument. This resulted in unsatisfying agreement ( = 0.51, average over topics), one reason being inconsistency in selecting argument spans (median length of arguments ranged from nine to 16 words among the three experts). In a second round, we therefore decided to restrict the spans that could be selected by applying a slot filling approach that enforces valid argument spans that match a template. We use the template: “ should be supported/opposed, because ”. The guidelines specify that the resulting sentence had to be a grammatically sound statement. Although this choice unsurprisingly increased the length of spans and reduced the total number of arguments selected, it increased consistency of spans substantially (min/max. median length was now between 15 and 17). Furthermore, the agreement between the three experts rose to = 0.61 (average over topics). Compared to other studies on token-level argument mining Eckle-Kohler et al. (2015); Li et al. (2017); Stab and Gurevych (2014), this score is in an acceptable range and we deem it sufficient to proceed with crowdsourcing.

In our crowdsourcing setup, workers could select one or multiple spans, where each span’s permissible length is between one token and the entire sentence. Workers had to either choose at least one argument span and its stance (supporting/opposing), or select that the sentence did not contain a valid argument and instead solve a simple math problem. We introduced further quality control measures in the form of a qualification test and periodic attention checks.888Workers had to be located in the US, CA, AU, NZ or GB, with an acceptance rate of 95% or higher. Payment was $0.42 per HIT, corresponding to US federal minimum wage ($7.25/hour). The annotators in the expert study were salaried research staff. On an initial batch of 160 sentences, we collected votes from nine workers. To determine the optimal number of workers for the final study, we did majority voting on the token-level (ties broken as non-arguments) for both the expert study and workers from the initial crowd study. We artificially reduced the number of workers (1-9) and calculated percentage overlap averaged across all worker combinations (for worker numbers lower than 9). Whereas the overlap was highest with at nine votes, it only dropped to for five votes (and decreased more significantly for fewer votes). We deemed five votes to be an acceptable compromise between quality and cost. The agreement with experts on the five-worker-setup is = 0.71, which is substantial Landis and Koch (1977).

The final gold standard labels on the 8000 sampled sentences was determined using a variant of Bayesian Classifier Combination Kim and Ghahramani (2012), referred to as IBCC in Simpson2018’s modular framework for Bayesian aggregation of sequence labels. This method has been shown to yield results superior to majority voting or MACE Hovy et al. (2013).

3.5 Dataset Splits

We create two different dataset splits. (i) An in-domain split. This lets us evaluate how models perform on known vocabulary and data distributions. (ii) A cross-domain split. This lets us evaluate how well a model generalizes for unseen topics and distributions different from the training set. In the cross-domain setup, we defined topics T1-T5 to be in the train set, topic T6 in the development set and topics T7 and T8 in the test set. For the in-domain setup, we excluded topics T7 and T8 (cross-domain test set), and used the first of the topics T1-T6 for train, the next for dev und the remaining for test. The samples from the in-domain test set were also excluded in the cross-domain train and development sets. As a result, there are 4000 samples in train, 800 in dev and 2000 in test for the cross-domain split; and 4200 samples in train, 600 in dev and 1200 in test for the in-domain split. We work with two different splits so as to guarantee that train/dev sets (in-domain or cross-domain) do not overlap with test sets (in-domain or cross-domain). The assignment of sentences to the two splits is released as part of ARC-8.

3.6 Dataset Statistics

The resulting data set, ARC-8,999We will make the dataset available at www.ukp.tu-darmstadt.de/data consists of 8000 annotated sentences with 3500 (43.75%) being non-argumentative. The 4500 argumentative sentences are divided into 1951 (43.36%) single pro argument sentences, 1799 (39.98%) single contra argument sentences and the remaining 750 (16.67%) sentences are many possible combinations of supporting (PRO) and opposing (CON) arguments with up to five single argument segments in a sentence. Thus, the token-level annotation leads to a higher (+18.73%) total count of arguments of 5343, compared to 4500 with a sentence-level approach. If we propagate the label of a sentence to all its tokens, then 100% of tokens of argumentative sentences are argumentative. This ratio drops to 69.94% in our token-level setup, reducing the amount of non-argumentative tokens otherwise incorrectly selected as argumentative in a sentence.

4 Methods

We model ARC as a sequence labeling task. The input is a topic and a sentence . The goal is to select spans of words each of which corresponding to an argument unit , . Following Stab2018b, we distinguish between PRO and CON ( should be supported/opposed, because

) arguments. To measure the difficulty of the task of ARC, we estimate the performance of simple baselines as well as current models in NLP, that achieve state-of-the-art results on other sequence labeling data sets

Devlin et al. (2018).

4.1 1-class Baselines

The 1-class baseline labels the data set completely (i.e., for each the entire sequence ) with one of the three labels PRO, CON and NON.

4.2 Sentence-Level Baselines

As the sentence-level baseline, we used labels produced by the previously mentioned ArgumenText Classify API from stab2018argumentext. Since it is a sentence-level classifier, we also projected the sentence-level prediction on all of the tokens in a sequence to enable token-level evaluation.

4.3 Bert

Furthermore, we used the BERT101010https://github.com/huggingface/pytorch-pretrained-BERT base (cased) model Devlin et al. (2018) as a recent state-of-the-art model which achieved impressive results on many tasks including sequence labeling. For this model we considered two scenarios. First, we kept the parameters as they are and used the model as a feature extractor (considered frozen, tagged as ). Second, we fine-tuned (tagged as ) the parameters for the ARC task and the corresponding different tags.

5 Experiments

In total, we run three different experiments on the ARC-8 dataset with the previously introduced models, which we will describe in this section. Additionally we experimented with different tagsets for the ARC task. All experiments were conducted on a single GPU with 11 GB memory.

5.1 1-class Baselines

For the simple baselines, we applied 1-class sequence tagging on the corresponding development and test sets for the in-domain and cross-domain setups. This allowed us to estimate the expected lower bounds for more complex models.

5.2 Token- vs. Sentence-Level

To further investigate the performance of a token-level model vs. a sentence-level model, we run four different training procedures and evaluate the results on both token- and sentence-level. We first train models on the token-level (sequence labeling) and also evaluate on the token-level. Second, we train a model on the sentence-level (as a text classification task) and project the predictions to all tokens of the sentence, which we then compare to the token-level labels of the gold standard. Third, we train models on the token-level and aggregate a sentence-level score from the predicted scores, which we evaluate against an aggregated sentence-level gold-standard. Finally, in the last of this type of experiments, we train a model on sentence-level and compared it against the aggregated sentence-level gold-standard. In the latter two cases, we aggregate on sentence-level as follows: for each sentence, all occurrences of possible types of label are counted. If there is only one type of label, the sentence is labeled with it. Otherwise, if there is the NON label with only one other label (PRO or CON), then the NON label is omitted and the sentence is labeled with the remaining label. In other cases, a majority vote determines the final sentence label, or, in the case of ties, the NON label is assigned.

5.3 Sequence Labeling with Different Tagsets

In the sequence labeling experiments with the new ARC-8 data set, we investigate the performance of BERT (cf. Section 4.3). The base scenario is with three labels PRO, CON and NON (TAGS=3), but we also use two extended label sets. In one of them, we extended the PRO and CON labels with BI tags (TAGS=5), with B being the beginning of a segment and I a within-segment token, resulting in the tags: B-PRO, I-PRO, B-CON, I-CON and NON. The other extension is with BIES tags, were we add E for end of a segment and S for single unit segments (TAGS=9), resulting in the following tag set: B-PRO, I-PRO, E-PRO, S-PRO, B-CON, I-CON, E-CON, S-CON and NON.

5.4 Adding Topic Information

The methods described so far do not use topic information. We also tests methods for ARC that make use of topic information. In the first scenario, we just add the topic information to the labels, resulting in 25 TAGS (2 span information (B and I) 2 stance information (PRO and CON) 6 topics (in-domain, topics T1-T6), and the NON label; for example B-PRO-CLONING). In the scenario “TAGS=25++”, in addition to the TAGS=25 setup, we add the topic at the beginning of a sequence. Additionally, in the TAGS=25++ scenario, we add all sentences of the other topics as negative examples to the training set, with all labels set to NON. For example, a sentence with PRO tokens for the topic CLONING, was added as is (argumentative, for CLONING) and as non-argumentative for the other five topics. Since all the topics need to be known beforehand, this is done only on the in-domain datasets. This last experiment is to investigate whether the model is able to learn the topic-dependency of argument units.

6 Evaluation

Domain In-Domain Cross-Domain
Train Token Sentence Token Sentence Token Sentence Token Sentence
EVAL Token Sentence Token Sentence
SET Dev Test Dev Test Dev Test Dev Test Dev Test Dev Test Dev Test Dev Test
Model TAGS
Baseline (PRO) 3 10.46 10.91 10.46 10.91 14.10 14.10 14.10 14.10 6.46 12.36 6.46 12.36 9.87 16.43 9.87 16.43
Baseline (CON) 3 10.24 10.53 10.24 10.53 13.48 14.07 13.48 14.07 15.78 11.55 15.78 11.55 20.13 15.03 20.13 15.03
Baseline (NON) 3 25.83 25.43 25.83 25.43 21.57 21.13 21.57 21.13 24.54 24.01 24.54 24.01 18.83 18.43 18.83 18.43
ArgumenText 3 25.10 23.46 25.10 23.46 32.72 29.86 32.72 29.86 19.87 24.81 19.87 24.81 26.56 31.26 26.56 31.26
BERT 3 55.60 52.93 49.95 49.24 62.93 59.99 49.97 50.10 38.91 40.86 37.47 34.60 43.98 49.50 38.56 34.37
BERT 5 55.38 52.23 - - 61.93 60.20 - - 38.49 40.73 - 43.45 48.71 - - -
BERT 9 54.50 51.37 - - 61.16 60.09 - - 37.86 39.96 - 42.82 48.54 - - -
BERT 3 68.95 63.35 64.83 63.78 72.51 65.49 64.92 64.26 53.66 52.28 46.47 52.19 55.54 51.21 46.56 51.68
BERT 5 68.34 64.67 - - 70.21 65.80 - - 53.32 52.52 - - 53.07 51.98 - -
BERT 9 67.58 64.98 - - 67.19 64.27 - - 53.50 54.96 - - 52.45 51.90 - -
BERT 25 71.18 63.23 - - 72.91 64.66 - - - - - - - - - -
BERT 25++ 66.58 64.19 - - 65.72 64.21 - - - - - - - - - -
Table 2: F1 scores for all methods; training was done with the corresponding TAGS in the table, while evalutation was always on three labels (PRO, CON, NON) with aggregation if necessary; the missing values (-) were not possible or applicable experiment setups and hence omitted.

In this section we evaluate the results and analyze the errors from the models in the different ARC experiments. All reported results are macro F1 scores, except otherwise stated. For the computation of the scores we used a function from scikit-learn111111https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html where we concatenated all the true values and the predictions over all sentences per set.

Model TAGS Train time (sec./it.)
BERT (base, cased) 3 (T) 3 (S) 37 18
BERT (base, cased) 3 (T) 3 (S) 39 29
Table 3: BERT average runtimes with training on token-level (T) and training on sentence-level (S), with 32 sentences per batch on a single GPU with 11 GB memory.
Model TAGS Dev (In-Domain) Test (In-Domain) Dev (Cross-Domain) Test (Cross-Domain)
BERT (base, cased) 3 5 9 55.60 34.25 18.93 52.93 32.45 17.92 38.91 23.20 12.82 40.86 24.88 13.76
BERT (base, cased) 3 5 9 68.95 58.32 39.66 63.35 54.23 36.01 53.66 41.81 28.35 52.28 42.65 30.34
Table 4: Sequence labeling with BERT for 3, 5 and 9 labels.
Model TAGS Dev
(In-Domain)
Test
(In-Domain)
BERT (base, cased) 25 51.38 41.73
BERT (base, cased) 25++ 45.50 42.83
Table 5: BERT Experiments with added topic information; for 25, the topic information is only in the labels; for 25++, the topic information is in the labels, negative examples are added and the topic information is provided at the beginning of a sequence.

6.1 Results

We present the results in the following manner: Table 2 shows experiments across domains and for different tagsets in the training step, always evaluating on three labels (PRO, CON and NON). In Table 3 we compare the runtimes of the token- and sentence-level training. Finally, we show the results of the evaluation on the same tags as we used for the training in Table 4 and Table 5.

For the results in Table 2 we see that the baseline for the NON label (most frequent label) and the model “ArgumenText” (ArgumenText Classify API) are clearly worse than all BERT-based models. This shows that we are definitely improving upon the pipeline that we used to select the data.

Token- vs. Sentence-Level

The four experiments on token- and sentence-level for both in- and cross-domain setups (Table 2) work significantly better with a fine-tuned BERT model for the ARC task, which is a similar discovery as in peters2019tune for many other NLP tasks. Furthermore, training on token-level leads always to better results, which was one of our motivations and objectives for this task and the dataset. For an evaluation on token-level, a model trained on token-level with TAGS=9 works best, while TAGS=5 work best for an evaluation on sentence-level. However, the average runtime per iteration (Table 3) is for sentence-level models on average between 25% and 50% faster compared to token-level models.

Sequence Labeling Across Domains

The results for the evaluation on three labels are in Table 2 and the best F1 scores for in-domain on token- (64.26) and sentence-level (65.80) are higher than the corresponding scores for cross-domain (54.96 and 51.98, respectively). This validates our assumption that the ARC problem depends on the topic at hand and that cross-topic (cross-domain) transfer is more difficult to learn.

Sequence Labeling with Different Tagsets

The results in Table 4 are from evaluations of models that were trained on the corresponding TAGS 3, 5 and 9, and work again better for in-domain and a fine-tuned model (63.35, 54.23 and 36.01, respectively). Results for larger tagsets are clearly lower which is to be expected from the increased complexity of the task and the low number of training examples for some of the tags.

Adding Topic Information

Adding the topic information in the labels or before a sequence generally does not help when evaluating on three tags (results for 25 and 25++ TAGS in Table 2). So we suggest to use more complex models that can improve the results when the topic information is provided. The results in Table 5 show that additional information about the topic and from the negative examples (42.83) are helping to train the model. So the model is able to learn the topic relevance of a sentence for the six topics in the in-domain sets.

6.2 Error Analysis

We classified errors in three ways: (i) the span is not correctly recognized, (ii) the stance is not correctly classified, or (iii) the topic is not correctly classified.

Span

The errors by the models for the span can be divided into two more cases: (a) the beginning and/or end of a segment is incorrectly recognized, and/or (b) the segment is broken into several segments or merged into fewer segments, such that tokens inside or outside an actual argument unit are misclassified as non-argumentative. Therefore, we used the predictions by the best token-level model with TAGS=9 in both in-domain and cross-domain settings, and analyzed the average length of segments as well as the total count of segments for the true and predicted labels. For the average length of segments (in tokens), we got 17.66 for true and 13.73 for predicted labels in-domain and 16.35 for true and 13.14 for predicted labels cross-domain, showing that predicted segments are on average four tokens in-domain and on average three token cross-domain shorter than the true segments. Regarding the count of segments, there are 297 more segments in the predicted labels for in-domain and 372 more segments in the predicted labels for cross-domain, than there are in the gold-standard.

Stance

The complete misclassification of the stance occured for the best token-level model (TAGS=9) in 7.67% of the test sentences in-domain and in 16.50% of the test sentences cross-domain. A frequent error is that apparently stance-specific words are assigned a label that is not consistent with the overall segment stance.

Topic

We looked for errors where the topic-independent tag was correct (e.g., B-CON, beginning of a con argument), but the topic was incorrect. This type of error occurred only four times on the testset for TAGS=25++ on some of the tokens, but never for a full sequence. The model misclassified for example the actual topic nuclear energy as the topic abortion, or the actual topic death penalty was confused for the topic minimun wage. Reasons for this could be some topic specific vocabulary that the model learned, but none of them are actually words one would assign to the misclassified topics.

7 Conclusion

We introduced a new task, argument unit recognition and classification (ARC), and release the benchmark ARC-8 for this task. We demonstrated that ARC-8 has good quality in terms of annotator agreement: the required annotations can be crowdsourced using specific data selection and filtering methods as well as a slot filling approach. We cast ARC as a sequence labeling task and established a state of the art for ARC-8, using baseline as well as advanced methods for sequence labeling. In the future, we plan to find better models for this task, especially models with the ability to better incorporate the topic information in the learning process.

Acknowledgments

We gratefully acknowledge support by Deutsche Forschungsgemeinschaft (DFG) (SPP-1999 Robust Argumentation Machines (RATIO), SCHU2246/13), as well as by the German Federal Ministry of Education and Research (BMBF) under the promotional reference 03VP02540 (ArgumenText).

References