Argument mining Peldszus and Stede (2013) has gained substantial attention from researchers in the NLP community, mostly due to its complexity as a task requiring sophisticated reasoning, but also due to the availability of high-quality resources. Those resources include discourse-level closed-domain datasets for political, educational or legal applications Walker et al. (2012); Stab and Gurevych (2014); Wyner et al. (2010), as well as open-domain datasets for topic-dependent argument retrieval from heterogeneous sources Stab et al. (2018b); Shnarch et al. (2018). While discourse-level argument mining aims to parse argumentative structures in a fine-grained manner within single documents (thus, mostly in single domains or applications), topic-dependent argument retrieval focusses on argumentative constructs such as claims or evidences with regard to a given topic that can be found in very different types of discourse. Argument retrieval typically frames the argumentative unit (argument, claim, evidence etc.) on the level of sentence, i.e., it seeks to detect sentences that are relevant supporting (PRO) or opposing (CON) arguments as in the examples given in Fig. 1.
In this work, we challenge the assumption that arguments should be detected on the sentence-level. This is partly justified by the difficulty of “unitizing”, i.e., of segmenting a sentence into meaningful units for argumentation tasks Stab et al. (2018b); Miller et al. (2019). We show that re-framing the argument retrieval task as Argument unit Recognition and Classification (ARC), i.e., as recognition and classification of spans within a sentence on the token-level is feasible, not just in terms of the reliability of recognizing argumentative spans, but also in terms of the scalability of generating training data.
Framing argument retrieval as in ARC, i.e., on the token-level, has several advantages:
It prevents merging otherwise separate arguments into a single argument (e.g., for the topic death penalty in Fig. 1).
It can handle two-sided argumentation adequately (e.g., for the topic gun control in Fig. 1).
It can be framed as a sequence labeling task, which is a common scenario for many NLP applications with many available architectures for experimentation Eger et al. (2017).
To address the feasibility of ARC, we will address the following questions. First, we discuss how to select suitable data for annotating arguments on the token-level. Second, we analyze whether the annotation of arguments on the token-level can be reliably conducted with trained experts, as well as with untrained workers in a crowdsourcing setup. Third, we test a few basic as well as state-of-the-art sequence labeling methods on ARC.
A focus of our work is robustness. (i) The assumption that arguments correspond to complete sentences makes argument mining brittle – when the assumption is not true, then sentence-level argument mining makes mistakes. In addition, sentence identification is error-prone for noisy text (e.g., text crawled from the web), resulting in noisy non-sentence units being equated with arguments. (ii) The properties of argument topics vary considerably from topic to topic. An ARC method trained on one topic will not necessarily perform well on another. We set ARC-8 up to make it easy to test the robustness of argument mining by including a cross-domain split and demonstrate that cross-domain generalization is challenging for ARC-8.
2 Related Work
Our work follows the established line of work on argument mining in the NLP community, which can loosely be divided into approaches detecting and classifying arguments on the discourse level Palau and Moens (2009); Stab and Gurevych (2014); Eger et al. (2017) and ones focusing on topic-dependent argument retrieval Levy et al. (2014); Wachsmuth et al. (2017); Hua and Wang (2017); Stab et al. (2018b). Our work is in line with the latter: we model arguments as self-contained pieces of information which can be verified as relevant arguments for a given topic with no or minimal surrounding context.
As one of the main contributions of this work, we show how to create training data for token-level argument mining with the help of crowdsourcing. Stab2018b and Shnarch2018 annotated topic-dependent arguments on the sentence-level using crowdsourcing. Fleiss’ agreement scores reported were 0.45 in Shnarch2018 for crowd workers and 0.72 in Stab2018b for experts. miller2019 present a multi-step approach to crowdsource more complex argument structures in customer reviews. Like us, they annotate arguments on the token-level – however, they annotate argument components from the discourse-level perspective. Their inter-annotator agreement ( roughly between 0.4 and 0.5) is low, demonstrating the difficulty of this task. In this work, to capture argument spans more precisely, we test the validity of arguments using a slot filling approach. Reisert2018 also use argument templates, i.e., slots to determine arguments.
Close to the spirit of this work, Ajjour2017 compare various argumentative unit segmentation approaches on the token-level across three corpora. They use a feature-based approach and various architectures for segmentation and find that BiLSTMs work best on average. However, as opposed to this work, they study argumentation on the discourse level, i.e., they do not consider topic-dependency and only account for arguments and non-arguments (no argumentative types or relations like PRO and CON). Eger2017ACL model discourse-level argument segmentation, identification (claim, premises and major claims) and relation extraction as sequence tagging, dependency parsing and entity-relation extraction. For a dataset of student essays Stab and Gurevych (2014), they find that sequence tagging and an entity-relation extraction approach Miwa and Bansal (2016) work best. In particular, for the unit segmentation task (vanilla BIO), they find that state-of-the-art sequence tagging approaches can perform as well or even better than human experts. Stab2017 propose a CRF-based approach with manually defined features for the unit segmentation task on student essays Stab and Gurevych (2014) and also achieve performance close to human experts.
3 Corpus Creation
Collecting annotations on the token-level is challenging. First, the unit of annotation needs to be clearly defined. This is straightforward for tasks with short spans (sequences of words) such as named entities, but much harder for longer spans – as in the case of argument units. Second, labels from multiple annotators need to be merged into a single gold standard.111One could also learn from “soft” labels, i.e., a distribution created from the votes of multiple annotators. However, this does not solve the problem that some annotators deliver low quality work and their votes should be outscored by a (hopefully) higher-quality majority of annotators. This is also more difficult for longer sequences because simple majority voting over individual words will likely create invalid (e.g., disrupted or grammatically incorrect) spans.
To address these challenges, we carefully designed selection of sources, sampling and annotation of input for ARC-8, our novel argument unit dataset. We first describe how we processed and retrieved data from a large webcrawl. Next, we outline the sentence sampling process that accounts for a balanced selection of both (non-)argument types and source documents. Finally, we describe how we crowdsource annotations of argument units within sentences in a scalable way.
3.1 Data Source
We used the February 2016 Common Crawl archive,222http://commoncrawl.org/2016/02/february-2016-crawl-archive-now-available/ which was indexed with Elasticsearch333https://www.elastic.co/products/elasticsearch following the description in Stab et al. (2018a). For the sake of comparability, we adopt Stab2018b’s eight topics (cf. Table 1). The topics are general enough to have good coverage in Common Crawl. They are also of a controversial nature and hence a potentially good choice for argument mining with an expected broad set of supporting and opposing arguments.
3.2 Retrieval Pipeline
For document retrieval, we queried the indexed data for Stab2018b’s topics and collected the first results per topic ordered by their document score () from Elasticsearch; a higher indicates higher relevance for the topic. Each document was checked for its corresponding WARC file at the Common Crawl Index.444http://index.commoncrawl.org/CC-MAIN-2016-07 We then downloaded and parsed the original HTML document for the next steps of our pipeline; this ensures reproducibility. Following this, we used justext555http://corpus.tools/wiki/Justext to remove HTML boilerplate. The resulting document was segmented into separate sentences as well as within a sentence into single tokens using spacy.666https://spacy.io/ We only consider sentences with number of tokens in the range .
3.3 Sentence Sampling
The sentences were pre-classified with a sentence-level argument mining model following Stab et al. (2018b) and available via the ArgumenText Classify API.777https://api.argumentsearch.com/en/doc The API returns for each sentence (i) an argument confidence score in (we discard sentences with ), (ii) the stance on the sentence-level (PRO or CON) and (iii) the stance confidence score . This information was used together with the to rank sentences for a selection in the following crowd annotation process. First, all three scores (for documents, arguments and stance confidence) were normalized in the range of available sentences and secondly summed up to create a for each sentence (see Eq. 1) with , and being the ranks fo document, argument and stance confidence scores, respectively.
The ranked sentences were divided by topic and the pre-classified stance on the sentence-level and ordered by (where a lower
indicates a better candidate). We then went down the ranked list selected each sentence with a probability ofuntil the target size of per stance and topic was reached; otherwise we did additional passes through the list. Table 1 gives data set creation statistics.
|T3||marijuana legalization||490||472||45,644||6,351||1,000||630||759 (+20.48%)||370|
|T4||minimum wage||494||479||43,128||8,290||1,000||630||760 (+20.63%)||370|
|T5||nuclear energy||491||470||43,576||5,056||1,000||623||726 (+16.53%)||377|
|T6||death penalty||491||484||32,253||6,079||1,000||598||711 (+18.90%)||402|
|T7||gun control||497||479||38,443||4,576||1,000||529||624 (+17.96%)||471|
|T8||school uniforms||495||475||40,937||3,526||1,000||713||891 (+24.96%)||287|
3.4 Crowd Annotations
The goal of this work was to come up with a scalable approach to annotate argument units on the token-level. Given that arguments need to be annotated with regard to a specific topic, large amounts of (cross-topic) training data need to be created. As has been shown by previous work on topic-dependent argument mining Shnarch et al. (2018); Stab et al. (2018b), crowdsourcing can be used to obtain reliable annotations for argument mining datasets. However, as outlined above, token-level annotation significantly increases the difficulty of the annotation task, so it was unclear whether agreement among untrained crowd workers would be sufficiently high.
We use the agreement measure Krippendorff2016a in this work. It is designed for annotation tasks that involve unitizing textual continua – i.e., segmenting continuous text into meaningful subunits – and measuring chance-corrected agreement in those tasks. It is also a good fit for argument spans within a sentence: typically these spans are long and the context is a single sentence that may contain any type of argument and any number of arguments. Krippendorff2016a define a family of -reliability coefficients that improve upon several weaknesses of previous measures. From these, we chose the coefficient, which also takes into account agreement on “blanks” (non-arguments in our case). The rationale behind this was that ignoring agreement on sentences without any argument spans would over-proportionally penalize disagreement in sentences that contain arguments while ignoring agreement in sentences without arguments.
To determine agreement, we initially carried out an in-house expert study with three graduate employees (who were trained on the task beforehand) and randomly sampled 160 sentences (10 per topic and stance) from the overall data. In the first round, we did not impose any restrictions on the span of words to be selected, other than that the selected span should be the shortest self-contained span that forms an argument. This resulted in unsatisfying agreement ( = 0.51, average over topics), one reason being inconsistency in selecting argument spans (median length of arguments ranged from nine to 16 words among the three experts). In a second round, we therefore decided to restrict the spans that could be selected by applying a slot filling approach that enforces valid argument spans that match a template. We use the template: “ should be supported/opposed, because ”. The guidelines specify that the resulting sentence had to be a grammatically sound statement. Although this choice unsurprisingly increased the length of spans and reduced the total number of arguments selected, it increased consistency of spans substantially (min/max. median length was now between 15 and 17). Furthermore, the agreement between the three experts rose to = 0.61 (average over topics). Compared to other studies on token-level argument mining Eckle-Kohler et al. (2015); Li et al. (2017); Stab and Gurevych (2014), this score is in an acceptable range and we deem it sufficient to proceed with crowdsourcing.
In our crowdsourcing setup, workers could select one or multiple spans, where each span’s permissible length is between one token and the entire sentence. Workers had to either choose at least one argument span and its stance (supporting/opposing), or select that the sentence did not contain a valid argument and instead solve a simple math problem. We introduced further quality control measures in the form of a qualification test and periodic attention checks.888Workers had to be located in the US, CA, AU, NZ or GB, with an acceptance rate of 95% or higher. Payment was $0.42 per HIT, corresponding to US federal minimum wage ($7.25/hour). The annotators in the expert study were salaried research staff. On an initial batch of 160 sentences, we collected votes from nine workers. To determine the optimal number of workers for the final study, we did majority voting on the token-level (ties broken as non-arguments) for both the expert study and workers from the initial crowd study. We artificially reduced the number of workers (1-9) and calculated percentage overlap averaged across all worker combinations (for worker numbers lower than 9). Whereas the overlap was highest with at nine votes, it only dropped to for five votes (and decreased more significantly for fewer votes). We deemed five votes to be an acceptable compromise between quality and cost. The agreement with experts on the five-worker-setup is = 0.71, which is substantial Landis and Koch (1977).
The final gold standard labels on the 8000 sampled sentences was determined using a variant of Bayesian Classifier Combination Kim and Ghahramani (2012), referred to as IBCC in Simpson2018’s modular framework for Bayesian aggregation of sequence labels. This method has been shown to yield results superior to majority voting or MACE Hovy et al. (2013).
3.5 Dataset Splits
We create two different dataset splits. (i) An in-domain split. This lets us evaluate how models perform on known vocabulary and data distributions. (ii) A cross-domain split. This lets us evaluate how well a model generalizes for unseen topics and distributions different from the training set. In the cross-domain setup, we defined topics T1-T5 to be in the train set, topic T6 in the development set and topics T7 and T8 in the test set. For the in-domain setup, we excluded topics T7 and T8 (cross-domain test set), and used the first of the topics T1-T6 for train, the next for dev und the remaining for test. The samples from the in-domain test set were also excluded in the cross-domain train and development sets. As a result, there are 4000 samples in train, 800 in dev and 2000 in test for the cross-domain split; and 4200 samples in train, 600 in dev and 1200 in test for the in-domain split. We work with two different splits so as to guarantee that train/dev sets (in-domain or cross-domain) do not overlap with test sets (in-domain or cross-domain). The assignment of sentences to the two splits is released as part of ARC-8.
3.6 Dataset Statistics
The resulting data set, ARC-8,999We will make the dataset available at www.ukp.tu-darmstadt.de/data consists of 8000 annotated sentences with 3500 (43.75%) being non-argumentative. The 4500 argumentative sentences are divided into 1951 (43.36%) single pro argument sentences, 1799 (39.98%) single contra argument sentences and the remaining 750 (16.67%) sentences are many possible combinations of supporting (PRO) and opposing (CON) arguments with up to five single argument segments in a sentence. Thus, the token-level annotation leads to a higher (+18.73%) total count of arguments of 5343, compared to 4500 with a sentence-level approach. If we propagate the label of a sentence to all its tokens, then 100% of tokens of argumentative sentences are argumentative. This ratio drops to 69.94% in our token-level setup, reducing the amount of non-argumentative tokens otherwise incorrectly selected as argumentative in a sentence.
We model ARC as a sequence labeling task. The input is a topic and a sentence . The goal is to select spans of words each of which corresponding to an argument unit , . Following Stab2018b, we distinguish between PRO and CON ( should be supported/opposed, because
) arguments. To measure the difficulty of the task of ARC, we estimate the performance of simple baselines as well as current models in NLP, that achieve state-of-the-art results on other sequence labeling data setsDevlin et al. (2018).
4.1 1-class Baselines
The 1-class baseline labels the data set completely (i.e., for each the entire sequence ) with one of the three labels PRO, CON and NON.
4.2 Sentence-Level Baselines
As the sentence-level baseline, we used labels produced by the previously mentioned ArgumenText Classify API from stab2018argumentext. Since it is a sentence-level classifier, we also projected the sentence-level prediction on all of the tokens in a sequence to enable token-level evaluation.
Furthermore, we used the BERT101010https://github.com/huggingface/pytorch-pretrained-BERT base (cased) model Devlin et al. (2018) as a recent state-of-the-art model which achieved impressive results on many tasks including sequence labeling. For this model we considered two scenarios. First, we kept the parameters as they are and used the model as a feature extractor (considered frozen, tagged as ). Second, we fine-tuned (tagged as ) the parameters for the ARC task and the corresponding different tags.
In total, we run three different experiments on the ARC-8 dataset with the previously introduced models, which we will describe in this section. Additionally we experimented with different tagsets for the ARC task. All experiments were conducted on a single GPU with 11 GB memory.
5.1 1-class Baselines
For the simple baselines, we applied 1-class sequence tagging on the corresponding development and test sets for the in-domain and cross-domain setups. This allowed us to estimate the expected lower bounds for more complex models.
5.2 Token- vs. Sentence-Level
To further investigate the performance of a token-level model vs. a sentence-level model, we run four different training procedures and evaluate the results on both token- and sentence-level. We first train models on the token-level (sequence labeling) and also evaluate on the token-level. Second, we train a model on the sentence-level (as a text classification task) and project the predictions to all tokens of the sentence, which we then compare to the token-level labels of the gold standard. Third, we train models on the token-level and aggregate a sentence-level score from the predicted scores, which we evaluate against an aggregated sentence-level gold-standard. Finally, in the last of this type of experiments, we train a model on sentence-level and compared it against the aggregated sentence-level gold-standard. In the latter two cases, we aggregate on sentence-level as follows: for each sentence, all occurrences of possible types of label are counted. If there is only one type of label, the sentence is labeled with it. Otherwise, if there is the NON label with only one other label (PRO or CON), then the NON label is omitted and the sentence is labeled with the remaining label. In other cases, a majority vote determines the final sentence label, or, in the case of ties, the NON label is assigned.
5.3 Sequence Labeling with Different Tagsets
In the sequence labeling experiments with the new ARC-8 data set, we investigate the performance of BERT (cf. Section 4.3). The base scenario is with three labels PRO, CON and NON (TAGS=3), but we also use two extended label sets. In one of them, we extended the PRO and CON labels with BI tags (TAGS=5), with B being the beginning of a segment and I a within-segment token, resulting in the tags: B-PRO, I-PRO, B-CON, I-CON and NON. The other extension is with BIES tags, were we add E for end of a segment and S for single unit segments (TAGS=9), resulting in the following tag set: B-PRO, I-PRO, E-PRO, S-PRO, B-CON, I-CON, E-CON, S-CON and NON.
5.4 Adding Topic Information
The methods described so far do not use topic information. We also tests methods for ARC that make use of topic information. In the first scenario, we just add the topic information to the labels, resulting in 25 TAGS (2 span information (B and I) 2 stance information (PRO and CON) 6 topics (in-domain, topics T1-T6), and the NON label; for example B-PRO-CLONING). In the scenario “TAGS=25++”, in addition to the TAGS=25 setup, we add the topic at the beginning of a sequence. Additionally, in the TAGS=25++ scenario, we add all sentences of the other topics as negative examples to the training set, with all labels set to NON. For example, a sentence with PRO tokens for the topic CLONING, was added as is (argumentative, for CLONING) and as non-argumentative for the other five topics. Since all the topics need to be known beforehand, this is done only on the in-domain datasets. This last experiment is to investigate whether the model is able to learn the topic-dependency of argument units.
In this section we evaluate the results and analyze the errors from the models in the different ARC experiments. All reported results are macro F1 scores, except otherwise stated. For the computation of the scores we used a function from scikit-learn111111https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html where we concatenated all the true values and the predictions over all sentences per set.
|Model||TAGS||Train time (sec./it.)|
|BERT (base, cased)||3 (T)||3 (S)||37||18|
|BERT (base, cased)||3 (T)||3 (S)||39||29|
|Model||TAGS||Dev (In-Domain)||Test (In-Domain)||Dev (Cross-Domain)||Test (Cross-Domain)|
|BERT (base, cased)||3||5||9||55.60||34.25||18.93||52.93||32.45||17.92||38.91||23.20||12.82||40.86||24.88||13.76|
|BERT (base, cased)||3||5||9||68.95||58.32||39.66||63.35||54.23||36.01||53.66||41.81||28.35||52.28||42.65||30.34|
|BERT (base, cased)||25||51.38||41.73|
|BERT (base, cased)||25++||45.50||42.83|
We present the results in the following manner:
Table 2 shows experiments across domains and
for different tagsets in the training step,
always evaluating on three labels (PRO, CON and NON).
In Table 3 we compare the runtimes of
the token- and sentence-level training.
Finally, we show the results of the evaluation
on the same tags as we used for the training
in Table 4 and Table 5.
For the results in Table 2 we see that the baseline for the NON label (most frequent label) and the model “ArgumenText” (ArgumenText Classify API) are clearly worse than all BERT-based models. This shows that we are definitely improving upon the pipeline that we used to select the data.
Token- vs. Sentence-Level
The four experiments on token- and sentence-level for both in- and cross-domain setups (Table 2) work significantly better with a fine-tuned BERT model for the ARC task, which is a similar discovery as in peters2019tune for many other NLP tasks. Furthermore, training on token-level leads always to better results, which was one of our motivations and objectives for this task and the dataset. For an evaluation on token-level, a model trained on token-level with TAGS=9 works best, while TAGS=5 work best for an evaluation on sentence-level. However, the average runtime per iteration (Table 3) is for sentence-level models on average between 25% and 50% faster compared to token-level models.
Sequence Labeling Across Domains
The results for the evaluation on three labels are in Table 2 and the best F1 scores for in-domain on token- (64.26) and sentence-level (65.80) are higher than the corresponding scores for cross-domain (54.96 and 51.98, respectively). This validates our assumption that the ARC problem depends on the topic at hand and that cross-topic (cross-domain) transfer is more difficult to learn.
Sequence Labeling with Different Tagsets
The results in Table 4 are from evaluations of models that were trained on the corresponding TAGS 3, 5 and 9, and work again better for in-domain and a fine-tuned model (63.35, 54.23 and 36.01, respectively). Results for larger tagsets are clearly lower which is to be expected from the increased complexity of the task and the low number of training examples for some of the tags.
Adding Topic Information
Adding the topic information in the labels or before a sequence generally does not help when evaluating on three tags (results for 25 and 25++ TAGS in Table 2). So we suggest to use more complex models that can improve the results when the topic information is provided. The results in Table 5 show that additional information about the topic and from the negative examples (42.83) are helping to train the model. So the model is able to learn the topic relevance of a sentence for the six topics in the in-domain sets.
6.2 Error Analysis
We classified errors in three ways: (i) the span is not correctly recognized, (ii) the stance is not correctly classified, or (iii) the topic is not correctly classified.
The errors by the models for the span can be divided into two more cases: (a) the beginning and/or end of a segment is incorrectly recognized, and/or (b) the segment is broken into several segments or merged into fewer segments, such that tokens inside or outside an actual argument unit are misclassified as non-argumentative. Therefore, we used the predictions by the best token-level model with TAGS=9 in both in-domain and cross-domain settings, and analyzed the average length of segments as well as the total count of segments for the true and predicted labels. For the average length of segments (in tokens), we got 17.66 for true and 13.73 for predicted labels in-domain and 16.35 for true and 13.14 for predicted labels cross-domain, showing that predicted segments are on average four tokens in-domain and on average three token cross-domain shorter than the true segments. Regarding the count of segments, there are 297 more segments in the predicted labels for in-domain and 372 more segments in the predicted labels for cross-domain, than there are in the gold-standard.
The complete misclassification of the stance occured for the best token-level model (TAGS=9) in 7.67% of the test sentences in-domain and in 16.50% of the test sentences cross-domain. A frequent error is that apparently stance-specific words are assigned a label that is not consistent with the overall segment stance.
We looked for errors where the topic-independent tag was correct (e.g., B-CON, beginning of a con argument), but the topic was incorrect. This type of error occurred only four times on the testset for TAGS=25++ on some of the tokens, but never for a full sequence. The model misclassified for example the actual topic nuclear energy as the topic abortion, or the actual topic death penalty was confused for the topic minimun wage. Reasons for this could be some topic specific vocabulary that the model learned, but none of them are actually words one would assign to the misclassified topics.
We introduced a new task, argument unit recognition and classification (ARC), and release the benchmark ARC-8 for this task. We demonstrated that ARC-8 has good quality in terms of annotator agreement: the required annotations can be crowdsourced using specific data selection and filtering methods as well as a slot filling approach. We cast ARC as a sequence labeling task and established a state of the art for ARC-8, using baseline as well as advanced methods for sequence labeling. In the future, we plan to find better models for this task, especially models with the ability to better incorporate the topic information in the learning process.
We gratefully acknowledge support by Deutsche Forschungsgemeinschaft (DFG) (SPP-1999 Robust Argumentation Machines (RATIO), SCHU2246/13), as well as by the German Federal Ministry of Education and Research (BMBF) under the promotional reference 03VP02540 (ArgumenText).
- Ajjour et al. (2017) Yamen Ajjour, Wei-Fan Chen, Johannes Kiesel, Henning Wachsmuth, and Benno Stein. 2017. Unit segmentation of argumentative texts. In Proceedings of the 4th Workshop on Argument Mining, pages 118–128, Copenhagen, Denmark. Association for Computational Linguistics.
- Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Eckle-Kohler et al. (2015)
Judith Eckle-Kohler, Roland Kluge, and Iryna Gurevych. 2015.
On the role of
discourse markers for discriminating claims and premises in argumentative
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 2236–2242, Lisbon, Portugal. Association for Computational Linguistics.
- Eger et al. (2017) Steffen Eger, Johannes Daxenberger, and Iryna Gurevych. 2017. Neural end-to-end learning for computational argumentation mining. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017), volume Volume 1: Long Papers, pages 11–22. Association for Computational Linguistics.
- Hovy et al. (2013) Dirk Hovy, Taylor Berg-Kirkpatrick, Ashish Vaswani, and Eduard Hovy. 2013. Learning whom to trust with mace. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1120–1130.
- Hua and Wang (2017) Xinyu Hua and Lu Wang. 2017. Understanding and detecting supporting arguments of diverse types. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 203–208, Vancouver, Canada. Association for Computational Linguistics.
Kim and Ghahramani (2012)
Hyun-Chul Kim and Zoubin Ghahramani. 2012.
Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, volume 22 of
Proceedings of Machine Learning Research, pages 619–627, La Palma, Canary Islands. PMLR.
- Krippendorff et al. (2016) K. Krippendorff, Y. Mathet, S. Bouvry, and A. Widlöcher. 2016. On the reliability of unitizing textual continua: Further developments. Quality & Quantity, 50(6):2347–2364.
- Landis and Koch (1977) J. Richard Landis and Gary G. Koch. 1977. The measurement of observer agreement for categorical data. Biometrics, 33(1):159–174.
- Levy et al. (2014) Ran Levy, Yonatan Bilu, Daniel Hershcovich, Ehud Aharoni, and Noam Slonim. 2014. Context dependent claim detection. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pages 1489–1500, Dublin, Ireland. Dublin City University and Association for Computational Linguistics.
- Li et al. (2017) Mengxue Li, Shiqiang Geng, Yang Gao, Shuhua Peng, Haijing Liu, and Hao Wang. 2017. Crowdsourcing argumentation structures in Chinese hotel reviews. In Proceedings of the 2017 IEEE International Conference on Systems, Man, and Cybernetics, pages 87–92.
- Miller et al. (2019) Tristan Miller, Maria Sukhareva, and Iryna Gurevych. 2019. A streamlined method for sourcing discourse-level argumentation annotations from the crowd. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics.
- Miwa and Bansal (2016) Makoto Miwa and Mohit Bansal. 2016. End-to-end relation extraction using lstms on sequences and tree structures. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1105–1116, Berlin, Germany. Association for Computational Linguistics.
- Palau and Moens (2009) Raquel Mochales Palau and Marie-Francine Moens. 2009. Argumentation mining: The detection, classification and structure of arguments in text. In Proceedings of the 12th International Conference on Artificial Intelligence and Law, ICAIL ’09, pages 98–107, New York, NY, USA. ACM.
- Peldszus and Stede (2013) Andreas Peldszus and Manfred Stede. 2013. From argument diagrams to argumentation mining in texts: A survey. Int. J. Cogn. Inform. Nat. Intell., 7(1):1–31.
- Peters et al. (2019) Matthew Peters, Sebastian Ruder, and Noah A Smith. 2019. To tune or not to tune? adapting pretrained representations to diverse tasks. arXiv preprint arXiv:1903.05987.
- Reisert et al. (2018) Paul Reisert, Naoya Inoue, Tatsuki Kuribayashi, and Kentaro Inui. 2018. Feasible annotation scheme for capturing policy argument reasoning using argument templates. In Proceedings of the 5th Workshop on Argument Mining, pages 79–89. Association for Computational Linguistics.
- Shnarch et al. (2018) Eyal Shnarch, Carlos Alzate, Lena Dankin, Martin Gleize, Yufang Hou, Leshem Choshen, Ranit Aharonov, and Noam Slonim. 2018. Will it blend? blending weak and strong labeled data in a neural network for argumentation mining. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 599–605. Association for Computational Linguistics.
- Simpson and Gurevych (2018) Edwin Simpson and Iryna Gurevych. 2018. Bayesian ensembles of crowds and deep learners for sequence tagging. CoRR, abs/1811.00780.
- Stab et al. (2018a) Christian Stab, Johannes Daxenberger, Chris Stahlhut, Tristan Miller, Benjamin Schiller, Christopher Tauchmann, Steffen Eger, and Iryna Gurevych. 2018a. Argumentext: Searching for arguments in heterogeneous sources. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pages 21–25.
- Stab and Gurevych (2014) Christian Stab and Iryna Gurevych. 2014. Annotating argument components and relations in persuasive essays. In Proceedings of the 25th International Conference on Computational Linguistics (COLING 2014), pages 1501–1510. Dublin City University and Association for Computational Linguistics.
- Stab and Gurevych (2017) Christian Stab and Iryna Gurevych. 2017. Parsing argumentation structures in persuasive essays. Computational Linguistics, 43(3):619–659.
- Stab et al. (2018b) Christian Stab, Tristan Miller, Benjamin Schiller, Pranav Rai, and Iryna Gurevych. 2018b. Cross-topic argument mining from heterogeneous sources. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3664–3674. Association for Computational Linguistics.
- Wachsmuth et al. (2017) Henning Wachsmuth, Martin Potthast, Khalid Al Khatib, Yamen Ajjour, Jana Puschmann, Jiani Qu, Jonas Dorsch, Viorel Morari, Janek Bevendorff, and Benno Stein. 2017. Building an argument search engine for the web. In Proceedings of the 4th Workshop on Argument Mining, pages 49–59, Copenhagen, Denmark. Association for Computational Linguistics.
- Walker et al. (2012) Marilyn Walker, Jean Fox Tree, Pranav Anand, Rob Abbott, and Joseph King. 2012. A corpus for research on deliberation and debate. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012), pages 812–817, Istanbul, Turkey. European Language Resources Association (ELRA).
- Wyner et al. (2010) Adam Wyner, Raquel Mochales-Palau, Marie-Francine Moens, and David Milward. 2010. Approaches to text mining arguments from legal cases. In Enrico Francesconi, Simonetta Montemagni, Wim Peters, and Daniela Tiscornia, editors, Semantic Processing of Legal Texts: Where the Language of Law Meets the Law of Language, pages 60–79. Springer Berlin Heidelberg, Berlin, Heidelberg.