The automatic extraction of arguments from natural language texts, also known as argumentation mining
, has recently become a hot topic in artificial intelligence (c.f.[Lippi and Torroni2015a]). An argument is a basic unit people use to persuade their audiences to accept a particular state of affairs [Eckle-Kohler et al.2015], and it usually consists of a claim and some premises offered in support of the claim. As a concrete example, consider the following texts extracted from a hotel review posted on Tripadvisor.com:
Example 1: 1⃝ Appalling in room television/radio/technology. 2⃝ There was an old, small, black, CRT TV. 3⃝ The channel selection was minimal, 4⃝ picture quality average, and 5⃝ movie options unimpressive.
The review excerpt in Example 1 can be viewed as an argument: clause 1⃝ is the claim of the argument, and the other four clauses are premises supporting the claim. Argumentation mining consists of three sub-tasks i) segmenting clauses, ii) distinguishing different argument components (e.g. claims, premises) from non-argumentative clauses, and iii) predicting the relations between argument components (e.g. support/attack). In this work, we term the second sub-task in argumentation mining argument component detection (ACD), and it is the focus of many existing argumentation mining papers and this work alike.
Motivation. When human annotators manually perform the ACD task, they decide the label of a clause not only based on the clause’s own linguistic features, but also on its context. For instance, consider again Example 1: if we consider clause 2⃝ alone and ignore its surrounding clauses, we are very likely to label it as a claim; however, if we additionally consider the content and label of 1⃝, we may instead label 2⃝ as a premise. To obtain the contextual information, human annotators usually need to read and label a document for multiple rounds [Stab and Gurevych2014a]. However, despite the importance of contextual information, it is ignored by most existing automatic ACD methods. In this work, we consider a specific form of contextual information called historical annotations (HAs), and investigate how to effectively use it in ACD.
In particular, given a target clause to be annotated, human annotators may consider two types of HAs during their multi-round annotating process: type-L (L stands for ’last round’): labels of some clauses surrounding the target clause, made in the previous round of annotating; and type-C (C stands for ’current round’): labels of some clauses preceding the target clause, made in the current round of annotating. Fig. 1 illustrates these two types of HAs. We consider HAs rather than other types of contextual information (e.g. the topic of a document, linguistic features of some surrounding clauses, etc.) for two reasons: i) HAs take only a few bits to encode in vectorised representations; and ii) HAs have been widely used in some NLP tasks, e.g. text summarisation [Rioux et al.2014] and dialogue generation [Pietquin et al.2011]. To the best of our knowledge, this is the first work that studies the usage and influence of HAs in the ACD task.
Objectives. Our first objective is to present the design and implementation of the first reinforcement learning (RL) based ACD technique. When HAs are used in the annotating process, the label of the current clause is part of the contextual information of surrounding clauses (see Fig. 1); thus, the annotating process can be modelled as a sequential decision making problem, as the current decision (i.e. label for the current clause) influences the future decisions. We formally formulate ACD as a sequential decision making problem, and select suitable RL algorithms to solve it.
Our second objective is to study the influences of HAs on different ACD methods. We evaluate the performances of both RL-based ACD and some state-of-the-art supervised-learning (SL) based ACD on two corpora; results suggest that, using HAs results in no significant performance changes for SL-based ACD tools, but leads to significant performance improvements for RL-based ACD; in particular, by using appropriate HAs, RL’s accuracy is improved by 8.90% and 17.85% in the two test corpora. In addition, HAs-augmented RL significantly outperforms (in terms of accuracy) the state-of-the-art SL algorithm by 5.56% and 11.94% in two test corpora.
2 Related Work
Works on ACD.
Most existing automatic ACD methods model ACD as a classification task, and their focuses are mostly on designing useful features to represent clauses, and selecting appropriate SL-based classifiers. Widely used features include structural, lexical, syntactic and contextual features, and popular classifiers include SVM, naive Bayes, decision tree and random forest. For well-structured documents, using these SL classifiers and conventional features leads to relatively good performances: for example, SVM achieves .726 and .741 macro-in corpora consisting of persuasive essays [Stab and Gurevych2014a, Stab and Gurevych2014b] and legal documents [Palau and Moens2009], resp. However, for some less well-structured texts, e.g. Wikipedia articles, these methods have significantly poorer performances: [Levy et al.2014, Lippi and Torroni2015b] report that in the task for detecting claims from Wikipedia articles, only around .17
is achieved, although they have tried different features (topic-dependent features and partial constituency trees, resp.) and different classifiers (logistic regression and SVM, resp.).
Some works are devoted to using unsupervised-learning techniques to extract features.[Lawrence et al.2014] assume that clauses belonging to the same argument are likely to share the same topic; thus, they employ the LDA-based topic modelling technique [Blei et al.2003] to extract each clause’s topics and use these topics as features. [Nguyen and Litman2015] use LDA to extract the argument words (i.e. words used as argument indicators, e.g. ‘think’, ‘reason’) and domain words (i.e. terminologies commonly used within a certain topic, e.g. ‘education’, ’art’), and add indicator features for these words. However, these features have only been tested on small corpora constructed from well-structured documents (the former is tested on documents obtained from a 19th century philosophical book, while the later is tested on the persuasive essay corpus proposed in [Stab and Gurevych2014a]), and the computational expense of LDA is high.
Works on contextual information in ACD. HAs have been implicitly used in some SL-based ACD tools. In [Habernal and Gurevych2015], ACD is modelled as a sequence tagging problem, and SVM-HMM [Altun et al.2003] is used to solve this problem; SVM-HMM implicitly considers type-C HAs during the labelling process. Their technique achieves macro- between .185 and .304 in debate portal documents. Type-C HAs are arguably the de facto features used by RL-based NLP tools. In RL-based text summarisation techniques [Ryang and Abekawa2012, Rioux et al.2014]
, annotations of all preceding sentences, made in the current round of scanning, are included in the feature vector; in RL-based dialogue generation systems, e.g.[Pietquin et al.2011, Williams and Young2007], the full history of dialogue acts in the current dialogue is used in the state representation. However, all these works do not compare the performances of HAs-augmented and HAs-free versions of their techniques, thus fail to investigate to what extent the usage of HAs can improve performance.
As for other forms of contextual information, in [Levy et al.2014], the topic of a document is used to build features to identify claims. To be more specific, given a clause, the similarity between this clause and the topic sentence is used to decide whether this clause is a claim or not. However, the importance of the topic information is questionable, as [Lippi and Torroni2015b] report that similar performances can be obtained without using the topic information.
3 Formulating ACD as a Sequential Decision Making Problem
Markov decision processes (MDPs) are widely used mathematical models for formulating sequential decision making problems. In this work, we consider ACD formally as episodic MDPs. An episodic MDP is a tuple . is the set of states; a state is a representation of the current status of the problem at hand. is the set of actions. By performing an action in state , the agent is transited to some new state and receives a numerical reward , where is the reward function. is the transition function
: it gives the probability of moving from stateto by performing action . is the set of terminal states: when the agent is transited to a state , the current episode ends. The components of our MDP-based ACD formulation are as follows:
State set . Each state represents a clause to be annotated. Thus, we let be a feature vector, which includes not only the current clause’s linguistic features, but may also include type-L and type-C HAs. We let denote the length of the conventional linguistic features, and denote the window sizes for type-L and type-C HAs, resp., and denote the length of the state vector; thus, .
Action set . Given a state , performing action on means labelling the corresponding clause of as type . Thus, each action in corresponds to a type of label.
Transition probability . Given the above formulations of and , gives the next clause to be annotated after the current clause’s labelling finishes. In other words, decides the sequence of labelling. As such, we can use a short-hand notation to indicate that, after is annotated, no matter what its annotation is, the next clause to be annotated is ; in this work, for simplicity, we let be the clause ensuing ; investigating the effectiveness of other sequences of labelling is left as a future work.
Reward function . evaluates the goodness of annotating as type . Thus, we let be positive (negative, resp.) if is (not, resp.) the same to the gold-standard annotation of . Note that function is known during the training phase but is unknown in the test phase.
Terminal states set . We view labelling a document for one round as an episode. Thus, if and only if corresponds to the last clause in a document.
A policy specifies the action to take in each state. RL amounts to algorithms for obtaining the (near-)optimal policies for MDPs, even some components (e.g. function ) of the MDP are unknown. To obtain the optimal policy, RL maintains a Q-function, which provides a quantitative evaluation of the current policy being used. Specifically, given a policy , its Q-function gives the discounted sum of rewards that will be received by performing action in state and following policy thereafter:
where is the immediate reward received in time step , is the expectation operator with respect to policy , and is a real-valued parameter known as the discount factor.
In the ACD task, the RL agent proceeds as follows to obtain the optimal policy: the RL agent first uses some random policy to annotate the input documents for one round, and collects the rewards (produced by reward function ) during the labelling process; then the RL agent uses these rewards to build the Q-function of the policy, so as to evaluate the goodness of the current policy and derive an improved policy. The newly-obtained policy is used to label the input documents for another round, and the improvement cycle repeats until the policy converges (i.e. two consecutive policies are the same). Most RL algorithms ensure that the converged policy is optimal. Fig. 2 illustrates the workflow of RL-based ACD.
4 RL-based ACD Framework
To select suitable RL algorithms for a MDP is not a trivial task, as RL algorithms fall into many different categories, each suitable for certain types of MDPs. We consider the following two factors when we select RL algorithms for our MDP-based ACD formulation presented in Sect. 3:
Data efficiency. Since there exist few high-quality and large-scale ACD corpora (c.f. [Lippi and Torroni2015a]), we need to select RL algorithms with strong generalisation capabilities, so as to learn the optimal policies with limited amount of training data.
Computational efficiency. Obtaining the optimal policy usually requires many rounds of policy improvement. Thus, the computational expense for each round of improvement should be small enough.
To strike a trade-off between the above two factors, we decide to use the least square policy iteration (LSPI) [Lagoudakis and Parr2003] algorithm to solve our MDP-based ACD. LSPI is a model-based RL algorithm, which can efficiently use the training data. The computational complexity of LSPI increases linearly with the growth of the sample size, and some works have been proposed to further reduce its complexity (e.g. [Geramifard et al.2006, Sutton et al.2009]).
The LSPI-based ACD framework is presented in Alg. 1. In the training phase (lines 4-8), the RL agent labels each document for rounds so as to obtain the optimal policy; each round of labelling is called an episode (line 5). The obtained policy is output after the training phase finishes (line 9).
In the test phase (lines 11-19), since type-L HAs are used in the state representation (see Sect. 3), the feature vector for the same clause can be different in different rounds (see Fig. 1); as a result, the algorithm needs to label the same document for multiple rounds until the annotations converge, i.e. the annotations in the current round (stored in list , line 14) are the same to those made in the last round (stored in ). Once the annotations converge, the algorithm breaks the loop (line 15) and begins to label the next test document; else, if annotations fail to converge in (an positive integer provided a priori; see line 2) rounds of labelling, the algorithm outputs the annotations obtained in the final round (line 18).
Now we discuss the computational complexity of LSPI-based ACD. Suppose there are clauses in the training set, and each document is labelled for rounds; also remind that the state vector size is (see Sect. 3). As such, the complexity of each episode (line 6) using LSPI is , and the complexity for obtaining the final policy (line 9) is [Lagoudakis and Parr2003]; thus, the overall complexity in the training phase is .
As for the complexity of SVM-based ACD, again, we suppose that there are clauses in the training set and each document is labelled for rounds. For each clause, as its annotation can be different in different rounds of labelling (see Fig. 1), each clause has different vector representations. Thus, there are in total input vectors in the training phase. In line with most existing SL-based ACD (see Sect. 2), we select SVM with RBF kernel as the classifier, and its complexity in the training phase is between and (using LIBSVM [Chang and Lin2011]). As for SVM-HMM [Altun et al.2003], its complexity is no cheaper than standard SVM. To summarise, the training complexity of SVM/SVM-HMM is (at least) quadratic with the number of samples, while the complexity of LSPI is linear with the number of samples; thus, LSPI-based ACD scales better when applied to large-scale corpora, and is more suitable for applications with short feature vectors.
In the test phase, the complexity for computing the annotation for one clause is when using LSPI, but is approximately [Claesen et al.2014] when using SVM and SVM-HMM (with RBF kernel). Since the number of annotation types is usually much smaller than the vector length , the computational complexity of LSPI is usually lower in the test phase.
When selecting corpora for testing our methods, we primarily consider the labelling quality of the corpora, because the corpora’s quality heavily influences the quality of the ACD tools trained on them [Habernal and Gurevych2015]. The inter-rater agreement (IRA) score is a widely used metric to evaluate the reliability of annotations and quality of corpora. Fleiss’ kappa [Fleiss1971] is among the most widely used IRA metrics, because it can compute the agreement between two or more raters, and it considers the possibility of the agreement occurring by chance, thus giving more “robust” measure than simple percentage agreement. If the Fleiss’ kappa score equals 1, it suggests the raters have “perfect agreement”; the lower the score, the poorer the agreement. In this work, all IRA scores reported are Fleiss’ kappa values.
Since there exist few well-annotated and publicly available argumentation corpora, we create our own argumentation corpus.111Details of the creation of our hotel corpus is presented in a separate paper, which is currently under review. We randomly sampled 200 hotel reviews of appropriate length (50 - 200 words) in the hotel review dataset provided by [Wachsmuth et al.2014]. We presented these hotel reviews on a crowdsourcing platform, and asked five workers to independently annotate each review. Similar to [Wachsmuth et al.2015], we viewed each sub-sentence as a clause. We asked the workers to label each clause as one of the following six categories:
major claim: summarises the main opinion of a review;
claim: an opinion on a certain aspect of a hotel;
premise: a reason/evidence supporting a claim;
background: an objective description that does not give direct opinions but provides some background information; for example “this is my second staying at this hotel”, “we arrived at at midnight”;
recommendation: a positive or negative recommendation for the hotel, e.g. “do not come to this place if you want a luxury hotel”, ‘I would definitely come to this hotel again the next time I visit London’; and
others, for all the other clauses.
A detailed annotation guide and some examples were presented to the workers before they started their labelling. We asked the workers to give one and only one major claim for each hotel review, and informed them that a claim can have no premises, but each premise must support some claim. The annotating process lasts for 4 weeks, with 216 workers in total participated. We removed the annotations with obvious mistakes, and finally obtained annotations for 105 hotel reviews. In total, the corpus contains 1575 sub-sentences and 14756 tokens; some statistics are given in Table 1. Since the IRA for type others is lower than 0.5, we manually checked and calibrated all others annotations. Except for type others, all types have IRA scores above 0.6, suggesting that the agreement is substantial [Landis and Koch1977].
Another corpus we used to test our approach is the persuasive essays corpus proposed in [Stab and Gurevych2016]. This corpus contains 402 essays on a variety of different topics, and it has three argument component types: major claim, claim and premise; the IRA scores for these three argumentative types are 0.88, 0.64 and 0.83, resp.; however, the IRA for type others is not reported.
To the best of our knowledge, these two corpora are among the most well-annotated argumentation corpora (in terms of IRA scores). Some larger corpora, e.g. the one in [Levy et al.2014], have much lower IRA scores (.39); the legal texts corpus proposed in [Palau and Moens2009] is not publicly available, and the web texts corpus proposed in [Habernal et al.2014] has relatively low IRA (below .50) for most argument component types.
6 Experimental Settings and Results
In this section, we denote a HAs combination with type-L window size and type-C window size as a pair . Under each HAs combination setting, we used a repeated 10-fold cross-validation setup and ensured that clauses from the same document are not distributed over the train and test sets; in addition, we repeated the cross-validation 10 times, which yields a total of 100 folds. All results presented are average values over the 100 folds. We let the significance level be 0.05. As for the conventional linguistic features (see Sect. 3), we used exactly the same features to those in [Stab and Gurevych2014b]. For model selection and hyper-parameter tuning, we randomly sampled 25% documents (from both corpora) and performed 5-fold cross-validation.
|In the essay corpus||In the hotel corpus|
We select SVM and SVM-HMM as our baselines, because these two algorithms are among the most widely used and best-performing algorithms to build ACD tools (see Sect. 2). As for the algorithm implementations, we used LIBSVM [Chang and Lin2011] for SVM and a revision of SVM [Joachims et al.2009] for SVM-HMM.
Although SVM-HMM considers type-C HAs, it ignores type-L HAs and it does not consider HAs explicitly. For these reasons, and also for ensuring the fairness of comparison between SL- and RL-based ACD tools, we also test the performances of SVM and SVM-HMM using the HAs-augmented features. We try all HAs combinations from (0,0) to (9,5) in both baseline algorithms, and find that using HAs-augmented features does not result in significant changes on the performances of SVM and SVM-HMM in both corpora. We have tried using HAs-augmented features in some other SL algorithms (J48 decision tree, naive Bayes and random forest provided in WEKA [Hall et al.2009]), and we make similar observations. These results suggest that, most existing SL-based ACD tools can hardly take advantage of HAs to improve their performances, either through the implicit way (e.g. SVM-HMM, which implicitly considers type-C HAs) or the explicit way (i.e. directly augmenting HAs into the feature vector).
As for the relative goodness of SVM and SVM-HMM, these two baseline approaches have comparable performances in both corpora: consider the best performances achieved by SVM and SVM-HMM (presented in the first two rows in Table 2); in both corpora, although macro- of SVM-HMM is marginally higher than those of SVM, the accuracy of SVM is higher than that of SVM-HMM, and the gaps between their accuracy and macro- scores are all insignificant.
First, we study the influences of HAs on RL-based ACD. The performances of RL-based ACD using different HAs combinations are presented in Fig. 3
. We can see that, in both corpora, the worst performances are obtained when no HAs are used, and the performances increase almost linearly with the growth of the type-L and type-C window sizes. To evaluate the significance of the improvement, we performed t-tests between performances at(no HAs are used), (type-L is used to the maximum and no type-C is used), (type-C is used to the maximum and no type-L is used) and (both types of HAs are fully used); results suggest that the performance at is significantly inferior than the performances obtained in the other three settings, and the performance at is significantly superior than all the other performances. These observations indicate that, both type-L and type-C can significantly improve RL’s performance in the ACD task, and the two types of HAs can be used together. As for the relative importance of type-L and type-C, since the performances’ growth rate along the type-L and type-C dimensions are almost the same, we believe that the relative importance of these two types of HAs are comparable and their influences on the performances are independent.
Second, we compare the performances of RL-based ACD and the baseline approaches. Main results are presented in Table 2. The results for HAs-augmented RL (the last row in the table) are obtained using the HAs combination . We make two key observations from these results:
HAs-free RL underperforms the baseline approaches in both corpora, but the performance gaps are mostly marginal and insignificant. To be more specific, except for the gap between the macro- score of HAs-free RL (.521) and SVM-HMM (.582) in the hotel corpus, all other gaps between HAs-free RL’s performance (namely accuracy and macro-) and those of the baseline approaches are not statistically significant.
HAs-augmented RL outperforms the baseline approaches in both corpora, and the performance gaps are mostly substantial and significant. Specifically, except for the gap between the macro- score of HAs-augmented RL (.696) and SVM-HMM (.683) in the essay corpus, all other gaps between HAs-augmented RL’s performance (accuracy and macro-) and those of the baseline approaches are significant.
The reason that HAs-free RL underperforms SVM/SVM-HMM is because the RL algorithm we use (namely the LSPI algorithm; see Sect. 4 and Alg. 1) has much weaker “expressiveness” than SVM/SVM-HMM: LSPI employs a linear function to evaluate Q-function (see Sect. 3 and Eq. (1
)) thus, when the Q-function is complex, LSPI can hardly provide a precise estimation of the Q-function; since Q-function is used to derive annotation policies, the poor estimation of Q-function harms RL’s performance. In contrast, SVM and SVM-HMM use RBF kernels to perform the classification, which has much stronger expressiveness than LSPI’s linear function. Thus, we believe that by using more sophisticated RL algorithms, e.g. thekernel-based RL algorithms [Taylor and Parr2009, Ormoneit and Sen2002] and the recently proposed deep RL [Mnih et al.2015], the performance of RL-based ACD can be substantially improved (at the price of higher computational complexity though).
6.3 Discussion and Error Analysis
To obtain further insights into how the usage of HAs improves the performance of RL, we look into the confusion matrices of each algorithm and manually investigate some misclassified cases. In both corpora, we find the biggest error source is the misclassification between premises and claims: for example, in the essay corpus, 607 out of 1506 claims are mis-classified as premises by SVM-HMM; in the hotel corpus, 88 out of 180 premises are mis-classified as claims by HAs-free RL. Similar observations are also reported in [Stab and Gurevych2014b], and we believe that ignoring contextual information (including HAs) is a major factor leading to these misclassifications, as illustrated in Example 1.We find that this problem is considerably mitigated by HAs-augmented RL: from Table 2 we can see that, in the essay corpus, HAs-augmented RL’s performance for type premises leads baselines’ performances by around 8%; in the hotel corpus, HAs-augmented RL’s performance for type claim outperforms baselines’ performances by over 10%. As a concrete example, HAs-augmented RL correctly labels 2⃝ to 5⃝ in Example 1 as premises, while the other approaches fail.
In the hotel corpus, another major error source is the mis-classification of type others: from the right-most column in Table 2 we can see that the score for type others is the lowest among all types’ performances. We believe the reason is that, it is even challenging for human annotators to identify clauses of type others; this can be seen from the fact that the IRA score for others is the lowest (0.35; see Table 1). As a concrete example, consider the following review excerpt in our hotel corpus:
Example 2: 1⃝ The rooms come in different sizes. 2⃝ The two other families we were traveling with had larger rooms – 3⃝ see if you can book something larger (especially if traveling with kids).
In the crowdsourcing platform, 3 workers labelled 3⃝ as others while the other 2 workers labelled it as recommendations. We think the reason is that, when considering 3⃝ alone, it looks like a recommendation; but when considering all the three clauses, 3⃝ is more like others. Thus, HAs are important for identifying type others, and this may explain why HAs-augmented RL outperforms the other three ACD techniques by such a big margin for type others: when we let both type-L and type-C windows larger than 3, HAs-augmented RL can successfully label 3⃝ as others, while the other approaches label it as recommendation.
In this work, we novelly propose a RL-based ACD technique, and study the influences of HAs therein. Empirical results on two corpora suggest that, using HAs can significantly improve RL-based ACD’s performance, and the HAs-augmented RL’s performance is significantly superior than those of the state-of-the-art SL-based ACD techniques. To the best of our knowledge, this is the first work systemically studying the influences of HAs and the applicability of RL in the ACD task. Future work includes studying the influences of some other contextual information, e.g. linguistic features of surrounding clauses, in SL- and RL-based ACD methods, and studying the applicability of RL for some other sub-tasks in argumentation mining, e.g. argument relation prediction.
[Altun et al.2003]
Y. Altun, I. Tsochantaridis, T. Hofmann, et al.
Hidden Markov support vector machines.In Proc. of ICML, 2003.
- [Blei et al.2003] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. JMLR, 3, 2003.
- [Chang and Lin2011] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3), 2011.
- [Claesen et al.2014] M. Claesen, F. De Smet, J. Suykens, and B. De Moor. Fast prediction with SVM models containing RBF kernels. arXiv preprint arXiv:1403.0736, 2014.
- [Eckle-Kohler et al.2015] J. Eckle-Kohler, R. Kluge, and I. Gurevych. On the role of discourse markers for discriminating claims and premises in argumentative discourse. In Proc. of EMNLP, 2015.
- [Fleiss1971] J. L. Fleiss. Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5):378–382, 1971.
- [Geramifard et al.2006] A. Geramifard, M. Bowling, and R. S. Sutton. Incremental least-squares temporal difference learning. In Proc. of AAAI, 2006.
- [Habernal and Gurevych2015] I. Habernal and I. Gurevych. Exploiting debate portals for semi-supervised argumentation mining in user-generated web discourse. In Proc. of EMNLP, 2015.
- [Habernal et al.2014] I. Habernal, J. Eckle-Kohler, and I. Gurevych. Argumentation mining on the web from information seeking perspective. In Proc. of the Workshop on Frontiers and Connections between Argumentation Theory and Natural Language Processing, 2014.
- [Hall et al.2009] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The WEKA data mining software: an update. ACM SIGKDD explorations newsletter, 11(1):10–18, 2009.
- [Joachims et al.2009] T. Joachims, T. Finley, and C.-N. Yu. Cutting-plane training of structural SVMs. Machine Learning, 77(1), 2009.
- [Lagoudakis and Parr2003] Michail G Lagoudakis and Ronald Parr. Least-squares policy iteration. JMLR, 4:1107–1149, 2003.
- [Landis and Koch1977] J. R. Landis and G. G. Koch. The measurement of observer agreement for categorical data. Biometrics, 33(1):159–174, 1977.
- [Lawrence et al.2014] J. Lawrence, C. Ree, A. Colin, S. MacAlister, A. Ravenscroft, and D. Bourget. Mining arguments from 19th century philosophical texts using topic based modelling. In Proc. of Workshop on Argumentation Mining, 2014.
- [Levy et al.2014] R. Levy, Y. Bilu, D. Hershcovich, E. Aharoni, and N. Slonim. Context dependent claim detection. In Proc. of COLING, 2014.
- [Lippi and Torroni2015a] M. Lippi and P. Torroni. Argumentation mining: State of the art and emerging trends. ACM Transactions on Internet Technology, 2015.
- [Lippi and Torroni2015b] M. Lippi and P. Torroni. Context-independent claim detection for argument mining. In Proc. of IJCAI, 2015.
- [Mnih et al.2015] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
- [Nguyen and Litman2015] H. V. Nguyen and D. J. Litman. Extracting argument and domain words for identifying argument components in texts. In Proc. of Workshop on Argumentation Mining, 2015.
- [Ormoneit and Sen2002] D. Ormoneit and Ś. Sen. Kernel-based reinforcement learning. MACH LEARN, 49(2-3):161–178, 2002.
- [Palau and Moens2009] R. M. Palau and M.-F. Moens. Argumentation mining: the detection, classification and structure of arguments in text. In Proc. of ICAIL, 2009.
- [Pietquin et al.2011] O. Pietquin, M. Geist, and S. Chandramohan. Sample efficient on-line learning of optimal dialogue policies with Kalman temporal differences. In Proc. of IJCAI, 2011.
[Rioux et al.2014]
C. Rioux, S. A. Hasan, and Y. Chali.
Fear the REAPER: A system for automatic multi-document summarization with reinforcement learning.In Proc. of EMNLP, 2014.
- [Ryang and Abekawa2012] S. Ryang and T. Abekawa. Framework of automatic text summarization using reinforcement learning. In Proc. of EMNLP, 2012.
- [Stab and Gurevych2014a] C. Stab and I. Gurevych. Annotating argument components and relations in persuasive essays. In Proc. of COLING, 2014.
- [Stab and Gurevych2014b] C. Stab and I. Gurevych. Identifying argumentative discourse structures in persuasive essays. In Proc. of EMNLP, 2014.
- [Stab and Gurevych2016] C. Stab and I. Gurevych. Parsing argumentation structures in persuasive essays. arXiv preprint, arXiv:1604.07370, 2016.
- [Sutton et al.2009] R. S. Sutton, H. R. Maei, D. Precup, S. Bhatnagar, D. Silver, C. Szepesvári, and E. Wiewiora. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proc. of ICML. ACM, 2009.
- [Taylor and Parr2009] G. Taylor and R. Parr. Kernelized value function approximation for reinforcement learning. In Proc. of ICML, pages 1017–1024. ACM, 2009.
- [Wachsmuth et al.2014] H. Wachsmuth, M. Trenkmann, B. Stein, G. Engels, and T. Palakarska. A review corpus for argumentation analysis. In Computational Linguistics and Intelligent Text Processing, pages 115–127. Springer, 2014.
- [Wachsmuth et al.2015] Henning Wachsmuth, Johannes Kiesel, and Benno Stein. Sentiment flow–a general model of web review argumentation. In Proc. of EMNLP, 2015.
- [Williams and Young2007] J. D. Williams and S. Young. Partially observable Markov decision processes for spoken dialog systems. Computer Speech & Language, 21(2), 2007.