1 Introduction
Extractive document summarization
, as a challenging instance of natural language generation (NLG), is a popular summarisation paradigm, which builds summaries by selecting an appropriate sequence of important phrases or sentences from the input document(s). Extractive summarisation can be formulated as a sequential decisionmaking problem, and hence can be tackled by the Reinforcement Learning (RL) algorithms. RL searches for the (near)optimal trajectories (i.e. sequences of decisions) by directly optimising the objective functions, e.g. the ROUGE metrics [Lin2004]. Such objectives are nondifferentiable and therefore difficult to be directly optimised by deep neural networks. In addition, RL alleviates the
exposure biasproblem faced by sequential supervised learning paradigms in NLG. Combining RL algorithms such as REINFORCE
[Williams1992] with neural techniques (e.g. sequencetosequence) yields stateoftheart performance in summarisation [Narayan et al.2018, Yao et al.2018].Existing RLbased summarisation systems fall into two categories: crossinput RL and inputspecific RL. For crossinput RL (upper part in Fig. 1), at training time, RL agents interact with a groundtruth reward oracle for multiple episodes so as to learn a policy that maximises the accumulated rewards in the episode; at test time, the learnt policy is applied to unseen data to generate the summary. However, learning such crossinput policies is very expensive because of the huge search space. Another issue is the delayed rewards: the ROUGE scores can be calculated only when the complete summary is generated. Such delayed rewards cause RLbased summarisers to take even longer time to converge. Although multiple techniques have been proposed to speed up the training of RL, e.g. memory replay [Mnih et al.2015], MIXER [Ranzato et al.2016]
[Cheng et al.2018], training crossinput RL yet requires considerable time, data and parameter tuning.On the other hand, inputspecific RL (middle part in Fig. 1) neither requires parallel data (i.e. input documents and reference summaries for them) nor a reward oracle. Instead, it employs a handcrafted reward function with which the RL agent interacts in order to learn a policy specifically for the given input. By doing so, the size of the search space significantly decreases, which consequently diminishes the training time and computational resources. However, designing such a reward function is highly challenging as it should fit all inputs [Rioux et al.2014]. This explains the poor performance of inputspecific RL summarisers [Ryang and Abekawa2012].
To tackle the problems faced by the above two RLbased summarisation paradigms, we propose a novel paradigm called REward Learning for InputSpecific reinforcement learning (RELIS). Instead of learning a crossinput policy, RELIS learns a crossinput reward oracle at training time, and then uses the learnt reward to train an inputspecific policy for each input at test time (bottom part in Fig. 1). RELIS is inspired by inverse RL [Abbeel and Ng2004], which requires a demonstrator to present optimal trajectories. Because such a demonstrator is hardly available in summarisation, RELIS leverages LearningtoRank (L2R) algorithms [Li2011] to approximate the groundtruth reward oracle from “weak supervisions”, e.g., numeric scores that indicate the quality of the summary or preferences over summary pairs.
Our contributions are threefold: (i) We propose RELIS (§2), a new RLbased summarisation framework that enjoys the strong performance of crossinput RL and the low computational cost of inputspecific RL. (ii) Theoretically, we prove that by employing appropriate L2R and RL algorithms, RELIS guarantees to generate nearoptimal outputs (§3). (iii) Empirically, we evaluate RELIS in multidocument extractive summarisation (§4). Compared to the stateoftheart methods, RELIS provides comparable performance but requires much less the training time or data. Because the proof of RELIS is generic, we believe RELIS has the potential to be applied to other NLG tasks, e.g. translation and sentence simplification. Source code and supplementary material are available at https://github.com/UKPLab/ijcai2019relis.
2 Relis
We first formally define the summarisaiton task, and then detail the L2R and RL module of RELIS.
2.1 Extractive Summarisation
In line with DBLP:conf/acl/PeyrardE17a [DBLP:conf/acl/PeyrardE17a], we formulate summarisation as a discrete optimisation problem. Let be the set of all possible input documents and be the set of inputs available at training time. An input can be either a single document or a cluster of documents on the same topic. For input , let indicate the set of all extractive summaries for that comply with the length requirement. Then, the task of summarisation is to map each input to its best summary in with respect to a ranking function . For a candidate summary , returns the number of candidates in that have equal or lower quality than , including itself. For example, if is the highestquality candidate in , then .
can be obtained from human evaluators, automatic metrics (e.g. ROUGE) or heuristics measuring the quality of outputs. We denote the groundtruth ranking function on
by .Given the above definition of summarisation, we define a summariser agent as a tuple , where is an optimisation model for finding the (near)optimal summary with respect to . In the crossinput paradigm, the RL agent learns a policy that solves the optimisation problem for any at training time. In RELIS, instead, an agent learns a ranking at training time so that is as “close” as possible to (“close” will be formally defined in §3). At test time, for each , RELIS formulates as a Markov Decision Process (MDP) and learns an RL policy specifically for .
2.2 Learning to Rank (L2R)
L2R algorithms learn to reconstruct the ranking over objects from an oracle [Li2011]. L2R induces the approximated ranking by learning a utility function from the oracle, such that if and only if . L2R can learn from different types of oracles, including pointwise oracles that provide pointbased scores for objects, pairwise oracles that provide preferences over pairs of objects, and listwise oracles that provide certain metrics (e.g. accuracy) for candidate rankings. Here, we focus on pointwise and pairwise oracles as humans reliably provide such judgements for short texts [Kreutzer et al.2018]. Function can be learnt by any function approximation technique, e.g., neural networks.
In pointwise L2R, for every , we draw sample summaries from using some sampling strategy, e.g., random sampling without replacement. Then, we query their values from the oracle and use a regression algorithm to minimise the averaged mean squared error (MSE) between and .
Formally, the loss function is
(1) 
In pairwise L2R, for every we sample pairs of summaries from and then query the oracle about their preferences. We denote the collected preferences for input by , where are the candidate outputs presented to the oracle in the iteration; equals if the oracle prefers over , and equals otherwise. Different loss functions can be used to learn from the preferences in . First, we consider the crossentropy loss:
(2) 
where . An alternative is the margin ranking (a.k.a. pairwise hinge) loss:
(3) 
where if is preferred over , and otherwise. Additionally, we consider an improved margin ranking loss proposed by DBLP:conf/ecir/AgarwalC10 [DBLP:conf/ecir/AgarwalC10], which gives large penalties to misranked pairs with wide margins:
(4) 
Note that the improved margin ranking loss is a mix of pointwise and pairwise L2R as it requires both and .
2.3 Reinforcement Learning (RL)
RL amounts to algorithms that search for optimal solutions in Markov Decision Processes (MDPs). We formulate the optimisation problems for NLG tasks as episodic MDP [Ryang and Abekawa2012, Rioux et al.2014]. An episodic MDP is a tuple , where is a set of states, is a set of actions, is the transition function with giving the next state after performing action in state , is the reward function with giving the immediate reward for performing action in state and is a set of terminal states that mark the end of an episode. Furthermore, to formulate the delayed rewards in summarisation, we let if , so as to ensure that nonzero rewards only appear in the last step of each episode.
In extractive summarisation, the components in are defined as follows. is the set of all possible states, i.e. permutations of sentences from . Note that is a subset of because only includes the summaries complying with the length requirement, but includes all possible extractive summaries. Two types of action constitute : add a new sentence from the input document cluster to the current draft summary and terminate the generation. includes all the overlength summaries and an absorbing state . If action is terminate, regardless of the current state . The reward function returns if action is terminate, a negative reward if the current state is an overlength summary and 0 otherwise. We denote this MDP by to highlight that it uses as rewards.
A policy defines how actions are selected:
is the probability of selecting action
in state . In the context of summarisation, we slightly abuse the notation by letting denote the probability of policy generating given input . We say a policy is proper if , i.e. does not generate illegal outputs. Then, the expected reward of performing proper policy is:(5) 
The goal of RL is to find the optimal policy that has the highest expected reward: . Note that
is a probability distribution that assigns nonzero probabilities only to the optimal outputs for
, i.e. only if . Hence .3 Proof of Convergence for RELIS
In this section, we prove that if the errors in both L2R and RL are bounded, the error of the final output of RELIS is a linear polynomial of the errors from the two steps. First, we define convergent L2R, which can learn an approximated ranking “close” to the groundtruth ranking .
Definition 1.
Let be the utility function learnt by an L2R algorithm. For , let be the ranking derived from on . The L2R algorithm is convergent with respect to if there exists a nonnegative integer such that
(6) 
Some L2R algorithms are convergent under realistic conditions. For example, the quicksort based L2R algorithm proposed by DBLP:conf/icml/MaystreG17 [DBLP:conf/icml/MaystreG17] is convergent with respect to pairwise oracles, when it obtains a sufficiently large number of preferences. Next, we define convergent RL algorithms that guarantee to learn nearoptimal policies.
Definition 2.
Let denote the episodic MDP for an NLG task given input , which uses a ranking function over as its reward function. An RL algorithm is convergent for if there exists such that
(7) 
where is the optimal policy for and is the policy of the RL algorithm after episodes of learning. ^{2}^{2}2Since is the optimal policy for , is always higher than the expected reward of any other policy (cf. Eq. 5).
Many RL algorithms have been proven to be convergent, including valuebased RL [Sutton et al.2009], actorcritic [Sutton et al.1999] and policy gradient [Fazel et al.2018].
Recall that for an input from the test set , our goal is to find the optimal policy for the NLG tuple (see §2). The theorem below shows that if RELIS uses convergent L2R and RL, its output is nearoptimal, i.e. the error between the RELIS output and the optimal output is bounded.
Theorem 1.
For a test input , we denote the optimal NLG agent for by , and its RELIS approximation by , where is the approximation of learnt by an convergent L2R algorithm. Let be the optimal policy for and the proper policy for learnt by a convergent RL algorithm for episodes. Then, as approaches positive infinity, .
Proof.
We denote the optimal policy for as . Hence from Eq. (7) we have:
(8) 
For any , from Eq. (6) we have
(9) 
Using Eq. (9), we can bound with any positive value (see Eq. (5)):
(10) 
Note that in Eq. (10) because is a proper policy (see §2). Combining Eq. (8) and (10), we get
(11) 
Since is the optimal policy for , according to Eq. (5), . Similarly, . Hence we can replace in Eq. (11) with and obtain ∎
4 Experimental Setup
Task and datasets.
We evaluate RELIS for extractive multidocument summarisation on three benchmark datasets from the Document Understanding Conferences (DUC)^{3}^{3}3https://duc.nist.gov/ described in Table 1. Each dataset contains a set of document clusters. For each cluster, the target is to create a summary with at most 100 words. Each cluster is accompanied with several humangenerated summaries, denoted as , that we use for training and evaluation. To decide the best parameters, we perform 10fold cross validation on DUC’01.
Groundtruth reward oracle.
For , we use a ROUGEbased to induce the groundtruth . We measure the quality of a summary by , where and denote the average ROUGE1 and ROUGE2 recall metrics, respectively. and are arguably the most widely used metrics to approximate human evaluation of summary quality. The scores for optimal extractive summaries are usually half of their scores [Gao et al.2018], so we multiply by to balance the impact of and . Next, we describe how we approximate the groundtruth reward.
Dataset  # Cluster  # Doc  # Sent/Cluster 

DUC’01  30  308  378 
DUC’02  59  567  271 
DUC’04  50  500  265 
4.1 Approximated Reward Oracle
Recall that the goal of L2R is to learn the utility function using pointwise or pairwise preferences (see §2). In pointwise L2R, we sample different summaries for each document cluster in by randomly selecting sentences from until the length limit is reached. Using larger does not significantly increase the quality of but increases the training time. We use the groundtruth ranking over the samples to learn the utility function by minimising the MSE loss from Eq. (1). In pairwise L2R, we randomly draw of the possible pairs of the sampled summaries to compute by minimising , or from Eq. (2) – (4). Preliminary results suggest that increasing to does not benefit the performance but significantly slows down the training, while decreasing it to significantly harms the quality of .
We use a linear model to approximate the utility function, i.e. , where encodes for input
as a vector by concatenating the following features:

The negative JensenShannon divergence between unigrams, bigrams and named entities in and .

The summary quality evaluation heuristics proposed by rioux2014emnlp [rioux2014emnlp].

TFIDF values averaged over all tokens and named entities in [Peyrard and Gurevych2018].

The average number of document clusters a word or named entity appears in.

Rate of named entities from appearing in .

The percentage of tokens in that belong to some named entities in .

The redundancy feature proposed by DBLP:conf/naacl/PeyrardG18 [DBLP:conf/naacl/PeyrardG18], applied to unigrams, bigrams and named entities.
We additionally use two Convolutional Neural Network (CNN) architectures to generate autoencoded features: StdCNN
[Kim2014] and PriorSum [Cao et al.2015]. To train these autoencoders, we follow DBLP:conf/acl/CaoWLLZW15 [DBLP:conf/acl/CaoWLLZW15] and feed only the embeddings of the words of into these models. We add a linear layer as the last layer to map the output vector of these models to a value. Full settings of these models are in the supplementary material. For each , we measure the quality of the approximated ranking by its ranking correlation to the groundtruth . We consider two ranking correlation metrics: Spearman’s and the Normalized Discounted Cumulative Gain (ndcg) on the top items. ndcg puts more emphasis on the top elements with logarithmic decay weighting . For both metrics, higher values indicate stronger correlations .We train the utility function with the loss functions in Eq. (1)(4) and we find that using different loss functions does not significantly^{4}^{4}4In all experiments, we preform statistical significant test by twotailed test and value . change performance (see supplementary material). However, the crossentropy loss consistently results in marginally better performance than the others, and hence we use it throughout our experiments. We find that under all examined settings, the CNNbased features underperform the other features. Using the CNNbased features together with other features also worsen the quality of . The reason is because both PriorSum and StdCNN only encode the summaries’ documentindependent features [Cao et al.2015]. Encoding documentdependent features requires more sophisticated neural models [Wu and Hu2018, Narayan et al.2018] which require considerable time, data and parameter tuning. This undermines the benefits of RELIS. We leave the research for efficient reward representation learning for future work.
DUC’01  DUC’02  DUC’04  

ndcg  ndcg  ndcg  
ASRL  .176  .555  .131  .537  .145  .558 
REAPER  .316  .638  .301  .639  .372  .701 
JS  .549  .736  .525  .700  .570  .763 
Our  .601  .764  .560  .727  .617  .802 
4.2 RELIS Setup
We test RELIS on all three DUC datasets as follows. In line with DBLP:conf/aaai/CaoLLW17 [DBLP:conf/aaai/CaoLLW17] and DBLP:journals/tois/RenCRWNMR18 [DBLP:journals/tois/RenCRWNMR18], we split train and test data in the “leaveoneout” manner so that documents from two datasets are used as the training set, , and documents from the rest as the test set . In each run in the “leaveoneout” experiments, we randomly select 30% data from the training set as the dev set, and select the model with the best performance on the dev set. We use Adam with initial learning rate
. The number of epochs is
and batch size is . As for RL in RELIS, we use the same temporal difference algorithm as rioux2014emnlp [rioux2014emnlp]. Full details of our parameter selection and results of using different loss functions are in the supplementary material.5 Results
Table 2 compares the quality of our with other widely used rewards for inputspecific RL (see §4). has significantly higher correlation to the groundtruth ranking compared with all other approaches, confirming that our proposed L2R method yields a superior reward oracle.
DUC’01  DUC’02  DUC’04  
ICSI  33.31  7.33  35.04  8.51  37.31  9.36 
PriorSum  35.98  7.89  36.63  8.97  38.91  10.07 
TCSum  36.45  7.66  36.90  8.61  38.27  9.66 
TCSum  33.45  6.07  34.02  7.39  35.66  8.66 
SRSum  36.04  8.44  38.93  10.29  39.29  10.70 
DeepTD  28.74  5.95  31.63  7.09  33.57  7.96 
REAPER  32.43  6.84  35.03  8.11  37.22  8.64 
RELIS  34.73  8.66  37.11  9.12  39.34  10.73 
In Table 3, we compare RELIS with nonRLbased and RLbased summarisation systems. For nonRLbased systems, we report ICSI [Gillick and Favre2009]
maximising the bigram overlap of summary and input using integer linear programming
, PriorSum [Cao et al.2015] learning sentence quality with CNNs, TCSum [Cao et al.2017] employing text classification of the input documents, the variant of TCSum without the text classification pretraining (TCSum) and SRSum [Ren et al.2018], which learns sentence relations with both word and sentencelevel attentive neural networksto estimate
salience.For RLbased systems, we reimplement REAPER [Rioux et al.2014] as an inputspecific RL, and DeepTD as a crossinput RL. DeepTD is adapted from the DQNbased RL summariser [Yao et al.2018] and is trained by taking as the rewards (see §4). It uses InferSent to represent summaries. To improve the efficiency and performance of DeepTD, we use memory replay and periodic update. Further details and the algorithm of DeepTD are in the supplementary material.
RELIS significantly outperforms the other RLbased systems. Note that RELIS and REAPER use the identical RL algorithm for inputspecific policy learning; hence the improvement of RELIS is due to the higher quality of the L2Rlearnt reward . RELIS outperforms DeepTD because training crossinput policies requires much more data than available in the DUC datasets. At the same time, RELIS performs on par with neuralbased TCSum and SRSum, while it requires significantly less data and time to train, as shown next.
We run RELIS, SRSum, DeepTD and REAPER on the same workstation with a 4core CPU, 8 GB memory and no GPUs. Table 4 compares their average training and test time for each document cluster. RELIS reduces the training time of SRSum by two orders of magnitude. At test time, RELIS takes reasonable time to train the inputspecific policy, and we believe that it can be further reduced by using more efficient RL algorithms and employing techniques like memory replay or reward shaping.
Unlike TCSum, RELIS requires no additional training data: TCSum uses 1.8 million news articles from New York Times and their category annotations (e.g. Health, Politics, Business) for training. It is worth noting that without using such a massive extra data for the text classification step, the performance of TCSum significantly drops (see TCSum in Table 3). The training time of TCSum is unlikely to be shorter than RELIS since TCSum requires to train
a CNNbased text classifier
before training a CNNbased sentence selector.SRSum  DeepTD  REAPER  RELIS  

Training  810 s  1,560 s  N/A  2 s 
Test  7 s  4 s  31 s  34 s 
To summarise, RELIS significantly outperforms RLbased models, and it yields competitive performance compared with the stateoftheart neural summarisers, with the benefit of needing less training time and data.
6 Discussion & Related Work
RELIS is proposed as a RLbased summarisation paradigm, but it can be applied to other NLG tasks where RL is widely used, e.g., translation, sentence simplification and dialogue generation. For example, to apply RELIS to translation, we just let (see §2) be the set of texts in the source language and let be the set of all possible translations in the target language for input ; and the error bound of RELIS (Theorem 1) still holds as it is indifferent to the contents in and . Hence, in this section, we discuss the related work in the context of NLG in general.
Reward learning
recently receives increasing interests from the machine learning community
[Ibarz et al.2018, Zheng et al.2018], but it is largely overlooked in NLG until now. Unlike classic RL applications, e.g. robotics and games, where rewards are provided or easy to design, NLG tasks lack strong metrics to measure the quality of the output. Wellknown rewards, e.g. ROUGE, are criticised for their low correlation with human evaluations [Chaganty et al.2018]. Recent work suggests that improving the reward function boosts the performance of RLbased NLG [Kryscinski et al.2018].Besides using metrics such as BLEU and ROUGE as rewards to train RLbased NLG, novel rewards have been designed. DBLP:conf/emnlp/KryscinskiPXS18 [DBLP:conf/emnlp/KryscinskiPXS18] propose a new reward function for encouraging RLbased summarisers to prefer novel words in abstractive summaries. For sentence simplification, DBLP:conf/emnlp/ZhangL17 [DBLP:conf/emnlp/ZhangL17] propose a reward function measuring the simplicity, relevance and grammatical correctness of candidate outputs. For machine translation, DBLP:conf/emnlp/NguyenDB17 [DBLP:conf/emnlp/NguyenDB17] propose a simulated user, which simulates human’s ratings on translations, and they use the simulated user to provide rewards for an RLbased translator. However, all rewards discussed above require reference outputs, unlike our learnt reward function which can be used to provide rewards at test time when no reference outputs are available.
DBLP:conf/aaai/WuH18 [DBLP:conf/aaai/WuH18] and DBLP:conf/naacl/BosselutCHGHC18 [DBLP:conf/naacl/BosselutCHGHC18] recently propose to use a large volume of unlabelled data to learn a scorer measuring the discourse coherence degree of sentences, and use the scorer as rewards to train crossinput RL. We go beyond their work by proving the soundness of the combination of reward learning and RL, and training inputspecific instead of crossinput policies so as to reduce training time and data.
RLbased interactive NLG methods elicit human feedback as rewards. DBLP:conf/acl/RiezlerKU18 [DBLP:conf/acl/RiezlerKU18] elicit pointwise and pairwise feedback on candidate translations to train a crossinput policy. DBLP:conf/emnlp/GaoMG18 [DBLP:conf/emnlp/GaoMG18] propose
using active learning
to select appropriate candidate summary pairs and acquire human preferences to improve a heuristicsbased reward. However, these methods require much feedback to bring satisfactory results, e.g., at least 50 summary pairs to yield significant improvements [Gao et al.2018]. RELIS can be used as a pretraining stage for interactive methods, as RELIS first learns a highquality crossinput reward function and then uses the interactive NLG techniques to elicit a small amount of feedback to finetune a user and inputspecific reward function, so at to generate higherquality and personalised results.7 Conclusion
We propose a novel RL paradigm called RELIS, which learns a reward function from a reward oracle with learningtorank (L2R) algorithms at training time, and then uses the learnt reward to train inputspecific RL policies at test time. Compared with the widely employed crossinput RLbased summarisation approaches, RELIS avoids the expensive learning of crossinput policies but, instead, efficiently performs L2R and inputspecific RL learning. Moreover, RELIS avoids the arduous reward design required in inputspecific RLbased summarisation approaches. We prove that, with proper L2R and RL algorithms, RELIS guarantees to produce nearoptimal outputs. Our experiments show that even with linear L2R and standard RL algorithms, RELIS yields performance on par with the state of the art while only requiring a small fraction of data and time to train. Our work lays the theoretical foundation for reward learning in NLG, and we hope it will encourage further research in this direction.
Acknowledgements
This work has been supported by the German Research Foundation (DFG), as part of the QAEduInf project (GU 798/181 and RI 803/121) and through the GermanIsraeli Project Cooperation (DIP, DA 1600/11 and GU 798/171).
References
 [Abbeel and Ng2004] Pieter Abbeel and Andrew Y. Ng. Apprenticeship learning via inverse reinforcement learning. In ICML, 2004.
 [Agarwal and Collins2010] Shivani Agarwal and Michael Collins. Maximum margin ranking algorithms for information retrieval. In ECIR, pages 332–343, 2010.
 [Bosselut et al.2018] Antoine Bosselut, Asli Çelikyilmaz, Xiaodong He, Jianfeng Gao, PoSen Huang, and Yejin Choi. Discourseaware neural rewards for coherent text generation. In NAACLHLT, pages 173–184, 2018.
 [Cao et al.2015] Ziqiang Cao, Furu Wei, Sujian Li, Wenjie Li, Ming Zhou, and Houfeng Wang. Learning summary prior representation for extractive summarization. In ACL, pages 829–833, 2015.
 [Cao et al.2017] Ziqiang Cao, Wenjie Li, Sujian Li, and Furu Wei. Improving multidocument summarization via text classification. In AAAI, pages 3053–3059, 2017.
 [Chaganty et al.2018] Arun Tejasvi Chaganty, Stephen Mussmann, and Percy Liang. The price of debiasing automatic metrics in natural language evalaution. In ACL, pages 643–653, 2018.
 [Cheng et al.2018] ChingAn Cheng, Xinyan Yan, Nolan Wagener, and Byron Boots. Fast policy learning through imitation and reinforcement. In UAI, pages 845–855, 2018.
 [Fazel et al.2018] Maryam Fazel, Rong Ge, Sham Kakade, and Mehran Mesbahi. Global convergence of policy gradient methods for the linear quadratic regulator. In ICML, pages 1466–1475, 2018.
 [Gao et al.2018] Yang Gao, Christian M. Meyer, and Iryna Gurevych. APRIL: Interactively learning to summarise by combining active preference learning and reinforcement learning. In EMNLP, pages 4120–4130, 2018.
 [Gillick and Favre2009] Dan Gillick and Benoit Favre. A scalable global model for summarization. In ILP, pages 10–18. Association for Computational Linguistics, 2009.
 [Ibarz et al.2018] Borja Ibarz, Jan Leike, Tobias Pohlen, Geoffrey Irving, Shane Legg, and Dario Amodei. Reward learning from human preferences and demonstrations in atari. In NeurIPS, pages 8022–8034, 2018.
 [Kim2014] Yoon Kim. Convolutional neural networks for sentence classification. In EMNLP, pages 1746–1751, 2014.
 [Kreutzer et al.2018] Julia Kreutzer, Joshua Uyheng, and Stefan Riezler. Reliability and learnability of human bandit feedback for sequencetosequence reinforcement learning. In ACL, pages 1777–1788, 2018.
 [Kryscinski et al.2018] Wojciech Kryscinski, Romain Paulus, Caiming Xiong, and Richard Socher. Improving abstraction in text summarization. In EMNLP, pages 1808–1817, 2018.
 [Li2011] Hang Li. A short introduction to learning to rank. IEICE Transactions, 94D(10):1854–1862, 2011.
 [Lin2004] ChinYew Lin. ROUGE: A package for automatic evaluation of summaries. In ACL Workshop “Text Summarization Branches Out”, pages 74–81, 2004.
 [Maystre and Grossglauser2017] Lucas Maystre and Matthias Grossglauser. Just sort it! A simple and effective approach to active preference learning. In ICML, pages 2344–2353, 2017.
 [Mnih et al.2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin A. Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 [Narayan et al.2018] Shisha Narayan, Shay B. Cohen, and Mirella Lapata. Ranking sentences for extractive summarization with reinforcement learning. In NAACLHLT, pages 1747–1759, 2018.

[Nguyen et al.2017]
Khanh Nguyen, Hal Daumé III, and Jordan L. BoydGraber.
Reinforcement learning for bandit neural machine translation with simulated human feedback.
In EMNLP, pages 1465–1475, 2017.  [Peyrard and EckleKohler2017] Maxime Peyrard and Judith EckleKohler. A principled framework for evaluating summarizers: Comparing models of summary quality against human judgments. In ACL, pages 26–31, 2017.
 [Peyrard and Gurevych2018] Maxime Peyrard and Iryna Gurevych. Objective function learning to match human judgements for optimizationbased summarization. In NAACLHLT, pages 654–660, 2018.

[Ranzato et al.2016]
Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba.
Sequence level training with recurrent neural networks.
In ICLR, 2016.  [Ren et al.2018] Pengjie Ren, Zhumin Chen, Zhaochun Ren, Furu Wei, Liqiang Nie, Jun Ma, and Maarten de Rijke. Sentence relations for extractive summarization with deep neural networks. ACM Trans. Inf. Syst., 36(4):39:1–39:32, 2018.
 [Rioux et al.2014] Cody Rioux, Sadid A. Hasan, and Yllias Chali. Fear the REAPER: A system for automatic multidocument summarization with reinforcement learning. In EMNLP, pages 681–690, 2014.
 [Ryang and Abekawa2012] Seonggi Ryang and Takeshi Abekawa. Framework of automatic text summarization using reinforcement learning. In EMNLP/CoNLL, pages 256–265, 2012.
 [Sutton et al.1999] Richard S. Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In NIPS, pages 1057–1063, 1999.
 [Sutton et al.2009] Richard S. Sutton, Hamid Reza Maei, Doina Precup, Shalabh Bhatnagar, David Silver, Csaba Szepesvári, and Eric Wiewiora. Fast gradientdescent methods for temporaldifference learning with linear function approximation. In ICML, pages 993–1000, 2009.
 [Williams1992] Ronald J. Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256, 1992.
 [Wu and Hu2018] Yuxiang Wu and Baotian Hu. Learning to extract coherent summary via deep reinforcement learning. In AAAI, pages 5602–5609, 2018.
 [Yao et al.2018] Kaichun Yao, Libo Zhang, Tiejian Luo, and Yanjun Wu. Deep reinforcement learning for extractive document summarization. Neurocomputing, 284:52–62, 2018.
 [Zhang and Lapata2017] Xingxing Zhang and Mirella Lapata. Sentence simplification with deep reinforcement learning. In EMNLP, pages 584–594, 2017.
 [Zheng et al.2018] Zeyu Zheng, Junhyuk Oh, and Satinder Singh. On learning intrinsic rewards for policy gradient methods. In NeurIPS, pages 4649–4659, 2018.
Comments
There are no comments yet.