1 Introduction
Interactive Natural Language Processing (NLP) approaches that put the human in the loop gained increasing research interests recently (Amershi et al., 2014; Gurevych et al., 2018; Kreutzer et al., 2018a). The user–system interaction enables personalised and useradapted results by incrementally refining the underlying model based on a user’s behaviour and by optimising the learning through actively querying for feedback and judgements. Interactive methods can start with no or only few input data and adjust the output to the needs of human users.
Previous research has explored eliciting different forms of feedback from users in interactive NLP, for example mouse clicks for information retrieval (Borisov et al., 2018), postedits and ratings for machine translation (Denkowski et al., 2014; Kreutzer et al., 2018a), error markings for semantic parsing (Lawrence and Riezler, 2018), bigrams for summarisation (P.V.S. and Meyer, 2017), and preferences for translation (Kreutzer et al., 2018b). Controlled experiments suggest that asking for preferences places a lower cognitive burden on the human subjects than asking for absolute ratings or categorised labels (Thurstone, 1927; Kendall, 1948; Kingsley and Brown, 2010). But it remains unclear whether people can easily provide reliable preferences over summaries. In addition, preferencebased interactive NLP faces the high sample complexity problem: a preference is a binary decision and hence only contains a single bit of information, so the NLP systems usually need to elicit a large number of preferences from the users to improve their performance. For example, the machine translation system by Sokolov et al. (2016a) needs to collect hundreds of thousands of preferences from a simulated user before it converges.
Collecting such large amounts of user inputs and using them to train a “onefitsall” model might be feasible for tasks such as machine translation, because the learnt model can generalise to many unseen texts. However, for highly subjective tasks, such as document summarisation, this procedure is not effective, since the notion of importance is specific to a certain topic or user. For example, the information that Lee Harvey Oswald shot president Kennedy might be important when summarising the assassination, but less important for a summary on Kennedy’s childhood. Likewise, a user who is not familiar with the assassination might consider the information more important than a user who is analysing the political backgrounds for many years. Therefore, we aim at an interactive system that adapts a model for a given topic and user context based on user feedback – instead of training a single model across all users and topics, which hardly fits anyone’s needs perfectly. In this scenario, it is essential to overcome the high sample complexity problem and learn to adapt the model using a minimum of user interaction.
In this article, we propose the Active Preferencebased ReInforcement Learning (APRIL) framework^{1}^{1}1 We first introduced APRIL in (Gao et al., 2018). Towards the end of §1 we discuss how this article substantially extends our previous work.. Our core research idea is to split the preferencebased interactive learning process
into two stages. First, we estimate the user’s ranking over candidate summaries using
active preference learning (APL) in an interaction loop. Second, we use the learnt ranking to guide a neural reinforcement learning (RL) agent to search for the (near)optimal summary. The use of APL allows us to maximise the information gain from a small number of preferences, helping to reduce the sample complexity. Fig. 1 shows this general idea in comparison to the stateoftheart preferencebased interactive NLP paradigm, Structured Prediction from Partial Information (SPPI) (Sokolov et al., 2016b; Kreutzer et al., 2017). In §3, we discuss the technical background of RL, preference learning and SPPI, before we introduce our solution APRIL in §4.We apply APRIL to the Extractive MultiDocument Summarisation (EMDS) task. Given a cluster of documents on the same topic, an EMDS system needs to extract important sentences from the input documents to generate a summary complying with a given length requirement that fits the needs of the user and her/his task. For the first time, we provide evidence for the efficacy of preferencebased interaction in EMDS based on a user study, in which we measure the usability and the noise of preference feedback, yielding a mathematical model we can use for simulation and for analysing our results (§5). To evaluate APRIL, we then perform experiments on standard EMDS benchmark datasets. We compare the effectiveness of multiple APL and RL algorithms and select the best algorithms for our full system. We compare APRIL to SPPI and noninteractive methods, in both simulation (§6) and realuser experiments (§7). Our results suggest that with only ten rounds of user interaction, APRIL produces summaries better than those produced by both noninteractive methods and SPPI.
This work extends our earlier work (Gao et al., 2018) in three aspects. (i) We present a new user study on the reliability and usability of the preferencebased interaction (§5). Based on this study, we propose a realistic simulated user, which is used in our experiments. (ii) We evaluate multiple new APL strategies and a novel neural RL algorithm, and compare them with the counterpart methods used in Gao et al. (2018). The use of these new algorithms further boost the efficiency and performance of APRIL (§6). (iii) We conduct additional user studies to compare APRIL with both noninteractive baselines and SPPI under more realistic settings (§7). APRIL can be applied to a wide range of other NLP tasks, including machine translation, semantic parsing and information exploration. All source code and experimental setups can be found in https://github.com/UKPLab/irjneuralapril.
2 Related Work
Sppi.
The method most similar to ours is SPPI (Sokolov et al., 2016b; Kreutzer et al., 2017)
. The core of SPPI is a policygradient RL algorithm, which receives rewards derived from the preferencebased feedback. It maintains a policy that approximates the utility of each candidate output and selects the higherutility candidates with higher probability. As discussed in §
1, SPPI suffers heavily from the high sample complexity problem. We will present the technical details of SPPI in §3.3 and compare it to APRIL in §6 and §7.Preferences.
The use of preferencebased feedback in NLP attracts increasing research interest. Zopf (2018) learns a sentence ranker from human preferences on sentence pairs, which can be used to evaluate the quality of summaries, by counting how many highranked sentences are included in a summary. Simpson and Gurevych (2018) develop an improved Gaussian process preference learning (Chu and Ghahramani, 2005) algorithm to learn an argument convincingness ranker from noisy preferences. Unlike these methods that focus on learning a ranker from preferences, we focus on using preferences to generate better summaries. Kreutzer et al. (2018b) ask real users to provide cardinal (5point ratings) and ordinal (pairwise preferences) feedback over translations, and use the collected data to train an offpolicy RL to improve the translation quality. Their study suggests that the interrater agreement for the cardinal and ordinal feedback is similar. However, they do not measure or consider the influence of the questions’ difficulties on the agreement, which we find significant for EMDS (see §5). In addition, their system is not interactive, but uses log data instead of actively querying users.
Interactive Summarisation.
The iNeATS (Leuski et al., 2003) and IDS (Jone et al., 2002) systems allow users to tune several parameters (e.g., size, redundancy, focus) to customise the produced summaries. Further work presents automatically derived summary templates (Orǎsan et al., 2003; Orǎsan and Hasler, 2006) or hierarchically ordered summaries (Christensen et al., 2014; Shapira et al., 2017) allowing users to drilldown from a general overview to detailed information. However, these systems do not employ the users’ feedback to update their internal summarisation models. P.V.S. and Meyer (2017) propose an interactive EMDS system that asks users to label important bigrams within candidate summaries. Given the important bigrams, they use
integer linear programming
to optimise important bigram coverage in the summary. In simulation experiments, their system can achieve nearoptimal performance in ten rounds of interaction, collecting up to 350 important bigrams. However, labelling important bigrams is a large burden on the users, as users have to read through many potentially unimportant bigrams (see §5). Also, they assume that the users’ feedback is always perfect.Reinforcement Learning.
RL has been applied to both extractive and abstractive summarisation in recent years (Ryang and Abekawa, 2012; Rioux et al., 2014; Gkatzia et al., 2014; Henß et al., 2015; Paulus et al., 2017; Pasunuru and Bansal, 2018; Kryscinski et al., 2018)
. Most existing RLbased document summarisation systems either use heuristic functions (e.g.,
Ryang and Abekawa, 2012; Rioux et al., 2014), which do not rely on reference summaries, or ROUGE scores requiring reference summaries as the rewards for RL (Paulus et al., 2017; Pasunuru and Bansal, 2018; Kryscinski et al., 2018). However, neither ROUGE nor the heuristicsbased rewards can precisely reflect real users’ requirements on summaries (Chaganty et al., 2018); hence, using these imprecise rewards can severely mislead the RLbased summariser. The quality of the rewards has been recognised as the bottleneck for RLbased summarisation systems (Kryscinski et al., 2018). Our work learns how to give good rewards from users’ preferences. In this work, we assume that our system has no access to the reference summaries, but can query a user for preferences over summary pairs.Some RL work directly uses the users’ ratings as rewards. Nguyen et al. (2017)
employ user ratings on translations as rewards when training an RLbased encoderdecoder translator. However, eliciting ratings on summaries is very expensive as users have high variance in their ratings of the same summary
(Chaganty et al., 2018), which is why we consider preferencebased feedback and a learnt reward surrogate.Preferencebased RL (PbRL)
is a recently proposed paradigm at the intersection of preference learning, RL, active learning (AL) and inverse RL (Wirth et al., 2017). Unlike apprenticeship learning (Dethlefs and Cuayáhuitl, 2011) which requires the user to demonstrate (near)optimal sequences of actions (called action trajectories), PbRL only asks for the user’s preferences (either partial or total order) on several action trajectories. Wirth et al. (2016) apply PbRL to several simulated robotics tasks. They show that their method can achieve nearoptimal performance by interacting with a simulated perfect user for 15–40 rounds. Christiano et al. (2017) use PbRL in training simulated robotics tasks, Atariplaying agents and a simulated backflipping agent by collecting feedback from both simulated oracles and real crowdsourcing workers. They find that human feedback can be noisy and partial (i.e., capturing only a fraction of the true reward), but that it is much easier for people to provide consistent comparisons than consistent absolute scores in their robotics use case. In §5, we evaluate this for document summarisation.
However, the approach by Christiano et al. (2017) fails to obtain satisfactory results in some robotics tasks even after 5,000 interaction rounds. In a followup work, Ibarz et al. (2018)
elicit demonstrations from experts, use the demonstrations to pretrain a model with imitation learning techniques, and
successfully finetune the pretrained model with PbRL. In EMDS, extractive reference summaries might be viewed as demonstrations, but they are expensive to collect and not available in popular summarisation corpora (e.g., the DUC datasets). APRIL does not require demonstrations, but learns a reward function based on user preferences on entire summaries, which is then used to train an RL policy.3 Background
In this section, we recap necessary details of RL (§3.1), preference learning (§3.2) and SPPI (§3.3). We adapt them to the EMDS use case, so as to lay the foundation for APRIL. To ease the reading, we summarise the notation used in the remaining article in Table 1.
Notation  Description 

a document cluster from the set of all possible inputs  
a summary from the set of all legal summaries for  
MDP of the EMDS task for : states , actions , transition function , reward function and terminal states  
the reward of summary in  
policy in RL: the probability of selecting summary in  
policy in SPPI, parameterised by : the probability of presenting pair to the oracle (Eq. (6))  
the groundtruth utility function on  
the approximation of  
, where is a utility function on  
the ranking function on induced by (Eq. (2))  
the approximation of induced by  
the preference direction function, which returns 1 if the oracle/user prefers over for  
the objective function in RL (Eq. (1))  
the objective function in preference learning (Eq. (3))  
the objective function in SPPI (Eq. (5)) 
3.1 Reinforcement Learning
RL amounts to algorithms for efficiently searching optimal solutions in Markov Decision Processes (MDPs). MDPs are widely used to formulate sequential decisionmaking problems. Let be the input space and let be the set of all possible outputs for input . An episodic MDP is a tuple for input , where is the set of states, is the set of actions and is the transition function with giving the next state after performing action in state . is the reward function with giving the immediate reward for performing action in state . is the set of terminal states; visiting a terminal state terminates the current episode.
EMDS can be formulated as episodic MDP, as the summariser has to sequentially select sentences from the original documents to add to the draft summary. Our MDP formulation of EMDS matches previous approaches by Ryang and Abekawa (2012) and Rioux et al. (2014): is a cluster of documents and is the set of all legal summaries for cluster (i.e., all permutations of sentences in that fulfil the given summary length constraint). In the MDP for document cluster , includes all possible draft summaries of any length (i.e., ). The action set includes two types of actions: concatenate a sentence in to the current draft summary, or terminate the draft summary construction. The transition function is trivial in EMDS, because given the current draft summary and an action, the next state can be easily identified as the draft summary plus the selected sentence or as a terminating state. The reward function returns an evaluation score of the summary once the action terminate is performed; otherwise it returns 0 because the summary is still under construction and thus not ready to be evaluated (socalled delayed rewards). Providing nonzero rewards before the action terminate can lead to even worse result, as reported by Rioux et al. (2014). The terminal states set includes all states corresponding to summaries exceeding the given length requirement and an absorbing state . By performing action terminate, the agent will be transited to regardless of its current state, i.e. for all if is terminate.
A policy in an MDP defines how actions are selected: is the probability of selecting action in state . Note that in many sequential decisionmaking tasks, is learnt across all inputs . However, for our EMDS use case, we learn an inputspecific policy for a given in order to reflect the subjectivity of the summarisation task introduced in §1. We let be the set of all possible summaries a policy can construct in document cluster . denotes the probability of policy for generating a summary in . Likewise, denotes the accumulated reward received by building summary . Finally, the expected reward of performing is:
(1) 
The goal of an MDP is to find the optimal policy that has the highest expected reward: .
3.2 Preference Learning
For a document cluster and its legal summaries set , we let be the groundtruth utility function measuring the quality of summaries in . We additionally assume that no two items in have the same value. Let be the ascending ranking induced by : for ,
(2) 
where is the indicator function. In other words, gives the rank of among all elements in with respect to . The goal of preference learning is to approximate from the pairwise preferences on some elements in . The preferences are provided by an oracle.
The BradleyTerry (BT) model (Bradley and Terry, 1952) is a widely used preference learning model, which approximates the ranking by approximating the utility function : Suppose we have observed preferences: , where are the summaries presented to the oracle in the round, and indicates the preference direction of the oracle: if the oracle prefers over , and otherwise. The objective in BT is to maximise the following likelihood function:
(3) 
where
(4) 
is the approximation of parameterised by
, which can be learnt by any function approximation techniques, e.g. neural networks or linear models. By maximising Eq. (
3), the resulting will be used to obtain , which in turn can be used to induce the approximated ranking function .3.3 The SPPI Framework
SPPI can be viewed as a combination of RL and preference learning. For an input , the objective of SPPI is to maximise
(5) 
where is the same preference direction function as in preference learning (§3.2). is a policy that decides the probability of presenting a pair of summaries to the oracle:
(6) 
In line with preference learning, is the utility function for estimating the quality of summaries, parameterised by . The policy samples the pairs with larger utility gaps with higher probability; as such, both “good” and “bad” summaries have the chance to be presented to the oracle and thus encourages the exploration of the summary space. To maximise Eq. (5), SPPI uses gradient ascent to update incrementally. Algorithm 1 presents the pseudo code of our adaptation of SPPI to EMDS.
Note that the objective function in SPPI (Eq. (5)) and the expected reward function in RL (Eq. (1)) have a similar form: if we view the preference direction function in Eq. (5) as a reward function, we can consider SPPI as an RL problem. The major difference between SPPI and RL is that the policy in SPPI selects pairs (Eq. (6)), while the policy in RL selects single summaries (see §3.1). For APRIL, we will exploit this connection to propose our new objective function and learning paradigm.
4 The APRIL Framework
SPPI suffers from the high sample complexity problem, which we attribute to two major reasons: First, the policy in SPPI (Eq. (6)) is good at distinguishing the “good” summaries from the “bad” ones, but poor at selecting the “best” summaries from “good” summaries, because it only queries the summaries with large quality gaps. Second, SPPI makes inefficient use of the collected preferences: After each round of interaction, SPPI performs one step of the policy gradient update, but does not generalise or reuse the collected preferences. This potentially wastes expensive user information. To alleviate these two problems, we exploit the connection between SPPI, RL and preference learning and propose the APRIL framework detailed in this section.
Recall that in EMDS, the goal is to find the optimal summary for a given document cluster , namely the summary that is preferred over all other possible summaries in according to . Based on this understanding and in line with the RL formulation of EMDS from §3.1, we define a new expected reward function for policy as follows:
(7) 
Note that equals 1 if is preferred over and equals 0 otherwise (see §3.2). Thus, counts the number of summaries that are lesspreferred than summary , and hence equals (see Eq. 2). Policy that can maximise this new objective function will select summaries with highest rankings, hence outputs the optimal summary.
This new objective function decomposes the learning problem into two stages: (i) approximating the ranking function , and (ii) based on the approximated ranking function, searching for the optimal policy that can maximise the new objective function. These two stages can be solved by (active) preference learning and reinforcement learning, respectively, and they constitute our APRIL framework, illustrated in Figure 2.
4.1 Stage 1: Active Preference Learning
For an input document cluster , the task in the first stage of APRIL is to obtain , the approximated ranking function on by collecting a small number of preferences from the oracle. It involves four major components: a summary Database (DB) storing the summary candidates, an ALbased Querier that selects candidates from the Summary DB to present to the user, a Preference DB storing the preferences collected from the user, and a Preference Learner that learns from the preferences. The left cycle in Fig. 2 illustrates this stage, and Alg. 2 presents the corresponding pseudo code. Below, we detail these four components.
Summary DB .
Ideally should include all legal extractive summaries for a document cluster , namely . Since this is impractical for large clusters, we either randomly sample a large set of summary candidates or use pretrained summarization models and heuristics to generate . Note that can be built offline, i.e. before the interaction with the user starts. This improves the realtime responsiveness of the system.
Preference DB .
The preference database stores all collected user preferences , where is the query budget (i.e., how many times a user may be asked for a preference), are the summaries presented to the user in the round of interaction, and is the user’s preference (see §3.2).
Preference Learner.
We use the BT model introduced in §3.2 to learn from the preferences in . In order to increase the realtime responsiveness of our system, we use a linear model to approximate the utility function , i.e. , where is a vectorised representation of summary for input cluster . However, purely using to approximate is sensitive to the noise in the preferences, especially when the number of collected preferences is small. To mitigate this, we approximate not only using (the posterior), but also using some prior knowledge (the prior), for example the heuristicsbased summary evaluation function proposed by Ryang and Abekawa (2012) and Rioux et al. (2014). Note that these heuristics do not require reference summaries; see §2. Formally, we define the as
(8) 
where is a realvalued parameter trading off between the prior and posterior.
ALbased Querier.
The active learning based querier receives and selects which candidate pair from to present to the user in each round of interaction. To reduce the reading burden of the oracle, inspired by the preference collection workflows in robots training (Wirth et al., 2016), we use the following setup to obtain summary pairs: In each interaction round, one summary of the pair is old (i.e. it has been presented to the user in the previous round) and the other one is new (i.e. it has not been read by the user before). As such, the user only needs to read summaries in rounds of interaction.
Any poolbased active learning strategy (Settles, 2010) can be used to implement the querier, e.g., uncertainty sampling (Lewis and Gale, 1994). We explore four computationally efficient active learning strategies:

Diversitybased heuristic
: This strategy minimises the vector space similarity of the presented summaries. For a pair
, we definewhere
is the cosine similarity. This heuristic encourages querying dissimilar summaries, so as to encourage exploration and facilitate generalisation. In addition, dissimilar summaries are more likely to have large utility gaps
and hence can be answered more accurately by the users (discussed later in §5). 
Densitybased heuristic
: This strategy encourages querying summaries from “dense” areas in the vector space, so as to avoid querying outliers and to facilitate generalisation. Formally, for a summary
for cluster , we define 
Uncertaintybased heuristic : This strategy encourages querying the summaries whose approximated utility is most uncertain. In line with P.V.S. and Meyer (2017), we define as follows: For a summary , we estimate the probability of being the optimal summary as
and let the uncertainty of be if , and let otherwise.
To exploit the strengths of all these AL strategies, we normalise their output values to the same range and use their weighted sum to select the new summary to present to the user:
(9) 
where is the old summary, i.e. the one from the previous interaction round. To select the first summary, we let and . , , and denote the weights for the four heuristics.
4.2 Stage 2: RLbased Summariser
Given the approximated ranking learnt by the first stage, the target of the second stage in APRIL is to obtain
We consider two RL algorithms to obtain : the linear Temporal Difference (TD) algorithm, and a neural version of the TD algorithm.
TD (Sutton, 1984) has proven effective for solving the MDP in EMDS (Rioux et al., 2014; Ryang and Abekawa, 2012). The core of TD is to approximate the values: In EMDS, estimates the “potential” of the (draft) summary for input cluster given policy : the higher the value, the more likely is contained in the optimal summary for . Given the values, a policy can be derived using the softmax strategy:
(10) 
where ranges over all available actions in the state . The intuition behind Eq. (10) is that the probability of performing the action increases if the resulting state of , namely , has a higher value. Note the similarity between the policy of TD (Eq. (10)) and the policy of SPPI (Eq. (6)): they both use a Gibbs distribution to assign probabilities to different actions, but the difference is that an action in SPPI is a pair of summaries, while in TD an action is adding a sentence to the current draft summary or terminate (see §3.1).
Existing works use linear functions to approximate the values (Rioux et al., 2014; Ryang and Abekawa, 2012). To more precisely approximate the values, we use a neural network and term the resulting algorithm Neural TD (NTD). Inspired by DQN (Mnih et al., 2015), we employ the memory replay and periodic update techniques to boost and stabilise the performance of NTD. We use NTD rather than DQN (Mnih et al., 2015) because in MDPs with discrete actions and continuous states, as in our EMDS formulation, QLearning needs to maintain a network for each action , which is very expensive when the size of is large. Instead, the TD algorithms only have to maintain the network, whose size is independent of the number of actions. In EMDS, the size of the action set typically exceeds several hundreds (see Table 2), because each sentence corresponds to one action.
Alg. 3 shows the pseudo code of NTD. We use the Summary DB as the memory replay. This helps us to reduce the sample generation time, which is critical in interactive systems. We select samples from using softmax sampling (line 3 in Alg. 3):
(11) 
where stands for the parameters of the neural network. Given the selected summary , we build a sequence of states (line 4), where is the number of sentences in , and state is the draft summary including the first sentences of . Then, we update the error as in the standard TD algorithms (lines 5 to 9) and perform back propagation with gradient descent (line 10). We update every episodes (line 11), as in DQN, to stabilise the performance of NTD. After finishing all training, the obtained values can be used to derive the by Eq. (10).
5 Preferencebased Interaction for Summarisation
To date, there is little knowledge about the usability and the reliability of user feedback in summarisation. This is a major limitation for designing interactive systems and for effectively experimenting with simulated users before an actual user study. In this section, we therefore study preferencebased feedback for our EMDS use case and derive a mathematical model to simulate real users’ preferencegiving behaviour.
Hypotheses.
Our study tests two hypotheses: (H1) We assume that users find it easier to provide preference feedback than providing other forms of feedback for summaries. In particular, we measure the user satisfaction and the time needed for preferencebased interaction and bigrambased interaction proposed by P.V.S. and Meyer (2017), which has also been used in interactive summarisation.
(H2) Previous research suggests that the more difficult the questions, the lower the correct rate of the answers or, in other words, the higher the noise in the answers (Huang et al., 2016; Donmez and Carbonell, 2008). In our preferencegiving scenario, we assume that the difficulty of comparing a pair of items can be measured by the utility gap between the presented items: the wider the utility gap, the easier it is for the user to identify the better item. We term this the widergaplessnoise hypothesis in this article.
The widergaplessnoise hypothesis is an essential motivation for the policy in SPPI (Eq. (6)) and the diversitybased active learning strategy in APRIL (see §4.1), but yet there is little empirical evidence for validating this hypothesis. Based on the findings in our user study, we provide evidence towards H1 and H2, and we propose a realistic user model, which we employ in our simulation experiments in §6.
Study setup.
We invite 12 users to participate in our user study. All users are native or fluent English speakers from our university. We ask each user to provide feedback for newswire summaries from two topics (d074b from DUC’02 and d32f from DUC’01) in the following way.
We first allow the users to familiarise with the topic by means of two 200words abstracts. This is necessary, since the events discussed in the news documents are several years old and maybe unknown to our participants. Without having such background information, it would not be possible for users to judge importance in the early stages of the study. We ask each user to provide preferences for ten summary pairs and to label all important bigrams in five additional summaries. For collecting preferencebased feedback, we ask the participants to select the better summary (i.e. the one containing more important information) in each pair. For collecting bigrambased feedback, we adopt the setup of P.V.S. and Meyer (2017), who proposed a successful EMDS system using bigrambased interaction. At the end of the study, we ask the participants to rate the usability (i.e., userfriendliness) of preference and bigrambased interaction on a 5point Likert scale, where higher scores indicate higher usability.
To evaluate H2, we require summary pair with different utility gaps. To this end, we measure the utility (see §3.2) of a summary for document cluster as
(12) 
where are the reference summaries for document cluster (provided in the DUC datasets), and , and stand for average ROUGE1, ROUGE2 and ROUGESU4 recall metrics (Lin, 2004), respectively. These ROUGE metrics are widely used to measure the quality of summaries. The denominator values 0.47, 0.22 and 0.18 are the upperbound ROUGE scores reported by P.V.S. and Meyer (2017). They are used to balance the weights of the three ROUGE scores. As such, each ROUGE score is normalised to , and we further multiply the sum of the ROUGE scores by to normalise values to , which facilitates our analyses afterwards.
For document cluster , the utility gap of two summaries and is thus . As for the ten summary pairs in our user study, we select four pairs with utility gap , three with , two with and one with , where (i.e., a utility gap very close to the predefined gap width). Figure 3 shows two example summary pairs and their . As for the five summaries for bigrambased feedback, we select summaries with high utility , but ensure that they have low overlap in order simulate the setup AL setup of P.V.S. and Meyer (2017).
Usability assessment.
To evaluate hypothesis H1, we measure the easiness of providing preferences for summaries with two metrics: the average interaction time a participant spends in providing a preference and the participant’s usability rating on the 5point scale. We compare both metrics for preferencebased interaction with bigrambased interaction.
Fig. 4 visualises the interaction time and the usability ratings for preference and bigrambased interaction as notched boxplots. Both plots confirm the clear difference between preference and bigrambased feedback for summaries: We measure an average interaction time of 102
s (with standard error
) for annotating bigrams in a single summary, which is over twice the time spent for providing a preference for a summary pair (43 s, ). The users identified 7.2 bigrams per summary, which took 14s per bigram on average. As for the usability ratings, providing preferences is rated 3.8 () on average (median at 4), while labelling bigrams is rated 2.4 () on average (median at 2). These results suggest that humans can more easily provide preferences over summaries than providing pointbased feedback in the form of bigrams.Comparison of interaction time and usability ratings for preference and bigrambased interaction. Notches indicate 95% confidence interval.
Reliability assessment.
To evaluate hypothesis H2, we measure the reliability of the users’ preferences, i.e. the percentage of the pairs in which the user’s preference is the same as the preference induced by . Figure 5 shows the reliability scores for the varying utility gaps employed in our study. The results clearly suggest that, for summary pairs with wider utility gaps, the participants can more easily identify the better summary in the pair, resulting into higher reliability. This observation validates the widergaplessnoise assumption.
Realistic user simulation.
We observe that the shape of the reliability curves in Figure 5 is similar to that of the logistic function: when approaches 0, the reliability scores approaches 0.5 and with the increase of , the reliability asymptotically approaches 1. Hence, we adopt the logistic model proposed by Viappiani and Boutilier (2010) to estimate the real users’ preferences. We term the model logistic noise oracle (LNO): For two summaries , we assume the probability that a user prefers over is:
(13) 
where is a realvalued parameter controlling the “flatness” of the curve: higher yield a flatter curve, which in turn suggests that asking users to distinguish summaries with similar quality causes high noise.
We estimate based on the observations we made in the user study by maximising the likelihood function:
where ranges over all users and ranges over the number of preferences provided by each user. and are the summaries presented to the user in round . is the user’s preference direction function: equals 1 if is preferred by the user over , and equals 0 otherwise. By letting , we obtain . The green curve in Figure 5 is the reliability curve for the LNO with . We find that it fits well with the reliability curves of the real users. As a concrete example, consider the summary pairs in Figure 3: LNO prefers over with probability and prefers over with probability , which is consistent with our observations that 7 out of 12 users prefer over , while all users prefer over .
6 Simulation Experiments
In this section, we study APRIL in a simulation setup. We use the LNObased user model with to simulate user preferences as introduced in §5. We separately study the first and the second stage of APRIL, by comparing multiple active learning and RL techniques in each stage. Then, we combine the bestperforming strategy from each stage to build the overall APRIL pipeline and compare our method with SPPI. We perform our experiments on three multidocument summarisation benchmark datasets from the Document Understanding Conferences^{2}^{2}2https://duc.nist.gov/ (DUC): DUC’01, DUC’02 and DUC’04. Table 2 shows the main properties of these datasets. To ease the reading, we summarise the parameters we used in our simulation experiments in Table 3.
Dataset  # Topic  # Doc  SumLen  # Sent/Topic 

DUC’01  30  308  100  378 
DUC’02  59  567  100  271 
DUC’04  50  500  100  265 
6.1 APL Strategy Comparison
We compare our ALbased querying strategy introduced in §4.1 (see Eq. (9)) with three baseline AL strategies:

Random: In each interaction round, select a new candidate summary from uniformly at random and ask the user to compare it to the old one from the previous interaction round. In the first round, we randomly select two summaries to present.

J&N is the robust query selection algorithm proposed by Jamieson and Nowak (2011). It assumes that the items’ preferences are dependent on their distances to an unknown reference point in the embedding space: the farther an item to the reference point, the more preferred the item is. After each round of interaction, the algorithm uses all collected preferences to locate the area where the reference point may fall into and identifies the query pairs which can reduce the size of this area, termed ambiguous query pairs. To combat noise in preferences, the algorithm selects the mostlikelycorrect ambiguous pair to query the oracle in each round.
Note that Gibbs presents two new summaries to the user each round, while the other querying strategies we consider present only one new summary per round (see §4.1). Thus, in rounds of interaction with a user, the user needs to read summaries with Gibbs, but only with the other querying strategies.
Parameter  Description 

For APL (stage 1 in APRIL); see Alg. 2:  
query round budget  
Summary DB size for each cluster (see §4.1)  
heuristicsbased prior reward (see §4.1 and Eq. (8)); we use the reward heuristics proposed by Ryang and Abekawa (2012)  
tradeoff between prior and posterior rewards (see Eq. (8))  
learning rate for preference learning  
vectorised representation of summary for document cluster (see Eq. (8)); we use the same vector representation as Rioux et al. (2014)  
weights of the preference learning strategies (see Eq. (9); selection details presented in §6.1)  
For RL (stage 2 in APRIL); see Alg. 3:  
episode budget  
update frequency in NTD  
neural approximation of values (see §6.2 for setup details)  
For SPPI; see Alg. 1:  
learning rate in SPPI. 
To find the best weights , , and for our AL querying strategy in Eq. (9), we run grid search: We select each weight from and ensure that the sum of the four weights is 1.0. The query budget was set to 10, 50 and 100. For each cluster , we generated 5,000 extractive summaries to construct . Each summary contains no more than 100 words, generated by randomly selecting sentences in the original documents in . The prior used in Eq. 8 is the reward function proposed by Ryang and Abekawa (2012), and we set the tradeoff parameter to . All querying strategies we test take less than 500 ms to decide the next summary pair to present.
The performance of the querying strategies is measured by the quality of their resulting reward function (see Eq. (8)). For each cluster , we measure the quality of by its Spearman’s rank correlation (Spearman, 1904) to the goldstandard utility scores (Eq. (12)) over all summaries in . We normalise to the same range of (i.e. [0,10]). For the vector representation , we use the same 200dimensional bagofbigram representation as Rioux et al. (2014).
Random  .232  .235  .243 
J&N  .238  .240  .247 
Gibbs  .246  .275  .289 
()  .236  .241  .261 
()  .288  .297  .319 
()  .211  .238  .263 
()  .257  .285  .303 
BestCombination  .288  .298  .320 
Lower bound, , : .194 
Table 4 compares the performance of different querying strategies. We find that all querying strategies outperform the zerointeraction lower bound even with 10 rounds of interaction, suggesting that even collecting a small number of preferences can help to improve the quality of . Among all baseline strategies, Gibbs significantly^{3}^{3}3
We used doubletailed ttests to compute the
values, and selected as the significance level. outperforms the other two, and we believe the reason is that Gibbs exploits the widergaplessnoise assumption (see §5). Of all 56 possible AL weights combinations, 48 combinations outperform the random and J&N baselines, and 27 outperform Gibbs. This shows the overall strength of our ALbased strategy. The best combination of the weights is , and , closely followed by using the diversitybased strategy alone (i.e. ). We believe the reason behind the effectiveness of the strategy is that it not only exploits the widergaplessnoise assumption by querying dissimilar summaries, but also explores summaries from different areas in the embedding space, which helps the generalisation. Due to its simplicity, we use henceforth, since its performance is almost identical and has no statistically significant difference to the best combination.6.2 RL Comparison
We compare NTD (Alg. 3) to two baselines: TD (Sutton, 1984) and LSTD (Boyan, 1999). TD has been successfully used by Ryang and Abekawa (2012) and Rioux et al. (2014) for EMDS. LSTD improves TD by using least square optimisation, and it has been proven to perform better in largescale problems than TD (Lagoudakis and Parr, 2003). Note that both TD and LSTD uses linear models to approximate the values.
We use the following settings, which yield good performance in pilot experiments: Learning episode budget and learning step in TD and NTD. For NTD, the input of the value network is the same 200dimensional draft summary representation as in (Rioux et al., 2014)
; after the input layer, we add a fully connected ReLU
(Glorot et al., 2011) layer with dimension 100 as the first hidden layer; an identical fully connected 100dimensional ReLU layer is followed as the second hidden layer; at last, a linear output layer is used to output the value. Fig. 6 illustrates the structure of the values network. We use Adam (Kingma and Ba, 2014) as the gradient optimiser (line 10 in Alg. 3), with default setup. For LSTD, we initialise its square matrix as a diagonal matrix and let the diagonal elements be random numbers between 0 and 1, as suggested by Lagoudakis and Parr (2003).The rewards we use are based on defined in Eq. (12). Note that this serves as the upperbound performance, because is the goldstandard scoring function, which is not accessible in the interactive settings (see §6.3 and §7). We measure the performance of the three RL algorithms by the quality of their generated summaries in terms of multiple ROUGE scores. Results are presented in Table 5. NTD outperforms the other two RL algorithms by a large margin. We assume that this is attributed to its more precise approximation of the values using the neural network.
Dataset  RL  

DUC’01  NTD  .452  .169  .359  .177 
TD  .442  .161  .349  .172  
LSTD  .432  .151  .362  .179  
DUC’02  NTD  .483  .181  .379  .193 
TD  .475  .179  .374  .189  
LSTD  .462  .163  .363  .183  
DUC’04  NTD  .492  .189  .391  .203 
TD  .473  .174  .378  .192  
LSTD  .457  .156  .360  .182 
In terms of computation time,^{4}^{4}4All RL experiments were performed on a workstation with a quadcore CPU and 8 GB RAM, without using GPUs. TD takes around 30 seconds to finish the 3,000 episodes of training and produce a summary, NTD takes around 2 minutes, while LSTD takes around 5 minutes. Since the RL computation is performed only once after all online interaction has finished, we find this computation time acceptable. However, without using as the memory replay, NTD takes around 10 minutes to run 3,000 episodes of training.
6.3 Full System Performance
We compare SPPI with two variants of APRIL: APRILTD and APRILNTD that use TD and NTD, respectively. Both implementations of APRIL use the diversitybased AL strategy (i.e. ). All the other parameters values are the same as those described in §6.1 and §6.2 (see Table 3).
SPPI  .323  .068  .259  .098  
APRILTD  .324  .070  .257  .099  
APRILNTD  .325  .069  .260  .100  
SPPI  .323  .068  .259  .099  
APRILTD  .338  .075  .268  .105  
APRILNTD  .339  .075  .269  .106  
SPPI  .325  .067  .261  .099  
APRILTD  .340  .081  .271  .106  
APRILNTD  .345  .082  .276  .107  
SPPI  .325  .070  .261  .100  
APRILTD  .349  .083  .275  .113  
APRILNTD  .357  .086  .281  .115 
Results on DUC’01 are presented in Table 6. When no interaction is allowed (i.e. , ), we find that the performance of the three algorithms shows no significant differences. With the increase of , the gap between both APRIL implementations and SPPI becomes larger, suggesting the advantage of APRIL over SPPI. Also note that, when and , APRILNTD does not have significant advantage over APRILTD, but when , APRILNTD significantly outperforms APRILTD in terms of ROUGE1 and ROUGEL. This is because when is small, the learnt reward function contains much noise (i.e. has low correlation with ; see Table 4) and the poor quality of limits the advantage of the NTD algorithm. The problem gets relieved with the increase of . The above observations also apply to the experiments on DUC’02 and DUC’04; their results are presented in Tables 7 and Tables 8, respectively.
SPPI  .350  .077  .278  .112  
AprilTD  .351  .078  .278  .113  
AprilNTD  .350  .078  .279  .112  
SPPI  .349  .076  .277  .111  
AprilTD  .359  .084  .281  .116  
AprilNTD  .361  .085  .283  .116  
SPPI  .351  .077  .279  .112  
AprilTD  .361  .083  .283  .117  
AprilNTD  .364  .086  .287  .118  
SPPI  .351  .078  .277  .113  
AprilTD  .368  .088  .290  .123 
Comments
There are no comments yet.