Preference-based Interactive Multi-Document Summarisation

06/07/2019 ∙ by Yang Gao, et al. ∙ 0

Interactive NLP is a promising paradigm to close the gap between automatic NLP systems and the human upper bound. Preference-based interactive learning has been successfully applied, but the existing methods require several thousand interaction rounds even in simulations with perfect user feedback. In this paper, we study preference-based interactive summarisation. To reduce the number of interaction rounds, we propose the Active Preference-based ReInforcement Learning (APRIL) framework. APRIL uses Active Learning to query the user, Preference Learning to learn a summary ranking function from the preferences, and neural Reinforcement Learning to efficiently search for the (near-)optimal summary. Our results show that users can easily provide reliable preferences over summaries and that APRIL outperforms the state-of-the-art preference-based interactive method in both simulation and real-user experiments.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Interactive Natural Language Processing (NLP) approaches that put the human in the loop gained increasing research interests recently (Amershi et al., 2014; Gurevych et al., 2018; Kreutzer et al., 2018a). The user–system interaction enables personalised and user-adapted results by incrementally refining the underlying model based on a user’s behaviour and by optimising the learning through actively querying for feedback and judgements. Interactive methods can start with no or only few input data and adjust the output to the needs of human users.

Previous research has explored eliciting different forms of feedback from users in interactive NLP, for example mouse clicks for information retrieval (Borisov et al., 2018), post-edits and ratings for machine translation (Denkowski et al., 2014; Kreutzer et al., 2018a), error markings for semantic parsing (Lawrence and Riezler, 2018), bigrams for summarisation (P.V.S. and Meyer, 2017), and preferences for translation (Kreutzer et al., 2018b). Controlled experiments suggest that asking for preferences places a lower cognitive burden on the human subjects than asking for absolute ratings or categorised labels (Thurstone, 1927; Kendall, 1948; Kingsley and Brown, 2010). But it remains unclear whether people can easily provide reliable preferences over summaries. In addition, preference-based interactive NLP faces the high sample complexity problem: a preference is a binary decision and hence only contains a single bit of information, so the NLP systems usually need to elicit a large number of preferences from the users to improve their performance. For example, the machine translation system by Sokolov et al. (2016a) needs to collect hundreds of thousands of preferences from a simulated user before it converges.

Collecting such large amounts of user inputs and using them to train a “one-fits-all” model might be feasible for tasks such as machine translation, because the learnt model can generalise to many unseen texts. However, for highly subjective tasks, such as document summarisation, this procedure is not effective, since the notion of importance is specific to a certain topic or user. For example, the information that Lee Harvey Oswald shot president Kennedy might be important when summarising the assassination, but less important for a summary on Kennedy’s childhood. Likewise, a user who is not familiar with the assassination might consider the information more important than a user who is analysing the political backgrounds for many years. Therefore, we aim at an interactive system that adapts a model for a given topic and user context based on user feedback – instead of training a single model across all users and topics, which hardly fits anyone’s needs perfectly. In this scenario, it is essential to overcome the high sample complexity problem and learn to adapt the model using a minimum of user interaction.

In this article, we propose the Active Preference-based ReInforcement Learning (APRIL) framework111 We first introduced APRIL in (Gao et al., 2018). Towards the end of §1 we discuss how this article substantially extends our previous work.. Our core research idea is to split the preference-based interactive learning process

into two stages. First, we estimate the user’s ranking over candidate summaries using

active preference learning (APL) in an interaction loop. Second, we use the learnt ranking to guide a neural reinforcement learning (RL) agent to search for the (near-)optimal summary. The use of APL allows us to maximise the information gain from a small number of preferences, helping to reduce the sample complexity. Fig. 1 shows this general idea in comparison to the state-of-the-art preference-based interactive NLP paradigm, Structured Prediction from Partial Information (SPPI) (Sokolov et al., 2016b; Kreutzer et al., 2017). In §3, we discuss the technical background of RL, preference learning and SPPI, before we introduce our solution APRIL in §4.

We apply APRIL to the Extractive Multi-Document Summarisation (EMDS) task. Given a cluster of documents on the same topic, an EMDS system needs to extract important sentences from the input documents to generate a summary complying with a given length requirement that fits the needs of the user and her/his task. For the first time, we provide evidence for the efficacy of preference-based interaction in EMDS based on a user study, in which we measure the usability and the noise of preference feedback, yielding a mathematical model we can use for simulation and for analysing our results (§5). To evaluate APRIL, we then perform experiments on standard EMDS benchmark datasets. We compare the effectiveness of multiple APL and RL algorithms and select the best algorithms for our full system. We compare APRIL to SPPI and non-interactive methods, in both simulation (§6) and real-user experiments (§7). Our results suggest that with only ten rounds of user interaction, APRIL produces summaries better than those produced by both non-interactive methods and SPPI.

(a) SPPI workflow
(b) APRIL workflow
Figure 1: SPPI (a) directly uses the collected preferences to “teach” its summary-generator, while APRIL (b) learns a reward function as the proxy of the user/oracle, and uses the learnt reward to “teach” the RL-based summariser.

This work extends our earlier work (Gao et al., 2018) in three aspects. (i) We present a new user study on the reliability and usability of the preference-based interaction (§5). Based on this study, we propose a realistic simulated user, which is used in our experiments. (ii) We evaluate multiple new APL strategies and a novel neural RL algorithm, and compare them with the counterpart methods used in Gao et al. (2018). The use of these new algorithms further boost the efficiency and performance of APRIL (§6). (iii) We conduct additional user studies to compare APRIL with both non-interactive baselines and SPPI under more realistic settings (§7). APRIL can be applied to a wide range of other NLP tasks, including machine translation, semantic parsing and information exploration. All source code and experimental setups can be found in https://github.com/UKPLab/irj-neural-april.

2 Related Work

Sppi.

The method most similar to ours is SPPI (Sokolov et al., 2016b; Kreutzer et al., 2017)

. The core of SPPI is a policy-gradient RL algorithm, which receives rewards derived from the preference-based feedback. It maintains a policy that approximates the utility of each candidate output and selects the higher-utility candidates with higher probability. As discussed in §

1, SPPI suffers heavily from the high sample complexity problem. We will present the technical details of SPPI in §3.3 and compare it to APRIL in §6 and §7.

Preferences.

The use of preference-based feedback in NLP attracts increasing research interest. Zopf (2018) learns a sentence ranker from human preferences on sentence pairs, which can be used to evaluate the quality of summaries, by counting how many high-ranked sentences are included in a summary. Simpson and Gurevych (2018) develop an improved Gaussian process preference learning (Chu and Ghahramani, 2005) algorithm to learn an argument convincingness ranker from noisy preferences. Unlike these methods that focus on learning a ranker from preferences, we focus on using preferences to generate better summaries. Kreutzer et al. (2018b) ask real users to provide cardinal (5-point ratings) and ordinal (pairwise preferences) feedback over translations, and use the collected data to train an off-policy RL to improve the translation quality. Their study suggests that the inter-rater agreement for the cardinal and ordinal feedback is similar. However, they do not measure or consider the influence of the questions’ difficulties on the agreement, which we find significant for EMDS (see §5). In addition, their system is not interactive, but uses log data instead of actively querying users.

Interactive Summarisation.

The iNeATS (Leuski et al., 2003) and IDS (Jone et al., 2002) systems allow users to tune several parameters (e.g., size, redundancy, focus) to customise the produced summaries. Further work presents automatically derived summary templates (Orǎsan et al., 2003; Orǎsan and Hasler, 2006) or hierarchically ordered summaries (Christensen et al., 2014; Shapira et al., 2017) allowing users to drill-down from a general overview to detailed information. However, these systems do not employ the users’ feedback to update their internal summarisation models. P.V.S. and Meyer (2017) propose an interactive EMDS system that asks users to label important bigrams within candidate summaries. Given the important bigrams, they use

integer linear programming

to optimise important bigram coverage in the summary. In simulation experiments, their system can achieve near-optimal performance in ten rounds of interaction, collecting up to 350 important bigrams. However, labelling important bigrams is a large burden on the users, as users have to read through many potentially unimportant bigrams (see §5). Also, they assume that the users’ feedback is always perfect.

Reinforcement Learning.

RL has been applied to both extractive and abstractive summarisation in recent years (Ryang and Abekawa, 2012; Rioux et al., 2014; Gkatzia et al., 2014; Henß et al., 2015; Paulus et al., 2017; Pasunuru and Bansal, 2018; Kryscinski et al., 2018)

. Most existing RL-based document summarisation systems either use heuristic functions (e.g.,

Ryang and Abekawa, 2012; Rioux et al., 2014), which do not rely on reference summaries, or ROUGE scores requiring reference summaries as the rewards for RL (Paulus et al., 2017; Pasunuru and Bansal, 2018; Kryscinski et al., 2018). However, neither ROUGE nor the heuristics-based rewards can precisely reflect real users’ requirements on summaries (Chaganty et al., 2018); hence, using these imprecise rewards can severely mislead the RL-based summariser. The quality of the rewards has been recognised as the bottleneck for RL-based summarisation systems (Kryscinski et al., 2018). Our work learns how to give good rewards from users’ preferences. In this work, we assume that our system has no access to the reference summaries, but can query a user for preferences over summary pairs.

Some RL work directly uses the users’ ratings as rewards. Nguyen et al. (2017)

employ user ratings on translations as rewards when training an RL-based encoder-decoder translator. However, eliciting ratings on summaries is very expensive as users have high variance in their ratings of the same summary

(Chaganty et al., 2018), which is why we consider preference-based feedback and a learnt reward surrogate.

Preference-based RL (PbRL)

is a recently proposed paradigm at the intersection of preference learning, RL, active learning (AL) and inverse RL (Wirth et al., 2017). Unlike apprenticeship learning (Dethlefs and Cuayáhuitl, 2011) which requires the user to demonstrate (near-)optimal sequences of actions (called action trajectories), PbRL only asks for the user’s preferences (either partial or total order) on several action trajectories. Wirth et al. (2016) apply PbRL to several simulated robotics tasks. They show that their method can achieve near-optimal performance by interacting with a simulated perfect user for 15–40 rounds. Christiano et al. (2017) use PbRL in training simulated robotics tasks, Atari-playing agents and a simulated back-flipping agent by collecting feedback from both simulated oracles and real crowdsourcing workers. They find that human feedback can be noisy and partial (i.e., capturing only a fraction of the true reward), but that it is much easier for people to provide consistent comparisons than consistent absolute scores in their robotics use case. In §5, we evaluate this for document summarisation.

However, the approach by Christiano et al. (2017) fails to obtain satisfactory results in some robotics tasks even after 5,000 interaction rounds. In a follow-up work, Ibarz et al. (2018)

elicit demonstrations from experts, use the demonstrations to pre-train a model with imitation learning techniques, and

successfully fine-tune the pre-trained model with PbRL. In EMDS, extractive reference summaries might be viewed as demonstrations, but they are expensive to collect and not available in popular summarisation corpora (e.g., the DUC datasets). APRIL does not require demonstrations, but learns a reward function based on user preferences on entire summaries, which is then used to train an RL policy.

3 Background

In this section, we recap necessary details of RL (§3.1), preference learning (§3.2) and SPPI (§3.3). We adapt them to the EMDS use case, so as to lay the foundation for APRIL. To ease the reading, we summarise the notation used in the remaining article in Table 1.

Notation Description
a document cluster from the set of all possible inputs
a summary from the set of all legal summaries for
MDP of the EMDS task for : states , actions , transition function , reward function and terminal states
the reward of summary in
policy in RL: the probability of selecting summary in
policy in SPPI, parameterised by : the probability of presenting pair to the oracle (Eq. (6))
the ground-truth utility function on
the approximation of
, where is a utility function on
the ranking function on  induced by (Eq. (2))
the approximation of induced by
the preference direction function, which returns 1 if the oracle/user prefers over for
the objective function in RL (Eq. (1))
the objective function in preference learning (Eq. (3))
the objective function in SPPI (Eq. (5))
Table 1: Overview of the notation used in this article

3.1 Reinforcement Learning

RL amounts to algorithms for efficiently searching optimal solutions in Markov Decision Processes (MDPs). MDPs are widely used to formulate sequential decision-making problems. Let be the input space and let be the set of all possible outputs for input . An episodic MDP is a tuple  for input , where is the set of states, is the set of actions and is the transition function with giving the next state after performing action in state . is the reward function with giving the immediate reward for performing action in state . is the set of terminal states; visiting a terminal state terminates the current episode.

EMDS can be formulated as episodic MDP, as the summariser has to sequentially select sentences from the original documents to add to the draft summary. Our MDP formulation of EMDS matches previous approaches by Ryang and Abekawa (2012) and Rioux et al. (2014): is a cluster of documents and is the set of all legal summaries for cluster (i.e., all permutations of sentences in that fulfil the given summary length constraint). In the MDP for document cluster , includes all possible draft summaries of any length (i.e., ). The action set includes two types of actions: concatenate a sentence in to the current draft summary, or terminate the draft summary construction. The transition function is trivial in EMDS, because given the current draft summary and an action, the next state can be easily identified as the draft summary plus the selected sentence or as a terminating state. The reward function returns an evaluation score of the summary once the action terminate is performed; otherwise it returns 0 because the summary is still under construction and thus not ready to be evaluated (so-called delayed rewards). Providing non-zero rewards before the action terminate can lead to even worse result, as reported by Rioux et al. (2014). The terminal states set includes all states corresponding to summaries exceeding the given length requirement and an absorbing state . By performing action terminate, the agent will be transited to regardless of its current state, i.e. for all if is terminate.

A policy in an MDP defines how actions are selected: is the probability of selecting action in state . Note that in many sequential decision-making tasks, is learnt across all inputs . However, for our EMDS use case, we learn an input-specific policy for a given in order to reflect the subjectivity of the summarisation task introduced in §1. We let be the set of all possible summaries a policy can construct in document cluster . denotes the probability of policy for generating a summary in . Likewise, denotes the accumulated reward received by building summary . Finally, the expected reward of performing is:

(1)

The goal of an MDP is to find the optimal policy that has the highest expected reward: .

3.2 Preference Learning

For a document cluster and its legal summaries set , we let be the ground-truth utility function measuring the quality of summaries in . We additionally assume that no two items in have the same value. Let be the ascending ranking induced by : for ,

(2)

where is the indicator function. In other words, gives the rank of among all elements in with respect to . The goal of preference learning is to approximate from the pairwise preferences on some elements in . The preferences are provided by an oracle.

The Bradley-Terry (BT) model (Bradley and Terry, 1952) is a widely used preference learning model, which approximates the ranking by approximating the utility function : Suppose we have observed preferences: , where are the summaries presented to the oracle in the round, and indicates the preference direction of the oracle: if the oracle prefers over , and otherwise. The objective in BT is to maximise the following likelihood function:

(3)

where

(4)

is the approximation of parameterised by

, which can be learnt by any function approximation techniques, e.g. neural networks or linear models. By maximising Eq. (

3), the resulting will be used to obtain , which in turn can be used to induce the approximated ranking function .

3.3 The SPPI Framework

SPPI can be viewed as a combination of RL and preference learning. For an input , the objective of SPPI is to maximise

(5)

where is the same preference direction function as in preference learning (§3.2). is a policy that decides the probability of presenting a pair of summaries to the oracle:

(6)

In line with preference learning, is the utility function for estimating the quality of summaries, parameterised by . The policy samples the pairs with larger utility gaps with higher probability; as such, both “good” and “bad” summaries have the chance to be presented to the oracle and thus encourages the exploration of the summary space. To maximise Eq. (5), SPPI uses gradient ascent to update incrementally. Algorithm 1 presents the pseudo code of our adaptation of SPPI to EMDS.

Input : sequence of learning rates ; query budget ; document cluster
1 initialise ;
2 while  do
3       sampling using (Eq. 6);
4       get preference ;
5       (Eq. 5);
6      
7 end while
Output : 
Algorithm 1 Adaptation of SPPI (Kreutzer et al., 2017, Alg. 1) for preference-based EMDS.

Note that the objective function in SPPI (Eq. (5)) and the expected reward function in RL (Eq. (1)) have a similar form: if we view the preference direction function in Eq. (5) as a reward function, we can consider SPPI as an RL problem. The major difference between SPPI and RL is that the policy in SPPI selects pairs (Eq. (6)), while the policy in RL selects single summaries (see §3.1). For APRIL, we will exploit this connection to propose our new objective function and learning paradigm.

4 The APRIL Framework

SPPI suffers from the high sample complexity problem, which we attribute to two major reasons: First, the policy in SPPI (Eq. (6)) is good at distinguishing the “good” summaries from the “bad” ones, but poor at selecting the “best” summaries from “good” summaries, because it only queries the summaries with large quality gaps. Second, SPPI makes inefficient use of the collected preferences: After each round of interaction, SPPI performs one step of the policy gradient update, but does not generalise or re-use the collected preferences. This potentially wastes expensive user information. To alleviate these two problems, we exploit the connection between SPPI, RL and preference learning and propose the APRIL framework detailed in this section.

Recall that in EMDS, the goal is to find the optimal summary for a given document cluster , namely the summary that is preferred over all other possible summaries in according to . Based on this understanding and in line with the RL formulation of EMDS from §3.1, we define a new expected reward function for policy as follows:

(7)

Note that equals 1 if is preferred over and equals 0 otherwise (see §3.2). Thus, counts the number of summaries that are less-preferred than summary , and hence equals (see Eq. 2). Policy that can maximise this new objective function will select summaries with highest rankings, hence outputs the optimal summary.

This new objective function decomposes the learning problem into two stages: (i) approximating the ranking function , and (ii) based on the approximated ranking function, searching for the optimal policy that can maximise the new objective function. These two stages can be solved by (active) preference learning and reinforcement learning, respectively, and they constitute our APRIL framework, illustrated in Figure 2.

Figure 2: Detailed workflow of APRIL (extended version of the workflow presented in Fig. 0(b))

4.1 Stage 1: Active Preference Learning

For an input document cluster , the task in the first stage of APRIL is to obtain , the approximated ranking function on by collecting a small number of preferences from the oracle. It involves four major components: a summary Database (DB) storing the summary candidates, an AL-based Querier that selects candidates from the Summary DB to present to the user, a Preference DB storing the preferences collected from the user, and a Preference Learner that learns from the preferences. The left cycle in Fig. 2 illustrates this stage, and Alg. 2 presents the corresponding pseudo code. Below, we detail these four components.

Input : Query budget ; document cluster ; Summary DB ; heuristic ; tradeoff , learning rate
1 let ;
2 get first summary by Eq. (9) ;
3 initialise while  do
4       select according to Eq. (9) ;
5       get preference from the oracle, add to ;
6       (Eq. (3)) ;
7      
8 end while
9 (Eq. (8)) ;
Output :  and its induced ranking
Algorithm 2 Active preference learning (Stage 1 in APRIL).

Summary DB .

Ideally should include all legal extractive summaries for a document cluster , namely . Since this is impractical for large clusters, we either randomly sample a large set of summary candidates or use pre-trained summarization models and heuristics to generate . Note that can be built offline, i.e. before the interaction with the user starts. This improves the real-time responsiveness of the system.

Preference DB .

The preference database stores all collected user preferences , where is the query budget (i.e., how many times a user may be asked for a preference), are the summaries presented to the user in the round of interaction, and is the user’s preference (see §3.2).

Preference Learner.

We use the BT model introduced in §3.2 to learn from the preferences in . In order to increase the real-time responsiveness of our system, we use a linear model to approximate the utility function , i.e. , where is a vectorised representation of summary for input cluster . However, purely using to approximate is sensitive to the noise in the preferences, especially when the number of collected preferences is small. To mitigate this, we approximate not only using (the posterior), but also using some prior knowledge (the prior), for example the heuristics-based summary evaluation function proposed by Ryang and Abekawa (2012) and Rioux et al. (2014). Note that these heuristics do not require reference summaries; see §2. Formally, we define the as

(8)

where is a real-valued parameter trading off between the prior and posterior.

AL-based Querier.

The active learning based querier receives and selects which candidate pair from to present to the user in each round of interaction. To reduce the reading burden of the oracle, inspired by the preference collection workflows in robots training (Wirth et al., 2016), we use the following setup to obtain summary pairs: In each interaction round, one summary of the pair is old (i.e. it has been presented to the user in the previous round) and the other one is new (i.e. it has not been read by the user before). As such, the user only needs to read summaries in rounds of interaction.

Any pool-based active learning strategy (Settles, 2010) can be used to implement the querier, e.g., uncertainty sampling (Lewis and Gale, 1994). We explore four computationally efficient active learning strategies:

  • Utility gap (): Inspired by the policy of SPPI (see §3.3 and Eq. (6)), this strategy presents summaries with large estimated utility gaps :

  • Diversity-based heuristic

    : This strategy minimises the vector space similarity of the presented summaries. For a pair

    , we define

    where

    is the cosine similarity. This heuristic encourages querying dissimilar summaries, so as to encourage exploration and facilitate generalisation. In addition, dissimilar summaries are more likely to have large utility gaps

     and hence can be answered more accurately by the users (discussed later in §5).

  • Density-based heuristic

    : This strategy encourages querying summaries from “dense” areas in the vector space, so as to avoid querying outliers and to facilitate generalisation. Formally, for a summary

    for cluster , we define

  • Uncertainty-based heuristic : This strategy encourages querying the summaries whose approximated utility is most uncertain. In line with P.V.S. and Meyer (2017), we define as follows: For a summary , we estimate the probability of being the optimal summary as

    and let the uncertainty of be if , and let otherwise.

To exploit the strengths of all these AL strategies, we normalise their output values to the same range and use their weighted sum to select the new summary to present to the user:

(9)

where is the old summary, i.e. the one from the previous interaction round. To select the first summary, we let and . , , and denote the weights for the four heuristics.

4.2 Stage 2: RL-based Summariser

Given the approximated ranking learnt by the first stage, the target of the second stage in APRIL is to obtain

We consider two RL algorithms to obtain : the linear Temporal Difference (TD) algorithm, and a neural version of the TD algorithm.

TD (Sutton, 1984) has proven effective for solving the MDP in EMDS (Rioux et al., 2014; Ryang and Abekawa, 2012). The core of TD is to approximate the -values: In EMDS, estimates the “potential” of the (draft) summary for input cluster given policy : the higher the value, the more likely is contained in the optimal summary for . Given the -values, a policy can be derived using the softmax strategy:

(10)

where ranges over all available actions in the state . The intuition behind Eq. (10) is that the probability of performing the action increases if the resulting state of , namely , has a higher -value. Note the similarity between the policy of TD (Eq. (10)) and the policy of SPPI (Eq. (6)): they both use a Gibbs distribution to assign probabilities to different actions, but the difference is that an action in SPPI is a pair of summaries, while in TD an action is adding a sentence to the current draft summary or terminate (see §3.1).

Existing works use linear functions to approximate the -values (Rioux et al., 2014; Ryang and Abekawa, 2012). To more precisely approximate the -values, we use a neural network and term the resulting algorithm Neural TD (NTD). Inspired by DQN (Mnih et al., 2015), we employ the memory replay and periodic update techniques to boost and stabilise the performance of NTD. We use NTD rather than DQN (Mnih et al., 2015) because in MDPs with discrete actions and continuous states, as in our EMDS formulation, Q-Learning needs to maintain a network for each action , which is very expensive when the size of is large. Instead, the TD algorithms only have to maintain the network, whose size is independent of the number of actions. In EMDS, the size of the action set typically exceeds several hundreds (see Table 2), because each sentence corresponds to one action.

Alg. 3 shows the pseudo code of NTD. We use the Summary DB as the memory replay. This helps us to reduce the sample generation time, which is critical in interactive systems. We select samples from using softmax sampling (line 3 in Alg. 3):

(11)

where stands for the parameters of the neural network. Given the selected summary , we build a sequence of states (line 4), where is the number of sentences in , and state is the draft summary including the first sentences of . Then, we update the error as in the standard TD algorithms (lines 5 to 9) and perform back propagation with gradient descent (line 10). We update every episodes (line 11), as in DQN, to stabilise the performance of NTD. After finishing all training, the obtained -values can be used to derive the by Eq. (10).

Input :  Learning episode budget ; document cluster ; summary DB ; approximated ranking function ; update frequency
1 while  do
2       initialise randomly, let ;
3       sample by Eq. (11);
4       build states from : ;
5       ;
6       while  do
7             ;
8            
9       end while
10      update with gradient descent;
11       let every episodes;
12      
13 end while
Output : Policy by Eq. (10)
Algorithm 3 NTD algorithm for EMDS.

5 Preference-based Interaction for Summarisation

To date, there is little knowledge about the usability and the reliability of user feedback in summarisation. This is a major limitation for designing interactive systems and for effectively experimenting with simulated users before an actual user study. In this section, we therefore study preference-based feedback for our EMDS use case and derive a mathematical model to simulate real users’ preference-giving behaviour.

Hypotheses.

Our study tests two hypotheses: (H1) We assume that users find it easier to provide preference feedback than providing other forms of feedback for summaries. In particular, we measure the user satisfaction and the time needed for preference-based interaction and bigram-based interaction proposed by P.V.S. and Meyer (2017), which has also been used in interactive summarisation.

(H2Previous research suggests that the more difficult the questions, the lower the correct rate of the answers or, in other words, the higher the noise in the answers (Huang et al., 2016; Donmez and Carbonell, 2008). In our preference-giving scenario, we assume that the difficulty of comparing a pair of items can be measured by the utility gap between the presented items: the wider the utility gap, the easier it is for the user to identify the better item. We term this the wider-gap-less-noise hypothesis in this article.

The wider-gap-less-noise hypothesis is an essential motivation for the policy in SPPI (Eq. (6)) and the diversity-based active learning strategy in APRIL (see §4.1), but yet there is little empirical evidence for validating this hypothesis. Based on the findings in our user study, we provide evidence towards H1 and H2, and we propose a realistic user model, which we employ in our simulation experiments in §6.

Study setup.

We invite 12 users to participate in our user study. All users are native or fluent English speakers from our university. We ask each user to provide feedback for newswire summaries from two topics (d074b from DUC’02 and d32f from DUC’01) in the following way.

We first allow the users to familiarise with the topic by means of two 200-words abstracts. This is necessary, since the events discussed in the news documents are several years old and maybe unknown to our participants. Without having such background information, it would not be possible for users to judge importance in the early stages of the study. We ask each user to provide preferences for ten summary pairs and to label all important bigrams in five additional summaries. For collecting preference-based feedback, we ask the participants to select the better summary (i.e. the one containing more important information) in each pair. For collecting bigram-based feedback, we adopt the setup of P.V.S. and Meyer (2017), who proposed a successful EMDS system using bigram-based interaction. At the end of the study, we ask the participants to rate the usability (i.e., user-friendliness) of preference- and bigram-based interaction on a 5-point Likert scale, where higher scores indicate higher usability.

To evaluate H2, we require summary pair with different utility gaps. To this end, we measure the utility (see §3.2) of a summary for document cluster as

(12)

where are the reference summaries for document cluster (provided in the DUC datasets), and , and stand for average ROUGE-1, ROUGE-2 and ROUGE-SU4 recall metrics (Lin, 2004), respectively. These ROUGE metrics are widely used to measure the quality of summaries. The denominator values 0.47, 0.22 and 0.18 are the upper-bound ROUGE scores reported by P.V.S. and Meyer (2017). They are used to balance the weights of the three ROUGE scores. As such, each ROUGE score is normalised to , and we further multiply the sum of the ROUGE scores by to normalise values to , which facilitates our analyses afterwards.

For document cluster , the utility gap of two summaries and is thus . As for the ten summary pairs in our user study, we select four pairs with utility gap , three with , two with and one with , where (i.e., a utility gap very close to the predefined gap width). Figure 3 shows two example summary pairs and their . As for the five summaries for bigram-based feedback, we select summaries with high utility , but ensure that they have low overlap in order simulate the setup AL setup of P.V.S. and Meyer (2017).

: “I think he’s doing a beautiful job up there. “; President Bush, asked at a news conference whether Thomas’ claim not to have an opinion on abortion is credible, answered, “That’s a question for the Senate to decide. In their respective careers, the Thomases have embraced the view that women and minorities are hindered, rather than helped, by affirmative action and government programs. True equality is achieved by holding everyone to the same standard, they believe. “; Before Thomas’ testimony ended, the unflappable 43-year-old federal judge was criticized, sometimes in harsh terms, by several liberal Democrats. Hatch asked. : They see a woman with strong opinions on issues that are bound to come before the court. Dean Kelley, the National Council of Churches’ counselor on religious liberty, wrote a critique of Clarence Thomas that was used as grounds for his organization’s opposition to the Supreme Court nominee. “; Thomas said Senate confirmation of his nomination would give him “an opportunity to serve and give back” and to “bring something different to the court. True equality is achieved by holding everyone to the same standard, they believe. “; “He’s handling himself very well,” the president said. Hatch asked. : Heflin cited the “appearance of a confirmation conversion” and said it may raise questions of Thomas’ “integrity and temperament. The ministers were recently organized into a conservative Coalition for the Restoration of the Black Family and Society, with the first item on its agenda being Thomas’ confirmation. After still another Thomas answer, Biden said, “That’s not the question I asked you, judge. Several committee members said they expected the committee to recommend, by a 10-4 or 9-5 vote, that the Senate confirm Thomas. But others see a different symbolism. But others see a different symbolism. But they hope Sens. : During the early ’80s, Virginia Thomas enrolled in Lifespring, a self-help course that challenges students to take responsibility for their lives. RADIO; (box) KQED, 88.5 FM Tape delay beginning at 9 a.m. repeated at 9:30 p.m. (box) KPFA, 94.1 FM Live coverage begins at 6:30 a.m. TELEVISION; (box) C-SPAN Live coverage begins at 7 a.m. repeated at 5 p.m. (box) CNN Intermittent coverage. “; On natural law : “At no time did I feel, nor do I feel now, that natural law is anything more than the background to our Constitution. “I’m not satisfied with the answers,” Leahy said.

Figure 3: Two summary pairs from topic d074b with utility gaps (pair A, the upper two summaries) and (pair B, the bottom two summaries).

Usability assessment.

To evaluate hypothesis H1, we measure the easiness of providing preferences for summaries with two metrics: the average interaction time a participant spends in providing a preference and the participant’s usability rating on the 5-point scale. We compare both metrics for preference-based interaction with bigram-based interaction.

Fig. 4 visualises the interaction time and the usability ratings for preference and bigram-based interaction as notched boxplots. Both plots confirm the clear difference between preference- and bigram-based feedback for summaries: We measure an average interaction time of 102

 s (with standard error

) for annotating bigrams in a single summary, which is over twice the time spent for providing a preference for a summary pair (43 s, ). The users identified 7.2 bigrams per summary, which took 14s per bigram on average. As for the usability ratings, providing preferences is rated 3.8 () on average (median at 4), while labelling bigrams is rated 2.4 () on average (median at 2). These results suggest that humans can more easily provide preferences over summaries than providing point-based feedback in the form of bigrams.

(a) Interaction time
(b) Usability ratings
Figure 4:

Comparison of interaction time and usability ratings for preference and bigram-based interaction. Notches indicate 95% confidence interval.

Reliability assessment.

To evaluate hypothesis H2, we measure the reliability of the users’ preferences, i.e. the percentage of the pairs in which the user’s preference is the same as the preference induced by . Figure 5 shows the reliability scores for the varying utility gaps employed in our study. The results clearly suggest that, for summary pairs with wider utility gaps, the participants can more easily identify the better summary in the pair, resulting into higher reliability. This observation validates the wider-gap-less-noise assumption.

Figure 5: The reliability of users’ preferences increases with the growth of the utility gaps between presented summaries. Error bars indicate standard errors.

Realistic user simulation.

We observe that the shape of the reliability curves in Figure 5 is similar to that of the logistic function: when approaches 0, the reliability scores approaches 0.5 and with the increase of , the reliability asymptotically approaches 1. Hence, we adopt the logistic model proposed by Viappiani and Boutilier (2010) to estimate the real users’ preferences. We term the model logistic noise oracle (LNO): For two summaries , we assume the probability that a user prefers over is:

(13)

where is a real-valued parameter controlling the “flatness” of the curve: higher yield a flatter curve, which in turn suggests that asking users to distinguish summaries with similar quality causes high noise.

We estimate based on the observations we made in the user study by maximising the likelihood function:

where ranges over all users and ranges over the number of preferences provided by each user. and are the summaries presented to the user in round . is the user’s preference direction function: equals 1 if is preferred by the user over , and equals 0 otherwise. By letting , we obtain . The green curve in Figure 5 is the reliability curve for the LNO with . We find that it fits well with the reliability curves of the real users. As a concrete example, consider the summary pairs in Figure 3: LNO prefers over with probability and prefers over with probability , which is consistent with our observations that 7 out of 12 users prefer over , while all users prefer over .

6 Simulation Experiments

In this section, we study APRIL in a simulation setup. We use the LNO-based user model with to simulate user preferences as introduced in §5. We separately study the first and the second stage of APRIL, by comparing multiple active learning and RL techniques in each stage. Then, we combine the best-performing strategy from each stage to build the overall APRIL pipeline and compare our method with SPPI. We perform our experiments on three multi-document summarisation benchmark datasets from the Document Understanding Conferences222https://duc.nist.gov/ (DUC): DUC’01, DUC’02 and DUC’04. Table 2 shows the main properties of these datasets. To ease the reading, we summarise the parameters we used in our simulation experiments in Table 3.

Dataset # Topic # Doc SumLen # Sent/Topic
DUC’01 30 308 100 378
DUC’02 59 567 100 271
DUC’04 50 500 100 265
Table 2: For our experiments, we use standard benchmark datasets from the Document Understanding Conference (DUC). # Doc: the overall number of documents across all topics. SumLen: the length of each summary (in tokens). # Sent/Topic: average number of sentences in a topic.

6.1 APL Strategy Comparison

We compare our AL-based querying strategy introduced in §4.1 (see Eq. (9)) with three baseline AL strategies:

  • Random: In each interaction round, select a new candidate summary from uniformly at random and ask the user to compare it to the old one from the previous interaction round. In the first round, we randomly select two summaries to present.

  • J&N is the robust query selection algorithm proposed by Jamieson and Nowak (2011). It assumes that the items’ preferences are dependent on their distances to an unknown reference point in the embedding space: the farther an item to the reference point, the more preferred the item is. After each round of interaction, the algorithm uses all collected preferences to locate the area where the reference point may fall into and identifies the query pairs which can reduce the size of this area, termed ambiguous query pairs. To combat noise in preferences, the algorithm selects the most-likely-correct ambiguous pair to query the oracle in each round.

  • Gibbs: This is the querying strategy used in SPPI. In each round, it selects summaries with the Gibbs distribution (Eq. 6), and updates the weights for the utility function as in line 5 in Alg. 1.

Note that Gibbs presents two new summaries to the user each round, while the other querying strategies we consider present only one new summary per round (see §4.1). Thus, in rounds of interaction with a user, the user needs to read summaries with Gibbs, but only with the other querying strategies.

Parameter Description
For APL (stage 1 in APRIL); see Alg. 2:
query round budget
Summary DB size for each cluster (see §4.1)
heuristics-based prior reward (see §4.1 and Eq. (8)); we use the reward heuristics proposed by Ryang and Abekawa (2012)
trade-off between prior and posterior rewards (see Eq. (8))
learning rate for preference learning
vectorised representation of summary for document cluster (see Eq. (8)); we use the same vector representation as Rioux et al. (2014)
weights of the preference learning strategies (see Eq. (9); selection details presented in §6.1)
For RL (stage 2 in APRIL); see Alg. 3:
episode budget
update frequency in NTD
neural approximation of -values (see §6.2 for setup details)
For SPPI; see Alg. 1:
learning rate in SPPI.
Table 3: Overview of the parameters used in simulation experiments.

To find the best weights , , and for our AL querying strategy in Eq. (9), we run grid search: We select each weight from and ensure that the sum of the four weights is 1.0. The query budget was set to 10, 50 and 100. For each cluster , we generated 5,000 extractive summaries to construct . Each summary contains no more than 100 words, generated by randomly selecting sentences in the original documents in . The prior used in Eq. 8 is the reward function proposed by Ryang and Abekawa (2012), and we set the trade-off parameter to . All querying strategies we test take less than 500 ms to decide the next summary pair to present.

The performance of the querying strategies is measured by the quality of their resulting reward function (see Eq. (8)). For each cluster , we measure the quality of by its Spearman’s rank correlation (Spearman, 1904) to the gold-standard utility scores (Eq. (12)) over all summaries in . We normalise to the same range of (i.e. [0,10]). For the vector representation , we use the same 200-dimensional bag-of-bigram representation as Rioux et al. (2014).

Random .232 .235 .243
J&N .238 .240 .247
Gibbs .246 .275 .289
() .236 .241 .261
() .288 .297 .319
() .211 .238 .263
() .257 .285 .303
BestCombination .288 .298 .320
Lower bound, , : .194
Table 4: Spearman’s rank correlation between and , averaged over 20 independent runs on all clusters in DUC’01. is learnt with different querying strategies. The lower bound is to prohibit all interactions () and let (i.e. in Eq. 8). Results marked with an asterisk are significantly better than all baselines.

Table 4 compares the performance of different querying strategies. We find that all querying strategies outperform the zero-interaction lower bound even with 10 rounds of interaction, suggesting that even collecting a small number of preferences can help to improve the quality of . Among all baseline strategies, Gibbs significantly333

We used double-tailed t-tests to compute the

-values, and selected as the significance level. outperforms the other two, and we believe the reason is that Gibbs exploits the wider-gap-less-noise assumption (see §5). Of all 56 possible AL weights combinations, 48 combinations outperform the random and J&N baselines, and 27 outperform Gibbs. This shows the overall strength of our AL-based strategy. The best combination of the weights is , and , closely followed by using the diversity-based strategy alone (i.e. ). We believe the reason behind the effectiveness of the strategy is that it not only exploits the wider-gap-less-noise assumption by querying dissimilar summaries, but also explores summaries from different areas in the embedding space, which helps the generalisation. Due to its simplicity, we use henceforth, since its performance is almost identical and has no statistically significant difference to the best combination.

6.2 RL Comparison

We compare NTD (Alg. 3) to two baselines: TD (Sutton, 1984) and LSTD (Boyan, 1999). TD has been successfully used by Ryang and Abekawa (2012) and Rioux et al. (2014) for EMDS. LSTD improves TD by using least square optimisation, and it has been proven to perform better in large-scale problems than TD (Lagoudakis and Parr, 2003). Note that both TD and LSTD uses linear models to approximate the -values.

We use the following settings, which yield good performance in pilot experiments: Learning episode budget and learning step in TD and NTD. For NTD, the input of the -value network is the same 200-dimensional draft summary representation as in (Rioux et al., 2014)

; after the input layer, we add a fully connected ReLU

(Glorot et al., 2011) layer with dimension 100 as the first hidden layer; an identical fully connected 100-dimensional ReLU layer is followed as the second hidden layer; at last, a linear output layer is used to output the -value. Fig. 6 illustrates the structure of the -values network. We use Adam (Kingma and Ba, 2014) as the gradient optimiser (line 10 in Alg. 3), with default setup. For LSTD, we initialise its square matrix as a diagonal matrix and let the diagonal elements be random numbers between 0 and 1, as suggested by Lagoudakis and Parr (2003).

Figure 6: The structure of the -values network used in NTD.

The rewards we use are based on defined in Eq. (12). Note that this serves as the upper-bound performance, because is the gold-standard scoring function, which is not accessible in the interactive settings (see §6.3 and §7). We measure the performance of the three RL algorithms by the quality of their generated summaries in terms of multiple ROUGE scores. Results are presented in Table 5. NTD outperforms the other two RL algorithms by a large margin. We assume that this is attributed to its more precise approximation of the -values using the neural network.

Dataset RL
DUC’01 NTD .452 .169 .359 .177
TD .442 .161 .349 .172
LSTD .432 .151 .362 .179
DUC’02 NTD .483 .181 .379 .193
TD .475 .179 .374 .189
LSTD .462 .163 .363 .183
DUC’04 NTD .492 .189 .391 .203
TD .473 .174 .378 .192
LSTD .457 .156 .360 .182
Table 5: NTD outperforms the other TD algorithms across all DUC datasets. All results are averaged over 10 independent runs across all topics in each dataset. Asterisk: significant advantage.

In terms of computation time,444All RL experiments were performed on a workstation with a quad-core CPU and 8 GB RAM, without using GPUs. TD takes around 30 seconds to finish the 3,000 episodes of training and produce a summary, NTD takes around 2 minutes, while LSTD takes around 5 minutes. Since the RL computation is performed only once after all online interaction has finished, we find this computation time acceptable. However, without using as the memory replay, NTD takes around 10 minutes to run 3,000 episodes of training.

6.3 Full System Performance

We compare SPPI with two variants of APRIL: APRIL-TD and APRIL-NTD that use TD and NTD, respectively. Both implementations of APRIL use the diversity-based AL strategy (i.e. ). All the other parameters values are the same as those described in §6.1 and §6.2 (see Table 3).

SPPI .323 .068 .259 .098
APRIL-TD .324 .070 .257 .099
APRIL-NTD .325 .069 .260 .100
SPPI .323 .068 .259 .099
APRIL-TD .338 .075 .268 .105
APRIL-NTD .339 .075 .269 .106
SPPI .325 .067 .261 .099
APRIL-TD .340 .081 .271 .106
APRIL-NTD .345 .082 .276 .107
SPPI .325 .070 .261 .100
APRIL-TD .349 .083 .275 .113
APRIL-NTD .357 .086 .281 .115
Table 6: Results with rounds of interaction with the LNO-based simulated user. All results are averaged over all document clusters in DUC’01. Asterisk: significantly outperforms SPPI. Dagger: significantly outperforms both SPPI and APRIL-TD.

Results on DUC’01 are presented in Table 6. When no interaction is allowed (i.e. , ), we find that the performance of the three algorithms shows no significant differences. With the increase of , the gap between both APRIL implementations and SPPI becomes larger, suggesting the advantage of APRIL over SPPI. Also note that, when and , APRIL-NTD does not have significant advantage over APRIL-TD, but when , APRIL-NTD significantly outperforms APRIL-TD in terms of ROUGE-1 and ROUGE-L. This is because when is small, the learnt reward function contains much noise (i.e. has low correlation with ; see Table 4) and the poor quality of limits the advantage of the NTD algorithm. The problem gets relieved with the increase of . The above observations also apply to the experiments on DUC’02 and DUC’04; their results are presented in Tables 7 and Tables 8, respectively.

SPPI .350 .077 .278 .112
April-TD .351 .078 .278 .113
April-NTD .350 .078 .279 .112
SPPI .349 .076 .277 .111
April-TD .359 .084 .281 .116
April-NTD .361 .085 .283 .116
SPPI .351 .077 .279 .112
April-TD .361 .083 .283 .117
April-NTD .364 .086 .287 .118
SPPI .351 .078 .277 .113
April-TD .368 .088 .290 .123