Unsupervised Dual-Cascade Learning with Pseudo-Feedback Distillation for Query-based Extractive Summarization

11/01/2018 ∙ by Haggai Roitman, et al. ∙ ibm 0

We propose Dual-CES -- a novel unsupervised, query-focused, multi-document extractive summarizer. Dual-CES is designed to better handle the tradeoff between saliency and focus in summarization. To this end, Dual-CES employs a two-step dual-cascade optimization approach with saliency-based pseudo-feedback distillation. Overall, Dual-CES significantly outperforms all other state-of-the-art unsupervised alternatives. Dual-CES is even shown to be able to outperform strong supervised summarizers.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The vast amounts of textual data end users need to consume motivates the need for automatic summarization 

[7]. An automatic summarizer gets as an input one or more documents and possibly also a limit on summary length (e.g., maximum number of words). The summarizer then needs to produce a textual summary that captures the most salient (general and informative) content parts within input documents. Moreover, the summarizer may also be required to satisfy a specific user information need, expressed by one or more queries. Therefore, the summarizer will need to produce a focused summary which includes the most relevant information to that need.

1.1 Motivation

While both saliency and focus goals should be considered within a query-focused summarization setting, these goals may be actually conflicting with each other [2]. Higher saliency usually comes at the expense of lower focus and vice-versa. Moreover, such a tradeoff may directly depend on summary length.

Figure 1: Illustration of the tradeoff between summary saliency and focus goals using varying summary length upper bounds (DUC 2007 dataset).

To illustrate the effect of summary length on this tradeoff, using the DUC 2007 dataset, Figure 1 reports the summarization quality which was obtained by the Cross Entropy Summarizer (CES) – a state of the art unsupervised query-focused multi-document extractive summarizer [6]

. Saliency was measured according to cosine similarity between the summary’s bigram representation and that of the input documents. Focus was further measured relatively to how much the summary’s induced unigram model is “concentrated” around query-related words.

As we can observe in Figure 1, with the relaxation of the summary length limit, where a more lengthy summary is being allowed, saliency increases at the expense of focus. Laying towards more saliency would result in a better coverage of general and more informative content. Yet, this would result in the inclusion of less relevant content to the specific information need in mind.

1.2 Towards a better tradeoff handling

Aiming at better handling the saliency versus focus tradeoff, in this work, we propose Dual-CES – an extended CES summarizer [6]. Similar to CES, Dual-CES is an unsupervised query-focused, multi-document, extractive summarizer. To this end, like CES, Dual-CES utilizes the Cross Entropy method [21] for selecting a subset of sentences extracted from input documents, whose combination is predicted to produce a good summary.

Yet, differently from CES, Dual-CES does not attempt to address both saliency and focus goals in a single optimization step. Instead, Dual-CES implements a novel two-step dual-cascade optimization approach, which utilizes two sequential CES-like invocations. Using such an approach, Dual-CES tries to handle the tradeoff by gradually shifting from generating a long summary that is more salient in the first step to generating a short summary that is more focused in the second step. Moreover, Dual-CES utilizes the long summary that was generated in the first step for saliency-based pseudo-feedback distillation, which allows to generate a final focused summary with better saliency. Dual-CES provides a fully unsupervised end-to-end query-focused multi-document extractive summarization solution.

Using an evaluation with the DUC 2005, 2006 and 2007 benchmarks, we show that, Dual-CES generates a focused (and shorter) summary which has much higher saliency (and hence a better tradeoff handling). Overall, Dual-CES provides a significantly better summarization quality compared to other alternative unsupervised summarizers; and in many cases, it even outperforms that of state-of-the art supervised summarizers.

2 Related Work

In this work we employ an unsupervised learning approach for the task of query-based multi-document extractive summarization. Many previous works have employed various unsupervised and/or supervised learning methods for the same task. Some learning systems rank sentences based on their surface and/or graph level features 

[3, 15, 18]. Others have used various sparse coding techniques for selecting a subset of sentences that minimizes a given documents reconstruction error [12, 26, 16, 9, 11] or used a variational auto-encoder for sentence representation [13].

Attention models incorporated within deep-learning summarization architectures have further been suggested for improving sentence ranking and selection [1, 12, 20]. Such models try to simulate a human attentive reading behaviour. This allows to better account for context-sensitive features during summarization. Compared to these works, we do not try to attend for sentence ranking or selection. Alternatively, we distill informative hints from summarized documents, aiming to improve the saliency of produced focused summaries.

Finally, reinforcement learning methods have been recently considered 

[4, 6, 17, 19]. Among such methods, the CES summarizer [6] is the only one which is both query-sensitive and unsupervised. Similar to CES, we also utilize the Cross Entropy (CE) method [21], a global policy search optimization framework, for solving the sentence subset selection problem. Yet, differently from CES, we utilize the CE method twice, each time with a slightly-different summarization goal in mind (i.e., first saliency and then focus). Moreover, we utilize the distilled saliency-based pseudo-feedback to improve the summarization policy search between such switched (dual) goals. To the best of our knowledge, this on its own, serves as a novel aspect of our work.

3 Background

Here we provide background details on our summarization task and the Cross Entropy method which we use for implementing Dual-CES.

3.1 Summarization task

We address the query-focused, multi-document summarization task. Formally, let denote some user information need for documents summarization, which may be expressed by one or more queries. Let denote a set of one or more matching documents to be summarized and be the maximum allowed summary length (in words).

We implement an extractive summarization approach. Our goal is to produce a length-limited summary by extracting salient content parts in which are further relevant (focused) to .

Following [6], we now cast the summarization task as a sentence subset selection problem. To this end, we produce summary (with maximum length ) by choosing a subset of sentences which maximizes a given quality target .

3.2 Unsupervised summarization

Dual-CES is an unsupervised summarizer. Similar to CES, it utilizes the Cross Entropy method [21] for selecting the most “promising” subset of sentences in . Since we assume an unsupervised setting, no actual reference summaries are available for training nor can we directly optimize an actual quality target . Instead, following [6], is “surrogated” by several summary quality prediction measures . Each “predictor”

is designed to estimate the level of saliency or focus of a given candidate summary

and is presumed to correlate (up to some extent) with actual summarization quality, e.g., ROUGE [14]. For simplicity, similar to CES, various predictions are assumed to be independent and are combined into a single optimization objective by taking their product, i.e.: .

3.3 Using the Cross Entropy method

The CE-method provides a generic Monte-Carlo optimization framework for solving hard combinatorial problems [21]. Previously, it was utilized for solving the sentence subset selection problem [6].

To this end, the CE-method gets as an input , a constraint on maximum summary length and an optional pseudo-reference summary , whose usage will be explained later on. Let denote a single invocation of the CE-method. The result of such an invocation is a single length-feasible summary which contains a subset of sentences selected from which maximizes . For example, CES is implemented by invoking .

We next briefly explain how the CE-method solves this problem. For a given sentence , let denote the likelihood that it should be included in summary . Starting with a selection policy with the highest entropy (i.e.: ), the CE-Method learns a selection policy that maximizes . To this end, is incrementally learned using an importance sampling approach [21]. At each iteration , a sample of sentence-subsets is generated according to the selection policy which was learned in the previous iteration . The likelihood of picking a sentence at iteration is estimated (via cross-entropy minimization) as follows:

(1)

Here, denotes the Kronecker-delta (indicator) function and denotes the

-quantile (

) of the sample performances . Therefore, the likelihood of picking a sentence will increase when it is being included in more (subset) samples whose performance is above the current minimum required quality target value . We further smooth as follows: ; with  [21].

Upon its termination, the CE-method is expected to converge to the global optimal selection policy  [21]. We then produce a single summary . To enforce that only feasible summaries will be produced, following [6], we set whenever a sampled summary length exceeds the word limit.

4 The Dual-CES summarizer

Figure 2: Dual-CES implementation flow

Differently from CES, Dual-CES does not attempt to maximize both saliency and focus goals in a single optimization step. Instead, Dual-CES implements a novel two-step dual-cascade optimization approach (see Figure 2), which utilizes two CES-like invocations. Both invocations consider the same sentences powerset solution space. Yet, each such invocation utilizes a bit different set of summary quality predictors , depending on whether the summarizer’s goal should lay towards higher summary saliency or focus.

In the first step, Dual-CES relaxes the summary length constraint, aiming at producing a longer and more salient summary. This summary is then treated as a pseudo-effective reference summary from which saliency-based pseudo-feedback is distilled. Such pseudo-feedback is then utilized in the second step of the cascade for setting an additional auxiliary saliency-driven goal. Yet, at the second step, similar to CES, the primary goal is actually to produce a focused summary (with maximum length limit ). Overall, Dual-CES is simply implemented as follows:

Here, and denote the saliency and focus summary quality objectives which are optimized during the cascade, respectively. Both and are implemented as a product of several basic predictors.

denotes the relaxed summary length hyperparameter. We next elaborate the implementation details of

Dual-CES’s dual optimization steps.

4.1 Step 1: Saliency-oriented summarization

The purpose of the first step is to produce a single longer summary (with length ) which will be used as a pseudo-reference for saliency-based feedback distillation. As illustrated in Figure 1, with a longer summary length – a more salient summary may be produced.

This step is simply implemented by invoking the CE-method with . The target measure guides the optimization towards the production of a summary with the highest possible saliency. Similar to CES, is calculated as the product of several summary quality predictors. Overall, we use five different predictors, four of which were previously used in CES [6]. The additional predictor that we introduce is designed to “drive” the optimization even further towards higher saliency. Next, we shortly describe each predictor. The symbol marks whether it was originally employed in CES [6].

4.1.1 Predictor 1: coverage

This predictor estimates to what extent (candidate) summary (generally) covers the document set . Here, we represent both and

as term-frequency vectors, considering only bigrams, which commonly represent more important content units 

[6]. For a given text , let . The coverage predictor is then defined by .

4.1.2 Predictor 2: position-bias

This predictor biases sentence selection towards sentences that appear earlier in their containing documents. It is calculated as , where is the relative start position (in characters) of sentence in its containing document and is a position-bias hyperparameter (fixed to , following [6]).

4.1.3 Predictor 3: summary length

This predictor biases towards selection of summaries that are closer to the maximum permitted length. Such summaries contain fewer and longer sentences, and therefore, tend to be more informative. Let denote the length of text (in number of words). Here, may either be a single sentence or a whole summary . This predictor is then calculated as , where .

4.1.4 Predictor 4: asymmetric coverage

To target even higher saliency, we suggest a fourth predictor, inspired by the risk minimization framework [27]. To this end, we measure the Kullback-Leibler (KL) “similarity” between the two (unsmoothed) unigram language models induced from the centroid representation111Such centroid representation is simply given by concatenating the text of sentences in or documents in . of () and (), formally:

.

4.1.5 Predictor 5: focus-drift

While producing a longer summary may result in higher saliency, as was further illustrated in Figure 1, such a summary may be less focused. Hence, to avoid such focus-drift, while we opt to optimize for higher saliency at this step, the target information need should be still considered. To this end, we add a predictor: , which acts as a “query-anchor” and measures to what extent summary ’s unigram model is devoted to the information need .

4.2 Step 2: Focus-oriented summarization

The input to the second step of the cascade consists of the same set of documents , summary length constraint and the pseudo-reference summary that was generated in the previous step. This step is simply implemented by invoking the CE-method with . Here, the target measure guides the optimization towards the production of a focused summary, while still keeping high saliency as much as possible. To achieve that, we use an additional focus-driven predictor which bias summary production towards higher focus. Moreover, using the pseudo-reference summary we introduce an additional auxiliary saliency-based predictor, whose goal is to enhance the saliency of produced focused summary. Overall, is calculated as the product of the previous five summary quality predictors (Predictors 15) and the two additional predictors, whose details are described next.

4.2.1 Predictor 6: query-relevancy

This predictor estimates the relevancy of summary to . For that, we use two similarity measures. The first, following [6], measures the Bhattacharyya similarity (coefficient) between the two (unsmoothed) unigram language models of and , i.e.: . The second measures the cosine similarity between and unigram term-frequency representations, i.e.:

. The two similarity measures are then combined into a single measure using their geometric mean, i.e.:

.

4.2.2 Predictor 7: reference summary (distillation) coverage

We further make use of the pseudo-reference summary , which was produced in the first step, and introduce an additional auxiliary saliency-based predictor. This predictor utilizes pseudo-feedback that is distilled from unique unigram words in . It is calculated as: . Following [10, 27], we only consider the top-100 most frequent unigrams in .

Intuitively speaking, usually will be longer (in words) than any candidate summary that may be chosen in the second step; hence, is expected to be more salient than . Therefore, such a predictor is expected to “drive” the optimization to prefer those candidate summaries that include as many salient words from , acting as if they were by themselves longer (and more salient) summaries (than those candidates that include less salient words from ).

4.2.3 Adaptive hyperparameter adjustment

Apart from salient words in that are used as feedback, we note that, sentences in may also provide additional “hints” about other properties of informative sentences in , which may potentially be selected to improve saliency. One such property is the relative start-positions of sentences in . To this end, we now assign (i.e., the average start-position of feedback sentences in ) as the value of the position-bias hyperparameter within (Predictor 2).

4.3 An extension: Length-adaptive Dual-CES

We conclude this section with a suggestion of an extension to Dual-CES that adaptively adjusts the value of hyperparameter . To this end, we introduce a new learning parameter which defines the maximum length limit for summary production (sampling) that is allowed at iteration of the CE-method. We now assume that summary lengths have a Poisson distribution of word occurrences with mean . Using importance sampling, this parameter is estimated at iteration as follows:

(2)

Similar to , we further smooth as follows: . Here, is the same smoothing hyperparameter which was used to smooth and .

5 Evaluation

5.1 Datasets

Our evaluation is based on the Document Understanding Conferences (DUC) 2005, 2006 and 2007 benchmarks222http://www-nlpir.nist.gov/projects/duc/data.html. These benchmarks are commonly used for evaluating the query-based multi-document summarization task by all of our related works. Given a topic statement, which is expressed by one or more questions, and a set of English documents, the main task is to produce a 250-word (i.e., ) topic-focused summary [5]. The number of topics per benchmark are , and in the DUC 2005, 2006 and 2007 benchmarks, respectively. The number of documents to be summarized per topic is , and in the DUC 2005, 2006 and 2007 benchmarks, respectively. Each document was pre-segmented (by NIST) into sentences. Following [6], we use Lucene’s English analysis333https://lucene.apache.org/ for processing the text of topics and documents.

5.1.1 Dual-CES implementation

We evaluated both Dual-CES and its adaptive-length variant (hereinafter denoted Dual-CES-A). To this end, on the first saliency-driven step, for Dual-CES, we fixed the (strict) upper bound limit on summary length to . Dual-CES-A, on the other hand, adaptively adjusts such length limit and was initialized with . Both variants were further set with a summary limit for their second focus-driven step.

We implemented both Dual-CES and Dual-CES-A in Java (JRE8). Further following [6], to reduce CE-method’s runtime, we applied a preliminary step of sentence pruning, where only the top-150 sentences with the highest (unigram) Bhattacharyya similarity to the topic’s queries were considered for summarization. Similar to [6], the CE-method hyperparameters were fixed as follows: , and .

Finally, to handle DUC’s complex information needs, we closely followed [6], as follows. First, for each summarized topic, we calculated the query-focused predictions (i.e., and ) per each one of its questions. To this end, each question was represented as a sub-query by concatenating the main topic’s text to the question’s text. Each sub-query was further expanded with top- (unigram) Wikipedia related-words [25]. We then obtained the topic query-sensitive predictions by summing up its various sub-queries’ predictions.

5.1.2 Evaluation measures

The three DUC benchmarks include four reference (ground-truth) human-written summaries per each topic [5]. We measured summarization quality using the ROUGE measure [14], which is the official one for this task [5]. To this end, we used the ROUGE toolkit with its standard parameters setting444ROUGE-1.5.5.pl -a -c 95 -m -n 2 -2 4 -u -p 0.5 -l 250. We report both Recall and F-Measure of ROUGE-1, ROUGE-2 and ROUGE-SU4. ROUGE-1 and ROUGE-2 measure the overlap in unigrams and bigrams between the produced and the reference summaries, respectively. ROUGE-SU4 measures the overlap in skip-grams separated by up to four words.

Finally, since Dual-CES essentially depends on the CE-method which has a stochastic nature, its quality may depend on the specific seed that was used for random sampling. Hence, following [6], to reduce sensitivity to random seed selection, per each summarization task (i.e., topic and documents pair), we run each Dual-CES variant 30 times (each time with a different random seed) and recorded its mean performance (and confidence interval).

5.2 Baselines

We compare the summary quality of Dual-CES to the results that were previously reported for several competitive summarization baselines. These baselines include both supervised and unsupervised methods and apply various strategies for handling the saliency versus focus tradeoff. To distinguish between both types of works, we mark supervised method names with a superscript .

The first line of baselines utilize various surface and graph level features, namely: BI-PLSA [22], CTSUM  [24], HierSum [8], HybHSum [3], MultiMR [23], QODE [28] and SubMod-F [15]. The second line of baselines apply various sparse-coding or auto-encoding techniques, namely: DocRebuild [16], RA-MDS [11], SpOpt [26], and VAEs-A [13]. The third line of baselines incorporate various attention models, namely: AttSum [1], C-Attention [12] and CRSum+SF [20]. We further note that, some baselines, like DocRebuild, SpOpt and C-Attention, use hand-crafted rules for sentence compression.

Finally, we directly compare with two CES variants, which serve as direct alternatives to Dual-CES. The first one, is the original CES summarizer, whose results are reported in [6]. The second one, denoted hereinafter CES, utilizes Predictors 16, which are combined within a single optimized objective (by taking their product). This variant, therefore, allows to directly evaluate the contribution of our proposed dual-cascade learning approach which is employed by the two Dual-CES variants.

5.3 Results

The main results of our evaluation are reported in Table 1 (ROUGE-X F-Measure) and Table 2 (ROUGE-X Recall). The numbers reported for the various baselines are the best numbers reported in their respective works. Unfortunately, not all baselines fully reported their results for all benchmarks and measures. Whenever a report on a measure is missing, we further use the symbol ’-’.

System R-1 R-2 R-SU4
DUC 2005 MultiMR 36.90 6.83 -
CES
37.76(03)
7.45(03)
13.02(02)
CES
36.94(01)
7.21(04)
12.82(04)
Dual-CES-A
38.13(07)
7.58(04)
13.24(04)
Dual-CES
38.08(06)
7.54(03)
13.17(03)
DUC 2006 RA-MDS 39.10 8.10 13.6
MultiMR 40.30 8.50 -
DocRebuild 40.86 8.48 14.45
C-Attention 39.30 8.70 14.10
VAEs-A 39.60 8.90 14.30
CES
40.46(02)
9.13(01)
14.71(01)
CES
39.93(08)
9.02(05)
14.42(05)
Dual-CES-A
41.07(07)
9.42(06)
14.89(05)
Dual-CES
41.23(07)
9.47(04)
14.97(03)
DUC 2007 RA-MDS 40.80 9.70 15.00
MultiMR 42.04 10.30 -
DocRebuild 42.72 10.30 15.81
CTSUM 42.66 10.83 16.16
C-Attention 42.30 10.70 16.10
VAEs-A 42.10 11.10 16.40
CES
42.84(01)
11.33(01)
16.50(01)
CES
41.90(08)
11.14(06)
16.17(05)
Dual-CES-A
43.25(06)
11.73(06)
16.80(04)
Dual-CES
43.24(07)
11.78(05)
16.83(05)
Table 1: Results of ROUGE F-Measure evaluation on DUC 2005, 2006, and 2007 benchmarks.
System R-1 R-2 R-SU4
DUC 2005 SubMod-F - 8.38 -
CRSum+SF 39.52 8.41 -
BI-PLSA 36.02 6.76 -
CES
40.33(03)
7.94(02)
13.89(02)
CES
39.56(11)
7.71(04)
13.73(05)
Dual-CES-A
40.85(07)
8.11(04)
14.19(04)
Dual-CES
40.82(06)
8.07(04)
14.13(04)
DUC 2006 AttSum 40.90 9.40 -
SubMod-F - 9.75 -
HybHSum 43.00 9.10 15.10
CRSum+SF 41.70 10.03 -
HierSum 40.10 8.60 14.30
SpOpt 39.96 8.68 14.22
QODE 40.15 9.28 14.79
CES
43.00(01)
9.69(01)
15.63(01)
CES
42.57(09)
9.61(06)
15.38(06)
Dual-CES-A
43.78(07)
10.04(06)
15.88(05)
Dual-CES
43.94(07)
10.09(05)
15.96(03)
DUC 2007 AttSum 43.92 11.55 -
SubMod-F - 12.38 -
HybHSum 45.60 11.40 17.20
CRSum+SF 44.60 12.48 -
HierSum 42.40 11.80 16.70
SpOpt 42.36 11.10 16.47
QODE 42.95 11.63 16.85
CES
45.43(01)
12.02(01)
17.50(01)
CES
44.65(01)
11.85(01)
17.21(06)
Dual-CES-A
46.01(07)
12.47(06)
17.87(04)
Dual-CES
46.02(08)
12.53(06)
17.91(05)
Table 2: Results of ROUGE Recall evaluation on DUC 2005, 2006, and 2007 benchmarks.

5.3.1 Dual-CES vs. other baselines

First we note that, among the various baseline methods that we have compared with, CES on its own, serves as the strongest baseline to outperform in most cases. Overall, Dual-CES provides better results compared to any other baseline (and specifically the unsupervised ones). Specifically, on F-Measure, Dual-CES has achieved between and better ROUGE-2 and ROUGE-1, respectively. On recall, Dual-CES has achieved between better ROUGE-1. On ROUGE-2, in the DUC 2006 and 2007 benchmarks, Dual-CES was about better, while it was slightly inferior to SubMod-F and CRSum+SF in the DUC 2005 benchmark. Yet, SubMod-F and CRSum+SF are actually supervised, while Dual-CES is fully unsupervised. Therefore, overall, Dual-CES’s ability to reach (even to outperform in many cases) the quality of strong supervised counterparts actually only emphasizes more its potential.

5.3.2 Dual-CES vs. CES variants

Dual-CES significantly improves over the two CES variants in all benchmarks. On F-Measure, Dual-CES has achieved at least between and better ROUGE-2 and ROUGE-1, respectively. On recall, Dual-CES has achieved at least between and better ROUGE-2 and ROUGE-1, respectively. By distilling saliency-based pseudo-feedback between step transitions, Dual-CES manages to better utilize the CE-method for selecting a more promising subset of sentences. A case in point is the CES variant which is even inferior to CES. A simple combination of all predictors (except Predictor 7 which is unique to Dual-CES since it requires a pseudo-reference summary) does not directly translates to a better tradeoff handling. This, therefore, serves as a strong empirical evidence of the importance of the dual-cascade optimization approach implemented by Dual-CES, which allows to produce focused summarizes with better saliency.

5.3.3 Comparison with attentive baselines

The pseudo-feedback distillation approach employed between the two steps of Dual-CES has some resemblance to attention models that are used by state-of-the-art deep learning summarization methods [1, 12, 20]. First we note that, Dual-CES significantly improves over these attentive baselines on ROUGE-1. On ROUGE-2, Dual-CES is significantly better than C-Attention and AttSum, while it provides (more or less) similar quality to CRSum+SF.

Closer analysis of the various attention strategies that are employed within these baselines, reveals that, while AttSum only attends on a sentence representation level, C-Attention and CRSum+SF further attend on a word level. Such a more fine-granular attendance results in an improved saliency for the two latter. Yet, while C-Attention first attends on sentences then on words, CRSum+SF performs its attentions reversely. Using Dual-CES as a reference method for comparison, apparently, CRSum+SF attendance on salient words first and then on salient sentences based on such words seems as the better strategy.

In a sense, similar to CRSum+SF, Dual-CES also first “attends” on salient words which are distilled from the pseudo-feedback reference summary. Dual-CES then utilizes such salient words for better selection of salient sentences within its second step of focused summary production. Yet, compared to CRSum+SF and similar to C-Attention, Dual-CES’s saliency “attention” process is unsupervised. Moreover, Dual-CES further “attends” on salient sentence positions, which result in better tuning of the position-bias hyperparameter.

R-1 R-2 R-SU4
500 45.52 12.32 17.69
750 45.84 12.46 17.85
1000 45.88 12.48 17.84
1250 45.91 12.50 17.86
1500 46.02 12.53 17.91
1750 45.99 12.46 17.87
2000 45.97 12.44 17.83
Adaptive-length () 46.01 12.47 17.87
Table 3: Sensitivity of Dual-CES to the value of hyperparamter (DUC 2007 benchmark)
Figure 3: Illustration of the adaptive-length learning by Dual-CES-A (DUC 2007 benchmark)

5.3.4 Hyperparamter sensitivity analysis

Table 3 reports the sensitivity of Dual-CES (measured by ROUGE-X Recall) to the value of hyperparameter , using the DUC 2007 benchmark. To this end, we ran Dual-CES with an increasing value. For further comparison, we also report in Table 3 the results of its adaptive-length version Dual-CES-A. Dual-CES-A is still initialized with and adaptively adjusts this hyperparameter. Figure 3 illustrates the (average) learning curve of its adaptive-length parameter .

Overall, Dual-CES’s summarization quality remains quite stable, exhibiting low sensitivity to . Similar stability was further observed for the two other DUC benchmarks. In addition, Figure 3 depicts an interesting empirical outcome: Dual-CES-A converges (more or less) to the best hyperparameter value (i.e., in Table 3). Dual-CES-A, therefore, serves as a robust alternative for flexibly estimating such hyperparameter value during runtime. Dual-CES-A can provide similar quality and may outperform Dual-CES.

6 Conclusions and Future work

We proposed Dual-CES, an unsupervised, query-focused, extractive multi-document summarizer. Dual-CES was shown to better handle the tradeoff between saliency and focus, providing the best summarization quality compared to other alternative state-of-the-art unsupervised summarizers. Moreover, in many cases, Dual-CES even outperforms state-of-the-art supervised summarizers. As a future work, we would like to learn to distill from additional pseudo-feedback sources.

References

  • [1] Ziqiang Cao, Wenjie Li, Sujian Li, Furu Wei, and Yanran Li. Attsum: Joint learning of focusing and summarization with neural attention. In COLING, 2016.
  • [2] Jaime Carbonell and Jade Goldstein. The use of mmr, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’98, pages 335–336, New York, NY, USA, 1998. ACM.
  • [3] Asli Celikyilmaz and Dilek Hakkani-Tur. A hybrid hierarchical model for multi-document summarization. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL ’10, pages 815–824, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics.
  • [4] Yen-Chun Chen and Mohit Bansal. Fast abstractive summarization with reinforce-selected sentence rewriting. CoRR, abs/1805.11080, 2018.
  • [5] Hoa Trang Dang. Overview of duc 2005. In Proceedings of the document understanding conference, volume 2005, pages 1–12, 2005.
  • [6] Guy Feigenblat, Haggai Roitman, Odellia Boni, and David Konopnicki. Unsupervised query-focused multi-document summarization using the cross entropy method. In Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR ’14. ACM, 2017.
  • [7] Mahak Gambhir and Vishal Gupta. Recent automatic text summarization techniques: A survey. Artif. Intell. Rev., 47(1):1–66, January 2017.
  • [8] Aria Haghighi and Lucy Vanderwende. Exploring content models for multi-document summarization. In Proceedings of the 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, NAACL ’09, pages 362–370, Stroudsburg, PA, USA, 2009. Association for Computational Linguistics.
  • [9] Zhanying He, Chun Chen, Jiajun Bu, Can Wang, Lijun Zhang, Deng Cai, and Xiaofei He. Document summarization based on data reconstruction. In

    Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence

    , AAAI’12, pages 620–626. AAAI Press, 2012.
  • [10] Victor Lavrenko and W. Bruce Croft. Relevance based language models. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’01, pages 120–127, New York, NY, USA, 2001. ACM.
  • [11] Piji Li, Lidong Bing, Wai Lam, Hang Li, and Yi Liao. Reader-aware multi-document summarization via sparse coding. In Proceedings of the 24th International Conference on Artificial Intelligence, IJCAI’15, pages 1270–1276. AAAI Press, 2015.
  • [12] Piji Li, Wai Lam, Lidong Bing, Weiwei Guo, and Hang Li. Cascaded attention based unsupervised information distillation for compressive summarization. In

    Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

    , pages 2081–2090, 2017.
  • [13] Piji Li, Zihao Wang, Wai Lam, Zhaochun Ren, and Lidong Bing. Salience estimation via variational auto-encoders for multi-document summarization. In AAAI, pages 3497–3503, 2017.
  • [14] Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out: Proceedings of the ACL-04 workshop, volume 8. Barcelona, Spain, 2004.
  • [15] Hui Lin and Jeff Bilmes. A class of submodular functions for document summarization. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, HLT ’11, pages 510–520, Stroudsburg, PA, USA, 2011. Association for Computational Linguistics.
  • [16] Shulei Ma, Zhi-Hong Deng, and Yunlun Yang. An unsupervised multi-document summarization framework based on neural document model. In Proceedings of the 26th International Conference on Computational Linguistics: Technical Papers, pages 1514–1523, Osaka, Japan, December 2016. The COLING 2016 Organizing Committee.
  • [17] Shashi Narayan, Shay B. Cohen, and Lapata Mirella. Ranking sentences for extractive summarization with reinforcement learning. CoRR, abs/1802.08636, 2018.
  • [18] You Ouyang, Wenjie Li, Sujian Li, and Qin Lu. Applying regression models to query-focused multi-document summarization. Information Processing & Management, 47(2):227–237, 2011.
  • [19] Romain Paulus, Caiming Xiong, and Richard Socher. A deep reinforced model for abstractive summarization. In In Proceedings of the 6th International Conference on Learning Representations, ICLR ’18, 2018.
  • [20] Pengjie Ren, Zhumin Chen, Zhaochun Ren, Furu Wei, Jun Ma, and Maarten de Rijke. Leveraging contextual sentence relations for extractive summarization using a neural attention model. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’17, pages 95–104, New York, NY, USA, 2017. ACM.
  • [21] Reuven Y Rubinstein and Dirk P Kroese.

    The cross-entropy method: a unified approach to combinatorial optimization, Monte-Carlo simulation and machine learning

    .
    Springer, 2004.
  • [22] Chao Shen, Tao Li, and Chris H. Q. Ding. Integrating clustering and multi-document summarization by bi-mixture probabilistic latent semantic analysis (plsa) with sentence bases. In Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence, AAAI’11, pages 914–920. AAAI Press, 2011.
  • [23] Xiaojun Wan and Jianguo Xiao. Graph-based multi-modality learning for topic-focused multi-document summarization. In Proceedings of the 21st International Jont Conference on Artifical Intelligence, IJCAI’09, pages 1586–1591, San Francisco, CA, USA, 2009. Morgan Kaufmann Publishers Inc.
  • [24] Xiaojun Wan and Jianmin Zhang. Ctsum: Extracting more certain summaries for news articles. In Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR ’14, pages 787–796, New York, NY, USA, 2014. ACM.
  • [25] Yang Xu, Gareth J.F. Jones, and Bin Wang. Query dependent pseudo-relevance feedback based on wikipedia. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’09, pages 59–66, New York, NY, USA, 2009. ACM.
  • [26] Jin-ge Yao, Xiaojun Wan, and Jianguo Xiao. Compressive document summarization via sparse optimization. In Proceedings of the 24th International Conference on Artificial Intelligence, IJCAI’15, pages 1376–1382. AAAI Press, 2015.
  • [27] ChengXiang Zhai and John Lafferty. A risk minimization framework for information retrieval. Inf. Process. Manage., 42(1):31–55, January 2006.
  • [28] Sheng-hua Zhong, Yan Liu, Bin Li, and Jing Long. Query-oriented unsupervised multi-document summarization via deep learning model. Expert Syst. Appl., 42(21):8146–8155, November 2015.