Log In Sign Up

Scaling Up Query-Focused Summarization to Meet Open-Domain Question Answering

by   Weijia Zhang, et al.

Query-focused summarization (QFS) requires generating a textual summary given a query using a set of relevant documents. However, in practice, such relevant documents are not readily available but should be first retrieved from a document collection. Therefore, we show how to extend this task to make it more realistic. Thereby the task setup also resembles the settings of the open-domain question answering task, where the answer is a summary of the top-retrieved documents. To address this extended task, we combine passage retrieval with text generation to produce the summary of the retrieved passages given the input query. We demonstrate the first evaluation results on the proposed task and show that a few samples are sufficient to fine-tune a large generative model with retrieved passages.


page 1

page 2

page 3

page 4


Improve Query Focused Abstractive Summarization by Incorporating Answer Relevance

Query focused summarization (QFS) models aim to generate summaries from ...

Query Focused Multi-Document Summarization with Distant Supervision

We consider the problem of better modeling query-cluster interactions to...

Few-shot Query-Focused Summarization with Prefix-Merging

Query-focused summarization has been considered as an important extensio...

Cross-document Event Coreference Search: Task, Dataset and Modeling

The task of Cross-document Coreference Resolution has been traditionally...

Coarse-grain Fine-grain Coattention Network for Multi-evidence Question Answering

End-to-end neural models have made significant progress in question answ...

Privacy Policy Question Answering Assistant: A Query-Guided Extractive Summarization Approach

Existing work on making privacy policies accessible has explored new pre...

Query-Focused Extractive Summarisation for Finding Ideal Answers to Biomedical and COVID-19 Questions

This paper presents Macquarie University's participation to the BioASQ S...

1 Introduction

Query-focused summarization (QFS) aims to produce a multi-sentence summary from a set of topic-related documents to answer a given query. One of the main challenges of this task is the dataset construction. A typical pipeline is for several professional assessors to first read candidate documents given a topic, then design a query reflecting the information-seeking needs and choose a subset of 25-50 documents relevant to this query. Finally, they write reference summaries based on the selected documents to answer the query [2]. Obviously, this labor-intensive work is very time-consuming. Therefore, all existing QFS datasets(e.g., DUC [2]) do not contain enough samples for training the data-hungry neural models. They can be used only for fine-tuning and evaluation.

Another notable limitation of the existing QFS task setup is that the relevant documents for the given query are assumed to be readily available, which is clearly not the case in practice. In a more realistic scenario, the relevant documents should be first retrieved from a large document collection, e.g., the Web, and after that summarised to answer the query. It is important to consider the end-to-end task to be able to analyse and mitigate the errors propagating from the retrieval phase.

We propose to extend the standard QFS task by replacing the set of relevant documents with a large-scale document collection. We refer to this new task setup as open-domain QFS. Such a modification in the task setup also makes it more similar to the open-domain question answering (QA) task, where a natural language question should be answered using a large document collection [1].

Most of the datasets for open-domain QA, such as Natural Questions [10], assume that the answer is a single contiguous span that can be extracted from a single relevant document (extractive QA). Reusing the QFS datasets in the open-domain setting allows us to obtain high-quality data for abstractive QA, in which the answer to the given question is a summary of multiple relevant documents.

Due to the low resource setting, the previous work on QFS relied mainly on unsupervised approaches to extract sentences to produce the output summary [6, 13, 16]. With the rising popularity of the pre-trained language models, the more recent studies [11, 20, 21]

also explored weak/distant supervision from the QA and single-document summarization tasks. Such approaches incorporate sentence selector trained on large QA datasets, such as SQUAD 2.0 

[19] to select relevant sentences, or train a summarization model on large non-query-focused summarization datasets, such as CNN-DM [7].

In contrast to all previous work, we explore the potential of the retriever-generator framework that was recently proposed in the context of the open-domain QA task [5, 4] to tackle the task of open-domain QFS, which we introduced in this work. In this framework, the retriever is set to output the top-k relevant passages for a given query. Then, the generator uses the encoder-decoder architecture, which takes the query and the passages retrieved at the previous stage as input, to produce the output summary.

Our main contributions are twofold: (i) we are the first, to our knowledge, to extend the QFS task into the open-domain scenario, and (ii) show the results of the state-of-the-art model on this task that can be efficiently trained in a low-resource scenario.

2 Task Formulation

We briefly introduce the original QFS task and then describe our proposed extension to the open-domain QFS. Let the tuple denote a sample in the QFS dataset (a cluster), where is a given query, is a set of topic-related documents and is the reference summary for the document set and the query . The goal is to produce the summary given the documents and the query as input: . Note that are selected manually based on by human annotators and usually includes tens of documents.

We extend the original QFS task to the open-domain setting as by replacing with , where are millions of passages processed from a large-scale document collection, i.e., . Our goal remains to generate the summary with respect to the query . However, different from the original QFS setup, there is no evidence that the given passages are relevant to the query. More importantly, the number of passages is so big that it is impossible to consider all of them, which calls for an effective information retrieval model to supplement the extended task.

3 Method

Inspired by the approach previously proposed for the long-form QA task [5], we adopt a retriever-generator framework to address the task of open-domain QFS introduced in the previous section.

3.1 Retriever

The goal of the retriever is to return the top- most relevant passages given the input query. Our retriever is based on the DPR model [9], which consists of a query encoder and a passage encoder . The encoders map the input query and passage

to dense vectors

and respectively. The similarity score between them is defined as the dot product of their corresponding vectors:


Then, top- passages are retrieved based on the above scores. We use DPR that was pre-trained on KILT benchmark [17] by fine-tuning the model [3].

3.2 Generator

Our generator is based on the Fusion-in-Decoder (FID) approach, which is the recent encoder-decoder architecture proposed for open-domain QA datasets [8]. Following this work, we use the pre-trained  [18] to initialize our generator. The generator takes the query and the set of the top- retrieved passages as input and produces the final summary. More specifically, every supporting passage from is concatenated with the query and used as input to the encoder independently. Afterwards, the encoded representations of all the retrieved passages are concatenated and used as input to the decoder. Let denotes the output representation of the encoder for the passage , then the decoder generates a summary word by word:

Dataset DUC
2005 2006 2007
#Clusters 50 50 45
#Queries/Cluster 1 1 1
#Documents/Cluster 32 25 25
#Summaries/Cluster 4-9 4 4
#Words/Query 13.6 12.5 15.1
#Words/Summary 277.8 281.0 275.7
Table 1: Dataset statistics

4 Experiments

To support the new task of open-domain QFS, we construct a new dataset based on the standard DUC collection and Wikipedia. Our experiments are set to compare the performance of different models and different training setups for this task.

4.1 Dataset Construction

We derive the new dataset for open-domain QFS using the original QFS dataset, which consists of the three subsets collected for the Document Understanding Conferences (DUC) from 2005 to 2007. Each subset contains 45-50 clusters, and each cluster contains a query, a set of relevant documents and several reference summaries (see Table 1 for more detailed dataset statistics).

We then complement the query-summaries pairs from DUC with three different passage collections. The first one (’WIKI’) is the Wikipedia dump used in the KILT benchmark [17], which contains about 5.9 million Wiki documents. Following the previous work [9], we split each document into disjoint passages with 100 words maximum. This way we obtain about 21 million passages in total. However, we can not guarantee that ’WIKI’ contains enough evidence to answer the queries from DUC since the reference summaries are based on the content of the original DUC documents. Therefore, we mix all the DUC documents from different clusters together to construct the second document collection (’DUC’) using the same approach to split the document into the passages, which results in 32 thousand passages in total. Finally, we combine both collections to form the third collection (’MIX’).

Following the previous work on DUC datasets [14], we use the data of the first two years as the training set and the third subset as the test set. We randomly select 10% from the training set for validation.

4.2 Experimental Setup

In this work, we name our model DPR+FID. For comparison, we also apply BM25 as another retrieval method to get retrieved passages, and it is called BM25+FID. Moreover, we implement the recent state-of art models in open-domain question answering, RAG [12], in which passage retrieval and answer generation can be learned jointly.

Since we have three document collections (’WIKI’, ’DUC’ and ’MIX’), we have variants of each model trained on the passages retrieved from the respective collection, e.g., DPR+FID (WIKI) denotes that we use the passages from ’WIKI’ for training and test. For DUC, we use the passages from the first two years for training/validation and the passages from the third year for test.

We only train the FID model and set for top-

retrieved passages as 50. For training, the learning rate and the batch size are set to 1e-4 and 1, respectively. All of our models are trained on a single TITAN-XP GPU. We set the maximum number of training epochs to 30, which took about 12 hours. For inference, We set the maximum length of a generated summary to 250.

4.3 Results

In this section, we compare the retrieval performance of BM25 versus DPR, and evaluate the quality of the summaries generated by all our models. We also investigate the impact of the number of retrieved passages on the summarization performance.

4.3.1 Retrieval Performance.

To evaluate retrieval components BM25 and DPR, we treat the retrieved passages from the same cluster with their corresponding query as the ground truth. Following the related work on DPR [9], we use top- accuracy as the retrieval metric for this task. Table 2 compares BM25 and DPR on DUC 2007 dataset using the passages from the ’DUC’ collection. We can see that DPR performs much better than BM25. With the increase of the number of passages , we see the decrease in accuracy for both models.

Model Top-10 Top-30 Top-50
BM25 94.2 88.7 84.9
DPR 96.4 94.4 91.0
Table 2: Top- retrieval accuracy (%) on the open-domain QFS task for DUC 2007 (’DUC’).
Model R-1 R-2 R-SU4
RAG (WIKI) 32.3 5.2 10.8
BM25+FID (WIKI) 39.0 8.4 14.4
DPR+FID (WIKI) 38.5 9.1 14.5
RAG (DUC) 28.9 5.7 10.1
BM25+FID (DUC) 43.1 12.1 17.0
DPR+FID (DUC) 42.9 11.8 17.0
RAG (MIX) 27.0 4.6 9.0
BM25+FID (MIX) 39.2 8.8 14.7
DPR+FID (MIX) 41.0 10.0 15.6
Table 3: Summarization results (ROUGE) on the extended DUC 2007 dataset.

4.3.2 Summarization Performance.

The results of generated summaries on the extended DUC 2007 datasets are shown in Table 3. For evaluation, we use ROUGE [15] as an automatic metric of summarization quality. Following previous work [20, 21], we report for ROUGE-1 (R-1), ROUGE-2(R-2) and ROUGE-SU4 (R-SU4) with maximum length of 250 words.

We observe that the performance of RAG-based models is relatively lower, which means that they can not be trained well on such a small dataset. In contrast, BM25+FID achieves the best performance among all models but only on the smallest collection (DUC). As the number of passages increase from thousands to millions in the MIX collection, DPR-based model shows an increase in performance in comparison with BM25.

Note that BM25+FID (DUC) outperforms DPR+FID (DUC), while DPR was previously shown to outperform BM25 on the passage retrieval task for the same collection in Table 2. This may indicate that using all passages from the cluster is not sufficiently accurate to evaluate the retrieval performance in this case since only a few passages actually contain the evidence required to address the given query.

In Figure 1, we report the change in ROUGE-SU4 of the DPR+FID model (WIKI) with respect to the change in the number of retrieved passages, i.e., sensitivity of the summarization model to the hyper-parameter . We observe that increasing the number of passages from 10-30 brings an improvement. However, the performance slightly drops when the number of passages surpasses 30, i.e., the summarization model is not sufficiently robust to deal with the noisy input and retrieving more passages may hurt the summarization performance.

Figure 1: Performance of DPR+FID (WIKI) with different number of retrieved passages.

5 Conclusion

In this paper, we propose to extend the original QFS task setup into the open-domain setting and investigate the performance of the retriever-generator framework on the extended DUC datasets. We show that a large summarization model for the open-domain QFS task can be successfully trained with less than one hundred samples by incorporating the passages obtained from the retriever. We release the code of our experiments and the extended DUC dataset to make our results easier to reproduce.


  • [1] D. Chen, A. Fisch, J. Weston, and A. Bordes (2017) Reading wikipedia to answer open-domain questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1870–1879. Cited by: §1.
  • [2] H. T. Dang (2006) DUC 2005: evaluation of question-focused summarization systems. In Proceedings of the Workshop on Task-Focused Summarization and Question Answering, pp. 48–55. Cited by: §1.
  • [3] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §3.1.
  • [4] A. Fan, C. Gardent, C. Braud, and A. Bordes (2019)

    Using local knowledge graph construction to scale seq2seq models to multi-document inputs


    2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing

    Cited by: §1.
  • [5] A. Fan, Y. Jernite, E. Perez, D. Grangier, J. Weston, and M. Auli (2019) ELI5: long form question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3558–3567. Cited by: §1, §3.
  • [6] Z. He, C. Chen, J. Bu, C. Wang, L. Zhang, D. Cai, and X. He (2012) Document summarization based on data reconstruction. In

    Twenty-sixth AAAI conference on artificial intelligence

    Cited by: §1.
  • [7] K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom (2015) Teaching machines to read and comprehend. Advances in neural information processing systems 28, pp. 1693–1701. Cited by: §1.
  • [8] G. Izacard and É. Grave (2021) Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 874–880. Cited by: §3.2.
  • [9] V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W. Yih (2020) Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pp. 6769–6781. Cited by: §3.1, §4.1, §4.3.1.
  • [10] T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, et al. (2019) Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics 7, pp. 453–466. Cited by: §1.
  • [11] M. T. R. Laskar, E. Hoque, and X. Huang (2020)

    WSL-ds: weakly supervised learning with distant supervision for query focused multi-document abstractive summarization

    In Proceedings of the 28th International Conference on Computational Linguistics, pp. 5647–5654. Cited by: §1.
  • [12] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. Yih, T. Rocktäschel, et al. (2020) Retrieval-augmented generation for knowledge-intensive nlp tasks. Cited by: §4.2.
  • [13] P. Li, Z. Wang, W. Lam, Z. Ren, and L. Bing (2017)

    Salience estimation via variational auto-encoders for multi-document summarization

    In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31. Cited by: §1.
  • [14] W. Li, X. Zhang, Y. Wu, F. Wei, and M. Zhou (2019) Document-based question answering improves query-focused multi-document summarization. In CCF International Conference on Natural Language Processing and Chinese Computing, pp. 41–52. Cited by: §4.1.
  • [15] C. Lin (2004) Rouge: a package for automatic evaluation of summaries. In Text summarization branches out, pp. 74–81. Cited by: §4.3.2.
  • [16] H. Liu, H. Yu, and Z. Deng (2015) Multi-document summarization based on two-level sparse representation model. In Twenty-ninth AAAI conference on artificial intelligence, Cited by: §1.
  • [17] F. Petroni, A. Piktus, A. Fan, P. Lewis, M. Yazdani, N. D. Cao, J. Thorne, Y. Jernite, V. Plachouras, T. Rocktäschel, and S. Riedel (2020) KILT: a benchmark for knowledge intensive language tasks. In arXiv:2009.02252, Cited by: §3.1, §4.1.
  • [18] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2020)

    Exploring the limits of transfer learning with a unified text-to-text transformer


    Journal of Machine Learning Research

    21, pp. 1–67.
    Cited by: §3.2.
  • [19] P. Rajpurkar, R. Jia, and P. Liang (2018) Know what you don’t know: unanswerable questions for squad. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 784–789. Cited by: §1.
  • [20] Y. Xu and M. Lapata (2020) Coarse-to-fine query focused multi-document summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pp. 3632–3645. Cited by: §1, §4.3.2.
  • [21] Y. Xu and M. Lapata (2021) Generating query focused summaries from query-free resources. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 6096–6109. Cited by: §1, §4.3.2.