With the rapid growth of textual documents on the internet, accessing information from the web has become a challenging issue . Often users want the summary of a topic from various sources to fulfill their information needs . The QF-MDS task deals with such problems where the goal is to summarize a set of documents to answer a given query.
In the QF-MDS task, the summaries generated by the summarizer can be either extractive or abstractive [43, 16]. An extractive summarizer extracts relevant text spans from the source document(s), whereas an abstractive summarizer generates a summary in natural language that may contain some words which did not appear in the source document(s) [34, 28, 30]. With the rising popularity of virtual assistants in recent years, there is a growing interest to integrate abstractive summarization capabilities in these systems for natural response generation .
One major challenge for the QF-MDS task is that the datasets (e.g., DUC 2005, 2006, 2007) used for such tasks do not contain any labeled training data. Therefore, neural summarization models that leverage supervised training cannot be used in these datasets. Note that for other related tasks [1, 23, 27], how to reduce the demands for labeling the data and how to leverage unlabeled data were also identified as a major challenge. While using datasets similar to the target dataset as the training data for the QF-MDS task, we find that these datasets only contain multi-document gold summaries. However, the state-of-the-art transformer-based  summarization models [24, 18] cannot be used in long documents due to computational complexities [6, 46]. To tackle these issues, we propose a novel weakly supervised approach by utilizing distant supervision to generate weak reference summary of each single-document from multi-document gold reference summaries. We train our model on each document with weak supervision and find that our proposed approach that generates abstractive summaries is very effective for the QF-MDS task. More concretely, we make the following contributions:
Second, to address the computational issue to train neural models in long documents [46, 6], we propose an iterative approach that adopts a pre-trained single-document generic summarization model to leverage the effectiveness of fine-tuning such models for query focused abstractive summarization  and extends it for the QF-MDS task.
Experimental results on DUC 2005-07 datasets show that our proposed approach sets a new state-of-the-art result in terms of various ROUGE scores. As a secondary contribution, we will make our source codes publicly available here: https://github.com/tahmedge/WSL-DS-COLING-2020.
2 Related Work
Early work on multi-document summarization was mostly focused on generic summarization , whereas the amount of work for QF-MDS had been very limited . Due to the lack of training data for the QF-MDS task, most previous works were based on various unsupervised approaches that could only generate extractive summaries [39, 10, 37, 42, 47, 38, 26, 8].
To generate the abstractive summaries for the QF-MDS task, 
proposed a transfer learning technique to tackle the issue of no training data. They adopted the Pointer Generation Network (PGN) pre-trained for the generic abstractive summarization task in a large dataset to predict the query focused summaries in the target dataset via modifying the attention mechanism of the PGN model. However, their model failed to outperform different extractive approaches in terms of various ROUGE scores [8, 33].
Identifying sentences which are relevant to the query is an important step for the QF-MDS task. For this purpose, various approaches were utilized such as counting word overlaps  or the Cross-Entropy Method . Though neural models based on supervised training have significantly outperformed various non-neural models for the answer selection task in recent years [17, 19], such neural models have not been effectively used for the QF-MDS task yet due to the absence of labeled data for the relevant sentences in the QF-MDS datasets.
Recently,  showed that neural models pre-trained in a large Question Answering (QA) dataset could effectively select answers in other QA datasets. More recently, such pre-trained answer selection models for the QF-MDS task were used by . In their work, they utilized distant supervision from various QA datasets using the fine-tuned BERT  model to filter out the irrelevant sentences from the documents. However,  showed that filtering sentences as an early step could lead to performance deterioration for the QF-MDS task. Thus, instead of applying distant supervision to filter out some sentences from the document, we apply it to generate the weak reference summary of each unlabeled document in our training datasets. Our proposed weakly supervised learning approach not only allows us to leverage the advantage of fine-tuning pre-trained generic summarization models , but also allows us to overcome the limitation of training neural models in long documents [6, 46].
3 Our Proposed Approach
Suppose, we have a query containing words and a set of documents = . For the QF-MDS task, the goal is to generate a summary containing words from the document set for the given query .
In figure 1, we show the overall architecture of our proposed approach. Since there is no training data available for the QF-MDS task, we provide supervised training to our target dataset by using other QF-MDS datasets as the training data. However, the available QF-MDS datasets  only contain the gold summaries generated by human experts from multiple documents and do not contain the gold summary of each individual document. Due to the limitations of using neural models in long documents [6, 46], we propose an iterative approach to train our model on each document of a document set. For this purpose, we generate the weak reference summary of each document from the multi-document gold summaries using distant supervision to train our model for the QF-MDS task. Finally, we rank the generated query focused summaries via an answer selection model . In the following, we give a detailed description of our proposed approach.
3.1 Weakly Supervised Learning with Distant Supervision
To generate the weak reference summaries using distant supervision, we utilize the pre-trained RoBERTa model  in two steps (see Figure 1a). At first, we generate the weak extractive reference summary of each individual document using a RoBERTa sentence similarity model fine-tuned for the answer selection task. Then, we measure the similarity score between each sentence in the human written (abstractive) multi-document gold summaries with each sentence in the weak extractive reference summary using a RoBERTa sentence similarity model fine-tuned for the paraphrase identification task. Based on the similarity score, we select the most relevant sentences from the gold reference summaries as the weak abstractive reference summary for each document. Below we describe these steps in detail.
RoBERTa Answer Selection Model:
In this step, we first generate the initial weak extractive reference summary of each individual document by measuring the relevance between the query and each sentence in . For this purpose, we adopt the RoBERTa sentence similarity model from  for its impressive performance in the answer sentence selection task and fine-tune it in the QA-ALL dataset of MS-MARCO . The fine-tuned RoBERTaMS-MARCO model was then utilized in our training dataset to measure the similarity score between each sentence in the document and the query. Based on the similarity score, we select the top most relevant sentences as the weak extractive reference summary. Note that we use the value of because extracting only sentences was found effective in different extractive summarizers such as the LEAD-3 baseline [35, 24], as well as the BERTSUMEXT model .
RoBERTa Paraphrase Identification Model:
We provide distant supervision to generate the weak abstractive reference summary by replacing each sentence in the weak extractive reference summary (generated in the previous step) with the most similar sentence found in the multi-document gold summaries. For this purpose, we fine-tune the RoBERTa model for the paraphrase identification task in the MRPC dataset. Then for each document in a document set , we utilize the fine-tuned RoBERTaMRPC paraphrase identification model to replace each sentence in the weak extractive reference summary of with the most similar sentence found in the gold summaries (among the sentences that are not already selected for the same document) of .
3.2 Iterative Fine-Tuning for Multi-Document Summarization
For the QF-MDS task, we adopt the transformer-based  BERTSUM model pre-trained for generic abstractive summarization on the CNN/DailyMail dataset  to leverage the advantages of fine-tuning it for the query focused abstractive summarization task . However, BERTSUM was trained for the single-document summarization task by considering at most 512 tokens [24, 6, 46]. To address this issue for the multi-document scenario, we take an iterative approach (see Figure 1b). At first, we incorporate query relevance via concatenating the query with each document, similar to the work of . Then, we fine-tune BERTSUM using the weak abstractive reference summary to generate the query focused abstractive summary of each document in a document set. The sentences in the generated query focused summaries of each document set are then ranked using the fine-tuned RoBERTaMS-MARCO answer selection model to select the sentences that are most relevant to the query (see Figure 1c).
4 Experimental Setup
We now describe the datasets used in this paper, followed by the details of our implementation.
We use the DUC 2005, 2006, and 2007 datasets for the QF-MDS task. The number of document sets were 50, 50, and 45 where the number of documents in each document set were 32, 25, and 25 in DUC 2005, 2006 and 2007 datasets respectively . Each document set is associated with a query and the objective is to generate a summary containing at most 250 words from the document set based on the given query. Given the absence of the training data, to evaluate our model in each year’s dataset we use the datasets from the other two years for training. From each year’s training data, we randomly selected 20% of the document sets for validation while we used the rest for training.
For the RoBERTa model, we used its Large version [25, 19] and implemented using HuggingFace’s Transformer . For fine-tuning the summarization model, we used the BERTSUMEXT-ABS222https://github.com/nlpyang/PreSumm model pre-trained on the CNN/DailyMail dataset . While selecting the most relevant sentences as the final query focused summary, we used the Trigram Blocking to reduce redundancy . To fine-tune the BERTSUM model, we kept most parameters similar to the original work  and ran 50 steps with batch size equal to 200. Among these 50 steps, we selected the step for evaluation that performed the best on the validation set. All of our models were run in multi-GPU settings using 4 NVIDIA V100 GPUs. We report the results based on both Recall and F1 scores in terms of ROUGE-1, ROUGE-2, and ROUGE-SU4 metrics . From now on, we denote ROUGE as R.
|(a) F1 Score:||DUC 2005||DUC 2006||DUC 2007|
|(b) Recall:||DUC 2005||DUC 2006||DUC 2007|
|without Distant Supervision||41.77 (- 2.25%)||41.88 (- 2.24%)||
, based on paired t-test ()
|without Trigram Blocking||40.92 (- 4.24%)||41.01 (- 4.27%)||No, based on paired t-test ( )|
|without Weakly Supervised Learning||40.01 (- 6.37%)||40.12 (- 6.35%)||Yes, based on paired t-test ( )|
5 Results and Discussions
We now analyze the performance of our proposed model by comparing with other models (see Table 1). We also perform ablation test to further investigate its effectiveness (see Table 2). We denote our approach of using the Pre-trained models (RoBERTa and BERTSUM) for Query focused SUMmary generation via utilizing Weakly Supervised Learning with Distant Supervision (WSL-DS) as PQSUMWSL-DS. For performance comparisons, we use two baselines that do not utilize weak supervision and fine-tuning. Note that both of these baselines use the BERTSUM  model pre-trained on the CNN/DailyMail dataset. One of them is pre-trained for extractive summarization: PQSUMEXT; while the other is pre-trained for abstractive summarization: PQSUMABS. These baselines generate the summaries of all documents in a document set which are then ranked using RoBERTaMS-MARCO. Moreover, we compare our model with four recent works: i) CES-50 , ii) RSA , iii) QUERYSUM , and iv) DUAL-CES .
From Table 1(a), we find that our PQSUMWSL-DS model sets a new state-of-the-art in all datasets based on the F1 metric for all ROUGE scores. Specifically, based on R-1, it improves by 5.88% in DUC 2005 from  along with 4.54% and 3.28% from  in DUC 2006 and 2007 respectively. From Table 1(b), we find that our model also sets a new state-of-the-art in terms of R-2 Recall in DUC 2005 and 2006, but fails to outperform the DUAL-CES  in DUC 2007. In comparison to the abstractive RSA model , we find that our model outperforms them in all datasets in terms of both R-1 and R-2 Recall, but fails to outperform them in R-SU4 scores. Moreover, we find based on paired t-test ( ) that the weakly supervised learning significantly outperforms the baselines in terms of both Recall and F1.
The result of our ablation test based on the average of R-1 scores across all datasets is shown in Table 2. We find that the performance is deteriorated if we exclude Distant Supervision via removing the RoBERTaMRPC model, as well as if the Trigram Blocking is not used. Moreover, the performance is significantly degraded if the summary is generated by only ranking the sentences in the documents using the fine-tuned RoBERTaMS-MARCO without utilizing Weakly Supervised Learning.
6 Conclusions and Future Work
In this paper, we propose a novel weakly supervised approach for the Query Focused Multi-Document Abstractive Summarization task to tackle the issue of no available labeled training data for such tasks. We also propose an iterative approach to address the computational problem that occurs while training neural models in long documents [15, 6, 46]. Experimental results in three datasets show that our proposed approach sets a new state-of-the-art result in various evaluation metrics. In the future, we will apply our models on more tasks, such as information retrieval applications [11, 12, 44, 13]22, 45], learning from imbalanced or unlabeled datasets [21, 3, 4], and automatic chart question answering .
We gratefully appreciate the area chair(s) and the reviewers for their excellent review comments. This research is supported by the Natural Sciences & Engineering Research Council (NSERC) of Canada, the York Research Chairs (YRC) program and an ORF-RE (Ontario Research Fund-Research Excellence) award in BRAIN Alliance. We acknowledge Compute Canada for providing us the computing resources.
-  (2003) Challenges in information retrieval and language modeling: report of a workshop held at the center for intelligent information retrieval, University of Massachusetts Amherst, September 2002. In ACM SIGIR Forum, Vol. 37, pp. 31–47. Cited by: §1.
-  (2016) MS marco: a human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268. Cited by: §3.1.
Zero-Resource Cross-Lingual Named Entity Recognition. arXiv preprint arXiv:1911.09812. Cited by: §6.
-  (2020) MultiMix: a robust data augmentation strategy for cross-lingual nlp. arXiv preprint arXiv:2004.13240. Cited by: §6.
-  (2018) Query focused abstractive summarization: incorporating query relevance, multi-document coverage, and summary length constraints into seq2seq models. arXiv preprint arXiv:1801.07704. Cited by: §2, §2, §2, Table 1, §5, §5.
-  (2020) Longformer: the long-document transformer. arXiv preprint arXiv:2004.05150. Cited by: 2nd item, §1, §2, §3.2, §3, §6.
-  (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4171–4186. Cited by: §2, Figure 1.
-  (2017) Unsupervised query-focused multi-document summarization using the cross entropy method. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 961–964. Cited by: §1, §2, §2, §2, §3, §4, Table 1, §5.
-  (2019) TANDA: transfer and adapt pre-trained transformer models for answer sentence selection. arXiv preprint arXiv:1911.04118. Cited by: §2.
-  (2009) Exploring content models for multi-document summarization. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 362–370. Cited by: §2.
-  (2009) A bayesian learning approach to promoting diversity in ranking for biomedical information retrieval. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval,, pp. 307–314. Cited by: §6.
Applying machine learning to text segmentation for information retrieval. Information Retrieval 6 (3-4), pp. 333–362. Cited by: §6.
-  (2005) York University at TREC 2005: genomics track. In Proceedings of the Fourteenth Text REtrieval Conference, TREC, Cited by: §6.
-  (2020) Answering questions about charts and generating visual explanations. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pp. 1–13. Cited by: §6.
-  (2019) Reformer: the efficient transformer. In International Conference on Learning Representations, Cited by: §6.
-  (2020) AQuaMuSe: automatically generating datasets for query-based multi-document summarization. arXiv preprint arXiv:2010.12694. Cited by: §1.
-  (2019) Utilizing bidirectional encoder representations from transformers for answer selection task. In Proceedings of the V AMMCS International Conference: Extended Abstract, pp. 221. Cited by: §2.
Query focused abstractive summarization via incorporating query relevance and transfer learning with transformer models.
Canadian Conference on Artificial Intelligence, pp. 342–348. Cited by: 2nd item, §1, §2, §3.2.
-  (2020) Contextualized embeddings based transformer encoder for sentence similarity modeling in answer selection task. In Proceedings of The 12th Language Resources and Evaluation Conference, pp. 5505–5514. Cited by: 1st item, §2, §3.1, §3, §4.
-  (2004-07) ROUGE: a package for automatic evaluation of summaries. In Text Summarization Branches Out, Barcelona, Spain, pp. 74–81. Cited by: §4.
-  (2006) Boosting prediction accuracy on imbalanced datasets with SVM ensembles. In Proceedings of the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD, pp. 107–118. Cited by: §6.
-  (2007) ARSA: a sentiment-aware model for predicting sales performance using blogs. In Proceedings of the 30th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 607–614. Cited by: §6.
-  (2008) Modeling and predicting the helpfulness of online reviews. In Proceedings of the 8th IEEE International Conference on Data Mining, pp. 443–452. Cited by: §1.
Text summarization with pretrained encoders.
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pp. 3721–3731. Cited by: §1, §3.1, §3.2, §4, §5.
-  (2019) RoBERTa: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: 1st item, §3.1, §3.1, §4.
-  (2016) An unsupervised multi-document summarization framework based on neural document model. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 1514–1523. Cited by: §2.
-  (2012) Proximity-based rocchio’s model for pseudo relevance. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 535–544. Cited by: §1.
-  (2016) Abstractive text summarization using sequence-to-sequence rnns and beyond. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pp. 280–290. Cited by: §1.
-  (2018) Abstractive unsupervised multi-document summarization using paraphrastic sentence fusion. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 1191–1204. Cited by: §2.
Diversity driven attention model for query-based abstractive summarization. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pp. 1063–1072. Cited by: §1.
-  (2019) Multi-style generative reading comprehension. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2273–2284. Cited by: §1.
-  (2018) A deep reinforced model for abstractive summarization. In International Conference on Learning Representations, Cited by: §4.
-  (2020) Unsupervised dual-cascade learning with pseudo-feedback distillation for query-focused extractive summarization. In Proceedings of The Web Conference 2020, pp. 2577–2584. Cited by: §2, Table 1, §5, §5.
-  (2015) A neural attention model for abstractive sentence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 379–389. Cited by: §1.
-  (2017) Get to the point: summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pp. 1073–1083. Cited by: §2, §3.1.
-  (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §1, §3.2.
-  (2009) Graph-based multi-modality learning for topic-focused multi-document summarization. In Proceedings of the 21st International Joint Conference on Artificial Intelligence, pp. 1586–1591. Cited by: §2.
-  (2014) CTSUM: extracting more certain summaries for news articles. In Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 787–796. Cited by: §2.
-  (2008) Integrating clustering and multi-document summarization to improve document understanding. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp. 1435–1436. Cited by: §2.
-  (2019) HuggingFace’s transformers: state-of-the-art natural language processing. ArXiv, pp. arXiv–1910. Cited by: §4.
-  (2020) Query focused multi-document summarization with distant supervision. arXiv preprint arXiv:2004.03027. Cited by: §2, Table 1, §5, §5.
-  (2015) Compressive document summarization via sparse optimization. In Proceedings of the 24th International Conference on Artificial Intelligence, pp. 1376–1382. Cited by: §2.
-  (2017) Recent advances in document summarization. Knowledge and Information Systems 53 (2), pp. 297–336. Cited by: §1, §1, §2.
-  (2013) A survival modeling approach to biomedical search result diversification using wikipedia. IEEE Transactions on Knowledge and Data Engineering 25 (6), pp. 1201–1212. Cited by: §6.
-  (2012) Mining online reviews for predicting sales performance: A case study in the movie domain. IEEE Transactions on Knowledge and Data Engineering 24 (4), pp. 720–734. Cited by: §6.
-  (2020) Big bird: transformers for longer sequences. arXiv preprint arXiv:2007.14062. Cited by: 2nd item, §1, §2, §3.2, §3, §6.
Query-oriented unsupervised multi-document summarization via deep learning model. Expert Systems with Applications 42 (21), pp. 8146–8155. Cited by: §2.