Multi-Document Keyphrase Extraction: A Literature Review and the First Dataset

10/03/2021 ∙ by Ori Shapira, et al. ∙ 0

Keyphrase extraction has been comprehensively researched within the single-document setting, with an abundance of methods and a wealth of datasets. In contrast, multi-document keyphrase extraction has been infrequently studied, despite its utility for describing sets of documents, and its use in summarization. Moreover, no dataset existed for multi-document keyphrase extraction, hindering the progress of the task. Recent advances in multi-text processing make the task an even more appealing challenge to pursue. To initiate this pursuit, we present here the first literature review and the first dataset for the task, MK-DUC-01, which can serve as a new benchmark. We test several keyphrase extraction baselines on our data and show their results.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Keyphrase extraction (KE) is the task of selecting important and topical phrases from within a body of text (Turney, 2000). Single-document KE has garnered extensive research due to its vast practical uses. For example, keyphrases are listed on scientific or news articles, product descriptions and meeting transcripts to give the reader a hint at the matters of the source text. Additionally, these keyphrases are serviceable for downstream tasks like document categorization (Hulth and Megyesi, 2006), clustering (Jones and Mahoui, 2000), summarization (Jones et al., 2002) and search (Gutwin et al., 1999). Hence, single-document KE is resourced with a multitude of datasets across several domains (e.g. Kim and Kan (2009) and Krapivin et al. (2009) for scientific papers, or Wan and Xiao (2008) and Marujo et al. (2012) for news), and is frequently reviewed in survey papers to report on the advancements of the task (e.g. Hasan and Ng, 2014; Siddiqi and Sharan, 2015; Merrouni et al., 2019; Papagiannopoulou and Tsoumakas, 2020).

Conversely, multi-document KE (MDKE) has been sporadically researched. To the best of our knowledge, only about a handful of works have explicitly targeted the task (§2.1). This is despite it being just as valuable for gaining a high-level depiction of a set

of related documents, and it being leveraged as a medium for supporting multi-document summarization

2.2). It could additionally assist the aspect-based summarization task in extracting the aspects around which summaries are to be generated -- especially useful when the texts are information-rich with high sub-topical variation (e.g Frermann and Klementiev, 2019; Gerani et al., 2019; Hayashi et al., 2021). Moreover, the multi-document setting, with its large inputs and information variability, introduces an interesting and challenging dimension of complexity that recent advances in text processing may tackle.

To make matters trickier, no dataset was previously available for MDKE, i.e. consisting of sets of documents and their corresponding gold lists of keyphrases. Previous works, therefore, did not evaluate with standard automatic keyphrase extraction methods, or conducted extrinsic evaluations within the summarization task (see §2 for details).

To initiate a more established research line on MDKE, we provide here the first literature review on the task, and propose a MDKE dataset, which can provide a benchmark for the task. The dataset is based on the existing DUC-2001 single-document KE dataset (Wan and Xiao, 2008) in the news domain. We leverage the properties of the original DUC-2001 multi-document summarization dataset222 to convert the single-document KE dataset to a multi-document one. We run several KE algorithms on the dataset to demonstrate the current state of the task on the new benchmark.

We start with the literature review on MDKE in §2, and describe the new dataset and its experimentation in §3.

2 Multi-Document KE Review

We outline the research conducted explicitly and indirectly on MDKE. Few works have expressly tackled multi-document KE, however it has also been applied in several studies on multi-document summarization (MDS), as an intermediate step.

2.1 Works on Multi-Document KE

Hammouda et al. (2005) seem to be the first to have worked on the MDKE task, targeting the web-document domain. Word-sequences common to all documents are ranked based on term frequency, term position in documents, and use in titles or headlines. To evaluate, 10 sets of  30 documents are retrieved via query search. The system keyphrases are compared, by word-stem overlap, to the single corresponding document-set search query.

Berend and Farkas (2013)

approach the task by merging keyphrase lists from individual documents in a document set. Within a document, n-grams, that satisfy several rules, are classified as keyphrases with a trained maxent model that uses features of word surface form and basic Wikipedia knowledge. An information gain metric ranks the unified list of keyphrases, from which the top-15 are taken out of a subset of documents. To evaluate, scientific paper sets from ACL workshops

(Schäfer et al., 2012)

(110 workshops with ~14 articles each) were paired with their respective ‘‘call-for-papers’’ (CFP) website sections. A system keyphrase list on a paper set was then compared to the CFP text via word-level cosine similarity. Also, NLP experts assessed whether keyphrase lists indeed properly characterized the corresponding workshop.

Bayatmakou et al. (2017) also take a single-to-multi-document approach. A set of documents is retrieved with a search query, and individual documents’ keyphrases are extracted with RAKE (Rose et al., 2010). The keyphrases in the unified lists are re-scored according to word co-occurrence within keyphrases and term-frequency within documents, weighted by document salience (similarity to the search query). While an automatic evaluation is proposed (measuring against the query and co-occurrence of keywords and query in documents), the actual assessment is a manual satisfaction rating against the search query. Experiments were performed over a large dataset of scientific abstracts.333

As for keyword extraction (unigrams), Bharti et al. (2017) propose several methods of term-frequency for scoring words in sets of news articles. They evaluate the resulting keyword list against the aggregated words in the articles’ headlines, with recall and precision. Qing-sheng (2007) design cluster-based and MMR-based algorithms, and test against annotated data. YangJie et al. (2008) use tf-idf and word-level features to score words. No further details are available on the latter two references, as their papers could not be retrieved.

Relating to MDKE, Wan and Xiao (2008) propose a method for single-document KE, CollabRank, that ranks a document’s keyphrases with respect to similar ‘‘collaborating’’ documents. To compute word saliency, they produce an affinity graph reflecting co-occurrence relationships between documents’ words. That paper also introduces the single-document KE dataset that we build upon for MDKE (§3).

2.2 KE for Multi-Document Summarization

MDS aims to generate a passage that covers the main issues of the source document-set. Keyphrases naturally point to the central aspects, and can therefore assist in marking the important information for a summary.

Some works extract keyphrases from the document-set, e.g. with conventional term-frequency methods (Alshahrani and Bikdash, 2019), by using single-document KE algorithms on the concatenated documents (Nayeem and Chali, 2017), or through query similarity for query-focused summarization (Ma et al., 2008). These keyphrases are then used to rank sentences across the documents for the potential summaries. Output summaries are standardly evaluated against reference summaries from MDS datasets. While keyphrases have also been extracted per document for the same purpose of MDS (e.g. Bhaskar, 2013; Fejer and Omar, 2015), it is difficult to assess which approach is most effective since various factors impact the resulting summaries.

Other works focus on word importance rather than keyphrase extraction. Hong and Nenkova (2014) assign importance to documents’ content words based on their appearance in reference summaries. Alternative methods include ILP frameworks for bigram selection and weighting, using syntactic and external information (Li et al., 2015), or through a joint sentence/keyword reinforcement process that converges to a high quality summary (Li and Zheng, 2020).

2.3 KE Evaluation

Most single-document KE works automatically evaluate a keyphrase list against a gold list. As seen in §2.1 above, the few works on MDKE do not conduct such evaluations due to the lack of test datasets, which we counteract with our proposed MDKE dataset (§3). They all conduct disparate assessments, and evaluate against references that are limited in informativeness or that are unrepresentative of keyphrase requisites.

The most prominent KE metric for comparing against a gold keyphrase list is the metric, which considers the recall and precision of the predicted list, truncated to items, against the full gold list. Words are often stemmed to allow some reasonable variation of word forms. To allow for a broader valuation, the unigram-level score is also used -- where the two lists of keyphrases are each flattened out to a single list of words. Other, less exploited metrics include the Mean Reciprocal Rank (Voorhees and Tice, 2000) and Mean Average Precision, that also take list-order into account. (See a KE review paper, such as (Sun et al., 2020)

, for more details on the evaluation metrics.)

3 New Dataset

3.1 Dataset Formation

Our proposed MDKE dataset, which we name MK-DUC-01, builds upon the DUC-2001 single-document KE dataset introduced by Wan and Xiao (2008), for the news domain.

The DUC-2001 MDS dataset (Over, 2001) consists of 30 topics, each containing an average of 10.27 related news articles (308 in total). Experts summarized each individual article, as well as each of the document-sets, with different length summaries. Overall, there are three 100-token-long summaries per document, and three summaries per document-set, at lengths 50, 100, 200 and 400 tokens. Wan and Xiao (2008) further annotated each document with a list of keyphrases. On average, there are 8.08 keyphrases per document, with 2.09 words per keyphrase. This data is still widely used for the news-domain single-document KE task.

To restructure the single-document KE dataset for the multi-document setting, we carried out an automatic merging and reranking process, followed by a manual refinement procedure:

Automatic merging and reranking.

For each topic with its corresponding document-set , and 400-token topic reference summaries , we first provided a score for each stemmed word in as where stands for ’s document-frequency in document-set , i.e. the number of documents of in which appears.

We then unified ’s lists of keyphrases (from the single-document KE dataset), removing exact phrase duplicates and phrases that do not appear anywhere in ,444Since we consider the task of keyphrase extraction, all keyphrases must be contained within the document set. 16 of 2488 keyphrases were removed in total. to form a single list of potential keyphrases, . Each phrase was then scored as , i.e. the average of ’s stem scores. This generated a ranked list of keyphrases, , ordered by a salience score.

Lastly, we merged pairs of phrases in where one was contained within the other (stemmed and disregarding word order), leaving only the longer variant or the one earlier in

, e.g., merging ‘‘routine training’’/ ‘‘routine train flight’’. Due to the variance of keyphrases’ informativeness across documents, we found that this heuristic effectively filtered out overly generic or repetitive keyphrases.

Full Trunc-20
# topics 30 30
Avg (StD) # docs per topic 10.27 (2.24) 10.27 (2.24)
Avg (StD) # KPs per topic 43.8 (15.6) 19.97 (0.18)
Avg (StD) KP word-length 2.13 (0.66) 2.17 (0.66)
# KPs with substitute cluster 142 of 1314 104 of 599
Avg (StD) # KPs in clusters 2.82 (1.26) 3.07 (1.37)
Table 1: MK-DUC-01 stats, on the full data and when truncating the keyphrase lists to 20. (KP = keyphrase)

Manual refinement.

As we strived to generate a high-quality MDKE benchmark dataset, we further refined the keyphrase lists produced by the automatic stage above. One of the authors looked over the 30 lists with the relevant topic documents and reference summaries open for assistance, and carried out the following: (1) removed phrases that were particularly scarce or of low informativeness (e.g., ‘‘similar transmission’’ in the ‘‘Mad Cow Disease’’ topic); (2) removed phrases that were not synonymous with others, but were clearly implied from other phrases (e.g., ‘‘U.S. Senate’’ where other keyphrases mention the Senate); (3) clustered together phrases that can be used replaceably (e.g., ‘‘1990 census’’ and ‘‘1990 population count’’) to form keyphrase substitute clusters, with the more commonly used variant as the preferred alternative; (4) produced substitute clusters for persons’ titled proper nouns, when the title is optional (e.g., a cluster for ‘‘Bill Clinton’’ containing ‘‘President Clinton’’ and ‘‘Governor Bill Clinton’’), leaving the untitled version as the preferred alternative.

Note that due to the variability of similar keyphrases across documents, we were able to form substitute clusters of replaceable variations of keyphrases, which is a novel conception in KE datasets. This can assist in the evaluation process when a system outputs a keyphrase that is worded differently within the gold list of keyphrases (see examples in annotation points (3) and (4) above), which is a major shortcoming in standard KE evaluation. We marked a preferred variant in each such cluster to enable standard evaluation, that requires a flat list of gold keyphrases.

This whole dataset formation procedure yielded the final MK-DUC-01 dataset, with basic statistics appearing in Table 1. We suggest a version of the dataset where the keyphrase lists are truncated at 20 items, denoted here Trunc-20. This establishes a more representational task-setting since lead keyphrases in the gold lists are more salient in their corresponding topics, while those low in the list are less anticipated as topic-level keyphrases.

Concat Merge
F1@k unigram-F1@k F1@k unigram-F1@k
Algorithm 1 5 10 20 1 5 10 20 1 5 10 20 1 5 10 20
Tf-Idf 0.32 1.87 4.44 5.67 4.56 16.77 26.58 35.11 0.63 1.60 2.44 4.50 4.52 17.04 24.04 31.00
KPMiner (El-Beltagy and Rafea, 2009) 1.27 4.80 7.56 9.68 5.52 21.27 30.55 34.88 1.59 5.34 7.56 10.84 5.39 19.45 28.85 38.93
YAKE (Campos et al., 2020) 2.54 7.20 10.67 13.17 6.24 24.80 33.30 36.04 2.86 5.87 8.23 10.84 7.06 25.75 33.87 37.72
TextRank (Mihalcea and Tarau, 2004) 0.63 4.00 4.67 8.01 10.11 28.25 32.03 30.72 2.54 9.88 13.79 17.17 9.08 29.66 39.90 41.28
SingleRank (Wan and Xiao, 2008) 0.95 5.08 6.90 12.01 12.51 29.52 32.81 31.98 2.86 8.80 14.23 18.51 9.26 28.96 38.56 42.01
TopicRank (Bougouin et al., 2013) 1.59 6.40 9.33 11.33 5.47 18.69 28.75 36.72 4.13 10.94 16.01 18.68 6.75 25.22 39.24 44.88
TopicalPageRank (Sterckx et al., 2015) 1.27 5.61 7.79 13.35 12.16 29.81 33.67 32.44 2.86 9.08 15.56 19.84 9.06 28.82 39.11 42.41
PositionRank (Florescu and Caragea, 2017) 1.90 8.28 11.80 16.35 9.31 27.09 32.20 33.22 3.17 8.53 15.56 19.51 8.44 27.50 39.20 44.11
MultipartiteRank (Boudin, 2018) 1.27 6.13 10.23 12.17 6.04 20.08 30.12 38.70 4.13 11.21 17.34 21.34 7.64 25.63 39.55 45.61
CollabRank (Wan and Xiao, 2008) - 2.86 9.61 14.9 17.84 9.37 28.03 37.68 41.26
(Bayatmakou et al., 2017) [multi-doc] - 0.09 0.93 1.46 1.88 1.84 7.74 11.69 15.18
Table 2: Results on various KE algorithms tested with the Trunc-20 version of our MK-DUC-01 dataset. In Concat mode all topic documents are concatenated as a single text input, and in Merge mode algorithms are run on individual documents after which keyphrase lists are merged and reranked. The bottom two algorithms are multi-document based KE algorithms, and work in Merge mode only.

3.2 Dataset Experimentation

We demonstrate the use of MK-DUC-01 by testing ten existing single-document KE algorithms and a multi-document one. The single-document algorithms are applied in two modes: (1) Concat, where all topic documents are concatenated into a single text that is then fed to the algorithm to output a list of keyphrases per topic; (2) Merge, where for each topic, the algorithm is fed one document at a time, and the generated lists of keyphrases are merged using a similar strategy as in the automatic merging and reranking procedure in §3.1, except that , i.e., it does not consider the reference summary set -- which is naturally unavailable in the KE task. CollabRank (Wan and Xiao, 2008) uses its collaborating documents, hence only Merge is applied. The algorithm by Bayatmakou et al. (2017) was designed as a multi-document algorithm, but follows a similar approach to our Merge mode, running RAKE (Rose et al., 2010) per document, and merging keyphrase lists with a method different from ours.

For evaluation, we employ the standard stemmed and unigram- metrics for , on the MK-DUC-01 data, both in the Trunc-20 version (Table 2) and in its full version (Table 4 in the appendix). We witness a clear benefit of the Merge strategy across nearly all algorithms, and a strong significant improvement over the official MDKE algorithm.

4 Conclusion

We review the research conducted on multi-document KE, which is far understudied compared to its single-document counterpart. Only a few works have tackled the MDKE task head-on, all without the existence of a suitable dataset. Meanwhile, the notion of MDKE has been indirectly applied in multi-document summarization research, and can potentially assist in new trends within the summarization community, using recent advances in multi-text processing. We introduce the first MDKE dataset as a benchmark, and evaluate different KE algorithms on it, acting as baseline results for the task.


This work was supported in part by the German Research Foundation through the German-Israeli Project Cooperation (DIP, grant DA 1600/1-1); by the Israel Science Foundation (grant 1951/17); and by a grant from the Israel Ministry of Science and Technology.


  • S. Alshahrani and M. Bikdash (2019) Multi-Document Summarization Based on Keyword Fusion. In 2019 SoutheastCon, pp. 1–5. Cited by: §2.2.
  • F. Bayatmakou, A. Ahmadi, and A. Mohebi (2017) Automatic Query-based Keyword and Keyphrase Extraction. In

    2017 Artificial Intelligence and Signal Processing Conference (AISP)

    pp. 325–330. Cited by: Appendix A, Table 3, Table 4, §2.1, §3.2, Table 2.
  • G. Berend and R. Farkas (2013) Single-Document Keyphrase Extraction for Multi-Document Keyphrase Extraction. Computación y Sistemas 17 (2), pp. 179–186. Cited by: §2.1.
  • S. K. Bharti, K. S. Babu, A. Pradhan, S. Devi, T. Priya, E. Orhorhoro, O. Orhorhoro, V. Atumah, E. Baruah, P. Konwar, et al. (2017) Automatic Keyword Extraction for Text Summarization in Multi-document E-newspapers Articles. European Journal of Advances in Engineering and Technology 4 (6), pp. 410–427. Cited by: §2.1.
  • P. Bhaskar (2013) Multi-Document Summarization using Automatic Key-Phrase Extraction. In Proceedings of the Student Research Workshop associated with RANLP 2013, Hissar, Bulgaria, pp. 22–29. External Links: Link Cited by: §2.2.
  • F. Boudin (2016)

    PKE: an Open Source Python-based Keyphrase Extraction Toolkit

    In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations, Osaka, Japan, pp. 69–73. External Links: Link Cited by: Appendix A.
  • F. Boudin (2018) Unsupervised Keyphrase Extraction with Multipartite Graphs. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, Louisiana, pp. 667–672. External Links: Link, Document Cited by: Table 4, Table 2.
  • A. Bougouin, F. Boudin, and B. Daille (2013) TopicRank: Graph-Based Topic Ranking for Keyphrase Extraction. In

    Proceedings of the Sixth International Joint Conference on Natural Language Processing

    Nagoya, Japan, pp. 543–551. External Links: Link Cited by: Table 4, Table 2.
  • R. Campos, V. Mangaravite, A. Pasquali, A. Jorge, C. Nunes, and A. Jatowt (2020) YAKE! Keyword Extraction from Single Documents using Multiple Local Features. Information Sciences 509, pp. 257–289. Cited by: Table 4, Table 2.
  • S. R. El-Beltagy and A. Rafea (2009) KP-Miner: A keyphrase extraction system for English and Arabic documents. Information systems 34 (1), pp. 132–144. Cited by: Table 4, Table 2.
  • H. N. Fejer and N. Omar (2015) Automatic Multi-Document Arabic Text Summarization using Clustering and Keyphrase Extraction. Journal of Artificial Intelligence 8 (1), pp. 1. Cited by: §2.2.
  • C. Florescu and C. Caragea (2017) PositionRank: An Unsupervised Approach to Keyphrase Extraction from Scholarly Documents. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1105–1115. External Links: Link, Document Cited by: Table 4, Table 2.
  • L. Frermann and A. Klementiev (2019) Inducing Document Structure for Aspect-based Summarization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 6263–6273. External Links: Link, Document Cited by: §1.
  • S. Gerani, G. Carenini, and R. T. Ng (2019) Modeling Content and Structure for Abstractive Review Summarization. Computer Speech and Language 53, pp. 302–331. External Links: ISSN 0885-2308, Document, Link Cited by: §1.
  • C. Gutwin, G. Paynter, I. Witten, C. Nevill-Manning, and E. Frank (1999) Improving Browsing in Digital Libraries with Keyphrase Indexes. Decision Support Systems 27 (1-2), pp. 81–104. Cited by: §1.
  • K. M. Hammouda, D. N. Matute, and M. S. Kamel (2005) Corephrase: Keyphrase Extraction for Document Clustering. In

    International workshop on machine learning and data mining in pattern recognition

    pp. 265–274. Cited by: §2.1.
  • K. S. Hasan and V. Ng (2014) Automatic Keyphrase Extraction: A Survey of the State of the Art. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Baltimore, Maryland, pp. 1262–1273. External Links: Link, Document Cited by: §1.
  • H. Hayashi, P. Budania, P. Wang, C. Ackerson, R. Neervannan, and G. Neubig (2021) WikiAsp: A Dataset for Multi-domain Aspect-based Summarization. Transactions of the Association for Computational Linguistics 9, pp. 211–225. External Links: ISSN 2307-387X, Document, Link, Cited by: §1.
  • K. Hong and A. Nenkova (2014)

    Improving the Estimation of Word Importance for News Multi-Document Summarization

    In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, Gothenburg, Sweden, pp. 712–721. External Links: Link, Document Cited by: §2.2.
  • M. Honnibal, I. Montani, S. Van Landeghem, and A. Boyd (2020) spaCy: Industrial-strength Natural Language Processing in Python External Links: Document, Link Cited by: Appendix A.
  • A. Hulth and B. B. Megyesi (2006) A Study on Automatically Extracted Keywords in Text Categorization. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia, pp. 537–544. External Links: Link, Document Cited by: §1.
  • S. Jones, S. Lundy, and G. W. Paynter (2002) Interactive Document Summarisation using Automatically Extracted Keyphrases. In Proceedings of the 35th Annual Hawaii International Conference on System Sciences, pp. 1160–1169. Cited by: §1.
  • S. Jones and M. Mahoui (2000) Hierarchical Document Clustering using Automatically Extracted Keyphrases. Computer Science Working Papers. Cited by: §1.
  • S. N. Kim and M. Kan (2009) Re-examining automatic keyphrase extraction approaches in scientific articles. In Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications (MWE 2009), Singapore, pp. 9–16. External Links: Link Cited by: §1.
  • M. Krapivin, A. Autaeu, and M. Marchese (2009) Large Dataset for Keyphrases Extraction. Cited by: §1.
  • C. Li, Y. Liu, and L. Zhao (2015) Using External Resources and Joint Learning for Bigram Weighting in ILP-Based Multi-Document Summarization. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, pp. 778–787. External Links: Link, Document Cited by: §2.2.
  • Z. Li and X. Zheng (2020) Unsupervised Summarization by Jointly Extracting Sentences and Keywords. arXiv preprint arXiv:2009.07481. Cited by: §2.2.
  • L. Ma, T. He, F. Li, Z. Gui, and J. Chen (2008) Query-focused Multi-document Summarization using Keyword Extraction. In 2008 International Conference on Computer Science and Software Engineering, Vol. 1, pp. 20–23. Cited by: §2.2.
  • L. Marujo, A. Gershman, J. Carbonell, R. Frederking, and J. P. Neto (2012) Supervised Topical Key Phrase Extraction of News Stories using Crowdsourcing, Light Filtering and Co-reference Normalization. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey, pp. 399–403. External Links: Link Cited by: §1.
  • Z. A. Merrouni, B. Frikh, and B. Ouhbi (2019) Automatic Keyphrase Extraction: a Survey and Trends. Journal of Intelligent Information Systems 54, pp. 391–424. Cited by: §1.
  • R. Mihalcea and P. Tarau (2004) TextRank: Bringing Order into Text. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, pp. 404–411. External Links: Link Cited by: Table 4, Table 2.
  • M. T. Nayeem and Y. Chali (2017) Extract with Order for Coherent Multi-Document Summarization. In Proceedings of TextGraphs-11: the Workshop on Graph-based Methods for Natural Language Processing, Vancouver, Canada, pp. 51–56. External Links: Link, Document Cited by: §2.2.
  • P. Over (2001) Introduction to DUC-2001: an Intrinsic Evaluation of Generic News Text Summarization Systems. In Proceedings of DUC 2001 Document Understanding Conference, Vol. 49. Cited by: §3.1.
  • E. Papagiannopoulou and G. Tsoumakas (2020) A Review of Keyphrase Extraction. WIREs Data Mining and Knowledge Discovery 10 (2), pp. e1339. External Links: Document, Link, Cited by: §1.
  • C. Qing-sheng (2007) Research on Keyword-Extraction from Multi-Document in User Model. Computer Simulation. Cited by: §2.1.
  • S. Rose, D. Engel, N. Cramer, and W. Cowley (2010) Automatic Keyword Extraction from Individual Documents. In Text Mining, pp. 1–20. External Links: ISBN 9780470689646, Document, Link, Cited by: Appendix A, §2.1, §3.2.
  • U. Schäfer, J. Read, and S. Oepen (2012) Towards an ACL Anthology Corpus with Logical Document Structure. An Overview of the ACL 2012 Contributed Task. In Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries, Jeju Island, Korea, pp. 88–97. External Links: Link Cited by: §2.1.
  • S. Siddiqi and A. Sharan (2015) Keyword and Keyphrase Extraction Techniques: A Literature Review. International Journal of Computer Applications 109, pp. 18–23. Cited by: §1.
  • L. Sterckx, T. Demeester, J. Deleu, and C. Develder (2015) Topical Word Importance for Fast Keyphrase Extraction. In Proceedings of the 24th International Conference on World Wide Web, WWW ’15 Companion, New York, NY, USA, pp. 121–122. External Links: ISBN 9781450334730, Link, Document Cited by: Table 4, Table 2.
  • C. Sun, L. Hu, S. Li, T. Li, H. Li, and L. Chi (2020) A Review of Unsupervised Keyphrase Extraction Methods Using Within-Collection Resources. Symmetry 12 (11). External Links: Link, ISSN 2073-8994, Document Cited by: §2.3.
  • P. D. Turney (2000) Learning Algorithms for Keyphrase Extraction. Information Retrieval 2, pp. 303–336. Cited by: §1.
  • E. M. Voorhees and D. M. Tice (2000) The TREC-8 Question Answering Track. In Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00), Athens, Greece. External Links: Link Cited by: §2.3.
  • X. Wan and J. Xiao (2008) CollabRank: Towards a Collaborative Approach to Single-Document Keyphrase Extraction. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), Manchester, UK, pp. 969–976. External Links: Link Cited by: Table 4, §1, §1, §2.1, §3.1, §3.1, §3.2, Table 2.
  • J. YangJie, C. Dong-feng, L. Xiao-qing, and B. Yu (2008) Keyword Extraction in Multi-Document Based on Joint Weight. Journal of Chinese Information Processing, pp. 06. Cited by: §2.1.

Appendix A Further Experiment Details

Algorithm/Mode Concat Merge
TfIdf 1.30 2.22
KPMiner 1.42 1.39
YAKE 1.99 2.58
TextRank 3.64 2.68
SingleRank 3.24 2.57
TopicRank 1.51 2.08
TopicalPageRank 3.14 2.52
PositionRank 2.52 2.32
MultipartiteRank 1.51 2.12
CollabRank - 2.54
(Bayatmakou et al., 2017) - 3.00
Table 3: Average number of words per keyphrase produced by the different algorithms, using the two generation modes (Concat and Merge) on MK-DUC-01.
Concat Merge
F1@k unigram-F1@k F1@k unigram-F1@k
Algorithm 1 5 10 20 1 5 10 20 1 5 10 20 1 5 10 20
Tf-Idf 0.11 1.22 2.91 4.29 2.62 9.93 16.71 24.53 0.54 1.21 1.67 3.33 2.58 10.58 15.59 22.59
KPMiner (El-Beltagy and Rafea, 2009) 0.95 3.03 4.90 7.26 3.26 12.40 19.25 25.36 1.06 3.29 5.04 7.90 3.11 11.31 18.33 27.32
YAKE (Campos et al., 2020) 1.58 4.56 7.18 9.42 3.66 15.28 22.34 27.10 1.54 3.37 5.12 7.52 4.13 16.14 23.16 29.56
TextRank (Mihalcea and Tarau, 2004) 0.45 2.76 3.19 6.24 6.55 20.28 25.41 28.91 1.42 5.53 8.16 11.53 5.20 18.80 27.62 33.54
SingleRank (Wan and Xiao, 2008) 0.42 3.37 4.74 8.94 7.82 20.46 25.23 28.28 1.56 4.91 8.47 12.20 5.16 18.29 26.86 33.24
TopicRank (Bougouin et al., 2013) 1.08 4.13 5.91 8.76 3.52 11.84 18.75 28.15 2.32 6.45 9.99 13.39 3.97 16.14 26.07 34.23
TopicalPageRank (Sterckx et al., 2015) 0.73 3.71 5.44 9.94 7.72 20.61 25.79 28.48 1.62 5.10 9.11 13.27 5.18 18.09 26.87 33.50
PositionRank (Florescu and Caragea, 2017) 1.28 5.51 8.14 12.48 6.24 18.10 23.75 28.61 1.76 4.61 9.36 13.07 4.93 17.00 26.55 33.83
MultipartiteRank (Boudin, 2018) 0.93 3.66 6.70 9.63 3.75 12.46 20.21 29.41 2.45 6.90 10.65 15.24 4.65 16.28 26.19 34.37
CollabRank (Wan and Xiao, 2008) - 1.56 5.37 8.80 11.89 5.28 17.83 25.82 33.20
(Bayatmakou et al., 2017) [multi-doc] - 0.32 1.60 2.00 2.17 3.57 11.29 15.04 16.31
Table 4: Results on various KE algorithms tested with our MK-DUC-01 dataset. In Concat mode all topic documents are concatenated as a single text input, and in Merge mode algorithms are run on individual documents after which keyphrase lists are merged and reranked. The bottom two algorithms are multi-document based KE algorithms, and work in Merge mode only.
F1@k unigram-F1@k Avg. KP
Algorithm 1 5 10 20 1 5 10 20 Length
Tf-Idf 3.45 9.28 11.02 11.02 11.12 29.46 33.64 30.99 1.50
KPMiner 6.43 14.31 13.98 11.40 14.15 34.06 37.82 37.61 1.17
YAKE 4.43 12.44 13.85 14.09 13.04 28.55 29.17 27.41 2.02
TextRank 3.25 11.41 17.16 19.57 15.97 29.55 30.08 27.73 2.84
SingleRank 7.05 19.83 24.02 23.85 21.21 35.54 34.92 30.97 2.59
TopicRank 8.48 19.64 21.41 19.43 14.85 35.55 39.74 37.21 1.57
TopicalPageRank 7.62 20.63 24.24 24.59 21.20 35.88 35.22 31.83 2.50
PositionRank 8.01 22.94 27.42 26.56 17.61 35.61 38.19 35.06 2.08
MultipartiteRank 9.33 21.63 23.30 22.16 15.78 37.07 42.42 40.40 1.55
CollabRank 8.94 22.88 26.92 25.50 22.37 36.92 37.38 33.31 2.41
Table 5: The results of various single-document KE algorithms on the single-document DUC-2001 KE dataset, for reference as a comparison to algorithms’ results in the multi-document setting (Tables 2, 4 and 3). The average number of KPs in each document’s gold list in the dataset is 8.08, and all keyphrases are used in the evaluation. CollabRank is a single-document KE algorithm that uses related documents (within the same topic) in its operation.
# Keyphrase
1 drug testing
2 illegal steroid use
drug use
illegal performance-enhancing drugs
3 Olympics gold medal
4 Seoul Olympics
5 banned steroid
illegal anabolic steroid
6 Ben Johnson
Canadian Ben Johnson
Sprinter Ben Johnson
Canadian Olympic sprinter
7 world record
8 anabolic steroid stanozolol
illegal steroid stanzolol
9 world championships
10 Charlie Francis
Canadian coach Charlie Francis
Canadian national sprint coach
11 100-meter dash
100-metre sprint
12 stanozolol use
13 Carl Lewis
American Carl Lewis
U.S. sprinter Carl Lewis
14 urine sample
15 steroid furazabol
16 Jamie Astaphan
17 steroid combination
18 Toronto
19 personal physician
20 disgraced Olympic sprinter
21 Canadian inquiry
federal inquiry
22 drug scandal
23 Angella Issajenko
24 Johnson scandal
25 stripping
26 controlled substance
27 world record-holder
28 Hamilton spectator indoor games
29 disappointed nation
30 record crowd
31 world-class sprinter
32 two-year suspension
33 news conference
34 first race
35 second-place finish
36 Lynda Huey
37 first indoor loss
38 slow start
39 Daron Council
40 homecoming
41 expectation
Table 6: The keyphrases in our MK-DUC-01 dataset for topic d31 about the Ben Johnson steroid scandal, containing 13 documents. Keyphrases with multiple items represent substitute clusters, where the first item in the cluster is the marked preferred keyphrase wording when using standard KE evaluation using a flat list of gold keyphrases. The top 20 keyphrases are used in the Trunc-20 dataset version.

Additional baseline evaluations.

Table 4 presents the results on the full gold keyphrase lists (non-truncated). When compared to the results on the Trunc-20 truncated lists (Table 2), there is an expected degredation in all scores, since the keyphrases lower in the lists are less representative keyphrases of the respective document sets. This, and the longer absolute lengths of the lists, make it less likely for the KE algorithms to extract correct keyphrases, and hence yield considerably lower recall scores across the board (not shown here), while precision scores mostly remain the same or are slightly higher.

Evaluation details.

When computing , the system-keyphrase and the gold-keyphrase are compared using stemmed exact match.

When computing unigram-: (1) the top- items in the system keyphrase list are retrieved; (2) those keyphrases are flattened out to a single list of stems; (3) the gold-list is also flattened to a list of stems; (4) the former is evaluated against the latter with .

The average of the F1 scores over all instances is the final score presented.

Keyphrase sizes.

Table 3 presents the average token length as produced overall by each algorithm, when using the Concat and Merge generation modes. The keyphrase sizes in Concat are represenatative of the corresponding algorithms’ output sizes, while the sizes in Merge go through an additional process, hence slightly altering the natural output sizes of the algorithms.

Single-document KE results.

We ran the relevant algorithms from Tables 2 and 4 on the single-document DUC-2001 KE dataset (308 documents and 8.08 keyphrases per document), to get a sense of their comparable quality in the single and multiple document settings. Results are presented in Table 5. There are 7 documents that were not processed in the KPMiner algorithm due to processing errors.

Overall, we see that the algorithm rankings are quite similar in the two settings, across values and in both metrics.

Algorithm implementations.

We used the PKE Python toolkit package (Boudin, 2016) for all KE algorithms except for (Bayatmakou et al., 2017), which we implemented ourselves. The Bayatmakou et al. (2017) algorithm uses RAKE (Rose et al., 2010) as its underlying single-document KE component, for which we used the nltk-rake library.555 As RAKE outputted very long keyphrases yielding low scores, we used only those upto 3 words. For CollabRank, we considered all other documents in its original topic document-set as ‘‘collaborating’’ documents, and computed their similarity scores using spaCy (Honnibal et al., 2020) text similarity.

Execution resources.

All algorithms and automatic methods used for annotation and experimentation were run on a standard laptop, and no special hardware was required.

Run times were upto about a second per keyphrase extraction instance, except for CollabRank which required about 15-20 seconds per document. Running the Merge mode on the document-sets required tens of seconds for some algorithms as the process iterates over all documents separately. The Concat mode, which requires a single run per document-set, was substantially faster overall.

Appendix B Dataset Example

Table 6 presents an example list of keyphrases from our MK-DUC-01 dataset. The top 20 keyphrases are used in the Trunc-20 dataset version, while the full list is used in the full dataset version. Some keyphrases have multiple wording variations, acting as the substitute clusters. The first item in a cluster is used in the standard evaluation when a flat list of keyphrases is required.