Keyphrase extraction (KE) is the task of selecting important and topical phrases from within a body of text (Turney, 2000). Single-document KE has garnered extensive research due to its vast practical uses. For example, keyphrases are listed on scientific or news articles, product descriptions and meeting transcripts to give the reader a hint at the matters of the source text. Additionally, these keyphrases are serviceable for downstream tasks like document categorization (Hulth and Megyesi, 2006), clustering (Jones and Mahoui, 2000), summarization (Jones et al., 2002) and search (Gutwin et al., 1999). Hence, single-document KE is resourced with a multitude of datasets across several domains (e.g. Kim and Kan (2009) and Krapivin et al. (2009) for scientific papers, or Wan and Xiao (2008) and Marujo et al. (2012) for news), and is frequently reviewed in survey papers to report on the advancements of the task (e.g. Hasan and Ng, 2014; Siddiqi and Sharan, 2015; Merrouni et al., 2019; Papagiannopoulou and Tsoumakas, 2020).
Conversely, multi-document KE (MDKE) has been sporadically researched. To the best of our knowledge, only about a handful of works have explicitly targeted the task (§2.1).
This is despite it being just as valuable for
gaining a high-level depiction of a set of related documents, and it being leveraged as a medium for supporting multi-document summarization (§
of related documents, and it being leveraged as a medium for supporting multi-document summarization (§2.2). It could additionally assist the aspect-based summarization task in extracting the aspects around which summaries are to be generated -- especially useful when the texts are information-rich with high sub-topical variation (e.g Frermann and Klementiev, 2019; Gerani et al., 2019; Hayashi et al., 2021). Moreover, the multi-document setting, with its large inputs and information variability, introduces an interesting and challenging dimension of complexity that recent advances in text processing may tackle.
To make matters trickier, no dataset was previously available for MDKE, i.e. consisting of sets of documents and their corresponding gold lists of keyphrases. Previous works, therefore, did not evaluate with standard automatic keyphrase extraction methods, or conducted extrinsic evaluations within the summarization task (see §2 for details).
To initiate a more established research line on MDKE, we provide here the first literature review on the task, and propose a MDKE dataset,111github.com/OriShapira/MkDUC-01 which can provide a benchmark for the task. The dataset is based on the existing DUC-2001 single-document KE dataset (Wan and Xiao, 2008) in the news domain. We leverage the properties of the original DUC-2001 multi-document summarization dataset222https://www-nlpir.nist.gov/projects/duc/guidelines/2001.html to convert the single-document KE dataset to a multi-document one. We run several KE algorithms on the dataset to demonstrate the current state of the task on the new benchmark.
2 Multi-Document KE Review
We outline the research conducted explicitly and indirectly on MDKE. Few works have expressly tackled multi-document KE, however it has also been applied in several studies on multi-document summarization (MDS), as an intermediate step.
2.1 Works on Multi-Document KE
Hammouda et al. (2005) seem to be the first to have worked on the MDKE task, targeting the web-document domain. Word-sequences common to all documents are ranked based on term frequency, term position in documents, and use in titles or headlines. To evaluate, 10 sets of 30 documents are retrieved via query search. The system keyphrases are compared, by word-stem overlap, to the single corresponding document-set search query.
Berend and Farkas (2013) approach the task by merging keyphrase lists from individual documents in a document set. Within a document, n-grams, that satisfy several rules, are classified as keyphrases with a trained maxent model that uses features of word surface form and basic Wikipedia knowledge. An information gain metric ranks the unified list of keyphrases, from which the top-15 are taken out of a subset of documents.
To evaluate, scientific paper sets from ACL workshops (110 workshops with ~14 articles each) were paired with their respective ‘‘call-for-papers’’ (CFP) website sections. A system keyphrase list on a paper set was then compared to the CFP text via word-level cosine similarity. Also, NLP experts assessed whether keyphrase lists indeed properly characterized the corresponding workshop.
approach the task by merging keyphrase lists from individual documents in a document set. Within a document, n-grams, that satisfy several rules, are classified as keyphrases with a trained maxent model that uses features of word surface form and basic Wikipedia knowledge. An information gain metric ranks the unified list of keyphrases, from which the top-15 are taken out of a subset of documents. To evaluate, scientific paper sets from ACL workshops(Schäfer et al., 2012)
(110 workshops with ~14 articles each) were paired with their respective ‘‘call-for-papers’’ (CFP) website sections. A system keyphrase list on a paper set was then compared to the CFP text via word-level cosine similarity. Also, NLP experts assessed whether keyphrase lists indeed properly characterized the corresponding workshop.
Bayatmakou et al. (2017) also take a single-to-multi-document approach. A set of documents is retrieved with a search query, and individual documents’ keyphrases are extracted with RAKE (Rose et al., 2010). The keyphrases in the unified lists are re-scored according to word co-occurrence within keyphrases and term-frequency within documents, weighted by document salience (similarity to the search query). While an automatic evaluation is proposed (measuring against the query and co-occurrence of keywords and query in documents), the actual assessment is a manual satisfaction rating against the search query. Experiments were performed over a large dataset of scientific abstracts.333https://www.webofknowledge.com
As for keyword extraction (unigrams), Bharti et al. (2017) propose several methods of term-frequency for scoring words in sets of news articles. They evaluate the resulting keyword list against the aggregated words in the articles’ headlines, with recall and precision. Qing-sheng (2007) design cluster-based and MMR-based algorithms, and test against annotated data. YangJie et al. (2008) use tf-idf and word-level features to score words. No further details are available on the latter two references, as their papers could not be retrieved.
Relating to MDKE, Wan and Xiao (2008) propose a method for single-document KE, CollabRank, that ranks a document’s keyphrases with respect to similar ‘‘collaborating’’ documents. To compute word saliency, they produce an affinity graph reflecting co-occurrence relationships between documents’ words. That paper also introduces the single-document KE dataset that we build upon for MDKE (§3).
2.2 KE for Multi-Document Summarization
MDS aims to generate a passage that covers the main issues of the source document-set. Keyphrases naturally point to the central aspects, and can therefore assist in marking the important information for a summary.
Some works extract keyphrases from the document-set, e.g. with conventional term-frequency methods (Alshahrani and Bikdash, 2019), by using single-document KE algorithms on the concatenated documents (Nayeem and Chali, 2017), or through query similarity for query-focused summarization (Ma et al., 2008). These keyphrases are then used to rank sentences across the documents for the potential summaries. Output summaries are standardly evaluated against reference summaries from MDS datasets. While keyphrases have also been extracted per document for the same purpose of MDS (e.g. Bhaskar, 2013; Fejer and Omar, 2015), it is difficult to assess which approach is most effective since various factors impact the resulting summaries.
Other works focus on word importance rather than keyphrase extraction. Hong and Nenkova (2014) assign importance to documents’ content words based on their appearance in reference summaries. Alternative methods include ILP frameworks for bigram selection and weighting, using syntactic and external information (Li et al., 2015), or through a joint sentence/keyword reinforcement process that converges to a high quality summary (Li and Zheng, 2020).
2.3 KE Evaluation
Most single-document KE works automatically evaluate a keyphrase list against a gold list. As seen in §2.1 above, the few works on MDKE do not conduct such evaluations due to the lack of test datasets, which we counteract with our proposed MDKE dataset (§3). They all conduct disparate assessments, and evaluate against references that are limited in informativeness or that are unrepresentative of keyphrase requisites.
The most prominent KE metric for comparing against a gold keyphrase list is the metric, which considers the recall and precision of the predicted list, truncated to items, against the full gold list. Words are often stemmed to allow some reasonable variation of word forms. To allow for a broader valuation, the unigram-level score is also used -- where the two lists of keyphrases are each flattened out to a single list of words. Other, less exploited metrics include the Mean Reciprocal Rank (Voorhees and Tice, 2000) and Mean Average Precision, that also take list-order into account. (See a KE review paper, such as (Sun et al., 2020) , for more details on the evaluation metrics.)
, for more details on the evaluation metrics.)
3 New Dataset
3.1 Dataset Formation
Our proposed MDKE dataset, which we name MK-DUC-01, builds upon the DUC-2001 single-document KE dataset introduced by Wan and Xiao (2008), for the news domain.
The DUC-2001 MDS dataset (Over, 2001) consists of 30 topics, each containing an average of 10.27 related news articles (308 in total). Experts summarized each individual article, as well as each of the document-sets, with different length summaries. Overall, there are three 100-token-long summaries per document, and three summaries per document-set, at lengths 50, 100, 200 and 400 tokens. Wan and Xiao (2008) further annotated each document with a list of keyphrases. On average, there are 8.08 keyphrases per document, with 2.09 words per keyphrase. This data is still widely used for the news-domain single-document KE task.
To restructure the single-document KE dataset for the multi-document setting, we carried out an automatic merging and reranking process, followed by a manual refinement procedure:
Automatic merging and reranking.
For each topic with its corresponding document-set , and 400-token topic reference summaries , we first provided a score for each stemmed word in as where stands for ’s document-frequency in document-set , i.e. the number of documents of in which appears.
We then unified ’s lists of keyphrases (from the single-document KE dataset), removing exact phrase duplicates and phrases that do not appear anywhere in ,444Since we consider the task of keyphrase extraction, all keyphrases must be contained within the document set. 16 of 2488 keyphrases were removed in total. to form a single list of potential keyphrases, . Each phrase was then scored as , i.e. the average of ’s stem scores. This generated a ranked list of keyphrases, , ordered by a salience score.
Lastly, we merged pairs of phrases in where one was contained within the other (stemmed and disregarding word order), leaving only the longer variant or the one earlier in , e.g., merging ‘‘routine training’’/ ‘‘routine train flight’’.
Due to the variance of keyphrases’ informativeness across documents, we found that this heuristic effectively filtered out overly generic or repetitive keyphrases.
, e.g., merging ‘‘routine training’’/ ‘‘routine train flight’’. Due to the variance of keyphrases’ informativeness across documents, we found that this heuristic effectively filtered out overly generic or repetitive keyphrases.
|Avg (StD) # docs per topic||10.27 (2.24)||10.27 (2.24)|
|Avg (StD) # KPs per topic||43.8 (15.6)||19.97 (0.18)|
|Avg (StD) KP word-length||2.13 (0.66)||2.17 (0.66)|
|# KPs with substitute cluster||142 of 1314||104 of 599|
|Avg (StD) # KPs in clusters||2.82 (1.26)||3.07 (1.37)|
As we strived to generate a high-quality MDKE benchmark dataset, we further refined the keyphrase lists produced by the automatic stage above. One of the authors looked over the 30 lists with the relevant topic documents and reference summaries open for assistance, and carried out the following: (1) removed phrases that were particularly scarce or of low informativeness (e.g., ‘‘similar transmission’’ in the ‘‘Mad Cow Disease’’ topic); (2) removed phrases that were not synonymous with others, but were clearly implied from other phrases (e.g., ‘‘U.S. Senate’’ where other keyphrases mention the Senate); (3) clustered together phrases that can be used replaceably (e.g., ‘‘1990 census’’ and ‘‘1990 population count’’) to form keyphrase substitute clusters, with the more commonly used variant as the preferred alternative; (4) produced substitute clusters for persons’ titled proper nouns, when the title is optional (e.g., a cluster for ‘‘Bill Clinton’’ containing ‘‘President Clinton’’ and ‘‘Governor Bill Clinton’’), leaving the untitled version as the preferred alternative.
Note that due to the variability of similar keyphrases across documents, we were able to form substitute clusters of replaceable variations of keyphrases, which is a novel conception in KE datasets. This can assist in the evaluation process when a system outputs a keyphrase that is worded differently within the gold list of keyphrases (see examples in annotation points (3) and (4) above), which is a major shortcoming in standard KE evaluation. We marked a preferred variant in each such cluster to enable standard evaluation, that requires a flat list of gold keyphrases.
This whole dataset formation procedure yielded the final MK-DUC-01 dataset, with basic statistics appearing in Table 1. We suggest a version of the dataset where the keyphrase lists are truncated at 20 items, denoted here Trunc-20. This establishes a more representational task-setting since lead keyphrases in the gold lists are more salient in their corresponding topics, while those low in the list are less anticipated as topic-level keyphrases.
|KPMiner (El-Beltagy and Rafea, 2009)||1.27||4.80||7.56||9.68||5.52||21.27||30.55||34.88||1.59||5.34||7.56||10.84||5.39||19.45||28.85||38.93|
|YAKE (Campos et al., 2020)||2.54||7.20||10.67||13.17||6.24||24.80||33.30||36.04||2.86||5.87||8.23||10.84||7.06||25.75||33.87||37.72|
|TextRank (Mihalcea and Tarau, 2004)||0.63||4.00||4.67||8.01||10.11||28.25||32.03||30.72||2.54||9.88||13.79||17.17||9.08||29.66||39.90||41.28|
|SingleRank (Wan and Xiao, 2008)||0.95||5.08||6.90||12.01||12.51||29.52||32.81||31.98||2.86||8.80||14.23||18.51||9.26||28.96||38.56||42.01|
|TopicRank (Bougouin et al., 2013)||1.59||6.40||9.33||11.33||5.47||18.69||28.75||36.72||4.13||10.94||16.01||18.68||6.75||25.22||39.24||44.88|
|TopicalPageRank (Sterckx et al., 2015)||1.27||5.61||7.79||13.35||12.16||29.81||33.67||32.44||2.86||9.08||15.56||19.84||9.06||28.82||39.11||42.41|
|PositionRank (Florescu and Caragea, 2017)||1.90||8.28||11.80||16.35||9.31||27.09||32.20||33.22||3.17||8.53||15.56||19.51||8.44||27.50||39.20||44.11|
|MultipartiteRank (Boudin, 2018)||1.27||6.13||10.23||12.17||6.04||20.08||30.12||38.70||4.13||11.21||17.34||21.34||7.64||25.63||39.55||45.61|
|CollabRank (Wan and Xiao, 2008)||-||2.86||9.61||14.9||17.84||9.37||28.03||37.68||41.26|
|(Bayatmakou et al., 2017) [multi-doc]||-||0.09||0.93||1.46||1.88||1.84||7.74||11.69||15.18|
3.2 Dataset Experimentation
We demonstrate the use of MK-DUC-01 by testing ten existing single-document KE algorithms and a multi-document one. The single-document algorithms are applied in two modes: (1) Concat, where all topic documents are concatenated into a single text that is then fed to the algorithm to output a list of keyphrases per topic; (2) Merge, where for each topic, the algorithm is fed one document at a time, and the generated lists of keyphrases are merged using a similar strategy as in the automatic merging and reranking procedure in §3.1, except that , i.e., it does not consider the reference summary set -- which is naturally unavailable in the KE task. CollabRank (Wan and Xiao, 2008) uses its collaborating documents, hence only Merge is applied. The algorithm by Bayatmakou et al. (2017) was designed as a multi-document algorithm, but follows a similar approach to our Merge mode, running RAKE (Rose et al., 2010) per document, and merging keyphrase lists with a method different from ours.
For evaluation, we employ the standard stemmed and unigram- metrics for , on the MK-DUC-01 data, both in the Trunc-20 version (Table 2) and in its full version (Table 4 in the appendix). We witness a clear benefit of the Merge strategy across nearly all algorithms, and a strong significant improvement over the official MDKE algorithm.
We review the research conducted on multi-document KE, which is far understudied compared to its single-document counterpart. Only a few works have tackled the MDKE task head-on, all without the existence of a suitable dataset. Meanwhile, the notion of MDKE has been indirectly applied in multi-document summarization research, and can potentially assist in new trends within the summarization community, using recent advances in multi-text processing. We introduce the first MDKE dataset as a benchmark, and evaluate different KE algorithms on it, acting as baseline results for the task.
This work was supported in part by the German Research Foundation through the German-Israeli Project Cooperation (DIP, grant DA 1600/1-1); by the Israel Science Foundation (grant 1951/17); and by a grant from the Israel Ministry of Science and Technology.
- Multi-Document Summarization Based on Keyword Fusion. In 2019 SoutheastCon, pp. 1–5. Cited by: §2.2.
Automatic Query-based Keyword and Keyphrase Extraction.
2017 Artificial Intelligence and Signal Processing Conference (AISP), pp. 325–330. Cited by: Appendix A, Table 3, Table 4, §2.1, §3.2, Table 2.
- Single-Document Keyphrase Extraction for Multi-Document Keyphrase Extraction. Computación y Sistemas 17 (2), pp. 179–186. Cited by: §2.1.
- Automatic Keyword Extraction for Text Summarization in Multi-document E-newspapers Articles. European Journal of Advances in Engineering and Technology 4 (6), pp. 410–427. Cited by: §2.1.
- Multi-Document Summarization using Automatic Key-Phrase Extraction. In Proceedings of the Student Research Workshop associated with RANLP 2013, Hissar, Bulgaria, pp. 22–29. External Links: Cited by: §2.2.
PKE: an Open Source Python-based Keyphrase Extraction Toolkit. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations, Osaka, Japan, pp. 69–73. External Links: Cited by: Appendix A.
- Unsupervised Keyphrase Extraction with Multipartite Graphs. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, Louisiana, pp. 667–672. External Links: Cited by: Table 4, Table 2.
TopicRank: Graph-Based Topic Ranking for Keyphrase Extraction.
Proceedings of the Sixth International Joint Conference on Natural Language Processing, Nagoya, Japan, pp. 543–551. External Links: Cited by: Table 4, Table 2.
- YAKE! Keyword Extraction from Single Documents using Multiple Local Features. Information Sciences 509, pp. 257–289. Cited by: Table 4, Table 2.
- KP-Miner: A keyphrase extraction system for English and Arabic documents. Information systems 34 (1), pp. 132–144. Cited by: Table 4, Table 2.
- Automatic Multi-Document Arabic Text Summarization using Clustering and Keyphrase Extraction. Journal of Artificial Intelligence 8 (1), pp. 1. Cited by: §2.2.
- PositionRank: An Unsupervised Approach to Keyphrase Extraction from Scholarly Documents. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1105–1115. External Links: Cited by: Table 4, Table 2.
- Inducing Document Structure for Aspect-based Summarization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 6263–6273. External Links: Cited by: §1.
- Modeling Content and Structure for Abstractive Review Summarization. Computer Speech and Language 53, pp. 302–331. External Links: Cited by: §1.
- Improving Browsing in Digital Libraries with Keyphrase Indexes. Decision Support Systems 27 (1-2), pp. 81–104. Cited by: §1.
- Corephrase: Keyphrase Extraction for Document Clustering. In , pp. 265–274. Cited by: §2.1.
- Automatic Keyphrase Extraction: A Survey of the State of the Art. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Baltimore, Maryland, pp. 1262–1273. External Links: Cited by: §1.
- WikiAsp: A Dataset for Multi-domain Aspect-based Summarization. Transactions of the Association for Computational Linguistics 9, pp. 211–225. External Links: Cited by: §1.
Improving the Estimation of Word Importance for News Multi-Document Summarization. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, Gothenburg, Sweden, pp. 712–721. External Links: Cited by: §2.2.
- spaCy: Industrial-strength Natural Language Processing in Python External Links: Cited by: Appendix A.
- A Study on Automatically Extracted Keywords in Text Categorization. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia, pp. 537–544. External Links: Cited by: §1.
- Interactive Document Summarisation using Automatically Extracted Keyphrases. In Proceedings of the 35th Annual Hawaii International Conference on System Sciences, pp. 1160–1169. Cited by: §1.
- Hierarchical Document Clustering using Automatically Extracted Keyphrases. Computer Science Working Papers. Cited by: §1.
- Re-examining automatic keyphrase extraction approaches in scientific articles. In Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications (MWE 2009), Singapore, pp. 9–16. External Links: Cited by: §1.
- Large Dataset for Keyphrases Extraction. Cited by: §1.
- Using External Resources and Joint Learning for Bigram Weighting in ILP-Based Multi-Document Summarization. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, pp. 778–787. External Links: Cited by: §2.2.
- Unsupervised Summarization by Jointly Extracting Sentences and Keywords. arXiv preprint arXiv:2009.07481. Cited by: §2.2.
- Query-focused Multi-document Summarization using Keyword Extraction. In 2008 International Conference on Computer Science and Software Engineering, Vol. 1, pp. 20–23. Cited by: §2.2.
- Supervised Topical Key Phrase Extraction of News Stories using Crowdsourcing, Light Filtering and Co-reference Normalization. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey, pp. 399–403. External Links: Cited by: §1.
- Automatic Keyphrase Extraction: a Survey and Trends. Journal of Intelligent Information Systems 54, pp. 391–424. Cited by: §1.
- TextRank: Bringing Order into Text. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, pp. 404–411. External Links: Cited by: Table 4, Table 2.
- Extract with Order for Coherent Multi-Document Summarization. In Proceedings of TextGraphs-11: the Workshop on Graph-based Methods for Natural Language Processing, Vancouver, Canada, pp. 51–56. External Links: Cited by: §2.2.
- Introduction to DUC-2001: an Intrinsic Evaluation of Generic News Text Summarization Systems. In Proceedings of DUC 2001 Document Understanding Conference, Vol. 49. Cited by: §3.1.
- A Review of Keyphrase Extraction. WIREs Data Mining and Knowledge Discovery 10 (2), pp. e1339. External Links: Cited by: §1.
- Research on Keyword-Extraction from Multi-Document in User Model. Computer Simulation. Cited by: §2.1.
- Automatic Keyword Extraction from Individual Documents. In Text Mining, pp. 1–20. External Links: Cited by: Appendix A, §2.1, §3.2.
- Towards an ACL Anthology Corpus with Logical Document Structure. An Overview of the ACL 2012 Contributed Task. In Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries, Jeju Island, Korea, pp. 88–97. External Links: Cited by: §2.1.
- Keyword and Keyphrase Extraction Techniques: A Literature Review. International Journal of Computer Applications 109, pp. 18–23. Cited by: §1.
- Topical Word Importance for Fast Keyphrase Extraction. In Proceedings of the 24th International Conference on World Wide Web, WWW ’15 Companion, New York, NY, USA, pp. 121–122. External Links: Cited by: Table 4, Table 2.
- A Review of Unsupervised Keyphrase Extraction Methods Using Within-Collection Resources. Symmetry 12 (11). External Links: Cited by: §2.3.
- Learning Algorithms for Keyphrase Extraction. Information Retrieval 2, pp. 303–336. Cited by: §1.
- The TREC-8 Question Answering Track. In Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00), Athens, Greece. External Links: Cited by: §2.3.
- CollabRank: Towards a Collaborative Approach to Single-Document Keyphrase Extraction. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), Manchester, UK, pp. 969–976. External Links: Cited by: Table 4, §1, §1, §2.1, §3.1, §3.1, §3.2, Table 2.
- Keyword Extraction in Multi-Document Based on Joint Weight. Journal of Chinese Information Processing, pp. 06. Cited by: §2.1.
Appendix A Further Experiment Details
|(Bayatmakou et al., 2017)||-||3.00|
|KPMiner (El-Beltagy and Rafea, 2009)||0.95||3.03||4.90||7.26||3.26||12.40||19.25||25.36||1.06||3.29||5.04||7.90||3.11||11.31||18.33||27.32|
|YAKE (Campos et al., 2020)||1.58||4.56||7.18||9.42||3.66||15.28||22.34||27.10||1.54||3.37||5.12||7.52||4.13||16.14||23.16||29.56|
|TextRank (Mihalcea and Tarau, 2004)||0.45||2.76||3.19||6.24||6.55||20.28||25.41||28.91||1.42||5.53||8.16||11.53||5.20||18.80||27.62||33.54|
|SingleRank (Wan and Xiao, 2008)||0.42||3.37||4.74||8.94||7.82||20.46||25.23||28.28||1.56||4.91||8.47||12.20||5.16||18.29||26.86||33.24|
|TopicRank (Bougouin et al., 2013)||1.08||4.13||5.91||8.76||3.52||11.84||18.75||28.15||2.32||6.45||9.99||13.39||3.97||16.14||26.07||34.23|
|TopicalPageRank (Sterckx et al., 2015)||0.73||3.71||5.44||9.94||7.72||20.61||25.79||28.48||1.62||5.10||9.11||13.27||5.18||18.09||26.87||33.50|
|PositionRank (Florescu and Caragea, 2017)||1.28||5.51||8.14||12.48||6.24||18.10||23.75||28.61||1.76||4.61||9.36||13.07||4.93||17.00||26.55||33.83|
|MultipartiteRank (Boudin, 2018)||0.93||3.66||6.70||9.63||3.75||12.46||20.21||29.41||2.45||6.90||10.65||15.24||4.65||16.28||26.19||34.37|
|CollabRank (Wan and Xiao, 2008)||-||1.56||5.37||8.80||11.89||5.28||17.83||25.82||33.20|
|(Bayatmakou et al., 2017) [multi-doc]||-||0.32||1.60||2.00||2.17||3.57||11.29||15.04||16.31|
|2||illegal steroid use|
|illegal performance-enhancing drugs|
|3||Olympics gold medal|
|illegal anabolic steroid|
|Canadian Ben Johnson|
|Sprinter Ben Johnson|
|Canadian Olympic sprinter|
|8||anabolic steroid stanozolol|
|illegal steroid stanzolol|
|Canadian coach Charlie Francis|
|Canadian national sprint coach|
|American Carl Lewis|
|U.S. sprinter Carl Lewis|
|20||disgraced Olympic sprinter|
|28||Hamilton spectator indoor games|
|37||first indoor loss|
Additional baseline evaluations.
Table 4 presents the results on the full gold keyphrase lists (non-truncated). When compared to the results on the Trunc-20 truncated lists (Table 2), there is an expected degredation in all scores, since the keyphrases lower in the lists are less representative keyphrases of the respective document sets. This, and the longer absolute lengths of the lists, make it less likely for the KE algorithms to extract correct keyphrases, and hence yield considerably lower recall scores across the board (not shown here), while precision scores mostly remain the same or are slightly higher.
When computing , the system-keyphrase and the gold-keyphrase are compared using stemmed exact match.
When computing unigram-: (1) the top- items in the system keyphrase list are retrieved; (2) those keyphrases are flattened out to a single list of stems; (3) the gold-list is also flattened to a list of stems; (4) the former is evaluated against the latter with .
The average of the F1 scores over all instances is the final score presented.
Table 3 presents the average token length as produced overall by each algorithm, when using the Concat and Merge generation modes. The keyphrase sizes in Concat are represenatative of the corresponding algorithms’ output sizes, while the sizes in Merge go through an additional process, hence slightly altering the natural output sizes of the algorithms.
Single-document KE results.
We ran the relevant algorithms from Tables 2 and 4 on the single-document DUC-2001 KE dataset (308 documents and 8.08 keyphrases per document), to get a sense of their comparable quality in the single and multiple document settings. Results are presented in Table 5. There are 7 documents that were not processed in the KPMiner algorithm due to processing errors.
Overall, we see that the algorithm rankings are quite similar in the two settings, across values and in both metrics.
We used the PKE Python toolkit package (Boudin, 2016) for all KE algorithms except for (Bayatmakou et al., 2017), which we implemented ourselves. The Bayatmakou et al. (2017) algorithm uses RAKE (Rose et al., 2010) as its underlying single-document KE component, for which we used the nltk-rake library.555https://pypi.org/project/rake-nltk As RAKE outputted very long keyphrases yielding low scores, we used only those upto 3 words. For CollabRank, we considered all other documents in its original topic document-set as ‘‘collaborating’’ documents, and computed their similarity scores using spaCy (Honnibal et al., 2020) text similarity.
All algorithms and automatic methods used for annotation and experimentation were run on a standard laptop, and no special hardware was required.
Run times were upto about a second per keyphrase extraction instance, except for CollabRank which required about 15-20 seconds per document. Running the Merge mode on the document-sets required tens of seconds for some algorithms as the process iterates over all documents separately. The Concat mode, which requires a single run per document-set, was substantially faster overall.
Appendix B Dataset Example
Table 6 presents an example list of keyphrases from our MK-DUC-01 dataset. The top 20 keyphrases are used in the Trunc-20 dataset version, while the full list is used in the full dataset version. Some keyphrases have multiple wording variations, acting as the substitute clusters. The first item in a cluster is used in the standard evaluation when a flat list of keyphrases is required.