With the increasing amounts of freely available text-based data sets, methods for efficient keyphrase detection are becoming of high relevance 
. These methods, given a single or multiple documents, output a ranked list of short phrases (or single tokens), which represents key aspects of the input text. In the recent years, plethora of keyphrase extraction methods were presented; broadly, they can be divided into unsupervised and supervised ones. This paper focuses on unsupervised keyphrase extraction, i.e. the process where no training set of document is needed to learn to estimate keyphrases – they are estimated solely based on statistical/topological properties of a given document. The unsupervised methods can be further divided to the ones which construct a graph based on token co-occurrences and the ones which leverage statistical properties of n-grams. Recently, neural language model-based keyphrase extraction was also proposed . With the abundance of methods, optimization of a single metric becomes less relevant – methods which maximize e.g., F1@k are common. This paper aims to inform the reader that a realm of highly relevant properties beyond simple retrieval performance can be meaningful in practice, and should be the focus of any novel method proposed (including the adaptation of an existing one presented in this paper). The contributions of this paper are multifold:
We present RaKUn 2.0, a graph-based keyphrase extractor optimized for retrieval-efficiency optimality when considering both retrieval capabilities and performance.
A polygon-based visualization suitable for studying and comparing multiple criteria for multiple keyphrase detection algorithms.
An extensive benchmark of RaKUn 2.0 against strong baselines (including e.g., the recently introduced KeyBERT).
Friedman-Nemenyi-based analysis of average ranks of the algorithms (and their similarity).
2 Selected related work
This section contains an overview of the existing keyphrase detection methods, key underlying ideas and possible caveats of different paradigms. This paper focuses exclusively on unsupervised keyphrase extraction – the process of transforming an input document in to a ranked collection of keyphrases, i.e. , where represents the top hits (detected keyphrases), a given keyphrase and a given keyphrase’s score. The first branch of approaches are based on text-to-graph transformations, followed by subsequent processing of the obtained graphs. Such methods are able to exploit multilevel structure of a document  (MultiPartiteRank), hierarchical structure  (SingleRank). An example token graph is shown in Figure 2.
One of the first graph-based methods was TextRank, which demonstrated the robustness of graph-based keyphrase detection (and was one of the first to do so). More involved approaches, capable of incorporating topic-level information were also proposed  (TopicalPageRank). One of the key issues with graph-based representations is that of node denoising – the process of identifying the relevant space of nodes which are commonly subject to ranking. The graph-based methods are highly dependant on the graph construction approach (based on co-occurrence, syntactic, semantic and similarity information) and node ranking algorithm (e.g. degree, closeness, Page Rank, selectivity, etc.) . A detailed overview of graph-based methods for keyword extraction and various node-ranking measures is provided in .
Alongside graph-based methods, statistical methods are also actively developed. One of the most recent examples includes YAKE! , an approach which considers large amounts of n-gram patterns and scores them so that they represent relevant keyphrases. It operates by extracting statistical features from single documents to select the most important keywords of a text. Keyphrase detection was also considered as a task solvable by considering neural language models . An example of this family of models is AttentionRank , which exploits the transformer-based neural language model to extract relevant keywords. A more detailed overview of general keyword detection methods is given in .
The discussed approaches seldom focus on metrics beyond retrieval capabilities (e.g., Precision, Recall and F1). One of the purposes of this paper is a comprehensive evaluation of the discussed algorithms with regards to multiple criteria, including computation time and duplication rates (how frequent is a token amongst the space of detected keyphrases).
3 Proposed algorithm
The proposed approach sources the core idea from the recent paper on meta vertex-based keyphrase detection RaKUn . The considered extension, proposed in this paper is optimized specifically to push the boundary of the retrieval-efficiency front between retrieval performance and retrieval time. We begin with a general overview of the algorithm, followed by theoretical analysis of its complexity (space and time). We refer to the proposed approach as RaKUn 2.0. A high-level overview is shown as Algorithm 1.
The main steps include tokenization, token merging, document graph construction and node ranking. Instead of first constructing (larger) graphs which are subject to node merging into meta vertices, RaKUn 2.0 conducts the merging step at the sequence level, making it more efficient. This step was considered based on an observation that pre-merging tokens in close proximity already offers sufficient results – by considering only tokens close to one another, no specialized metric for string comparison (possibly expensive) was needed, which substantially sped up the detection process. The second idea which substantially sped up the process is related to bi-gram hashing. It refers to constructing a mapping between each bi-gram and its count in the document, enabling fast lookup of this information as follows. for each subsequent token pair () term counts are retrieved (they are pre-computed during tokenization). We next compute a merge threshold score as:
where and are two subsequent tokens, and
is the bi-gram comprised of the two tokens. If MScore is lower than a user-specified threshold (hyperparameter), the merged token is added as a new token to the token space, and term counts of the two individual tokens are diminished by MScore as, i.e., multiplied with the computed score. Values of MScore, lower than one, imply more emphasis of multi-term keyphrases (individual terms are not as emphasized), and values larger than one imply more individual token keyphrases. Hence, the MScore serves as an intermediary step which emphasizes specific tokens during the ranking step.
The token graph is constructed from the modified list of tokens by considering subsequent, lower-cased tokens as edges. The edge weights are incremented every time a given bi-gram repeats – the transitions between tokens which commonly co-occur are emphasized. The next step is node ranking. Here, a real-valued score is assigned to each (pre-merged) token. We consider personalized PageRank algorithm 
, where the personalization vector is constructed based on term counts. This step results in real-valued scores (between 0 and 1) for each token. The final set of scores is obtained by computing an element-wise product between the PageRank scores and token lengths. This step emphasizes longer keyphrases. We traverse the space of scored tokens and remove case-level duplicates (e.g., ‘City’ and ‘city’).
The described algorithm for keyphrase detection was conceived with simplicity in mind. This property also resonates with its computational complexity. Let represent the number of tokens after the merge step (cardinality difference is negligible with regards to the runtime). Both graph construction and merging need one pass across the token sequence (. The computationally most expensive part is computation of personalized PageRank. In theory, PageRank’s complexity is , where is the number of links in the constructed token graph. In practice, the obtained graphs are very sparse – only selected bi-grams co-occur. The opposite case, where dense, clique-like graphs would be produced would imply appearance of tokens in highly diverse contexts, which is highly unlikely. The final step requires sorting of tokens based their scores. This yields the final complexity of . Assuming very sparse graphs (as observed during the experiments), the complexity remains linear with regards to the number of tokens in the token set after the merge step.
We next discuss the evaluation procedures used to estimate the performance of individual algorithms, followed by a discussion regarding their comparison. We evaluate each algorithm with regards to three main aspects; retrieval performance, keyword duplication rate and computation time. The retrieval performance was measured as done in the previous work . Precision@k is defined as . Recall@k is defined as . Precision represents the number of keyphrases retrieved with regards to top
The second score is the duplication rate. We compute this score as follows; for each detected keyphrase, we first split it to separate tokens (if multi-token keyphrase is considered). For each part, we traverse the space of detected tokens. If there is a match, we increment a duplicate counter, otherwise, we increment the non_duplicate counter. The final score is computed as , and was observed to be in the interval [0, 1]. The computation time was measured in seconds (for each document).
|Dataset||#Docs||#KW||Mean KW tokens||Mean doc len|
For visualization of retrieval-efficiency tradeoffs with regards to the mentioned scores it makes sense to have uniform meaning of large and small values. Hence, we introduce the following adapted scores which reflect this idea. The retrieval capability already corresponds to e.g., F1 score, meaning that higher values are preferred. We additionally normalize F1 scores to range between 0 and 1 based on the worst-best performing algorithms (on average). This way, an algorithm scored with 0 is the worst-performing one, while the top performing is scored with 1 (see Figure 8). Similar adaptations were considered for time performance (normalized inverse times) and duplication rates (normalized inverse duplication rates). One of the main results of this paper is a visualization which jointly considers all three aspects. The considered collection of data sets is summarized in Table 1.
The considered baselines are discussed next. The graph-based baselines include MultiPartiteRank , SingleRank , TextRank  and TopicalPageRank . The statistical baseline considered was YAKE . The language model-based baseline is the recent KeyBERT . For all approaches, we considered the default hyperparameter configurations, as we were interested in out-of-the-box performance. We computed, however, two variants of KeyBERT, one which emits single tokens (KeyBERT-(1,1)) and one which permits two term tokens (KeyBERT-(1,2)). Default configuration of KeyBERT variants performed worse than term frequency-based extraction 111We considered unigrams. Inverse document frequencies were not computed as they require the whole corpus, making them not directly comparable to purely unsupervised methods., and offered (1,1) adequate performance only when we set the ‘maxsum‘ and ‘mmr‘ flags to ‘true‘. The stopwords used were the same for all approaches (NLTK’s default English stopwords ). Other algorithms’ implementations were based on the PKE library .
A summary of algorithm run times (relative to one another) is shown in Figure 3.
As expected, the simplest baseline (term frequency) is up to three orders of magnitude faster than e.g., BERT-based model. The second approach that performs substantially better, while remaining up to two orders of magnitude faster is the proposed RaKUn 2.0. It is closely followed by SingleRank and TopicalPageRank. The duplication levels are shown in Figure 4.
The duplication ablation indicates the highest duplication levels were observed for YAKE, TopicalPageRank and TextRank. MultiPartiteRank and SingleRank had notably lower duplication levels (KeyBERT-(1,1) as well the term frequency (unigram) baseline. The proposed RaKUn 2.0 is at the lower end of the approaches with regards to this score, albeit not being optimal.
We continue the discussion by presenting the retrieval performance. A systematic investigation of algorithm performance is shown in Figure 5.
The results indicate that on average, MultiPartiteRank is the leading algorithm in the low scenarios. RaKUn 2.0, however, performs very similarly for up to ten keyphrases, which is one of the most common usecases of such algorithms. A more detailed overview of the scores on the per-data set level is given in Tables 1-5. The color codes represent top three performers for each data set (gold=first, silver=second, bronze=third).
We additionally conducted rank-based difference significance evaluation , where the average algorithm ranks are compared across all data sets. If the algorithms are linked with a red line, they perform very similarly (). The diagrams are shown as Figures 7 and 7.
The tests indicate that the difference between the top-performing approaches (MultiPartiteRank, YAKE and RaKUn 2.0) is insignificant. Similar observations can be made based on tabular summaries. Overall, however, we can observe a marginal dominance of RaKUn 2.0 w.r.t. precision. Similar retrieval performance amplifies the purpose of this paper, which transcends the retrieval-only evaluation and incorporates also other properties of either the algorithms or the retrieved space.
In Figure 8, the selected approaches are compared across the three main evaluation criteria – retrieval performance, duplication performance (inverse of duplication rate) and time performance (inverse of normalized times across all algorithms). Larger values are better for each criterion. It can be observed that MultiPartiteRank outperforms the others at the front considering duplication and retrieval performance, however, RaKUn 2.0 outperforms the others when considering retrieval capabilities and computation time.
5.1 Scaling to 14M documents
A direct way of testing the complexity bounds stated in the methods section was to attempt and run RaKUn 2.0 directly on the collection of approximately 14 million biomedical articles – the MeDAL corpus 222https://www.reddit.com/r/MachineLearning/comments/jx63fd/r_a_14m_articles_dataset_for_medical_nlp/. The corpus was parsed into a list of documents and fed into the default configuration of RaKUn 2.0. The computation took approximately forty seconds (including text reading) on a virtual machine with 12 cores and 32GB of RAM. The list of top ten keyphrases is shown as Table 6.
The top keyphrases correspond to rather general biological terms, which are some of the main topics related to the considered documents. The results were obtained by maintaining the merge_threshold hyperparameter set to one – single term keyphrases can be obtained if this threshold is lowered. For example, if set to 0.5, the top three keyphrases are ‘activity’, ‘concentration’ and ‘enzyme’.
6 Discussion and conclusions
In this paper we presented an approach to unsupervised keyphrase detection, aimed specifically at pushing the limits of computation time and retrieval performance. The main contributions of this paper are an algorithm for keyphrase detection that performs substantially (significantly) faster than current state-of-the-art methods, while maintaining the retrieval performance. The algorithmic novelties introduced touch upon the transformation of token sequences into graphs, and re-address the question of meta vertices by constructing them at the sequence level, which is substantially faster. Further, by exploiting personalized PageRank, global token information is incorporated into keyphrase ranking alongside token lengths. By conducting an extensive benchmark against established baselines, this paper presents an evaluation which incorporates both retrieval capabilities, but further details into computation time and duplication rates amongst the retrieved keyphrases.
Analysis of keyphrase detection algorithms with regards to multiple evaluation criteria is becoming of higher relevance, as many low-latency applications cannot afford expensive detection phase. To our knowledge, this paper is similarly one of the first to evaluate the performance based on critical difference diagrams, exactly assessing the significance of observed differences (in time and retrieval performance).
Further work includes exploration of lower-level implementations of top-performing approaches, alongside their parts that could be subject to parallelism. A potentially interesting endeavor would also include background knowledge (as graphs), possibly enabling detection of keywords beyond the ones found in a given document, while remaining unsupervised.
The RaKUN 2.0 algorithm is available as a simple-to-use Python library available at https://github.com/SkBlaz/rakun2.
The work was supported by the Slovenian Research Agency (ARRS) core research programme Knowledge Technologies (P2-0103), and projects Computer-assisted multilingual news discourse analysis with contextual embeddings (J6-2581) and Quantitative and qualitative analysis of the unregulated corporate financial reporting (J5-2554). The work was also supported by the Ministry of Culture of Republic of Slovenia through project Development of Slovene in Digital Environment (RSDO).
-  (2000) The nlm indexing initiative.. In Proceedings of the AMIA Symposium, pp. 17. Cited by: Table 1.
-  (2017) SemEval 2017 task 10: ScienceIE - extracting keyphrases and relations from scientific publications. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Vancouver, Canada, pp. 546–555. External Links: Cited by: Table 1.
-  (2015-07) An overview of graph-based keyword extraction methods and approaches. Journal of Information and Organizational Sciences 39, pp. 1–20. Cited by: §2.
-  (2009) Natural language processing with python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.”. Cited by: §4.
Pke: an open source python-based keyphrase extraction toolkit. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations, Osaka, Japan, pp. 69–73. External Links: Cited by: §4.
-  (2018) Unsupervised keyphrase extraction with multipartite graphs. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, Louisiana, pp. 667–672. External Links: Cited by: §2, §4.
-  (2013) TopicRank: graph-based topic ranking for keyphrase extraction. In Proceedings of the Sixth International Joint Conference on Natural Language Processing, Nagoya, Japan, pp. 543–551. External Links: Cited by: §2, §4.
-  (2020) YAKE! keyword extraction from single documents using multiple local features. Information Sciences 509, pp. 257–289. External Links: Cited by: §2, §4, §4.
Statistical comparisons of classifiers over multiple data sets.
Journal of Machine Learning Research7 (1), pp. 1–30. External Links: Cited by: §5.
-  (2021-11) AttentionRank: unsupervised keyphrase extraction using self and cross attentions. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic, pp. 1919–1928. External Links: Cited by: §2.
Extracting keyphrases from research papers using citation networks.
Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, July 27 -31, 2014, Québec City, Québec, Canada, C. E. Brodley and P. Stone (Eds.), pp. 1629–1635. External Links: Cited by: Table 1.
-  (2020) KeyBERT: minimal keyword extraction with bert.. Zenodo. External Links: Cited by: §1, §2, §4.
-  (2014) Automatic keyphrase extraction: a survey of the state of the art. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Baltimore, Maryland, pp. 1262–1273. External Links: Cited by: §1.
-  (2003) Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, pp. 216–223. External Links: Cited by: Table 1.
-  (2010) SemEval-2010 task 5 : automatic keyphrase extraction from scientific articles. In Proceedings of the 5th International Workshop on Semantic Evaluation, Uppsala, Sweden, pp. 21–26. External Links: Cited by: Table 1.
-  (2009) Large dataset for keyphrases extraction. Cited by: Table 1.
-  (2022) A comprehensive review of recent automatic speech summarization and keyword identification techniques. Artificial Intelligence in Industrial Applications, pp. 111–126. Cited by: §2.
-  (2013) Keyphrase cloud generation of broadcast news. External Links: Cited by: Table 1.
-  (2009) Human-competitive tagging using automatic keyphrase extraction. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Singapore, pp. 1318–1327. External Links: Cited by: Table 1.
-  (2008) Topic indexing with wikipedia. In Proceedings of the AAAI WikiAI workshop, Vol. 1, pp. 19–24. Cited by: Table 1.
-  (2010) Domain-independent automatic keyphrase indexing with small training sets. ArXiv preprint abs/10.1002. External Links: Cited by: Table 1.
-  (2009) Human-competitive automatic topic indexing. Ph.D. Thesis, The University of Waikato. Cited by: Table 1.
-  (2004) TextRank: bringing order into text. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, pp. 404–411. External Links: Cited by: §2, §4.
-  (2007) Keyphrase extraction in scientific publications. In Asian Digital Libraries. Looking Back 10 Years and Forging New Frontiers, D. H. Goh, T. H. Cao, I. T. Sølvberg, and E. Rasmussen (Eds.), Berlin, Heidelberg, pp. 317–326. External Links: Cited by: Table 1.
-  (1999) The pagerank citation ranking: bringing order to the web.. Technical Report Technical Report 1999-66, Stanford InfoLab, Stanford InfoLab. Note: Previous number = SIDL-WP-1999-0120 External Links: Cited by: §3.
-  (2020) A review of keyphrase extraction. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 10 (2), pp. e1339. Cited by: §1.
-  (2008) Keyphrase extraction from single documents in the open domain exploiting linguistic and statistical methods. M. App. Sc Thesis. Cited by: Table 1.
RaKUn: rank-based keyword extraction via unsupervised learning and meta vertex aggregation. In Statistical Language and Speech Processing, C. Martín-Vide, M. Purver, and S. Pollak (Eds.), Cham, pp. 311–323. External Links: Cited by: §3.
-  (2008) CollabRank: towards a collaborative approach to single-document keyphrase extraction. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), Manchester, UK, pp. 969–976. External Links: Cited by: §2, §4.
-  (2020) MeDAL: medical abbreviation disambiguation dataset for natural language understanding pretraining. In Proceedings of the 3rd Clinical Natural Language Processing Workshop, Online, pp. 130–135. External Links: Cited by: §5.1.