Log In Sign Up

Retrieval-efficiency trade-off of Unsupervised Keyword Extraction

Efficiently identifying keyphrases that represent a given document is a challenging task. In the last years, plethora of keyword detection approaches were proposed. These approaches can be based on statistical (frequency-based) properties of e.g., tokens, specialized neural language models, or a graph-based structure derived from a given document. The graph-based methods can be computationally amongst the most efficient ones, while maintaining the retrieval performance. One of the main properties, common to graph-based methods, is their immediate conversion of token space into graphs, followed by subsequent processing. In this paper, we explore a novel unsupervised approach which merges parts of a document in sequential form, prior to construction of the token graph. Further, by leveraging personalized PageRank, which considers frequencies of such sub-phrases alongside token lengths during node ranking, we demonstrate state-of-the-art retrieval capabilities while being up to two orders of magnitude faster than current state-of-the-art unsupervised detectors such as YAKE and MultiPartiteRank. The proposed method's scalability was also demonstrated by computing keyphrases for a biomedical corpus comprised of 14 million documents in less than a minute.


RaKUn: Rank-based Keyword extraction via Unsupervised learning and Meta vertex aggregation

Keyword extraction is used for summarizing the content of a document and...

Keywords lie far from the mean of all words in local vector space

Keyword extraction is an important document process that aims at finding...

Learning Passage Impacts for Inverted Indexes

Neural information retrieval systems typically use a cascading pipeline,...

Unsupervised Graph-based Rank Aggregation for Improved Retrieval

This paper presents a robust and comprehensive graph-based rank aggregat...

CITADEL: Conditional Token Interaction via Dynamic Lexical Routing for Efficient and Effective Multi-Vector Retrieval

Multi-vector retrieval methods combine the merits of sparse (e.g. BM25) ...

1 Introduction

With the increasing amounts of freely available text-based data sets, methods for efficient keyphrase detection are becoming of high relevance [13]

. These methods, given a single or multiple documents, output a ranked list of short phrases (or single tokens), which represents key aspects of the input text. In the recent years, plethora of keyphrase extraction methods were presented; broadly, they can be divided into unsupervised and supervised ones. This paper focuses on unsupervised keyphrase extraction, i.e. the process where no training set of document is needed to learn to estimate keyphrases – they are estimated solely based on statistical/topological properties of a given document. The unsupervised methods can be further divided to the ones which construct a graph based on token co-occurrences and the ones which leverage statistical properties of n-grams 

[26]. Recently, neural language model-based keyphrase extraction was also proposed [12]. With the abundance of methods, optimization of a single metric becomes less relevant – methods which maximize e.g., F1@k are common. This paper aims to inform the reader that a realm of highly relevant properties beyond simple retrieval performance can be meaningful in practice, and should be the focus of any novel method proposed (including the adaptation of an existing one presented in this paper). The contributions of this paper are multifold:

Figure 1: Performance trade-off (time vs. performance) of keyphrase detection methods averaged across fifteen data sets.
  1. We present RaKUn 2.0, a graph-based keyphrase extractor optimized for retrieval-efficiency optimality when considering both retrieval capabilities and performance.

  2. A polygon-based visualization suitable for studying and comparing multiple criteria for multiple keyphrase detection algorithms.

  3. An extensive benchmark of RaKUn 2.0 against strong baselines (including e.g., the recently introduced KeyBERT).

  4. Friedman-Nemenyi-based analysis of average ranks of the algorithms (and their similarity).

2 Selected related work

This section contains an overview of the existing keyphrase detection methods, key underlying ideas and possible caveats of different paradigms. This paper focuses exclusively on unsupervised keyphrase extraction – the process of transforming an input document in to a ranked collection of keyphrases, i.e. , where represents the top hits (detected keyphrases), a given keyphrase and a given keyphrase’s score. The first branch of approaches are based on text-to-graph transformations, followed by subsequent processing of the obtained graphs. Such methods are able to exploit multilevel structure of a document [6] (MultiPartiteRank), hierarchical structure [29] (SingleRank). An example token graph is shown in Figure 2.

Figure 2: An example token graph.

One of the first graph-based methods was TextRank 

[23], which demonstrated the robustness of graph-based keyphrase detection (and was one of the first to do so). More involved approaches, capable of incorporating topic-level information were also proposed [7] (TopicalPageRank). One of the key issues with graph-based representations is that of node denoising – the process of identifying the relevant space of nodes which are commonly subject to ranking. The graph-based methods are highly dependant on the graph construction approach (based on co-occurrence, syntactic, semantic and similarity information) and node ranking algorithm (e.g. degree, closeness, Page Rank, selectivity, etc.) [3]. A detailed overview of graph-based methods for keyword extraction and various node-ranking measures is provided in [3].

Alongside graph-based methods, statistical methods are also actively developed. One of the most recent examples includes YAKE! [8], an approach which considers large amounts of n-gram patterns and scores them so that they represent relevant keyphrases. It operates by extracting statistical features from single documents to select the most important keywords of a text. Keyphrase detection was also considered as a task solvable by considering neural language models [12]. An example of this family of models is AttentionRank [10], which exploits the transformer-based neural language model to extract relevant keywords. A more detailed overview of general keyword detection methods is given in  [17].

The discussed approaches seldom focus on metrics beyond retrieval capabilities (e.g., Precision, Recall and F1). One of the purposes of this paper is a comprehensive evaluation of the discussed algorithms with regards to multiple criteria, including computation time and duplication rates (how frequent is a token amongst the space of detected keyphrases).

3 Proposed algorithm

The proposed approach sources the core idea from the recent paper on meta vertex-based keyphrase detection RaKUn [28]. The considered extension, proposed in this paper is optimized specifically to push the boundary of the retrieval-efficiency front between retrieval performance and retrieval time. We begin with a general overview of the algorithm, followed by theoretical analysis of its complexity (space and time). We refer to the proposed approach as RaKUn 2.0. A high-level overview is shown as Algorithm 1.

Data: Input document , merge factor
tokens tokenizeDocument()    Tokenization.
tokens mergeTokens()    Merging.
documentGraph(tokens)    Weighted graph.
tokenFrequencies(tokens) tokenRanks personalizedPR(,)    Ranking.
sort(, tokenRanks)    Sorting.
return K;
Algorithm 1 RaKUn 2.0

The main steps include tokenization, token merging, document graph construction and node ranking. Instead of first constructing (larger) graphs which are subject to node merging into meta vertices, RaKUn 2.0 conducts the merging step at the sequence level, making it more efficient. This step was considered based on an observation that pre-merging tokens in close proximity already offers sufficient results – by considering only tokens close to one another, no specialized metric for string comparison (possibly expensive) was needed, which substantially sped up the detection process. The second idea which substantially sped up the process is related to bi-gram hashing. It refers to constructing a mapping between each bi-gram and its count in the document, enabling fast lookup of this information as follows. for each subsequent token pair () term counts are retrieved (they are pre-computed during tokenization). We next compute a merge threshold score as:

where and are two subsequent tokens, and

is the bi-gram comprised of the two tokens. If MScore is lower than a user-specified threshold (hyperparameter), the merged token is added as a new token to the token space, and term counts of the two individual tokens are diminished by MScore as

, i.e., multiplied with the computed score. Values of MScore, lower than one, imply more emphasis of multi-term keyphrases (individual terms are not as emphasized), and values larger than one imply more individual token keyphrases. Hence, the MScore serves as an intermediary step which emphasizes specific tokens during the ranking step.

The token graph is constructed from the modified list of tokens by considering subsequent, lower-cased tokens as edges. The edge weights are incremented every time a given bi-gram repeats – the transitions between tokens which commonly co-occur are emphasized. The next step is node ranking. Here, a real-valued score is assigned to each (pre-merged) token. We consider personalized PageRank algorithm [25]

, where the personalization vector is constructed based on term counts. This step results in real-valued scores (between 0 and 1) for each token. The final set of scores is obtained by computing an element-wise product between the PageRank scores and token lengths. This step emphasizes longer keyphrases. We traverse the space of scored tokens and remove case-level duplicates (e.g., ‘City’ and ‘city’).

The described algorithm for keyphrase detection was conceived with simplicity in mind. This property also resonates with its computational complexity. Let represent the number of tokens after the merge step (cardinality difference is negligible with regards to the runtime). Both graph construction and merging need one pass across the token sequence (. The computationally most expensive part is computation of personalized PageRank. In theory, PageRank’s complexity is , where is the number of links in the constructed token graph. In practice, the obtained graphs are very sparse – only selected bi-grams co-occur. The opposite case, where dense, clique-like graphs would be produced would imply appearance of tokens in highly diverse contexts, which is highly unlikely. The final step requires sorting of tokens based their scores. This yields the final complexity of . Assuming very sparse graphs (as observed during the experiments), the complexity remains linear with regards to the number of tokens in the token set after the merge step.

4 Evaluation

We next discuss the evaluation procedures used to estimate the performance of individual algorithms, followed by a discussion regarding their comparison. We evaluate each algorithm with regards to three main aspects; retrieval performance, keyword duplication rate and computation time. The retrieval performance was measured as done in the previous work [8]. Precision@k is defined as . Recall@k is defined as . Precision represents the number of keyphrases retrieved with regards to top

predicted ones, while recall represents the overall retrieval capability. We also computed (macro) F1, which is the harmonic mean of precision and recall, averaged across documents.

The second score is the duplication rate. We compute this score as follows; for each detected keyphrase, we first split it to separate tokens (if multi-token keyphrase is considered). For each part, we traverse the space of detected tokens. If there is a match, we increment a duplicate counter, otherwise, we increment the non_duplicate counter. The final score is computed as , and was observed to be in the interval [0, 1]. The computation time was measured in seconds (for each document).

Dataset #Docs #KW Mean KW tokens Mean doc len
wiki20 [20] 20 35.5 2.0 7728.0
fao30 [21] 30 32.2 1.6 4710.3
theses100 [22] 100 6.7 2.0 4813.9
citeulike180 [19] 183 17.4 1.3 4517.9
Nguyen2007 [24] 209 12.0 2.1 4425.6
SemEval2010 [15] 243 15.6 2.2 7093.3
SemEval2017 [2] 493 17.3 2.9 168.3
500N-KPCrowd-v1.1 [18] 500 49.2 1.4 393.9
PubMed [1] 500 14.2 1.9 3880.2
kdd [11] 755 4.1 2.0 74.1
fao780 [21] 779 8.0 1.6 4685.0
Schutz2008 [27] 1231 45.3 1.5 2362.6
www [11] 1330 4.8 1.9 82.0
Inspec [14] 2000 14.1 2.2 112.5
Krapivin2009 [16] 2304 5.3 2.1 7094.1
Table 1: Summary of the considered data sets.

For visualization of retrieval-efficiency tradeoffs with regards to the mentioned scores it makes sense to have uniform meaning of large and small values. Hence, we introduce the following adapted scores which reflect this idea. The retrieval capability already corresponds to e.g., F1 score, meaning that higher values are preferred. We additionally normalize F1 scores to range between 0 and 1 based on the worst-best performing algorithms (on average). This way, an algorithm scored with 0 is the worst-performing one, while the top performing is scored with 1 (see Figure 8). Similar adaptations were considered for time performance (normalized inverse times) and duplication rates (normalized inverse duplication rates). One of the main results of this paper is a visualization which jointly considers all three aspects. The considered collection of data sets is summarized in Table 1.

The considered baselines are discussed next. The graph-based baselines include MultiPartiteRank [6], SingleRank [29], TextRank [23] and TopicalPageRank [7]. The statistical baseline considered was YAKE [8]. The language model-based baseline is the recent KeyBERT [12]. For all approaches, we considered the default hyperparameter configurations, as we were interested in out-of-the-box performance. We computed, however, two variants of KeyBERT, one which emits single tokens (KeyBERT-(1,1)) and one which permits two term tokens (KeyBERT-(1,2)). Default configuration of KeyBERT variants performed worse than term frequency-based extraction 111We considered unigrams. Inverse document frequencies were not computed as they require the whole corpus, making them not directly comparable to purely unsupervised methods., and offered (1,1) adequate performance only when we set the ‘maxsum‘ and ‘mmr‘ flags to ‘true‘. The stopwords used were the same for all approaches (NLTK’s default English stopwords [4]). Other algorithms’ implementations were based on the PKE library [5].

5 Results

A summary of algorithm run times (relative to one another) is shown in Figure 3.

Figure 3: Pairwise time comparison of average algorithm run times (.

As expected, the simplest baseline (term frequency) is up to three orders of magnitude faster than e.g., BERT-based model. The second approach that performs substantially better, while remaining up to two orders of magnitude faster is the proposed RaKUn 2.0. It is closely followed by SingleRank and TopicalPageRank. The duplication levels are shown in Figure 4.

Figure 4: Duplication levels for different algorithms.

The duplication ablation indicates the highest duplication levels were observed for YAKE, TopicalPageRank and TextRank. MultiPartiteRank and SingleRank had notably lower duplication levels (KeyBERT-(1,1) as well the term frequency (unigram) baseline. The proposed RaKUn 2.0 is at the lower end of the approaches with regards to this score, albeit not being optimal.

We continue the discussion by presenting the retrieval performance. A systematic investigation of algorithm performance is shown in Figure 5.

Figure 5: F1 score for different top keyphrases, averaged across all data sets.

The results indicate that on average, MultiPartiteRank is the leading algorithm in the low scenarios. RaKUn 2.0, however, performs very similarly for up to ten keyphrases, which is one of the most common usecases of such algorithms. A more detailed overview of the scores on the per-data set level is given in Tables 1-5. The color codes represent top three performers for each data set (gold=first, silver=second, bronze=third).

Algorithm KeyBERT-(1,1) KeyBERT-(1,2) MultiPartiteRank RaKUn 2.0 SingleRank TextRank TopicalPageRank YAKE TFreq
500N-KPCrowd-v1.1 0.012 0.012 0.171 0.138 0.164 0.057 0.094 0.127 0.106
Inspec 0.0 0.0 0.22 0.143 0.207 0.126 0.24 0.195 0.041
Krapivin2009 0.051 0.057 0.109 0.097 0.094 0.007 0.02 0.118 0.011
Nguyen2007 0.099 0.058 0.168 0.141 0.152 0.025 0.053 0.188 0.035
PubMed 0.1 0.021 0.087 0.083 0.072 0.002 0.004 0.087 0.036
Schutz2008 0.088 0.023 0.23 0.194 0.219 0.015 0.031 0.15 0.075
SemEval2010 0.071 0.053 0.152 0.139 0.133 0.01 0.023 0.155 0.023
SemEval2017 0.0 0.0 0.216 0.132 0.203 0.122 0.224 0.175 0.056
citeulike180 0.205 0.03 0.172 0.225 0.14 0.004 0.013 0.185 0.097
fao30 0.16 0.027 0.176 0.233 0.161 0.008 0.011 0.15 0.072
fao780 0.116 0.013 0.141 0.138 0.118 0.004 0.009 0.138 0.064
kdd 0.0 0.001 0.107 0.144 0.094 0.058 0.109 0.144 0.056
theses100 0.099 0.017 0.149 0.103 0.128 0.004 0.006 0.093 0.042
wiki20 0.222 0.013 0.186 0.226 0.163 0.0 0.0 0.135 0.021
www 0.0 0.001 0.11 0.113 0.099 0.065 0.109 0.129 0.062
Table 3: Precision@10 (gold=first, silver=second, bronze=third, per row)
Algorithm KeyBERT-(1,1) KeyBERT-(1,2) MultiPartiteRank RaKUn 2.0 SingleRank TextRank TopicalPageRank YAKE TFreq
500N-KPCrowd-v1.1 0.046 0.037 0.38 0.323 0.36 0.129 0.173 0.262 0.192
Inspec 0.0 0.0 0.174 0.112 0.165 0.101 0.189 0.152 0.032
Krapivin2009 0.037 0.04 0.079 0.069 0.068 0.005 0.014 0.084 0.008
Nguyen2007 0.09 0.05 0.151 0.124 0.138 0.022 0.048 0.166 0.032
PubMed 0.065 0.013 0.057 0.055 0.047 0.001 0.003 0.057 0.023
Schutz2008 0.193 0.047 0.504 0.433 0.48 0.029 0.065 0.329 0.163
SemEval2010 0.075 0.055 0.159 0.146 0.14 0.011 0.024 0.162 0.023
SemEval2017 0.0 0.001 0.293 0.184 0.278 0.169 0.3 0.235 0.077
citeulike180 0.208 0.03 0.172 0.228 0.14 0.003 0.012 0.183 0.097
fao30 0.183 0.033 0.21 0.28 0.19 0.01 0.013 0.18 0.087
fao780 0.075 0.008 0.092 0.09 0.077 0.002 0.006 0.089 0.041
kdd 0.0 0.001 0.064 0.087 0.056 0.036 0.065 0.085 0.034
theses100 0.064 0.011 0.098 0.068 0.084 0.002 0.004 0.06 0.027
wiki20 0.19 0.01 0.155 0.19 0.135 0.0 0.0 0.12 0.02
www 0.0 0.001 0.066 0.068 0.06 0.04 0.065 0.076 0.037
Table 4: Recall@10 (gold=first, silver=second, bronze=third, per row)
Algorithm KeyBERT-(1,1) KeyBERT-(1,2) MultiPartiteRank RaKUn 2.0 SingleRank TextRank TopicalPageRank YAKE TFreq
500N-KPCrowd-v1.1 0.007 0.007 0.144 0.119 0.139 0.041 0.087 0.129 0.113
Inspec 0.0 0.0 0.356 0.233 0.331 0.194 0.388 0.326 0.07
Krapivin2009 0.097 0.119 0.212 0.187 0.182 0.014 0.041 0.236 0.021
Nguyen2007 0.135 0.087 0.23 0.216 0.205 0.036 0.078 0.279 0.05
PubMed 0.27 0.065 0.223 0.209 0.181 0.009 0.013 0.239 0.096
Schutz2008 0.063 0.017 0.161 0.133 0.153 0.011 0.022 0.104 0.053
SemEval2010 0.071 0.054 0.152 0.139 0.133 0.01 0.024 0.156 0.023
SemEval2017 0.0 0.0 0.183 0.11 0.17 0.101 0.189 0.15 0.046
citeulike180 0.221 0.034 0.187 0.242 0.151 0.005 0.016 0.205 0.104
fao30 0.149 0.023 0.159 0.21 0.147 0.007 0.01 0.134 0.065
fao780 0.335 0.036 0.396 0.39 0.321 0.01 0.025 0.39 0.193
kdd 0.0 0.003 0.384 0.514 0.346 0.188 0.398 0.562 0.194
theses100 0.266 0.045 0.389 0.262 0.326 0.019 0.02 0.254 0.116
wiki20 0.294 0.017 0.251 0.297 0.221 0.0 0.0 0.166 0.023
www 0.0 0.004 0.393 0.412 0.352 0.221 0.394 0.502 0.237
Table 5: Retrieval time (s). (gold=first, silver=second, bronze=third, per row)
Algorithm KeyBERT-(1,1) KeyBERT-(1,2) MultiPartiteRank RaKUn 2.0 SingleRank TextRank TopicalPageRank YAKE TFreq
500N-KPCrowd-v1.1 0.422 0.763 0.477 0.009 0.454 0.417 1.552 0.699 0.0
Inspec 0.202 0.337 0.399 0.006 0.4 0.394 1.554 0.74 0.0
Krapivin2009 1.561 11.282 6.144 0.08 4.332 1.111 2.852 1.423 0.007
Nguyen2007 1.333 6.318 3.528 0.05 2.45 0.87 2.505 1.304 0.005
PubMed 1.237 4.865 2.823 0.046 1.945 0.766 2.312 1.249 0.004
Schutz2008 1.58 6.926 3.993 0.038 2.675 0.772 2.428 1.236 0.004
SemEval2010 1.675 12.479 6.135 0.076 4.213 1.117 2.707 1.378 0.008
SemEval2017 0.213 0.431 0.403 0.007 0.395 0.389 1.596 0.947 0.0
citeulike180 1.556 8.006 3.937 0.051 2.411 0.812 2.446 1.322 0.005
fao30 1.528 7.599 4.665 0.056 2.793 0.877 2.47 1.573 0.005
fao780 1.531 7.806 5.284 0.056 3.111 0.838 2.53 1.479 0.005
kdd 0.153 0.268 0.394 0.006 0.394 0.383 1.371 0.549 0.0
theses100 1.52 7.069 4.05 0.053 2.603 0.811 2.293 1.644 0.004
wiki20 1.598 10.453 5.674 0.066 3.8 0.952 2.586 1.429 0.006
www 0.152 0.269 0.39 0.006 0.396 0.396 1.283 0.552 0.0
Table 2: F1@10 (gold=first, silver=second, bronze=third, per row)

We additionally conducted rank-based difference significance evaluation [9], where the average algorithm ranks are compared across all data sets. If the algorithms are linked with a red line, they perform very similarly (). The diagrams are shown as Figures 7 and  7.

Figure 6: Critical difference diagram - F1@15. RaKUn 2.0’s performance is (statistically) comparable to the recent state-of-the-art approaches.
Figure 7: CD diagrams – time per document. Higher ranks indicate faster compute time. RaKUn 2.0 is significantly faster when compared to other state-of-the-art methods.
Figure 6: Critical difference diagram - F1@15. RaKUn 2.0’s performance is (statistically) comparable to the recent state-of-the-art approaches.

The tests indicate that the difference between the top-performing approaches (MultiPartiteRank, YAKE and RaKUn 2.0) is insignificant. Similar observations can be made based on tabular summaries. Overall, however, we can observe a marginal dominance of RaKUn 2.0 w.r.t. precision. Similar retrieval performance amplifies the purpose of this paper, which transcends the retrieval-only evaluation and incorporates also other properties of either the algorithms or the retrieved space.

In Figure 8, the selected approaches are compared across the three main evaluation criteria – retrieval performance, duplication performance (inverse of duplication rate) and time performance (inverse of normalized times across all algorithms). Larger values are better for each criterion. It can be observed that MultiPartiteRank outperforms the others at the front considering duplication and retrieval performance, however, RaKUn 2.0 outperforms the others when considering retrieval capabilities and computation time.

Figure 8: A visualization comparing best and worst-performing approaches with regards to three different criteria relevant in practice. Note that the scores are relative with regards to the considered methods’ performances.

5.1 Scaling to 14M documents

A direct way of testing the complexity bounds stated in the methods section was to attempt and run RaKUn 2.0 directly on the collection of approximately 14 million biomedical articles – the MeDAL corpus [30]222 The corpus was parsed into a list of documents and fed into the default configuration of RaKUn 2.0. The computation took approximately forty seconds (including text reading) on a virtual machine with 12 cores and 32GB of RAM. The list of top ten keyphrases is shown as Table 6.

Keyphrase Score
presence 0.02041868080426608
molecular weights 0.01313742352650019
glutamine synthetase 0.01081927396059080
growth hormone 0.01081481738381907
arterial blood 0.00973761662559790
investigated 0.00926714499542069
rate constant 0.00904369510973679
blood flow 0.00899499866920862
molecular weight 0.00865807865159297
sodium dodecyl 0.00865611530561878
Table 6: 14M articles summarized as top ten keyphrases.

The top keyphrases correspond to rather general biological terms, which are some of the main topics related to the considered documents. The results were obtained by maintaining the merge_threshold hyperparameter set to one – single term keyphrases can be obtained if this threshold is lowered. For example, if set to 0.5, the top three keyphrases are ‘activity’, ‘concentration’ and ‘enzyme’.

6 Discussion and conclusions

In this paper we presented an approach to unsupervised keyphrase detection, aimed specifically at pushing the limits of computation time and retrieval performance. The main contributions of this paper are an algorithm for keyphrase detection that performs substantially (significantly) faster than current state-of-the-art methods, while maintaining the retrieval performance. The algorithmic novelties introduced touch upon the transformation of token sequences into graphs, and re-address the question of meta vertices by constructing them at the sequence level, which is substantially faster. Further, by exploiting personalized PageRank, global token information is incorporated into keyphrase ranking alongside token lengths. By conducting an extensive benchmark against established baselines, this paper presents an evaluation which incorporates both retrieval capabilities, but further details into computation time and duplication rates amongst the retrieved keyphrases.

Analysis of keyphrase detection algorithms with regards to multiple evaluation criteria is becoming of higher relevance, as many low-latency applications cannot afford expensive detection phase. To our knowledge, this paper is similarly one of the first to evaluate the performance based on critical difference diagrams, exactly assessing the significance of observed differences (in time and retrieval performance).

Further work includes exploration of lower-level implementations of top-performing approaches, alongside their parts that could be subject to parallelism. A potentially interesting endeavor would also include background knowledge (as graphs), possibly enabling detection of keywords beyond the ones found in a given document, while remaining unsupervised.

7 Replicability

The RaKUN 2.0 algorithm is available as a simple-to-use Python library available at


The work was supported by the Slovenian Research Agency (ARRS) core research programme Knowledge Technologies (P2-0103), and projects Computer-assisted multilingual news discourse analysis with contextual embeddings (J6-2581) and Quantitative and qualitative analysis of the unregulated corporate financial reporting (J5-2554). The work was also supported by the Ministry of Culture of Republic of Slovenia through project Development of Slovene in Digital Environment (RSDO).


  • [1] A. R. Aronson, O. Bodenreider, H. F. Chang, S. M. Humphrey, J. G. Mork, S. J. Nelson, T. C. Rindflesch, and W. J. Wilbur (2000) The nlm indexing initiative.. In Proceedings of the AMIA Symposium, pp. 17. Cited by: Table 1.
  • [2] I. Augenstein, M. Das, S. Riedel, L. Vikraman, and A. McCallum (2017) SemEval 2017 task 10: ScienceIE - extracting keyphrases and relations from scientific publications. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Vancouver, Canada, pp. 546–555. External Links: Document, Link Cited by: Table 1.
  • [3] S. Beliga, A. Meštrović, and S. Martincic-Ipsic (2015-07) An overview of graph-based keyword extraction methods and approaches. Journal of Information and Organizational Sciences 39, pp. 1–20. Cited by: §2.
  • [4] S. Bird, E. Klein, and E. Loper (2009) Natural language processing with python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.”. Cited by: §4.
  • [5] F. Boudin (2016-12)

    Pke: an open source python-based keyphrase extraction toolkit

    In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations, Osaka, Japan, pp. 69–73. External Links: Link Cited by: §4.
  • [6] F. Boudin (2018) Unsupervised keyphrase extraction with multipartite graphs. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, Louisiana, pp. 667–672. External Links: Document, Link Cited by: §2, §4.
  • [7] A. Bougouin, F. Boudin, and B. Daille (2013) TopicRank: graph-based topic ranking for keyphrase extraction. In Proceedings of the Sixth International Joint Conference on Natural Language Processing, Nagoya, Japan, pp. 543–551. External Links: Link Cited by: §2, §4.
  • [8] R. Campos, V. Mangaravite, A. Pasquali, A. Jorge, C. Nunes, and A. Jatowt (2020) YAKE! keyword extraction from single documents using multiple local features. Information Sciences 509, pp. 257–289. External Links: Document, ISSN 0020-0255, Link Cited by: §2, §4, §4.
  • [9] J. Demšar (2006)

    Statistical comparisons of classifiers over multiple data sets


    Journal of Machine Learning Research

    7 (1), pp. 1–30.
    External Links: Link Cited by: §5.
  • [10] H. Ding and X. Luo (2021-11) AttentionRank: unsupervised keyphrase extraction using self and cross attentions. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic, pp. 1919–1928. External Links: Link, Document Cited by: §2.
  • [11] S. D. Gollapalli and C. Caragea (2014) Extracting keyphrases from research papers using citation networks. In

    Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, July 27 -31, 2014, Québec City, Québec, Canada

    , C. E. Brodley and P. Stone (Eds.),
    pp. 1629–1635. External Links: Link Cited by: Table 1.
  • [12] M. Grootendorst (2020) KeyBERT: minimal keyword extraction with bert.. Zenodo. External Links: Document, Link Cited by: §1, §2, §4.
  • [13] K. S. Hasan and V. Ng (2014) Automatic keyphrase extraction: a survey of the state of the art. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Baltimore, Maryland, pp. 1262–1273. External Links: Document, Link Cited by: §1.
  • [14] A. Hulth (2003) Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, pp. 216–223. External Links: Link Cited by: Table 1.
  • [15] S. N. Kim, O. Medelyan, M. Kan, and T. Baldwin (2010) SemEval-2010 task 5 : automatic keyphrase extraction from scientific articles. In Proceedings of the 5th International Workshop on Semantic Evaluation, Uppsala, Sweden, pp. 21–26. External Links: Link Cited by: Table 1.
  • [16] M. Krapivin, A. Autaeu, and M. Marchese (2009) Large dataset for keyphrases extraction. Cited by: Table 1.
  • [17] T. Kumar, M. Mahrishi, and G. Meena (2022) A comprehensive review of recent automatic speech summarization and keyword identification techniques. Artificial Intelligence in Industrial Applications, pp. 111–126. Cited by: §2.
  • [18] L. Marujo, M. Viveiros, and J. P. da Silva Neto (2013) Keyphrase cloud generation of broadcast news. External Links: 1306.4606 Cited by: Table 1.
  • [19] O. Medelyan, E. Frank, and I. H. Witten (2009) Human-competitive tagging using automatic keyphrase extraction. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Singapore, pp. 1318–1327. External Links: Link Cited by: Table 1.
  • [20] O. Medelyan, I. H. Witten, and D. Milne (2008) Topic indexing with wikipedia. In Proceedings of the AAAI WikiAI workshop, Vol. 1, pp. 19–24. Cited by: Table 1.
  • [21] O. Medelyan and I. H. Witten (2010) Domain-independent automatic keyphrase indexing with small training sets. ArXiv preprint abs/10.1002. External Links: Link Cited by: Table 1.
  • [22] O. Medelyan (2009) Human-competitive automatic topic indexing. Ph.D. Thesis, The University of Waikato. Cited by: Table 1.
  • [23] R. Mihalcea and P. Tarau (2004) TextRank: bringing order into text. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, pp. 404–411. External Links: Link Cited by: §2, §4.
  • [24] T. D. Nguyen and M. Kan (2007) Keyphrase extraction in scientific publications. In Asian Digital Libraries. Looking Back 10 Years and Forging New Frontiers, D. H. Goh, T. H. Cao, I. T. Sølvberg, and E. Rasmussen (Eds.), Berlin, Heidelberg, pp. 317–326. External Links: ISBN 978-3-540-77094-7 Cited by: Table 1.
  • [25] L. Page, S. Brin, R. Motwani, and T. Winograd (1999) The pagerank citation ranking: bringing order to the web.. Technical Report Technical Report 1999-66, Stanford InfoLab, Stanford InfoLab. Note: Previous number = SIDL-WP-1999-0120 External Links: Link Cited by: §3.
  • [26] E. Papagiannopoulou and G. Tsoumakas (2020) A review of keyphrase extraction. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 10 (2), pp. e1339. Cited by: §1.
  • [27] A. T. Schutz et al. (2008) Keyphrase extraction from single documents in the open domain exploiting linguistic and statistical methods. M. App. Sc Thesis. Cited by: Table 1.
  • [28] B. Škrlj, A. Repar, and S. Pollak (2019)

    RaKUn: rank-based keyword extraction via unsupervised learning and meta vertex aggregation

    In Statistical Language and Speech Processing, C. Martín-Vide, M. Purver, and S. Pollak (Eds.), Cham, pp. 311–323. External Links: ISBN 978-3-030-31372-2 Cited by: §3.
  • [29] X. Wan and J. Xiao (2008) CollabRank: towards a collaborative approach to single-document keyphrase extraction. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), Manchester, UK, pp. 969–976. External Links: Link Cited by: §2, §4.
  • [30] Z. Wen, X. H. Lu, and S. Reddy (2020) MeDAL: medical abbreviation disambiguation dataset for natural language understanding pretraining. In Proceedings of the 3rd Clinical Natural Language Processing Workshop, Online, pp. 130–135. External Links: Document, Link Cited by: §5.1.