Keyphrase extraction is concerned with automatically extracting a set of representative phrases from a document that concisely summarize its content [hasan+ng2014]. There exist both supervised and unsupervised keyphrase extraction methods. Unsupervised methods are popular because they are domain independent and do not need labeled training data, i.e. manual extraction of the keyphrases, which comes with subjectivity issues as well as significant investment in time and money. Supervised methods on the other hand, have more powerful modeling capabilities and typically achieve higher accuracy than the unsupervised ones according to previous studies [DBLP:journals/lre/KimMKB13, caragea2014citation, meng2017deep].
The versatility of keyphrases renders keyphrase extraction a very important document processing task. Keyphrases can be used for semantically indexing a collection of documents either in place of their full-text or in addition to it, enabling semantic and faceted search [gutwin1999improving]. In addition, they can be used for query expansion in the context of pseudo-relevance feedback [DBLP:conf/jcdl/SongSAO06]. They can also serve as features for document clustering and classification [DBLP:conf/acl/HulthM06]
. Furthermore, the set of extracted keyphrases can be viewed as an extreme summary of the corresponding document for human inspection, while the individual keyphrases can guide the extraction of sentences in automatic document summarization systems[ZhangZM04]. Keyphrase extraction is particularly important in the (academic) publishing industry for carrying out a number of important tasks, such as the recommendation of new articles or books to customers, highlighting missing citations to authors, identifying potential reviewers for submissions and the analysis of content trends [augenstein2017semeval].
There exists a number of noteworthy keyphrase extraction surveys. [hasan+ng2014] focus on the errors that are made by state-of-the-art keyphrase extractors: (a) evaluation errors (when a returned keyphrase is semantically equivalent to a gold one but it is evaluated as erroneous), (b) redundancy errors (when a method returns correct but semantically equivalent keyphrases), (c) infrequency errors (when a keyphrase appears one or two times in a text and the method fails to detect it), and (d) overgeneration errors (when a system correctly returns a phrase as a keyphrase because it contains a word that appears frequently in the document, but erroneously outputs additional phrases as keyphrases that contain this frequent word). Despite that their analysis is not based on a large number of documents, it is quite interesting and well-presented. An earlier survey by the same authors presents the results of an experimental study of state-of-the-art unsupervised keyphrase extraction methods, conducted with the aim of gaining deeper insights into these methods [hasan+ng2010]. The main conclusions are the following: (a) methods should be evaluated on multiple datasets, (b) post-processing steps (e.g., phrase formation) have a large impact on the performance of methods, and, (c) TfIdf is a strong baseline. [DBLP:conf/aclnut/BoudinMC16] study the effect of document pre-processing pipelines to the keyphrase extraction process, while [DBLP:conf/ecir/FlorescuC17] examine how keyphrase extraction is affected by phrase ranking schemes.
Our article constitutes a contemporary review of the keyprase extraction task, containing the following main contributions:
A systematic presentation of both unsupervised (Section 2) and supervised (Section 3) keyphrase extraction methods via comprehensive categorization schemes based on the main properties of these methods. Our article reviews 37 additional methods compared to [hasan+ng2014]. In addition, we contribute a time line of unsupervised and supervised methods to shed light on their evolution, as well as a presentation of the main types of features employed in supervised methods, along with a discussion of the issue of class imbalance.
We present the different approaches that can be followed for evaluating keyphrase extraction methods, as well as the different evaluation measures that exist, along with their popularity in the literature (Section LABEL:evaluation).
We provide a list of popular keyphrase extraction datasets, including their sources and properties, as well as a comprehensive catalogue of commercial APIs and free software (Section LABEL:data-comp-software) related to keyphrase extraction.
We present a thorough empirical study, both quantitative and qualitative, among commercial APIs and state-of-the-art unsupervised methods, which allows to gain a deeper understanding of how the results are affected by different evaluation approaches, evaluation measures and ground truth standards (Section LABEL:comparative-eval).
The article search strategy that we followed, involved searching for “keyphrase extraction” in the following databases of scientific literature: Google Scholar, Springer Link, IEEE Xplore, ACM Digital Library and DBLP. We focused mainly on articles appearing at the high quality journals and conference proceedings that are given in Appendix LABEL:sources.
2 Unsupervised Methods
The basic steps of an unsupervised keyphrase extraction system are the following [hasan+ng2010, hasan+ng2014]:
Selection of the candidate lexical units based on some heuristics. Examples of such heuristics are the exclusion of stopwords and the selection of words that belong to a specific part-of-speech (POS).
Ranking of the candidate lexical units.
Formation of the keyphrases by selecting words from the top-ranked ones or by selecting a phrase with a high rank score or whose parts have a high score.
presents the time line of the main research approaches related to unsupervised keyphrase extraction that are cited in this survey. We present the methods discussed below, along with their key-characteristics that support our adopted presentation structure in the following sections. Each method can be characterized as (i) statistics-based (Stat.), (ii) graph-based that incorporates statistics (Stats into Graph), (iii) topic-based (labels in gray color) that uses clustering (Clust.), LDA or knowledge graphs (KG) to find the document’s topics, (iv) methods that use citation networks or neighbors’ information (C/N Info) as well as (v) semantics (Sem.), and (vi) language model-based methods (Lang. Mod.). According to Table1, graph-based methods are the most popular ones. However, the statistics-based methods still hold the attention of the research community. Additionally, the incorporation of semantics seems to be helpful for the task as more and more methods use them. Figure 1 shows the presentation structure of the section. First, we present the statistics-based (Section 2.1) and the graph-based ranking methods (Section 2.2). Then, we discuss the methods that are based on embeddings (Section 2.3) as well as the category of language model-based methods (Section 2.4).
|Year •||Methods •||Stat.||Stats into Graph||Clust.||LDA||KG||C/N Info||Sem.||Lang. Mod.|
|2003 •||[tomokiyo2003language] •||✓|
|2004 •||[mihalcea+tatau2004] •||✓|
[wan+xiao2008] - SingleRank
[wan+xiao2008] - ExpandRank •
|2013 •||[bougouin2013topicrank] •||✓||✓|
|2019 •||[wonautomatic] •||✓|
2.1 Statistics-based Methods
TfIdf is the common baseline on the task. This method scores and ranks the phrases according to the formula:
where Tf is the raw phrase frequency and , with N the number of documents in the document set D and the number of documents in which the phrase appears. Effective variations of TfIdf are also implemented, such as taking the logarithm of the phrase frequency instead of the raw frequency to saturate the increase for high frequency words. Phrase frequency also contributes to alternative scoring schemes such as the one recently proposed by [DBLP:conf/ecir/FlorescuC17]. Specifically, the score of the phrase is calculated as:
where mean corresponds to the mean of the words’ scores which constitute the phrase and Tf is the phrase frequency in the text document. This is not a free-standing scoring scheme, but it can be considered as an intermediate stage of an unsupervised method.
KP-Miner [DBLP:journals/is/El-BeltagyR09] is a keyphrase extraction system that exploits various types of statistical information beyond the Tf and Idf scores. It follows a quite effective filtering process of candidate phrases and uses a scoring function similar to TfIdf. Particularly, the system keeps as candidates those that are not be separated by punctuation marks/stopwords, considering at the same time the least allowable seen frequency (lasf) factor and a cutoff constant (CutOff) that is defined in terms of a number of words after which a phrase appears for the first time. Then, the system ranks the candidate phrases taking into account the Tf and Idf scores as well as the term position and a boosting factor for compound terms over the single terms.
Around the same time, co-occurrence statistics and statistical metrics based on external resources started to be used for semantic similarity calculation among the document’s candidate terms. KeyCluster [liu2009clustering]
tries to extract keyphrases that cover all the major topics of a document. First, it removes stopwords and selects the candidate terms. Then, it utilizes a variety of measures (co-occurrence-based or Wikipedia-based) to calculate the semantic relatedness of the candidate terms and groups them into clusters (using spectral clustering). Finally, it finds the exemplar terms of each cluster in order to extract keyphrases from the document.
The great importance of using both statistics and contexts info is confirmed by recent methods such as YAKE [CamposSpringer] and the method proposed by [wonautomatic]. YAKE, besides the term’s position/frequency, also uses new statistical metrics that capture context information and the spread of the terms into the document. First, YAKE preprocesses the text by splitting it into individual terms. Second, a set of 5 features is calculated for each individual term: Casing ( that reflects the casing aspect of a word), Word Positional ( that values more those words occurring at the beginning of a document), Word Frequency (), Word Relatedness to Context ( that computes the number of different terms that occur to the left/right side of the candidate word), and Word DifSentence ( quantifies how often a candidate word appears within different sentences). Then, all these features are used for the computation of the S(w) score for each term (the smaller the value, the more important the word w would be).
Finally, a contiguous sequence of 1, 2 and 3-gram candidate keywords is generated using a sliding window of 3-grams. For each candidate keyword kw the following score is assigned:
The smaller the score the more meaningful the keyword will be. In addition, the method of [wonautomatic] shows that using a combination of simple textual statistical features is possible to achieve results that compete with state-of-the-art methods. The first step of this method is the selection of the candidate phrases using morphosyntactic patterns. Then, for each candidate the following features are calculated: Term Frequency, i.e., the sum of each word frequency of the candidate phrase, Inverse Document Frequency (Idf), Relative First Occurrence
, i.e., the cumulative probability of the typewhere measures the position of the first occurrence and the candidate frequency, and, Length, i.e., a simple rule that scores 1 for unigrams and 2 for the remaining sizes. The final score of each candidate is the result of the product of these 4 features. Moreover, with respect to the observation that the larger document datasets are associated with a higher number of keyphrases per document, the top- candidates are extracted for each document with (the parameter was found experimentally).
2.2 Graph-based Ranking Methods
The basic idea in graph-based ranking is to create a graph from a document that has as nodes the candidate phrases from the document and each edge of this graph connects related candidate keyphrases. The final goal is the ranking of the nodes using a graph-based ranking method, such as Google’s PageRank [grin+page1998], Positional Function [herings+van+talman] and HITS [kleinberg1999]
or generally solving a proposed optimization problem on the graph. Based on the methods described below, PageRank has been successfully used for graph-based keyphrase extraction. PageRank is based on eigenvector centrality and recursively defines the weight of a vertex as a measure of its influence inside the graph-of-words, regardless of how cohesive its neighborhood is. However,[DBLP:conf/ecir/RousseauV15] consider the vertices of the main core as the set of keywords to extract from the document, as it corresponds to the most cohesive connected component of the graph. For this reason, the vertices are intuitively appropriate candidates.
TextRank was the first graph-based keyphrase extraction method proposed by [mihalcea+tatau2004] which inspired researchers to build upon it, leading to well-known state-of-the-art methods. First of all, the text is tokenized, and annotated with POS tags. Then, syntactic filters are applied to the text units, i.e., only nouns and adjectives are kept. Next, the lexical units that pass the filters mentioned above, i.e., the candidates, are added to the graph as nodes and an edge is added between the nodes that co-occur within a window of words. The graph is undirected and unweighted. The initial score assigned to each node is equal to 1 and then the PageRank algorithm runs until it converges. Specifically, for a node the corresponding score function that is repeatedly computed is:
where is the set of neighbors of , is the set of neighbors of and is the probability of jumping from one node to another. Once the algorithm converges, the nodes are sorted by decreasing score.
SingleRank [wan+xiao2008] is an extension of TextRank which incorporates weights to edges. Similarly, to the statistical-based methods, the co-occurrence statistics are crucial information regarding the contexts. Hence, each edge weight is equal to the number of co-occurrences of the two corresponding words. Then, the score function for a node is computed in a similar way:
where is the the number of co-occurrences of word and word . In a post-processing stage, for each continuous sequence of nouns and adjectives in the text document, the scores of the constituent words are summed up and the top-ranked candidates are returned as keyphrases. Co-occurrences are also used by various graph-based methods such as the one of [rose2010automatic], called RAKE (Rapid Automatic Keyword Extraction)
RAKE (Rapid Automatic Keyword Extraction), that utilizes both word frequency and word degree to assign scores to phrases. RAKE take as input parameters a list of stopwords, a set of phrase delimiters and a set of word delimiters to partition the text into candidate phrases. Then, a graph of word-word co-occurrences is created and a score (word frequency or the word degree or the ratio of degree to frequency) is assigned for each candidate phrase which is the sum of the scores of the words that comprise the phrase. In addition, RAKE is able to identify keyphrases that contain interior stopwords, by detecting pairs of words that adjoin one another at least twice in the same document, in the same order. Finally, the top ranked candidate phrases are selected as keyphrases for the document.
In this vein, the more recent methods SGRank [DBLP:conf/starsem/DaneshSM15] and PositionRank (PR) [DBLP:conf/acl/FlorescuC17] utilize statistical, positional, and, word co-occurrence information, thus improving the overall performance. In particular, SGRank [DBLP:conf/starsem/DaneshSM15]
, first, extracts all possible n-grams from the input text, eliminating those that contain punctuation marks or whose words are anything different than noun, adjective or verb. Furthermore, it takes into account term frequency conditions. In the second stage, the candidate n-grams are ranked based on a modified version of TfIdf (similar to KP-Miner). In the third stage, the top ranking candidates are re-ranked based on additional statistical heuristics, such as position of first occurrence and term length. Finally, the ranking produced in stage three is incorporated into a graph-based algorithm which produces the final ranking of keyphrase candidates.PositionRank (PR) [DBLP:conf/acl/FlorescuC17] is a graph-based unsupervised method that tries to capture frequent phrases considering the word-word co-occurrences and their corresponding position in the text. Specifically, it incorporates all positions of a word into a biased weighted PageRank. Finally, the keyphrases are scored and ranked.
The rest of the graph-based methods are grouped into three main categories, i.e., the methods that incorporate information from similar documents or citation networks (Section 2.2.1), the topic-based methods (Section 2.2.2), and the graph-based methods that utilize semantics (Section 2.2.3), which are discussed in the following sections.
2.2.1 Incorporating Information from Similar Documents/Citation Networks
The graph-based methods discussed earlier assume that the documents are independent of each other. Hence, only the information included in the target document, i.e., the phrase’s TfIdf, position etc., is used during the keyphrase extraction process. However, related documents have mutual influences that help to extract keyphrases. ExpandRank [wan+xiao2008] is an extension of SingleRank that takes into consideration information from neighboring documents to the target document. It constructs an appropriate knowledge context
for the target document that is used in the keyphrase extraction process and helps to extract important keyphrases from it. According to this method each document is represented by a vector with TfIdf scores. For a target document, its nearest neighbors are identified, and a larger document set of k+1 documents is created. Based on this document set, a graph is constructed, where each node corresponds to a candidate word in , and edges are added between two nodes, , that co-occur within a window of words in the document set. The weight of an edge, is computed as follows:
is the cosine similarity betweenand , and is the co-occurrence frequency of in document . Once the graph is constructed, the rest of the procedure is identical to SingleRank.
A more related knowledge context to the target document can be also found via citation networks. In a citation network, information flows from one paper to another via the citation relation. In other words, the influence of one paper on another is captured through citation contexts (i.e., short text segments surrounding a paper’s mention). In this vein, CiteTextRank [gollapalli2014extracting] incorporates information from citation networks for the keyphrase extraction process capturing the information from such citation contexts. In particular, given a target document and a citation network (), a cited context for is a context in which is cited by a paper , and a citing context for is a context in which is citing another paper . The content of is dubbed as global context. As a first step, they construct an undirected graph for , with nodes the words from all types of contexts of and edges between the nodes (), in case of co-occurrence within a window of continuous tokens in any of the contexts. The weight of an edge is set equal to:
where the TC are the available types of contexts in (global, citing, cited), is the cosine similarity between the TfIdf vectors of any context of and , is the number of co-occurrences of in context , is the set of contexts of type and is the weight for contexts of type . Finally, they score the vertices using the PageRank algorithm.
2.2.2 Topic-based Methods
Except the previous methods that use classic statistical heuristics (Tf, Idf, position) as well as context-aware statistics such as word-word co-occurrence information, there are also methods that try to return keyphrases related to the topics discussed in the document. Specifically, the topic-based methods try to extract keyphrases that are representative for a text document in terms of the topics it covers. Such methods usually apply clustering techniques or Latent Dirichlet Allocation (LDA) [blei2003latent] to detect the main topics discussed.
TopicRank (TR) [bougouin2013topicrank], first, preprocesses the text to extract the candidate phrases. Then, the candidate phrases are grouped into separate topics using hierarchical agglomerative clustering. In the next stage, a graph of topics is constructed whose edges are weighted based on a measure that considers phrases’ offset positions in the text. Then, TextRank is used to rank the topics and one keyphrase candidate is selected from each of the N most important topics (first occurring keyphrase candidate). A more recent very similar method to TopicRank but more advanced is MultipartiteRank (MR) [Boudin18Multipartite] which introduces an in-between step where edge weights are adjusted to capture position information giving bias towards keyphrase candidates occurring earlier in the document. Note that the heuristic to promote a specific group of candidates, e.g., those that appear earlier in the text, can be adapted to satisfy other conditions/needs.
Topical PageRank (TPR) [liu2010automatic] is a topic-based method upon which various topic-based methods have been built. TPR uses LDA to obtain the topic distribution pr(z|w) of each word , for topic , where is the number of topics and the topic distribution pr(z|d) of a new document , for each topic . Then, for a document , it constructs a word graph based on word-word co-occurrences by adding only the adjectives and nouns. The idea of TPR is to run a Biased PageRank for each topic separately. So, for every topic , the topic-specific PageRank word scores are calculated as follows:
where is equal to which is the probability of topic given word , is a damping factor range from to , is the weight of link , and is the out-degree of vertex . The final PageRank word scores are obtained by the equation given above, iteratively, until convergence. Using the topic-specific importance scores of words, the ranking of candidate phrases with respect to each topic separately is:
where is a candidate phrase. Finally, the topic distribution of the document is considered by integrating the topic-based rankings of candidates:
Single Topical PageRank (Single TPR) [DBLP:conf/www/SterckxDDD15] is an alternative method to avoid the large cost of TPR by running only one PageRank for each document. According to this method, the sum of the topic-specific values of each word is replaced by the concept of topical importance . Particularly, first, the cosine similarity () is calculated between the vector of word-topic probabilities and the document-topic probabilities. Finally, the Single PageRank becomes
where is the set of graph nodes. Moreover, in a related study, [DBLP:conf/www/SterckxDDD15a] propose the utilization of multiple topic models for the keyphrase extraction task to show the benefit from a combination of models. Particularly, the models disagree when they are trained on different corpora, as there is a difference in contexts between the corpora. This leads to different topic models and disagreement about the word importance. So, this disagreement is leveraged by computing a combined topical word importance value which is used as a weight in a Topical PageRank, improving the performance, in cases where the topic models differ substantially.
In this spirit, Salience Rank [DBLP:conf/acl/TenevaC17] is quite close to Single TPR as it runs only once PageRank, incorporating in it a word metric called word salience , which is a linear combination of the topic specificity [DBLP:conf/avi/ChuangMH12] and corpus specificity of a word (the last can be calculated counting word frequencies in a specific corpus). Intuitively, topic specificity measures how much a word is shared across topics (the less the word is shared across topics, the higher its topic specificity). Users can balance topic specificity and corpus specificity of the extracted keyphrases and can tune the results according to particular cases. So, the Single PageRank becomes:
An interesting and probably effective direction in LDA-based methods would be the utilization of a topic model similar to the one proposed by [DBLP:journals/pvldb/El-KishkySWVH14]. Indeed, most topic modeling algorithms model text corpora with unigrams, whereas human interpretation often relies on inherent grouping of terms into phrases. Particularly, [DBLP:journals/pvldb/El-KishkySWVH14] first propose an efficient phrase mining technique to extract frequent significant phrases and segment the text at the same time, which uses frequent phrase mining and a statistical significance measure. Then, they introduce a simple but effective topic model that restricts all constituent terms within a phrase to share the same latent topic, and assigns the phrase to the topic of its constituent words. Besides, the first part of the method (phrase mining) could be exploited to filter out false candidate keyphrases in the context of a keyphrase extraction pipeline.
2.2.3 Graph-based Methods with Semantics
Semantics from Knowledge Graphs/Bases
The main problems of the topic-based methods are that the topics are too general and vague. In addition, the co-occurrence-based methods suffer from information loss, i.e., if two words never co-occur within a window size in a document, there will be no edges to connect them in the corresponding graph-of-words even though they are semantically related, whereas the statistics-based methods suffer from information overload, i.e., the real meanings of words in the document may be overwhelmed by the large amount of external texts used for the computation of statistical information. To deal with such problems and incorporate semantics for keyphrase extraction, [DBLP:journals/dase/ShiZYCZ17] propose a keyphrase extraction system that uses knowledge graphs. First, nouns and named entities (keyterms) are selected and grouped based on semantic similarity by applying clustering. Then, the keyterms of each cluster are connected to entities of DBpedia. For each cluster, the relations between the keyterms are detected by extracting the h-hop keyterm graph from the knowledge graph, i.e., the subgraph of DBpedia that includes all paths of length no longer than between two different nodes of the cluster. Then, all the extracted keyterm graphs of the clusters are integrated into one and a Personalized PageRank (PPR) [DBLP:conf/www/Haveliwala02] is applied on it to get the ranking score of each keyterm. The final ranking scheme of the candidate phrases uses the PPR score which is the sum of the PPR scores of the keyterms in it, as well as, the frequency and first occurrence position of the phrase. Moreover, this method could be categorized to the topic-based methods using clustering (see Table 1).
Similarly, [wikirank2018yu] propose WikiRank, an unsupervised automatic keyphrase extraction method that tries to link semantic meaning to text. First, they use TAGME [WikiRankFerragina2010], which is a tool for topic/concept annotation that detects meaningful text phrases and matches them to a relevant Wikipedia page. Additionally, they extract noun groups whose pattern is zero or more adjectives followed by one or more nouns as candidate keyphrases. Then, a semantic graph is built whose vertices are the union of the concept set and the candidate keyphrase set. In case the candidate keyphrase contains a concept according to the annotation of TAGME an edge is added between the corresponding nodes. The weight of a concept is equal to the frequency of the concept in the full-text document. Moreover, they propose the score of a concept in a subgraph of to be:
where is the weight of the concept , and is the degree of in the corresponding subgraph. The goal is to find the candidate keyphrase set with the best coverage, i.e., the maximum sum of scores of the concepts annotated from the phrases in .
Semantics from Pretrained Word Embeddings
Although the methods that utilize semantics from knowledge graphs/bases have shown their improvements, the keyphrase extraction process requires more background knowledge than just semantic relation information. Thus, [Wang2014] propose a graph-based ranking model that considers semantic information coming from distributed word representations as background knowledge. Again, a graph of words is created with edges that represent the co-existence between the words within a window of M consecutive words. Then, a weight, called word attraction score is assigned to every edge, which is the product of two individual scores; a) the attraction force between two words that uses the frequencies of the words and the euclidean distance between the corresponding word embeddings and b) the dice coefficient [dice1945measures, Stubbs2003] to measure the probability of two words co-occurring in a pattern, or by chance. Particularly, given a document as a sequence of words , i.e., the dice coefficient is computed as:
where is the co-occurrence frequency of words and , and and are the occurrence frequencies of and in . Once more, a weighted PageRank algorithm is utilized to rank the words. [DBLP:conf/adc/WangLM15] propose an improved method that uses a Personalized weighted PageRank model with pretrained word embeddings and also more effective edge weights. Particularly, the strength of relation of a pair of words is calculated as the product of the semantic relatedness and local co-occurrence co-efficient, as:
where with to be the cosine similarity between the corresponding vectors. Furthermore, as co-occurrence co-efficient is used the Point-wise Mutual Information. Finally, the score of is calculated as follows:
where is the strength of relatedness score between the two words calculated previously,
is the probability distribution of word, calculated as ( is the occurrence frequency and the number of total words), and is the set of vertices incident to
However, [DBLP:conf/adc/WangLM15] do not use domain-specific word embeddings and notice that training them might lead to improvements. This motivated [key2vec2018Mahata] to present Key2Vec, an unsupervised keyphrase extraction technique from scientific articles that represents candidate keyphrases of a document by domain specific phrase embeddings and ranks them using a theme-weighted PageRank algorithm [pagerankLangville2003]. After exhaustive text preprocessing on a corpus of scientific abstracts, which is well-described in their work, Fasttext [DBLP:journals/tacl/BojanowskiGJM17] is utilized for training multiword phrase embeddings. First, the same text preprocessing is applied to the target document in order to get a set of unique candidate keyphrases. Then, a theme excerpt, i.e., the first sentence(s), is extracted from the document. Afterwards, a unique set of thematic phrases, i.e., named entities, noun phrases and unigram words, are also extracted from the theme excerpt. Next, they get the vector representation of each thematic phrase using the trained phrase embedding model and perform vector addition to get the final theme vector. Moreover, the phrase embedding model is also used to get the vector representation for each candidate keyphrase. Then, they calculate the cosine distance between the theme vector and the vector for each candidate keyphrase, assigning a score to each candidate (thematic weight). Then, a directed graph is constructed with the candidate keyphrases as the vertices. Two candidate keyphrases are connected if they co-occur within a window size of 5 (bidirectional edges are used). In addition, weights are calculated for the edges using the semantic similarity between the candidate keyphrases obtained from the phrase embedding model and their frequency of co-occurrence, as in [DBLP:conf/adc/WangLM15]. For the final ranking of the candidate keyphrases, a weighted personalized PageRank algorithm is used.
2.3 Keyphrase Extraction based on Embeddings
Many methods for representation of words have been proposed. Representative techniques, which are based on a co-occurrence matrix, are Latent Dirichlet Allocation (LDA) [blei2003latent] and Latent Semantic Analysis (LSA) [deerwester1990indexing]. However, word embeddings came to the foreground by [DBLP:journals/corr/abs-1301-3781], who presented the popular Continuous Bag-of-Words model (CBOW) and the Continuous Skip-gram model. Additionally, sentence embeddings (Doc2Vec [DBLP:conf/rep4nlp/LauB16] or Sent2vec [DBLP:conf/naacl/PagliardiniGJ18]) as well as the popular GloVe (Global Vectors) [Pennington14glove:global] method are utilized by keyphrase extraction methods.
EmbedRank [DBLP:journals/corr/abs-1801-04470] extracts candidate phrases based on POS sequences (phrases that consist of zero or more adjectives followed by one or multiple nouns). EmbedRank uses sentence embeddings (Doc2Vec or Sent2vec) to represent both the candidate phrases and the document in the same high-dimensional vector space. Finally, the system ranks the candidate phrases using the cosine similarity between the embedding of the candidate phrase and the document embedding.
Moreover, [papagiannopoulou2018local] present the Reference Vector Algorithm (RVA), a method for keyphrase extraction, whose main innovation is the use of local word embeddings/semantics (in particular GloVe vectors), i.e., embeddings trained from the single document under consideration. The employed local training of GloVe on a single document and the graph-based family of methods can be considered as two alternative views of the same information source, as both methods utilize the statistics of word-word co-occurrence in a text. Then, the mean (reference) vector of the words in the document’s title and abstract is computed. This mean vector is a vector representation of the semantics of the whole publication. Finally, candidate keyphrases are extracted from the title and abstract, and are ranked in terms of their cosine similarity with the reference vector, assuming that the closer to the reference vector is a word vector, the more representative is the corresponding word for the publication.
2.4 Language Model-based Methods
Language modeling plays an important role in natural language processing tasks[DBLP:journals/csl/ChenG99]. Generally, an -gram language model assigns a probability value to every sequence of words , i.e., the probability can be decomposed as
For example a trigram language model is the following:
-gram language models there are also various types of models, such as the popular neural language models that use neural networks to learn the context of the words[DBLP:journals/corr/abs-1803-08240].
Keyphrase extraction with N-gram language models [tomokiyo2003language]
create both unigram and n-gram language models on a foreground corpus (target document) and a background corpus (document set). Their main idea is based on the fact that the loss between two language models can be measured using the Kullback-Leibler divergence. Particularly, at phrase level, for each phrase, thephraseness is computed as the divergence between the unigram and n-gram language models on the foreground corpus and the informativeness is calculated as the divergence between the n-gram language models on the foreground and the background corpus. Then, the phraseness and informativeness are summed as a final score for each phrase. Finally, phrases are sorted based on this score.
3 Supervised Methods
In this section, we present traditional supervised methods (Section LABEL:sec:traditional_sup_methods
) as well as deep learning methods (SectionLABEL:sec:deep_learning_methods) along with the main categories of features that are used (Section LABEL:sec:types_features). Section LABEL:sec:exp_comparison_supervised
discusses the performance of some earlier, state-of-the-art and more recent supervised methods on 3 popular keyphrase extraction datasets. Finally, we discuss the main problems of the supervised learning methods along with the proposed solutions (SectionLABEL:imbalanced). Figure 2 shows the presentation structure of the supervised keyphrase extraction methods that is also consistent with their corresponding taxonomy.