Rank-based Unsupervised Keyword Extraction via Metavertex Aggregation
Keyword extraction is used for summarizing the content of a document and supports efficient document retrieval, and is as such an indispensable part of modern text-based systems. We explore how load centrality, a graph-theoretic measure applied to graphs derived from a given text can be used to efficiently identify and rank keywords. Introducing meta vertices (aggregates of existing vertices) and systematic redundancy filters, the proposed method performs on par with state-of-the-art for the keyword extraction task on 14 diverse datasets. The proposed method is unsupervised, interpretable and can also be used for document visualization.READ FULL TEXT VIEW PDF
Keyword extraction is an important document process that aims at finding...
In this paper, we present a supervised framework for automatic keyword
Keyword Extraction is an important task in several text analysis endeavo...
"Keyword Extraction" refers to the task of automatically identifying the...
Keyphrases efficiently summarize a document's content and are used in va...
More than ever, technical inventions are the symbol of our society's adv...
With growing amounts of available textual data, development of algorithm...
Rank-based Unsupervised Keyword Extraction via Metavertex Aggregation
Keywords are terms (i.e. expressions) that best describe the subject of a document . A good keyword effectively summarizes the content of the document and allows it to be efficiently retrieved when needed. Traditionally, keyword assignment was a manual task, but with the emergence of large amounts of textual data, automatic keyword extraction methods have become indispensable. Despite a considerable effort from the research community, state-of-the-art keyword extraction algorithms leave much to be desired and their performance is still lower than on many other core NLP tasks . The first keyword extraction methods mostly followed a supervised approach [14, 24, 31]
: they first extract keyword features and then train a classifier on a gold standard dataset. For example, KEA
, a state of the art supervised keyword extraction algorithm is based on the Naive Bayes machine learning algorithm. While these methods offer quite good performance, they rely on an annotated gold standard dataset and require a (relatively) long training process. In contrast, unsupervised approaches need no training and can be applied directly without relying on a gold standard document collection. They can be further divided into statistical and graph-based methods. The former, such as YAKE[7, 6], KP-MINER  and RAKE , use statistical characteristics of the texts to capture keywords, while the latter, such as Topic Rank , TextRank , Topical PageRank  and Single Rank , build graphs to rank words based on their position in the graph. Among statistical approaches, the state-of-the-art keyword extraction algorithm is YAKE [7, 6]
, which is also one of the best performing keyword extraction algorithms overall; it defines a set of five features capturing keyword characteristics which are heuristically combined to assign a single score to every keyword. On the other hand, among graph-based approaches, Topic Rank can be considered state-of-the-art; candidate keywords are clustered into topics and used as vertices in the final graph, used for keyword extraction. Next, a graph-based ranking model is applied to assign a significance score to each topic and keywords are generated by selecting a candidate from each of the top-ranked topics. Network-based methodology has also been successfully applied to the task of topic extraction .
The method that we propose in this paper, RaKUn, is a graph-based keyword extraction method. We exploit some of the ideas from the area of graph aggregation-based learning, where, for example, graph convolutional neural networks and similar approaches were shown to yield high quality vertex representations by aggregating their neighborhoods’ feature space. This work implements some of the similar ideas (albeit not in a neural network setting), where redundant information is aggregated into meta vertices in a similar manner. Similar efforts were shown as useful for hierarchical subnetwork aggregation in sensor networks  and in biological use cases of simulation of large proteins .
The main contributions of this paper are as follows. The notion of load centrality was to our knowledge not yet sufficiently exploited for keyword extraction. We show that this fast measure offers competitive performance to other widely used centralities, such as for example the PageRank centrality (used in ). To our knowledge, this work is the first to introduce the notion of meta vertices with the aim of aggregating similar vertices, following similar ideas to the statistical method YAKE , which is considered a state-of-the-art for the keyword extraction. Next, as part of the proposed RaKUn algorithm we extend the extraction from unigrams also to bigram and threegram keywords based on load centrality scores computed for considered tokens. Last but not least, we demonstrate how arbitrary textual corpora can be transformed into weighted graphs whilst maintaining global sequential information, offering the opportunity to exploit potential context not naturally present in statistical methods.
The paper is structured as follows. We first present the text to graph transformation approach (Section 2), followed by the introduction of the RaKUn keyword extractor (Section 3). We continue with qualitative evaluation (Section 4) and quantitative evaluation (Section 5), before concluding the paper in Section 6.
We first discuss how the texts are transformed to graphs, on which RaKUn operates. Next, we formally state the problem of keyword extraction and discuss its relation to graph centrality metrics.
In this work we consider directed graphs. Let represent a graph comprised of a set of vertices and a set of edges (
), which are ordered pairs. Further, each edge can have a real-valued weight assigned. Letrepresent a document comprised of tokens . The order in which tokens in text appear is known, thus is a totally ordered set. A potential way of constructing a graph from a document is by simply observing word co-occurrences. When two words co-occur, they are used as an edge. However, such approaches do not take into account the sequence nature of the words, meaning that the order is lost. We attempt to take this aspect into account as follows. The given corpus is traversed, and for each element , its successor , together with a given element, forms a directed edge . Finally, such edges are weighted according to the number of times they appear in a given corpus. Thus the graph, constructed after traversing a given corpus, consists of all local neighborhoods (order one), merged into a single joint structure. Global contextual information is potentially kept intact (via weights), even though it needs to be detected via network analysis as proposed next.
A naïve approach to constructing a graph, as discussed in the previous section, commonly yields noisy graphs, rendering learning tasks harder. Therefore, we next discuss the selected approaches we employ in order to reduce both the computational complexity and the spatial complexity of constructing the graph, as well as increasing its quality (for the given down-stream task).
First, we consider the following heuristics which reduce the complexity of the graph that we construct for keyword extraction: Considered token length (while traversing the document , only tokens of length are considered), and next, lemmatization (tokens can be lemmatized, offering spatial benefits and avoiding redundant vertices in the final graph). The two modifications yield a potentially “simpler” graph, which is more suitable and faster for mining.
Even if the optional lemmatization step is applied, one can still aim at further reducing the graph complexity by merging similar vertices. This step is called meta vertex construction. The motivation can be explained by the fact, that even similar lemmas can be mapped to the same keyword (e.g., mechanic and mechanical; normal and abnormal). This step also captures spelling errors (similar vertices that will not be handled by lemmatization), spelling differences (e.g., British vs. American English), non-standard writing (e.g., in Twitter data), mistakes in lemmatization or unavailable or omitted lemmatization step.
The meta-vertex construction step works as follows. Let represent the set of vertices, as defined above. A meta vertex is comprised of a set of vertices that are elements of , i.e. . Let denote the -th meta vertex. We construct a given so that for each , ’s initial edges (prior to merging it into a meta vertex) are rewired to the newly added . Note that such edges connect to vertices which are not a part of . Thus, both the number of vertices, as well as edges get reduced substantially. This feature is implemented via the following procedure:
Meta vertex candidate identification. Edit distance and word lengths distance are used to determine whether two words should be merged into a meta vertex (only if length distance threshold is met, the more expensive edit distance is computed).
The meta vertex creation. As common identifiers, we use the stemmed version of the original vertices and if there is more than one resulting stem, we select the vertex from the identified candidates that has the highest centrality value in the graph and its stemmed version is introduced as a novel vertex (meta vertex).
The edges of the words entailed in the meta vertex are next rewired to the meta vertex.
The two original words are removed from the graph.
The procedure is repeated for all candidate pairs.
A schematic representation of meta vertex construction is shown in Figure 1. The yellow and blue groups of vertices both form a meta vertex, the resulting (right) graph is thus substantially reduced, both with respect to the number of vertices, as well as the number of edges.
Up to this point, we discussed how the graph used for keyword extraction is constructed. In this work, we exploit the notion of load centrality, a fast measure for estimating the importance of vertices in graphs. This metric can be defined as follows.
The load centrality of a vertex falls under the family of centralities which are defined based on the number of shortest paths that pass through a given vertex , i.e. , where represents the number of shortest paths that pass from vertex to vertex via and the number of all shortest paths between and (see [4, 11]). The considered load centrality measure is subtly different from the better known betweenness centrality; specifically, it is assumed that each vertex sends a package to each other vertex to which it is connected, with routing based on a priority system: given an input of flow arriving at vertex with destination ’, divides equally among all neighbors of minimum shortest path to the target. The total flow passing through a given via this process is defined as ’s load. Load centrality thus maps from the set of vertices to real values. For detailed description and computational complexity analysis, see . Intuitively, vertices of the graph with the highest load centrality represent key vertices in a given network. In this work, we assume such vertices are good descriptors of the input document (i.e. keywords). Thus, ranking the vertices yields a priority list of (potential) keywords.
We next discuss how the considered centrality is used as part of the whole keyword extraction algorithm RaKUn, summarized in Algorithm 1.
The algorithm consists of three main steps described next. First, a graph is constructed from a given ordered set of tokens (e.g., a document) (lines 1 to 8). The resulting graph is commonly very sparse, as most of the words rarely co-occur. The result of this step is a smaller, denser graph, where both the number of vertices, as well as edges is lower. Once constructed, load centrality (line 10) is computed for each vertex. Note that at this point, should the top vertices by centrality be considered, only single term keywords emerge. As it can be seen from line 11, to extend the selection to 2- and 3-grams, the following procedure is proposed:
Keywords comprised of two terms are constructed as follows. First, pairs of first order keywords (all tokens) are counted. If the support (= number of occurrences) is higher than (line 11 in Algorithm 1), the token pair is considered as potential 2-gram keyword. The load centralities of the two tokens are averaged, i.e. , and the obtained keywords are considered for final selection along with the computed ranks.
For construction of 3-gram keywords, we follow a similar idea to that of bigrams. The obtained 2-gram keywords (previous step) are further explored as follows. For each candidate 2-gram keyword, we consider two extension scenarios: Extending the 2-gram from the left side. Here, the in-neighborhood of the left token is considered as a potential extension to a given keyword. Ranks of such candidates are computed by averaging the centrality scores in the same manner as done for the 2-gram case. Extending the 2-gram from the right side. The difference with the previous point is that all outgoing connections of the rightmost vertex are considered as potential extensions. The candidate keywords are ranked, as before, by averaging the load centralities, i.e. .
Having obtained a set of (keyword, score) pairs, we finally sort the set according to the scores (descendingly), and take top keywords as the result. We next discuss the evaluation the proposed algorithm.
RaKUn can be used also for visualization of keywords in a given document or document corpus. A visualization of extracted keywords is applied to an example from wiki20  (for dataset description see Section 5.1), where we visualize both the global corpus graph, as well as a local (document) view where keywords are emphasized, see Figures 2 and 3, respectively. It can be observed that the global graph’s topology is far from uniform — even though we did not perform any tests of scale-freeness, we believe the constructed graphs are subject to distinct topologies, where keywords play prominent roles.
This section discusses the experimental setting used to validate the proposed RaKUn approach against state-of-the-art baselines. We first describe the datasets, and continue with the presentation of the experimental setting and results.
For RaKUn evaluation, we used 14 gold standard datasets from the list of [7, 6], from which we selected datasets in English. Detailed dataset descriptions and statistics can be found in Table 1, while full statistics and files for download can be found online111https://github.com/LIAAD/KeywordExtractor-Datasets. Most datasets are from the domain of computer science or contain multiple domains. They are very diverse in terms of the number of documents—ranging from wiki20 with 20 documents to Inspec with 2,000 documents, in terms of the average number of gold standard keywords per document—from 5.07 in kdd to 48.92 in 500N-KPCrowd-v1.1—and in terms of the average length of the documents—from 75.97 in kdd to SemEval2017 with 8332.34.
|Dataset||Desc.||No. docs||Avg. keywords||Avg. doc length|
|500N-KPCrowd-v1.1 ||Broadcast news transcriptions||500||48.92||408.33|
|Inspec ||Scientific journal papers from Computer Science collected between 1998 and 2002||2000||14.62||128.20|
|Nguyen2007 ||Scientific conference papers||209||11.33||5201.09|
|PubMed||Full-text papers collected from PubMed Central||500||15.24||3992.78|
|Schutz2008||Full-text papers collected from PubMed Central||1231||44.69||3901.31|
|SemEval2010 ||Scientific papers from the ACM Digital Library||243||16.47||8332.34|
|SemEval2017 ||500 paragraphs selected from 500 ScienceDirect journal articles, evenly distributed among the domains of Computer Science, Material Sciences and Physics||500||18.19||178.22|
|citeulike180 ||Full-text papers from the CiteULike.org||180||18.42||4796.08|
|fao30 ||Agricultural documents from two datasets based on Food and Agriculture Organization (FAO) of the UN||30||33.23||4777.70|
|fao780 ||Agricultural documents from two datasets based on Food and Agriculture Organization (FAO) of the UN||779||8.97||4971.79|
|kdd ||Abstracts from the ACM Conference on Knowledge Discovery and Data Mining (KDD) during 2004-2014||755||5.07||75.97|
|theses100||Full master and Ph.D. theses from the University of Waikato||100||7.67||4728.86|
|wiki20 ||Computer science technical research reports||20||36.50||6177.65|
|www ||Abstracts of WWW conference papers from 2004-2014||1330||5.80||84.08|
We adopted the same evaluation procedure as used for the series of results recently introduced by YAKE authors 222We attempted to reproduce YAKE evaluation procedure based on their experimental setup description and also thank the authors for additional explanation regarding the evaluation. For comparison of results we refer to their online repository https://github.com/LIAAD/yake . Five fold cross validation was used to determine the overall performance, for which we measured Precision, Recall and F1 score, with the latter being reported in Table 2.333The complete results and the code are available at https://github.com/SkBlaz/rakun Keywords were stemmed prior to evaluation.444This being a standard procedure, as suggested by the authors of YAKE. As the number of keywords in the gold standard document is not equal to the number of extracted keywords (in our experiments =10), in the recall we divide the correctly extracted keywords by the number of keywords parameter , if in the gold standard number of keywords is higher than .
Selecting default configuration. First, we used a dedicated run for determining the default parameters. The cross validation was performed as follows. For each train-test dataset split, we kept the documents in the test fold intact, whilst performing a grid search on the train part to find the best parametrization. Finally, the selected configuration was used to extract keywords on the unseen test set. For each train-test split, we thus obtained the number of true and false positives, as well as true and false negatives, which were summed up and, after all folds were considered, used to obtain final F1 scores, which served for default parameter selection. The grid search was conducted over the following parameter range Num keywords: , Num tokens (the number of tokens a keyword can consist of): Count threshold (minimum support used to determine potential bigram candidates): Word length difference threshold (maximum difference in word length used to determine whether a given pair of words shall be aggregated): , Edit length difference (maximum edit distance allowed to consider a given pair of words for aggregation): , Lemmatization: [yes, no].
Even if one can use the described grid-search fine-tunning procedure to select the best setting for individual datasets, we observed that in nearly all the cases the best settings were the same. We therefore selected it as the default, which can be used also on new unlabeled data. The default parameter setting was as follows. The number of tokens was set to 1, Count threshold was thus not needed (only unigrams), for meta vertex construction Word length difference threshold was set to 3 and Edit distance to 2. Words were initially lemmatized. Next, we report the results using these selected parameters (same across all datasets), by which we also test the general usefulness of the approach.
The results are presented in Table 2, where we report on F1 with the default parameter setting of RaKUn, together with the results from related work, as reported in the github table of the YAKE 555https://github.com/LIAAD/yake/blob/master/docs/YAKEvsBaselines.jpg (accessed on: June 11, 2019).
|Dataset||RaKUn||YAKE||Single Rank||KEA||KP-MINER||Text Rank||Topic Rank||Topical PageRank|
We first observe that on the selection of datasets, the proposed RaKUn wins more than any other method. We also see that it performs notably better on some of the datasets, whereas on the remainder it performs worse than state-of-the-art approaches. Such results demonstrate that the proposed method finds keywords differently, indicating load centrality, combined with meta vertices, represents a promising research venue. The datasets, where the proposed method outperforms the current state-of-the-art results are: 500N-KPCrowd-v1.1, Schutz2008, fao30 and wiki20. In addition, RaKUn also achieves competitive results on citeulike180. A look at the gold standard keywords in these datasets reveals that they contain many single-word units which is why the default configuration (which returns unigrams only) was able to perform so well.
Four of these five datasets (500N-KPCrowd-v1.1, Schutz2008, fao30, wiki20) are also the ones with the highest average number of keywords per document with at least 33.23 keywords per document, while the fifth dataset (citeulike180) also has a relatively large value (18.42). Similarly, four of the five well-performing datasets (Schutz2008, fao30, citeulike180, wiki20) include long documents (more than 3,900 words), with the exception being 500N-KPCrowd-v1.1. For details, see Table 1. We observe that the proposed RaKUn outperforms the majority of other competitive graph-based methods. For example, the most similar variants Topical PageRank and TextRank do not perform as well on the majority of the considered datasets. Furthermore, RaKUn also outperforms KEA, a supervised keyword learner (e.g., very high difference in performance on 500N-KPCrowd-v1.1 and Schutz2008 datasets), indicating unsupervised learning from the graph’s structure offers a more robust keyword extraction method than learning a classifier directly.
In this work we proposed RaKUn, a novel unsupervised keyword extraction algorithm which exploits the efficient computation of load centrality, combined with the introduction of meta vertices, which notably reduce corpus graph sizes. The method is fast, and performs well compared to state-of-the-art such as YAKE and graph-based keyword extractors. In further work, we will test the method on other languages. We also believe additional semantic background knowledge information could be used to prune the graph’s structure even further, and potentially introduce keywords that are inherently not even present in the text (cf.). The proposed method does not attempt to exploit meso-scale graph structure, such as convex skeletons or communities, which are known to play prominent roles in real-world networks and could allow for vertex aggregation based on additional graph properties. We believe the proposed method could also be extended using the Ricci-Oliver  flows on weighted graphs.
The work was supported by the Slovenian Research Agency through a young researcher grant [BŠ], core research programme (P2-0103), and projects Semantic Data Mining for Linked Open Data (N2-0078) and Terminology and knowledge frames across languages (J6-9372). This work was supported also by the EU Horizon 2020 research and innovation programme, Grant No. 825153, EMBEDDIA (Cross-Lingual Embeddings for Less-Represented Languages in European News Media).
Bougouin, A., Boudin, F., Daille, B.: Topicrank: Graph-based topic ranking for keyphrase extraction. In: International Joint Conference on Natural Language Processing (IJCNLP). pp. 543–551 (2013)
Gollapalli, S.D., Caragea, C.: Extracting keyphrases from research papers using citation networks. In: Twenty-Eighth AAAI Conference on Artificial Intelligence (2014)