Retrieval-efficiency trade-off of Unsupervised Keyword Extraction

08/15/2022
by   Blaž Škrlj, et al.
0

Efficiently identifying keyphrases that represent a given document is a challenging task. In the last years, plethora of keyword detection approaches were proposed. These approaches can be based on statistical (frequency-based) properties of e.g., tokens, specialized neural language models, or a graph-based structure derived from a given document. The graph-based methods can be computationally amongst the most efficient ones, while maintaining the retrieval performance. One of the main properties, common to graph-based methods, is their immediate conversion of token space into graphs, followed by subsequent processing. In this paper, we explore a novel unsupervised approach which merges parts of a document in sequential form, prior to construction of the token graph. Further, by leveraging personalized PageRank, which considers frequencies of such sub-phrases alongside token lengths during node ranking, we demonstrate state-of-the-art retrieval capabilities while being up to two orders of magnitude faster than current state-of-the-art unsupervised detectors such as YAKE and MultiPartiteRank. The proposed method's scalability was also demonstrated by computing keyphrases for a biomedical corpus comprised of 14 million documents in less than a minute.

READ FULL TEXT
research
07/15/2019

RaKUn: Rank-based Keyword extraction via Unsupervised learning and Meta vertex aggregation

Keyword extraction is used for summarizing the content of a document and...
research
08/21/2020

Keywords lie far from the mean of all words in local vector space

Keyword extraction is an important document process that aims at finding...
research
04/16/2021

Back to the Basics: A Quantitative Analysis of Statistical and Graph-Based Term Weighting Schemes for Keyword Extraction

Term weighting schemes are widely used in Natural Language Processing an...
research
04/04/2023

Rethinking the Role of Token Retrieval in Multi-Vector Retrieval

Multi-vector retrieval models such as ColBERT [Khattab and Zaharia, 2020...
research
05/01/2020

HipoRank: Incorporating Hierarchical and Positional Information into Graph-based Unsupervised Long Document Extractive Summarization

We propose a novel graph-based ranking model for unsupervised extractive...
research
01/17/2019

Unsupervised Graph-based Rank Aggregation for Improved Retrieval

This paper presents a robust and comprehensive graph-based rank aggregat...
research
11/02/2020

Biased TextRank: Unsupervised Graph-Based Content Extraction

We introduce Biased TextRank, a graph-based content extraction method in...

Please sign up or login with your details

Forgot password? Click here to reset