Benchmarking Top-K Keyword and Top-K Document Processing with T^2K^2 and T^2K^2D^2

04/20/2018
by   Ciprian-Octavian Truică, et al.
0

Top-k keyword and top-k document extraction are very popular text analysis techniques. Top-k keywords and documents are often computed on-the-fly, but they exploit weighted vocabularies that are costly to build. To compare competing weighting schemes and database implementations, benchmarking is customary. To the best of our knowledge, no benchmark currently addresses these problems. Hence, in this paper, we present T^2K^2, a top-k keywords and documents benchmark, and its decision support-oriented evolution T^2K^2D^2. Both benchmarks feature a real tweet dataset and queries with various complexities and selectivities. They help evaluate weighting schemes and database implementations in terms of computing performance. To illustrate our bench-marks' relevance and genericity, we successfully ran performance tests on the TF-IDF and Okapi BM25 weighting schemes, on one hand, and on different relational (Oracle, PostgreSQL) and document-oriented (MongoDB) database implementations, on the other hand.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/14/2017

T^2K^2: The Twitter Top-K Keywords Benchmark

Information retrieval from textual data focuses on the construction of v...
research
08/12/2021

TextBenDS: a generic Textual data Benchmark for Distributed Systems

Extracting top-k keywords and documents using weighting schemes are popu...
research
06/19/2017

Leveraging web resources for keyword assignment to short text documents

Assigning relevant keywords to documents is very important for efficient...
research
06/06/2012

Feature Weighting for Improving Document Image Retrieval System Performance

Feature weighting is a technique used to approximate the optimal degree ...
research
12/22/2021

Dynamics of senses of new physics discourse: co-keywords analysis

The paper presents a longitudinal analysis of the evolution of new physi...
research
04/16/2021

Back to the Basics: A Quantitative Analysis of Statistical and Graph-Based Term Weighting Schemes for Keyword Extraction

Term weighting schemes are widely used in Natural Language Processing an...
research
06/06/2022

Knowledge-based Document Classification with Shannon Entropy

Document classification is the detection specific content of interest in...

Please sign up or login with your details

Forgot password? Click here to reset