T^2K^2: The Twitter Top-K Keywords Benchmark

09/14/2017
by   Ciprian-Octavian Truică, et al.
0

Information retrieval from textual data focuses on the construction of vocabularies that contain weighted term tuples. Such vocabularies can then be exploited by various text analysis algorithms to extract new knowledge, e.g., top-k keywords, top-k documents, etc. Top-k keywords are casually used for various purposes, are often computed on-the-fly, and thus must be efficiently computed. To compare competing weighting schemes and database implementations, benchmarking is customary. To the best of our knowledge, no benchmark currently addresses these problems. Hence, in this paper, we present a top-k keywords benchmark, T^2K^2, which features a real tweet dataset and queries with various complexities and selectivities. T^2K^2 helps evaluate weighting schemes and database implementations in terms of computing performance. To illustrate T^2K^2's relevance and genericity, we successfully performed tests on the TF-IDF and Okapi BM25 weighting schemes, on one hand, and on different relational (Oracle, PostgreSQL) and document-oriented (MongoDB) database implementations, on the other hand.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/20/2018

Benchmarking Top-K Keyword and Top-K Document Processing with T^2K^2 and T^2K^2D^2

Top-k keyword and top-k document extraction are very popular text analys...
research
08/12/2021

TextBenDS: a generic Textual data Benchmark for Distributed Systems

Extracting top-k keywords and documents using weighting schemes are popu...
research
07/20/2018

Exploring Combinations of Ontological Features and Keywords for Text Retrieval

Named entities have been considered and combined with keywords to enhanc...
research
07/12/2023

Testing different Log Bases For Vector Model Weighting Technique

Information retrieval systems retrieves relevant documents based on a qu...
research
02/07/2017

Effects of Stop Words Elimination for Arabic Information Retrieval: A Comparative Study

The effectiveness of three stop words lists for Arabic Information Retri...
research
05/11/2018

Cross-lingual Document Retrieval using Regularized Wasserstein Distance

Many information retrieval algorithms rely on the notion of a good dista...
research
12/22/2021

Dynamics of senses of new physics discourse: co-keywords analysis

The paper presents a longitudinal analysis of the evolution of new physi...

Please sign up or login with your details

Forgot password? Click here to reset