Selection of Optimal Parameters in the Fast K-Word Proximity Search Based on Multi-component Key Indexes

Proximity full-text search is commonly implemented in contemporary full-text search systems. Let us assume that the search query is a list of words. It is natural to consider a document as relevant if the queried words are near each other in the document. The proximity factor is even more significant for the case where the query consists of frequently occurring words. Proximity full-text search requires the storage of information for every occurrence in documents of every word that the user can search. For every occurrence of every word in a document, we employ additional indexes to store information about nearby words, that is, the words that occur in the document at distances from the given word of less than or equal to the MaxDistance parameter. We showed in previous works that these indexes can be used to improve the average query execution time by up to 130 times for queries that consist of words occurring with high-frequency. In this paper, we consider how both the search performance and the search quality depend on the value of MaxDistance and other parameters. Well-known GOV2 text collection is used in the experiments for reproducibility of the results. We propose a new index schema after the analysis of the results of the experiments. This is a pre-print of a contribution published in Supplementary Proceedings of the XXII International Conference on Data Analytics and Management in Data Intensive Domains (DAMDID/RCDL 2020), Voronezh, Russia, October 13-16, 2020, P. 336-350, published by CEUR Workshop Proceedings. The final authenticated version is available online at: http://ceur-ws.org/Vol-2790/

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/06/2020

Proximity full-text searches of frequently occurring words with a response time guarantee

Full-text search engines are important tools for information retrieval. ...
research
09/06/2020

An Improved Algorithm for Fast K-Word Proximity Search Based on Multi-Component Key Indexes

A search query consists of several words. In a proximity full-text searc...
research
06/14/2020

An efficient algorithm for three-component key index construction

In this paper, proximity full-text searches in large text arrays are con...
research
11/18/2018

Proximity Full-Text Search with a Response Time Guarantee by Means of Additional Indexes

Full-text search engines are important tools for information retrieval. ...
research
08/01/2021

Relevance ranking for proximity full-text search based on additional indexes with multi-component keys

The problem of proximity full-text search is considered. If a search que...
research
07/18/2020

About a structure of easily updatable full-text indexes

We consider strategies to organize easily updatable associative arrays i...
research
01/26/2021

pdfPapers: shell-script utilities for frequency-based multi-word phrase extraction from PDF documents

Biomedical research is intensive in processing information in the previo...

Please sign up or login with your details

Forgot password? Click here to reset