Information Retrieval in long documents: Word clustering approach for improving Semantics

02/20/2023
by   Paul Mbate Mekontchou, et al.
0

In this paper, we propose an alternative to deep neural networks for semantic information retrieval for the case of long documents. This new approach exploiting clustering techniques to take into account the meaning of words in Information Retrieval systems targeting long as well as short documents. This approach uses a specially designed clustering algorithm to group words with similar meanings into clusters. The dual representation (lexical and semantic) of documents and queries is based on the vector space model proposed by Gerard Salton in the vector space constituted by the formed clusters. The originalities of our proposal are at several levels: first, we propose an efficient algorithm for the construction of clusters of semantically close words using word embedding as input, then we define a formula for weighting these clusters, and then we propose a function allowing to combine efficiently the meanings of words with a lexical model widely used in Information Retrieval. The evaluation of our proposal in three contexts with two different datasets SQuAD and TREC-CAR has shown that is significantly improves the classical approaches only based on the keywords without degrading the lexical aspect.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/15/2018

WordNet-Based Information Retrieval Using Common Hypernyms and Combined Features

Text search based on lexical matching of keywords is not satisfactory du...
research
05/08/2019

On the Feasibility of Automated Detection of Allusive Text Reuse

The detection of allusive text reuse is particularly challenging due to ...
research
09/15/2021

Co-Embedding: Discovering Communities on Bipartite Graphs through Projection

Many datasets take the form of a bipartite graph where two types of node...
research
04/17/2019

Patent Analytics Based on Feature Vector Space Model: A Case of IoT

The number of approved patents worldwide increases rapidly each year, wh...
research
09/30/2014

An agent-driven semantical identifier using radial basis neural networks and reinforcement learning

Due to the huge availability of documents in digital form, and the decep...
research
04/23/2015

svcR: An R Package for Support Vector Clustering improved with Geometric Hashing applied to Lexical Pattern Discovery

We present a new R package which takes a numerical matrix format as data...
research
08/28/2018

Implementation Notes for the Soft Cosine Measure

The standard bag-of-words vector space model (VSM) is efficient, and ubi...

Please sign up or login with your details

Forgot password? Click here to reset