Implementation Notes for the Soft Cosine Measure

08/28/2018
by   Vít Novotný, et al.
0

The standard bag-of-words vector space model (VSM) is efficient, and ubiquitous in information retrieval, but it underestimates the similarity of documents with the same meaning, but different terminology. To overcome this limitation, Sidorov et al. proposed the Soft Cosine Measure (SCM) that incorporates term similarity relations. Charlet and Damnati showed that the SCM is highly effective in question answering (QA) systems. However, the orthonormalization algorithm proposed by Sidorov et al. has an impractical time complexity of O(n^4), where n is the size of the vocabulary. In this paper, we prove a tighter lower worst-case time complexity bound of O(n^3). We also present an algorithm for computing the similarity between documents and we show that its worst-case time complexity is O(1) given realistic conditions. Lastly, we describe implementation in general-purpose vector databases such as Annoy, and Faiss and in the inverted indices of text search engines such as Apache Lucene, and ElasticSearch. Our results enable the deployment of the SCM in real-world information retrieval systems.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/02/2021

Worst-case Optimal Binary Join Algorithms under General ℓ_p Constraints

Worst-case optimal join algorithms have so far been studied in two broad...
research
11/16/2017

Remedies against the Vocabulary Gap in Information Retrieval

Search engines rely heavily on term-based approaches that represent quer...
research
11/20/2017

Linear-Complexity Relaxed Word Mover's Distance with GPU Acceleration

The amount of unstructured text-based data is growing every day. Queryin...
research
05/19/2019

The algorithm by Ferson et al. is surprisingly fast: An NP-hard optimization problem solvable in almost linear time with high probability

Ferson et al. (Reliable computing 11(3), p. 207--233, 2005) introduced a...
research
02/20/2023

Information Retrieval in long documents: Word clustering approach for improving Semantics

In this paper, we propose an alternative to deep neural networks for sem...
research
01/21/2018

Efficient algorithms for computing a minimal homology basis

Efficient computation of shortest cycles which form a homology basis und...
research
12/25/2014

Plagiarism Detection on Electronic Text based Assignments using Vector Space Model (ICIAfS14)

Plagiarism is known as illegal use of others' part of work or whole work...

Please sign up or login with your details

Forgot password? Click here to reset