The exact probability law for the approximated similarity from the Minhashing method

09/20/2022
by   Soumaila Dembélé, et al.
0

We propose a probabilistic setting in which we study the probability law of the Rajaraman and Ullman RU algorithm and a modified version of it denoted by RUM. These algorithms aim at estimating the similarity index between huge texts in the context of the web. We give a foundation of this method by showing, in the ideal case of carefully chosen probability laws, the exact similarity is the mathematical expectation of the random similarity provided by the algorithm. Some extensions are given. Résumé. Nous proposons un cadre probabilistique dans lequel nous étudions la loi de probabilité de l'algorithme de Rajaraman et Ullman RU ainsi qu'une version modifiée de cet algorithme notée RUM. Ces alogrithmes visent à estimer l'indice de la similarité entre des textes de grandes tailles dans le contexte du Web. Nous donnons une base de validité de cette méthode en montrant que pour des lois de probabilités minutieusement choisies, la similarité exacte est l'espérance mathématique de la similarité aléatoire donnée par l'algorithme RUM. Des généralisations sont abordées.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/08/2021

Moments estimators and omnibus chi-square tests for some usual probability laws

For many probability laws, in parametric models, the estimation of the p...
research
04/09/2020

A Mathematical Assessment of the Isolation Tree Method for Data Anomaly Detection

We present the mathematical analysis of the Isolation Random Forest Meth...
research
05/20/2020

Tables of Quantiles of the Distribution of the Empirical Chiral Index in the Case of the Uniform Law and in the Case of the Normal Law

The empirical distribution of the chiral index is simulated for various ...
research
10/28/2014

New similarity index based on entropy and group theory

In this work, we propose a new similarity index for images considering t...
research
03/06/2013

Inference Algorithms for Similarity Networks

We examine two types of similarity networks each based on a distinct not...
research
06/07/2020

Conditional probability and improper priors

The purpose of this paper is to present a mathematical theory that can b...
research
09/08/2023

Solving the Problem of Induction

This article solves the Hume's problem of induction using a probabilisti...

Please sign up or login with your details

Forgot password? Click here to reset