DotHash: Estimating Set Similarity Metrics for Link Prediction and Document Deduplication

05/27/2023
by   Igor Nunes, et al.
0

Metrics for set similarity are a core aspect of several data mining tasks. To remove duplicate results in a Web search, for example, a common approach looks at the Jaccard index between all pairs of pages. In social network analysis, a much-celebrated metric is the Adamic-Adar index, widely used to compare node neighborhood sets in the important problem of predicting links. However, with the increasing amount of data to be processed, calculating the exact similarity between all pairs can be intractable. The challenge of working at this scale has motivated research into efficient estimators for set similarity metrics. The two most popular estimators, MinHash and SimHash, are indeed used in applications such as document deduplication and recommender systems where large volumes of data need to be processed. Given the importance of these tasks, the demand for advancing estimators is evident. We propose DotHash, an unbiased estimator for the intersection size of two sets. DotHash can be used to estimate the Jaccard index and, to the best of our knowledge, is the first method that can also estimate the Adamic-Adar index and a family of related metrics. We formally define this family of metrics, provide theoretical bounds on the probability of estimate errors, and analyze its empirical performance. Our experimental results indicate that DotHash is more accurate than the other estimators in link prediction and detecting duplicate documents with the same complexity and similar comparison time.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/22/2018

Adversarial Link Prediction in Social Networks

Link prediction is one of the fundamental tools in social network analys...
research
08/01/2020

Learning-based link prediction analysis for Facebook100 network

In social network science, Facebook is one of the most interesting and w...
research
06/22/2019

Predicting kills in Game of Thrones using network properties

TV series such as HBO's most popular show Game of Thrones have seen a hi...
research
12/13/2019

Fast Computation of Katz Index for Efficient Processing of Link Prediction Queries

Network proximity computations are among the most common operations in v...
research
09/30/2014

Predicting missing links via correlation between nodes

As a fundamental problem in many different fields, link prediction aims ...
research
03/14/2021

Collaborative Filtering Approach to Link Prediction

Link prediction is a fundamental challenge in network science. Among var...
research
07/27/2020

Measuring similarity in co-occurrence data using ego-networks

The co-occurrence association is widely observed in many empirical data....

Please sign up or login with your details

Forgot password? Click here to reset