LSH methods for data deduplication in a Wikipedia artificial dataset

12/10/2021
by   Juan Ciro, et al.
0

This paper illustrates locality sensitive hasing (LSH) models for the identification and removal of nearly redundant data in a text dataset. To evaluate the different models, we create an artificial dataset for data deduplication using English Wikipedia articles. Area-Under-Curve (AUC) over 0.9 were observed for most models, with the best model reaching 0.96. Deduplication enables more effective model training by preventing the model from learning a distribution that differs from the real one as a result of the repeated data.

READ FULL TEXT
research
09/14/2019

Multi-class Multilingual Classification of Wikipedia Articles Using Extended Named Entity Tag Set

Wikipedia is a great source of general world knowledge which can guide N...
research
11/02/2020

Analyzing Wikidata Transclusion on English Wikipedia

Wikidata is steadily becoming more central to Wikipedia, not just in mai...
research
06/24/2020

WikipediaBot: Automated Adversarial Manipulation of Wikipedia Articles

This paper presents an automated adversarial mechanism called WikipediaB...
research
05/10/2021

Wiki-Reliability: A Large Scale Dataset for Content Reliability on Wikipedia

Wikipedia is the largest online encyclopedia, used by algorithms and web...
research
11/14/2022

Wikigender: A Machine Learning Model to Detect Gender Bias in Wikipedia

The way Wikipedia's contributors think can influence how they describe i...
research
03/30/2021

Tracking Knowledge Propagation Across Wikipedia Languages

In this paper, we present a dataset of inter-language knowledge propagat...
research
10/20/2020

AutoMeTS: The Autocomplete for Medical Text Simplification

The goal of text simplification (TS) is to transform difficult text into...

Please sign up or login with your details

Forgot password? Click here to reset