A Pre-trained Data Deduplication Model based on Active Learning

07/31/2023
by   Xinyao Liu, et al.
0

In the era of big data, the issue of data quality has become increasingly prominent. One of the main challenges is the problem of duplicate data, which can arise from repeated entry or the merging of multiple data sources. These "dirty data" problems can significantly limit the effective application of big data. To address the issue of data deduplication, we propose a pre-trained deduplication model based on active learning, which is the first work that utilizes active learning to address the problem of deduplication at the semantic level. The model is built on a pre-trained Transformer and fine-tuned to solve the deduplication problem as a sequence to classification task, which firstly integrate the transformer with active learning into an end-to-end architecture to select the most valuable data for deduplication model training, and also firstly employ the R-Drop method to perform data augmentation on each round of labeled data, which can reduce the cost of manual labeling and improve the model's performance. Experimental results demonstrate that our proposed model outperforms previous state-of-the-art (SOTA) for deduplicated data identification, achieving up to a 28 datasets.

READ FULL TEXT

page 1

page 4

page 7

research
10/12/2020

Pre-trained Language Model Based Active Learning for Sentence Matching

Active learning is able to significantly reduce the annotation cost for ...
research
02/15/2018

Cost-Effective Training of Deep CNNs with Active Model Adaptation

Deep convolutional neural networks have achieved great success in variou...
research
08/21/2023

Overcoming Overconfidence for Active Learning

It is not an exaggeration to say that the recent progress in artificial ...
research
12/15/2021

Towards General and Efficient Active Learning

Active learning aims to select the most informative samples to exploit l...
research
01/20/2021

Active Learning for Sequence Tagging with Deep Pre-trained Models and Bayesian Uncertainty Estimates

Annotating training data for sequence tagging tasks is usually very time...
research
10/29/2019

Understand customer reviews with less data and in short time: pretrained language representation and active learning

In this paper, we address customer review understanding problems by usin...
research
01/24/2022

Keeping Deep Lithography Simulators Updated: Global-Local Shape-Based Novelty Detection and Active Learning

Learning-based pre-simulation (i.e., layout-to-fabrication) models have ...

Please sign up or login with your details

Forgot password? Click here to reset