A Fast Randomized Algorithm for Massive Text Normalization

10/06/2021
by   Nan Jiang, et al.
0

Many popular machine learning techniques in natural language processing and data mining rely heavily on high-quality text sources. However real-world text datasets contain a significant amount of spelling errors and improperly punctuated variants where the performance of these models would quickly deteriorate. Moreover, real-world, web-scale datasets contain hundreds of millions or even billions of lines of text, where the existing text cleaning tools are prohibitively expensive to execute over and may require an overhead to learn the corrections. In this paper, we present FLAN, a scalable randomized algorithm to clean and canonicalize massive text data. Our algorithm relies on the Jaccard similarity between words to suggest correction results. We efficiently handle the pairwise word-to-word comparisons via Locality Sensitive Hashing (LSH). We also propose a novel stabilization process to address the issue of hash collisions between dissimilar words, which is a consequence of the randomized nature of LSH and is exacerbated by the massive scale of real-world datasets. Compared with existing approaches, our method is more efficient, both asymptotically and in empirical evaluations, and does not rely on additional features, such as lexical/phonetic similarity or word embedding features. In addition, FLAN does not require any annotated data or supervised learning. We further theoretically show the robustness of our algorithm with upper bounds on the false positive and false negative rates of corrections. Our experimental results on real-world datasets demonstrate the efficiency and efficacy of FLAN.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/12/2018

Word Embedding based on Low-Rank Doubly Stochastic Matrix Decomposition

Word embedding, which encodes words into vectors, is an important starti...
research
04/08/2019

Word Similarity Datasets for Thai: Construction and Evaluation

Distributional semantics in the form of word embeddings are an essential...
research
08/29/2020

Puzzle-AE: Novelty Detection in Images through Solving Puzzles

Autoencoder (AE) has proved to be an effective framework for novelty det...
research
10/06/2011

Bayesian Locality Sensitive Hashing for Fast Similarity Search

Given a collection of objects and an associated similarity measure, the ...
research
11/12/2018

Unseen Word Representation by Aligning Heterogeneous Lexical Semantic Spaces

Word embedding techniques heavily rely on the abundance of training data...
research
03/15/2023

Mining False Positive Examples for Text-Based Person Re-identification

Text-based person re-identification (ReID) aims to identify images of th...
research
01/11/2018

Stochastic Learning of Nonstationary Kernels for Natural Language Modeling

Natural language processing often involves computations with semantic or...

Please sign up or login with your details

Forgot password? Click here to reset