Log In Sign Up

A Fast Randomized Algorithm for Massive Text Normalization

Many popular machine learning techniques in natural language processing and data mining rely heavily on high-quality text sources. However real-world text datasets contain a significant amount of spelling errors and improperly punctuated variants where the performance of these models would quickly deteriorate. Moreover, real-world, web-scale datasets contain hundreds of millions or even billions of lines of text, where the existing text cleaning tools are prohibitively expensive to execute over and may require an overhead to learn the corrections. In this paper, we present FLAN, a scalable randomized algorithm to clean and canonicalize massive text data. Our algorithm relies on the Jaccard similarity between words to suggest correction results. We efficiently handle the pairwise word-to-word comparisons via Locality Sensitive Hashing (LSH). We also propose a novel stabilization process to address the issue of hash collisions between dissimilar words, which is a consequence of the randomized nature of LSH and is exacerbated by the massive scale of real-world datasets. Compared with existing approaches, our method is more efficient, both asymptotically and in empirical evaluations, and does not rely on additional features, such as lexical/phonetic similarity or word embedding features. In addition, FLAN does not require any annotated data or supervised learning. We further theoretically show the robustness of our algorithm with upper bounds on the false positive and false negative rates of corrections. Our experimental results on real-world datasets demonstrate the efficiency and efficacy of FLAN.


page 1

page 2

page 3

page 4


Word Embedding based on Low-Rank Doubly Stochastic Matrix Decomposition

Word embedding, which encodes words into vectors, is an important starti...

Word Similarity Datasets for Thai: Construction and Evaluation

Distributional semantics in the form of word embeddings are an essential...

Puzzle-AE: Novelty Detection in Images through Solving Puzzles

Autoencoder (AE) has proved to be an effective framework for novelty det...

Bayesian Locality Sensitive Hashing for Fast Similarity Search

Given a collection of objects and an associated similarity measure, the ...

Unseen Word Representation by Aligning Heterogeneous Lexical Semantic Spaces

Word embedding techniques heavily rely on the abundance of training data...

Neural Word Decomposition Models for Abusive Language Detection

User generated text on social media often suffers from a lot of undesire...

Stochastic Learning of Nonstationary Kernels for Natural Language Modeling

Natural language processing often involves computations with semantic or...

1. Introduction

Many Natural Language Processing (NLP) algorithms rely on high-quality text sources to obtain state-of-the-art results (10.1145/1401890.1401965; gudivada2017data). Recent studies have shown that model performance deteriorates when state-of-the-art models are evaluated on real-world noisy texts (DBLP:conf/aclnut/KreekA18; DBLP:journals/corr/abs-2005-00295; damaschk-etal-2019-multiclass). Specifically, text data extracted from web sources such as Twitter, Reddit, and search query logs contain numerous instances of spelling errors, typos, and non-standard punctuation marks (sikdar2013cutting; DBLP:conf/emnlp/VolskePSS17). This noise can render pretrained neural models trained on clean data sources ineffective and is challenging to clean with traditional text normalization methods on web-scale datasets.

This challenge motivates the need for lexical normalization, which is the task of cleaning noisy input words into canonicalized forms. Prior techniques for lexical normalization involve 1) combining similar words based on a rich set of features, such as phonetic similarity, lexical edit distances,

-gram probabilities, and word-embedding features 

(DBLP:journals/tist/HanCB13; DBLP:conf/emnlp/KajiK14), 2) supervised learning, where annotated datasets are required to learn a correction mapping from unormalized words to normalized ones (DBLP:journals/ijdar/ChoudhurySJMSB07)

, and 3) similarity search with word-embeddings, where the top-ranked words under a vector similarity measure are considered as the correction 


In this paper, we present FLAN, a scalable randomized algorithm for lexical normalization. Compared with the existing methods, FLAN can 1) eliminate the need for additional annotation for supervised learning. 2) scale better on large datasets,, especially those with hundreds of millions or billions of lines of text. 3) be robust to errors, by reducing the likelihood of normalizing a word into a dissimilar one due to our proposed graph stabilization technique.

FLAN harnesses Locality-Sensitive Hashing (LSH) (DBLP:conf/aaai/ZhaoLM14) to find normalized words in a graph. FLAN consists of two stages, an indexing step and an inference step. The input to the indexing stage is the set of tokens found in the data (word unigrams in our experiments) and the output is a word-to-word directed graph, built via LSH, where all words in a connected component point to a canonicalized representative. At inference time, we use LSH again to hash an unknown word to its appropriate graph component and substitute this noisy word with the canonicalized representative from the graph.

We further boost the probability of LSH bucketing similar words together by taking independent repetitions of the hashing process and building a weighted word-to-word graph where the weights represent the number of repetitions in which two tokens shared the same hash signature. As a stabilization step, we then remove those insignificant edges with weight below a predefined threshold. In the pruned graph, the words in every connected component are regarded as sharing the same meaning. This edge pruning operation reduces the likelihood of a word being normalized to a dissimilar one. We further derive upper bounds on the false positive and false negative rate of this graph construction process.

In our experiments, we compare FLAN with several popular text normalization methods over large-scale Twitter and Reddit datasets. In terms of running time, we find that FLAN is faster than baselines across both the indexing and inference stages. In a human evaluation on the correction quality across the Twitter and Reddit datasets, FLAN achieves a and

higher F1-Score, respectively, against the competing methods. We also demonstrate the impact of FLAN on downstream NLP tasks. On the Twitter sentiment analysis and various perturbed GLUE benchmark tasks, FLAN demonstrates consistent improvement over the baselines. We also conduct an ablation study over the impact of threshold parameter on the algorithm’s performance. We further provide a case study of applying FLAN in an industrial setting on the task of normalizing search queries and product titles on a dataset sampled from the search logs of a large e-commerce website. On this dataset with hundreds of millions of lines of text, we find that FLAN completed normalizing the data in a few hours while competing spell correction methods required days to finish. Our contributions in this paper can be summarized as follows:

  • We present an efficient algorithm for lexical normalization that uses the Jaccard similarity between words for lexical correction. To the best of our knowledge, this similarity measurement has not been fully explored in this domain and is different from existing word embedding search and lexical/phonetic edit distance models. Our technique does not require supervised training or annotated data. FLAN also scales better to large datasets thanks to the efficiency of LSH over competing algorithmic primitives.

  • While LSH provides an efficient approach to map similar words together, its randomized nature introduces the possibility of dissimilar words sharing the same signature due to undesirable hash collisions, a problem that becomes very prevalent at massive scales. We address this challenge of dealing with unfavorable collisions through a novel approach of modeling the LSH outputs as a word-to-word graph and using multiple repetitions to identify connected components of similar entities in this graph. We show that FLAN is robust to errors and scales well to large datasets both theoretically and in our empirical experiments.

  • We compare FLAN with several existing popular methods over different datasets, comparing the average running time, examining the quality of the word corrections via human evaluations, and providing several case studies for the performance over perturbed GLUE benchmark datasets, Twitter sentiment analysis, and a large-scale product search dataset.

The remainder of the paper is organized as follows: Section 2 provides background information on lexical normalization and LSH; in Section 3, we present the details of the FLAN system, using LSH to hash individual words into bins and then employing a weighted word-to-word graph to determine how to convert tokens into a canonicalized representation; Section 4 presents our experimental studies showing the empirical advantages of our proposed algorithm; Section 5 recaps the contributions of this paper and also identifies some directions for future work in extending this method.

2. Background & Related Work

2.1. Lexical Normalization

Recently, lexical normalization has received great interest with the advent of mobile computing and social networks (coddington2014correction; DBLP:conf/cikm/BonchiFNSV12), where typing on a small keyboard increases the opportunity for typos, and the rise of social media (DBLP:conf/aclnut/BaldwinMHKRX15), where users are accustomed to using slang, abbreviations, and other types of informal languages. Lexical normalization refers to the process of transferring non-standard, informal, or misspelled tokens into their standardized counterparts as well as converting words of various tenses or pluralization into a consistent representation (muller2019enhancing). This process has emerged as a crucial step to be able to utilize neural NLP models, which are often pretrained on clean text corpora, on noisy, real-world datasets.

Prior techniques in lexical normalization all involve either: 1) combining features, such as phonetic similarity and lexical distances with standard word and -gram probabilities (DBLP:conf/cikm/IslamI09; DBLP:journals/tist/HanCB13; DBLP:conf/emnlp/KajiK14), 2) supervised learning, where annotated datasets are required to learn a correction mapping (DBLP:journals/ijdar/ChoudhurySJMSB07), or 3) nearest neighbor search within the space of word-embeddings, where the top-ranked words under the a vector similarity measure are considered as the correction candidates (DBLP:journals/siamcomp/ColeH02).

In the literature, the classic approaches for lexical normalization usually encompass a combination of spelling correction, stemming (4410417), and regular expression filtering (10.1145/980972.980996). More recent works have introduced unsupervised statistical models for text cleaning (contractor2010cleansing; aw2006phrase)

or combining multiple heuristics to identify and normalize out of vocabulary words 


. Another explored learning robust word representations through end-to-end neural networks as opposed to normalizing the data beforehand

(malykh-etal-2018-robust; doval2019towards) or directly fine-tuning the BERT models for lexical normalization task (muller2019enhancing). Another group of works focus on directly learning over the subword level information, where character sequences or subword pairs are directly used for learning the representation without any correction steps (muller2019enhancing).

However, there are several issues limiting the use of aforementioned approaches. The pattern of typos may vary across data sources and languages, possibly may require training separate supervised learning models or collecting additional labels. The current methods for lexical normalization are also either prohibitively slow when applied over massive datasets or require expensive and time-consuming model training.

2.2. Locality-Sensitive Hashing

LSH is a family of functions, such that a function uniformly sampled from this family has the property that, under the hash mapping, similar points have a higher probability of having the same hash value (DBLP:conf/aaai/ZhaoLM14). More precisely, consider a family of hash functions mapping points from to a discrete integer set .

Definition 2.0 (LSH Family (DBLP:conf/uai/Shrivastava015; DBLP:books/cu/LeskovecRU14)).

A hashing family is called -sensitive if for any two points and function chosen uniformly from satisfies the following properties:

  • If , then ;

  • If , then .

In practice, we assume and .

A collision occurs when the hash values for two points are equal: . The collision probability is proportional to some monotonic function of similarity between the two points: , where is the similarity under consideration and is a monotonically increasing function. Essentially, similar items are more likely to collide with each other under LSH mapping.

Minwise hashing (MinHash) is the LSH for set resemblance, also known as the Jaccard similarity (DBLP:journals/rsa/BroderM01). The minwise hashing family applies a random permutation on the given set , and stores only the minimum value after the permutation mapping. Given two sets , the probability of the sets having the same MinHash value is the Jaccard similarity between the given two sets:


For computing several LSH signatures of the data vector, the last decade has witnessed a tremendous advance in reducing the amortized computational and memory requirements For random projections based LSH, of which signed random projection is a special case, we can calculate LSH hashes of the data vector, with dimensions , in time , a significant improvement over . This speedup is possible due to the theory of Fast-Johnson-Lindenstrauss transformation (DBLP:journals/siamcomp/AilonC09). On the orthogonal side, even better speedup of has been obtained with permutation-based LSH, such as minwise hashing, using ideas of densification (DBLP:conf/icml/Shrivastava17; DBLP:conf/icml/Shrivastava014; DBLP:conf/uai/Shrivastava014). These drastic reductions in hashing time have been instrumental in making LSH based methods more appealing and practical and we leveragethese advances in our work.

In this research, we explore methods with Jaccard similarity between words. If two words have more subsequences or subwords in common, they would have a higher similarity with each other. This measure of similarity focuses solely on morphology of the words (aronoff1994morphology), representing the structures and meanings within words. It does not incorporate the semantic or syntactic meaning, that requires the context of the words. We leave this study for future work.

3. Lexical Normalization via Randomized Hashing

3.1. Motivations

To measure the distance or similarity between two words, extensive research has been conducted over two metrics: edit distance and cosine similarity. Edit distance and its variations, including Levenshtein, Damerau–Levenshtein, and Jaro-Winkler distance 

(cohen2003comparison) are all defined around computing the minimal sequence of edit operations (i.e., deletion, insertion, and replacement) for changing a given word into another. Information on neighboring characters on keyboards as well as phonetic relationships are commonly applied to adjust the cost of deletion, insertion, and replacement. In the cosine similarity paradigm (pmlr-v89-ding19a), words are embedded into the Euclidean space, and the distance between two words is the angle between their corresponding word vectors. In the domain of lexical normalization, these two metrics require prohibitive computational cost when dealing with large data (DBLP:journals/tist/HanCB13; DBLP:conf/www/GuzmanL16). In this work, we consider Jaccard similarity as the similarity measurement between word pairs. Here, the Jaccard similarity is the ratio of character spans (or subwords) that two words share. The advantage of this metric is that it can handle web-scale data via recent algorithmic advances in computing LSH signatures (DBLP:conf/aaai/ZhaoLM14; DBLP:conf/uai/Shrivastava014).

3.2. Vocabulary as Lexically Similar Components

Figure 1.

An overview of hashing a word into a signature value via LSH. The input is a sequence of characters, sliced into substrings via the hyperparameter

CHARLENS. Then, we use a 2-universal hash function to convert a string into a numerical (integer) value. Finally, we apply MinHash for the integer array to obtain a signature value of the input word.

Locality-Sensitive Hashing for Words. As shown in Figure 1, the detailed LSH algorithm is composed of several steps. For the first step, a word of characters , is sliced into a set of substrings :


Here, is the union of substring sets. In our experiments, we introduce a hyper-parameter CHARLENS to denote which substrings will be included in the set. For example, signifies that the character-level unigram, trigram and 5-gram sets will be included into the overall set

. If the substring length is longer than the input word length, its N-gram set is defined to be


After obtaining the substring set, we use a hash function from a 2-universal hash family to map every substring into a large universe .

Next, we use one permutation hashing (DBLP:conf/nips/0001OZ12) to hash the output of above step. We partition the universe into bins and the set of hashed integers will be correspondingly partitioned. For example, in Figure 1, the universe is partitioned into bins: . The integers are put into the first bin , and the other integers will be partitioned correspondingly. By the MinHash process, we only preserve the minimum value for those non-empty bins. For example, for the first bin, we would only preserve the minimum value of .

One existing issue of one permutation hashing is that we cannot have a signature for those empty bins. DBLP:conf/icml/Shrivastava17 proposes to borrow the signature value in the neighboring bins into the empty bin. In particular, for a given empty bin, we will flip a coin and borrow the first non-empty bin value from either the left or the right. This borrowing process is known as densification. After this densification operation, we obtain an array of signature values to represent the input word .

Next, we randomly hash the signature array into an integer in another universe . Here, we need another hash function from the 2-universal hash family that recursively hashes the array of signature values into one element. Each step takes the sum of the current signature value and the hashed value of the previous step as input. It will then output the hashed value for the current step: , where we use the last element value as the signature value for the input word . We show the detailed process of mapping an input word into a signature in the universe in Figure 1.

To conclude, given two words and , the probability of the words having the same signature value is proportional to the Jaccard similarity of the two words. The probability of the event that two words will have the same signature value () by the LSH algorithm is proportional to their Jaccard similarity (DBLP:conf/aaai/ZhaoLM14):


Here we make the assumption that all words grouped together via their signature value are lexically similar (as shown in Figure 2). These grouped words usually are the variant of one canonical representation, which we call the representative word. In this work, we use this representative to replace all of the grouped words to normalize the text data.

Figure 2. The pipeline of FLAN algorithm. (a) Inputs are a list of words with their frequencies, which are then converted into a graph with weighted cliques via randomized hashing. (b) Afterward, we prune those in-significant edges and consider every component as words sharing the same meaning. (c) The output is a linked list style vocabulary, with all the similar words point to their pivots. (d) In inference, the new words use randomized hashing to validate if it is similar to words in vocabulary .

However, due to the other property of LSH families introduced in Definition  2.1, dissimilar words can also have the same signature with an unavoidable small probability. For real-world datasets with millions of distinct words, the chance of at least one pair of dissimilar words sharing a signature becomes significant, which leads to poor results. Motivated by the Count-Min Sketch data structure (DBLP:journals/jal/CormodeM05; luo2017arrays; chen2018), we propose a graphical stabilization method to greatly decrease the likelihood of a word mapping to any dissimilar words while at the same time maintaining a high likelihood among similar words.

Stabilization by Weighted Cliques. We stabilize the word mappings by repeating the hashing process () times and aggregating the results using a graph data structure (luo2017arrays). In the graph , let every vertex represent a word in the dataset such that the number of vertices in the graph is equal to the number of unique words in the dataset. We also define the edge weight to be the number of times two words have the same signature value, where . If two words have no hash collision, then they do not have edge between them. In one repetition, words sharing the same signature form a clique. For independent repetitions, we have a weighted graph where the weight of each edge represents the number of repetitions in which two words shared the same LSH signature. Figure 2(b) provides an illustration of this LSH word-to-word graph.

Furthermore, we introduce an extra criterion with a threshold parameter to determine if an edge weight is significant. Insignificant edges are pruned to decrease the likelihood of a word mapping to any dissimilar words. To be specific, if , then the two words are assumed to be sufficiently similar; if there is no edge between words and or , then these two words are considered as distinct entities and we remove any edge between them in the graph. After repetitions and edge pruning, we interpret the words left in every connected component as sharing the same meaning. Note that case means no edge will be pruned, which is simply the union of edges over all repetitions. For the case, only words with the same signature across all the repetitions are preserved, which is the intersection for all the repetitions.

Finally, the output of the algorithm is a linked list-style vocabulary , where the lists of misspelled words are pointed to their representatives. Here we let the most frequent token in every connected component be the representative for this group of similar words. Figure 2(c) gives one example of the output.

Inference Criterion. Once we create a vocabulary by FLAN, we can go through our input dataset and remap words to their morphological representatives. However, in the inference stage, there may exist words in the testing set that are not covered simply because they were not present in the indexing corpus. Thus, we introduce a criterion to decide if these new unseen words can be mapped back to the FLAN graph and determine which word in the vocabulary would be the best fit.

Given the extracted vocabulary from the indexing procedure and an unseen word in the testing set, we apply the LSH method for the word and check if the word would have a collision with any word in the vocabulary. After repetitions, we would have several edges that link from a set of words in the vocabulary to this word . Then we reuse our prior criterion: , for every edge that link to . If there are no edges () or none of them satisfy the criterion, this word is claimed as not similar to any words in the vocabulary. If we find more than one satisfying words, we pick the word with the largest weight. Figure 2(d) present two examples.

3.3. Error Analysis of Similarity Estimation

Given a set of distinct words and several clusters , each word belongs to one and only one cluster. Let be a mapping from a node to its appropriate partition. Each partition can be viewed as a connected component of lexically similar words. Similar to Definition 2.1, let denote the minimum probability of an intra-cluster edge and be the maximum probability of an inter-cluster edge. Let denote the FLAN graph as described in the previous section. The probability of the graph having the edge is:


In practice, we can think of

, as we expect the lexical similarity between words in the same component to be larger than those across the components. LSH seeks to estimate

and as modeling the Jaccard similarity between words. The stabilization step is coin flips with probabilities or . We first upper bound the probability of an unrelated word being included in the wrong connected component, which is the false positive probability. Then, we bound the probability that a word will not be assigned to its proper cluster by edge pruning, referred to as the false negative probability.

Prop. 3.0 (False Positive Probability).

Fix a node , the probability that FLAN will connect to a node in cluster where is at most where is cluster ’s size.


Recall that denotes the weight assigned to edge . Using a union bound and a Chernoff bound, we have


where the second inequality follows from setting . ∎

Prop. 3.0 (False Negative Probability).

Fix a node , the probability that FLAN will not add an edge from to any of the other nodes in is at most .


We note that does not share edge with some other word in FLAN graph if the edge weight is smaller than after applying repetitions. By the fact that the presence of each edge is an independent event and another Chernoff bound, we have that


where the second inequality follows from setting . ∎

The propositions imply that the probability of a false positive and false negative event decreases exponentially with more repetitions. Furthermore, we note that one meaningful difference between the two bounds is the dependence on the cluster size . In the first case, a larger size increases the error probability while it decreases this quantity in the latter bound.

Running Time Analysis. In FLAN, the time complexity of computing the hash values is , where is the number of words in the dataset, the number of repetitions of LSH, and the average number of characters in a word in the data set. Here is usually small and . Afterward, the complexity of constructing the graph is , where is the expected number of items in each bucket of the hash table, and is the number of buckets. The final graph pruning takes time to finish. Thus, the overall computational complexity is . In practice, is on the order of millions or billions and thus dominates asymptotically, so we can simplify the previous bound to . Note that the major speedup of this method comes from prior breakthrough in efficiently computing the MinHash signatures (DBLP:journals/rsa/BroderM01).

For comparison, a spell correction algorithm based on edit distance runs in time , where is the number of possible characters to be deleted, replaced, or inserted. Given a word, a spell corrector will consider all the neighboring words with, for example, one and two steps of edit distances, then pick the neighboring word with maximum score in the dictionary. Usually, the number of possible characters is much larger than the number of repetitions . FLAN would further improve upon the speed of edit distance-based algorithms in distributed settings where we can compute these repetitions in parallel.

3.4. Connection to Existing Approaches

The distance measurement used in our method is an extension and relaxation of classic stemming operations (10.5555/188490.188499), where two words with the same stem would be of identical meaning. Our method would not only identify two words sharing the same prefix or suffix strings with high similarity, but also any subsequences of the word based on the composition of .

The FLAN graph also captures common tendencies in human errors, such as substituting adjacent characters on a keyboard or similar-sounding characters. It reduces the effort of generating features for finding patterns in typos. For a connected component of the graph, words with adjacent or similar sounding characters are included with high probability. These misspelled words are then mapped to the representative word in the final pruned graph.

Popular spell correction methods like Hunspell and Aspell111 find words that have a “sounds like” word within a given edit distance of the original string. For FLAN, the words in a given connected component of the graph include those with small edit-distance with high probability, but this component will also likely to include words with longer edit distances, offering a dynamic and generalized way for correction. Experiment evidences for this property are collected in Table 4.

Moreover, supervised learning methods that build upon rich feature sets about human typing and spelling patterns work well on small-scale and domain-specific datasets. However, different languages and various data domains usually require adjustments, additional labeled annotations, and further feature engineering. Such expert knowledge becomes quite expensive to acquire when we scale to massive data and various languages. Our method, with no such features over typing, spelling, devices, or languages, uses multiple repetitions and pruned edge weights as statistical estimators. FLAN can effectively and efficiently normalize words to a canonical form without any supervised learning, annotations, or feature engineering.

4. Experimental Study

4.1. Experiments Setup

Datasets. We consider datasets from Twitter, Reddit, the GLUE benchmark (wang-etal-2018-glue) with perturbed text, and data sampled from the logs of a large e-commerce search engine. The Twitter sentiment140 dataset contains 1.6 million of tweets with 0.7 million distinct words (DBLP:conf/comsnets/SahniCCS17). The Reddit dataset has 10 million of sentences with 2.7 million distinct words (DBLP:conf/emnlp/VolskePSS17)

. For the GLUE benchmark, we consider MRPC, STSB, RTE, CoLA and SST2 datasets, that covering single sentence prediction, sentence similarity and paraphrase along with the language inference tasks. For the e-commerce product search logs, it contains 100 million lines of product and search texts with 3.2 million unique words. Note that the Reddit dataset is unlabeled so we only use this corpus to measure the time efficiency and correction quality of various normalization techniques and not the performance on downstream machine learning tasks.

Baselines. We consider those methods with different similarity measurements for comparison: 1) edit-distance with standard word dictionary. The current popular algorithm (DBLP:journals/jdiq/Al-Hussaini17) as well as the classic method (norvig2009natural) are included. 2) cosine similarity over pretrained word-embeddings space. We use Glove (pennington2014glove) and Fasttext (mikolov-etal-2018-advances) as the word-embeddings and apply maximum inner-product search via the FAISS library for searching over the high-dimensional space (8733051). Note that there are several lexical normalization methods are not included in this research, because either the source codes are not shared (DBLP:conf/aclnut/SupranovichP15), the methods require annotated lexical normalization datasets (DBLP:conf/lrec/GootRCCM20), a long pipeline with several human-defined rules are needed (DBLP:conf/www/GuzmanL16), the methods are built upon morphological and phonetic features that are defined by domain experts (han2011lexical) or the dependencies of code were out of maintenance (DBLP:conf/acl/Goot19).

Evaluation Metrics. We evaluate FLAN as well as the aforementioned lexical normalization baselines in terms of: 1) computational efficiency, which evaluates the exact running time of every algorithm, 2) correction quality, measuring the goodness of correction with human evaluators, and 3) impact on downstream applications, namely Twitter sentiment classification and perturbed GLUE benchmark datasets.

Hyperparameter Settings For the hyperparameters in FLAN, we set CHARLENS to be . Furthermore, we set the universe size to and partition the space into bins. The -universal hashing function we use in our experiments is , where and the prime number . The random seed is also fixed for reproducibility. The number of repetitions are set to . We note that the number of repetitions determines the memory and also the running time of the FLAN algorithm. A higher number of repetitions give us a higher quality normalization while a lower number gives us a faster algorithm. Every algorithm runs over 20 cores CPU with a frequency of 3.8 GHz. We set the threshold ratio for removing low weight edges. Figure 3 provide a detailed analysis for selecting this threshold.

4.2. Correction Efficiency

We compare the running time of all the methods over large scale datasets. For the “Indexing” procedure, we first extract all the words from the text corpora along with the frequencies of the words. Then, the words are fed into every algorithm, where the output is either the original word or the corrected one. This measures the overall time to create the correction mapping for the whole training set. Only the time used for lexical normalization is calculated for these benchmarks. Specifically, for the “Single” case, the whole algorithm is applied over one process. For the “Multi” case, we partition the workload equally over 20 processors. For the “Inference” step, we benchmark the overall time for mapping words to their normalized form following the indexing stage. As shown in Table 1, we observe that FLAN has a faster running time and scales better to the dataset size than the baseline methods across both the indexing and inference stages.

Dataset Methods Indexing (mins) Inference
Single Multi (mins)
Twitter FLAN ()
Reddit FLAN ()
Table 1. Running time of lexical normalization methods over Twitter and Reddit Datasets. FLAN scales better to the dataset size and is faster over Indexing and Inference scenarios than the competing approaches.

We acknowledge that the computation time is impacted by the choice of programming language, specific libraries, and software engineering optimizations such as caching and precomputation. The core algorithm in FAISS is implemented in C++ while the rest of the methods we benchmark are implemented in Python.

4.3. Correction Effectiveness

To evaluate the quality of the corrections made by a given lexical normalization method, we conducted a study with native English speakers to evaluate the quality of the correction methods. We first select 100 sentences from both the Twitter and Reddit datasets,feed the sentences into each of our algorithms, and then extract the corrected output sentences. We create a questionnaire for the corrected sentences and deploy to the Amazon Mechanical TurkFive different native speakers evaluated the quality of each sentence after correction. Each reviewer was asked to label every corrected sentence as either “Good”, “Neutral”, “Bad”, or “Not Sure”. We define the label “Good” as signifying the corrections make the meaning of the text more clear or more grammatically correct. The label “Bad” denotes that the corrections make the meaning of the text less clear or less gramatically correct. “Neutral” case, signifies that the corrections do not improve or diminish the clarity of the text.

Datasets Methods Precision Recall F1-Score
Twitter FLAN ()
Reddit FLAN (=)
Table 2. Human evaluation for the quality of word corrections. For Twitter dataset, FLAN has a higher Recall and F1-score. For the Reddit dataset, and FLAN has a higher Precision and F1-Score value than the baselines.

To evaluate recall, we further conduct another human evaluation for judging if the input sentences contain any spelling errors or typos that require lexical normalization. We repeat the same process as above but the reviewers need to label every input sentence as ”Yes” or ”No”.

To assess the results from this study, we consider “good” and “neutral” as a correct result and regard “bad” as an incorrect one. The precision is calculated as the ratio between the number of correct results to the total number of corrections. Recall is defined as the fraction of problematic sentences that are corrected to good. The F1-Score is calculated based on Precision and Recall 

(muller2019enhancing; DBLP:conf/lrec/GootRCCM20).

The results are presented in Table 2. For the Twitter dataset, we observe that FLAN has the highest recall and F1 score value while the FAISS-Glove method has the highest precision score. For the Reddit dataset, FLAN has the highest precision and F1 score value compared to the baselines. However, we still observe some failure cases with FLAN, such as mapping “evga” and “vga” together. Disambiguating such pairs would likely require more information on the surrounding context of a word. We defer this investigation for future work.

4.4. Impact to Downstream Applications

Twitter Sentiment Analysis:

We evaluate the impact of lexical normalization over real-world noisy tweets. The task is to classify the sentiment of a given tweet as positive and negative. For the neural learning model, we use the summation of word vectors as the sentence representation, which is then mapped to a two-dimensional output vector via an affine transformation. The learning objective is to minimize the logistic loss between the predicted label and the ground truth label. The word vectors inside the model are randomly initialized and we set the dimension to

. Prior to training the model, we apply the various lexical normalization techniques we study in our eperiments. We report the accuracy on the testing set, which we also normalize, when we reach the best result on the corresponding validation set.

Methods Valid Accuracy Test Accuracy
No Correction
FAISS + Glove
FAISS + Fasttext
Table 3. Accuracy results on the Twitter Dataset. The FLAN improve the Accuracy on the validation set by and testing set by against all the baselines.

As shown in Table 3, we observe that DBLP:journals/jdiq/Al-Hussaini17; norvig2009natural do not improve the classification result, because of the large percentage of mismatch between the language style on Twitter and formal writing. FLAN does not introduce such a domain mismatch.

Perturbed GLUE Benchmark: To further investigate the impact of lexical normalization tools over the related NLP tasks, we consider 5 subtasks of the popular GLUE benchmark (wang-etal-2018-glue). As the GLUE datasets are of high-quality, we follow previous approaches (doval2019towards; DBLP:journals/corr/abs-2104-08420) in randomly perturbing the words in the validation and testing dataset while keeping the training set fixed. We generate synthetic lexical errors at and rates of noise such that we perturb a sentence with probability equal to this rate and then select 1-2 characters uniformly at random in every word of the sentence to delete or replace with another random character. Note that the synthesised typos are different from the real errors that follow a more structured distribution. We use a pretrained DistilBert model (DBLP:journals/corr/abs-1910-01108)

, which we then fine-tune over the training set with 10 epochs. We then evaluate on the perturbed test sets after applying a normalization algorithm as a cleaning step. We also include a “No correction” baseline as part of our study.

The results can be found at Table 5. We observe that with the rate of noises become higher, the relative improvement of FLAN w.r.t. the “No correctoin” the rest competing approaches become larger on all the chosen subtasks. It shows that FLAN has better capability to recover the words and improve the quality of the sentences.

4.5. Detailed Inspections

Ablation Study on Threshold . In Figure 3, we plot the effect of the graph pruning threshold on the behavior of FLAN. When we see that FLAN corrects nearly every word in the corpus. However, when we set or , we note that this correction coverage drops rapidly, which empirically demonstrates the exponential decay from applying more repetitions that we discussed previously. We also plot the correction coverage of our baseline methods for references. Based on these results, we selected as the pruning threshold in our experiences since it provided a balance between covering words and not introducing too much noise.

Figure 3. Percentage of words get covered by the vocabulary () on the Twitter and Reddit datasets. FLAN’s coverage is determined by the threshold .

Case Study for Connected Components in the FLAN Graph. We collect the results of select connected components after the LSH mapping, repetition, and pruning steps. The left column in Table 4 is the representative word for the connected component while the right column illustrates other words in the connected component that are mapped to the representative.

Representative Connected Components
there thereâ, therem, therea, ithere, therer
night gnight, nightï, nightâ, gnightâ, dnight, nighti
friends friend, friendsss, friendz, friendss, friendzz,
friendsssss, myfriends, friendssss, vfriends,
myfriend, friendâ, friend1
feeling feelin, feelingz, feelingg, feelinga, feelinf,
morning mornings, gmorning, morningg, gmornin,
morningss, morningo, gmorningg, smorning,
morningstar, morningâ, morningon
amazing amazingg, amazinggg, mazing, mazinggg,
amazinggggg, amazinggggggg, amazingggg,
amazinggggggggggg, amazingggggggggg,
mazingggg, amazinggggggggg, soamazing,
amazings, amazingggggg
Table 4. Connected Components in the constructed graph of FLAN over Twitter Dataset. The left column is the representative word for every connected component and the right column shows other words in the corresponding connected component. We observe that FLAN can capture patterns from typographical errors on keyboards.

From Table 4, we observe that FLAN can successfully group words with minor character difference into the same connected component. These results also provide evidence to the effectiveness of our graph pruning strategy in preventing spurious hash collisions from leading to unrelated word matches. This table also shows that FLAN tends to convert words of plural form into singular form or vice versa based on the frequency distribution of these variations in the dataset. In addition, FLAN is able to map infrequent words to a meaningful and frequent words in the indexed vocabulary, such as “amazinggggg” and “amazingggggg”. In addition, FLAN captures typos related to the characters in close keyboard proximity such as “feelinf” as well as fixing the habit of double typing (“feelingg” and “gmorningg”).

Subtask Perturb Rate Metrics No corr. FLAN () norvig2009natural DBLP:journals/jdiq/Al-Hussaini17 FAISS-Glove FAISS-Fasttext
MRPC Accuracy
MRPC Accuracy
MRPC Accuracy
STSB Pearson
STSB Pearson
STSB Pearson
RTE Accuracy
RTE Accuracy
RTE Accuracy
CoLA Matthews
CoLA Matthews
CoLA Matthews
SST2 Accuracy
SST2 Accuracy
SST2 Accuracy
Table 5. Preturbed GLUE benchmark with all the lexical normalization algorithms. We observe that when the noisy level become larger and larger, the FLAN can help to recover more words and get better results than all the competing methods.

Ultimately, these results suggest that lexical normalization can aid in improving the quality of text-based models applied to noisy data, and FLAN provides a computationally scalable alternative to existing methods.

4.6. Large-Scale Case Study: Product Search

We further conducted offline experiments applying FLAN to normalize a dataset of hundreds of millions of search queries and product titles sampled from the logs of a large e-commerce search engine (DBLP:conf/kdd/NigamSMLDSTGY19). The structure of neural model, learning objective function and evaluation metrics follow the same settings as DBLP:conf/kdd/NigamSMLDSTGY19. We observed that the edit-distance based methods (DBLP:journals/jdiq/Al-Hussaini17; norvig2009natural), were prohibitively slow to apply at this scale, requiring days to complete. Meanwhile, FLAN finished normalizing the entire dataset in roughly hours. The FAISS model, on the other hand, achieved poor recall when compared to FLAN and required the additional overhead of learning these word representations on the e-commerce query-product logs.

5. Conclusion

In this work, we investigated lexical normalization for cleaning the real-world text data. We propose FLAN, a scalable randomized algorithm for cleaning and canonicalizing massive text data. By leveraging advances in randomized hashing, FLAN considerably reduces the computational complexity for large-scale text normalization. By leveraging the advance of MinHash, the approximated all word pairs are efficient computed. Compared with existing approaches, FLAN does not need extra annotation, rule definition and feature generation.

Moreover, we propose using a graphical structure to detect and clean undesirable word associations due to random hash collisions to stabilize the correction quality. We further provide theoretical guarantees on the robustness of our algorithm with upper bounds on the false positive and false negative probabilities.

In experimental studies, we benchmark with several prevalent methods and several large-scale datasets. In running time analyses, FLAN demonstrates a faster computation speed over against methods from edit-distance models and maximum inner product search in high-dimensional word-embedding spaces. When measuring the quality of corrections, FLAN has relatively a higher recall and F1 score against the baselines as measured by human evaluation. Finally, we evaluate the end-to-end benefit of FLAN on two machine learning tasks: Twitter sentiment analysis and perturbed GLUE benchmarks, where we find that FLAN consistently improves the quality of noisy texts and help the generalization of the model.