1 Introduction
Measuring similarities between strings is an essential component for a large number of language and string processing tasks including information retrieval, natural biology, and natural language processing. A string metric is a function that quantifies the distance between two strings. The most widely known string metric is the edit distance, also known as the Levenshtein distance which represents the number of substitutions insertions, or deletions operation needed to transform one string to another levenshtein1966binary. The fewer operations needed to go from one string to another, the more similar two strings are.
String metrics are key to the approximate string matching problem present in many fields. As an example, natural language processing needs automatic correction of spelling, and in bioinformatics similarities between DNA sequences is a crucial task. Both tasks are string approximation problems. Common to all string similarity metrics is that they are used to find matching patterns for a string that underwent some distortion including, but not limited to, misspellings, abbreviations, slang or irregularities in DNA sequences. The focus in this paper is a novel distance metric with some advantageous properties not present in other string metrics including considerations misspellings and nonstandard usage of words, and including the string context.
The evolution and distortion of languages is nothing new. However, a consequence of the global social media era is the nonstandardization of languages, which means that the same phrase, and even the same word, can be communicated in a variety of ways within the same language. This evolution is a challenge for natural language processing as any data handling, such as classification of translation, becomes less formal. All natural language processing would be much easier if everyone wrote the same way — which is unrealistic. A mitigation is to normalize nonstandard words to a more standard format that is easier to handle.
o normalize misspellings, nonstandard words and phrases, abbreviations, dialects, sociolects and other text variations (referred to here as nonstandard spelling and nonstandard words) three approaches other than string metrics are available in the literature. The first method is to view normalization of any nonstandard spelling as a translation problem aw2006phrase
. One example is based on statistical tools that map the nonstandard words with their English counterparts based on a probability distribution
kobus2008normalizing. Certainly, the translation method is a promising strategy, however, using a method designed to capture the complex relationship between two different languages in the context of word normalization is an “overreach” given the strong relationship between the English words and their nonstandard forms. A second approach for solving the nonstandard spelling challenge is to consider it as plain spell checking. The method tries to correct the misspelled words based a probability modelaw2006phrase. The challenge with the latter is that distance between nonstandard and standard spelling substantial. Hence, far from all normalizations can be viewed as corrections. Third, normalizing nonstandard spelling can be viewed as a speech recognition problemaw2006phrase pennell2011toward. In this approach, the texts are regarded as a phonetic approximation of the correctly spelled message. What motivates this view is that many nonstandard spellings are written based on their phonetic rather than their normative spelling. However, this view is also an oversimplification of the nature of the texts as it contains nonstandard spellings that are not just phonetic spellings of the correct word. As examples, texts containing abbreviation (lol for laugh out loud), truncated words (n for and), and leetspeak (4ever for forever) not be handled by this approach.This paper proposes a new method that maps each word in a vocabulary into a real vectors space. As a result, the distance between two words will be the distance between the vector representation of those words in the real vector space. The mapping between the word vocabulary and the real vector space must satisfy two premises. The first premise is that the distance between the vector representations of a nonstandard word and its corrected form should be shorter than the distance between the nonstandard word and any other unrelated known word. To achieve this constraint, the vector representation needs to spur positive correlations between a corrected spelling and
every possible nonstandard spelling of that word, while minimizing correlation with all other words and their nonstandard spellings. The second premise is that the vector representation of a word should also be such as each word with similar meaning has similar representations. We assume that two words have a similar meaning when they are often used in the same context. The context of a word is the collection of words surrounding it. To obtain such a representation, we mixed a predictive word embedding methods with a denoising autoencoderbengio2009learning. A denoising autoencoder is an artificial neural network that takes as input a data set, adds some noise to the data, then tries to reconstruct the initial data from the noisy version. By performing this reconstruction, the denoising autoencoder learns the feature present in the initial data in its hidden layer. In our approach, we consider the nonstandard spellings to be the noisy versions of the corrected word forms. In a predictive word embedding method, Each word is represented by a realvalued vector learned based on the usage of words and their context. A neural network learns this realvalued vector representation in a way that minimizes the loss of predicting a word based on its context. This representation is in contrast to the representation in a bag of words model where, unless explicitly managed, different words have different representations, regardless of their use.2 Background
A string metric or string distance function defines a distance between every element of a set of strings . Any distance function on must satisfy the following conditions for all and hazewinkel2013encyclopaedia:
(1) 
(2) 
(3) 
(4) 
Hence, the comparison between any two strings is larger or equal to 0 (Equation 1), identical strings have distance 0 (Equation 2), the distance between two strings is independent of whether the first is compared to the second or visa verse (Equation 3), and the distance between two strings is always equal to or smaller than including a third string in the measurements (Equation 4).
Over the years, several attempts at defining an allencompassing string metric have been carried out. The most well known is the edit distance (Levenshtein distance) which has been one of the most widely used string comparison function since its introduction in 1965 levenshtein1966binary. It counts the minimum number of operations (deletes, insert and substitute of a character) required to transform one string to another. It also assigns a cost to each operation. For example, if the weights assigns to the operation sis one, the distance between the words “vector” and “doctor” is two since only two substitutions are required for a transformation. The edit distance satisfies all the requirements as a distance function (equations 1,2,3 and 4).
The edit distance is called a simple edit distance when all operations have the same cost and a general edit distance when operations have different costs. Other than that, the edit distance has four notable variants. First, a longest common subsequence (LCS) is when only insertions and deletions are allowed with cost one bakkelund2009lcs. A second simplification is a variant that only allows substitution. In this case, the distance is called the Hamming distance hamming1950error. Third, the DamerauLevenshtein Distances adds the transposition of two adjacent characters to the operations allowed by the edit distance damerau1964technique. Finally, the episode distance allows only insertions that cost 1. The episode distance is not symmetric and does not satisfy Equation 3. Since insertions do not allow to transform a string to , is either or .
In 1992, Ukkonen ukkonen1992approximate introduced the qgram distance. It is based on counting the number of occurrences of common qgrams (strings of length ) in each string, and the strings have a closer distance the more qgrams they have in common. The qgram distance is not a metric function since it does not obey the identity of indiscernibles requirement (Equation 2).
Later, Kondrak kondrak2005n
developed the notion of Ngram distance in which he extended the edit and LCS distance to consider the deletions, insertions, and substitutions of Ngrams. The use of Ngrams enabled some new statical methods for string metrics originally from the field of samples and sets. The use of Ngram introduces the notion of statistical string metrics, which are metrics that measure the statistical properties of the compared strings. An example, the SorensenDice coefficient was used as a metric to measure the similarity between two string
sorensen1948methoddice1945measures, initially a method used to compare the similarity between two samples. In the case of strings the coefficient is computed as follows:(5) 
where is the number of character Ngrams found in both strings, is the number of Ngrams in string x and is the number of Ngrams in string .
The Jaccard index is another statistical method used to compare the similarity between two sample sets, including strings. It is calculated as one minus the quotient of shared Ngrams by all observed Ngrams in both strings.
Some vector similarity functions have been extended to include string similarity as well, of which the most notable is the string cosine similarity. It measures the cosine similarity between vector representations of two strings to be compared. For English words, the vectors have a size 26, one element for each character, and the number of occurrences of each character in each string.
The use of machine learning techniques for vector representations of words has been around since 1986 thanks to the work of Rumelhart, Hinton, and Williams
rumelhart1988learning. The string similarity measurements are used as features in supervised natural language processing tasks to increase the performance of the classifier. More recently, a method called locally linear embedding was introduced. The method computes lowdimensional, neighborhoodpreserving embedding of high dimensional input. The method is applied to generate a twodimensional embedding of words that conserves their semantics
roweis2000nonlinear. Later, feedforward neural networks were used to generate a distributed vector representation of wordsbengio2003neural. By predicting the next word giving the previous words in the context, the neural network learns a vector representation of the words in its hidden layer. The method is extended to take into consideration the surrounding words not only the previous wordsmikolov2013distributed. In the same context, the feedforward neural network is replaced by a restricted Boltzmann machine to produce the vector representations
mnih2007three. A word vector representation variant learns for each word a low dimensional linear projection of the onehot encoding of a word by incorporation the projection in the energy function of a restricted Boltzmann machine
dahl2012traininghinton2009replicated. Finally, GloVe is one of the most successful attempts at producing vector representations of words for string comparisons pennington2014glove. GloVe learns a logbilinear model that combines the advantages of global matrix factorization and local context window to produce a vector representation of word based on the word count. A vector similarity measure such as the Euclidean distance, cosine similarity, or measure can then be used to measure the similarity between two strings.3 Word coding approach
The objective of this research is to find a function that maps words into real vector space in such as a way the distance between two similar words (i.e., nonstandard spellings of the same word or words used in the same context) will be the smallest distance between the corresponding mapping in the real vector space. To achieve this goal, needs to obey two constraints. The first constraint is that the distance in real vector space between the mapping of a word and its nonstandard versions must be shorter than the distance between that word and nonstandard versions of other words. The second constraint is that the distance in real vector space between the mapping of words with similar meanings must be shorter than the distance between words with dissimilar meanings. We define meaning by similar context: we assume that words used in the same context have a similar meaning. To mode the first constraint, we use a denoising autoencoder, and to model the second constraint, we introduce a context encoder.
The denoising autoencoder and the context encoder are explained in sections 3.1 and 3.2 respectively. The overall method is explained in section 3.3. A summary of all the notation, parameters, and functions used in this section is summarized in Appendix A.
3.1 Denoising autoencoder
An autoencoder is an unsupervised learning algorithm based on artificial neural networks in which the target value is equal to the input
bengio2009learning. An autoencoder can in its simplest form be represented by a network composed of:
An input layer representing the feature vector of the input.

A hidden layer that applies a nonlinear transformation of the input.

An output layer representing the target value or the label.
Suppose we have a training example , the autoencoder tries to learn a function such that , an approximation of the identity function. The identity function seems to be a trivial function to learn, however, if we put some constraints on the autoencoder, it can learn a function that captures features and structures in the data. For example, limiting the number of hidden units in the network to be less than the input units forces the network to learn a compressed representation of the input. Instead of copying the value of the input in the hidden layer, the network must learn which parts of the input is more important and leads to a better reconstruction. Adding noise to the input is another constraint that forces the autoencoder to learn the most features of the data. By reconstructing the data based on its noisy version, the autoencoder undoes the effect of the noise. Undoing the noisy effects can only be performed when the autoencoder learns the statistical dependencies between the inputs. For the latter example, the autoencoder is called a denoising autoencoder.
In our approach, we input the nonstandard spelling to a denoising autoencoder and try to reconstruct the original word. Any nonstandard spelling of a word can be seen as a noisy version of the original word. The aim is that the network should learn two essential features: (1) The relations between nonstandard and standard word spellings, and (2) what separates the standard words. Both features should lie in the hidden layer, which is used to reconstruct the standard word from the nonstandard spellings.
The denoising autoencoder in our approach includes a vocabulary of words, which can be standard English^{1}^{1}1The approach is not limited to English. However, all our examples are from the English language. words and nonstandard variants of those words. consists of the following subsets: The standard words , and the nonstandard spelling for of every word as . We define an initialization function that transforms a word in into a vector of real numbers in . can be a function that performs a onehot encoding of the words in , or it can map each character in a word to a unique number, or assign a random vector to each word. can be presented by a matrix of free parameters.
The input of the denoising autoencoder is the nonstandard word spelling corresponding to word , and the output of the hidden layer is . The reconstructed word outputed from the autoencoder, should ideally be equal to the . The details is presented in Equation 6 where is a matrix of weights, and is a bias term. Each element in is associated to the connection between the element of and the hidden unit of the autoencoder.
(6) 
The reconstruction of the original word by the output layer of the autoencoder is given by Equation 7 where is a matrix of weights. Each element in is associated to the connection between the hidden unit of the autoencoder and the element of the reconstruction . is a bias term.
(7) 
The overall architecture of the denoising autoencoder is presented in Figure 1.
For a distance function in real vector space, the autoencoder learns the parameters , , , and
that minimize the loss function
given by the distance between the initialization of the nonstandard version of and its reconstruction (Equation 8).(8) 
The output of the hidden layer of an encoder can be fed as an input to another autoencoder, which tries to reconstruct it. In this case, the second autoencoder learns features about the features learned by the first autoencoder: It learns a seconddegree feature abstraction of the input data. This process of stacking autoencoders can be repeated indefinitely. The obtained network is called a deep believe network bengio2009learning. For each layer, all elements in , , , or are updated using backpropagation and stochastic gradient the descent lecun2012efficient.
3.2 Context based coding
To increase the relevance of a denoising autoencoder, we connect each with their context. The context means text close to the word used in a setting. We define the context as a sequence of words . The objective of the context based encoding is to learn a model representing the probability of a word given its context such that . presents the likelihood of the word appearing after the sequence . This method was first introduced by Bengio et al. bengio2003neural. We decompose the function in two parts:

A mapping from an element to a vector , which represents the vector associated with each word in the vocabulary.

A probability function over vector representation of words assigned by . maps an input sequence of vectors representation of words in a context,
, to a conditional probability distribution over words in
for the next word . Thus,
Hence, the function is a composition of the two functions, and . With each of these two parts are associated some parameters. The parameters of are the elements of the matrix presenting the words vector representations. The function may be implemented by a neural network with parameters . Training is achieved by looking for that maximizes the training corpus loglikelihood:
(9) 
The neural network presenting has a softmax output layer, which guarantees positive probabilities summing to 1:
(10) 
is the output of hidden layer of the neural network.
(11) 
where
is the activation function of the hidden layer,
, , and are the matrix of weights, and are the biases, and is feature vector of word vector representation from the matrix : . The parameters of the model are . The overall architecture of the context encoder is presented in Figure 2.3.3 Distance over word space
If and are bijections, the denoising autoencoder transformation of the initialization of a word is a bijections from the word space to . This observation provides the function , giving the distance between the autoencoding representations for the words and (Equation 12), with the property of a metric in . First, the nonnegativity, triangular inequality, and symmetry of are derived from the same properties of the . Second, the identity of indiscernible is automatically deduced from the observation that and are bijective functions. The advantage of this distance is that the function contains in its weight matrix and bias an encoding that captures the stochastic structure of misspelling patterns representing the words observed during the learning phase of the autoencoder. It is important to note that is a pseudometric on since the purpose of the autoencoder is to minimize the distance between vector representation of a correct word and its nonstandard version so the identity of indiscernibles cannot be guaranteed.
(12) 
To get the mapping , the Matrix in section 3.2 can be updated by combining the autoencoder in section 3.1 and the context encoder. In such case, both methods will work in a parallel manner to update the vector representation of the words. Thus, the vector representations of the words in the matrix are learned using the context method, and the vector representations of nonstandard words in are calculated based the autoencoder in addition to that (see Figure 3). The denoising autoencoder we used in, in this case, is a sevenlayer deep autoencoder. The initialization function is a onehot encoding of the words in which results in an input layer of the denoising autoencoder of nodes (not shown in Figure 3 for the sake of presentability). The combination of the autoencoder and the context coding to produce the mapping is used to define the function in in Equation 13. can be seen as an extension of , but instead including the context of the words. is a function that finds the distance between the words and using the function . is not a metric in for the same reason is not, and it is not a metric in as well since the context encoding will give words used in the same context similar vector presentation, therefore, the identity of indiscernibles is not guaranteed here either.
(13) 
4 Results and discussion
To test our approach, we used a data set composed of the 1051 most frequent words from Twitter paired with their various misspellings, from a data set was initially used in an IBM data normalization challenge in 2015 ^{2}^{2}2The data is available here: https://noisytext.github.io/normsharedtask.html. To train the context based encoder, we used a data set containing 97191 different sentences with vocabulary words and their nonstandard form. It is important to note that the data is imbalanced: Some words have only one nonstandard form, and other words have multiple nonstandard forms. This imbalance may potentially introduce some challenges since the autoencoder might not learn accurate features for words with a few nonstandard versions. Since we used softmax units, the number of nodes in the last hidden layer of the autoencoder is set to , which also represents the length of the obtained encoding vector. The autoencoder is trained using minbatch gradient descent with batches of 100 examples each and a learning rate of 0.01. The closest standard word is picked as the most likely standard version of the nonstandard spelling. By closest, we mean the word that has the smallest distance or to the nonstandard spelling as defined in subsection^{3}^{3}3The code and data part of these experiments are available here https://github.com/mehdimbl/WordCoding 3.3.
Distance  Closest word  5th closest word 

Cosine Similarity  46.33%  60.22% 
QGram  47.57%  62.41% 
SørensenDice coefficient  47.85%  60.03% 
Edit distance  55.75%  68.22% 
WeightedLevenshtein  55.85%  67.93% 
DamerauLevenshtein distance  56.51%  68.03% 
NGram  58.23%  76.49% 
MetricLongest Common Subsequence  60.89%  75.73% 
Longest Common Subsequence  61.37%  74.31% 
NormalisedLevenshtein  63.17%  78.30% 
with Cosine similarity  83.82%  89.53% 
with distance  76.37%  81.53% 
with Euclidean distance  82.71%  87.35% 
with Cosine similarity  85.37%  89.61% 
Table 1 compares the results produced our approaches , , with the existing string metrics presented in Section 2 in finding the correct version of a nonstandard spelling. The table shows a huge increase in accuracy from 63.17% for the best metric available in the literature (NormalisedLevenshtein) to 85.37% when is used. The reason is that unlike the stateoftheart metric, captures stochastic word patterns shared between the nonstandard word and its correct form. Figure 4 shows the performance of the NormalisedLevenshtein, , and in finding the standard spelling of a nonstandard word among its nearest neighbors. The axis represents the number of nearest neighbors. Figure 4 shows that after ten neighbors starts to outperform because is modeled by an autoencoder which the main purpose is to model such nonstandard word. With , as we go farther from a word, the nearest word will contain words used in a similar context which are not necessarily standard version of the word (see table 3).
Our approach is not limited to one vector distance. In fact, neural representation part of the inner autoencoder can be measured with any vector similarity.
Table 1 also compares the performance of with different vector distances. The cosine similarity produces the best performance in this task with 85.37%.
Table 2 shows the closest correct word to a sample of nonstandard spellings with the autoencoder without context encoder. Table 2 shows the results that the closest words share some patterns with the nonstandard counterpart. For example, the four closest word to “starin” all end with “ing” and three of them start with “s” (staring, praying sucking, slipping). Notice also the similarity in character between the two closest words “staring” and “praying”. The same thing can be said about the closest word to “ddnt”. In the case of “omg” and “justunfollow”, all the closest words are combinations of more than one word, which suggests that the autoencoder learns that they are abbreviations or combination of words. The next examples in Table 2 presents a nonstandard spelling for which the approach with denoising autoencoder fails to recognize the correct version in the five closest word: The correct version of “wada” is “water” but our algorithm chooses “wanna” as the closest correct version. Even though it is not the correct guess, the resemblance between “wada” and “wanna” justifies the guess and a human could arguably have made the same mistake. The same can also be said about “bea” and “the”. For “dats”, the algorithm picks the correct word as the fourth closest word. However, the first pick (“what’s”) can also be justified since it shares three characters with the nonstandard spelling.
Nonstandard spellings  Closest word  2nd closest word  3rd closest word  4th closest word  5th closest word  Correct word 
thng  thing  there  wanna  right  where  thing 
starin  staring  praying  sucking  slipping  weekend  staring 
omg  oh my god  at least  in front  in spite  what’s up  oh my god 
ddnt  didn’t  that’s  aren’t  what’s  better  didn’t 
justunfollow  just unfollow  what about you  ultra violence  direct message  what are you doing  just unfollow 
wada  wanna  sugar  sense  never  speed  water 
dats  what’s  wasn’t  aren’t  that’s  give a  that’s 
bea  the  why  kid  old  yes  tea 
Table 3 shows the closest words to a sample of other words in term of the distance . In addition to the nonstandard spelling being closed to the standard word, Table 3 shows that words similar in meaning are also introduced to the closest words. For example, the third closest word to “dogg” is “cat” both being domestic animals. Notice also that “boy” and “kid” come next because in many of the training sentence a boy a kid is mentioned in the same context as a dog or cat. The same thing can be said about the closest word to “tomorrow”. In the case of “txt” and “birthday”, most of the closest words are their standard/ nonstandard version. The next examples in Table 3 presents a nonstandard spelling, “teering”, for which the approach with the distance fails to recognize a word with a close meaning to it in the five closest word. The correct version of “teering” is “tearing” which is the third closest word in table 3. Even though the closest word to “teering” is not related to it, the resemblance between “weering” and “teering” can justify it being the closest word. Table 3 shows that finds the standard version of “bea” as the thirds nearest word which is an improvement in this case over .
Nonstandard spellings  Closest word  2nd closest word  3rd closest word  4th closest word  5th closest word 
dogg  dog  doog  cat  boy  kid 
txt  text  texting  txted  texted  work 
teering  wearing  meeting  tearing  shaking  picking 
tomorrow  tmrw  tmr  today  yesterday  judgment 
video  vid  vids  videos  sleep  remix 
birthday  bday  biryhday  birthdayyy  drinking  dropping 
thng  ting  thing  think  right  stuff 
starin  staring  looking  glaring  slipping  praying 
omg  omgg  omfg  ohmygod  ohmygad  oh my god 
ddnt  didn  didnt  didn’t  havn’t  aren’t 
wada  wanna  sugar  sense  never  speed 
dats  dat  that  thats  thts  that’s 
bea  the  tea  yes  old  coffee 
5 Conclusion
In this paper, we proposed to combine a denoising autoencoder along with a context encoder to learn a mapping between vocabulary and real vector space. This mapping allows us to define a metric in word space that encompasses nonstandard spelling variations of words as well as words used in similar context. This work is a first attempt at defining a fully learned metric in word space using neural networks. Granted, the resulting metric does not satisfy all the theoretical property of a metric. However, the experimental results showed that the resulting metrics succeeds in 85.4% of the cases in finding the correct version of a nonstandard spelling — a considerable increase in accuracy from the established Normalised Levenshtein of 63.2%. Besides, we showed that words used in similar contexts have a shorter distance between them than words in different contexts.
Appendix A Model’s parameters and functions
Parameters/functions  Description 

Vocabulary  
Set of standard words in  
Set of nonstandard version of the word  
Initialization function  
Output of the hidden layer of the denoising autoencoder  
,  Weights of the denoising autoencoder 
,  Biases of the denoising autoencoder 
Output of the denoising autoencoder  
Loss function  
Metric in real vector space  
Activation function  
Probability distribution over  
Mapping function  
Probability distribution over mappings of words produced by  
Output of the context encoder  
,  Weights of the context encoder 
,  Biases of the context encoder 
Mapping function learned by the combination of the denoising autoencoder and context encoder  
Metric over resulting from the denosing entoencoder mapping  
Metric over resulting from 
Comments
There are no comments yet.