Normalyzing Numeronyms -- A NLP approach

07/31/2019 ∙ by Avishek Garain, et al. ∙ 0

This paper presents a method to apply Natural Language Processing for normalizing numeronyms to make them understandable by humans. We approach the problem through a two-step mechanism. We make use of the state of the art Levenshtein distance of words. We then apply Cosine Similarity for selection of the normalized text and reach greater accuracy in solving the problem. Our approach garners accuracy figures of 71% and 72% for Bengali and English language, respectively.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A numeronym is a number-based word. Most commonly, a numeronym is a word where a number is used to form an abbreviation [1, 2, 7]. Pronouncing the letters and numbers may sound similar to the full word: ”K9” for ”canine” (phonetically: ”kay” + ”nine”).

Nowadays, the use of numeronyms are widespread due to the concept of Language Localization. Language localization is process of adapting the product in the language suited to the particular culture and geographical location/market. The need to communicate and connect with the younger audience is the main reason to adopt language localization services.

There is a thin line between localization and translation. Translation includes grammar and spelling issues which vary by the geographical locations. Localization deals more with significant, non-textural components of products or services. It addresses other aspects such as adapting graphics, using appropriate date and time formats, adopting the local currency, choices of colors, and cultural references amongst many other details.

Now, when short segments of alphabets are replaced by numbers, the resulting word is still readable, but more often, we can see more complex form of numeronyms, such as L10N(Localization) and I18N(Internationalization). Deciphering these form can be quite trivial and needs acquaintance with such language. To the best of our knowledge, we did not find any previous state-of-art work done in this domain.

Hence, to find the normalized version of the words, we have used the concept of Damerau-Levenshtein [3] distance and Cosine similarity. This approach is able to counter the problem to a large instance and when checked manually, it gives us an accuracy of 71% and 72% respectively, for Bengali and English.

The rest of the paper is organized as follows. Section 2 defines the data that was used for the experiment. The working of the algorithm has been described in detail in Section 3. This is followed by results and concluding remarks in Section 4 and 5, respectively.

2 Data

As such no such state of the art corpus consisting of numeronym words and their corresponding normalized words are available. So we developed our own dataset by collecting 8000 English and 2000 Bengali numeronym words from various digital resources. These datasets were termed as (English) and (Bengali). Also, a large number of English and Bengali Wikipedia pages were scrapped and the data from them were tokenized, using Stanford Tokenizer111https://nlp.stanford.edu/software/tokenizer.html. This data will, later, serve as our generative data.

3 Methodology

Initially, the collected English and Bengali sentences from Wikipedia, were tokenized into words. This led to formation of two dictionaries (English) and (Bengali), where each entry had the word and its corresponding frequency. Both the dictionaries had close to 100k words. It is to be noted that our hypothesis claimed that the length of the correct word and the numeronym word remains same before and after normalization. So, We take each word from the numeronym list (,) and find its Levenshtein distance [5] with all the words of same length, from the prepared dictionary (,).

3.1 Damerau-Levenshtein Distance

Damerau-Levenshtein Distance is defined as the least number of insertions, deletions and replacements required to convert a word to some another word. It can be implemented using dynamic programming approach.

3.2 Algorithm

where is the indicator function equal to 0 when and equal to 1 otherwise. Each recursive call matches one of the cases covered by the Damerau–Levenshtein distance:

  • corresponds to a deletion (from a to b).

  • corresponds to an insertion (from a to b).

  • corresponds to a match or mismatch, depending on whether the respective symbols are the same.

  • corresponds to a transposition between two successive symbols.

3.3 Workflow

The words from and , with minimum Levenshtein distance are extracted for each numeronym from and . This created a one-to-many mapping from numeronym word to dictionary words. This is shown in Figure 1.The maximum degree of tolerance for change was kept at 2.

Figure 1: Mapping of numeronyms and dictionary entries after calculating Levenshtein distance.

Primarily, we thought of extracting the correct normalized form of a numeronym, based on selecting the word with the highest frequency, from the one-to many mapping. But, selecting the most frequent word, is not a practical approach. So, to select the most probable word, we took the help of Cosine Similarity algorithm.

3.4 Cosine Similarity Algorithm

Cosine similarity is particularly used in positive space, where the outcome is neatly bounded in [0, 1]. The formula used in our approach is as follows.

Where A and B are the source numeronym word and the one of the target dictionary words, respectively.

The word with the highest cosine similarity value was selected as the probable replacement of the numeronym. Example of numeronym replacement is shown in Table 1.

Input numeronym k1n9
Most similar words Cosine-similarlity score
king 0.80
kind 0.69
kin 0.26
Table 1: Example of selecting probable replacement of numeronym word.

4 Results

Since, the experiment was done on both English and Bengali numeronyms, we took the help of two linguists, who were fluent in both the languages. The linguists were asked to classify the results of our algorithm into two classes; correct and incorrect. The inter-annotator agreement for both the languages are shown in Table

2 and 3, respectively.

Linguist B
Correct Incorrect
Linguist A Correct 5563 429
Incorrect 416 1592
Kappa 0.720
Table 2: Inter Annotator agreement for English.
Linguist B
Correct Incorrect
Linguist A Correct 1487 110
Incorrect 77 326
Kappa 0.718
Table 3: Inter Annotator agreement for Bengali.

5 Conclusion

Through this work of ours, we manage to make machines understand Numeronyms and decode them as any normal person would by looking at them. After understanding them, these systems can more accurately respond to various requirements based on Language processing. Our results are quite satisfactory in portraying the success of the algorithm and hope to find a use in near future in systems of daily needs. Lack of datasets have led to lower accuracies which makes way for further works to be carried out based on this work. A better similarlity measurement metric and selection model might give more accurate results.

References

  • [1] Arbekova, T.: Lexicology of the english language. M.: High School (1977)
  • [2] Borisov, V.: Abbreviations and acronyms. Military and scientific-technical shortenings in foreign languages./Ed. Schweitzer, AD–M.: Higher School (2004)
  • [3] Damerau, F.J.: A technique for computer detection and correction of spelling errors. Communications of the ACM 7(3), 171–176 (1964)
  • [4] Fleiss, J.L., Cohen, J.: The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and psychological measurement 33(3), 613–619 (1973)
  • [5] Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet physics doklady. vol. 10, pp. 707–710 (1966)
  • [6] Mahata, S., Das, D., Pal, S.: Wmt2016: A hybrid approach to bilingual document alignment. In: Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers. vol. 2, pp. 724–727 (2016)
  • [7] Ware, J.: Localization: For Starters. CreateSpace Independent Publishing Platform, USA (2016)