Spell checker and correction is a well-known and well-researched problem in Natural Language Processing[46, 8, 17, 20]. However, most state-of-the-art research has been done on spell checkers for English[9, 13]. Some systems might be extended to other languages as well, but there has not been as extensive research in spell checkers for other languages. People have tried to make spell checkers for individual languages: Bengali , Czech , Danish , Dutch , Finnish , French [44, 16], German [39, 28], Greek , Hindi [14, 24], Indonesian , Marathi , Polish , Portuguese , Russian [42, 43], Spanish , Swedish , Tamil , Thai , etc. This is due to the fact languages are very different in nature and pose different challenges making it difficult to have one solution that work for all languages . Many systems do not work in real-time cases. There are some rule-based spell checkers (like LanguageTool111www.languagetool.org) which try to capture grammar and spelling rules [35, 33]
. This is not scalable and requires language expertise to add new rules. Another problem is evaluating the performance of the spell check system for each language due to lack of quality test data. Spelling errors are classified in two categories: non-word errors where the word is unknown and real-word errors where the word itself is correct but used in a wrong form / context.
We present a context sensitive real-time spell-checker system which can be adapted to any language. One of the biggest problem earlier was absence of data for languages other than English, so we propose three approaches to create noisy channel datasets of real-world typographic errors. We use Wikipedia data for creating dictionaries and synthesizing test data. To compensate for resource-scarcity of most languages we also use manually curated movie subtitles since it provides information about how people communicate as shown in .
Our system outperforms industry-wide accepted English spell checkers (Hunspell and Aspell) and show our performance on benchmark datasets for English. We present our performance on synthetic dataset for 24 languages viz.,
Bengali, Czech, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Marathi, Polish, Portuguese, Romanian, Russian, Spanish, Swedish, Tamil, Telugu, Thai and Turkish. We also compare 11 of these languages to one of the most popular rule-based systems. We did not customize our spell checker to suit local variants or dialects of a language. For example — the spelling“color” is used in American English whereas spelling “colour” is preferred in other versions of English. Our system will not flag any of these spellings.
The paper makes following contributions:
We show system’s time performance for each step in process, proving it’s real-time effectiveness.
Our system outperforms existing rule-based and industry-wide accepted spell checking tools.
We show that our system can be adapted to other languages with minimal effort — showing precision@k for and mean reciprocal rank (MRR) for 24 languages.
The paper is divided into four sections. Section II explains the preprocessing steps and approach to generate a ranked list of suggestions for any detected error. Section III presents different synthetic data-generation algorithms. Section IV describes the experiments and reports their results. Finally, Section V concludes the paper and discusses future endeavours.
Our system takes a sentence as input, tokenizes the sentence, identifies misspelled words (if any), generates a list of suggestions and ranks them to return the top corrections. For ranking the suggestions, we use n-gram conditional probabilities. As a preprocessing step, we create frequency dictionaries which will aid in generation of n-gram conditional probabilities.
Ii-a Preprocessing: Building n-gram dictionaries
We calculated unigram, bigram and trigram frequencies of tokens from corpus. Using these frequencies, we calculated conditional probabilities expressed in the equation 1 where is conditional probability and is the count of the n-gram in corpus. For unigrams, we calculate its probability of occurrence in the corpus.
We used Wikipedia dumps222Wikimedia Downloads: https://dumps.wikimedia.org along with manually curated movie subtitles for all languages. We capped Wikipedia articles to 1 million and subtitle files to 10K. On an average, each subtitle file contains 688 subtitle blocks and each block contains 6.4 words . We considered words of minimum length 2 with frequency more than 5 times in the corpus. Similarly, only bigrams and trigrams where each token was known were considered.
One issue we encountered while building these dictionaries using such a huge corpus was its size. For English, the number of unique unigrams was approx. , bigrams was and trigrams was . If we store these files as uncompressed Python Counters, these files end up being , and respectively. To reduce the size, we compressed these files using a word-level Trie with hashing. We created a hash map for all the words in the dictionary (unigram token frequency) assigning a unique integer id to each word. Using each word’s id, we created a trie-like structure where each node represented one id and its children represented n-grams starting with that node’s value. The Trie ensured that the operation to lookup an n-gram was bounded in O(1) and reduced the size of files by on an average. For English, the hashmap was , unigram probabilities’ file was , bigram was and trigram was .
There are a number of solutions available for creating tokenizer for multiple languages. Some solutions (like [34, 40, 7]), try to use publicly available data to train tokenizers, whereas some solutions (like Europarl preprocessing tools ) are rule-based. Both approaches are not extensible and typically are not real-time.
For a language, we create list of supported characters using writing systems information333https://en.wikipedia.org/wiki/List_of_languages_by_writing_system and Language recognition charts444https://en.wikipedia.org/wiki/Wikipedia:Language_recognition_chart. We included uppercase and lowercase characters (if applicable) and numbers in that writing system, ignoring all punctuation. Any character which doesn’t belong to this list is implied as foreign character to that language and will be tokenized as a separate token. Using regex rule, we extract all continuous sequences of characters in supported list.
Ii-C Error Detection
We kept our error-search strictly to non-words errors; for every token in sentence, we checked for its occurrence in dictionary. However, to make system more efficient, we only considered misspelled tokens of length greater than 2. On manual analysis of Wikipedia misspellings dataset for English, we discovered misspelling of length 1 and 2 do not make sense and hence computing suggestions and ranking them is not logical.
Ii-D Generating candidate suggestions
Given an unknown token, we generated a list of all known words within edit distance of 2, calling them candidate suggestions. We present the edit distance distribution of publicly available datasets for English in Section IV-C. Two intuitive approaches to generate the list of suggestions that work fairly well on a small-size dataset are checking edit-distance of incorrect spelling with all words in dictionary and second, generating a list of all words in edit-distance 2 of incorrect spelling555https://norvig.com/spell-correct.html. The obvious problem with the first approach is with the size of corpus which is typically in range of hundreds of thousands and with the second approach is size of word because for longer words there can be thousands of suggestions and building a list of such words is also time consuming.
We considered four approaches — Trie data structure, Burkhard-Keller Tree (BK Tree) , Directed Acyclic Word Graphs (DAWGs)  and Symmetric Delete algorithm (SDA)666https://github.com/wolfgarbe/SymSpell. In Table I, we represent the performance of algorithms for edit distance 2 without adding results for BK trees because its performance was in range of couple of seconds. We used Wikipedia misspelling dataset777https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings to create a list of 2062 unique misspellings of lengths varying from 3 to 16 which were not present in our English dictionary. For each algorithm, we extracted the list of suggestions in edit distance of 1 and 2 for each token in dataset.
Ii-E Ranking suggestions
Using SDA, we generate a list of candidates which are to be ranked in order of relevance in the given context. Authors of , demonstrate the effectiveness of n-grams for English to auto-correct real-word errors and unknown word errors. However, they use high-order n-grams in isolation. We propose a weighted sum of unigrams, bigrams and trigrams to rank the suggestions. Authors in , use character embeddings to generate embeddings for each misspelling for clinical free-text and then similar to , rank on basis of contextual similarity score.
We create a context score () for each suggestion and rank on decreasing order of that score, returning top suggestions. Context score is weighted sum of unigram context score (), bigram context score () and trigram context score () defined by equation 2. This score is calculated for each suggestion by replacing token with the suggestion. For n-grams where any token is unknown, the count is considered to be .
= index of misspelled token
= the weight for -gram’s score
= occurrence frequency of sequence ( …)
= conditional probability.
Iii Synthesizing Spelling Errors
The biggest challenge in evaluation of spell checker was quality test dataset. Most of the publicly available datasets are for English . We propose three strategies to introduce typographical errors in correct words to represent noisy channel. We select all the sentences, where we did not find any spelling error and introduced exactly one error per sentence.
Iii-a Randomized Characters
From a sentence, we pick one word at random and make one of the three edits: insertion, deletion or substitution with a random character from that language’s supported character list. Since it is a completely randomized strategy, the incorrect words created are not very “realistic”. For example — in English for edit distance 2, word “moving” was changed to “moviAX”, “your” to “mouk”, “chest” to “chxwt”. We repeated the process for edit distance 1 (introducing only one error) and edit distance 2 (introducing two errors) and create dataset for 20,000 sentences each.
Iii-B Characters Swap
On analyzing common misspellings for English, we discovered majority of edit-distance 2 errors are swap of two adjacent characters. For example — “grow” is misspelled as “gorw”, “grief” as “greif”. One swap imply edit distance of two, we created a dataset of 20,000 samples for such cases.
Iii-C Character Bigrams
Introducing errors randomly produces unrealistic words. To create more realistic errors, we decided to use character bigram information. From all the words in dictionary for a language, we calculate occurrence probabilities for character bigrams. For a given word, we select a character bigram randomly and replace the second character in selected bigram with a possible substitute from pre-computed character bigram probabilities. This way, we were able to generate words which were more plausible. For example — in English for edit distance 1, word “heels” was changed to “heely”, “triangle” to “triajgle”, “knee” to “kyee”. On shallow manual analysis of generated words, most of the words look quite realistic. For English, some of the words generated are representative of keyboard-strokes error (errors that occur due to mistakenly pressing a near-by key on keyboard/input device). For example, we generated some samples like — “Allow” to “Alkow”, “right” to “riggt”, “flow” to “foow” and “Stand” to “Stabd”. We generated a sample of 40,000 sentences each for edit distance 1 and edit distance 2.
Iv Experiments and Results
Iv-a Synthetic Data evaluation
For each language, we created a dataset of 140,000 sentences888With an exception of Czech, Greek, Hebrew and Thai where size of dataset was smaller due to unavailability of good samples with one misspelling each. The best performances for each language is reported in Table II. We present Precision@k999Percentage of cases where expected output was in top results for and mean reciprocal rank (MRR). The system performs well on synthetic dataset with a minimum of 80% P@1 and 98% P@10.
|Time (s)||ED=1 (ms)||ED=2 (ms)||Time (ms)|
The system is able to do each sub-step in real-time; the average time taken to perform for each sub-step is reported in Table III. All the sentences used for this analysis had exactly one error according to our system. Detection time is the average time weighted over number of tokens in query sentence, suggestion time is weighted over misspelling character length and ranking time is weighted over length of suggestions generated.
Table IV presents the system’s performance on each error generation algorithm. We included only P@1 and P@10 to show trend on all languages. “Random Character” and “Character Bigrams” includes data for edit distance 1 and 2 whereas “Characters Swap” includes data for edit distance 2. Table V presents the system’s performance individually on edit distance 1 and 2. We included only P@1, P@3 and P@10 to show trend on all languages.
|Language||Random Character||Characters Swap||Character Bigrams|
We experimented with the importance of each n-gram. Figure 1 presents the results for this experiment. We kept two weights constant varying one weight to compare the performance. For example to determine unigram weight () importance, we set bigram weight () and trigram () to , varying (). As shown in Figure 1(a) and Figure 1(b), if unigram or trigram are given more importance, the performance of system worsens. Figure 1(c) shows removing lower order n-grams and giving more importance to only trigram also decreases performance. Therefore, finding the right balance between each weight is crucial for system’s best performance.
|Language||Edit Distance = 1||Edit Distance = 2|
Iv-B Comparison with LanguageTool
We compared the performance of system with one of the most popular rule-based systems, LanguageTool (LT). Due to some license issues, we could only run LT for 11 languages viz., Danish, Dutch, French, German, Greek, Polish, Portuguese, Romanian, Russian, Spanish and Swedish.
As shown in Figure 2
, LT doesn’t detect any error in many cases. For example — for German, it did not detect any error in 42% sentences and for 25% (8% (No Match) + 17% (Detected more than one error)), it detected more than one error in a sentence out of which in 8% sentences, the error detected by our system was not detected by LT. Only for 33% sentences LT detected exactly one error which was same as detected by our system. Results for Portuguese seem very skewed which can be due to the fact Portuguese has two major versions, Brazilian Portuguese (pt-BR) and European Portuguese (pt-PT); LT has different set of rules for both versions whereas dataset used was a mix of both.
Iv-C Public Datasets results
We used four publicly available datasets for English — birkbeck: contains errors from Birkbeck spelling error corpus101010http://ota.ox.ac.uk/, hollbrook: contains spelling errors extracted from passages in book, English for the Rejected, aspell: errors collected to test GNU Aspell111111http://aspell.net/ , wikipedia: most common spelling errors on Wikipedia. Each dataset had a list of misspelling and the corresponding correction. We ignored all the entries which had more than one tokens. We extracted 5,987 unique correct words and 31,589 misspellings. Figure 3 shows the distribution of edit distance between misspelling and its correction. Figure 3(a) shows the same distribution excluding birkbeck dataset leaving 2,081 unique words and 2,725 misspellings. birkbeck dataset is the biggest out of four but the quality of this dataset is questionable. As explained by the dataset owners, the dataset is created using poor resources. From Figure 3(a), our assumption of most of the common misspelling being in maximum edit-distance of 2 is correct.
We use every correct and incorrect token in this dataset to check if they are present and absent in our dictionary respectively in order to prove if our detection system is able to detect correctness/incorrectness of tokens efficiently. The detection system was able to detect of correct tokens and of incorrect tokens accurately. The percentage of incorrect token detection is comparatively low is because there are many tokens in dataset which were actually correct but were added in misspelling dataset — “flower”, “representative”, “mysteries”, etc. Some correct words in dataset which were detected incorrect were also noise due to the fact some words start with a capital letter but in dataset they were in lowercase — “beverley”, “philippines”, “wednesday” etc. Comparison of most popular spell checkers for English (GNU Aspell and Hunspell121212http://hunspell.github.io/) on this data is presented in Table VI. Since these tools only work on word-error level, we used only unigram probabilities for ranking. Our system outperforms both the systems.
Iv-D False Positive evaluation
For a spell checker system, false positives is when spelling error is detected but there was none. We experimented with a mix of three public datasets — OpenSubtitles dataset, OPUS Books dataset and OPUS Tatoeba dataset to generate a dataset with minimum 15,000 words for each of 24 languages. Since these datasets are human curated, we can safely assume every token should be detected as a known word.
As shown in Table VII, most of the words for each language were detected as known but still there was a minor percentage of words which were detected as errors. For English, the most frequent errors in complete corpus were either proper nouns or foreign language words — “Pencroft”, “Oblonsky”, “Spilett”, “Meaulnes” and “taient”. This proves the effectiveness of system against false positives.
|Language||# Sentences||# Total Words||# Detected||%|
We presented a novel context sensitive spell checker system which works in real-time. Most of the available literature majorly discuss spell checkers for English and sometimes for some European (like German, French) and Indian languages (like Hindi, Marathi), but there are no publicly available systems (non-rule based) which can work for all languages.
Our proposed system outperformed industry-wide accepted spell checkers (GNU Aspell and Hunspell) and rule-based spell checkers (LanguageTool). First, we proposed three different approaches to create typographic errors for any language which has not been done earlier in multilingual setting. Second, we divide our proposed system in 5 steps — Preprocessing; tokenization; error detection; candidate suggestion generation; and suggestion ranking. We used n-gram conditional probability dictionaries to understand context to rank suggestions and present top suggestions.
We showed the adaptability of our system to 24 languages using precision@k for and mean reciprocal rank (MRR). The system performs at a minimum of 80% P@1 and 98% P@10 on synthetic dataset. We showed the robustness of our system to false-positives. In future, we can further increase the support to real-word errors and compound word errors.
-  (2002) Implementation of directed acyclic word graph. Kybernetika 38, pp. 91–103. Cited by: §II-D.
-  (2006) A constraint grammar based spellchecker for danish with a special focus on dyslexics. A Man of Measure: Festschrift in Honour of Fred Karlsson on his 60th Birthday. Special Supplement to SKY Jounal of Linguistics 19, pp. 387–396. Cited by: §I.
-  (2013) Double dutch: the dutch spelling system and learning to spell in dutch. In Handbook of orthography and literacy, pp. 149–164. Cited by: §I.
-  (1973) Some approaches to best-match file searching. Commun. ACM 16, pp. 230–236. Cited by: §II-D.
-  (2006) Spelling error patterns in spanish for word processing applications.. In LREC, pp. 93–98. Cited by: §I.
Memory-based context-sensitive spelling correction at web scale.
Sixth International Conference on Machine Learning and Applications (ICMLA 2007), pp. 166–171. Cited by: §II-E.
-  (2008) Optimizing chinese word segmentation for machine translation performance. In WMT@ACL, Cited by: §II-B.
-  (2007) Improving query spelling correction using web search results. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Cited by: §I.
-  (2007) How difficult is it to develop a perfect spell-checker? a cross-linguistic analysis through complex network approach. Cited by: §I.
-  (2005) CORRECTING spelling errors by modelling their causes. Cited by: §IV-C.
-  (2003) Tamil spell checker. In Sixth Tamil Internet 2003 Conference, Chennai, Tamilnadu, India, Cited by: §I.
-  (2005) Design and implementation of a morphology-based spellchecker for marathi, and indian language. ARCHIVES OF CONTROL SCIENCE 15 (3), pp. 301. Cited by: 1st item, §I.
-  (2012) Multidimensional pareto optimization of touchscreen keyboards for speed, familiarity and improved spell checking. In CHI, Cited by: §I.
Automatic spelling correction for resource-scarce languages using deep learning. In Proceedings of ACL 2018, Student Research Workshop, pp. 146–152. Cited by: 1st item, §I.
-  (2017) Unsupervised context-sensitive spelling correction of clinical free-text with word and character n-gram embeddings. In BioNLP, Cited by: §II-E.
Developing a lexicon for a new french spell-checker. Cited by: §I.
-  (2010) A large scale ranker-based system for search query spelling correction. In Proceedings of the 23rd International Conference on Computational Linguistics, pp. 358–366. Cited by: §I.
-  (2014) The wiked error corpus: a corpus of corrective wikipedia edits and its application to grammatical error correction. In Advances in Natural Language Processing – Lecture Notes in Computer Science, A. Przepiórkowski and M. Ogrodniczuk (Eds.), Vol. 8686, pp. 478–490. External Links: Cited by: §III-B, §III.
-  (2013) Automatic extraction of polish language errors from text edition history. In International Conference on Text, Speech and Dialogue, pp. 129–136. Cited by: §I.
-  (2019) Problems with automating translation of movie/tv show subtitles. ArXiv abs/1909.05362. Cited by: §I.
Unsupervised quality estimation without reference corpus for subtitle machine translation using word embeddings. In 2019 IEEE 13th International Conference on Semantic Computing (ICSC), Vol. , pp. 32–38. External Links: Cited by: §I, §II-A.
-  (2000) Design and evaluation of grammar checkers in multiple languages. In COLING, Cited by: §I.
-  (2007) A light weight stemmer for bengali and its use in spelling checker. Cited by: §I.
-  (2014) Auto spell suggestion for high quality speech synthesis in hindi. CoRR abs/1402.3648. Cited by: §I.
-  (2001) Implementation aspects and applications of a spelling correction algorithm. Text as a Linguistic Paradigm: Levels, Constituents, Constructs. Festschrift in honour of Ludek Hrebicek 60, pp. 108–123. Cited by: §I.
-  (1997) A thai soundex system for spelling correction. In Proceeding of the National Language Processing Pacific Rim Symposium, pp. 633–636. Cited by: §I.
-  (2015) An ensemble method for spelling correction in consumer health questions. AMIA … Annual Symposium proceedings. AMIA Symposium 2015, pp. 727–36. Cited by: §II-E.
-  (2000) A word analysis system for german hyphenation, full text search, and spell checking, with regard to the latest reform of german orthography. In TSD, Cited by: §I.
-  (2005) Europarl: a parallel corpus for statistical machine translation. Cited by: §II-B.
-  (1992) Techniques for automatically correcting words in text. ACM Comput. Surv. 24, pp. 377–439. Cited by: §I.
-  (2016) OpenSubtitles2016: extracting large parallel corpora from movie and tv subtitles. In LREC, Cited by: §IV-D.
-  (1998) Linguistic issues in the development of regra: a grammar checker for brazilian portuguese. Natural Language Engineering 4 (4), pp. 287–307. Cited by: §I.
Developing an open-source, rule-based proofreading tool. Software: Practice and Experience 40 (7), pp. 543–566. Cited by: §I.
-  (2018) Multilingual word segmentation: training many language-specific tokenizers smoothly thanks to the universal dependencies corpus. In LREC, Cited by: §II-B.
-  (2003) A rule-based style and grammar checker. Citeseer. Cited by: §I.
-  (2001) A greek morphological lexicon and its exploitation by a greek controlled language checker. In Proceedings of the 8th Panhellenic Conference on Informatics, pp. 8–10. Cited by: §I.
-  (2010) Finite-state spell-checking with weighted language and error models. In Proceedings of LREC 2010 Workshop on creation and use of basic lexical resources for less-resourced languages, Cited by: §I.
-  (2012) Korektor–a system for contextual spell-checking and diacritics completion. Proceedings of COLING 2012: Posters, pp. 1019–1028. Cited by: §I.
-  (2008) Evaluating automatic detection of misspellings in german. Language Learning & Technology 12 (3), pp. 73–92. Cited by: §I.
-  (2008) Unsupervised multilingual learning for morphological segmentation. In ACL, Cited by: §II-B.
-  (2011) A non word error spell checker for indonesian using morphologically analyzer and hmm. In Proceedings of the 2011 International Conference on Electrical Engineering and Informatics, pp. 1–6. Cited by: §I.
-  (2016) Automatic spelling correction for russian social media texts. In Proceedings of the International Conference “Dialog”(Moscow, pp. 688–701. Cited by: §I.
-  (2017) Spelling correction for morphologically rich language: a case study of russian. In Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing, pp. 45–53. Cited by: §I.
-  (2002) Corpus-based evaluation of a french spelling and grammar checker.. In LREC, Cited by: §I.
-  (23-25) Parallel data, tools and interfaces in opus. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), N. C. (. Chair), K. Choukri, T. Declerck, M. U. Dogan, B. Maegaard, J. Mariani, J. Odijk, and S. Piperidis (Eds.), Istanbul, Turkey (english). External Links: Cited by: §IV-D.
-  (2009) Using the web for language independent spellchecking and autocorrection. In EMNLP, Cited by: §I.