Learning Multi-level Dependencies for Robust Word Recognition

11/22/2019 ∙ by Zhiwei Wang, et al. ∙ Michigan State University 0

Robust language processing systems are becoming increasingly important given the recent awareness of dangerous situations where brittle machine learning models can be easily broken with the presence of noises. In this paper, we introduce a robust word recognition framework that captures multi-level sequential dependencies in noised sentences. The proposed framework employs a sequence-to-sequence model over characters of each word, whose output is given to a word-level bi-directional recurrent neural network. We conduct extensive experiments to verify the effectiveness of the framework. The results show that the proposed framework outperforms state-of-the-art methods by a large margin and they also suggest that character-level dependencies can play an important role in word recognition.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Most of the widely used language processing systems have been built on neural networks that are highly effective, achieving the performance comparable to humans [6, 29, 30]. They are also very brittle, however, as they could be easily broken with the presence of noises [2, 32, 8]. However, the language processing mechanism of humans are very robust. One representative example is the following Cambridge sentence:

Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn’t mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe.

In spite of the fact that a human can read the above sentence with little difficulty, it can cause a complete failure to existing natural language processing systems such as Google Translation Engine 

222https://translate.google.com/. Building robust natural language processing systems is becoming increasingly important nowadays given severe consequences that can be made by adversarial samples [28]: carefully misspelled spam emails that fool spam detection systems [9] deliberately designed input sentences that force chatbot to emit offensive language [25, 7, 15], etc. Thus, in this work, we focus on building a word recognition framework which can denoise the misspellings such as those shown in the Cambridge sentence. As suggested by psycholinguistic studies [20, 5], the humans can comprehend text that is noised by jumbling internal characters while leaving the first and last characters of a word unchanged. Thus, an ideal word recognition model is expected to emulate robustness of human language processing mechanism.

The benefits of such framework are two-folds. The first is its recognition ability can be straightforwardly used to correct misspellings. The second is its contribution to the robustness of other natural language processing systems. By serving as a denoising component, the word recognition framework can firstly clean the noised sentences before they are inputted into other natural language processing systems [19, 33].

(a) Train Stage.
(b) Test Stage
Figure 1: The graphical illustration of the proposed framework: MUDE.

From the human perspective, there are two types of information that play an essential role for us to recognize the noised words [18]. The first is the character-level dependencies. Take the word ‘wlohe’ in the Cambridge sentences as an example, it is extremely rare to see a ‘w’ sits next to an ‘l’ in an English word. Instead, it is more natural with ‘wh’. Thus, it is quite easy for humans to narrow down possible correct forms of ‘wlohe’ to be ‘whole’ or ‘whelo’. To ensure that it should be ‘whole’, we often need the second type of information: context information such as ‘but the wrod as a wlohe.’, which is denoted as word-level dependencies in this paper. Intuitively, an effective word recognition framework should capture these multi-level dependencies. However, multi-level dependencies are rarely exploited by the exiting works such as scRNN [21]. Hence, we propose a framework MUDE that is able to fully utilize multi-level dependencies for the robust word recognition task. It integrates a character-level sequence-to-sequence model and a word-level sequential learning model into a coherent framework. The major contributions of our work are summarized as follows:

  • We identify importance of character-level dependencies for recognizing a noised word;

  • We propose a novel framework, MUDE, that utilizes both character-level and word-level dependencies for robust word recognition task;

  • We conduct extensive experiments on various types of noises to verify the effectiveness of MUDE.

For the rest of the paper, we firstly give a detailed description of MUDE. Then we conduct experiments to verify its effectiveness. We will also show that MUDE is able to achieve the state-of-the-art performance no matter what type of noise presents and outperforms the widely used baselines by a large margin. Next, we introduce the relevant literature in the related work section, followed by a conclusion of current work and discussion of possible future research directions.

The Proposed Framework: MUDE

In this section, we describe MUDE that is able to capture both character-level and word-level dependencies. The overall architecture is illustrated in Figure 1. It consists of 3 major components: a sequence-to-sequence model, a bi-directional recurrent neural network and a prediction layer. Next we will detail each component.

Learning Character-level Dependencies

As mentioned previously, there exist sequential patterns in the characters of a word. For example, vocabulary roots such as cur and bio can be found in many words. In this subsection, we propose a sequence-to-sequence model to learn a better representation of a given word by incorporating character-level dependencies. The model consists of an encoder and a decoder, which we will describe next.


Let be a sequence of characters of a given noised word . We firstly map each character to a -dimensional character embedding as follows:


where is the embedding matrix given that the total number of unique characters is . is the one-hot representation of . Since there could some noise in , the sequential order of can be misleading. Thus, instead of using a sequential learning model such as recurrent neural network, we choose the multi-head attention mechanism [24] to model the dependencies between characters without considering their order. To do so, we add a special character whose final representation will be used as the representation of the word.

Specifically, the multi-head attention mechanism will obtain a refined representation for each character in . Next, without the loss of generality, we will use as an example to illustrate. To obtain the refined representation of , will firstly be projected into query space and will be projected into key and value spaces as follows:


where , , are the projection matrices for query, key, and value spaces, respectively. With , and , the refined representation of can be calculated as the weighted sum of :


where is the attention score that is obtained by the following equation:


To capture the dependencies of characters from different aspects, multiple sets of projection matrices are usually used, which will result in multiple sets of , and , and thus . To be concrete, assume that there are sets of projection matrices, from Eq. (Encoder) and Eq. (3), we can obtain s, which are denoted as {}. With this, the refined representation of is obtained by the concatenation operation:


where is the new representation of and contains dependency information of to other characters in from aspects.

Following [24], we also add a positional-wise feedforward layer to as follows:


where and are the learnable parameters. is the final representation of . Note that we can have several above mentioned layers stacked together to form a deep structure.

At this point, we have obtained the refined representation vector for each character and we use that of the special character

as the representation of given noised word, which is denoted as


To capture the sequential dependency in the correct words, the Gated Recurrent Unit (GRU) which has achieved great performance in many sequence learning tasks 

[27, 1, 26] is used as the decoder. To be specific, in the decoding process, the initial hidden state of GRU is initialized with the noised word presentation . Then at each time stamp , GRU will recursively output a hidden state given the hidden state at the previous time stamp. Due to the page limitation, we do not show the details of GRU, which is well described in [3]. In addition, each hidden state will emit a predicted character . The decoding process will end when the special character denoting the end of word is emitted. Concretely, the whole decoding process is formally formulated as follows:


where is a trainable parameter.

gives the emission probability of each character and

denotes the entry of vector .

Sequence-to-sequence Loss

To train the previously described character-level sequence-to-sequence model, we define the loss function as follows:


where is the index of the ground truth at position of the correct word . By minimizing , the designed sequence-to-sequence model can learn a meaningful representation that incorporates character-level sequential dependencies for the noised word. Next, we will describe the framework component that captures the word-level dependencies.

Capturing Word-level Dependencies

From the human perspective, it is vitally important to consider the context of the whole sentences in order to understand a noised word. For example, it would be very hard to know ‘frist’ means ‘first’ until a context ‘the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae.’ is given. Thus, to utilize the context information and word-level dependencies, we design a recurrent neural network (RNN) to incorporate them in the noised word representation. Specifically, the word presentations obtained from character-level encoder will be passed into a bi-directional long short-term memory (LSTM). Concretely, given a sequence of word presentations

obtained from character-level dependencies, we calculate a sequence of refined word representation vectors as follows:


where denotes concatenation. indicates that s are processed from to , while processes word presentations in an opposite direction, namely, from to . Comparing to original LSTM where only forward pass is performed, bi-directional LSTM can include both ‘past’ and ‘future’ information in the representation of .

With the aforementioned procedure, the representation of each word now incorporates both character-level and word-level dependencies. Thus, the correct word is predicted as follows:


where is a trainable matrix and is the size of the vocabulary that contains all possible words. Moreover,

is the probability distribution over the vocabulary for the

word in a sentence.

Word Prediction Loss

To effectively train MUDE for correct word prediction, similar to character-level sequence-to-sequence model, we define the following objective function:


where is the index of the correct word.

Training Procedure

So far we have described MUDE which includes a character-level sequence-to-sequence model and a word-level sequential learning model. To train both models simultaneously, we design a loss function for the whole framework as follows:



is a hyperparameter that controls the contribution of the character-level sequence-to-sequence model. Since the major goal of the framework is to predict the correct word given the noised word, we decrease the value of

gradually as the training proceeds to allow the optimizer increasingly focus on improving the word prediction performance.

Test Stage

As shown in Figure 1, in the test stage, we simply remove the decoder of the sequence-to-sequence model and only keep the encoder in the framework.


In this section, we conduct extensive experiments on the spell correction task to verify the effectiveness of MUDE. Next, we firstly introduce the experimental settings, followed by the analysis of the experimental results.

Experimental Settings


We use the publicly available Penn Treebank [16] as the dataset. Following the previous work [21], we firstly experiment on 4 different types of noise: Permutation (PER), Deletion (DEL), Insertion (DEL), and Substitution (SUB), which only operate on the internal characters of words, leaving the first and last characters unchanged. Table 1 shows a toy example of a noised sentence. These 4 types of noise can cover most of the realistic cases of misspellings and commonly tested in previous works [2, 19]. For each type of noise, we construct a noised dataset from the original dataset by altering all the words that have more than 3 characters with corresponding noise. We use the same training, validation and testing split in [21], which contains 39,832, 1,700 and 2,416 sentences, respectively.

Noise Type Sentence
Correct An illustrative example of noised text
PER An isulvtriatle epaxmle of nsieod txet
DEL An illstrative examle of nosed tet
INS An ilelustrative edxample of nmoised texut
SUB An ilkustrative exsmple of npised test
Table 1: Toy examples of noised text


To show the effectiveness of MUDE, we compare it with two strong and widely used baselines. The first is Enchant 333https://abiword.github.io/enchant/ spell checker which is based on dictionaries. The second one is scRNN [21]. It is a recurrent neural network based word recognition model and has achieved previous state-of-the-art results on spell correction tasks. This baseline only considers the sequential dependencies in the word level with a recurrent neural network and ignores that of character level. Note that other baselines including CharCNN [22] have been significantly outperformed by scRNN. Thus, we do not include them in the experiments.

Implementation Details

Both scRNN and MUDE are implemented with Pytorch. The number of hidden units of word representations is set to be 650 as suggested by previous work 

[21]. The learning rate is chosen from {0.1, 0.01, 0.001, 0.0001} and in Eq (12

) is chosen from {1, 0.1, 0.001} according to the model performance on the validation datasets. The parameters of MUDE are learned with stochastic gradient decent algorithm and we choose RMSprop 

[23] to be the optimizer as it did in [21]. To make the comparison fair, scRNN is trained with the same settings as MUDE.

Comparison Results

The comparison results are shown in Table 2. There are several observations can be made from the table. The first is that model based methods (scRNN and MUDE) achieve much better performance than dictionary based one (Enchant). This is not surprising as model based methods can effectively utilize the sequential dependencies of words in the sentences. Moreover, MUDE consistently outperforms scRNN in all cases, which we believe attributes to the effective design of MUDE to capture both character and word level dependencies. More detailed analysis of contribution brought by the character-level dependencies will be shown later in this section. In addition, we observe that the difficulty brought by different types of noise varies significantly. Generally, for model based methods, permutation and insertion noises are relatively easier to deal with comparing to deletion and substitution noises. We argue this is because the former ones do not lose any character information. In other words, the original character information is largely preserved with permutation and insertion. On the contrary, both deletion and substitution can cause information loss, which makes it harder to recognize the original words. This again demonstrate how important the character-level information is. Finally, the results also show that in more difficult situations where deletion or substitution noises present, the advantages of the MUDE become even more obvious. This clearly suggests the effectiveness of the MUDE.

Enchant 72.33 71.23 93.93 79.77
scRNN 98.23 91.55 95.95 87.09
MUDE 98.81 95.86 97.16 90.52
Table 2: Performance comparison with different types of noise in terms of accuracy (%). Best results are highlighted with bold numbers.

Next, we take one step further by removing the constraint that the noise will not affect the first and last characters of each word. More specifically, we define 4 new types of noise that are W-PER, W-DEL, W-INS, and W-SUB, which stand for altering a word by permuting the whole word, deleting, inserting, and substituting characters in any position of the word. Similarly, for each type of new noise, we construct a noised dataset. The results are shown in Table 3.

(a) Prediction loss.
(b) Seq2Seq loss.
Figure 2: Learning curve of MUDE in the training procedure with different values.
Enchant 59.08 69.84 93.77 77.23
scRNN 97.36 89.99 95.96 81.12
MUDE 98.75 93.29 97.10 85.17
Table 3: Performance comparison different type of noise in terms of accuracy (%). Best results are highlighted with bold numbers.

From the table, we observe that firstly, the performance of nearly all methods decreases comparing to that of Table 2. This suggests the new types of noise are more difficult to handle, which is expected as they cause more variations of noised words. In fact, without keeping first and last characters of each words, it also becomes a difficult task for human to comprehend the noised sentences [20]. Secondly, MUDE still achieves higher accuracy than other baselines, which is consistent with observations from Table 2. More importantly, as the difficulty of the task increases, the advantages of MUDE over scRNN also become more obvious. Take the noise of substitution for example, in Table 2, MUDE has around 3.5% absolute accuracy gain over scRNN. When more difficult noise (W-SUB) comes, the performance gain of MUDE becomes 4% as shown in Table 3. Such observation is also consistent with previous findings.

In summary, both Table 2 and  3 clearly demonstrate the robustness of MUDE and its advantages over scRNN which can not utilize the character-level dependencies. Thus, in the next subsection, we conduct analysis on the contribution of character-level dependencies to gain better understanding of MUDE.

Parameter Analysis

In this subsection, we analyze the contribution of character-level dependencies to better word representations by showing the model performance with different values, which controls the contribution of character-level sequence-to-sequence loss. Specifically, we let the be and . When is , MUDE will totally ignore the character-level dependencies; When equals to , MUDE achieve best accuracy in validation set. The prediction loss and seq2seq loss during the training stage with different values are shown in Fig 2. Note that the trends in Fig 2 are similar in all of the cases with the different types noise and we only show that of W-PER case due to the page limitation.

As the upper sub-figure shows, when the prediction loss converges faster and at a lower value comparing to that of case when . For the seq2seq loss, it remains constant value when as the model does not learn anything regarding seq2seq task. On the other hand, when , the seq2seq loss stably decreases, suggesting that the MUDE is trained to obtain better representation of each word. The obvious positive correlation between these two losses clearly demonstrates the importance of learning character-level dependencies in misspelling correction tasks.

Test Noise
Train Noise PER 98.81 82.55 79.61 92.21 92.37 71.39 69.88 85.70
W-PER 98.75 81.31 78.3 91.32 91.25 69.55 67.91 84.64
DEL 90.83 90.83 86.02 79.96 79.97 81.99 76.02 85.18
W-DEL 86.75 86.75 94.08 78.83 78.87 80.35 79.07 84.74
INS 94.79 94.79 77.3 74.81 97.15 82.86 80.42 87.41
W-INS 95.67 95.67 78.34 75.95 97.01 82.96 80.78 87.91
SUB 91.71 91.71 88.34 81.49 81.19 81.21 83.65 86.22
W-SUB 87.05 87.05 83.42 82.42 79.27 79.17 85.67 83.65

Table 4: Generalization analysis results. The best result are highlighted. MEAN shows the average value of each row.
Test Noise
Train Noise W-ALL 96.45 96.45 94.26 93.34 95.3 95.28 91.51 90.48 94.13

Table 5: Data augmentation results. The values that are higher than these of Table 4 are bold.

Generalization Analysis

In this subsection, we conduct experiments to understand generalization ability of MUDE. Concretely, we train the framework on one type of noise and test it with a dataset that presents another type of noise. The results are shown in Table 4.

From results, we have the following two observations. Firstly, between datasets with similar type of noise, MUDE generalizes quite well (e.g. trained on W-PER and tested on PER), which is not surprising. However, the MUDE trained on one type of noise performs much worse on other types of noise that are very different. These observations suggest that it is hard for MUDE to generalize between noises, which we argue is possibly because of the small overlap between distributions of each type of noise.

Thus, in the next, we apply the commonly used adversarial training method by augmenting all types of noise to train MUDE and test it on each type noise individually. As W-* (* {PER, DEL, INS, SUB}) includes the *, in this experiment, we only combine W-* instead of all types of noise. We denote the new constructed training dataset as W-ALL. The results are shown in Table 5. It can be observed from table that the MUDE trained on W-ALL has much better generalization ability (i.e., the mean value is much higher). In addition, it is interesting to see that performance of the MUDE decreases slightly in relatively easy cases where permutation or insertion noise presents while increasing a lot in difficult cases where deletion or substitution noise presents.

Case Study

In this subsection, we take the Cambridge sentences which are not the training set as an example to give a qualitative illustration of MUDE’s misspelling correction performance. Note that due to the constraint of space, we only show the results of the two types of noise: W-PER and W-INS. The example is shown in Table 6. We can see from the table that it is quite difficult for even humans to comprehend the noised sentence when first and last characters are also changed. However, MUDE can still recognize almost all of the words. In addition, for both cases, the MUDE has much less errors in the corrected sentence than scRNN, which is consistent with previous quantitative results.

. Correct According to a researcher at Cambridge University , it does n’t matter in what order the letters in a word are , the only important thing is that the first and last letter be at the right place . The rest can be a total mess and you can still read it without problem . This is because the human mind does not read every letter by itself , but the word as a whole . W-PER Noised iodrcAngc ot a reeachsr at meigaCdbr srtiinUyve , it seod tn’ amrtte in wtah rerdo het tserelt in a rdwo rae , the onyl onmtiaptr ingth si tath hte itfrs dan stla treelt be ta het tgrhi place . hTe rset nca be a aotlt mess dan ouy anc lsilt drae ti tthwuoi lorbmpe . hTsi is aubeecs the huamn dmni edos nto erad evrye lteter by etfisl , but het rdwo sa a eholw . scRNN According to a research at Cambridge University , it does n’t matter in what order the letters in a word are , the only important thing is that the first and last letter be at the right place . The rest can be a total mess and you can still read it without problem . This is because the human mind does not read very letter by itself , but the word as a whole . MUDE According to a research at Cambridge University , it does n’t matter in what order the letters in a word are , the only important thing is that the first and last letter be at the right place . The rest can be a total mess and you can still read it without problem . This is because the human mind does not read every letter by itself , but the word as a whole . W-INS Noised Acxcording to a reysearch at Cazmbridge Univversity , it doesw n’t msatter in whmat orderh the letteros in a fword are , the oynly wimportant tghing is tyhat the fircst and ldast legtter be at the rightv placeu . The resty can be a totalp mesus and you can stillb rnead it withougt promblem . Txhis is bebcause the humgan minnd doess not reabd everyb lettfer by itslelf , but the whord as a whvole . scRNN according to a research at Cambridge University , it does n’t matter in what order the letters in a word are , the only important thing is that the first and last better be at the right place . The rest can be a total less and you can still read it without problem . This is because the human mind does not rated every better by itself , but the word a a whole . MUDE According to a research at Cambridge University , it does n’t matter in what order the letters in a word are , the only important thing is that the first and last letter be at the right place . The rest can be a total uses and you can still read it without problem . This is because the human mind does not bear every letter by itself , but the word as a whole .

Table 6: An illustrative example of spelling correction outputs for the Cambridge sentence. Words that the models fail to correct are underlined and bold.

Related Work

In this section, we briefly review the related literature that is grouped into two categories. The first category includes the exiting works on similar tasks and the second one contains previous works that have applied word recognition model to improve the robustness of other NLP systems.

Grammatical Error Correction

Since the CoNLL-2014 shared task [17], Grammatical Error Correction (GEC) has gained great attention from NLP communities [31, 11, 14, 4, 13]

. Currently the most effective approaches regard GEC as machine translation problem that translates erroneous sentences to correct sentences. Thus, many methods that are based on statistical or neural machine translation architectures have been proposed. However, most of the existing GEC systems have focused on correction of grammar errors instead of noised spellings. For example, most of words in a wrong sentence in CoNLL-2014 shared task 

[17] are correct such as ‘Nothing is absolute right or wrong’, where the only error comes from the specific form ‘absolute’. One of the existing works that are most similar to this paper is scRNN [21], where each word is represented in a fixed ‘bag of characters’ way. It only consists of a word-level RNN and focused on very easy noise. On the contrary, our proposed framework is more flexible and can obtain meaningful representations that incorporate both character and word-level dependencies. In addition, we have experimented on more difficult types of noise than these in [21] and achieved much better performance.

Denoising text for downstream tasks

Robust NLP systems are becoming increasingly important given the severe consequences adversarial samples can cause [10, 12, 28]. However, previous works have shown that neural machine translation models can be easily broken with words whose characters are permuted [2]. To solve this problem, researchers have found that misspelling correction models can play an extremely effective role [19, 33] in improving the robustness of the systems. For example, Pruthi et al [19] firstly applied the pre-trained scRNN model to source sentence to remove noise and then the denoised source sentence was input into the neural translation model to obtain the correctly translated sentence. In addition, Zhou et al [33] directly integrated such denoising models into the machine translation system that was trained in an end-to-end approach. In either way, these works suggest that the proposed framework which has demonstrated strong performance can have great potentials in improving the robustness of other NLP systems.


As most of the current NLP systems are very brittle, it is extremely important to develop robust neural models. In this paper, we have presented a word recognition framework, MUDE, that achieves very strong and robust performance with different types of noise presenting. The proposed framework is able to capture both character and word-level dependencies to obtain effective word representations. Extensive experiments on datasets with various types of noise have demonstrated its superior performance over the exiting popular models.

There are several meaningful future research directions that are worthy exploring. The first is to extend MUDE to deal with sentences where word-level noise presents. For example, in the noised sentences, some of the words might be swapped, dropped, inserted or replaced, etc. In addition, it is also meaningful to improve the generality of MUDE such that it can achieve strong performance with the presence of various types of noise not seen in the training dataset. Another possible future direction is to utilize MUDE to improve the robustness other NLP systems including machine translation, reading comprehension, text classification, etc. Lastly, as this work primarily focuses on English, it would be very meaningful to experiment the proposed framework on other languages.


  • [1] S. Andermatt, S. Pezold, and P. Cattin (2016) Multi-dimensional gated recurrent units for the segmentation of biomedical 3d-data. In Deep Learning and Data Labeling for Medical Applications, pp. 142–151. Cited by: Decoder.
  • [2] Y. Belinkov and Y. Bisk (2017) Synthetic and natural noise both break neural machine translation. arXiv preprint arXiv:1711.02173. Cited by: Introduction, Data, Denoising text for downstream tasks.
  • [3] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv:1406.1078. Cited by: Decoder.
  • [4] S. Chollampatt and H. T. Ng (2017) Connecting the dots: towards human-level grammatical error correction. In Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, pp. 327–333. Cited by: Grammatical Error Correction.
  • [5] M. Davis (2012) Psycholinguistic evidence on scrambled letters in reading. Cited by: Introduction.
  • [6] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805. Cited by: Introduction.
  • [7] E. Dinan, S. Humeau, B. Chintagunta, and J. Weston (2019) Build it break it fix it for dialogue safety: robustness from adversarial human attack. arXiv:1908.06083. Cited by: Introduction.
  • [8] J. Ebrahimi, A. Rao, D. Lowd, and D. Dou (2017) Hotflip: white-box adversarial examples for text classification. arXiv:1712.06751. Cited by: Introduction.
  • [9] G. Fumera, I. Pillai, and F. Roli (2006) Spam filtering based on the analysis of text information embedded into images. Journal of Machine Learning Research 7 (Dec), pp. 2699–2720. Cited by: Introduction.
  • [10] K. Grosse, N. Papernot, P. Manoharan, M. Backes, and P. McDaniel (2017) Adversarial examples for malware detection. In European Symposium on Research in Computer Security, pp. 62–79. Cited by: Denoising text for downstream tasks.
  • [11] R. Grundkiewicz and M. Junczys-Dowmunt (2018) Near human-level performance in grammatical error correction with hybrid machine translation. arXiv:1804.05945. Cited by: Grammatical Error Correction.
  • [12] M. Iyyer, J. Wieting, K. Gimpel, and L. Zettlemoyer (2018) Adversarial example generation with syntactically controlled paraphrase networks. arXiv:1804.06059. Cited by: Denoising text for downstream tasks.
  • [13] J. Ji, Q. Wang, K. Toutanova, Y. Gong, S. Truong, and J. Gao (2017) A nested attention neural hybrid model for grammatical error correction. arXiv:1707.02026. Cited by: Grammatical Error Correction.
  • [14] M. Junczys-Dowmunt, R. Grundkiewicz, S. Guha, and K. Heafield (2018) Approaching neural grammatical error correction as a low-resource machine translation task. arXiv:1804.05940. Cited by: Grammatical Error Correction.
  • [15] H. Liu, T. Derr, Z. Liu, and J. Tang (2019) Say what i want: towards the dark side of neural dialogue models. arXiv:1909.06044. Cited by: Introduction.
  • [16] M. Marcus, B. Santorini, and M. A. Marcinkiewicz (1993) Building a large annotated corpus of english: the penn treebank. Cited by: Data.
  • [17] H. T. Ng, S. M. Wu, T. Briscoe, C. Hadiwinoto, R. H. Susanto, and C. Bryant (2014) The conll-2014 shared task on grammatical error correction. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task, pp. 1–14. Cited by: Grammatical Error Correction.
  • [18] M. Perea, M. Jiménez, F. Talero, and S. López-Cañada (2015) Letter-case information and the identification of brand names. British Journal of Psychology 106 (1), pp. 162–173. Cited by: Introduction.
  • [19] D. Pruthi, B. Dhingra, and Z. C. Lipton (2019) Combating adversarial misspellings with robust word recognition. arXiv:1905.11268. Cited by: Introduction, Data, Denoising text for downstream tasks.
  • [20] K. Rayner, S. J. White, and S. Liversedge (2006) Raeding wrods with jubmled lettres: there is a cost. Cited by: Introduction, Comparison Results.
  • [21] K. Sakaguchi, K. Duh, M. Post, and B. Van Durme (2017) Robsut wrod reocginiton via semi-character recurrent neural network. In

    Thirty-First AAAI Conference on Artificial Intelligence

    Cited by: Introduction, Data, Baselines, Implementation Details, Grammatical Error Correction.
  • [22] I. Sutskever, J. Martens, and G. E. Hinton (2011) Generating text with recurrent neural networks. In ICML-11, pp. 1017–1024. Cited by: Baselines.
  • [23] T. Tieleman and G. Hinton (2012) RMSProp. COURSERA: Lecture 7017. Cited by: Implementation Details.
  • [24] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: Encoder, Encoder.
  • [25] M. J. Wolf, K. Miller, and F. S. Grodzinsky (2017) Why we should have seen that coming: comments on microsoft’s tay experiment, and wider implications. ACM SIGCAS Computers and Society 47 (3), pp. 54–64. Cited by: Introduction.
  • [26] D. Xu, W. Cheng, D. Luo, Y. Gu, X. Liu, J. Ni, B. Zong, H. Chen, and X. Zhang (2019) Adaptive neural network for node classification in dynamic networks. In ICDM, Cited by: Decoder.
  • [27] D. Xu, W. Cheng, D. Luo, X. Liu, and X. Zhang (2019) Spatio-temporal attentive rnn for node classification in temporal attributed graphs. In IJCAI, pp. 3947–3953. Cited by: Decoder.
  • [28] H. Xu, Y. Ma, H. Liu, D. Deb, H. Liu, J. Tang, and A. Jain (2019) Adversarial attacks and defenses in images, graphs and text: a review. arXiv:1909.08072. Cited by: Introduction, Denoising text for downstream tasks.
  • [29] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237. Cited by: Introduction.
  • [30] A. W. Yu, D. Dohan, M. Luong, R. Zhao, K. Chen, M. Norouzi, and Q. V. Le (2018) Qanet: combining local convolution with global self-attention for reading comprehension. arXiv:1804.09541. Cited by: Introduction.
  • [31] W. Zhao, L. Wang, K. Shen, R. Jia, and J. Liu (2019) Improving grammatical error correction via pre-training a copy-augmented architecture with unlabeled data. arXiv:1903.00138. Cited by: Grammatical Error Correction.
  • [32] Z. Zhao, D. Dua, and S. Singh (2017) Generating natural adversarial examples. arXiv:1710.11342. Cited by: Introduction.
  • [33] S. Zhou, X. Zeng, Y. Zhou, A. Anastasopoulos, and G. Neubig (2019) Improving robustness of neural machine translation with multi-task learning. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pp. 565–571. Cited by: Introduction, Denoising text for downstream tasks.