Robsut Wrod Reocginiton via semi-Character Recurrent Neural Network

08/07/2016 ∙ by Keisuke Sakaguchi, et al. ∙ Johns Hopkins University 0

Language processing mechanism by humans is generally more robust than computers. The Cmabrigde Uinervtisy (Cambridge University) effect from the psycholinguistics literature has demonstrated such a robust word processing mechanism, where jumbled words (e.g. Cmabrigde / Cambridge) are recognized with little cost. On the other hand, computational models for word recognition (e.g. spelling checkers) perform poorly on data with such noise. Inspired by the findings from the Cmabrigde Uinervtisy effect, we propose a word recognition model based on a semi-character level recurrent neural network (scRNN). In our experiments, we demonstrate that scRNN has significantly more robust performance in word spelling correction (i.e. word recognition) compared to existing spelling checkers and character-based convolutional neural network. Furthermore, we demonstrate that the model is cognitively plausible by replicating a psycholinguistics experiment about human reading difficulty using our model.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Despite the rapid improvement in natural language processing by computers, humans still have advantages in situations where the text contains noise. For example, the following sentences, introduced by a psycholinguist

[davis2003aoccdrnig], provide a great demonstration of the robust word recognition mechanism in humans. Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn’t mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe.

This example shows the Cmabrigde Uinervtisy (Cambridge University) effect, which demonstrates that human reading is resilient to (particularly internal) letter transposition.

Figure 1: Schematic Illustration of semi-character recurrent neural network (scRNN).

Robustness is an important and useful property for various tasks in natural language processing, and we propose a computational model which replicates this robust word recognition mechanism. The model is based on a standard recurrent neural network (RNN) with a memory cell as in long short-term memory

[hochreiter1997long]. We use an RNN because it has shown to be state-of-the-art language modeling [DBLP:conf/interspeech/MikolovKBCK10] and it is also flexible to realize the findings from the Cmabrigde Uinervtisy

effect. Technically, the input layer of our model consists of three sub-vectors: beginning (

), internal (), and ending () character(s) of the input word (Figure 1). This semi-character level recurrent neural network is referred as scRNN.

Cond. Example # of fixations Regression(%) Avg. Fixation (ms)
N The boy could not solve the problem so he asked for help. 10.4 15.0 236
INT The boy cuold not slove the probelm so he aksed for help. 11.4 17.6 244
END The boy coudl not solev the problme so he askde for help. 12.6 17.5 246
BEG The boy oculd not oslve the rpoblem so he saked for help. 13.0 21.5 259
Table 1: Example sentences and results for measures of fixation excerpt from Rayner01032006 (Rayner01032006). There are 4 conditions: N = normal text; INT = internally jumbled letters; END = letters at word endings are jumbled; BEG = letters at word beginnings are jumbled. Entries with have statistically significant difference from the condition N () and those with and differ from and with respectively.

First, we review previous work on the robust word recognition mechanism from psycholinguistics literature. Next, we describe technical details of scRNN which capture the robust human mechanism using recent developments in neural networks. As closely related work, we explain character-based convolutional neural network (CharCNN) proposed by DBLP:journals/corr/KimJSR15 (DBLP:journals/corr/KimJSR15). Our experiments show that the scRNN significantly outperforms commonly used spelling checkers and CharCNN by (at least) 42% for jumbled word correction and 3% and 14% in other noise types (insertion and deletion). We also show that scRNN replicates recent findings from psycholinguistics experiments on reading difficulty depending on the position of jumbled letters, which indicates that scRNN successfully mimics (at least a part of) the robust word recognition mechanism by humans.

Raeding Wrods with Jumbled Lettres

Sentence processing with jumbled words has been a major research topic in psycholinguistics literature. One popular experimental paradigm is masked priming, in which a (lower-cased) stimulus, called prime, is presented for a short duration (e.g. 60 milliseconds) followed by the (upper-cased) target word, and participants are asked to judge whether the target word exists in English as quickly as possible (Figure 2).111There is another variant for masked priming technique, where backward mask is inserted between the prime and target in addition to the forward mask. The prime is consciously imperceptible due to the instantaneous presentation but it proceeds to visual word recognition by participants. The masked priming paradigm allows us to investigate the machinery of lexical processing and the effect of prime in a pure manner.

forster1987masked (forster1987masked) show that a jumbled word (e.g. gadren-GARDEN) facilitates primes as large as identity primes (garden-GARDEN) and these results have been confirmed in cases where the transposed letters are not adjacent (caniso-CASINO) [Perea2004231] and even more extreme cases (sdiwelak-SIDEWALK) [doi:10.1080/01690960701579722].

Figure 2: Example of the masked priming procedure.

These findings about robust word processing mechanism by humans have been further investigated by looking at other types of noise in addition to simple letter transpositions. Humphreys1990517 (Humphreys1990517) show that deleting a letter in a word still produces significant priming effect (e.g. blck-BLACK), and similar results have been shown in other research [Peressotti1999, grainger2006letter]. van2006study (van2006study) demonstrate that a priming effect remains when inserting a character into a word (e.g. juastice-JUSTICE).

Another popular experimental paradigm in psycholinguistics is eye-movement tracking. In comparison to the masked priming technique, eye-movement paradigm provides data from normal reading process by participants. Regarding word recognition, the eye-tracking method has shown the relationship between a word difficulty and the eye fixation time on the word: when a word is difficult to process, average time to fixation becomes long. In addition, words that are difficult to process often induce regressions to words previously read.

With the eye-movement paradigm, Rayner01032006 (Rayner01032006) and johnson2007transposed (johnson2007transposed) conduct detailed experiments on the robust word recognition mechanism with jumbled letters. They show that letter transposition affects fixation time measures during reading depending on which part of the word is jumbled. Table 1 presents the result from Rayner01032006 (Rayner01032006). It is obvious that people can read smoothly (i.e. smaller number of fixations, regression, and average of fixation duration) when a given sentence has no noise (referring to this condition as N). When the characters at the beginning of words are jumbled (referring to this condition as BEG), participants have more difficulty (e.g. longer fixation time). The other two conditions, where words are internally jumbled (INT) or letters at word endings are jumbled (END), have similar amount of effect, although the number of fixations between them showed a statistically significant difference (). In short, the reading difficulty with different jumble conditions is summarized as follows: N INT END BEG.

It may be surprising that there is statistically significant difference between END and BEG conditions despite the difference being very subtle (i.e. fixing either the first or the last character). This result demonstrates the importance of beginning letters for human word recognition.222While there is still ongoing debate in the psycholinguistics community as to exactly how (little) the order of internal letters matter, here we follow the formulation of Rayner01032006 (Rayner01032006), considering only the letter order distinctions of BEG, INT, and END.

Semi-Character Recurrent Neural Network

In order to achieve the human-like robust word processing mechanism, we propose a semi-character based recurrent neural network (scRNN). The model takes a semi-character vector () for a given jumbled word, and predicts a (correctly spelled) word () at each time step. The structure of scRNN is based on a standard recurrent neural network, where current input () and previous information is connected through hidden states (

) by applying a certain (e.g. sigmoid) function (

) with linear transformation parameters (

) and the bias () at each time step ().

One critical issue of vanilla recurrent neural networks is that it is unable to learn long distance dependency in the inputs due to the vanishing gradient [bengio1994learning]. To address the problem, hochreiter1997long (hochreiter1997long) introduced long short-term memory (LSTM), which is able to learn long-term dependencies by adding a memory cell (). The memory cell has an ability to discard or keep previous information in its state. Technically, the LSTM architecture is given by the following equations,

(1)
(2)
(3)
(4)
(5)
(6)

where is the (element-wise) sigmoid function and is the element-wise multiplication.

While a standard input vector for RNN derives from either a word or a character, the input vector in scRNN consists of three sub-vectors () that correspond to the characters’ position.

(7)

The first and third sub-vectors (, ) represent the first and last character of the -th word. These two sub-vectors are therefore one-hot representations. The second sub-vector () represents a bag of characters of the word without the initial and final positions. For example, the word “University” is represented as , , and , with all the other elements being zero. The size of sub-vectors () is equal to the number of characters () in our language, and has therefore the size of by concatenating the sub-vectors.

Regarding the final output (i.e. predicted word ), the hidden state vector () of the LSTM is taken as input to the following softmax function layer with a fixed vocabulary size ().

(8)

We use the cross-entropy training criterion applied to the output layer as in most LSTM language modeling works; the model learns the weight matrices () to maximize the likelihood of the training data. This should approximately correlate with maximizing the number of exact word match in the predicted outputs. Figure 1 shows a pictorial overview of scRNN.

In order to check if the scRNN can recognize the jumbled words correctly, we test it in spelling correction experiments. If the hypothesis about the robust word processing mechanism is correct, scRNN will also be able to read sentences with jumbled words robustly.

Character-based Neural Network

Another possible approach to deal with reading jumbled words by neural networks is (pure) character-level neural network [sutskever2011generating], where both input and output are characters instead of words. The character-based neural networks have been investigated and used for a variety of NLP tasks such as segmentation [DBLP:journals/corr/Chrupala13], dependency parsing [ballesteros-dyer-smith:2015:EMNLP], machine translation [DBLP:journals/corr/LingTDB15], and text normalization [chrupala:2014:P14-2].

For spelling correction, schmaltz-EtAl:2016:BEA11 (schmaltz-EtAl:2016:BEA11) uses character-level convolutional neural networks (CharCNN) proposed by DBLP:journals/corr/KimJSR15 (DBLP:journals/corr/KimJSR15), in which the input is character but the prediction is at the word-level. More technically, according to DBLP:journals/corr/KimJSR15 (DBLP:journals/corr/KimJSR15), CharCNN concatenates the character embedding vectors into a matrix whose -th column corresponds to the -th character embedding vector (size of ) of -th word which contains characters. A narrow convolution is applied between and filter of width , and then feature map is obtained by the following transformation333In the equation, means the -to-()-th column of . with a bias .

(9)

This is interpreted as a process of capturing important feature with filter to maximize the predicted word representation by the max-over-time:

(10)

Although CharCNN and scRNN have some similarity in terms of using a recurrent neural network, CharCNN is able to store richer representation than scRNN. In the following section, we compare the performance of CharCNN and scRNN with respect to jumbled word recognition task.

Experiments

We conducted spelling correction experiments to judge how well scRNN can recognize noisy word sentences. In order to make the task more realistic, we tested three different noise types: jumble, delete, and insert, where the jumble changes the internal characters (e.g. Cambridge Cmbarigde), delete randomly deletes one of the internal characters (Cambridge Camridge), and insert randomly inserts an alphabet into an internal position (Cambridge Cambpridge). None of the noise types change the first and last characters. We used Penn Treebank for training, tuning, and testing.444Section 2-21 for training, 22 for tuning, and 23 for test https://catalog.ldc.upenn.edu/ldc99t42. The data includes 39,832 sentences in training set (898k/950k tokens are covered by the top 10k vocabulary), 1,700 sentences in the tuning set (coverage 38k/40k), and 2,416 sentences in test set (coverage 54k/56k).