Progress in natural language processing has recently come from deriving sentence representations using recurrent neural networks (RNNs)
Progress in natural language processing has recently come from deriving sentence representations using recurrent neural networks (RNNs)(Elman, 1990; Sutskever et al., 2014; Goldberg, 2017). Yet while these networks have had great success, the nature of the representations they learn is unclear, which poses problems for interpretability, accountability, and controllability of NLP systems. More specifically, the success of RNNs has raised the question of whether and how these networks learn robust syntactic generalizations about natural language, which would enable robust performance even on data that differs from the peculiarities of the training set.
Here we build upon recent work studying RNN language models with the same techniques used to study language processing in the human mind: by examining their behavior on targeted sentences chosen to probe particular aspects of the learned representations. Linzen et al. (2016), followed more recently by others (Bernardy and Lappin, 2017; Enguehard et al., 2017; Gulordava et al., 2018), use an agreement prediction task (Bock and Miller, 1991) to study whether RNNs learn a hierarchical morphosyntactic dependency: for example, that The key to the cabinets… can grammatically continue with was but not with were. This dependency turns out to be learnable at human performance from a language modeling objective alone (Gulordava et al., 2018). In the present work we extend this general approach to a wider range of grammatical phenomena.
We draw on the rich literature in human sentence processing to subject RNNs to much the same scrutiny as a human experimental subject in a psycholinguistic study might undergo: what linguistic knowledge does the subject’s incremental processing behavior reflect? We focus on two central types of knowledge that may be evident in processing: syntactic state, a representation of syntactic events that may have occurred, that are currently unfolding, and that are yet to come; and grammatical dependency, the set of conditions characterizing syntactically mediated relations among elements in a sentence. We investigate syntactic state by studying RNNs’ behavior under garden-path disambiguation, multiple center-embedding, and the maintenance of obligatory upcoming syntactic events. For grammatical dependency, we extend the above recent studies of verb agreement by investigating the binding of reflexive pronouns and the licensing of negative polarity items. Beyond helping characterize the representations of contemporary RNNs, this work may bear on classic learnability questions in acquisition: what grammatical knowledge can be learned from a childhood’s or a lifetime’s worth of string input by a flexible sequence model without a strong domain-specific inductive bias?
2 General methods
We investigate RNN behavior primarily by studying the surprisal , or log inverse probability, that an RNN assigns to each word in a sentence:
, or log inverse probability, that an RNN assigns to each word in a sentence:
where is the current word or character, is the RNN’s hidden state before consuming , the probability is calculated from the RNN’s softmax activation, and the logarithm is taken in base 2, so that surprisal is measured in bits.
A common practice in psycholinguistics is to study a measure of reaction time per word (for example reading time as measured by an eyetracker), as a measure of the word-by-word difficulty of online language processing. These reading times are often taken to reflect the extent to which humans expect certain words in context, and may be generally proportional to surprisal given the comprehender’s probabilistic language model (Hale, 2001; Levy, 2008; Smith and Levy, 2013). In this study, we take RNN surprisal as the analogue of human reading time, using it to probe the RNNs’ expectations about what words will follow in certain contexts. While we are not interested in directly modeling human processing difficulty here, we note that there is a long tradition linking RNN performance to human language processing (Elman, 1990; Christiansen and Chater, 1999; MacDonald and Christiansen, 2002; Frank and Bod, 2011).
2.1 Experimental methodology
In each experiment presented below, we design a set of sentences such that the word-by-word surprisal values will show evidence for syntactic representations.111Our experiments and analyses were preregistered on aspredicted.org: blind preregistration codes 5vr6ze, f8yd86, vh82i7, pt3x3i, yt6pi4. We analyze by-word surprisal profiles for these sentences using regression analysis.
We analyze by-word surprisal profiles for these sentences using regression analysis.
Except where otherwise noted, all statistics are derived from linear mixed-effects models (Baayen et al., 2008) with sum-coded fixed-effects predictors and by-item random intercepts, where the dependent variable is the summed surprisal across words within the region in question. Random slopes and interactions are not necessary in these models to avoid anti-conservativity (Barr et al., 2013) because we do not have repeated observations within any item/condition combination. This method allows us to factor out by-item variation in surprisal and focus on the contrasts between conditions.
2.2 LSTMs tested
We study the behavior of two LSTMs for English. First, the model presented in Jozefowicz et al. (2016) as ‘‘BIG LSTM+CNN Inputs’’, which we call ‘‘JRNN’’, which was trained on the One Billion Word Benchmark (Chelba et al., 2013) with two hidden layers of 8196 units each and CNN character embeddings as input. Second, we use the model described in the supplementary materials of Gulordava et al. (2018), which we call ‘‘GRNN’’, trained on 90 million tokens of English Wikipedia with two hidden layers of 650 hidden units each. In Section 6.2, we also study the behavior of an LSTM for Japanese (JPRNN), which is a character-based model (cf. Kim et al., 2016)222Our model does not have a convolutional layer, but rather only an embedding layer, due to the considerable size of the vocabulary. with 650 hidden units, trained on 900,000 paragraphs (800,000 for training and 100,000 for validation) of Japanese Wikipedia. After 100 epochs, we obtained the best validation perplexity, at 12.67. All LSTMs are trained on a pure language modeling objective.
with 650 hidden units, trained on 900,000 paragraphs (800,000 for training and 100,000 for validation) of Japanese Wikipedia. After 100 epochs, we obtained the best validation perplexity, at 12.67. All LSTMs are trained on a pure language modeling objective.
Our goal in examining these models is not to draw contrasts between them, since they are very similar in their architecture and performance (in terms of perplexity); rather our goal is to provide results from samples of state-of-the-art models. Future work may examine how our findings differ across model architectures and draw causal connections between model architectures and the ability to represent complex syntax.
3 Garden path effects
One of the major questions in human sentence processing has been how people represent the incremental parse of a sentence during online language comprehension. The major phenomenon that has been used to probe these representations is garden path effects. Garden path effects arise from local ambiguities, where a context leads a person to believe one parse is likely, but then a disambiguating word forces her to drastically revise her beliefs. In effect, the comprehender is ‘‘led down the garden path’’ by a locally likely but ultimately incorrect parse (Bever, 1970).
In psycholinguistics, garden path effects have been studied in order to answer questions like: do humans represent multiple possible parses in parallel, ranked by probability, or do they only represent the single most likely parse? What information affects the parse tree a person will consider most likely given a locally ambiguous context (MacDonald et al., 1994)? For RNNs, these methods can be used to answer the questions above, and also: what properties of the input lead to syntactic representations that are more or less accurate?
Garden-pathing in RNNs has very recently been demonstrated by van Schijndel and Linzen (2018), albeit over only a short (two-word) stretch of ambiguity-maintaining material. Here we investigate a garden path previously unstudied in RNNs, induced by the classic Main Verb/Reduced Relative (MV/RR) ambiguity, in which a word is locally ambiguous between being the main verb of a sentence or introducing a reduced relative clause, and that ambiguity is maintained over a longer stretch of material:
. . The woman brought the sandwich from the kitchen tripped on the carpet. [reduced, ambiguous] .̱ The woman given the sandwich from the kitchen tripped on the carpet. [reduced, unambiguous] .̧ The woman who was brought the sandwich from the kitchen tripped on the carpet. [unreduced, ambig] .̣ The woman who was given the sandwich from the kitchen tripped on the carpet. [unreduced, unambig]
In Example 3, the phrase ‘‘brought the sandwich from the kitchen’’ is initially analyzed as a main verb phrase, but upon reaching the verb ‘‘tripped’’---the disambiguator---the reader must re-analyze it as a relative clause. In these examples, there are two possible cues that the first verb is introducing a reduced relative clause (reduced RC): either (i) it can be preceded with ‘‘who was’’ (the RC is unreduced) or (ii) it is ambiguously a past participle or a past tense verb, such as ‘‘brought’’ rather than ‘‘given’’---that is, it is ambiguous. The garden path effect should only arise in the critical condition where the relative clause is reduced and the verb is ambiguous.
For our purposes, the key dependent variable for MV/RR sentences is the surprisal at the disambiguator. If the surprisal at the disambiguator is higher in the critical condition than in the other conditions, this indicates that the network had a preferred syntactic analysis for the previous material which did not lead it to expect the disambiguator.
The surprisal in the region following the disambiguator is also interesting. If it is the same across conditions, that indicates that the network successfully revised its syntactic analysis. If it is high in the critical condition, then the network did not recover from the garden path event: that is, either the disambiguator did not provide sufficient information to cause the network to revise its syntactic analysis, or it caused the network to enter a confused state from which only poor predictions can be made.
We present results from two manipulations of the MV/RR ambiguity. First, we present an experiment manipulating verb ambiguity, as presented in 3--3. Second, we present an experiment where the subject NP provides cues about whether a following ambiguous verb should be interpreted as a main verb or a reduced RC.
3.1 Verbform ambiguity and RC reduction
Figure 1 shows surprisals for each RNN in each sentence region for 29 items we constructed following the template of 3. Surprisals are summed over words in the region, giving the total unexpectedness of the phrase, and then averaged across items. At the critical main-clause verb, surprisal is superadditively highest in the reduced ambiguous condition (the dotted blue line; a positive interaction between the reduced and ambiguous conditions is significant in both models at ), the key predicted human-like garden-path disambiguation effect. Within the reduced conditions (represented by blue lines), surprisal is lower when the participial verbform was unambiguous than when it was ambiguous ( in JRNN and in GRNN), demonstrating that the models have learned the distinctive syntactic behavior of participial verbs. But strikingly, even when the participial verbform is unambiguous, surprisal is still higher when the RC is reduced than when it is unreduced (compare the red and blue solid lines; in both models), suggesting that the models have not fully representationally separated participial verbs from finite verbs. Apparently, the network treats an unambiguous participial verb as only a noisy cue to the presence of an RC.
Post-disambiguation, surprisals are higher in the unreduced conditions than in the reduced conditions (both models ), suggesting that the models may not fully recover to a clean syntactic state following garden-path disambiguation.
3.2 Subject animacy
Syntactic garden-pathing in humans has been demonstrated to be sensitive to fine-grained lexical and semantic cues, such as the animacy of the NP subject in the case of MV/RR garden-pathing (Trueswell et al., 1994, though see Ferreira and Clifton, 1986; Clifton Jr. et al., 2003 for controversy regarding time-course). Are RNNs similarly sensitive? We examined this question with 30 items on the Trueswell et al. template of 3.2 below:
. . The witness examined by the lawyer turned out to be unreliable. [reduced, animate] .̱ The evidence examined by the lawyer turned out to be unreliable. [reduced, inanimate] .̧ The witness that was examined by the lawyer turned out to be unreliable. [unreduced, animate] .̣ The evidence that was examined by the lawyer turned out to be unreliable. [unreduced, inanimate] .
If RNNs have human-like sensitivity to the fine-grained covariance of syntactic structure and lexico-semantic information, then surprisal should be superadditively higher in the reduced/animate condition 3.2. The strongest evidence for such a contingency would be if this effect shows up at the RC-internal by-phrase, which has poor compatibility with an active-voice analysis of the preceding verbform.
Figure 2 shows surprisals by region, condition, and RNN, averaged across items. At the by-phrase there is a large main effect of RC reduction (compare the blue vs. red lines; both models ), and a small interaction between reduction and animacy in the predicted direction: with animate subject nouns, surprisal in this region is higher when the RC is reduced, but not when the RC is unreduced. This interaction is significant in GRNN () but not in JRNN (). At the main verb, GRNN shows once again a large main effect of RC reduction (), indicating that a main verb is still somewhat surprising following a reduced relative; JRNN shows no such effect. The final region has high total surprisal because it is several words long. It shows a small but significant effect of RC reduction in JRNN (), and no other significant contrasts between conditions.
Our result shows that GRNN can exploit fine-grained information about the covariance of lexical forms and syntactic structure in order to infer syntactic state. However, the presence of residual ‘‘spillover’’ effects at the main verb suggest that the network has not encountered sufficient evidence at the main verb to close the relative clause.
4 Obligatory upcoming syntactic events
Garden-path disambiguation effects are diagnostic of cases where a syntactic state is weighted strongly against a syntactic event at a particular moment in incremental processing. Other grammatical contexts create an obligation
a syntactic event at a particular moment in incremental processing. Other grammatical contexts create an obligationfor a certain type of syntactic event in the future. In an incremental processing system such as a human or an RNN, this obligation must be maintained for an indefinitely long time. It is trivial to do so in rule-based processing architectures with a stack, and sentence processing research clearly demonstrates that humans maintain such expectations in syntactic processing (Staub and Clifton, 2006; Lau et al., 2006; Levy et al., 2012), but it remains unclear whether RNNs learn to use their memory this way in natural language sequence prediction.
We tested two simple grammatical configurations inducing an obligatory upcoming syntactic event: object-extracted relative clauses and subordination.
4.1 Relative clause completions
A prefix such as the one in 4.1 signals the onset of an object-extracted relative clause (ORC). A grammatical continuation of the prefix must include two verb phrases: one to finish the relative clause, and one to finish the main clause. Similarly, Example 4.1 signals the onset of two nested ORCs: a grammatical continuation must contain three verb phrases. Humans can reliably generate grammatical continuations for prefixes such as 4.1, but struggle with prefixes such as 4.1 (Yngve, 1960; Miller and Chomsky, 1963; Gibson and Thomas, 1999; Vasishth et al., 2010; Frank et al., 2016; Futrell and Levy, 2017; sample grammatical completions in italics):
. . The author who the editor…disliked sent in the manuscript. .̱ The manuscript that the author who the editor…disliked sent in was of low quality. .
We tested the LSTMs’ ability to represent the requirement for two verb phrases in the first case and three verb phrases in the second case, using 20 prefixes on the template of 4.1 based on materials from Gibson and Thomas (1999). We sampled 9 completions per prefix per condition per LSTM by recurrently sampling from the softmax distribution of following words up until the generation of the end-of-sequence symbol. We (the authors) then judged grammaticality of completions by hand. We judged grammaticality solely based on whether the network generated the right number of syntactically appropriate verb phrases, where a verb phrase was counted as appropriate if it matched its subject in number. Grammatical errors in irrelevant parts of the continuations were ignored. We ignored continuations where a judgment was impossible due to generation of an <UNK> token.333We carried out these judgments ourselves because they require some linguistic sophistication to identify the relevant verb phrases. Statistical analysis for this study was performed using a mixed logit model with fixed and by-item random effects of embedding depth, LSTM, and their interaction, to account for repeated measures.
Statistical analysis for this study was performed using a mixed logit model with fixed and by-item random effects of embedding depth, LSTM, and their interaction, to account for repeated measures.
Figure 3 shows the proportions of completions judged grammatical. Both JRNN and GRNN generate a high proportion of grammatical continuations for one ORC, with JRNN outperforming GRNN (). Neither network can reliably generate the required three verb phrases for a prefix with two nested ORCs, but GRNN suffers less than JRNN from the additional embedding, a significant interaction ().
Relative clause completions seem to be a case where limitations in RNN performance mirror limitations in human performance. However, the networks have lower accuracy than human subjects across the board. Mechanical Turk subjects can complete single-ORC prefixes of this form grammatically with near 100% accuracy and double-ORC prefixes with around 40--60% accuracy (unpublished data: Gibson, p.c.).
If an English sentence begins with a subordinate clause, the expectation for the onset of the matrix clause must be maintained for however long the subordinate clause lasts 4.2. Ending the sentence without a subordinate clause 4.2 is surprising to humans and should be surprising to a human-like language model. The strength of this syntactic obligation in a language model can be quantified by the size of the interaction effect between subordinator presence/absence and matrix-clause presence/absence on the joint surprisal of all post-subordinate clause material.444It is necessary to look at this interaction rather than simply comparing 4.2 and 4.2 because we need to control for the surprisal of the following matrix clause. The logic of this interaction is similar to the “1icensing interaction” used to study filler–gap dependencies in Wilcox et al. (2018).
. . As the doctor studied the textbook, the nurse walked into the office. .̱ *As the doctor studied the textbook. .̧ ?The doctor studied the textbook, the nurse walked into the office. .̣ The doctor studied the textbook. .
We designed 23 items on the pattern of 4.2; Figure 4 (left and center panels) shows results from both LSTMs in terms of the difference in surprisal for matrix-clause and non-matrix-clause continuations depending on whether a subordinator is present or not. A positive effect in this figure indicates that a subordinator makes the continuation more surprising; a negative effect means the subordinator makes the continuation less surprising. We see a strong facilitative effect of the subordinator on matrix-clause continuations, and a somewhat weaker penalty on non-matrix-clause continuations in both models. As predicted, a negative interaction of matrix clause presence and subordinator presence is significant in both models (), and it is numerically much larger for GRNN.
We included a further manipulation in this study, optionally modifying each NP in the subordinate clause with a prepositional phrase or subject- or object-extracted relative clause, on the hypothesis that lengthening and increasing syntactic complexity of the subordinate clause might weaken the expectation for an upcoming matrix clause.
Results can be seen in Figure 4, which shows the size of the interaction between presence of a subordinator and presence of a matrix clause (that is, the difference between the two bars for each model in the left panel of Figure 4). A positive interaction corresponds to a licensing relationship where the subordinator makes the matrix clause more likely and a premature ending less likely. GRNN exhibits a strong interaction when the intervening material is short and syntactically simple (Figure 4, left bottom), and the interaction gets progressively weaker as the intervening material becomes longer and more complex ( for subject postmodifiers but not significant for object postmodifiers).
JRNN has more complex behavior: object interveners actually make the matrix clause more likely. Overall, in a linear regression, subject interveners decrease the size of the licensing interaction (
likely. Overall, in a linear regression, subject interveners decrease the size of the licensing interaction () and object interveners increase it ().
Taken together, the results of this study indicate that both LSTMs derive and maintain in memory an expectation for an upcoming matrix clause from a sentence-initial subordinator; that this expectation decays in the presence of complex intervening material; and that this expectation is stronger and exhibits more clearly understandable behavior in GRNN, even though that model is smaller and trained on less data.
5 Reflexive pronoun binding and c-command
Having investigated the representation and maintenance of syntactic state in both RNNs, we move on to examining their representation of grammatical dependencies that are defined with respect to syntactic state in human grammatical competence. Linzen et al. (2016) and Gulordava et al. (2018) provide one such case, that of subject--verb agreement. Here we extend the scope of these studies to binding, which characterizes the syntactic restrictions on pronouns and their antecedents (Chomsky, 1981). In English, a reflexive pronoun is subject to two constraints: (1) it must agree in gender and number with its antecedent, and (2) approximately, its antecedent must be the syntactically most local NP that c-commands it (Reinhart, 1981; see Pollard and Sag, 1994 for a closely related characterization in a lexicalist syntactic framework). Provided that an RNN learns NP stereotypical gender---a reasonable prospect given the results of Caliskan et al. (2017) and Rudinger et al. (2018)---we can use reflexive pronoun surprisal to assess whether the model also learns the structure of the grammatical dependency that must characterize the relationship between a reflexive and its antecedent.
We chose 30 nouns referring to professions likely to have strong stereotypic gender, based on government statistical data (Bureau of Labor Statistics, 2017). To assess whether each model learned this stereotypic gender, we constructed an item for each noun involving a simple transitive clause with a reflexive pronoun object of each gender:
. . The hairdresser washed herself. [match] .̱ The hairdresser washed himself. [mismatch] .̧ The lumberjack cut himself. [match] .̣ The lumberjack cut herself. [mismatch] .
For JRNN, we find higher surprisal at reflexive pronouns mismatching the antecedent’s stereotypical gender than at pronouns matching the antecedent’s stereotypical gender (Figure 5, left panel, ). We did not find a reliable effect of stereotypical gender for GRNN (not depicted), so we do not examine it further in this section.
Next, we tested whether JRNN’s probabilistic dependency between preceding-NP stereotypical gender and reflexive pronoun gender reflects a humanlike grammatical binding domain. For each item we introduced a second profession noun, either matching or mismatching reflexive pronoun gender, in a position that linearly intervenes but is outside the reflexive’s binding domain:
. . The lumberjack who is related to the soldier cut himself. .̱ The lumberjack who is related to the hairdresser cut himself. .̧ The lumberjack who is related to the soldier cut herself. .̣ The lumberjack who is related to the hairdresser cut herself. .
Experimental evidence suggests that humans do not consider antecedents for reflexives outside the binding domain (Sturt, 2003; Xiang et al., 2009; Dillon et al., 2013). Likewise, if the RNN properly represents binding domains for reflexives, then only subject noun stereotypical gender, and not intervener stereotypical gender, should affect surprisal at the reflexive pronoun.
Results are in Figure 5 (right panel). Among conditions where intervener gender mismatches that of the reflexive pronoun (blue bars), surprisal is lower when the true antecedent matches reflexive gender (). However, there is no evidence that JRNN has learned the proper binding domain for reflexives: among conditions where true antecedent gender mismatches that of the reflexive pronoun (bars on the right), surprisal is lower when the intervener matches reflexive gender (), a facilitative effect just as large as that of matching gender for the true antecedent.
6 Negative Polarity Items
Finally, we turn to another type of grammatical dependency: negative polarity items (NPIs). These are items such as English ‘‘ever’’, which must be licensed by having an semantically negative element in a structurally appropriate context, e.g.:
. . No one has ever climbed that mountain. .̱ *Someone has ever climbed that mountain. .
We examine NPIs in English and Japanese, which (i) differ in the relative order of NPI and licensor, and (ii) have subtly different licensing domains, both of which are different than the reflexive binding domain of Section 5. Although JRNN failed to learn reflexive binding domains, the different distributional characteristics of NPIs might well make them more learnable.
6.1 Negative Polarity Items in English
For present purposes, the licensing condition for English NPIs is effectively c-command by a semantically negative (downward-entailing) operator. We investigated English NPI licensing by testing surprisal at ‘‘any’’ and ‘‘ever’’ when the potential licensor ‘‘no’’ preceded, in an appropriate position to license the NPI 6.1, 6.1 and/or in a non-licensing position 6.1, 6.1:
. . *The bill that the senator likes has ever found any support in the senate. .̱ *The bill that no senator likes has ever found any support in the senate. .̧ No bill that the senator likes has ever found any support in the senate. .̣ No bill that no senator likes has ever found any support in the senate. .
Humans are measurably surprised when encountering an unlicensed NPI as in 6.1 or 6.1, but a non-licensing intervener as in 6.1 elicits a ‘‘semanticality illusion effect’’ manifesting as reduced incremental processing disruption (Vasishth et al., 2008). To test whether RNNs learn the grammatical licensing condition for English NPIs, we designed 26 examples following 6.1 in three variants: (i) both ‘‘ever’’ and ‘‘any’’ absent; (ii) ‘‘ever’’ present, ‘‘any’’ absent; (iii) ‘‘ever’’ absent, ‘‘any’’ present. We quantified NPI licensing effects by examining the surprisal of the NPI itself.
Figure 6 shows the NPI surprisal for each of the four conditions a--d for the word ‘‘ever’’, and Figure 7 shows the same for ‘‘any’’. The relatively high left-side bars for each condition indicate higher NPI surprisal in the absence of a grammatical licensor, as in Examples 6.1 and 6.1. However, the relatively lower height of the red bars in both conditions indicates that surprisal is also reduced in the presence of a non-grammatical licensor in a relative cause, as in examples 6.1 and 6.1.
If a model has learned the appropriate licensing conditions for English NPIs, we would expect strong surprisal reduction from ‘‘no’’ in the licensing position and zero surprisal reduction from ‘‘no’’ in the distractor position. We do find significant surprisal reduction coming from a matrix-clause ‘‘no’’ (for ‘‘ever’’, in JRNN and in GRNN; for ‘‘any’’, in JRNN and in GRNN), but also significant surprisal reduction coming from the distractor-position ‘‘no’’ ( in both models and NPIs), indicating that the models have learned a spurious licensing relationship between a negative word embedded in a relative clause and an NPI in a higher clause, or have perhaps learned simply that any negative word licenses an NPI at any linearly following position.
6.2 Negative Polarity Items in Japanese
As described above, NPIs require negative words, but negative words do not require NPIs. In sentences and languages where the negative licensors follow NPIs, the grammatical dependency changes to one of an obligatory upcoming event. Since such events were found to be well-represented in Section 4, sentences where negative items follow NPIs might more clearly show whether LSTMs correctly capture the grammatical dependency. Here we consider the Japanese NPI shika, ‘only’, which follows this pattern:
‘Only the bus came.’ . *bus-shika ki-ta.
‘Only the bus came.’
In more complex sentences with embedded clauses, shika must appear in the same clause as the negative verbs (the Clausemate Condition, Muraki, 1978): the negation must be in the main clause when shika is not embedded 6.2 and in the embedded clause when shika is embedded 6.2.555 Linguists have reported variable acceptablity of 6.2 (when the matrix verb is negative) depending on the grammatical role of NP-shika. Shika in the embedded subject position is reported to be more acceptable than the direct object (Aoyagi and Ishii, 1994; Tanaka, 1997) and the indirect object position is the worst (Kataoka, 2006). We did not find any experimental research on this issue.
. . …shika…[…V-/neg…] V-neg. .̱ *…shika…[…V-/neg…] V-.
. . …[…shika…V-neg…] V-/neg. .̱ *…[…shika…V-…] V-/neg.
We tested whether JPRNN is sensitive to these grammatical conditions by creating 83 single-clause items on the pattern of 6.2 and automatically generating 2218 items with embedded complement clauses on the patterns of 6.2--6.2, varying also the case of the NP on which shika appears. If the model has learned the proper contingency between shika and verbal negation, then the case of NP should be irrelevant, but different cases may show different effect sizes in RNN language models because they appear with shika with varying frequency.
We assess how well a model has learned the shika licensing condition by assessing the difference in surprisal at each verb depending on whether shika is present in a particular position in context, or absent (similar to Figure 4 when we were studying subordination). A licensing condition would manifest as shika reducing the surprisal of a negative verb (relative to that verb’s surprisal if shika is absent) more than the surprisal of an affirmative verb; an affirmative verb in a required licensing position should show an increased in surprisal when shika is present.
Figure 8 shows the difference in surprisal for each condition. Unembedded sentences (Figure (a)a) show a licensing effect for all NP cases (blue bars below 0), though we fail to get a surprisal increase for affirmative verbs when the topic is shika-marked (the red bars are above 0 only for accusative and dative NPs). In complex sentences where shika is in the matrix clause, shika on the topic NP does not lead to interpretable behavior.666There is no accusative condition in Figures (b)b and (d)d because there is no verb that naturally takes both accusative object and complement clause arguments in Japanese. On the dative NP, shika inappropriately leads to an expectation for negation on the embedded-clause verb (in Figure (b)b, the blue bar is below red bar). Furthermore, when the embedded-clause verb is affirmative, the expectation for shika is spuriously passed on to the matrix clause verb (as shown by the negative green bar for the dative case in Figure (d)d). When the embedded-clause verb is negated, the expectation for a further negative verb is partially discharged (purple bar). Finally, in complex sentences where shika is in the embedded clause, the model generates a strong expectation for a negative embedded-clause verb (Fig. (c)c, blue bars below red bars), but inappropriately passes that expectation on to the matrix clause when the embedded-clause verb is affirmative (Fig. (e)e, green bars).
In sum, as was the case with English NPIs, our RNN clearly learns the requirement that shika imposes for a following negative verb, but it does not learn the appropriate grammatical dependency between the NPI and the licensor.
7 General Discussion and Conclusion
We have applied the methods of controlled psycholinguistic experimentation to assess the evidence in contemporary RNN models for incremental syntactic state and for the proper representation of a range of grammatical dependencies. This approach builds on previous work in a similar spirit (Linzen et al., 2016; Gulordava et al., 2018; Wilcox et al., 2018) and complements a variety of other approaches currently practiced (Shi et al., 2016; Belinkov et al., 2018; Blevins et al., 2018; Kádár et al., 2017; Williams et al., 2018; Ettinger et al., 2017; Lake and Baroni, 2017; Weber et al., 2018).
In both of the English RNNs we studied we found clear evidence of incremental state syntactic representation, with important qualifications. The garden path results show that the models represent an incremental parse state inside relative clauses, and that they can partially exploit verb-form cues that indicate the onset of reduced relative clauses (Section 3.1. The results on relative clause completion and subordination show strong evidence of the maintenance of expectations for obligatory upcoming material, which decays as intervening material becomes longer or more complex.
The results on grammatical dependency show more room for improvement. None of the models tested learned the appropriate licensing conditions for reflexive pronoun binding or NPI licensing in English or Japanese.
We believe that the psycholinguistic methodology employed in this paper provide a valuable lens on the internal representations of systems which are currently widely seen as black boxes. We have found that proper syntactic representation can emerge, but does not necessarily generalize across constructions. Future work can examine how these properties vary as a function of network architecture and objective function structure, in the pursuit of human-like syntactic competence.
EGW would like to acknowledge support from the Mind Brain Behavior Graduate Student Grant, as well as Emmanuel Dupoux and the Cognitive Machine Learning Group at the ENS. RPL gratefully acknowledges support to his laboratory from Elemental Cognition and from the MIT-IBM Watson AI Lab. This work was supported by a GPU Grant from the NVIDIA corporation. All scripts, experimental materials, and results are available online at
EGW would like to acknowledge support from the Mind Brain Behavior Graduate Student Grant, as well as Emmanuel Dupoux and the Cognitive Machine Learning Group at the ENS. RPL gratefully acknowledges support to his laboratory from Elemental Cognition and from the MIT-IBM Watson AI Lab. This work was supported by a GPU Grant from the NVIDIA corporation. All scripts, experimental materials, and results are available online athttp://github.com/Futrell/rnn_psycholinguistic_subjects.
- Aoyagi and Ishii (1994) Hiroshi Aoyagi and Toru Ishii. 1994. On agreement-inducing vs. non-agreement-inducing NPIs. In Proceedings of the North East Linguistic Society 24, pages 1 -- 15. GLSA Publications.
- Baayen et al. (2008) R.H. Baayen, D.J. Davidson, and D.M. Bates. 2008. Mixed-effects modeling with crossed random effects for subjects and items. Journal of Memory and Language, 59(4):390--412.
- Barr et al. (2013) Dale J Barr, Roger Levy, Christoph Scheepers, and Harry J Tily. 2013. Random effects structure for confirmatory hypothesis testing: Keep it maximal. Journal of Memory and Language, 68(3):255--278.
- Belinkov et al. (2018) Yonatan Belinkov, Lluís Màrquez, Hassan Sajjad, Nadir Durrani, Fahim Dalvi, and James Glass. 2018. Evaluating layers of representation in neural machine translation on part-of-speech and semantic tagging tasks. arXiv preprint arXiv:1801.07772.
- Bernardy and Lappin (2017) Jean-Philippe Bernardy and Shalom Lappin. 2017. Using deep neural networks to learn syntactic agreement. Linguistic Issues in Language Technology, 15:1--15.
- Bever (1970) Thomas G. Bever. 1970. The cognitive basis for linguistic structures. In J. R. Hayes, editor, Cognition and the Development of Language. Wiley, New York.
- Blevins et al. (2018) Terra Blevins, Omer Levy, and Luke Zettlemoyer. 2018. Deep RNNs encode soft hierarchical syntax. arXiv preprint arXiv:1805.04218.
- Bock and Miller (1991) Kathryn Bock and Carol A Miller. 1991. Broken agreement. Cognitive Psychology, 23(1):45--93.
- Bureau of Labor Statistics (2017) Bureau of Labor Statistics. 2017. Labor force statistics from the current population survey.
- Caliskan et al. (2017) Aylin Caliskan, Joanna J Bryson, and Arvind Narayanan. 2017. Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334):183--186.
- Chelba et al. (2013) Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. 2013. One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005.
- Chomsky (1981) N. Chomsky. 1981. Lectures on Government and Binding. Foris Publications, Dordrecht, The Netherlands.
- Christiansen and Chater (1999) Morten H. Christiansen and Nick Chater. 1999. Toward a connectionist model of recursion in human linguistic performance. Cognitive Science, 23(2):157--205.
- Clifton Jr. et al. (2003) Charles Clifton Jr., Matthew J. Traxler, Mohamed Taha Mohamed, Rihana S. Williams, Robin K. Morris, and Keith Rayner. 2003. The use of thematic role information in parsing: Syntactic processing autonomy revisited. Journal of Memory and Language, 49:317--334.
- Dillon et al. (2013) Brian Dillon, Alan Mishler, Shayne Sloggett, and Colin Phillips. 2013. Contrasting intrusion profiles for agreement and anaphora: Experimental and modeling evidence. Journal of Memory and Language, 69(2):85--103.
- Elman (1990) J.L. Elman. 1990. Finding structure in time. Cognitive Science, 14(2):179--211.
- Enguehard et al. (2017) Emile Enguehard, Yoav Goldberg, and Tal Linzen. 2017. Exploring the syntactic abilities of RNNs with multi-task learning. arXiv preprint arXiv:1706.03542.
- Ettinger et al. (2017) Allyson Ettinger, Sudha Rao, Hal Daumé III, and Emily M Bender. 2017. Towards linguistically generalizable NLP systems: A workshop and shared task. In Proceedings of the First Workshop on Building Linguistically Generalizable NLP Systems, pages 1--10.
- Ferreira and Clifton (1986) Fernanda Ferreira and Charles Clifton. 1986. The independence of syntactic processing. Journal of Memory and Language, 25(3):348--368.
- Frank and Bod (2011) Stefan L. Frank and Rens Bod. 2011. Insensitivity of the human sentence-processing system to hierarchical structure. Psychological Science, 22(6):829--834.
- Frank et al. (2016) Stefan L. Frank, Thijs Trompenaars, Richard L. Lewis, and Shravan Vasishth. 2016. Cross-linguistic differences in processing double-embedded relative clauses: Working-memory constraints or language statistics? Cognitive Science, 40:554--578.
- Futrell and Levy (2017) Richard Futrell and Roger Levy. 2017. Noisy-context surprisal as a human sentence processing cost model. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 688--698, Valencia, Spain.
- Gibson and Thomas (1999) Edward Gibson and James Thomas. 1999. Memory limitations and structural forgetting: The perception of complex ungrammatical sentences as grammatical. Language and Cognitive Processes, 14(3):225--248.
- Goldberg (2017) Yoav Goldberg. 2017. Neural network methods for natural language processing. Synthesis Lectures on Human Language Technologies, 10(1):1--309.
- Gulordava et al. (2018) K. Gulordava, P. Bojanowski, E. Grave, T. Linzen, and M. Baroni. 2018. Colorless green recurrent networks dream hierarchically. In Proceedings of NAACL.
- Hale (2001) John T. Hale. 2001. A probabilistic Earley parser as a psycholinguistic model. In Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics and Language Technologies, pages 1--8.
- Jozefowicz et al. (2016) Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. 2016. Exploring the limits of language modeling. arXiv, 1602.02410.
- Kádár et al. (2017) Ákos Kádár, Grzegorz Chrupała, and Afra Alishahi. 2017. Representation of linguistic form and function in recurrent neural networks. Computational Linguistics, 43(4):761--780.
- Kataoka (2006) Kiyoko Kataoka. 2006. Nihongo hitei-bun no koozoo: kakimaze-bun to hitei-koou hyoogen. Kuroshio, Tokyo.
Kim et al. (2016)
Yoon Kim, Yacine Jernite, David Sontag, and Alexander M. Rush. 2016.
Character-aware neural language models.
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, pages 2741--2749. AAAI Press.
- Lake and Baroni (2017) Brenden M Lake and Marco Baroni. 2017. Still not systematic after all these years: On the compositional skills of sequence-to-sequence recurrent networks. arXiv preprint arXiv:1711.00350.
- Lau et al. (2006) Ellen Lau, Clare Stroud, Silke Plesch, and Colin Phillips. 2006. The role of structural prediction in rapid syntactic analysis. Brain & Language, 98:74--88.
- Levy (2008) Roger Levy. 2008. Expectation-based syntactic comprehension. Cognition, 106(3):1126--1177.
- Levy et al. (2012) Roger Levy, Evelina Fedorenko, Mara Breen, and Ted Gibson. 2012. The processing of extraposed structures in English. Cognition, 122(1):12--36.
- Linzen et al. (2016) Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. 2016. Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Transactions of the Association for Computational Linguistics, 4:521--535.
- MacDonald and Christiansen (2002) Maryellen C. MacDonald and Morten H. Christiansen. 2002. Reassessing working memory: Comment on Just and Carpenter (1992) and Waters and Caplan (1996). Psychological Review, 109(1):35--54.
- MacDonald et al. (1994) Maryellen C. MacDonald, Neal J. Pearlmutter, and Mark S. Seidenberg. 1994. The lexical nature of syntactic ambiguity resolution. Psychological Review, 101(4):676.
- Masson and Loftus (2003) Michael EJ Masson and Geoffrey R Loftus. 2003. Using confidence intervals for graphically based data interpretation. Canadian Journal of Experimental Psychology/Revue canadienne de psychologie expérimentale, 57(3):203.
- Miller and Chomsky (1963) G.A. Miller and N. Chomsky. 1963. Finitary models of language users. Handbook of Mathematical Psychology, 2:419--491.
- Muraki (1978) Masatake Muraki. 1978. The sika nai construction and predicate restructuring. In John Hinds and Irwin Howard, editors, Problems in Japanese syntax and semantics, pages 155 -- 177. Kaitakusha, Tokyo.
- Pollard and Sag (1994) Carl Pollard and Ivan A. Sag. 1994. Head-Driven Phrase Structure Grammar. Center for the Study of Language and Information, Stanford, CA.
- Reinhart (1981) Tanya Reinhart. 1981. Definite NP anaphora and c-command domains. Linguistic Inquiry, 12(4):605--635.
- Rudinger et al. (2018) Rachel Rudinger, Jason Naradowsky, Brian Leonard, and Benjamin Van Durme. 2018. Gender bias in coreference resolution. arXiv preprint arXiv:1804.09301.
- van Schijndel and Linzen (2018) Marten van Schijndel and Tal Linzen. 2018. Modeling garden path effects without explicit hierarchical syntax. In Proceedings of the 40th Annual Meeting of the Cognitive Science Society.
- Shi et al. (2016) Xing Shi, Inkit Padhi, and Kevin Knight. 2016. Does string-based neural MT learn source syntax? In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1526--1534.
- Smith and Levy (2013) Nathaniel J. Smith and Roger Levy. 2013. The effect of word predictability on reading time is logarithmic. Cognition, 128(3):302--319.
- Staub and Clifton (2006) Adrian Staub and Charles Clifton. 2006. Syntactic prediction in language comprehension: Evidence from either …or. Journal of Experimental Psychology: Learning, Memory, & Cognition, 32(2):425--436.
- Sturt (2003) Patrick Sturt. 2003. The time-course of the application of binding constraints in reference resolution. Journal of Memory and Language, 48:542--562.
- Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pages 3104--3112.
- Tanaka (1997) Hidekazu Tanaka. 1997. Invisible movement in sika-nai and the linear crossing constraint. Journal of East Asian Linguistics, 6(2):143--188.
- Trueswell et al. (1994) John C. Trueswell, Michael K. Tanenhaus, and S. Garnsey. 1994. Semantic influences on parsing: Use of thematic role information in syntactic ambiguity resolution. Journal of Memory and Language, 33:285--318.
- Vasishth et al. (2008) Shravan Vasishth, Sven Brüssow, Richard L Lewis, and Heiner Drenhaus. 2008. Processing polarity: How the ungrammatical intrudes on the grammatical. Cognitive Science, 32(4):685--712.
- Vasishth et al. (2010) Shravan Vasishth, Katja Suckow, Richard L Lewis, and Sabine Kern. 2010. Short-term forgetting in sentence comprehension: Crosslinguistic evidence from verb-final structures. Language and Cognitive Processes, 25(4):533--567.
Weber et al. (2018)
Noah Weber, Leena Shekhar, and Niranjan Balasubramanian. 2018.
The fine line between linguistic generalization and failure in seq2seq-attention models.In
Workshop on New Forms of Generalization in Deep Learning and NLP (NAACL 2018).
- Wilcox et al. (2018) Ethan G. Wilcox, Roger P. Levy, Takashi Morita, and Richard Futrell. 2018. What do RNNs learn about filler--gap dependencies? In Proceedings of BlackboxNLP 2018.
- Williams et al. (2018) Adina Williams, Andrew Drozdov, and Samuel R Bowman. 2018. Do latent tree learning models identify meaningful structure in sentences? Transactions of the Association for Computational Linguistics, 6:253--267.
- Xiang et al. (2009) Ming Xiang, Brian Dillon, and Colin Phillips. 2009. Illusory licensing effects across dependency types: ERP evidence. Brain & Language, 108(1):40--55.
- Yngve (1960) Victor H. Yngve. 1960. A model and an hypothesis for language structure. Proceedings of the American Philosophical Society, 104(5):444--466.