An analysis of the utility of explicit negative examples to improve the syntactic abilities of neural language models

04/06/2020 ∙ by Hiroshi Noji, et al. ∙ 0

We explore the utilities of explicit negative examples in training neural language models. Negative examples here are incorrect words in a sentence, such as "barks" in "*The dogs barks". Neural language models are commonly trained only on positive examples, a set of sentences in the training data, but recent studies suggest that the models trained in this way are not capable of robustly handling complex syntactic constructions, such as long-distance agreement. In this paper, using English data, we first demonstrate that appropriately using negative examples about particular constructions (e.g., subject-verb agreement) will boost the model's robustness on them, with a negligible loss of perplexity. The key to our success is an additional margin loss between the log-likelihoods of a correct word and an incorrect word. We then provide a detailed analysis of the trained models. One of our findings is the difficulty of object-relative clauses for RNNs. We find that even with our direct learning signals the models still suffer from resolving agreement across an object-relative clause. Augmentation of training sentences involving the constructions somewhat helps, but the accuracy still does not reach the level of subject-relative clauses. Although not directly cognitively appealing, our method can be a tool to analyze the true architectural limitation of neural models on challenging linguistic constructions.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction


Despite not being exposed to explicit syntactic supervision, neural language models (LMs), such as recurrent neural networks, are able to generate fluent and natural sentences, suggesting that they induce syntactic knowledge about the language to some extent. However, it is still under debate whether such induced knowledge about grammar is robust enough to deal with syntactically challenging constructions such as long-distance subject-verb agreement. So far, the results for RNN language models (RNN-LMs) trained only with raw text are overall negative; prior work has reported low performance on the challenging test cases

marvin-linzen:2018:EMNLP even with the massive size of the data and model van-schijndel-EtAl:2019:EMNLP1, or argue the necessity of an architectural change to track the syntactic structure explicitly wilcox-etal-2019-structural; kuncoro-EtAl:2018:Long. Here the task is to evaluate whether a model assigns a higher likelihood on a grammatically correct sentence (1) over an incorrect sentence (1) that is minimally different from the original one Q16-1037.

[]The author that the guards like laughs. [*]The author that the guards like laugh.

In this paper, to obtain a new insight into the syntactic abilities of neural LMs, in particular RNN-LMs, we perform a series of experiments under a different condition from the prior work. Specifically, we extensively analyze the performance of the models that are exposed to explicit negative examples. In this work, negative examples are the sentences or tokens that are grammatically incorrect, such as (1) above.

Since these negative examples provide a direct learning signal on the task at test time it may not be very surprising if the task performance goes up. We acknowledge this, and argue that our motivation for this setup is to deepen understanding, in particular the limitation or the capacity of the current architectures, which we expect can be reached with such strong supervision. Another motivation is engineering: we could exploit negative examples in different ways, and establishing a better way will be of practical importance toward building an LM or generator that can be robust on particular linguistic constructions.

The first research question we pursue is about this latter point: what is a better method to utilize negative examples that help LMs to acquire robustness on the target syntactic constructions? Regarding this point, we find that adding additional token-level loss trying to guarantee a margin between log-probabilities for the correct and incorrect words (e.g.,

and for (1)) is superior to the alternatives. On the test set of marvin-linzen:2018:EMNLP, we show that LSTM language models (LSTM-LMs) trained by this loss reach near perfect level on most syntactic constructions for which we create negative examples, with only a slight increase of perplexity about 1.0 point.

Past work conceptually similar to us is enguehard-etal-2017-exploring, which, while not directly exploiting negative examples, trains an LM with additional explicit supervision signals to the evaluation task. They hypothesize that LSTMs do have enough capacity to acquire robust syntactic abilities but the learning signals given by the raw text are weak, and show that multi-task learning with a binary classification task to predict the upcoming verb form (singular or plural) helps models aware of the target syntax (subject-verb agreement). Our experiments basically confirm and strengthen this argument, with even stronger learning signals from negative examples, and we argue this allows to evaluate the true capacity of the current architectures. In our experiments (Section exp), we show that our margin loss achieves higher syntactic performance.

Another relevant work on the capacity of LSTMs is kuncoro-etal-2019-scalable, which shows that by distilling from syntactic LMs dyer-EtAl:2016:N16-1, LSTM-LMs can be robust on syntax. We show that our LMs with the margin loss outperforms theirs in most of the aspects, further strengthening the capacity of LSTMs, and also discuss the limitation.

The latter part of this paper is a detailed analysis of the trained models and introduced losses. Our second question is about the true limitation of LSTM-LMs: are there still any syntactic constructions that the models cannot handle robustly even with our direct learning signals? This question can be seen as a fine-grained one raised by enguehard-etal-2017-exploring

with a stronger tool and improved evaluation metric. Among tested constructions, we find that syntactic agreement across an object relative clause (RC) is challenging. To inspect whether this is due to the architectural limitation, we train another LM on a dataset, on which we unnaturally augment sentences involving object RCs. Since it is known that object RCs are relatively rare compared to subject RCs

hale-2001-probabilistic, frequency may be the main reason for the lower performance. Interestingly, even when increasing the number of sentences with an object RC by eight times (more than twice of sentences with a subject RC), the accuracy does not reach the same level as agreement across a subject RC. This result suggests an inherent difficulty to track a syntactic state across an object RC for sequential neural architectures.

We finally provide an ablation study to understand the encoded linguistic knowledge in the models learned with the help of our method. We experiment under reduced supervision at two different levels: (1) at a lexical level, by not giving negative examples on verbs that appear in the test set; (2) at a construction level, by not giving negative examples about a particular construction, e.g., verbs after a subject RC. We observe no huge score drops by both. This suggests that our learning signals at a lexical level (negative words) strengthen the abstract syntactic knowledge about the target constructions, and also that the models can generalize the knowledge acquired by negative examples to similar constructions for which negative examples are not explicitly given. The result also implies that negative examples do not have to be complete and can be noisy, which will be appealing from an engineering perspective.

2 Target Task and Setup

The most common evaluation metric of an LM is perplexity. Although neural LMs achieve impressive perplexity merity2018regularizing, it is an average score across all tokens and does not inform the models’ behaviors on linguistically challenging structures, which are rare in the corpus. This is the main motivation to separately evaluate the models’ syntactic robustness by a different task.

2.1 Syntactic evaluation task

task As introduced in Section intro, the task for a model is to assign a higher probability to the grammatical sentence over the ungrammatical one, given a pair of minimally different sentences at a critical position affecting the grammaticality. For example, (1) and (1) only differ at a final verb form, and to assign a higher probability to (1), models need to be aware of the agreement dependency between author and laughs over an RC.

marvin-linzen:2018:EMNLP test set

While initial work Q16-1037; N18-1108 has collected test examples from naturally occurring sentences, this approach suffers from the coverage issue, as syntactically challenging examples are relatively rare. We use the test set compiled by marvin-linzen:2018:EMNLP, which consists of synthetic examples (in English) created by a fixed vocabulary and a grammar. This approach allows us to collect varieties of sentences with complex structures.

The test set is divided by a necessary syntactic ability. Many are about different patterns of subject-verb agreement, including local (2.1) and non-local ones across a prepositional phrase or a subject/object RC, and coordinated verb phrases (2.1). (1) is an example of agreement across an object RC. The senators smile/*smiles. The senators like to watch television shows and are/*is twenty three years old. Previous work has shown that non-local agreement is particularly challenging for sequential neural models marvin-linzen:2018:EMNLP.

The other patterns are reflexive anaphora dependencies between a noun and a reflexive pronoun (2.1), and on negative polarity items (NPIs), such as ever, which requires a preceding negation word (e.g., no and none) at an appropriate scope (2.1): The authors hurt themselves/*himself. No/*Most authors have ever been popular.

Note that NPI examples differ from the others in that the context determining the grammaticality of the target word (No/*Most) does not precede it. Rather, the grammaticality is determined by the following context. As we discuss in Section method, this property makes it difficult to apply training with negative examples for NPIs for most of the methods studied in this work.

All examples above (12.1) are actual test sentences, and we can see that since they are synthetic some may sound somewhat unnatural. The main argument behind using this dataset is that even not very natural, they are still strictly grammatical, and an LM equipped with robust syntactic abilities should be able to handle them as human would do.

2.2 Language models


Training data

Following the practice, we train LMs on the dataset not directly relevant to the test set. Throughout the paper, we use an English Wikipedia corpus assembled by N18-1108, which has been used as training data for the present task marvin-linzen:2018:EMNLP; kuncoro-etal-2019-scalable, consisting of 80M/10M/10M tokens for training/dev/test sets. It is tokenized and rare words are replaced by a single unknown token, amounting to the vocabulary size of 50,000.

Baseline LSTM-LM

Since our focus in this paper is an additional loss exploiting negative examples (Section method), we fix the baseline LM throughout the experiments. Our baseline is a three-layer LSTM-LM with 1,150 hidden units at internal layers trained with the standard cross-entropy loss. Word embeddings are 400-dimensional, and input and output embeddings are tied DBLP:journals/corr/InanKS16. Deviating from some prior work marvin-linzen:2018:EMNLP; van-schijndel-EtAl:2019:EMNLP1, we train LMs at sentence level as in sequence-to-sequence models sutskever2014sequence. This setting has been employed in some previous work kuncoro-EtAl:2018:Long; kuncoro-etal-2019-scalable.111 On the other hand, the LSTM-LM of marvin-linzen:2018:EMNLP, which is prepared by N18-1108

, is trained at document level through truncated backpropagation through time (BPTT)

conf/icassp/MikolovKBCK11. Since our training regime is more akin to the task setting of syntactic evaluation, it may provide some advantage at test time.

Parameters are optimized by SGD. For regularization, we apply dropout on word embeddings and outputs of every layer of LSTMs, with weight decay of 1.2e-6, and anneal the learning rate by 0.5 if the validation perplexity does not improve successively, checking every 5,000 mini-batches. Mini-batch size, dropout weight, and initial learning rate are tuned by perplexity on the dev set of Wikipedia dataset.

The size of our three-layer LM is the same as the state-of-the-art LSTM-LM at document-level merity2018regularizing. marvin-linzen:2018:EMNLP’s LSTM-LM is two-layer with 650 hidden units and word embeddings. Comparing two, since the word embeddings of our models are smaller (400 vs. 650) the total model sizes are comparable (40M for ours vs. 39M for theirs). Nonetheless, we will see in the first experiment that our carefully tuned three-layer model achieves much higher syntactic performance than their model (Section exp), being a stronger baseline to our extensions, which we introduce next.

3 Learning with Negative Examples


Now we describe four additional losses for exploiting negative examples. The first two are existing ones, proposed for a similar purpose or under a different motivation. As far as we know, the latter two have not appeared in past work.222 The loss for large-margin language models huang-EtAl:2018:EMNLP3 is similar to our sentence-level margin loss. Whereas their formulation is more akin to the standard large-margin setting, aiming to learn a reranking model, our margin loss is simpler, just comparing two log-likelihoods of predefined positive and negative sentences.

We note that we create negative examples by modifying the original Wikipedia training sentences. As a running example, let us consider the case where sentence (3) exists in a mini-batch, from which we create a negative example (3). []An industrial park with several companies is located in the close vicinity. [*]An industrial park with several companies are located in the close vicinity.


By a target word, we mean a word for which we create a negative example (e.g., is). We distinguish two types of negative examples: a negative token and a negative sentence; the former means a single incorrect word (e.g., are).

3.1 Negative Example Losses

Binary-classification loss

This is proposed by enguehard-etal-2017-exploring to complement a weak inductive bias in LSTM-LMs for learning syntax. It is multi-task learning across the cross-entropy loss () and an additional loss ():


where is a relative weight for

. Given outputs of LSTMs, a linear and binary softmax layers predict whether the next token is singular or plural.

is a loss for this classification, only defined for the contexts preceding a target token :

where is a prefix sequence and is a set of all prefixes ending with a target word (e.g., An industrial park with several companies is) in the training data. is a function returning the number of . In practice, for each mini-batch for , we calculate for the same set of sentences and add these two to obtain a total loss for updating parameters.

As we mentioned in Section intro, this loss does not exploit negative examples explicitly; essentially a model is only informed of a key position (target word) that determines the grammaticality. This is rather an indirect learning signal, and we expect that it does not outperform the other approaches.

Unlikelihood loss

This is recently proposed welleck2019neural for resolving the repetition issue, a known problem for neural text generators holtzman2019curious. Aiming at learning a model that can suppress repetition, they introduce an unlikelihood loss, which is an additional loss at a token level and explicitly penalizes choosing words previously appeared in the current context.

We customize their loss for negative tokens (e.g., are in (3)). Since this loss is added at token-level, instead of Eq. (LABEL:totalloss) the total loss is , which we modify as:

where returns negative tokens for a target .333 Empty for non-target tokens. It may return multiple tokens sometimes, e.g., themselves{himself, herself}. controls the weight. is a sentence in the training data . The unlikelihood loss strengthens the signal to penalize undesirable words in a context by explicitly reducing the likelihood of negative tokens . This is more direct learning signal than the binary classification loss.

Sentence-level margin loss

We propose a different loss, in which the likelihoods for correct and incorrect sentences are more tightly coupled. As in the binary classification loss, the total loss is given by Eq. (LABEL:totalloss). We consider the following loss for :

where is a margin value between the log-likelihood of original sentence and negative sentences . returns a set of negative sentences by modifying the original one. Note that we change only one token for each , and thus may obtain multiple negative sentences from one when it contains multiple target tokens (e.g., she leaves there but comes back …).444 In principle, one can cumulate this loss within a single mini-batch for as we do for the binary-classification loss. However, obtaining needs to run an LM entirely on negative sentences as well, which demands a lot of GPU memories. We avoid this by separating mini-batches for and . We precompute all possible pairs of (, ) and create a mini-batch by sampling from them. We make the batch size for (the number of pairs) as the half of that for

, to make the number of sentences contained in both kinds of batches equal. Finally, in each epoch, we only sample at most the half mini-batches of those for

to reduce the total amount of training time.

Comparing to the unlikelihood loss, not only decreasing the likelihood of a negative example, this loss tries to guarantee a minimal difference between the two likelihoods. The learning signal of this loss seems stronger in this sense; however, the token-level supervision is missing, which may provide a more direct signal to learn a clear contrast between correct and incorrect words. This is an empirical problem we pursue in the experiments.

Token-level margin loss

Our final loss is a combination of the previous two, by replacing in the unlikelihood loss by a margin loss:

3.2 Parameters

Each method employs a few additional hyperparameters. For the binary classification (

) and unlikelihood () losses, we select their values from that achieve the best average syntactic performance (we find ). For the two margin losses, we fix and and only see the effects of margin values.

LSTM-LM Additional margin loss () Additional loss () Distilled
M&L18 Ours Sentence-level Token-level Binary-pred. Unlike. K19
Simple 94.0 98.1 (1.3) 100.0 (0.0) 100.0 (0.0) 99.1 (1.2) 99.7 (0.6) 100.0 (0.0)
In a sent. complement 99.0 96.1 (2.0) 95.8 (0.7) 99.3 (0.4) 96.9 (2.4) 92.7 (3.1) 98.0 (2.0)
Short VP coordination 90.0 93.6 (3.0) 100.0 (0.0) 99.4 (1.1) 93.8 (3.3) 95.6 (3.0) 99.0 (2.0)
Long VP coordination 61.0 82.2 (3.4) 94.5 (1.0) 99.0 (0.8) 83.9 (3.2) 90.0 (2.4) 80.0 (2.0)
Across a PP 57.0 92.6 (1.4) 98.8 (0.4) 98.6 (0.3) 92.7 (1.3) 95.2 (1.2) 91.0 (3.0)
Across a SRC 56.0 91.5 (3.4) 99.6 (0.4) 99.8 (0.2) 91.9 (2.5) 97.1 (0.7) 90.0 (2.0)
Across an ORC 50.0 84.5 (3.1) 93.5 (4.0) 93.7 (2.0) 86.3 (3.2) 88.7 (4.1) 84.0 (3.0)
Across an ORC (no that) 52.0 75.7 (3.3) 86.7 (4.2) 89.4 (2.7) 78.6 (4.0) 86.4 (3.5) 77.0 (2.0)
In an ORC 84.0 84.3 (5.5) 99.8 (0.2) 99.9 (0.1) 89.3 (6.2) 92.4 (3.5) 92.0 (4.0)
In an ORC (no that) 71.0 81.8 (2.3) 97.0 (1.0) 98.6 (0.9) 83.0 (5.1) 88.9 (2.4) 92.0 (2.0)
Simple 83.0 94.1 (1.9) 99.4 (1.1) 99.9 (0.2) 91.8 (2.9) 98.0 (1.1) 91.0 (4.0)
In a sent. complement 86.0 80.8 (1.7) 99.2 (0.6) 97.9 (0.8) 79.0 (3.1) 92.6 (2.9) 82.0 (3.0)
Across an ORC 55.0 74.9 (5.0) 72.8 (2.4) 73.9 (1.3) 72.3 (3.0) 78.9 (8.6) 67.0 (3.0)
Simple 40.0 99.2 (0.7) 98.7 (1.6) 97.7 (2.0) 98.0 (3.1) 98.2 (1.2) 94.0 (4.0)
Across an ORC 41.0 63.5 (15.0) 56.8 (6.0) 64.1 (13.8) 64.5 (14.0) 48.5 (6.4) 91.0 (7.0)
Perplexity 78.6 49.5 (0.2) 56.4 (0.5) 50.4 (0.6) 49.6 (0.3) 50.3 (0.2) 56.7 (0.2)
Table 1:

Comparison of syntactic dependency evaluation accuracies across different types of dependencies and perplexities. Numbers in parentheses are standard deviations. M&L18 is the result of two-layer LSTM-LMs in

marvin-linzen:2018:EMNLP. K19 is the result of distilled two-layer LSTM-LMs from RNNGs kuncoro-etal-2019-scalable. VP: verb phrase; PP: prepositional phrase; SRC: subject relative clause; and ORC: object-relative clause. Margin values are set to 10, which works better according to Figure margin. Perplexity values are calculated on the test set of the Wikipedia dataset. The values of M&L18 and K19 are copied from kuncoro-etal-2019-scalable.


Figure 1: Margin value vs. macro average accuracy over the same type of constructions, or perplexity, with standard deviation for the sentence and token-level margin losses. is the baseline LSTM-LM without additional loss.


3.3 Scope of Negative Examples

scope Since our goal is to understand to what extent LMs can be sensitive to the target syntactic constructions by giving explicit supervision via negative examples, we only prepare negative examples on the constructions that are directly tested at evaluation. Specifically, we mark the following words in the training data, and create negative examples:

Present verb

To create negative examples on subject-verb agreement, we mark all present verbs and change their numbers.555 We use Stanford tagger toutanova2003feature to find the present verbs. We change the number of verbs tagged by VBZ or VBP using (


We also create negative examples on reflexive anaphora, by flipping between {themselves}{himself, herself}.

These two are both related to the syntactic number of a target word. For binary classification we regard both as a target word, apart from the original work that only deals with subject-verb agreement enguehard-etal-2017-exploring. We use a single common linear layer for both constructions.

In this work, we do not create negative examples for NPIs. This is mainly for technical reasons. Among four losses, only the sentence-level margin loss can correctly handle negative examples for NPIs, essentially because other losses are token-level. For NPIs, left contexts do not have information to decide the grammaticality of the target token (a quantifier; no, most, etc.) (Section task). Instead, in this work, we use NPI test cases as a proxy to see possible negative (or positive) impacts as compensation for specially targeting some constructions. We will see that in particular for our margin losses, such negative effects are very small.

4 Experiments on Additional Losses


We first see the overall performance of baseline LMs as well as the effects of additional losses. Throughout the experiments, for each setting, we train five models from different random seeds and report the average score and standard deviation.

Naive LSTM-LMs perform well

The main accuracy comparison across target constructions for different settings is presented in Table main. We first notice that our baseline LSTM-LMs (Section lm) perform much better than marvin-linzen:2018:EMNLP’s LM. A similar observation is recently made by kuncoro-etal-2019-scalable.666We omit the comparison due to space limitation, but the performance is overall similar. This suggests that the original work underestimates the true syntactic ability induced by LSTM-LMs. The table also shows the results by their distilled LSTMs from RNNGs (Section intro).

Higher margin value is effective

For the two types of margin loss, which margin value should we use? Figure margin reports average accuracies within the same types of constructions. For both token and sentence-levels, the task performance increases with , but a too large value (15) causes a negative effect, in particular on reflexive anaphora. There is an increase of perplexity by both methods. However, this effect is much smaller for the token-level loss. In the following experiments, we fix the margin value to 10 for both, which achieves the best syntactic performance.

Which additional loss works better?

We see a clear tendency that our token-level margin achieves overall better performance. Unlikelihood loss does not work unless we choose a huge weight parameter (), but it does not outperform ours, with a similar value of perplexity. The improvements by binary-classification loss are smaller, indicating that the signals are weaker than other methods with explicit negative examples. Sentence-level margin loss is conceptually advantageous in that it can deal with any types of negative examples defined in a sentence including NPIs. We see that it is often competitive with token-level margin loss, but we see relatively a large increase of perplexity (4.9 points). This increase is observed by even smaller values (Figure margin). Understanding the cause of this degradation as well as alleviating it is an important future direction.

Figure 2: Accuracies on “Across an ORC” (with and without complementizer “that”) by models trained on augmented data with additional sentences containing an object RC. Margin is set to 10. X-axis denotes the total number of object RCs in the training data. 0.37M roughly equals the number of subject RCs in the original data. “animate only” is a subset of examples (see body). Error bars are standard deviations across 5 different runs.


5 Limitations of LSTM-LMs

orc In Table main, the accuracies on dependencies across an object RC are relatively low. The central question in this experiment is whether this low performance is due to the limitation of current architectures, or other factors such as frequency. We base our discussion on the contrast between object (5) and subject (5) RCs: The authors (that) the chef likes laugh. The authors that like the chef laugh. Importantly, the accuracies for a subject RC are more stable, reaching 99.8% with the token-level margin loss, although the content words used in the examples are common.777 Precisely, they are not the same. Examples of object RCs are divided into two categories by the animacy of the main subject (animate or not), while subject RCs only contain animate cases. If we select only animate examples from object RCs the vocabularies for both RCs are the same, remaining only differences in word order and inflection, as in (5, 5).

It is known that object RCs are less frequent than subject RCs hale-2001-probabilistic; Levy2008-LEVESC, and it could be the case that the use of negative examples still does not fully alleviate this factor. Here, to understand the true limitation of the current LSTM architecture, we try to eliminate such other factors as much as possible under a controlled experiment.


We first inspect the frequencies of object and subject RCs in the training data, by parsing them with the state-of-the-art Berkeley neural parser kitaev-klein:2018:Long. In total, while subject RCs occur 373,186 times, object RCs only occur 106,558 times. We create three additional training datasets by adding sentences involving object RCs to the original Wikipedia corpus (Section lm). To this end, we randomly pick up 30 million sentences from Wikipedia (not overlapped to any sentences in the original corpus), parse by the same parser, and filter sentences containing an object RC, amounting to 680,000 sentences. Among the test cases about object RCs, we compare accuracies on subject-verb agreement, to make a comparison with subject RCs. We also evaluate on “animate only” subset, which has a correspondence to the test cases for subject RC with only differences in word order and inflection (like (5) and (5); see footnote 7). Of particular interest to us is the accuracy on these animate cases. Since the vocabularies are exactly the same, we hypothesize that the accuracy will reach the same level as that on subject RCs with our augmentation.

Figure 3: An ablation study to see the performance of models trained with reduced explicit negative examples (token-level and construction-level). One color represents the same models across plots, except the last bar (construction-level), which is different for each plot.



However, for both all and animate cases, accuracies are below those for subject RCs (Figure orc). Although we see improvements from the original score (93.7), the highest average accuracy by the token-level margin loss on “animate” subset is 97.1 (“with that”), not beyond 99%. This result indicates some architectural limitation of LSTM-LMs in handling object RCs robustly at a near perfect level. Answering why the accuracy does not reach (almost) 100%, perhaps with other empirical properties or inductive biases khandelwal-etal-2018-sharp; ravfogel-etal-2019-studying is future work.

6 Do models generalize explicit supervision, or just memorize it?

One distinguishing property of our margin loss, in particular token-level loss, is that it is highly lexical, making contrast explicitly between correct and incorrect words. This direct signal may make models acquire very specialized knowledge about each target word, not very generalizable one across similar words and occurring contexts. In this section, to get insights into the transferability of syntactic knowledge induced by our margin losses, we provide an ablation study by removing certain negative examples during training.

Second verb (V1 and V2)
Models All verbs like other verbs
LSTM-LM 82.2 (3.4) 13.0 (12.2) 89.9 (3.6)
Margin (token) 99.0 (0.8) 94.0 (6.5) 99.6 (0.5)
   -Token 90.8 (3.3) 51.0 (29.9) 95.2 (2.6)
   -Pattern 90.1 (4.6) 50.0 (30.6) 94.6 (2.2)
Table 2: Accuracies on long VP coordinations by the models with/without ablations. “All verbs” scores are overall accuracies. “like” scores are accuracies on examples on which the second verb (target verb) is like.


First verb (V1 and V2)
Models likes other verbs
LSTM-LM 61.5 (20.0) 93.5 (3.4)
Margin (token) 97.0 (4.5) 99.9 (0.1)
   -Token 63.5 (18.5) 99.2 (1.1)
   -Pattern 67.0 (21.2) 98.0 (1.4)
Table 3: Further analysis of accuracies on the “other verbs” cases of Table vp_coord_first. Among these cases, the second column (“likes”) shows accuracies on cases where the first verb (not target) is likes.



We perform two kinds of ablation. For token-level ablation (-Token), we avoid creating negative examples for all verbs that appear as a target verb888swim, smile, laugh, enjoy, hate, bring, interest, like, write, admire, love, know, and is. in the test set. Another is construction-level (-Pattern), by removing all negative examples occurring in a particular syntactic pattern. We ablate a single construction at a time for -Pattern, from four non-local subject-verb dependencies (across a prepositional phrase (PP), subject RC, object RC, and long verb phrase (VP)).999We identify all these cases from the parsed training data, which we prepared for the analysis in Section orc. We hypothesize that models are less affected by token-level ablation, as knowledge transfer across words appearing in similar contexts is promoted by language modeling objective. We expect that construction-level supervision would be necessary to induce robust syntactic knowledge, as perhaps different phrases, e.g., a PP and a VP, are processed differently.


Figure ablation is the main results. Across models, we restrict the evaluation on four non-local dependency constructions, which we selected as ablation candidates as well. For a model with -Pattern, we evaluate only on examples of construction ablated in the training (see caption). To our surprise, both -Token and -Pattern have similar effects, except “Across an ORC”, on which the degradation by -Pattern is larger. This may be related to the inherent difficulty of object RCs for LSTM-LMs that we verified in Section orc. For such particularly challenging constructions, models may need explicit supervision signals. We observe lesser score degradation by ablating prepositional phrases and subject RCs. This suggests that, for example, the syntactic knowledge strengthened for prepositional phrases with negative examples could be exploited to learn the syntactic patterns about subject RCs, even when direct learning signals on subject RCs are missing.

We see approximately 10.0 points score degradation on long VP coordination by both ablations. Does this mean that long VPs are particularly hard in terms of transferability? We find that the main reason for this drop, relative to other cases, are rather technical, essentially due to the target verbs used in the test cases. See Table vp_coord_first,second_vp, which show that failed cases for the ablated models are often characterized by the existence of either like or likes. Excluding these cases (“other verbs” in Table second_vp), the accuracies reach 99.2 and 98.0 by -Token and -Pattern, respectively. These verbs do not appear in the test cases of other tested constructions. This result suggests that the transferability of syntactic knowledge to a particular word may depend on some characteristics of that word. We conjecture that the reason of weak transferability to likes and like is that they are polysemous; e.g., in the corpus, like is much more often used as a preposition and being used as a present tense verb is rare. This types of issues due to frequency may be one reason of lessening the transferability. In other words, like can be seen as a challenging verb to learn its usage only from the corpus, and our margin loss helps for such cases.

7 Conclusion

We have shown that by exploiting negative examples explicitly, the syntactic abilities of LSTM-LMs greatly improve, demonstrating a new capacity of handling syntax robustly. Given a success of our approach using negative examples, and our final analysis for transferability, which indicates that the negative examples do not have to be complete, one interesting future direction is to extend our approach to automatically inducing negative examples themselves in some way, possibly with orthographic and/or distributional indicators or others.


We would like to thank Naho Orita and the members of Computational Psycholinguistics Tokyo for their valuable suggestions and comments. This paper is based on results obtained from projects commissioned by the New Energy and Industrial Technology Development Organization (NEDO).