Controlled Evaluation of Grammatical Knowledge in Mandarin Chinese Language Models

09/22/2021 ∙ by Yiwen Wang, et al. ∙ MIT Harvard University 0

Prior work has shown that structural supervision helps English language models learn generalizations about syntactic phenomena such as subject-verb agreement. However, it remains unclear if such an inductive bias would also improve language models' ability to learn grammatical dependencies in typologically different languages. Here we investigate this question in Mandarin Chinese, which has a logographic, largely syllable-based writing system; different word order; and sparser morphology than English. We train LSTMs, Recurrent Neural Network Grammars, Transformer language models, and Transformer-parameterized generative parsing models on two Mandarin Chinese datasets of different sizes. We evaluate the models' ability to learn different aspects of Mandarin grammar that assess syntactic and semantic relationships. We find suggestive evidence that structural supervision helps with representing syntactic state across intervening content and improves performance in low-data settings, suggesting that the benefits of hierarchical inductive biases in acquiring dependency relationships may extend beyond English.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 16

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A rich collection of targeted linguistic evaluations has shown that neural language models can surprisingly learn many aspects of grammar from unlabeled linguistic input (e.g., linzen-etal-2016-assessing; gulordava-etal-2018-colorless; warstadt2020blimp; hu-etal-2020-systematic; xiang2021climp). There is also growing evidence that explicit modeling of syntax helps neural network-based language models represent syntactic state and exhibit human-like processing behaviors of non-local grammatical dependencies, including number agreement kuncoro-etal-2018-lstms, negative polarity licensing, filler-gap dependencies wilcox-etal-2019-structural; hu-etal-2020-systematic, and garden-path effects (futrell-etal-2019-neural; hu-etal-2020-systematic). However, this line of research has focused primarily on the syntax of English. It is unclear to what extent structural supervision may help neural language models generalize for languages with differing typologies. Expanding these analyses beyond English has the potential to inform scientific questions about inductive biases for language acquisition, as well as practical questions about model architectures that approach language-independence (bender_achieving_2011).

Here, we perform a controlled case study of grammatical knowledge in Mandarin Chinese language models. The orthography and grammar of Chinese provide a useful testing ground given the differences from English and other Indo-European languages (see, e.g., shopen_language_1985; li_chinese_2015). Whereas today’s Indo-European languages like English generally use phone-based orthography, Chinese uses a logographic system where each character generally responds to a syllable. Most Mandarin Chinese words are one or two syllables, influencing the distribution of tokens. Grammatically, Chinese has almost no inflectional morphology, and corpus studies suggest that the average dependency length of Mandarin Chinese sentences is larger than that of English sentences (jiang2015effects), with potential implications for language modeling. On the one hand, the need to track input across long dependencies may make structural supervision more beneficial for Mandarin Chinese language models; on the other hand, the prevalence of these dependencies may make it easier for them to learn to maintain non-local information without explicitly modeling syntax. Other fine-grained differences in typology also affect the types of syntactic tests that can be conducted. For example, since relative clauses precede the head noun in Chinese (unlike in English), we can manipulate the distance of a verb–object dependency by inserting relative clauses in between. These characteristics motivate our choice of Mandarin Chinese as a language for evaluating structurally supervised neural language models.

We design six classes of Mandarin test suites covering a range of syntactic and semantic relationships, some specific to Mandarin and some comparable to English. We train neural language models with differing inductive biases on two datasets of different sizes, and compare models’ performance on our targeted evaluation materials. While most prior work investigating syntactically guided language models has used Recurrent Neural Network Grammar models (dyer2016recurrent) – potentially conflating structural supervision with a particular parameterization – this work further explores structured Transformer language models (qian-etal-2021-structural)

. Our results are summarized as follows. We find that structural supervision yields greatest performance advantages in low-data settings, in line with prior work on English language models. Our results also suggest a potential benefit of structural supervision in deriving garden-path effects induced by local classifier–noun mismatch, and in maintaining syntactic expectations across intervening content within a dependency relation. These findings suggest that the benefits of hierarchical inductive biases in acquiring dependency relationships may not be specific to English.

2 Targeted Linguistic Evaluation

Linguistic minimal pairs have been used to construct syntactic test suites in English (e.g., linzen-etal-2016-assessing; marvin2018targeted; mueller2020crosslinguistic; davis2020recurrent; warstadt2020blimp) and other languages such as Italian, Spanish, French, and Russian (ravfogel2018lstm; gulordava-etal-2018-colorless; an-etal-2019-representation; mueller2020crosslinguistic; davis2020recurrent). A minimal pair is formed by two sentences that differ in grammaticality or acceptability but are otherwise matched in structure and lexical content. 1 and 2 form an example English minimal pair differing in subject-verb agreement:

  1. [label=()]

  2. The man drinks coffee everyday.

  3. The man drink coffee everyday.

The two sentences differ only at the main verb ‘drinks’/‘drink’, which must agree with its subject ‘The man’ in number. Since only the third-person singular form ‘drinks’ agrees with the subject, 1 is grammatical, whereas 2 is not.

The closest work to ours is the corpus of Chinese linguistic minimal pairs (CLiMP; xiang2021climp), which provides a benchmark for testing syntactic generalization of Mandarin Chinese language models. While CLiMP focuses on building a comprehensive challenge set, the current work performs controlled experiments to investigate the effect of structural supervision on language models’ ability to learn syntactic and semantic relationships. Moreover, the test items in CLiMP are (semi)-automatically generated, which may result in semantically anomalous sentences and introduce noise into the evaluation phase. In contrast, we manually construct the items in our test suites to sound as natural as possible.

3 Methods

3.1 Evaluation Paradigm

The general structure of our evaluation paradigm follows that of wilcox-etal-2019-structural and hu-etal-2020-systematic. We use surprisal as a linking function between the language model output and human expectations (hale-2001-probabilistic)

. Surprisal is defined as the inverse log probability of a word (

) conditioned on the preceding words in the same context ():

Our test suites take the form of a group of hand-written, controlled sentence sets. Each sentence set, or test item, contains at least two minimally differing sentences, and each sentence contains a stimulus prefix and a downstream target region. The content of the target region remains fixed across the sentence variants within the test item, while the content of the stimulus varies in a minimal manner that modulates the sentence’s grammaticality or acceptability. The target region, where we measure the surprisal output of a certain model, is underlined in all the example items described in Section 3.2.

For each test item, we measure success by computing the difference in surprisals assigned by the model to the target region, conditioned on the ungrammatical vs. grammatical stimulus prefixes. If the model successfully captures the dependency, it should be less surprised at the grammatical target region than the ungrammatical one, leading to a positive surprisal difference (ungrammatical grammatical). If this criterion is satisfied, then the model achieves a success score of 1, and 0 otherwise. These binary scores are averaged over test suite items and/or classes to obtain accuracy scores.

3.2 Test Suites

We organize our materials into six classes of test suites, each of which assesses models’ knowledge of a particular linguistic phenomenon.111All code and data, including test suites, can be found at https://github.com/YiwenWang03/syntactic-generalization-mandarin Within each class, we develop individual test suites with different types of modifiers that intervene between the two ends of the dependency (including the no modifier case). For each test suite, we manually construct 30 test items, taking care to maintain semantic plausibility wherever possible. For details about individual test suites, see Section A.1. While here we present example items of a specific type of modifier to introduce each test suite, full set of examples with other type of modifiers are included in Section A.2.

The test suite classes include both syntactic and semantic dependencies, some of which do not exist in languages that have been the focus of targeted evaluation, and thus have not yet been explored. Missing Object evaluates syntactic knowledge of argument structure, Subordination and Garden Path subject/object assess representation of syntactic state, and classifier–noun Compatibility and verb–noun Compatibility evaluate a combination of syntactic and semantic factors (Sergio). Looking cross-linguistically, one class assesses a phenomenon present in Mandarin but not English (classifier–noun Compatibility); two classes assess an expectation-violation phenomenon that is present in both Mandarin and English but arises from different sources (Garden Path subject/object); and three classes assess phenomena present in both languages (verb–noun Compatibility, Missing Object, and Subordination).

For each phenomenon targeted by a given test suite class, the two components of the syntactic/semantic dependency often occur adjacently in a sentence. However, if a language model robustly represents the dependency, then it should maintain its expectations even when intervening content is present between the upstream and downstream ends of the dependency. We assess the robustness of the models’ grammatical knowledge on each test suite class by inserting three commonly-used types of modifiers to create non-local dependencies: adjectives, subject-extracted relative clauses (SRCs) and their variants, and object-extracted relative clauses (ORCs). The resulting set of test suites is described in greater detail in the following sections.

3.2.1 Classifier–Noun Compatibility

Classifiers are a special class of words in Chinese languages which are obligatorily used with numerals in a noun phrase. Each specific classifier in Mandarin is only compatible with a set of noun references that is largely semantically delimited. The general classifier “个”(cl), in contrast, is compatible with most nouns. In the classifier–noun Compatibility test suites, we evaluate whether a model expects nouns from a semantically-compatible class over those from a semantically-incompatible class, given a specific classifier.

(1.a) 孩子 听到 了 一 首 熟悉 的 歌曲 。
child hear pst one cl familiar de song .
“The child heard a familiar song.” (1.b) 孩子 听到 了 一 张 熟悉 的 歌曲 。
child hear pst one cl familiar de song .
‘The child heard a familiar song.” (1.c) 孩子 听到 了 一 张 熟悉 的 专辑 。
child hear pst one cl familiar de album .
“The child heard a familiar album.” (1.d)

孩子 听到 了 一 首 熟悉 的 专辑 。
child hear pst one cl familiar de album .
“The child heard a familiar album.”

Example (1) shows a test item from the suite with adjectival modifiers. We also consider ORCs and SRCs as modifiers in this test suite class (see Section A.2.1). Here we consider the classifier “首”(cl), which is compatible with the noun “歌曲”(song) but not the noun “专辑”(album), and the classifier “张”(cl), which is compatible with the noun “专辑” but not the noun “歌曲”. The four variants (1.a-d) show four possible combinations of the two classifiers and the two nouns. Here the target region is the sentence-final noun together with the period. We also check that the two nouns compared within each test item have similar frequency in the training data. We measure surprisals at the sentence-final noun and the period. A human-like language model should assign lower surprisals to the target regions in (1.a) and (1.c), the items with an appropriate classifier–noun pair, and high surprisals to the target regions in (1.b) and (1.d), the items with mismatched classifier–noun pairs. In other words, we evaluate four pair-wise comparisons to see whether they meet the following criteria: (1.b) > (1.a), (1.d) > (1.c), (1.d) > (1.a), and (1.b) > (1.c). We report mean accuracy averaged across all four pair-wise comparisons as a model’s accuracy on a given test suite.

3.2.2 Garden-Path Effects

Garden-path effects are a class of phenomena in human sentence processing, where the incremental parsing state of a sentence prefix needs to be reanalyzed as the comprehender processes a downstream disambiguator region (bever1970cognitive). We construct a set of test suites evaluating whether models exhibit garden-path effects induced by locally mismatched classifier–noun pairs situated within a globally coherent sentence, inspired by previous human behavioral studies (wu2018effects).

To illustrate the classifier-induced garden-path effect, consider examples (2.a) and (2.b):

(2.a) 他 离开 了 那 间 朋友 的 工厂 。
he leave pst that cl friend start de factory .

(2.b) 他 离开 了 那 个 朋友 的 工厂 。
he leave pst that cl friend start de factory .
“He left the factory that the friend started.”

In (2.b), the general classifier “个”(cl) is compatible with the immediately following noun, “朋友”(friend), resulting in a garden-path interpretation that this noun is the object of the main-clause verb “离开”(leave). This interpretation is disconfirmed, however, by the next verb, “开”(start), which indicates that the noun “朋友” is actually the start of a relative clause preceding the main-clause object, “工厂”(factory). This means that the relative clause verb “开” should be highly surprising in (2.b). In (2.a), in contrast, the specific classifier “间”(cl) is incompatible with “朋友” (though it is compatible with “工厂”), cueing the upcoming relative clause structure (wu2018effects). A human-like language model should thus show a higher surprisal at the target verb “开” for (2.b) than (2.a).

We design test suites similar to the structure of example (2) and (3). Examples (2.a-b) show the structure of items in the Garden Path object set, where an ORC modifies the object of the main clause verb and the target region is the verb immediately following the closest noun to the classifier. Examples (3.a-b) show the basic structure of items in the Garden Path subject set, where an ORC modifies the sentence subject.

(3.a) 那 间 朋友 开 工厂 倒闭 了 。
that cl friend start de factory close pst .

(3.b) 那 个 朋友 开 工厂 倒闭 了 。
that cl friend start de factory closed pst .
“The factory that the friend started has closed.”

Here in (3.b), the garden-path interpretation is that the ORC’s subject “朋友” may be initially analyzed as the subject of the sentence during incremental processing; the target region of the garden-path effect is the disambiguating word “的”(de) that ends the relative clause and precedes the head noun of the true main subject of the sentence, “工厂”.

The criterion for getting a test item correct is that the model shows lower surprisal at the target region “开” (start) in (2.a) than (2.b), and lower surprisal at the target region “的” in (3.a) than (3.b). Examples (2) and (3) have no modifiers in between the classifier stimulus and the target region. To manipulate the length of dependency, we also consider same types of modifiers as in classifier–noun Compatibility in the full test suites.

3.2.3 Verb–Noun Compatibility

Similar to classifier–noun Compatibility, verb–noun Compatibility is also a group of semantic test suites, assessing the consistency between a transitive verb222We test the transitivity of the verbs with a Tregex search (levy-andrew-2006-tregex) on the CTB dataset, along with human judgments from native Mandarin speakers. and its direct object noun. (4) shows an example with an adjective modifier in between the verb and its object, where (4.b) is semantically inconsistent since the word “阅读”(read) does not match the object noun “电脑”(computer).

(4.a) 我 修理 了 这 个 新 的 电脑 。
I fix pst this cl new de computer .
“I have fixed this new computer.”

(4.b) 我 阅读 了 这 个 新 的 电脑 。
I read pst this cl new de computer .
“I have read this new computer.”
The stimulus is the transitive verb, and the target region is the object noun and the period (which encapsulates the possibility of an incomplete but potentially grammatical sentence). We insert adjectives, ORCs and SRCs modifiers (same as in classifier–noun Compatibility) between the verb and object. The expected behavior is that the surprisal at the target region being lower in the semantically consistent variant (here, (4.a)).

3.2.4 Missing Object

Next, we turn to phenomena primarily characterized by syntactic expectations. The first of these test suite classes, Missing Object, assesses models’ ability to track a direct object required by a transitive main verb. Consider (5.a) and (5.b) (no modifier case):

(5.a) 记者 采访 了 科学家
journalist interview pst scientist .
“The journalist interviewed the scientist.”

(5.b) 记者 采访 了
journalist interview pst .
“The journalist interviewed.”
(5.a) is grammatical and (5.b) is not, since the main verb “采访”(interview) requires a downstream direct object. To test if models learn this dependency, we record the model’s surprisal at the sentence-final period “。”. The model should be more surprised to see “。” in (5.b) since it is less likely to end a sentence without an object needed by the verb. Note that this is a case where we assess human-likeness of an autoregressive language model by whether it is temporarily confused as to the structural interpretation midway through the sentence, as evidenced by its next-word predictions.

To continue our investigation of long-distance dependencies, we add three types of modifiers: single SRC, coordinated SRCs, and embedded SRCs, exemplified in Section A.2.5, respectively.333We do not consider the conventional modifier types used in the other test suite classes because adding an ORC can be confounding — the language model might be surprised at the ungrammatical target region due to the RC verb instead of the main transitive verb. Therefore, we instead focus on SRCs for this particular class. We expect the insertion of modifiers after the main verb to increase difficulty, as the model must track the verb-object dependency over a greater amount of content. In addition, the parallel and hierarchical SRCs are longer and more syntactically complex than the simple SRCs.

3.2.5 Subordination

Finally, our Subordination test suites assess the ability of a model to maintain global expectations for a main clause while inside a local subordinate clause. For example, consider (6.a) and (6.b) (no modifier case):

(6.a) 如果 他 不 尝试 , 他 将 失去 机会
if he neg try , he will lose opportunity .
“If he doesn’t try, he will lose the opportunity.”

(6.b) 如果 他 不 尝试
if he neg try .
“If he doesn’t try.”
In this case, we test the surprisal at the sentence-final period. If the model correctly represents the gross syntactic state within the subordinate clause, then it should assign higher surprisal to the period in sentences like (6.b) than in sentences like (6.a). We include modifiers (same types as in classifier–noun Compatibility) before the subject noun inside the matrix clause.

3.3 Models

To investigate the effect of syntax modeling in learning the dependencies described in Section 3.2

, we train four classes of neural language models by crossing two types of parameterization with two types of supervision. Two of our model classes are trained for vanilla next-word-prediction: Long Short-Term Memory networks

(LSTM; HochSchm97) and Transformers (attention). The remaining two model classes are based on the LSTM and Transformer architectures, but explicitly incorporate syntactic structure during training: Recurrent Neural Network Grammars (RNNG; dyer2016recurrent) and Transformer-parameterized parsing-as-language-modelling models (PLM; qian-etal-2021-structural). While prior work on structural supervision in English language models has focused primarily on RNNGs, both RNNGs and PLMs are joint probabilistic models of terminal word sequences along with the corresponding constituency parses. Thus, they both explicitly model syntax (in contrast to their vanilla language modeling counterparts), while featuring different parameterizations.

We use the PyTorch implementation of the LSTM

pytorch. The Transformer and PLM models are based on the HuggingFace GPT-2 architecture transformer. While we use the model architecture equivalent to the size of pre-trained GPT-2, we do not use the pre-trained tokenizer. All of our models are trained on a pre-tokenized Mandarin corpus and share the same vocabulary for each training dataset. Model sizes are reported in Table (a)a in Appendix B.

For the LSTM and Tranformer models, we calculate the surprisal at the target region by taking the negative log of the model’s predicted conditional probability. We estimate the RNNGs and PLMs’ word surprisals with word-synchronous beam search

(stern-etal-2017-effective), following hale-etal-2018-finding and wilcox-etal-2019-structural. The action beam size is 100 and the word beam size is 10. For regions with multi-token content, we sum over the probabilities of each token.

As a baseline, we additionally implement an -gram model with Kneser-Ney Smoothing (kneser1995improved) using the SRILM toolkit (stolcke2002srilm). For cases where the smoothed -gram model assigns identical probabilities to the target region across different conditions in a test item, we do tie-breaking by randomly flipping a fair coin to determine the outcome for that particular item.

3.4 Corpus Data

We consider two datasets to explore how training data size affects models ability to acquire grammatical knowledge (similar to hu-etal-2020-systematic). The LSTM and Transformer models are trained on the raw text only, whereas the RNNG and PLM models are tained with additional syntactic annotations.

Chinese Treebank (CTB)

The Chinese Treebank (CTB 9.0; ctb9) is a Chinese language corpus annotated with Penn Treebank-style (Penn) constituency parses. We use the Newswire, Magazine articles, Broadcast news, Broadcast conversations, and Weblogs sections, as we expect these sources to contain well-formed sentences with a variety of syntactic constructions. We follow the split defined by shao2017character to construct training, development, and test sets.

Xinhua News Data

To investigate the effects of increased training data size on models’ syntactic generalization, we create a larger corpus combining CTB with a subset from the Xinhua News corpus (xinhua).444 We include CTB to guarantee that this larger dataset covers the CTB vocabulary. The Xinhua corpus contains metadata and content for 406K Mandarin news articles, collected from three mainstream media sources. Only the article contents are used for our training purposes. The texts from Xinhua corpus are first split into sentences and then tokenized into words with SpaCy (spacy). We then obtain sentence parses with the Berkeley Neural parser kitaev-klein-2018-constituency; kitaev-etal-2019-multilingual. We filter out extremely long sentences (>100 tokens) and map tokens occurring less than twice in the training data to fine-grained UNK tokens. Appendix C reports full statistics of the training corpora.

We train ten types of language models, crossing model architecture (-gram, LSTM, RNNG, Transformer, and PLM) with dataset (Chinese Treebank and hybrid Xinhua dataset).555We denote the language models trained on the hybrid Xinhua dataset with the suffix “-Xinhua”. Each model type is trained with multiple random seeds,666We train RNNG models for 2 random seeds, and all other neural models for 3 random seeds. and results are reported as averages across these instances. Model perplexity scores are reported in Table (b)b.

Figure 1: Accuracy by test suite class and model. For each test suite class, we average the accuracy across different modifier types (including the no modifier case). Error bars denote 95% CIs of the mean accuracy score.
CTB Xinhua
RNNG vs. LSTM PLM vs. Transformer RNNG vs. LSTM PLM vs. Transformer
classifier–noun - - - **
Garden Path object - ** - *
Garden Path subject - - - -
verb–noun ** - * -
Missing Object - - - ***
Subordination ** - ** -
Table 1: Comparison between language models that perform explicit syntax modeling (RNNGs and PLMs) and their vanilla counterparts (LSTMs and Transformers). represents statistically significant improvement in structurally supervised language models, and represents the opposite direction. *: , **: , ***: .

4 Results

We begin by reporting the overall performance of the models on the test suite classes introduced in Section 3.2. Figure 1 shows accuracy scores averaged across test suites within each class.777For numerical accuracy scores, see Table 6 in Appendix E. For results on individual test suites, see Figure 7 in Appendix F. First, we note that the -gram baseline overall performs the worst among all language models, which matches our expectations since syntactic dependencies beyond the 5-token window are difficult for the model to capture.

Turning to the neural models, we assess the effects of training data size and architecture by examining the mean accuracy scores across test suites for each model. We fit separate linear mixed-effects models comparing the effects of data size, using the lme4 package in R (lme4). The dependent variable is the mean accuracy score (Figure 2). For each language model type, the main effect is a binary indicator of whether the model is trained on the CTB dataset or the larger Xinhua dataset. We include test suite class, modifier type, and model seed as random factors with random intercepts and slopes for the accuracy score. Across test suite classes, the Transformer-Xinhua models outperform their smaller CTB counterparts (), but the effect of data size is less clear for the LSTM, RNNG and PLM models.

Figure 2: Mean accuracy scores by model type across all test suites.
Figure 3: Accuracy on Missing Object as function of modifier complexity, for each model class.

Comparing different model architectures, we find that in the Subordination test suite class, RNNGs trained on the smaller CTB dataset achieve comparable performance to LSTMs trained on the larger Xinhua dataset (; linear mixed-effects model with model type as the main factor). Furthermore, for both verb–noun Compatibility and Subordination, the RNNGs perform better than the LSTMs and Transformers when trained on the smaller CTB dataset (see Figure 1). These results suggest that an inductive bias for learning hierarchical structure may help in low-data Chinese settings, consistent with prior experiments conducted in English language models (hu-etal-2020-systematic).

Overall, we find only suggestive benefits of structural supervision. While prior work in English (kuncoro-etal-2018-lstms; wilcox-etal-2019-structural; hu-etal-2020-systematic; qian-etal-2021-structural) has shown that RNNGs and PLMs can improve upon their vanilla LSTM and Transformer counterparts, the improvement is relatively smaller for Mandarin. To compare the performance of models with and without structural supervision, we fit a generalized linear mixed-effect model on the binary outcome (whether or not the model predicts a positive surprisal difference) for each test item, within each combination of test suite class, training data, and model parameterization. We consider a binary indicator of whether or not the model performs syntax modeling explicitly as the main factor, and include test item, modifier type and model seed as random factors. Table 1 summarizes the results. For the CTB-only models, the structurally supervised models (RNNG and PLM) achieve accuracy either significantly greater or comparable to the corresponding vanilla models (LSTM and Transformer) across all test suite classes. However, the pattern is less clear for the Xinhua-trained models: the structurally supervised models lead to both gains and losses in accuracy compared to the vanilla models. We conjecture that word segmentation and parsing errors in the automatically-annotated Xinhua dataset might have affected the learning process of the model. In addition, the Xinhua training data explored in this work is still not very large in size (<10 million tokens), so it could be that further benefits of syntactic supervision may be more pronounced with much larger training datasets. Nevertheless, the suggestive benefits of explicitly modelling syntax with very small amounts of data could have implications for language modeling in low-resource settings.

We also assess whether our models are better at capturing syntactic or semantic relationships. To do this, we group the six classes of test suites into three categories: syntactic dependency (Missing Object and Subordination), semantic relationship (classifier–noun Compatibility and verb–noun Compatibility), and a hybrid capturing semantically-driven syntactic state representations (Garden Path subject and Garden Path object). We find that the average accuracy score is higher in the syntactic test suites than in the semantic test suites (),888See Section D.1 for details. suggesting that the language models in our study – including those with no explicit syntax modeling – find it easier to learn syntactic dependencies than semantic relationships.

4.1 Robustness to Intervening Content

Next, we investigate the effect of structural supervision on tracking dependencies across intervening content. We focus our analysis on Missing Object,999The modifiers used in the other test suite classes (adjective, ORC, SRC) are not as directly comparable, since they vary in multiple dimensions, not just complexity. as the modifiers considered in these test suites can be ordered according to their syntactic complexity (no SRC < single SRC < coordinated SRCs < embedded SRCs).

(a) Garden Path object
(b) Garden Path subject
Figure 4: Mean surprisal difference at target region (matched classifier mismatched classifier).

Figure 3 shows the models’ performance on this test suite class as a function of modifier complexity, ranging from least to most difficult along the horizontal axis of each subplot. The vanilla LSTMs and Transformers clearly degrade in performance as the intervening materials between the stimulus and the target region grow in length and complexity. In contrast, it is also visually apparent that the RNNG models do not degrade sharply as modifiers get longer and more complex. We fit linear mixed-effects models to investigate the relationship between modifier type and accuracy for each language model.101010See Section D.2 for details. Our results appear to confirm that both the RNNGs and PLMs do not significantly degrade on the simple SRC and coordinated SRC modifiers (compared to the no-modifier baseline). For the most complex modifier (embedded SRC), all models suffer in accuracy, but the magnitude of this effect is smaller for the RNNGs and PLMs compared to the LSTMs and Transformers. Taken together, our results suggest that while structural supervision does not give the language models a significant advantage compared to their vanilla counterparts in the accuracy scores, it seems to help the model maintain syntactic expectations despite the intervention of syntactically complex content.

4.2 Garden-Path Effects

Building upon the classifier–noun Compatibility results, we investigate whether a mismatched local classifier–noun pair may serve as an early cue for the upcoming RC structure, inducing a garden path effect. Figure 1 shows that the neural models systematically perform better on Garden Path object than Garden Path subject. We conjecture that neural language models may implicitly predict an ORC modifying the subject noun regardless of the type of classifiers. Therefore, the language models may be more prepared to see an ORC modifying the subject “工厂” in (3.b) than in (2.b) with the object “工厂”.

To gain a better understanding of model performance on these two test suite classes, we examine the average difference in target-region surprisal values between sentences with and without local classifier–noun mismatch (Figure 4). The average difference gives detailed information on how each language model processes the garden path region, which is complementary to the binary success/failure score achieved by a model on a given test item. Figure (a)a shows that the neural models have a positive average surprisal difference across test items for Garden Path object. Furthermore, the magnitude of this difference increases with the inclusion of the larger Xinhua dataset, suggesting that with more data, models become more confident in taking the incongruence between the classifier and the noun as a pre-RC cue. 111111This result also accords with prior findings that classifiers facilitate object-modifying RC processing wu14; wu2018effects. On the other hand, recall that all models perform rather poorly on Garden Path subject (Figure 1). Figure (b)b shows that only the RNNGs trained on the Xinhua corpus output a statistically significant postive average surprisal difference (; one-sample -test). PLM-Xinhua, although not statistically significant, has a positive mean surprisal difference as well. This is due to the fact that the magnitude of the suprisal differences predicted by the RNNG-Xinhua and PLM-Xinhua models is greater when they exhibit the predicted garden-path effects, and smaller when they do not follow the predicted direction. Therefore, structural supervision may help models represent syntactic state in a more human-like way.

5 Conclusion

This work evaluates Mandarin Chinese language models on six grammatical relationships, including syntactic dependencies and semantic compatibilities. We use Mandarin as a case study for analyzing how the potential advantages of explicit syntax modeling (as performed by RNNGs and PLMs) generalize from English to a typologically different language. Although structural supervision does not boost the sequential model’s learning in all relationships tested in this study, we find that it does allow the RNNG and PLM models to learn dependencies robust to increasingly complex modifiers, as seen in the Missing Object test suites. Compared to the vanilla sequence-based LSTM and Transformer models, explicit syntactic modeling also seems to help with grammatical generalization in settings with small training data. We also find that Mandarin syntactic dependencies (such as tracking gross syntactic state within a subordinate clause) tend to be easier to learn than semantic dependencies (such as the compatibility between classifiers and nouns). This study is one of the first steps towards understanding the role structural inductive biases may play in learning semantic and syntactic relationships in typologically diverse languages.

Acknowledgements

We thank three anonymous reviewers for their feedback. J.H. acknowledges support from an NSF Graduate Research Fellowship under Grant No. 1745302. This work was supported by the MIT–IBM AI Research Lab and MIT’s Quest for Intelligence.

References

Appendix A Individual Test Suites

a.1 List of Test Suites

See Table 2 for a summary of test suite classes constructed in this study.

Test Suite Class Modifier Type # Test Items Prior Work in English Prior Work in Chinese
classifier–noun Compatibility None 30 - zhan-levy-2018-comparing; xiang2021climp
classifier–noun Compatibility Adjective 30 - xiang2021climp
classifier–noun Compatibility Object-Extracted RC 30 - xiang2021climp
classifier–noun Compatibility Subject-Extracted RC 30 - xiang2021climp
Garden Path object None 31 - wu2018effects
Garden Path object Adjective 31 - -
Garden Path object Object-Extracted RC 31 - -
Garden Path object Subject-Extracted RC 31 - -
Garden Path subject None 31 futrell-etal-2019-neural; hu-etal-2020-systematic -
Garden Path subject Adjective 31 - -
Garden Path subject Object-Extracted RC 31 hu-etal-2020-systematic -
Garden Path subject Subject-Extracted RC 31 hu-etal-2020-systematic -
verb–noun Compatibility None 31 - -
verb–noun Compatibility Adjective 31 - -
verb–noun Compatibility Object-Extracted RC 31 - -
verb–noun Compatibility Subject-Extracted RC 31 - -
Missing Object None 30 warstadt2020blimp -
Missing Object Subject-Extracted RC 30 - -
Missing Object Coordinated Subject-Extracted RC 30 - -
Missing Object Embedded Subject-Extracted RC 30 - -
Subordination None 30 futrell-etal-2019-neural; hu-etal-2020-systematic -
Subordination Adjective 30 - -
Subordination Object-Extracted RC 30 futrell-etal-2019-neural; hu-etal-2020-systematic -
Subordination Subject-Extracted RC 30 futrell-etal-2019-neural; hu-etal-2020-systematic -
Table 2: Summary of individual test suites used in our experiments.

a.2 Complementary Test Suite Examples

In this section, we provide complementary test items for those in Section 3.2, constructing a full set of examples for each test suite class. Note that except for Missing Object where we consider variants of SRCs as the modifier (SRC, coordinated SRC, and embedded SRC), the general modifiers types are adjective, ORC and SRC.

a.2.1 Classifier–Noun Compatibility

No modifier: (7.a) 孩子 听到 了 一 首 歌曲 。
child hear pst one cl song .
“The child heard a song.” (7.b) 孩子 听到 了 一 张 歌曲 。
child hear pst one cl song .
‘The child heard a song.” (7.c) 孩子 听到 了 一 张 专辑 。
child hear pst one cl album .
“The child heard an album.” (7.d) 孩子 听到 了 一 首 专辑 。
child hear pst one cl album .
“The child heard an album.”
ORC as modifier: (8.a) 孩子 听到 了 一 首 他 熟悉 的 歌曲 。
child hear pst one cl he familiar de song .
“The child heard a song that he is familiar with.” (8.b) 孩子 听到 了 一 张 他 熟悉 的 歌曲 。
child hear pst one cl he familiar de song .
‘The child heard a song that he is familiar with.” (8.c) 孩子 听到 了 一 张 他 熟悉 的 专辑 。
child hear pst one cl he familiar de album .
“The child heard an album that he is familiar with.” (8.d) 孩子 听到 了 一 首 他 熟悉 的 专辑 。
child hear pst one cl he familiar de album .
“The child heard an album that he is familiar with.”
SRC as modifier: (9.a) 孩子 听到 了 一 首 令 他 熟悉 的 歌曲 。
child hear pst one cl make he familiar de song .
“The child heard a song that makes him feel familiar.” (9.b) 孩子 听到 了 一 张 令 他 熟悉 的 歌曲 。
child hear pst one cl make he familiar de song .
‘The child heard a song that makes him feel familiar.” (9.c) 孩子 听到 了 一 张 令 他 熟悉 的 专辑 。
child hear pst one cl make he familiar de album .
“The child heard an album that makes him feel familiar.” (9.d) 孩子 听到 了 一 首 令 他 熟悉 的 专辑 。
child hear pst one cl make he familiar de album .
“The child heard an album that makes him feel familiar.”

a.2.2 Garden-Path Object

Adjectival modifier: (10.a) 他 离开 了 那 间 负责 的 朋友 的 工厂 。
he leave pst that cl conscientious de friend start de factory .
(10.b) 他 离开 了 那 个 负责 的 朋友 的 工厂 。
he leave pst that cl conscientious de friend start de factory .
“He left the factory that the conscientious friend started.”
ORC as modifier: (11.a) 他 离开 了 那 间 我 尊敬 的 朋友 的 工厂 。
he leave pst that cl I respect de friend start de factory .
(11.b) 他 离开 了 那 个 我 尊敬 的 朋友 的 工厂 。
he leave pst that cl I respect de friend start de factory .
“He left the factory that the friend whom I respect started.”
SRC as modifier: (12.a) 他 离开 了 那 间 帮助 过 我 的 朋友 的 工厂 。
he leave pst that cl help pst I de friend start de factory .
(12.b) 他 离开 了 那 个 帮助 过 我 的 朋友 的 工厂 。
he leave pst that cl help pst I de friend start de factory .
“He left the factory that the friend who helped me before started.”

a.2.3 Garden-Path Subject

Adjectival modifier: (13.a) 那 间 负责 的 朋友 开 工厂 倒闭 了 。
that cl conscientious de friend start de factory close pst .

(13.b) 那 个 负责 的 朋友 开 工厂 倒闭 了 。
that cl conscientious de friend start de factory close pst .
“The factory that the conscientious friend started has closed.”
ORC as modifier: (14.a) 那 间 我 尊敬 的 朋友 开 工厂 倒闭 了 。
that cl I respect de friend start de factory close pst .

(14.b) 那 个 我 尊敬 的 朋友 开 工厂 倒闭 了 。
that cl I respect de friend start de factory close pst .
“The factory that the friend whom I respect started has closed.”
SRC as modifier: (15.a) 那 间 帮助 过 我 的 朋友 开 工厂 倒闭 了 。
that cl help pst I de friend start de factory close pst .

(15.b) 那 个 帮助 过 我 的 朋友 开 工厂 倒闭 了 。
that cl help pst I de friend start de factory closed pst .
“The factory that the friend who helped me before started has closed.”

a.2.4 Verb–Noun Compatibility

No modifier: (16.a) 我 修理 了 这 个 电脑 。
I fix pst this cl computer .
“I have fixed this computer.” (16.b) 我 阅读 了 这 个 电脑 。
I read pst this cl computer .
“I have read this computer.”
ORC as modifier: (17.a) 我 修理 了 这 个 父亲 使用 的 电脑 。
I fix pst this cl father use de computer .
“I have fixed this computer that the father uses.”

(17.b) 我 阅读 了 这 个 父亲 使用 的 电脑 。
I read pst this cl father use de computer .
“I have read this computer that the father uses.”
SRC as modifier: (18.a) 我 修理 了 这 个 计算 公式 的 电脑 。
I fix pst this cl calculate formula de computer .
“I have fixed this computer that calculates formulas.” (18.b) 我 阅读 了 这 个 计算 公式 的 电脑 。
I read pst this cl calculate formula de computer .
“I have read this computer that calculates formulas.”

a.2.5 Missing Object

SRC as modifier: (19.a) 记者 采访 了 研发 产品 的 科学家
journalist interview pst develop product de scientist .
“The journalist interviewed the scientist who developed the product.” (19.b) 记者 采访 了 研发 产品
journalist interview pst develop product .
“The journalist interviewed who developed the product.”
Coordinated SRCs as modifier: (20.a) 记者 采访 了 研发 产品 并且 获 了 奖 的 科学家
journalist interview pst develop product and win pst prize de scientist .
“The journalist interviewed the scientist who developed the product and won a prize.” (20.b) 记者 采访 了 研发 产品 并且 获 了 奖
journalist interview pst develop product and win pst prize .
“The journalist interviewed who developed the product and won a prize.”
Embedded RCs as modifier: (21.a) 记者 采访 了 研发 帮助 老人 的 产品 的 科学家
journalist interview pst develop help elderly de product de scientist .
“The journalist interviewed the scientist who developed the product that helps the elderly.” (21.b) 记者 采访 了 研发 帮助 老人 的 产品
journalist interview pst develop help elderly de product .
“The journalist interviewed who developed the product that helps the elderly.”

a.2.6 Subordination

Adjectival modifier: (22.a) 如果 内向 的 他 不 尝试 , 他 将 失去 机会
if introverted de he neg try , he will lose opportunity .
“If he who is introverted doesn’t try, he will lose the opportunity.” (22.b) 如果 内向 的 他 不 尝试
if introverted de he neg try .
“If he who is introverted doesn’t try.”
ORC as modifier: (23.a) 如果 父亲 期待 的 他 不 尝试 , 他 将 失去 机会
if father expect de he neg try , he will lose opportunity .
“If he who the father has expectations of doesn’t try, he will lose the opportunity.” (23.b) 如果 父亲 期待 的 他 不 尝试
if father expect de he neg try .
“If he who the father has expectations of doesn’t try.”
SRC as modifier: (24.a) 如果 没有 工作 的 他 不 尝试 , 他 将 失去 机会
if neg job de he neg try , he will lose opportunity .
“If he who has no jobs doesn’t try, he will lose the opportunity.” (24.b) 如果 没有 工作 的 他 不 尝试
if neg job de he neg try .
“If he who has no jobs doesn’t try.”

Appendix B Model Information

Model # layers # hidden units Emb size
LSTM 2 256 256
RNNG 2 256 256
Transformer 12 768 768
PLM 12 768 768
(a)
Model CTB Xinhua
-gram 330 332
LSTM 161 190
RNNG 227 194
Transformer 234 170
PLM 297 244
(b)
Table 3: (a) Model architecture size. (b) Perplexity results on CTB test data.

We find that the perplexity score reported in Table (b)b is comparatively high for RNNG compared to that reported in dyer2016recurrent. This may be because the CTB data we use includes some informal and spoken language, such as weblogs and broadcast conversations.

Appendix C Corpus Statistics

Corpus # Tokens Vocab Size
CTB 974K 27K
Xinhua 7M 88K
Table 4: Statistics of training corpora.
Figure 5: Comparing the distribution of sentence length in Mandarin and English corpora.

Figure 5 shows differences in the distribution of sentence lengths between the Mandarin corpus and a comparable English corpus of similar register. The maximum sentence length is 242 words for Xinhua+CTB, and only 70 words for the English BLLIP-sm corpus used to train hu-etal-2020-systematic’s (hu-etal-2020-systematic) models. The distribution of sentence lengths also has a heavier tail for the Mandarin corpus than the English corpus.

Appendix D Additional Analysis

d.1 Comparing Syntactic and Semantic Test Suites

In this section, we focus on comparing how well models learn about syntactic dependency and semantic compatibility.

Recall that we group classifier–noun Compatibility and verb–noun Compatibility as semantic relationships, and Missing Object and Subordination as syntactic dependency. We compute the accuracy scores of these two categories, as shown in Figure 6. Here we exclude the -gram models since their performances deviate from the other language models to a great extent. Adding modifiers between the stimulus and the target region seems to shrink the gap a bit. And this again might be due to the fact that for Missing Object, we intentionally make the intervening contents increasingly hard to learn, dragging down the average accuracy score for syntactic dependencies.

Figure 6: Accuracy on semantic and syntactic test suite classes, with and without intervening content. -gram results are excluded.

Similar to testing the effect of data size, we determine the statistical significance here with a linear mixed-effects model on the accuracy score. We predict accuracy with a binary indicator of whether or not the test is in the syntactic group that we define, including model type and test item as random factors. We find that syntactic relationships seem to be easier to learn than semantic ones regardless of the intervening content ().

d.2 Intervening Content in Missing Object

In this section, we discuss our statistical analysis on language models’ robustness against intervening contents.

We fit separate linear mixed-effects models for each language model with the accuracy score as the dependent variable, and modifier type as the predictor. We include model seed and data size as random factors, both with random intercepts and random slopes. Table 5 summarizes our results. Recall that in Missing Object, we consider four types of modifiers: None, single SRC, coordinated SRCs, and embedded SRCs. Each cell in Table 5 represents coefficients of a particular modifier with respect to the None modifier baseline for a particular language model class. For single SRC and coordinated SRC modifiers, neither of the RNNG or PLM models show significant degradation in the accuracy score. All models suffer from the embedded SRC modifier, with negative coefficients that are all statistically significant. However, RNNGs and PLMs seem to be affected the least, with larger coefficients than vanilla LSTMs and Transformers. This suggests that structural supervision helps language model to learn syntactic dependencies that are more robust against the intervening content.

Model Single SRC Coordinated SRC Embedded SRC
LSTM -0.06* -0.14*** -0.3***
RNNG -0.016 -0.025 -0.175***
Transformer -0.1** -0.13* -0.35***
PLM 0.05* -0.01 -0.2***
Table 5: Coefficients of the modifiers in Missing Object with the condition of no modifier as the baseline. *: , **: , ***: .

Appendix E Accuracy by Model and Test Suite Class

See Table 6 for a summary of accuracy scores achieved by each model across test suite classes.

Model classifier–noun
Garden Path
object
Garden Path
subject
verb–noun Missing Object Subordination
-gram 0.552 0.508 0.484 0.484 0.575 0.675
LSTM 0.598 0.659 0.320 0.624 0.847 0.789
RNNG 0.609 0.690 0.359 0.714 0.838 0.854
Transformer 0.603 0.672 0.376 0.581 0.761 0.817
PLM 0.615 0.750 0.419 0.589 0.775 0.836
-gram-Xinhua 0.594 0.516 0.516 0.500 0.700 0.500
LSTM-Xinhua 0.650 0.691 0.355 0.782 0.819 0.850
RNNG-Xinhua 0.636 0.750 0.367 0.714 0.854 0.913
Transformer-Xinhua 0.708 0.720 0.363 0.755 0.792 0.931
PLM-Xinhua 0.746 0.656 0.395 0.745 0.692 0.908
Table 6: Accuracy score by model and test suite class. Blue color denotes the best score within the CTB dataset.

Appendix F Results on Individual Test Suites

Figure 7 shows language models’ accuracy scores on the six test suite classes: classifier–noun Compatibility, Garden Path object, Garden Path subject, verb–noun Compatibility, Missing Object, and Subordination. On the x-axis, we have four types of modifiers tested on that specific test suite class.

(a) Classifier–Noun Compatibility
(b) Verb–Noun Compatibility
(c) Garden Path Object
(d) Garden Path Subject
(e) Missing Object
(f) Subordination
Figure 7: Accuracy on individual test suites used in our experiments.