Log In Sign Up

Multilingual Syntax-aware Language Modeling through Dependency Tree Conversion

Incorporating stronger syntactic biases into neural language models (LMs) is a long-standing goal, but research in this area often focuses on modeling English text, where constituent treebanks are readily available. Extending constituent tree-based LMs to the multilingual setting, where dependency treebanks are more common, is possible via dependency-to-constituency conversion methods. However, this raises the question of which tree formats are best for learning the model, and for which languages. We investigate this question by training recurrent neural network grammars (RNNGs) using various conversion methods, and evaluating them empirically in a multilingual setting. We examine the effect on LM performance across nine conversion methods and five languages through seven types of syntactic tests. On average, the performance of our best model represents a 19 % increase in accuracy over the worst choice across all languages. Our best model shows the advantage over sequential/overparameterized LMs, suggesting the positive effect of syntax injection in a multilingual setting. Our experiments highlight the importance of choosing the right tree formalism, and provide insights into making an informed decision.


Vyākarana: A Colorless Green Benchmark for Syntactic Evaluation in Indic Languages

While there has been significant progress towards developing NLU dataset...

Multilingual Irony Detection with Dependency Syntax and Neural Models

This paper presents an in-depth investigation of the effectiveness of de...

Syntax Representation in Word Embeddings and Neural Networks – A Survey

Neural networks trained on natural language processing tasks capture syn...

Finding Universal Grammatical Relations in Multilingual BERT

Recent work has found evidence that Multilingual BERT (mBERT), a transfo...

Syntax-aware Multilingual Semantic Role Labeling

Recently, semantic role labeling (SRL) has earned a series of success wi...

Representations of Syntax [MASK] Useful: Effects of Constituency and Dependency Structure in Recursive LSTMs

Sequence-based neural networks show significant sensitivity to syntactic...

Towards Syntactic Iberian Polarity Classification

Lexicon-based methods using syntactic rules for polarity classification ...

1 Introduction

The importance of language modeling in recent years has grown considerably, as methods based on large pre-trained neural language models (LMs) have become the state-of-the-art for many problems (devlin-etal-2019; radford-etal-2019). However, these neural LMs are based on general architectures and therefore do not explicitly model linguistic constraints, and have been shown to capture only a subset of the syntactic representations typically found in constituency treebanks (warstadt-etal-2020). An alternative line of LM research aims to explicitly model the parse tree in order to make the LM syntax-aware. A representative example of this paradigm, reccurent neural network grammar (RNNG, dyer-etal-2016), is reported to perform better than sequential LMs on tasks that require complex syntactic analysis (kuncoro-etal-2019; hu-etal-2020; noji-oseki-2021).

The aim of this paper is to extend LMs that inject syntax to the multilingual setting. This attempt is important mainly in two ways. Firstly, English has been dominant in researches on syntax-aware LM. While multilingual LMs have received increasing attention in recent years, most of their approaches do not explicitly model syntax, such as multilingual BERT (mBERT, devlin-etal-2019) or XLM-R (conneau-etal-2020). Although these models have shown high performance on some cross-lingual tasks (conneau-etal-2018), they perform poorly on a syntactic task (mueller-etal-2020). Secondly, syntax-aware LMs have interesting features other than their high syntactic ability. One example is the validity of RNNG as a cognitive model under an English-based setting, as demonstrated in hale-etal-2018. Since human cognitive functions are universal, while natural languages are diverse, it would be ideal to conduct this experiment based on multiple languages.

The main obstacle for multilingual syntax-aware modeling is that it is unclear how to inject syntactic information while training. A straightforward approach is to make use of a multilingual treebank, such as Universal Dependencies (UD, nivre-etal-2016; nivre-etal-2020), where trees are represented in a dependency tree (DTree) formalism. matthews-etal-2019 evaluated parsing and language modeling performance on three typologically different languages, using a generative dependency model. Unfortunately, they revealed that dependency-based models are less suited to language modeling than comparable constituency-based models, highlighting the apparent difficulty of extending syntax-aware LMs to other languages using existing resources.

This paper revisits the issue of the difficulty of constructing multilingual syntax-aware LMs, by exploring the performance of multilingual language modeling using constituency-based models. Since our domain is a multilingual setting, our focus turns to how dependency-to-constituency conversion techniques result in different trees, and how these trees affect the model’s performance. We obtain constituency treebanks from UD-formatted dependency treebanks of five languages using nine tree conversion methods. These treebanks are in turn used to train an RNNG, which we evaluate on perplexity and CLAMS (mueller-etal-2020).

Our contributions are: (1) We propose a methodology for training multilingual syntax-aware LMs through the dependency tree conversion. (2) We found an optimal structure that brings out the potential of RNNG across five languages. (3) We demonstrated the advantage of our multilingual RNNG over sequential/overparameterized LMs.

2 Background

2.1 Recurrent Neural Network Grammars

[.S [.NP [roof]; The pilot ] [.VP laughs ] ]
Partial tree
0 NT(S)
(S (NP
(S (NP The
(S (NP The pilot
(S (NP The pilot)
(S (NP The pilot) (VP
Figure 1:

The illustration of stack-RNNG behavior. Stack-LSTM represents the current partial tree, in which adjacent vectors are connected in the network. At REDUCE action, the corresponding vector is updated with composition function (as underlined).

RNNGs are generative models that estimate joint probability of a sentence

and a constituency tree (CTree) . The probability is estimated with top-down constituency parsing actions that produce :

kuncoro-etal-2017 proposed a stack-only RNNG that computes the next action probability based on the current partial tree. Figure 1 illustrates the behavior of it. The model represents the current partial tree with a stack-LSTM, which consists of three types of embeddings: nonterminal, word, and closed-nonterminal. The next action is estimated with the last hidden state of a stack-LSTM. There are three types of actions as follows:

  • NT(X): Push nonterminal embedding of () onto the stack.

  • GEN(): Push word embedding of () onto the stack.

  • REDUCE: Pop elements from the stack until a nonterminal embedding shows up. With all the embeddings which are popped, compute closed-nonterminal embedding using composition funcion Comp:

RNNG can be regarded as a language model that injects syntactic knowledge explicitly, and various appealing features have been reported (kuncoro-etal-2017; kuncoro-etal-2017; hale-etal-2018). We focus on its high performance on syntactic evaluation, which is described below.

Difficulty in extending to other languages

In principle, RNNG can be learned with any corpus as long as it contains CTree annotation. However, it is not evident which tree formats are best in a multilingual setting. Using the same technique as English can be inappropriate because each language has its own characteristic, which can be different from English. This question is the fundamental motivation of this research.

2.2 Cross-linguistic Syntactic Evaluation

To investigate the capability of LMs to capture syntax, previous work has attempted to create an evaluation set that requires analysis of the sentence structure (linzen-etal-2016). One typical example is a subject-verb agreement, a rule that the form of a verb is determined by the grammatical category of the subject, such as person or number:

The pilot that the guards love laughs/*laugh. (1)

In (1), the form of laugh is determined by the subject pilot, not guards. This judgment requires syntactic analysis; guards is not a subject of target verb laugh because it is in the relative clause of the real subject pilot.

marvin-linzen-2018 designed the English evaluation set using a grammatical framework. mueller-etal-2020 extended this framework to other languages (French, German, Hebrew, and Russian) and created an evaluation set named CLAMS (Cross-Linguistic Assessment of Models on Syntax). CLAMS covers 7 categories of agreement tasks, including local agreement (e.g. The author laughs/*laugh) and non-local agreement that contains an intervening phrase between subject and verb as in (1). They evaluated LMs on CLAMS and demonstrated that sequential LMs often fail to assign a higher probability to the grammatical sentence in cases that involve non-local dependency.

Previous work has attempted to explore the syntactic capabilities of LMs with these evaluation sets. kuncoro-etal-2019 compared the performance of LSTM LM and RNNG using the evaluation set proposed in marvin-linzen-2018, demonstrating the superiority of RNNG in predicting the agreement. noji-takamura-2020 suggested that LSTM LMs potentially have a limitation in handling object relative clauses. Since these analyses are performed on the basis of English text, it is unclear whether they hold or not in a multilingual setting. In this paper, we attempt to investigate this point by learning RNNGs in other languages and evaluating them on CLAMS.

1 Function flat(, , ):
2        lNT [flat(, .ldeps, .rdeps) for in ];
3        rNT [flat(, .ldeps, .rdeps) for in ];
4        return [lNT [] rNT].removeEmptyList ;
7Function lf(, , ):
8        if  is not empty then
               /* Pop left-most dependent */
9               .pop();
10               lNT [lf(, .ldeps, .rdeps)];
11               rNT [lf(, , )];
13        else if  is not empty then
               /* Pop right-most dependent */
14               .pop();
15               lNT [lf(, , )];
16               rNT [lf(, .ldeps, .rdeps)];
18        else return [];
19        return [lNT rNT];
Algorithm 1 lf is short for left-first conversion. We omit right-first conversion because it can be defined just by swapping the codeblocks 6-9 and 10-13 of left-first conversion.

3 Method: Dependency Tree Conversion

As a source of multilingual syntactic information, we use Universal Dependencies (UD), a collection of cross-linguistic dependency treebanks with a consistent annotation scheme. Since RNNG requires a CTree-formatted dataset for training, we perform DTree-to-CTree conversions, which are completely algorithmic to make it work regardless of language. Our method consists of two procedures: structural conversion and nonterminal labeling; obtaining a CTree skeleton with unlabeled nonterminal nodes, then assigning labels by leveraging syntactic information contained in the dependency annotations. While our structural conversion is identical to the baseline approach of collins-etal-1999, we include a novel labeling method that relies on dependency relations, not POS tags.