1 Introduction
The importance of language modeling in recent years has grown considerably, as methods based on large pre-trained neural language models (LMs) have become the state-of-the-art for many problems (devlin-etal-2019; radford-etal-2019). However, these neural LMs are based on general architectures and therefore do not explicitly model linguistic constraints, and have been shown to capture only a subset of the syntactic representations typically found in constituency treebanks (warstadt-etal-2020). An alternative line of LM research aims to explicitly model the parse tree in order to make the LM syntax-aware. A representative example of this paradigm, reccurent neural network grammar (RNNG, dyer-etal-2016), is reported to perform better than sequential LMs on tasks that require complex syntactic analysis (kuncoro-etal-2019; hu-etal-2020; noji-oseki-2021).
The aim of this paper is to extend LMs that inject syntax to the multilingual setting. This attempt is important mainly in two ways. Firstly, English has been dominant in researches on syntax-aware LM. While multilingual LMs have received increasing attention in recent years, most of their approaches do not explicitly model syntax, such as multilingual BERT (mBERT, devlin-etal-2019) or XLM-R (conneau-etal-2020). Although these models have shown high performance on some cross-lingual tasks (conneau-etal-2018), they perform poorly on a syntactic task (mueller-etal-2020). Secondly, syntax-aware LMs have interesting features other than their high syntactic ability. One example is the validity of RNNG as a cognitive model under an English-based setting, as demonstrated in hale-etal-2018. Since human cognitive functions are universal, while natural languages are diverse, it would be ideal to conduct this experiment based on multiple languages.
The main obstacle for multilingual syntax-aware modeling is that it is unclear how to inject syntactic information while training. A straightforward approach is to make use of a multilingual treebank, such as Universal Dependencies (UD, nivre-etal-2016; nivre-etal-2020), where trees are represented in a dependency tree (DTree) formalism. matthews-etal-2019 evaluated parsing and language modeling performance on three typologically different languages, using a generative dependency model. Unfortunately, they revealed that dependency-based models are less suited to language modeling than comparable constituency-based models, highlighting the apparent difficulty of extending syntax-aware LMs to other languages using existing resources.
This paper revisits the issue of the difficulty of constructing multilingual syntax-aware LMs, by exploring the performance of multilingual language modeling using constituency-based models. Since our domain is a multilingual setting, our focus turns to how dependency-to-constituency conversion techniques result in different trees, and how these trees affect the model’s performance. We obtain constituency treebanks from UD-formatted dependency treebanks of five languages using nine tree conversion methods. These treebanks are in turn used to train an RNNG, which we evaluate on perplexity and CLAMS (mueller-etal-2020).
Our contributions are: (1) We propose a methodology for training multilingual syntax-aware LMs through the dependency tree conversion. (2) We found an optimal structure that brings out the potential of RNNG across five languages. (3) We demonstrated the advantage of our multilingual RNNG over sequential/overparameterized LMs.
2 Background
2.1 Recurrent Neural Network Grammars
|
The illustration of stack-RNNG behavior. Stack-LSTM represents the current partial tree, in which adjacent vectors are connected in the network. At REDUCE action, the corresponding vector is updated with composition function (as underlined).
RNNGs are generative models that estimate joint probability of a sentence
and a constituency tree (CTree) . The probability is estimated with top-down constituency parsing actions that produce :kuncoro-etal-2017 proposed a stack-only RNNG that computes the next action probability based on the current partial tree. Figure 1 illustrates the behavior of it. The model represents the current partial tree with a stack-LSTM, which consists of three types of embeddings: nonterminal, word, and closed-nonterminal. The next action is estimated with the last hidden state of a stack-LSTM. There are three types of actions as follows:
-
NT(X): Push nonterminal embedding of () onto the stack.
-
GEN(): Push word embedding of () onto the stack.
-
REDUCE: Pop elements from the stack until a nonterminal embedding shows up. With all the embeddings which are popped, compute closed-nonterminal embedding using composition funcion Comp:
RNNG can be regarded as a language model that injects syntactic knowledge explicitly, and various appealing features have been reported (kuncoro-etal-2017; kuncoro-etal-2017; hale-etal-2018). We focus on its high performance on syntactic evaluation, which is described below.
Difficulty in extending to other languages
In principle, RNNG can be learned with any corpus as long as it contains CTree annotation. However, it is not evident which tree formats are best in a multilingual setting. Using the same technique as English can be inappropriate because each language has its own characteristic, which can be different from English. This question is the fundamental motivation of this research.
2.2 Cross-linguistic Syntactic Evaluation
To investigate the capability of LMs to capture syntax, previous work has attempted to create an evaluation set that requires analysis of the sentence structure (linzen-etal-2016). One typical example is a subject-verb agreement, a rule that the form of a verb is determined by the grammatical category of the subject, such as person or number:
The pilot that the guards love laughs/*laugh. | (1) |
In (1), the form of laugh is determined by the subject pilot, not guards. This judgment requires syntactic analysis; guards is not a subject of target verb laugh because it is in the relative clause of the real subject pilot.
marvin-linzen-2018 designed the English evaluation set using a grammatical framework. mueller-etal-2020 extended this framework to other languages (French, German, Hebrew, and Russian) and created an evaluation set named CLAMS (Cross-Linguistic Assessment of Models on Syntax). CLAMS covers 7 categories of agreement tasks, including local agreement (e.g. The author laughs/*laugh) and non-local agreement that contains an intervening phrase between subject and verb as in (1). They evaluated LMs on CLAMS and demonstrated that sequential LMs often fail to assign a higher probability to the grammatical sentence in cases that involve non-local dependency.
Previous work has attempted to explore the syntactic capabilities of LMs with these evaluation sets. kuncoro-etal-2019 compared the performance of LSTM LM and RNNG using the evaluation set proposed in marvin-linzen-2018, demonstrating the superiority of RNNG in predicting the agreement. noji-takamura-2020 suggested that LSTM LMs potentially have a limitation in handling object relative clauses. Since these analyses are performed on the basis of English text, it is unclear whether they hold or not in a multilingual setting. In this paper, we attempt to investigate this point by learning RNNGs in other languages and evaluating them on CLAMS.
3 Method: Dependency Tree Conversion
As a source of multilingual syntactic information, we use Universal Dependencies (UD), a collection of cross-linguistic dependency treebanks with a consistent annotation scheme. Since RNNG requires a CTree-formatted dataset for training, we perform DTree-to-CTree conversions, which are completely algorithmic to make it work regardless of language. Our method consists of two procedures: structural conversion and nonterminal labeling; obtaining a CTree skeleton with unlabeled nonterminal nodes, then assigning labels by leveraging syntactic information contained in the dependency annotations. While our structural conversion is identical to the baseline approach of collins-etal-1999, we include a novel labeling method that relies on dependency relations, not POS tags.