Meta-learning for fast cross-lingual adaptation in dependency parsing

04/10/2021 ∙ by Anna Langedijk, et al. ∙ 14

Meta-learning, or learning to learn, is a technique that can help to overcome resource scarcity in cross-lingual NLP problems, by enabling fast adaptation to new tasks. We apply model-agnostic meta-learning (MAML) to the task of cross-lingual dependency parsing. We train our model on a diverse set of languages to learn a parameter initialization that can adapt quickly to new languages. We find that meta-learning with pre-training can significantly improve upon the performance of language transfer and standard supervised learning baselines for a variety of unseen, typologically diverse, and low-resource languages, in a few-shot learning setup.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 12

Code Repositories

multilingual-interference

This repository contains code for a project about tackling negative interference in multilingual meta-learning setup.


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The field of natural language processing (NLP) has seen substantial performance improvements due to large-scale language model pre-training

Devlin et al. (2019). Whilst providing an informed starting point for subsequent task-specific fine-tuning, such models still require large annotated training sets for the task at hand Yogatama et al. (2019). This limits their applicability to a handful of languages for which such resources are available and leads to an imbalance in NLP technology’s quality and availability across linguistic communities. Aiming to address this problem, recent research has focused on the development of multilingual sentence encoders, such as multilingual BERT (mBERT) Devlin et al. (2019) and XLM-R Conneau et al. (2019), trained on as many as 93 languages. Such pre-trained multilingual encoders enable zero-shot transfer of task-specific models across languages Wu and Dredze (2019), offering a possible solution to resource scarcity. Zero-shot transfer, however, is most successful among typologically similar, high-resource languages, and less so for languages distant from the training languages and in resource-lean scenarios Lauscher et al. (2020). This stresses the need to develop techniques for fast cross-lingual model adaptation, that can transfer knowledge across a wide range of typologically diverse languages with limited supervision.

In this paper, we focus on the task of universal dependency (UD) parsing and present a novel approach for effective and resource-lean cross-lingual parser adaptation via meta-learning. Meta-learning is a learning paradigm that leverages previous experience from a set of tasks to solve a new task efficiently. As our goal is fast cross-lingual model adaptation, we focus on optimization-based meta-learning, where the main objective is to find a set of initial parameters from which rapid adaption to a variety of different tasks becomes possible (Hospedales et al., 2020)

. Optimization-based meta-learning has been successfully applied to a variety of NLP tasks. Notable examples include neural machine translation 

Gu et al. (2018), semantic parsing Huang et al. (2018), pre-training text representations Lv et al. (2020), word sense disambiguation Holla et al. (2020) and cross-lingual natural language inference and question answering Nooralahzadeh et al. (2020). To the best of our knowledge, meta-learning has not yet been explored in the context of dependency parsing.

We take inspiration from recent research on universal dependency parsing Tran and Bisazza (2019); Kondratyuk and Straka (2019). We employ an existing UD parsing framework — UDify, a multi-task learning model (Kondratyuk and Straka, 2019) — and extend it to perform few-shot model adaptation to previously unseen languages via meta-learning. We pre-train the dependency parser on a high-resource language prior to applying the model-agnostic meta-learning (maml) algorithm (Finn et al., 2017) to a collection of few-shot tasks in a diverse set of languages. We evaluate our model on its ability to perform few-shot adaptation to unseen languages, from as few as 20 examples. Our results demonstrate that our methods outperform language transfer and multilingual joint learning baselines, as well as existing (zero-shot) UD parsing approaches, on a range of language families, with the most notable improvements among the low-resource languages. We also investigate the role of a pre-training language as a starting point for cross-lingual adaptation and the effect of typological properties on the learning process.

2 Related work

2.1 Meta-learning

In meta-learning, the datasets are separated into episodes that correspond to training tasks. Each episode contains a support and a query set, that include samples for adaptation and evaluation, respectively. Meta-learning serves as an umbrella term for algorithms from three categories: Metric-based

methods classify new samples based on their similarity to the support set

(e.g. Snell et al., 2017). Model-based methods explicitly store meta-knowledge within their architectures – e.g. through an external memory (Santoro et al., 2016). Optimization-based

methods, on which we focus, estimate parameter initializations that can be fine-tuned with a few steps of gradient descent

(e.g. Finn et al., 2017; Nichol and Schulman, 2018).

Finn et al. (2017) proposed Model-Agnostic Meta-Learning (maml) to learn parameter initializations that generalize well to similar tasks. During the meta-training phase, maml iteratively selects a batch of episodes, on which it fine-tunes the original parameters given the support set in an inner learning loop, and tests it on the query set. The gradients of the query set with respect to the original parameters are used to update those in the outer learning loop, such that these weights become a better parameter initialization over iterations. Afterwards, during meta-testing, one selects a support set for the test task, adapts the model using that set and evaluates it on new samples from the test task.

maml has provided performance benefits for cross-lingual transfer for tasks such as machine translation (Gu et al., 2018)

, named entity recognition

(Wu et al., 2020), hypernymy detection (Yu et al., 2020) and mapping lemmas to inflected forms (Kann et al., 2020). The closest approach to ours is by Nooralahzadeh et al. (2020), who focus on natural language inference and question answering. Their method, x-maml, involves pre-training a model on a high-resource language prior to applying maml. This yielded performance benefits over standard supervised learning for cross-lingual transfer in a zero-shot and fine-tuning setup (albeit using 2500 training samples to fine-tune on test languages). The performance gains were the largest for languages sharing morphosyntactic features. Besides the focus on dependency parsing, our approach can be distinguished from Nooralahzadeh et al. (2020) in several ways. We focus on fast adaptation from a small number of examples (using only 20, 40 or 80 sentences). Whilst they use one language for meta-training, we use seven typologically diverse languages, with the aim of explicitly learning to adapt to a variety of languages.

2.2 Universal dependency parsing

The Universal Dependencies project is an ongoing community effort to construct a cross-linguistically consistent morphosyntactic annotation scheme (Nivre, 2018). The project makes results comparable across languages and eases the evaluation of cross-lingual (structure) learning. The task of dependency parsing involves predicting a dependency tree for an input sentence, which is a directed graph of binary, asymmetrical arcs between words. These arcs are labeled and denote dependency relation types, which hold between a head-word and its dependent. A parser is tasked to assign rankings to the space of all possible dependency graphs and to select the optimal candidate.

Dependency parsing of under-resourced languages has since long been of substantial interest in NLP. Well-performing UD parsers, such as the winning model in the CoNLL 2018 Shared Task by Che et al. (2018), do not necessarily perform well on low-resource languages (Zeman et al., 2018). Cross-lingual UD parsing is typically accomplished by projecting annotations between languages with parallel corpora (Agić et al., 2014), through model transfer (e.g. Guo et al., 2015; Ammar et al., 2016; Ahmad et al., 2018), through hybrid methods combining annotation projections and model transfer (Tiedemann et al., 2014), or by aligning word embeddings across languages (Schuster et al., 2019).

State-of-the-art methods for cross-lingual dependency parsing exploit pre-trained mBERT with a dependency parsing classification layer that is fine-tuned on treebanks of high-resource languages, and transferred to new languages: Wu and Dredze (2019) only fine-tune on English, whereas Tran and Bisazza (2019) experiment with multiple sets of fine-tuning languages. Including diverse language families and scripts benefits transfer to low-resource languages, in particular. UDify, the model of Kondratyuk and Straka (2019), is jointly fine-tuned on data from 75 languages, with a multi-task learning objective that combines dependency parsing with predicting part-of-speech tags, morphological features, and lemmas. Üstün et al. (2020), instead, freeze the mBERT parameters and train adapter modules that are interleaved with mBERT’s layers, and take a language embedding as input. This embedding is predicted from typological features. Model performance strongly relies on the availability of those features, since using proxy embeddings from different languages strongly degrades low-resource languages’ performance.

3 Dataset

We use data from the Universal Dependencies v2.3 corpus (Nivre, 2018). We use treebanks from 26 languages that are selected for their typological diversity. We adopt the categorization of high-resource and low-resource languages from Tran and Bisazza (2019) and employ their set of training and test languages for comparability. The set covers languages from six language families (Indo-European, Korean, Afro-Asiatic, Uralic, Dravidian, Austro-Asiatic) and 16 subfamilies. Their training set (expMix) includes eight languages: English, Arabic, Czech, Hindi, Italian, Korean, Norwegian, and Russian. These languages fall into the language families of Indo-European, Korean and Afro-Asiatic and have diverse word orders (i.e. VSO, SVO and SOV). Joint learning on data from this diverse set yielded state-of-the-art zero-shot transfer performance on low-resource languages in the experiments of Tran and Bisazza (2019).

Per training language we use up to 20,000 example trees, predicting dependency arc labels from 132 classes total. We select Bulgarian (Indo-European) and Telugu (Dravidian) as validation languages to improve generalization to multiple language families. Among the 16 unseen test languages, six are low-resource ones: Armenian, Breton, Buryat, Faroese, Kazakh, and Upper Sorbian. The remaining test languages are high-resource: Finnish, French, Japanese, Persian, German, Tamil, Urdu, Vietnamese, Hungarian, Swedish. The test languages cover three new language families that were unseen during training, i.e. Austro-Asiatic, Dravidian, and Uralic. Furthermore, three of our test languages (Buryat, Faroese, and Upper Sorbian) are not included in the pre-training of mBERT. We refer the reader to Appendix B for details about the treebank sizes and the language families.

4 Method

4.1 The UDify model

The UDify model concurrently predicts part-of-speech tags, morphological features, lemmas and dependency trees Kondratyuk and Straka (2019). UDify exploits the pre-trained mBERT model (Devlin et al., 2019), that is a self-attention network with 12 transformer encoder layers.

The model takes single sentences as input. Each sentence is tokenized into subword units using mBERT’s word piece tokenizer, after which contextual embedding lookup provides input for the self-attention layers. A weighted sum of the outputs of all layers is computed (Equation 1) and fed to a task-specific classifier.

(1)

Here, denotes the contextual output embeddings for task . In our case, indicates UD-parsing. In contrast to the multi-task objective of the original UDify model, our experiments only involve UD-parsing. The term represents the mBERT representation for layer at token position . The terms and denote trainable scalars, where the former applies to mBERT and the latter scales the normalized averages. For words that were tokenized into multiple word pieces, only the first word piece was fed to the UD-parsing classifier.

The UD-parsing classifier is a graph-based biaffine attention classifier (Dozat and Manning, 2017) that projects the embeddings

through arc-head and arc-dep feedforward layers. The resulting outputs are combined using biaffine attention to produce a probability distribution of arc heads for each word. Finally, the dependency tree is decoded using the Chu-Liu/Edmonds algorithm

(Chu, 1965; Edmonds, 1967). We refer the reader to the work of Kondratyuk and Straka (2019) for further details on the architecture and its training procedure.

4.2 Meta-learning procedure

We apply first-order111For more details on first-order versus second-order, see Finn et al. (2017); Holla et al. (2020). maml to the UDify model. The model’s self-attention layers are initialized with parameters from mBERT and the classifier’s feedforward layers are randomly initialized. The model is pre-trained on a high-resource language using standard supervised learning and further meta-trained on a set of seven languages with maml. It is then evaluated using meta-testing. We refer to maml with pre-training as simply maml. The entire procedure can be described as follows:

  1. [wide, labelwidth=!, labelindent=0pt, itemsep=1pt]

  2. Pre-train on a high-resource language to yield the initial parameters .

  3. Meta-train on all other training languages. For each language , we partition the UD training data into two disjoint sets, and , and perform the following inner loop:

    1. Temporarily update the model parameters

      with stochastic gradient descent on support set

      , sampled from , with step size for gradient descent adaptation steps. When using a single gradient step, the update becomes:

      (2)
    2. Compute the losses of the model parameters using the query set , sampled from , denoted by .

  4. Sum up the test losses and perform a meta-update in the outer learning loop on the model with parameters using the step size :

    (3)

    In our experiments, the update is a first-order approximation, replacing by .

  5. After meta-training, we apply meta-testing to unseen languages. For each language, we sample a support set from the UD training data. We then fine-tune our model on , and evaluate the model on the entire test set. Thereby, meta-testing mimics the adaptation from the inner loop. We repeat this process multiple times to get a reliable estimate of how well the model adapts to unseen languages.

5 Experimental setup

We extend the existing UDify code222github.com/Hyperparticle/udify to be used in a meta-learning setup.

5.1 Training and evaluation

Pre-training

We use either English or Hindi as pre-training language. This allows us to draw more general conclusions about how well maml generalizes with typologically different pre-training languages, and about the impact of pre-training prior to cross-lingual adaptation. Whilst English and many of our training languages have an SVO word order, Hindi has an SOV word order. Hindi treebanks have a larger percentage of non-projective dependency trees (Mannem et al., 2009), where dependency arcs are allowed to cross one another. Non-projective trees are more challenging to parse (Nivre, 2009)

. Pre-training on Hindi thus allows us to test the effects of projectivity on cross-lingual adaptation. We pre-train for 60 epochs, during which we use UDify’s default parameters.

Meta-training

We apply meta-training using seven languages, thus excluding the pre-training language from meta-training. We train for 500 episodes per language, using a cosine-based learning rate scheduler with 10% warm-up. We use the Adam optimizer (Kingma and Ba, 2014) in the outer loop and SGD in the inner loop. Support and query sets are of size 20. Due to the sequence labelling paradigm, the number of shots per class varies per batch. When , the average class will appear 16 times. A frequent class such as punct may appear up to 100 times, whereas less frequent classes may not appear at all.

To select hyperparameters, we independently vary the amount of updates

and the learning rates in the inner loop and outer loop for mBERT and the parser, while performing meta-validation with the languages Bulgarian and Telugu. To meta-validate, we follow the procedure described in Section 4.2 for both languages, mimicking the meta-testing setup with a support set size of 20. The hyperparameters are estimated independently for Hindi and English pre-training (see Appendix A).

Meta-testing

At meta-testing time, we use SGD with the same learning rates and the same used in the inner loop during meta-training. We vary the support set size , to monitor performance gains from using more data.

5.2 Baselines

We define several baselines, that are evaluated using meta-testing, i.e. by fine-tuning the models on a support set of a test language prior to evaluation on that language. This allows us to directly compare their ability to adapt quickly to new languages with that of the meta-learner.

Monolingual baselines (en, hin)

These baselines measure the impact of meta-training on data from seven additional languages. The model is initialized using mBERT and trained using data from English (en) or Hindi (hin), without meta-training.

Multilingual non-episodic baseline (ne)

Instead of episodic training, this baseline treats support and query sets as regular mini-batches and updates the model parameters directly using a joint learning objective, similar to Kondratyuk and Straka (2019) and Tran and Bisazza (2019). The model is pre-trained on English or Hindi and thus indicates the advantages of maml over standard supervised learning. The training learning rate and meta-testing learning rate are estimated separately, since there is no inner loop update in this setup.

maml without pre-training

We evaluate the effects of pre-training by running a maml setup without any pre-training. Instead, the pre-training language is included during meta-training as one of now eight languages. maml without pre-training is trained on 2000 episodes per language.

Meta-testing only

The simplest baseline is a decoder randomly initialized on top of mBERT, without pre-training and meta-training. Dependency parsing is only introduced at meta-testing time.

5.3 Evaluation

Hyperparameter selection and evaluation is performed using Labeled Attachment Score (LAS) as computed by the CoNLL 2018 Shared Task evaluation script.333universaldependencies.org/conll18/evaluation.html LAS evaluates the correctness of both the dependency class and dependency head. We use the standard splits of Universal Dependencies for training and evaluation when available. Otherwise, we remove the support set from the test set first. We train each model with seven different seeds and compare maml to a monolingual baseline and ne using paired -tests, adjusting for multiple comparisons using the Bonferroni correction.

6 Results and analysis

Language T&B K&S Üst. en ne maml en ne maml en ne maml
Low-Resource Languages
Armenian 58.95 49.8 63.34 63.84 50.59 63.54 64.30 51.99 63.79 64.78
Breton 52.62 39.84 58.5 60.34 61.44 64.18 61.32 61.67 65.12 62.76 62.20 66.14
Buryat 23.11 26.28 28.9 23.66 25.56 25.77 23.82 25.67 26.38 24.17 25.88 27.33
Faroese 61.98 59.26 69.2 68.50 67.83 68.95 69.56 68.12 69.88 70.59 68.62 71.12
Kazakh 44.56 63.66 60.7 47.25 55.02 55.07 47.80 55.08 55.46 49.08 55.23 56.15
U.Sorbian 49.74 62.82 54.2 49.29 54.47 56.40 50.55 54.70 57.55 52.11 55.07 58.81
Mean 48.45 49.81 54.61 55.70 50.61 54.80 56.45 51.78 55.13 57.38
High-Resource Languages
Finnish 62.29 56.61 64.94 64.89 56.99 65.07 65.40 57.73 65.18 65.82
French 59.54 65.21 66.55 66.85 65.33 66.59 66.97 65.63 66.65 67.25
German 70.93 72.47 76.15 76.41 72.6 76.17 76.54 72.93 76.21 76.72
Hungar. 61.11 56.50 62.93 62.71 56.23 63.09 62.81 56.73 63.21 62.52
Japanese 24.10 18.87 36.49 39.06 20.05 37.15 42.17 22.80 38.40 46.81
Persian 56.92 43.43 52.55 52.81 44.53 52.76 53.63 46.42 53.11 54.74
Swedish 78.70 80.26 80.73 81.36 80.41 80.81 81.53 80.57 80.79 81.59
Tamil 32.78 31.58 41.12 44.34 32.67 41.72 46.73 34.81 42.88 50.73
Urdu 63.06 25.71 57.25 55.16 26.89 57.36 56.16 29.30 57.68 57.60
Vietnam. 29.71 43.24 42.73 43.34 43.65 42.82 43.74 44.28 43.02 44.34
Mean 53.91 49.39 58.14 58.69 49.93 58.35 59.57 51.12 58.71 60.81
Mean 51.88 49.55 56.82 57.57 50.19 57.02 58.4 51.37 57.37 59.52
Table 1: Mean LAS aligned accuracy per support set size for unseen test languages. Best results per category are bolded. Significant results are underlined (). Previous work consists of Tran and Bisazza (2019), UDify (Kondratyuk and Straka, 2019) and UDapter (Üstün et al., 2020). : Languages were absent from mBERT.

maml with English pre-training

We report the average LAS score for models pre-trained on English in Table 1. We compare these results to related approaches that use mBERT and have multiple training languages. With support set size 20, maml already outperforms the zero-shot transfer setup of Tran and Bisazza (2019) for all test languages, except Persian and Urdu. maml is competitive with UDify (Kondratyuk and Straka, 2019) and UDapter (Üstün et al., 2020) for low-resource languages, despite the stark difference in the number of training languages compared to UDify444UDify is trained on the low-resource languages, while we only test on them. For a fair comparison, we only list UDify results on languages with a small amount of sentences () in the training set, to mimic a few-shot generalisation setup. (75), and without relying on fine-grained typological features of languages, as is the case for UDapter.

maml consistently outperforms the en and ne baselines. Large improvements over the en baseline are seen on low-resource and non-Germanic languages. The difference between maml and the baselines increases when becomes larger. The largest improvements over ne are on Tamil and Japanese, but ne outperforms maml on Hungarian and Urdu. maml consistently outperforms ne on low-resource languages, with an average 1.1% improvement per low-resource language for , up to a 2.2% average improvement for .

maml with Hindi pre-training

The results for models pre-trained on Hindi can be seen in Table 3. Although there are large differences between the monolingual en and hin baselines, both maml (hin) and ne (hin) achieve, on average, similar LAS scores to their English counterparts. maml still outperforms ne for the majority of languages: the mean improvement on low-resource languages is 0.8% per language for , which increases to 1.6% per language for .

Effects of pre-training

We further investigate the effectiveness of pre-training by omitting the pre-training phase. A comparison between maml and maml without pre-training is shown in Table 2. maml without pre-training underperforms for the majority of languages and its performance does not increase as much with a larger support set size. This suggests that pre-training provides a better starting point for meta-learning than plain mBERT.

In the meta-testing only setting, the fine-tuned model reaches an average LAS of 6.9% over all test languages for , increasing to 15% for , indicating that meta-testing alone is not sufficient to learn the task.555Full results are listed in Appendix C

Language maml maml- maml maml-
Low-Resource Languages
Armenian 63.84 59.70 64.78 60.03
Breton 64.18 59.33 66.14 60.84
Buryat 25.77 26.02 27.33 27.05
Faroese 68.95 65.30 71.12 66.79
Kazakh 55.07 53.92 56.15 54.99
U.Sorbian 56.40 51.67 58.78 52.38
Mean 55.7 52.66 57.38 53.68
High-Resource Languages
Finnish 64.89 61.97 65.82 62.47
French 66.85 63.42 67.25 64.15
German 76.41 74.38 76.72 74.72
Hungar. 62.71 58.47 62.52 57.48
Japanese 39.06 39.72 46.81 43.87
Persian 52.81 50.31 54.74 51.08
Swedish 81.36 77.57 81.59 78.10
Tamil 44.34 46.55 50.68 50.54
Urdu 55.16 55.4 57.60 56.28
Vietnam. 43.34 42.62 44.33 43.78
Mean 58.4 55.95 59.52 56.53
Table 2: Mean LAS per unseen language, for maml without pre-training (denoted maml-) versus maml (EN). : Languages were absent from mBERT.
Language hin ne maml hin ne maml hin ne maml
Low-Resource Languages
Armenian 48.41 63.30 63.76 48.87 63.41 64.17 49.70 63.59 64.76
Breton 34.06 62.09 61.56 36.09 62.40 62.47 38.95 63.05 63.75
Buryat 24.24 25.05 26.27 24.71 25.18 26.79 25.54 25.40 27.37
Faroese 50.72 65.31 66.82 52.30 65.57 67.31 54.64 66.17 68.25
Kazakh 49.80 53.77 54.23 49.90 53.94 54.45 50.49 54.08 55.00
U.Sorbian 36.22 53.36 54.97 37.08 53.58 55.64 38.22 53.94 56.56
Mean 40.57 53.81 54.60 41.49 54.01 55.14 42.92 54.37 55.95
High-Resource Languages
Finnish 50.49 64.05 64.64 50.93 64.20 65.05 51.79 64.40 65.61
French 31.16 64.44 65.73 31.59 64.44 65.68 33.39 64.42 65.69
German 44.83 74.40 75.15 45.46 74.41 75.23 46.65 74.46 75.31
Hungarian 46.72 60.98 62.51 46.97 61.33 62.89 47.91 61.68 62.91
Japanese 40.25 39.97 41.96 43.03 40.56 43.61 46.87 41.58 45.90
Persian 28.60 53.73 53.63 29.51 53.85 54.00 31.11 54.06 54.53
Swedish 46.96 79.24 79.89 47.73 79.32 80.14 49.15 79.31 80.21
Tamil 46.51 39.44 39.57 47.35 39.84 40.84 48.55 40.73 42.81
Urdu 67.72 50.64 49.16 67.96 50.93 50.16 68.17 51.50 51.57
Vietnamese 26.96 42.13 42.12 27.92 42.23 42.37 29.61 42.46 42.87
Mean 43.02 56.9 57.44 43.85 57.11 58.0 45.32 57.46 58.74
Mean 42.1 55.74 56.37 42.96 55.95 56.92 44.42 56.3 57.69
Table 3: Mean LAS aligned accuracy per unseen language, for models pre-trained on Hindi. Best results per category are listed in bold, significant results are underlined (). : Languages were absent from mBERT.

Further Analysis

Performance increases over the monolingual baselines vary strongly per language – e.g. consider the difference between Japanese and French in Table 1. The performance increase is largest for languages that differ from the pre-training language with respect to their syntactic properties. We conduct two types of analysis, based on typological features and projectivity, to quantify this effect and correlate these properties to the performance increase over monolingual baselines.666No clear correlation was found by Tran and Bisazza (2019). By using increase in performance and not “plain” performance, we may see a stronger effect.

Firstly, we use 103 binary syntactic features from URIEL (Littell et al., 2017)

to compute the syntactic cosine similarities (denoted

) between languages. With this metric, a language such as Italian is syntactically closer to English () than Urdu (), even though they are both Indo-European. For each unseen language, we collect the cosine similarities to each (pre-)training language.

Then, we collect the difference in performance between the monolingual baselines and the ne or maml setups for . For each training language, we compute the correlations between performance increases for the test languages and their similarity to this training language, visualised in Figure 1.

When pre-training on Hindi, there is a significant positive correlation with syntactic similarity to English and related languages. When pre-training on English, a positive correlation is seen with similarity to Hindi and Korean. Positive correlations imply that on unseen languages, improvement increases when similarity to the training language increases. Negative correlations mean there is less improvement when similarity to the training languages increases, suggesting that those languages do not contribute as much to adaptation. On average, the selection of meta-training languages contributes significantly to the increase in performance for the Hindi pre-training models. This effect is stronger for maml (HIN) () than ne (HIN) (), which may indicate that the meta-training procedure is better at incorporating knowledge from those unrelated languages.

Figure 1: Spearman’s between the performance increase over the monolingual baseline and the cosine similarity to the syntax of training languages (-axis) for models using pre-training (-axis). (*: )

Secondly, we analyze which syntactic features impact performance most. We correlate individual URIEL features with maml’s performance increases over monolingual baselines (see Figure 2).

Features related to word order and negation show a significant correlation. Considering the presence of these features in both pre-training languages of maml, a pattern emerges: when a feature is absent in the pre-training language, there is a positive correlation with increase in performance. Similarly, when a feature is present in the pre-training language, there is a negative correlation, and thus a smaller increase in performance after meta-training. This indicates that maml is successfully adapting to these specific features during meta-training.

We analysed maml’s performance improvements over ne on each of the 132 dependency relations, and found that they are consistent across relations.777The same holds for the 37 coarse-grained relations: universaldependencies.org/u/dep/index.html

Lastly, we detect non-projective dependency trees in all datasets. The Hindi treebank used has 14%888This amount is in line with Mannem et al. (2009). of non-projective trees, whereas English only has 5%.999Full results can be found in Appendix B. We correlate the increase in performance with the percentage of non-projective trees in a language’s treebank. The correlation is significant for ne (EN) ( = , = ) and maml (EN) ( = , = ). Figure 3 visualizes the correlation for maml (EN). We do not find significant correlations for models pre-trained on Hindi. This suggests that a model trained on a mostly projective language can benefit more from further training on non-projective languages than the other way around.

Figure 2: Spearman’s between the performance increase over monolingual baselines and URIEL features (-axis), for maml with pre-training (-axis). We indicate features present in English () and in Hindi (). (*: )
Figure 3: Spearman’s between the % of non-projective dependency trees and maml’s improvement over the English baseline ( = , = ).

7 Discussion

Our experiments show that meta-learning, specifically maml, indeed is able to adapt to unseen languages on the task of cross-lingual dependency parsing more effectively than a non-episodic model. The difference between both methods is most apparent for languages that differ strongly from those in the training set (e.g. Japanese in Table 1) where effective few-shot adaptation is crucial. This shows that maml is successful at learning to learn from a few examples, and can efficiently incorporate new information. Furthermore, we see a clear increase of performance of maml if we increase the test support size for the unseen languages, while ne only slightly improves. This suggests that maml may be a promising method for cross-lingual adaptation more generally, also outside of the few-shot learning scenario.

Our ablation experiments on pre-training show that it is beneficial for maml to start from a strong set of parameters, pre-trained on a high resource language. Thereby, the pre-training is not dependent on a specific language. maml performs well with either Hindi or English as a pre-training language, although improvements for unseen languages vary. When a model is pre-trained on English, there is a large positive correlation for improvements in languages that are syntactically dissimilar to English, such as Japanese and Tamil. During meta-training, dissimilar training languages such as Hindi most contribute to the model’s ability to generalize. Syntactic features, especially those related to word order, which have already been learned during pre-training, require less adaptation. The same is true, vice versa, for Hindi pre-training.

This effect is also observed, though only in one direction, when correlating performance increase with non-projectivity. It is beneficial to meta-train on a set of languages that vary in projectivity after pre-training on one which is mostly projective. However, not all variance is explained by the difference in typological features. The fact that

maml outperforms maml without pre-training suggests that pre-training also contributes language-agnostic syntactic features, which is indeed the overall goal of multi-lingual UD models.

8 Conclusion

In this paper, we present a meta-learning approach for the task of cross-lingual dependency parsing. Our experiments show that meta-learning can improve few-shot universal dependency parsing performance on unseen, unrelated test languages, including low-resource languages and those not covered by mBERT. In addition, we see that it is beneficial to pre-train before meta-training, as in the x-maml approach (Nooralahzadeh et al., 2020). In particular, the pre-training language can affect how much adaptation is necessary on languages that are typologically different from it.

Therefore, an important direction for future research is to investigate a wider range of pre-training/meta-training language combinations, based on specific hypotheses about language relationships and relevant syntactic features. Task performance may be further improved by including a larger set of syntax-related tasks, such as POS-tagging, to sample from during meta-training (Kondratyuk and Straka, 2019).

References

Appendix A Training details and hyperparameters

Figure 4: Visualization of maml algorithm for three training languages. The green lines represents the meta-update from the outer learning loop. The red dotted lines represent the gradient computed on the support set for each language in the inner learning loop.
Parameter Value
Dependency tag dimension 256
Dependency arc dimension 768
Dropout 0.5
BERT Dropout 0.2

Mask probability

0.2
Layer dropout 0.1
Table 4: Hyperparameters for the UDify model architecture.

All models use the same architecture: an overview of all model parameters can be seen in Table 4. The model contains 196M parameters, of which 178M are mBERT.

At pre-training time, we use the default parameters of UDify (Kondratyuk and Straka, 2019). The Adam optimizer is used with a 1e-3 learning rate for the decoder and a 5e-5 learning rate for BERT layers. Weight decay of 0.01 is applied. We employ a gradual unfreezing scheme, freezing the BERT layer weights for the first epoch.

At meta-training time, we vary the learning rates as shown in Table 5. We also vary the amount of updates at training/testing time: . We applied weight decay at meta-training time in initial experiments, but this yielded no improvements. No gradual unfreezing is applied at meta-training time. We use Adam for the outer loop updates and SGD for the inner loop updates and at testing time. We sample 500 episodes per language, using query and support set size of 20. The best hyperparameters are chosen with respect to their final performance on the meta-validation set consisting of Bulgarian and Telugu. We run two seeds for hyperparameter selection, and seven seeds for all the final models. Labeled Attachment Score (LAS) is used for hyperparameter selection and final evaluation.

We train all models on an NVIDIA TITAN RTX. Pre-training takes around 3 hours, meta-training takes around 1 hour for 100 episodes per language when the amount of updates is set to 20. For maml, this amounts to approximately 5 hours, and for the ablated maml-, it amounts to approximately 20 hours, which can be seen as another benefit of pre-training. Finally, training in a non-episodic fashion (NE) also takes up less time, namely 2 to 3 hours.

All final learning rates can be seen in Table 6. For all models except the random decoder baseline, was selected. The best random decoder used after a separate hyperparameter search of high learning rates and s (compensating for the lack of prior DP training).

LR mBERT Decoder
Inner/Test {1e-4, 5e-5, 1e-5} {1e-3, 5e-4, 1e-4, 5e-5}
Outer {5e-5, 1e-5, 7e-6} {1e-3, 7e-4, 5e-4, 1e-4, 5e-5}
Table 5: Learning rates independently varied for maml and ne. For the ablated maml-, only underlined learning rates were tried due to the long training times.
Inner/Test LR Outer LR
Decoder BERT Decoder BERT
Meta-test only 5e-3 1e-3 n/a n/a
en/hin 1e-4 1e-4 n/a n/a
ne (en/hin) 5e-4 1e-5 1e-4 7e-6
maml (en) 1e-3 1e-4 5e-4 1e-5
maml (hin) 5e-4 5e-5 5e-4 5e-5
maml- 1e-3 1e-5 5e-4 1e-5
Table 6: Final hyperparameters, as selected by few-shot performance on the validation set. Inner loop/Test learning rates are used with SGD, outer loop LRs are used with the Adam optimizer.

Appendix B Information about datasets used

All information about the datasets used can be found in Table 7, along with corresponding statistics about non-projective trees. The cosine syntactical similarities are visualized in Figure 5.

Appendix C Full results

We show the full results for each model in Table 8, Table 9 and Table 10.

Figure 5: Syntactical cosine similarities from each training language to each other languages, calculated using URIEL’s 103 binary syntactic features (Littell et al., 2017). Average cosine similarities are shown in the rightmost column and bottommost row. For instance, Japanese and Kazakh have a relatively low average cosine similarity to the training languages.
Language Family Subcategory UD Dataset Train Val. Test Non-proj. %
Low-Resource Test Languages
Armenian IE Armenian ArmTDP 560 0 470 10.2
Breton IE Celtic KEB 0 0 888 2.7
Buryat Mongolic Mongolic BDT 19 0 908 15.6
Faroese IE Germanic OFT 0 0 1208 2.7
Kazakh Turkic Northwestern KTB 31 0 1047 12.1
Upper Sorbian IE Slavic UFAL 23 0 623 11.3
High-Resource Test Languages
Finnish Uralic Finnic TDT 12217 1364 1555 6.3
French IE Romance Spoken 1153 907 726 5.5
German IE Germanic GSD 13814 799 977 9.2
Hungarian Uralic Ugric Szeged 910 441 449 27.1
Japanese Japanese Japanese GSD 7133 511 551 2.6
Persian IE Iranian Seraji 4798 599 600 6.7
Swedish IE Germanic PUD 0 0 1000 3.8
Tamil Dravidian Southern TTB 400 80 120 2.8
Urdu IE Indic UDTB 4043 552 535 22.6
Vietnamese Austro-As. Viet-Muong VTB 1400 800 800 2.9
Validation Languages
Bulgarian IE Slavic BTB 8907 1115 1116 3.1
Telugu Dravidian South Central MTG 1051 131 146 0.2
Train Languages
Arabic Afro-As. Semitic PADT 6075 909 680 9.2
Czech IE Slavic PDT 68495 9270 10148 11.9
English IE Germanic EWT 12543 2002 2077 4.8
Hindi IE Indic HDTB 13304 1659 1684 13.6
Italian IE Romance ISDT 13121 564 482 2.1
Korean Korean Korean Kaist 23010 2066 2287 21.7
Norwegian IE Germanic Nynorsk 14174 1890 1511 8.2
Russian IE Slavic SynTagRus 48814 6584 6491 8.0
Table 7: All datasets used during testing (first 16 rows) training and evaluation (final 10 rows), along with the amount of sentences in the dataset and the percentage of non-projective trees throughout that dataset.
Language M.T. only en hin ne (EN) ne (HIN) maml maml (HIN) maml-
Unseen Languages
Armenian 4.970.007 49.80.005 48.410.002 63.340.002 63.30.005 63.840.002 63.760.003 59.70.004
Breton 10.770.019 60.340.003 34.060.005 61.440.005 62.090.004 64.180.003 61.560.002 59.330.005
Buryat 9.630.018 23.660.002 24.240.002 25.560.003 25.050.003 25.770.002 26.270.002 26.020.004
Faroese 13.860.024 68.50.004 50.720.004 67.830.006 65.310.006 68.950.003 66.820.002 65.30.005
Kazakh 13.970.012 47.250.004 49.80.002 55.020.002 53.770.003 55.070.002 54.230.003 53.920.005
U.Sorbian 3.440.005 49.290.004 36.220.003 54.470.003 53.360.003 56.40.004 54.970.005 51.670.004
Finnish 6.950.014 56.610.002 50.490.003 64.940.003 64.050.004 64.890.003 64.640.002 61.970.005
French 6.810.011 65.210.001 31.160.003 66.550.001 64.440.002 66.850.001 65.730.001 63.420.003
German 7.520.012 72.470.001 44.830.004 76.150.002 74.40.002 76.410.002 75.150.001 74.380.003
Hungarian 5.580.004 56.50.003 46.720.004 62.930.003 60.980.002 62.710.003 62.510.004 58.470.002
Japanese 4.020.008 18.870.002 40.250.005 36.490.008 39.970.003 39.060.003 41.960.005 39.720.007
Persian 1.910.004 43.420.005 28.660.004 52.620.006 53.780.004 52.820.005 53.590.003 50.310.004
Swedish 5.150.008 80.260.001 46.960.004 80.730.001 79.240.002 81.360.001 79.890.001 77.570.002
Tamil 5.180.013 31.580.005 46.510.003 41.120.009 39.440.006 44.340.005 39.570.008 46.550.01
Urdu 2.860.01 25.710.004 67.720.001 57.250.004 50.640.004 55.160.004 49.160.002 55.40.003
Vietnamese 7.140.008 43.240.002 26.960.002 42.730.001 42.130.002 43.340.001 42.120.001 42.620.003
Validation & Training Languages
Bulgarian 8.650.01 71.190.002 46.760.003 78.420.003 77.620.001 78.640.002 78.40.001 75.30.003
Telugu 42.360.078 64.390.018 66.780.014 68.50.006 64.80.008 69.910.01 65.80.008 67.580.008
Arabic 3.250.007 38.530.006 20.740.004 71.510.002 69.760.002 68.860.002 73.090.002 66.40.002
Czech 6.370.006 67.30.002 43.240.002 83.150.001 81.650.001 82.00.001 83.210.001 80.060.001
English 8.430.008 89.290.001 44.480.003 82.150.004 79.480.001 83.740.001 81.890.001 78.040.001
Hindi 3.380.007 35.420.002 90.990.0 76.560.002 74.030.004 74.150.003 71.330.003 74.480.004
Italian 7.150.008 82.50.001 36.860.006 87.340.002 85.280.001 86.50.001 87.180.002 83.090.002
Korean 7.820.011 36.440.004 40.30.002 66.350.003 68.040.003 63.930.003 74.080.001 63.620.005
Norwegian 5.680.013 74.70.001 43.70.003 80.090.001 77.650.001 78.670.001 81.330.002 75.610.001
Russian 6.760.013 68.940.003 47.290.005 80.960.001 79.410.001 79.930.001 81.680.001 76.480.002
Table 8: Full meta-testing results for all models and baselines, including validation and training languages, for . The meta-testing only baseline is denoted as “M.T. only”.
Language M.T. only en hin ne (EN) ne (HIN) maml maml (HIN) maml-
Unseen Languages
Armenian 5.820.007 50.590.005 48.870.002 63.540.002 63.410.004 64.30.002 64.170.003 59.850.004
Breton 14.520.02 61.320.004 36.090.005 61.670.005 62.40.005 65.120.003 62.470.004 59.960.004
Buryat 13.360.017 23.820.002 24.710.003 25.670.003 25.180.003 26.380.003 26.790.003 26.490.004
Faroese 20.40.019 69.560.004 52.30.005 68.120.006 65.570.006 69.880.004 67.310.003 65.950.004
Kazakh 17.110.014 47.80.004 49.90.003 55.080.002 53.940.003 55.460.003 54.450.004 54.290.005
U.Sorbian 4.490.008 50.550.005 37.080.003 54.70.004 53.580.003 57.550.004 55.640.005 52.090.005
Finnish 9.420.011 56.990.003 50.930.003 65.070.003 64.20.004 65.40.003 65.050.003 62.260.004
French 8.410.019 65.330.002 31.590.005 66.590.001 64.440.002 66.970.001 65.680.002 63.780.003
German 10.40.016 72.60.001 45.460.004 76.170.002 74.410.002 76.540.002 75.230.002 74.530.003
Hungarian 6.80.007 56.230.003 46.970.004 63.090.003 61.330.002 62.810.002 62.890.003 58.090.003
Japanese 5.850.014 20.050.003 43.030.006 37.150.008 40.560.002 42.170.004 43.610.004 41.510.006
Persian 3.320.01 44.540.004 29.550.004 52.720.006 53.850.004 53.650.005 54.020.003 50.830.005
Swedish 8.270.008 80.410.001 47.730.003 80.810.001 79.320.002 81.530.001 80.140.002 77.940.002
Tamil 9.370.018 32.670.004 47.350.004 41.720.009 39.840.004 46.730.005 40.840.006 48.540.008
Urdu 5.880.01 26.890.005 67.960.002 57.360.004 50.930.004 56.160.004 50.160.004 55.840.003
Vietnamese 9.650.014 43.650.002 27.920.002 42.820.002 42.230.003 43.740.001 42.370.001 43.230.004
Validation & Training Languages
Bulgarian 10.870.021 71.210.002 47.290.004 78.420.003 77.620.001 78.650.002 78.390.001 75.440.003
Telugu 49.110.068 66.640.014 67.70.013 68.690.006 64.750.01 70.750.012 66.10.009 67.970.007
Arabic 4.830.006 41.60.012 22.650.005 71.530.002 69.780.002 68.950.002 73.010.003 66.470.002
Czech 7.950.008 67.740.003 43.920.002 83.150.001 81.640.001 82.030.001 83.190.001 80.120.001
English 11.010.012 89.30.001 45.320.004 82.210.004 79.490.001 83.960.002 82.050.002 78.070.001
Hindi 7.640.017 36.640.002 90.990.0 76.580.002 74.240.004 74.280.002 72.160.004 74.50.004
Italian 9.30.014 82.680.001 38.610.007 87.350.002 85.280.001 86.510.001 87.370.003 83.110.002
Korean 10.260.014 36.770.004 40.740.003 66.40.003 68.070.003 64.230.003 74.010.002 63.910.004
Norwegian 9.240.015 74.70.002 44.10.006 80.060.002 77.530.004 78.690.001 81.20.004 75.640.001
Russian 9.090.012 69.30.003 48.030.005 80.980.001 79.430.001 80.00.001 81.660.001 76.570.002
Table 9: Full meta-testing results for all models and baselines, including validation and training languages, for . The meta-testing only baseline is denoted as “M.T. only”.
Language M.T. only en hin ne (EN) ne (HIN) maml maml (HIN) maml-
Unseen Languages
Armenian 8.190.006 51.990.005 49.70.002 63.790.002 63.590.004 64.780.003 64.760.003 60.030.003
Breton 22.540.018 62.760.004 38.950.004 62.20.006 63.050.004 66.140.003 63.750.004 60.840.004
Buryat 16.870.007 24.170.003 25.540.003 25.880.003 25.40.003 27.330.003 27.370.004 27.050.004
Faroese 27.760.019 70.590.004 54.640.005 68.620.006 66.170.005 71.120.004 68.250.003 66.790.004
Kazakh 21.890.009 49.080.004 50.490.003 55.230.003 54.080.003 56.150.003 55.00.004 54.990.005
U.Sorbian 7.490.01 52.110.005 38.220.004 55.080.004 53.940.004 58.780.005 56.560.006 52.380.005
Finnish 11.910.012 57.730.004 51.790.002 65.180.003 64.40.004 65.820.005 65.610.004 62.470.004
French 12.420.026 65.630.002 33.390.006 66.650.001 64.420.002 67.250.002 65.690.003 64.150.003
German 16.570.017 72.930.002 46.650.003 76.210.002 74.460.002 76.720.002 75.310.002 74.720.003
Hungarian 13.00.013 56.730.003 47.910.003 63.210.003 61.680.002 62.520.002 62.910.002 57.480.004
Japanese 14.380.015 22.80.004 46.870.004 38.40.007 41.580.003 46.810.003 45.90.004 43.870.005
Persian 6.160.019 46.40.006 31.110.01 53.080.006 54.010.004 54.730.006 54.540.005 51.070.004
Swedish 12.990.011 80.570.001 49.150.002 80.790.002 79.310.002 81.590.001 80.210.002 78.10.002
Tamil 18.460.011 34.810.007 48.550.002 42.880.008 40.730.004 50.680.003 42.810.006 50.540.008
Urdu 13.060.01 29.30.004 68.170.004 57.630.004 51.50.004 57.60.004 51.570.004 56.280.004
Vietnamese 15.360.015 44.280.002 29.610.002 42.990.002 42.460.003 44.330.002 42.880.002 43.780.004
Validation & Training Languages
Bulgarian 16.260.025 71.420.003 48.070.006 78.430.003 77.670.002 78.670.003 78.450.002 75.680.003
Telugu 54.480.016 69.080.011 68.790.01 68.970.006 65.050.009 71.520.012 66.860.008 68.410.008
Arabic 9.870.015 46.240.015 25.50.006 71.540.002 69.790.002 69.070.002 73.040.002 66.510.002
Czech 10.740.012 68.40.003 45.280.002 83.160.001 81.650.001 82.040.001 83.20.001 80.150.001
English 16.860.016 89.30.001 46.870.002 82.320.003 79.510.001 84.280.002 82.080.002 78.070.002
Hindi 16.70.018 39.250.003 90.960.0 76.610.002 74.650.004 74.460.003 73.30.003 74.630.004
Italian 16.860.027 82.960.001 41.80.009 87.350.002 85.290.001 86.570.002 87.390.003 83.170.002
Korean 15.160.017 37.770.005 41.530.003 66.460.003 68.160.003 64.360.004 74.050.002 64.210.005
Norwegian 13.080.012 74.930.002 45.30.004 80.080.002 77.560.004 78.760.001 81.220.004 75.690.001
Russian 13.370.012 69.790.003 49.020.004 81.010.001 79.450.001 80.040.001 81.670.001 76.560.002
Table 10: Full meta-testing results for all models and baselines, including validation and training languages, for . The meta-testing only baseline is denoted as “M.T. only”.