Parsing with Multilingual BERT, a Small Corpus, and a Small Treebank

09/29/2020 ∙ by Ethan C. Chau, et al. ∙ University of Washington 0

Pretrained multilingual contextual representations have shown great success, but due to the limits of their pretraining data, their benefits do not apply equally to all language varieties. This presents a challenge for language varieties unfamiliar to these models, whose labeled and unlabeled data is too limited to train a monolingual model effectively. We propose the use of additional language-specific pretraining and vocabulary augmentation to adapt multilingual models to low-resource settings. Using dependency parsing of four diverse low-resource language varieties as a case study, we show that these methods significantly improve performance over baselines, especially in the lowest-resource cases, and demonstrate the importance of the relationship between such models' pretraining data and target language varieties.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Contextual word representations (CWRs) from pretrained language models have improved many NLP systems. Such language models include BERT (Devlin et al., 2019) and ELMo (Peters et al., 2018), which are conventionally “pretrained” on large unlabeled datasets before their internal representations are “finetuned” during supervised training on downstream tasks like parsing. However, many language varieties111Sociolinguists define “language varieties” broadly to encompass any distinct form of a language. In addition to standard varieties (conventionally referred to as “languages”), this includes dialects, registers, and styles (Trudgill, 2003). lack large annotated and even unannotated datasets, raising questions about the broad applicability of such data-hungry methods.

One exciting way to compensate for the lack of unlabeled data in low-resource language varieties is to finetune a large, multilingual language model that has been pretrained on the union of many languages’ data (Devlin et al., 2019; Lample and Conneau, 2019). This enables the model to transfer some of what it learns from high-resource languages to low-resource ones, demonstrating benefits over monolingual methods in some cases (Conneau et al., 2020a; Tsai et al., 2019), though not always (Agerri et al., 2020; Rönnqvist et al., 2019).

Specifically, multilingual models face the transfer-dilution tradeoff (Conneau et al., 2020a): increasing the number of languages during pretraining improves positive crosslingual transfer but decreases the model capacity allocated to each language. Furthermore, such models are only pretrained on a finite amount of data and may lack exposure to specialized domains of certain languages or even entire low-resource language varieties. The result is a challenge for these language varieties, which must rely on positive transfer from a sufficient number of similar high-resource languages. Indeed, Wu and Dredze (2020) find that multilingual models often underperform monolingual baselines for such languages and question their off-the-shelf viability.

We take inspiration from previous work on domain adaptation, where general-purpose monolingual models have been effectively adapted to specialized domains through additional pretraining on domain-specific corpora (Gururangan et al., 2020). We hypothesize that we can improve the performance of multilingual models on low-resource language varieties analogously, through additional pretraining on language-specific corpora.

However, additional pretraining on more data in the target language does not ensure its full representation in the model’s vocabulary, which is constructed to maximally represent the model’s original pretraining data (Sennrich et al., 2016; Wu et al., 2016). Artetxe et al. (2020) find that target languages’ representation in the vocabulary affects these models’ transferability, suggesting that language varieties on the fringes of the vocabulary may not be sufficiently well-modeled. Can we incorporate vocabulary from the target language into multilingual models’ existing alignment?

We introduce the use of additional language-specific pretraining for multilingual CWRs in a low-resource setting, before use in a downstream task; to better model language-specific tokens, we also augment the existing vocabulary with frequent tokens from the low-resource language (§2). Our experiments consider dependency parsing in four typologically diverse low-resource language varieties with different degrees of relatedness to a multilingual model’s pretraining data (§3). Our results show that these methods consistently improve performance on each target variety, especially in the lowest-resource cases (§4). In doing so, we demonstrate the importance of accounting for the relationship between a multilingual model’s pretraining data and the target language variety.

Because the pretraining-finetuning paradigm is now ubiquitous, many experimental findings for one task can now inform work on other tasks. Thus, our findings on dependency parsing—whose annotated datasets cover many more low-resource language varieties than those of other NLP tasks—are expected to interest researchers and practitioners facing low-resource situations for other tasks. To this end, we make our code, data, and hyperparameters publicly available.


2 Overview

We are chiefly concerned with the adaptation of pretrained multilingual models to a target language by optimally using available data. As a case study, we use the multilingual cased BERT model (mBert) of Devlin et al. (2019), a transformer-based (Vaswani et al., 2017) language model which has produced strong CWRs for many languages (Kondratyuk and Straka, 2019, inter alia). mBert is pretrained on the 104 languages with the most Wikipedia data and encodes input tokens using a fixed wordpiece vocabulary (Wu et al., 2016) learned from this data. Low-resource languages are slightly oversampled in its pretraining data, but high resource languages are still more prevalent, resulting in a language imbalance.333Sampling is done based on an exponentially smoothed distribution of the amount of data in each language, which slightly increases the representation of low-resource languages. See for more details.

We observe that two types of target language varieties may be disadvantaged by this training scheme: the lowest-resource languages in mBert’s pretraining data (which we call Type 1); and unseen low-resource languages (Type 2). Although Type 1 languages are oversampled during training, they are still overshadowed by high-resource languages. Type 2 languages must rely purely on crosslingual vocabulary overlap. In both cases, the wordpieces that encode the input tokens in these languages may not fully capture the senses in which they are used, or they may be completely unseen.444Wordpiece tokenization is done greedily based on a fixed vocabulary. The model returns a special “unknown” token for unseen characters and other subword units that cannot be represented by the vocabulary. However, other low-resource varieties with more representation in mBert’s pretraining data (Type 0) may not be as disadvantaged. Optimally using mBert in low-resource settings thus requires accounting for limitations with respect to a target language variety.

2.1 Methods

We evaluate three methods of adapting mBert to better model target language varieties.

Language Type # Sentences # Tokens WP/Token UNK Tokens
ga 1 199k 3.6M 2.10 12807
mt 2 62k 1M 2.95 49791
sing 0 80k 1.2M 1.24 3
vi 0 255k 5.6M 1.33 6955
Table 1: Unlabeled dataset statistics: number of sentences, number of tokens, average wordpieces per token, and tokens containing an unknown wordpiece under original mBert vocabulary.

Language-Adaptive Pretraining (lapt)

Under the assumption that language varieties function analagously to domains for mBert, we adapt the domain-adaptive pretraining method of Gururangan et al. (2020) to a multilingual setting. With language-adaptive pretraining, mBert

is pretrained for additional epochs on monolingual data in the target language variety to improve the alignment of the wordpiece embeddings.

Vocabulary Augmentation (va)

To better model unseen or language-specific wordpieces, we explore performing lapt after augmenting mBert’s vocabulary from a target language variety. We train a new wordpiece vocabulary on monolingual data in the target language, tokenize the monolingual data with the new vocabulary, and augment mBert’s vocabulary with the 99 most common wordpieces555mBert’s fixed-size vocabulary contains 99 tokens designated as “unused,” whose representations were not updated during initial pretraining and can be repurposed for vocabulary augmentation without modifying the pretrained model. in the new vocabulary that replaced the “unknown” wordpiece token. Full details of this process are given in the Appendix.

Tiered Vocabulary Augmentation (tva)

We consider a variant of va with a larger learning rate for the embeddings of the 99 new wordpieces than for the other parameters. We expect this method to learn the embeddings more thoroughly without overfitting the model’s remaining parameters. Learning rate details are given in the Appendix.

2.2 Evaluation

We perform evaluation on dependency parsing. Following Kondratyuk and Straka (2019), we take a weighted sum of the activations at each mBert layer as the CWR for each token. We then pass the representations into the graph-based dependency parser of Dozat and Manning (2017). This parser, which is also used in related work (Kondratyuk and Straka, 2019; Mulcaire et al., 2019a; Schuster et al., 2019), uses a biaffine attention mechanism between word representations to score a parse tree.

3 Experiments

We consider two variants of each mBert method: one in which the pretrained CWRs are frozen; and one where they are further finetuned during parser training (ft). Following prior work involving these two variants (Beltagy et al., 2019), ft variants perform biaffine attention directly on the outputs of mBert instead of first passing them through a BiLSTM, as in Dozat and Manning (2017).

We perform additional pretraining for up to 20 epochs, selecting our final models based on average validation LAS downstream. Full training details are given in the Appendix. We report average scores and standard errors based on five random initializations. Code and data are publicly available (see footnote 


3.1 Languages and Datasets

We perform experiments on four typologically diverse low-resource languages: Irish (ga), Maltese (mt), Vietnamese (vi), and Singlish (Singapore Colloquial English; sing). Singlish is an English-based creole spoken in Singapore, which incorporates lexical and syntactic borrowings from other languages spoken in Singapore: Chinese, Malay, and Tamil. Wang et al. (2017) provide an extended motivation for evaluating on Singlish.

These language varieties are examplars of the three types discussed in §2. mBert is trained on the 104 largest Wikipedias, which includes Irish and Vietnamese but excludes Maltese and Singlish. However, the Irish Wikipedia is several orders of magnitude smaller than the full Vietnamese one. So, we view Irish and Maltese as Type 1 and Type 2 language varieties, respectively. Though Singlish lacks its own Wikipedia and is likely not included in mBert’s pretraining data per se, its component languages (English, Chinese, Malay, and Tamil) are all well-represented in the data. We thus consider it to be a Type 0 variety along with Vietnamese.

Unlabeled Datasets

Representations Irish (ga) Maltese (mt) Singlish (sing) Vietnamese (vi)
Type 1 Type 2 Type 0 Type 0
fastT 65.36 1.33 68.23 0.61 66.42 0.92 53.37 0.95
elmo 68.25 0.37 74.33 0.53 68.63 1.04 56.91 0.41
mBert 68.19 0.43 67.06 0.61 74.01 0.39 62.96 0.41
lapt 73.03 0.25 78.51 0.41 76.48 0.63 64.67 0.22
va 72.68 0.47 79.88 0.55 76.71 0.70 64.28 0.44
tva 73.11 0.37 79.32 0.45 76.92 0.77 64.46 0.44
mBert + ft 72.67 0.22 76.74 0.35 78.24 0.52 66.13 0.38
lapt + ft 75.45 0.28 82.77 0.24 79.30 0.57 67.50 0.25
va + ft 76.17 0.08 83.53 0.21 79.89 0.46 67.28 0.38
tva + ft 76.23 0.22 83.16 0.25 80.09 0.34 67.82 0.27
Table 2:

Results (LAS) on downstream UD parsing, with standard deviations from five random initializations.

Bolded results are within one standard deviation of the maximum for each category (frozen/ft).

Additional pretraining for Irish, Maltese, and Vietnamese uses unlabeled articles from Wikipedia. To simulate a truly low-resource setting for Vietnamese, we use a random sample of 5% of the articles. Singlish data is crawled from the SG Talk Forum666 online forum and provided by Wang et al. (2017). To ensure robust evaluation, we remove all sentences that appear in the labeled validation and test sets from the unlabeled data. Full details are provided in the Appendix.

Tab. 1 gives the average number of wordpieces per token and the number of tokens with unknown wordpieces in each of the unlabeled datasets, computed based on the original mBert vocabulary. While the high number of wordpieces per token for Irish and Maltese may be due in part to morphological richness, it also suggests that these languages stand to benefit more from improved alignment of the wordpieces’ embeddings. Furthermore, the higher rates of unknown wordpieces leave room for enhanced performance with an improved vocabulary.

Labeled Datasets

Parsers for Irish, Maltese, and Vietnamese are trained on the corresponding treebanks and train/test splits from Universal Dependencies 2.5 (34): IDT, MUDT, and VTB, respectively. For Singlish, we use the extended treebank component of Wang et al. (2019), which we randomly partition into train (80%), valid. (10%), and test sets (10%).777Our partition of the data is available at We use the provided gold word segmentation but no POS tag features.

3.2 Baselines

For each language, we evaluate the performance of mBert in frozen and ft variants, without any adaptations. We additionally benchmark each method against strong prior work that represents conventional approaches for representing low-resource languages: static fastText embeddings (fastT; Bojanowski et al., 2017), which can be learned effectively even on small datasets; and monolingual ELMo models (elmo; Peters et al., 2018), a monolingual contextual approach. We choose ELMo over training a new BERT model because the high computational and data requirements of the latter make it unviable in a low-resource setting. Both baselines are trained on our unlabeled datasets.

4 Results and Discussion

Tab. 2 shows the performance of each of the method variants on the four Universal Dependencies datasets, with standard deviations from five different initializations. Our experiments demonstrate that additional language-specific pretraining results in more effective representations. lapt consistently outperforms baselines, especially for Irish and Maltese, where overlap with the original pretraining data is low and frozen mBert underperforms elmo. This suggests that the insights of Gururangan et al. (2020) on additional pretraining for domain adaptation are also applicable to transferring multilingual models to low-resource languages, even without much additional data.

lapt with our vocabulary augmentation methods yield small but significant improvements over lapt alone, especially for ft configurations and Type 1/2 languages. This demonstrates that accurate vocabulary modeling is important for improving representations in the target language, and that va and tva are effective methods for doing so while maintaining overall alignment. For Maltese, va’s stronger performance compared to tva can be explained by the overall lack of unlabeled data: one would expect tva to overfit more quickly on a very small dataset.

Furthermore, the relative error reductions between unadapted mBert and each of our methods correlates with each language variety’s relationship to mBert pretraining data. Maltese (Type 2) improves by up to 39% and Irish (Type 1) by up to 15%, compared to 11% for Singlish and 5% for Vietnamese (both Type 0). While this trend is by no means comprehensive, it suggests that effective use of mBert requires considering the target language variety.

Our results thus support our hypotheses and give insight to the limitations of mBert. Wordpieces appear in different contexts in different languages, and mBert initially lacks enough exposure to wordpiece usage in Type 1/2 target languages to outperform baselines. However, increased exposure through additional language-specific pretraining can ameliorate this issue. Likewise, despite mBert’s attempt to balance its pretraining data, the existing vocabulary still favors languages that have been seen more. Augmenting the vocabulary can produce additional improvement for languages with greater proportions of unseen wordpieces. Overall, our findings are promising for low-resource language varieties, demonstrating that large improvements in performance are possible with the help of a little unlabeled data, and that the performance discrepancy of multilingual models for low-resource languages (Wu and Dredze, 2020) can be overcome.

5 Further Related Work

Our work builds on prior empirical studies on multilingual models, which probe the behavior and components of existing models to explain why they are effective. Cao et al. (2020), Pires et al. (2019), and Wu and Dredze (2019) note the importance of both vocabulary overlap and the relationship between languages in determining the effectiveness of multilingual models, but they primarily consider high-resource languages. On the other hand, Conneau et al. (2020b) and K et al. (2020) find vocabulary overlap to be less significant of a factor, instead attributing such models’ successes to typological similarity and parameter sharing. Artetxe et al. (2020) emphasize the importance of sufficiently representing the target language in the vocabulary. Unlike these studies, we primarily consider how to improve the performance of multilingual models for a given target language variety. Though our experiments do not directly probe the impact of vocabulary overlap, we contribute further evaluation of the importance of improved modeling of the target variety.

Recent work has also proposed additional pretraining for general-purpose language models, especially with respect to domain (Alsentzer et al., 2019; Chakrabarty et al., 2019; Gururangan et al., 2020; Han and Eisenstein, 2019; Howard and Ruder, 2018; Logeswaran et al., 2019; Sun et al., 2019). Lakew et al. (2018) and Zoph et al. (2016) perform additional training on parallel data to adapt bilingual translation models to unseen target languages, while Mueller et al. (2020) improve a polyglot task-specific model by finetuning on labeled monolingual data in the target variety. To the best of our knowledge, our work is the first to demonstrate the effectiveness of additional pretraining for massively multilingual language models toward a target low-resource language variety, using only unlabeled data in the target variety.

6 Conclusion

We explore additional language-specific pretraining and vocabulary augmentation for multilingual contextual word representations in low-resource settings and find them to be effective for dependency parsing, especially in the lowest-resource cases. Our results demonstrate the significance of the relationship between a multilingual model’s pretraining data and a target language. We expect that our findings can benefit practitioners in low-resource settings, and our data, code, and models are publicly available to accelerate further study.


We thank Jungo Kasai, Phoebe Mulcaire, members of UW NLP, and the anonymous reviewers for their helpful comments on preliminary versions of this paper. This work was supported by a NSF Graduate Research Fellowship to LHL and by NSF grant 1813153.


  • R. Agerri, I. San Vicente, J. A. Campos, A. Barrena, X. Saralegi, A. Soroa, and E. Agirre (2020) Give your text representation models some love: the case for Basque. In Proc. of LREC, External Links: Link, ISBN 979-10-95546-34-4 Cited by: §1.
  • E. Alsentzer, J. Murphy, W. Boag, W. Weng, D. Jindi, T. Naumann, and M. McDermott (2019) Publicly available clinical BERT embeddings. In

    Proc. of Clinical Natural Language Processing Workshop

    External Links: Document Cited by: §5.
  • M. Artetxe, S. Ruder, and D. Yogatama (2020) On the cross-lingual transferability of monolingual representations. In Proc. of ACL, External Links: Link, Document Cited by: §1, §5.
  • I. Beltagy, K. Lo, and A. Cohan (2019) SciBERT: pretrained language model for scientific text. In Proc. of EMNLP, External Links: Document Cited by: §3.
  • P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov (2017)

    Enriching word vectors with subword information

    TACL 5, pp. 135–146. External Links: Document Cited by: §A.3, §3.2.
  • S. Cao, N. Kitaev, and D. Klein (2020) Multilingual alignment of contextual word representations. In Proc. of ICLR, External Links: Link Cited by: §5.
  • T. Chakrabarty, C. Hidey, and K. McKeown (2019) IMHO fine-tuning improves claim detection. In Proc. of NAACL-HLT, External Links: Document Cited by: §5.
  • A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov (2020a) Unsupervised cross-lingual representation learning at scale. In Proc. of ACL, External Links: Link, Document Cited by: §1, §1.
  • A. Conneau, S. Wu, H. Li, L. Zettlemoyer, and V. Stoyanov (2020b) Emerging cross-lingual structure in pretrained language models. In Proc. of ACL, External Links: Link, Document Cited by: §5.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. of NAACL-HLT, External Links: Document Cited by: §A.2, §A.3, §A.3, §1, §1, §2.
  • J. Dodge, S. Gururangan, D. Card, R. Schwartz, and N. A. Smith (2019) Show your work: improved reporting of experimental results. In Proc. of EMNLP-IJCNLP, External Links: Document Cited by: §A.4.
  • T. Dozat and C. D. Manning (2017) Deep biaffine attention for neural dependency parsing. In Proc. of ICLR, External Links: 1611.01734 Cited by: §2.2, §3.
  • M. Gardner, J. Grus, M. Neumann, O. Tafjord, P. Dasigi, N. F. Liu, M. Peters, M. Schmitz, and L. Zettlemoyer (2018) AllenNLP: a deep semantic natural language processing platform. In

    Proc. of Workshop for NLP Open Source Software (NLP-OSS)

    External Links: Document Cited by: §A.3.
  • S. Gururangan, A. Marasović, S. Swayamdipta, K. Lo, I. Beltagy, D. Downey, and N. A. Smith (2020) Don’t stop pretraining: adapt language models to domains and tasks. In Proc. of ACL, External Links: Link, Document Cited by: §1, §2.1, §4, §5.
  • X. Han and J. Eisenstein (2019) Unsupervised domain adaptation of contextualized embeddings for sequence labeling. In Proc. of EMNLP, External Links: Document Cited by: §5.
  • J. Howard and S. Ruder (2018) Universal language model fine-tuning for text classification. In Proc. of ACL, External Links: Document Cited by: §5.
  • K. K, Z. Wang, S. Mayhew, and D. Roth (2020) Cross-lingual ability of multilingual BERT: an empirical study. In Proc. of ICLR, External Links: 1912.07840 Cited by: §A.3, §5.
  • D. Kondratyuk and M. Straka (2019) 75 languages, 1 model: parsing Universal Dependencies universally. In Proc. of EMNLP-IJCNLP, External Links: Document Cited by: §A.3, §2.2, §2.
  • S. M. Lakew, A. Erofeeva, M. Negri, M. Federico, and M. Turchi (2018) Transfer learning in multilingual neural machine translation with dynamic vocabulary. In Proc. of IWSLT, External Links: 1811.01137 Cited by: §5.
  • G. Lample and A. Conneau (2019) Cross-lingual language model pretraining. In Proc. of NeurIPS, External Links: Link Cited by: §A.3, §1.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: a robustly optimized BERT pretraining approach. Note: arXiv:1907.11692 [cs.CL] External Links: 1907.11692 Cited by: §A.3.
  • L. Logeswaran, M. Chang, K. Lee, K. Toutanova, J. Devlin, and H. Lee (2019) Zero-shot entity linking by reading entity descriptions. In Proc. of ACL, External Links: Document Cited by: §5.
  • D. Mueller, N. Andrews, and M. Dredze (2020)

    Sources of transfer in multilingual named entity recognition

    In Proc. of ACL, External Links: Link, Document Cited by: §5.
  • P. Mulcaire, J. Kasai, and N. A. Smith (2019a) Low-resource parsing with crosslingual contextualized representations. In Proc. of CoNLL, External Links: Document Cited by: §2.2.
  • P. Mulcaire, J. Kasai, and N. A. Smith (2019b) Polyglot contextual representations improve crosslingual transfer. In Proc. of NAACL-HLT, External Links: Document Cited by: §A.3.
  • M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In Proc. of NAACL-HLT, External Links: Document Cited by: §A.3, §1, §3.2.
  • T. Pires, E. Schlinger, and D. Garrette (2019) How multilingual is multilingual BERT?. In Proc. of ACL, External Links: Document Cited by: §5.
  • S. Rönnqvist, J. Kanerva, T. Salakoski, and F. Ginter (2019) Is multilingual BERT fluent in language generation?. In

    Proc. of the First NLPL Workshop on Deep Learning for Natural Language Processing

    External Links: Link Cited by: §1.
  • T. Schuster, O. Ram, R. Barzilay, and A. Globerson (2019) Cross-lingual alignment of contextual word embeddings, with applications to zero-shot dependency parsing. In Proc. of NAACL-HLT, External Links: Document Cited by: §2.2.
  • R. Sennrich, B. Haddow, and A. Birch (2016) Neural machine translation of rare words with subword units. In Proc. of ACL, External Links: Document Cited by: §1.
  • C. Sun, X. Qiu, Y. Xu, and X. Huang (2019) How to fine-tune BERT for text classification?. Note: arXiv:1905.05583 [cs.CL] External Links: 1905.05583 Cited by: §5.
  • P. Trudgill (2003) A glossary of sociolinguistics. Edinburgh University Press. External Links: ISBN 9780748616237, Link Cited by: footnote 1.
  • H. Tsai, J. Riesa, M. Johnson, N. Arivazhagan, X. Li, and A. Archer (2019) Small and practical BERT models for sequence labeling. In Proc. of EMNLP-IJCNLP, External Links: Document Cited by: §1.
  • [34] Universal dependencies 2.5. Cited by: §3.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Proc. of NeurIPS, External Links: Link Cited by: §2.
  • H. Wang, J. Yang, and Y. Zhang (2019) From genesis to creole language: transfer learning for singlish universal dependencies parsing and pos tagging. ACM Transactions on Asian and Low-Resource Language Information Processing 19 (1). External Links: ISSN 2375-4699, Link, Document Cited by: §3.1.
  • H. Wang, Y. Zhang, G. L. Chan, J. Yang, and H. L. Chieu (2017) Universal dependencies parsing for colloquial Singaporean English. In Proc. of ACL, External Links: Document Cited by: §A.2, §3.1, §3.1.
  • S. Wu and M. Dredze (2019) Beto, bentz, becas: the surprising cross-lingual effectiveness of BERT. In Proc. of EMNLP-IJCNLP, External Links: Document Cited by: §5.
  • S. Wu and M. Dredze (2020) Are all languages created equal in multilingual BERT?. In Proc. of the 5th Workshop on Representation Learning for NLP, External Links: Link, Document Cited by: §1, §4.
  • Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, Ł. Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Corrado, M. Hughes, and J. Dean (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. Note: arXiv:1609.08144 [cs.CL] External Links: 1609.08144 Cited by: §1, §2.
  • B. Zoph, D. Yuret, J. May, and K. Knight (2016) Transfer learning for low-resource neural machine translation. In Proc. of EMNLP, External Links: Link, Document Cited by: §5.

Appendix A Supplementary Material to Accompany Parsing with Multilingual BERT, a Small Corpus, and a Small Treebank

This supplement contains further details about the experiments presented in the main paper.

a.1 Vocabulary Augmentation and Statistics

Language Original Augmented
ga 12807 228
mt 49791 1124
sing 3 1
vi 6955 421
Table 3: Number of tokens with unknown wordpieces in the unlabeled dataset under original and augmented vocabularies.

We choose the vocabulary size to minimize the number of unknown wordpieces while maintaining a similar wordpiece-per-token ratio as the original mBert vocabulary. Empirically, we find a vocabulary size of 5000 to best meet these criteria. Then, we tokenize the unlabeled data using both the new and original vocabularies. We compare the tokenizations of each word and note cases where the new vocabulary yields a tokenization with fewer unknown wordpieces than the original one. We select the 99 most common wordpieces that occur in these cases and use them to fill the 99 unused slots in mBert’s vocabulary. For Singlish, 99 such wordpieces are not available; we fill the remaining slots with the most common wordpieces in the new vocabulary.

Tab. 3 gives a comparison of the number of tokens with unknown wordpieces under the original and augmented mBert vocabularies. The augmented vocabulary significantly decreases the number of unknowns, resulting in a specific embedding for most of the wordpieces.

a.2 Data Extraction and Preprocessing

In this section, we detail the steps used to obtain the pretraining data. After dataset-specific preprocessing, all datasets are tokenized with the multilingual spaCy tokenizer.888 We then generate pretraining shards in a format acceptable by mBert using scripts provided by Devlin et al. (2019) and the parameters listed in Tab. 7

, which includes artificially augmenting each dataset five times by masking different words with a probability of 0.15. Statistics for labeled datasets, which we use without modification, are provided in Tab.


Language Partition # Sentences # Tokens
ga Train 858 20k
Valid. 451 9.8k
Test 454 10k
mt Train 1123 23k
Valid. 433 11k
Test 518 10k
sing Train 2465 22k
Valid. 286 2.5k
Test 299 2.7k
vi Train 1400 24k
Valid. 800 13k
Test 800 14k
Table 4: Statistics for labeled Universal Dependencies datasets.

Wikipedia Data

We draw data from the newest available Wikipedia dump999 for the language at the time it was obtained: October 20, 2019 (Irish) and January 1, 2020 (Maltese, Vietnamese). We use WikiExtractor101010 to extract the article text, split sentences at periods, and remove the following items:

  • Document start and end line

  • Article titles and section headers

  • Categories

  • HTML content (e.g., <br>)

Articles are kept contiguous. The full Vietnamese Wikipedia consists of nearly 6.5 million sentences (141 million tokens); to simulate a truly low-resource setting, we randomly select 5% of the articles without replacement to use in our pretraining.

Singlish Data

Beginning with the raw crawled sentences from Wang et al. (2017), we remove any sentences that appear verbatim in the validation or test sets of either their original treebank or our partition. Furthermore, we remove any sentences with fewer than five tokens or more than 50 tokens, as we observe that a large proportion of these sentences are either nonsensical or extended quotes from Standard English. We note that this dataset is non-contiguous: most sentences do not appear in a larger context.

a.3 Training Procedure

During pretraining, we use the original implementation of Devlin et al. (2019) but modify it to optimize based only on the masked language modeling (MLM) loss. Although Devlin et al. (2019) also trained on a next sentence prediction (NSP) loss, subsequent work has found joint optimization of NSP and MLM to be less effective than MLM alone (K et al., 2020; Lample and Conneau, 2019; Liu et al., 2019). Furthermore, in certain low-resource language varieties, fully contiguous data may not be available, rendering the NSP task ill-posed. We perform additional pretraining for up to 20 epochs, selecting our final model based on average validation LAS downstream.

Following prior work on parsing with mBert (Kondratyuk and Straka, 2019), parsers are trained with a inverse square root learning rate decay and linear warmup, and gradual unfreezing and discriminative finetuning of the layers. These models are trained for up to 200 epochs with early stopping based on the validation performance. All parsers are implemented in AllenNLP, version 0.9.0 (Gardner et al., 2018).

Tab. 7 gives all hyperparameters kept constant during mBert pretraining and parser training. The values for these hyperparameters largely reflect the defaults or recommendations specified in the implementations we used. For instance, the base learning rate for lapt, va, and tva reflect recommendations in the code of Devlin et al. (2019), and the tva embedding learning rate is equal to the learning rate used in the original pretraining of mBert.

Due to the large number of parameters in mBert, large batch sizes are sometimes infeasible. We reduce the batch size until training is able to complete succesfully on our GPU.

elmo models are trained with the original implementation and default hyperparameter settings of Peters et al. (2018). However, following the implementation of Mulcaire et al. (2019b), we use a variable-length character vocabulary instead of a fixed-sized one to fully model the distribution in each language. fastT is trained using the skipgram model for five epochs, with the default hyperparameters of Bojanowski et al. (2017). All experiments are variously conducted on a single NVIDIA Titan X or Titan XP GPU.

a.4 Hyperparameter Optimization

Representations ga mt sing vi
elmo 10 10 5 10
lapt 5 20 5 5
va 10 15 1 5
tva 15 20 20 5
lapt + ft 20 10 1 5
va + ft 10 10 1 5
tva + ft 15 15 5 5
Table 5: Number of pretraining epochs used in final models, selected based on validation LAS scores.

For our experiments, we fix both the pretraining and downstream architectures and tune only the number of pretraining epochs. For lapt, va, and tva, we pretrain for an additional {1, 5, 10, 15, 20} epochs. For elmo, we pretrain for {1, 3, 5, 10} epochs. Final selections are given in Tab. 5.

Measuring Variation

We use Allentune (Dodge et al., 2019) to compute standard deviations for our experiments. For a given representation source, we randomly select five assignments of the following training hyperparameters via uniform sampling from the ranges specified in Tab. 6. To choose the best epoch for each method, we compute the average validation LAS for these five assignments to choose our final model. Then, we compute the average and standard deviation of the test LAS from each of these assignments.

In cases where a hyperparameter assignment yields exploding gradients and/or trends toward an infinite loss, we rerun the experiment to yield a feasible initialization.

Hyperparameter Minimum Maximum
Adam, Beta 1 0.9 0.9999
Adam, Beta 2 0.9 0.9999
Gradient Norm 1.0 10.0
Random Seed, Python 0 100000
Random Seed, Numpy 0 100000

Random Seed, PyTorch

0 100000
Table 6: Hyperparameter bounds for measuring variation.
Stage Hyperparameter Value
Data Creation Max Sequence Length 128
Max Predictions per Sequence 20
Masked LM Probability 0.15
Duplication Factor 5
Pretraining Max Sequence Length 128
Warmup Steps 1000
Batch Size {12, 16}
Max Predictions per Sequence 20
Masked LM Probability 0.15
Learning Rate 0.00002
tva Embedding Learning Rate 0.0001
Parser Dependency Arc Dimension 100
Dependency Tag Dimension 100
mBert Layer Dropout 0.1
elmo Dropout 0.5
Input Dropout 0.3
Parser Dropout 0.3
Optimizer Adam
Parser Learning Rate 0.001
mBert Learning Rate 0.00005
Learning Rate Warmup Epochs 1
Epochs 200
Early Stopping (Patience) 20
Batch Size {8, 24, 64}
BiLSTM Layers 3
BiLSTM Hidden Size 400
Table 7: Hyperparameters for data creation, pretraining, and parser.