Multilingual unsupervised sequence segmentation transfers to extremely low-resource languages

10/16/2021 ∙ by C. M. Downey, et al. ∙ University of Washington 0

We show that unsupervised sequence-segmentation performance can be transferred to extremely low-resource languages by pre-training a Masked Segmental Language Model (Downey et al., 2021) multilingually. Further, we show that this transfer can be achieved by training over a collection of low-resource languages that are typologically similar (but phylogenetically unrelated) to the target language. In our experiments, we transfer from a collection of 10 Indigenous American languages (AmericasNLP, Mager et al., 2021) to K'iche', a Mayan language. We compare our model to a monolingual baseline, and show that the multilingual pre-trained approach yields much more consistent segmentation quality across target dataset sizes, including a zero-shot performance of 20.6 F1, and exceeds the monolingual performance in 9/10 experimental settings. These results have promising implications for low-resource NLP pipelines involving human-like linguistic units, such as the sparse transcription framework proposed by Bird (2020).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Unsupervised sequence segmentation (at the word, morpheme, and phone level) has long been an area of interest in languages without whitespace-delimited orthography (e.g. Chinese, Uchiumi et al., 2015; Sun and Deng, 2018), morphologically complex languages without rule-based morphological anlayzers (Creutz and Lagus, 2002), and automatically phone-transcribed speech data (Goldwater et al., 2009; Lane et al., 2021) respectively. It has been particularly important for lower-resource languages in which there is little or no gold-standard data on which to train supervised models.

In modern neural end-to-end systems, partially unsupervised segmentation is usually performed via information-theoretic alogrithms such as BPE (Sennrich et al., 2016) and SentencePiece (Kudo and Richardson, 2018). However, the segmentations they produce are mostly non-sensical to humans. The motivating tasks listed above instead require unsupervised approaches that correlate more closely with human judgements of the boundaries of linguistic units. For example, in a human-in-the-loop framework such as the sparse transcription proposed by Bird (2020), candidate lexical items are automatically proposed to native speakers for confirmation, and it is important that these candidates be (close to) sensical pieces of language that the speaker would recognize.

In this paper, we investigate the utility of recent models that have been developed to conduct unsupervised segmentation jointly with or as a byproduct of a language modeling objective (e.g. Kawakami et al., 2019; Downey et al., 2021, see Section 2)

. The key idea is that recent breakthroughs in crosslingual language modeling and transfer learning

(Conneau and Lample, 2019; Artetxe et al., 2020, inter alia)

can be leveraged to facilitate transferring unsupervised segmentation performance to a new target language, when using these types of language models.

Specifically, we investigate the effectiveness of multilingual pre-training in a Masked Segmental Language Model (Downey et al., 2021) when applied to a low-resource target. We pre-train our model on the ten Indigenous languages of the 2021 AmericasNLP shared task dataset (Mager et al., 2021), and apply it to another low-resource, Indigenous, and morphologically complex language of Central America: K’iche’ (quc), which at least phylogenetically is unrelated to the pre-training languages (Campbell et al., 1986).

We hypothesize that multilingual pre-training on similar, possibly contact-related languages, will outperform a monolingual baseline trained from scratch on the same data. In specific, we expect that the multilingual model will perform increasingly better than the monolingual baseline the smaller the target corpus is.

Indeed, our experiments show that a pre-trained multilingual model provides stable performance across all dataset sizes and almost always outperforms the monolingual baseline. We additionally show that the multilingual model achieves a zero-shot segmentation performance of 20.6 F1 on the K’iche’ data, whereas the monolingual baseline yields a score of zero. These results suggest that transferring from a multilingual model can greatly assist unsupervised segmentation in very low-resource languages, even those that are morphologically rich. It may also support the idea that transfer-learning via multilingual pre-training may be possible at a more moderate scale (in terms of data and parameters) than is typical for recent crosslingual models.

In the following section, we overview important work relating to unsupervised segmentation, crosslingual pre-training, and transfer-learning (Section 2). We then introduce the multilingual data used in our experiments, as well as the additional pre-processing we performed to prepare the data for multilingual pre-training (Section 3). Next we provide a brief overview of the type of Segmental Language Model used for our experiments here, as well as our multilingual pre-training process (Section 4). After this, we provide details of our experimental process applying the pre-trained and from-scratch models to varying sizes of target data (Section 5). Finally, we discuss the results of our experiments and their significance for low-resource pipelines, both in the framework of unsupervised segmentation and for other NLP tasks more generally (Sections 6 and 7).

2 Related Work

Work related to the present study has largely fallen either into the field of (unsupervised) word segmentation, or into the field(s) of crosslingual language modeling and transfer learning. To our knowledge, we are the first to propose a crosslingual model for unsupervised word/morpheme-segmentation.

Unsupervised Segmentation

Current state-of-the-art unsupervised segmentation performance has largely been achieved with Bayesian models such as Hierarchical Dirichlet Processes (Teh et al., 2006; Goldwater et al., 2009) and Nested Pitman-Yor (Mochihashi et al., 2009; Uchiumi et al., 2015). Adaptor Grammars (Johnson and Goldwater, 2009) have been successful as well. Models such as Morfessor (Creutz and Lagus, 2002), which are based on Minimal Description Length (Rissanen, 1989) are also widely used for unsupervised morphology.

As Kawakami et al. (2019) note, most of these models are weak in terms of their actual language modeling ability, being unable to take into account much other than the immediate local context of the sequence. Another line of techniques have been focused on models that are both strong language models and good for sequence segmentation. Many are in some way based on Connectionist Temporal Classification (Graves et al., 2006), and include Sleep-WAke Networks (Wang et al., 2017), Segmental RNNs (Kong et al., 2016), and Segmental Language Models (Sun and Deng, 2018; Kawakami et al., 2019; Wang et al., 2021; Downey et al., 2021). In this work, we conduct experiments using the Masked Segmental Language Model of Downey et al. (2021), due to its good performance and scalability, the latter usually regarded as an obligatory feature of crosslingual models (Conneau et al., 2020a; Xue et al., 2021, inter alia).

Crosslingual and Transfer Learning

Crosslingual modeling and training has been an especially active area of research following the introduction of language-general encoder-decoders in Neural Machine Translation that offered the possibility of zero-shot translation (i.e. translation for language pairs not seen during training;

Ha et al., 2016; Johnson et al., 2017).

The arrival of crosslingual language model pre-training (XLM, Conneau and Lample, 2019) further invigorated the subfield by demonstrating that large models pre-trained on multiple languages yielded state-of-the-art performance across an abundance of multilingual tasks including zero-shot text classification (e.g. XNLI, Conneau et al., 2018), and that pre-trained transformer encoders provide great initializations for MT systems and language models in very low-resource languages.

Since XLM, numerous studies have attempted to single out exactly which components of the crosslingual training process contribute to the ability to transfer performance from one language to another (e.g. Conneau et al., 2020b). Others have questioned the importance of multilingual training, and have instead proposed that even monolingual pre-training can provide effective transfer to new languages (Artetxe et al., 2020). And though some like Lin et al. (2019) have tried to systematically study which aspects of pre-training languages/corpora enable effective transfer, in practice the choice is often driven by availability of data and other ad-hoc factors.

Currently, large crosslingual successors to XLM such as XLM-R (Conneau et al., 2020a), MASS (Song et al., 2019), mBART (Liu et al., 2020), and mT5 (Xue et al., 2021) have achieved major success, and are the starting point for a large portion of multilingual NLP systems. These models all rely on an enormous amount of parameters and pre-training data, the bulk of which comes from very high-resource languages. In contrast, in this paper we wish to assess whether multilingual pre-training on a suite of very low-resource languages, which combine to yield a moderate amount of unlabeled data, can provide a good starting point for similar languages which are also very low-resource, within the framework of the unsupervised segmentation task.

3 Data and Pre-processing

We draw data from three main datasets. The AmericasNLP 2021 open task dataset (Mager et al., 2021) contains text from ten Indigenous languages of Central and South America, which we use to pre-train our multilingual model. The multilingual dataset from Kann et al. (2018) consists of morphologically segmented sentences in several Indigenous languages, two of which overlap with the AmericasNLP set, and serves as segmentation validation data for our pre-training process in these languages. Finally, the K’iche’ data collected for Tyers and Henderson (2021) and Richardson and Tyers (2021) contains both raw and morphologically-segmented sentences. We use the former as the training data for our experiments transferring to K’iche’, and the latter as the validation and test data for these experiments.

AmericasNLP 2021

The AmericasNLP data consists of train and validation files for ten low-resource Indigenous languages: Asháninka (cni), Aymara (aym), Bribri (bzd), Guaraní (gug), Hñähñu (oto), Nahuatl (nah), Quechua (quy), Rarámuri (tar), Shipibo Konibo (shp), and Wixarika (hch). For each language, AmericasNLP also includes parallel Spanish sets, which we do not use. The data was originally curated for the AmericasNLP 2021 shared task on Machine Translation for low-resource languages (Mager et al., 2021).111https://github.com/AmericasNLP/americasnlp2021

We augment the Asháninka and Shipibo-Konibo training sets with additional available monolingual data from Bustamante et al. (2020),222https://github.com/iapucp/multilingual-data-peru which is linked in the official AmericasNLP repository. We add both the training and validation data from this corpus to the training set of our splits.

To prepare the AmericasNLP data for a multilingual language modeling setting, we first remove lines that contain urls, copyright boilerplate, or that contain no alphabetic characters. We also split lines that are longer than 2000 characters into sentences/clauses where evident. Because we use the Nahuatl and Wixarika data from Kann et al. (2018) as validation data, we remove any overlapping lines from the AmericasNLP set. We create a combined train file as the concatenation of the training data from each of the ten languages, as well as a combined validation file likewise.

Because the original ratio of Quechua training data is so high compared to all other languages (Figure 1), we randomly downsample this data to 2 examples, the closest order of magnitude to the next-largest training set. A plot of the balanced (final) composition of our AmericasNLP train and validation sets can be seen in Figure 2. A table with the detailed composition of this data is available in Appendix A.

Figure 1: Original (imbalanced) language composition of the AmericasNLP training set
Figure 2: Final language composition of our AmericasNLP splits after downsampling Quechua

Kann et al (2018)

The data from Kann et al. (2018) consists of morphologically-segmented sentences and phrases for Spanish, Purepecha (pua/tsz), Wixarika (hch), Yorem Nokki (mfy), Mexicanero (azd/azn), and Nahuatl (nah). This data was originally curated for a supervised neural morphological segmentation task for polysynthetic minimal-resource languages. We clean this data in the same manner as the AmericasNLP sets. Because Nahuatl and Wixarika are two of the languages in our multilingual pre-training set, we use these examples as validation data for segmentation quality during the pre-training process.

K’iche’ data

All of the K’iche’ data used in our study was curated for Tyers and Henderson (2021). The raw (non-gold-segmented) data used as training data in our transfer experiments comes from a section of this data web-scraped by the Crúbadán project (Scannell, 2007). This data is relatively noisy, so we clean it by removing lines with urls or lines where more than half of the characters are non-alphabetic. This cleaned data consists of 62,695 examples and is used as our full-size training set for K’iche’. Our experiments involve testing transfer at different resource levels, so we also create smaller training sets by downsampling the original to lower orders of magnitude.

For evaluating segmentation performance on K’iche’, we use the segmented sentences from Richardson and Tyers (2021),333https://github.com/ftyers/global-classroom which were created for a shared task on morphological segmentation. These segmentations were created by a hand-crafted FST, and then manually disambiguated. The sentences originally came in a train/validation/test split, but because gold-segmented sentences are so rare, we concatenate these sets and then split them in half into final validation and test sets.

4 Model and Pre-training

This section gives an overview of the Masked Segmental Language Model (MSLM), introduced in Downey et al. (2021), along with a description of our model’s multilingual pre-training.

MSLMs

An MSLM is a variant of a Segmental Language Model (SLM) (Sun and Deng, 2018; Kawakami et al., 2019; Wang et al., 2021), which takes as input a sequence of characters x

and outputs a probability distribution for a sequence of segments

y such that the concatenation of the segments of y is equivalent to x: . An MSLM is composed of a Segmental Transformer Encoder and an LSTM-based Segment Decoder (Downey et al., 2021). See Figure 3.

The training objective for an MSLM is based on the prediction of masked-out spans. During a forward pass, the encoder generates an encoding for every position in x, for a segment up to symbols long; the encoding for position corresponds to every possible segment that starts at position . Therefore, the encoding approximates

To ensure that the encodings are generated based only on the portions of x that are outside of the predicted span, the encoder uses a Segmental Attention Mask (Downey et al., 2021) to mask out tokens inside the segment. Figure 3 shows an example of such a mask with .

Finally, the Segment Decoder of an SLM determines the probability of the

character of the segment of y that begins at index , , using the encoded context:

The output of the decoder is therefore based entirely on the context of the sequence, and not on the determination of other segment boundaries. The probability of y is modeled as the marginal probability over all possible segmentations of x. Because directly marginalizing would be computationally intractable, the marginal is computed using dynamic programming over a forward-pass lattice. The maximum-probability segmentation is determined using Viterbi decoding. The MSLM training objective maximizes language-modelling performance, which is measured in Bits per Character (bpc) over each sentence.

Figure 3: Masked Segmental Language model (left) and Segmental Attention Mask (right). (Figure 3 in Downey et al., 2021)

Multilingual Pre-training

In our experiments, we test the transfer capabilities of a multilingual pre-trained MSLM. We train this model on the AmericasNLP 2021 data, which was pre-processed as described in Section 3. Since Segmental Language Models operate on plain text, we can train the model directly on the multilingual concatenation of this data, and evaluate it by its language modeling performance on the concatenated validation data, which is relatively language-balanced in comparison to the training set (see Figure 2).

We train an MSLM with four encoder layers for 16,768 steps, using the Adam optimizer (Kingma and Ba, 2015). We apply a linear warmup for 1024 steps, and a linear decay afterward. The transformer layers have hidden size 256, feedforward size 512, and 4 attention heads. The LSTM-based segment decoder has a hidden size of 256. Character embeddings are initialized using Word2Vec (Mikolov et al., 2013) over the training data. The maximum possible segment size is set to 10. We sweep eight learning rates on a grid of the interval , and the best model is chosen as the one that minimizes the Bits Per Character (bpc) language-modeling loss on the validation set. For further details of the pre-training procedure, see Appendix B.

To evaluate the effect of pre-training on the segmentation quality for languages within the pre-training set, we also log MCC between the model output and gold-segmented secondary validation sets available in Nahuatl and Wixarika (Kann et al., 2018, see Section 3). As Figure 4 shows, the unsupervised segmentation quality for Nahuatl and Wixarika almost monotonically increases during pre-training.

Figure 4: Plot of segmentation quality for Nahuatl and Wixarika during multilingual pre-training (measured by Matthews Correlation Coefficient with gold segmentation)

5 Experiments

We seek to evaluate whether crosslingual pre-training facilitates effective low-resource transfer learning for segmentation. To do this, we pre-train a Segmental Language Model on the AmericasNLP 2021 dataset (Mager et al., 2021) and transfer it to a new target language: K’iche’ (Tyers and Henderson, 2021). As a baseline, we train a monolingual K’iche’ model from scratch. We evaluate model performance with respect to the size of the target training set, simulating varying degrees of low-resource setting. To manipulate this variable, we randomly downsample the K’iche’ training set to 8 smaller sizes, for 9 total: {256, 512, … 2, 2 (full)}. For each size, we both train a monolingual model and fine-tune the pre-trained multilingual model we describe in Section 4.444All of the data and software required to run these experiments can be found at https://github.com/cmdowney88/XLSLM

Architecture and Modelling

Both the pre-trained crosslingual model and the baseline monolingual model are Masked Segmental Language Models (MSLMs) with the architecture described in Section 4. The only difference is that the baseline monolingual model is initialized with a character vocabulary only covering the particular K’iche’ training set (size-specific). The character vocabulary of the K’iche’ data is a subset of the AmericasNLP vocabulary, so in the multilingual case we are able to transfer without changing our embedding and output layers. The character embeddings for the monolingual model are initialized using Word2Vec (Mikolov et al., 2013) on the training set (again, size-specific).

Evaluation Metrics

Segmental Language Models can be trained in either a fully unsupervised or “lightly” supervised manner (Downey et al., 2021). In the former case, only the language modeling objective (Bits Per Character, bpc) is considered in picking parameters and checkpoints. In the latter, the segmentation quality over gold-segmented validation data can be considered. Though our validation set is gold-segmented, we pick the best parameters and checkpoints based only on the bpc performance, thus simulating the unsupervised case. However, in order to monitor the change in segmentation quality during training, we also use Matthews Correlation Coefficient (MCC). This measure frames segmentation as a character-wise binary classification task (i.e. boundary vs no boundary), and measures correlation with the gold segmentation.

To make our results comparable with the wider word-segmentation literature, we use the scoring script from the SIGHAN Segmentation Bakeoff (Emerson, 2005) to obtain our final segmentation F1 score. For each model and dataset size, we choose the best checkpoint (by bpc), apply the model to the combined validation and test set, and use the SIGHAN script to score the output segmentation quality.

Parameters and Trials

For our training procedure (both training the baseline from scratch and fine-tuning the multilingual model) we tune hyperparameters on three of the nine dataset sizes (256, 2048, and full) and choose the optimal parameters as those that obtain the lowest bpc. For each of the other sizes, we directly apply the chosen parameters from the tuned dataset of the closest size (on a log scale). We tune over five learning rates and three encoder dropout values.

Models are trained using the Adam optimizer (Kingma and Ba, 2015) for 8192 steps on all but the two smallest sizes, which are trained for 4096 steps. A linear warmup is used for the first 1024 steps (512 for the smallest sets), followed by linear decay. We set the maximum segment length to 10.

For a more details on our training procedure, see Appendix B.

6 Results

The results of our K’iche’ transfer experiments at various target sizes can be found in Table 1. In general, the pre-trained multilingual model demonstrates good performance across dataset sizes, with the lowest segmentation quality (20.6 F1) being in the zero-shot case, and the highest (42.0) achieved when trained on 2 examples. The best segmentation quality of the monolingual model is very close to that of the multilingual one (41.9, at size 4096), but this performance is not consistent across dataset sizes. Further, there doesn’t seem to be a noticeable trend across dataset size for the monolingual model, except that performance seems to increase from approximately 0 F1 in the zero-shot case up to 4096 examples.

Model Segmentation F1
0 256 512 1024 2048 4096 8192 2 2 62,695 (full)
Multilingual 20.6 32.2 37.8 37.8 37.9 38.1 40.6 42.0 39.1 38.0
Monolingual 0.002 6.1 23.6 22.9 27.3 41.9 25.5 30.5 15.7 32.6
Table 1: Segmentation quality on the combined validation and test set for each model, at each target training corpus size. Star indicates size at which hyperparameter tuning was conducted. For tuned sizes, showing only the performance of the model that achieved the best bpc
Model Segmentation F1
256 2048 62,695 (full)
Multilingual 31.9 0.4 37.7 1.1 37.5 1.8
Monolingual 4.1 1.4 21.9 6.0 31.1 1.8
Table 2:

Variation of segmentation quality across the best four hyperparameter combinations for a single size (by bpc; mean plus/minus standard deviation)

Interpretation

The above results show that the multilingual pre-trained MSLM provides consistent segmentation performance across dataset sizes as small as 512 examples. Even for size 256, there is only a 15% (relative) drop in segmentation quality from the next-largest size. Further, the pre-trained model yields an impressive zero-shot performance of 20.6 F1, where the baseline is approximately 0 F1.

On the other hand, the monolingual model can achieve good segmentation quality on the target language, but the pattern of success across target corpus sizes is not clear (note the quality at size 2 is almost halved compared to the two neighboring sizes).

This variation in the monolingual baseline may be partially explainable by sensitivity to hyperparameters. Table 2 shows that across the best four hyperparameters, the segmentation quality of the monolingual model varies considerably. This is especially noticeable at smaller sizes: at size 2048, the F1 standard deviation is 27.4% of the mean, and at size 256 it is 34.1% of the mean.

A related explanation could be that the hyperparameters tuned at specific target sizes don’t transfer well to other sizes. However, it should be noted that even at the sizes for which hyperparameters were tuned (256, 2048, and full), the monolingual performance lags behind the multilingual. Further, the best segmentation quality achieved by the monolingual model is at size 4096, at which the hyperparameters tuned for size 2048 were applied directly.

In sum, the pre-trained multilingual model yields far more stable performance across target dataset sizes, and almost always outperforms its monolingual from-scratch counterpart.

7 Analysis and Discussion

Standing of Hypotheses

Within the framework of unsupervised segmentation via language modeling, the results of these experiments provide strong evidence that relevant linguistic patterns can be learned over a collection of low-resource languages, and then transferred to a new language without much (or any) training data. Further, it is shown that the target language need not be (phylogenetically) related to any of the pre-training languages, even though the details of morphological structure are ultimately language-specific.

The hypothesis that multilingual pre-training would yield increasing advantage over a from-scratch baseline at smaller target sizes is also strongly supported. This result is consistent with related work showing this to be a key advantage of the multilingual approach (Wu and Dredze, 2020). Perhaps more interestingly, the monolingual model does not come to outperform the multilingual one at the largest dataset sizes, which also tends to be the case in related studies (e.g. Wu and Dredze, 2020; Conneau et al., 2020a). However, it is useful to bear in mind that segmentation quality is an unsupervised objective, and as such it will not necessarily always follow trends in supervised objectives.

Significance

The above results, especially the non-trivial zero-shot transferability of segmentation performance, suggest that the type of language model used here learns some abstract linguistic pattern(s) that are generalizable between languages (even ones on which the model has not been trained). It is possible that these generalizations could take the form of abstract stem/affix or word-order patterns, corresponding roughly to the lengths and order of morphosyntactic units. Because MSLMs operate on the character level (and in these languages orthographic characters mostly correspond to phones), it is also possible the model could recognize syllable structure in the data (the ordering of consonants and vowels in human languages is relatively constrained), and learn to segment on syllable boundaries.

It is also helpful to remember that we select the training suite and target language to have some characteristics in common that may help to facilitate transfer. The AmericasNLP training languages are almost all morphologically rich, with many being considered polysynthetic (Mager et al., 2021), a feature that K’iche’ shares (Suárez, 1983). Further, all of the languages, including K’iche’, are spoken in countries where either Spanish or Portuguese are the official language, and are very likely to have had close contact with these Iberian languages and have borrowed lexical items. Finally, the target language family (Mayan) has also been shown to be in close historical contact with the families of several of the AmericasNLP set (Nahuatl, Rarámuri, Wixarika, Hñähñu), forming a Linguistic Area or Sprachbund (Campbell et al., 1986).

It is possible that one or several of these shared characteristics facilitates the strong transfer shown in our experiments. However, our current study does not conclusively show this to be the case. Lin et al. (2019) show that factors like linguistic similarity and geographic contact are often not as important for transfer success as non-linguistic features such as the raw size of the source dataset. Furthermore, Artetxe et al. (2020) show that even monolingually-trained models can be rapidly adapted to a new language by simply training a new embedding layer and adding lightweight adapter layers.

Future Work

There are some future studies that we believe would shed light on the nuances of segmentation transfer-learning. First, pre-training monolingually on a language that is typologically or geographically close to the target could help disentangle the benefit given by multilingual training from that achieved by pre-training on a similar language in general (though the source language would need to be sufficiently high-resource to enable this comparison). Second, pre-training either multilingually or monolingually on languages that are not linguistically similar to the target language could help isolate the advantage given by pre-training on any language data.

In this way, we hope future experiments will refine our understanding of the dynamics that facilitate effective transfer into low-resource languages, both in the framework of unsupervised segmentation and in other tasks in which language model pre-training has enabled transfer learning.

8 Conclusion

This study has shown that unsupervised sequence segmentation performance can be transferred via multilingual pre-training to a novel target language with little or no target data. The target language also does not need to be from the same family as a pre-training language for this transfer to be successful. While training a monolingual model from scratch on larger amounts of target data can result in good segmentation quality, our experiments show that success in this approach is much more sensitive to hyperparameters, and the multilingual model outperforms the monolingual one in 9/10 of our experimental settings.

One finding that may have broader implications is that pre-training can be conducted over a set of low-resource languages that may have some typological or geographic similarity to the target, rather than over a crosslingual suite centered around high-resource languages like English and other European languages. As mentioned in Section 2, most modern crosslingual models have huge numbers of parameters (XLM has 570 million, mT5 has up to 13 billion, Xue et al., 2021), and are trained on enormous amounts of data, usually bolstered by hundreds of gigabytes of data in the highest-resource languages (Conneau et al., 2020a).

In contrast, our results suggest that effective transfer learning may be possible at smaller scales, by combining the data of low-resource languages and training moderately-sized, more targeted pre-trained multilingual models (our model has 3.15 million parameters). Of course, the present study can only support this possibility within the unsupervised segmentation task, and so future work will be needed to investigate whether crosslingual transfer to and from low-resource languages can be extended to other tasks.

References

  • Ž. Agić and I. Vulić (2019) JW300: A Wide-Coverage Parallel Corpus for Low-Resource Languages. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3204–3210. External Links: Link, Document Cited by: Appendix A.
  • M. Artetxe, S. Ruder, and D. Yogatama (2020) On the Cross-lingual Transferability of Monolingual Representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 4623–4637. External Links: Link, Document Cited by: §1, §2, §7.
  • S. Bird (2020) Sparse Transcription. Computational Linguistics 46 (4), pp. 713–744. External Links: Link, Document Cited by: Multilingual unsupervised sequence segmentation transfers to extremely low-resource languages, §1.
  • D. Brambila (1976) Diccionario Rarámuri-castellano (Tarahumar). Obra Nacional de la Buena Prensa. Cited by: Appendix A.
  • G. Bustamante, A. Oncevay, and R. Zariquiey (2020) No Data to Crawl? Monolingual Corpus Creation from PDF Files of Truly low-Resource Languages in Peru. In Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, pp. 2914–2923 (English). External Links: ISBN 979-10-95546-34-4, Link Cited by: Appendix A, §3.
  • L. Campbell, T. Kaufman, and T. C. Smith-Stark (1986) Meso-America as a Linguistic Area. Language 62 (3), pp. 530–570. Note: Publisher: Linguistic Society of America External Links: ISSN 00978507, 15350665, Link Cited by: §1, §7.
  • L. Chiruzzo, P. Amarilla, A. Ríos, and G. Giménez Lugo (2020) Development of a Guarani - Spanish Parallel Corpus. In Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, pp. 2629–2633 (English). External Links: ISBN 979-10-95546-34-4, Link Cited by: Appendix A.
  • A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov (2020a) Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 8440–8451. External Links: Link, Document Cited by: §2, §2, §7, §8.
  • A. Conneau and G. Lample (2019) Cross-lingual Language Model Pretraining. In Advances in Neural Information Processing Systems, Vol. 32, Vancouver, Canada. External Links: Link Cited by: §1, §2.
  • A. Conneau, R. Rinott, G. Lample, A. Williams, S. Bowman, H. Schwenk, and V. Stoyanov (2018) XNLI: Evaluating Cross-lingual Sentence Representations. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    ,
    Brussels, Belgium, pp. 2475–2485. External Links: Link, Document Cited by: Appendix A, §2.
  • A. Conneau, S. Wu, H. Li, L. Zettlemoyer, and V. Stoyanov (2020b) Emerging Cross-lingual Structure in Pretrained Language Models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 6022–6034. External Links: Link, Document Cited by: §2.
  • A. Constenla, F. Elizondo, and F. Pereira (2004) Curso Básico de Bribri. Editorial de la Universidad de Costa Rica. Cited by: Appendix A.
  • M. Creutz and K. Lagus (2002) Unsupervised Discovery of Morphemes. In Proceedings of the ACL-02 Workshop on Morphological and Phonological Learning, pp. 21–30. External Links: Link Cited by: §1, §2.
  • R. Cushimariano Romano and R. C. Sebastián Q. (2008) Ñaantsipeta asháninkaki birakochaki. Diccionario Asháninka-Castellano. Versión preliminar. External Links: Link Cited by: Appendix A.
  • C. M. Downey, F. Xia, G. Levow, and S. Steinert-Threlkeld (2021) A Masked Segmental Language Model for Unsupervised Natural Language Segmentation. arXiv:2104.07829 [cs]. Note: arXiv: 2104.07829 External Links: Link Cited by: Appendix B, Multilingual unsupervised sequence segmentation transfers to extremely low-resource languages, §1, §1, §2, Figure 3, §4, §4, §4, §5.
  • A. Ebrahimi, M. Mager, A. Oncevay, V. Chaudhary, L. Chiruzzo, A. Fan, J. Ortega, R. Ramos, A. Rios, I. Vladimir, G. A. Giménez-Lugo, E. Mager, G. Neubig, A. Palmer, R. A. C. Solano, N. T. Vu, and K. Kann (2021) AmericasNLI: Evaluating Zero-shot Natural Language Understanding of Pretrained Multilingual Models in Truly Low-resource Languages. arXiv:2104.08726 [cs]. Note: arXiv: 2104.08726 External Links: Link Cited by: Appendix A.
  • T. Emerson (2005) The Second International Chinese Word Segmentation Bakeoff. In Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, External Links: Link Cited by: §5.
  • I. Feldman and R. Coto-Solano (2020) Neural Machine Translation Models with Back-Translation for the Extremely Low-Resource Indigenous Language Bribri. In Proceedings of the 28th International Conference on Computational Linguistics, pp. 3965–3976. Cited by: Appendix A.
  • S. Flores Solórzano (2017) Corpus Oral Pandialectal de la Lengua Bribri. External Links: Link Cited by: Appendix A.
  • A. Galarreta, A. Melgar, and A. Oncevay (2017) Corpus Creation and Initial SMT Experiments between Spanish and Shipibo-konibo. In Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017, Varna, Bulgaria, pp. 238–244. External Links: Link, Document Cited by: Appendix A.
  • S. Goldwater, T. L. Griffiths, and M. Johnson (2009) A Bayesian framework for word segmentation: Exploring the effects of context. Cognition 112 (1), pp. 21–54 (en). External Links: ISSN 0010-0277, Link, Document Cited by: §1, §2.
  • A. Graves, F. Santiago, F. Gomez, and J. Schmidhuber (2006)

    Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks

    .
    In

    Proceedings of the 23rd International Conference on Machine Learning

    ,
    Pittsburgh, PA. Cited by: §2.
  • X. Gutierrez-Vasques, G. Sierra, and I. H. Pompa (2016) Axolotl: a Web Accessible Parallel Corpus for Spanish-Nahuatl. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Portorož, Slovenia, pp. 4210–4214. External Links: Link Cited by: Appendix A.
  • T. L. Ha, J. Niehues, and A. Waibel (2016) Toward Multilingual Neural Machine Translation with Universal Encoder and Decoder. In Proeedings of the 13th International Conference on Spoken Language Translation, Cited by: §2.
  • D. Huarcaya Taquiri (2020)

    Traducción Automática Neuronal para Lengua Nativa Peruana

    .
    Bachelor’s Thesis, Universidad Peruana Unión. Cited by: Appendix A.
  • C. V. Jara Murillo and A. G. Segura (2013) Se’ ttö’ bribri ie Hablemos en bribri. EDigital. External Links: Link Cited by: Appendix A.
  • C. V. Jara Murillo (2018a) Gramática de la Lengua Bribri. EDigital. External Links: Link Cited by: Appendix A.
  • C. V. Jara Murillo (2018b) I Ttè Historias Bribris. 2 edition, Editorial de la Universidad de Costa Rica. External Links: Link Cited by: Appendix A.
  • M. Johnson and S. Goldwater (2009)

    Improving nonparameteric Bayesian inference: experiments on unsupervised word segmentation with adaptor grammars

    .
    In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Boulder, Colorado, pp. 317–325. External Links: Link Cited by: §2.
  • M. Johnson, M. Schuster, Q. V. Le, M. Krikun, Y. Wu, Z. Chen, N. Thorat, F. Viégas, M. Wattenberg, G. Corrado, M. Hughes, and J. Dean (2017) Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation. Transactions of the Association for Computational Linguistics 5, pp. 339–351. External Links: Link, Document Cited by: §2.
  • K. Kann, J. M. Mager Hois, I. V. Meza-Ruiz, and H. Schütze (2018) Fortification of Neural Morphological Segmentation Models for Polysynthetic Minimal-Resource Languages. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 47–57. External Links: Link, Document Cited by: §3, §3, §3, §4.
  • K. Kawakami, C. Dyer, and P. Blunsom (2019) Learning to Discover, Ground and Use Words with Segmental Neural Language Models. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 6429–6441. External Links: Link, Document Cited by: §1, §2, §4.
  • D. Kingma and J. Ba (2015) Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, Conference Track Proceedings, San Diego, CA, USA. Cited by: §4, §5.
  • L. Kong, C. Dyer, and N. A. Smith (2016)

    Segmental Recurrent Neural Networks

    .
    In 4th International Conference on Learning Representations, ICLR 2016, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), San Juan, Puerto Rico. External Links: Link Cited by: §2.
  • T. Kudo and J. Richardson (2018) SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Brussels, Belgium, pp. 66–71. External Links: Link, Document Cited by: §1.
  • W. Lane, M. Bettinson, and S. Bird (2021) A Computational Model for Interactive Transcription. In

    Proceedings of the Second Workshop on Data Science with Human in the Loop: Language Advances

    ,
    Online, pp. 105–111. External Links: Link, Document Cited by: §1.
  • Y. Lin, C. Chen, J. Lee, Z. Li, Y. Zhang, M. Xia, S. Rijhwani, J. He, Z. Zhang, X. Ma, A. Anastasopoulos, P. Littell, and G. Neubig (2019) Choosing Transfer Languages for Cross-Lingual Learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3125–3135. External Links: Link, Document Cited by: §2, §7.
  • Y. Liu, J. Gu, N. Goyal, X. Li, S. Edunov, M. Ghazvininejad, M. Lewis, and L. Zettlemoyer (2020) Multilingual Denoising Pre-training for Neural Machine Translation. Transactions of the Association for Computational Linguistics 8, pp. 726–742. External Links: Link, Document Cited by: §2.
  • J. Loriot, E. Lauriault, and D. Day (1993) Diccionario Shipibo-Castellano. Ministerio de Educación. Cited by: Appendix A.
  • M. Mager, D. Carrillo, and I. Meza (2018) Probabilistic Finite-State Morphological Segmenter for Wixarika (Huichol) Language. Journal of Intelligent & Fuzzy Systems 34 (5), pp. 3081–3087. Cited by: Appendix A.
  • M. Mager, A. Oncevay, A. Ebrahimi, J. Ortega, A. Rios, A. Fan, X. Gutierrez-Vasques, L. Chiruzzo, G. Giménez-Lugo, R. Ramos, I. V. Meza Ruiz, R. Coto-Solano, A. Palmer, E. Mager-Hois, V. Chaudhary, G. Neubig, N. T. Vu, and K. Kann (2021) Findings of the AmericasNLP 2021 Shared Task on Open Machine Translation for Indigenous Languages of the Americas. In Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas, Online, pp. 202–217. External Links: Link, Document Cited by: Appendix A, Appendix A, Multilingual unsupervised sequence segmentation transfers to extremely low-resource languages, §1, §3, §3, §5, §7.
  • E. Margery (2005) Diccionario Fraseológico Bribri-Español Español-Bribri. 2 edition, Editorial de la Universidad de Costa Rica. Cited by: Appendix A.
  • E. Mihas (2011) Añaani katonkosatzi parenini, El idioma del alto Perené. Milwaukee, WI: Clarks Graphics. Cited by: Appendix A.
  • T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013)

    Efficient Estimation of Word Representations in Vector Space

    .
    In 1st International Conference on Learning Representations, ICLR 2013, Workshop Track Proceedings, Scottsdale, AR, USA. Cited by: Appendix B, §4, §5.
  • D. Mochihashi, T. Yamada, and N. Ueda (2009) Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Suntec, Singapore, pp. 100–108. External Links: Link Cited by: §2.
  • H. E. G. Montoya, K. D. R. Rojas, and A. Oncevay (2019) A Continuous Improvement Framework of Machine Translation for Shipibo-Konibo. In Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages, Dublin, Ireland, pp. 17–23. External Links: Link Cited by: Appendix A.
  • J. Ortega, R. A. Castro-Mamani, and J. R. Montoya Samame (2020) Overcoming Resistance: The Normalization of an Amazonian Tribal Language. In Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages, Suzhou, China, pp. 1–13. External Links: Link Cited by: Appendix A.
  • P. Prokopidis, V. Papavassiliou, and S. Piperidis (2016) Parallel Global Voices: a Collection of Multilingual Corpora with Citizen Media Stories. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Portorož, Slovenia, pp. 900–905. External Links: Link Cited by: Appendix A.
  • I. Richardson and F.M. Tyers (2021) A morphological analyser for K’iche’. Procesamiento de Lenguaje Natural 66, pp. 99–109. Cited by: §3, §3.
  • J. Rissanen (1989) Stochastic Complexity in Statistical Inquiry. Series in Computer Science, Vol. 15, World Scientific, Singapore. Cited by: §2.
  • K. Scannell (2007) The Crúbadán Project: Corpus building for under-resourced languages. Building and Exploring Web Corpora: Proceedings of the 3rd Web as Corpus Workshop. Cited by: §3.
  • R. Sennrich, B. Haddow, and A. Birch (2016) Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 1715–1725. External Links: Link, Document Cited by: §1.
  • K. Song, X. Tan, T. Qin, J. Lu, and T. Liu (2019) MASS: Masked Sequence to Sequence Pre-training for Language Generation. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA. Cited by: §2.
  • G. Suárez (1983) The Mesoamerican Indian Languages. Cambridge Language Surveys, Cambridge University Press, Cambridge. Cited by: §7.
  • Z. Sun and Z. Deng (2018) Unsupervised Neural Word Segmentation for Chinese via Segmental Language Modeling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 4915–4920. External Links: Link Cited by: §1, §2, §4.
  • Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei (2006) Hierarchical Dirichlet Processes. Journal of the American Statistical Association 101 (476), pp. 1566–1581. Cited by: §2.
  • J. Tiedemann (2012) Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey, pp. 2214–2218. External Links: Link Cited by: Appendix A.
  • F. Tyers and R. Henderson (2021) A corpus of K’iche’ annotated for morphosyntactic structure. In Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas, Online, pp. 10–20. External Links: Link, Document Cited by: §3, §3, §5.
  • K. Uchiumi, H. Tsukahara, and D. Mochihashi (2015)

    Inducing Word and Part-of-Speech with Pitman-Yor Hidden Semi-Markov Models

    .
    In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing, China, pp. 1774–1782. External Links: Link, Document Cited by: §1, §2.
  • C. Wang, Y. Wang, P. Huang, A. Mohamed, D. Zhou, and L. Deng (2017) Sequence Modeling via Segmentations. In Proceedings of the 34th International Conference on Machine Learning, D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, International Convention Centre, Sydney, Australia, pp. 3674–3683. External Links: Link Cited by: §2.
  • L. Wang, Z. Li, and X. Zheng (2021) Unsupervised Word Segmentation with Bi-directional Neural Language Model. arXiv:2103.01421 [cs]. Note: arXiv: 2103.01421 External Links: Link Cited by: §2, §4.
  • S. Wu and M. Dredze (2020) Are All Languages Created Equal in Multilingual BERT?. In Proceedings of the 5th Workshop on Representation Learning for NLP, Online, pp. 120–130. External Links: Link, Document Cited by: §7.
  • L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, and C. Raffel (2021) mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, pp. 483–498. External Links: Link, Document Cited by: §2, §2, §8.

Appendix A AmericasNLP Datasets

Composition

The detailed composition of our preparation of the AmericasNLP 2021 training and validation sets can be found in Tables 3 and 4 respectively. train_1.mono.cni, train_2.mono.cni, train_1.mono.shp, and train_2.mono.shp are the additional monolingual sources for Asháninka and Shipibo-Konibo obtained from Bustamante et al. (2020). train_downsampled.quy is the version of the Quechua training set downsampled to 2 lines to be more balanced with the other languages. train.anlp is the concatenation of the training set of every language before Quechua downsampling, and train_balanced.anlp is the version after Quechua downsampling. Our pre-training process uses train_balanced.anlp.

Citations

A more detailed description of the sources and citations for the AmericasNLP set can be found in the original shared task paper (Mager et al., 2021). Here, we attempt to give a brief listing of the proper citations.

All of the validation data originates from AmericasNLI (Ebrahimi et al., 2021) which is a translation of the Spanish XNLI set (Conneau et al., 2018) into the 10 languages of the AmericasNLP 2021 open task.

The training data for each of the languages comes from a variety of different sources. The Asháninka training data is sourced from Ortega et al. (2020); Cushimariano Romano and Sebastián Q. (2008); Mihas (2011) and consists of stories, educational texts, and environmental laws. The Aymara training data consists mainly of news text from the GlobalVoices corpus (Prokopidis et al., 2016) as available through OPUS (Tiedemann, 2012). The Bribri training data is from six sources (Feldman and Coto-Solano, 2020; Margery, 2005; Jara Murillo, 2018a; Constenla et al., 2004; Jara Murillo and Segura, 2013; Jara Murillo, 2018b; Flores Solórzano, 2017) ranging from dictionaries and textbooks to story books. The Guaraní training data consists of blogs and web news sources collected by Chiruzzo et al. (2020). The Nahuatl training data comes from the Axolotl parallel corpus (Gutierrez-Vasques et al., 2016). The Quechua training data was created from the JW300 Corpus (Agić and Vulić, 2019), including Jehovah’s Witnesses text and dictionary entries collected by Huarcaya Taquiri (2020). The Rarámuri training data consists of phrases from the Rarámuri dictionary (Brambila, 1976). The Shipibo-Konibo training data consists of translations of a subset of the Tatoeba dataset (Montoya et al., 2019), translations from bilingual education books (Galarreta et al., 2017), and dictionary entries (Loriot et al., 1993). The Wixarika training data consists of translated Hans Christian Andersen fairy tales from Mager et al. (2018).

No formal citation was given for the source of the Hñähñu training data (see Mager et al., 2021).

Appendix B Hyperparameter Details

Pre-training

The character embeddings for our multilingual model are initialized by training CBOW (Mikolov et al., 2013)

on the AmericasNLP training set for 32 epochs, with a window size of 5. Special tokens like

<bos> that do not appear in the training corpus are randomly initialized. These pre-trained embeddings are not frozen during training. During pre-training, a dropout rate of 12.5% is applied within the (transformer) encoder layers. A dropout rate of 6.25% is applied both to the embeddings before being passed to the encoder, and to the hidden-state and start-symbol encodings input to the decoder (see Downey et al., 2021). Checkpoints are taken every 128 steps. The optimal learning rate was 7.5e-4.

K’iche’ Transfer Experiments

Similar to the pre-trained model, character embeddings are initialized using CBOW on the given training set for 32 epochs with a window size of 5, and these embeddings are not frozen during training. As in pre-training, a dropout rate of 6.25% is applied to the input embeddings, plus and the start-symbol for the decoder. Checkpoints are taken every 64 steps for sizes 256 and 512, and every 128 steps for every other size.

For all training set sizes, we sweep 5 learning rates and 3 encoder dropout rates, but the swept set is different for each. For size 256, we sweep learning rates {5e-5, 7.5e-5, 1e-4, 2.5e-4, 5e-4} and (encoder) dropout rates {12.5%, 25%, 50%}. For size 2048, we sweep learning rates {1e-4, 2.5e-4, 5e-4, 7.5e-4, 1e-3} and dropouts {12.5%, 25%, 50%}. For the full training size, we sweep learning rates {1e-4, 2.5e-4, 5e-4, 7.5e-4, 1e-3} and dropouts {6.5%, 12.5%, 25%}.

Language File Lines Total Tokens Unique Tokens Total Characters Unique Characters Mean Token Length
All train.anlp 259,207 2,682,609 400,830 18,982,453 253 7.08
All train_balanced.anlp 171,830 1,839,631 320,331 11,981,011 241 6.51
Asháninka train.cni 3,883 26,096 12,490 232,494 65 8.91
Asháninka train_1.mono.cni 12,010 99,329 27,963 919,897 48 9.26
Asháninka train_2.mono.cni 593 4,515 2,325 42,093 41 9.32
Aymara train.aym 6,424 96,075 33,590 624,608 156 6.50
Bribri train.bzd 7,508 41,141 7,858 167,531 65 4.07
Guaraní train.gug 26,002 405,449 44,763 2,718,442 120 6.70
Hñähñu train.oto 4,889 72,280 8,664 275,696 90 3.81
Nahuatl train.nah 16,684 351,702 53,743 1,984,685 102 5.64
Quechua train.quy 120,145 1,158,273 145,899 9,621,816 114 8.31
Quechua train_downsampled.quy 32,768 315,295 64,148 2,620,374 95 8.31
Rarámuri train.tar 14,720 103,745 15,691 398,898 74 3.84
Shipibo Konibo train.shp 14,592 62,850 17,642 397,510 56 6.32
Shipibo Konibo train_1.mono.shp 22,029 205,866 29,534 1,226,760 61 5.96
Shipibo Konibo train_2.mono.shp 780 6,424 2,618 39,894 39 6.21
Wixarika train.hch 8,948 48,864 17,357 332,129 67 6.80
Table 3: Composition of the AmericasNLP 2021 training sets
Language File Lines Total Tokens Unique Tokens Total Characters Unique Characters Mean Token Length
All dev.anlp 9,122 79,901 27,597 485,179 105 6.07
Asháninka dev.cni 883 6,070 3,100 53,401 63 8.80
Aymara dev.aym 996 7,080 3,908 53,852 64 7.61
Bribri dev.bzd 996 12,974 2,502 50,573 73 3.90
Guaraní dev.gug 995 7,191 3,181 48,516 70 6.75
Hñähñu dev.oto 599 5,069 1,595 22,712 69 4.48
Nahuatl dev.nah 672 4,300 1,839 31,338 56 7.29
Quechua dev.quy 996 7,406 3,826 58,005 62 7.83
Rarámuri dev.tar 995 10,377 2,964 55,644 48 5.36
Shipibo Konibo dev.shp 996 9,138 3,296 54,996 65 6.02
Wixarika dev.hch 994 10,296 3,895 56,142 62 5.45
Table 4: Composition of the AmericasNLP 2021 validation sets