Weakly Supervised POS Taggers Perform Poorly on Truly Low-Resource Languages

04/28/2020 ∙ by Katharina Kann, et al. ∙ Københavns Uni NYU college 0

Part-of-speech (POS) taggers for low-resource languages which are exclusively based on various forms of weak supervision - e.g., cross-lingual transfer, type-level supervision, or a combination thereof - have been reported to perform almost as well as supervised ones. However, weakly supervised POS taggers are commonly only evaluated on languages that are very different from truly low-resource languages, and the taggers use sources of information, like high-coverage and almost error-free dictionaries, which are likely not available for resource-poor languages. We train and evaluate state-of-the-art weakly supervised POS taggers for a typologically diverse set of 15 truly low-resource languages. On these languages, given a realistic amount of resources, even our best model gets only less than half of the words right. Our results highlight the need for new and different approaches to POS tagging for truly low-resource languages.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Part-of-speech (POS) tagging can be very helpful in many natural language processing (NLP) applications, including machine translation, question answering, or relation extraction, especially for low-resource languages

[32, 26, 11]. POS taggers assign a syntactic category, i.e., a part of speech, to each token in a sentence, thus providing a rudimentary syntactic analysis of the sentence. While ambiguity and unseen words make POS tagging non-trivial, supervised POS tagging is often considered a relatively simple problem, reaching a performance close to inter-annotator agreement [9, accuracy for English]. The fact that weakly supervised POS taggers sometimes obtain almost-as-good scores makes POS tagging come across as a task where there is little room for improvement. We show that on the contrary, there is a lot of room for improvement.

The accuracy of POS taggers naturally depends on the quality of available training data, as well as the granularity of the POS inventory, i.e., the size of the tag set. Nevertheless, for almost all resource-rich languages considered in the literature, supervised POS taggers have reported tagging accuracies in the high 90s.111The CoNLL 2018 shared task [41] on universal dependencies reported macro-average F scores of up to on a set of 61 treebanks in high-resource languages. However, supervised POS taggers typically require corpora that have been manually created by professional linguistic annotators.222See Hovy:ea:15 Hovy:ea:15 for experiments with crowdsourcing POS annotation, indicating that sometimes, lay annotators can be used at a small cost in accuracy. In the absence of such corpora, we have to resort to alternative sources of (weak) supervision. One very popular source of such supervision is tag dictionaries [23, 15, 37, 40, 25, 30].

In this paper, we revisit and critically examine existing methods for weakly supervised POS tagging. While such approaches have been created to meet the needs of low-resource languages, many experiments with weakly supervised POS taggers have been limited to languages for which good resources actually exist. This has practical reasons: it is easier to obtain translations, dictionaries, and benchmark data for widely studied languages. However, since the resource-rich languages are not a representative sample of the world’s languages, this means we have no guarantee that results scale to truly low-resource languages.

Contributions We study the performance of weakly supervised POS taggers for actual low-resource languages. To this end, we consider a language truly low-resource for a task when no or almost no resources, i.e., annotated corpora or manually created dictionaries, are available for that task in that language. Assuming a setting in which we have access to no annotated data in the target language at all, we investigate several state-of-the art baselines, as well as two cross-lingual transfer variants of a state-of-the-art model for the high-resource case [31], extended to a multi-task learning (MTL) architecture in order to provide additional character-level supervision [21]. Our results show that POS tagging is still a difficult problem in the truly low-resource setting: Our strongest baseline obtains a macro-average accuracy of over 15 languages, even though it was developed specifically for low-resource languages. Our best architecture, based on cross-lingual transfer and character-level MTL, is slightly better, with a macro-average accuracy of , but our experiments emphasize the need for additional research on POS tagging for truly low-resource languages. We provide our preprocessed data for all languages under https://bitbucket.org/olacroix/truly-low-resource in order to facilitate future research.

Disclaimer In this study, we do not consider the case where small amounts of labeled data is (also) available for the target language. While it is easy to think that annotating a few hundred sentences should be cheap, developing guidelines, as well as finding and training annotators for truly low-resource languages is challenging and often impossible. Moreover, we do not look at the potential gains from using contextualized word embeddings from pretrained language models [29, 12], since even high-quality language models can be hard to train for truly low-resource languages, because data is scarce and exhibits significant spelling variation, and because language detection software has limited precision for truly low-resource languages [5].

Unsupervised and Weakly Supervised POS Tagging

Since the resources required to train supervised taggers are expensive to create and unlikely to exist for the majority of the world’s languages, many unsupervised and weakly supervised methods—the latter often building on the former—have been developed. We now briefly survey this work.

Unsupervised POS taggingHidden Markov models (HMMs) and their variants have been standard models for POS induction [24, 16]. christodoulopoulos2010two christodoulopoulos2010two evaluated seven different POS induction systems, from the well-known clustering method of brown1992class brown1992class to bergkirkpatrick-EtAl:2010:NAACLHLT bergkirkpatrick-EtAl:2010:NAACLHLT, who incorporated linguistically motivated features into HMMs. Subsequently, Christodoulopoulos2011Bayesian Christodoulopoulos2011Bayesian proposed a Bayesian multinomial mixture model instead of a sequence model to infer clusters from unlabeled text. More recent studies include [35] who proposes to use anchor HMMs.

POS tags could then be assigned to clusters, e.g., by using monolingual dictionaries, however, in most studies gold annotated data are used to find the best match of POS-tags, hence reporting upper-bound performance of their systems. While all approaches mentioned above were reported to perform reasonably, none of the results were obtained for actual low-resource languages (and, in a low-resource setting such as when using low-resource dictionaries), e.g., for English [8], for German, and for Arabic [10].

Weakly supervised POS tagging Weakly supervised methods for POS tagging tend to rely on the projection of information across word alignments between parallel corpora [37, 1, 30], cross-lingual features obtained from parallel data, comparable data, or seed bilingual dictionaries [17, 14], or monolingual tag dictionaries that constrain the search space of unsupervised POS induction [23, 40, 15]. Strategies using dictionaries extracted from Wiktionary have been popular for learning from noisily labeled sentences [23, 40]. li2012wiki li2012wiki proposed a feature-based maximum entropy emission model for learning POS taggers from texts which have been labeled using monolingual dictionaries. However, they evaluate the performance of their system exclusively on resource-rich languages for which such dictionaries cover a large amount of their vocabulary, resulting in performances over . garrette2013learning garrette2013learning proposed a maximum entropy Markov model for learning POS taggers from noisily labeled sentences. They labeled a raw corpus with the help of a manually annotated dictionary that was expanded by label propagation. Notably, the authors did evaluate on actual low-resource languages. However, to actually achieve reasonable results, they bootstrapped their model with a small set of manually annotated sentences.

Our work is also related to fang2017model fang2017model, in that we do not rely on parallel text, since we are interested in a setting in which large amounts of parallel corpora are not available, as might be the case for true low-resource languages. However, in contrast, we do not assume a small amount of training data for the target language. Further, our approach is similar to tackstrom2013token tackstrom2013token in creating silver-standard training data from type-level dictionaries, but we make use of the state-of-the-art architecture by Plank16multi Plank16multi instead of employing HMMs. Our study is also similar to agic2015bible agic2015bible in evaluating approaches to weakly supervised POS tagging across many low-resource languages, but differs in that we are interested in methods without need for parallel corpora.

Model Architecture

Hierarchical POS tagging long short-term memory networks (LSTMs), such as the architecture proposed by Plank16multi Plank16multi, receive both word-level and subword-level input. They perform well, even on unseen words, due to their ability to associate subword-level patterns with POS tags. This is important in a low-resource setting. However, hierarchical LSTMs are also very expressive, and thus prone to overfitting. In order to overcome this, we additionally train our models on subword-level auxiliary tasks

[21] to regularize the character-level encoding in hierarchical LSTMs. Such a model is still able to make predictions about unknown words, but the subword-level auxiliary task should prevent it from overfitting.

Hierarchical LSTMs with Character-Level Decoding

For the hierarchical sequence labeling LSTM, we follow Plank16multi Plank16multi: our subword-level LSTM is bidirectional and operates on the character level [7]. Its input is the character sequence of each input word, represented by the embedding sequence . The final character-based representation of each word is the concatenation of the two last LSTM hidden states:


Here, and denote a forward and backward LSTM, respectively.

Second, a context bi-LSTM operates on the word level. Like Plank16multi Plank16multi, we use the term “context bi-LSTM” to denote a bidirectional LSTM which, in order to generate a representation for input element , encodes all elements up to position with a forward LSTM and all elements from to using a backward LSTM. For each sentence represented by embeddings , its input are the concatenation of the word embeddings with the outputs of the subword-level LSTM: . The final representation which gets forwarded to the next part of the network is again the concatenation of the last two hidden LSTM states:


This is then passed on to a classification layer.

Character-Based Seq2Seq Model

The second component of our model is based on a character-level sequence-to-sequence (seq2seq) architecture. It consists of a bi-LSTM encoder which is connected to an LSTM decoder [36].

Encoding The encoder is the character-level bi-LSTM described above, and, thus, yields the representation


for an input word embedded as . Parameters of the character-level LSTM are shared between the sequence labeling and the seq2seq part of our model.

Decoding The decoder receives the concatenation of the last hidden states, , as input. In particular, we do not use an attention mechanism [4], since our goal is not to improve performance on the auxiliary task, but instead to encourage the encoder to learn better word representations. The decoder is trained to predict each output character dependent on and previous predictions , …, as


for a non-linear function and the LSTM hidden state . The final softmax output layer is calculated over the character vocabulary of the language.

Multi-Task Learning

The character-level LSTM is shared between the sequence labeling and the seq2seq components of our network. All model parameters, including all embeddings, are updated during training.

We want to train our neural model jointly on (i) a low-resource main task, i.e., POS tagging, and (ii) a character-level auxiliary task, namely word autoencoding. Therefore, we want to maximize the following joint log-likelihood:


Here, denotes the POS tagging training data, with being the input sentence and the corresponding label sequence. is our auxiliary task training data with examples consisting of input and output . The set of model parameters is the union of the set of parameters of the sequence labeling and the seq2seq part. Parameters of the character-level LSTM are shared between tasks.

Word autoencoding (AE) Our auxiliary task consists of reproducing a given input character sequence in the output. Thus, training examples are of the form , where for the vocabulary of . Word autoencoding has been used as an auxiliary task before, e.g., by Vosoughi:ea:16 Vosoughi:ea:16.

Language Treebank Data (test) Unimorph Wikidata+Panlex Wikipedia (# tagged) Embeddings
code family sentences tokens entries translations sentences tokens entries
am AA 1,095 10k - 2.7k 777 17.9k 10k
be IE 68 1.3k - 35.3k 7,385 101.9k 93k
br IE 888 10.3k - 12.2k 9,083 112.9k 39k
fo IE 1,208 10.0k 45.4k 2.9k 9,958 144.6k 12k
hsb IE 623 10.7k - 4.6k 1,858 30.2k 10k
hy IE 514 11.4k 338k 65.1k 3,560 71.4k 47k
kmr IE 734 10.1k - 4.6k 3,225 48.3k 24k
lt IE 55 1.0k 34.1k 38.9k 11,464 117.2k 100k
mr IE 47 0.4k - 23.4k 4,886 55.2k 47k
mt AA 100 2.3k - 2.1k 2,361 43.9k 16k
bxr Mo 908 10.0k - 2.7k 2,308 37.8k 28k
kk Tu 1,047 10.1k - 63.5k 12,273 122.4k 100k
ta Dr 120 2.2k - 27.1k 5,772 76.2k 100k
te Dr 146 0.7k - 28.0k 7,872 90.9k 100k
tl Au 55 0.2k - 6.8k 5,871 97.6k 41k
de IE 1,000 21.3k 179.3k 90.2k 12,162 195.1k 100k
es IE 1,000 23.3k 382.9k 59.7k 15,209 276.6k 100k
it IE 1,000 23.7k 509.5k 59.7k 10,254 170.0k 100k
pt IE 1,000 23.4k 303.9k 47.9k 12,674 195.2k 100k
sv IE 1,000 19.1k 78.4k 58.8k 10,243 134.5k 100k
Table 1: Resources for our low-resource languages (up) and high-resource languages (down). Language families: Afro-Asiatic (AA), Austronesian (Au), Dravidian (Dr), Indo-European (IE), Mongolic (Mo), and Turkic (Tu).

Weak and Cross-Lingual Supervision

The model described above relies on full supervision. However, in our setting, we need to rely on alternative resources, including raw corpora and linguistic resources, and cross-lingual transfer to create training examples. The raw corpora which we leverage for our approaches consist of cleaned Wikipedia texts for our languages. Preprocessing involves (a) segmentation of text into sentences; (b) tokenization; (c) removal of sentences that include at least one of (i) mostly foreign characters; (ii) mostly symbols or punctuation; (iii) no words from our dictionaries. Most of the texts that we extract for our low-resource languages are predominantly made up of ambiguous sentences (i.e., most words in these sentences are mapped to more than one POS tag in our dictionaries) which makes it impossible to extract, for training, only sentences that would be fully (and unambiguously) tagged with our dictionaries. Numbers of sentences and tokens extracted for each language are shown in Table 1.

In order to obtain silver-standard training data for POS tagging in our low-resource languages, we rely on cross-lingual transfer. In particular, we assume the following to be available for each low-resource language: (i) raw text, e.g., the Wikipedia corpora we just described; (ii) a bilingual dictionary , containing translations which consist of pairs of words in the low-resource language and a high-resource language , respectively; (iii) large amounts of gold POS-annotated data in the high-resource language ; and, optionally, (iv) a monolingual tag dictionary . Given those resources, we propose two cross-lingual transfer methods, which we will outlay in the following.

Frequency-based annotation (FREQ) The high-level idea of our first approach is to tag each token with the POS tag which has been most frequently assigned to its high-resource language translations:


Here, denotes all possible translations of in , is the set of all attested POS tags of in and is the number of times tag has been assigned to any word in in . is a type-level resource, while the training data in the high-resource language is token-based, i.e., words are tagged in context.

A word which appears neither in nor in is not tagged. We switch learning off for those tokens, i.e., we mask the calculation of the loss and, thus, do not take them into account during optimization. Raw sentences which do not contain at least one tagged word are discarded.

Ambiguous annotation (AMB) We next propose to annotate our raw sentences with noisy or ambiguous labels. We tag each token with all POS tags we consider possible, given dictionary and high-resource language data , i.e., we assign to a token all tags in with


and all variables denoting the same quantities as before. In order to include our monolingual dictionaries, which give a set of possible tags for each word in the low-resource language, we extend this to


Words that do not appear in either , or are tagged with all possible tags. As before, we discard raw sentences which contain only tokens without information.

For training, we adapt the calculation of the cross entropy to ambiguous annotation. In particular, we obtain the loss of each example by treating the tag in

which obtains the highest probability under our model as the gold label.

Languages and Resources

The goal of this work is to evaluate state-of-the-art unsupervised or weakly supervised POS-tagging strategies on truly low-resource languages. For this, we select 15 languages from the Universal Dependencies (UD) project v2.1 [28].333http://universaldependencies.org/ All chosen languages are low-resource languages: 10 of them (Belarusian (be), Buryat (bxr), Upper Sorbian (hsb), Armenian (hy), Kazakh (kk), Kurmanji (kmr), Lithuanian (lt), Marathi (mr), Tamil (ta) and Telugu (te)) have less than 10k tokens for training in UD, and the other 5 languages (Amharic (am), Breton (br), Faroese (fo), Maltese (mt) and Tagalog (tl)) have no training data at all. Note, however, that we consider the training sets only to determine languages which are low-resource languages, and we do not make use of any training data in our experiments. Our languages represent 6 different languages families: Afro-Asiatic (am and mt), Austronesian (tl), Dravidian (ta and te), Indo-European (be, br, fo, hsb, hy, kmr, lt, and mr), Mongolic (bxr), and Turkic (kk); cf. Table 1.

English is used as the high-resource language for cross-lingual transfer in our experiments. Note that a more informed choice of source language might be able to obtain better results, but that bilingual dictionaries are more likely to exist from/to English.


Bilingual dictionaries A possible resource for weakly supervised approaches to cross-lingual POS tagging are bilingual dictionaries that contain word-to-word translations. They can be used for transferring information from resource-rich languages to low-resource languages, either by replacing words directly, transferring statistics, or for inducing cross-lingual word representations as done, e.g., by faruqui2014vec faruqui2014vec. Some bilingual dictionaries can be downloaded from the Wiktionary user page of Matthias Buchmeier444https://en.wiktionary.org/wiki/User:Matthias˙Buchmeier or from the Wikt2Dict project website555https://github.com/juditacs/wikt2dict. Unfortunately, those dictionaries do not include many low-resource languages (only 1 (lt) in our 15 languages is covered). We therefore decide to rely on freely available resources for extracting bilingual dictionaries: Panlex Swadesh666https://panlex.org and Wikidata777https://www.wikidata.org. The Panlex Swadesh Corpora [6] gathers lists of 317 words (and synonyms) for over 600 languages. From this, we retrieve small bilingual dictionaries for almost all low-resource languages we consider.888Panlex Swadesh data is not available for kmr and mr. Our second resource, Wikidata, is an online database collecting more than 28 million links to Wikipedia and Wiktionary pages across different languages. The English pages align the multilingual sites, which enables us to extract bilingual dictionary entries. We extract bilingual dictionaries (English to target language) from these pages for our 15 low-resource languages.999We use the Mongolian and the Kurdish version for, respectively, Buryat and Kurmanji, which are closely related languages. Statistics are provided in Table 1.
Monolingual dictionaries A common approach for obtaining monolingual tag dictionaries, i.e., lists of tokens matched with their possible POS tags, for low-resource languages is to extract this information from Wiktionary,101010https://www.wiktionary.org a multilingual free online dictionary. However, as an online collaborative tool that can be edited by any user, it is noisy and prone to errors. This is particularly true for low-resource languages, which is one of the reasons why studies such as li2012wiki li2012wiki and wisniewski2014cross wisniewski2014cross limited themselves to a few (not truly resource-poor) languages.

The Unimorph project [22], in contrast, was professionally created and contains morpho-syntactic information for 107 languages. However, this resource is only available for 3 of our languages (see Table 1), and POS tags are restricted to nouns, verbs, adjectives, as well as a small number of adverbs. We convert the Unimorph tagging scheme into UD POS tags for all languages.


We leverage monolingual word embeddings to improve our ability to generalize to unseen words in the target language. This can be seen as MTL with a language modeling auxiliary task. However, since the use of word embeddings is de facto standard in NLP, we consider them a basic part of our model. We employ the Polyglot embeddings which have been built from Wikipedia texts and made available by AlRfou13polyglot AlRfou13polyglot.111111https://sites.google.com/site/rmyeid/projects/polyglot See Table 1 (last column) for the number of words covered by those embeddings.


We explore a set of state-of-the-art unsupervised and weakly supervised POS tagging strategies. Note that we do not experiment with projected information from parallel corpora like agic2015bible agic2015bible, due to datasets only being available for 7 out of our 15 languages. As mentioned in the disclaimer above, we also do not consider limited supervision or contextualized word embeddings.
CHR11 We employ the fully unsupervised method of Christodoulopoulos2011Bayesian Christodoulopoulos2011Bayesian121212https://github.com/christos-c/bmmm (cf. related work section) as our first baseline. We learn clusters on raw text extracted from Wikipedia using their Bayesian multinomial mixture model. To infer POS tags, we use our monolingual dictionaries to match each cluster with the POS tag that is most frequently associated with its tokens.
GAR13 Second, we compare to garrette2013learning garrette2013learning’s system131313https://github.com/dhgarrette/low-resource-pos-tagging-2014 (cf. related work section), which consists of a maximum entropy Markov model which learns from a corpus of noisily labeled sentences. For this, we make use of the raw Wikipedia texts and our monolingual dictionaries.
PLA16 Plank16multi Plank16multi is a hierarchical LSTM POS tagger relying on a combination of character embeddings and word embeddings. Plank16multi Plank16multi train this architecture in a multi-task fashion, using the task of predicting the log frequency of the next word as an auxiliary task. We compute log frequencies for each word as . We then learn POS taggers in a MTL setup: the main task learns to predict POS tags using the ambiguous learning strategy (data is created the same way as for AMB), and the auxiliary task learns a simplified language model. The auxiliary task training data consists of sentences from Wikipedia.

We mostly adopt the hyperparameters from Plank16multi Plank16multi. The number of dimensions of our word embeddings is 64 (which is also the dimension of the

Polyglot embeddings). We use one hidden layer of dimension for both the word and the character LSTM. In our use of dropout we also follow Plank16multi Plank16multi. However, we further add character dropout with a coefficient of to improve regularization. We train a minimum and maximum number of and epochs, respectively, and terminate if no improvement in the training loss is detected for consecutive epochs. At test time, we use the model which obtained the lowest loss.

We compute global accuracy scores over POS tags for all systems. Our evaluation is token-based, and each reported result is an average over runs with different random seeds.


The results of all experiments are presented in Table 2; the first three columns show the baseline results, the last four columns contain our proposed approaches. Our most important observation is that none of the approaches perform well on these languages. In fact, on average, no method gets even half of the tags right. This shows there is still a lot of room for exploring new and radically different approaches to learning POS taggers for (truly) low-resource languages. In addition, we make the following four observations: (i) We see that the unsupervised baseline, CHR11, obtains the lowest accuracy on average over all languages. This shows that, while all POS taggers are poor, distant supervision from tag dictionaries may provide some signal. (ii) On average over all languages, both single-task cross-lingual neural approaches, i.e., AMB and FREQ, outperform all baselines. Further, the difference between AMB and FREQ is with small. Thus, it seems that both cross-lingual strategies, i.e., ambiguous labeling and using the most frequent tag as the gold label, work similarly well on average. Looking at individual languages, however, up to around performance difference can be found for am, hsb, hy, kmr, and te. Most likely, this difference might be explained by the frequency difference between the most frequent and other possible tags in the respective languages. (iii)

Looking at PLA16, we find that a language modeling auxiliary task seems to hurt performance on average. PLA16 uses the ambiguous annotation strategy for cross-lingual transfer and should, thus, directly improve over AMB. However, it only performs better for three languages: am, ta, and tl. This contrast to Plank16multi Plank16multi’s results could be due to the small corpus sizes and, thus, mistakes in estimating word frequencies.

(iv) The approaches with character-level MTL, i.e., AMB+AE and FREQ+AE, both improve over the respective single-task approaches. AMB+AE also obtains the highest average accuracy overall. Thus, in contrast to the language modeling auxiliary task, a character-level multi-task approach improves model performance in our setting.

  am 0.3441 0.1392 0.1643 (0.00) 0.1595 (0.03) 0.2479 (0.00) 0.1651 (0.00) 0.2608 (0.01)
  be 0.2366 0.3524 0.4234 (0.03) 0.4627 (0.03) 0.4805 (0.01) 0.4285 (0.02) 0.5027 (0.02)
  br 0.3442 0.3119 0.3267 (0.01) 0.3449 (0.02) 0.3247 (0.01) 0.3375 (0.02) 0.3325 (0.01)
  bxr 0.4432 0.5295 0.3140 (0.09) 0.3783 (0.07) 0.4114 (0.02) 0.4153 (0.09) 0.4372 (0.01)
  fo 0.4048 0.5671 0.5928 (0.01) 0.5992 (0.00) 0.5559 (0.02) 0.6016 (0.00) 0.5341 (0.01)
  hsb 0.1886 0.3657 0.3573 (0.06) 0.4306 (0.01) 0.3540 (0.01) 0.4125 (0.03) 0.3446 (0.01)
  hy 0.3706 0.3821 0.4940 (0.02) 0.5131 (0.01) 0.4302 (0.02) 0.5061 (0.01) 0.4482 (0.01)
  kk 0.451 0.4271 0.4801 (0.07) 0.4809 (0.06) 0.4524 (0.02) 0.5370 (0.02) 0.4469 (0.02)
  kmr 0.3201 0.3501 0.3165 (0.06) 0.3898 (0.01) 0.3020 (0.01) 0.3865 (0.00) 0.2948 (0.01)
  lt 0.383 0.4460 0.5251 (0.01) 0.5266 (0.02) 0.4813 (0.01) 0.5226 (0.01) 0.4857 (0.02)
  mr 0.2522 0.3862 0.3670 (0.00) 0.3710 (0.01) 0.3781 (0.01) 0.3799 (0.01) 0.3808 (0.01)
  mt 0.3126 0.3002 0.2208 (0.03) 0.2666 (0.04) 0.3326 (0.01) 0.2924 (0.06) 0.3544 (0.02)
  ta 0.3275 0.2758 0.3302 (0.04) 0.3163 (0.05) 0.3193 (0.01) 0.3562 (0.00) 0.3259 (0.01)
  te 0.5035 0.5062 0.4430 (0.06) 0.4746 (0.01) 0.5734 (0.02) 0.4888 (0.01) 0.5615 (0.01)
  tl 0.2774 0.5274 0.4931 (0.01) 0.4924 (0.01) 0.5157 (0.02) 0.5075 (0.04) 0.4651 (0.04)
  Average 0.3440 0.3911 0.3899 0.4138 0.4106 0.4225 0.4117
Table 2:

POS tagging accuracy. For all neural models, the standard deviation is given in parentheses.

Error Analysis

The most frequent POS tags in the test treebanks are NOUN ( on average), PUNCT ( on average), and VERB ( on average). We suspect this to be the main reason why the cluster-based method CHR11 achieves competitive results: it tends to tag a huge part of the tokens with the most frequent POS tags, i.e., NOUN, PUNCT, and VERB. However, it fails on less frequent ones, losing against the other approaches. Bi-LSTMs tag more tokens belonging to less frequent categories correctly. However, a possible source of errors might be that the training sets we created for AMB and FREQ are fairly unbalanced: while the percentage of NOUN is in the treebank test sets, only of the tokens are assigned this tag for FREQ. The VERB tag, in contrast, which makes, on average, up for of the tokens in the treebank test sets, is assigned to of the words for FREQ. Similarly, the PUNCT tag is much more frequent for FREQ: , as compared to in the test sets. For AMB, a total of of the tokens have been tagged as possible NOUN. The next most frequent tags are VERB and PROPN with and, respectively, . PUNCT is slightly less frequent than in the test sets: only of the words have been tagged as such.141414Annotations for AMB are ambiguous, i.e., each token can have multiple tags, such that the percentages do not sum up to .

Comparison with High-Resource Languages

Finally, the result that state-of-the-art taggers do not perform particularly well for the languages in our experiments is more valuable if we can find a plausible explanation for that. We, thus, repeat a subset of the main experiments for 5 high-resource languages: German (de), Spanish (es), Italian (it), Portuguese (pt) and Swedish (sv). We compare two different settings: First, we employ comparable resources to the previous experiment (-). Second, we use the dictionaries (named Wiki-ly) provided by li2012wiki li2012wiki, which are extracted from Wiktionary and contain more reliable POS tags,151515Note that li2012wiki li2012wiki use the Universal POS tags of petrov2012universal petrov2012universal, which do not exactly correspond to UD POS tags. We map, respectively, the CONJ and “.” tags to CCONJ and PUNCT. No PROPN, AUX and SCONJ tags occur in the dictionaries. and replace Wikipedia texts with UD texts for training (+). We keep all hyperparameters the same as in the previous experiments. Again, we perform 5 training runs for all neural models and report average scores. We evaluate the scores on the PUD test sets [18] of the UD treebanks for which there is a Wiki-ly dictionary available.

Language CHR11 GAR13 AMB FREQ
de - 0.29 0.42 0.47 0.44
es - 0.47 0.36 0.46 0.56
it - 0.39 0.32 0.41 0.48
pt - 0.49 0.34 0.37 0.53
sv - 0.32 0.42 0.43 0.49
Average - 0.39 0.37 0.43 0.50
de + 0.59 0.59 0.66 0.72
es + 0.63 0.74 0.72 0.80
it + 0.67 0.75 0.74 0.75
pt + 0.62 0.66 0.73 0.77
sv + 0.59 0.69 0.70 0.80
Average + 0.62 0.69 0.72 0.77
Table 3: High-resource POS tagging accuracy. “-”: comparable data to previous experiments; “+”: higher quality data.

Results and discussion Results are shown in Table 3. Comparing the average results of the neural models with those for the low-resource languages, we find an improvement of for AMB and for FREQ, respectively. Thus, those approaches work slightly better for our high-resource languages, maybe because those are closer to English, our source language for cross-lingual transfer. However, when higher quality resources are used, tagging performance is even higher for AMB. Similarly, CHR11 and GAR13 improve strongly over their results for the resource-poor languages. For FREQ, performance nearly doubles: it increases by . This highlights the importance of using high quality resources.

Related Work

POS tagging and other NLP sequence labeling tasks have been successfully approached using bidirectional LSTMs [39, 31]

. Early work using such architectures relied on large annotated datasets, but Plank16multi Plank16multi showed that bi-LSTMs are not as reliant on data size as previously assumed. Their approach obtained state-of-the-art results for POS tagging in several languages, which is why we build upon it. rei:2017:Long rei:2017:Long showed how an additional language modeling objective could improve performance for POS tagging. Neural networks make MTL via parameter sharing easy; thus, different task combinations have been investigated exhaustively

[33, 3]

. An analysis of task combinations was performed by Bingel2017 Bingel2017. ruder2017sluice ruder2017sluice presented a more flexible architecture, which learns what to share between the main and auxiliary tasks. Augenstein2018NAACL Augenstein2018NAACL combined MTL with semi-supervised learning for strongly related tasks with different output spaces. However, work on combining sequence labeling main tasks and seq2seq auxiliary tasks is harder to find. dai2015semi dai2015semi pretrained an LSTM as part of a sequence autoencoder on unlabeled data to obtain better performance on a sequence classification task. However, they reported poor results for joint training. We obtain different results: an autoencoding seq2seq task is beneficial for low-resource POS tagging. Cross-lingual approaches have been used for a large variety of tasks, e.g., automatic speech recognition

[19], entity recognition [38], or parsing [34, 27, 2]. In the realm of seq2seq models, work existis on cross-lingual machine translation [13, 20]. Another example is a character-based approach by kann2017one kann2017one for morphological generation.


We analyzed state-of-the-art approaches for low-resource POS tagging of truly low-resource languages: POS tagging in these languages is still difficult because resources are limited and of poor quality, with average tagging accuracies well below 50%. Weakly supervised approaches only slightly outperform a state-of-the-art unsupervised baseline.


We would like to thank Barbara Plank, Isabelle Augenstein, and Johannes Bjerva for conversations on this topic. AS is funded by a Google Focused Research Award.


  • [1] Ž. Agić, D. Hovy, and A. Søgaard (2015) If all you have is a bit of the Bible: Learning POS taggers for truly low-resource languages. In ACL-IJCNLP, Cited by: Unsupervised and Weakly Supervised POS Tagging.
  • [2] W. Ammar, G. Mulcaire, M. Ballesteros, C. Dyer, and N. Smith (2016) Many languages, one parser. TACL 4, pp. 431–444. Cited by: Related Work.
  • [3] I. Augenstein and A. Søgaard (2018) Multi-task learning of pairwise sequence classification tasks over disparate label spaces. In NAACL, Cited by: Related Work.
  • [4] D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In ICLR, Cited by: Character-Based Seq2Seq Model.
  • [5] T. Baldwin and M. Lui (2010) Language identification: the long and the short of the matter. In NAACL, Cited by: Introduction.
  • [6] T. Baldwin, J. Pool, and S. M. Colowick (2010) PanLex and LEXTRACT: Translating all Words of all Languages of the World. In COLING, Cited by: Dictionaries.
  • [7] M. Ballesteros, C. Dyer, and N. A. Smith (2015) Improved transition-based parsing by modeling characters instead of words with LSTMs. In EMNLP, Cited by: Hierarchical LSTMs with Character-Level Decoding.
  • [8] T. Berg-Kirkpatrick, A. Bouchard-Côté, J. DeNero, and D. Klein (2010)

    Painless unsupervised learning with features

    In ACL, Cited by: Unsupervised and Weakly Supervised POS Tagging.
  • [9] B. Bohnet, R. McDonald, G. Simões, D. Andor, E. Pitler, and J. Maynez (2018) Morphosyntactic tagging with a meta-biLSTM model over context sensitive token encodings. In ACL, Cited by: Introduction.
  • [10] C. Christodoulopoulos, S. Goldwater, and M. Steedman (2011) A Bayesian Mixture Model for PoS Induction Using Multiple Features. In EMNLP, Cited by: Unsupervised and Weakly Supervised POS Tagging.
  • [11] A. Currey and K. Heafield (2019) Incorporating source syntax into transformer-based neural machine translation. In WMT, Cited by: Introduction.
  • [12] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL, External Links: Link, Document Cited by: Introduction.
  • [13] D. Dong, H. Wu, W. He, D. Yu, and H. Wang (2015) Multi-task learning for multiple language translation. In ACL-IJCNLP, Cited by: Related Work.
  • [14] M. Fang and T. Cohn (2017) Model Transfer for Tagging Low-resource Languages using a Bilingual Dictionary. In ACL, Cited by: Unsupervised and Weakly Supervised POS Tagging.
  • [15] D. Garrette and J. Baldridge (2013) Learning a Part-of-Speech Tagger from Two Hours of Annotation. In NAACL, Cited by: Introduction, Unsupervised and Weakly Supervised POS Tagging.
  • [16] S. Goldwater and T. Griffiths (2007) A fully bayesian approach to unsupervised part-of-speech tagging. In ACL, Cited by: Unsupervised and Weakly Supervised POS Tagging.
  • [17] S. Gouws and A. Søgaard (2015) Simple task-specific bilingual word embeddings. In NAACL, Cited by: Unsupervised and Weakly Supervised POS Tagging.
  • [18] J. Hajič and D. Zeman (2017) Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. In CoNLL, Cited by: Comparison with High-Resource Languages.
  • [19] J. Huang, J. Li, D. Yu, L. Deng, and n. Gong (2013) Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers. In IEEE, Cited by: Related Work.
  • [20] M. Johnson, M. Schuster, Q. Le, M. Krikun, Y. Wu, Z. Chen, N. Thorat, F. Viégas, M. Wattenberg, G. Corrado, M. Hughes, and J. Dean (2017) Google’s multilingual neural machine translation system: enabling zero-shot translation. TACL 5, pp. 339–351. Cited by: Related Work.
  • [21] K. Kann, J. Bjerva, I. Augenstein, B. Plank, and A. Søgaard (2018) Character-level supervision for low-resource POS tagging. In DeepLo, Cited by: Introduction, Model Architecture.
  • [22] C. Kirov, J. Sylak-Glassman, R. Que, and D. Yarowsky (2016) Very-large Scale Parsing and Normalization of Wiktionary Morphological Paradigms. In LREC, Cited by: Dictionaries.
  • [23] S. Li, J. V. Graça, and B. Taskar (2012) Wiki-ly Supervised Part-of-Speech Tagging. In EMNLP, Cited by: Introduction, Unsupervised and Weakly Supervised POS Tagging.
  • [24] B. Merialdo (1994) Tagging english text with a probabilistic model. Computational Linguistics 20 (2), pp. 155–171. Cited by: Unsupervised and Weakly Supervised POS Tagging.
  • [25] R. Moore (2015) An improved tag dictionary for faster part-of-speech tagging. In EMNLP, Cited by: Introduction.
  • [26] M. Nadejde, S. Reddy, R. Sennrich, T. Dwojak, M. Junczys-Dowmunt, P. Koeh, and A. Birch (2017) Predicting target language ccg supertags improves neural machine translation. In WMT, Cited by: Introduction.
  • [27] T. Naseem, R. Barzilay, and A. Globerson (2012) Selective sharing for multilingual dependency parsing. In ACL, Cited by: Related Work.
  • [28] J. Nivre, Ž. Agić, L. Ahrenberg, M. J. Aranzabe, M. Asahara, A. Atutxa, M. Ballesteros, J. Bauer, et al. (2017) Universal dependencies 2.0. Note: LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics, Charles University External Links: Link Cited by: Languages and Resources.
  • [29] M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep Contextualized Word Representations. In ACL, External Links: Link, Document Cited by: Introduction.
  • [30] B. Plank and Ž. Agić (2018) Distant supervision from disparate sources for low-resource part-of-speech tagging. In EMNLP, Cited by: Introduction, Unsupervised and Weakly Supervised POS Tagging.
  • [31] B. Plank, A. Søgaard, and Y. Goldberg (2016) Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss. In ACL, Cited by: Introduction, Related Work.
  • [32] R. Sennrich and B. Haddow (2016) Linguistic input features improve neural machine translation. In WMT, Cited by: Introduction.
  • [33] A. Søgaard and Y. Goldberg (2016) Deep multi-task learning with low level tasks supervised at lower layers. In ACL, Cited by: Related Work.
  • [34] A. Søgaard (2011) Data point selection for cross-language adaptation of dependency parsers. In ACL-HLT, Cited by: Related Work.
  • [35] K. Stratos, M. Collins, and D. Hsu (2016) Unsupervised Part-Of-Speech Tagging with Anchor Hidden Markov Models. TACL 4. External Links: Link, Document Cited by: Unsupervised and Weakly Supervised POS Tagging.
  • [36] I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In NIPS, Cited by: Character-Based Seq2Seq Model.
  • [37] O. Täckström, D. Das, S. Petrov, R. McDonald, and J. Nivre (2013) Token and Type Constraints for Cross-Lingual Part-of-Speech Tagging. TACL 1, pp. 1–12. Cited by: Introduction, Unsupervised and Weakly Supervised POS Tagging.
  • [38] M. Wang and C. D. Manning (2014) Cross-lingual pseudo-projected expectation regularization for weakly supervised learning. TACL 2, pp. 55–66. Cited by: Related Work.
  • [39] P. Wang, Y. Qian, F. K. Soong, L. He, and H. Zhao (2015)

    A unified tagging solution: Bidirectional LSTM recurrent neural network with word embedding

    arXiv:1511.00215. Cited by: Related Work.
  • [40] G. Wisniewski, N. Pécheux, S. Gahbiche-Braham, and F. Yvon (2014) Cross-Lingual Part-of-Speech Tagging through Ambiguous Learning. In EMNLP, Cited by: Introduction, Unsupervised and Weakly Supervised POS Tagging.
  • [41] D. Zeman, J. Hajič, M. Popel, M. Potthast, M. Straka, F. Ginter, J. Nivre, and S. Petrov (2018) CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. In CoNLL, Cited by: footnote 1.