Cross-Lingual Transfer for Distantly Supervised and Low-resources Indonesian NER

07/25/2019 ∙ by Fariz Ikhwantri, et al. ∙ 0

Manually annotated corpora for low-resource languages are usually small in quantity (gold), or large but distantly supervised (silver). Inspired by recent progress of injecting pre-trained language model (LM) on many Natural Language Processing (NLP) task, we proposed to fine-tune pre-trained language model from high-resources languages to low-resources languages to improve the performance of both scenarios. Our empirical experiment demonstrates significant improvement when fine-tuning pre-trained language model in cross-lingual transfer scenarios for small gold corpus and competitive results in large silver compare to supervised cross-lingual transfer, which will be useful when there is no parallel annotation in the same task to begin. We compare our proposed method of cross-lingual transfer using pre-trained LM to different sources of transfer such as mono-lingual LM and Part-of-Speech tagging (POS) in the downstream task of both large silver and small gold NER dataset by exploiting character-level input of bi-directional language model task.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Building large named entity gold corpus for low-resource languages is challenging because time consuming, limited availability of technical and local expertise. Thus, manually annotated corpora for low-resource languages are usually small, or large but automatically annotated. In most cases, the former are used as a test set to evaluate models trained on the latter one.

To reduce the annotation efforts, previous works [19] utilized parallel corpus to project annotation from high-resource languages to low-resources languages using word-alignment. Another promising approach is to use knowledge base e.g DBPedia [2, 1] or semi-structured on multi-lingual documents e.g Wikipedia [20] to generate named entity seed.

Previous works on multi-lingual Wikipedia with motivation to acquire general corpus [20] and knowledge alignment between high–resource and low–resource languages encounter low recall problem because of incomplete and inconsistent alignments [22]. Some work on monolingual data with intensive rule labelling [1] and label validation [2] to create automatic annotation also face the same problem.

Our contribution in this paper consists of two parts. First, we propose to improve NER performance of a low-resource language, namely Indonesian, trained on noisily annotated Wikipedia data by (1) fine-tuning English NER model, and (2) using contextual word representations derived from either English (EN), Indonesian (ID), or Cross-lingual (EN to ID) fine-tuning of pre-trained language models which exploit character-level input. Second, we analyze why using pre-trained English language model from [26] yields improvement compare to monolingual Indonesian language model by looking at the dataset size, shared characteristic such as orthography, and its different like grammatical and morphological different to source language (English). We show that fine-tuning ELMo in unsupervised cross-lingual transfer can improve the performance significantly from baseline Stanford-NER [8], CNN-LSTM-CRF [18] and previous works using state-of-the-art multi-task NER with language modeling as an auxiliary task [29, 16] trained on conversational texts, and its monolingual counterpart that is trained on different dataset size in the target language, which in our case is Indonesian unlabeled corpora retrieved from Wikipedia and news dataset [33].

2 Related Works

Recently, Peters et al, [26] proposed to use pre-trained embedding from language model (ELMo) of large corpora for many NLP tasks such as NER [34], semantic role labeling [21], textual entailment [5], question answering [27]

and sentiment analysis

[31]. Motivated by deep character embedding for word representation that is useful in many linguistic probing and downstream tasks [24] and trained on large corpora using language model objective, we chose to investigate ELMo embedding as weight-initialization for NER task in a low-resource languages.

2.1 Deep Character Embedding

Character embedding is important to handle out-of-vocabulary problem such as in out-of-domain data [16] or another language with shared orthography [7]. The input words to Bidirectional LM, are computed by using concatenation of multiple convolution filters over sum of characters sequences of length [12, 11], 2 depth highway layers [32] and a linear projection.

The input to highway layers is the concatenation of from as . The output of highway layers of depth are computed as in Equation (1), where and, as an input to the first highway layer.

(1)

2.2 Bidirectional Language Models (BiLM)

Language modeling (LM) computes the probability of token

in sequence of tokens length N given the preceding tokens as . Reversed order LM, computes the probability of token in a sequence of tokens of length given the succeeding tokens in as .

(2)

In downstream task such as NER sequence labeling, the output of ELMo [26] used for contextual word representation is the concatenation of projected highway layer [32] of Deep Character Embedding output [12, 11], forward and backward output of LM-LSTM output of hidden layer. There are several ways to use ELMo layer for sequence labeling task, one of them is to use only last layers output of BiLM-LSTM. In this research, we only explore using last hidden layer of BiLM-LSTM [25].

2.3 Cross-lingual Transfer via Multi-Task Learning

Cross-lingual transfer learning aims to leverage high–resources languages for low-resource languages. Yang et al., (2016)

[36] proposed to transfer character embedding from English to Spanish because they shared same alphabet, while Cotterell et al., (2017) [7] study several languages transfer within the same family and orthographic representation using character embedding as shared input representation. In their proposed model, they shared character convolutions for composing words but not the LSTM layer. In the previous works above, the training process minimizes the joint loss of low-resource and high-resource languages as supervised multi-task learning (MTL) objective. However we found that due to grammatical and morphological different, it is more significant to do pre-training scenario (INIT) instead of joint-training objective.

3 Proposed Method

Figure 1: Cross-lingual Transfer Learning by using Character-level pre-training. Left image, our proposed Unsupervised-Supervised Cross-lingual Transfer where we fine-tune ELMo on target task NER but on source language. Right image, our proposed Cross-lingual Language Model fine-tuning where we fine-tune ELMo on target language Indonesian

In this section we explain briefly our two proposed method. Our first proposed method extend supervised cross-lingual transfer using ELMo (Figure 1, left image). Our second proposed method fine-tune ELMo from English to Indonesian News dataset to use on distantly supervised and small gold Indonesian NER dataset.

3.1 Supervised Cross-lingual Transfer with ELMo

Alfina et al [2] observed that automatically annotated corpora fail to tag many orthographically similar entity of ”America” to ”Amerika” in Indonesian. We also confirmed that, there are many cases of false negative in orthographically similar LOCATION alias such as ”Pacific” to ”Pasifik” in Indonesian Wikipedia. Intuitively, we proposed to increase the recall performance due to many false-negative error by supervised cross-lingual transfer [36] using pre-trained weights from state-of-the-arts NER model that uses Bidirectional language model. In the experiment result Table 5, the model corresponds to [English NER Sources] ELMo EN-1B Tokens from ”Supervised CL Transfer with ELMo” scenario.

3.2 Unsupervised Cross-lingual Transfer via ELMo fine-tuning

We proposed to use a pre-trained language model of high-resource languages such as English in order to initialize better weights for low-resource languages. The cross-lingual transfer in our research is simple and almost the same as [10] with language modeling objectives but we replace English target vocab with Indonesian by random initialization (figure. 1, right image).

Our motivation to propose this method is because we observed that there are only marginal improvement using monolingual Indonesia LM of 82M Tokens from Wikipedia compared to using English LM trained on 1B Tokens on applying ELMo to Distantly Supervised NER dataset. This might be attributed due to large difference of publicly available unlabeled corpus size, such as 82M in Indonesia Wikipedia111as of 20-08-2018 Wikipedia Database dump vs 1B Tokens of language model benchmark or 2.9B English Wikipedia available to train. In the experiment result Table 5, the model corresponds to ELMo EN-ID Transfer from one of the ”CL via ELMo EN” group scenario.

4 Dataset

In this research, we used gold and silver annotation named entity corpus in English as sources in transfer learning. For target language, we used large silver annotation Indonesian as training dataset. We use two set of small clean k tokens and k sentences as testing data in model comparison scenarios and another one as training data in ablation scenario for analysis, in addition of unlabeled data from Wikipedia and newswire.

4.1 Gold named entity corpus

4.1.1 CoNLL 2003

Dataset is well known shared task benchmark dataset in many NLP experiment. We follow the standard training, validation (testa), and test (testb) split scenario. The label consist of PERSON, LOCATION, ORG, and MISC. We experiment additional scenarios for cross-lingual transfer which ignore MISC labels.

4.1.2 Clean 1.2K DBPedia

Human annotations for a subset of the silver annotation corpus are important to measure the quality of that automatic annotation. Thus, we asked an Indonesian linguist to re-label the subset of data and compute the metrics for DEE, MDEE and +Gazz silver annotation dataset. The precision, recall and F1 score of the subset w.r.t our clean annotation can be found in Table 2. The clean annotation can be found at data supplementary material. We used this in-house annotation to do ablation analysis after training distantly supervised NER. We will made this subset of cleaned DBPedia Entity from noisy annotation publicly available in order to allow others to replicate our results in low-resources (gold) scenario.

4.2 Noisy named entity corpus

4.2.1 Wikipedia Named Entity

WP2 and WP3 are two version of dataset [20]. The corpus obtained from this github repository222https://github.com/dice-group/FOX/tree/master/input/Wikiner, because the initial link mentioned in the [20] is down. In this research we use these 2 version that corresponding to WP2 and WP3 of this silver standard named entity recognition dataset. We evaluate this dataset on CoNLL test [34] and WikiGold [3].

Dataset PER LOC ORG #Tok #Sent
DEE 13641 16014 2117 599600 20240
MDEE 13336 17571 2270 599600 20240
+Gazz 13269 22211 2815 599600 20240
Gold (Test) 569 510 353 14427 737
Clean 1.2K 1068 1773 720 38423 1220
Table 2: 1.2K instances of silver annotation performance with respect to the Clean 1.2k annotation. Clean 1.2k annotation is subset of DEE, MDEE and +Gazz
Annotation Prec Recall F1
DEE (1.2K) 60.85 33.08 42.86
MDEE (1.2K) 61.77 35.07 44.74
+Gazz (1.2K) 63.83 40.44 49.51
Table 1: Dataset statistics used in our experiments. #Tok: numbers of tokens. #Sent: numbers of sentences. Alfina et. al. [1, 2] use Gold as their test set. Clean 1.2K are used to measure noisy percentage of DEE, MDEE, and +Gazz and low-resources scenario

4.2.2 DBPedia Entity Expansion

Our research used publicly available DBPedia Entity Expansion (DEE, Gold) [1] and Modified Rule (MDEE, +Gazetteers) [2] dataset for Indonesian. Interested readers should check the original references for further details. The dataset label statistics can be found in Table 2. We used the same test (Gold) in silver annotation Indonesian NER dataset. However, due to entity expansion technique, previous works [1, 2] only considers Entity without their span (BIO) labels. In order to alleviate this difference, we transform the contiguous Entity with same label into BIO span. This rule based conversion does not seem affecting exact match span-based F1-metrics in distantly supervised scenarios when we reproduce the model in the same configuration.

4.3 ID-POS Corpus

The ID-POS corpus [28] contains 10K sentences of 250K tokens from news domain. There are 23 labels in the dataset. For POS tagging model, we train 5 model of 5-fold cross-validation following split dataset by [15]. For each fold of the models, we transfer the pre-trained weights into all NER train dataset in both large distantly supervised and low-resources gold NER scenarios.

4.4 Unlabeled Corpus for Language Model

Total number of vocabulary in Wikipedia Indonesia are 100k unique tokens from 2 millions total sentences with 82 millions total tokens. While total number of vocabulary in Kompas & Tempo dataset [33] are 130k tokens from 85k total sentences with 11 millions total tokens.

5 Experiments

Our main experiment for cross-lingual settings is Austronesian language, Indonesian. We choose Indonesian due to its language characteristics such as morphological distance from Indo-European family but same Latin alphabet orthography to English. It contains many loanwords for verb and named entity words from several languages. Most of the named entity are kept in the same form as the original language lexicon. It also categorized as low-resources as there is no large scale standardized and publicly available gold annotated dataset for NER task.

We use AllenNLP [9] implementation for Baseline BiLSTM-CRF and extend our own implementation based on Supervised Cross-lingual Transfer, Cross-lingual using ELMo from EN, Monolingual ELMo and Unsupervised-Supervised Cross-lingual Transfer. We make our extension and pre-trained bi-LM of mono-lingual and cross-lingual available on Github Links (Anonymous). We do not tune the model hyper-parameter such as dropout or learning rate, as there is no gold validation on comparable scenario with [2]. In addition, we found that tuning hyper-parameter to noisy validation do not improve and can even lead to worse result such as over-fitting to false negative.

5.0.1 General Model Configuration

We initialize all NER neural models on both monolingual and cross-lingual of Indonesian as target by using pre-trained word embedding with Glove [23]

on our Wikipedia dumps. The Glove-ID vectors are freeze during training on DEE, MDEE and +Gazz data. All the Indonesian NER models on distantly supervised data are trained for 10 epochs using Adam

[13] with learning rate for Optimization of batch size 32. For model using ELMo module, we use dropout rate 0.5 after the last layer output and before concatenation with word embedding and l2 regularization [14] on ELMo weights to prevent model over-fitting and retain pre-trained knowledge. We use 2 layer Bi-LSTM-CRF layer with hidden size 200 and the word embedding dimension 50.

5.0.2 Unsupervised Cross-lingual NER Transfer via ELMo

In cross-lingual bi-directional LM using CL via ELMo EN scenario, we use pre-trained weights from English 1B tokens333model-checkpoint to Indonesian News dataset (IDNews) [33]. We use implementation of bidirectional language model by Peters et al., (2018) [25, 26] 444https://github.com/allenai/bilm-tf and modified it for cross-lingual transfer scenario. We fine-tune the model for 3 epochs by replacing the Softmax vocab layer with randomly initialized weight. We only fine-tune language model in cross-lingual scenarios on 3 epochs instead of 10 is to prevent catastrophic forgetting [30], [10].We called this model ELMo EN-ID Transfer. As a baseline, we use ELMo EN-1B Tokens model directly in the CL via ELMo EN scenario.

Figure 2: Left image, Baseline scenario for supervised cross-lingual transfer learning. Right image, Baseline scenario for directly using ELMo 1B Tokens EN initializer

5.0.3 Supervised Cross-lingual NER Transfer

For the cross-lingual transfer learning baseline scenario, we use WP2, WP3 [20] and CoNLL 2003 dataset [34] of English language to train standard BiLSTM-CRF without ELMo initializer on 1B Language Model benchmarks. The models are trained on English languages and then the pre-trained weights are used as initalizer for both supervised and unsupervised transfer learning on DEE, MDEE, and +Gazz dataset. For the pre-trained English model, we report our reproduced baseline, recent state-of-the-arts NER and ELMo LSTM-CRF on WikiNER dataset [20] to show the improvement on noisy mono-lingual data and use as pre-trained model. We train the English NER models for 75 epochs with patience 25 epochs for early stopping during training. In the experiment result Table 5, the model corresponds to [Sources] BiLSTM-CRF in ”Supervised CL NER Transfer” scenarios.

5.0.4 Mono-lingual ELMo

In this scenarios, we use directly Pre-trained bi-LM on a mono-lingual corpus such as 1 billions word English [6], 82 millions Indonesian Wikipedia or 11 millions Indonesian News [33] dataset which illustrated on Figure 2 on the right. In the experiment result Table 5, the model corresponds to ELMo ([Unlabeled corpus]) in ”Mono-lingual ELMo

5.0.5 POS Tagging Transfer

In this scenarios, we train a standard Bi-LSTM model using Softmax with Cross-entropy loss function to Indonesian POS tagging dataset. The transfer procedure almost the same as Supervised Cross-lingual NER Transfer as illustrated in Figure

2 on the right, while there are 2 differences i) the top-most layer is Linear with Softmax Activation instead of CRF, and ii) the sources task is POS tagging instead of English NER. We train 5 models based on 5-fold cross-validation split provided by Kurniawan et al., (2018) [15], we report the averaged F1 of each k-th-fold model as pre-trained weights in both large silver and small clean annotation. In the experiment result Table 5, the model corresponds to ID-POS BiLSTM-CRF in ”POS Tagging Transfer” scenario.

This experiment scenario serve as comparison of transfer learning from different but related task in Yang et al., (2017) [36]. In addition, previous work by Blevins et al. (2018) [4] show that LM contains syntactic information thus serve as comparison to pre-trained monolingual bidirectional LM.

5.0.6 Multi-Task NER with BiLM

We also train and evaluate using recent state-of-the-arts model in Indonesian conversational dataset such as Multi-Task NER with BiLM auxiliary task (BiLM-NER) [17]. In the experiment Table 5, the model corresponds to BiLM-NER in ”Baseline” scenarios.

6 Results & Analysis

In this research, we reports our English dataset results which mainly used to show improvement of pre-trained BiLM and as source weights in transfer learning. We reports our main experiments in several version of large silver for model comparison and a small clean annotation in ablation scenarios. Finally, we analyzed our proposed method of supervised cross-lingual transfer with BiLM and Cross-lingual Transfer via Language Model.

6.1 English Dataset Results

From Table 3, model trained using pre-trained ELMo and random Word Embedding initialization (WE+ELMo LSTM-CRF) are better with an average of 4.925 % F1 score in four WikiNER scenarios compare to Word embedding initialized with Glove 6B words and character-CNN (WE+CharEmb) on CoNLL dataset. However, it is tie on WikiGold test where Glove+CharEmb without MISC labels perform are better than WE+ELMo, whereas the latter are better with MISC labels than the former. Overall, combining both Glove and ELMo yields best results except when using WP2 as training data when tested in CoNLL test.

Train Data WikiGold CoNLL Pre-Init
Glove+CharEmb LSTM-CRF
WP2 71.75 61.78 Glove 6B
WP3 71.40 62.51 Glove 6B
CoNLL 58.00 90.47 Glove 6B
WP2-w/o MISC 75.12 65.35 Glove 6B
WP3-w/o MISC 75.02 63.69 Glove 6B
CoNLL-w/o MISC 58.30 91.37 Glove 6B
WE (Random Init) +ELMo LSTM-CRF
WP2 76.96 71.48 ELMo 1B
WP3 74.95 68.54 ELMo 1B
CoNLL 74.07 90.18 ELMo 1B
WP2-w/o MISC 73.47 66.50 ELMo 1B
WP3-w/o MISC 72.91 66.51 ELMo 1B
CoNLL-w/o MISC 74.52 91.59 ELMo 1B
Glove +ELMo LSTM-CRF
WP2 77.14 69.91 Glove 6B & ELMo 1B
WP3 76.92 70.31 Glove 6B & ELMo 1B
CoNLL 75.12 91.98 Glove 6B & ELMo 1B
WP2-w/o MISC 80.55 73.05 Glove 6B & ELMo 1B
WP3-w/o MISC 81.09 75.60 Glove 6B & ELMo 1B
CoNLL-w/o MISC 79.49 93.53 Glove 6B & ELMo 1B
Table 3: F1 score performance results on WikiGold and CoNLL test set. English NER model w/o (without) MISC and pre-trained weight Glove 6B & ELMo 1B used as pre-train model for cross-lingual transfer scenarios

6.2 Indonesian Dataset Results

Model DEE MDEE +Gazz
Previous Works
Alfina et al., [2] 41.33 41.87 51.61
BiLM-NER 40.36 41.03 51.77
Baseline
Stanford-NER-BIO [2] 40.68 41.17 51.01
BiLSTM-CRF 46.09 45.59 52.04
POS Tagging Transfer
ID-POS BiLSTM-CRF 52.58 51.07 60.57
Supervised CL NER Transfer
WP2 BiLSTM-CRF 49.88 52.35 62.57
WP3 BiLSTM-CRF 51.21 50.95 62.90
CoNLL BiLSTM-CRF 52.56 50.75 60.81
CL via ELMo EN
ELMo EN-1B Tokens 51.08 53.19 60.66
ELMo EN-ID Transfer 52.63 54.74 63.02
Mono-lingual ELMo
ELMo (ID-Wiki) 50.68 52.38 60.51
ELMo (ID-News) 49.49 51.91 60.73
Supervised CL Transfer with ELMo
WP2 ELMo (EN) 52.99 55.39* 63.99
WP3 ELMo (EN) 54.15* 55.28 63.84
CoNLL ELMo (EN) 53.52 53.48 64.35*
Table 5: Ablation experiment results using Clean 1.2K as training data in small clean (human annotated) scenario also evaluated on Gold test set. W: Word embedding (Random Init), C: Char-CNN (+EN if INIT from CoNLL 2003) embedding, E: ELMo (EN), G: Glove-ID(+EN if in cross-lingual transfer from English) [23], I: ELMo (ID-Wiki), J: ELMo (EN-ID-News) Transfer
Model Prec Rec F1
Stanford-NER 71.42 53.84 61.39
BiLM-NER 63.65 63.29 63.47
BiLSTM-CRF
W+C+E 76.42 56.32 64.85
W+C 56.23 56.39 56.31
W+E 73.53 53.32 61.81
C+E 69.13 68.60 68.86
G 63.65 48.50 55.05
G+C 69.17 62.31 65.56
G+E 75.30 65.32 69.96
G+C+E 72.05 68.73 70.35
E 76.27 55.41 64.19
G+C+I 74.53 78.43 76.43
G+I 75.57 77.94 76.74
I 78.55 73.62 76.00
G+C+J 83.26 82.62 82.94
G+J 83.77 83.60 83.68
J 82.36 83.74 83.04
INIT from ID-POS
W+C 72.97 78.97 75.68
INIT from CoNLL 2003
W+C 66.23 56.25 60.83
G+C 70.18 65.87 67.96
C+E 71.84 64.27 67.85
W+C+E 73.63 65.46 69.30
G+E 73.38 69.08 71.17
G+C+E 72.63 72.99 72.85
Table 4: Experiment on silver standard annotation of Indonesian NER evaluated on Gold test set [1] in large distantly supervised NER scenario. Bold F1 scores are best result per scenarios (Baseline, Supervised Cross-lingual Transfer, Cross-lingual using ELMo from EN, Mono-lingual ELMo and Unsupervised-Supervised Cross-lingual Transfer). * is the best model on a dataset (DEE, MDEE, or +Gazz) on all model scenarios

We reproduce around the same results of [2] using Stanford NER. Our experiment using a recent state-of-the-arts model in Indonesian conversational dataset namely Multi-Task NER with BiLM auxiliary task (BiLM-NER) [17] (BiLM-NER) obtain comparable performance with log-linear model but lower than BiLSTM-CRF [18].

The mono-lingual pre-trained BiLM on 1B English words (ELMO EN-1B Tokens) performs comparable with pre-trained BiLM on 82 millions tokens in (ELMo (ID-Wiki)) and 11 millions news tokens (ELMo (ID-News)). All of the mono-lingual Embedding from Pre-trained BiLM on silver standard annotation perform worse than baseline supervised cross-lingual with & without BiLM scenarios.

6.3 Cross-lingual Transfer Analysis

We hypotheses that the performance of using ELMo on cross-lingual settings despite a little counter-intuitive are not entirely surprising can be addressed to i) Most named entities which available on multi-lingual documents are orthographically similar. For instance ”America” is ”Amerika” in Indonesian, while ”Obama” is still ”Obama”, ”President Barack Obama” is still ”Presiden Barack Obama”; ii) Due to the orthographic similarities of many entity names, the fact that English and Indonesian languages are typologically different (e.g. in terms of S-V-O word order and Determiner-Noun word order) is not relevant on noisy data, as long as the character sequences of named entities are similar in both languages [7, 35].

We confirm our first hypothesis by looking up the percentage of unique word (vocabulary) overlap rate between the Gold ID-NER [1] and three English dataset, namely WP2, WP3 [20] and CoNLL training [34]. The overall vocabulary overlap rate between Gold ID-NER and the three dataset are , , respectively. Furthermore, we checked WP2 per word-tag join overlap rate are PER , LOC , ORG , and O percentage. While CoNLL word-tag joins overlap rate are PER , LOC , ORG , and O . More details of unique word overlap rate between Indonesian DBPedia Entity, WP2, WP3 and CoNLL can be seen on Table 5. in Supervised Cross-lingual Transfer which only utilized character-embedding and pre-trained monolingual word-embedding trained from CoNLL dataset perform worse on both MDEE and +Gazz dataset than trained on WP2 and WP3 dataset.

We support our second hypothesis by doing ablation on clean annotation (Table 5). Our clean annotation show that, ELMo (ID-Wiki) outperformed ELMo (EN-1B Tokens) on small clean annotation data, but ELMo EN nonetheless still outperformed BiLSTM-CRF especially when combined with Supervised pre-training on CoNLL 2003 English NER [18].

Figure 3: Word-tag overlap rate breakdown between mono-lingual and cross-lingual corpora. (-) horizontal line: WP2 & DBPedia Gold, right slope: WP2 & DBPedia Train, (+) cross: is overlap between WP3 & DBPedia Gold, (—) vertical: overlap between WP3 & DBPedia Train, (/) left slope: CoNLL Train and DBPedia Gold, (o) dot: CoNLL Train and DBPeida Train

7 Conclusion

In this research, we extend the idea of character-level embedding pre-trained on language model to cross-lingual scenarios for distantly supervised and low-resources scenarios. We observed that training character-level embedding of language model requires enormous size of corpora [26]. Addressing this problem, we demonstrate that as long as orthographic constraint and some lexical words in target language such as loanwords to act as pivot are shared, we can utilize the high-resource languages model.

Acknowledgments

We also would like to thank Samuel Louvan, Kemal Kurniawan, Adhiguna Kuncoro, and Rezka Aufar L. for reviewing the early version of this work. We are also grateful to Suci Brooks and Pria Purnama for their relentless support.

References