Kata Word Analogy Task for Indonesian.
We introduced KaWAT (Kata Word Analogy Task), a new word analogy task dataset for Indonesian. We evaluated on it several existing pretrained Indonesian word embeddings and embeddings trained on Indonesian online news corpus. We also tested them on two downstream tasks and found that pretrained word embeddings helped either by reducing the training epochs or yielding significant performance gains.READ FULL TEXT VIEW PDF
In this paper, we share the process of developing word embeddings for th...
We present a novel online algorithm that learns the essence of each dime...
We investigate the effect of various dependency-based word embeddings on...
We propose a novel and simple method for semi-supervised text classifica...
Using pretrained word embeddings has been shown to be a very effective w...
We study the settings for which deep contextual embeddings (e.g., BERT) ...
Word embeddings are commonly obtained as optimizers of a criterion funct...
Kata Word Analogy Task for Indonesian.
Despite the existence of various Indonesian pretrained word embeddings, there are no publicly available Indonesian analogy task datasets on which to evaluate these embeddings. Consequently, it is unknown if Indonesian word embeddings introduced in, e.g., [Al-Rfou et al.2013] and [Grave et al.2018], capture syntactic or semantic information as measured by analogy tasks. Also, such embeddings are usually trained on Indonesian Wikipedia [Al-Rfou et al.2013, Bojanowski et al.2017] whose size is relatively small, approximately 60M tokens. Therefore, in this work, we introduce KaWAT (Kata Word Analogy T
ask), an Indonesian word analogy task dataset, and new Indonesian word embeddings pretrained on 160M tokens of online news corpus. We evaluated these embeddings on KaWAT, and also tested them on POS tagging and text summarization as representatives of syntactic and semantic downstream task respectively.
We asked an Indonesian linguist to help build KaWAT based on English analogy task datasets such as Google Word Analogy [Mikolov et al.2013a] and BATS [Gladkova et al.2016]. Following those works, we split the analogy tasks into two categories, syntax and semantic. We included mostly morphological analogies in the syntax category, leveraging the richness of Indonesian inflectional morphology. For semantic, we included analogies such as antonyms, country capitals and currencies, gender-specific words, measure words, and Indonesian province capitals. In total, we have 15K syntactic and 19K semantic analogy queries. KaWAT is available online.111https://github.com/kata-ai/kawat
One of the goals of this work is to evaluate and compare existing Indonesian pretrained word embeddings. We used fastText pretrained embeddings introduced in [Bojanowski et al.2017] and [Grave et al.2018], which have been trained on Indonesian Wikipedia and Indonesian Wikipedia plus Common Crawl data respectively. We refer to them as Wiki/fastText and CC/fastText hereinafter. We also used another two pretrained embeddings: polyglot embedding trained on Indonesian Wikipedia [Al-Rfou et al.2013] and NLPL embedding trained on the Indonesian portion of CoNLL 2017 corpus [Fares et al.2017].
For training our word embeddings, we used online news corpus obtained from Tempo.222https://www.tempo.co We used Tempo newspaper and magazine articles up to year 2014. This corpus contains roughly 400K articles, 160M word tokens, and 600K word types. To train the word embeddings, we experimented with three algorithms: word2vec [Mikolov et al.2013b], fastText [Bojanowski et al.2017], and GloVe [Pennington et al.2014]. We refer to them henceforth as Tempo/word2vec, Tempo/fastText, and Tempo/GloVe respectively. We used gensim333https://radimrehurek.com/gensim to run word2vec and fastText and the original C implementation for GloVe.444https://github.com/stanfordnlp/GloVe
For all three, we used their default hyperparameters, i.e. no tuning was performed. Our three embeddings are available online.555https://drive.google.com/open?id=1T9RmF0nHwN742aDkkQbjimpUCLgeVkhT
Evaluation on KaWAT was done using gensim with its KeyedVectors.most_similar
method. Since the vocabularies of the word embeddings are different, for a fair comparison, we first removed analogy queries containing words that do not exist in any vocabulary. In other words, we only kept queries whose words all exist in all vocabularies. After this process, there were roughly 6K syntactic and 1.5K semantic queries. We performed evaluation by computing 95% confidence interval of the accuracy at rank 1 by bootstrapping. Our implementation code is available online.666https://github.com/kata-ai/id-word2vec
We found that on syntactic analogies, Wiki/fastText achieved 2.7% accuracy, which significantly outperformed the others, even CC/fastText which has been trained on a much larger corpus. Other embeddings performed poorly, mostly less than 1% of accuracy. The overall trend of low accuracy scores attests to the difficulty of syntactic KaWAT analogies, making it suitable as benchmark for future research.
On semantic analogies, Tempo/GloVe clearly outperformed the others with 20.42% accuracy, except Tempo/word2vec. Surprisingly, we found that Tempo/fastText performed very poorly with less than 1% accuracy, even worse than Wiki/fastText which has been trained on a much smaller data. Overall, the accuracies on semantic are also low, less than 25%, which again attests to the suitability of KaWAT as benchmark for future work.
To check how useful these embeddings are for downstream tasks, we evaluated them
on POS tagging and text summarization task. For each task, we compared two
embeddings, which are the best off-the-shelf pretrained embedding and our
proposed embedding on the syntactic (for POS) and semantic (for summarization)
analogy task respectively.777 We performed paired t-test and found
Tempo/GloVe to be the best among our Tempo embeddings (
We performed paired t-test and found Tempo/GloVe to be the best among our Tempo embeddings (). We used the same model and setting as [Kurniawan and Aji2018] for POS tagging and [Kurniawan and Louvan2018] for summarization. However, for computational reasons, we tuned only the learning rate using grid search, and only used the first fold of the summarization dataset. Our key finding from the POS tagging experiment is that using the two embeddings did not yield significant gain on test score compared with not using any pretrained embedding (around ). However, on average, using Wiki/fastText resulted in 20% fewer training epochs, compared with only 4% when using Tempo/GloVe. For the summarization experiment, Tempo/GloVe was significantly better888As evidenced by the 95% confidence interval reported by the ROUGE script. than CC/fastText in ROUGE-1 and ROUGE-L scores (66.63 and 65.93 respectively). The scores of using CC/fastText was on par to those of not using any pretrained word embedding, and we did not observe fewer training epochs when using pretrained word embedding.
We introduced KaWAT, a new dataset for Indonesian word analogy task, and evaluated several Indonesian pretrained word embeddings on it. We found that (1) in general, accuracies on the analogy tasks were low, suggesting that improvements for Indonesian word embeddings are still possible and KaWAT is hard enough to be the benchmark dataset for that purpose, (2) on syntactic analogies, embedding by [Bojanowski et al.2017] performed best and yielded 20% fewer training epochs when employed for POS tagging, and (3) on semantic analogies, GloVe embedding trained on Tempo corpus performed best and produced significant gains on ROUGE-1 and ROUGE-L scores when used for text summarization.
We thank Tempo for their support and access to their news and magazine corpora. We also thank Rezka Leonandya and Fariz Ikhwantri for reviewing the earlier version of this manuscript.
Enriching word vectors with subword information.Transactions of the Association for Computational Linguistics, 5:135–146.
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Linguistics.