Part-of-speech (POS) tagging is a process to tag tokens in a string with their corresponding part-of-speech (e.g., noun, verb, etc). POS tagging is considered as one of the most basic tasks in NLP, as it is usually the first component in an NLP pipeline. This is because POS tags are shown to be useful features in various NLP tasks, such as named entity recognition[1, 2], machine translation [3, 4] and constituency parsing . Therefore, for any language, building a successful NLP system usually requires a well-performing POS tagger.
There are quite a number of research on Indonesian POS tagging [6, 7, 8, 9]. However, almost all of them are not evaluated on a common dataset. Even when they are, their train-test split are not the same. This lack of a common benchmark dataset makes a fair comparison among these works difficult. Moreover, despite the success of neural network models for English POS tagging [10, 11], the use of neural networks is generally unexplored for Indonesian. As a result, published results may not reflect the actual state-of-the-art performance of Indonesian POS tagger.
In this work, we explored different neural network architectures for Indonesian POS tagging. We evaluated our experiments on the IDN Tagged Corpus . Our best model achieves 97.47 score, a new state-of-the-art result for Indonesian POS tagging on the dataset. We release the dataset split that we used to serve as a benchmark for future work.
Ii Related Work
Pisceldo et al.  built an Indonesian POS tagger by employing a conditional random field (CRF)  and a maximum entropy model. They used contextual unigram and bigram features and achieved accuracy scores of 80-90% on PANL10N111http://www.panl10n.net
dataset tagged manually using their proposed tagset. The dataset consists of 15K sentences. Another work used a hidden Markov model enhanced with an affix tree to better handle out-of-vocabulary (OOV) words. They evaluated their models on the same PANL10N dataset and achieved more than 90% overall accuracy and roughly 70% accuracy for the OOV cases. We note that while the datasets are the same, the split could be different. Thus, making a fair comparison between them is difficult.
Dinakaramani et al.  proposed IDN Tagged Corpus, a new manually annotated POS tagging corpus for Indonesian. The corpus consists of 10K sentences and 250K tokens, and its tagset is different than that of the PANL10N dataset. The corpus is available online.222https://github.com/famrashel/idn-tagged-corpus A rule-based tagger is developed in  using the aformentioned dataset, and is able to achieve an accuracy of 80%.
One of the neural network-based POS taggers for Indonesian is proposed in . They used a feedforward neural network with an architecture similar to that proposed in . They evaluated their methods on the new POS tagging corpus  and separated the evaluation of multi- and single-word expressions. They experimented with several word embedding algorithms trained on Indonesian Wikipedia data and reported macro-averaged score of 91 and 73 for the single- and multi-word expression cases respectively. We remark that the choice of macro-averaged score is more suitable than accuracy for POS tagging because of the class imbalance in the dataset. There are too many words with NN as the true POS tag, so accuracy is not the best metric in such case.
We used the IDN Tagged Corpus proposed in . The corpus contains 10K sentences and 250K tokens that are tagged manually. Due to the small size,333As a comparison, Penn Treebank corpus for English has 40K sentences. we used 5-fold cross-validation to split the corpus into training, development, and test sets. We did not split multi-word expressions but treated them as if they are a single token. All 5 folds of the dataset are available publicly444https://github.com/kmkurn/id-pos-tagging/blob/master/data/dataset.tar.gz to serve as a benchmark for future work.
We used two simple baselines: majority tag (Major) and memorization (Memo). Major simply predicts the majority POS tag found in the training set for all words. Memo remembers the word-tag assignments from the training set and uses them to predict the tags on the test set. If there is an unknown word, it simply outputs the majority tag found in the training set.
Iii-C1 Rule-based tagger
We adopted a rule-based tagger designed by Rashel et al.  as one of our comparisons. Firstly, the tagger tags named entities and multi-word expressions based on a dictionary. Then, it uses MorphInd  to tag the rest of the words. Finally, they employ 15 hand-crafted rules to resolve ambiguous tags in the post-processing step. We want to note that we did not use their provided tokenizer since the IDN Tagged Corpus dataset is already tokenized. Their implementation is available online.555https://github.com/andryluthfi/indonesian-postag
Iii-C2 Conditional random field (CRF)
We used CRF  as another comparison since it is the most common non-neural model for sequence labeling tasks. We employed contextual words as well as affixes as features. For some context window size , the complete list of features is:
the current word, as well as preceding and succeeding words;
two and three leading characters of the current word and preceding and succeeding words;
two and three trailing characters of the current word and preceding and succeeding words.
The last two features are meant to capture prefixes and suffixes in Indonesian which usually consist of two or three characters. One advantage of this feature extraction approach is that it does not require language-specific tools such as stemmer or morphological segmenter. This advantage is particularly useful for Indonesian which does not have well-established tools for such purposes. We padded the input sentence with padding tokens to ensure that every token has enough preceding and succeeding words for context window size. For the implementation, we used pycrfsuite.666https://github.com/scrapinghub/python-crfsuite
Iii-C3 Neural network-based tagger
Our neural network-based POS tagger can be divided into 3 steps: embedding, encoding, and prediction. First, the tagger embeds the words and optionally additional features of such words (e.g., affixes). From this embedding process, we get vector representations of the words and the features. Next, the tagger learns contextual information in the encoding step via either a feedforward network with context window or a bidirectional LSTM. Finally, in prediction step, the tagger predicts the POS tags from the output of the encoding step using either a softmax or a CRF layer.
Embedding. In the embedding step, the tagger obtains vector representations of each word and additional features. We experimented with several additional features: prefixes, suffixes, and characters. Prefix features are the first 2 and 3 characters of the word. Likewise, suffix features are the last 2 and 3 characters of the word.777If the word has less than 3 characters, then all the prefixes and suffixes are equal to the word itself. For the character features, we followed 
by embedding each character and composing the resulting vectors with a max-pooled CNN. The final embedding of a word is then the concatenation of all these vectors. Fig.1 shows an illustration of the process.
Encoding. In the encoding step, the tagger learns contextual information by using either a feedforward network with context window or a bidirectional LSTM (biLSTM). The feedforward network accepts as input the concatenation of the embedding of the current word and preceding and succeeding words for some context window size . Formally, given a sequence of word embedding , the input of the feedforward network at timestep is
where denotes a concatenation. The feedforward network then computes
where is the output vector, is a dropout mask vector, and are parameters.888 There are bias vectors included as parameters, but we omit them from the equation for brevity.
There are bias vectors included as parameters, but we omit them from the equation for brevity.The output vector has length equal to the number of possible tags. Its
-th component defines the (unnormalized) log probability of the-th word having tag .
On the other hand, the biLSTM accepts as input the sequence of word embeddings, and for each timestep, the output from the forward and backward LSTM are concatenated to form the final output. Formally, the output at each timestep can be expressed as
The vector is then passed through as before to obtain .
Prediction. In the prediction step, the tagger predicts the POS tag of the -th word based on the output vector
. We tested two approaches: a softmax layer with greedy decoding and a CRF layer with Viterbi decoding. With a softmax layer, the tagger simply normalizesand predicts using greedy decoding, i.e. picking the tag with the highest probability. In contrast, with a CRF layer, the tagger treats as emission probability scores, models the tag-to-tag transition probability scores, and uses Viterbi algorithm to select the most probable tag sequence as the prediction. We refer readers to  to read more about how the CRF layer and Viterbi decoding work. We want to note that when we only embed words, encode using feedforward network, and predict using greedy decoding, the tagger is effectively the same as that in . Also, when only the word and character features are used, with a biLSTM and CRF layer, the tagger is effectively the same as that in . Our implementation code is available online.999https://github.com/kmkurn/id-pos-tagging
Iii-D Experiments Setup
For all models, we preprocessed the dataset by lowercasing all words, except when the characters were embedded. For the CRF model, we used L2 regularization whose coefficient was tuned to the development set. As we mentioned previously, we tuned the context window size to the development set as well.
For the neural tagger, we set the size of the word, affix, and character embedding to 100, 20, and 30 respectively. We applied dropout regularization to the embedding layers. The max-pooled CNN has 30 filters for each filter width. We set the feedforward network and the biLSTM to have 100 hidden units. We put a dropout layer before the biLSTM input layer. We tuned the learning rate, dropout rate, context window size, and CNN filter width to the development set. As we said earlier, we experimented with different configurations in the embedding, encoding, and prediction step. We evaluated each configuration on the development set as well.
At training time, we used a batch size of 8, decayed the learning rate by half if the
score on the development set did not improve after 2 epochs, and stopped the training early if the score still did not improve after decaying the learning rate 5 times. To address the exploding gradient problem, we normalized the gradient norm at 1, following the suggestion in. To handle the out-of-vocabulary problem, we converted singleton words and affixes occurring fewer than 5 times in the training data into a special token for unknown words/affixes.
Since the dataset is highly imbalanced (majority of words are nouns), using accuracy score as the evaluation metric is not appropriate as it gives a high score to a model that always predicts nouns regardless of input. Therefore, we decided to use
score which considers both precision and recall of the predictions.
Since there are multiple tags, there are two flavors to compute an overall score: micro and macro average. For POS tagging task where the tags do not span multiple words, micro-average score is exactly the same as accuracy score. Thus, macro-average score is our only option. However, there is still an issue. Macro-average score computes the overall score by averaging the score of each tag. This approach means that when the model wrongly predicts a rarely occurring tag (e.g., foreign word), it is penalized as heavily as it does a frequent tag. To address this problem, we used weighted macro-average score which takes into account the tag proportion imbalance. It computes the weighted average of the scores where each weight is equal to the corresponding tag’s proportion in the dataset. This functionality is available in the scikit-learn library.101010http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html
Iv Results and Discussion
|Feedforward + softmax||97.25|
|Feedforward + CRF||97.46|
|biLSTM + softmax||97.57|
|biLSTM + CRF||97.60|
|Rashel et al. ||85.77 (0.22)|
|biLSTM + CRF||97.47 (0.11)|
score of each method, averaged over 5 folds. Standard deviations are given in parentheses.
|biLSTM + CRF||96.06 (+0.00)|
|biLSTM + CRF + chars||97.42 (+1.36)|
|biLSTM + CRF + chars + prefix||97.50 (+0.08)|
|biLSTM + CRF + chars + prefix + suffix||97.60 (+0.10)|
Firstly, we report on our tuning experiments for the neural tagger. Table I shows the evaluation results of the many configurations of our neural tagger on the development set. We group the results by the encoding and prediction step configuration. For each group, we show the highest score among many embedding configurations. As we can see, biLSTM with CRF layer achieves 97.60 score, the best score on the development set. This result agrees with many previous work in neural sequence labeling that a bidirectional LSTM with CRF layer performs best [11, 18, 10]. Therefore, we will use this tagger to represent the neural model hereinafter.
To understand the performance of the neural model for each tag, we plot the confusion matrix from the development set of the first fold in Fig. 2. The figure shows that the model can predict most tags almost perfectly, except for X and WH tag. The X tag is described as ”a word or part of a sentence which its category is unknown or uncertain”.111111http://bahasa.cs.ui.ac.id/postag/downloads/Tagset.pdf The X tag is rather rare, as it only appears 397 times out of over 250K tokens. Some words annotated as X are typos and slang words. Some foreign terms and abbreviations are also annotated with X. The model might get confused as such words are usually tagged with a noun tag (NN or NNP). We also see that the model seems to confuse question words (WH) such as apa (what) or siapa (who) as SC since these words may be used in subordinate clauses as well. Looking at the data closely, we found that the tagging of such words are inconsistent. This inconsistency contributes to the inability of the model to distinguish the two tags well.
Next, we present the result of evaluating the baselines and other comparisons on the test set in Table II. The scores are averaged over the 5 cross-validation folds. We see that Major baseline performs very poorly compared to the Memo baseline, which surprisingly achieves over 90 points. This result suggests that Memo is a more suitable baseline for this dataset in contrast with Major. The result also provides evidence to the usefulness of our evaluation metric which heavily penalizes a simple majority vote model. Furthermore, we notice that the rule-based tagger by Rashel et al.  performs worse than Memo, indicating that Memo is not just suitable but also quite a strong baseline. Moving on, we observe how CRF has 6 points advantage over Memo, signaling that incorporating contextual features and modeling tag-to-tag transitions are useful. Lastly, the biLSTM with CRF tagger performs the best with 97.47 score.
To understand how each feature in the embedding step affects the neural tagger, we performed feature ablation on the development set and put the result in Table III. We see that with only words as features (first row), the neural tagger only achieves 96.06 score. Employing character features boosts the score up to 97.42, a gain of 1.36 points. Adding prefix and suffix features improves the performance further by 0.08 and 0.10 points respectively. From this result, we see that it is the character features that positively affect the neural tagger the most.
We experimented with several baselines and comparisons for Indonesian POS tagging task. Our comparisons include a rule-based tagger, a well-established probabilistic model for sequence labeling (CRF), and a neural model. We tested many configurations for our neural model: the features (words, affixes, characters), the architecture (feedforward, biLSTM), and the output layer (softmax, CRF). We evaluated all our models on the IDN Tagged Corpus , a manually annotated and publicly available Indonesian POS tagging dataset. Our best model achieves 97.47 score, a new state-of-the-art result on the dataset. We make our cross-validation split available publicly to serve as a benchmark for future work.
-  G. Luo, X. Huang, C.-Y. Lin, and Z. Nie, “Joint entity recognition and disambiguation,” in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015, pp. 879–888.
G. Aguilar, S. Maharjan, A. P. L. Monroy, and T. Solorio, “A multi-task
approach for named entity recognition in social media data,” in
Proceedings of the 3rd Workshop on Noisy User-generated Text, 2017, pp. 148–153.
R. Sennrich and B. Haddow, “Linguistic Input Features Improve Neural Machine Translation,” inProceedings of the First Conference on Machine Translation. Berlin, Germany: Association for Computational Linguistics, August 2016, pp. 83–91. [Online]. Available: http://www.aclweb.org/anthology/W16-2209.pdf
-  J. Niehues and E. Cho, “Exploiting linguistic resources for neural machine translation using multi-task learning,” arXiv preprint arXiv:1708.00993, 2017.
-  C. Dyer, A. Kuncoro, M. Ballesteros, and N. A. Smith, “Recurrent neural network grammars,” in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. San Diego, California: Association for Computational Linguistics, 2016, pp. 199–209.
-  F. Pisceldo, M. Adriani, and R. Manurung, “Probabilistic Part Of Speech Tagging for Bahasa Indonesia,” 2009, p. 6.
-  A. F. Wicaksono and A. Purwarianti, “HMM Based Part-of-Speech Tagger for Bahasa Indonesia,” Jan. 2010.
-  F. Rashel, A. Luthfi, A. Dinakaramani, and R. Manurung, “Building an Indonesian rule-based part-of-speech tagger,” in 2014 International Conference on Asian Language Processing (IALP), Oct. 2014, pp. 70–73.
-  A. F. Abka, “Evaluating the use of word embeddings for part-of-speech tagging in Bahasa Indonesia,” in 2016 International Conference on Computer, Control, Informatics and Its Applications (IC3INA), Oct. 2016, pp. 209–214.
-  X. Ma and E. Hovy, “End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2016, pp. 1064–1074.
-  M. Rei, G. K. O. Crichton, and S. Pyysalo, “Attending to characters in neural sequence labeling models,” in Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, 2016, pp. 309–318.
-  A. Dinakaramani, F. Rashel, A. Luthfi, and R. Manurung, “Designing an Indonesian part of speech tagset and manually tagged Indonesian corpus,” in 2014 International Conference on Asian Language Processing (IALP). IEEE, 2014, pp. 66–69.
J. D. Lafferty, A. McCallum, and F. C. N. Pereira, “Conditional Random
Fields: Probabilistic Models for Segmenting and Labeling Sequence
Proceedings of the Eighteenth International Conference on Machine Learning, ser. ICML ’01. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2001, pp. 282–289.
-  R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa, “Natural language processing (almost) from scratch,” Journal of Machine Learning Research, vol. 12, no. Aug, pp. 2493–2537, 2011.
-  F. Rashel, A. Luthfi, A. Dinakaramani, and R. Manurung, “Building an indonesian rule-based part-of-speech tagger,” in Asian Language Processing (IALP), 2014 International Conference on. IEEE, 2014, pp. 70–73.
-  S. Larasati, V. Kuboň, and D. Zeman, “Indonesian morphology tool (morphind): Towards an indonesian corpus,” Systems and Frameworks for Computational Morphology, pp. 119–129, 2011.
S. Hochreiter and J. Schmidhuber, “Long short-term memory,”Neural Computation, vol. 9, no. 8, p. 1735–1780, Nov 1997.
-  G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer, “Neural architectures for named entity recognition,” in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2016, p. 260–270. [Online]. Available: http://aclanthology.coli.uni-saarland.de/pdf/N/N16/N16-1030.pdf
-  N. Reimers and I. Gurevych, “Reporting score distributions makes a difference: Performance study of lstm-networks for sequence tagging,” in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Sep 2017, p. 338–348. [Online]. Available: https://www.aclweb.org/anthology/D17-1035