Named Entity Recognition (NER) is a foremost NLP task to label each atomic elements of a sentence into specific categories like ”PERSON”, ”LOCATION”, ”ORGANIZATION” and othersCollobert et al. (2011). There has been an extensive NER research on English, German, Dutch and Spanish language Lample et al. (2016), Ma and Hovy (2016), Devlin et al. (2018), Peters et al. (2018), Akbik et al. (2018), and notable research on low resource South Asian languages like HindiAthavale et al. (2016), IndonesianGunawan et al. (2018) and other Indian languages (Kannada, Malayalam, Tamil and Telugu)Gupta et al. (2018). However, there has been no study on developing neural NER for Nepali language. In this paper, we propose a neural based Nepali NER using latest state-of-the-art architecture based on grapheme-level which doesn’t require any hand-crafted features and no data pre-processing.
Recent neural architecture like Lample et al. (2016)
is used to relax the need to hand-craft the features and need to use part-of-speech tag to determine the category of the entity. However, this architecture have been studied for languages like English, and German and not been applied to languages like Nepali which is a low resource language i.e limited data set to train the model. Traditional methods like Hidden Markov Model (HMM) with rule based approachesDey and Prukayastha (2013),Dey et al. (2014)
, and Support Vector Machine (SVM) with manual feature-engineeringBam and Shahi (2014)
have been applied but they perform poor compared to neural. However, there has been no research in Nepali NER using neural network. Therefore, we created the named entity annotated dataset partly with the help of Dataturk111https://dataturks.com/ to train a neural model. The texts used for this dataset are collected from various daily news sources from Nepal222https://github.com/sndsabin/Nepali-News-Classifier around the year 2015-2016.
Following are our contributions:
We present a novel Named Entity Recognizer (NER) for Nepali language. To best of our knowledge we are the first to propose neural based Nepali NER.
As there are not good quality dataset to train NER we release a dataset to support future research
We perform empirical evaluation of our model with state-of-the-art models with relative improvement of upto 10%
In this paper, we present works similar to ours in Section 2. We describe our approach and dataset statistics in Section 3 and 4, followed by our experiments, evaluation and discussion in Section 5, 6, and 7. We conclude with our observations in Section 8.
To facilitate further research our code and dataset will be made available at github.com/link-yet-to-be-updated
2 Related Work
There has been a handful of research on Nepali NER task based on approaches like Support Vector Machine and gazetteer listBam and Shahi (2014) and Hidden Markov Model and gazetteer listDey and Prukayastha (2013),Dey et al. (2014).
Bam and Shahi (2014)
uses SVM along with features like first word, word length, digit features and gazetteer (person, organization, location, middle name, verb, designation and others). It uses one vs rest classification model to classify each word into different entity classes. However, it does not the take context word into account while training the model. Similarly,Dey and Prukayastha (2013) and Dey et al. (2014)
uses Hidden Markov Model with n-gram technique for extracting POS-tags. POS-tags with common noun, proper noun or combination of both are combined together, then uses gazetteer list as look-up table to identify the named entities.
Researchers have shown that the neural networks like CNNLeCun et al. (1989), RNNRumelhart et al. (1988), LSTMHochreiter and Schmidhuber (1997), GRUChung et al. (2014) can capture the semantic knowledge of language better with the help of pre-trained embbeddings like word2vecMikolov et al. (2013), glovePennington et al. (2014) or fasttextBojanowski et al. (2016).
Similar approaches has been applied to many South Asian languages like HindiAthavale et al. (2016), IndonesianGunawan et al. (2018), BengaliBanik and Rahman (2018) and In this paper, we present the neural network architecture for NER task in Nepali language, which doesn’t require any manual feature engineering nor any data pre-processing during training. First we are comparing BiLSTMHochreiter and Schmidhuber (1997), BiLSTM+CNNChiu and Nichols (2015), BiLSTM+CRFLample et al. (2016), BiLSTM+CNN+CRFMa and Hovy (2016) models with CNN modelCollobert et al. (2011) and Stanford CRF modelFinkel et al. (2005)
. Secondly, we show the comparison between models trained on general word embeddings, word embedding + character-level embedding, word embedding + part-of-speech(POS) one-hot encoding and word embedding + grapheme clustered or sub-word embeddingPark and Shin (2018). The experiments were performed on the dataset that we created and on the dataset received from ILPRL lab333http://ilprl.ku.edu.np/. Our extensive study shows that augmenting word embedding with character or grapheme-level representation and POS one-hot encoding vector yields better results compared to using general word embedding alone.
3.1 Bidirectional LSTM
We used Bi-directional LSTM to capture the word representation in forward as well as reverse direction of a sentence. Generally, LSTMs take inputs from left (past) of the sentence and computes the hidden state. However, it is proven beneficialDyer et al. (2015) to use bi-directional LSTM, where, hidden states are computed based from right (future) of sentence and both of these hidden states are concatenated to produce the final output as =[;], where , = hidden state computed in forward and backward direction respectively.
3.2.1 Word embeddings
We have used Word2Vec Mikolov et al. (2013), GloVe Pennington et al. (2014) and FastText Bojanowski et al. (2016) word vectors of 300 dimensions. These vectors were trained on the corpus obtained from Nepali National Corpus444https://www.sketchengine.eu/nepali-national-corpus/. This pre-lemmatized corpus consists of 14 million words from books, web-texts and news papers. This corpus was mixed with the texts from the dataset before training CBOW and skip-gram version of word2vec using gensim libraryŘehůřek and Sojka (2010). This trained model consists of vectors for 72782 unique words.
Light pre-processing was performed on the corpus before training it. For example, invalid characters or characters other than Devanagari were removed but punctuation and numbers were not removed. We set the window context at 10 and the rare words whose count is below 5 are dropped. These word embeddings were not frozen during the training session because fine-tuning word embedding help achieve better performance compared to frozen oneChiu and Nichols (2015).
We have used fasttext embeddings in particular because of its sub-word representation ability, which is very useful in highly inflectional language as shown in Table 3. We have trained the word embedding in such a way that the sub-word size remains between 1 and 4. We particularly chose this size because in Nepali language a single letter can also be a word, for example e, t, C, r, l, n, u and a single character (grapheme) or sub-word can be formed after mixture of dependent vowel signs with consonant letters for example, C + O + = CO, here three different consonant letters form a single sub-word.
The two-dimensional visualization of an example word npAl is shown in 4
. Principal Component Analysis (PCA) technique was used to generate this visualization which helps use to analyze the nearest neighbor words of a given sample word. 84 and 104 nearest neighbors were observed using word2vec and fasttext embedding respectively on the same corpus.
3.2.2 Character-level embeddings
Chiu and Nichols (2015) and Ma and Hovy (2016) successfully presented that the character-level embeddings, extracted using CNN, when combined with word embeddings enhances the NER model performance significantly, as it is able to capture morphological features of a word. Figure 1
shows the grapheme-level CNN used in our model, where inputs to CNN are graphemes. Character-level CNN is also built in similar fashion, except the inputs are characters. Grapheme or Character -level embeddings are randomly initialized from [0,1] with real values with uniform distribution of dimension 30.
3.2.3 Grapheme-level embeddings
Grapheme is atomic meaningful unit in writing system of any languages. Since, Nepali language is highly morphologically inflectional, we compared grapheme-level representation with character-level representation to evaluate its effect. For example, in character-level embedding, each character of a word npAl results into n + + p + A + l has its own embedding. However, in grapheme level, a word npAl is clustered into graphemes, resulting into n + pA + l. Here, each grapheme has its own embedding. This grapheme-level embedding results good scores on par with character-level embedding in highly inflectional languages like Nepali, because graphemes also capture syntactic information similar to characters. We created grapheme clusters using uniseg555https://uniseg-python.readthedocs.io/en/latest
/index.html package which is helpful in unicode text segmentations.
3.2.4 Part-of-speech (POS) one hot encoding
We created one-hot encoded vector of POS tags and then concatenated with pre-trained word embeddings before passing it to BiLSTM network. A sample of data is shown in figure 3.
4 Dataset Statistics
4.1 OurNepali dataset
Since, we there was no publicly available standard Nepali NER dataset and did not receive any dataset from the previous researchers, we had to create our own dataset. This dataset contains the sentences collected from daily newspaper of the year 2015-2016. This dataset has three major classes Person (PER), Location (LOC) and Organization (ORG). Pre-processing was performed on the text before creation of the dataset, for example all punctuations and numbers besides ’,’, ’-’, ’—’ and ’.’ were removed. Currently, the dataset is in standard CoNLL-2003 IO formatTjong Kim Sang and De Meulder (2003).
Since, this dataset is not lemmatized originally, we lemmatized only the post-positions like Ek, kO, l, mA, m, my, jF, sg, aEG which are just the few examples among 299 post positions in Nepali language. We obtained these post-positions from sanjaalcorps666https://github.com/sanjaalcorps/NepaliStemmer and added few more to match our dataset. We will be releasing this list in our github repository. We found out that lemmatizing the post-positions boosted the F1 score by almost 10%.
In order to label our dataset with POS-tags, we first created POS annotated dataset of 6946 sentences and 16225 unique words extracted from POS-tagged Nepali National Corpus and trained a BiLSTM model with 95.14% accuracy which was used to create POS-tags for our dataset.
The dataset released in our github repository contains each word in newline with space separated POS-tags and Entity-tags. The sentences are separated by empty newline. A sample sentence from the dataset is presented in table 3.
4.2 ILPRL dataset
After much time, we received the dataset from Bal Krishna Bal, ILPRL, KU. This dataset follows standard CoNLL-2003 IOB formatTjong Kim Sang and De Meulder (2003) with POS tags. This dataset is prepared by ILPRL Lab777http://ilprl.ku.edu.np/, KU and KEIV Technologies. Few corrections like correcting the NER tags had to be made on the dataset. The statistics of both the dataset is presented in table 1.
|Total entities w/o O||1176||11183|
|Others - O||12683||67904|
|Total entities w/ O||13859||79087|
Table 2 presents the total entities (PER, LOC, ORG and MISC) from both of the dataset used in our experiments. The dataset is divided into three parts with 64%, 16% and 20% of the total dataset into training set, development set and test set respectively.
In this section, we present the details about training our neural network. The neural network architecture are implemented using PyTorch frameworkPaszke et al. (2017). The training is performed on a single Nvidia Tesla P100 SXM2. We first run our experiment on BiLSTM, BiLSTM-CNN, BiLSTM-CRF BiLSTM-CNN-CRF using the hyper-parameters mentioned in Table 4. The training and evaluation was done on sentence-level. The RNN variants are initialized randomly from where .
First we loaded our dataset and built vocabulary using torchtext library888https://torchtext.readthedocs.io/en/latest/. This eased our process of data loading using its SequenceTaggingDataset class. We trained our model with shuffled training set using Adam optimizer with hyper-parameters mentioned in table 4
. All our models were trained on single layer of LSTM network. We found out Adam was giving better performance and faster convergence compared to Stochastic Gradient Descent (SGD). We chose those hyper-parameters after many ablation studies. The dropout of 0.5 is applied after LSTM layer.
For CNN, we used 30 different filters of sizes 3, 4 and 5. The embeddings of each character or grapheme involved in a given word, were passed through the pipeline of Convolution, Rectified Linear Unit and Max-Pooling. The resulting vectors were concatenated and applied dropout of 0.5 before passing into linear layer to obtain the embedding size of 30 for the given word. This resulting embedding is concatenated with word embeddings, which is again concatenated with one-hot POS vector.
5.1 Tagging Scheme
Currently, for our experiments we trained our model on IO (Inside, Outside) format for both the dataset, hence the dataset does not contain any B-type annotation unlike in BIO (Beginning, Inside, Outside) scheme.
5.2 Early Stopping
We used simple early stopping technique where if the validation loss does not decrease after 10 epochs, the training was stopped, else the training will run upto 100 epochs. In our experience, training usually stops around 30-50 epochs.
5.3 Hyper-parameters Tuning
We ran our experiment looking for the best hyper-parameters by changing learning rate from (0,1, 0.01, 0.001, 0.0001), weight decay from [, , , , , , ], batch size from [1, 2, 4, 8, 16, 32, 64, 128], hidden size from [8, 16, 32, 64, 128, 256, 512 1024]. Table 4 shows all other hyper-parameter used in our experiment for both of the dataset.
|LSTM - hidden size||100|
|CNN - Filter size||[3,4,5]|
|CNN - Filter number||30|
5.4 Effect of Dropout
Figure 5 shows how we end up choosing 0.5 as dropout rate. When the dropout layer was not used, the F1 score are at the lowest. As, we slowly increase the dropout rate, the F1 score also gradually increases, however after dropout rate = 0.5, the F1 score starts falling down. Therefore, we have chosen 0.5 as dropout rate for all other experiments performed.
In this section, we present the details regarding evaluation and comparison of our models with other baselines.
Table 3 shows the study of various embeddings and comparison among each other in OurNepali dataset. Here, raw dataset represents such dataset where post-positions are not lemmatized. We can observe that pre-trained embeddings significantly improves the score compared to randomly initialized embedding. We can deduce that Skip Gram models perform better compared CBOW models for word2vec and fasttext. Here, fastText_Pretrained represents the embedding readily available in fastText website999https://fasttext.cc/docs/en/crawl-vectors.html, while other embeddings are trained on the Nepali National Corpus as mentioned in sub-section 3.2.1. From this table 3, we can clearly observe that model using fastText_Skip Gram embeddings outperforms all other models.
Table 5 shows the model architecture comparison between all the models experimented. The features used for Stanford CRF classifier are words, letter n-grams of upto length 6, previous word and next word. This model is trained till the current function value is less than . The hyper-parameters of neural network experiments are set as shown in table 4. Since, word embedding of character-level and grapheme-level is random, their scores are near.
All models are evaluated using CoNLL-2003 evaluation scriptTjong Kim Sang and De Meulder (2003) to calculate entity-wise precision, recall and f1 score.
|BiLSTM + POS||83.65||84.09||81.25||85.39|
|BiLSTM + CNN (C)||86.45||87.45||80.51||85.45|
|BiLSTM + CNN (G)||86.71||86.00||78.24||83.49|
|BiLSTM + CNN (C) + POS||85.40||86.50||81.46||86.64|
|BiLSTM + CNN (G) + POS||85.46||86.43||83.08||82.99|
|Bam et al. SVM||66.26||46.26|
|Ma and Hovy w/ glove||83.63||72.1|
|Lample et al. w/ fastText||85.78||82.29|
|Lample et al. w/ word2vec||86.49||78.63|
|BiLSTM + CNN (G)||86.71||78.24|
|BiLSTM + CNN (G) + POS||85.46||83.08|
In this paper we present that we can exploit the power of neural network to train the model to perform downstream NLP tasks like Named Entity Recognition even in Nepali language. We showed that the word vectors learned through fasttext skip gram model performs better than other word embedding because of its capability to represent sub-word and this is particularly important to capture morphological structure of words and sentences in highly inflectional like Nepali. This concept can come handy in other Devanagari languages as well because the written scripts have similar syntactical structure.
We also found out that stemming post-positions can help a lot in improving model performance because of inflectional characteristics of Nepali language. So when we separate out its inflections or morphemes, we can minimize the variations of same word which gives its root word a stronger word vector representations compared to its inflected versions.
8 Conclusion and Future work
In this paper, we proposed a novel NER for Nepali language and achieved relative improvement of upto 10% and studies different factors effecting the performance of the NER for Nepali language.
We also present a neural architecture BiLSTM+CNN(grapheme-level) which turns out to be performing on par with BiLSTM+CNN(character-level) under the same configuration. We believe this will not only help Nepali language but also other languages falling under the umbrellas of Devanagari languages. Our model BiLSTM+CNN(grapheme-level) and BiLSTM+CNN(G)+POS outperforms all other model experimented in OurNepali and ILPRL dataset respectively.
Since this is the first named entity recognition research in Nepal language using neural network, there are many rooms for improvement. We believe initializing the grapheme-level embedding with fasttext embeddings might help boosting the performance, rather than randomly initializing it. In future, we plan to apply other latest techniques like BERT, ELMo and FLAIR to study its effect on low-resource language like Nepali. We also plan to improve the model using cross-lingual or multi-lingual parameter sharing techniques by jointly training with other Devanagari languages like Hindi and Bengali.
Finally, we would like to contribute our dataset to Nepali NLP community to move forward the research going on in language understanding domain. We believe there should be special committee to create and maintain such dataset for Nepali NLP and organize various competitions which would elevate the NLP research in Nepal.
Some of the future works are listed below:
Proper initialization of grapheme level embedding from fasttext embeddings.
Apply robust POS-tagger for Nepali dataset
Lemmatize the OurNepali dataset with robust and efficient lemmatizer
Improve Nepali language score with cross-lingual learning techniques
Create more dataset using Wikipedia/Wikidata framework
The authors of this paper would like to express sincere thanks to Bal Krishna Bal, Kathmandu University Professor for providing us the POS-tagged Nepali NER data.
- Contextual string embeddings for sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA, pp. 1638–1649. External Links: Cited by: §1.
Towards deep learning in hindi NER: an approach to tackle the labelled data sparsity. CoRR abs/1610.09756. External Links: Cited by: §1, §2.
- Named entity recognition for nepali text using support vector machines. Intelligent Information Management 6 (02), pp. 21. Cited by: §1, §2, §2.
- GRU based named entity recognition system for bangla online newspapers. In 2018 International Conference on Innovation in Engineering and Technology (ICIET), pp. 1–6. Cited by: §2.
- Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606. Cited by: §2, §3.2.1.
- Named entity recognition with bidirectional lstm-cnns. CoRR abs/1511.08308. External Links: Cited by: §2, §3.2.1, §3.2.2, §3.
Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR abs/1412.3555. External Links: Cited by: §2.
- Natural language processing (almost) from scratch. CoRR abs/1103.0398. External Links: Cited by: §1, §2.
- BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. External Links: Cited by: §1.
- Named entity recognition for nepali language: a semi hybrid approach. International Journal of Engineering and Innovative Technology (IJEIT) Volume 3, pp. 21–25. Cited by: §1, §2, §2.
- Named entity recognition using gazetteer method and n-gram technique for an inflectional language: a hybrid approach. International Journal of Computer Applications 84 (9). Cited by: §1, §2, §2.
Transition-based dependency parsing with stack long short-term memory. CoRR abs/1505.08075. External Links: Cited by: §3.1.
- Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd annual meeting on association for computational linguistics, pp. 363–370. Cited by: §2.
- Named-entity recognition for indonesian language using bidirectional lstm-cnns. Procedia Computer Science 135, pp. 425–432. Cited by: §1, §2.
- Raiden11@ iecsil-fire-2018: named entity recognition for indian languages. FIRE Working Notes. Cited by: §1.
- Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §2, §2.
- Neural architectures for named entity recognition. CoRR abs/1603.01360. External Links: Cited by: §1, §1, §2, §3.
- Backpropagation applied to handwritten zip code recognition. Neural computation 1 (4), pp. 541–551. Cited by: §2.
- End-to-end sequence labeling via bi-directional lstm-cnns-crf. CoRR abs/1603.01354. External Links: Cited by: §1, §2, §3.2.2, §3.
- Distributed representations of words and phrases and their compositionality. CoRR abs/1310.4546. External Links: Cited by: §2, §3.2.1.
- Grapheme-level awareness in word embeddings for morphologically rich languages. In Proceedings of the 11th Language Resources and Evaluation Conference, Miyazaki, Japan. External Links: Cited by: §2.
- Automatic differentiation in pytorch. Cited by: §5.
- Glove: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1532–1543. External Links: Cited by: §2, §3.2.1.
- Deep contextualized word representations. CoRR abs/1802.05365. External Links: Cited by: §1.
- Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Valletta, Malta, pp. 45–50 (English). Note: http://is.muni.cz/publication/884893/en Cited by: §3.2.1.
- Neurocomputing: foundations of research. J. A. Anderson and E. Rosenfeld (Eds.), pp. 696–699. External Links: Cited by: §2.
- Introduction to the conll-2003 shared task: language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Volume 4, CONLL ’03, Stroudsburg, PA, USA, pp. 142–147. External Links: Cited by: §4.1, §4.2, §6.