|Word||Level-1 Tag||Level-2 Tag||Joint Tag|
Named-entity recognition (NER) is an important task in information extraction. The task is to identify in a text, spans that are entities and classify them into pre-defined categories. There have been some conferences and shared tasks for evaluating NER systems in English and other languages, such as MUC-6Sundheim (1995), CoNLL 2002 Sang (2002) and CoNLL 2003 Sang and Meulder (2003).
In Vietnamese language, VLSP 2016 NER evaluation Huyen and Luong (2016) is the first evaluation campaign that aims to systematically compare NER systems for Vietnamese language. Similar to CoNLL 2003 shared-task, in VLSP 2016, four named-entity types were considered: person (PER), organization (ORG), location (LOC), and miscellaneous entities (MISC). In VLSP 2016, organizers provided the training/test with gold word segmentation, PoS and chunking tags. While that setting can help participant teams to reduce effort of data processing and solely focus on developing NER algorithms, it is not so realistic setting. In VLSP 2018 NER evaluation, only raw texts with XML tags were provided. Therefore, we need to choose appropriate Vietnamese NLP tools for preprocessing steps such as word segmentation, PoS tagging, and chunking.
In the report, we describe our NER system at VLSP 2018 NER evaluation campaign. We applied a feature-based model which combines word, word-shape features, Brown-cluster-based features, and word-embedding-based features and adopted Conditional Random Fields (CRF) Lafferty et al. (2001) for training and testing.
In the VLSP 2018 NER task, similar as VLSP 2016, there are nested entities the NER dataset. An entity may contain other entities inside them. We categorize entities in VLSP 2018 NER dataset into three levels.
Level-1 entities are entities that do not contain other entities inside them. For example: ¡ENAMEX TYPE=“LOC”¿Hà Nội¡/ENAMEX¿.
Level-2 entities are entities contain only level-1 entities inside them. For example: ¡ENAMEX TYPE=“ORG”¿UBND thành phố ¡ENAMEX TYPE=“LOC”¿Hà Nội¡/ENAMEX¿¡/ENAMEX¿.
Level-3 entities are entities that contain at least one level-2 entity and may contain some level-1 entities. For example ¡ENAMEX TYPE=“ORG”¿Khoa Toán, ¡ENAMEX TYPE=“ORG”¿ĐHQG ¡ENAMEX TYPE=“LOC”¿Hà Nội¡/ENAMEX¿¡/ENAMEX¿¡/ENAMEX¿
In our data statistics, we see that the number of level-3 entities is too small compared with the number of level-1 and level-2 entities, so we decided to ignore them in building the model. We just consider level-1 and level-2 entities.
In order to deal with nested named-entities, we investigated two methods. The first method trains separated models for each level of entities. The second method trains a single model on the training data in which tags are generated by combing entity tags of entities of all levels. Table 1 shows an example of how we combined entity tags at all levels of a token to create join tags.
We showed that combining tags of entities at all levels for training a sequence labeling model (joint-tag model) improved the accuracy of nested named-entity recognition.
2 System description
We formalize NER task as a sequence labeling problem by using the B-I-O tagging scheme and we apply a popular sequence labeling model, Conditional Random Fields to the problem. In this section, first we present how we preprocess the data and then present features that we used in our model.
In our NER system, we performed sentence and word segmentation on the data. For sentence segmentation, we just used a simple regular expression to detect sentence boundaries that match the pattern: period followed by a space and upper-case character. Actually, to produce result submissions, we also try not to perform sentence segmentation.
For word segmentation, we adopted RDRsegmenter Nguyen et al. (2018) which is the state-of-the-art Vietnamese word segmentation tool. Both training and development data are the converted into data files in CoNLL 2003 format with two columns: words and their BIO tags. Due to errors of word segmentation tool, there may be boundary-conflict problem between entity boundary and word boundary. In such cases, we decided to tag words as “O” (outside entity).
Basically, features in the proposed NER model are categorized into word, word-shape features, features based on word representations including word clusters and word embedding. Note that, we extract unigram and bigram features within the context surrounding the current token with the window size of . More specifically, for a feature of the current word, unigram and bigram features are as follows.
unigrams: [-2], [-1], , , 
bigrams: [-2][-1], [-1], , 
2.2.1 Word Features
We extract word-identity unigrams and bigrams within the window of size 5. We use both word surfaces and their lower-case forms. Beside words, we also extract prefixes and suffixes of surfaces of words within the context of the current word. In our model, we use prefixes and suffixes of lengths from 1 to 4 characters.
2.2.2 Word Shapes
In addition to word identities, we use word shapes to improve prediction ability, especially for unknown or rare words and reduce data spareness problem. We used the same word shapes as presented in Minh (2018).
2.2.3 Brown cluster-based features
Brown clustering algorithm is a hierarchical clustering algorithm for assigning words to clustersBrown et al. (1992). Each cluster contains words which are semantically similar. Output clusters are represented as bit-strings. Brown-cluster-based features in our NER model include whole bit-string representations of words and their prefixes of lengths 2, 4, 6, 8, 10, 12, 16, 20. Note that, we only extract unigrams for Brown-cluster-based features.
In experiments, we used the Brown clustering implementation of Liang Liang (2005) and applied the tool on the raw text data collected through a Vietnamese news portal. We performed word clustering on the same preprocessed text data which were used to generate word embeddings in Le-Hong et al. (2017). The number of word clusters used in our experiments is 5120.
2.2.4 Word embeddings
Word-embedding features have been used for a CRF-based Vietnamese NER model in Le-Hong et al. (2017)
. The basic idea is adding unigram features corresponding to dimensions of word representation vectors.
3.1 Data sets
Table 2 showed the data statistics on training set, development set, and official test set. The number of organization entities (ORG) at level 3 is too small, so we only consider level-1 and level-2 entities in training and evaluation. Level-2 entities are almost of ORG types.
3.2 Evaluation Measures
We used Precision, Recall, F1 score as evaluation measures. Note that, due to the fact that word segmentation may cause boundary conflict between entities and words, we convert words in the data into syllables before we evaluate Precision, Recall, F1 scores.
We consider four entity types: LOCATION, MISCELLANEOUS, ORGANIZATION, and PERSON in evaluation, and use the evaluation script of CoNLL-2013 for evaluation.
3.3 NER models
For evaluation on the development set, we train three NER models as follows on the training data of VLSP 2018 NER task.
Level-1 model is trained by using level-1 entity tags.
Level-2 model is trained by using level-2 entity tags.
Joint model is trained using joint tags which combine level-1 and level-2 tags of each word.
Table 3 and Table 4 shows the evaluation results on development set of recognizing level-1 and level-2 entities, respectively. The level-1 model obtained slightly better F1 score than joint model in recognizing level-1 entities while joint model outperformed level-2 model in recognizing level-2 entities. We also see that the level-2 model got higher precision than joint model but much lower recall than joint model. A plausible explanation for that phenomena is that information of level-1 tags helps to recognize more level-2 entities.
3.5 Result Submissions
We trained models on the data set obtained by combining provided training and development data and used the trained models for recognizing entities on the test set.
In order to produce submitted results, we use methods as follows.
Using level-1 and level-2 model for recognizing level-1 and level-2 entities, respectively. We refer this method as Separated method.
We use joint model to recognize joint tags for each word of a sentence, then split joint tags into level-1 and level-2 tags. We refer this method as Joint method.
We use the joint model for recognizing level-2 entities and level-1 model for recognizing level-1 entities. We refer this method as Hybrid method.
In recognition, there are some cases that predicted level-1 entities contains level-2 entities inside them. In such cases, we omit predicted level-2 entities inside predicted level-1 entities. The reason is that accuracy of level-1 entity recognition on dev set is much higher than the accuracy of level-2 entity recognition.
We submitted six runs at VLSP 2018 NER evaluation campaign as showed in Table 5. We try two preprocessing approaches: with sentence segmentation and without sentence segmentation. The reason why we try those preprocessing approaches is that we would like to know the influence of sequence lengths on the accuracy of our model.
Table 6 shows the official evaluation results for our six submitted runs. As indicated in the table, run 4 which uses Joint model obtained the highest F1 score among six runs. Using Joint model or Hybrid model obtained better F1 scores than using Separated methods. We also see that the difference between a system that performs sentence segmentation and a system that does not perform sentence segmentation is very small.
Table 7 shows the Precision, Recall, F1 scores for each entity category of run 4.
Run-6 (using level-1 and level-2 models separately without sentence segmentation) obtained the best accuracy of recognizing level-1 entities among submitted runs (%) and Run-3 (Joint model, sentence segmentation) obtained the best accuracy of recognizing level-2 entities ().
Using joint model obtained better F1 scores of recognizing both levels of entities than just those of the model trained on solely on level-1 and level-2 entity tags. That result is consistent with the result on the development set.
We haved presented a feature-based model for Vietnamese named-entity recognition and evaluation results at VLSP 2018 NER evaluation campaign. We compared several methods for recognizing nested entities. Experimental results showed that combining tags of entities at all levels for training a sequence labeling model improved the accuracy of nested named-entity recognition. As the future work, we plan to investigate deep learning methods such as BiLSTM-CNN-CRFMa and Hovy (2016) for nested named entity recognition.
- Brown et al. (1992) Peter F. Brown, Peter V. deSouza, Robert L. Mercer, Vincent J. Della Pietra, and Jenifer C. Lai. 1992. Class-based n-gram models of natural language. Comput. Linguist., 18(4):467–479.
- Huyen and Luong (2016) Nguyen Thi Minh Huyen and Vu Xuan Luong. 2016. Vlsp 2016 shared task: Named entity recognition. In Proceedings of Vietnamese Speech and Language Processing (VLSP).
- Lafferty et al. (2001) John Lafferty, Andrew McCallum, and Fernando Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, pages 282–289.
- Le-Hong et al. (2017) Phuong Le-Hong, Quang Nhat Minh Pham, Thai-Hoang Pham, Tuan-Anh Tran, and Dang-Minh Nguyen. 2017. An empirical study of discriminative sequence labeling models for vietnamese text processing. In Proceedings of the 9th International Conference on Knowledge and Systems Engineering (KSE 2017).
- Liang (2005) Percy Liang. 2005. Semi-supervised learning for natural language. Ph.D. thesis, Massachusetts Institute of Technology.
- Ma and Hovy (2016) Xuezhe Ma and Eduard Hovy. 2016. End-to-end sequence labeling via bi-directional lstm-cnns-crf. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1064–1074.
- Minh (2018) Pham Quang Nhat Minh. 2018. A feature-rich vietnamese named-entity recognition model. arXiv preprint arXiv:1803.04375.
- Nguyen et al. (2018) Dat Quoc Nguyen, Dai Quoc Nguyen, Thanh Vu, Mark Dras, and Mark Johnson. 2018. A Fast and Accurate Vietnamese Word Segmenter. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018).
Pennington et al. (2014)
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014.
vectors for word representation.
Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543.
- Sang (2002) Erik F. Tjong Kim Sang. 2002. Introduction to the conll-2002 shared task: Language-independent named entity recognition. CoRR, cs.CL/0209010.
- Sang and Meulder (2003) Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In CoNLL.
- Sundheim (1995) Beth Sundheim. 1995. Overview of results of the muc-6 evaluation. In MUC.