Named entity recognition (NER) is a fundamental task in NLP. It aims to identify and classify entities in a text into predefined types. It is an essential tool for information retrieval, question answering,Banerjee et al. (2019)
and text summarization tasksPatil et al. (2016); Li et al. (2020). However, except for some resource-rich monolingual languages, NER annotated data for most other languages are still very limited Kruengkrai et al. (2020). Moreover, it is usually time-consuming to annotate such data, particularly for low-resource languages such as multilingual and code-mixed Liu et al. (2021)
This paper presents the system description for named entity recognition on the code-mixed dataset. Code-mixing is defined as using two or more languages in a single sentence or utterance Dowlagar and Mamidi (2021). The use of code-mixed language is prevalent in most multilingual societies. Due to linguistic complexity arising from mixing two languages, the processing of code-mixed sentences is a challenging task Bali et al. (2014). So, the models that are trained on monolingual and multilingual datasets typically fail to handle code-mixed inputs Khanuja et al. (2020)
. Therefore, to encourage research on code-mixing, the speech and NLP communities are organizing several shared tasks. The shared tasks have concentrated on language identification, POS-tagging, sentiment analysis, hate speech detection, and several datasets exist for these as well. Similarly, SEMEVAL 2022’s Task 11 sub-task 13 was devoted to identifying named entities in code-mixed languagesMalmasi et al. (2022b). This task aims to classify the given tokens in the code-mixed sentences as persons, corporation, location, and others. An example is shown in Table 1.
The lack of annotated data is a crucial issue for code-mixed datasets. Lack of data poses a problem of data overfitting and poor entity recognition. The language models trained on such low resource datasets cannot generalize the training data, thus performing low on the test datasets. Several previous studies have used monolingual data as training signals for transfer learning, and these data can also be used in the form of pre-training. Thus, we used a similar approach of including the multilingual data along with the code-mixed dataset.
We used the multilingual pre-trained BERT model as our model for code-mixed NER. The model uses code mixed training data along with the multilingual training and mulilingual validation data.
We have analyzed that Bi-LSTM with CRF models has shown an improved accuracy on the token classification tasks such as POS tagging, language identification, and NER. The ensemble of BERT or XLM-RoBERTa with Bi-LSTM and CRF would have shown a further improvement in the code-mixed NER. Also, using language identification as a downstream task with the current method might have improved the NER’s accuracy.
The paper is organized as follows. Section 2 provides related work on Named Entity Recognition on CM social media text. Section 3 provides information on the task and examples of datasets. Section 4 describes the proposed work. Section 5 presents the experimental setup and Section 6 project the performance of the model. Section 7 concludes our work.
2 Related Work
Code-mixed NER has attracted a lot of attention in the NLP community this decade. This section lists the latest works on code-mixed named entity recognition.
Priyadharshini et al. (2020); Winata et al. (2019) generated multilingual meta representations from pre-trained monolingual word embeddings. The model learned to construct the best word representation by mixing multiple sources without explicit language identification.
Aguilar et al. (2019) presented a shared task on named entity recognition in the CALCS workshop. The language pairs used were English-Spanish (ENG-SPA) and Modern Standard Arabic Egyptian (MSA-EGY). They used Twitter data and nine entity types to establish a new dataset for code-switched NER benchmarks. The participating teams used LSTM, CNN, CRF, and word representations to recognize named entities.
Singh et al. (2018)
We presented a new Code-Mixed Hinglish corpus for NER. Different machine learning classification algorithms with word, character, and lexical features are used to establish baselines. The algorithms used were Decision tree, Long Short-Term Memory (LSTM), and Conditional Random Field (CRF).
Meng et al. (2021); Fetahu et al. (2021) presented a novel CM NER model. They proposed a gated architecture that enhances existing multilingual Transformers by dynamically infusing multilingual knowledge bases, a.k.a gazetteers. The evaluation of code-mixed queries shows that this approach efficiently utilizes gazetteers to recognize entities in code-mixed queries with an F1=68%, an absolute improvement of +31% over a non-gazetteer baseline.
Meng et al. (2021) mentioned that including Gazetteer features could cause models to overuse or underuse them, leading to poor generalization. They proposed a new approach for gazetteer knowledge integration by including Context in Gazetteer Representation using encoder and Mixture-of-Experts gating network models. These models overcome the feature overuse issue by learning to conditionally combine the context and gazetteer features instead of assigning them fixed weights.
3 Task Setup
The shared task detects semantically ambiguous and complex entities in short and low-context settings. Complex NEs, like the titles of creative works (movie/book/song/software names), are not simple nouns. Usually, they take imperative clauses, or they often resemble typical syntactic constituents. Such NEs are harder to recognize Ashwini and Choi (2014). Syntactic parsing of such complex noun phrases is hard, and most NER systems fail to identify them. Inside–outside–beginning (IOB) format Ramshaw and Marcus (1999) is used for annotating entities. A few examples of complex NEs and ambiguous NEs from the code-mixed dataset are given in Table 2.
So the MultiCoNER shared task encourages the models to handle such complex NEs. A huge dataset Malmasi et al. (2022a) is released for this task Malmasi et al. (2022b). The languages focused on in this shared task are: English, Spanish, Dutch, Russian, Turkish, Korean, Farsi, German, Chinese, Hindi, and Bangla. The shared task also offered an additional track with code-mixed and multilingual datasets. In this paper, we will be concentrating on the code-mixed dataset.
4 System Overview
We finetuned the pre-trained multilingual BERT model by using the multilingual training and validation datasets for code-mixed named entity recognition. We found that the training data is insufficient for the deep learning language model to identify the named entities in the validation data correctly. The data scarcity of low-resource languages has been a significant challenge for building NLP systems since they require a large amount of data to learn a robust model. We observed that the multilingual NER training data is similar to the code-mixed dataset. Also, it is relatively large when compared to the code-mixed dataset. In our approach, the multilingual training and validation data is combined with the code-mixed training dataset. Using the combined dataset, we finetune the deep neural network model. Our method thus attempts to learn language-agnostic features by using the combined multilingual and code-mixed dataset. This finetuned model can be used to infer named entity information at a token level on a code-mixed low resource language.
We used the pre-trained mBERT Devlin et al. (2018)
model for code-mixed NER. mBERT is a transformer-based self multi-headed attention model that is pre-trained on a massive collection of multilingual data and can be finetuned for our NER task. As the model is pre-trained on a large corpus, the semantic and syntactic information is well modeled and can be directly finetuned for a specific task. BERT is a bi-directional transformer modelVaswani et al. (2017). It analyzes the meaning of a term depending on its context given on both sides. The transformer part in the BERT works like an attention mechanism capable of learning the contextual relationships between the terms in a sentence.
5 Experimental Setup
The section presents the baselines, hyper-parameter settings, and analysis of observed results.
The baselines used for the proposed work is:
Conditional random field (CRF) Lafferty et al. (2001)
CRF is a statistical model and is a well-known approach for handling NER problems. The CRF model considers the neighboring samples by modeling the prediction as a graphical model. It assumes that the tag for the present word (denoted as ) is dependent on the tag of its previous/next word (denoted as or ).
MultiCoNER baseline Malmasi et al. (2022b)
The XLM-RoBERTa base with CRF model is used as a baseline for NER.
Pre-trained multilingual BERT (mBERT) Devlin et al. (2018)
A pre-trained multilingual BERT model with token classification without leveraging the multilingual data is used as a baseline.
5.2 Hyperparameters and libraries
For developing our model, the neural network library used is PyTorch, and the pre-trained multilingual BERT model (bert-base-multilingual-cased) and XLM-ROBERTa base model (xlm-roberta-base) is obtained from the hugging face-transformers library and is finetuned for the code-mixed NER task. The model is implemented in Kaggle Notebook with GPU processing.
The batch size of the datasets is kept as 64. The maximum length of the sentence from the training data is considered during the input data encoding/padding. Due to subword tokenization, we used the first token for predicting the tag. The optimizer used is weighted Adam with the learning rate of 2e-5 and epsilon value equal to 1e-5. The dropout is set to 0.1. The loss function used is a cross-entropy loss that is inbuilt into the transformer’s BERT model. The number of epochs used for training the model is 30. The training is stopped when there is no change in validation accuracy for more than four epochs.
|w/o multilingual data||with multilingual data|
|valid data||test data||valid data||test data|
The CRF is obtained from pytorch-crfsuite library111https://github.com/scrapinghub/python-crfsuite. The previous word and its tag, the next word, and its tag is used as the features to predict the tag of the current word.
6 Results and Analysis
Table 3 presents the f1-score of the models on the Dravidian code-mixed dataset. From the above results, it is clear that our system, i.e., leveraging the multilingual NER data in a low-resource code-mixed setting, improves the NER task compared to the baseline models. The CRF model didn’t perform well on the given NER task, as this statistical model does not capture the semantics of the tokens. Even the CRF with multilingual data performed poorly on this task compared to the baseline NN models. It shows the importance of capturing semantical, syntactic, and contextual information while building the NER model on these complex datasets.
Our submission, the pre-trained mBERT by leveraging the multilingual dataset, performed better than the MultiCoNER baseline by 6%. Even the MultiCoNER baseline with the multilingual dataset performed better than our submission.
The confusion matrices with and without multilingual data of our submission on the code-mixed NER validation dataset are shown in the Figures 1 and 2. By using confusion matrices, we observed that the multilingual data given in Figure 2 helped better identify the CW, PROD, CROP, and LOC entities when compared to the baseline model.
7 Conclusion and future work
In this paper, we addressed the shared task on named entity recognition for the code-mixed dataset. As the code-mixed data is a low resource language and there are no pre-trained models, we leveraged the multilingual dataset for training the NER model. The model used for testing our method is the pre-trained multilingual BERT. We finetuned the pre-trained mBERT for the code-mixed NER task by using the code-mixed training data and multilingual training and validation datasets.
The use of meta embeddings for dealing with code-mixed datasets has recently attracted a lot of attention. It might be possible that meta embedding-based NER will work better on this code-mixed dataset. Unlike the social media data where code-mixed sentences/words are written in Roman script, the native script is used for each word, so the language identification will work better on this dataset. Using Language identification or POS tagging as a downstream task for NER on this dataset will help in improving the code-mixed NER.
- Named entity recognition on code-switched data: overview of the calcs 2018 shared task. arXiv preprint arXiv:1906.04138. Cited by: §2.
- Targetable named entity recognition in social media. arXiv preprint arXiv:1408.0782. Cited by: §3.
- "I am borrowing ya mixing?" an analysis of english-hindi code mixing in facebook. In Proceedings of the First Workshop on Computational Approaches to Code Switching, pp. 116–126. Cited by: §1.
- A information retrieval based on question and answering and ner for unstructured information without using sql. Wireless Personal Communications 108 (3), pp. 1909–1931. Cited by: §1.
- Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §4, §5.1.
- OFFLangOne@ dravidianlangtech-eacl2021: transformers with the class balanced loss for offensive language identification in dravidian code-mixed text.. In Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages, pp. 154–159. Cited by: §1.
- Gazetteer Enhanced Named Entity Recognition for Code-Mixed Web Queries. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1677–1681. Cited by: §2.
- A new dataset for natural language inference from code-mixed conversations. arXiv preprint arXiv:2004.05051. Cited by: §1.
- Improving low-resource named entity recognition using joint sentence and token labeling. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5898–5905. Cited by: §1.
- Conditional random fields: probabilistic models for segmenting and labeling sequence data. Cited by: §5.1.
- A survey on deep learning for named entity recognition. IEEE Transactions on Knowledge and Data Engineering 34 (1), pp. 50–70. Cited by: §1.
- MulDA: a multilingual data augmentation framework for low-resource cross-lingual ner. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 5834–5846. Cited by: §1.
- MultiCoNER: a Large-scale Multilingual dataset for Complex Named Entity Recognition. Cited by: §3.
- SemEval-2022 Task 11: Multilingual Complex Named Entity Recognition (MultiCoNER). In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), Cited by: §1, §3, §5.1.
- GEMNET: Effective gated gazetteer representations for recognizing complex entities in low-context input. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1499–1512. Cited by: §2, §2.
- Survey of named entity recognition systems with respect to indian and foreign languages. International Journal of Computer Applications 134 (16). Cited by: §1.
- Named entity recognition for code-mixed indian corpus using meta embedding. In 2020 6th international conference on advanced computing and communication systems (ICACCS), pp. 68–72. Cited by: §2.
- Text chunking using transformation-based learning. In Natural language processing using very large corpora, pp. 157–176. Cited by: §3.
- Named entity recognition for hindi-english code-mixed social media text. In Proceedings of the seventh named entities workshop, pp. 27–35. Cited by: §2.
- Attention is all you need. Advances in neural information processing systems 30. Cited by: §4.
- Learning multilingual meta-embeddings for code-switching named entity recognition. In Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), pp. 181–186. Cited by: §2.