MultiCoNER: A Large-scale Multilingual dataset for Complex Named Entity Recognition

08/30/2022
by   Shervin Malmasi, et al.
7

We present MultiCoNER, a large multilingual dataset for Named Entity Recognition that covers 3 domains (Wiki sentences, questions, and search queries) across 11 languages, as well as multilingual and code-mixing subsets. This dataset is designed to represent contemporary challenges in NER, including low-context scenarios (short and uncased text), syntactically complex entities like movie titles, and long-tail entity distributions. The 26M token dataset is compiled from public resources using techniques such as heuristic-based sentence sampling, template extraction and slotting, and machine translation. We applied two NER models on our dataset: a baseline XLM-RoBERTa model, and a state-of-the-art GEMNET model that leverages gazetteers. The baseline achieves moderate performance (macro-F1=54 GEMNET, which uses gazetteers, improvement significantly (average improvement of macro-F1=+30 language models, and we believe that it can help further research in building robust NER systems. MultiCoNER is publicly available at https://registry.opendata.aws/multiconer/ and we hope that this resource will help advance research in various aspects of NER.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/15/2022

CMNEROne at SemEval-2022 Task 11: Code-Mixed Named Entity Recognition by leveraging multilingual data

Identifying named entities is, in general, a practical and challenging t...
research
03/24/2022

Mono vs Multilingual BERT: A Case Study in Hindi and Marathi Named Entity Recognition

Named entity recognition (NER) is the process of recognising and classif...
research
03/07/2022

USTC-NELSLIP at SemEval-2022 Task 11: Gazetteer-Adapted Integration Network for Multilingual Complex Named Entity Recognition

This paper describes the system developed by the USTC-NELSLIP team for S...
research
01/12/2023

A Dataset of Kurdish (Sorani) Named Entities – An Amendment to Kurdish-BLARK Named Entities

Named Entity Recognition (NER) is one of the essential applications of N...
research
08/07/2023

UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition

Large language models (LLMs) have demonstrated remarkable generalizabili...
research
05/04/2023

USTC-NELSLIP at SemEval-2023 Task 2: Statistical Construction and Dual Adaptation of Gazetteer for Multilingual Complex NER

This paper describes the system developed by the USTC-NELSLIP team for S...

Please sign up or login with your details

Forgot password? Click here to reset