Naamapadam: A Large-Scale Named Entity Annotated Data for Indic Languages

12/20/2022
by   Arnav Mhaske, et al.
0

We present, Naamapadam, the largest publicly available Named Entity Recognition (NER) dataset for the 11 major Indian languages from two language families. In each language, it contains more than 400k sentences annotated with a total of at least 100k entities from three standard entity categories (Person, Location and Organization) for 9 out of the 11 languages. The training dataset has been automatically created from the Samanantar parallel corpus by projecting automatically tagged entities from an English sentence to the corresponding Indian language sentence. We also create manually annotated testsets for 8 languages containing approximately 1000 sentences per language. We demonstrate the utility of the obtained dataset on existing testsets and the Naamapadam-test data for 8 Indic languages. We also release IndicNER, a multilingual mBERT model fine-tuned on the Naamapadam training set. IndicNER achieves the best F1 on the Naamapadam-test set compared to an mBERT model fine-tuned on existing datasets. IndicNER achieves an F1 score of more than 80 for 7 out of 11 Indic languages. The dataset and models are available under open-source licenses at https://ai4bharat.iitm.ac.in/naamapadam.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/23/2020

UNER: Universal Named-Entity RecognitionFramework

We introduce the Universal Named-Entity Recognition (UNER)framework, a 4...
research
02/19/2023

Intent Identification and Entity Extraction for Healthcare Queries in Indic Languages

Scarcity of data and technological limitations for resource-poor languag...
research
10/06/2022

Improving Large-scale Paraphrase Acquisition and Generation

This paper addresses the quality issues in existing Twitter-based paraph...
research
10/21/2022

CEFR-Based Sentence Difficulty Annotation and Assessment

Controllable text simplification is a crucial assistive technique for la...
research
11/04/2022

CLSE: Corpus of Linguistically Significant Entities

One of the biggest challenges of natural language generation (NLG) is th...
research
10/22/2018

Named Entity Disambiguation using Deep Learning on Graphs

We tackle NED by comparing entities in short sentences with graphs. Cre...
research
03/18/2023

Stall Number Detection of Cow Teats Key Frames

In this paper, we present a small cow stall number dataset named CowStal...

Please sign up or login with your details

Forgot password? Click here to reset