Cross Script Hindi English NER Corpus from Wikipedia

10/08/2018
by   Mohd Zeeshan Ansari, et al.
0

The text generated on social media platforms is essentially a mixed lingual text. The mixing of language in any form produces considerable amount of difficulty in language processing systems. Moreover, the advancements in language processing research depends upon the availability of standard corpora. The development of mixed lingual Indian Named Entity Recognition (NER) systems are facing obstacles due to unavailability of the standard evaluation corpora. Such corpora may be of mixed lingual nature in which text is written using multiple languages predominantly using a single script only. The motivation of our work is to emphasize the automatic generation such kind of corpora in order to encourage mixed lingual Indian NER. The paper presents the preparation of a Cross Script Hindi-English Corpora from Wikipedia category pages. The corpora is successfully annotated using standard CoNLL-2003 categories of PER, LOC, ORG, and MISC. Its evaluation is carried out on a variety of machine learning algorithms and favorable results are achieved.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/15/2023

AlbNER: A Corpus for Named Entity Recognition in Albanian

Scarcity of resources such as annotated text corpora for under-resourced...
research
03/05/2020

Neural Cross-Lingual Transfer and Limited Annotated Data for Named Entity Recognition in Danish

Named Entity Recognition (NER) has greatly advanced by the introduction ...
research
11/22/2019

Zero-Resource Cross-Lingual Named Entity Recognition

Recently, neural methods have achieved state-of-the-art (SOTA) results i...
research
07/08/2017

Weakly Supervised Cross-Lingual Named Entity Recognition via Effective Annotation and Representation Projection

The state-of-the-art named entity recognition (NER) systems are supervis...
research
07/03/2023

Exploring Spoken Named Entity Recognition: A Cross-Lingual Perspective

Recent advancements in Named Entity Recognition (NER) have significantly...
research
06/17/2022

BITS Pilani at HinglishEval: Quality Evaluation for Code-Mixed Hinglish Text Using Transformers

Code-Mixed text data consists of sentences having words or phrases from ...
research
07/15/2020

Sinhala Language Corpora and Stopwords from a Decade of Sri Lankan Facebook

This paper presents two colloquial Sinhala language corpora from the lan...

Please sign up or login with your details

Forgot password? Click here to reset