PEYMA: A Tagged Corpus for Persian Named Entities

01/30/2018
by   Mahsa Sadat Shahshahani, et al.
0

The goal in the NER task is to classify proper nouns of a text into classes such as person, location, and organization. This is an important preprocessing step in many NLP tasks such as question-answering and summarization. Although many research studies have been conducted in this area in English and the state-of-the-art NER systems have reached performances of higher than 90 percent in terms of F1 measure, there are very few research studies for this task in Persian. One of the main important causes of this may be the lack of a standard Persian NER dataset to train and test NER systems. In this research we create a standard, big-enough tagged Persian NER dataset which will be distributed for free for research purposes. In order to construct such a standard dataset, we studied standard NER datasets which are constructed for English researches and found out that almost all of these datasets are constructed using news texts. So we collected documents from ten news websites. Later, in order to provide annotators with some guidelines to tag these documents, after studying guidelines used for constructing CoNLL and MUC standard English datasets, we set our own guidelines considering the Persian linguistic rules.

READ FULL TEXT

page 2

page 17

12/22/2018

A Survey on Deep Learning for Named Entity Recognition

Named entity recognition (NER) is the task to identify text spans that m...
04/28/2022

HiNER: A Large Hindi Named Entity Recognition Dataset

Named Entity Recognition (NER) is a foundational NLP task that aims to p...
04/29/2022

What do we Really Know about State of the Art NER?

Named Entity Recognition (NER) is a well researched NLP task and is wide...
10/29/2018

A Pragmatic Guide to Geoparsing Evaluation

Empirical methods in geoparsing have thus far lacked a standard evaluati...
10/27/2021

Towards Realistic Single-Task Continuous Learning Research for NER

There is an increasing interest in continuous learning (CL), as data pri...
08/06/2021

Lights, Camera, Action! A Framework to Improve NLP Accuracy over OCR documents

Document digitization is essential for the digital transformation of our...
07/18/2022

GOAL: Towards Benchmarking Few-Shot Sports Game Summarization

Sports game summarization aims to generate sports news based on real-tim...