PEYMA: A Tagged Corpus for Persian Named Entities

01/30/2018
by   Mahsa Sadat Shahshahani, et al.
0

The goal in the NER task is to classify proper nouns of a text into classes such as person, location, and organization. This is an important preprocessing step in many NLP tasks such as question-answering and summarization. Although many research studies have been conducted in this area in English and the state-of-the-art NER systems have reached performances of higher than 90 percent in terms of F1 measure, there are very few research studies for this task in Persian. One of the main important causes of this may be the lack of a standard Persian NER dataset to train and test NER systems. In this research we create a standard, big-enough tagged Persian NER dataset which will be distributed for free for research purposes. In order to construct such a standard dataset, we studied standard NER datasets which are constructed for English researches and found out that almost all of these datasets are constructed using news texts. So we collected documents from ten news websites. Later, in order to provide annotators with some guidelines to tag these documents, after studying guidelines used for constructing CoNLL and MUC standard English datasets, we set our own guidelines considering the Persian linguistic rules.

READ FULL TEXT

page 2

page 17

research
12/22/2018

A Survey on Deep Learning for Named Entity Recognition

Named entity recognition (NER) is the task to identify text spans that m...
research
04/28/2022

HiNER: A Large Hindi Named Entity Recognition Dataset

Named Entity Recognition (NER) is a foundational NLP task that aims to p...
research
04/29/2022

What do we Really Know about State of the Art NER?

Named Entity Recognition (NER) is a well researched NLP task and is wide...
research
10/29/2018

A Pragmatic Guide to Geoparsing Evaluation

Empirical methods in geoparsing have thus far lacked a standard evaluati...
research
10/27/2021

Towards Realistic Single-Task Continuous Learning Research for NER

There is an increasing interest in continuous learning (CL), as data pri...
research
07/18/2022

GOAL: Towards Benchmarking Few-Shot Sports Game Summarization

Sports game summarization aims to generate sports news based on real-tim...
research
04/09/2023

RISC: Generating Realistic Synthetic Bilingual Insurance Contract

This paper presents RISC, an open-source Python package data generator (...

Please sign up or login with your details

Forgot password? Click here to reset