Automatically Annotated Turkish Corpus for Named Entity Recognition and Text Categorization using Large-Scale Gazetteers

02/08/2017
by   H. Bahadir Sahin, et al.
0

Turkish Wikipedia Named-Entity Recognition and Text Categorization (TWNERTC) dataset is a collection of automatically categorized and annotated sentences obtained from Wikipedia. We constructed large-scale gazetteers by using a graph crawler algorithm to extract relevant entity and domain information from a semantic knowledge base, Freebase. The constructed gazetteers contains approximately 300K entities with thousands of fine-grained entity types under 77 different domains. Since automated processes are prone to ambiguity, we also introduce two new content specific noise reduction methodologies. Moreover, we map fine-grained entity types to the equivalent four coarse-grained types: person, loc, org, misc. Eventually, we construct six different dataset versions and evaluate the quality of annotations by comparing ground truths from human annotators. We make these datasets publicly available to support studies on Turkish named-entity recognition (NER) and text categorization (TC).

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/23/2019

Fine-Grained Named Entity Recognition using ELMo and Wikidata

Fine-grained Named Entity Recognition is a task whereby we detect and cl...
research
05/16/2021

Few-NERD: A Few-Shot Named Entity Recognition Dataset

Recently, considerable literature has grown up around the theme of few-s...
research
08/13/2019

Building a Massive Corpus for Named Entity Recognition using Free Open Data Sources

With the recent progress in machine learning, boosted by techniques such...
research
11/24/2019

ScienceExamCER: A High-Density Fine-Grained Science-Domain Corpus for Common Entity Recognition

Named entity recognition identifies common classes of entities in text, ...
research
12/31/2020

TexSmart: A Text Understanding System for Fine-Grained NER and Enhanced Semantic Analysis

This technique report introduces TexSmart, a text understanding system t...
research
11/26/2018

Scalable graph-based individual named entity identification

Named entity discovery (NED) is an important information retrieval probl...
research
08/21/2023

Software Entity Recognition with Noise-Robust Learning

Recognizing software entities such as library names from free-form text ...

Please sign up or login with your details

Forgot password? Click here to reset